# Learning Pandas data structures
[Pandas Tutorial: DataFrames in Python](https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python)
- Pandas Series , DataFrame and Panel built on top of Numpy arrays.
    - Series = indexed 1D Numpy arrays
    - DataFrames = indexed 2D Numpy arrays
    - Panels = indexed 3D Numpy arrays

Indexes can be labeled -> Pandas objects are disctionary-based arrays

In [2]:
import numpy as np
import pandas as pd

In [2]:
# A sturctured array - manipulate data by named fields
my_array = np.ones(3, dtype=[('foo', int), ('bar', float)])
print(my_array)
print(my_array["foo"])
print(type(my_array))

[(1, 1.) (1, 1.) (1, 1.)]
[1 1 1]
<class 'numpy.ndarray'>


In [3]:
# A record array - access fields of structured arrays by attribute rather than by index
my_array2 = my_array.view(np.recarray)
print(my_array2)
print(my_array2.foo)
print(type(my_array2))

[(1, 1.) (1, 1.) (1, 1.)]
[1 1 1]
<class 'numpy.recarray'>


In [4]:
# Another example
x = np.array([(1, 2),(3,4)], dtype=[('a', float), ('b', np.int8)])
print(x)
print(x["a"])

[(1., 2) (3., 4)]
[1. 3.]


# 1. How to create Pandas DataFrames

Build DataFrames from sratch or convert lists and numpy arrays to Pandas DataFrames.

## Convert an array into a DataFrame

In [5]:
# First, build a ndarray from a nested list (array() is a fcn while ndarray is a type)
data = np.array([['', 'Col1', 'Col2'],
                 ['Row1', 1, 2], 
                 ['Row2', 3, 4]])

# Then, convert the ndarray into a DataFrame
# Note that rows are specified as "index"
df = pd.DataFrame(data=data[1:,1:],
                  index=data[1:,0],
                  columns=data[0,1:])

print(df)

     Col1 Col2
Row1    1    2
Row2    3    4


## Convert a nested list into a DataFrame

In [6]:
my_list = [[1, 2],
           [3, 4],
           [5, 6]]
df = pd.DataFrame(my_list, 
                  index=["A", "B", "C"], 
                  columns=["Class", "Speed"])
df

Unnamed: 0,Class,Speed
A,1,2
B,3,4
C,5,6


## Convert a dictionary into a DataFrame

In [7]:
# Build a dictionary
my_dict = {"A": ['1', '3'], "B": ['1', '2'], "C": ['2', '4']}

# Convert into a DataFrame
# Note that the keys are used as columns while indexes are automatically set from 0
df1 = pd.DataFrame(my_dict)
df1

Unnamed: 0,A,B,C
0,1,1,2
1,3,2,4


## Adding several entries of the dataframe at once

In [6]:
data = [{"Name": "Francis", "Age": 35, "Occupation": "waiter"},
        {"Name": "Maria", "Age": 17, "Occupation": "lawyer", "Degree": "doctorate"}]

df = pd.DataFrame(data)
df.head()

Unnamed: 0,Name,Age,Occupation,Degree
0,Francis,35,waiter,
1,Maria,17,lawyer,doctorate


## Take DataFrame as input to DataFrame

In [8]:
my_df = pd.DataFrame(data=[1,2,3,4], index=range(0,4), columns=["A"])
print(my_df)

df = pd.DataFrame(my_df)
print(df)

   A
0  1
1  2
2  3
3  4
   A
0  1
1  2
2  3
3  4


## Take a Series as input to a DataFrame

In [9]:
my_dict = {"UK": "London", "France": "Paris", "US": "Washington"}
my_series = pd.Series(my_dict) # The indexes are the keys of the dictionary
print(my_series)

df = pd.DataFrame(my_series) # the columns are indexed from 0 when not specified
print(df)

UK            London
France         Paris
US        Washington
dtype: object
                 0
UK          London
France       Paris
US      Washington


## Learn more about your dataframe

In [10]:
my_dict = {"A": ['1', '3'], "B": ['1', '2'], "C": ['2', '4']}
df = pd.DataFrame(my_dict)
print(df.shape)
print(len(df.index))
print(len(df.columns))
print(df["A"].count()) # WARNING: this excludes the NaN values
print(df.columns.values)

(2, 3)
2
3
2
['A' 'B' 'C']


# 2. How to select an index or column from a Pandas DataFrame

## Access values of a dataframe

In [11]:
my_dict = {"A": ['1', '3'], "B": ['1', '2'], "C": ['2', '4']}
df = pd.DataFrame(my_dict)
print(df)

 # row index, column index
print(df.iloc[0,0])
print(df.iat[0,0])

 # row index, column label
print(df.loc[0,"A"])
print(df.at[0,"A"])

   A  B  C
0  1  1  2
1  3  2  4
1
1
1
1


## Selecting rows and columns

In [12]:
# Select row indexed 0
print(df.iloc[0,:]) # This a Series 
print(df.iloc[0])
print(df.loc[0,:])
print(df.loc[0])

# Select columns
print(df.iloc[:,0])
print(df.loc[:,"A"])

A    1
B    1
C    2
Name: 0, dtype: object
A    1
B    1
C    2
Name: 0, dtype: object
A    1
B    1
C    2
Name: 0, dtype: object
A    1
B    1
C    2
Name: 0, dtype: object
0    1
1    3
Name: A, dtype: object
0    1
1    3
Name: A, dtype: object


# 3. Add an index, row or column to a DataFrame

In [13]:
my_dict = {"A": ['1', '3'], "B": ['1', '2'], "C": ['2', '4']}
df = pd.DataFrame(my_dict)
print(df)

# Set one of the columns as index for the other columns
df.set_index("C", inplace=True)
print(df)

   A  B  C
0  1  1  2
1  3  2  4
   A  B
C      
2  1  1
4  3  2


## Add rows to a DataFrame

.loc() works on labels of the index (both row and column) whereas .iloc() works on the positions in the index. .ix() can take either an integer (position) or a label.

In [14]:
df = pd.DataFrame(data=np.array([[1,2,3], [4,5,6], [7,8,9]]), 
                  index=[2,"A",4],
                  columns=[48,-2,"Zoe"])
print(df)

print(df.loc[2]) # Will get first row (labeled 2, position 0)
print(df.iloc[2]) # Will get third row (labeled 4, position 2))

   48  -2  Zoe
2   1   2    3
A   4   5    6
4   7   8    9
48     1
-2     2
Zoe    3
Name: 2, dtype: int32
48     7
-2     8
Zoe    9
Name: 4, dtype: int32


In [15]:
# Adding a row to a dataframe
df.loc["feast"] = [12, 34, "dk"]
print(df)

       48  -2 Zoe
2       1   2   3
A       4   5   6
4       7   8   9
feast  12  34  dk


## Add a column to a dataframe

In [16]:
df = pd.DataFrame(data=np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), columns=['A', 'B', 'C'])

# Method 1: Use `.index`
df['D'] = df.index # This places the "index" column in column "D"
print(df)

# Method 2: use loc to append a Series to the existing DataFrame
# Remember: each column of a DataFrame is a Series
my_series = pd.Series([100, 99, 299], index=df.index)
df.loc[:,4] = my_series
print(df)

   A  B  C  D
0  1  2  3  0
1  4  5  6  1
2  7  8  9  2
   A  B  C  D    4
0  1  2  3  0  100
1  4  5  6  1   99
2  7  8  9  2  299


In [17]:
df.index = ["De", "la", "merde"]
print(df)

# Make the current index column an actual column of the dataframe
print(df.reset_index()) 

# drop the index column
print(df.reset_index(drop=True)) 

# the column labels may have several levels; if so, then level specifies on what level the label of the index column must be placed
print(df.reset_index(level=0)) 



       A  B  C  D    4
De     1  2  3  0  100
la     4  5  6  1   99
merde  7  8  9  2  299
   index  A  B  C  D    4
0     De  1  2  3  0  100
1     la  4  5  6  1   99
2  merde  7  8  9  2  299
   A  B  C  D    4
0  1  2  3  0  100
1  4  5  6  1   99
2  7  8  9  2  299
   index  A  B  C  D    4
0     De  1  2  3  0  100
1     la  4  5  6  1   99
2  merde  7  8  9  2  299


# 4. Delete indices, rows, columns from a dataframe

## Remove an index from a dataframe

In [18]:
df = pd.DataFrame(data=np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [40, 50, 60], [23, 35, 37]]), 
                  index= [2.5, 12.6, 4.8, 4.8, 2.5], 
                  columns=[48, 49, 50])
print(df)

df1 = df.reset_index()
print(df1)

df2 = df.reset_index().drop_duplicates(subset="index", keep="last")
print(df2)

df3 = df.reset_index().drop_duplicates(subset="index", keep="last").set_index("index")
print(df3)

      48  49  50
2.5    1   2   3
12.6   4   5   6
4.8    7   8   9
4.8   40  50  60
2.5   23  35  37
   index  48  49  50
0    2.5   1   2   3
1   12.6   4   5   6
2    4.8   7   8   9
3    4.8  40  50  60
4    2.5  23  35  37
   index  48  49  50
1   12.6   4   5   6
3    4.8  40  50  60
4    2.5  23  35  37
       48  49  50
index            
12.6    4   5   6
4.8    40  50  60
2.5    23  35  37


## Delete a column from a dataframe

In [19]:
df = pd.DataFrame({"A": [1,2,3], 'b':[4,5,6], "C":[7,8,9]}, index=[2,5,8])

df1 = df.drop("A", axis=1, inplace=False)
print(df1)

# Equivalent
df2 = df.drop(df.columns[0],axis=1)
print(df2)

# Equivalent
df3 = df.drop(columns=["A"])
print(df3)

# remove several columns at once
df4 = df.drop(columns=["A","b"])
print(df4)
print(type(df4))

   b  C
2  4  7
5  5  8
8  6  9
   b  C
2  4  7
5  5  8
8  6  9
   b  C
2  4  7
5  5  8
8  6  9
   C
2  7
5  8
8  9
<class 'pandas.core.frame.DataFrame'>


## Removing a row from a dataframe

In [20]:
df = pd.DataFrame({"A": [1,2,2], 'b':[4,3,6], "C":[7,8,9]}, index=[3,5,8])
print(df)

# Drop the duplicates in df
print(df.drop_duplicates(["A"], keep="last"))

   A  b  C
3  1  4  7
5  2  3  8
8  2  6  9
   A  b  C
3  1  4  7
8  2  6  9


In [21]:
# drop an index at a certain position
print(df)
print(list(df.index))
df.drop(df.index[1])

   A  b  C
3  1  4  7
5  2  3  8
8  2  6  9
[3, 5, 8]


Unnamed: 0,A,b,C
3,1,4,7
8,2,6,9


In [22]:
# After dropping an index, we may want to reset it 
df.drop(df.index[1]).reset_index(drop=True)

Unnamed: 0,A,b,C
0,1,4,7
1,2,6,9


# 5. Rename the index or columns of dataframes

In [23]:
df = pd.DataFrame(data=[[1,4,2],
                        [3,45,20],
                        [2,3,7]],
                  columns=["A","B","C"],
                  index=["a","b","c"])
print(df)

new_columns = {"A": "D", "B":"E", "C":"F"}
df.rename(columns=new_columns, inplace=True)
print(df)

new_indexes = {"c": "d"}
df.rename(index=new_indexes)

   A   B   C
a  1   4   2
b  3  45  20
c  2   3   7
   D   E   F
a  1   4   2
b  3  45  20
c  2   3   7


Unnamed: 0,D,E,F
a,1,4,2
b,3,45,20
d,2,3,7


# 6. Format the data in a dataframe

## Replace all occurrences of a string in a dataframe

In [24]:
my_dict = {"Exam 1": ["Awful", "Perfect", "Acceptable", "Perfect", "Ok"],
           "Exam 2": ["Poor", "Poor", "Ok", "Perfect", "Awful"],
           "Exam 3": ["Perfect", "Acceptable", "Poor", "Ok", "Poor"]}

df = pd.DataFrame(my_dict, index=["Ben", "Mikael", "Sandra", "Maria", "Brendon"])
print(df)

# Replace the strings by numerical values
df.replace(["Awful", "Poor", "Ok", "Acceptable", "Perfect"], [0, 1, 2, 3, 4], inplace=True)
df

             Exam 1   Exam 2      Exam 3
Ben           Awful     Poor     Perfect
Mikael      Perfect     Poor  Acceptable
Sandra   Acceptable       Ok        Poor
Maria       Perfect  Perfect          Ok
Brendon          Ok    Awful        Poor


Unnamed: 0,Exam 1,Exam 2,Exam 3
Ben,0,1,4
Mikael,4,1,3
Sandra,3,2,1
Maria,4,4,2
Brendon,2,0,1


In [25]:
my_array = np.array([[1,2,"+3b"],
                     [4,5,"-6B"],
                     [7,8,"+9A"]])
df = pd.DataFrame(my_array, columns=["Class", "Text", "Result"])
print(df)

# Use the map function and remove unwanted parts of strings
df["Result"] = df["Result"].map(lambda x: x.lstrip("+-").rstrip("aAbBcC"))
df

  Class Text Result
0     1    2    +3b
1     4    5    -6B
2     7    8    +9A


Unnamed: 0,Class,Text,Result
0,1,2,3
1,4,5,6
2,7,8,9


## Splitting text ina column into multiple rows 

In [26]:
data_dict = {"Age": [34, 22, 19],
             "PlusOne": [0, 0, 1],
             "Ticket": ["23:44:55", "66:77:88", "43:68:05 56:34:12"]}
df = pd.DataFrame(data_dict)
df

Unnamed: 0,Age,PlusOne,Ticket
0,34,0,23:44:55
1,22,0,66:77:88
2,19,1,43:68:05 56:34:12


In [27]:
# Split the two time records in the third row
series = df["Ticket"].str.split(" ")
print(type(series))
series

<class 'pandas.core.series.Series'>


0              [23:44:55]
1              [66:77:88]
2    [43:68:05, 56:34:12]
Name: Ticket, dtype: object

In [28]:
# Create a second column
series = df["Ticket"].str.split(" ").apply(pd.Series, 1)
print(type(series))
series

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,0,1
0,23:44:55,
1,66:77:88,
2,43:68:05,56:34:12


In [29]:
# And stack the values
series = df["Ticket"].str.split(" ").apply(pd.Series, 1).stack()
print(type(series))
series

<class 'pandas.core.series.Series'>


0  0    23:44:55
1  0    66:77:88
2  0    43:68:05
   1    56:34:12
dtype: object

In [30]:
# After stacking the values, the index has two levels
# Drop the index level to line up with the Dataframe
series.index = series.index.droplevel(-1)
print(type(series))
series

<class 'pandas.core.series.Series'>


0    23:44:55
1    66:77:88
2    43:68:05
2    56:34:12
dtype: object

In [31]:
# Make the above series a dataframe 
ticket_df = pd.DataFrame(series)
print(ticket_df)

# Remove the ticket column from the original dataframe
print(df)
del df["Ticket"]
print(df)

# Append the ticketdf dataframe to the df
df = df.join(ticket_df)
df.rename(columns={0: "Ticket"}, inplace=True)
df

          0
0  23:44:55
1  66:77:88
2  43:68:05
2  56:34:12
   Age  PlusOne             Ticket
0   34        0           23:44:55
1   22        0           66:77:88
2   19        1  43:68:05 56:34:12
   Age  PlusOne
0   34        0
1   22        0
2   19        1


Unnamed: 0,Age,PlusOne,Ticket
0,34,0,23:44:55
1,22,0,66:77:88
2,19,1,43:68:05
2,19,1,56:34:12


## Applying a function to a pandas dataframe's columns or rows

In [32]:
df = pd.DataFrame(data=[[1,2],
                        [3,4],
                        [5,6]],
                  columns=["C1", "C2"],
                  index=["R1", "R2", "R3"])
df

Unnamed: 0,C1,C2
R1,1,2
R2,3,4
R3,5,6


In [33]:
# Double the values of a column
doubler = lambda x: x*2
df["C1"] = df["C1"].apply(doubler)
df

Unnamed: 0,C1,C2
R1,2,2
R2,6,4
R3,10,6


In [34]:
# Double the value of a row
df.loc["R1"] = df.loc["R1"].apply(doubler)
df

Unnamed: 0,C1,C2
R1,4,4
R2,6,4
R3,10,6


In [35]:
# Double the value of the entire dataframe
df = df.applymap(doubler)
df

Unnamed: 0,C1,C2
R1,8,8
R2,12,8
R3,20,12


# 7. Create an Empty DataFrame
The dataframe's shape is predefined but not its data.

In [36]:
# Prefill a dataframe with some value (any type)
df = pd.DataFrame(np.nan, index=[0,1,2,3], columns=["A"])
df

Unnamed: 0,A
0,
1,
2,
3,


In [37]:
# Force the data in the dataframe to be aof a particular type
df = pd.DataFrame(index=range(0,4),columns=["A", "B"], dtype="float")
df

Unnamed: 0,A,B
0,,
1,,
2,,
3,,


# 8. does pandas recognize dates when importing data?
Yes by doing <code>pd.read_csv("file.csv", parse_dates=True)</code>.

# 9. When, why, how to reshape a pandas dataframe?
To make your data suitable to data analysis.
3 ways to reshape a dataframe: pivoting, stacking/unstacking, melting

## Method 1: pivoting a dataframe

In [38]:
# Create your DataFrame
products = pd.DataFrame({'category': ['Cleaning', 'Cleaning', 'Entertainment', 'Entertainment', 'Tech', 'Tech'],
        'store': ['Walmart', 'Dia', 'Walmart', 'Fnac', 'Dia','Walmart'],
        'price':[11.42, 23.50, 19.99, 15.95, 55.75, 111.55],
        'testscore': [4, 3, 5, 7, 5, 8]})

products

Unnamed: 0,category,store,price,testscore
0,Cleaning,Walmart,11.42,4
1,Cleaning,Dia,23.5,3
2,Entertainment,Walmart,19.99,5
3,Entertainment,Fnac,15.95,7
4,Tech,Dia,55.75,5
5,Tech,Walmart,111.55,8


In [39]:
# Use `pivot()` to pivot the DataFrame
pivot_products = products.pivot(index='category', columns='store', values='price')
pivot_products

store,Dia,Fnac,Walmart
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Cleaning,23.5,,11.42
Entertainment,,15.95,19.99
Tech,55.75,,111.55


In [40]:
# Without specifying the "values" field
pivot_products = products.pivot(index='category', columns='store')
pivot_products
# This creates a multilevel column index

Unnamed: 0_level_0,price,price,price,testscore,testscore,testscore
store,Dia,Fnac,Walmart,Dia,Fnac,Walmart
category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Cleaning,23.5,,11.42,3.0,,4.0
Entertainment,,15.95,19.99,,7.0,5.0
Tech,55.75,,111.55,5.0,,8.0


In [41]:
# Using pivot_table when the column used as an index has repeating entries
data = [["Hargrave", 619, "France", 42, "yes"],
        ["Hill", 608, "Spain", 41, "no"],
        ["Onio", 502, "France", np.nan, "yes"],
        ["Boni", 699, "France", 39, "no"],
        ["Mitchell", 850, "Spain", 43, "no"],
        ["Chu", np.nan, "Spain", 44, "yes"]]

df = pd.DataFrame(data=data, 
                  columns=["surname", "credit_score", "geography", "age", "exited"])

df_new = pd.pivot_table(df, index=["geography"], columns=["exited"], values="credit_score", aggfunc=np.mean)
df_new

exited,no,yes
geography,Unnamed: 1_level_1,Unnamed: 2_level_1
France,699.0,560.5
Spain,729.0,


## Method 2: stacking and unstacking
Stacking a dataframe makes it taller. <br>
The innermost column index becomes the inner most row index.

In [42]:
data_dict = {"Age": [34, 22, 19],
             "PlusOne": [0, 0, 1],
             "Ticket": ["23:44:55", "66:77:88", "43:68:05 56:34:12"]}
df = pd.DataFrame(data_dict)
df

Unnamed: 0,Age,PlusOne,Ticket
0,34,0,23:44:55
1,22,0,66:77:88
2,19,1,43:68:05 56:34:12


In [43]:
# The innermost column index (0 here since there is a single level) 
# becomes the innermost row index (0)
print(type(df.stack()))
df.stack() # stacking results in a Series object

<class 'pandas.core.series.Series'>


0  Age                       34
   PlusOne                    0
   Ticket              23:44:55
1  Age                       22
   PlusOne                    0
   Ticket              66:77:88
2  Age                       19
   PlusOne                    1
   Ticket     43:68:05 56:34:12
dtype: object

In [44]:
pd.DataFrame(df.stack().unstack()) # results in a DataFrame object

Unnamed: 0,Age,PlusOne,Ticket
0,34,0,23:44:55
1,22,0,66:77:88
2,19,1,43:68:05 56:34:12


## Method 3: Melting a dataframe with <code>melt()</code>
Useful when the dataframe contains both "identifier" variables and "measured" variables. <br>
The measured variables are unpivoted to the row axis. Initially along the width of the dataframe, they will then be placed along the height after melting.

In [45]:
people = pd.DataFrame({'FirstName' : ['John', 'Jane'],
                       'LastName' : ['Doe', 'Austen'],
                       'BloodType' : ['A-', 'B+'],
                       'Weight' : [90, 64]})

people


Unnamed: 0,FirstName,LastName,BloodType,Weight
0,John,Doe,A-,90
1,Jane,Austen,B+,64


In [46]:
melted_df = pd.melt(people, 
                    id_vars=['FirstName', 'LastName'], 
                    var_name='measurements')
melted_df

Unnamed: 0,FirstName,LastName,measurements,value
0,John,Doe,BloodType,A-
1,Jane,Austen,BloodType,B+
2,John,Doe,Weight,90
3,Jane,Austen,Weight,64


# 10. Iterate over a dataframe
## <code>for</code> loop with <code>iterrow()</code>.

In [47]:
df = pd.DataFrame(data=np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), columns=['A', 'B', 'C'])

for index, row in df.iterrows() :
    print(row['A'], row['B'])
    
for index, row in df.iterrows():
    print(index)

1 2
4 5
7 8
0
1
2


## <code>for</code> loop with <code>items()</code> or <code>iteritems()</code>.

In [48]:
df = pd.DataFrame({'species': ['bear', 'bear', 'marsupial'],
                  'population': [1864, 22000, 80000]},
                  index=['panda', 'polar', 'koala'])

df

Unnamed: 0,species,population
panda,bear,1864
polar,bear,22000
koala,marsupial,80000


In [49]:
for label, content in df.items():
    print(label)
    print(content)
    # print(f'label: {label}')
    # print(f'content: {content}')

species
panda         bear
polar         bear
koala    marsupial
Name: species, dtype: object
population
panda     1864
polar    22000
koala    80000
Name: population, dtype: int64


## 11. Write a dataframe to a file (csv or excel)

In [50]:
df = pd.DataFrame({'category': ['Cleaning', 'Cleaning', 'Entertainment', 'Entertainment', 'Tech', 'Tech'],
        'store': ['Walmart', 'Dia', 'Walmart', 'Fnac', 'Dia','Walmart'],
        'price':[11.42, 23.50, 19.99, 15.95, 55.75, 111.55],
        'testscore': [4, 3, 5, 7, 5, 8]})

# create a csv with "tab" delimitor
df.to_csv("myDataFrame.csv", sep="\t", encoding="utf-8")

# Write to excel format
writer = pd.ExcelWriter("myDataFrame.xlsx")
df.to_excel(writer, "DataFrame")
writer.save()