<a href="https://colab.research.google.com/github/thecodemancer/study-with-me/blob/main/python/pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas


In [1]:
import numpy as np
import pandas as pd

In [2]:
pd.__version__

'1.3.5'

## Series

Pandas provides access to a DataFrame data structure and a Series data structure. First, we'll briefly look at the Series data structure, since each column in a DataFrame could be considered a Series as well.

A Series object is an array of data with axis labels or index values. Notice that when we display Series, we can see the index values on the left and the Series values on the right.

In [None]:
vals = np.array([1,2,3,4,5])
idxs = np.array(["a","b","c","d","e"])

my_series = pd.Series(vals, idxs)
my_series

a    1
b    2
c    3
d    4
e    5
dtype: int32

Inside Series, we can get the values with the .value attribute and the index with the .index attribute.

In [None]:
my_series.values

array([1, 2, 3, 4, 5])

In [None]:
my_series.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

We can subset a Series with specific values like a or with numeric index values. Take a look at the following example, which returns the same Series in a subset.

In [None]:
my_series[0:2]

a    1
b    2
dtype: int32

In [None]:
my_series[["a","b"]]

a    1
b    2
dtype: int32

### Pandas Dataframe

DataFrames can be thought of as 2D NumPy arrays with more features, such as indexing options and column headers, among others.

We reload the Iris dataset with the Pandas read_csv() function, which allows us to load data into a DataFrame. The default delimiter is the comma, although it can be modified. Shape is an attribute that indicates the dimensions of the DataFrame.

#### Load external data with Pandas

In [None]:
import os
import pandas as pd
parent_path = os.path.dirname(os.getcwd())
data_path = parent_path + "/data/iris.csv"

In [None]:
df = pd.read_csv(data_path)
df.shape

(150, 6)

We can use .head() to see the first five rows and .tail() to see the last five. If we include an integer, we specify the number of columns we want to display.

In [None]:
df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [None]:
df.tail()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
145,146,6.7,3.0,5.2,2.3,Iris-virginica
146,147,6.3,2.5,5.0,1.9,Iris-virginica
147,148,6.5,3.0,5.2,2.0,Iris-virginica
148,149,6.2,3.4,5.4,2.3,Iris-virginica
149,150,5.9,3.0,5.1,1.8,Iris-virginica


In [None]:
df.head(2)

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa


Pandas columns have data types, such as integer or floating point. The type of data affects everything from subsetting to the way machine learning models interpret the data. Column data types can be accessed with the .dtypes attribute.

In [None]:
df.dtypes

Id                 int64
SepalLengthCm    float64
SepalWidthCm     float64
PetalLengthCm    float64
PetalWidthCm     float64
Species           object
dtype: object

The data type of a column can be changed with the .astype method. The parameter will be the new data type.

In [None]:
df["SepalLengthCm"]  = df["SepalLengthCm"].astype(str)

In [None]:
df.dtypes

Id                 int64
SepalLengthCm     object
SepalWidthCm     float64
PetalLengthCm    float64
PetalWidthCm     float64
Species           object
dtype: object

In [None]:
df["SepalLengthCm"]  = df["SepalLengthCm"].astype(float)

In [None]:
df.dtypes

Id                 int64
SepalLengthCm    float64
SepalWidthCm     float64
PetalLengthCm    float64
PetalWidthCm     float64
Species           object
dtype: object

We can access the column names with the .columns attribute.

In [None]:
df.columns

Index(['Id', 'SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm',
       'Species'],
      dtype='object')

Note that id is read as a column. We can make a column the index with the .set_index method.

In [None]:
df = df.set_index("Id")
df.head()

Unnamed: 0_level_0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,5.1,3.5,1.4,0.2,Iris-setosa
2,4.9,3.0,1.4,0.2,Iris-setosa
3,4.7,3.2,1.3,0.2,Iris-setosa
4,4.6,3.1,1.5,0.2,Iris-setosa
5,5.0,3.6,1.4,0.2,Iris-setosa


Note that id is read as a column. We can make a column the index with the .set_index method.

In [None]:
df.columns = ["a", "b", "c", "d", "e"]

In [None]:
df.head()

Unnamed: 0_level_0,a,b,c,d,e
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,5.1,3.5,1.4,0.2,Iris-setosa
2,4.9,3.0,1.4,0.2,Iris-setosa
3,4.7,3.2,1.3,0.2,Iris-setosa
4,4.6,3.1,1.5,0.2,Iris-setosa
5,5.0,3.6,1.4,0.2,Iris-setosa


To access columns, we enclose the column name in square brackets. Accessing a column results in a Series. Selecting multiple columns at the same time returns a DataFrame.

In [None]:
df["a"]

Id
1      5.1
2      4.9
3      4.7
4      4.6
5      5.0
      ... 
146    6.7
147    6.3
148    6.5
149    6.2
150    5.9
Name: a, Length: 150, dtype: float64

In [None]:
df[["a", "b"]]

Unnamed: 0_level_0,a,b
Id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,5.1,3.5
2,4.9,3.0
3,4.7,3.2
4,4.6,3.1
5,5.0,3.6
...,...,...
146,6.7,3.0
147,6.3,2.5
148,6.5,3.0
149,6.2,3.4


#### Delete columns

Columns can be dropped with the .drop method.
Specifying the axis to 1 tells Pandas to look for items to remove in columns, rather than rows. A column name or a list of column names can be passed as a parameter. You must specify which columns are to be removed.

In [None]:
df_1 = df.drop("a", axis = 1)
df_1.head()

Unnamed: 0_level_0,b,c,d,e
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,3.5,1.4,0.2,Iris-setosa
2,3.0,1.4,0.2,Iris-setosa
3,3.2,1.3,0.2,Iris-setosa
4,3.1,1.5,0.2,Iris-setosa
5,3.6,1.4,0.2,Iris-setosa


In [None]:
df_2 = df.drop(["a", "b"], 1)
df_2.head()

Unnamed: 0_level_0,c,d,e
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,1.4,0.2,Iris-setosa
2,1.4,0.2,Iris-setosa
3,1.3,0.2,Iris-setosa
4,1.5,0.2,Iris-setosa
5,1.4,0.2,Iris-setosa


iloc[] selects the indices via the same method used to subset NumPy 2D arrays: iloc[rows,columns].

Note that the index starts at 1, as we used the id column in the iris.csv file earlier. We introduce the reset_index() method to reset the index to 0 as we start using iloc.

In [None]:
df.head()

Unnamed: 0_level_0,a,b,c,d,e
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,5.1,3.5,1.4,0.2,Iris-setosa
2,4.9,3.0,1.4,0.2,Iris-setosa
3,4.7,3.2,1.3,0.2,Iris-setosa
4,4.6,3.1,1.5,0.2,Iris-setosa
5,5.0,3.6,1.4,0.2,Iris-setosa


In [None]:
df = df.reset_index()

In [None]:
df.head()

Unnamed: 0,Id,a,b,c,d,e
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


#### .iloc and .loc

In the following example we will get the first row and all the columns.

In [None]:
df.iloc[0]

Id              1
a             5.1
b             3.5
c             1.4
d             0.2
e     Iris-setosa
Name: 0, dtype: object

We can also introduce a list of index values to a subset.

In [None]:
df.iloc[[0,1]]

Unnamed: 0,Id,a,b,c,d,e
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa


The following example will return the first five rows and only the first column.

In [None]:
df.iloc[0:5,0:1]

Unnamed: 0,Id
0,1
1,2
2,3
3,4
4,5


Instead of separating with a colon, we could use lists to specify which index and column values we want to put in the subset.

In [None]:
df.iloc[[1],[0,1]]

Unnamed: 0,Id,a
1,2,4.9


If we have an index of a string, we can use .loc to do subsets. Here's the difference between .loc and .iloc with a DataFrame that contains a character-based index.

Note that .iloc would be more difficult to use if we don't know the position of the index and if we had millions of rows of data. In that case it would be more efficient to use .loc.

In [None]:
tmp = pd.DataFrame([[1,2,3],[3,4,5]], index = ["a", "b"])
tmp

Unnamed: 0,0,1,2
a,1,2,3
b,3,4,5


In [None]:
tmp.loc["a"]

0    1
1    2
2    3
Name: a, dtype: int64

In [None]:
tmp.iloc[0]

0    1
1    2
2    3
Name: a, dtype: int64

In [None]:
tmp.loc[["a", "b"]]

Unnamed: 0,0,1,2
a,1,2,3
b,3,4,5


#### Descriptive Statistics

With the .describe() method, summary statistics are obtained to understand how the data is arranged.

In [None]:
df.head()

Unnamed: 0,Id,a,b,c,d,e
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [None]:
df.describe()

Unnamed: 0,Id,a,b,c,d
count,150.0,150.0,150.0,150.0,150.0
mean,75.5,5.843333,3.054,3.758667,1.198667
std,43.445368,0.828066,0.433594,1.76442,0.763161
min,1.0,4.3,2.0,1.0,0.1
25%,38.25,5.1,2.8,1.6,0.3
50%,75.5,5.8,3.0,4.35,1.3
75%,112.75,6.4,3.3,5.1,1.8
max,150.0,7.9,4.4,6.9,2.5


The correlate method will create a correlation matrix for each column of our DataFrame.

In [None]:
df.corr()

Unnamed: 0,Id,a,b,c,d
Id,1.0,0.716676,-0.397729,0.882747,0.899759
a,0.716676,1.0,-0.109369,0.871754,0.817954
b,-0.397729,-0.109369,1.0,-0.420516,-0.356544
c,0.882747,0.871754,-0.420516,1.0,0.962757
d,0.899759,0.817954,-0.356544,0.962757,1.0


#### Sums of rows and columns

As with NumPy arrays, we can do calculations across rows and columns with the axis argument. Instead of giving us a NumPy array as a result, it returns a Series. In the following example we show you the sum operation, although we can do others like max or mean. This is similar to what we saw in NumPy.

In [None]:
df.sum(axis = 0)

Id                                                11325
a                                                 876.5
b                                                 458.1
c                                                 563.8
d                                                 179.8
e     Iris-setosaIris-setosaIris-setosaIris-setosaIr...
dtype: object

In [None]:
df.sum(axis = 1)

0       11.2
1       11.5
2       12.4
3       13.4
4       15.2
       ...  
145    163.2
146    162.7
147    164.7
148    166.3
149    165.8
Length: 150, dtype: float64

We add a column with the syntax DataFrame[column_name] = values. Next, we set the variable to be equal to 5.

In [None]:
df["new_col"] = 5
df.head()

Unnamed: 0,Id,a,b,c,d,e,new_col
0,1,5.1,3.5,1.4,0.2,Iris-setosa,5
1,2,4.9,3.0,1.4,0.2,Iris-setosa,5
2,3,4.7,3.2,1.3,0.2,Iris-setosa,5
3,4,4.6,3.1,1.5,0.2,Iris-setosa,5
4,5,5.0,3.6,1.4,0.2,Iris-setosa,5


#### Add columns

We can also set column values using a list. In the following example, we use a list comprehension, and with the range function, we make a column for 0-149.

In [None]:
df["new_col_1"] = [x for x in range(150)]

In [None]:
df.head()

Unnamed: 0,Id,a,b,c,d,e,new_col,new_col_1
0,1,5.1,3.5,1.4,0.2,Iris-setosa,5,0
1,2,4.9,3.0,1.4,0.2,Iris-setosa,5,1
2,3,4.7,3.2,1.3,0.2,Iris-setosa,5,2
3,4,4.6,3.1,1.5,0.2,Iris-setosa,5,3
4,5,5.0,3.6,1.4,0.2,Iris-setosa,5,4


#### Subsets

DataFrames can be subset by setting rules. To do so, we use square brackets. Take a look at the following example, which will return a True if the condition is met or a False if it is not. We can pass this Boolean Series with square brackets to make a subset of the DataFrame.

In [None]:
df["a"] > 6

0      False
1      False
2      False
3      False
4      False
       ...  
145     True
146     True
147     True
148     True
149    False
Name: a, Length: 150, dtype: bool

In [None]:
tmp = df[df["a"] > 6]
tmp.shape

(61, 8)

We can also use the & sign to specify that rules 1 and 2 have to be met.

In [None]:
tmp_1 = df[((df["a"] > 6) & (df["b"] > 3))]
tmp_1.shape

(23, 8)

We could replace & with | to specify that rules 1 or 2 must be met. The signs & and | they let us stack several criteria to filter.

In [None]:
tmp_2 = df[((df["a"] > 6) | (df["b"] > 3))]
tmp_2.shape

(105, 8)

The .isin method filters a DataFrame by specifying that a column's value must be in a list of items. Take a look at the following example. We filter the DataFrame by columns, where column e is "Iris-setosa" or "Iris-versicolor". This is useful if we have a very long list of items to filter. In the example we could just write two rules and separate them with |, but in many cases it would not be profitable to write all our rules.

In [None]:
df.head()

Unnamed: 0,Id,a,b,c,d,e,new_col,new_col_1
0,1,5.1,3.5,1.4,0.2,Iris-setosa,5,0
1,2,4.9,3.0,1.4,0.2,Iris-setosa,5,1
2,3,4.7,3.2,1.3,0.2,Iris-setosa,5,2
3,4,4.6,3.1,1.5,0.2,Iris-setosa,5,3
4,5,5.0,3.6,1.4,0.2,Iris-setosa,5,4


In [None]:
tmp_3 = df[df["e"].isin(["Iris-setosa", "Iris-versicolor"])]
tmp_3.shape

(100, 8)

We can also use the str.contains method to filter rows where a column has certain strings. In this example we filter rows where column e contains the string "setosa".

In [None]:
tmp_4 = df[df["e"].str.contains("setosa")]
tmp_4.shape

(50, 8)

The ~ symbol inside square brackets is used to convert True to False, and vice versa. Therefore, the DataFrame will be a subset for the inverse of the rule.

In [None]:
tmp_5 = df[~df["e"].str.contains("setosa")]
tmp_5.shape

(100, 8)

In [None]:
tmp_6 = df[(df["a"] > 6) & (df["b"] > 3)]
tmp_6.shape

(23, 8)

In [None]:
tmp_7 = df[~((df["a"] > 6) & (df["b"] > 3))]
tmp_7.shape

(127, 8)

#### Combinations and concatenations

To demonstrate how to combine or concatenate DataFrames in Pandas, we will create several test DataFrames.

In [None]:
columns = ["a","b"]

a = pd.DataFrame([[1,1], [2,2]], columns = columns)
b = pd.DataFrame([[3,3], [4,4]], columns = columns)
c = pd.DataFrame([[5,5], [6,6]], columns = columns)

In [None]:
a

Unnamed: 0,a,b
0,1,1
1,2,2


In [None]:
b

Unnamed: 0,a,b
0,3,3
1,4,4


In [None]:
c

Unnamed: 0,a,b
0,5,5
1,6,6


With the concat function we can pass a list of DataFrames and stack it vertically. Note that when we stack DataFrames the index is no longer unique, so it's a good idea to call the reset_index() method afterwards.

In [None]:
d = pd.concat([a,b,c])
d

Unnamed: 0,a,b
0,1,1
1,2,2
0,3,3
1,4,4
0,5,5
1,6,6


In [None]:
d = d.reset_index(drop = True)
d

Unnamed: 0,a,b
0,1,1
1,2,2
2,3,3
3,4,4
4,5,5
5,6,6


Note that the columns are not the same in each DataFrame. The concat functionality will return nulls for cells that have no value.

In [None]:
columns = ["a","b"]

a = pd.DataFrame([[1,1], [2,2]], columns = columns)
b = pd.DataFrame([[3,3], [4,4]], columns = ["e", "f"])

d = pd.concat([a,b]).reset_index(drop = True)
d

Unnamed: 0,a,b,e,f
0,1.0,1.0,,
1,2.0,2.0,,
2,,,3.0,3.0
3,,,4.0,4.0


If we change the axis to 1, the DataFrames are stacked vertically.

In [None]:
columns = ["a","b"]

a = pd.DataFrame([[1,1], [2,2]], columns = columns)
b = pd.DataFrame([[3,3], [4,4]], columns = ["e", "f"])

d = pd.concat([a,b], axis = 1).reset_index(drop = True)
d

Unnamed: 0,a,b,e,f
0,1,1,3,3
1,2,2,4,4


To show the combinations in Pandas we will create two DataFrames. When we do that we can introduce a dictionary where the keys are the column names and the values are the cell values. Next, we use NumPy arrays to store the values for each column.

In [None]:
col_1 = np.array(["A", "B", "C", "D", "E"])
col_2 = np.array(["A", "B", "C"])

a = pd.DataFrame({
    "col_1":col_1,
    "col_1_ind":1
})

b = pd.DataFrame({
    "col_1":col_2,
    "col_2_ind":1
})

a

Unnamed: 0,col_1,col_1_ind
0,A,1
1,B,1
2,C,1
3,D,1
4,E,1


In [None]:
b

Unnamed: 0,col_1,col_2_ind
0,A,1
1,B,1
2,C,1


We can combine DataFrames with the merge method. In this method, the first parameter is the DataFrame itself, which will be the "right dataframe" or data frame of the row on the right. The how parameter specifies the type of join to do, while left_on and right_on refer to the columns to join. If the column names are consistent, simply use the on parameter.

In [None]:
a.merge(b, how = "inner", left_on = "col_1", right_on = "col_1")

Unnamed: 0,col_1,col_1_ind,col_2_ind
0,A,1,1
1,B,1,1
2,C,1,1


In [None]:
a.merge(b, how = "inner", on = "col_1")

Unnamed: 0,col_1,col_1_ind,col_2_ind
0,A,1,1
1,B,1,1
2,C,1,1


Next, we perform a “left join”, that is, show the values of the “left dataframe” or data frame on the left, or from a in this case.

In [None]:
c = a.merge(b, how = "left", left_on = "col_1", right_on = "col_1")
c

Unnamed: 0,col_1,col_1_ind,col_2_ind
0,A,1,1.0
1,B,1,1.0
2,C,1,1.0
3,D,1,
4,E,1,


#### Deal with null values

In the previous example we saw that Pandas returns nulls when there is nothing to match in the combination. We can use the .isnull() method to see which cells are nullable.

In [None]:
c.isnull()

Unnamed: 0,col_1,col_1_ind,col_2_ind
0,False,False,False
1,False,False,False
2,False,False,False
3,False,False,True
4,False,False,True


We can then concatenate the sum method to the isnull method and find the number of null values for each column. Trues are treated as ones and Falses as zeros. When we add and see a value greater than zero we know there are nulls in that column.

In [None]:
c.isnull().sum()

col_1        0
col_1_ind    0
col_2_ind    2
dtype: int64

Null values can be "filled" with the fillna method, which includes the value to use as a parameter.

In [None]:
c.fillna(0)

Unnamed: 0,col_1,col_1_ind,col_2_ind
0,A,1,1.0
1,B,1,1.0
2,C,1,1.0
3,D,1,0.0
4,E,1,0.0


We could also use the .dropna method to drop columns that have null values.

In [None]:
c.dropna()

Unnamed: 0,col_1,col_1_ind,col_2_ind
0,A,1,1.0
1,B,1,1.0
2,C,1,1.0


We will treat null values according to the problem we need to solve. In the last example we've seen, we pad nulls with a zero, but that might not be the most appropriate way to deal with nulls in another situation.

#### Add data to files

Pandas offers a wide variety of options to add data in csv or excel files. The to_csv and to_excel methods are used to write a DataFrame to either file.

In [None]:
df.head(1)

Unnamed: 0,Id,a,b,c,d,e,new_col,new_col_1
0,1,5.1,3.5,1.4,0.2,Iris-setosa,5,0


In [None]:
df.to_csv("test.csv")

In [None]:
df.to_excel("test.xlsx")

ExcelWriter opens a connection to an Excel file and moves the DataFrames to different sheets. For example, if we have five DataFrames, we can write each one to a different tab with the to_excel method, passing the ExcelWriter object as the first parameter. When we finish adding the file, we close the connection with the close method.

In [None]:
writer = pd.ExcelWriter('test.xlsx')

In [None]:
df.to_excel(writer, sheet_name='df_1')
df.to_excel(writer, sheet_name='df_2')

In [None]:
writer.close()

Just like we can upload csv files, we upload Excel files with the load_excel function. This function helps us specify the sheet to load into.