## Week 1: Pandas Review

### Structured Data Manipulation and Data Wrangling

- Pandas data frames: indexing, slicing and subsetting
- Handling categorical data with Pandas: binning and encoding
- Grouping and aggregation with Pandas
- Concatenating and merging data frames
- Dealing with missing data and duplicates 
- Lab: Structured Data Manipulation

In [1]:
import numpy as np
import pandas as pd

### What is Structured Data
#### Pandas vs NumPy objects. Row Index

In [2]:
s1 = pd.Series([1,2,3,4], index = ['a', 'b', 'c', 'd'], name = "Column_1") ## custom index and name are optional
s1  ## Pandas Series is a 1d array with an index (i.e., custom row names)

a    1
b    2
c    3
d    4
Name: Column_1, dtype: int64

In [3]:
s2 = pd.Series([10,20,30,40], index = ['a', 'b', 'c', 'd'])

In [4]:
np.array([1, 2, 3, 4])

array([1, 2, 3, 4])

In [5]:
## element-wise mathematical operations with numerical series
s2 / s1

a    10.0
b    10.0
c    10.0
d    10.0
dtype: float64

In [6]:
## row-wise operations are controlled by the row index
s2 = pd.Series([10,20,30,40], index = ['b', 'c', 'd', 'e'])
s2 / s1

a         NaN
b    5.000000
c    6.666667
d    7.500000
e         NaN
dtype: float64

### Pandas Data Frame


#### Reading and writing data from/to disk

In [7]:
## import data from a CSV file
df_iris = pd.read_csv("../data/iris.csv") #review options and arguments
df_iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [8]:
# writing to disk
## df.to_csv("myFileName.csv")

Pandas `read_` functions: read from text, json, excel, html, pdf etc.

https://archive.ics.uci.edu/ml/datasets/Japanese+Credit+Screening

In [9]:
pd.read_table('../data/crx.data', 
              sep = ',', header = None, names = list('ABCDEFGHIJKLMNO')).head()

Unnamed: 0,A,B,C,D,E,F,G,H,I,J,K,L,M,N,O
b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


In [10]:
### from json
pd.read_json("https://api.exchangerate-api.com/v4/latest/GBP").head()

Unnamed: 0,provider,WARNING_UPGRADE_TO_V6,terms,base,date,time_last_updated,rates
AED,https://www.exchangerate-api.com,https://www.exchangerate-api.com/docs/free,https://www.exchangerate-api.com/terms,GBP,2024-07-06,1720224001,4.7
AFN,https://www.exchangerate-api.com,https://www.exchangerate-api.com/docs/free,https://www.exchangerate-api.com/terms,GBP,2024-07-06,1720224001,90.6
ALL,https://www.exchangerate-api.com,https://www.exchangerate-api.com/docs/free,https://www.exchangerate-api.com/terms,GBP,2024-07-06,1720224001,118.56
AMD,https://www.exchangerate-api.com,https://www.exchangerate-api.com/docs/free,https://www.exchangerate-api.com/terms,GBP,2024-07-06,1720224001,495.72
ANG,https://www.exchangerate-api.com,https://www.exchangerate-api.com/docs/free,https://www.exchangerate-api.com/terms,GBP,2024-07-06,1720224001,2.29


In [11]:
### from html: Pandas will look for anything that looks like a table
### and will return a list of data frames

djia = pd.read_html('https://en.wikipedia.org/wiki/Dow_Jones_Industrial_Average')

https://en.wikipedia.org/wiki/Dow_Jones_Industrial_Average

In [12]:
len(djia)

23

In [13]:
type(djia)

list

In [14]:
df_djia = djia[1]
df_djia.head()
## if you cannot acces internet from Jupyter, use df_djia = pd.read_csv('djia.csv')

Unnamed: 0,Company,Exchange,Symbol,Industry,Date added,Notes,Index weighting
0,3M,NYSE,MMM,Conglomerate,1976-08-09,As Minnesota Mining and Manufacturing,1.54%
1,American Express,NYSE,AXP,Financial services,1982-08-30,,3.64%
2,Amgen,NASDAQ,AMGN,Biopharmaceutical,2020-08-31,,4.80%
3,Amazon,NASDAQ,AMZN,Retailing,2024-02-26,,2.93%
4,Apple,NASDAQ,AAPL,Information technology,2015-03-19,,3.04%


#### Inspecting a data frame

In [15]:
## methods and attributes:
## info, describe, head, shape, dtypes, nunique, columns, index

### Subsetting, indexing and slicing Pandas data frames

In [16]:
## select columns

df_iris[["sepal_length", "sepal_width"]].head()  ##list of column names

Unnamed: 0,sepal_length,sepal_width
0,5.1,3.5
1,4.9,3.0
2,4.7,3.2
3,4.6,3.1
4,5.0,3.6


In [17]:
## selecting a single column with [] and [[]]. What's the difference?

df_iris[['species']].head()  ## this is a data frame with one column

Unnamed: 0,species
0,setosa
1,setosa
2,setosa
3,setosa
4,setosa


In [18]:
df_iris['species'].head() ## and this is a series!

0    setosa
1    setosa
2    setosa
3    setosa
4    setosa
Name: species, dtype: object

In [19]:
df_iris.species.head()  ## attribute style column subsetting also works but is not always stable

0    setosa
1    setosa
2    setosa
3    setosa
4    setosa
Name: species, dtype: object

In [20]:
df_iris[:6]  ## one-dimensional index or slice is for selecting rows

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa


In [21]:
### 2-dimensional subsetting using row index and column names (.loc)

df_iris.loc[10:12, ['sepal_width', 'species']]  ### end-inclusive!

Unnamed: 0,sepal_width,species
10,3.7,setosa
11,3.4,setosa
12,3.0,setosa


In [22]:
## .loc also works for custom row index
df_iris.index = ["Row_" + str(i) for i in range(150)]
df_iris.head(3)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
Row_0,5.1,3.5,1.4,0.2,setosa
Row_1,4.9,3.0,1.4,0.2,setosa
Row_2,4.7,3.2,1.3,0.2,setosa


In [23]:
df_iris.loc[['Row_2', "Row_10"], ['sepal_width', 'sepal_length']]

Unnamed: 0,sepal_width,sepal_length
Row_2,3.2,4.7
Row_10,3.7,5.4


In [24]:
## numerical indexing NumPy-style with .iloc

df_iris.iloc[0:3,2:4]

Unnamed: 0,petal_length,petal_width
Row_0,1.4,0.2
Row_1,1.4,0.2
Row_2,1.3,0.2


In [25]:
df_iris.iloc[[1, 6, 12, 11, 6],[4, 3, 2, 1, 4]] ## pass row and column indices as lists

Unnamed: 0,species,petal_width,petal_length,sepal_width,species.1
Row_1,setosa,0.2,1.4,3.0,setosa
Row_6,setosa,0.3,1.4,3.4,setosa
Row_12,setosa,0.1,1.4,3.0,setosa
Row_11,setosa,0.2,1.6,3.4,setosa
Row_6,setosa,0.3,1.4,3.4,setosa


#### Conditional Subsetting

In [26]:
df_iris[df_iris['petal_width'] > 1].head(10)  ## we are looking for rows that meet a codition

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
Row_50,7.0,3.2,4.7,1.4,versicolor
Row_51,6.4,3.2,4.5,1.5,versicolor
Row_52,6.9,3.1,4.9,1.5,versicolor
Row_53,5.5,2.3,4.0,1.3,versicolor
Row_54,6.5,2.8,4.6,1.5,versicolor
Row_55,5.7,2.8,4.5,1.3,versicolor
Row_56,6.3,3.3,4.7,1.6,versicolor
Row_58,6.6,2.9,4.6,1.3,versicolor
Row_59,5.2,2.7,3.9,1.4,versicolor
Row_61,5.9,3.0,4.2,1.5,versicolor


In [27]:
df_iris[(df_iris['petal_width'] > 0.4) & (df_iris['species'] == 'setosa')].head(10)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
Row_23,5.1,3.3,1.7,0.5,setosa
Row_43,5.0,3.5,1.6,0.6,setosa


In [28]:
df_iris.loc[df_iris['petal_width'] > 2, 'species'].head(10)

Row_100    virginica
Row_102    virginica
Row_104    virginica
Row_105    virginica
Row_109    virginica
Row_112    virginica
Row_114    virginica
Row_115    virginica
Row_117    virginica
Row_118    virginica
Name: species, dtype: object

In [29]:
df_iris[df_iris["species"].isin(["setosa", "virginica"])]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
Row_0,5.1,3.5,1.4,0.2,setosa
Row_1,4.9,3.0,1.4,0.2,setosa
Row_2,4.7,3.2,1.3,0.2,setosa
Row_3,4.6,3.1,1.5,0.2,setosa
Row_4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
Row_145,6.7,3.0,5.2,2.3,virginica
Row_146,6.3,2.5,5.0,1.9,virginica
Row_147,6.5,3.0,5.2,2.0,virginica
Row_148,6.2,3.4,5.4,2.3,virginica


In [30]:
df_iris[df_iris.index.isin(["Row_0", "Row_41", "Row_9"])]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
Row_0,5.1,3.5,1.4,0.2,setosa
Row_9,4.9,3.1,1.5,0.1,setosa
Row_41,4.5,2.3,1.3,0.3,setosa


In [31]:
df_iris.loc[:, df_iris.columns.str.contains('sepal')].head()

Unnamed: 0,sepal_length,sepal_width
Row_0,5.1,3.5
Row_1,4.9,3.0
Row_2,4.7,3.2
Row_3,4.6,3.1
Row_4,5.0,3.6


### Data Wrangling

#### Data Transformation and Calculated Columns

In [32]:
df_djia.head()

Unnamed: 0,Company,Exchange,Symbol,Industry,Date added,Notes,Index weighting
0,3M,NYSE,MMM,Conglomerate,1976-08-09,As Minnesota Mining and Manufacturing,1.54%
1,American Express,NYSE,AXP,Financial services,1982-08-30,,3.64%
2,Amgen,NASDAQ,AMGN,Biopharmaceutical,2020-08-31,,4.80%
3,Amazon,NASDAQ,AMZN,Retailing,2024-02-26,,2.93%
4,Apple,NASDAQ,AAPL,Information technology,2015-03-19,,3.04%


In [33]:
## apply a function to a column. Remove "%" from the last columns values

df_djia["Index weighting"].apply(lambda x: x.split("%")[0]).head()

0    1.54
1    3.64
2    4.80
3    2.93
4    3.04
Name: Index weighting, dtype: object

In [34]:
## convert character strings to datetime objects with .to_datetime()

pd.to_datetime(df_djia["Date added"]).head()

0   1976-08-09
1   1982-08-30
2   2020-08-31
3   2024-02-26
4   2015-03-19
Name: Date added, dtype: datetime64[ns]

#### Categorical data type

In [35]:
pd.Categorical(df_iris["species"])

['setosa', 'setosa', 'setosa', 'setosa', 'setosa', ..., 'virginica', 'virginica', 'virginica', 'virginica', 'virginica']
Length: 150
Categories (3, object): ['setosa', 'versicolor', 'virginica']

In [36]:
pd.Categorical(df_iris["species"], ordered = True) ### not applicable for nominal categories

['setosa', 'setosa', 'setosa', 'setosa', 'setosa', ..., 'virginica', 'virginica', 'virginica', 'virginica', 'virginica']
Length: 150
Categories (3, object): ['setosa' < 'versicolor' < 'virginica']

#### Binning: convert numerical to categorical

In [37]:
pd.cut(df_iris["sepal_length"], bins = 3, labels = ["low", "mid", "high"])[::15]  ## this is an ordinal categorical column

Row_0       low
Row_15      mid
Row_30      low
Row_45      low
Row_60      low
Row_75      mid
Row_90      low
Row_105    high
Row_120    high
Row_135    high
Name: sepal_length, dtype: category
Categories (3, object): ['low' < 'mid' < 'high']

In [38]:
pd.cut(df_iris["sepal_length"], bins = 3, labels = ["low", "mid", "high"]).value_counts()

mid     71
low     59
high    20
Name: sepal_length, dtype: int64

In [39]:
pd.cut(df_iris["sepal_length"], bins = 3, labels = ["low", "mid", "high"]).value_counts()

mid     71
low     59
high    20
Name: sepal_length, dtype: int64

#### Dummy Columns: convert categorical to binary (0 or 1)

In [40]:
pd.get_dummies(df_iris["species"])[::15]

Unnamed: 0,setosa,versicolor,virginica
Row_0,1,0,0
Row_15,1,0,0
Row_30,1,0,0
Row_45,1,0,0
Row_60,0,1,0
Row_75,0,1,0
Row_90,0,1,0
Row_105,0,0,1
Row_120,0,0,1
Row_135,0,0,1


In [41]:
pd.get_dummies(df_iris["species"], drop_first=True)[::15]

Unnamed: 0,versicolor,virginica
Row_0,0,0
Row_15,0,0
Row_30,0,0
Row_45,0,0
Row_60,1,0
Row_75,1,0
Row_90,1,0
Row_105,0,1
Row_120,0,1
Row_135,0,1


#### Concatenating data frames

In [42]:
pd.concat([df_iris["species"], pd.get_dummies(df_iris["species"], prefix='iris')], axis = 1)[::15]
## axis defines concat dimension: 0 for vertical, 1 for horizontal

Unnamed: 0,species,iris_setosa,iris_versicolor,iris_virginica
Row_0,setosa,1,0,0
Row_15,setosa,1,0,0
Row_30,setosa,1,0,0
Row_45,setosa,1,0,0
Row_60,versicolor,0,1,0
Row_75,versicolor,0,1,0
Row_90,versicolor,0,1,0
Row_105,virginica,0,0,1
Row_120,virginica,0,0,1
Row_135,virginica,0,0,1


### Grouping and aggregation

In [43]:
df_iris.groupby("species").mean()

Unnamed: 0_level_0,sepal_length,sepal_width,petal_length,petal_width
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
setosa,5.006,3.428,1.462,0.246
versicolor,5.936,2.77,4.26,1.326
virginica,6.588,2.974,5.552,2.026


In [44]:
df_djia.groupby("Industry").count()

Unnamed: 0_level_0,Company,Exchange,Symbol,Date added,Notes,Index weighting
Industry,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Aerospace and defense,1,1,1,1,0,1
Biopharmaceutical,1,1,1,1,0,1
Broadcasting and entertainment,1,1,1,1,0,1
Chemical industry,1,1,1,1,0,1
Clothing industry,1,1,1,1,0,1
Conglomerate,2,2,2,2,2,2
Construction and mining,1,1,1,1,0,1
Drink industry,1,1,1,1,1,1
Fast-moving consumer goods,1,1,1,1,0,1
Financial services,4,4,4,4,0,4


In [45]:
df_iris.groupby("species").describe().T

Unnamed: 0,species,setosa,versicolor,virginica
sepal_length,count,50.0,50.0,50.0
sepal_length,mean,5.006,5.936,6.588
sepal_length,std,0.35249,0.516171,0.63588
sepal_length,min,4.3,4.9,4.9
sepal_length,25%,4.8,5.6,6.225
sepal_length,50%,5.0,5.9,6.5
sepal_length,75%,5.2,6.3,6.9
sepal_length,max,5.8,7.0,7.9
sepal_width,count,50.0,50.0,50.0
sepal_width,mean,3.428,2.77,2.974


In [46]:
df_djia.groupby("Industry").get_group('Aerospace and defense')

Unnamed: 0,Company,Exchange,Symbol,Industry,Date added,Notes,Index weighting
5,Boeing,NYSE,BA,Aerospace and defense,1987-03-12,,3.36%


## Your Turn! _Lab 1: Part 1_

# ===============================

### Merging data frames

In [47]:
df_gdp = pd.read_csv("../data/gdp_by_country.csv")
df_continents = pd.read_csv("../data/continents.csv")

In [48]:
df_gdp

Unnamed: 0,Country,IMF_Estimate,IMF_Year,WB_Estimate,WB_Year,CIA_Estimate,CIA_Year
0,China,30074380.0,2022.0,24273360.0,2020.0,23009780,2020
1,United States,25035164.0,2022.0,20936600.0,2020.0,19846720,2020
2,India,11665486.0,2022.0,9907028.0,2020.0,8443360,2020
3,Japan,6109961.0,2022.0,5328033.0,2019.0,5224850,2019
4,Germany,5316933.0,2022.0,4469546.0,2020.0,4238800,2020
...,...,...,...,...,...,...,...
223,Tuvalu,63.0,2022.0,55.0,2020.0,49,2019
224,Wallis and Futuna,,,,,60,2004
225,"Saint Helena, Ascension and Tristan da Cunha",,,,,31,2009
226,Niue,,,,,10,2003


In [49]:
df_gdp_imf = df_gdp[['Country', "IMF_Estimate"]].dropna()
df_gdp_imf

Unnamed: 0,Country,IMF_Estimate
0,China,30074380.0
1,United States,25035164.0
2,India,11665486.0
3,Japan,6109961.0
4,Germany,5316933.0
...,...,...
215,Kiribati,268.0
217,Marshall Islands,245.0
218,Palau,236.0
222,Nauru,145.0


In [50]:
df_continents.head()

Unnamed: 0,Country,Continent
0,China,Asia
1,United States,Americas
2,India,Asia
3,Japan,Asia
4,Germany,Europe


In [51]:
## inner merge

df_merged = pd.merge(df_continents, df_gdp_imf, on = "Country", how = "inner")
df_merged.head()

Unnamed: 0,Country,Continent,IMF_Estimate
0,China,Asia,30074380.0
1,United States,Americas,25035164.0
2,India,Asia,11665486.0
3,Japan,Asia,6109961.0
4,Germany,Europe,5316933.0


In [52]:
## left (or right) merge

left_merge_df = pd.merge(df_continents, df_gdp_imf, on = "Country", how = "left")
left_merge_df.tail(10)

Unnamed: 0,Country,Continent,IMF_Estimate
218,Palau,Oceania,236.0
219,Falkland Islands,Americas,
220,Anguilla,Americas,
221,Montserrat,Americas,
222,Nauru,Oceania,145.0
223,Tuvalu,Oceania,63.0
224,Wallis and Futuna,Oceania,
225,"Saint Helena, Ascension and Tristan da Cunha",Africa,
226,Niue,Oceania,
227,Tokelau,Oceania,


In [53]:
left_merge_df.info() ## NAs introduced by left merging

<class 'pandas.core.frame.DataFrame'>
Int64Index: 228 entries, 0 to 227
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Country       228 non-null    object 
 1   Continent     228 non-null    object 
 2   IMF_Estimate  192 non-null    float64
dtypes: float64(1), object(2)
memory usage: 7.1+ KB



### Handling missing data and duplicates

#### inspecting missing values

In [54]:
left_merge_df.isna().sum()

Country          0
Continent        0
IMF_Estimate    36
dtype: int64

In [55]:
left_merge_df.notna().sum(axis = 0)

Country         228
Continent       228
IMF_Estimate    192
dtype: int64

In [56]:
left_merge_df.isnull().any(axis=0)

Country         False
Continent       False
IMF_Estimate     True
dtype: bool

#### dropping missing values

In [57]:
left_merge_df.dropna(axis = 1).head() #remove columns containing at least one NaN

Unnamed: 0,Country,Continent
0,China,Asia
1,United States,Americas
2,India,Asia
3,Japan,Asia
4,Germany,Europe


In [58]:
left_merge_df.dropna(axis = 0) #remove rows containing at least one NaN

Unnamed: 0,Country,Continent,IMF_Estimate
0,China,Asia,30074380.0
1,United States,Americas,25035164.0
2,India,Asia,11665486.0
3,Japan,Asia,6109961.0
4,Germany,Europe,5316933.0
...,...,...,...
215,Kiribati,Oceania,268.0
217,Marshall Islands,Oceania,245.0
218,Palau,Oceania,236.0
222,Nauru,Oceania,145.0


In [59]:
left_merge_df.fillna(left_merge_df.mean(numeric_only=True))

Unnamed: 0,Country,Continent,IMF_Estimate
0,China,Asia,3.007438e+07
1,United States,Americas,2.503516e+07
2,India,Asia,1.166549e+07
3,Japan,Asia,6.109961e+06
4,Germany,Europe,5.316933e+06
...,...,...,...
223,Tuvalu,Oceania,6.300000e+01
224,Wallis and Futuna,Oceania,8.404846e+05
225,"Saint Helena, Ascension and Tristan da Cunha",Africa,8.404846e+05
226,Niue,Oceania,8.404846e+05


In [60]:
pd.concat([left_merge_df, left_merge_df.fillna(method = 'ffill')], axis = 1).tail(15)

Unnamed: 0,Country,Continent,IMF_Estimate,Country.1,Continent.1,IMF_Estimate.1
213,Micronesia,Oceania,386.0,Micronesia,Oceania,386.0
214,Cook Islands,Oceania,,Cook Islands,Oceania,386.0
215,Kiribati,Oceania,268.0,Kiribati,Oceania,268.0
216,Saint Pierre and Miquelon,Americas,,Saint Pierre and Miquelon,Americas,268.0
217,Marshall Islands,Oceania,245.0,Marshall Islands,Oceania,245.0
218,Palau,Oceania,236.0,Palau,Oceania,236.0
219,Falkland Islands,Americas,,Falkland Islands,Americas,236.0
220,Anguilla,Americas,,Anguilla,Americas,236.0
221,Montserrat,Americas,,Montserrat,Americas,236.0
222,Nauru,Oceania,145.0,Nauru,Oceania,145.0


### Sorting data frames

In [61]:
df_iris.sort_values("petal_length", ascending = False, ignore_index = True)[:10]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,7.7,2.6,6.9,2.3,virginica
1,7.7,2.8,6.7,2.0,virginica
2,7.7,3.8,6.7,2.2,virginica
3,7.6,3.0,6.6,2.1,virginica
4,7.9,3.8,6.4,2.0,virginica
5,7.3,2.9,6.3,1.8,virginica
6,7.4,2.8,6.1,1.9,virginica
7,7.2,3.6,6.1,2.5,virginica
8,7.7,3.0,6.1,2.3,virginica
9,6.3,3.3,6.0,2.5,virginica


## Your turn! _Lab 1: Part 2_

# ===============================