Andrea Calef

💁‍♂️ My details: a.calef@uea.ac.uk; office hours Wednesdays 4-6pm on Teams. Or meet live by appointment!

The materials for this week are available as a Jupyter notebook. Jupyter notebooks mix rich text with runnable python code. So, you can follow along with this lecture, run the python examples, and even add your own notes and code. To do this go to

https://mybinder.org/v2/gh/tturocy/eco7026a/HEAD

Alternatively, you can copy and paste code from here into the python command line or an IDE such as Spyder.
Using Jupyter notebooks

To get your own copy of this notebook, choose File above then Download.

When you have done that, click in the field below, and either press the play button or type Shift+Enter. This executes the Python cell.

## Notes: necessary libraries to replicate this lecture.

import numpy as np
<br>
import pandas as pd

In [None]:
import pandas as pd
import numpy as np

In [None]:
df1=pd.read_csv("Databases/Labourdata.csv") # Importing data for Labour Force
df2=pd.read_excel("Databases/UKHLS_missing.xlsx") # Importing data for the UK Household Longitudinal Study

In [None]:
print(df1)

In [None]:
for i in df1.columns[:3]:
    pct_missing = np.mean(df1[i].isnull())
    print('{} - {}%'.format(i, round(pct_missing*100)))


In [None]:
round(df1.iloc[:,:3].isna().mean()*100,2) 

In [None]:
df1 = df1.set_index('PERSONID') # Setting PERSONID as the new index Column for Labour force data

In [None]:
print(df1.iloc[:2, :3])
print(df1.iloc[0:3, :3])
print(df1.iloc[:, :3])

In [None]:
round(df2.isna().mean()*100,2) 

### Data cleaning

#### Solution 1: Dropping observations with missing values for UKHLS data. This will reduce the number of rows. 

In [None]:
df2.dropna(inplace = True) 

In [None]:
for i in df2.columns:
    pct_missing = np.mean(df2[i].isnull())
    print('{} - {}%'.format(i, round(pct_missing*100)))

#### Solution 2: Dropping columns.

In [None]:
columns_to_remove = ['EDUC', 'TIMETRND', 'MOTHERED', 'FATHERED', 'BRKNHOME', 'SIBLINGS'] 
df1.drop(columns_to_remove, inplace=True, axis=1) 
round(df1.isna().mean()*100,2) 

#### Solution 3: Filling in with mean values.  

In [None]:
df2=pd.read_excel("Databases/UKHLS_missing.xlsx")
round(df2.isna().mean()*100,2) 
df2['How much do you donate to charity?'].info()
df2['How much do you donate to charity?'].describe()

In [None]:
df2['How much do you donate to charity?'] = df2['How much do you donate to charity?'].fillna(df2['How much do you donate to charity?'].mean()) 
for i in df2.columns:
    pct_missing = np.mean(df2[i].isnull())
    print('{} - {}%'.format(i, round(pct_missing*100)))
df2['How much do you donate to charity?'].info()
df2['How much do you donate to charity?'].describe()

Any comment?

#### Solution 4: Interpolation

Depending on the research question, it may be used with <u>time series</u>.
<br>
<br>
**.interpolate()** can be used with both Series and DataFrames. With no options, the interpolation is linear.
<br>
<br>
**.interpolate(method='polynomial', order= n)** introduces nonlinearities of order n. 
<br>
<br>
For more information, please click <a href='https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html'>here</a>. 
 

In [None]:
df2=pd.read_excel("Databases/UKHLS_missing.xlsx")
df2['How much do you donate to charity?'].interpolate(method='polynomial', order=9)

In [None]:
df2

#### Note

This interpolation is meaningless, as the data set is just a cross-section!

### Data manipulation

In [None]:
df = pd.read_csv("Databases/WEO_Data.csv") # Load WEO data 

df.drop(['Units', 'Scale', 'Country/Series-specific Notes', 'Estimates Start After'], axis = 1, inplace = True)


In [None]:
df

In [None]:
df['Subject Descriptor'] = df['Subject Descriptor'].str.replace('Gross domestic product, constant prices', 'r_gdp')
df['Subject Descriptor'] = df['Subject Descriptor'].str.replace('Gross domestic product, current prices', 'n_gdp')
df['Subject Descriptor'] = df['Subject Descriptor'].str.replace('Gross domestic product, deflator', 'defl')
df['Subject Descriptor'] = df['Subject Descriptor'].str.replace('Unemployment rate', 'u')

In [None]:
print(df.T[:4])
print(df[:4].T)

In [None]:
df1 = df.stack()
df1.head(10)

In [None]:
df1.index = df1.index.droplevel(0) # drop indicator index 
df1.head(10)

### Problem: it does not work! 

We need a different strategy, implemented through the following four steps: 

1. Create the column "Year" that tracks the time dimension. This implies a decrease in columns and increase in rows. From a "wide" panel data visualisation to a "long" panel data visualisation. 
<br>
<br>
2. Create a new column named "GDP" that is a list variable, containing the four values of the variables of interest for each country and year. Data are sorted by Year and Country. The Index is automatically resetted. 
<br>
<br>
3. Creating columns for the variables of interest.
<br>
<br>
4. Merging the two previously created dataframes.

In [None]:
df
one = pd.melt(df, id_vars=['Country', 'Subject Descriptor'], var_name='Year', value_name='our_variables')
one.head() 

In [None]:
df

In [None]:
two = one.groupby(['Year','Country']).agg(list).drop('Subject Descriptor', axis = 1)
two.head() 

In [None]:
three = pd.DataFrame(two['our_variables'].to_list(), columns=[s.upper() for s in set(one['Subject Descriptor'].tolist())]) 
three.head()

In [None]:
four = pd.concat([two,three], axis=1).drop('our_variables', axis = 1)
four.head() 


In [None]:
two.reset_index(col_level=1, inplace=True)
two

In [None]:
four = pd.concat([two,three], axis=1).drop('our_variables', axis = 1)
four.head() 

In [None]:
df_sorted = four.sort_values(by=['Country','Year'])  
df_sorted.rename(columns={"DEFL": "deflator", "N_GDP": "nominal_gdp", "R_GDP": "real_gdp", "U": "unemployment_rate"}, inplace=True) 
df_sorted.head() 


Wait for a second ... 

#### Fix 1

Let us sort ...

In [None]:
three = pd.DataFrame(two['our_variables'].to_list(), columns=[s.upper() for s in sorted(set(one['Subject Descriptor'].tolist()))]) 
three

#### Fix 2

Let us simplify ...

In [None]:
three = pd.DataFrame(two['our_variables'].to_list(), columns=[s.upper() for s in one['Subject Descriptor'].tolist()]) 
three

#### Too much simplification ...

In [None]:
three = pd.DataFrame(two['our_variables'].to_list(), columns=[s.upper() for s in one['Subject Descriptor'][:4].tolist()]) 
three

In [None]:
one = pd.melt(df, id_vars=['Country', 'Subject Descriptor'], var_name='Year', value_name='our_variables')
two = one.groupby(['Year','Country']).agg(list).drop('Subject Descriptor', axis = 1)
three = pd.DataFrame(two['our_variables'].to_list(), columns=[s.upper() for s in one['Subject Descriptor'][:4].tolist()]) 
three
two.reset_index(col_level=1, inplace=True)
four = pd.concat([two,three], axis=1).drop('our_variables', axis = 1)
four.head() 
df_sorted = four.sort_values(by=['Country','Year'])  
df_sorted.rename(columns={"DEFL": "deflator", "N_GDP": "nominal_gdp", "R_GDP": "real_gdp", "U": "unemployment_rate"}, inplace=True) 
df_sorted.head() 

We can finally save in the data formats we wish, e.g.

In [None]:
df_sorted.to_csv('WEO_Data_Sorted.csv', index = False)
df_sorted.to_latex('WEO_Data_Sorted.tex', index = False)
df_sorted.to_stata('WEO_Data_Sorted.dta', write_index = False)

#### It has been a great pleasure to be your teacher!