<a href="https://colab.research.google.com/github/zeitgeist-hash/GV918-Week04/blob/main/Week_04_Class_Exercise_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Description

In this exercise, we will use economic growth data taken from AER package in R. https://www.rdocumentation.org/packages/AER/versions/1.2-9/topics/GrowthDJ

The purpose of this exercise is to explore the determinant of the **gdpgrowth**.


#### Variables

- **oil**: Is the country an oil-producing country?
- **inter**: Does the country have better quality data?
- **oecd**: Is the country a member of the OECD?
- **gdp60**: Per capita GDP in 1960.
- **gdp85**: Per capita GDP in 1985.
- **gdpgrowth**: Average growth rate of per capita GDP from 1960 to 1985 (in percent).
- **popgrowth**: Average growth rate of working-age population 1960 to 1985 (in percent).
- **invest**: Average ratio of investment (including Government Investment) to GDP from 1960 to 1985 (in percent).
- **school**: Average fraction of working-age population enrolled in secondary school from 1960 to 1985 (in percent).


In [1]:
url = 'https://vincentarelbundock.github.io/Rdatasets/csv/AER/GrowthDJ.csv'

In [2]:
import pandas as pd
import numpy as np

# Read the data

In [3]:
df_gdp = pd.read_csv(url)

In [4]:
df_gdp.head()

Unnamed: 0.1,Unnamed: 0,oil,inter,oecd,gdp60,gdp85,gdpgrowth,popgrowth,invest,school,literacy60
0,1,no,yes,no,2485.0,4371.0,4.8,2.6,24.1,4.5,10.0
1,2,no,no,no,1588.0,1171.0,0.8,2.1,5.8,1.8,5.0
2,3,no,no,no,1116.0,1071.0,2.2,2.4,10.8,1.8,5.0
3,4,no,yes,no,959.0,3671.0,8.6,3.2,28.3,2.9,
4,5,no,no,no,529.0,857.0,2.9,0.9,12.7,0.4,2.0


# Data wrangling

#### Check the data dimensionality using `.shape`

In [11]:
df_gdp.shape

(100, 11)

#### How many rows with missing values?

- check the funcitonality of `.isna()` and `.dropna()`

In [5]:
df_gdp.isna()

Unnamed: 0.1,Unnamed: 0,oil,inter,oecd,gdp60,gdp85,gdpgrowth,popgrowth,invest,school,literacy60
0,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,True
4,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...
116,False,False,False,False,False,False,False,False,False,False,False
117,False,False,False,False,False,True,False,True,False,False,True
118,False,False,False,False,False,False,False,False,False,False,False
119,False,False,False,False,False,False,False,False,False,False,False


In [6]:
df_gdp.dropna()

Unnamed: 0.1,Unnamed: 0,oil,inter,oecd,gdp60,gdp85,gdpgrowth,popgrowth,invest,school,literacy60
0,1,no,yes,no,2485.0,4371.0,4.8,2.6,24.1,4.5,10.0
1,2,no,no,no,1588.0,1171.0,0.8,2.1,5.8,1.8,5.0
2,3,no,no,no,1116.0,1071.0,2.2,2.4,10.8,1.8,5.0
4,5,no,no,no,529.0,857.0,2.9,0.9,12.7,0.4,2.0
5,6,no,no,no,755.0,663.0,1.2,1.7,5.1,0.4,14.0
...,...,...,...,...,...,...,...,...,...,...,...
115,116,no,yes,no,10367.0,6336.0,1.9,3.8,11.4,7.0,63.0
116,117,no,yes,yes,8440.0,13409.0,3.8,2.0,31.5,9.8,100.0
118,119,no,yes,no,879.0,2159.0,5.5,1.9,13.9,4.1,39.0
119,120,no,yes,yes,9523.0,12308.0,2.7,1.7,22.5,11.9,99.0


In [7]:
df_gdp.isna().sum()

Unnamed: 0     0
oil            0
inter          0
oecd           0
gdp60          5
gdp85         13
gdpgrowth      4
popgrowth     14
invest         0
school         3
literacy60    18
dtype: int64

#### We in the end drop the rows with missing values...

In [17]:
df_gdp.dropna(inplace=True)

# Data Subsetting

Try craeting following datasets

- OECD countries
- Countries with a literacy rate better than average 
 

In [10]:
df_oecd = df_gdp[df_gdp["oecd"] == "yes"]
df_oecd.shape

(22, 11)

In [14]:
avg_literacy = df_gdp["literacy60"].mean()

In [None]:
df_gdp

# Data Exploration

#### Calculate the mean and standard deviation of the `gdpgrowth`

In [15]:
df_gdp["gdpgrowth"].mean()

3.9730000000000003

#### Run `.describe()` to see the data description

In [16]:
df_gdp.describe()

Unnamed: 0.1,Unnamed: 0,gdp60,gdp85,gdpgrowth,popgrowth,invest,school,literacy60
count,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0
mean,61.56,3837.02,5634.37,3.973,2.274,17.434,5.453,49.43
std,35.207988,8005.468973,5625.719893,1.81547,1.016592,7.821341,3.460147,35.102233
min,1.0,383.0,412.0,-0.9,0.3,4.1,0.4,1.0
25%,31.75,1001.25,1182.25,2.7,1.7,11.625,2.4,15.0
50%,60.5,1945.0,3484.5,3.8,2.4,16.55,4.85,46.0
75%,93.25,4776.5,7718.75,5.125,2.9,23.325,8.075,84.0
max,121.0,77881.0,25635.0,9.2,6.8,36.9,11.9,100.0


#### Calculate the group averages

- For each categorical variables (`oil`, `inter`, `oecd`), calcurate the mean of `gdpgrowth`

In [18]:
df_gdp.groupby("oil").mean()

Unnamed: 0_level_0,Unnamed: 0,gdp60,gdp85,gdpgrowth,popgrowth,invest,school,literacy60
oil,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
no,61.84375,3026.75,5351.239583,3.944792,2.186458,17.567708,5.403125,50.614583
yes,54.75,23283.5,12429.5,4.65,4.375,14.225,6.65,21.0


#### Calculate the correlation between `gdpgrowth` and possible explanatory variables

In [19]:
df_gdp.iloc[:,-5:].corr()

Unnamed: 0,gdpgrowth,popgrowth,invest,school,literacy60
gdpgrowth,1.0,0.249679,0.369635,0.280325,0.1384
popgrowth,0.249679,1.0,-0.356105,-0.21618,-0.427336
invest,0.369635,-0.356105,1.0,0.631852,0.635772
school,0.280325,-0.21618,0.631852,1.0,0.812373
literacy60,0.1384,-0.427336,0.635772,0.812373,1.0


In [20]:
df_gdp.tail(5)

Unnamed: 0.1,Unnamed: 0,oil,inter,oecd,gdp60,gdp85,gdpgrowth,popgrowth,invest,school,literacy60
115,116,no,yes,no,10367.0,6336.0,1.9,3.8,11.4,7.0,63.0
116,117,no,yes,yes,8440.0,13409.0,3.8,2.0,31.5,9.8,100.0
118,119,no,yes,no,879.0,2159.0,5.5,1.9,13.9,4.1,39.0
119,120,no,yes,yes,9523.0,12308.0,2.7,1.7,22.5,11.9,99.0
120,121,no,no,no,1781.0,2544.0,3.5,2.1,16.2,1.5,29.0


In [21]:
df_gdp.loc[:,["invest","school"]]

Unnamed: 0,invest,school
0,24.1,4.5
1,5.8,1.8
2,10.8,1.8
4,12.7,0.4
5,5.1,0.4
...,...,...
115,11.4,7.0
116,31.5,9.8
118,13.9,4.1
119,22.5,11.9
