# Module 7 Assignment


A few things you should keep in mind when working on assignments:

1. Before you submit your assignment, make sure everything runs as expected. Go to the menubar, select Kernel, and Restart & Run all. 
2. Make sure that you save your work.
3. Upload your notebook to Compass.

-----

# Problem 1: Load and examine the dataset

The dataset, wealth_health_2007.csv, contains GDP per capita and life expectancy information of over 100 countries in 2007. 

In the following code cells:

- Load dataset to a DataFrame **df**.
- Display the first 5 rows in the DataFrame.
- Display the basic information of the DataFrame with info() method. Examine missing values.

In [1]:
# Load dataset to a DataFrame and display first 5 rows
import pandas as pd
df = pd.read_csv('wealth_health_2007.csv')
df.head()

Unnamed: 0,country,population,continent,life_exp,gdp_cap
0,Afghanistan,31889923,Asia,43.828,$974
1,Albania,3600523,Europe,76.423,"$5,937"
2,Algeria,33333216,Africa,72.301,
3,Angola,12420476,Africa,,"$4,797"
4,Argentina,40301927,Americas,75.32,"$12,779"


In [2]:
# Display basic information of the DataFrame, examine missing values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 142 entries, 0 to 141
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   country     142 non-null    object 
 1   population  142 non-null    int64  
 2   continent   142 non-null    object 
 3   life_exp    138 non-null    float64
 4   gdp_cap     137 non-null    object 
dtypes: float64(1), int64(1), object(3)
memory usage: 5.7+ KB


# Problem 2: Clean up string values

- Work on **df** created from Problem 1.
- Convert values in country and continent to all upper case.
- Convert gdp_cap to float.
- Display the first 5 rows of **df** to verify the result.
- Display the unique values from the continent column.

In [3]:
# Clean up string values and display first 5 rows.
df.country = df.country.str.upper()
df.continent = df.continent.str.upper()
df.gdp_cap = df.gdp_cap.str.replace('$','').str.replace(',','').astype(float)
df.head()

Unnamed: 0,country,population,continent,life_exp,gdp_cap
0,AFGHANISTAN,31889923,ASIA,43.828,974.0
1,ALBANIA,3600523,EUROPE,76.423,5937.0
2,ALGERIA,33333216,AFRICA,72.301,
3,ANGOLA,12420476,AFRICA,,4797.0
4,ARGENTINA,40301927,AMERICAS,75.32,12779.0


In [4]:
# Display unique values in contient column
df.continent.unique()

array(['ASIA', 'EUROPE', 'AFRICA', 'AMERICAS', 'OCEANIA'], dtype=object)

# Problem 3: Handle missing values

- Work on **df** created from Problem 1.
- Drop rows if country, population or continent is missing.
- Fill missing values in life expectancy with mean life expectancy of the corresponding continent.
- Fill missing values in GDP per capita with mean GDP per capita of the corresponding continent.
- Save the resulting DataFrame to **df_clean**.
- Display basic information of **df_clean** to verify there's no missing values.

**Note**: You may get a `SettingWithCopyWarning`, just ignore it.

In [5]:
# Handle missing values, create df_clean
df_clean = df.dropna(subset=['country', 'population', 'continent'])
df_clean.life_exp.fillna(df_clean.groupby('continent').life_exp.transform('mean'), inplace=True)
df_clean.gdp_cap.fillna(df_clean.groupby('continent').gdp_cap.transform('median'), inplace=True)
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 142 entries, 0 to 141
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   country     142 non-null    object 
 1   population  142 non-null    int64  
 2   continent   142 non-null    object 
 3   life_exp    142 non-null    float64
 4   gdp_cap     142 non-null    float64
dtypes: float64(2), int64(1), object(2)
memory usage: 6.7+ KB


In [6]:
# Verify df_clean has no missing values with info()
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 142 entries, 0 to 141
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   country     142 non-null    object 
 1   population  142 non-null    int64  
 2   continent   142 non-null    object 
 3   life_exp    142 non-null    float64
 4   gdp_cap     142 non-null    float64
dtypes: float64(2), int64(1), object(2)
memory usage: 6.7+ KB


# Problem 4: Predict life expectancy with a linear regression

Construct a linear regression model using statsmodels to predict life expectancy by GDP per capita and contient.

- Work with **df_clean** created from Problem 3.
- Life expectancy will be the dependent variable.
- GDP per capita and contient will be the dependent variables.
- Enclose categorical features with "C()" in the regression formula.
- Create a linear regression model with the ols method in statsmodels.formula.api.
- Fit the model and assign the fitted model to variable **result**.
- Print result summary.

In [7]:
# Predict life expectancy by GDP per captia and continent
import statsmodels.formula.api as smf
formula = "life_exp ~ gdp_cap + C(continent)"
model = smf.ols(formula, data=df_clean)
result = model.fit()
result.summary()


0,1,2,3
Dep. Variable:,life_exp,R-squared:,0.714
Model:,OLS,Adj. R-squared:,0.704
Method:,Least Squares,F-statistic:,67.98
Date:,"Sun, 03 Jan 2021",Prob (F-statistic):,2.7000000000000003e-35
Time:,11:23:27,Log-Likelihood:,-464.83
No. Observations:,142,AIC:,941.7
Df Residuals:,136,BIC:,959.4
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,54.0248,0.919,58.774,0.000,52.207,55.843
C(continent)[T.AMERICAS],15.9751,1.657,9.640,0.000,12.698,19.252
C(continent)[T.ASIA],12.4378,1.556,7.992,0.000,9.360,15.515
C(continent)[T.EUROPE],14.9812,1.992,7.521,0.000,11.042,18.921
C(continent)[T.OCEANIA],18.1102,4.876,3.714,0.000,8.468,27.753
gdp_cap,0.0003,5.73e-05,5.949,0.000,0.000,0.000

0,1,2,3
Omnibus:,10.242,Durbin-Watson:,1.882
Prob(Omnibus):,0.006,Jarque-Bera (JB):,20.931
Skew:,0.21,Prob(JB):,2.85e-05
Kurtosis:,4.834,Cond. No.,156000.0


# Problem 5: Interpret the regression result

Answer following questions from the regression result.
- What is the impact of GDP per capita on life expectancy? Is the coefficient significant?
- Which continent is the intercept based on?
- Given an African country with GDP per capita $942, what is the life expectancy predicted by the regression model? (If you use `predict()` method, be aware that values in continent are all in upper case)

Feel free to add code cells and markdown cells if needed.

In [9]:
dfc = pd.DataFrame({'continent':['AFRICA'], 'gdp_cap':[942]})
result.predict(dfc)

0    54.345893
dtype: float64

- Life expectancy increases by 0.0003 year with every $1 increase in GDP per capita. The coefficient is significant at 95% confidence level.
- The intercept is based on Africa
- The predicted life expectancy is 54.3.