In [None]:
# initializing otter-grader
import otter
grader = otter.Notebook()

# Lab 9: Putting it all Together

In this lab assignment we will pick bits and pieces of what a project would look like. This is not a full project, but it covers the following ideas and tries to link them to how to analyze a dataset:
* Visualization and Exploratory Data Analysis (EDA)
* Multilinear regression
* Correlation
* Bonus: PCA

This can be considered practice for your final project, but your project should not be as simple as this lab and should analyze in more depth than we can cover here.

## Due Date
This assignment is due **Friday, May 29 at 11:59pm PST.**

**Collaboration Policy**

Data science is a collaborative activity. While you may talk with others about the homework, we ask that you **write your solutions individually**. If you do discuss the assignments with others please **include their names** in the cell below.



**Collaborators**: ...

In [1]:
import numpy as np
import pandas as pd
import altair as alt
import scipy
import sklearn
from sklearn.linear_model import LinearRegression

## Life Expectancy Data

This data was taken from [this source](https://www.kaggle.com/kumarajarshi/life-expectancy-who), and the goal of this dataset is to see what factors are contributors to life expectancy.

In [2]:
df = pd.read_csv('life_expectancy.csv')
df.head()

In today's lab we will aim to answer these three questions:
1. What is the general trend of life expectancy over time?
2. What is a multilinear regression model that can predict life expectancy, and how accurate is it?
3. Is life expectancy a better indicator of how long people live, or the childhood mortality rate in developing nations?

## Question 0
Before we can begin with data analysis, we have to do a little bit of data exploration. Let's start familiarizing outselves with the data by asking the basic questions:
1. Are there any missing values? If so, how many?
2. Are there any unique values? Any values that might seem strange?
3. Are there any outliers? How does the distribution of the data change if you remove them?

### Question 0a
In the cell box below, include your code for investigating these questions. As with any data exploration, there is more than one way to do approach these questions. There is no correct answer for this, so you will get points for clear, commented code.
<!--
BEGIN QUESTION
name: q0a
points: 3
manual: true
-->
<!-- EXPORT TO PDF -->

In [3]:
## ADD DATA EXPLORATION CODE HERE.


### Question 0b
Answer the three questions in the cell below.
For convenience:
1. Are there any missing values? If so, how many?
2. Are there any unique values? Any values that might seem strange?
3. Are there any outliers? How does the distribution of the data change if you remove them?
<!--
BEGIN QUESTION
name: q0a
points: 3
manual: true
-->
<!-- EXPORT TO PDF -->

*Write your answer here, replacing this text.*

## Question 1
Let's start with the easiest question. What is the general trend of life expectancy over time?


### Question 1a
There are a lot of countries, so let's group them into two categories and take the mean life expectancy: Developing and Developed. Plot the year on the x-axis, life expectancy on the y-axis, and the binary value "Status" as the color. Be sure to set the scale of your axis so that the visualization makes sense.

<!--
BEGIN QUESTION
name: q1a
points: 3
manual: true
-->
<!-- EXPORT TO PDF -->

In [4]:
# GROUP THE DATA AND CREATE YOUR PLOT BELOW
...

### Question 1b
What can you conclude about our original question?

<!--
BEGIN QUESTION
name: q1b
points: 3
manual: true
-->
<!-- EXPORT TO PDF -->

*Write your answer here, replacing this text.*

## Question 2
Now let's ask the second question. What is the multilinear regression line that predicts life expectancy?

### Question 2a
In order to do this, let's convert the "Status" column into an indicator variable by converting "Developing" to 0 and "Developed" to 1. Then like we did in lab 8, calculate the multilinear regression and the **mean squared error** of the regression. You will learn more about analyzing the significance of each variable in this regression line in PSTAT 126.

Since you already did mean squared loss and adding bias in the previous lab, we have them provided for you. However, you have to create the rest of the regression.

In [5]:
def mean_squared_error(y, y_hat):
    return np.mean((y-y_hat)**2)

def add_bias(data):
    data.insert(0,'ones',1)
    return data

Before we do a linear regression, we need to clean the data. This is a very important step because this dataset, and many others that you will encounter will contain missing values. In addition to converting the "Status" column to an indicator, let's impute data with the mean of the column. While it would be better to impute values using a more representative value, for the interest of simplicity, we will just use the mean of each column.
* Hint: Look at how to use `fillna()`

<!--
BEGIN QUESTION
name: q2a
points: 3
manual: false
-->

In [6]:
# Replace the "Status" column with numerical values
df = ...
# Remove missing values
df_clean = df.fillna(df.mean())

<!--
BEGIN QUESTION
name: q2a2
points: 3
manual: false
-->

In [10]:
# Pick out the target value
y = ...
# Remove "Year", "Country", and the target value from the dataframe
x = ...
# Fit a linear regression model using sklearn
model = LinearRegression()
...
# Get y_hat by predicting using the model
y_hat = ...
loss = mean_squared_error(y,y_hat)
loss

### Question 2b

Let's take a look at the coefficients of our multilinear regression.

In [12]:
model.coef_

List 2 observations you notice from the regression coefficients and analyze them.

<!--
BEGIN QUESTION
name: q2b
points: 4
manual: true
-->
<!-- EXPORT TO PDF -->

*Write your answer here, replacing this text.*

## Question 3

Now let us consider the last question. Is life expectancy a better indicator of child mortality, or of increasing the length of time people are expected to live in developing nations? 

### Question 3a

First filter the dataset to only consider developing nations (Remember that this is the value 0 now). Then find the $r^2$ values for the following columns:
* Infant Deaths
* Deaths under 5

Hint: use `corr()`

<!--
BEGIN QUESTION
name: q3a
points: 3
manual: false
-->

In [13]:
# First filter the dataset
df_developing = df_clean[df_clean["Status"] == 0]

# Calculate r^2 for "Infant deaths"
inf_deaths_r2 = ...
print(f'r^2 of Infant Deaths: {inf_deaths_r2}')
# Calculate r^2 for "Deaths under 5"
child_deaths_r2 = ...
print(f'r^2 of Deaths under 5: {child_deaths_r2}')

### Question 3b
What do these $r^2$ values tell you about the importance of childhood mortality statistics in predicting life expectancy?

<!--
BEGIN QUESTION
name: q3b
points: 3
manual: true
-->
<!-- EXPORT TO PDF -->

*Write your answer here, replacing this text.*

## BONUS: PCA

Using the same value of x as we used in question 2, what is the minimum number of principal components needed to run multilinear regression on to get a loss value within 10% of our original loss value when the data is centered and scaled? Feel free to use the two imports given below. 

Show all your work and get the correct number of principal components to get bonus points.

In [16]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# SCRATCH WORK
...

<!--
BEGIN QUESTION
name: bonus
points: 3
manual: true
-->
<!-- EXPORT TO PDF -->

*Write your answer here, replacing this text.*

# Running Built-in Tests
1. All tests are in `tests` directory
1. Each python file in `tests` is a test
1. `grader.check('testname')` runs test `'testname'`, e.g. `'q1'`
1. `grader.check_all()` runs all visible tests

In [None]:
# Run built-in checks
grader.check_all()

In [None]:
# Generate pdf in classic notebook (does not work in JupyterLab)
import nb2pdf
nb2pdf.convert('lab09.ipynb')

# To generate pdf using command-line, run in terminal,
# nb2pdf lab09.ipynb

# Submission Checklist
1. Check filename is 'lab09.ipynb'
1. Save file to confirm all changes are on disk
1. Run *Kernel > Restart & Run All* to execute all code from top to bottom
1. Check `grader.check_all()` output
1. Save file again to write any new output to disk
1. Check generated pdf that all responses are displayed correctly
1. Submit to Gradescope