<table style="width: 100%;">
    <tr style="background-color: transparent;"><td>
        <img src="https://data-88e.github.io/assets/images/blue_text.png" width="250px" style="margin-left: 0;" />
    </td><td>
        <p style="text-align: right; font-size: 10pt;"><strong>Economic Models</strong>, Spring 2020<br>
            Dr. Eric Van Dusen<br>
            Notebook by Chris Pyles</p></td></tr>
</table>

# Project 3: Econometrics and Data Science

This project focuses on the application of the data science techniques from lecture. You will practice single variable ordinary least squares regression in the Data 8 style, go through a guided introduction to multivariate OLS using the package `statsmodels`, and finally create your own multivariate OLS model.

After this project, you should be able to

1. Write and apply the necesssary functions to perform single variable OLS
2. Use the `statsmodels` package to create multivariate OLS models
3. Understand how to quantitatively evaluate models using the root-mean-squared error
4. Look for and use relationships between variables to select features for regression

In [3]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import statsmodels.api as sm
import warnings

from ipywidgets import interact, Dropdown, IntSlider

warnings.simplefilter(action='ignore')
%matplotlib inline
plt.style.use('seaborn-muted')
plt.rcParams["figure.figsize"] = [10,7]

In this project, we will be working with data on credit card defaults and billing. The data covers April to September 2005, with one row for each cardholder. It has the following columns:

| Column | Description |
|-----|-----|
| `credit` | Total amount of credit |
| `sex` | Cardholder sex |
| `education` | Cardholder education level |
| `martial_status` | Cardholder marital status |
| `age` | Cardholder age |
| `bill_{month}05` | Bill amount for specific month |
| `paid_{month}05` | Amount paid in specified month |
| `default` | Whether the cardholder defaulted |

In the cell below, we load the dataset.

In [4]:
defaults = pd.read_csv("defaults.csv")
defaults

Unnamed: 0,credit,sex,education,marital_status,age,bill_sep05,bill_aug05,bill_jul05,bill_jun05,bill_may05,bill_apr05,paid_sep05,paid_aug05,paid_jul05,paid_jun05,paid_may05,paid_apr05,default
0,20000,female,undergraduate,married,24,3913,3102,689,0,0,0,0,689,0,0,0,0,1
1,120000,female,undergraduate,single,26,2682,1725,2682,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,90000,female,undergraduate,single,34,29239,14027,13559,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,50000,female,undergraduate,married,37,46990,48233,49291,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,50000,male,undergraduate,married,57,8617,5670,35835,20940,19146,19131,2000,36681,10000,9000,689,679,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29995,220000,male,diploma,married,39,188948,192815,208365,88004,31237,15980,8500,20000,5003,3047,5000,1000,0
29996,150000,male,diploma,single,43,1683,1828,3502,8979,5190,0,1837,3526,8998,129,0,0,0
29997,30000,male,undergraduate,single,37,3565,3356,2758,20878,20582,19357,0,0,22000,4200,2000,3100,1
29998,80000,male,diploma,married,41,-1645,78379,76304,52774,11855,48944,85900,3409,1178,1926,52964,1804,1


**Question 0.1:** Which of the columns in `defaults` would we need dummies for in order to use in an OLS model? Assign `q0_1` to an list of these column _labels_.

In [5]:
q0_1 = ["sex", "education", "marital_status"] # SOLUTION
q0_1

['sex', 'education', 'marital_status']

In [6]:
def test_q0_1(q0_1):
    assert len(q0_1) in [3, 4]
    assert "sex" in q0_1
    assert "education" in q0_1
    assert "marital_status" in q0_1

test_q0_1(q0_1) # IGNORE

In order to use the columns you chose, we will need to create dummies for them. In lecture, we showed a function (defined in the imports cell) that will get dummies for a variable for you.

**Question 0.2:** Use `pd.get_dummies` to get dummies for the variables you listed in `q0_1`.

In [7]:
defaults = pd.get_dummies(defaults, columns=q0_1) # SOLUTION

In [8]:
def test_q0_2(defaults):
    assert "education" not in defaults.columns
    assert "marital_status" not in defaults.columns
    assert "sex_male" in defaults.columns

test_q0_2(defaults) # IGNORE

## Part 1: Single Variable OLS

We'll start by doing some single variable linear regression, ala Data 8. To begin, recall that we can model $y$ based on $x$ using the form

$$\Large
\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x
$$

We can define the **correlation coefficient** of two values to be the mean of the product of their values in standard units.

**Question 1.1:** Complete the `corr` function below to compute the correlation coefficient of two arrays `x` and `y` based on the formula

$$\Large
r = \text{mean} \left ( x_\text{SU} \cdot y_\text{SU} \right )
$$

_Hint:_ You may find the `su` function, which converts an array to standard units, helpful.

In [9]:
def su(arr):
    """Converts array arr to standard units"""
    return (arr - np.mean(arr)) / np.std(arr)

def corr(x, y):
    """Calculates the correlation coefficient of two arrays"""
    return np.mean(su(x) * su(y)) # SOLUTION

In [10]:
def test_q1_1_1(np, corr):
    np.random.seed(1234)
    x2 = np.random.uniform(0, 10, 5)
    y2 = np.random.uniform(0, 10, 5)
    assert np.isclose(corr(x2, y2), 0.6410799722591175)

test_q1_1_1(np, corr) # IGNORE

In [11]:
""" # BEGIN TEST CONFIG
points: 1
hidden: true
""" # END TEST CONFIG
def test_q1_1_2(np, corr):
    np.random.seed(2345)
    x2 = np.random.uniform(0, 10, 5)
    y2 = np.random.uniform(0, 10, 5)
    assert np.isclose(corr(x2, y2), -0.4008555019904271)

test_q1_1_2(np, corr) # IGNORE

From this $r$ value that we have calculated above, we can compute the slope $\beta_1$ and intercept $\beta_0$ of the best-fit line using the formulas below.

$$\Large
\beta_1 = r \frac{\hat{\sigma}_y}{\hat{\sigma}_x}
\qquad \text{ and } \qquad
\beta_0 = \hat{\mu}_y - \beta_1 \cdot \hat{\mu}_x
$$

**Question 1.2:** Using your `corr` function, fill in the `slope` and `intercept` functions below which compute the values of $\beta_1$ and $\beta_0$ for the line of best fit that predicts `y` based on `x`. Your function should use vectorized arithmetic (i.e. no `for` loops).

_Hint:_ You may find your `slope` function useful in `intercept`.

In [12]:
def slope(x, y):
    """Computes the slope of the best-fit line of y based on x"""
    return np.std(y) * corr(x, y) / np.std(x) # SOLUTION

def intercept(x, y):
    """Computes the intercept of the best-fit line of y based on x"""
    return np.mean(y) - slope(x, y) * np.mean(x) # SOLUTION

In [13]:
def test_q1_2_1(np, slope):
    np.random.seed(1234)
    x2 = np.random.uniform(0, 10, 5)
    y2 = np.random.uniform(0, 10, 5)
    assert np.isclose(slope(x2, y2), 0.853965497371089)

test_q1_2_1(np, slope) # IGNORE

In [14]:
def test_q1_2_2(np, intercept):
    np.random.seed(1234)
    x2 = np.random.uniform(0, 10, 5)
    y2 = np.random.uniform(0, 10, 5)
    assert np.isclose(intercept(x2, y2), 1.5592892975597108)

test_q1_2_2(np, intercept) # IGNORE

In [15]:
""" # BEGIN TEST CONFIG
points: 0.5
hidden: true
""" # END TEST CONFIG
def test_q1_2_3(np, slope):
    np.random.seed(2345)
    x2 = np.random.uniform(0, 10, 5)
    y2 = np.random.uniform(0, 10, 5)
    assert np.isclose(slope(x2, y2), -0.5183482739336265)

test_q1_2_3(np, slope) # IGNORE

In [16]:
""" # BEGIN TEST CONFIG
points: 0.5
hidden: true
""" # END TEST CONFIG
def test_q1_2_4(np, intercept):
    np.random.seed(2345)
    x2 = np.random.uniform(0, 10, 5)
    y2 = np.random.uniform(0, 10, 5)
    assert np.isclose(intercept(x2, y2), 7.777051922080558)

test_q1_2_4(np, intercept) # IGNORE

---

### References

* Data from https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients#