In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab.ipynb")

# Lab 9 – Models and Pipelines 🔁

## DSC 80, Fall 2022

### Due Date: Thursday, December 1st at 11:59PM ‼️

## Instructions
Much like in DSC 10, this Jupyter Notebook contains the statements of the problems and provides code and Markdown cells to display your answers to the problems. Unlike DSC 10, the notebook is *only* for displaying a readable version of your final answers. The coding will be done in an accompanying `lab.py` file that is imported into the current notebook.

<span style='color:red'><b>Note: For Lab 9, there are no hidden tests!</b></span> The tests you see when you run `grader.check` are the final tests that will determine your grade. In addition, when you submit Lab 9 to Gradescope you will see your results on the assignment right away.

Labs and programming assignments will be graded in (at most) two ways:
1. The functions and classes in the accompanying `lab.py` file will be tested (a la DSC 20),
2. The notebook may be graded (if it contains free response questions or asks you to draw plots).

**Do not change the function names in the `lab.py` file!**
- The functions in the `lab.py` file are how your assignment is graded, and they are graded by their name.
- If you changed something you weren't supposed to, just use git to revert! Ask us if you need help with this, or google around for `git revert`.

**Tips for working in the notebook**:
- The notebooks serve to present the questions and give you a place to present your results for later review.
- The notebooks in *lab assignments* are not graded (only the `lab.py` file is submitted and graded).
- The notebook serves as a nice environment for 'pre-development' and experimentation before designing your function in your `lab.py` file. You can write code here, but make sure that all of your real work is in the `lab.py` file.

**Tips for developing in the `lab.py` file**:
- Do not change the function names in the starter code; grading is done using these function names.
- Do not change the docstrings in the functions. These are there to tell you if your work is on the right track!
- You are encouraged to write your own additional helper functions to solve the lab! 
- Always document your code!

### Importing code from `lab.py`

* We import our `lab.py` file that's contained in the same directory as this notebook.
* We use the `autoreload` notebook extension to make changes to our `lab.py` file immediately available in our notebook. Without this extension, we would need to restart the notebook kernel to see any changes to `lab.py` in the notebook.
    - `autoreload` is necessary because, upon import, `lab.py` is compiled to bytecode (in the directory `__pycache__`). Subsequent imports of `lab` merely import the existing compiled python.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from lab import *

In [3]:
import pandas as pd
import numpy as np
import os
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

from pipeline_testing_util import get_transformers

***Note:*** While working on the lab, check the Campuswire post titled "Lab 9 Released!" for any clarifications.

## Part 1: `sklearn` Pipelines 🧠

The file `data/toy.csv` contains an example dataset that consists of 4 columns:

- `'group'`: a categorical column with 3 categories
- `'c1'`: a numeric attribute
- `'c2'`: a numeric attribute
- `'y'`: the target variable (that you want to predict) 
```

In the following questions, you will build `Pipeline`s that combine feature engineering with a linear regression model.

In [5]:
fp = os.path.join('data', 'toy.csv')
data = pd.read_csv(fp)
data.head()

### Question 1

First, you will train a regression model using only a *log-scaled* `'c2'` variable. Create a `Pipeline` that:
1. log-scales `'c2'`, then
2. predicts `'y'` using a linear regression model (using your transformed `'c2'`).

That is, create a function `simple_pipeline` that takes in a DataFrame like `data` and returns a **tuple** consisting of 
- An already-fit `Pipeline`, and
- An array containing the predictions your model makes on `data` (after being trained on `data`).

***Note:*** By "log", we're referring to the natural logarithm.

In [7]:
# don't change this cell, but do run it -- it is needed for the tests to work
q1_fp = os.path.join('data', 'toy.csv')
q1_data = pd.read_csv(q1_fp)
q1_pl, q1_preds = simple_pipeline(q1_data)

In [None]:
grader.check("q1")

### Question 2

Now, you will engineer features from the other columns and use them to train a regression model.  Create a `Pipeline` that:
1. uses `'c1'` as is,
1. log-scales `'c2'`,
1. one-hot encodes `'group'`, and
1. predicts `'y'` using a linear regression model built on the three variables above. (Note that your model will have more than three "features", because one-hot encoding `'group'` will create multiple columns. Don't drop any of them.)

That is, create a function `multi_type_pipeline` that takes in a DataFrame like `data` and returns a **tuple** consisting of
- An already-fit `Pipeline`, and
- An array containing the predictions your model makes on `data` (after being trained on `data`).

***Hint:*** Use `ColumnTransformer`, as we did in [Lecture 15](https://github.com/dsc-courses/dsc80-2022-fa/blob/main/lectures/15-pipelines/notebook/lecture.ipynb).

In [16]:
# don't change this cell, but do run it -- it is needed for the tests to work
q2_fp = os.path.join('data', 'toy.csv')
q2_data = pd.read_csv(q2_fp)
q2_pl, q2_preds = multi_type_pipeline(q2_data)

In [None]:
grader.check("q2")

### Question 3

It seems like `'c1'` and `'c2'` have strong associations with the values of `'group'` (to see this, run the cell below). This suggests that group-wise scaling might make good features. 


Now, we want to standardize (i.e. z-scale) both `'c1'` and `'c2'` **within each `'group'`** (`'A'`, `'B'`, and `'C'`). Unfortunately, there is no built-in transformer in `sklearn` that performs group-wise standardization, so **you will need to create your own transformer!**

Your job is to complete the implementation of the `StdScalerByGroup` transformer class, meaning that you need to implement the `fit` and `transform` methods, along with the constructor (`__init__`).
- The `StdScalerByGroup` transformer works on an input array/DataFrame `X` whose first column contains groups, and whose remaining columns are quantitative and need to be standardized (within each group).
- The `fit` method should determine the mean and standard deviation of each quantitative column within each group in the input data `X` and save them in the instance variable `grps_`. (For instance, one of the quantities you may calculate here is the standard deviation of `'c1'`, but only for the rows whose `'group'` is `'B'`.)
- The `transform` method should take in an input array/DataFrame `X`, standardize each quantitative column separately using the means and standard deviations stored in `grps_`, and return a DataFrame containing the transformed quantitative columns.


If you `fit` and `transform` a `StdScalerByGroup` transformer on the `toy` DataFrame (without the `'y'` column), you should get back a DataFrame with two columns, `'c1'` and `'c2'`, with groups stored in the index (if you end up creating a `MultiIndex`, that is fine).


***Notes:***
1. You may decide on whatever structure you'd like for the `grps_` variable. This question will be graded on the correctness of the output of your transformer. (Check the correctness of your work by checking the output by-hand!)    
2. At no point should you loop over the **rows** of `data` (in fact, our solution doesn't use any loops).
3. The `'group'` column in the doctest is named `'g'` instead of `'group'`. Remember, the first column will **always** contain the groups, even if the first column's name is something other than `'group'`.
4. Do not worry about cases where the standard deviation is equal to 0.

In [29]:
# The scatter plot referenced at the start of Question 3
# This is not needed to answer the question, but motivates why we are standardizing
sns.scatterplot(data=data, x='c1', y='y', hue='group');

In [30]:
# don't change this cell, but do run it -- it is needed for the tests to work
# test fit 
q3_test_fit_cols = {'g': ['A', 'A', 'B', 'B'], 'c1': [1, 2, 2, 2], 'c2': [3, 1, 2, 0]}
q3_test_fit_X = pd.DataFrame(q3_test_fit_cols)
q3_test_fit_std = StdScalerByGroup().fit(q3_test_fit_X)

# test transform
q3_test_transform_cols = {'g': ['A', 'A', 'B', 'B'], 'c1': [1, 2, 3, 4], 'c2': [1, 2, 3, 4]}
q3_test_transform_X = pd.DataFrame(q3_test_transform_cols)
q3_test_transform_std = StdScalerByGroup().fit(q3_test_transform_X)
q3_test_transform_out = q3_test_transform_std.transform(q3_test_transform_X)

In [31]:
# don't change this cell, but do run it -- it is needed for the tests to work
q3_fit_data = pd.read_csv('data/toy.csv')

N = 2*10**6
a = np.random.choice(['A', 'B'], size=(N,1)).astype('object')
b = np.random.multivariate_normal([1, 2], [[1, 0],[0, 100]], size=N)
arr = np.hstack([a, b])
q3_transform_data = pd.DataFrame(arr)
q3_transform_data[1] = q3_transform_data[1].astype(float)
q3_transform_data[2] = q3_transform_data[2].astype(float)

In [None]:
grader.check("q3")

### Question 4

`Pipeline`s are supposed to help you easily try different model configurations. Create a function `eval_toy_model` which returns a hard-coded **list of tuples** consisting of the (RMSE, $R^2$) of three different modeling `Pipeline`s, fit and evaluated on the entire input dataset `data`. The three different `Pipeline`s are:
1. The `Pipeline` in Question 1.
1. The `Pipeline` in Question 2.
1. A `Pipeline` consisting of a linear regression model fit on features generated by applying `StdScalerByGroup` to `'c1'`, log-scaling `'c2'`, and applying `OneHotEncoder` to `'group'`.

In [None]:
grader.check("q4")

## Part 2: Overfitting 😟

### Question 5

In this question, you will train two different classes of prediction models – **decision tree and k-Nearest Neighbor regressors** – on Galton's child heights dataset from lecture and explore different ways in which overfitting can appear.

#### `tree_reg_perf` 🌲

A decision tree regressor is trained similar to a decision tree classifier: the splits of the tree are created by minimizing the variance of the (training data) response values in the leaves given by making the split in question. A decision tree regressor predicts the response value of a (new) observation based on the **average target value** of the training observations lying in the same leaf node. 

One **hyperparameter** of a decision tree regressor that affects model complexity is the **depth** of the tree. Larger depths correspond to more complicated decision trees. We will explore this parameter in this question.

Create a function `tree_reg_perf` that takes in a DataFrame like `galton` and:
- Splits the data into training and test sets,
- Trains 20 decision trees – one for each depth between 1 and 20, and
- Computes both the training RMSE and testing RMSE of each tree.

Store and return your results in a DataFrame that has two columns, `'train_err'` and `'test_err'`, and an index that corresponds to tree depths (i.e. 1, 2, ..., 20).

<br>

#### `knn_reg_perf` 👉👈

A k-Nearest Neighbors (k-NN) regressor predicts the response value of a (new) observation by computing the average value of the k-closest observations in the training set. The most common distance metric is Euclidean distance, i.e. $L_2$ distance.

One **hyperparameter** of a k-NN regressor that affects model complexity is k, **the number of neighbors averaged over**. Larger values of k correspond to more complicated regressors. We will explore this hyperparameter in this question.

Create a function `knn_reg_perf` that takes in a DataFrame like `galton` and:
- Splits the data into training and test sets,
- Trains 20 k-NN regressors – one for each value of k between 1 and 20, and
- Computes both the training RMSE and testing RMSE of each regressor.

Again, store and return your results in a DataFrame that has two columns, `'train_err'` and `'test_err'`, and an index that corresponds to values of k (i.e. 1, 2, ..., 20).

<br>

**Some guidelines for both subparts:**

- In all cases, we are using all other columns in `galton` to predict `'childHeight'`.
- You need to import the necessary classes from `sklearn` **inside** the functions you create. (Unlike before, we haven't imported them for you because we want you to figure out what to import!)
- If you're unsure how to create training and testing sets, refer to [Lecture 16](https://github.com/dsc-courses/dsc80-2022-fa/blob/main/lectures/16-bias_and_variance/notebook/lecture.ipynb). Use a test set size of 0.25.
    - For the purposes of this question, do not use any cross-validation.
- Don't write the formula for RMSE four times – define a helper function!

In [53]:
# Use `galton` to test your work
galton = pd.read_csv('data/galton.csv')
galton.head()

In [None]:
grader.check("q5")

After you've implemented both functions, run the cells below to plot training and testing error for both models.

In [65]:
np.random.seed(9) # For reproducibility

tree = tree_reg_perf(galton)
knn = knn_reg_perf(galton)
hyp = np.arange(1, 21)

plt.subplots(1, 2, figsize=(10, 4), dpi=100)

plt.subplot(1, 2, 1)
plt.plot(hyp, tree.iloc[:, 0], label='Training Error')
plt.plot(hyp, tree.iloc[:, 1], label='Testing Error')
plt.legend()
plt.xlabel('Tree Depth')
plt.xticks(np.arange(1, 21, 2))
plt.title('Error vs. Tree Depth for Decision Tree Regressor')

plt.subplot(1, 2, 2)
plt.plot(hyp, knn.iloc[:, 0], label='Training Error')
plt.plot(hyp, knn.iloc[:, 1], label='Testing Error')
plt.legend()
plt.xlabel('k (# neighbors)')
plt.xticks(np.arange(1, 21, 2))
plt.title('Error vs. # Neighbors for k-NN Regressor');

If your training and evaluation routines are correct, you should notice a few things:
- In both models, testing error initially decreases, and then (perhaps slowly) increases.
- With the decision tree, training error **decreases** as depth increases.
- With the k-NN regressor, training error **increases** as k (the number of neighbors looked at) increases.

You should think about **why** you observe each of the above phenomena. In particular, the last point may seem confusing – one would think that because larger values of k correspond to more complicated models (because the regressor is looking at more information to make a prediction), larger values of k should have lower training errors. But the nature of k-NN regressors is quite different than, say, decision tree regressors or linear regression models.

Lastly, in both cases, identify the ideal **hyperparameter** choice based on the graphs of testing error. You don't have to write the answer anywhere.

## Part 3: Predicting Survival on the Titanic 🛳🧊

### Question 6

Predicting whether or not passengers on the Titanic survived is a common first assignment when learning about classification – now it's your turn!

Create a function `titanic_model` that takes in a DataFrame `titanic` containing **training data only** and returns a `Pipeline` object fit to the training data. 


#### Requirements

You have freedom to build your own model. That is, **you can use any classification algorithm**, but your model should satisfy the following requirements:

- The model is built on the (binary) response column `'Survived'`.
* The model uses features derived from **all other columns in `titanic`**. Below, we specify which columns to "engineer"; you may "engineer" features using other columns, but be sure to include every column in your model (even if you choose to leave some columns as-is).

* Required feature engineering:
    * Derive a feature from the "title" in the `'Name'` field (e.g. "Mr", "Miss", "Mrs" – the names themselves should not be used as a feature; think about why).
    * Derive a feature that standardizes passengers' ages among their `'Pclass'` (use Question 3!).
    
#### Evaluation
    
Your model must achieve an accuracy of 0.78 on both the training set and the test set. Note that while you have access to the test set, it is still encouraged to perform your own model validation.

**Extra credit: If your model can consistently earn an accuracy of above 0.83 on the test set, you can earn 5 points of extra credit on the lab!**

Some guidance:

- `Pipeline` objects can have other `Pipeline` objects within them. While this isn't a requirement, you may find this useful to break down your model into smaller, more manageable steps.

- Your submitted `titanic_model` function should have the model's hyperparameters (e.g. tree depth) hard-coded in it. That is, the `Pipeline` object doesn't have to include the hyperparameter selection process.

- You will find [FunctionTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html) useful. If you want your transformer to output a categorical feature, you will need to select `validate=False`.

- When using [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html#sklearn.compose.ColumnTransformer), you may find the `remainder` keyword helpful.

- If you are set out to get those extra 5 points, consider building some meaningful features before fine-tuning the hyperparameters of your model. Do an EDA on the dataset – what kinds of people are more prone to survive?

In [67]:
# Experiment using `titanic` below – remember, this is only your training data
titanic = pd.read_csv('data/titanic.csv')
titanic.head()

In [None]:
grader.check("q6")

There is **a ton** of material out there on analyzing data from the Titanic. After you build your model, look online for other examples (e.g. [on Kaggle](https://www.kaggle.com/c/titanic)) and think about how you could improve your model.

## Congratulations! You've finished _the final lab of the quarter_! 🎉🥳

Submit your `lab.py` file to Gradescope. Note that you only need to submit the `lab.py` file; this notebook should not be uploaded.

Before submitting, you should ensure that all of your work is in the `lab.py` file. You can do this by running the doctests below, which will verify that your work passes the public tests **and** that your work is in the `lab.py` file. Run the cell below; you should see no output.

In [79]:
!python -m doctest lab.py

In addition, `grader.check_all()` will verify that your work passes the public tests.

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()