# Problem Session 4

The problems in this notebook will cover the content covered in our Regression lectures including:
- Regularization
- Principle Component Analysis
- Categorical Variables and Interactions
- Pipelines

In [2]:
## We first load in packages we will need
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

#### 1. Practice creating mock data and fitting models to it

Creating your own fake data and fitting models to that data is a good way to practice.  It is nice because you have access to the "ground truth" when you make your own data.

Another more practical usage of simulation is parametric bootstrapping, which we will cover in a few lectures.

It is also *very common* to need to mock up some data during an interview.

##### a.  

We will start by creating a design matrix $X$ with $20$ rows and $10$ columns whose first $9$ columns are all quite close to each other.  The last column will be distinct, so that X is "close to" being a rank 2 matrix.  

Then we will have $y$ be the sum of all columns plus some noise.

So the true functional relationship is just to sum all the features!

In [378]:
# nobs is short for "number of observations"
nobs = 20

# A numpy array of shape (nobs,) whose entries are drawn uniformly from the interval [0,10]
x = 

# A numpy array of shape (nobs, 10).  Each column should be x plus normal errors with standard deviation 0.1
# Note that we will overwrite the final column in the next step.
# Hint: use np.tile.  Look up the docs!
X = 

# Overwrite the last column of X with new independent draws from [0,10]


# A numpy array of shape (nobs,) which is equal to the sum of the columns of X plus normal errors of variance 1.
y = 

In [379]:
# Making a train test split.  Don't specify a random state this time!  We will be rerunning this later.
from sklearn.model_selection import 
X_train, X_test, y_train, y_test = 

##### b.

We will now fit a standard linear regression, ridge regression, and PCA regression model to the data.  We want to compare mean squared error on the test set.

In [380]:
from sklearn.linear_model import 
from sklearn.decomposition import 
from sklearn.pipeline import 
from sklearn.preprocessing import 
from sklearn.metrics import 

In [None]:
# Instantiate the models. Which ones need scaling?  Use 2 components for the PCA.
lr = 
ridge_pipe = 
pca_pipe = 

# Fit the models to the training data

# Find the model predictions on the training set
lr_train_preds = 
ridge_train_preds = 
pca_train_preds = 

# Find the model predictions on the test set
lr_test_preds = 
ridge_test_preds = 
pca_test_preds = 

# Find the mse on the training set
lr_train_mse = 
ridge_train_mse = 
pca_train_mse = 

# Find the mse on the test set
lr_test_mse = 
ridge_test_mse =
pca_test_mse =

# Results
print(f"OLS Training MSE: {lr_train_mse}")
print(f"Ridge Training MSE: {ridge_train_mse}")
print(f"PCA Training MSE: {pca_train_mse}")
print(f"OLS Test MSE: {lr_test_mse}")
print(f"Ridge Test MSE: {ridge_test_mse}")
print(f"PCA Test MSE: {pca_test_mse}")

In [382]:
# Use this cell to click "run all above" a number of times.  Discuss with your group.
# Will OLS always outperform the other two models on the training set?  Will it ever outperform on the testing set?

#### c.

Lasso deals with multicollinearity poorly in the sense that it will often "randomly" choose which columns to keep and which to discard.  So we should expect that Lasso will keep the last feature (which is not correlated with the first $9$), and randomly choose from the first $9$.  Let's see if that pans out.

In [383]:
from sklearn.linear_model import 

In [None]:
# Values for the Lasso hyperparameter
alphas = np.exp(np.linspace(-6,-1,8))

# coefs will store the lasso coefficients.  One row for each alpha, one column for each coefficient.
coefs = np.zeros((len(alphas), 10))

for i,alpha in enumerate(alphas):
    # Make a pipeline where you first scale and then lasso.
    # Use max_iter=100000 in your Lasso to avoid some convergence issues.
    lasso_pipe = 

    # Fit it to the training data

    # Store the coefficients in the ith row of coefs.  Replace the ?s appropriately 
    coefs[?,?] = 

# Just to make the display of coefs nicer I put them in a dataframe.
pd.DataFrame(coefs)


In [None]:
# Use this cell to click "run all above" a number of times.  Discuss with your group.
# Does Lasso always select feature 9?  Is the choice of other features to keep consistent?

#### Bonus 

Only do this if you got done with everything above in the first 20 minutes or so.  Otherwise come back to it at the end.

Write a function which repeats everything from part (a) and (b) above 1000 times and records how often Ridge beats OLS and PCA regression beats OLS on the test set.

Note:  If you only got through part 1 in this problem session that is fine.  You have already rehearsed all of the skills from lecture, and hopefully had a few insights as well.

The next part is therefore "optional".  I am including it for the groups that are really speedy.  It also feels bad to have a problem session without real data!  You can certainly treat part 2 as "homework" if you run have out of time.

The new dataset also doesn't really seem to benefit much from regularization, so the new techniques from this week are not very rewarding.  Think of it as just more regression practice.  There is also one new idea of "controlling for a variable" which is introduced in this section.

#### 2. The diamonds dataset

We introduce a new "classic" dataset.  Our task is to predict the price of diamonds.

* price: Price in US dollars.
* carat: Weight of the diamond.
* cut: Cut quality (ordered worst to best).
* color: Color of the diamond (ordered best to worst).
* clarity: Clarity of the diamond (ordered worst to best).
* x: Length in mm.
* y: Width in mm.
* z: Depth in mm.
* depth: Total depth percentage: 100 * z / mean(x, y)
* table: Width of the top of the diamond relative to the widest point.

Homepage: https://ggplot2.tidyverse.org/reference/diamonds.html

In [3]:
df = pd.read_csv('../../data/diamonds.csv')

In [None]:
df

For sake of time we will restrict ourselves to just one categorical feature (`cut`) and one continuous feature (`carat`) in our modeling.  This is only being done for pedagogical purposes!  In a real situation you would want to carefully explore all of the data you have available.

In [5]:
df = df[['cut', 'carat', 'price']]

#### a.

In [6]:
# Make a train/test split with 20% of data held aside as the test set.
from sklearn.model_selection import 

df_train, df_test = 

X_train = 
X_test = 
y_train = 
y_test = 

##### b. 

What are the percentage of samples belonging to each level of the `cut` feature?

##### c. 

Look at the distribution of price at each level of the `cut` feature.  Do you notice anything strange or unexpected?

##### d. 

One thing which might be a bit confusing is that the cut quality does not seem to be a very good indicator of price.  Why might that be?

Sometimes this happens when two predictors which each have a positive **causal** impact on the outcome are negatively correlated with each other.  In other words, it might be that **all else being equal** a higher quality cut will increase the price, and a larger carat will increase the price, but higher quality cuts are negatively correlated with the size in carats.

Use the `groupby` and `describe` methods to look at some summary statistics of carat size sorted by cut quality.

We can see that the "Fair" quality also has the largest mean carat size, while "Ideal" quality has the smallest. I am not a domain expert, but this could be due to jewelers needing to cut away more of the original stone to produce better cuts?  This would be something to consult with a jeweler on.

##### e.

Graph price against carat with color coded by cut quality.

In [None]:
# Graph here

##### f.

The relationship you obtained above does not look linear.  Graph the log of the price against the log of the carat size.  This should look substantially more linear!

In [12]:
df_train['log_price'] = 
df_train['log_carat'] = 

In [None]:
# Graph here

##### g.

We do not have the ability to **experimentally** adjust `cut` and `carat` independently to see the impact on price, but we can still use **statistical control**.

We will run a linear regression of `log_price` against `cut` and `log_carat`.  Do better cuts contribute to higher prices when controlling for carat?

In [14]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler, FunctionTransformer
from sklearn.linear_model import LinearRegression

In [None]:
# Discuss what you think preprocessor does with your team.  Can you test that it does what you think it should?
preprocessor = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(), ['cut']),
        ('identity', FunctionTransformer(func = None), ['log_carat'])
        ])

# Write a pipeline which first uses preprocessor and then uses LinearRegression(fit_intercept = False). 
# Why do I not want to fit the intercept term?
model = 

# Fit it on the training set using the 'cut' and 'log_carat'features (in that order).

# It is a bit difficult to access the feature names of one part of a pipeline, so I have done it for you.
one_hot_feature_names = model.named_steps['preprocess'].named_transformers_['cat'].get_feature_names_out(['cut'])
feature_names = np.append(one_hot_feature_names, ['log_carat'])  # Manually add log_carat

# Map coefficients to feature names and sort them
cut_adjustments = pd.Series(model.named_steps['linear'].coef_, index=feature_names).sort_values()

cut_adjustments

#### h. Evaluating residuals

Make a plot of residuals against predicted values.  Discuss the implications for your model.

The lines in the residual plot are due to the apparent thresholds on price in the training data.  Prices seem to have a soft cap at around $18k and a soft minimum of around $350.

#### i. Quantifying model performance

Let's use [mean absolute percentage error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_percentage_error.html) and [mean absolute error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html) as our performance metrics.  

Remember to use these in the units of the original target, not the transformed target!

How does our model perform on the training set?

In [17]:
from sklearn.metrics import 