In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab15.ipynb")

## Lab 15: Linear regression (25 minutes)

  
**Submission instruction**: Please create a zip and pdf via File -> Print (or cmd + P on mac), and upload it to Gradescope.

In [None]:
# edit these names to your name and your final project partner's name
me = ["Rick Marks"]
partner = ["Piper Marks"]
...

In [None]:
grader.check("name")



Let's revisit the ROUSes dataset. 

In [None]:
import pandas as pd
import seaborn as sns

In [None]:
rouses = pd.read_csv('ROUSes.csv')
print(rouses.shape)
rouses.head()

Run the following code to drop the column `Temperament` from the table `rouses`. 

In [None]:
rouses = rouses.drop(columns='Temperament')

Here our goal is to predict the `Weight` of ROUSes using numerical values. Start by answering the following questions: (2 points)

- Classification or regression?
- Predictor variables?
- Target variables?
- Model: is this simiple or multiple linear regression?

[Your answer]

### Exploratory analysis

Run the code `rouses.corr()`. This calculates the correlations between pairs of numerical columns.

In [None]:
# write your code here
...


Based on the correlations, which variable do you expect to be more helpful in predicting `Weight`: `Age` or `Length`? (1 point)


[Your answer]

In an ideal world, we would train our model on the entire dataset, collect data from new ROUSes, then test it on the new data. However, this is obviously not feasible in this case. In cases like this, most people randomly split the existing samples into train and test, and "pretend" like the samples in the test set are actually coming from ROUSes they haven't met yet.

Fill in the 4 blanks such that this sentence describes what the code does:

    We will randomly put 80% of the ___1____ into the __2___ set, and put the remaining __3_____ into the ____4___ set.

- Options for 1: rows or columns
- Options for 2: train or test
- Options for 3: rows or columns
- Options for 4: train or test

The corresponding code:

```Python
train = rouses.sample(frac=0.8, random_state=42)
test = rouses.drop(index=train.index)
print(train.shape, test.shape)
```
Write your answers below, then copy the code above into the code cell and run it. (2 points)

[Your answers here]

1

2

3

4 

In [None]:
# Copy code from above here
...


Each of your table needs to be split into two parts (`X` and `y`) for the automated regression algorithm to understand.  Remember that `X` corresponds to a table where each column is a feature or a predictor variable, and `y` corresponds to an array with the target variable. 

I've given you partial code that creates four new variables `y_train`, `X_train`, `X_test`, `y_test`. **Fill in the missing parts marked with ...**, then copy and run the code. The answer is a single column name that is the same in all four places. Pause and make sure you understand what is going on. (1 point)

```Python
y_train = train[...] # select column with target variable
X_train = train.drop(columns=[...]) # keep all other columns with predictor variables
print(X_train.shape, y_train.shape)

y_test = test[...]
X_test = test.drop(columns=[...]) 
print(X_test.shape, y_test.shape)
```

In [None]:
# Write your code here
...


### Setting up regression pipeline

We'll write our machine learning code inside a function so that we can call it multiple times later. Copy the following code, then fill in the missing spots marked by `...`. (2 points)

**Hint**: The six `...` are from these options: `X_train`, `X_test`, `y_train`, `y_test`.

``` Python
from sklearn.linear_model import LinearRegression
def rouses_lr(X_train, X_test, y_train, y_test): 
    # Create "empty" model    
    lr = LinearRegression(fit_intercept=True)
    
    # Fit model to data (or train model)
    lr.fit(..., ...)
    
    # Save coefficients of the trained model
    coefs = pd.DataFrame(lr.coef_, 
                         index=lr.feature_names_in_, 
                        columns=['Coefficient vals'])
    
    # Save model performance on train and test
    results = coefs
    results.loc['Train R2 score'] = lr.score(..., ...)
    results.loc['Test R2 score'] = lr.score(..., ...)
    return coefs
``` 


In [None]:
# complete and run your code here
...


### Normalizing features

In class, we talked about interpreting linear regression coefficients, and mentioned an important caveat: To use the **magnitude of the coefficients** as an indication of relative feature importance, **the features have to be in the same range or scale**. Today, we'll see a few different ways to acheive that goal. The technical term for this is **feature normalization**, if you want to Google it for your project.

First, call the function `describe` on the table `X_train` to see the summary statistics of the numerical columns.

In [None]:
# write your code here
...


Question: Does `Age` and `Length` have the same ranges? Or the same means? (1 point)

[Your answer]

We'll see how that affects the final model and interpretation of coefficients.

### 1. Standardization

The first method is called **standardization**. This means that each variable will have mean of 0 and standard deviation of 1 after this process. You can acheive this by doing the following steps:
   1. Within each column, subtract its mean. For example, if the mean `Age` is 13, subtract 13 from everyone's age.
   2. Within each column, divide by its standard deviation. For example, if the std of `Age` 10, divide everyone's age by 10.

The code to achieve this is very simple. Make sure you understand what's going on, then copy and run the lines below.

```Python
X_train_standardized = (X_train - X_train.mean()) / X_train.std() # Subtract column mean and divide by column std
X_test_standardized = (X_test - X_train.mean()) / X_train.std() # Subtract by train column mean and divide by train column std

X_train_standardized.describe()
```


In [None]:
# copy and run the code
...

Before we move on, check that each feature in `X_train_standardized` now has zero mean and 1 stadard deviation as expected.

### 2. Min-max scaling

The second method is called **min-max scaling**. This makes all values in each column to fall between 0 and 1. The minimum value becomes 0, the maximum value becomes 1, and everything in between is scaled linearly.

Here is the code that does that. As before, make sure you understand what's going on.

```Python
col_ranges = X_train.max() - X_train.min() # calculate the range per column
X_train_minmax = X_train - X_train.min() # this turns minimum value of each column to 0
X_train_minmax = X_train_minmax/col_ranges # this turns maximum value of each column to 1

X_test_minmax = X_test - X_train.min() # offset by X_train min
X_test_minmax = X_test_minmax/col_ranges # scale by X_train range

X_train_minmax.describe()
```

In [None]:
# copy andrun the code
...

Before we move on, check that each feature in `X_train_minmax` now has minimum 0 and maximum 1 as expected.

### Check the results!

Finally it's time to put everything together. Run the following code to check:

rows: 
- the resulting coefficients (`Age`, `Length`) of the linear regression model, and the
- $R^2$ scores (`Train R2 score`, `Test R2 score`), for

columns:
- the three different versions of feature tables (unnormalized `X_train`, standardized `X_train_standardized`, min-max scaled `X_train_minmax`), and
- the unnormalized coefficients muliplied by the std of the unnormalized features (`Unnormalized * std`)

In [None]:
results = pd.DataFrame(columns = ["Unnormalized", "Standardized", "Min-max scaled"])
results['Unnormalized'] = rouses_lr(X_train, X_test, y_train, y_test)['Results']
results['Standardized'] = rouses_lr(X_train_standardized, X_test_standardized, y_train, y_test)['Results']
results['Min-max scaled'] = rouses_lr(X_train_minmax, X_test_minmax, y_train, y_test)['Results']

results

I hope this convinces you to normalize your features for your project if you decide to do a predictive model, and especially if you plan to do linear regression.

And though you get the same R2 score regardless of feature normalization here, this is not the case for every dataset and every ML model. Often normalization improves accuracy of the model.

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

Submit zip and PDF file to Gradescope Lab 15

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False, run_tests=True)