## EC Notebook for Lecture 31: RMSE in Linear Regression

This extra credit Python notebook will let you practice the material you saw in lecture.  Completing all parts of this notebook will earn +1 extra credit point to your grade in STAT 107! :)

This notebook is worth +1 if turned in before 11:30 am on **Monday, Nov. 11** *(30 minutes before the next STAT 107 lecture)*.  You can feel free to complete it anytime for extra practice.

## RMSE

In this EC notebook, we will use the `beer-dataset`. Our goals will be as follows: <br>
1) Fit a Linear regression model of `Carbohydrates`(dependent variable ~ **y**) vs `PercentAlcohol` (independent variable ~**x**). <br>
2) Calculate RMSE of the fitted model using the formula: <br>
$$RMSE=SD_{errors}=(1-r^2)^{1/2}*SD_y$$. <br>
3) Calculate the RMSE by taking the root of mean of squared difference between predicted and the original values (`Carbohydrates`). We will write a function for this step.

##  1. Linear Model

Load the beer-dataset. First, let's fit a linear regression model by considering `PercentAlcohol` as independent variable (**x**) and `Carbohydrates` as dependent variable (**y**). Also, find out the slope and the intercept term.

In [2]:
import pandas as pd
# load the dataset and store it in df
df = pd.read_csv('beer-dataset.csv')

from sklearn.linear_model import LinearRegression
model = LinearRegression()
# Now, use model.fit
reg = model.fit(df[['PercentAlcohol']],df['Carbohydrates'])
slope = reg.coef_
intercept = reg.intercept_
print(slope)
print(intercept)




[3.03190839]
-3.544086376301504


In [3]:
## == TEST CASES for 1 ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any error our output, you PASSED all test cases!
# - If this cell results in any errors, check you previous cell, make changes, and RE-RUN your code and then this cell.

assert(round(slope[0],2) == 3.03), "The slope is incorrect."
assert(round(intercept,2) == -3.54), "The intercept is incorrect."

## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")
print()

🎉 All tests passed! 🎉



##  2. RMSE using the formula

As mentioned above, we will use the following formula to calculate RMSE:
$$RMSE=SD_{errors}=(1-r^2)^{1/2}*SD_y$$
We will need to calculate **r** and $SD_y$ for this purpose. <br>

1) **r** is the correlation between `Carbohydrates` and `PercentAlcohol`. <br>
2) $SD_y$ is the std (standard deviation) of `Carbohydrates`. <br>

A syntax for your help: <br>
`df["a"].corr(df["b"])` gives correlation between column `a` and column `b`.


In [9]:
r = df['Carbohydrates'].corr(df['PercentAlcohol'])
# Use std()...but on which column!?
SD_y = df['Carbohydrates'].std()
# Now, use the formula!
RMSE = (1-r**2)**0.5*SD_y
print(RMSE)





3.8585450410889286


In [10]:
## == TEST CASES for 2 ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any error our output, you PASSED all test cases!
# - If this cell results in any errors, check you previous cell, make changes, and RE-RUN your code and then this cell.

assert(round(r,2) == 0.42), "The r is incorrect."
assert(round(SD_y,2) == 4.25), "The SD_y is incorrect."
assert(round(RMSE,2) == 3.86), "The RMSE is incorrect."

## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")
print()

🎉 All tests passed! 🎉



##  3 RMSE with the help of predicted values.

First, we will predict values of Carbohydrates based on our linear regression. <br>
Then, we will write a function which takes two columns as input and outputs the root mean squared difference between the two columns. We will use this function to calculate RMSE.

In [14]:
# Use model.predict
df["predicted"] = reg.predict(df[['PercentAlcohol']])

def rmse(column1,column2):
    df['difference'] = column1 - column2
    df['squared_difference'] = df['difference']**2
    mean = df['squared_difference'].mean()
    return mean**0.5
    
print(rmse(df["predicted"],df["Carbohydrates"])) # This is our RMSE




3.835780564884537


In [15]:
## == TEST CASES for 3 ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any error our output, you PASSED all test cases!
# - If this cell results in any errors, check you previous cell, make changes, and RE-RUN your code and then this cell.

assert(round(rmse(df["predicted"],df["Carbohydrates"]),2) == 3.84), "The RMSE is incorrect."


## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")
print()

🎉 All tests passed! 🎉



## 3. Submit Your Work

In this notebook:

1. Click **File** -> **Save and Checkpoint** (to save your work)
2. Click **File** -> **Close and Halt** (to exit this notebook)

Follow the instructions on the STAT 107 website to submit your work.