## EC Notebook for Lecture 30: Linear Regression

This extra credit Python notebook will let you practice the material you saw in lecture.  Completing all parts of this notebook will earn +1 extra credit point to your grade in STAT 107! :)

This notebook is worth +1 if turned in before 11:30 am on **Friday, Nov. 8** *(30 minutes before the next STAT 107 lecture)*.  You can feel free to complete it anytime for extra practice.

## Linear Regression

In today's lecture, we learned the basic idea and concept of the linear regression.

Intuition: The basic idea of trying to fit a line as closely as possible to as many points as possible is known as linear regression. The most common technique is to try to fit a line that minimizes the squared distance to each of those points. This technique is known as the OLS or Ordinary Least Squares Regression. 

In the following, we will use **diamond** dataset and fit the linear regressions.

##  1. Regression Equation

In the class, we learned that the we can describle our regression via an equation in slope-intercept form since it is estimated from a sgraight line. Thus, one way to make predictions is by using the regression equation, i.e. calculate the slope and intercept by the definition:

Slope of the regression line: $r*(\text{SD of y})/(\text{SD of x})$.  
Intercept of the regression line: $(\text{average of y}) - slope*(\text{average of x})$

In the following, please use the formula to calculate the slope and intercept to fit the linear regession by using `price` as dependent variable and `depth` as independent variable.

In [1]:
# Keep in mind import diamonds dataset as df
import pandas as pd 
df = pd.read_csv('Diamonds.csv')

mean_depth = df['depth'].mean()
sd_depth = df['depth'].std()
mean_price = df['price'].mean()
sd_price = df['price'].std()
df['z_depth'] = (df['depth'] - mean_depth) / sd_depth
df['z_price'] = (df['price'] - mean_price) / sd_price
df['z_product'] = df['z_depth'] * df['z_price']
r = df['z_product'].sum() / (len(df['z_product']) - 1)

sd_x = sd_depth
sd_y = sd_price


In [2]:
slope = r*(sd_y)/sd_x
intercept = mean_price - slope*mean_depth


In [3]:
## == TEST CASES for 1 ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any error our output, you PASSED all test cases!
# - If this cell results in any errors, check you previous cell, make changes, and RE-RUN your code and then this cell.

assert(round(slope,5) == -29.64997), "The slope is incorrect."
assert(round(intercept,5) == 5763.66772), "The intercept is incorrect."

## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")
print()

🎉 All tests passed! 🎉



##  2. Building a Linear Regression in Python

In the class, we also learned how ot use the scikit-learn (sklearn) Python library for fitting the linear regression in Python. 


In the following, we still focus on the *diamond* dataset, but we do not use the defintion of the slope and intercept by the regression equation. We directly use scikit to fit the linear regression and make prediction!

##  2.1.1 Use scikit to fit simple linear regression

In this question, firstly, please use scikit to fit the simple liear regression by using `price` as dependent variable and `depth` as independent variable. Also, find the intercept and slope and compare them to those in the last question. 

In [4]:
# Please keep in mind to import the library. 
from sklearn.linear_model import LinearRegression
model = LinearRegression()

linear_reg = model.fit(df[['depth']],df['price'])
linear_slope = linear_reg.coef_
linear_intercept = linear_reg.intercept_


##  2.1.1 Use scikit to fit simple linear regression (Continue)

According to the last question, please predict the price of a `3.5` depth diamond using simple linear regression.

In [8]:
linear_pred = linear_reg.predict([[3.5]])


In [9]:
## == TEST CASES for 2 ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any error our output, you PASSED all test cases!
# - If this cell results in any errors, check you previous cell, make changes, and RE-RUN your code and then this cell.

assert(round(linear_slope[0],5) == -29.64997), "The slope is incorrect."
assert(round(linear_intercept,5) == 5763.66772), "The intercept is incorrect."
assert(round(linear_pred[0],5) == 5659.89282), "The price of 3.5 depth diamond is incorrect."

## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")
print()

🎉 All tests passed! 🎉



##  2.2.1 Use scikit to fit  multiple linear regression

In this question, firstly, please use scikit to fit the multiple liear regression by using `price` as dependent variable and `depth`,`table` and `carat` as independent variables. Also, find the intercept and coefficients for each independent variable.

In [10]:
multilinear_reg = model.fit(df[['depth','table','carat']],df['price'])
multilinear_intercept = multilinear_reg.intercept_
depth_coef = multilinear_reg.coef_[0]
table_coef = multilinear_reg.coef_[1]
carat_coef = multilinear_reg.coef_[2]


In [11]:
## == TEST CASES for 3 ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any error our output, you PASSED all test cases!
# - If this cell results in any errors, check you previous cell, make changes, and RE-RUN your code and then this cell.

assert(round(depth_coef,5) == -151.23635), "The coefficient of depth is incorrect."
assert(round(table_coef,5) == -104.47278), "The coefficient of table is incorrect."
assert(round(carat_coef,5) == 7858.77051), "The coefficient of carat is incorrect."
assert(round(multilinear_intercept,5) == 13003.44052), "The intercept of the regression model is incorrect."

## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")
print()

🎉 All tests passed! 🎉



## 3. Submit Your Work

In this notebook:

1. Click **File** -> **Save and Checkpoint** (to save your work)
2. Click **File** -> **Close and Halt** (to exit this notebook)

Follow the instructions on the STAT 107 website to submit your work.