<center> <img src="res/ds3000.png"> </center>

<center> <h1> Week 11 - Day 1 </h1> </center>

<center> <h2> Part 2: Multiple Linear Regression</h2></center>

## Outline
1. <a href='#1'>Multiple Linear Regression</a>
2. <a href='#2'>Data Preparation</a>
3. <a href='#3'>Training the Regression Model</a>
4. <a href='#4'>Testing the Model</a>
5. <a href='#5'>Regression Model Metrics</a>
6. <a href='#6'>Visualizing the Expected vs. Predicted Prices</a>

<a id="1"></a>

## 1. Multiple Linear Regression
* https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

### 1.1. California Housing Dataset
* [**California Housing dataset**](http://lib.stat.cmu.edu/datasets) bundled with scikit-learn 
* **Larger real-world dataset** 
    **20,640 samples**, each with **eight numerical features**
	* https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html 
* Perform **multiple linear regression** using **all eight numerical features** 
    * Make **more sophisticated housing price predictions** than if we were to use only a **single feature** or a **subset of the features**
* **`LinearRegression`** estimator performs **multiple linear regression** by default

* According to the California Housing Prices dataset’s description in scikit-learn
> "This dataset was **derived from the 1990 U.S. census**, using **one row per census block group**.  
>  
> "A **block group** is the **smallest geographical unit** for which the U.S. Census Bureau publishes sample data (typically has a **population of 600 to 3,000 people**)."

* The dataset has **20,640 samples**—**one per block group**—with **eight features** each:
	* **median income**—in tens of thousands, so 8.37 would represent $83,700
	* **median house age**—in the dataset, the maximum value for this feature is 52
	* **average number of rooms** 
	* **average number of bedrooms** 
	* **block population**
	* **average house occupancy**
	* **house block latitude**
	* **house block longitude**

* **Target** &mdash; **median house value** in hundreds of thousands, so 3.55 would represent \$355,000
    * **Maximum** for this feature is**&nbsp;5** for **\$500,000** 
* Reasonable to expect **more bedrooms**, **more rooms** or **higher income** would mean **higher house value**
* **Combine all numeric features to make predictions**
    * More likely to get **more accurate predictions** than with simple linear regression

In [None]:
import pandas as pd
df = pd.read_csv('res/boston_weather.csv')
df.columns = ["Date", "Temperature", "Anomaly"]
df.head()

### 1.2. Loading the Dataset
* Use sklearn.datasets function fetch_california_housing

In [None]:
from sklearn.datasets import fetch_california_housing
california = fetch_california_housing()  # Bunch object

In [None]:
print(california.DESCR)

* Confirm number of samples/features, number of targets, feature names

In [None]:
california.data.shape

In [None]:
california.target.shape

In [None]:
california.feature_names

<a id="2"></a>

## 2.  Data Preparation

In [None]:
import pandas as pd

df = pd.DataFrame(california.data, columns=california.feature_names)
df["Value"] = california.target

In [None]:
df.head()

In [None]:
df.describe()

In [None]:
features = df.drop("Value", axis=1)

In [None]:
features.head()

In [None]:
target = df["Value"]
target[:5]

### 2.1. Visualizing the Dataset

In [None]:
%matplotlib notebook

In [None]:
sample_df = df.sample(frac=0.1, random_state=3000)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')

for feature in california.feature_names:
    plt.figure(figsize=(8, 4.5))  # 8"-by-4.5" Figure
    sns.scatterplot(data=sample_df, x=feature, y='Value', hue='Value', palette='cool', legend=False)

<a id="3"></a>

## 3. Training the Regression Model
* Use the LinearRegression() estimator
* To find the best fitting regression line for the data, the LinearRegression estimator iteratively adjusts the slope and intercept to minimize the sum of the squares of the data points’ distances from the line

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split


#split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=3000)

#select a classifier and create the model by fitting the training data
model = LinearRegression().fit(X=X_train, y=y_train)


### 3.1. Regression  Equation
* Once the model is fitted, the estimator calculates the **slope** and **intercept** 
* We can make **predictions** with 


\begin{equation}
y = m_1 x_1 + m_2 x_2 + ... + m_n x_n + b
\end{equation}


* <em>m</em><sub>1</sub>, <em>m</em><sub>2</sub>, …, <em>m</em><sub><em>n</em></sub> are the **feature coefficients** (stored in **`coef_`** attribute)
* <em>b</em> is the **intercept** (stored in **`intercept_`** attribute)
* <em>x</em><sub>1</sub>, <em>x</em><sub>2</sub>, …, <em>x</em><sub><em>n</em></sub> are **feature values**
* <em>y</em> is the **predicted value**

In [None]:
model.coef_

In [None]:
intercept = model.intercept_
intercept

In [None]:
for i, name in enumerate(california.feature_names):
    print(f'{name}: {model.coef_[i]}')  

#### The equation:
\begin{equation}
MedianHouseValue = 0.447MedInc + .010HouseAge - 0.122AveRooms + 0.727AveBedrms - 0.000007Population - 0.004AveOccup - 0.420Latitude - 0.431Longitude - 36.587
\end{equation}

<a id="4"></a>

## 4. Testing the Model
* Test the model using the data in **`X_test`** and check some of the **predictions**

In [None]:
predicted = model.predict(X_test)

In [None]:
expected = y_test

In [None]:
predicted[:5]

In [None]:
expected[:5]

In [None]:
for p, e in zip(predicted[:5], expected[:5]):  
    print(f'predicted: {p:.2f}, expected: {e:.2f}')

In [None]:
results_df = pd.DataFrame(expected.values, columns=["expected"])

In [None]:
results_df["predicted"] = predicted

In [None]:
results_df

<a id="5"></a>

## 5. Regression Model Metrics
* Coefficient of Determination
* Mean Squared Error

### 5.1. Coefficient of Determination (**$R^{2}$ score**; 0.0-1.0)
* Use r2_score()
    * **1.0** &mdash; estimator **perfectly predicts** the **target variable’s value**, given predictor variables' values
    * **0.0** &mdash; **model cannot make predictions with any accuracy**, given predictor variables’ values 
* Calculate with arrays representing the **expected** and **predicted results**

In [None]:
from sklearn.metrics import r2_score
#TODO in class

### 5.2. Mean Squared Error (**$MSE$**; 0.0-1.0)
* Use mean_squared_error()
    * The closer MSE is to 0, the closer the fit is to the data (less discrepancy between actual and predicted values)
* Calculate with arrays representing the **expected** and **predicted results**

In [None]:
from sklearn.metrics import mean_squared_error
#TODO in class

<a id="6"></a>

## 6. Visualizing the Expected vs. Predicted Prices

In [None]:
import plotly.express as px
import plotly.graph_objects as go
#produce the scatter plot
graph = px.scatter(results_df, x="expected", y="predicted", template="none", color="predicted", opacity=.7)

#add the "perfect prediction" line; this is not the regression line
graph.update_layout(
    
    shapes=[    
        go.layout.Shape(
            type="line",
            x0=0, y0=0,
            x1=5, y1=5,
            line=dict(color="coral", width=2, dash="dash")
        )
    ]
)

#need to change axes limits; otherwise, plotly will auto-scale, leading to confusion
graph.update_layout(xaxis = dict(range = [0,6]))
graph.update_layout(yaxis = dict(range = [0,6]))

graph.show()