# Lesson 9: Standardization with Multiple Linear Regression (using sklearn)

Standardization - scaling a feature, so that it has a mean equal to 0 and standard deviation equal to 1.

## Import libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set()

from sklearn.linear_model import LinearRegression

## Load data 

In [4]:
data = pd.read_csv("1.02.Multiple_linear_regression.csv")

In [5]:
data.head()

Unnamed: 0,SAT,GPA,"Rand 1,2,3"
0,1714,2.4,1
1,1664,2.52,3
2,1760,2.54,3
3,1685,2.74,3
4,1693,2.83,2


In [6]:
data.describe()

Unnamed: 0,SAT,GPA,"Rand 1,2,3"
count,84.0,84.0,84.0
mean,1845.27381,3.330238,2.059524
std,104.530661,0.271617,0.855192
min,1634.0,2.4,1.0
25%,1772.0,3.19,1.0
50%,1846.0,3.38,2.0
75%,1934.0,3.5025,3.0
max,2050.0,3.81,3.0


## Create the multiple linear regression 

### Declare the independent and dependent variables 

In [8]:
x = data[["SAT","Rand 1,2,3"]]
y = data["GPA"]

### Standardization

In [36]:
from sklearn.preprocessing import StandardScaler

In [37]:
scaler = StandardScaler()                    # Declare an object with properties of StandardScaler.

In [45]:
scaler.fit(x)           # This scales our values: subtracts the mean and divides by std. 
                        # This defines a mechanism.

StandardScaler()

In [53]:
x_scaled = scaler.transform(x)         # Here we apply scaler to transform our raw data into the standardized.
x_scaled

array([[-1.26338288, -1.24637147],
       [-1.74458431,  1.10632974],
       [-0.82067757,  1.10632974],
       [-1.54247971,  1.10632974],
       [-1.46548748, -0.07002087],
       [-1.68684014, -1.24637147],
       [-0.78218146, -0.07002087],
       [-0.78218146, -1.24637147],
       [-0.51270866, -0.07002087],
       [ 0.04548499,  1.10632974],
       [-1.06127829,  1.10632974],
       [-0.67631715, -0.07002087],
       [-1.06127829, -1.24637147],
       [-1.28263094,  1.10632974],
       [-0.6955652 , -0.07002087],
       [ 0.25721362, -0.07002087],
       [-0.86879772,  1.10632974],
       [-1.64834403, -0.07002087],
       [-0.03150724,  1.10632974],
       [-0.57045283,  1.10632974],
       [-0.81105355,  1.10632974],
       [-1.18639066,  1.10632974],
       [-1.75420834,  1.10632974],
       [-1.52323165, -1.24637147],
       [ 1.23886453, -1.24637147],
       [-0.18549169, -1.24637147],
       [-0.5608288 , -1.24637147],
       [-0.23361183,  1.10632974],
       [ 1.68156984,

## Regression with scaled fetures 

In [42]:
reg = LinearRegression()
reg.fit(x_scaled,y)

LinearRegression()

In [43]:
reg.coef_

array([ 0.17181389, -0.00703007])

In [44]:
reg.intercept_

3.330238095238095

### Creating a summary table 

In [47]:
reg_summary = pd.DataFrame([["Bias"],["SAT"],["Rand 1,2,3"]], columns=["Features"])
reg_summary["Weights"] = reg.intercept_, reg.coef_[0], reg.coef_[1]

# In the line above I used "weights", which is just coefficients in machine learning language. Similarly,
# "bias" is used instead of "intercept".

reg_summary

Unnamed: 0,Features,Weights
0,Bias,3.330238
1,SAT,0.171814
2,"Rand 1,2,3",-0.00703


In the standardized version of the analysis, the bigger magnitudes of the coefficients, the more important
the feature is. In the example above, the feature "Rand 1,2,3" is not important as its weight is close to 0.

## Making predictions with weights 

In [49]:
new_data = pd.DataFrame(data=[[1700,2],[1800,1]],columns=["SAT","Rand 1,2,3"])
new_data

Unnamed: 0,SAT,"Rand 1,2,3"
0,1700,2
1,1800,1


In [50]:
reg.predict(new_data)



array([295.39979563, 312.58821497])

The above result is nonsense since we already rescaled our data, and did not do it for our new_data. We have 
to use the object "scaler" to transform our data into standardized data.

In [55]:
new_data_scaled = scaler.transform(new_data)
new_data_scaled

array([[-1.39811928, -0.07002087],
       [-0.43571643, -1.24637147]])

In [60]:
reg.predict(new_data_scaled)

array([3.09051403, 3.26413803])

## What if we removed "Rand 1,2,3" variable? 

In [57]:
reg_simple = LinearRegression()
x_simple_matrix = x_scaled[:,0].reshape(-1,1)
reg_simple.fit(x_simple_matrix,y)

LinearRegression()

In [59]:
reg_simple.predict(new_data_scaled[:,0].reshape(-1,1))

array([3.08970998, 3.25527879])