### Multiple Linear Regression

### Import the relevant libraries

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

from sklearn.linear_model import LinearRegression

### Load Data

In [4]:
data = pd.read_csv(r'C:\Users\ronni\OneDrive\Documents\Data Science Course\Part 4 (Jupyter)\Data Sets\1.02.+Multiple+linear+regression.csv')

In [5]:
data.head()

Unnamed: 0,SAT,GPA,"Rand 1,2,3"
0,1714,2.4,1
1,1664,2.52,3
2,1760,2.54,3
3,1685,2.74,3
4,1693,2.83,2


In [5]:
data.describe()

Unnamed: 0,SAT,GPA,"Rand 1,2,3"
count,84.0,84.0,84.0
mean,1845.27381,3.330238,2.059524
std,104.530661,0.271617,0.855192
min,1634.0,2.4,1.0
25%,1772.0,3.19,1.0
50%,1846.0,3.38,2.0
75%,1934.0,3.5025,3.0
max,2050.0,3.81,3.0


### Creat the multiple linear regression

### Declare the dependent and independent variables

In [6]:
x = data[['SAT','Rand 1,2,3']]
y = data['GPA']

### Regression itself

In [7]:
reg = LinearRegression()
reg.fit(x,y)
reg.get_params()

{'copy_X': True,
 'fit_intercept': True,
 'n_jobs': None,
 'normalize': False,
 'positive': False}

In [8]:
reg.coef_

array([ 0.00165354, -0.00826982])

In [9]:
reg.intercept_

0.29603261264909486

### Calculating the R-squared

In [10]:
R_Squared = reg.score(x,y)

### Formula for Adjusted R^2
$R^2_{adj.} = 1 - (1-R^2)*\frac{n-1}{n-p-1}$

In [11]:
x.shape

(84, 2)

In [12]:
n = x.shape[0]

p = x.shape[1]

adj_R_Squared = 1-(1-R_Squared)*((n-1)/(n-p-1))
adj_R_Squared

0.39203134825134023

Note: Adj. R^2 < R^2, therefore one or more of the predictors have little or no explanatory power

How to detect the variable which are unneeded in a model?


Feature Selection - simplifies models, improves speed and prevents a series of unwanted issues arising from having too many features

feature_selection.f_regression 

F-regression creates simple linear regressions of each feature and the dependent variable and return the p-value of each.

### Feature selection

In [13]:
from sklearn.feature_selection import f_regression

In [14]:
f_regression(x,y)

(array([56.04804786,  0.17558437]), array([7.19951844e-11, 6.76291372e-01]))

Note: First array shows f-statistic and second array shows p-values

In [16]:
p_values = f_regression(x,y)[1]
p_values

array([7.19951844e-11, 6.76291372e-01])

In [17]:
p_values.round(3)

array([0.   , 0.676])

Note: 'SAT' has a p-value of 0.000 which means it is useful. 'Rand 1,2,3' has a p-value of 0.676 which means it is useless in the model.

### Creating a summary table

In [20]:
reg_summary = pd.DataFrame(data = x.columns.values, columns=['Features'])
reg_summary

Unnamed: 0,Features
0,SAT
1,"Rand 1,2,3"


In [21]:
reg_summary ['Coefficients'] = reg.coef_
reg_summary ['p-values'] = p_values.round(3)

In [22]:
reg_summary

Unnamed: 0,Features,Coefficients,p-values
0,SAT,0.001654,0.0
1,"Rand 1,2,3",-0.00827,0.676


Note: P-values are one of the best ways to determine if a variable is redundant, but they provide no information whatsoever about HOW USEFUL a variable is

### Lets try the same Multiple Regression but with Feature Scaling 

### Declare the dependent and independent variables

In [23]:
x = data[['SAT','Rand 1,2,3']]
y = data['GPA']

### Standardization/Feature Scaling

In [24]:
from sklearn.preprocessing import StandardScaler

In [25]:
scaler = StandardScaler()

In [26]:
scaler.fit(x)

StandardScaler()

In [27]:
x_scaled = scaler.transform(x)

In [28]:
x_scaled

array([[-1.26338288, -1.24637147],
       [-1.74458431,  1.10632974],
       [-0.82067757,  1.10632974],
       [-1.54247971,  1.10632974],
       [-1.46548748, -0.07002087],
       [-1.68684014, -1.24637147],
       [-0.78218146, -0.07002087],
       [-0.78218146, -1.24637147],
       [-0.51270866, -0.07002087],
       [ 0.04548499,  1.10632974],
       [-1.06127829,  1.10632974],
       [-0.67631715, -0.07002087],
       [-1.06127829, -1.24637147],
       [-1.28263094,  1.10632974],
       [-0.6955652 , -0.07002087],
       [ 0.25721362, -0.07002087],
       [-0.86879772,  1.10632974],
       [-1.64834403, -0.07002087],
       [-0.03150724,  1.10632974],
       [-0.57045283,  1.10632974],
       [-0.81105355,  1.10632974],
       [-1.18639066,  1.10632974],
       [-1.75420834,  1.10632974],
       [-1.52323165, -1.24637147],
       [ 1.23886453, -1.24637147],
       [-0.18549169, -1.24637147],
       [-0.5608288 , -1.24637147],
       [-0.23361183,  1.10632974],
       [ 1.68156984,

Note: When we ran this model before without Standardization we could not see the effect each variable on the output because 'SAT' is ranging between 600 and 2400, while 'Rand 1,2,3' between 1 and 3  

### Regression with scaled features

In [31]:
reg = LinearRegression()
reg.fit(x_scaled,y)

LinearRegression()

In [32]:
reg.coef_

array([ 0.17181389, -0.00703007])

In [33]:
reg.intercept_

3.330238095238095

### Creating a summary table

In [38]:
reg_summary = pd.DataFrame([['Bias'],['SAT'],['Rand 1,2,3']],columns=['Features'])
reg_summary['Weights'] = reg.intercept_, reg.coef_[0], reg.coef_[1]

In [39]:
reg_summary

Unnamed: 0,Features,Weights
0,Bias,3.330238
1,SAT,0.171814
2,"Rand 1,2,3",-0.00703


Note: Weights is the machine learning word for coefficients. The bigger the weight, the bigger the impact on regression.

Note: We can clearly see that 'Rand 1,2,3' barely contributes to our output, if at all. Therefore can be removed!

### Making predictions with the standardized coefficients (weights)

In [40]:
new_data = pd.DataFrame(data=[[1700,2],[1800,1]], columns=['SAT','Rand 1,2,3'])
new_data

Unnamed: 0,SAT,"Rand 1,2,3"
0,1700,2
1,1800,1


In [41]:
reg.predict(new_data)

array([295.39979563, 312.58821497])

Note: This is not a valid GPA because our regression model is expecting value of the same magnitude of the x_scaled

In [42]:
new_data_scaled = scaler.transform(new_data)
new_data_scaled

array([[-1.39811928, -0.07002087],
       [-0.43571643, -1.24637147]])

In [44]:
reg.predict(new_data_scaled)

array([3.09051403, 3.26413803])

Note: After transforming our new data, the output is correct! GPA makes sense. 

### What if we removed the 'Rand 1,2,3' variable? Would our predictions change?

In [46]:
reg_simple = LinearRegression()
x_simple_matrix = x_scaled[:,0].reshape(-1,1)
reg_simple.fit(x_simple_matrix,y)

LinearRegression()

In [47]:
reg_simple.predict(new_data_scaled[:,0].reshape(-1,1))

array([3.08970998, 3.25527879])

Note: No change occured in our predictions which is what we expected since the weight of the variable 'Rand 1,2,3' was insignificant.