# Multiple linear regression

## Import the relevant libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
%matplotlib inline
from sklearn.linear_model import LinearRegression

## Load the data

In [2]:
data = pd.read_csv('/home/home02/earshar/data_science/main/data/csv_datasets/multiple_linear_regression.csv')
data.head()

Unnamed: 0,SAT,"Rand 1,2,3",GPA
0,1714,1,2.4
1,1664,3,2.52
2,1760,3,2.54
3,1685,3,2.74
4,1693,2,2.83


In [3]:
data.describe()

Unnamed: 0,SAT,"Rand 1,2,3",GPA
count,84.0,84.0,84.0
mean,1845.27381,2.059524,3.330238
std,104.530661,0.855192,0.271617
min,1634.0,1.0,2.4
25%,1772.0,1.0,3.19
50%,1846.0,2.0,3.38
75%,1934.0,3.0,3.5025
max,2050.0,3.0,3.81


## Create the multiple linear regression

### Declare the dependent and independent variables

- There are two independent variables (`SAT` and `Rand 1,2,3`), and a single dependent variable (`GPA`)

In [4]:
x = data[['SAT','Rand 1,2,3']]
y = data['GPA']

### Regression itself

- We start by creating a linear regression object
- The whole learning process boils down to fitting the regression

In [5]:
reg = LinearRegression()
reg.fit(x,y)

LinearRegression()

### Coefficients of the regression

- Note that the output is an array
- Coefficients of the `SAT` and `Rand 1,2,3` variables

In [6]:
reg.coef_

array([ 0.00165354, -0.00826982])

### Intercept of the regression

- Note that the result is a float as we usually expect a single value

In [7]:
reg.intercept_

0.29603261264909486

### Calculating the R-squared

- `reg.score(x,y)` returns the R-squared of a linear regression (both simple and multiple)
- This function takes 2 arguments, the input and the target (output)

In [8]:
reg.score(x,y)

0.40668119528142843

### Adjusted R-squared

- Adjusts the R-squared for the number of variables included in the model
- Is a more accurate representation of the performance of our model

### Formula for Adjusted R-squared

$R^2_{adj.} = 1 - (1-R^2)*\frac{n-1}{n-p-1}$

- n = 84 (number of observations)
- p = 2 (number of predictors)

In [9]:
x.shape

(84, 2)

In [10]:
r2 = reg.score(x,y)
num_obs = x.shape[0]
num_predictors = x.shape[1]
adj_rsquared = 1 - (1-r2) * ((num_obs-1)/(num_obs - num_predictors -1))
adj_rsquared

0.39203134825134023

### Feature selection

- Full documentation: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_regression.html
- Import the feature selection module from sklearn
- This module allows us to select the most appopriate features for our regression
- There exist many different approaches to feature selection, however, we will use one of the simplest

In [11]:
from sklearn.feature_selection import f_regression

### We will look into: `f_regression`

- `f_regression` finds the F-statistics for the *simple* regressions created with each of the independent variable. In our case, this would mean running a simple linear regression on GPA where SAT is the independent variable and a simple linear regression on `GPA` where `Rand 1,2,3` is the indepdent variable. The limitation of this approach is that it does not take into account the mutual effect of the two features.
- There are two output arrays: the first one contains the F-statistics for each of the regressions, and the second one contains the p-values of these F-statistics

In [12]:
f_regression(x,y)

(array([56.04804786,  0.17558437]), array([7.19951844e-11, 6.76291372e-01]))

### Since we are more interested in the latter (p-values), we can just take the second array

- To be able to quickly evaluate them, we can round the result to 3 digits after the dot
- First one refers to the first column of `x`, the second refers to the second, etc.
- These are univariate p-values reached from simple linear models. They do not reflect the interconnection of the features in our multiple linear regression.

In [14]:
p_values = f_regression(x,y)[1]
p_values.round(3)

array([0.   , 0.676])

### Creating a summary table

- As a side note, `p-values` are one of the best ways to determine if a variable is redundant, but they provide no information whatsoever about how useful a variable is. 

In [19]:
# reg_summary = pd.DataFrame(data=['SAT','Rand 1,2,3'], columns=['Features'])
reg_summary = pd.DataFrame(x.columns.values, columns=['Features'])
reg_summary

Unnamed: 0,Features
0,SAT
1,"Rand 1,2,3"


In [20]:
reg_summary['Coefficients'] = reg.coef_
reg_summary['p-values'] = p_values.round(3)
reg_summary

Unnamed: 0,Features,Coefficients,p-values
0,SAT,0.001654,0.0
1,"Rand 1,2,3",-0.00827,0.676
