# Multiple linear regression

## Import the relevant libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

from sklearn.linear_model import LinearRegression

## Load the data

In [2]:
data = pd.read_excel('catalyst_efficiency_multiple.xlsx')
data.head()

Unnamed: 0,efficiency,size,age
0,28.117697,6.4309,41
1,27.429783,6.5622,28
2,33.79516,4.8729,28
3,48.150673,15.0475,33
4,55.040911,12.7546,34


In [3]:
data.describe()

Unnamed: 0,efficiency,size,age
count,100.0,100.0,100.0
mean,35.074736,8.530242,43.3
std,9.246207,2.97942,11.073218
min,18.513855,4.7975,23.0
25%,28.113618,6.4333,34.0
50%,33.670886,6.96405,44.0
75%,40.286844,10.293225,54.0
max,60.081735,18.4251,62.0


## Create multiple linear regression

### Declare the dependent and the independent variables

In [4]:
y = data['efficiency'] # target
x = data[['size', 'age']] # features
x.shape

(100, 2)

### Regression

In [6]:
reg = LinearRegression()
reg.fit(x,y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

### Coefficients

In [8]:
reg.coef_

array([ 2.68498611,  0.03506309])

### Intercept

In [9]:
reg.intercept_

10.652923498249017

### R-squared

In [11]:
reg.score(x,y)

0.74649760476908522

### Adjusted R-squared

Formula for Adjusted R^2

$R^2_{adj.} = 1 - (1-R^2)*\frac{n-1}{n-p-1}$

where,

$R^2$ is our usual R-squared value, 

$n$ is the number of observations (or also called samples in machine learning language), and

$p$ is the number of features.


Adjusted R-squared for multiple regression

In [11]:
r2 = 0.74649760476908522
n = 100
p = 2

adj_r2 = 1-(1-r2)*(n-1)/(n-p-1)
adj_r2

0.7412707512591694

Adjusted R-squared for simple regression

In [36]:
r2 = 0.74473918658475857
n = 100
p = 1

adj_r2 = 1-(1-r2)*(n-1)/(n-p-1)
adj_r2

0.742134484407052

### Calculate the univariate p-values of the variables

Here's an activity: Familiarize yourself with the f-statistic and the concept of p-values. You can read this informative webpage to understand the f-statistic:https://online.stat.psu.edu/stat501/lesson/6/6.2

In [28]:
from sklearn.feature_selection import f_regression

In [29]:
f_regression(x,y)

(array([  2.85921052e+02,   1.15223820e-03]),
 array([  8.12763222e-31,   9.72990343e-01]))

In [30]:
p_values = f_regression(x,y)[1]
p_values

array([  8.12763222e-31,   9.72990343e-01])

In [31]:
p_values.round(3)

array([ 0.   ,  0.973])

### Create a summary table with your findings

In [32]:
reg_summary = pd.DataFrame(data = x.columns.values, columns=['Features'])
reg_summary ['Coefficients'] = reg.coef_
reg_summary ['p-values'] = p_values.round(3)
reg_summary

Unnamed: 0,Features,Coefficients,p-values
0,size,2.684986,0.0
1,age,0.035063,0.973
