# Multiple linear regression and adjusted R-squared

###### Multiple regression is an extension of simple linear regression. It is used when we want to predict the value of a variable based on the value of two or more other variables. The variable we want to predict is called the dependent variable (or sometimes, the outcome, target or criterion variable).
    
<img src="1.jpg">
    
###### For example, you could use multiple regression to understand whether exam performance can be predicted based on revision time, test anxiety, lecture attendance and gender. Alternately, you could use multiple regression to understand whether daily cigarette consumption can be predicted based on smoking duration, age when started smoking, smoker type, income and gender.

###### Multiple regression also allows you to determine the overall fit (variance explained) of the model and the relative contribution of each of the predictors to the total variance explained. For example, you might want to know how much of the variation in exam performance can be explained by revision time, test anxiety, lecture attendance and gender "as a whole", but also the "relative contribution" of each independent variable in explaining the variance.

## Quiz
Question:
Why do we prefer using a multiple linear regression model to a simple linear regression model?

    a) Easier to compute
    b) Having more independent variables makes the graphical representation better
    c) More realistic - things often depend on 2,7,10 or even more factors
    d) None of the above

## Adjusted R-squared

1. The R-squared measures how much of the total variability is explained by our model.
2. Multple regressions are always better than the simple ones, as with each additional variable you add, the explanatory power may increase or stay the same.

Adjusted R-squared is always smaller than R-squared

## Import the relevant libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import seaborn
seaborn.set()

## Load the data

In [2]:
# Load the data from a .csv in the same folder
data = pd.read_csv('1.02. Multiple linear regression.csv')

In [3]:
data.head()

Unnamed: 0,SAT,GPA,"Rand 1,2,3"
0,1714,2.4,1
1,1664,2.52,3
2,1760,2.54,3
3,1685,2.74,3
4,1693,2.83,2


## about the data

I've added another variable Rand 1,2,3 which assigns 1,2, or 3 to each of the student randomly

And I am 100% sure that this variable can not predict college GPA or it's not significant

So this is our new model:

# GPA = b0 + b1.SAT + b2.Rand 1,2,3

In [4]:
# Let's check what's inside this data frame
data.head(9)

Unnamed: 0,SAT,GPA,"Rand 1,2,3"
0,1714,2.4,1
1,1664,2.52,3
2,1760,2.54,3
3,1685,2.74,3
4,1693,2.83,2
5,1670,2.91,1
6,1764,3.0,2
7,1764,3.0,1
8,1792,3.01,2


In [5]:
data.shape

(84, 3)

## Descriptive statistics of data frame

In [6]:
data.describe()

Unnamed: 0,SAT,GPA,"Rand 1,2,3"
count,84.0,84.0,84.0
mean,1845.27381,3.330238,2.059524
std,104.530661,0.271617,0.855192
min,1634.0,2.4,1.0
25%,1772.0,3.19,1.0
50%,1846.0,3.38,2.0
75%,1934.0,3.5025,3.0
max,2050.0,3.81,3.0


## Create your first multiple regression

In [8]:
# Following the regression equation, our dependent variable (y) is the GPA
y = data ['GPA']
# Similarly, our independent variable (x) is the SAT score
x1 = data [['SAT','Rand 1,2,3']]

In [9]:
# Add a constant. Esentially, we are adding a new column (equal in lenght to x), which consists only of 1s
x = sm.add_constant(x1)
# Fit the model, according to the OLS (ordinary least squares) method with a dependent variable y and an idependent x
results = sm.OLS(y,x).fit()

  return ptp(axis=axis, out=out, **kwargs)


In [10]:
# Print a nice summary of the regression.
results.summary()

0,1,2,3
Dep. Variable:,GPA,R-squared:,0.407
Model:,OLS,Adj. R-squared:,0.392
Method:,Least Squares,F-statistic:,27.76
Date:,"Fri, 13 Dec 2019",Prob (F-statistic):,6.58e-10
Time:,07:47:53,Log-Likelihood:,12.72
No. Observations:,84,AIC:,-19.44
Df Residuals:,81,BIC:,-12.15
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.2960,0.417,0.710,0.480,-0.533,1.125
SAT,0.0017,0.000,7.432,0.000,0.001,0.002
"Rand 1,2,3",-0.0083,0.027,-0.304,0.762,-0.062,0.046

0,1,2,3
Omnibus:,12.992,Durbin-Watson:,0.948
Prob(Omnibus):,0.002,Jarque-Bera (JB):,16.364
Skew:,-0.731,Prob(JB):,0.00028
Kurtosis:,4.594,Cond. No.,33300.0


## Comparing the above results with Simple Linear Regression results

##### Adj. R-squared value has gone down from 0.399 in simple linear reg to 0.392 in multiple linear reg

###### We've been penalized for adding another variable which is not significant. 

##### So we should be chucking this model off. 

##### Not in every situation an additional variable is helpful in improving regression results.

<img style="float: left;" src="SimpleLinearResults.JPG">

## Checking the 

### F-statistic gives the overall significance of model

### For Simple regression model F-statistic = 56.05
### For Multiple regression model F-statistic = 27.76

### So Simple regression model is more significant than Multiple regression model

# Quiz
Q1:
The adjusted R-squared is a measure that:

    a) measures how well your model fits the data
    b) measures how well your model fits the data but penalizes excessive use of variables
    c) measures how well your model fits the data but penalizes excessive use of p-values
    d) measures how well your data fits your model but penalizes the excessive use of variables

Q2:
The adjusted R-squared is:

    a) usually bigger that R-squared
    b) usually smaller that R-squared
    c) usually equal to that R-squared
    d) No relation to R-squared
    
Q3:
What can you tell about a new parameter if adding it increases R-squared but decreases adjusted R-squared?

    a) The variable improves our model
    b) The variable can be omitted since it holds no predictive power
    c) It has a quadratic relationship with dependent variable
    d) None of the above

#1
# b-Like the R-squared, the adjusted R-squared measures how well your model fits the data. However, it penalizes the use of variables that are meaningless for the regression.
#2
# b-Almost always, the adjusted R-squared is smaller than the R-squared. The statement is not true only in the extreme occasions of small sample sizes and a high number of independent variables.
#3
# b-The variable can be omitted since it holds no predictive power

# eof