# ++++++++++++++++++ Multiple Linear Regression ++++++++++++++++++

## Good models require multiple regressions in order to address complex problems. More variables or determinants you have, more factors you are considering in the model. 

## Population multiple regression model looks like: 

#### Y = beta0 + beta1 *X1 + beta2 *X2 + ............ + betak *Xk + error

## Sample multiple regression model:

#### y_hat = b0 + b1 *x1 + b2 *x2 + .................... + bk *xk + e

#### Here y_hat = predicted value while x1, x2, x3.....xk are the independent variables. 

### The multiple regression is not about the best fitting line anymore. As the number of DV increase, to more than 2, we cannot visualize the model in 2 Dimension. The end result of the MLR is a good model which ensures  a minimum SSE [Sum of Squares Error]. 

### We ensure this by increasing the explanatory power of the model by increasing the number of variables. 

### Logic: SSE is inversely proportional to SSR. If SSE Decreases, the SSR increases and vice-versa. 

# Adjusted R-Squared. 
### Statisticians always refer to the Adjusted R-squared measure in comparison to the R-squared measure. 
### Adjusted R-squared < R-squared
### Adjusted R-squared penalises excessive use of variables. 


# Import the necessary libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib as plt
import statsmodels.api as sm

import seaborn
seaborn.set()         # Add skin to the matplotlib plots. 


# Read data

In [2]:
df = pd.read_csv('Multiple_Linear_Regression_SAT_GPA.csv')
print(df.describe())


# Observe the data. 
# Here GPA = DV, SAT = IV and 'Rand 1,2,3' is a column containing random numbers between 1 to 3 assigned to each row. 
# This column value definitely doesnt impact the GPA score. 
print(df)


               SAT        GPA  Rand 1,2,3
count    84.000000  84.000000   84.000000
mean   1845.273810   3.330238    2.059524
std     104.530661   0.271617    0.855192
min    1634.000000   2.400000    1.000000
25%    1772.000000   3.190000    1.000000
50%    1846.000000   3.380000    2.000000
75%    1934.000000   3.502500    3.000000
max    2050.000000   3.810000    3.000000
     SAT   GPA  Rand 1,2,3
0   1714  2.40           1
1   1664  2.52           3
2   1760  2.54           3
3   1685  2.74           3
4   1693  2.83           2
5   1670  2.91           1
6   1764  3.00           2
7   1764  3.00           1
8   1792  3.01           2
9   1850  3.01           3
10  1735  3.02           3
11  1775  3.07           2
12  1735  3.08           1
13  1712  3.08           3
14  1773  3.12           2
15  1872  3.17           2
16  1755  3.17           3
17  1674  3.17           2
18  1842  3.17           3
19  1786  3.19           3
20  1761  3.19           3
21  1722  3.19           3
2

# Create the multiple regression model


## Declare the dependent and the independent variables. We have 2 explanatory variables - SAT and 'Rand1,2,3'
Model looks like:    GPA = b0 + b1*SAT + b2*Rand1,2,3  

In [3]:
y = df.GPA                          # Get the DV
x1 = df[['SAT', 'Rand 1,2,3']]      # Here x1 is a dataframe containing 2 IV series. 


## Regression itself

In [4]:
# Note: We will use the library statsmodels to perform simple/multiple Linear regression. 
# General multiple regression equation: y_hat = beta0 + beta1.x1 + beta2.x2
# By default, the statsmodels method: statsmodels.regression.linear_model.OLS does not include the INTERCEPT beta0. 
# We have to explicitly manually add one if required. 
# 'statsmodels' however provides a convenience function called 'add_constant()' that adds a constant column to input dataset.

x = sm.add_constant(x1)
results = sm.OLS(y,x).fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                    GPA   R-squared:                       0.407
Model:                            OLS   Adj. R-squared:                  0.392
Method:                 Least Squares   F-statistic:                     27.76
Date:                Tue, 01 Jan 2019   Prob (F-statistic):           6.58e-10
Time:                        16:35:45   Log-Likelihood:                 12.720
No. Observations:                  84   AIC:                            -19.44
Df Residuals:                      81   BIC:                            -12.15
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.2960      0.417      0.710      0.4

# Interpretations
1. The R-squared: 0.407.....which is not so great. 
2. The adjusted R-squared value: 0.392
3. Hypothesis: 
        
        H0: Regression coefficients = 0 i.e. b0=b1=b2=0 i.e. the regression coefficients are insignificant. 
        H1: Regressions coefficients != 0 i.e. b0 != b1 != b2 !=0 i.e. the regression coefficients are significant. 
        
4. The regression coefficients: 
        b0: 0.2960  ....constant
        b1: 0.0017
        b2: -0.0083
        
5. The p-values: 
        Rule: The IV is a significant contributor in explaining the variability in the DV, if the p-value < 0.05. 
        In other words...for a coefficient to be STATISTICALLY SIGNIFICANT, the p-value should be less than 0.05. 
        
        In our case, the p-value corresponding to the IV 'Rule 1,2,3' is 0.762 > 0.05 i.e. we cannot reject the NULL Hypothesis 
        i.e. we accept the NULL Hypothesis. The variable 'Rand 1,2,3' is STATISTICALLY INSIGNIFICANT and is useless and thus 
        should be dropped.
      

# Assess the overall significance of the model - using F-statistics. 

### Just as the:
### z-statistic follows a Std Normal Distribution.
### t-statistic follows a student's t-distribution.
### F-statistics follows a F distribution.

#### F-test:
    H0: All beta's are 0 i.e. beta1 = beta2 = beta3 .....= betak = 0 i.e. None of the IV are statistically significant.  
    H1: Atleast one of the beta is non-zero.
#### In our case:
    F-statistic:  27.76
    Prob (F-statistic): 6.58e-10   .... (The p-value corresponding to this F-statistic). Here the p-value is very 
    low..practically 0.000 which is less tha 0.05. Thus the OVERALL MODEL IS SIGNIFICANT. 

### Thumbrule: Lower the F-statistic, closer to the Non-Significant model. F-statistic thus plays an important role, when we have to compare 2 models. 
    
