# $$\textrm{Multiple Linear Regression}$$
---

# Table of contents

* [Task](#T1)
* [Packages used](#T21) and [Dataset](#T22)
* [Data Visualization](#T3)
* [Task 1: Training regression model](#T4)
    * [Pre processing data](#T41)
    * [Estimation by linear regressor](#T42)
    * [Interpretting regressor coeff](#T43)
    * [Significant check based on OLS summary](#T44)
* [Task 2: Training regression model](#T5)
    - [Model selection based on R2 score](#T51)
* [Task 3: Train a regression model](#T6)
    - [Insignificant variable based on OLS summary](#T61)
    - [Retrain the model](#T62)
    - [Observations True vs Predict](#T63)

<a id="T1"></a>
# Task

1) Develop an estimated multiple linear regression equation with mbap as response variable and sscp & hscp as the two predictor variables. Interpret the regression coefficients and check whether they are significant based on the summary output.

---
2) Estimate a multiple regression equation for each of the below scenarios and based on the model’s R-square comment which model is better. 
    
    (i) Use mbap as outcome variable and sscp & degreep as the two predictor variables.
    
    (ii) Use mbap as outcome variable and hscp & degreep as the two predictor variables. 
---
3) Show the functional form of a multiple regression model. Build a regression model with mbap as dependent variable and sscp, hscp and degree_p as three independent variables. 
    
    Divide the dataset in the ratio of 80:20 for train and test set (set seed as 1001) and use the train set to build the model. Show the model summary and interpret the p-values of the regression coefficients. 
    
    Remove any insignificant variables and rebuild the model. 
    
    Use this model for prediction on the test set and show the first few observations’ actual value of the test set in comparison to the predicted value.

<a id="T21"></a>
# Packages used

In [None]:
#Libraries used in the kernel

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns # graphs potting 
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, classification_report
from statsmodels.api import OLS

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

<a id="T22"></a>
# Import Dataset

In [None]:
dataframe = pd.read_csv("../input/factors-affecting-campus-placement/Placement_Data_Full_Class.csv", index_col="sl_no")
dataframe.head()

In [None]:
#Make copies of dataframe
data_reg = dataframe.copy()
data_class = dataframe.copy()

<a id="T3"></a>
# Data Visualization

**This plot shows marks distribution in MBA score**

In [None]:
sns.kdeplot(dataframe.mba_p[ dataframe.gender=="M"])
sns.kdeplot(dataframe.mba_p[ dataframe.gender=="F"])
plt.legend(["Male", "Female"])
plt.xlabel("mba percentage")
plt.show()

_average density plot shows Male has lower average score in MBA than Females.

**Correlation between different features**

In [None]:
matrix = dataframe.corr()
plt.figure(figsize=(8,6))
#plot heat map
g=sns.heatmap(matrix,annot=True,cmap="YlGn_r")

_Senior secondary have a higher correlation with MBA score than Higher secondary score_

In [None]:
plt.figure(figsize=(12,8))
sns.regplot(x="ssc_p",y="mba_p",data=dataframe)
sns.regplot(x="hsc_p",y="mba_p",data=dataframe)
plt.legend(["ssc percentage", "hsc percentage"])
plt.ylabel("mba percentage")
plt.show()

_SSC percentage slightly weighs more for having good MBA score_

<a id="T4"></a>
# Task 1: Training regression model

1. **Develop an estimated multiple linear regression equation with mbap as response variable and sscp & hscp as the two predictor variables. Interpret the regression coefficients and check whether they are significant based on the summary output**

<a id="T41"></a>
# Pre processing data

In [None]:
# Seperating independent and dependent variables
#dependent variables ssc_p, hsc_p
X = data_class.iloc[:,[1,3]].values
y = data_class.iloc[:,-3].values.reshape(-1,1)

In [None]:
#splitting into training and test set
#from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=0)

<a id="T42"></a>
## Estimation by multiple regressor

In [None]:
#Multiple linear regression
#import library
#from sklearn.linear_model import LinearRegression
regressor = LinearRegression()

#train the model
regressor.fit(X_train, y_train)

#predict the test set(mba_p)
y_pred_m = regressor.predict(X_test)

In [None]:
#from sklearn.metrics import r2_score, classification_report
print("R2 score: " + str(r2_score(y_test, y_pred_m)))

<a id="T43"></a>
## Interpreting regression coefficients

In [None]:
print(regressor.coef_)
print(regressor.intercept_)

The equation of our multiple linear regression model is:

$$\textrm{mba_p} = 0.14 \times \textrm{ssc_p} + 0.13 \times \textrm{hsc_p} + 44.05$$


<a id="T44"></a>
## Significant check based on OLS summary

* _The significance of a regression coefficient in a regression model is determined by dividing the estimated coefficient over the standard deviation of this estimate._
* _For statistical significance we expect the absolute value of the t-ratio to be greater than 2 or the_
* _P-value to be less than the significance level (α=0.01 or 0.05 or 0,1)._

In [None]:
#from statsmodels.api import OLS
summ=OLS(y_train,X_train).fit()
summ.summary()

**The regression model summary shows that the hsc and ssc predictor variables are statistically significant because their p-values equal 0.000.**

<a id="T5"></a>
# Task 2: Training regression model

2. **Estimate a multiple regression equation for each of the below scenarios and based on the model’s R-square comment which model is better.** 
    
    (i) Use mbap as outcome variable and sscp & degreep as the two predictor variables.    

In [None]:
# Seperating independent and dependent variables
#dependent variables ssc_p, degree_p
X = data_class.iloc[:,[1,6]].values
y = data_class.iloc[:,-3].values.reshape(-1,1)

#splitting into training and test set
#from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=0)

#Multiple linear regression
#from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

#predict the dependent variable
y_pred_m = regressor.predict(X_test)

#from sklearn.metrics import r2_score, classification_report
print("R2 score: " + str(r2_score(y_test, y_pred_m)))
print("regression coeff: " + str(regressor.coef_))
print("regression intercept: " + str(regressor.intercept_))
print("mba_p = 0.12 x ssc_p + 0.22 x degree_p + 39.66")

2. **Estimate a multiple regression equation for each of the below scenarios and based on the model’s R-square comment which model is better.** 
    
    (ii) Use mbap as outcome variable and hscp & degreep as the two predictor variables.

In [None]:
# Seperating independent and dependent variables
X = data_class.iloc[:,[3,6]].values
y = data_class.iloc[:,-3].values.reshape(-1,1)

#splitting into training and test set
#from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=0)

#Multiple linear regression
#from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred_m = regressor.predict(X_test)

#from sklearn.metrics import r2_score, classification_report
print("R2 score:" + str(r2_score(y_test, y_pred_m)))
print("regression coeff:" + str(regressor.coef_))
print("regression intercept:" + str(regressor.intercept_))
print("mba_p = " + str(regressor.coef_[0][0]) + " x hsc_p + " + str(regressor.coef_[0][1]) + " x degree_p + " + str(regressor.intercept_[0]))

<a id='51'></a>
## model selection based on R2 score

> **Model with mbap as outcome variable and sscp & degreep as the two predictor variables is better since the r2_score is 0.267**

<a id="T6"></a>
# Task 3: Train a regression model

3) **Show the functional form of a multiple regression model. Build a regression model with mbap as dependent variable and sscp, hscp and degree_p as three independent variables.** 
    
    Divide the dataset in the ratio of 80:20 for train and test set (set seed as 1001) and use the train set to build the model. Show the model summary and interpret the p-values of the regression coefficients. 
    
    Remove any insignificant variables and rebuild the model. 
    
    Use this model for prediction on the test set and show the first few observations’ actual value of the test set in comparison to the predicted value.

**_functional form_**
> Multiple regression model with mba_p as dependent varialble and ssc_p, hsc_p and degree_p as three independent variables.

$$ mba_p = x_1.ssc_p + x_2.hsc_p + x_3.degree_p + constant $$

In [None]:
# Seperating independent and dependent variables
X = data_class.iloc[:,[1,3,6]].values
y = data_class.iloc[:,-3].values.reshape(-1,1)

#splitting into training and test set
#from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=1001)

#Multiple linear regression
#from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

<a id="T61"></a>
## Insignificant variable based on OLS summary

* _The significance of a regression coefficient in a regression model is determined by dividing the estimated coefficient over the standard deviation of this estimate._
* _For statistical significance we expect the absolute value of the t-ratio to be greater than 2 or the_
* _P-value to be less than the significance level (α=0.01 or 0.05 or 0,1)._

In [None]:
#Summary of the model
#from statsmodels.api import OLS
summ=OLS(y_train,X_train).fit()
summ.summary()

> **The regression model summary shows that the hsc and degree predictor variables are statistically significant because their p-values equal 0.000.**

> **The x1 i.e. ssc variable has a p-value 0.004**

**_Drop the ssc feature_**

<a id="T62"></a>
## Retrain the model

In [None]:
# Seperating independent and dependent variables
X = data_class.iloc[:,[3,6]].values
y = data_class.iloc[:,-3].values.reshape(-1,1)

#splitting into training and test set
#from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=1001)

#Multiple linear regression
#from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

In [None]:
#predict the values
y_pred_m = regressor.predict(X_test)

In [None]:
#Summary of the model
#from statsmodels.api import OLS
summ=OLS(y_train,X_train).fit()
summ.summary()

In [None]:
#from sklearn.metrics import r2_score, classification_report
#R2 score
print("R2 score:" + str(r2_score(y_test, y_pred_m)))

#model p values
print("regression coeff:" + str(regressor.coef_))
print("regression intercept:" + str(regressor.intercept_))
print("mba_p = " + str(regressor.coef_[0][0]) + " x hsc_p + " + str(regressor.coef_[0][1]) + " x degree_p + " + str(regressor.intercept_[0]))

<a id="T63"></a>
## Observations true vs predicted

In [None]:
np.set_printoptions(precision=2)
dff = pd.DataFrame(list(zip(y_test, y_pred_m.round(2))),columns=("Target","Predicted"))
dff.head(8)

$$\textrm{If you like the work please upvote :-) }$$
$$\textrm{Comments are Welcome }$$