# <font color='Blue'>Linear Regression</font>

## <font color='Blue'>1. Building a Simple Linear Regression Model</font>

A dataset contain the salary of 50 graduating MBA students of a Business School in 2016 and their Class X percentage marks. We will develop a model to understand and predict salary based on Class X percentage marks.

### <font color='Blue'>Steps in building a linear regression model</font>
     STEP 1: Collect, Extract, Analyze Data
     STEP 2: Processing Data
     STEP 3: Dividing data into training and validation datasets
     STEP 4: Build the model
     STEP 5: Make Predictions

#### <font color='Green'>STEP 1: Collect, Extract, Analyze Data</font>

In [1]:
# Loading Data

import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np 

# Setting pandas print option to limit decimal places to 4
np.set_printoptions(precision=4, linewidth=100) 

mba_salary_df = pd.read_csv( 'MBA Salary.csv' )

In [2]:
mba_salary_df.head( 10 )

Unnamed: 0,S. No.,Percentage in Grade 10,Salary
0,1,62.0,270000
1,2,76.33,200000
2,3,72.0,240000
3,4,60.0,250000
4,5,61.0,180000
5,6,55.0,300000
6,7,70.0,260000
7,8,68.0,235000
8,9,82.8,425000
9,10,59.0,240000


In [3]:
mba_salary_df.info()

# Look for Missing data, Handle if any
# Handle Categorical Variables if any

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 3 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   S. No.                  50 non-null     int64  
 1   Percentage in Grade 10  50 non-null     float64
 2   Salary                  50 non-null     int64  
dtypes: float64(1), int64(2)
memory usage: 1.3 KB


#### <font color='Green'>STEP 2: Process Data</font>

In [4]:
# Creating Feature Set and Outcome Variable
import statsmodels.api as sm

X = sm.add_constant( mba_salary_df['Percentage in Grade 10'] )
print("Feature Set")
print(X.head(10))

Feature Set
   const  Percentage in Grade 10
0    1.0                   62.00
1    1.0                   76.33
2    1.0                   72.00
3    1.0                   60.00
4    1.0                   61.00
5    1.0                   55.00
6    1.0                   70.00
7    1.0                   68.00
8    1.0                   82.80
9    1.0                   59.00


In [5]:
print("Outcome Variable")
Y = mba_salary_df['Salary']
print(Y.head(10))

Outcome Variable
0    270000
1    200000
2    240000
3    250000
4    180000
5    300000
6    260000
7    235000
8    425000
9    240000
Name: Salary, dtype: int64


#### <font color='Green'>STEP 3: Dividing data into training and validation datasets</font>

In [6]:
# Splitting the dataset into training and validation sets

from sklearn.model_selection import train_test_split
train_X, test_X, train_y, test_y = train_test_split( X , Y, train_size = 0.8, random_state = 100 )

In [7]:
# Datatypes

print("train_X")
print(train_X.info())
print("___________________________________________________________________")
print("test_X")
print(test_X.info())
print("___________________________________________________________________")
print("train_y",type(train_y))
print(len(train_y))
print("___________________________________________________________________")
print("test_y",type(test_y))
print(len(test_y))

train_X
<class 'pandas.core.frame.DataFrame'>
Int64Index: 40 entries, 0 to 8
Data columns (total 2 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   const                   40 non-null     float64
 1   Percentage in Grade 10  40 non-null     float64
dtypes: float64(2)
memory usage: 960.0 bytes
None
___________________________________________________________________
test_X
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 6 to 42
Data columns (total 2 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   const                   10 non-null     float64
 1   Percentage in Grade 10  10 non-null     float64
dtypes: float64(2)
memory usage: 240.0 bytes
None
___________________________________________________________________
train_y <class 'pandas.core.series.Series'>
40
___________________________________________________________________
test_y 

#### <font color='Green'>STEP 4: Build the model</font>

In [11]:
mba_salary_lm = sm.OLS( train_y, train_X ).fit()

In [12]:
mba_salary_lm.summary2()

# R-squared is 0.211, model explains 21.1% of the variance in y
# p-value of t-test of β1 is < 0.05, so β1 is statistically significant
# p-value of F-test of the model is < 0.05, so the model is statistically significant

0,1,2,3
Model:,OLS,Adj. R-squared:,0.19
Dependent Variable:,Salary,AIC:,1008.868
Date:,2021-10-15 12:13,BIC:,1012.2458
No. Observations:,40,Log-Likelihood:,-502.43
Df Model:,1,F-statistic:,10.16
Df Residuals:,38,Prob (F-statistic):,0.00287
R-squared:,0.211,Scale:,5012100000.0

0,1,2,3,4,5,6
,Coef.,Std.Err.,t,P>|t|,[0.025,0.975]
const,30587.2857,71869.4497,0.4256,0.6728,-114904.8089,176079.3802
Percentage in Grade 10,3560.5874,1116.9258,3.1878,0.0029,1299.4892,5821.6855

0,1,2,3
Omnibus:,2.048,Durbin-Watson:,2.611
Prob(Omnibus):,0.359,Jarque-Bera (JB):,1.724
Skew:,0.369,Prob(JB):,0.422
Kurtosis:,2.3,Condition No.:,413.0


#### <font color='Green'>Step 5: Measure model accuracy and validate model</font>

<b> Making predictions in validation set </b>

In [13]:
pred_y = mba_salary_lm.predict( test_X )
round(pred_y,2)

6     279828.40
36    272707.23
37    215737.83
28    237101.35
43    295851.05
49    247071.00
5     226419.59
33    308313.10
20    254904.29
42    295494.99
dtype: float64

<b> Finding R-Square and RMSE </b>

In [15]:
from sklearn.metrics import r2_score, mean_squared_error
rsquare_train = round(np.abs(r2_score(train_y,mba_salary_lm.fittedvalues)),4)
rsquare_valid = round(np.abs(r2_score(test_y,pred_y)),4)

import numpy
rmse_train = round(np.sqrt(mean_squared_error(train_y,mba_salary_lm.fittedvalues)),4)
rmse_valid = round(np.sqrt(mean_squared_error(pred_y,test_y)),4)

print("R-Square_Train: ", rsquare_train, " RMSE_Train: ", rmse_train)
print("R-Square_Valid: ", rsquare_valid, " RMSE_Valid: ", rmse_valid)

R-Square_Train:  0.211  RMSE_Train:  69003.4482
R-Square_Valid:  0.1566  RMSE_Valid:  73458.0435


#### <font color='Green'>Step 6: Making Predictions</font>

In [16]:
predictions = mba_salary_lm.get_prediction(test_X)
predictions_df = round(predictions.summary_frame(alpha=0.05),2)

# Store all the values in a dataframe
pred_y_df = pd.DataFrame( { 'grade_10_perc': test_X['Percentage in Grade 10'],
                            'test_y': test_y,
                            'pred_y': predictions_df['mean'],
                            'pred_y_left': predictions_df['obs_ci_lower'],
                            'pred_y_right': predictions_df['obs_ci_upper'],
                            'pred_interval_size':predictions_df['obs_ci_upper']-predictions_df['obs_ci_lower']} )
pred_y_df

Unnamed: 0,grade_10_perc,test_y,pred_y,pred_y_left,pred_y_right,pred_interval_size
6,70.0,260000,279828.4,134000.16,425656.65,291656.49
36,68.0,177600,272707.23,127260.89,418153.57,290892.68
37,52.0,236000,215737.83,68302.61,363173.04,294870.43
28,58.0,360000,237101.35,91458.13,382744.58,291286.45
43,74.5,250000,295851.05,148658.29,443043.8,294385.51
49,60.8,300000,247071.0,101837.28,392304.72,290467.44
5,55.0,300000,226419.59,80034.71,372804.47,292769.76
33,78.0,330000,308313.1,159585.7,457040.5,297454.8
20,63.0,120000,254904.29,109799.23,400009.35,290210.12
42,74.4,300000,295494.99,148340.06,442649.91,294309.85


##Thankyou....