Linear Regression - It is predictive modeling which help us to predict a continuous numerical value.
Linear Regression will have a line of best fit which means that each data point will have minimum distance from line of best fit. We also call this distance as error. A line of best fit is drawn in such a way that each data point have least error with respect to the line.
Line of best fit is represented as Y = mX + C for a simple linear regression and Y = m1X1 + m2X2 + C for a multiple linear regression
Where Y is target or dependent variable and X is independent variable
Both Y and X has to be continuous numerical variable
The Linear Regression model does not predict a specific value whereas it predicts a range of a values. This range of values is calculated based on RMSE
Each Linear Regression model will have Adjust R Square value. This represent the percentage of variance in Y because of X
A Linear Regression model with HIGH value of R Square and LOW RMSE is an ideal model.
R-sq value (>= 75%) : % of variation in Y which can be explained by variation in X…….

Approach for building model in Python
 
Step 1: Read and access data
Step 2: Identify the variables (i.e dependent and independent variable)
Step 3: Dividing the data into training and test data
Step 4: Creating the model using on the training data set
Step 5: Find the values of slope, intercept and R Square (Applicable only when we build Regression based model)
Step 6: Predict the values on test data using your model
Step 7: Calculate the RMSE value from test data 
Step 8: Predict the values for your validation data using model and RMSE

ABC Inc an IT firm operating in telecom domain . They have around 191 employees in India. Lately they have started benchmarking employee ctc offered against last CTC. The HR has the ctc data.  Help them to perform an analysis which will show the relationship between Last CTC and employee CTC offered. Predict what CTC can be offered to a candidate with 6 lakhs of Last ctc.

Data set used: “CTCdata”

In [1]:
import pandas as pd

In [2]:
#Step 1: Read and access data
ctc=pd.read_csv(r"E:/Dataset/CTCdata.csv")
ctc

Unnamed: 0,CTCoffered,LastCTC,Interview rating,Skill Set Index,Highest qualification,Total years of work exp
0,19,18,4,3,3,8.5
1,17,16,4,3,3,7.7
2,17,16,4,3,3,7.9
3,9,8,3,1,2,2.7
4,10,9,5,4,4,9.7
...,...,...,...,...,...,...
186,7,5,4,2,2,5.5
187,21,19,3,2,2,5.3
188,14,14,5,4,4,10.3
189,10,8,5,4,4,9.5


In [3]:
df = ctc[['CTCoffered','LastCTC']]
df

Unnamed: 0,CTCoffered,LastCTC
0,19,18
1,17,16
2,17,16
3,9,8
4,10,9
...,...,...
186,7,5
187,21,19
188,14,14
189,10,8


In [4]:
#Step 2: Identify the variables (i.e dependent and independent variable)
Y = df[['CTCoffered']]
 
X = df[['LastCTC']]
 
# Step 3: Dividing the data into training and test data
 
from sklearn.model_selection import train_test_split
 
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, train_size = 0.8, random_state=1234)

len(X_train), len(X_test), len(Y_train), len(Y_test)

(152, 39, 152, 39)

In [5]:
X_train

Unnamed: 0,LastCTC
148,5
102,15
39,17
25,6
129,6
...,...
152,17
116,8
53,10
38,6


In [6]:
# Step 4: Creating the model using on the training data set

# step a: Create a model object 
from sklearn.linear_model import LinearRegression
lr = LinearRegression()

# step b: Fit the model object into training data to build a model
model = lr.fit(X_train,Y_train)
model

LinearRegression()

Simple Linear Regression form is: Y = mx+c

CTCoffered = m * LastCTC + Constant

Final model : CTCoffered = (0.93693697 * LastCTC) + 1.86356049

m-> Slope -> 0.93693697

c-> Constant -> 1.86356049

R-sq -> 0.9747618578793376 or 97.47%(R-sq should be high as possible)

In [7]:
#Step 5: Find the values of slope, intercept and R Square (Applicable only when we build Regression based model)
# For linear model R-sq >=75

In [8]:
# To calculate the slope
model.coef_

array([[0.93693697]])

In [9]:
# To calculate constant
model.intercept_

array([1.86356049])

In [10]:
# To calculate the R-sq value for model
rsq = model.score(X_train, Y_train)
rsq

0.9747618578793376

We should never predict the single point value. Whenever we are required to perform the prediction. We should always give the prediction range 

In order to calculate the prediction range, we should know the RMSE

RMSE - Root <- Mean <- Square <- Error( Read From backwords)

For a good model, RMSE should be as low as possible....

Lower Prediction = Predicted - RMSE

Upper Prediction = Predicted + RMSE

In [11]:
# Step 6: Predict the values on test data using your model

Y_test['predicted'] = model.predict(X_test)
Y_test

Unnamed: 0,CTCoffered,predicted
65,10,10.295993
185,12,13.106804
40,11,11.23293
31,18,17.791489
95,13,14.043741
48,19,17.791489
146,11,12.169867
130,21,19.665363
104,12,12.169867
163,17,16.854552


In [12]:
# Step 7: Calculate the RMSE value from test data. RMSE - Root <- Mean <- Square <- Error
Y_test['error'] = Y_test['CTCoffered'] - Y_test['predicted']
Y_test['sq-error'] = (Y_test['CTCoffered'] - Y_test['predicted']) **2

Error_mean = Y_test['sq-error'].mean()
Error_mean #This is mean of sq_Error

#Find the root of Error mean -> RMSE
import math
RMSE = math.sqrt(Error_mean)
RMSE

0.8817060801700444

RMSE as 0.8817060

The unit of RMSE is same as the unit of target variable. (i.e. 0.88 Lakhs), it mean that whenever we perform the prediction using the model which we have built, my target variable will have the variation of +- 0.88 Lakhs.

In [13]:
Y_test

Unnamed: 0,CTCoffered,predicted,error,sq-error
65,10,10.295993,-0.295993,0.087612
185,12,13.106804,-1.106804,1.225015
40,11,11.23293,-0.23293,0.054256
31,18,17.791489,0.208511,0.043477
95,13,14.043741,-1.043741,1.089396
48,19,17.791489,1.208511,1.460499
146,11,12.169867,-1.169867,1.368589
130,21,19.665363,1.334637,1.781256
104,12,12.169867,-0.169867,0.028855
163,17,16.854552,0.145448,0.021155


In [14]:
# Step 8: Predict the values for your validation data using model and RMSE
# Predict what CTC can be offered to a candidate with 6 lakhs of Last ctc.

val_data = pd.DataFrame({'LastCTC':[6]})
val_data

Unnamed: 0,LastCTC
0,6


In [15]:
pred_ctc_offered = model.predict(val_data)
pred_ctc_offered

array([[7.48518232]])

In [16]:
lower_range = pred_ctc_offered - RMSE

upper_range = pred_ctc_offered + RMSE
print(lower_range, upper_range)

[[6.60347624]] [[8.3668884]]
