# Importing Libraries and Dataset

Regression analysis is the most widely used method of prediction. Linear regression is used when the dataset has a linear correlation and as the name suggests, simple linear regression has one independent variable (predictor) and one dependent variable(response).

The simple linear regression equation is represented as y = a+bx where x is the explanatory variable, y is the dependent variable, b is coefficient and a is the intercept.

For regression analysis, first we have to import libraries.

In [1]:
#Importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import plotly.express as px
import plotly.graph_objs as go
from plotly.offline import iplot

After importing libraries, the dataset is to be imported. 

In [2]:
#Importing dataset
dataset = pd.read_csv('../input/salary-data-simple-linear-regression/Salary_Data.csv')

To see the first five rows of the dataset we can use dataset.head().

# Assigning dependent variable to y and independent variable to X.

In [3]:
#Assiging values in X & Y
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

In [4]:
print(X)

[[ 1.1]
 [ 1.3]
 [ 1.5]
 [ 2. ]
 [ 2.2]
 [ 2.9]
 [ 3. ]
 [ 3.2]
 [ 3.2]
 [ 3.7]
 [ 3.9]
 [ 4. ]
 [ 4. ]
 [ 4.1]
 [ 4.5]
 [ 4.9]
 [ 5.1]
 [ 5.3]
 [ 5.9]
 [ 6. ]
 [ 6.8]
 [ 7.1]
 [ 7.9]
 [ 8.2]
 [ 8.7]
 [ 9. ]
 [ 9.5]
 [ 9.6]
 [10.3]
 [10.5]]


In [5]:
print(y)

[ 39343.  46205.  37731.  43525.  39891.  56642.  60150.  54445.  64445.
  57189.  63218.  55794.  56957.  57081.  61111.  67938.  66029.  83088.
  81363.  93940.  91738.  98273. 101302. 113812. 109431. 105582. 116969.
 112635. 122391. 121872.]


The dataset has to be split into a training set and a test set analysis. This can be done by the function train_test_split function from the Model_selection module of the Scikit-learn library.


# Spliting Dataset

In [6]:
#Splitting testdata into X_train,X_test,y_train,y_test
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=.33,random_state=42)

Now the data set will be divided into X_train,X_test,y_train,y_test based on the test_size we have provided as input.

Here dataset has 30 observations and test_size is taken as 33% of the total observation. This indicates the test set should have 33% * 30 =9.9 ~10 observations and the training set should have 20 observations. Random_state is assigned to some value so that the dataset is split randomly.

In [7]:
print(X_train)

[[ 2.2]
 [ 5.1]
 [ 2.9]
 [ 4.1]
 [ 4. ]
 [ 7.9]
 [ 1.3]
 [ 1.5]
 [ 9. ]
 [ 2. ]
 [ 7.1]
 [ 9.5]
 [ 5.9]
 [10.5]
 [ 6.8]
 [ 3.2]
 [ 3.9]
 [ 4.5]
 [ 6. ]
 [ 3. ]]


In [8]:
print(X_test)

[[ 9.6]
 [ 4.9]
 [ 8.2]
 [ 5.3]
 [ 3.2]
 [ 3.7]
 [10.3]
 [ 8.7]
 [ 4. ]
 [ 1.1]]


In [9]:
print(y_train)

[ 39891.  66029.  56642.  57081.  55794. 101302.  46205.  37731. 105582.
  43525.  98273. 116969.  81363. 121872.  91738.  54445.  63218.  61111.
  93940.  60150.]


In [10]:
print (y_test)

[112635.  67938. 113812.  83088.  64445.  57189. 122391. 109431.  56957.
  39343.]


random_state is provided as input to divide the test set and the training set randomly. If we use random_state as 47, then the dataset will be divided in a different random way.

# Linear Regression

To perform linear regression, LinearRegression class is imported from the module linear_model of the Scikit-learn library. The simple regression model built will be an instance of class LinearRegression.

In [12]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X_train, y_train)

LinearRegression()

# Predict

In [13]:
#Prediciting Y from Linear regression Model
y_pred = lr.predict(X_test)

Traning data is visualized with X_train and y_train, the red mark indicates the data point and the blue line indicates the regression line or best fit line.

# ScatterPlot

In [14]:
x_range = np.linspace(X.min(), X.max(), 100)
y_range = lr.predict(x_range.reshape(-1, 1))

fig = go.Figure([
        go.Scatter(x=X_train.squeeze(), y=y_train, 
                   name='train', mode='markers'),
        go.Scatter(x=X_test.squeeze(), y=y_test, 
                   name='test', mode='markers'),
        go.Scatter(x=x_range, y=y_range, 
                   name='prediction')
    ])

fig.show()

# Coefficient and Intercept 

To find the linear regression equation, coefficient and intercept are to be calculated which can be done with the below equation.

In [15]:
#Assigning Coefficient (slope) to b
b = lr.coef_

In [16]:
print("Coefficient  :" , b)

Coefficient  : [9426.03876907]


In [17]:
#Assigning Y-intercept to a
a = lr.intercept_

In [18]:
print("Intercept : ", a)

Intercept :  25324.33537924433


# Predicting Unknown Values

In [19]:
# y_pred=9426.03876907×(years of experience)+25324.33537924433
#y_predict(11)
print(lr.predict([[11]]))

[129010.76183907]


# Evaluation

In [20]:
#Mean Squared Error (MSE)
from sklearn import metrics

In [21]:
print('Mean Squared Error (MSE)  : ', metrics.mean_squared_error(y_test, y_pred))

Mean Squared Error (MSE)  :  35301898.887134895


In [22]:
import statsmodels.api as sm

Ordinary Least-Squares (OLS) estimator module can be called from statsmodels.api to get regression summary.

In [23]:
X_stat = sm.add_constant(X_train)
regsummary = sm.OLS(y_train, X_stat).fit()
regsummary.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.955
Model:,OLS,Adj. R-squared:,0.952
Method:,Least Squares,F-statistic:,381.3
Date:,"Fri, 15 Sep 2023",Prob (F-statistic):,1.45e-13
Time:,09:46:00,Log-Likelihood:,-200.48
No. Observations:,20,AIC:,405.0
Df Residuals:,18,BIC:,406.9
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.532e+04,2743.538,9.231,0.000,1.96e+04,3.11e+04
x1,9426.0388,482.706,19.527,0.000,8411.911,1.04e+04

0,1,2,3
Omnibus:,0.822,Durbin-Watson:,1.772
Prob(Omnibus):,0.663,Jarque-Bera (JB):,0.819
Skew:,0.38,Prob(JB):,0.664
Kurtosis:,2.363,Cond. No.,12.4


R-Square or Adj-R-Square value can be obtained as below.

In [24]:
print("Adjusted R-Square : ",regsummary.rsquared_adj)
print("R-Square : ",regsummary.rsquared)

Adjusted R-Square :  0.9524194554302405
R-Square :  0.9549236946181227


If only interested to find the R-Square value, r2_score can be imported from the metrics module of the Scikit-learn library.

In [25]:
from sklearn.metrics import r2_score

In [26]:
r2_score(y_train, lr.predict(X_train))

0.9549236946181227

# Reference

https://medium.com/geekculture/simple-linear-regression-analysis-using-python-c5b2f637942