This is a very simple tutorial intended for the beginners to understand and implement Simple Linear Regression from the scratch. 



<font color='blue'> Simple Linear Regression </font> is a great first machine learning algorithm to implement as it requires you to estimate properties from your training dataset, but is simple enough for beginners to understand. Linear regression is a prediction method that is more than 200 years old. In this tutorial, you will discover how to implement the simple linear regression algorithm from scratch in Python.

After completing this tutorial you will know:<br>
&#9632; How to estimate statistical quantities from training data.<br>
&#9632; How to estimate linear regression coefficients from data.<br>
&#9632; How to make predictions using linear regression for new data.<br>


Linear regression assumes a **linear or straight line relationship between the input variables (X) and the single output variable (y).** More specifically, that output (y) can be calculated from a linear combination of the input variables (X). When there is a single input variable, the method is referred to as a simple linear regression.

In simple linear regression we can use statistics on the training data to estimate the coefficients required by the model to make predictions on new data.

The line for a simple linear regression model can be written as:

$$ y = b_0 + b_1 * x $$
where $b_0$ and $b_1$ are the coefficients we must estimate from the training data. Once the coefficients are known, we can use this equation to estimate output values for $y$ given new input examples of $x$. It requires that you calculate statistical properties from the data such as **mean, variance** and **covariance.**


If somehow this notebook helps you, please do <font color='red'> UPVOTE </font>

## <font color = 'blue'> Swedish Insurance Dataset</font>
We will use a real dataset to demonstrate simple linear regression. The dataset is called the **“Auto Insurance in Sweden”** dataset and involves **<font color='blue'> predicting the total payment for all the claims in thousands of Swedish Kronor (y) given the total number of claims (x). </font>**

This means that for a new number of claims (x) we will be able to predict the total payment of claims (y).

Let's load some basic python libraries that we will need over the course of this tutorial. 

In [1]:
# library for manipulating the csv data
import pandas as pd

# library for scientific calculations on numbers + linear algebra
import numpy as np
import math

# library for regular plot visualizations
import matplotlib.pyplot as plt

#library for responsive visualizations
import plotly.express as px


### Reading the data

In [2]:
data = pd.read_csv('../input/auto-insurance-in-sweden/swedish_insurance.csv')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63 entries, 0 to 62
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   X       63 non-null     int64  
 1   Y       63 non-null     float64
dtypes: float64(1), int64(1)
memory usage: 1.1 KB


In [3]:
print(data.columns)
data.head(10)

Index(['X', 'Y'], dtype='object')


Unnamed: 0,X,Y
0,108,392.5
1,19,46.2
2,13,15.7
3,124,422.2
4,40,119.4
5,57,170.9
6,23,56.9
7,14,77.5
8,45,214.0
9,10,65.3


Let's have a look at the data itself. You can either use `matplotlib.pyplot` or `plotly` for visualization. The latter one produces responsive visualizations. Try hovering over the points on the graph to see the actual values.

In [4]:
fig = px.box(data['X'], points = 'all')
fig.update_layout(title = f'Distribution of X',title_x=0.5, yaxis_title= "Number of Insurance Claims")
fig.show()

fig = px.box(data['Y'], points = 'all')
fig.update_layout(title = f'Distribution of Y',title_x=0.5, yaxis_title= "Amount of Insurance Paid")
fig.show()

In [5]:
fig = px.scatter(x = data['X'], y=data['Y'])
fig.update_layout(title = 'Swedish Automobiles Data', title_x=0.5, xaxis_title= "Number of Claims", yaxis_title="Payment in Claims", height = 500, width = 700)
fig.update_xaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_yaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.show()

**This tutorial is broken down into five parts:<br>**
&#9832; Calculate Mean and Variance.<br>
&#9832; Calculate Covariance (X,Y).<br>
&#9832; Estimate Coefficients.<br>
&#9832; Make Predictions.<br>
&#9832; Visual Comparison for Correctness.<br>
These steps will give you the foundation you need to implement and train simple linear regression models for your own prediction problems.

### 1. Calculate Mean and Variance.
As said earlier, simple linear regression uses mean and variance of the given data. We will use `numpy` builtin functions to calculate them. 

In [6]:
data['Y']

0     392.5
1      46.2
2      15.7
3     422.2
4     119.4
      ...  
58     87.4
59    209.8
60     95.5
61    244.6
62    187.5
Name: Y, Length: 63, dtype: float64

In [7]:
mean_x = np.mean(data['X'])
mean_y = np.mean(data['Y'])

var_x = np.var(data['X'])
var_y = np.var(data['Y'])


print('x stats: mean= %.3f   variance= %.3f' % (mean_x, var_x))
print('y stats: mean= %.3f   variance= %.3f' % (mean_y, var_y))

x stats: mean= 22.905   variance= 536.658
y stats: mean= 98.187   variance= 7505.052


### 2. Calculate Covariance.
The covariance of two groups of numbers describes how those numbers change together. Covariance is a generalization of correlation. Correlation describes the relationship between two groups of numbers, whereas covariance can describe the relationship between two or more groups of numbers. It is calculated by the following formula. 
$$ Cov(X,Y) = \frac{\sum{(X_i - \overline{X})}{(Y_j - \overline{Y})}}{n} $$

You can simply implement it by yourself or use builtin function `numpy.cov()`


In [8]:
# Calculate covariance between x and y
def covariance(x, y):
    mean_x = np.mean(x)
    mean_y = np.mean(y)
    covar = 0.0
    for i in range(len(x)):
        covar += (x[i] - mean_x) * (y[i] - mean_y)
    return covar/len(x)



covar_xy = covariance(data['X'], data['Y'])
print(f'Cov(X,Y): {covar_xy}')

Cov(X,Y): 1832.0543461829182


### 3. Estimate Coefficients
We must estimate the values for two coefficients in simple linear regression.

In [9]:
b1 = covar_xy / var_x
b0 = mean_y - b1 * mean_x

print(f'Coefficents:\n b0: {b0}  b1: {b1} ')


Coefficents:
 b0: 19.99448575911481  b1: 3.413823560066367 


### 4. Make Predictions
The simple linear regression model is a line defined by coefficients estimated from training data. Once the coefficients are estimated, we can use them to make predictions. The equation to make predictions with a simple linear regression model is as follows:
$$ \hat{y} = b_0 + b_1 * x $$

In [10]:
x = data['X'].values.copy()
x

array([108,  19,  13, 124,  40,  57,  23,  14,  45,  10,   5,  48,  11,
        23,   7,   2,  24,   6,   3,  23,   6,   9,   9,   3,  29,   7,
         4,  20,   7,   4,   0,  25,   6,   5,  22,  11,  61,  12,   4,
        16,  13,  60,  41,  37,  55,  41,  11,  27,   8,   3,  17,  13,
        13,  15,   8,  29,  30,  24,   9,  31,  14,  53,  26])

In [11]:
# Taking the values from the dataframe and sorting only X for the ease of plotting line later on
x = data['X'].values.copy()
# x.sort()
print(f'x: {x}')

# Predicting the new data based on calculated coeffiecents. 
y_hat = b0 + b1 * x
print(f'\n\ny_hat: {y_hat}')

y = data['Y'].values
print(f'\n\ny: {y}')

x: [108  19  13 124  40  57  23  14  45  10   5  48  11  23   7   2  24   6
   3  23   6   9   9   3  29   7   4  20   7   4   0  25   6   5  22  11
  61  12   4  16  13  60  41  37  55  41  11  27   8   3  17  13  13  15
   8  29  30  24   9  31  14  53  26]


y_hat: [388.68743025  84.8571334   64.37419204 443.30860721 156.54742816
 214.58242868  98.51242764  67.7880156  173.61654596  54.13272136
  37.06360356 183.85801664  57.54654492  98.51242764  43.89125068
  26.82213288 101.9262512   40.47742712  30.23595644  98.51242764
  40.47742712  50.7188978   50.7188978   30.23595644 118.995369
  43.89125068  33.64978     88.27095696  43.89125068  33.64978
  19.99448576 105.34007476  40.47742712  37.06360356  95.09860408
  57.54654492 228.23772292  60.96036848  33.64978     74.61566272
  64.37419204 224.82389936 159.96125172 146.30595748 207.75478156
 159.96125172  57.54654492 112.16772188  47.30507424  30.23595644
  78.02948628  64.37419204  64.37419204  71.20183916  47.30507424
 118.99536

### 5. Visual Comparison for Correctness 

In [12]:
import plotly.graph_objects as go
fig = go.Figure()

fig.add_trace(go.Scatter(x=data['X'], y=data['Y'], name='train', mode='markers', marker_color='rgba(152, 0, 0, .8)'))
fig.add_trace(go.Scatter(x=data['X'], y=y_hat, name='prediction', mode='lines+markers', marker_color='rgba(0, 152, 0, .8)'))

fig.update_layout(title = f'Swedish Automobiles Data\n (visual comparison for correctness)',title_x=0.5, xaxis_title= "Number of Claims", yaxis_title="Payment in Claims")
fig.update_xaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_yaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.show()

## Lets Understand Mean Squared Error
<p>the mean squared error (MSE) is a risk function that <b>measures the square of errors</b>. When performing regression, use MSE if you believe your target is normally distributed and you want large errors to be penalized more than small ones.</p>

<img src='https://www.simplilearn.com/ice9/free_resources_article_thumb/Reg_Line.png'></img>

In [13]:
#import linear regression model
from sklearn.linear_model import LinearRegression

In [14]:
data.head()

Unnamed: 0,X,Y
0,108,392.5
1,19,46.2
2,13,15.7
3,124,422.2
4,40,119.4


In [15]:
data.shape

(63, 2)

In [16]:
new_x = data.drop('Y',axis=1)

In [17]:
new_y = data['Y']

In [18]:
reg = LinearRegression()

In [19]:
reg.fit(new_x, new_y)

LinearRegression()

In [20]:
reg.predict([[19]])

array([84.8571334])

In [21]:
reg.coef_

array([3.41382356])

In [22]:
reg.intercept_

19.994485759114795

# To find Mean Squared Error


The Mean Squared Error is calculated as:

<strong>MSE = (1/n) * Σ(actual – forecast)2</strong>

In [23]:
from sklearn.metrics import mean_squared_error

In [24]:
mse = mean_squared_error(new_x, new_y, squared = False)

In [25]:
mse

100.22505627218129