This is a very simple tutorial intended for the beginners to understand and implement Simple Linear Regression from the scratch. 



<font color='blue'> Simple Linear Regression </font> is a great first machine learning algorithm to implement as it requires you to estimate properties from your training dataset, but is simple enough for beginners to understand. Linear regression is a prediction method that is more than 200 years old. In this tutorial, you will discover how to implement the simple linear regression algorithm from scratch in Python.

After completing this tutorial you will know:<br>
&#9632; How to estimate statistical quantities from training data.<br>
&#9632; How to estimate linear regression coefficients from data.<br>
&#9632; How to make predictions using linear regression for new data.<br>


Linear regression assumes a **linear or straight line relationship between the input variables (X) and the single output variable (y).** More specifically, that output (y) can be calculated from a linear combination of the input variables (X). When there is a single input variable, the method is referred to as a simple linear regression.

In simple linear regression we can use statistics on the training data to estimate the coefficients required by the model to make predictions on new data.

The line for a simple linear regression model can be written as:

$$ y = b_0 + b_1 * x $$
where $b_0$ and $b_1$ are the coefficients we must estimate from the training data. Once the coefficients are known, we can use this equation to estimate output values for $y$ given new input examples of $x$. It requires that you calculate statistical properties from the data such as **mean, variance** and **covariance.**


If somehow this notebook helps you, please do <font color='red'> UPVOTE </font>

## <font color = 'blue'> Swedish Insurance Dataset</font>
We will use a real dataset to demonstrate simple linear regression. The dataset is called the **“Auto Insurance in Sweden”** dataset and involves **<font color='blue'> predicting the total payment for all the claims in thousands of Swedish Kronor (y) given the total number of claims (x). </font>**

This means that for a new number of claims (x) we will be able to predict the total payment of claims (y).

Let's load some basic python libraries that we will need over the course of this tutorial. 

In [None]:
# library for manipulating the csv data
import pandas as pd

# library for scientific calculations on numbers + linear algebra
import numpy as np
import math

# library for regular plot visualizations
import matplotlib.pyplot as plt

#library for responsive visualizations
import plotly.express as px


In [None]:
data = pd.read_csv('../input/auto-insurance-in-sweden/swedish_insurance.csv')
data.sort_values('X', inplace= True)
data.info()

In [None]:
data.head(10)
data

# Manipulating the data (Task) 
The data has been manipulated to see the effect of outliers on the prediction model. It was seen that the MSE rises with the following manipulations. But overall, the line of best fit remains pretty much the same as before.

The manipulations have been commented out so as to show the MSE for the data that has been provided.

In [None]:
# Manipulating Data to create some outliers

# data.loc[data.X==61,'Y'] = 450
# data.loc[data.X==40,'Y'] = 50


Let's have a look at the data itself. You can either use `matplotlib.pyplot` or `plotly` for visualization. The latter one produces responsive visualizations. Try hovering over the points on the graph to see the actual values.

In [None]:
fig = px.scatter(x = data['X'], y=data['Y'])
fig.update_layout(title = 'Swedish Automobiles Data', title_x=0.5, xaxis_title= "Number of Claims", yaxis_title="Payment in Claims", height = 500, width = 700)
fig.update_xaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_yaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.show()

**This tutorial is broken down into five parts:<br>**
&#9832; Calculate Mean and Variance.<br>
&#9832; Calculate Covariance.<br>
&#9832; Estimate Coefficients.<br>
&#9832; Make Predictions.<br>
&#9832; Visual Comparison for Correctness.<br>
These steps will give you the foundation you need to implement and train simple linear regression models for your own prediction problems.

### 1. Calculate Mean and Variance.
As said earlier, simple linear regression uses mean and variance of the given data. We will use `numpy` builtin functions to calculate them. 

In [None]:
mean_x = np.mean(data['X'])
mean_y = np.mean(data['Y'])

var_x = np.var(data['X'])
var_y = np.var(data['Y'])


print('x stats: mean= %.3f   variance= %.3f' % (mean_x, var_x))
print('y stats: mean= %.3f   variance= %.3f' % (mean_y, var_y))

### 2. Calculate Covariance.
The covariance of two groups of numbers describes how those numbers change together. Covariance is a generalization of correlation. Correlation describes the relationship between two groups of numbers, whereas covariance can describe the relationship between two or more groups of numbers. It is calculated by the following formula. 
$$ Cov(X,Y) = \frac{\sum{(X_i - \overline{X})}{(Y_j - \overline{Y})}}{n} $$

You can simply implement it by yourself or use builtin function `numpy.cov()`


In [None]:
# Calculate covariance between x and y
def covariance(x, y):
    mean_x = np.mean(x)
    mean_y = np.mean(y)
    covar = 0.0
    for i in range(len(x)):
        covar += (x[i] - mean_x) * (y[i] - mean_y)
    return covar/len(x)
covar_xy = covariance(data['X'], data['Y'])
print(f'Cov(X,Y): {covar_xy}')

### 3. Estimate Coefficients
We must estimate the values for two coefficients in simple linear regression.

In [None]:
b1 = covar_xy / var_x
b0 = mean_y - b1 * mean_x

print(f'Coefficents:\n b0: {b0}  b1: {b1} ')


### 4. Make Predictions
The simple linear regression model is a line defined by coefficients estimated from training data. Once the coefficients are estimated, we can use them to make predictions. The equation to make predictions with a simple linear regression model is as follows:
$$ \hat{y} = b_0 + b_1 * x $$

In [None]:
# Taking the values from the dataframe and sorting only X for the ease of plotting line later on
x = data['X'].values.copy()
x.sort()
print(f'x: {x}')

# Predicting the new data based on calculated coeffiecents. 
y_hat = b0 + b1 * x
print(f'y_hat: {y_hat}')

### 5. Visual Comparison for Correctness 

In [None]:
import plotly.graph_objects as go
fig = go.Figure()

fig.add_trace(go.Scatter(x=data['X'], y=data['Y'], name='train', mode='markers', marker_color='rgba(152, 0, 0, .8)'))
fig.add_trace(go.Scatter(x=x, y=y_hat, name='prediction', mode='lines+markers', marker_color='rgba(0, 152, 0, .8)'))

fig.update_layout(title = f'Swedish Automobiles Data\n (visual comparison for correctness)',title_x=0.5, xaxis_title= "Number of Claims", yaxis_title="Payment in Claims", height = 500, width = 800)
fig.update_xaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_yaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.show()

# Writing the MSE function (Task)


In [None]:
def MSE():
    y = data['Y'].values
    
    size = y.size
    
    diff = np.subtract(y,y_hat)
    squared_diff = np.square(diff)
    sum_of_squared_diff = np.sum(squared_diff)
    
    mse = sum_of_squared_diff / size
    return mse
    


mse = MSE()
print(mse)
    

# Observations
* Manipulating the data to create some outliers resulted in a higher value of MSE.
* But overall, the line of best fit remained pretty much the same.
* It is easy to see how linear regression would be done for multiple features. A new variable will be required for each feature, which will also have a coefficient called the parameter. 
* Suppose, if there were two features (say, x1 and x2) in the above problem, then the linear regression model would be as follows :
$$ \hat{y} = b_0 + b_1 * x1 + b_2 * x2 $$

* Finally, the more data we have, the better and more accurate our predictions will be.


## Where To Go From Here 
* <font color="red">Can you find out the squared error of the predictions???</font>
* Extend the same problem for multiple input features. 