# Introduction to Linear Regression

* Regression analysis is likely the first and simplest of all of predictive models we would use in this course
* It estimates the relationship between a dependent variable we can also call it as target & an independent variable which is also known as predictor
* First published in early 1800s by Legrendre and Gauss
* Regression was coined by Francis Galton (cousin of Charles Darwin) to describe the biological process of extreme values moving towards population mean
* It can be classified as a supervised learning algorithms
* Very easy to interpret
* It is also called least square method  

# Table of Content
<ul>
    <li> <a href="#fakedata"> Creating Data </a> </li>
    <li> <a href="#fitting"> Model Fitting </a> </li>
    <li> <a href="#evaluation"> Model Evaluation (R-Square value) </a> </li>
    <li> <a href="#assumtions"> Assumptions of linear regression</a> </li>  
</ul>

Lets import all the necessary python libraries

In [None]:
import sys
import re
import numpy as np
from scipy import stats, polyfit
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.utils import resample
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets
import warnings
warnings.simplefilter("ignore")

<a id='fakedata'></a>
# Creating Data
* we will create a some fake height vs weight data

In [None]:
height = np.random.uniform(40,75, 20)
beta0 = 10
beta1 = 2
weight = beta0 + beta1*height

# Add some random error

weight  = weight + np.random.normal(0,10, len(height))


Create a dataframe for easy view

In [None]:
df = pd.DataFrame({'Height':height, 'Weight':weight})
df = df.round(1)
df.head(10)

In [None]:
fig,axes = plt.subplots(figsize=(5,5))
plt.plot(height,weight, linestyle='', marker='o')
plt.xlabel('Height [inches]')
plt.ylabel('Weight [Pounds]')

<a id='fitting'></a>
# Model Fitting
* Method of least square fitting
* Method determines the best line which minimizes the total of the square of the errors between the line and data points as small as possible

<img src="img/least-squares2.svg" width=300, align="center" />

## Mathematics for least square fitting

The standard notation is:

$$y^{model} = \beta_{0} + \beta_{1}x$$

where $\beta_{0}$, $\beta_{1}$ are regression coefficients we need to determine

Lets us define the error as;

$$e_{i} = y_{i}^{actual} - y_{i}^{model}$$ 

From this error definition, we can find the expression for square of error terms for all data points as:

$$\sum\limits_{i=1}^n e_{i}^2 = \sum\limits_{i=1}^n (y_{i}^{actual} - y_{i}^{model})^2 = \sum\limits_{i=1}^n (y_{i}^{actual} - (\beta_{0} + \beta_{1}x_{i}))^2$$

We minimize the above expression with respect to $\beta_{0}$ and $\beta_{1}$, setting the derivative to zero. From there we can calculate the values of our regression coefficients.

$$\frac{\partial }{\partial \beta_{0}} \sum\limits_{i=1}^n (y_{i}^{actual} - (\beta_{0} + \beta_{1}x_{i}))^2 = 0$$

$$\frac{\partial }{\partial \beta_{1}} \sum\limits_{i=1}^n (y_{i}^{actual} - (\beta_{0} + \beta_{1}x_{i}))^2 = 0$$


    

## <font color='red'>Assigment related to Least Square method</font>
1. Find the value of $\beta_{0}$ and $\beta_{1}$ from the above expression. That will be the analytical solution to the regression coefficients. 

## <font color='black'>Lets us build our model using numpy polyfit routine which calculate the best fit line based on least square fitting method</font>

In [None]:
parameters, cov_matrix = np.polyfit(df['Height'], df['Weight'], 1, cov=True)
intercept, intercept_error = parameters[1], np.sqrt(cov_matrix[1][1])
slope, slope_error = parameters[0], np.sqrt(cov_matrix[0][0])

print("beta0 = {:.2g} +/- {:.2g} , beta1 = {:.2f} +/- {:.2g} ".format(intercept, intercept_error,slope, slope_error))

print("Actual value")

print("beta0 = {:.2g}, beta1 = {:.2f}".format(beta0, beta1))


In [None]:
x_min, x_max = np.min(df['Height']), np.max(df['Height'])
x = np.linspace(x_min-5,x_max+5,100)

In [None]:
fig,axes = plt.subplots(figsize=(8,8))
plt.plot(df['Height'], df['Weight'], marker='o', linestyle='none', label='Data')
plt.plot(x, x*slope + intercept, label="slope={:.2g}, intercept={:.2f}".format(slope, intercept))
plt.legend()
plt.xlabel('Height')
plt.ylabel('Weight(Y)')

## <font color='red'>Assigment related to model fitting</font>
1. Compare the value of $\beta_{0}$ and $\beta_{1}$ from your calculation to the value calculated by the standard libraries

<a id='evaluation'></a>
# Model Evaluation (or Model Goodness)

## R-Squared value
* R-Square value: the proportion of the variance in the dependent variable that is predictable from the independent variable.In other words, It is a statistical measure of how close the data are to the fitted regression line.
* The best regression line we can get when R-Square is 1


$$R^{2} = \frac{\sum\limits_{i=1}^n (y^{model} - y^{mean})^2}{\sum\limits_{i=1}^n (y^{actual} - y^{mean})^2}$$

In [None]:
def r2_score(x,y,parameters):
    ybar = np.mean(y)
    ymodel = parameters[0]*x + parameters[1]
    ssreg = np.sum((ymodel-ybar)**2)
    sstot = np.sum((y - ybar)**2)
    
    return (ssreg / sstot)
    

In [None]:
print("R-Sqaure value from our calculation = {:.2f}".format(r2_score(height, weight, parameters)))

Lets us compare this value with the standard package 

In [None]:
slope, intercept, r_value, p_value, std_err = stats.linregress(height, weight)
print("R-squared from standard package: {:.3f}".format(r_value**2))

## Residual Plot
* The residual plot is the most important plot to review the acceptance of prediction from linear regression model
* A proper residual plot will look like a random scatter of points around zero

In [None]:
fig,axes = plt.subplots(1,2,figsize=(12,6))
sns.regplot(x='Height', y='Weight', data=df, ax = axes[0])
sns.residplot(x='Height', y='Weight', data=df, ax = axes[1])

axes[0].set_title('Regression plot')
axes[1].set_title('Residual plot')

## <font color='red'>Assigment related to model evaluation</font>
1. Write a function for reduced R-Squared value. 
2. Explain why Reduced R-Squared (or Adjusted R-Squared) is a better measure for model goodness than R-Squared
3. Write a plot function to plot the residual plot and compare it with above plot. Do you see the same plot from your own function


<a id='assumptions'></a>
# Assumptions of linear regression
* There should not be a measurement error in the observed values of the target and each input variable.
* The relationship between the dependent variable and all of indepenent variables are linear
* The observations should be independent
* The variance of the target variable should be constant (Homoscedasticity)
* Outlier should be treated properly, otherwise it will biased your model

## Violation of Homoscedasticity
The residual plot is good way to check whether the data are homoscedastic (meaning the residuals are equal across the regression line).  The following scatter plots show examples of data that are not homoscedastic

<img src="img/homoscedasticity.png" width=500, align="center" />

In [None]:
height = np.random.uniform(40,75, 100)
beta0 = 10
beta1 = 2
weight = beta0 + beta1*height

# Add some random error
new_weight = np.zeros(len(height))
for i in range(len(height)):
    variance = int(weight[i] / 20)
    
    new_weight[i]  = weight[i] + np.random.normal(0,variance**4)


In [None]:
df = pd.DataFrame({'Height':height, 'Weight':new_weight})
df = df.round(1)
df.head(10)

In [None]:
fig,axes = plt.subplots(1,2,figsize=(12,6))
sns.regplot(x='Height', y='Weight', data=df, ax = axes[0])
sns.residplot(x='Height', y='Weight', data=df, ax = axes[1])

axes[0].set_title('Regression plot')
axes[1].set_title('Residual plot')

## Violation of linear relation

* if curvature is present in the residuals, then it is likely that there is curvature in the relationship between the response and the predictor that is not explained by our model.  
* A linear model does not adequately describe the relationship between the predictor and the response

<img src="img/linearity.png" width=500, align="center" />


In [None]:
height = np.random.uniform(40,75, 50)
beta0 = 10
beta1 = 2
weight = beta0 + beta1*height*height + np.random.normal(0,10, len(height))

df = pd.DataFrame({'Height':height, 'Weight':weight})
df = df.round(1)
df.head(10)

In [None]:
fig,axes = plt.subplots(1,2,figsize=(12,6))
sns.regplot(x='Height', y='Weight', data=df, ax = axes[0])
sns.residplot(x='Height', y='Weight', data=df, ax = axes[1])

axes[0].set_title('Regression plot')
axes[1].set_title('Residual plot')

## Effect of Outlier

* Outliers can have a big influence on the fit of the regression line
* The one extreme outlier is essentially tilting the regression line
* As a result, the model will not predict well for many of the observations.

<img src="img/outlier.png" width=500, align="center" />

In [None]:
x = np.random.uniform(0,5,10)
y = 2*x + 5 + np.random.normal(0,3,size=len(x))

x_outlier = np.append(x,4)
y_outlier = np.append(y,200)

slope, intercept = np.polyfit(x,y,1)
slope_outlier, intercept_outlier = np.polyfit(x_outlier, y_outlier,1)

fig, axes = plt.subplots(figsize=(6,6))

plt.plot(x_outlier, y_outlier, marker='o', linestyle='none', label='Data')

plt.plot(x, x*slope_outlier+intercept_outlier, label='Outlier')

plt.plot(x, x*slope+intercept, label='Removing Outlier')

plt.legend()

plt.xlabel('X [arbitrary units]')
plt.ylabel('Y [arbitrary units]')