### Topics to Discuss:
>What is linear regression?<br>
>Analyzing Advertisement dataset.<br>
>Building a simple linear regression model & multiple linear regression model.<br>
>Understanding OLS methods to estimate model parameters.<br>
>How to use statsmodel API in python?<br>
>Interpreting the coefficients of the model.<br>
>How to find if the parameters estimated are significant?<br>
>Making predictions using the model.<br>
>Finding model residuals and analyzing it.<br>
>Evaluating model efficiency using RMSE and R-Square values.<br>
>Understanding gradient descent approach to find model parameters.<br>
>Splitting dataseta and cross validating models.<br>

### Adverstiment Dataset
>The adverstiting dataset captures sales revenue generated with respect to advertisement spends across multiple channles 
>like radio, tv and newspaper.

### Attribution Descriptions
>TV - Spend on TV Advertisements <br>
>Radio - Spend on radio Advertisements <br>
>Newspaper - Spend on newspaper Advertisements <br>
>Sales - Sales revenue generated <br>
Note: The amounts are in diffrent units

#### import the packages and the data required for analysis

In [None]:
# import the packages
import pandas as pd
import numpy as np

In [None]:
# import the packages for charts/plots
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# import the pandas profiling package
import pandas_profiling as pp

In [None]:
# load the data set
advt = pd.read_csv('D:/SampleData/Advertising.csv')

#### data inspection:

In [None]:
# inspect the metadata
advt.info()

In [None]:
# create a pandas profiling report
profile_report = pp.ProfileReport(advt)
profile_report.to_file(output_file = 'profile_report.html')

In [None]:
# remove the first column
advt = advt[['TV', 'Radio', 'Newspaper', 'Sales']]

In [None]:
# create UDF - general function that returns multiple stats for continuous variables
def var_summary(x):
    return pd.Series([x.count(), x.isnull().sum(), x.sum(), x.mean(), x.median(),  
                      x.std(), x.var(), x.min(), x.quantile(0.01), x.quantile(0.05),
                          x.quantile(0.10),x.quantile(0.25),x.quantile(0.50),x.quantile(0.75), 
                              x.quantile(0.90),x.quantile(0.95), x.quantile(0.99),x.max()], 
                  index = ['N', 'NMISS', 'SUM', 'MEAN','MEDIAN', 'STD', 'VAR', 'MIN', 'P1', 
                               'P5' ,'P10' ,'P25' ,'P50' ,'P75' ,'P90' ,'P95' ,'P99' ,'MAX'])

In [None]:
# Data audit Report for continuous variables
advt.loc[:, advt.dtypes == 'float64'].apply(lambda x: var_summary(x)).T

#### data cleaning/data treatment

In [None]:
# Outlier treatment: cap all the numeric variables at 1% and 99%
advt = advt.loc[:, advt.dtypes == 'float64'].apply(lambda x: x.clip(lower = x.quantile(0.01), upper = x.quantile(0.99)))

In [None]:
# get the %age of missing values in the data
1 - advt.count()/advt.shape[0]

In [None]:
# Handling Missings : fill with mean/median/mode
# not required as no missings in the data

In [None]:
# dummy variable creation
# not required as no categorical variables in the data

#### exploratory data analysis : univariate analysis

In [None]:
# create user defined function to create the distplots
def fn_distplot(pd_series):
    plt.figure(figsize = (5, 3))
    sns.distplot(pd_series)
    print('This is a chart for ' + pd_series.name)
    plt.show()
    return

In [None]:
# create dist plots for all float type variables
advt.loc[:, advt.dtypes == 'float64'].apply(lambda x: fn_distplot(x))
plt.show()

### Notes:
> 1. Sales seems to be normal distribution. 
> 2. Spending on newspaper advertisement seems to be righ skewed.
> 3. Most of the spends on newspaper is fairly low where are spend on radio and tv seems be uniform distribution. 
> 4. Spends on tv are comparatively higher then spends on radio and newspaper.

#### exploratory data analysis : bivariate analysis

In [None]:
# create user defined function to create the joint plots
def fn_jointplot(y_variable, x_variable):
    sns.jointplot(x_variable, y_variable, height = 5)
    print('This is a chart for ' + y_variable.name + ' vs ' + x_variable.name)
    plt.show()
    return

In [None]:
#Is there a relationship between sales and spend various advertising channels?
advt.loc[:, advt.dtypes == 'float64'].apply(lambda x: fn_jointplot(advt.Sales, x))
plt.show()

# Notes
>Sales and spend on newpaper is not highly correlaed where are sales and spend on tv is highly correlated.

In [None]:
# Visualizing pairwise correleation
sns.pairplot(advt)
plt.show()

In [None]:
# get the corelation table : calculating correlations
advt.corr()

In [None]:
# Visualizing the correlations : the darker is the color, the stronger is the correlation
sns.heatmap(advt.corr())
plt.show()

### NOTES:
> 1. The diagonal of the above matirx shows the auto-correlation of the variables. It is always 1. 
> 2. You can observe that the correlation between TV and Sales is highest i.e. 0.78 and then between sales and radio i.e. 0.576.

**Correlations can vary from -1 to +1.**
<br/>
**Closer to +1 means strong positive correlation and close -1 means strong negative correlation and closer to 0 means not correlated.**
<br/>
**variables with strong correlations are mostly probably candidates for model builing**

### Building Regression Model
> 1. Linear regression is an approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variables) denoted X. 
> 2. The case of one explanatory variable is called simple linear regression. For more than one explanatory variable, the process is called multiple linear regression
> 3. A simple linear regression model is given by Y = mX + b
<br/>
> where m is the slope and b is the y-intercept. Y is the dependent variable and X is the explanatory variable. <br>

**Very briefly and simplistically, Linear Regression is a class of techniques for fitting a straight line to a set of data points**

In [None]:
# import the package for ols modelling
import statsmodels.formula.api as smf

In [None]:
# list all the variables in the data
advt.columns

In [None]:
# build the model
lm = smf.ols('Sales ~ TV', advt).fit()

In [None]:
# model statistics
print(lm.summary())

### Notes:
> Parameters estimated are considered to be significant if p-value is less than 0.05 <br>
> This indicates TV is a significant parameters. And the parameter estimates can be accepted. <br><br>
> <b>So, the linear model is</b> <br>
> Sales = 7.1392 + 0.047 ∗ TV


### Evaluating Model Accuracy
> R-squared is a statistical measure of how close the data are to the fitted regression line. <br>
> R-square signifies percentage of variations in the reponse variable that can be explained by the model. <br>
> R-squared = Explained variation / Total variation <br>
> Total variation is variation of response variable around it's mean. <br>
> R-squared value varies between 0 and 100%. 0% signifies that the model explains none of the variability, <br>
> while 100% signifies that the model explains all the variability of the response. <br>
> The closer the r-square to 100%, the better is the model. <br>

In [None]:
# make predictions of the sales
lmpredict = lm.predict(advt.TV)

# give the name to the series
lmpredict.name = 'Predicted Sales'

In [None]:
# original and the predicted values of the sales
pd.concat([advt.Sales, lmpredict], axis = 1).round(2).head()

### Calculating mean square error ... RMSE
> RMSE calculate the difference between the actual value and predicted value of the response variable <br>
> The square root of the mean/average of the square of all of the error. <br> 
> Compared to the similar Mean Absolute Error, RMSE amplifies and severely punishes large errors. <br>
> The lesser the RMSE value, the better is the model.

In [None]:
# import the package
from sklearn import metrics

In [None]:
# validate model accuracy : MSE (Mean Square Error)
mse = metrics.mean_squared_error(advt.Sales, lmpredict)

In [None]:
# validate model accuracy : RMSE (Root Mean Square Error)
rmse = np.sqrt(mse)

In [None]:
# assumption check: residuals/errors should be normally distributed
sns.distplot(lm.resid)
plt.show()

> One of the assumptions is that the residuals should be normally distributed i.e. it should be random.
The residuals should be plotted against the response variable and it should not show any pattern

In [None]:
# assumption: residuals/errors of the model should not be correlated with dependant (Y) variable
sns.jointplot(advt.Sales, lm.resid)
plt.show()

### Multiple Linear Regression Model

In [None]:
# build the model
lm = smf.ols( 'Sales ~ TV + Radio', advt ).fit()

In [None]:
# model statistics
print(lm.summary())

In [None]:
# predict the values
lmpredict = lm.predict(advt[['TV', 'Radio']])

In [None]:
# original and the predicted values of the sales
pd.concat([advt.Sales, lmpredict], axis = 1).round(2).head()

In [None]:
# validate model accuracy : MSE (Mean Square Error)
mse = metrics.mean_squared_error(advt.Sales, lmpredict)

In [None]:
# validate model accuracy : RMSE (Root Mean Square Error)
rmse = np.sqrt(mse)

In [None]:
# assumption check: residuals/errors should be normally distributed
sns.distplot(lm.resid)
plt.show()

> One of the assumptions is that the residuals should be normally distributed i.e. it should be random.
The residuals should be plotted against the response variable and it should not show any pattern

In [None]:
# assumption: residuals/errors of the model should not be correlated with dependant (Y) variable
sns.jointplot(advt.Sales, lm.resid)
plt.show()