# Taste of Supervised Learning


# Case: Sales Prediction via Linear Regression

In [None]:
from google.colab import drive
drive.mount('/content/gdrive/')

## Reading and Understanding the Data

In [None]:
# Supress Warnings
import warnings
warnings.filterwarnings('ignore')

# Import the numpy and pandas package
import numpy as np
import pandas as pd

# Data Visualisation
import matplotlib.pyplot as plt 
import seaborn as sns

In [None]:
advertising = pd.read_csv('./dataset/advertising.csv')
advertising.head()

## Data Inspection

In [None]:
advertising.shape

In [None]:
advertising.info()

In [None]:
advertising.describe()

## Data Cleaning

In [None]:
# Checking Null values
advertising.isnull().sum()*100/advertising.shape[0]
# There are no NULL values in the dataset, hence it is clean.

In [None]:
# Outlier Analysis
fig, axs = plt.subplots(3, figsize = (5,5))
plt1 = sns.boxplot(advertising['TV'], ax = axs[0])
plt2 = sns.boxplot(advertising['Newspaper'], ax = axs[1])
plt3 = sns.boxplot(advertising['Radio'], ax = axs[2])
plt.tight_layout()
plt.show()

### Inference: There are no considerable outliers present in the data.

## Exploratory Data Analysis

### Univariate Analysis

#### Sales (Target Variable)

In [None]:
sns.boxplot(advertising['Sales'])
plt.show()

In [None]:
# Let's see how Sales are related with other variables using scatter plot.
sns.pairplot(advertising, x_vars=['TV', 'Newspaper', 'Radio'], y_vars='Sales', height=4, aspect=1, kind='scatter')
plt.show()

In [None]:
# Let's see the correlation between different variables.
sns.heatmap(advertising.corr(), cmap="YlGnBu", annot = True)
plt.show()

As is visible from the pairplot and the heatmap, the variable `TV` seems to be most correlated with `Sales`. So let's go ahead and perform simple linear regression using `TV` as our feature variable.

## Model Building

### Performing Simple Linear Regression

Equation of linear regression<br>
$y = c + m_1x_1 + m_2x_2 + ... + m_nx_n$

-  $y$ is the response
-  $c$ is the intercept
-  $m_1$ is the coefficient for the first feature
-  $m_n$ is the coefficient for the nth feature<br>

In our case:

$y = c + m_1 \times TV$

The $m$ values are called the model **coefficients** or **model parameters**.

---

### Generic Steps in model building 

Assign the feature variable, `TV`, in this case, to the variable `X` and the response variable, `Sales`, to the variable `y`.

In [None]:
X = advertising['TV']
y = advertising['Sales']

#### Train-Test Split

- Split variables into training and testing sets. 
- Perform this by importing `train_test_split` from the `sklearn.model_selection` library. 
- It is usually a good practice to keep 70% of the data in your train dataset and the rest 30% in your test dataset

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = ##code here

In [None]:
# Let's now take a look at the train dataset
X_train.head()

In [None]:
y_train.head()

In [None]:
X_train = np.array(X_train).reshape(-1,1)
X_train.shape

In [None]:
y_train = np.array(y_train).reshape(-1,1)
y_train.shape

#### Building a Linear Model

- import the `linear_model` library for performing the linear regression.

In [None]:
from sklearn import linear_model

### Linear Regression
- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
    

In [None]:
lr = ##code here

In [None]:
##code here

In [None]:
# Print the parameters, i.e. the intercept and the slope of the regression line fitted

In [None]:
##code here

In [None]:
##code here

In [None]:
#The coefficient for TV is 0.054
#y = ax + b
#Sales = A*TV + const

---
The fit is significant. Let's visualize how well the model fit the data.

From the parameters that we get, our linear regression equation becomes:

$ Sales = 6.948 + 0.054 \times TV $

In [None]:
plt.scatter(X_train, y_train)
plt.plot(X_train, 6.948 + 0.054*X_train, 'r')
plt.show()

## Model Evaluation

### Residual analysis 
To validate assumptions of the model, and hence the reliability for inference

#### Distribution of the error terms
We need to check if the error terms are also normally distributed (which is infact, one of the major assumptions of linear regression), let us plot the histogram of the error terms and see what it looks like.

In [None]:
y_train_pred = lr.predict(X_train)
res = (y_train - y_train_pred)

In [None]:
fig = plt.figure()
sns.distplot(res, bins = 15)
fig.suptitle('Error Terms', fontsize = 15)                  # Plot heading 
plt.xlabel('y_train - y_train_pred', fontsize = 15)         # X-label
plt.show()

The residuals are following the normally distributed with a mean 0. All good!

### Predictions on the Test Set

In [None]:
# 
X_test = np.array(X_test).reshape(-1,1)

# Predict the y values corresponding to X_test
y_pred = ##code here

In [None]:
y_pred

### $R^2$ - score in Linear Regression

- **R-squared**, which sometimes is also known as the **coefficient of determination**, defines the degree to which the variance in the dependent variable (target or response) can be explained by the independent variable (features or predictors).

- **`sklearn.metrics`** performance evaluation metrics

In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

<img src = './images/r2-score.png'></img>

##### Looking at the RMSE

In [None]:
#Returns the mean squared error; we'll take a square root
np.sqrt(mean_squared_error(y_test, y_pred))

<img src = ./images/r2-formula.png width=500></img>

###### Checking the R-squared on the test set
 - $R^2$ ranges from [0,1]

In [None]:
r_squared = r2_score(y_test, y_pred)
r_squared

## Different metrics for evaluating a linear regression model. 
<img src = './images/r2-score-eq.png'></img>

##### Visualizing the fit on the test set

In [None]:
plt.scatter(X_test, y_test)
plt.plot(X_test, 6.948 + 0.054 * X_test, 'r')
plt.show()

##Feel free to contact

#Shankar Gangisetty - https://sites.google.com/site/shankarsetty