# CrashDS

#### Module 2 : Linear Regression

Dataset from ISLR by *James et al.* : `Advertising.csv`         
Source: http://faculty.marshall.usc.edu/gareth-james/ISL/data.html     

---

### Essential Libraries

Let us begin by importing the essential Python Libraries.    
You may install any library using `conda install <library>`.    
Most of the libraries come by default with the Anaconda platform.

> NumPy : Library for Numeric Computations in Python  
> Pandas : Library for Data Acquisition and Preparation  
> Matplotlib : Low-level library for Data Visualization  
> Seaborn : Higher-level library for Data Visualization  

In [None]:
# Basic Libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot
sb.set() # set the default Seaborn style for graphics

We will also need the essential Python libraries for (basic) Machine Learning.      
Scikit-Learn (`sklearn`) will be our de-facto Machine Learning library in Python.   

> `LinearRegression` model from `sklearn.linear_model` : Our main model for Regression   
> `train_test_split` method from `sklearn.model_selection` : Random Train-Test splits     
> `mean_squared_error` metric from `sklearn.metrics` : Primary performance metric for us 

In [None]:
# Import essential models and functions from sklearn
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

---

## Case Study : Advertising Budget vs Sales


### Import the Dataset

The dataset is in CSV format; hence we use the `read_csv` function from Pandas.  
Immediately after importing, take a quick look at the data using the `head` function.

In [None]:
# Load the CSV file and check the format
advData = pd.read_csv('Advertising.csv')
advData.head()

Check the vital statistics of the dataset using the `type` and `shape` attributes.     
Check the variables (and their types) in the dataset using the `info()` method.

In [None]:
print("Data type : ", type(advData))
print("Data dims : ", advData.shape)
advData.info()

### Format the Dataset

Drop the `Unnamed: 0` column as it contributes nothing to the problem.   
Rename the other columns for homogeneity in nomenclature and style.      

Check the format and vital statistics of the modified dataframe.     


In [None]:
# Drop the first column (axis = 1) by its name
advData = advData.drop('Unnamed: 0', axis = 1)

# Rename the other columns as per your choice
advData = advData.rename(columns={"TV": "TV", "radio": "RD", "newspaper" : "NP", "sales" : "Sales"})

# Check the modified dataset
advData.info()

---

## Uni-Variate Regression : Predicting Sales using TV

We take `Sales` as our target variable for the Uni-Variate Regression.    
We will start by setting up a Uni-Variate Linear Regression problem.   

Response Variable : **Sales**     
Predictor Feature : **TV**    

> Regression Model : Sales = $a$ $\times$ TV + $b$  

Check the mutual relationship between the variables to start with.

In [None]:
# 2D scatterplot of two variables to observe their relationship
f = plt.figure(figsize=(16, 8))
sb.scatterplot(x = "TV", y = "Sales", data = advData)

### Preparing the Dataset

Extract the Response and Predictor variables as two individual Pandas `DataFrame`.

In [None]:
# Extract Response and Predictors
y = pd.DataFrame(advData["Sales"])
X = pd.DataFrame(advData[["TV"]])

Split the dataset randomly into Train and Test datasets using `train_test_split`.

In [None]:
# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

### Fitting the Regression Model

`LinearRegression` is a class for the regression model in `sklearn`.     
We need to create an object of the `LinearRegression` class, as follows.

In [None]:
# Create a Linear Regression object
linreg = LinearRegression()

Train the Linear Regression model using the Train Set `X_train` and `y_train`.   

In [None]:
# Train the Linear Regression model
linreg.fit(X_train, y_train)

You have *trained* the model to fit the following formula.

>  Regression Problem : Sales = $a$ $\times$ TV + $b$

Check Intercept ($a$) and Coefficient ($b$) of the regression line.

In [None]:
# Coefficients of the Linear Regression line
print('Intercept \t b = ', linreg.intercept_)
print('Coefficients \t a = ', linreg.coef_)

Plot the regression line based on *coefficients-intercept* form.

In [None]:
# Formula for the Regression line
regline_x = X_train
regline_y = linreg.intercept_ + linreg.coef_ * X_train

# Plot the Linear Regression line
f = plt.figure(figsize=(16, 8))
plt.scatter(X_train, y_train)
plt.plot(regline_x, regline_y, 'r-', linewidth = 3)
plt.show()

Plot the regression line by *prediction* using the trained model.

In [None]:
# Predict the Response on the Train Set
y_train_pred = linreg.predict(X_train)

# Plot the Linear Regression line
f = plt.figure(figsize=(16, 8))
plt.scatter(X_train, y_train)
plt.scatter(X_train, y_train_pred, color = "red")
plt.show()

### Goodness of Fit of the Model

Check how good the predictions are on the Train Set.    
Metrics : Explained Variance and Mean Squared Error.

In [None]:
# Explained Variance (R^2)
print("Explained Variance (R^2) \t", linreg.score(X_train, y_train))

# Mean Squared Error (MSE)
y_train_pred = linreg.predict(X_train)
print("Mean Squared Error (MSE) \t", mean_squared_error(y_train, y_train_pred))

Test the Linear Regression model using the Test Set.   

In [None]:
# Predict the Response on the Test Set
y_test_pred = linreg.predict(X_test)

# Plot the Predictions
f = plt.figure(figsize=(16, 8))
plt.scatter(X_test, y_test, color = "green")
plt.scatter(X_test, y_test_pred, color = "red")
plt.show()

Check how good the predictions are on the Test Set.   

In [None]:
# Mean Squared Error (MSE)
y_test_pred = linreg.predict(X_test)
print("Mean Squared Error (MSE) \t", mean_squared_error(y_test, y_test_pred))

It is quite meaningful to check the Predictions against the True values of the Response variable.

In [None]:
# Predict the Response for both Train and Test
y_train_pred = linreg.predict(X_train)
y_test_pred = linreg.predict(X_test)

# Plot the Predictions vs the True values
f, axes = plt.subplots(1, 2, figsize=(16, 8))
axes[0].scatter(y_train, y_train_pred, color = "blue")
axes[0].plot(y_train, y_train, 'w-', linewidth = 1)
axes[0].set_xlabel("True values of the Response Variable (Train)")
axes[0].set_ylabel("Predicted values of the Response Variable (Train)")
axes[1].scatter(y_test, y_test_pred, color = "green")
axes[1].plot(y_test, y_test, 'w-', linewidth = 1)
axes[1].set_xlabel("True values of the Response Variable (Test)")
axes[1].set_ylabel("Predicted values of the Response Variable (Test)")
plt.show()

---

## Linear Regression : Generic Function

Let us write a generic function to perform Linear Regression, as before.      
Our Predictor variable(s) will be $X$ and the Response variable will be $Y$.   

> Regression Model : $y$ = $a$ $X$ + $b$  
> Train data : (`X_Train`, `y_train`)    
> Test data : (`X_test`, `y_test`)

In [None]:
def performLinearRegression(X_train, y_train, X_test, y_test):
    '''
        Function to perform Linear Regression with X_Train, y_train,
        and test out the performance of the model on X_Test, y_test.
    '''    
    linreg = LinearRegression()         # create the linear regression object
    linreg.fit(X_train, y_train)        # train the linear regression model

    # Predict Response corresponding to Predictors
    y_train_pred = linreg.predict(X_train)
    y_test_pred = linreg.predict(X_test)

    # Plot the Predictions vs the True values
    f, axes = plt.subplots(1, 2, figsize=(16, 8))
    axes[0].scatter(y_train, y_train_pred, color = "blue")
    axes[0].plot(y_train, y_train, 'w-', linewidth = 1)
    axes[0].set_xlabel("True values of the Response Variable (Train)")
    axes[0].set_ylabel("Predicted values of the Response Variable (Train)")
    axes[1].scatter(y_test, y_test_pred, color = "green")
    axes[1].plot(y_test, y_test, 'w-', linewidth = 1)
    axes[1].set_xlabel("True values of the Response Variable (Test)")
    axes[1].set_ylabel("Predicted values of the Response Variable (Test)")
    plt.show()

    # Check the Goodness of Fit (on Train Data)
    print("Goodness of Fit of Model \tTrain Dataset")
    print("Explained Variance (R^2) \t", linreg.score(X_train, y_train))
    print("Mean Squared Error (MSE) \t", mean_squared_error(y_train, y_train_pred))
    print()

    # Check the Goodness of Fit (on Test Data)
    print("Goodness of Fit of Model \tTest Dataset")
    print("Mean Squared Error (MSE) \t", mean_squared_error(y_test, y_test_pred))
    print()

Try out the Generic Function to perform Linear Regression on `Sales` against `RD`.

In [None]:
# Specify the Predictors and Response
response = "Sales"
predictors = ["RD"]

# Extract Response and Predictors
y = pd.DataFrame(advData[response])
X = pd.DataFrame(advData[predictors])

# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

# Perform Linear Regression with Train-Test
performLinearRegression(X_train, y_train, X_test, y_test)

Try out the Generic Function to perform Linear Regression on `Sales` against `NP`.

In [None]:
# Specify the Predictors and Response
response = "Sales"
predictors = ["NP"]

# Extract Response and Predictors
y = pd.DataFrame(advData[response])
X = pd.DataFrame(advData[predictors])

# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

# Perform Linear Regression with Train-Test
performLinearRegression(X_train, y_train, X_test, y_test)

---

## Multi-Variate Linear Regression

Let us set up a Multi-Variate Linear Regression problem.   

Response Variable : **Sales**     
Predictor Feature : **TV, RD, NP**       

> Regression Model : Sales = $a_1$ $\times$ TV + $a_2$ $\times$ RD + $a_3$ $\times$ NP + $b$      

Fortunately, our generic Linear Regression function works in this case as well.   

In [None]:
# Specify the Predictors and Response
response = "Sales"
predictors = ["TV", "RD", "NP"]

# Extract Response and Predictors
y = pd.DataFrame(advData[response])
X = pd.DataFrame(advData[predictors])

# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

# Perform Linear Regression with Train-Test
performLinearRegression(X_train, y_train, X_test, y_test)

---

## Prediction using a Regression Model

Once we have trained a Regression Model, we may use it to predict the Response.   

In [None]:
# Specify the Predictors and Response
response = "Sales"
predictors = ["TV"]

# Extract Response and Predictors
y = pd.DataFrame(advData[response])
X = pd.DataFrame(advData[predictors])

# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

# Perform Linear Regression with Train-Test
linreg = LinearRegression()         # create the linear regression object
linreg.fit(X_train, y_train)        # train the linear regression model

In [None]:
# Predict Response corresponding to Predictors
y_train_pred = linreg.predict(X_train)
y_test_pred = linreg.predict(X_test)

Let's predict the value of Response for a few specific Data Points -- using the Regression Model derived above.   

In [None]:
# Extract random Data Points for Prediction
advData_pred = advData.sample(5)
advData_pred

In [None]:
# Extract Predictors for Prediction
X_pred = pd.DataFrame(advData_pred[predictors])

# Predict Response corresponding to Predictors
y_pred = linreg.predict(X_pred)
y_pred

### Prediction Errors

Let us check the errors in the Predicted values, compared to the Actuals.

In [None]:
# Summarize the Actuals, Predictions and Errors
y_pred = pd.DataFrame(y_pred, columns = ["Predicted"], index = advData_pred.index)
advData_acc = pd.concat([advData_pred[response], y_pred], axis = 1)

y_errs = 100 * abs(advData_acc[response] - advData_acc["Predicted"]) / advData_acc[response]
y_errs = pd.DataFrame(y_errs, columns = ["Error %"], index = advData_pred.index)
advData_acc = pd.concat([advData_acc, y_errs], axis = 1)

advData_acc

### Prediction Interval

The confidence on Prediction depends on the Distribution and Deviation of the Errors in Prediction.    
We obtain the Mean Squared Error on the Train Set while fitting/training the Linear Regression Model.    

The Standard Error of Prediction may be estimated as $StdE = \sqrt{\frac{n}{n-2} MSE}$ from the Train Set.

In [None]:
MSE_train = mean_squared_error(y_train, y_train_pred)
StdE_pred = np.sqrt(len(y_train) * MSE_train/(len(y_train) - 2))

print("Mean Squared Error (MSE) \t:", MSE_train.round(2))
print("Pred Standard Error (SE) \t:", StdE_pred.round(2))

In Prediction, we assume a Gaussian (Normal) Distribution for Predictions Errors.    
The `95%` Prediction Interval for any data point is given by $Prediction \pm 1.96 \times StdE$    
The `99%` Prediction Interval for any data point is given by $Prediction \pm 2.58 \times StdE$

In [None]:
y_95l = pd.DataFrame(advData_acc["Predicted"] - 1.96*StdE_pred).rename(columns = {"Predicted" : "95 Lower"})
y_95u = pd.DataFrame(advData_acc["Predicted"] + 1.96*StdE_pred).rename(columns = {"Predicted" : "95 Upper"})
y_99l = pd.DataFrame(advData_acc["Predicted"] - 2.58*StdE_pred).rename(columns = {"Predicted" : "99 Lower"})
y_99u = pd.DataFrame(advData_acc["Predicted"] + 2.58*StdE_pred).rename(columns = {"Predicted" : "99 Upper"})

advData_int = pd.concat([advData_acc, y_95l, y_95u, y_99l, y_99u], axis = 1)
advData_int

*NOTE : You can always go back and try fitting a model with more predictors to check the difference.*