# Workbench

**Importing the required libraries**

In [None]:
# Import the numpy and pandas package
import numpy as np
import pandas as pd

# Data Visualisation
import matplotlib.pyplot as plt
import seaborn as sns

# Import the warnings
import warnings

# Import statsmodels
import statsmodels.formula.api as smf

# Import RMSE
from statsmodels.tools.eval_measures import rmse

# Import Linear Regression from scikit-learn
from sklearn.linear_model import LinearRegression

# Import Polynomial Features
from sklearn.preprocessing import PolynomialFeatures

# Import Metrics
from sklearn.metrics import mean_squared_error, r2_score

# configuration settings
%matplotlib inline
sns.set(color_codes=True)
warnings.filterwarnings('ignore') ## Surpress the warnings

**Load the data into a dataframe**

In [None]:
# load the data into a dataframe called supermarket_till_transactions_df
supermarket_till_transactions_df = pd.read_csv("")

In [None]:
# view the top five records
supermarket_till_transactions_df.head(5)

In order to illustrate Multiple Linear Regression we just need two variables which are:
1. SHOP_WEEKDAY
2. SHOP_HOUR
3. QUANTITY
4. SPEND

In [None]:
supermarket_till_transactions_df = supermarket_till_transactions_df[["SHOP_WEEKDAY","SHOP_H
supermarket_till_transactions_df.head(5)

**Using statsmodel**

Simple linear regression can easily be extended to include multiple features. This is called multiple linear
regression:
    
$y = β_0 + β_1 x_1 +. . . +β_n x_n $

Each $x$ represents a different feature, and each feature has its own coefficient. In this case:

$y = β_0  + β_1  × SHOPWEEKDAY + β_2  × SHOPHOUR + β_3  × QUANTITY$

Let's use Statsmodels to estimate these coefficients:

In [None]:
# Initialise and fit linear regression model using `statsmodels`
stats_model = smf.ols('SPEND ~ SHOP_WEEKDAY + SHOP_HOUR + QUANTITY', data=supermarket_till_
stats_model = stats_model.fit()

We no longer have to calculate alpha and beta ourselves as this method does it automatically for us! Calling
model.params will show us the model’s parameters:

In [None]:
stats_model.params

From the results above:
1. $β_0 = 366.8018$ - This is the y intercept when $x$ is zero
2. $β_1 = -18.779187$ - This is the regression coefficient that measures a unit change in SPEND when
SHOP_WEEKDAY changes
3. $β_2 = -12.306012$ - This is the regression coefficient that measures a unit change in SPEND when
SHOP_HOUR changes
4. $β_3 = 53.434573$ - This is the regression coefficient that measures a unit change in SPEND when
QUANTITY changes

In [None]:
# print a summary of the fitted model
stats_model.summary()

**R Squared**

**The Coefficient of determination, R-Squared** – This is used to measure how much of the variation in the
outcome can be explained by the variation in the independent variables. R-Squared always increases as more
predictors are added to the MLR model even though the predictors may not be related to the outcome variable.

R2 by itself can't thus be used to identify which predictors should be included in a model and which should be
excluded. R2 can only be between 0 and 1, where 0 indicates that the outcome cannot be predicted by any of
the independent variables and 1 indicates that the outcome can be predicted without error from the independent
variables.

In [None]:
# print the R-squared value for the model
stats_model.rsquared

**This means that <font color=red>12.025%</font> of the SPEND can be explained by SHOP_WEEKDAY, SHOP_HOUR and
QUANTITY**

**Adjusted R-Squared**

When we add more predictor variables into the equation, R-Squared will always increase making R-Squared
not accurate as the number of predictor variables increases.

Adjusted R-Squared, accounts for the increase of the predictor variables.

Because of the nature of the equation, the adjusted R-Squared should always be lower or equal to the RSquared

In [None]:
# print the Adjusted R-squared value for the model
stats_model.rsquared_adj

**Perform the prediction**

In [None]:
new_shop_weekday = 1
new_shop_hour = 18
new_quantity = 2
stats_model_ypred = stats_model.predict({"SHOP_WEEKDAY": new_shop_weekday,"SHOP_HOUR": new_

**RMSE**

The root-mean-square error (RMSE) is a frequently used measure of the differences between values (sample
and population values) predicted by a model and the values actually observed

The smaller the value the better

In [None]:
# calc rmse
stats_model_rmse = rmse(y, stats_model_ypred)
stats_model_rmse

**_Confidence in the model_**

A confidence interval gives an estimated range of values which is likely to include an unknown population
parameter, the estimated range being calculated from a given set of sample data.

A confidence interval is how much uncertainty there is with any particular statistic. Confidence intervals are
often used with a margin of error. It tells you how confident you can be that the results reflect what you would
expect to find if it were possible to study the entire population.

In [None]:
# print the confidence intervals for the model coefficients
stats_model.conf_int()

**_Hypothesis Testing and P-Values_**

**p-values** tell you how statistically significant the variable is. Removing variables with high p-values can cause
your accuracy/R squared to increase, and even the p-values of the other variables to increase as well — and
that’s a good sign.

In [None]:
# print the p-values for the model coefficients
stats_model.pvalues

**Using scikit-learn**

In [None]:
# Build linear regression model using SHOP_WEEKDAY,SHOP_HOUR and QUANTITY as predictors
# Split data into predictors X and output Y
predictors = ['SHOP_WEEKDAY', 'SHOP_HOUR', 'QUANTITY']
X = supermarket_till_transactions_df[predictors]
y = supermarket_till_transactions_df['SPEND']

# Initialise and fit model
lm = LinearRegression()
scikit_model = lm.fit(X, y)

In [None]:
print(f'alpha = {scikit_model.intercept_}')
print(f'betas = {scikit_model.coef_}')

Therefore, our model can be written as:

SPEND = 366.802 + (-18.779SHOP_WEEKDAY) + (-12.306SHOP_HOUR) + (53.435*QUANTITY)

We can predict values by simply using .predict():

In [None]:
new_X = [[1, 18, 2]] # Sunday 6pm buying 2 items
scikit_learn_ypred = scikit_model.predict(new_X)

**Calculate the RMSE when scikit learn is used**

In [None]:
# calc rmse
scikit_learn_rmse = rmse(y, scikit_learn_ypred)
scikit_learn_rmse

**This means that anyone buying 2 items at 6pm on sunday is most likely to spend 233.38 Shillings.**