# Workbench

**Importing the required libraries**

In [None]:
# Import the numpy and pandas package
import numpy as np
import pandas as pd

# Import Standard operations
import operator

# Data Visualisation
import matplotlib.pyplot as plt
import seaborn as sns

# Import the warnings
import warnings

# Import statsmodels
import statsmodels.formula.api as smf
import statsmodels.api as sm
from statsmodels.sandbox.regression.predstd import wls_prediction_std

# Import RMSE
from statsmodels.tools.eval_measures import rmse

# Imort Linear Regression from scikit-learn
from sklearn.linear_model import LinearRegression

# Import Polynomial Features
from sklearn.preprocessing import PolynomialFeatures

# Import the Train Test Split capability from sk-learn
from sklearn.model_selection import train_test_split

# Import the metrics
from sklearn.metrics import mean_squared_error, r2_score

# configuration settings
%matplotlib inline
sns.set(color_codes=True)
warnings.filterwarnings('ignore') ## Surpress the warnings

**Load the data into a dataframe**

In [None]:
load the data into a dataframe called supermarket_till_transactions_df
supermarket_till_transactions_df = pd.read_csv("")

In [None]:
# view the top five records
supermarket_till_transactions_df.head(5)

In order to illustrate Polynomial Linear Regression we just need two variables which are:
1. SHOP_HOUR
2. SPEND

In [None]:
supermarket_till_transactions_df = supermarket_till_transactions_df[["SHOP_HOUR","SPEND"]]
supermarket_till_transactions_df.head(5)

**Visualize the linear regression and compare to polynomial regression line**

In [None]:
x = supermarket_till_transactions_df.iloc[:,:-1].values
y = supermarket_till_transactions_df.iloc[:,-1].values

**Display the Linear Regression Line**

In [None]:
linear_regression_model = LinearRegression()
linear_regression_model.fit(x, y)
y_pred = linear_regression_model.predict(x)

linear_rmse = np.sqrt(mean_squared_error(y,y_pred))
linear_r2 = r2_score(y,y_pred)

# Visualizing the Linear Regression results
def display_linear_regression():
    plt.scatter(x, y, s=10)
    plt.plot(x, y_pred, color='r')
    plt.title('Linear Regression')
    plt.xlabel('SHOP Hour')
    plt.ylabel('SPEND')
    plt.show()
    return

In [None]:
# Plot the Line regression line
display_linear_regression()

**Calculate the RMSE**

We can see that the straight line is unable to capture the patterns in the data. Which shows it is an example of
under-fitting

To overcome the under-fitting, we need to increase the complexity of the model

In [None]:
print("The RMSE is : {} ".format(linear_rmse))
print("The R-Squared is : {} ".format(linear_r2))

To generate a higher order equation we can add powers of the original features as new features and thus the
linear model

$             Y = \theta_0 + \theta_1x $

can be transformed to

$            Y = \theta_0 + \theta_1x +\theta_2x^2 $

To convert the original features into their higher order terms we will use the PolynomialFeatures class provided
by scikit-learn and then train using Linear Regression

**Display the Polynomial Regression Line**

In [None]:
polynomial_features= PolynomialFeatures(degree=5)
x_poly = polynomial_features.fit_transform(x)

polynomial_regression_model = LinearRegression()
polynomial_regression_model.fit(x_poly, y)
y_poly_pred = polynomial_regression_model.predict(x_poly)

polynomial_regression_rmse = np.sqrt(mean_squared_error(y,y_poly_pred))
polynomial_regression_r2 = r2_score(y,y_poly_pred)

plt.scatter(x, y, s=10)

# sort the values of x before line plot
sort_axis = operator.itemgetter(0)
sorted_zip = sorted(zip(x,y_poly_pred), key=sort_axis)
x, y_poly_pred = zip(*sorted_zip)
plt.plot(x, y_poly_pred, color='r')
plt.title('Poplynomial Regression')
plt.xlabel('SHOP HOUR')
plt.ylabel('SPEND')
plt.show()

**Calculate the RMSE**

We can see that the line is tries to capture as many data points as possible and when we check the R-Squared
value it should increase.

It is quite clear the new line tries to fit it better than the linear one.

In [None]:
print("The RMSE is : {} ".format(polynomial_regression_rmse))
print("The R-Squared is : {} ".format(polynomial_regression_r2))

We can see that the RMSE has decreased and the R-Squared has increased as compared to the linear
regression model

Using statsmodel

Simple linear regression can easily be extended to include multiple features. This is called multiple linear
regression:

$y = β_0 + β_1x_1+. . . +β_nx_n$

Each $x$ represents a different feature, and each feature has its own coefficient. In this case:

$y = β_0 + β_1 × SHOPHOUR$

Let's use Statsmodels to estimate these coefficients:

In [None]:
# Initialise and fit linear regression model using `statsmodels`
polynomial_features= PolynomialFeatures(degree=5)
xp = polynomial_features.fit_transform(x)
xp.shape

In [None]:
stats_model = sm.OLS(y, xp).fit()
ypred = stats_model.predict(xp)
ypred.shape

In [None]:
plt.scatter(x,y)
plt.plot(x, ypred, color='r')
plt.title('Polynomial Regression (Using statsmodel) ')
plt.xlabel('SHOP HOUR')
plt.ylabel('SPEND')

**Plotting the upper and lower confidence intervals**

In [None]:
_, upper,lower = wls_prediction_std(stats_model)

plt.scatter(x,y)
plt.plot(x,ypred)
plt.plot(x,upper,'--',label="Upper") # confid. intrvl
plt.plot(x,lower,':',label="lower")
plt.legend(loc='upper left')

In [None]:
stats_model.summary()

Warnings:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

[2] The condition number is large, 1.57e+09. This might indicate that there are
strong multicollinearity or other numerical problems.

We no longer have to calculate alpha and beta ourselves as this method does it automatically for us! Calling
model.params will show us the model’s parameters:

From the results above:

1. $β_0 = 366.8018$ - This is the y intercept when $x$ is zero
2. $β_2 = -12.306012$ - This is the regression coefficient that measures a unit change in SPEND when SHOP_HOUR changes

**R Squared**

**The Coefficient of determination, R-Squared** – This is used to measure how much of the variation in the
outcome can be explained by the variation in the independent variables. R-Squared always increases as more
predictors are added to the **MLR** model even though the predictors may not be related to the outcome variable.

R2 by itself can't thus be used to identify which predictors should be included in a model and which should be
excluded. R2 can only be between 0 and 1, where 0 indicates that the outcome cannot be predicted by any of
the independent variables and 1 indicates that the outcome can be predicted without error from the independent
variables.

In [None]:
# print the R-squared value for the model
stats_model.rsquared

**This means that <font color=red>8.62%</font> of the SPEND can be explained by SHOP_HOUR**

**Adjusted R-Squared**

When we add more predictor variables into the equation, R-Squared will always increase making R-Squared
not accurate as the number of predictor variables increases.

Adjusted R-Squared, accounts for the increase of the predictor variables.

Because of the nature of the equation, the adjusted R-Squared should always be lower or equal to the RSquared

In [None]:
# print the Adjusted R-squared value for the model
stats_model.rsquared_adj

**RMSE**

The root-mean-square error (RMSE) is a frequently used measure of the differences between values (sample
and population values) predicted by a model and the values actually observed

The smaller the value the better

In [None]:
# calc rmse
stats_model_rmse = rmse(y, ypred)
stats_model_rmse

**Confidence in the model**

A confidence interval gives an estimated range of values which is likely to include an unknown population
parameter, the estimated range being calculated from a given set of sample data.

A confidence interval is how much uncertainty there is with any particular statistic. Confidence intervals are
often used with a margin of error. It tells you how confident you can be that the results reflect what you would
expect to find if it were possible to study the entire population.

In [None]:
# print the confidence intervals for the model coefficients
stats_model.conf_int()

**Hypothesis Testing and P-Values**

**p-values** tell you how statistically significant the variable is. Removing variables with high p-values can cause
your accuracy/R squared to increase, and even the p-values of the other variables to increase as well — and
that’s a good sign.

In [None]:
# print the p-values for the model coefficients
stats_model.pvalues

**Notes**

To be prevent over-fitting, we can add more training samples so that the algorithm doesn't learn the noise in the
system and can become more generalized.

To strike a blance between under-fitting and over-fitting you need to understand a statistical term called **Bias-
Variance Trade-Off**