# Workbench

**Importing the required libraries**

In [None]:
# Import the numpy and pandas package
import numpy as np
import pandas as pd

#Data Visualisation
import matplotlib.pyplot as plt
import seaborn as sns

#Import the warnings
import warnings

#Import statsmodels
import statsmodels.formula.api as smf

#Import RMSE
from statsmodels.tools.eval_measures import rmse

#Import Linear Regression from scikit-learn
from sklearn.linear_model import LinearRegression

#configuration settings
%matplotlib inline
sns.set(color_codes=True)
warnings.filterwarnings('ignore')##Surpress the warnings

**Load the data into a dataframe**

In [None]:
#Load the data into a dataframe called supermarket_till_transactions_df
supermarket_till_transactions_df =pd.read_csv("")

In [1]:
#View the top five records
supermarket_till_transactions_df.head()

In order to illustrate Simple Linear Regression we just need two variables which are:
1. SHOP_WEEKDAY
2. SPEND

In [None]:
supermarket_till_transactions_df = supermarket_till_transactions_df[["SHOP_WEEKDAY","SPEND"]]
supermarket_till_transactions_df.head(5)

Using Ordinary Least Squares Method (OLS)
There are two kinds of variables in a alinear regression model:
1. The input or predictor variable commonly refered to as X
2. The output is the variable that we want to predict commonly refered to as Y


where Yₑ is the estimated or predicted value of Y based on our linear equation.
The objective of the Ordinary Least Square Method is to find the values of and in the y = x + linear
regression equation that minimise the sum of the squared difference between Y and Yₑ.

where X̄ is the mean of X values and Ȳ is the mean of Y values.
β as simply Cov(X, Y) / Var(X)

If we are able to determine the optimum values of these two parameters, then we will have the line of best fit
that we can use to predict the values of Y, given the value of X.

In [None]:
x = supermarket_till_transactions_df["SHOP_WEEKDAY"]
y = supermarket_till_transactions_df["SPEND"]

#calculate the mean of x and y
xmean = np.mean(x)
ymean = np.mean(y)

#Calculate the tems needed for the numerator and denominator of beta
supermarket_till_transactions_df['xycov'] = (supermarket_till_transactions_df["SHOP_WEEKDAY"])
supermarket_till_transactions_df['xvar'] = (supermarket_till_transactions_df["SHOP_WEEKDAY"])

#Calculate beta and alpha
beta = supermarket_till_transactions_df['xycov'].sum()/supermarket_till_transactions_df[""]
alpha = ymean - (beta*xmean)

In [None]:
# View the alpha and beta values
print(f'alpha = {alpha}')
print(f'beta = {beta}')

Great, we now have an estimate for alpha and beta! Our model can be written as Yₑ = 330.098 + -30.045 X,
and we can start making predictions:

In [None]:
ypred = alpha + beta * X

In [None]:
# View the predictions
ypred

Let’s plot our prediction ypred against the actual values of y, to get a better visual understanding of our model.

In [None]:
# Plot regression against actual data
plt.figure(figsize=(12, 6))
plt.plot(X, ypred) # regression line
plt.plot(X, y, 'ro') # scatter plot showing actual data
plt.title('Actual vs Predicted')
plt.xlabel('X')
plt.ylabel('y')
plt.show()

The blue line is our line of best fit i.e. Yₑ = 330.098 + -30.045 X

We can see from this graph that there is a negative linear relationship between X and y. Using our model, we
can predict y from any values of X!

For example, if we had a value X = 7, we can predict that: (According to the data description 7 represents
Saturday)

Yₑ = 330.098 + -30.045 (7) = 119.783

According to this it means that customer spend reduces from Monday to Saturday

**Using statsmodels**

In [None]:
# Initialise and fit linear regression model using `statsmodels`
stats_model = smf.ols('SPEND ~ SHOP_WEEKDAY', data=supermarket_till_transactions_df)
stats_model = stats_model.fit()

We no longer have to calculate alpha and beta ourselves as this method does it automatically for us! Calling
model.params will show us the model’s parameters:

In [None]:
stats_model.params

From the results above:
1. β0 = 330.097882 - This is the y intercept when x is zero
2. β1 = -30.045026 - This is the regression coefficient that measures a unit change in SPEND when SHOP_WEEKDAY changes

The negative value on the regression co-efficient for SHOP_WEEKDAY means that SHOP_WEEKDAY has a
negative impact to the SPEND.

In [None]:
stats_model.summary()

**R-Squared**

**The Coefficient of determination, R-Squared** – This is used to measure how much of the variation in the
outcome can be explained by the variation in the independent variables. R-Squared always increases as more
predictors are added to the MLR model even though the predictors may not be related to the outcome variable.

R2 by itself can't thus be used to identify which predictors should be included in a model and which should be
excluded. R2 can only be between 0 and 1, where 0 indicates that the outcome cannot be predicted by any of
the independent variables and 1 indicates that the outcome can be predicted without error from the independent
variables.

In [None]:
# print the R-squared value for the model
stats_model.rsquared

**This means that <font color= red>2.335%</font> of the SPEND can be explained by SHOP_WEEKDAY**

**Adjusted R-Squared**

When we add more predictor variables into the equation, R-Squared will always increase making R-Squared
not accurate as the number of predictor variables increases.

Adjusted R-Squared, accounts for the increase of the predictor variables.

Because of the nature of the equation, the adjusted R-Squared should always be lower or equal to the RSquared

In [None]:
# print the Adjusted R-squared value for the model
stats_model.rsquared_adj

**Confidence in the model**

A confidence interval gives an estimated range of values which is likely to include an unknown population
parameter, the estimated range being calculated from a given set of sample data.

A confidence interval is how much uncertainty there is with any particular statistic. Confidence intervals are
often used with a margin of error. It tells you how confident you can be that the results reflect what you would
expect to find if it were possible to study the entire population.

In [None]:
# print the confidence intervals for the model coefficients
stats_model.conf_int()

**Hypothesis Testing and P-Values**

In [None]:
# print the p-values for the model coefficients
stats_model.pvalues

Now that we’ve fit a simple regression model, we can try to predict the values of spend based on the equation
we just derived using the .predict method.

In [None]:
# Predict values
spend_pred = stats_model.predict()
# Plot regression against actual data
plt.figure(figsize=(12, 6))
plt.plot(supermarket_till_transactions_df['SHOP_WEEKDAY'], supermarket_till_transactions_df
plt.plot(supermarket_till_transactions_df['SHOP_WEEKDAY'], spend_pred, 'r', linewidth=2)
plt.xlabel('SHOP_WEEKDAY')
plt.ylabel('SPEND')
plt.title('SHOP_WEEKDAY vs SPEND')
plt.show()

With this model, we can predict spend from given day of the week. For example, if we want to predict the spend
for sunday, we can predict that spend will increase to 300.05 shillings:

In [None]:
new_X = 1
ypred = stats_model.predict({"SHOP_WEEKDAY": new_X})

**RMSE**

The root-mean-square error (RMSE) is a frequently used measure of the differences between values (sample
and population values) predicted by a model and the values actually observed

The smaller the value the better

In [None]:
# calc rmse
rmse = rmse(y, ypred)
rmse