<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="300" alt="cognitiveclass.ai logo">
</center>

# **Retail Sales Dataset 2018-2022**
# **Lab 4. Model Development**

Estimated time needed: **1** hour

### **Dataset Attributes**
*   Date: year and month
*   SKU: unique code consisting of letters and numbers that identify each product
*   Group: group of related products which share some common attributes
*   Units Pkg: package weight (kg)
*   Avg Price Pkg: average price per package
*   Sales Pkg: total package sales per month

### **Target Field**
*   Turnover per month

### **Objectives**

After completing this lab you will be able to:

*   Develop prediction models


<p>In this section, we will develop several models that will predict the turnover per month using the variables or features. This is just an estimate but should give us an objective idea of how much the turnover will cost.</p>


The question we want to ask in this module

<ul>
    <li>What will be the turnover in X months?</li>
</ul>
<p>In data analytics, we often use <b>Model Development</b> to help us predict future observations from the data we have.</p>

<p>A model will help us understand the exact relationship between different variables and how these variables are used to predict the result.</p>


<h4><b>Setup</b></h4>


Import libraries:


If you run the lab locally using Anaconda, you can load the correct library and versions by uncommenting the following:


In [ ]:
#If you run the lab locally using Anaconda, you can load the correct library and versions by uncommenting the following:
#install specific version of libraries used in lab
#! mamba install pandas==1.3.3-y
#! mamba install numpy=1.21.2-y
#! mamba install sklearn=0.20.1-y

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings

warnings.filterwarnings('ignore')

This dataset was hosted on IBM Cloud object. Click <a href="https://cocl.us/DA101EN_object_storage?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkDA0101ENSkillsNetwork20235326-2021-01-01">HERE</a> for free storage.


Load the data and store it in dataframe `df`:


In [ ]:
df = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX0KT0EN/clean_sales_1.csv")
df

<h2><b>1. Linear Regression and Multiple Linear Regression</b></h2>


<h4><b>Linear Regression</b></h4>


<p>One example of a Data  Model that we will be using is:</p>
<b>Simple Linear Regression</b>

<br>
<p>Simple Linear Regression is a method to help us understand the relationship between two variables:</p>
<ul>
    <li>The predictor/independent variable -> <b>X</b></li>
    <li>The response/dependent variable (that we want to predict, our target variable) -> <b>Y</b></li>
</ul>

<p>The result of Linear Regression is a <b>linear function</b> that predicts the response (dependent) variable as a function of the predictor (independent) variable.</p>


$$
Y: Response \ Variable\\\\\\\\\\\\
X: Predictor \ Variables
$$


<b>Linear Function</b>
$$
Yhat = a + b  X
$$


<ul>
    <li><b>a</b> refers to the <b>intercept</b> of the regression line, in other words: the value of Y when X is 0</li>
    <li><b>b</b> refers to the <b>slope</b> of the regression line, in other words: the value with which Y changes when X increases by 1 unit</li>
</ul>


<h4>Let's load the modules for linear regression:</h4>


In [ ]:
from sklearn.linear_model import LinearRegression

In [ ]:
from sklearn import set_config
set_config(display="diagram")

<h4>Create the linear regression object:</h4>


In [ ]:
lm = LinearRegression()
lm

<h4>How could "Sales Pkg" help us predict turnover?</h4>


For this example, we want to look at how Sales Pkg can help us predict turnover per month.
Using simple linear regression, we will create a linear function with "Sales Pkg" as the predictor variable and the "Turnover per month" as the response variable.


In [ ]:
X = df[['Sales Pkg']]
Y = df['Turnover per month']

Fit the linear model using highway-mpg:


In [ ]:
lm.fit(X, Y)

We can output a prediction of turnover:


In [ ]:
Y_hat = lm.predict(X)
Y_hat[0:5]   

Let's see how much predicted values overlap actual values:


In [ ]:
plt.figure(figsize=(12, 10))

plt.title("Predicted vs real value")
plt.xlabel("Index of Sales Pkg (X) from 0 to 4137")
plt.ylabel("Turnover (Y)")


Y.plot()
plt.plot(Y.index, Y_hat)

This plot of the actual values of a dependent variable Y (turnover) against the predicted values of the same variable Y_hat using a linear regression model that was fit to a set of independent variables Z.


The resulting plot shows the relationship between the actual (blue) and predicted (orange, Y_hat) values of Y. If the linear regression model is a good fit for the data, the predicted values should closely match the actual values.</br>
It's clear, that predicted values highly overlap the real ones, so we could say that this model (lm) is a good fit.


<h4>What is the value of the intercept (a)?</h4>


In [ ]:
lm.intercept_

<h4>What is the value of the slope (b)?</h4>


In [ ]:
lm.coef_

<h3>What is the final estimated linear model we get?</h3>


As we saw above, we should get a final linear model with the structure:


$$
Yhat = a + b  X
$$


Plugging in the actual values we get:


<b>Turnover</b> = -7.09 + 8.88*<b>Sales Pkg</b>


**If Sales Pkg = 0, our turnover value is going to be negative. This means that this model (lm) is inadequate, so we must keep searching a good model for our data.**


Let's remember what we have done:


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
  <b style="font-size: 2em; font-weight: bold;">Question #1 a):</b>

<b>Create a linear regression object called "lm_example1".</b>

</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
lm_example1 = LinearRegression()
lm_example1
```

</details>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
  <b style="font-size: 2em; font-weight: bold;">Question #1 b):</b>

<b>Train the model using "Units Pkg" as the independent variable and "Turnover per month" as the dependent variable?</b>

</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 
# note that "Units Pkg" is not defined as a good predictor, but we do this for practice purposes


<details><summary>Click here for the solution</summary>

```python
lm_example1.fit(df[["Units Pkg"]], df[["Turnover per month"]])
lm_example1
```

</details>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
  <b style="font-size: 2em; font-weight: bold;">Question #1 c):</b>

<b>Find the slope and intercept of the model.</b>

</div>


<h4>Slope</h4>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>
    
```python
# Slope 
lm_example1.coef_
```
</details>


<h4>Intercept</h4>


In [ ]:
# Write your code below and press Shift+Enter to execute 



<details><summary>Click here for the solution</summary>

```python
# Intercept
lm_example1.intercept_
```

</details>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
  <b style="font-size: 2em; font-weight: bold;">Question #1 c):</b>

<b>What is the equation of the predicted line? You can use x and yhat or variables names.</b>

</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python 
turnover = -7.09 + 8.88*df['Units Pkg']
turnover

```

</details>


<h4><b>Multiple Linear Regression</b></h4>


<p>What if we want to predict turnover value using more than one variable?</p>

<p>If we want to use more variables in our model to predict car price, we can use <b>Multiple Linear Regression</b>.
Multiple Linear Regression is very similar to Simple Linear Regression, but this method is used to explain the relationship between one continuous response (dependent) variable and <b>two or more</b> predictor (independent) variables.
Most of the real-world regression models involve multiple predictors. We will illustrate the structure by using four predictor variables, but these results can generalize to any integer:</p>


$$
Y: Response \ Variable\\\\\\\\\\\\
X\_1 :Predictor\ Variable \ 1\\\\
X\_2: Predictor\ Variable \ 2\\\\
X\_3: Predictor\ Variable \ 3\\\\
X\_4: Predictor\ Variable \ 4\\\\
$$


$$
a: intercept\\\\\\\\\\\\
b\_1 :coefficients \ of\ Variable \ 1\\\\
b\_2: coefficients \ of\ Variable \ 2\\\\
b\_3: coefficients \ of\ Variable \ 3\\\\
b\_4: coefficients \ of\ Variable \ 4\\\\
$$


The equation is given by:


$$
Yhat = a + b\_1 X\_1 + b\_2 X\_2 + b\_3 X\_3 + b\_4 X\_4
$$


<p>From the previous section we know that there is only one good prefictor of turnover: Sales pkg, but for practice purposes let's include these variables as potential predictors::</p>
<ul>
    <li>Avg Price Pkg</li>
    <li>Units Pkg</li>
    <li>Sales Pkg</li>
</ul>
Let's develop a model using these variables as the predictor variables.


In [ ]:
lm1 = LinearRegression()
lm1

In [ ]:
Z = df[['Avg Price Pkg', 'Units Pkg', 'Sales Pkg']]

Fit the linear model using the three above-mentioned variables.


In [ ]:
lm1.fit(Z, Y)

What is the value of the intercept(a)?


In [ ]:
lm1.intercept_

What are the values of the coefficients (b1, b2, b3)?


In [ ]:
lm1.coef_

What is the final estimated linear model that we get?


As we saw above, we should get a final linear function with the structure:

$$
Yhat = a + b\_1 X\_1 + b\_2 X\_2 + b\_3 X\_3 + b\_4 X\_4
$$

What is the linear function we get in this example?


<b>Turnover</b> = -1518.0531995292617 + 171.19867466*<b>Avg Price Pkg</b> - 5.89834868*<b>Units Pkg</b> + 8.94327586*<b>Sales Pkg</b>


In [ ]:
Y_hat1 = lm1.predict(Z)
Y_hat1

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
  <b style="font-size: 2em; font-weight: bold;">Question #2 a):</b>
    
<b>Create and train a Multiple Linear Regression model "lm_example2" where the response variable is "Turnover per month", and the predictor variable are "Sales Pkg" and "Avg Price Pkg".</b>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
lm_example2 = LinearRegression()
lm_example2.fit(df[["Sales Pkg", "Avg Price Pkg"]], Y)


```

</details>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
  <b style="font-size: 2em; font-weight: bold;">Question #2 b):</b>
    
<b>What is the equation of the predicted line?</b>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
Turnover = -1642.4794547903107 + 8.97092365*df["Sales Pkg"] + 175.48454914*df["Avg Price Pkg"]
```

</details>


<h2><b>2. Model Evaluation Using Visualization</b></h2>


Now that we've developed some models, how do we evaluate our models and choose the best one? One way to do this is by using a visualization.


Import the visualization package, seaborn:


In [ ]:
# import the visualization package: seaborn
import seaborn as sns
%matplotlib inline 

<h3><b>Regression Plot</b></h3>


<p>When it comes to simple linear regression, an excellent way to visualize the fit of our model is by using <b>regression plots</b>.</p>

<p>This plot will show a combination of a scattered data points (a <b>scatterplot</b>), as well as the fitted <b>linear regression</b> line going through the data. This will give us a reasonable estimate of the relationship between the two variables, the strength of the correlation, as well as the direction (positive or negative correlation).</p>


Let's visualize **Sales Pkg** as a potential predictor variable of turnover:


In [ ]:
width = 8
height = 6

plt.figure(figsize=(width, height)) 
sns.regplot(x="Sales Pkg", y="Turnover per month", data=df)

<p>We can see from this plot that Turnover is positively correlated to Sales packages since the regression slope is positive.

One thing to keep in mind when looking at a regression plot is to pay attention to how scattered the data points are around the regression line. This will give you a good indication of the variance of the data and whether a linear model would be the best fit or not. If the data is too far off from the line, this linear model might not be the best model for this data.
    
Let's compare this plot to the regression plot of <b>"Units Pkg".</b></p>


In [ ]:
plt.figure(figsize=(width, height))
sns.regplot(x="Units Pkg", y="Turnover per month", data=df)

<p>Comparing the regression plot of "Units Pkg" and "Sales Pkg", we see that the points for "Sales Pkg" are much closer to the generated line and, on average, increase. The points for "Units Pkg" have more spread around the predicted line and it is much harder to determine if the points are decreasing or increasing as the "Units Pkg" increases.</p>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
  <b style="font-size: 2em; font-weight: bold;">Question #3:</b>
    
<b>Given the regression plots above, is "Units Pkg" or "Sales Pkg" more strongly correlated with "Turnover per month"?</b>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 
# answer: Sales Pkg


<details><summary>Click here for the solution</summary>

```python
df[["Sales Pkg", "Units Pkg", "Turnover per month"]].corr()

```

</details>


<h3><b>Residual Plot</b></h3>

<p>A good way to visualize the variance of the data is to use a residual plot.</p>

<p><ol>What is a <b>residual</b>?</ol></p>

<p>The difference between the observed (actual) value (y) and the predicted value (Yhat) is called the residual (e). When we look at a regression plot, the residual is the distance from the data point to the fitted regression line.</p>

<p><ol>So what is a <b>residual plot</b>?</ol></p>

<p>A residual plot is a graph that shows the residuals on the vertical y-axis and the independent variable on the horizontal x-axis.</p>

<p><ol>What do we pay attention to when looking at a residual plot?</ol></p>

<p>We look at the spread of the residuals:</p>

<p>- If the points in a residual plot are <b>randomly spread out around the x-axis</b>, then a <b>linear model is appropriate</b> for the data.

<b>Why is that?</b> Randomly spread out residuals means that the variance is constant, and thus the linear model is a good fit for this data.</p>


In [ ]:
width = 8
height = 6

plt.figure(figsize=(width, height))
sns.residplot(x=df['Sales Pkg'],y=df['Turnover per month'])
plt.show()

<i>What is this plot telling us?</i>

<p>We can see from this residual plot that the residuals are not randomly spread around the x-axis, it's giving horizontal V-shape, leading us to believe that maybe a non-linear model is more appropriate for this data.</p>


<h3>Multiple Linear Regression</h3>


<p>How do we visualize a model for Multiple Linear Regression? This gets a bit more complicated because you can't visualize it with regression or residual plot.</p>

<p>One way to look at the fit of the model is by looking at the <b>distribution plot</b>. We can look at the distribution of the fitted values that result from the model and compare it to the distribution of the actual values.</p>


We compare Y (Turnover per month given in df) and Y_hat1 (predicted Turnover using multiple linear regression model)


In [ ]:
plt.figure(figsize=(10, 8))

ax1 = sns.distplot(df['Turnover per month'], hist=False, color="r", label="Actual Value")
sns.distplot(Y_hat, hist=False, color="b", label="Fitted Values" , ax=ax1)


plt.title('Actual vs Fitted Values for Turnover')
plt.xlabel('Turnover per month')
plt.ylabel('Proportion of sales')

plt.show()

<p>We can see that the fitted values are reasonably close to the actual values since the two distributions overlap a bit. However, there is definitely some room for improvement.</p>


In [ ]:
plt.figure(figsize=(12, 10))

plt.title("Predicted vs real value")
plt.xlabel("Indexes of 'Avg Price Pkg', 'Units Pkg', 'Sales Pkg' (Z)")
plt.ylabel("Turnover")

Y.plot()
plt.plot(Y.index, Y_hat1)

We see that although predicted values highly overlap actual values, there are also turnover with negative values. It's unappropriate, so this model (lm1) is bad.


<h2><b>3. Polynomial Regression and Pipelines</b></h2>


<h3><b>Simple Polynomial Regression</b></h3>


<p><b>Polynomial regression</b> is a particular case of the general linear regression model or multiple linear regression models.</p> 
<p>We get non-linear relationships by squaring or setting higher-order terms of the predictor variables.</p>

<p>There are different orders of polynomial regression:</p>


<center><b>Quadratic - 2nd Order</b></center>
$$
Yhat = a + b_1 X +b_2 X^2 
$$

<center><b>Cubic - 3rd Order</b></center>
$$
Yhat = a + b_1 X +b_2 X^2 +b_3 X^3\\\\\\\\\\\\
$$

<center><b>Higher-Order</b>:</center>
$$
Y = a + b_1 X +b_2 X^2 +b_3 X^3 ....\\\\
$$


<p>We saw earlier that a linear model did not provide the best fit while using "Sales Pkg" as the predictor variable. Let's see if we can try fitting a polynomial model to the data instead.</p>


<p>We will use the following function to plot the data:</p>


In [ ]:
def PlotPolly(model, independent_variable, dependent_variable, Name):
    plt.figure(figsize=(8, 6))
    
    x_new = np.linspace(0, 3500, 100)
    y_new = model(x_new)

    plt.plot(independent_variable, dependent_variable, '.', x_new, y_new, '-')
    plt.title('Polynomial Fit with Matplotlib for Turnover')
    ax = plt.gca()
    ax.set_facecolor((0.898, 0.898, 0.898))
    fig = plt.gcf()
    plt.xlabel(Name)
    plt.ylabel('Turnover per month')

    plt.show()
    plt.close()

Let's get the variables:


In [ ]:
x = df['Sales Pkg']
Y = df['Turnover per month']

Let's fit the polynomial using the function <b>polyfit</b>, then use the function <b>poly1d</b> to display the polynomial function.


In [ ]:
# Here we use a polynomial of the 7th order
function = np.polyfit(x, Y, 7)
poly_model1 = np.poly1d(function)
print(poly_model1)

Let's plot the function:


In [ ]:
PlotPolly(poly_model1, x, Y, 'Sales Pkg')

Now let's see if our model is good:


In [ ]:
Y_hat_poly1 = pow(4.45e-19*df["Sales Pkg"], 7) - pow(4.606e-15*df["Sales Pkg"], 6) + pow(1.801e-11*df["Sales Pkg"], 5) - pow(3.279e-08*df["Sales Pkg"], 4) + pow(2.733e-05*df["Sales Pkg"], 3) - pow(0.008502*df["Sales Pkg"], 2) + 9.314*df["Sales Pkg"] + 11.78
Y_hat_poly1

In [ ]:
plt.figure(figsize=(10, 8))

plt.title("Predicted vs real value")
plt.xlabel("Indexes of Sales Pkg (x)")
plt.ylabel("Turnover (y)")

Y.plot()
plt.plot(Y.index, Y_hat_poly1)

Now we haven't negative turnover values and overlap is good, so we can assume that this model (poly_model1) is good.


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
  <b style="font-size: 2em; font-weight: bold;">Question #4:</b>
    
<b>Create 11 order polynomial model with the variables x and y from above.</b>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
# Here we use a polynomial of the 11th order 
function1 = np.polyfit(x, y, 11)
poly_model_example1 = np.poly1d(function1)
print(poly_model_example1)
PlotPolly(poly_model_example1, x, y, 'Sales Pkg')

```

</details>


<h3><b>Multiple Polynomial Regression</b></h3>


<p>The analytical expression for Multivariate Polynomial function gets complicated. For example, the expression for a second-order (degree=2) polynomial with two variables is given by:</p>


$$
Yhat = a + b\_1 X\_1 +b\_2 X\_2 +b\_3 X\_1 X\_2+b\_4 X\_1^2+b\_5 X\_2^2
$$


We can perform a polynomial transform on multiple features. First, we import the module:


In [ ]:
from sklearn.preprocessing import PolynomialFeatures

We create a <b>PolynomialFeatures</b> object of degree 2:


In [ ]:
pr = PolynomialFeatures(degree=2)
pr

Let's build our model using 'Avg Price Pkg', 'Units Pkg', 'Sales Pkg' (we also used them before in multiple linear regression)


In [ ]:
Z_pr = pr.fit_transform(Z)
Z_pr

In the original data, there are 4138 samples and 3 features.


In [ ]:
Z.shape

After the transformation, there are 4138 samples and 10 features.


In [ ]:
Z_pr.shape

<h2><b>Pipeline</b></h2>


<p>Data Pipelines simplify the steps of processing the data. We use the module <b>Pipeline</b> to create a pipeline. We also use <b>StandardScaler</b> as a step in our pipeline.</p>


A pipeline is a series of data processing steps combined into one single model. Pipelines can be used to automate a workflow, especially when you want to perform the same steps on multiple datasets. 

For instance, a common workflow in machine learning involves preprocessing the data, performing feature selection, and then fitting a model. By creating a pipeline that encapsulates these three steps, we can write more concise code and easily experiment with different variations of the workflow.


StandardScaler is a preprocessing step that is used to standardize the data. It scales the data so that it has a mean of 0 and a standard deviation of 1. This is done to ensure that all the features are on the same scale, so that the model can learn the weights of the features more effectively.


In [ ]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

We create the pipeline by creating a list of tuples including the name of the model or estimator and its corresponding constructor.


In [ ]:
Input = [('scale', StandardScaler()), ('polynomial', PolynomialFeatures(include_bias=False)), ('model', LinearRegression())]

We input the list as an argument to the pipeline constructor:


In [ ]:
pipe = Pipeline(Input)
pipe

First, we convert the data type Z to type float to avoid conversion warnings that may appear as a result of StandardScaler taking float inputs.

Then, we can normalize the data,  perform a transform and fit the model simultaneously.


In [ ]:
Z = Z.astype(float)
pipe.fit(Z, Y)

Similarly,  we can normalize the data, perform a transform and produce a prediction  simultaneously.


In [ ]:
ypipe = pipe.predict(Z)
ypipe

Let's see if this model is a good fit:


In [ ]:
plt.figure(figsize=(10, 8))

plt.title("Predicted vs real value")
plt.xlabel("Indexes of 'Avg Price Pkg', 'Units Pkg', 'Sales Pkg' (x)")
plt.ylabel("Turnover (y)")

Y.plot()
plt.plot(Y.index, ypipe)

This model (pipe) fits perfectly. 


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
  <b style="font-size: 2em; font-weight: bold;">Question #5:</b>
    
<b>Create a pipeline that standardizes the data, then produce a prediction using a linear regression model using the features "Units Pkg" and "Sales Pkg" and target Y.</b>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
Input_example = [('scale', StandardScaler()), ('model',LinearRegression())]

pipe_example = Pipeline(Input_example)

pipe.fit(df[["Units Pkg", "Sales Pkg"]], Y)

ypipe_example = pipe.predict(df[["Units Pkg", "Sales Pkg"]])
ypipe_example

```

</details>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
  <b style="font-size: 2em; font-weight: bold;">Question #6:</b>
    
<b>Answer if this model is good or not?</b>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
plt.figure(figsize=(10, 8))

plt.title("Predicted vs real value")
plt.xlabel("Indexes of Units Pkg and Sales Pkg (x)")
plt.ylabel("Turnover (y)")

Y.plot()
plt.plot(Y.index, ypipe_example)
# this model is a way worse than previous
```

</details>


<h2><b>4. Measures for In-Sample Evaluation</b></h2>


<p>When evaluating our models, not only do we want to visualize the results, but we also want a quantitative measure to determine how accurate the model is.</p>

<p>Two very important measures that are often used in Statistics to determine the accuracy of a model are:</p>
<ul>
    <li><b>R^2 / R-squared</b></li>
    <li><b>Mean Squared Error (MSE)</b></li>
</ul>

<b>R-squared</b>

<p>R squared, also known as the coefficient of determination, is a measure to indicate how close the data is to the fitted regression line.</p>

<p>The value of the R-squared is the percentage of variation of the response variable (Y) that is explained by a linear model.</p>

<b>Mean Squared Error (MSE)</b>

<p>The Mean Squared Error measures the average of the squares of errors. That is, the difference between actual value (y) and the estimated value (ŷ).</p>
<p>The goal is to achieve the lowest possible MSE.</p>


<h3><b>Model 1: Simple Linear Regression (SLR)</b></h3>


Let's calculate the R^2:


In [ ]:
lm_r2 =  lm.score(X, Y)
print('The R-square is: ', lm_r2) # Y = df["Turnover per month"], X = df["Sales Pkg"]

We can say that \~91.69% of the variation of the turnover is explained by this simple linear model lm using Sales Pkg as an independent variable.


Let's calculate the MSE:


We already have Y_hat (predicted turnover) where X is the input variable:


In [ ]:
print('The output of the first four predicted value is: ', Y_hat[0:4])

Let's import the function <b>mean_squared_error</b> from the module <b>metrics</b>:


In [ ]:
from sklearn.metrics import mean_squared_error

We can compare the predicted results with the actual results:


In [ ]:
lm_mse = mean_squared_error(df['Turnover per month'], Y_hat)
print('The mean square error of turnover and predicted value is: ', lm_mse)

<h3><b>Model 2: Multiple Linear Regression (MLR)</b></h3>


Let's calculate the R^2:


In [ ]:
lm1_r2 = lm1.score(Z, Y)
print('The R-square is: ', lm1_r2) # Y = df["Turnover per month"], Z = df[['Avg Price Pkg', 'Units Pkg', 'Sales Pkg']]

We can say that \~94.35% of the variation of turnover is explained by this multiple linear regression "lm1".


Let's calculate the MSE.


We compare the predicted results (Y_hat1, calculated earlier) with the actual results:


In [ ]:
lm1_mse = Y_hat1
print('The mean square error of turnover and predicted value using lm1 is: ', \
      mean_squared_error(df['Turnover per month'], lm1_mse))

<h3><b>Model 3: Simple Polynomial Fit (SPF)</b></h3>


Let's calculate the R^2.


Let’s import the function <b>r2\_score</b> from the module <b>metrics</b> as we are using a different function.


In [ ]:
from sklearn.metrics import r2_score

We apply the function to get the value of R^2:


In [ ]:
spf_r2 = r2_score(Y, Y_hat_poly1)
print('The R-square value is: ', spf_r2)

We can say that \~91.44% of the variation of turnover is explained by this polynomial fit.


We can also calculate the MSE:


In [ ]:
spf_mse = mean_squared_error(df['Turnover per month'], Y_hat_poly1)
spf_mse

<h3><b>Model 4: Multiple Polynomial Fit (MPF)</b></h3>


Let's calculate the R^2.


In [ ]:
mpf_r2 = r2_score(Y, ypipe)
print('The R-square value is: ', mpf_r2) # ypipe = predicted Y

If the R-squared value is equal to 1, it means that all the variance in the dependent variable (y) can be explained by the independent variables (z) in the model (pipe). In other words, the model fits the data perfectly, with no errors.


We can also calculate the MSE:


In [ ]:
mpf_mse = mean_squared_error(df['Turnover per month'], ypipe)
mpf_mse

MSE is very close to zero, R-squared is 1, so we can fully agree that model "pipe" is perfect.


Let's visualize our results:


<table>
    <tr>
        <td></td>
        <th>SLR</th>
        <th>MLR</th>
        <th>SPF</th>
        <th>MPF</th>
    </tr>
        <th scope="row">R-squared</th>
        <td>0.9169</td> 
        <td>0.9435</td>
        <td>0.9144</td>
        <td>1.0000</td>
    <tr>
        <th scope="row">MSE</th>
        <td>1257876</td>
        <td>854612</td>
        <td>1295687</td>
        <td>3.82*10^(-23)</td>
    </tr>

</table>


It's clearly seen, that MPF is the best fitting model.


<h2><b>5. Prediction and Decision Making</b></h2>
<h3><b>Prediction</b></h3>

<p>In the previous section, we trained the model using the method <b>fit</b>. Now we will use the method <b>predict</b> to produce a prediction.


Create a new input:


In [ ]:
new_input = np.arange(1, 100, 1).reshape(-1, 1) # our made up sales packages values

Fit the model:


In [ ]:
lm.fit(X, Y) # X - Sales Pkg
lm

Produce a prediction:


In [ ]:
yhat = lm.predict(new_input)
yhat

We can plot the data:


In [ ]:
plt.plot(new_input, yhat)
plt.xlabel("Sales Pkg")
plt.ylabel("Turnover")
plt.show()

<h3><b>Decision Making</b>: Determining a Good Model Fit</h3>


<p>Now that we have visualized the different models, and generated the R-squared and MSE values for the fits, how do we determine a good model fit?
<ul>
    <li><i>What is a good R-squared value?</i></li>
</ul>
</p>

<p>When comparing models, <b>the model with the higher R-squared value is a better fit</b> for the data.
<ul>
    <li><i>What is a good MSE?</i></li>
</ul>
</p>

<p>When comparing models, <b>the model with the smallest MSE value is a better fit</b> for the data.</p>

<h4>Let's take a look at the values for the different models.</h4>
<p>Simple Linear Regression: Using Sales Pkg as a Predictor Variable of Turnover per month.
<ul>
    <li>R-squared: 0.9169272199026435</li>
    <li>MSE: 1257876.4091968017</li>
</ul>
</p>

<p>Multiple Linear Regression: Using 'Avg Price Pkg', 'Units Pkg', 'Sales Pkg' as Predictor Variables of Turnover per month.
<ul>
    <li>R-squared: 0.9435596184343996</li>
    <li>MSE: 854612.3581542918</li>
</ul>
</p>

<p>Simple Polynomial Fit: Using Sales Pkg as a Predictor Variable of Turnover per month.
<ul>
    <li>R-squared: 0.9144301021099479</li>
    <li>MSE: 1295687.41731205</li>
</ul>
</p>

<p>Multiple Polynomial Fit: Using 'Avg Price Pkg', 'Units Pkg', 'Sales Pkg' as a Predictor Variables of Turnover per month.
<ul>
    <li>R-squared: 1.0</li>
    <li>MSE: 3.828607614112173e-23 (~0)</li>
</ul>
</p>


<h3><b>Simple Linear Regression Model (SLR) vs Multiple Linear Regression Model (MLR)</b></h3>


<p>Usually, the more variables you have, the better your model is at predicting, but this is not always true. Sometimes you may not have enough data, you may run into numerical problems, or many of the variables may not be useful and even act as noise. As a result, you should always check the MSE and R^2.</p>

<p>In order to compare the results of the MLR vs SLR models, we look at a combination of both the R-squared and MSE to make the best conclusion about the fit of the model.
<ul>
    <li><b>MSE</b>: The MSE of MLR is smaller.</li>
    <li><b>R-squared</b>: The R-squared of the SLR and the R-squared of the MLR are almost the same, but MLR has higher R-squared.</li>
</ul>
</p>

This R-squared in combination with the MSE show that MLR seems like the better model fit in this case compared to SLR.


<h3><b>Simple Linear Model (SLR) vs. Simple Polynomial Fit</b></h3>


<ul>
    <li><b>MSE</b>: MSE of SLR is smaller that MSE of SPF.</li> 
    <li><b>R-squared</b>: The R-squared of the SLR and the R-squared of the SPF are almost the same, but SLR has higher R-squared.</li>
</ul>
<p>SLR is better model that SPF.</p>


<h3><b>Simple Linear Regression (SLR) vs. Multiple Polynomial Fit</b></h3>


There is no doubt, that MPF is better than SLR, because MPF has perfect R-squared (1) abd perfect MSE (~0).


<h3><b>Multiple Linear Regression (MLR) vs. Simple Polynomial Fit</b></h3>


<ul>
    <li><b>MSE</b>: MSE of MLR is smaller that MSE of SPF.</li> 
    <li><b>R-squared</b>: The R-squared of the MLR and the R-squared of the SPF are almost the same, but MLR has higher R-squared.</li>
</ul>
<p>MLR is better model that SPF.</p>


<h3><b>Multiple Linear Regression (MLR) vs. Multiple Polynomial Fit</b></h3>


There is no doubt, that MPF is better than MLR, because MPF has perfect R-squared (1) abd perfect MSE (~0).


<h3><b>Multiple Polynomial Fit vs. Simple Polynomial Fit</b></h3>


There is no doubt, that MPF is better than SPF, because MPF has perfect R-squared (1) abd perfect MSE (~0).


<h2><b>Conclusion</b></h2>


<p>Comparing these three models, we conclude that the <b>MPF</b> model that uses these predictors such as 'Avg Price Pkg', 'Units Pkg', 'Sales Pkg' is the best model to be able to predict turnover from our dataset. </p>


### Thank you for completing this lab!

## Author

<a href="https://author.skills.network/instructors/rosana_klym?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0KT0EN3043-2023-01-01">Rosana Klym</a>

## Contributors
<a href="https://author.skills.network/instructors/yaroslav_vyklyuk_2?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0KT0EN3043-2023-01-01">Yaroslav Vyklyuk</a>

<a href="https://author.skills.network/instructors/olga_kavun?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0KT0EN3043-2023-01-01">Olga Kavun</a>


## Change Log

| Date (YYYY-MM-DD) | Version | Changed By | Change Description                            |
| ----------------- | ------- | ---------- | --------------------------------------------- |
| 2023-04-25       | 2.0    | Rosana    | Changed url of csv                            |
| 2023-05-03      | 2.1     | Rosana    | Changed year to 2023 below                    |

<hr>

## <h3 align="center"> © IBM Corporation 2023. All rights reserved. <h3/>
