<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="300" alt="cognitiveclass.ai logo">
</center>

# Motorcycle sales analysis

# *Lab 4. Data analysis with Python*
Estimated time needed: **1 hour**

## Objectives

After completing this lab you will be able to:

*   Develop prediction models


<details><summary><b style="font-size: 2em; font-weight: bold;">Click here to see content, description of dataset, source of dataset and licence</b></summary>
<br/>
    <b style="font-size: 1.2em; font-weight: bold;">Content</b>
<p>You work in the accounting department of a company that sells motorcycle parts. The company operates three warehouses in a large metropolitan area.</p>

<b style="font-size: 1.2em; font-weight: bold;">Dataset Glossary (Column-wise)</b>
<ul>
    <li>Date<p>Determines the date when client bought products</p></li>
    <li>Warehouse<p>The warehouse location.</p></li>
    <li>Client type<p>Determines how client bought the products. This column can be only Retail or Wholesale</p></li>
   <li>Product line<p>Name of product (part of motorcycle)</p></li>
    <li>Quantity<p>The count bought product</p></li>
    <li>Unit price<p>Cost of one product</p></li>
    <li>Total<p>The total purchase price</p></li>
    <li>Payment<p>Determines the method of payment for the purchase. This dataset has three types of payment: Credit card, cash or transfer</p></li>
</ul>

<b style="font-size: 1.2em; font-weight: bold;">Target field</b>
<ul>
    <li>Total</li>
</ul>

<b style="font-size: 1.2em; font-weight: bold;">Data source and licence</b>
    <p>The Motorcycle sales analysis Dataset is an online source, and it is in a CSV (comma separated value) format. Let's use this dataset as an example to practice data reading.</p>

<ul>
    <li>Data source: <a href="https://www.kaggle.com/datasets/devijeganath/motorcycle-sales-analysis?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX04ZPEN2976-2023-01-01">https://www.kaggle.com/datasets/devijeganath/motorcycle-sales-analysis</a></li>
    <li>Data type: csv</li>
    <li>Licence: <a href="https://creativecommons.org/publicdomain/zero/1.0/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX04ZPEN2976-2023-01-01">CC0: Public Domain</a></li>
</ul>
<p>
This DataSet released under CC0: Public Domain license that allow of copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission.
The person who associated a work with this deed has dedicated the work to the public domain by waiving all of his or her rights to the work worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
</p>
You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission.
</details>


<p>In this section, we will develop several models that will predict the total cost of purchase using the variables or features. This is just an estimate but should give us an objective idea of how much the turnover must be.</p>


<p>In data analytics, we often use <b>Model Development</b> to help us predict future observations from the data we have.</p>

<p>A model will help us understand the exact relationship between different variables and how these variables are used to predict the result.</p>


<b style="font-size: 1.5em; font-weight: bold;">Table of Contents</b>

<div class="alert alert-block alert-info" style="margin-top: 20px">
<ol>
    <li><a href="#id1">Linear Regression and Multiple Linear Regression</a></li>
    <li><a href="#id2">Model Evaluation Using Visualization</a></li>
    <li><a href="#id3">Polynomial Regression and Pipelines</a></li>
    <li><a href="#id4">Measures for In-Sample Evaluation</a></li>
    <li><a href="#id5">Prediction and Decision Making</a></li>
</ol>

</div>

<hr>


<b style="font-size: 1.5em; font-weight: bold;">Setup</b>


Import libraries:


In [ ]:
#If you run the lab locally using Anaconda, you can load the correct library and versions by uncommenting the following:
#install specific version of libraries used in lab
#! mamba install pandas==1.3.3-y
#! mamba install numpy=1.21.2-y
#! mamba install sklearn=0.20.1-y

! mamba install scikit-learn -y

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OrdinalEncoder, PolynomialFeatures, StandardScaler
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score
from sklearn import set_config
set_config(display="diagram")
warnings.filterwarnings('ignore')

Load the data and store it in dataframe `df`:


In [ ]:
path = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX04ZPEN/new_motorcycles.csv"
df = pd.read_csv(path)
df.head()

<b style="font-size: 1.5em; font-weight: bold;">Change data types</b>


<p>First of all we need to correct data types for dataset. All our columns, except for 'Total' and 'Quantity' must be 'category'</p>


In [ ]:
columns = df.columns[df.dtypes=='object']
columns

In [ ]:
df[columns] = df[columns].astype('category')
df.dtypes

<b style="font-size: 2em; font-weight: bold; text-decoration: none;"><a name="id1"><font color="black">1. Linear Regression and Multiple Linear Regression</font></a></b>


<b style="font-size: 1em; font-weight: bold;">Linear Regression</b>


<p>One example of a Data  Model that we will be using is:</p>
<b>Simple Linear Regression</b>

<br>
<p>Simple Linear Regression is a method to help us understand the relationship between two variables:</p>
<ul>
    <li>The predictor/independent variable (X)</li>
    <li>The response/dependent variable (that we want to predict)(Y)</li>
</ul>

<p>The result of Linear Regression is a <b>linear function</b> that predicts the response (dependent) variable as a function of the predictor (independent) variable.</p>


$$
Y: Response \ Variable\\\\\\\\\\\\
X: Predictor \ Variables
$$


<b>Linear Function</b>
$$
Yhat = a + b  X
$$


<ul>
    <li>a refers to the <b>intercept</b> of the regression line, in other words: the value of Y when X is 0</li>
    <li>b refers to the <b>slope</b> of the regression line, in other words: the value with which Y changes when X increases by 1 unit</li>
</ul>


<b style="font-size: 1em; font-weight: bold;">Let's load the modules for linear regression:</b>


<b style="font-size: 1em; font-weight: bold;">Create the linear regression object:</b>


In [ ]:
lm1 = LinearRegression()
lm1

<b style="font-size: 1em; font-weight: bold;">How could "Quantity" help us predict total price?</b>


For this example, we want to look at how Quantity of bought products can help us predict total cost of purchase.
Using simple linear regression, we will create a linear function with "Quantity" as the predictor variable and the "Total" as the response variable.


In [ ]:
X = df[['Quantity']]
Y = df['Total']

Fit the linear model using quantity:


In [ ]:
lm1.fit(X,Y)

We can output a prediction:


In [ ]:
Yhat1 = lm1.predict(X)

<p>Let's see the first five predicted values and real values


In [ ]:
print("PREDICTED VALUES")
Yhat1[0:5]

In [ ]:
print("REAL VALUES")
Y[0:5]

<b style="font-size: 1em; font-weight: bold;">What is the value of the intercept (a)?</b>


In [ ]:
lm1.intercept_

<b style="font-size: 1em; font-weight: bold;">What is the value of the slope (b)?</b>


In [ ]:
lm1.coef_

<b style="font-size: 1em; font-weight: bold;">What is the final estimated linear model we get?</b>


As we saw above, we should get a final linear model with the structure:


$$
Yhat = a + b  X
$$


Plugging in the actual values we get:


<b>Total</b> = -3.089298009617778 + 31.10189441 x <b>Quantity</b>


<p>Let's plot the predicted values and real values</p>


In [ ]:
plt.figure()
plt.title('Predicted values vs real values')
plt.xlabel('Index')
plt.ylabel('Total cost')
Y.plot()
plt.plot(Y.index,Yhat1)

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: bold;">Question #1</b>

<ul>
    <li><b style="font-size: 1.2em">Create a linear regression object called "lm1_1".</b></li>
    <li><b style="font-size: 1.2em">Train the model using 'Retail' as the independent variable and 'Total' as the dependent variable</b></li>
    <li><b style="font-size: 1.2em">Find the intercept (coefficient (a)) and the slope (coefficient (b))</b></li>
    <li><b style="font-size: 1.2em">Find predict values using <code>predict()</code>. Predicted values save in variable <code>Yhat1_1</code>.</b></li>
    <li><b style="font-size: 1.2em">Print first five predicted values and real values</b></li>
    <li><b style="font-size: 1.2em">Plot the real and predicted values as you see above</b></li>
</ul>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
lm1_1 = LinearRegression()
lm1_1.fit(df[['Retail']],Y)
print('Coefficient a = ',lm1_1.intercept_)
print('Coefficient b = ',lm1_1.coef_)
print()
Yhat1_1 = lm1_1.predict(df[['Retail']])
print('PREDICTED VALUES')
print(Yhat1_1[0:5])
print()
print('REAL VALUES')
print(Y[0:5])
plt.figure()
plt.title('Predicted values vs real values')
plt.xlabel('Index')
plt.ylabel('Total cost')
Y.plot()
plt.plot(Y.index,Yhat1_1)
```

</details>


<b style="font-size: 1.5em; font-weight: bold;"> Multiple Linear Regression</b>


<p>What if we want to predict total cost of purchase using more than one variable?</p>

<p>If we want to use more variables in our model to predict total, we can use <b>Multiple Linear Regression</b>.
Multiple Linear Regression is very similar to Simple Linear Regression, but this method is used to explain the relationship between one continuous response (dependent) variable and <b>two or more</b> predictor (independent) variables.
Most of the real-world regression models involve multiple predictors. We will illustrate the structure by using four predictor variables, but these results can generalize to any integer:</p>


$$
Y: Response \ Variable\\\\\\\\\\\\
X_1 :Predictor\ Variable \ 1\\\\
X_2: Predictor\ Variable \ 2\\\\
X_3: Predictor\ Variable \ 3\\\\
X_4: Predictor\ Variable \ 4\\\\
$$


$$
a: intercept\\\\\\\\\\\\
b_1 :coefficients \ of\ Variable \ 1\\\\
b_2: coefficients \ of\ Variable \ 2\\\\
b_3: coefficients \ of\ Variable \ 3\\\\
b_4: coefficients \ of\ Variable \ 4\\\\
$$


The equation is given by:


$$
Yhat = a + b_1 X_1 + b_2 X_2 + b_3 X_3 + b_4 X_4
$$


<p>From the previous section  we know that other good predictors of total could be:</p>
<ul>
    <li>Quantity</li>
    <li>Retail</li>
    <li>Transfer</li>
    <li>Product line</li>
    <li>Quantity binned</li>
    <li>Total ranged</li>
</ul>
Let's develop a model using these variables as the predictor variables.
But first we need to replace text values by numeric values in columns 'Product line', 'Quantity binned', 'Total ranged' using OrdinalEncoder


In [ ]:
#RUN THIS BLOCK ONLY ONE TIME TO NOT LOST YOUR ORIGINAL DATA
enc = OrdinalEncoder()
df[['Product line','Quantity binned','Total ranged']] = enc.fit_transform(df[['Product line','Quantity binned','Total ranged']])

In [ ]:
Z = df[df.columns[df.columns != 'Total']]

Fit the linear model using the four above-mentioned variables.


In [ ]:
lm2 = LinearRegression()
lm2.fit(Z, Y)

What is the value of the intercept(a)?


In [ ]:
lm2.intercept_

What are the values of the coefficients (b1, b2, b3, b4, b5, b6)?


In [ ]:
lm2.coef_

What is the final estimated linear model that we get?


As we saw above, we should get a final linear function with the structure:

$$
Yhat = a + b_1 X_1 + b_2 X_2 + b_3 X_3 + b_4 X_4 + b_5 X_5 + b_6 X_6
$$

What is the linear function we get in this example?


<b>Total</b> = -60.512043854954186 + 24.83509961 x <b>Quantity</b> + 32.84125557 x <b>Retail</b> - 14.02757299 x <b>Transfer</b> + 3.53041191 x <b>Product line</b> -3.53041191 x <b>Quantity binned</b> - 7.78875037 x <b>Total ranged</b>


<p>Let's see the first five predicted and real values</p>


In [ ]:
Yhat2 = lm2.predict(Z)
print('PREDICTED VALUES')
print(Yhat2[0:5])
print('REAL VALUES')
print(Y[0:5])
plt.figure()
plt.title('Predicted values vs real values')
plt.xlabel('Index')
plt.ylabel('Total cost')
Y.plot()
plt.plot(Y.index,Yhat2)

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: bold;">Question #2</b>

<ul>
    <li><b style="font-size: 1.2em">Create a linear regression object called "lm2_2".</b></li>
    <li><b style="font-size: 1.2em">Train the model using 'Quantity' and 'Transfer' as the independent variable and 'Total' as the dependent variable</b></li>
    <li><b style="font-size: 1.2em">Find the intercept (coefficient (a)) and the slope (coefficient (b))</b></li>
    <li><b style="font-size: 1.2em">Using method <code>predict()</code> find <code>Yhat2_2</code></b></li>
    <li><b style="font-size: 1.2em">Print first five predicted values and real values</b></li>
    <li><b style="font-size: 1.2em">Plot the predicted and real values</b></li>
</ul>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
lm2_2 = LinearRegression()
lm2_2.fit(df[['Quantity' , 'Transfer']],Y)
print('Coefficient a: ',lm2_2.intercept_)
print('Coefficient b: ',lm2_2.coef_)
Yhat2_2 = lm2_2.predict(df[['Quantity' , 'Transfer']])
print('PREDICTED VALUES')
print(Yhat2_2[0:5])
print('REAL VALUES')
print(Y[0:5])
plt.figure()
plt.title('Predicted values vs real values')
plt.xlabel('Index')
plt.ylabel('Total cost')
Y.plot()
plt.plot(Y.index,Yhat2_2)
```
</details>


<b style="font-size: 2em; font-weight: bold; text-decoration: none;"><a name="id2"><font color="black"> 2. Model Evaluation Using Visualization</font></a></b>


Now that we've developed some models, how do we evaluate our models and choose the best one? One way to do this is by using a visualization.


Import the visualization package, seaborn:


<b style="font-size: 1.5em; font-weight: bold;">Residual Plot</b>

<p>A good way to visualize the variance of the data is to use a residual plot.</p>

<p>What is a <b>residual</b>?</p>

<p>The difference between the observed value (y) and the predicted value (Yhat) is called the residual (e). The residual is the distance from the data point to the fitted regression line.</p>

<p>So what is a <b>residual plot</b>?</p>

<p>A residual plot is a graph that shows the residuals on the vertical y-axis and the independent variable on the horizontal x-axis.</p>

<p>What do we pay attention to when looking at a residual plot?</p>

<p>We look at the spread of the residuals:</p>

<p>- If the points in a residual plot are <b>randomly spread out around the x-axis</b>, then a <b>linear model is appropriate</b> for the data.

Why is that? Randomly spread out residuals means that the variance is constant, and thus the linear model is a good fit for this data.</p>


In [ ]:
width = 12
height = 10
plt.figure(figsize = (width, height))
sns.residplot(x = df['Quantity'],y = Y)
plt.show()

<i>What is this plot telling us?</i>

<p>We can see from this residual plot that the residuals are not randomly spread around the x-axis, leading us to believe that maybe a non-linear model is more appropriate for this data.</p>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: bold;">Question #3</b>
<br>
<b style="font-size: 1.2em"> Using residual plot display relationship between 'Product line' and 'Total'</b>
</div>


In [ ]:
#Write your code here


<details><summary>Click here for the solution</summary>

```python
width = 12
height = 10
plt.figure(figsize = (width, height))
sns.residplot(x = df['Product line'],y = Y)
plt.show()
```
</details>


<b style="font-size: 1.5em; font-weight: bold;"> Visualize Multiple Linear Regression</b>


<p>How do we visualize a model for Multiple Linear Regression? This gets a bit more complicated because you can't visualize it with regression or residual plot.</p>

<p>One way to look at the fit of the model is by looking at the <b>distribution plot</b>. We can look at the distribution of the fitted values that result from the model and compare it to the distribution of the actual values.</p>


First, let's make a prediction for Yhat3:


In [ ]:
plt.figure(figsize = (width, height))


ax1 = sns.distplot(Y, hist = False, color = "r", label = "Actual Value")
sns.distplot(Yhat2, hist = False, color = "b", label = "Predicted values", ax = ax1)


plt.title('Actual vs Fitted Values for Total')
plt.xlabel('Total')
plt.ylabel('Proportions of predictors')

plt.show()
plt.close()

<p>We can see that the fitted values are reasonably close to the actual values since the two distributions overlap a bit. However, there is definitely some room for improvement.</p>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: bold;">Question #4</b>
<br>
<b style="font-size: 1.2em"> Using distribution plot display relationship between 'Quantity','Transfer' and 'Total'</b>
    <p style="font-size: 1.2em">Hint: predicted values for these columns you must have from Question #2</p>
</div>


In [ ]:
#Write your code here


<details><summary>Click here for the solution</summary>

```python
plt.figure(figsize = (width, height))


ax1 = sns.distplot(Y, hist = False, color = "r", label = "Actual Value")
sns.distplot(Yhat2_2, hist = False, color = "b", label = "Predicted values" , ax = ax1)


plt.title('Actual vs Fitted Values for Total')
plt.xlabel('Total')
plt.ylabel('Proportions of predictors')

plt.show()
plt.close()
```
</details>


<b style="font-size: 2em; font-weight: bold; text-decoration: none;"><a name="id3"><font color="black"> 3. Polynomial Regression and Pipelines</font></a></b>


<p><b>Polynomial regression</b> is a particular case of the general linear regression model or multiple linear regression models.</p> 
<p>We get non-linear relationships by squaring or setting higher-order terms of the predictor variables.</p>

<p>There are different orders of polynomial regression:</p>


<center><b>Quadratic - 2nd Order</b></center>
$$
Yhat = a + b_1 X +b_2 X^2 
$$

<center><b>Cubic - 3rd Order</b></center>
$$
Yhat = a + b_1 X +b_2 X^2 +b_3 X^3\\\\\\\\\\\\
$$

<center><b>Higher-Order</b>:</center>
$$
Y = a + b_1 X +b_2 X^2 +b_3 X^3 ....\\\\
$$


<p>We saw earlier that a linear model did not provide the best fit while using "quantity" as the predictor variable. Let's see if we can try fitting a polynomial model to the data instead.</p>


<p>We will use the following function to plot the data:</p>


In [ ]:
def PlotPolly(model, independent_variable, dependent_variabble, Name):
    x_new = np.linspace(15, 55, 100)
    y_new = model(x_new)

    plt.plot(independent_variable, dependent_variabble, '.', x_new, y_new, '-')
    plt.title('Polynomial Fit with Matplotlib for Total ~ {:}'.format(Name))
    ax = plt.gca()
    ax.set_facecolor((0.898, 0.898, 0.898))
    fig = plt.gcf()
    plt.xlabel(Name)
    plt.ylabel('Total cost')

    plt.show()
    plt.close()

Let's get the variable:


In [ ]:
x = df['Quantity']

Let's fit the polynomial using the function <b>polyfit</b>, then use the function <b>poly1d</b> to display the polynomial function.


In [ ]:
# Here we use a polynomial of the 6rd order 
f = np.polyfit(x, Y, 6)
p = np.poly1d(f)
print(p)

Let's plot the function:


In [ ]:
PlotPolly(p, x, Y, 'Quantity')

In [ ]:
plt.figure()
plt.title('Predicted values vs real values')
plt.xlabel('Index')
plt.ylabel('Total cost')
Y.plot()
plt.plot(Y.index,p(x))

<p>We can already see from plotting that this polynomial model is almost the same as multiple linear regression and simple linear regression</p>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: bold;">Question #5</b>
<br>
<b style="font-size: 1.2em">Create 3 order polynomial model with the variables x as 'Product line' and y as 'Total'</b>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
# Here we use a polynomial of the 3rd order 
f1 = np.polyfit(df['Product line'], Y, 3)
p1 = np.poly1d(f1)
print(p1)
PlotPolly(p1,df['Product line'], Y, 'Products')

```

</details>


<p>The analytical expression for Multivariate Polynomial function gets complicated. For example, the expression for a second-order (degree=2) polynomial with two variables is given by:</p>


$$
Yhat = a + b_1 X_1 +b_2 X_2 +b_3 X_1 X_2+b_4 X_1^2+b_5 X_2^2
$$


We can perform a polynomial transform on multiple features.


We create a <b>PolynomialFeatures</b> object of degree 2:


In [ ]:
pr = PolynomialFeatures(degree=2)
pr

In [ ]:
Z_pr = pr.fit_transform(Z)

In the original data, there are 1000 samples and 6 features.


In [ ]:
Z.shape

After the transformation, there are 1000 samples and 28 features.


In [ ]:
Z_pr.shape

<b style="font-size: 1.5em; font-weight: bold;"> Pipeline</b>


<p>Data Pipelines simplify the steps of processing the data. We use the module <b>Pipeline</b> to create a pipeline. We also use <b>StandardScaler</b> as a step in our pipeline.</p>


We create the pipeline by creating a list of tuples including the name of the model or estimator and its corresponding constructor.


In [ ]:
Input = [('scale',StandardScaler()), ('polynomial', PolynomialFeatures(include_bias=False)), ('model',LinearRegression())]

We input the list as an argument to the pipeline constructor:


In [ ]:
pipe = Pipeline(Input)
pipe

First, we convert the data type Z to type float to avoid conversion warnings that may appear as a result of StandardScaler taking float inputs.

Then, we can normalize the data,  perform a transform and fit the model simultaneously.


In [ ]:
Z = Z.astype(float)
pipe.fit(Z,Y)

Similarly,  we can normalize the data, perform a transform and produce a prediction  simultaneously.


In [ ]:
ypipe = pipe.predict(Z)
print('PREDICTED VALUES')
print(ypipe[0:5])
print('REAL VALUES')
print(Y[0:5])

In [ ]:
plt.figure()
plt.title('Predicted values vs real values')
plt.xlabel('Index')
plt.ylabel('Total cost')
Y.plot()
plt.plot(Y.index,ypipe)

<p>You see that this model is better than previous three</p>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: bold;">Question #6</b>
<br>
<b style="font-size: 1.2em">Create a pipeline as above and use the features 'Quantity' and 'Transfer' and target Y.</b>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
Input1 = [('scale',StandardScaler()), ('polynomial', PolynomialFeatures(include_bias = False)), ('model',LinearRegression())]
Z1 = df[['Quantity','Transfer']]
pipe1 = Pipeline(Input1)
Z1 = Z1.astype(float)
pipe1.fit(Z1,Y)

ypipe1 = pipe1.predict(Z1)
print('PREDICTED VALUES')
print(ypipe1[0:5])
print('REAL VALUES')
print(Y[0:5])
plt.figure()
plt.title('Predicted values vs real values')
plt.xlabel('Index')
plt.ylabel('Total cost')
Y.plot()
plt.plot(Y.index,ypipe1)
```

</details>


<b style="font-size: 2em; font-weight: bold; text-decoration: none;"><a name="id4"><font color="black">4. Measures for In-Sample Evaluation</font></a></b>


<p>When evaluating our models, not only do we want to visualize the results, but we also want a quantitative measure to determine how accurate the model is.</p>

<p>Two very important measures that are often used in Statistics to determine the accuracy of a model are:</p>
<ul>
    <li><b>$R^{2}$ / R-squared</b></li>
    <li><b>Mean Squared Error (MSE)</b></li>
</ul>

<b>R-squared</b>

<p>R squared, also known as the coefficient of determination, is a measure to indicate how close the data is to the fitted regression line.</p>

<p>The value of the R-squared is the percentage of variation of the response variable (y) that is explained by a linear model.</p>

<b>Mean Squared Error (MSE)</b>

<p>The Mean Squared Error measures the average of the squares of errors. That is, the difference between actual value (y) and the estimated value (ŷ).</p>


<b style="font-size: 1.5em; font-weight: bold;">We have four models:</b>
<li>Simple Linear Regression (SLR)</li>
<li>Multiple Linear Regression (MLR)</li>
<li>Polynomial Regression (PR)</li>
<li>Multiple Polynomial Regression (MPR)</li>
<b style="font-size: 1.5em; font-weight: bold;">Let's calculate $R^{2}$ and MSE for each model</b>



In [ ]:
# Find the R^2
R2_1 = lm1.score(X, Y)
mse1 = mean_squared_error(Y, Yhat1)
R2_2 = lm2.score(Z, Y)
mse2 = mean_squared_error(Y, Yhat2)
R2_3 = r2_score(Y, p(x))
mse3 = mean_squared_error(Y, p(x))
R2_4 = r2_score(Y, ypipe)
mse4 = mean_squared_error(Y, ypipe)
print("SLR\t{:.3f}\t{:.2f}\nMLR\t{:.3f}\t{:.2f}\nPR\t{:.3f}\t{:.2f}\nMPR\t{:.3f}\t{:.2f}".format(R2_1,mse1,R2_2,mse2,R2_3,mse3,R2_4,mse4),end = print('\tR^2\tMSE'))                        

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: bold;">Question #7</b>
<br>
<b style="font-size: 1.2em">Find $R^{2}$ and MSE for linear regression (between Retail and Total), multiple regression (between Quantity, Transfer and Total), polynomial regression (between Product line and Total), multiple polynomial regression (between Quantity, Transfer and Total) for models which you created at Question #1, Question #2, Question #5 and Question #6. Display them as table from the example above</b>
</div>


In [ ]:
#Write your code here


<details><summary>Click here for the solution</summary>

```python
R2_1_1 = lm1_1.score(df[['Retail']], Y)
mse1_1 = mean_squared_error(Y, Yhat1_1)
R2_2_2 = lm2_2.score(df[['Quantity','Transfer']], Y)
mse2_2 = mean_squared_error(Y, Yhat2_2)
R2_3_3 = r2_score(Y, p1(df['Product line']))
mse3_3 = mean_squared_error(Y, p1(df['Product line']))
R2_4_4 = r2_score(Y, ypipe1)
mse4_4 = mean_squared_error(Y, ypipe1)

print("SLR\t{:.3f}\t{:.2f}\nMLR\t{:.3f}\t{:.2f}\nPF\t{:.3f}\t{:.2f}\nMPF\t{:.3f}\t{:.2f}".format(R2_1_1,mse1_1,R2_2_2,mse2_2,R2_3_3,mse3_3,R2_4_4,mse4_4),end = print('\tR^2\tMSE'))
```

</details>


<b style="font-size: 1em; font-weight: bold;">And now we restore the original data in dataset using <code>inverse_transform()</code> from OrdinalEncoder()</b>


In [ ]:
df[['Product line','Quantity binned','Total ranged']] = enc.inverse_transform(df[['Product line','Quantity binned','Total ranged']])
df

<b style="font-size: 2em; font-weight: bold; text-decoration: none;"><a name="id5"><font color="black">5. Prediction and Decision Making</b>
<p>In the previous section, we trained the model using the method <b>fit</b>. Now we will use the method <b>predict</b> to produce a prediction. Lets use <b>pyplot</b> for plotting; we will also be using some functions from numpy.</p>


Create a new input:


In [ ]:
new_input=np.arange(df['Quantity'].mean(), df['Quantity'].max()+1, 1).reshape(-1, 1)

Produce a prediction:


In [ ]:
yhat1=lm1.predict(new_input)

We can plot the data:


In [ ]:
plt.title('Prediction values ')
plt.xlabel('New quantity values')
plt.ylabel('Total')
plt.plot(new_input, yhat1)
plt.show()

<b style="font-size: 1.5em; font-weight: bold;"> Decision Making: Determining a Good Model Fit</b>


<p>Now that we have visualized the different models, and generated the R-squared and MSE values for the fits, how do we determine a good model fit?
<ul>
    <li><i>What is a good R-squared value?</i></li>
</ul>
</p>

<p>When comparing models, <b>the model with the higher R-squared value is a better fit</b> for the data.
<ul>
    <li><i>What is a good MSE?</i></li>
</ul>
</p>

<p>When comparing models, <b>the model with the smallest MSE value is a better fit</b> for the data.</p>

<h4>Let's take a look at the values for the different models.</h4>
<p>Simple Linear Regression: Using Quantity as a Predictor Variable of Total.
<ul>
    <li>R-squared: 0.757</li>
    <li>MSE: 28901.21</li>
</ul>
</p>

<p>Multiple Linear Regression: Using all fields from dataset, except for 'Total' as Predictor Variables of Total.
<ul>
    <li>R-squared: 0.775</li>
    <li>MSE: 26830.47</li>
</ul>
</p>

<p>Polynomial Fit: Using Quantity as a Predictor Variable of Total.
<ul>
    <li>R-squared: 0.758</li>
    <li>MSE: 28758.81</li>
</ul>
<p>Multiple Polynomial Fit: Using all fields in dataset as a Predictor Variable of Total.
<ul>
    <li>R-squared: 0.877</li>
    <li>MSE: 14635.21 </li>
</ul>
</p>


<p>Let's see the result of all four models</p>


In [ ]:
figure, axis = plt.subplots(2, 2)

# plt.xlabel('Predictors')
# plt.ylabel('Total cost')
#df['Total'].plot()
# plt.plot(df['Total'].index,Yhat3)
axis[0, 0].plot(Y)
axis[0, 0].plot(Y.index, Yhat1)
axis[0, 0].set_title("Simple Linear Regression")

axis[0, 1].plot(Y)
axis[0, 1].plot(Y.index, Yhat2)
axis[0, 1].set_title("Multiple Linear Regression")
  
axis[1, 0].plot(Y)
axis[1, 0].plot(Y.index, p(x))
axis[1, 0].set_title("Polynomial Fit")
  
axis[1, 1].plot(Y)    
axis[1, 1].plot(Y.index, ypipe)
axis[1, 1].set_title("Multiple Polynomial Fit")
  
plt.show()

<b style="font-size: 1.5em; font-weight: bold;"> Conclusion</b>


<p>Comparing these four models, we conclude that <b>the Multiple Polynomial model is the best model</b> to be able to predict total from our dataset. This result makes sense since we have 28 variables at all and we know that more than one of those variables are potential predictors of the final total cost of purchase.</p>


### Thank you for completing this lab!

## Authors

<a href="https://author.skills.network/instructors/victor_dyrenko?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX04ZPEN2976-2023-01-01">Victor Dyrenko</a>

<a href="https://author.skills.network/instructors/yaroslav_vyklyuk_2?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX04ZPEN2976-2023-01-01">Yaroslav Vyklyuk</a>

<a href="https://author.skills.network/instructors/olga_kavun?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX04ZPEN2976-2023-01-01">Olga Kavun</a>

## Change Log

| Date (YYYY-MM-DD) | Version |     Changed By   | Change Description                                         |
| ----------------- | ------- | ---------------- | ---------------------------------------------------------- |
| 2023-04-24        | 1       | Victor Dyrenko   | Finished lab                                               |


<hr>

## <h3 align="center"> © IBM Corporation 2023. All rights reserved. <h3/>
