<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX0GJYEN/SN_web_lightmode.png?1677849873482" width="300" alt="cognitiveclass.ai logo">
</center>

# Investigation of BTC/BUSD cryptocurrency using ADOSC, NATR, TRANGE indicators, and other cryptocurrencies.


## Lab 4. Model development.

Estimated time needed: **30** minutes

## Objectives

After completing this lab you will be able to:

*   Develop prediction models


<h3>Table of Contents</h3>

<div class="alert alert-block alert-info" style="margin-top: 20px">
<ol>
    <li><u>Linear Regression and Multiple Linear Regression</u></li>
        <ul>
            <li><u>Linear Regression</u></li>
            <li><u>Multiple Linear Regression</u></li>
        </ul>
    <li><u>Model Evaluation Using Visualization</u></li>
    <li><u>Polynomial Regression and Pipelines</u></li>
        <ul>
            <li><u>Polynomial Regression</u></li>
            <li><u>Pipeline</u></li>
        </ul>
    <li><u> Measures for In-Sample Evaluation</u></li>
        <ul>
            <li><u> Simple Linear Regression</u></li>
            <li><u>Multiple Linear Regression</u></li>
            <li><u>Polynomial Fit</u></li>
        </ul>
</ol>

</div>

<hr>


#### ***Dataset description***

The dataset used in this lab contains time-series data on various attributes related to Bitcoin (BTC) and other cryptocurrencies, aggregated at 1-minute intervals. The dataset index represents the time period for which the data is reported(1 minute). Also the dataset contains binned average prices of other cryptocurrencies. 

<hr>

**Attributes:**

* ***General:***
    * `open` - the opening price of a **BTC** during a specific time period.
    * `high` - the highest price of a **BTC** during a specific time period.
    * `low` - the lowest price of a **BTC** during a specific time period.
    * `close` - the closing price of a **BTC** during a specific time period.
    * `rec_count` - the number of records or data points in the dataset for a given time period.
    * `volume` - the total amount of trading activity (buying and selling) for a **BTC** during a specific time period.
    * `avg_price` - the average price of a **BTC** during a specific time period.


* ***Indicators***
    * `ADOSC` - an indicator used in technical analysis to measure the momentum of buying and selling pressure for ***Bitcoin***.
    * `NATR` - an indicator used in technical analysis to measure the volatility of ***Bitcoin***.
    * `TRANGE` - an indicator used in technical analysis to measure the range of prices (from high to low) for ***Bitcoin*** during a specific time period.


* ***Other cryptocurrencies:***
    * `ape_avg_price` - the average price of ***APE*** during a specific time period.
    * `bnb_avg_price` - the average price of ***BNB*** during a specific time period.
    * `doge_avg_price` - the average price of ***DOGE coin*** during a specific time period.
    * `eth_avg_price` - the average price of ***Ethereum*** during a specific time period.
    * `xrp_avg_price` - the average price of ***XRP*** during a specific time period.
    * `matic_avg_price` - the average price of ***MATIC*** during a specific time period.
* ***Categorical:***
    * `category` - binned ***average price for BTC(`avg_price`)*** to bins: `high`, `medium-high`, `medium`, `medium-low`, `low`.
    * `ape_category` - binned `ape_avg_price` to bins: `high`, `medium-high`, `medium`, `medium-low`, `low`.
    * `bnb_category` - binned `bnb_avg_price` to bins: `high`, `medium-high`, `medium`, `medium-low`, `low`.
    * `doge_category` - binned `doge_avg_price` to bins: `high`, `medium-high`, `medium`, `medium-low`, `low`.
    * `eth_category` - binned `eth_avg_price` to bins: `high`, `medium-high`, `medium`, `medium-low`, `low`.
    * `xrp_category` - binned `xrp_avg_price` to bins: `high`, `medium-high`, `medium`, `medium-low`, `low`.
    * `matic_category` - binned `matic_avg_price` to bins: `high`, `medium-high`, `medium`, `medium-low`, `low`.
    
<hr>

*The indicators `ADOSC`, `NATR`, and `TRANGE` are used in technical analysis to provide insights into the momentum, volatility, and price ranges of financial instruments or assets. The other attributes represent the average prices of different cryptocurrencies during a specific time period.*

<hr>

<h4>Setup</h4>


In [ ]:
# uncomment this line if you use pip
# !pip install pandas
# !pip install --upgrade numpy
# !pip install matplotlib
# !pip install --upgrade scipy
# !pip install seaborn
# !pip install scikit-learn
# !pip install statsmodels

In [ ]:
!mamba install pandas -y
!mamba install numpy -y
!mamba install matplotlib -y
!mamba install scipy -y
!mamba install seaborn -y 
!mamba install statsmodels -y
!mamba install scikit-learn -y 

Import libraries:


In [ ]:
import warnings
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures

%matplotlib inline 

Creating array with names of other currencies to easily access all columns related to other currencies. 

In [ ]:
currencies = np.array(['ape', 'eth', 'bnb', 'xrp', 'doge', 'matic'])

In [ ]:
file_path = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX0GJYEN/BTCBUSD_1min_categories.csv'

In [ ]:
df = pd.read_csv(file_path, index_col='ts')

# converting index to datetime
df.index = pd.to_datetime(df.index)
df.head()

In [ ]:
# create a list of columns(names of other cryptocurrecies in format <name>_avg_price)
currencies_avg_price = np.char.add(currencies, '_avg_price')
currencies_avg_price

In [ ]:
corr = df[np.append(currencies_avg_price, 'avg_price')].corr()
corr

In [ ]:
fig, ax = plt.subplots(figsize=(15,15))
sns.heatmap(corr, cmap='RdBu', vmin=-1, vmax=1, annot=True)

## 1. Linear Regression and Multiple Linear Regression


### Linear Regression


One example of a Data  Model that we will be using is:

* **Simple Linear Regression**


Simple Linear Regression is a method to help us understand the relationship between two variables:

* The predictor/independent variable (X)
* The response/dependent variable (that we want to predict)(Y)


The result of Linear Regression is a **linear function** that predicts the response (dependent) variable as a function of the predictor (independent) variable.


$$
Y: Response \ Variable\\\\\\\\\\
X: Predictor \ Variables
$$


**Linear Function:**

$$
\hat Y = a + b  X
$$


<ul>
    <li>a refers to the <b>intercept</b> of the regression line, in other words: the value of Y when X is 0</li>
    <li>b refers to the <b>slope</b> of the regression line, in other words: the value with which Y changes when X increases by 1 unit</li>
</ul>


##### Create the linear regression object:


In [ ]:
lm = LinearRegression()
lm

<h4>How could other cryptocurrencies help us predict avarage price of our main cryptocurrency(BTC)?</h4>


In previous lab we find out that **DOGE and APE** are more likely to affect or main currency(BTC) than other currencies.
So we will use these values for our linear regression models.

At the same time from correlation matrix we know that **DOGE** may be a good predictor of average price(it states that there is linear relationship between these values), that is why we will use **DOGE average price for our linear regression model**. Also, we may use **ETH, APE, XRP** values for other models.

In [ ]:
# y is a target value column - the value we want to predict
y = df['avg_price']
# deleting target column from X dataset
X = df.drop('avg_price', axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)

Let's examine shapes:

In [ ]:
print(f' X train shape: {X_train.shape}\n X test shape: {X_test.shape}\n Y train shape: {y_train.shape}\n Y test shape: {y_test.shape}')

Fit the linear model using ape_avg_price:


In [ ]:
lm.fit(X_train[['doge_avg_price']], y_train)

We can predict some average price values using the method `predict`. To do that, we pass testing data as a parameter into `predict` method.

In [ ]:
y_hat=lm.predict(X_test[['doge_avg_price']])
print(f"Data used to predict values: {X_test[['doge_avg_price']][0:5]}\n Predicted values: {y_hat[0:5]}")

#### What is the value of the intercept (a)?


In [ ]:
lm.intercept_

#### What is the value of the slope (b)?


In [ ]:
lm.coef_

#### What is the final estimated linear model we get?


As we saw above, we should get a final linear model with the structure:


$$
\hat Y = a + b  X
$$


Plugging in the actual values we get:


<!-- <b>Price</b> = 38423.31 - 821.73 x <b>highway-mpg</b> -->

$$Price(BTC) = 12246.2325 - 51507.52254 X$$


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
  <b style="font-size: 2em; font-weight: bold;"> Question #1 a):</b>

  <b>Create a linear regression object called "lm1".</b>
    
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 

<details><summary>Click here for the solution</summary>

```python
lm1 = LinearRegression()
lm1
```

</details>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
  <b style="font-size: 2em; font-weight: bold;"> Question #1 b):</b>

  <b>Train the model using `eth_avg_price` as the independent variable and `avg_price` as the dependent variable?</b>
    
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute

<details><summary>Click here for the solution</summary>

```python
lm1.fit(X_train[['eth_avg_price']], y_train)
lm1
```

</details>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
  <b style="font-size: 2em; font-weight: bold;"> Question #1 c):</b>

  <b>Find the slope and intercept of the model.</b>
    
</div>


#### **Slope**


In [ ]:
# Write your code below and press Shift+Enter to execute 

<details><summary>Click here for the solution</summary>
    
```python
# Slope(b)
lm1.coef_
```
</details>


#### **Intercept**


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
# Intercept(a)
lm1.intercept_
```

</details>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
  <b style="font-size: 2em; font-weight: bold;"> Question #1 d):</b>

  <b>What is the equation of the predicted line? You can use x and yhat or `eth_avg_price` or `avg_price`.</b>
    
</div>

In [ ]:
# Write your code below and press Shift+Enter to execute 
# using X and Y  


<details><summary>Click here for the solution</summary>

```python
# using X and Y  
y_hat_=5.4691 + 9941.5855*X
# or 
Price=5.4691 + 9941.5855*X_test[['eth_avg_price']]

```

</details>


##### **LR with statsmodels**

Now let's create a linear regression model using `statsmodels`. The model from this library may look a bit more challenging to understand initially, but it has some advantages. Later, you will see the difference between `sklearn` and `statsmodels` linear regressions.

We will use **DOGE average price as the predictor** of average price.

Before we train the model we need to add the column of ones to the inputs if you want `statsmodels` to calculate the intercept `𝑏₀`. It doesn’t take `𝑏₀` into account by default. 

You can add the column of ones to `X_train` using `add_constant()` function, which takes `X_train` as input array and returns a new array with the column of ones inserted at the beginning.

In [ ]:
# Current version of statsmodels shows future warning which we can ignore(this warning is caused by library and may be fixed in future)
warnings.filterwarnings('ignore', category=FutureWarning)

X_train_stats = sm.add_constant(X_train["doge_avg_price"])
X_train_stats

Now, let's create the linear regression model and train it.

In [ ]:
# create model object
model = sm.OLS(y_train, X_train_stats)
# train the model
stats_lm = model.fit()

Add column of ones to `X_test` as well.

In [ ]:
X_test_stats = sm.add_constant(X_test["doge_avg_price"])
X_test_stats

Predict values of average price using `predict` method:

In [ ]:
print('Predicted values:\n')
stats_lm.predict(X_test_stats)

In [ ]:
# Set future warnings back to default
warnings.filterwarnings('default', category=FutureWarning)

### Multiple Linear Regression


<p>What if we want to average price using more than one variable?</p>

<p>If we want to use more variables in our model to predict average price, we can use <b>Multiple Linear Regression</b>.
Multiple Linear Regression is very similar to Simple Linear Regression, but this method is used to explain the relationship between one continuous response (dependent) variable and <b>two or more</b> predictor (independent) variables.
Most of the real-world regression models involve multiple predictors. We will illustrate the structure by using four predictor variables, but these results can generalize to any integer:</p>


$$
Y: Response \ Variable\\\\\\\\\\
X\_1 :Predictor\ Variable \ 1\\\\
X\_2: Predictor\ Variable \ 2\\\\
...\\\\
X\_n: Predictor\ Variable \ n\\\\
$$


$$
a: intercept\\\\\\\\\\
b\_1 :coefficients \ of\ Variable \ 1\\\\
b\_2: coefficients \ of\ Variable \ 2\\\\
...\\\\
b\_n: coefficients \ of\ Variable \ n\\\\
$$


The equation is given by:


$$
Yhat = a + b_1 X_1 + b_2 X_2 + b_3 X_3 + b_4 X_4 + ... + b_n X_n
$$


<p>From the previous section  we know that other good predictors of price could be:</p>
<ul>
    <li>DOGE</li>
    <li>ETH</li>
    <li>XRP</li>
</ul>
Let's develop a model using these variables as the predictor variables.


In [ ]:
M_train = X_train[['doge_avg_price', 'eth_avg_price', 'xrp_avg_price']]
M_test = X_test[['doge_avg_price', 'eth_avg_price', 'xrp_avg_price']]

Create model:

In [ ]:
mult_lm = LinearRegression()

Fit the linear model using the four above-mentioned variables.


In [ ]:
mult_lm.fit(M_train, y_train)

What is the value of the intercept(a)?


In [ ]:
mult_lm.intercept_

What are the values of the coefficients (b1, b2, b3, b4)?


In [ ]:
mult_lm.coef_

What is the final estimated linear model that we get?


In [ ]:
mult_lm_yhat = mult_lm.predict(M_test)
mult_lm_yhat 

As we saw above, we should get a final linear function with the structure:

$$
Yhat = a + b_1 X_1 + b_2 X_2 + b_3 X_3
$$

What is the linear function we get in this example?


$$Price = 8759.0594 + 9.0944 DOGE + 3.7751 ETH + 6.6993 XRP $$


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
  <b style="font-size: 2em; font-weight: bold;">Question  #2 a):</b>

  <b>Create and train a Multiple Linear Regression model `mult_lm2` where the response variable is `price`, and the predictor variable are `eth_avg_price` and  `ape_avg_price`.</b>
    
</div>

In [ ]:
# Write your code below and press Shift+Enter to execute

<details><summary>Click here for the solution</summary>

```python
mult_lm2 = LinearRegression()
mult_lm2.fit(X_train[["eth_avg_price", "ape_avg_price"]], y_train)


```

</details>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
  <b style="font-size: 2em; font-weight: bold;"> Question  #2 b):</b>

  <b>Find the coefficient of the model.</b>
    
</div>

In [ ]:
# Write your code below and press Shift+Enter to execute

<details><summary>Click here for the solution</summary>

```python
mult_lm2.coef_

```

</details>


## 2. Model Evaluation Using Visualization


Now that we've developed some models, how do we evaluate our models and choose the best one? One way to do this is by using a visualization.


### Regression Plot


<p>When it comes to simple linear regression, an excellent way to visualize the fit of our model is by using <b>regression plots</b>.</p>

<p>This plot will show a combination of a scattered data points (a <b>scatterplot</b>), as well as the fitted <b>linear regression</b> line going through the data. This will give us a reasonable estimate of the relationship between the two variables, the strength of the correlation, as well as the direction (positive or negative correlation).</p>


Let's visualize **doge_avg_price** as potential predictor variable of price:


In [ ]:
width = 12
height = 10
plt.figure(figsize=(width, height))
sns.regplot(x='doge_avg_price', y='avg_price', data=df, line_kws={"color": "k"})
plt.ylim(df['avg_price'].min(),df['avg_price'].max())

One thing to keep in mind when looking at a regression plot is to pay attention to how scattered the data points are around the regression line. This will give you a good indication of the variance of the data and whether a linear model would be the best fit or not.

Let's compare this plot to the regression plot of "xrp_avg_price".</p>


In [ ]:
plt.figure(figsize=(width, height))
sns.regplot(x='xrp_avg_price', y="avg_price", data=df, line_kws={"color": "k"})
plt.ylim(df['avg_price'].min(),df['avg_price'].max())
# plt.xlim(df['xrp_avg_price'].min(),)

<p>Comparing the regression plot of "xrp_avg_price" and "doge_avg_price", we see that the points for "doge_avg_price" are closer to the generated line. </p>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
  <b style="font-size: 2em; font-weight: bold;"> Question #3:</b>

  <b>Given the regression plots above, is `doge_avg_price` or `xrp_avg_price` more strongly correlated with `avg_price`? Use the method  `.corr()` to verify your answer.</b>
    
</div>

In [ ]:
# Write your code below and press Shift+Enter to execute 

<details><summary>Click here for the solution</summary>

```python
# The variable "highway-mpg" has a stronger correlation with "price", it is approximate -0.704692  compared to "peak-rpm" which is approximate -0.101616. You can verify it using the following command:

df[["doge_avg_price","xrp_avg_price","avg_price"]].corr()

```

</details>


#### Residual Plot

<p>A good way to visualize the variance of the data is to use a residual plot.</p>

<p>What is a <b>residual</b>?</p>

<p>The difference between the observed value (y) and the predicted value (Yhat) is called the residual (e). When we look at a regression plot, the residual is the distance from the data point to the fitted regression line.</p>

<p>So what is a <b>residual plot</b>?</p>

<p>A residual plot is a graph that shows the residuals on the vertical y-axis and the independent variable on the horizontal x-axis.</p>

<p>What do we pay attention to when looking at a residual plot?</p>

<p>We look at the spread of the residuals:</p>

<p>- If the points in a residual plot are <b>randomly spread out around the x-axis</b>, then a <b>linear model is appropriate</b> for the data.

Why is that? Randomly spread out residuals means that the variance is constant, and thus the linear model is a good fit for this data.</p>


In [ ]:
width = 12
height = 10
plt.figure(figsize=(width, height))
sns.residplot(x=df['doge_avg_price'],y=df['avg_price'])
plt.show()

**What is this plot telling us?**

We can see from this residual plot that the residuals are not randomly spread around the x-axis, leading us to believe that maybe a non-linear model is more appropriate for this data.


### Multiple Linear Regression


<p>How do we visualize a model for Multiple Linear Regression? This gets a bit more complicated because you can't visualize it with regression or residual plot.</p>

<p>One way to look at the fit of the model is by looking at the <b>distribution plot</b>. We can look at the distribution of the fitted values that result from the model and compare it to the distribution of the actual values.</p>


First, let's make a prediction:


In [ ]:
mult_y_hat = mult_lm.predict(M_test)
mult_y_hat

In [ ]:
import sys
width = 15
height = 12

plt.figure(figsize=(width, height))


sns.kdeplot(mult_y_hat)
sns.kdeplot(y_test)


plt.title('Actual vs Fitted Values for Price')
plt.xlabel('Price')
plt.ylabel('Proportion')

plt.show()
plt.close()

<p>We can see that the fitted values are reasonably close to the actual values since the two distributions overlap a bit. However, there is definitely some room for improvement.</p>


## 3. Polynomial Regression and Pipelines


### **Polynomial Regression**

<p><b>Polynomial regression</b> is a particular case of the general linear regression model or multiple linear regression models.</p> 
<p>We get non-linear relationships by squaring or setting higher-order terms of the predictor variables.</p>

<p>There are different orders of polynomial regression:</p>


<center><b>Quadratic - 2nd Order</b></center>
$$
Yhat = a + b_1 X +b_2 X^2 
$$

<center><b>Cubic - 3rd Order</b></center>
$$
Yhat = a + b_1 X +b_2 X^2 +b_3 X^3\\\\\\\\\\
$$

<center><b>Higher-Order</b>:</center>
$$
Y = a + b_1 X +b_2 X^2 +b_3 X^3 ....\\\\
$$


<p>We will use the following function to plot the data:</p>


In [ ]:
def PlotPolly(model, independent_variable, dependent_variabble, Name, minimum=0, maximum=1000, step=100):
    x_new = np.linspace(minimum, maximum, step)
    y_new = model(x_new)

    plt.plot(independent_variable, dependent_variabble, '.', x_new, y_new, '-')
    plt.title('Polynomial Fit with Matplotlib')
    ax = plt.gca()
    ax.set_facecolor((0.898, 0.898, 0.898))
    plt.xlabel(Name)
    # plt.ylabel('Price of Cars')

    plt.show()
    plt.close()

Let's get the variables:

In [ ]:
xp_train = X_train['doge_avg_price']
yp_train = y_train
xp_test = X_test['doge_avg_price']
yp_test = y_test

Let's fit the polynomial using the function <b>polyfit</b>, then use the function <b>poly1d</b> to display the polynomial function.


In [ ]:
# Here we use a polynomial of the 3rd order (cubic) 
f = np.polyfit(xp_train, yp_train, 3)
p = np.poly1d(f)
print(p)

Let's plot the function:


In [ ]:
PlotPolly(p, xp_train, yp_train, 'doge_avg_price', minimum=df['doge_avg_price'].min(), maximum=df['doge_avg_price'].max(), step=10)

<p>The analytical expression for Multivariate Polynomial function gets complicated. For example, the expression for a second-order (degree=2) polynomial with two variables is given by:</p>


$$
\hat{Y} = a + b_1 X_1 +b_2 X_2 +b_3 X_1 X_2+b_4 X_1^2+b_5 X_2^2
$$


We can perform a polynomial transform on multiple features. First, we import the module:


We create a <b>PolynomialFeatures</b> object of degree 2:


In [ ]:
pr=PolynomialFeatures(degree=2)
pr

In [ ]:
M_pr=pr.fit_transform(M_train)

In the original data, there are 201 samples and 4 features.


In [ ]:
M_train.shape

After the transformation, there are 13341 samples and 10 features.


In [ ]:
M_pr.shape

### **Pipeline**


<p>Data Pipelines simplify the steps of processing the data. We use the module <b>Pipeline</b> to create a pipeline. We also use <b>StandardScaler</b> as a step in our pipeline.</p>


We create the pipeline by creating a list of tuples including the name of the model or estimator and its corresponding constructor.


In [ ]:
Input=[('scale',StandardScaler()), ('polynomial', PolynomialFeatures(include_bias=False)), ('model',LinearRegression())]

Let's set config to diagram mode, so we can display our pipeline.

In [ ]:
set_config(display="diagram")

We input the list as an argument to the pipeline constructor:


In [ ]:
pipe=Pipeline(Input)
pipe

First, we convert the data type Z to type float to avoid conversion warnings that may appear as a result of StandardScaler taking float inputs.

Then, we can normalize the data,  perform a transform and fit the model simultaneously.


In [ ]:
M_train = M_train.astype(float)
pipe.fit(M_train,y_train)

Similarly,  we can normalize the data, perform a transform and produce a prediction  simultaneously.


In [ ]:
ypipe=pipe.predict(M_test)
ypipe[0:4]

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
  <b style="font-size: 2em; font-weight: bold;"> Question #4:</b>

  <b>Create a pipeline that standardizes the data, then produce a prediction using a linear regression model using the features Z and target y.</b>
    
</div>

In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
Input=[('scale',StandardScaler()),('model',LinearRegression())]

pipe=Pipeline(Input)

pipe.fit(Z,y)

ypipe=pipe.predict(Z)
ypipe[0:10]

```

</details>


## 4. Measures for In-Sample Evaluation


<p>When evaluating our models, not only do we want to visualize the results, but we also want a quantitative measure to determine how accurate the model is.</p>

<p>Important measures that are often used in Statistics to determine the accuracy of a model are:</p>
<ul>
    <li><b>$R^2$ / R-squared</b></li>
    <li><b>Mean Squared Error (MSE)</b></li>
    <li><b>Root Mean Squared Error (RMSE)</b></li>
    <li><b>Mean absolute error (MAE)</b></li>
</ul>


### Model 1: Simple Linear Regression


Let's calculate different metrics for linear regression model from previous section.

**R-squared**

R squared, also known as the coefficient of determination, is a **measure to indicate how close the data is to the fitted regression line**.

The value of the R-squared is the percentage of variation of the response variable (y) that is explained by a linear model.

Let's calculate the $R^2$:


In [ ]:
# Find the R^2
r2 = lm.score(X_test[['doge_avg_price']], y_test)
print(f"The R-square is: {r2:.4f}")

We can say that \~0.611% of the variation of the price is explained by this simple linear model "horsepower_fit".


**Mean Squared Error**

**MSE** is **the average squared difference between the predicted and actual values**(**the difference between actual value (y) and the estimated value (ŷ)**). Lower values indicate a better fit.

$$MSE = (1/n) * sum((\hat{y} - y)^2)$$

We are going to use 'y_hat' (predicted values that linear regression model gave us).

Let's calculate the **Mean Squared Error (MSE)**:


In [ ]:
mse = mean_squared_error(y_test, y_hat)
print(f"The mean square error of price and predicted value is: {mse:.4f}")

For more info on MSE refer to:
1. [https://en.wikipedia.org/wiki/Mean_squared_error](https://en.wikipedia.org/wiki/Mean_squared_error)
2. [https://statisticsbyjim.com/regression/mean-squared-error-mse/](https://statisticsbyjim.com/regression/mean-squared-error-mse/)

**Root Mean Squared Error**

It is easier to interpret RMSE than the MSE because it is in the same units as the target variable. A smaller RMSE indicates better accuracy of the predictions.

$$RMSE = \sqrt{(1/n) * sum((\hat{y} - y)^2)}$$

Let's calculate **Root Mean Squared Error(RMSE)**:

In [ ]:
rmse = np.sqrt(mse)
print(f"Root Mean Squared Error (RMSE): {rmse:.4f}")

For more info on RMSE refer to:

1. [https://en.wikipedia.org/wiki/Root-mean-square_deviation#:~:text=The%20root%2Dmean%2Dsquare%20deviation,estimator%20and%20the%20values%20observed](https://en.wikipedia.org/wiki/Root-mean-square_deviation#:~:text=The%20root%2Dmean%2Dsquare%20deviation,estimator%20and%20the%20values%20observed)
2. [https://www.statisticshowto.com/probability-and-statistics/regression-analysis/rmse-root-mean-square-error/](https://www.statisticshowto.com/probability-and-statistics/regression-analysis/rmse-root-mean-square-error/)

**Mean Absolute Error(MAE)**

$$MAE = (1/n) * sum(|\hat{y} - y|)$$

In [ ]:
mae = mean_absolute_error(y_test, y_hat)

Let's output previously calculated metrics: 

In [ ]:
print(f" R2: {r2:.4f}\n MSE: {mse:.4f}\n RMSE: {rmse:.4f}\n MAE: {mae:.4f}")

For more info on MAE refer to:
1. [https://en.wikipedia.org/wiki/Mean_absolute_error](https://en.wikipedia.org/wiki/Mean_absolute_error)

If you remember we used `statsmodels` previously for linear regression, and we also said that you will see the difference later. 

In model we trained with `statsmodels` we can easily retrieve many metrics:

In [ ]:
stats_lm.summary()

In [ ]:
# because we will calculate the same metrics for other models let's create method
def get_metrics(y_real, y_pred) -> tuple:
    """
    :returns tuple of metrics(MSE, RMSE, MAE)
    """
    mse = mean_squared_error(y_real, y_pred)
    rmse = np.sqrt(mse)
    mae = mean_squared_error(y_real, y_pred)
    return mse, rmse, mae

### Model 2: Multiple Linear Regression


Let's calculate the $R^2$:


In [ ]:
# Find the R^2
r2 = mult_lm.score(M_test, y_test)
print(f"The R-square is: {r2:.4f}")

We can say that \~0.8152 % of the variation of price is explained by this multiple linear regression "multi_fit".


As you can remember we store predictions for this model in mult_y_hat:

In [ ]:
mult_y_hat

Let's calculate our metrics(MSE, RMSE, MAE)

In [ ]:
mse, rmse, mae = get_metrics(y_test, mult_y_hat)

Let's output our metrics:

In [ ]:
print(f" R2: {r2:.4f}\n MSE: {mse:.4f}\n RMSE: {rmse:.4f}\n MAE: {mae:.4f}") # 4 float poinrts

### Model 3: Polynomial Fit


Let's calculate the $R^2$:


We apply the function to get the value of **$R^2$**:


In [ ]:
pol_y_hat = p(xp_test)
r_squared = r2_score(yp_test, pol_y_hat)
print(f"The R-square value is: {r_squared:.4f}")

We can say that \~0.77 % of the variation of price is explained by this polynomial fit.


We can also calculate the **MSE**:


In [ ]:
mse = mean_squared_error(yp_test, pol_y_hat)

mse, rmse, mae = get_metrics(yp_test, p(xp_test))

Let's output our metrics:

In [ ]:
print(f" R2: {r2:.4f}\n MSE: {mse:.4f}\n RMSE: {rmse:.4f}\n MAE: {mae:.4f}")

### Simple Linear Regression Model (SLR) vs Multiple Linear Regression Model (MLR)


Usually, the more variables you have, the better your model is at predicting, but this is not always true. Sometimes you may not have enough data, you may run into numerical problems, or many of the variables may not be useful and even act as noise. As a result, you **should always check the MSE, RMSE, $R^2$,  MAE**.

<h2>Conclusion</h2>


Comparing these three models, we conclude that **the MLR model is the best model to be able to predict price from our dataset**.


# **Thank you for completing Lab 4!**

## Authors

<a href="https://author.skills.network/instructors/nazar_kohut">Nazar Kohut</a>

<a href="https://author.skills.network/instructors/yaroslav_vyklyuk_2?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0QGDEN2306-2023-01-01">Prof. Yaroslav Vyklyuk, DrSc, PhD</a>

<a href="https://author.skills.network/instructors/mariya_fleychuk?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0QGDEN2306-2023-01-01">Prof. Mariya Fleychuk, DrSc, PhD</a>


## Change Log

| Date (YYYY-MM-DD) | Version | Changed By   | Change Description                                         |
| ----------------- | ------- | -------------| ---------------------------------------------------------- |
|     2023-18-03    |   1.0   | Nazar Kohut  | Lab created                                                |

<hr>

## <h3 align="center"> © IBM Corporation 2023. All rights reserved. <h3/>