<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="400" alt="cognitiveclass.ai logo">
</center>

# **Investigation relationships between exchange rate BTC/BUSD and ADOSC, NATR, TRANGE indicators**

## Lab 4. Model Development

Estimated time needed: **30** minutes

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
    
Для Марії
### The tasks:
*   

</div>

### Objectives

After completing this lab you will be able to:

*   Develop prediction models
*   Visualize the fitted models
*   Measure the performance of models using metrics
*   Use Pipeline  

<h3>Table of Contents</h3>

<div class="alert alert-block alert-info" style="margin-top: 20px">
<ol>
    <li>Import and Load Data</li>
    <li>Linear Regression and Multiple Linear Regression</li>
    <li>Model Evaluation Using Visualization</li>
        <ul>
            <li>Regression Plot</li>
            <li>Residual Plot</li>
            <li>Multiple Linear Regression Plot</li>
        </ul>
    <li>Polynomial Regression</li>
            <ul>
                <li>Multivariate Polynomial function</li>
            </ul>
    <li>Pipeline</li>
    <li>Measures for In-Sample Evaluation</li>
    <li>Decision Making: Determining a Good Model Fit</li>
</ol>

</div>
<hr>


## Dataset Description

### Context
Dataset contains historical changes of the ***BTC/BUSD*** and ***ADOSC, NATR, TRANGE indicators*** for the period from *11/11/2022 to 11/24/2022* with an *1-minute* aggregation time.

### Columns

#### Input columns
* ***Time*** - the timestamp of the record
* ***Open*** -  the price of the asset at the beginning of the trading period
* ***High*** -  the highest price of the asset during the trading period
* ***Low*** - the lowest price of the asset during the trading period.
* ***Close*** - the price of the asset at the end of the trading period
* ***Volume*** - the total number of shares or contracts of a particular asset that are traded during a given period
* ***Count*** -  the number of individual trades or transactions that have been executed during a given time period
* ***ADOSC*** - Chaikin oscillator indicator
* ***NATR*** - normalized average true range (ATR) indicator
* ***TRANGE*** - true range indicator

#### Target column
* ***Price*** - the average price at which a particular asset has been bought or sold during a given period


----


<p>In this section, we will develop several models that will predict the cryptoccurency price using the variables or features. This is just an estimate but should give us an objective idea of how much the cryptocurrency should raise or decline.</p>


<p>In data analytics, we often use <b>Model Development</b> to help us predict future observations from the data we have. A model will help us understand the exact relationship between different variables and how these variables are used to predict the result.</p>


## 1. Import and Load Data


### Setup


If you run the lab locally using Anaconda, you can load the correct library and versions by uncommenting the following:


In [ ]:
#If you run the lab locally using Anaconda, you can load the correct library and versions by uncommenting the following:
#install specific version of libraries used in lab
#! mamba install pandas==1.3.3-y
# ! mamba install numpy -y
# ! mamba install scipy -y
# ! mamba install scikit-learn -y
! conda install -c conda-forge scikit-learn -y

Import libraries:


In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from scipy import stats
import seaborn as sns
import matplotlib.patches as  mpatches
%matplotlib inline 
from sklearn import set_config
set_config(display = 'diagram')

### Load Data


First, we assign the URL of the dataset to <code>"path"</code>.


In [ ]:
path = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX0CQNEN/BTCBUSD_trades_1m.csv"

Use the Pandas method <code>read_csv()</code> to load the data from the web address. Set the parameter  <code>index_col=0</code> in order to use the first column of cvs file as the index of the dataframe.


In [ ]:
df = pd.read_csv(path, index_col=0)

Set dataframe index column type to <strong>datetime</strong> using <code>pd.to_datetime()</code> method for correct time series analysis. 


In [ ]:
df.index = pd.to_datetime(df.index)

In the previous lab we calculated technical financial indicators. Since the values of previous periods had to be taken into account for their calculation, the first few lines of the dataframe contain `NaN` values.

We will use different methods for recovering missing data in this module that do not work correctly with recovering data in the first rows of time series. Therefore, we need to remove `NaN` values with `df.dropna(inplace=True)` method.


In [ ]:
df.dropna(inplace=True)
df.head()

## 2. Linear Regression and Multiple Linear Regression


#### Could the indicators help us predict cryptocurrency price?


For this example, we want to look at how one of the indicators can help us predict our cryptocurrency price.

In previous module we have calculated The Granger causality test that helped us determine whether specific parameters could assist in forecasting cryptocurrency price. Therefore, we have discovered which indicators movement may be useful in projecting the price of our BTC currency.

Still the correlation values of the price and all indicators point out no meaningful or strong linear relationship between variables. Even though for educational purposes, we will use linear models and later try to find non-linear dependencies. 


### Linear Regression


#### Simple Linear Regression
<p><em>Simple linear regression</em> is a regression model that estimates the relationship between one independent variable and one dependent variable (that we want to predict) using a straight line. It is a method to help us understand the relationship between two variables:</p>
<ul>
    <li><em>$The\ predictor\ or\ independent\ variable\ -\ X$</em></li>
    <li><em>$The\ response\ or\ dependent\ variable\ -\ Y$</em></li>
</ul>

<p>The result of Linear Regression is a <strong><em>linear function</em></strong> <em>that predicts the response (dependent) variable as a function of the predictor (independent) variable.<em></p>


$$
Y: Response \ Variable\\\\\\\\\\\\
X: Predictor \ Variables
$$


#### Linear Function

$$
Yhat = a + b  X
$$


<ul>
    <li>$a$ refers to the <strong><em>intercept</em></strong> of the regression line, in other words: <em>the value of $Y$ when $X$ is 0</em></li>
    <li>$b$ refers to the <strong><em>slope</em></strong> of the regression line, in other words: <em>the value with which $Y$ changes when $X$ increases by 1 unit</em></li>
</ul>


<center>
<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX0CQNEN/LR.jpg" width="400" height="400"></img>
</center>


Let's start by creating the linear regression object:


In [ ]:
lm = LinearRegression()
lm

Considering obtained Granger Causality Test results we will use simple linear regression to create a linear function with "Volume" as the predictor variable and the "Price" as the response variable.


In [ ]:
X = df[['Volume']]
Y = df['Price']

Fit the linear model:


In [ ]:
lm.fit(X,Y)

We can output a prediction:


In [ ]:
Yhat = lm.predict(X)
Yhat[0:5]   

#### What is the value of the intercept $-\ a$?


In [ ]:
lm.intercept_

We obtained an <em>intercept value</em> of ~16533.351. 

<h5>However, what does it mean?</h5>

<em>In our example, the value of the $a$ indicates the average price rate for a BTC cryptocurrency with no volume.</em>


#### What is the value of the slope $-\ b$?


In [ ]:
lm.coef_

We obtained a <em>slope value</em> of -0.465. 

<h5>However, what does it mean?</h5>

<em>The value of the slope indicates that for each price unit, the volume score decreases, on average, by 0.465 points.</em>


#### What is the final estimated linear model we get?


As we saw above, we should get a final linear model with the structure:


$$
Yhat = a + b  X
$$


Plugging in the actual values we get:


$$
Price = 16533.351 - 0.465 \times Volume
$$


<div class="alert alert-danger alertdanger" style="margin-top: 20px">

<b style="font-size: 2em; font-weight: 600;">Question #1 a):</b>

<b>Create a linear regression object called <strong>"lm1"</strong>.</b>

</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 
lm1 = LinearRegression()
lm1

<details><summary>Click here for the solution</summary>

```python
lm1 = LinearRegression()
lm1
```

</details>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">

<b style="font-size: 2em; font-weight: 600;">Question #1 b):</b>

<b>Train the model using ADOSC indicator as the independent variable and cryptocurrency price as the dependent variable.</b>

</div>


In [ ]:
# Write your code below and press Shift+Enter to execute

lm1.fit(df[['ADOSC']], df[['Price']])
lm1

<details><summary>Click here for the solution</summary>

```python
lm1.fit(df[['ADOSC']], df[['Price']])
lm1
```

</details>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">

<b style="font-size: 2em; font-weight: 600;">Question #1 c):</b>

<b>Find the slope and intercept of the model.</b>

</div>


#### Slope


In [ ]:
# Write your code below and press Shift+Enter to execute 
lm1.coef_

<details><summary>Click here for the solution</summary>
    
```python
# Slope 
lm1.coef_
```
</details>


#### Intercept


In [ ]:
# Write your code below and press Shift+Enter to execute 
lm1.intercept_


<details><summary>Click here for the solution</summary>

```python
# Intercept
lm1.intercept_
```

</details>


### Multiple Linear Regression


<strong><em>What if we want to predict cryptocurrency price using more than one variable?</em></strong>

If we want to use more variables in our model to predict car price, we can use <strong><em>Multiple Linear Regression</em></strong>.

<strong><em>Multiple Linear Regression</em></strong> is very similar to Simple Linear Regression, but this method is used to explain the relationship between <em>one continuous response (dependent) variable</em> and <em>two or more predictor (independent) variables</em>.

Most of the real-world regression models involve multiple predictors. We will illustrate the structure by using four predictor variables, but these results can generalize to any integer:


$$
Y: Response \ Variable\\\\\\\\\\\\
X_1 :Predictor\ Variable\ 1\\\\
X_2: Predictor\ Variable\ 2\\\\
X_3: Predictor\ Variable\ 3\\\\
X_4: Predictor\ Variable\ 4\\\\
$$


$$
a: intercept\\\\\\\\\\\\
b_1 :coefficients \ of\ Variable \ 1\\\\
b_2: coefficients \ of\ Variable \ 2\\\\
b_3: coefficients \ of\ Variable \ 3\\\\
b_4: coefficients \ of\ Variable \ 4\\\\
$$


The equation is given by:


$$
Yhat = a + b_1 X_1 + b_2 X_2 + b_3 X_3 + b_4 X_4
$$


<p>From the previous section  we know that the good predictors of price could be:</p>
<ul>
    <li>Volume</li>
    <li>ADOSC</li>
    <li>NATR</li>
    <li>TRANGE</li>
</ul>

Let's develop a model using these variables as the predictor variables.


In [ ]:
Z = df[['Volume', 'ADOSC', 'NATR', 'TRANGE']]

Fit the linear model using the four above-mentioned variables.


In [ ]:
lm.fit(Z, df['Price'])

#### What is the value of the intercept $-\ a$?


In [ ]:
lm.intercept_

We obtained an <em>intercept value</em> of ~16536.597. 

<h5>However, what does it mean?</h5>

The intercept in a multiple regression model is the mean for the response when all of the explanatory variables take on the value 0.


#### What are the values of the coefficients $b_1,\ b_2,\ b_3,\ b_4$?


In [ ]:
lm.coef_

We obtained several <em>slope values</em>. 

##### However, what do they mean?

A regression coefficient in multiple regression is the slope of the linear relationship between the criterion variable and the part of a predictor variable that is independent of all other predictor variables. <em>In other words, it represents the change in the criterion variable associated with a change of one in the predictor variable when all other predictor variables are held constant.</em>


#### What is the final estimated linear model that we get?


As we saw above, we should get a final linear function with the structure:

$$
Yhat = a + b_1 X_1 + b_2 X_2 + b_3 X_3 + b_4 X_4
$$

What is the linear function we get in this example?


$$
Price = 16536.597 - 0.803 \times Volume + 0.287 \times ADOSC - 467.211 \times NATR + 3.261 \times TRANGE
$$


<div class="alert alert-danger alertdanger" style="margin-top: 20px">

<b style="font-size: 2em; font-weight: 600;">Question #2 a):</b>
    
Create and train a Multiple Linear Regression model "lm2" where the response variable is "Volume", and the predictor variables are "NATR" and  "TRANGE".
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 

lm2 = LinearRegression()
lm2.fit(df[['NATR', 'TRANGE']],df['Volume'])

<details><summary>Click here for the solution</summary>

```python
lm2 = LinearRegression()
lm2.fit(df[['NATR' , 'TRANGE']],df['Volume'])
```

</details>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">

<b style="font-size: 2em; font-weight: 600;">Question #2 b):</b>
    
<b>Find the coefficient of the model.</b>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 
lm2.coef_

<details><summary>Click here for the solution</summary>

```python
lm2.coef_

```

</details>


## 3. Model Evaluation Using Visualization


Now that we've developed some models, how do we evaluate our models and choose the best one? One way to do this is by using a visualization.


### Regression Plot


<p>When it comes to simple linear regression, an excellent way to visualize the fit of our model is by using <b>regression plots</b>.</p>

<p>This plot will show a combination of a scattered data points (a <b>scatterplot</b>), as well as the fitted <b>linear regression</b> line going through the data. This will give us a reasonable estimate of the relationship between the two variables, the strength of the correlation, as well as the direction (positive or negative correlation).</p>


Let's visualize **Volume** as potential predictor variable of our main cryptocurrency price:


In [ ]:
width, height = 8, 6
plt.figure(figsize=(width, height))
sns.regplot(x="Volume", y="Price", data=df)

One thing to keep in mind when looking at a regression plot is to pay attention to how scattered the data points are around the regression line. This will give you a good indication of the variance of the data and whether a linear model would be the best fit or not. If the data is too far off from the line, this linear model might not be the best model for this data.

#### What conclusions can be drawn from the graph above? 

In the graph above we can observe a negative and weak correlation between the two variables. It is also hard to determine if the points are decreasing or increasing as the "Volume" decreases. Moreover, a significant part of the data is not distributed along the regression line, which indicates that the linear model is not the best solution for these data. Besides, it is worth considering that the Volume parameter is probably not a good predictor of the BTC currency price.


<div class="alert alert-danger alertdanger" style="margin-top: 20px">

<b style="font-size: 2em; font-weight: 600;">Question #3:</b>
    
<b>Visualize TRANGE indicator as potential predictor variable of our main cryptocurrency price using <code>regplot()</code> function.</b>

What can we find out from the resulting graph?
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 
sns.regplot(x="TRANGE", y="Price", data=df)

<details><summary>Click here for the solution</summary>

```python
sns.regplot(x="TRANGE", y="Price", data=df)
```

</details>


<em>What conclusion can be made from the graph above?</em>

As in the previous case, we observe a chaotic distribution of data points with the impossibility of accurately determining how one variable depends on another. In such a situation, it is also worth considering a <em>residual plot</em> that shows the difference between the observed and the fitted values.


### Residual Plot

<p>A good way to visualize the variance of the data is to use <em>a residual plot.</em></p>

#### What is a <em>residual</em>?

<p>The difference between <em>the observed value $y$</em> and <em>the predicted value $Y_{hat}$</em> is called <strong><em>the residual</em></strong> $e$. When we look at a regression plot, <em>the residual is the distance from the data point to the fitted regression line.</em></p>

#### So what is a <em>residual plot</em>?

<p><em>A residual plot</em> is a graph that shows the residuals on the vertical y-axis and the independent variable on the horizontal x-axis.</p>

#### What do we pay attention to when looking at <em>a residual plot?</em>

<p>We look at the spread of the residuals. If the points in a residual plot are <strong><em>randomly spread out around the x-axis</em></strong>, then a <strong><em>linear model is appropriate</em></strong> for the data.

<em>Why is that?</em> Randomly spread out residuals means that the variance is constant, and thus the linear model is a good fit for this data.</p>


In [ ]:
width, height = 6, 6
plt.figure(figsize=(width, height))
sns.residplot(x=df['Volume'], y=df['Price'])
plt.show()

#### What is this plot telling us?

<p>We can see from this residual plot that the residuals are not randomly spread around the x-axis, leading us to believe that maybe <em>a non-linear model is more appropriate for this data.</em></p>


### Multiple Linear Regression Plot


#### How do we visualize a model for Multiple Linear Regression?

This gets a bit more complicated because you can't visualize it with regression or residual plot.</p>

<p>One way to look at the fit of the model is by looking at the <b>distribution plot</b>. We can look at the distribution of the fitted values that result from the model and compare it to the distribution of the actual values.</p>


First, let's make a prediction:


In [ ]:
Y_hat = lm.predict(Z)

In [ ]:
plt.figure(figsize=(width, height))

ax1 = sns.distplot(df['Price'], hist=False, color="tab:orange", label="Actual Value")
sns.distplot(Y_hat, hist=False, color="tab:blue", label="Fitted Values" , ax=ax1)

# sns.kdeplot(data=[df['Price'], Y_hat])
handles = [mpatches.Patch(facecolor=plt.cm.Reds(100), label="Price"),
           mpatches.Patch(facecolor=plt.cm.Blues(100), label="Fitted values")]
plt.legend(handles=handles)

plt.title('Actual vs Fitted Values for Price')
plt.xlabel('Price')
plt.ylabel('Proportion of cryptocurrency')

plt.show()
plt.close()

<p>We can see that the fitted values are not reasonably close to the actual values since the two distributions do not precisely follow each other's pattern. Though it shows better results than the simple linear regression. However, there is definitely a room for improvement.</p>


## 4. Polynomial Regression


#### What is a <em>polynomial regression</em>?

<em>Polynomial regression</em> is a particular case of the general linear regression model or multiple linear regression models.
We get non-linear relationships by squaring or setting higher-order terms of the predictor variables.

<p>There are different orders of polynomial regression:</p>


<center><strong>Quadratic - 2nd Order</strong></center>
$$
Yhat = a + b_1 X +b_2 X^2 
$$
<br>
<center><strong>Cubic - 3rd Order</strong></center>
$$
Yhat = a + b_1 X +b_2 X^2 +b_3 X^3\\\\\\\\\\\\
$$
<br>
<center><strong>Higher-Order</strong>:</center>
$$
Y = a + b_1 X +b_2 X^2 +b_3 X^3 ....\\\\
$$


#### What is the polynomial regression for?

We may have any two variables that are correlated. However, what happens if we know that our data is correlated, but <em>the relationship does not look linear?</em> In this case, we should use a polynomial regression to fit a polynomial equation to it. 

This way, polynomial regression provides the best approximation of the relationship between the dependent and independent variable. Additionally, on the resulting regression plot, the regression line will be more closely fitted to the data distribution pattern.

<center>
<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX0CQNEN/SLRvsPL.jpg" width="800" height="800"></img>
<center>


<em>Advantages of using Polynomial Regression:</em>

<li> polynomial provides the best approximation of the relationship between the dependent and independent variable</li>
<li>a broad range of function can be fit under it</li>
<li>polynomial basically fits a wide range of curvature</li>
<br>

<em>Disadvantages of using Polynomial Regression<em>:

<li>the presence of one or two outliers in the data can seriously affect the results of the nonlinear analysis</li>
<li>it is too sensitive to the outliers — <em>observations that lie an abnormal distance from other values in a sample</em></li>
<li>there are unfortunately fewer model validation tools for the detection of outliers in nonlinear regression than there are for linear regression</li>
<br>
    
Read more about it <a href='https://towardsdatascience.com/introduction-to-linear-regression-and-polynomial-regression-f8adc96f31cb?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0CQNEN2434-2023-01-01'>here</a>.


<p>We saw earlier that a linear model did not provide the best fit while using "Volume" as the predictor variable. Let's see if we can try fitting a polynomial model to the data instead.</p>


<p>We will use the following function to plot the data:</p>


In [ ]:
def PlotPolly(model, independent_variable, dependent_variabble, Name):
    x_new = np.linspace(min(independent_variable), max(independent_variable), 100)
    y_new = model(x_new)

    plt.plot(independent_variable, dependent_variabble, '.', x_new, y_new, '-', markersize=1)
    plt.title('Polynomial Fit')
    ax = plt.gca()
    ax.set_facecolor((0.898, 0.898, 0.898))
    fig = plt.gcf()
    plt.xlabel(Name)
    plt.ylabel('Price')

    plt.show()
    plt.close()

Let's get the variables:


In [ ]:
x = df['ADOSC']
y = df['Price']

Let's fit the polynomial using the function <b>polyfit</b>, then use the function <b>poly1d</b> to display the polynomial function.


In [ ]:
# Here we use a polynomial of the 3rd order (cubic) 
f = np.polyfit(x, y, 3)
p = np.poly1d(f)
print(p)

Let's plot the function:


In [ ]:
PlotPolly(p, x, y, 'Price')

In [ ]:
np.polyfit(x, y, 3)

<p>We can already see from plotting that this polynomial model does not make better difference than the linear model, although she somehow  try to follow the pattern. This can be because the Volume parameter seems to be not a suitable predictor for BTC price.</p>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">

<b style="font-size: 2em; font-weight: 600;">Question #4:</b>
    
<b>Create 11 order polynomial model with the variables x and y from above.</b>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 
f1 = np.polyfit(x, y, 11)
p1 = np.poly1d(f1)
print(p1)
PlotPolly(p1,x,y, 'Price')

<details><summary>Click here for the solution</summary>

```python
# Here we use a polynomial of the 11rd order (cubic) 
f1 = np.polyfit(x, y, 11)
p1 = np.poly1d(f1)
print(p1)
PlotPolly(p1,x,y, 'XRPBUSD_avg_price')

```

</details>


From the graph above we can see that as the order of polynomial regression model increases the predicted line may be more fitted to the data points. 

<strong><em>Nevertheless, this graph clearly conveys one of the main disadvantages of using polynomial regression, such as the influence of outliers.</em></strong> 

We observe how the most distant points from the main cluster cause sharp and significant changes in the regression curve, which are very often provide a false prediction.   

Furthermore, in the field of financial services, non-linear models of higher 3 orders are not used. 


### Multivariate Polynomial function

<p>The analytical expression for Multivariate Polynomial function gets complicated. For example, the expression for <em>a second-order (degree=2) polynomial </em> with two variables is given by:</p>


$$
Yhat = a + b_1 X_1 +b_2 X_2 +b_3 X_1 X_2+b_4 X_1^2+b_5 X_2^2
$$


We can perform a polynomial transform on multiple features. First, we create a <em>PolynomialFeatures object of degree 2:</em>


In [ ]:
pr=PolynomialFeatures(degree=2)
pr

In [ ]:
Z_pr=pr.fit_transform(Z)

In the original data, there are 16751 samples and 4 features.


In [ ]:
Z.shape

After the transformation, there are 16751 samples and 15 features.


In [ ]:
Z_pr.shape

## 5. Pipeline


<p>Data Pipelines simplify the steps of processing the data. The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. 
    
We use the module <strong>Pipeline</strong> to create a pipeline of transforms with a final estimator. We also use <strong><em>StandardScaler</em></strong> as an estimator in our pipeline.</p>


#### What is a StandardScaler?

In machine learning, <em><strong>the StandardScaler</strong></em> is used to resize the distribution of values so that <em>the mean of the observed values is 0</em> and <em>the standard deviation is 1.</em>

<em><strong>StandardScaler</strong></em> is an important technique that is primarily performed as a pre-processing step before many machine learning models to standardize the range of the input dataset.

#### When to use StandardScaler?

The StandardScaler is applied when the characteristics of the input dataset vary widely between ranges, or when they are measured in different units.


We create the pipeline by creating a list of tuples including the name of the model or estimator and its corresponding constructor.


In [ ]:
Input = [('scale',StandardScaler()), 
    ('polynomial', PolynomialFeatures(include_bias=False)), 
    ('model',LinearRegression())]

We input the list as an argument to the pipeline constructor:


In [ ]:
pipe = Pipeline(Input)
pipe

First, we convert the data type Z to type float to avoid conversion warnings that may appear as a result of StandardScaler taking float inputs.

Then, we can normalize the data,  perform a transform and fit the model simultaneously.


In [ ]:
Z = Z.astype(float)
pipe.fit(Z, y)

Similarly,  we can normalize the data, perform a transform and produce a prediction  simultaneously.


In [ ]:
ypipe=pipe.predict(Z)
ypipe[0:4]

<div class="alert alert-danger alertdanger" style="margin-top: 20px">

<b style="font-size: 2em; font-weight: 600;">Question #5:</b>
    
<b>Create a pipeline that standardizes the data, then produce a prediction using a linear regression model using the features Z and target y.</b>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 
Input=[('scale',StandardScaler()),('model',LinearRegression())]

pipe=Pipeline(Input)

pipe.fit(Z,y)

ypipe=pipe.predict(Z)
ypipe[0:10]

<details><summary>Click here for the solution</summary>

```python
Input=[('scale',StandardScaler()),('model',LinearRegression())]

pipe=Pipeline(Input)

pipe.fit(Z,y)

ypipe=pipe.predict(Z)
ypipe[0:10]

```

</details>


## 6. Measures for Evaluation


<p>When evaluating our models, not only do we want to visualize the results, but we also want a quantitative measure to determine how accurate the model is.</p>

<p>Two very important measures that are often used in Statistics to determine the accuracy of a model are:</p>
<ul>
    <li><strong><em>R-squared</em></strong> $(R^{2})$</li>
    <li><strong><em>Mean Squared Error</em></strong> $(MSE)$</li>
    <li><strong><em>P-value</em></strong></li>
</ul>

### R-squared

<p><em><strong>R squared</strong></em>, also known as <strong><em>the coefficient of determination</em></strong>, is a measure to indicate how close the data is to the fitted regression line.
The value of the R-squared is the percentage of variation of <em>the response variable</em> $y$ that is explained by a linear model.</p>

### Mean Squared Error (MSE)

<p><strong><em>The Mean Squared Error</em></strong> measures the average of the squares of errors. That is, the difference between actual value $y$ and the estimated value $ŷ$.</p>

$$
MSE = \frac{1}{n} \sum \limits _{i=1} ^{n} (y_i - ŷ_i)^{2},
$$ 
<center>where $y_i$ — the actual value, $ŷ_i$ — the forecast value.</center>


### P-value

<em><strong>The P-value</strong></em> is the probability value that the correlation between these two variables is statistically significant. Normally, we choose a significance level of 0.05, which means that we are 95% confident that the correlation between the variables is significant.

By convention, when

<ul>
    <li><em>the p-value</em> is $<$ <strong>0.001</strong>: we say there is strong evidence that the correlation is significant.</li>
    <li><em>the p-value</em> is $<$ <strong>0.05</strong>: there is moderate evidence that the correlation is significant.</li>
    <li><em>the p-value</em> is $<$ <strong>0.1</strong>: there is weak evidence that the correlation is significant.</li>
    <li><em>the p-value</em> is $>$ <strong>0.1</strong>: there is no evidence that the correlation is significant.</li>
</ul>


Let's declare a function responsible for calculating above metrics.


In [ ]:
def calculate_metrics(y_pred, y):
    r_squared = r2_score(y, y_pred)
    mse = mean_squared_error(y, y_pred)
    p_value = stats.f_oneway(y, y_pred)[1]
    
    print(f'The R-square value is: {r_squared:.4f}')
    print(f'The MSE value is: {mse:.4f}')
    print(f'The P-value is: {p_value:.4f}')

### Model 1: Simple Linear Regression

Let's calculate metrics for simple linear model:


In [ ]:
lm = LinearRegression()
lm.fit(X,Y)
Yhat = lm.predict(X)

calculate_metrics(Yhat, Y)

### Model 2: Multiple Linear Regression

Let's calculate metrics for multiple linear model:


In [ ]:
# fit the model
lm.fit(Z, Y)
Yhat_multifit = lm.predict(Z)

calculate_metrics(Yhat_multifit, Y)

### Model 3: Polynomial Fit

Let's calculate metrics for polynomial linear model:


In [ ]:
calculate_metrics(p(x), y)

## 7. Decision Making: Determining a Good Model Fit


<p>Now that we have visualized the different models, and generated the P-values, R-squared and MSE values for the fits, how do we determine a good model fit? First, since in all three models we obtained P-values equal to 1, which indicate the absence of significant correlation evidence, we will ignore them.
</p>

#### <em>What is a good R-squared value?</em>

When comparing models, <b>the model with <em>the higher R-squared value is a better fit</em> for the data.

#### <em>What is a good MSE?</em>

<p>When comparing models, <b>the model with <em>the smallest MSE value is a better fit</em> for the data.</p>

Let's take a look at the values for the different models.
<p><strong><em>Simple Linear Regression:</em></strong> Using <em>Volume</em> as a Predictor Variable of our main currency price.
<ul>
    <li><strong>R-squared</strong>: 0.0084</li>
    <li><strong>MSE</strong>: 97820.9490 </li>
</ul>
</p>

<p><strong><em>Multiple Linear Regression:</em></strong> Using <em>Volume</em> and <em>indicators</em> as Predictor Variables of our main currency price.
<ul>
    <li><strong>R-squared</strong>: 0.0172</li>
    <li><strong>MSE</strong>: 96947.1260 </li>
</ul>
</p>

<p><strong><em>Polynomial Fit:</em></strong> Using <em>ADOSC</em> as a Predictor Variable of our main currency price.
<ul>
    <li><strong>R-squared</strong>: 0.0028</li>
    <li><strong>MSE</strong>: 98370.3123 </li>
</ul>
</p>


### Simple Linear Regression Model (SLR) vs Multiple Linear Regression Model (MLR)


<p>Usually, the more variables you have, the better your model is at predicting, but this is not always true. Sometimes you may not have enough data, you may run into numerical problems, or many of the variables may not be useful and even act as noise. As a result, you should always check the MSE and  $R^{2}$.</p>

<p>In order to compare the results of <em><strong>the MLR vs SLR models</strong></em>, we look at a combination of both the <strong><em>R-squared</em></strong> and <em><strong>MSE</strong></em> to make the best conclusion about the fit of the model.
<ul>
    <li><strong>MSE</strong>: The MSE of SLR is <em>97820.9490</em>, while MLR has an MSE of <em>96947.1260</em>. <strong>The MSE of MLR is much smaller.</strong></li>
    <li><strong>R-squared</strong>: In this case, we can also see that there is a big difference between the $R^{2}$ of the SLR and the $R^{2}$ of the MLR. The $R^{2}$ for the SLR (~0.0084) is very small compared to the $R^{2}$ for the MLR (~0.0172).</li>
</ul>
</p>

This <strong>R-squared</strong> in combination with the <strong>MSE</strong> show that <strong><em>MLR</em></strong> seems like the better model fit in this case compared to <em><strong>SLR</strong></em>.


### Simple Linear Model (SLR) vs. Polynomial Fit


<ul>
    <li><strong>MSE</strong>: We can see that Polynomial Fit brought down the MSE, since this MSE is smaller than the one from the SLR.</li> 
    <li><strong>R-squared</strong>: The $R^{2}$ for the Polynomial Fit is larger than the $R^{2}$ for the SLR, so the Polynomial Fit also brought up the $R^{2}$ quite a bit.</li>
</ul>
<p>Since the Polynomial Fit resulted in a lower <strong>MSE</strong> and a higher $R^{2}$, we can conclude that this was a better fit model than the <strong><em>SLR</em></strong> for predicting BTC price with ADOSC as a predictor variable.</p>


### Multiple Linear Regression (MLR) vs. Polynomial Fit


<ul>
    <li><strong>MSE</strong>: <strong>The MSE for the MLR is smaller</strong> than the MSE for the Polynomial Fit.</li>
    <li><strong>R-squared</strong>: The $R^{2}$ for the MLR is also much larger than for the Polynomial Fit.</li>
</ul>


<p>Comparing these three models, we conclude that <strong><em>the MLR model is the best regression model</em></strong> to be able to predict price from our dataset. </p>


### Thank you for completing this lab!

## Authors

<a href="https://author.skills.network/instructors/yaryna_beida?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0CQNEN2434-2023-01-01">Yaryna Beida</a>

<a href="https://author.skills.network/instructors/yaroslav_vyklyuk_2?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0CQNEN2434-2023-01-01">Prof. Yaroslav Vyklyuk, DrSc, PhD</a>

<a href="https://author.skills.network/instructors/mariya_fleychuk?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0CQNEN2434-2023-01-01">Prof. Mariya Fleychuk, DrSc, PhD</a>


## Change Log

| Date (YYYY-MM-DD) | Version | Changed By | Change Description                            |
| ----------------- | ------- | ---------- | --------------------------------------------- |
|     2023-03-18    |   1.0   |Yaryna Beida| Lab created                                   |


<hr>

## <h3 align="center"> © IBM Corporation 2023. All rights reserved. <h3/>
