<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="500" alt="cognitiveclass.ai logo"  />
</center>

#  **Investigation of MATIC/BUSD exchange rate dynamic,  calculation and analysis of separate  technical financial indicators of cryptocurrency market (ATR, OBV, RSI, AD)**

## **Lab 4. Model development**


Estimated time needed: **30** minutes

## **Objectives**

After completing this lab you will be able to:

*   Develop prediction models
*   Compare models
*   Perform model evaluation using visualization


In this section, we will develop several models that will predict **"Avg_price"** of the cryptocurrency using the technical indicators. This is just an estimate but should give us an objective idea of how much the cryptocurrency should cost.


Some questions we want to ask in this module

* How good are models in predicting **"Avg_price"**?
* Which model predicts the best?

## **Table of Contents**

<div class="alert alert-block alert-info" style="margin-top: 20px">
<ol>
    <li>Import Data</li>
    <li>Linear Regression and Multiple Linear Regression</li>
        <ul>
            <li>Linear Regression</li>
            <li>Multiple Linear Regression</li>
        </ul>
    <li>Model Evaluation Using Visualization</li>
        <ul>
            <li>Regression Plot</li>
            <li>Residual Plot</li>
            <li>Multiple Linear Regression</li>
        </ul>
    <li>Polynomial Regression and Pipelines</li>
        <ul>
            <li>Polynomial Regression</li>
            <li>Pipeline</li>
        </ul>
    <li>Measures for In-Sample Evaluation</li>
    <ul>
        <li>$R^2$ / $R$-squared</li>
        <li>Mean Squared Error (MSE)</li>
        <li>F-test score</li>
        <li>P-value</li>
        <li>Model 1: Simple Linear Regression</li>
        <li>Model 2: Multiple Linear Regression</li>
        <li>Model 3: Polynomial Fit</li>
    </ul>
    <li>Prediction and Decision Making</li>
    <ul>
        <li>Prediction</li>
        <li>Decision Making: Determining a Good Model Fit</li>
        <li>Simple Linear Regression Model (SLR) vs Multiple Linear Regression Model (MLR)</li>
        <li>Simple Linear Model (SLR) vs. Polynomial Fit</li>
        <li>Multiple Linear Regression (MLR) vs. Polynomial Fit</li>
    </ul>
    <li>Sources</li>
</ol>

</div>

<hr>


## **Dataset Description**

### **Files**
* #### **MATICBUSD_trades_1m_preprocessed.csv** - the file contains historical changes of the pair **MATIC/BUSD** and ATR, OBV, RSI, AD indicators for the period from 11/11/2022 to 12/29/2022 with an aggregation time of 1 minute

### **Columns**

* #### `Ts` - the timestamp of the record
* #### `Open` -  the price of the asset at the beginning of the trading period
* #### `High` -  the highest price of the asset during the trading period
* #### `Low` - the lowest price of the asset during the trading period.
* #### `Close` - the price of the asset at the end of the trading period
* #### `Volume` - the total number of shares or contracts of a particular asset that are traded during a given period
* #### `Rec_count` -  the number of individual trades or transactions that have been executed during a given time period
* #### `Avg_price` - the average price at which a particular asset has been bought or sold during a given period
* #### `ATR` - average true range indicator
* #### `OBV` - on-balance volume indicator
* #### `RSI` - relative strength index indicator
* #### `AD` - accumulation / distribution indicator

In data analytics, we often use **Model Development** to help us predict future observations from the data we have. A model will help us understand the exact relationship between different variables and how these variables are used to predict the result.

# **1. Import data**

Run the following cell to install required libraries:


In [ ]:
# If you run the lab locally using Anaconda, you can load the correct library and versions by uncommenting the following:
# install specific version of libraries used in lab

# ! conda install -q -y pandas
! conda install -q -y numpy
# ! conda install -q -y matplotlib
# ! conda install -q -y seaborn
# ! conda install -q -y -c anaconda statsmodels -y

! conda install -q -y -c anaconda scikit-learn
# ! conda install -q -y scipy

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score
from sklearn import set_config
from scipy import stats

import warnings
from typing import Callable, Union
pd.set_option("display.precision", 5) # setting numbers after digits
pd.options.display.float_format = "{:.5f}".format
set_config(display="diagram")
warnings.filterwarnings("ignore")

This dataset was hosted <a href="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX087UEN/MATICBUSD_trades_1m_preprocessed.csv">HERE</a>.


In [ ]:
path = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX087UEN/MATICBUSD_trades_1m_preprocessed.csv"

Load the data and store it in dataframe `df`:


In [ ]:
df = pd.read_csv(path, parse_dates=["Ts"])

In [ ]:
df.head()

We will need to drop first 15 `NaN`'s

In [ ]:
df = df.dropna()

# **2. Linear Regression and Multiple Linear Regression**


## **Linear regression**

**Linear regression** is an algorithm that provides a linear relationship between an independent variable and a dependent variable to predict the outcome of future events. It is a statistical method used in data science and machine learning for predictive analysis.

The independent variable is also the predictor or explanatory variable that remains unchanged due to the change in other variables. However, the dependent variable changes with fluctuations in the independent variable. The regression model predicts the value of the dependent variable, which is the response or outcome variable being analyzed or studied.

Thus, linear regression is a supervised learning algorithm that simulates a mathematical relationship between variables and makes predictions for *continuous or numeric variables such as sales, salary, age, product price, etc*.

This analysis method is advantageous when at least two variables are available in the data, as observed in stock market forecasting, portfolio management, scientific analysis, etc.

A sloped straight line represents the linear regression model.

<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX087UEN/25-4.png" width="550" alt="linear regression image"  />
</center>

<center>Best Fit Line for a Linear Regression Model</center>

In the above figure,

$X$-axis = Independent variable

$Y$-axis = Output / dependent variable

Line of regression = Best fit line for a model

Here, a line is plotted for the given data points that suitably fit all the issues. Hence, it is called the **"best fit line"**. The goal of the linear regression algorithm is to find this best fit line seen in the above figure.

The result of Linear Regression is a **linear function** that predicts the dependent variable as a function of the independent variable.

<h3>
$$
Y : Dependent \ Variable\\\\\\\\\\
X : Independent \ Variables
$$
</h3>

**Linear Function**

<center>
<h3>
$
\widehat{Y} = a + b  X
$
</h3>
</center>


$a$ refers to the intercept of the regression line, in other words: the value of $Y$ when $X$ is 0 <br>
$b$ refers to the slope of the regression line, in other words: the value with which $Y$ changes when $X$ increases by 1 unit

**Advantages:**

* Easy to understand and interpret the results.
* It is computationally efficient and requires minimal computational power.
* Can be used to identify the relationship between two variables, making it useful for predictive modeling.
* Useful when the relationship between the dependent and independent variable is linear.

**Disadvantages:**

* Assumes a linear relationship between the dependent and independent variables.
* It is sensitive to outliers, which can negatively affect the results.
* Cannot be used when there are multiple independent variables.

Create the linear regression object:

In [ ]:
lm = LinearRegression()
lm

**How could "ATR" help us predict "Avg_price"?**


In the previous lab we did Granger Causality Test and concluded that **"ATR"** granger-causes **"Avg_price"** so we will use **"ATR"** as independent variable.

In [ ]:
X_lr = df[["ATR"]]
Y = df["Avg_price"]

Fit the linear model using **"ATR"** and **"Avg_price"**:

In [ ]:
lm.fit(X_lr, Y)

We can output a prediction:


In [ ]:
Y_hat = lm.predict(X_lr)
Y_hat[0:5]

**What is the value of the intercept ($a$)?**

In [ ]:
lm.intercept_

**What is the value of the slope ($b$)?**

In [ ]:
lm.coef_

Now let's do the same but we will consider **"OBV"** and **"AD"** as independent variables

**What is the final estimated linear model we get?**


As we saw above, we should get a final linear model with the structure:


<center>
    <h3>
$
\widehat{Y} = a + b  X
$
    </h3>
</center>

Plugging in the actual values we get:


**Avg_price** = 0.8439822752478543 + 22.49597877 x **ATR**

In [ ]:
lm_obv = LinearRegression()

X_obv = df[["OBV"]]

lm_obv.fit(X_obv, Y)

Y_obv = lm_obv.predict(X_obv)
print(f"First 5 predictions: {Y_obv[:5]}")
print(f"Y-intercept: {lm_obv.intercept_}")
print(f"Coef: {lm_obv.coef_}")

In [ ]:
lm_ad = LinearRegression()

X_ad = df[["AD"]]

lm_ad.fit(X_ad, Y)

Y_ad = lm_ad.predict(X_ad)
print(f"First 5 predictions: {Y_ad[:5]}")
print(f"Y-intercept: {lm_ad.intercept_}")
print(f"Coef: {lm_ad.coef_}")

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1><strong>Question #1 a): </strong></h1>

**Create a linear regression object called `lm_rsi`.**

</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
lm_rsi = LinearRegression()
lm_rsi
```

</details>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1><strong>Question #1 b):</strong></h1>

**Train the model using "RSI" as the independent variable and "Avg_price" as the dependent variable?**

</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
lm_rsi.fit(df[["RSI"]], df["Avg_price"])
lm_rsi
```

</details>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1><strong>Question #1 c):</strong></h1>

**Find the slope and intercept of the model.**

</div>


Slope

In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
# Slope 
lm_rsi.coef_
```

</details>


Intercept

In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
# Intercept
lm_rsi.intercept_
```

</details>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1><strong>Question #1 d):</strong></h1>

**What is the equation of the predicted line? Assign the equation to `y_hat` variable.**

</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
y_hat = 0.00014489 * df["RSI"] + 0.8573564226902424

```

</details>


## **Multiple Linear Regression**


**What if we want to predict "Avg_price" using more than one variable?**

If we want to use more variables in our model to predict price, we can use **Multiple Linear Regression**.
Multiple Linear Regression is very similar to Simple Linear Regression, but this method is used to explain the relationship between one continuous response (dependent) variable and **two or more** predictor (independent) variables.

**Example:** Consider the task of calculating blood pressure. In this case, height, weight, and amount of exercise can be considered independent variables. Here, we can use multiple linear regression to analyze the relationship between the three independent variables and one dependent variable, as all the variables considered are quantitative.

**What Multiple Linear Regression Can Tell You?**

Simple linear regression is a function that allows an analyst or statistician to make predictions about one variable based on the information that is known about another variable. Linear regression can only be used when one has two continuous variables — an independent variable and a dependent variable. The independent variable is the parameter that is used to calculate the dependent variable or outcome. A multiple regression model extends to several explanatory variables.

**How Are Multiple Regression Models Used in Finance?**

Any econometric model that looks at more than one variable may be a multiple. Factor models compare two or more factors to analyze relationships between variables and the resulting performance. The Fama and French Three-Factor Mod is such a model that expands on the capital asset pricing model (**CAPM**) by adding size risk and value risk factors to the market risk factor in **CAPM** (which is itself a regression model). By including these two additional factors, the model adjusts for this outperforming tendency, which is thought to make it a better tool for evaluating manager performance.

Most of the real-world regression models involve multiple predictors. We will illustrate the structure by using four predictor variables, but these results can generalize to any integer:


<h3>
$$
Y: \text{Dependent Variable}\\\\\\\\\\
X_1: \text{Independent Variable 1}\\\\
X_2: \text{Independent Variable 2}\\\\
X_3: \text{Independent Variable 3}\\\\
X_4: \text{Independent Variable 4}\\\\
$$
</h3>

<h3>
$$
a: \text{intercept}\\\\\\\\\\
b_1 : \text{coefficients of Variable 1}\\\\
b_2: \text{coefficients of Variable 2}\\\\
b_3: \text{coefficients of Variable 3}\\\\
b_4: \text{coefficients of Variable 4}\\\\
$$
</h3>

The equation is given by:


<h3>
$$
\widehat{Y} = a + b_1 X_1 + b_2 X_2 + b_3 X_3 + b_4 X_4
$$
</h3>

From the previous section  we know that good predictors of **"Avg_price"** could be:

* **"ATR"**
* **"OBV"**
* **"RSI"**
* **"AD"**

**Advantages:**

* Can be used to identify the relationship between multiple independent variables and a dependent variable.
* It can account for the effect of confounding variables on the dependent variable.
* Can improve the accuracy of predictions compared to simple linear regression.
* Can identify which independent variable(s) have the most impact on the dependent variable.

**Disadvantages:**

* Requires more data than simple linear regression.
* The results can be difficult to interpret when there are multiple independent variables.
* Assumes a linear relationship between the dependent and independent variables.

Let's calculate a correlation between these cryptocurrencies

In [ ]:
df.corr()

Let's develop a model using these variables as the predictor variables.

In [ ]:
Z = df[["ATR", "OBV", "RSI", "AD"]]

Fit the linear model using the four above-mentioned variables.


In [ ]:
mlr = LinearRegression()
mlr.fit(Z, Y)

In [ ]:
Y_mlr = mlr.predict(Z)

**What is the value of the intercept ($a$)?**


In [ ]:
mlr.intercept_

**What are the values of the coefficients ($b_1$, $b_2$, $b_3$, $b_4$)?**


In [ ]:
mlr.coef_

**What is the final estimated linear model that we get?**


As we saw above, we should get a final linear function with the structure:

<h3>
$$
\widehat{Y} = a + b_1 X_1 + b_2 X_2 + b_3 X_3 + b_4 X_4
$$
</h3>
    
**What is the linear function we get in this example?**


**Avg_price** = 1.0480399345639373 - 0.993894360 x **ATR** + 0.00000000672461253 x **OBV** + 0.0000742896268 x **RSI** + 0.00000000578434954 x **AD**


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
    
# **Question  #2 a):**
    
**Create and train a Multiple Linear Regression model `lm2` where the dependent variable is "Avg_price", and the independent variable is "ATR" and "AD".**
    
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
lm2 = LinearRegression()
lm2.fit(df[["ATR", "AD"]], df["Avg_price"])


```

</details>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
    
# **Question  #2 b):**
    
**Find the coefficient of the model.**
    
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
lm2.coef_

```

</details>


# **3. Model Evaluation Using Visualization**


Now that we've developed some models, how do we evaluate our models and choose the best one? One way to do this is by using a visualization.


## **Regression Plot**


When it comes to simple linear regression, an excellent way to visualize the fit of our model is by using **regression plots**.

This plot will show a combination of a scattered data points (a **scatterplot**), as well as the fitted **linear regression** line going through the data. This will give us a reasonable estimate of the relationship between the two variables, the strength of the correlation, as well as the direction (positive or negative correlation).


Let's visualize **"AD"** as potential predictor variable of **"Avg_price"**:


In [ ]:
width = 6
height = 5
plt.figure(figsize=(width, height))
plt.title("AD as potential predictor variable of Avg_price")
sns.regplot(x="AD", y="Avg_price", data=df, scatter_kws={"s": 1})

As we can see, the points are not scattered in a straight line so the line does not describe these data well

One thing to keep in mind when looking at a regression plot is to pay attention to how scattered the data points are around the regression line. This will give you a good indication of the variance of the data and whether a linear model would be the best fit or not. If the data is too far off from the line, this linear model might not be the best model for this data.

Let's compare this plot to the regression plot of **"ATR"**.


In [ ]:
plt.figure(figsize=(width, height))
plt.title("ATR as potential predictor variable of Avg_price")
sns.regplot(x="ATR", y="Avg_price", data=df, scatter_kws={"s": 1})

We can see positive correlation and many outliers. But the line does not describe the data well

<div class="alert alert-danger alertdanger" style="margin-top: 20px">

# **Question #3:**
    
**Given the regression plots above, is "AD" or "ATR" more strongly correlated with "Avg_price"? Use the method  `.corr()` to verify your answer.**

</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
# The variable "AD" has a stronger correlation with "Avg_price", it is approximate 0.481835 compared to "ATR" which is approximate 0.334486. You can verify it using the following command:

df[["AD", "ATR", "Avg_price"]].corr()

```

</details>


## **Residual Plot**

A good way to visualize the variance of the data is to use a residual plot.

**What is a residual?**

The difference between the observed value $(y)$ and the predicted value $(\widehat{Y})$ is called the residual $(e)$. When we look at a regression plot, the residual is the distance from the data point to the fitted regression line.

**So what is a residual plot?**

A residual plot is a graph that shows the residuals on the vertical $y$-axis and the independent variable on the horizontal $x$-axis.

**What do we pay attention to when looking at a residual plot?**

We look at the spread of the residuals:

If the points in a residual plot are **randomly spread out around the $x$-axis**, then a **linear model is appropriate** for the data.

**Why is that?** Randomly spread out residuals means that the variance is constant, and thus the linear model is a good fit for this data.

In [ ]:
width = 6
height = 5
plt.figure(figsize=(width, height))
plt.title("Residual plot of AD (x) and Avg_price (y)")
sns.residplot(x=df["AD"], y=df["Avg_price"], scatter_kws={"s": 1})
plt.show()

**What is this plot telling us?**

We can see from this residual plot that the residuals are not randomly spread around the $x$-axis, leading us to believe that maybe a non-linear model is more appropriate for this data.

## **Multiple Linear Regression**

**How do we visualize a model for Multiple Linear Regression?** This gets a bit more complicated because you can't visualize it with regression or residual plot.

One way to look at the fit of the model is by looking at the **distribution plot**. We can look at the distribution of the fitted values that result from the model and compare it to the distribution of the actual values.


First, let's make a prediction:


In [ ]:
Y_hat_mlr = mlr.predict(Z)

In [ ]:
plt.figure(figsize=(width, height))

temp_df = pd.DataFrame({"Avg_price": df["Avg_price"], "Fitted values": Y_hat_mlr})
temp_df.plot.kde(figsize=(width, height))

plt.title("Actual vs Fitted Values for Avg_price")
plt.xlabel("Price (in dollars)")
plt.ylabel("Proportion")

plt.show()
plt.close()

<p>We can see that the fitted values are reasonably close to the actual values. However, there is definitely some room for improvement.</p>


# **4. Polynomial Regression and Pipelines**


## **Polynomial regression**

In statistics, **polynomial regression** is a form of regression analysis in which the relationship between the independent variable $x$ and the dependent variable $y$ is modelled as an $n$-th degree polynomial in $x$. **Polynomial regression** is a particular case of the general linear regression model or multiple linear regression models. We get non-linear relationships by squaring or setting higher-order terms of the predictor variables.

There are different orders of polynomial regression:

<center>Quadratic - 2nd Order</center>
<h3 style="margin-bottom: 5px; margin-top: 5px">
$$
Yhat = a + b_1 X +b_2 X^2 
$$
</h3>
    
<center>Cubic - 3rd Order</center>
<h3 style="margin-top: 5px">
$$
Yhat = a + b_1 X +b_2 X^2 +b_3 X^3
$$
</h3> 

<center>Higher-Order:</center>
<h3 style="margin-top: 5px">
$$
Y = a + b_1 X +b_2 X^2 +b_3 X^3
$$
</h3>


**The example:** Imagine you want to predict how many likes your new social media post will have at any given point after the publication. There is no linear correlation between the number of likes and the time that passes. Your new post will probably get many likes in the first 24 hours after publication, and then its popularity will decrease.

**Why do we need Polynomial Regression?**

A simple linear regression algorithm only works when the relationship between the data is linear. But suppose we have non-linear data, then linear regression will not be able to draw a best-fit line. Simple regression analysis fails in such conditions. Consider the below diagram, which has a non-linear relationship, and you can see the linear regression results on it, which does not perform well, meaning it does not come close to reality. Hence, we introduce polynomial regression to overcome this problem, which helps identify the curvilinear relationship between independent and dependent variables.

<center><img width="550" src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX087UEN/kawer8rc.5_(5).png" alt="poly"/></center>

**Advantages:**


* Can fit complex curves and relationships between the dependent and independent variables.
* Can improve the accuracy of predictions compared to linear regression models.
* Can be used when the relationship between the dependent and independent variables is non-linear.

**Disadvantages:**

* Can be overfitting to the data, leading to poor performance on new data.
* The higher degree of the polynomial used, the more complex and difficult to interpret the model becomes.
* Requires more data than simple linear regression models.

We saw earlier that a linear model did not provide the best fit while using **"ATR"** as the predictor variable. Let's see if we can try fitting a polynomial model to the data instead.


<p>We will use the following function to plot the data:</p>


In [ ]:
def plot_poly(model: Callable, independent_variable: pd.Series, dependent_variable: pd.Series, name: str) -> None:
    """
    Plots polynomial function
    
    Parameters
    ----------
    model: Callable
        Function which will calculate polynomial
    independent_variable: pd.Series
        Independent variable (x)
    dependent_variable: pd.Series
        Dependent variable (y)
    name: str
        Name of indicator which will be used as label of x-axis
    """
    x_new = np.linspace(min(independent_variable) - 100, max(independent_variable) + 100, 100)
    y_new = model(x_new)

    plt.plot(independent_variable, dependent_variable, ".", x_new, y_new, "-", markersize=1)
    plt.title("Polynomial Fit with Matplotlib for Price")
    ax = plt.gca()
    ax.set_facecolor((0.898, 0.898, 0.898))
    fig = plt.gcf()
    plt.xlabel(name)
    plt.ylabel("Avg_price")

    plt.show()
    plt.close()

Let's get the variables:


In [ ]:
X_poly = df["AD"]

Let's fit the polynomial using the function <code>polyfit</code>, then use the function <code>poly1d</code> to display the polynomial function.


In [ ]:
# Here we use a polynomial of the 3rd order (cubic) 
f = np.polyfit(X_poly, Y, 3)
p = np.poly1d(f)
Y_poly = p(X_poly)
print(p)

Let's plot the function:


In [ ]:
plot_poly(p, X_poly, Y, "AD")

In [ ]:
np.polyfit(X_poly, Y, 3)

<p>We can already see from plotting that this polynomial model performs better than the linear model. This is because the generated polynomial function  "hits" more of the data points.</p>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">

# **Question #4:**
    
**Create 2 order polynomial model with the variables $x$ and $y$ from above.**
    
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
# Here we use a polynomial of the 2nd order
f1 = np.polyfit(X_poly, Y, 2)
p1 = np.poly1d(f1)
print(p1)
plot_poly(p1, X_poly, Y, "AD")

```

</details>


The analytical expression for Multivariate Polynomial function gets complicated. For example, the expression for a second-order (`degree=2`) polynomial with two variables is given by:


<h3>
$$
\widehat{Y} = a + b_1 X_1 + b_2 X_2 +b_3 X_1 X_2 + b_4 X_1^2 + b_5 X_2^2
$$
</h3>

We can perform a polynomial transform on multiple features. We create a <code>PolynomialFeatures</code> object of degree 2:


In [ ]:
pr = PolynomialFeatures(degree=2)
pr

In [ ]:
Z_pr = pr.fit_transform(Z)

In the original data, there are 66846 samples and 4 features.


In [ ]:
Z.shape

After the transformation, there are 66154 samples and 15 features.


In [ ]:
Z_pr.shape

## **Pipeline**

Data Pipelines simplify the steps of processing the data. We use the module `Pipeline` to create a pipeline. We also use `StandardScaler` as a step in our pipeline.

We create the pipeline by creating a list of tuples including the name of the model or estimator and its corresponding constructor.


In [ ]:
Input = [
    ("scale", StandardScaler()), 
    ("polynomial", PolynomialFeatures(include_bias=False)), 
    ("model", LinearRegression())
]

We input the list as an argument to the pipeline constructor:


In [ ]:
pipe = Pipeline(Input)
pipe

First, we convert the data type `Z` to type float to avoid conversion warnings that may appear as a result of `StandardScaler` taking float inputs.

Then, we can normalize the data, perform a transform and fit the model simultaneously.


In [ ]:
Z = Z.astype(float)
pipe.fit(Z, Y)

Similarly,  we can normalize the data, perform a transform and produce a prediction  simultaneously.


In [ ]:
y_pipe = pipe.predict(Z)
y_pipe[0:4]

<div class="alert alert-danger alertdanger" style="margin-top: 20px">

# **Question #5:**
    
**Create a pipeline that standardizes the data, then produce a prediction using a linear regression model using the features `Z` and target `Y`.**
    
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
Input = [
    ("scale", StandardScaler()),
    ("model", LinearRegression())
]

pipe = Pipeline(Input)

pipe.fit(Z, Y)

y_pipe = pipe.predict(Z)
y_pipe[0:10]
```

</details>


# **5. Measures for In-Sample Evaluation**


When evaluating our models, not only do we want to visualize the results, but we also want a quantitative measure to determine how accurate the model is.

Five very important measures that are often used in Statistics to determine the accuracy of a model are:

* **$R^2$ / R-squared**
* **Mean Squared Error (MSE)**
* **F-test score**
* **P-value**

## **R-squared**

**R-squared**, also known as the **coefficient of determination**, is a measure to indicate how close the data is to the fitted regression line.

The value of the R-squared is the percentage of variation of the response variable ($y$) that is explained by a linear model.

R-squared values range from 0 to 1 and are commonly stated as percentages from 0% to 100%. An R-squared of 100% means that all movements of a security (or another dependent variable) are completely explained by movements in the index (or the independent variable(s) you are interested in).

In finance, an R-Squared above 0.7 would generally be seen as showing a high level of correlation, whereas a measure below 0.4 would show a low correlation.

## **Mean Squared Error (MSE)**

The **Mean Squared Error** measures the average of the squares of errors. That is, the difference between actual value ($y$) and the estimated value ($\widehat{y}$).

## **F-test score**

**F-test score** or **Fisher criterion** is a discriminant criterion function that was first presented by Fisher in 1936. It is defined by the ratio of the between-class scatter to the within-class scatter. By maximizing this criterion, one can obtain an optimal discriminant projection axis. After the sample being projected on to this projection axis, the within-class scatter is minimized and the between-class scatter is maximized.

## **P-value**

The **P-value** is the probability value that the correlation between these two variables is statistically significant. Normally, we choose a significance level of 0.05, which means that we are 95% confident that the correlation between the variables is significant.

By convention, when the

* p-value is $<$ 0.001: we say there is strong evidence that the correlation is significant.
* the p-value is $<$ 0.05: there is moderate evidence that the correlation is significant.
* the p-value is $<$ 0.1: there is weak evidence that the correlation is significant.
* the p-value is $>$ 0.1: there is no evidence that the correlation is significant.

Let's define function that will calculate all needed metrics

In [ ]:
def calculate_metrics(y_pred: Union[pd.Series, np.array], y: Union[pd.Series, np.array]) -> tuple:
    """
    Calculates R-squared, MSE, F-test score, P-value
    
    Parameters
    ----------
    y_pred: pd.Series
        output of model
    y: pd.Series
        true values of y
    
    Returns
    -------
    r2: float
        R^2 / R-squared
    mse: float
        MSE
    f_val: float
        F-test score
    p_val: float
        P-value
    """
    r2 = r2_score(y, y_pred)
    mse = mean_squared_error(y, y_pred)
    f_val, p_val = stats.f_oneway(y, y_pred)
    
    return r2, mse, f_val, p_val

In [ ]:
temp_df = pd.DataFrame({
    "y_true": Y,
    "y_pred": Y_hat,
    "ts": df["Ts"]
})
temp_df.plot(x="ts", y=["y_true", "y_pred"], title="Comprasion of true values and predicted by SLR")
plt.ylabel("BUSD")
plt.xlabel("Time")

## **Model 1: Simple Linear Regression**

In [ ]:
calculate_metrics(Y_hat, Y)

**Conclusion:**

* **$R^2$ / R-squared:** We can say that ≈ 11.18% of the variation of **"Avg_price"** is explained by this simple linear model.
* **MSE:** We got 0.0026927796894885793
* **F-test score:** 4.99 * $10^{-26}$ what means that the differences between the groups being compared are not statistically significant.
* **P-value:** 1.0 what means there is no evidence that the correlation is significant

## **Model 2: Multiple Linear Regression**

In [ ]:
calculate_metrics(Y_mlr, Y)

**Conclusion:**

* **$R^2$ / R-squared:** We can say that ≈ 78.91% of the variation of **"Avg_price"** is explained by this multiple linear model.
* **MSE:** We got 0.0006392628534897521
* **F-test score:** 3.7 * $10^{-27}$ what means that the differences between the groups being compared are not statistically significant
* **P-value:** 1.0 what means there is no evidence that the correlation is significant

## **Model 3: Polynomial Fit**

In [ ]:
calculate_metrics(Y_poly, Y)

**Conclusion:**

* **$R^2$ / R-squared**: We can say that ≈ 29.29% of the variation of **"Avg_price"** is explained by this polynomial model.
* **MSE:** We got 0.0021438907476508517
* **F-test score:** 8.02 * $10^{-23}$ what means that the differences between the groups being compared are not statistically significant
* **P-value:** 1.0 what means there is no evidence that the correlation is significant

# **6. Prediction and Decision Making**

## **Prediction**

In the previous section, we trained the model using the method `fit`. Now we will use the method `predict` to produce a prediction.

In [ ]:
# Generating simple input
new_input = np.linspace(min(df["ATR"]), max(df["ATR"]), len(df)).reshape(-1, 1)
new_input

Fit the model:

In [ ]:
lm.fit(X_lr, Y)
lm

Produce a prediction:


In [ ]:
yhat = lm.predict(new_input)
yhat[0:5]

Obviously we have got a straight line. 


In [ ]:
plt.plot(new_input, yhat)
plt.title("Prediction based on ATR")
plt.xlabel("ATR")
plt.ylabel("Avg_price")
plt.show()

## **Decision Making: Determining a Good Model Fit**

Now that we have visualized the different models, and generated the R-squared and MSE values for the fits, how do we determine a good model fit?

* *What is a good R-squared value?*

When comparing models, **the model with the higher R-squared value is a better fit** for the data.

* *What is a good MSE?*

When comparing models, **the model with the smallest MSE value is a better fit** for the data.

* *What is a good F-test score?*

When comparing models, **the model with the smallest F-test value is a better fit** for the data.

* *What is a good P-value?*

When comparing models, **the model with the higher P-value is a better fit** for the data.

Let's take a look at the values for the different models.

We prefer R-squared, MSE when comparing models. Simple Linear Regression: Using **"ATR"** as an independent variable of **"Avg_price"**.

* **R-squared:** 0.1118
* **MSE:** 0.002692
* **F-test score:** 4.99 * $10^{-26}$</sup>
* **P-value:** 1.0

Multiple Linear Regression: Using **"ATR"**, **"OBV"**, **"RSI"** and **"AD"** as predictor variables of **"Avg_price"**.

* **R-squared:** 0.7891
* **MSE:** 0.000639
* **F-test score:** 3.7 * $10^{-27}$
* **P-value:** 1.0

Polynomial Fit: **"ATR"**, **"OBV"**, **"RSI"** and **"AD"** as predictor variables of **"Avg_price"**.

* **R-squared:** 0.2929
* **MSE:** 0.002143
* **F-test score:**  8.02 * $10^{-23}$
* **P-value:** 1.0

## **Simple Linear Regression Model (SLR) vs Multiple Linear Regression Model (MLR)**

Usually, the more variables you have, the better your model is at predicting, but this is not always true. Sometimes you may not have enough data, you may run into numerical problems, or many of the variables may not be useful and even act as noise. As a result, you should always check the MSE and $R^2$ / R-squared.

In order to compare the results of the MLR vs SLR models, we look at a combination of both the R-squared and MSE to make the best conclusion about the fit of the model.

* **MSE:** The MSE of SLR is 0.002692 while MLR has an MSE of 0.000639. The MSE of MLR is much smaller.
* **R-squared:** In this case, we can also see that there is a difference between the R-squared of the SLR and the R-squared of the MLR. The R-squared for the SLR (≈ 0.1118) is much smaller compared to the R-squared for the MLR (≈ 0.7891).

This R-squared in combination with the MSE show that MLR seems like the better model fit in this case compared to SLR.

## **Simple Linear Model (SLR) vs. Polynomial Fit**

* **MSE:** We can see that Polynomial Fit brought down the MSE, since this MSE is smaller than the one from the SLR.
* **R-squared:** The R-squared for the Polynomial Fit is larger than the R-squared for the SLR, so the Polynomial Fit also brought up the R-squared quite a bit.

Since the Polynomial Fit resulted in a lower MSE and a higher R-squared, we can conclude that this was a better fit model than the simple linear regression for predicting **"Avg_price"** with **"ATR"** as a predictor variable.


## **Multiple Linear Regression (MLR) vs. Polynomial Fit**

* **MSE:** The MSE for the MLR is smaller than the MSE for the Polynomial Fit.
* **R-squared:** The R-squared for the MLR is larger than for the Polynomial Fit.

# **Conclusion:**

Comparing these three models, we conclude that **the MLR model is the best model** to be able to predict **"Avg_price"** from our dataset. This result makes sense since we have 4 technical indicators in total and we know that more than one of those indicators are potential predictors of **"Avg_price"**.

# **7. Sources:**

<ul>
    <li><a href="https://www.spiceworks.com/tech/artificial-intelligence/articles/what-is-linear-regression" target="_blank">https://www.spiceworks.com/tech/artificial-intelligence/articles/what-is-linear-regression</a></li>
    <li><a href="https://www.spiceworks.com/tech/artificial-intelligence/articles/what-is-linear-regression" target="_blank">https://www.spiceworks.com/tech/artificial-intelligence/articles/what-is-linear-regression</a></li>
    <li><a href="https://www.investopedia.com/terms/m/mlr.asp" target="_blank">https://www.investopedia.com/terms/m/mlr.asp</a></li>
</ul>

# **Thank you for completing this lab!**

## Author

<a href="https://author.skills.network/instructors/borys_melnychuk" target="_blank" >Borys Melnychuk</a>

<a href="https://author.skills.network/instructors/yaroslav_vyklyuk_2?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0QGDEN2306-2023-01-01">Prof. Yaroslav Vyklyuk, DrSc, PhD</a>

<a href="https://author.skills.network/instructors/mariya_fleychuk?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0QGDEN2306-2023-01-01">Prof. Mariya Fleychuk, DrSc, PhD</a>



## Change Log

| Date (YYYY-MM-DD) | Version | Changed By      | Change Description                                         |
| ----------------- | ------- | ----------------| ---------------------------------------------------------- |
|     2023-03-18    |   1.0   | Borys Melnychuk | Creation of the lab                                        |

<hr>

## <h3 align="center"> © IBM Corporation 2023. All rights reserved. </h3>
