<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="500" alt="cognitiveclass.ai logo">
</center>

#  **Investigation of MATIC/BUSD exchange rate dynamic,  calculation and analysis of separate  technical financial indicators of cryptocurrency market (ATR, OBV, RSI, AD)**

## **Lab 5. Model Evaluation and Refinement**

## **The tasks**
* To make model evaluation,using functions for:  Plotting, Training and Testing, Cross-Validation
* To analyze Overfitting, Underfitting and select the Model;
* To built Ridge Regression;
* to use Grid Search.

Estimated time needed: **30** minutes

## **Objectives**

After completing this lab you will be able to:

* evaluate and Refine Prediction Models;
* use Grid Search, Cross-Validation;
* built Ridge Regression;
* to use Grid Search.


## **Table of Contents**

<div class="alert alert-block alert-info" style="margin-top: 20px">
<ol>
    <li>Import Data</li>
    <li>Model Evaluation</li>
    <ul>
        <li>Functions for Plotting</li>
        <li>Training and Testing</li>
        <li>Cross-Validation</li>
    </ul>
    <li>Overfitting, Underfitting and Model Selection</li>
    <ul>
       <li>Overfitting</li> 
    </ul>
    <li>Ridge Regression</li>
    <li>Grid Search</li>
    <li>Sources</li>
</ol>

</div>

<hr>


## **Dataset Description**

### **Files**
* #### **MATICBUSD_trades_1m_preprocessed.csv** - the file contains historical changes of the pair **MATIC/BUSD** and ATR, OBV, RSI, AD indicators for the period from 11/11/2022 to 12/29/2022 with an aggregation time of 1 minute. **MATIC/BUSD** - the exchange rate of **MATIC** cryptocurrency to **BUSD** cryptocurrency

### **Columns**

* #### `Ts` - the timestamp of the record
* #### `Open` -  the price of the asset at the beginning of the trading period
* #### `High` -  the highest price of the asset during the trading period
* #### `Low` - the lowest price of the asset during the trading period.
* #### `Close` - the price of the asset at the end of the trading period
* #### `Volume` - the total number of shares or contracts of a particular asset that are traded during a given period
* #### `Rec_count` -  the number of individual trades or transactions that have been executed during a given time period
* #### `Avg_price` - the average price at which a particular asset has been bought or sold during a given period
* #### `ATR` - average true range indicator
* #### `OBV` - on-balance volume indicator
* #### `RSI` - relative strength index indicator
* #### `AD` - accumulation / distribution indicator


# **1. Import Data**


Run the following cell to install required libraries:


In [ ]:
# install specific version of libraries used in lab
# ! conda install -q -y pandas
! conda install -q -y numpy
! conda install -q -y -c anaconda scikit-learn
# ! conda install -q -y -c anaconda ipywidgets
# ! conda install -q -y tqdm

In [ ]:
import pandas as pd
import numpy as np
from ipywidgets import interact, interactive, fixed, interact_manual

from sklearn.linear_model import LinearRegression, Ridge
from sklearn.model_selection import cross_val_score, cross_val_predict, GridSearchCV, train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import r2_score, mean_squared_error
from sklearn import set_config

import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm
import warnings

warnings.filterwarnings("ignore")
set_config("diagram")

# setting hyperparameters
WIDTH, HEIGHT = 12, 10

This dataset was hosted <a href="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX08XFEN/MATICBUSD_trades_1m_preprocessed.csv">HERE</a>


In [ ]:
path = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX08XFEN/MATICBUSD_trades_1m_preprocessed.csv"

Let's read and print the dataset


In [ ]:
df = pd.read_csv(path)
df.head()

We need to drop first 15 `NaN`'s


In [ ]:
df = df.dropna()

# **2. Model evaluation**


## **Functions for Plotting**


Let's define 2 functions: <code>dist_plot</code> (for plotting distributions) and <code>poly_plot</code> (for plotting Polynomial Regression)


In [ ]:
def dist_plot(y: pd.Series, y_pred: pd.Series, title: str) -> None:
    """
    Plots distribution plot of `y` and `y_pred` with title = `title`
    
    Parameters
    ----------
    y: pd.Series
        True values
    y_pred: pd.Series
        Predicted values
    title: str
        The title of plot
    """
    plt.figure(figsize=(WIDTH, HEIGHT))
    
    temp_df = pd.DataFrame({"y": y, "y_pred": y_pred})
    temp_df.plot.kde(figsize=(WIDTH, HEIGHT))

    plt.title(title)
    plt.xlabel("Price (in BUSD)", labelpad=0.02)
    plt.ylabel("Proportion")

    plt.show()

In [ ]:
def poly_plot(x_train: pd.Series, x_test: pd.Series, y_train: pd.Series, y_test: pd.Series, lr: LinearRegression, poly_transform: PolynomialFeatures, indicator: str) -> None:
    """
    Plots plot of polynomial regression
    
    Parameters
    ----------
    x_train: pd.Series
        Train x
    x_test: pd.Series
        Test x
    y_train: pd.Series
        Train y
    y_test: pd.Series
        Test y
    lr: sklearn.linear_model.LinearRegression
        Model of LinearRegression
    pr: sklearn.preprocessing.PolynomialFeatures
        Polynomial features
    indicator: str
        x-label
    """
    plt.figure(figsize=(WIDTH, HEIGHT))
 
    x_max = max([x_train.values.max(), x_test.values.max()])
    x_min = min([x_train.values.min(), x_test.values.min()])
    x = np.arange(x_min, x_max, 100)

    plt.plot(x_train, y_train, "ro", label="Training Data")
    plt.plot(x_test, y_test, "go", label="Test Data")
    plt.plot(x, lr.predict(poly_transform.fit_transform(x.reshape(-1, 1))), label="Predicted Function")
    
    plt.title("Plot of train, test data and predicted function")
    plt.xlabel(indicator)
    plt.ylabel("Price")
    plt.legend()

## **Training and Testing**

<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX08XFEN/PartitionTwoSets.svg" alt="train-test-dataset">
</center>

<center>We split the dataset into 2 parts. One for training, the other for testing</center>


Splitting a dataset into **training** and **testing** datasets is a critical step in developing machine learning models. Here are some reasons why:

1. Avoiding overfitting: Machine learning models aim to find patterns in data and learn from them. However, if the model becomes too complex, it may fit the training data too closely, capturing noise in the data rather than generalizing to new data. Splitting the dataset into training and testing sets allows us to assess whether the model is overfitting to the training data.

2. Evaluating model performance: Splitting the dataset into training and testing sets allows us to evaluate the performance of the model on new, unseen data. By testing the model on the testing set, we can assess how well the model generalizes to new data and avoid overestimating the model's performance.

3. Hyperparameter tuning: Splitting the dataset into training and validation sets allows us to tune the hyperparameters of the model. Hyperparameters are parameters that are not learned from the data but are set before training the model, such as the learning rate or the number of layers in a neural network. By evaluating the model's performance on the validation set, we can tune the hyperparameters to improve the model's performance.


**Overfitting** will be discussed a bit later. Let's assign target value to <code>y_data</code>


In [ ]:
y_data = df["Avg_price"]

Drop price data in dataframe <code>x_data</code>:


In [ ]:
x_data = df.drop("Avg_price", axis=1)

Now, we split our data into training and testing data using the function <code>train_test_split</code>. This function splits arrays or matrices into random train and test subsets.


In [ ]:
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.10, shuffle=False)

print("Number of test samples :", x_test.shape[0])
print("Number of training samples:", x_train.shape[0])

The <code>test_size</code> parameter sets the proportion of data that is split into the testing set. In the above, the testing set is 10% of the total dataset.


<div class="alert alert-danger alertdanger" style="margin-top: 20px">

# **Question  #1):**

**Use the function `train_test_split` to split up the dataset such that 40% of the data samples will be utilized for testing and `shuffle=False`. Print the shape of train and test $x$**

</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
x_train1, x_test1, y_train1, y_test1 = train_test_split(x_data, y_data, test_size=0.4, shuffle=False) 
print("Number of test samples :", x_test1.shape[0])
print("Number of training samples:", x_train1.shape[0])
```

</details>


We create a Linear Regression object:


In [ ]:
lre = LinearRegression()
lre

We fit the model using the feature **"AD"**:


In [ ]:
lre.fit(x_train[["AD"]], y_train)

Let's calculate the $R^2$ on the test data:


In [ ]:
lre.score(x_test[["AD"]], y_test)

We can see the $R^2$ of the test dataset is negative. $R^2$ is negative only when the chosen model does not follow the trend of the data, so fits worse than a horizontal line   


In [ ]:
lre.score(x_train[["AD"]], y_train)

As we cannot judge by $R^2$ let's calculate $MSE$ and compare them


In [ ]:
mean_squared_error(y_train, lre.predict(x_train[["AD"]]))

In [ ]:
mean_squared_error(y_test, lre.predict(x_test[["AD"]]))

$MSE$ calculated from the train dataset is half the $MSE$ calculated from the test dataset what means the loss of train dataset is less than test one


<div class="alert alert-danger alertdanger" style="margin-top: 20px">

# **Question  #2):**
    
**Split dataset (40% for testing, 60% for training, `shuffle=False`). Create LinearRegression `lre2` and train on the train data. Find the $R^2$ and $MSE$ on the test data using 40% of the dataset for testing.**
    
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
x_train2, x_test2, y_train2, y_test2 = train_test_split(x_data, y_data, test_size=0.4, shuffle=False)
lre2 = LinearRegression()
lre2.fit(x_train2[["AD"]], y_train2)
print(f'R-squared: {lre2.score(x_test2[["AD"]], y_test2)}, MSE: {mean_squared_error(y_test2, lre2.predict(x_test2[["AD"]]) ) }')
```

</details>


Sometimes you do not have sufficient testing data; as a result, you may want to perform cross-validation. Let's go over several methods that you can use for cross-validation.


## **Cross-Validation**


<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX08XFEN/1676051591395.jpg" alt="cross-validation">
</center>

<center>The Example of 4-fold cross-validation</center>


**Cross-validation**, sometimes called **rotation estimation** or **out-of-sample testing**, is any of various similar model validation techniques for assessing how the results of a statistical analysis will generalize to an independent data set. **Cross-validation** is a resampling method that uses different portions of the data to test and train a model on different iterations. 

It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. In a prediction problem, a model is usually given a dataset of known data on which training is run (training dataset), and a dataset of unknown data (or first seen data) against which the model is tested (called the validation dataset or testing set). 

The goal of **cross-validation** is to test the model's ability to predict new data that was not used in estimating it, in order to flag problems like overfitting or selection bias and to give an insight on how the model will generalize to an independent dataset (i.e., an unknown dataset, for instance from a real problem).


We input the object, the feature (**"AD"**), and the target data (`y_data`). The parameter `cv` determines the number of folds. In this case, it is 4.


In [ ]:
R_cross = cross_val_score(lre, x_data[["AD"]], y_data, cv=4, scoring="r2")

The default scoring is $R^2$ Each element in the array has the average $R^2$</sup> value for the fold:


In [ ]:
R_cross

Since the model predicts data badly we get negative $R^2$ score. We can calculate the average and standard deviation of our estimate:


In [ ]:
print("The mean of the folds are", R_cross.mean(), "and the standard deviation is", R_cross.std())

<div class="alert alert-danger alertdanger" style="margin-top: 20px">

# **Question  #3):**

**Calculate the average $R^2$ using two folds, then find the average $R^2$ for the second fold utilizing the "AD" feature:**
    
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
R_cross_2 = cross_val_score(lre, x_data[["AD"]], y_data, cv=2)
R_cross_2.mean()

```

</details>


You can also use the function <code>cross_val_predict</code> to predict the output. The function splits up the data into the specified number of folds, with one fold for testing and the other folds are used for training. 


We input the object, the feature "AD", and the target data y_data. The parameter <code>cv</code> determines the number of folds. In this case, it is 4. We can produce an output:


In [ ]:
y_cvp = cross_val_predict(lre, x_data[["AD"]], y_data, cv=4)
y_cvp[0:5]

Let's plot our predicted values and compare them to actual ones


In [ ]:
plt.figure(figsize=(WIDTH, HEIGHT))
plt.plot(df[["AD"]], y_data, label="Avg_price")
plt.plot(df[["AD"]], y_cvp, label="Avg_price predicted")
plt.title("Plot of actual Avg_price vs predicted by cv")
plt.xlabel("AD")
plt.ylabel("Price (in BUSD)")
plt.legend()

How we can see our model predicts datapoints badly what we mentioned above


# **3. Overfitting, Underfitting and Model Selection**

It turns out that the test data, sometimes referred to as the "out of sample data", is a much better measure of how well your model performs in the real world.  One reason for this is overfitting.

Let's go over some examples. It turns out these differences are more apparent in Multiple Linear Regression and Polynomial Regression so we will explore overfitting in that context.


Let's create Multiple Linear Regression objects and train the model using **"ATR"**, **"OBV"**, **"RSI"**, **"AD"** as features.


In [ ]:
lr = LinearRegression()
lr.fit(x_train[["ATR", "OBV", "RSI", "AD"]], y_train)

Prediction using training data:


In [ ]:
y_lr_train = lr.predict(x_train[["ATR", "OBV", "RSI", "AD"]])
y_lr_train[0:5]

Prediction using test data:


In [ ]:
y_lr_test = lr.predict(x_test[["ATR", "OBV", "RSI", "AD"]])
y_lr_test[0:5]

Let's perform some model evaluation using our training and testing data separately. Let's examine the distribution of the predicted values of the training data.


In [ ]:
title = "Distribution plot of predicted values using training data"
dist_plot(y_train, y_lr_train, title)

**Figure 1:** Plot of predicted values using the training data compared to the actual values of the training data.


So far, the model seems to be doing well in learning from the training dataset. But what happens when the model encounters new data from the testing dataset? When the model generates new values from the test data, we see the distribution of the predicted values is much different from the actual target values.


In [ ]:
title = "Distribution plot of predicted values using test data"
dist_plot(y_test, y_lr_test, title)

**Figure 2:** Plot of predicted value using the test data compared to the actual values of the test data.


Comparing **Figure 1** and **Figure 2**, we can see that values in **Figure 1** are much closer than in **Figure 2**. This difference in **Figure 2** is apparent in the range of $(0.76; 0.81)$ and $(0.82; 0.85)$. This is where the shape of the distribution is a bit different. Let's see if polynomial regression also exhibits a drop in the prediction accuracy when analysing the test dataset.


## **Overfitting**

<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX08XFEN/6360ef2568a0381c60b26049_overfitting-and-underfitting-in-machine-learning-1.png" alt="overfitting">
</center>

<center>The example of overfitting</center>


**Overfitting** is an undesirable machine learning behavior that occurs when the machine learning model gives accurate predictions for training data but not for new data. When data scientists use machine learning models for making predictions, they first train the model on a known data set. Then, based on this information, the model tries to predict outcomes for new data sets. An overfit model can give inaccurate predictions and cannot perform well for all types of new data.

**Overfitting** occurs when the model fits the noise, but not the underlying process. Therefore, when testing your model using the test set, your model does not perform as well since it is modelling noise, not the underlying process that generated the relationship. Let's create a degree 5 polynomial model.


Let's use 55 percent of the data for training and the rest for testing:


In [ ]:
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.45, shuffle=False)

We will perform a degree 5 polynomial transformation on the feature "AD".


In [ ]:
pr = PolynomialFeatures(degree=5)
x_train_pr = pr.fit_transform(x_train[["AD"]])
x_test_pr = pr.fit_transform(x_test[["AD"]])
pr

Now, let's create a Linear Regression model `poly` and train it.


In [ ]:
poly = LinearRegression()
poly.fit(x_train_pr, y_train)

We can see the output of our model using the method <code>predict</code>. We assign the values to `y_hat`.


In [ ]:
y_poly = poly.predict(x_test_pr)
y_poly[0:5]

Let's take the first five predicted values and compare it to the actual targets.


In [ ]:
print("Predicted values:", y_poly[0:4])
print("True values:", y_test[0:4].values)

We will use the function <code>poly_plot</code> that we defined at the beginning of the lab to display the training data, testing data, and the predicted function.


In [ ]:
poly_plot(x_train[["AD"]], x_test[["AD"]], y_train, y_test, poly, pr, "AD")

**Figure 3:** A polynomial regression model where red dots represent training data, green dots represent test data, and the blue line represents the model prediction.


Let's calculate $R^2$ for train and test dataset


$R^2$ of the training data:


In [ ]:
poly.score(x_train_pr, y_train)

$R^2$ of the test data:


In [ ]:
poly.score(x_test_pr, y_test)

We got negative $R^2$ score calculated from test dataset again. That means that our model predicts unseen datapoints not good


Let's see how the $MSE$ changes on the test data for different order polynomials and then plot the results:


In [ ]:
mse_test = []

order = [1, 2, 3, 4]
for n in order:
    pr = PolynomialFeatures(degree=n)
    x_train_pr = pr.fit_transform(x_train[["AD"]])
    x_test_pr = pr.fit_transform(x_test[["AD"]])
    
    lrt = LinearRegression()
    lrt.fit(x_train_pr, y_train)
    
    mse = mean_squared_error(y_test, lrt.predict(x_test_pr) )
    mse_test.append(mse)

plt.figure(figsize=(WIDTH, HEIGHT))
plt.plot(order, mse_test)
plt.xlabel("Order")
plt.ylabel("MSE")
plt.title("MSE using test data")

We see with increasing order of polynomial $MSE$ increases as well


The following function will be used in the next section. Please run the cell below.


In [ ]:
def f(order: tuple, test_size: tuple, indicator: str) -> None:
    """
    Plots train, test data and predicted function with test_size = `test_size` and order = `order`
    
    Parameters
    ----------
    order: tuple
        Tuple of orders
    test_size: tuple
        Tuple of test sizes
    indicator: str
        The indicator
    """
    x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=test_size, shuffle=False)
    pr = PolynomialFeatures(degree=order)
    x_train_pr = pr.fit_transform(x_train[["AD"]])
    x_test_pr = pr.fit_transform(x_test[["AD"]])
    
    poly = LinearRegression()
    poly.fit(x_train_pr, y_train)
    
    poly_plot(x_train[["AD"]], x_test[["AD"]], y_train, y_test, poly, pr, indicator)
    plt.show()

The following interface allows you to experiment with different polynomial orders and different amounts of data.


In [ ]:
interact(f, order=(0, 6, 1), test_size=(0.05, 0.95, 0.05), indicator="AD")

We can move this slider and change the order and we will see how our predicted function changes. As we increase order out function gets complicated. As we increase <code>test_size</code> number of red increases


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
    
# **Question  #4 a):**

**We can perform polynomial transformations with more than one feature. Create a `PolynomialFeatures` object `pr1` of degree two.**

</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
pr1 = PolynomialFeatures(degree=2)

```

</details>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
    
# **Question  #4 b):**

**Transform the training and testing samples for the features "ATR", "OBV", "RSI", "AD". Hint: use the method `fit_transform`**
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
x_train_pr1 = pr1.fit_transform(x_train[["ATR", "OBV", "RSI", "AD"]])
x_test_pr1 = pr1.fit_transform(x_test[["ATR", "OBV", "RSI", "AD"]])


```

</details>


<!-- The answer is below:

x_train_pr1=pr.fit_transform(x_train[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']])
x_test_pr1=pr.fit_transform(x_test[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']])

-->


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
    
# **Question  #4 c):**

**How many dimensions does the new feature have?**

**Hint: use the attribute `.shape`.**
    
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
# there are now 15 features
x_train_pr1.shape
```

</details>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
    
# **Question  #4 d):**

**Create a linear regression model `poly1`. Train the object using the method `fit` using the polynomial features.**
    
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
poly1 = LinearRegression()
poly1.fit(x_train_pr1, y_train)
```

</details>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
    
# **Question  #4 e):**
    
**Use the method  `predict` to predict an output on the polynomial features, then use the function `dist_plot` to display the distribution of the predicted test output vs the actual test data.**
    
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
y_poly1 = poly1.predict(x_test_pr1)

title = "Distribution plot of predicted values using test data"

dist_plot(y_test, y_poly1, title)

```

</details>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">

# **Question  #4 f):**

**Using the distribution plot above, describe (in words) the two regions where the predicted prices are less accurate than the actual prices.**

</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
# 1. 0.75 - 0.81 We see how `y_pred` is much lower than `y`
# 2. 0.81 - 0.89 `y_pred` increased too much
```

</details>


# **4. Ridge Regression**


**Ridge regression** is a method of estimating the coefficients of multiple-regression models in scenarios where the independent variables are highly correlated. It has been used in many fields including econometrics, chemistry, and engineering. Also known as Tikhonov regularization, named for Andrey Tikhonov, it is a method of regularization of ill-posed problems. It is particularly useful to mitigate the problem of multicollinearity in linear regression, which commonly occurs in models with large numbers of parameters. In general, the method provides improved efficiency in parameter estimation problems in exchange for a tolerable amount of bias (see bias–variance tradeoff).


In this section, we will review Ridge Regression and see how the parameter alpha changes the model. Just a note, here our test data will be used as validation data.


Let's perform a degree two polynomial transformation on our data.


In [ ]:
pr = PolynomialFeatures(degree=2)
x_train_pr = pr.fit_transform(x_train[["ATR", "OBV", "RSI", "AD"]])
x_test_pr = pr.fit_transform(x_test[["ATR", "OBV", "RSI", "AD"]])

Let's create a <code>Ridge</code> regression object, setting the regularization parameter (alpha) to 1


In [ ]:
rm = Ridge(alpha=1)

Like regular regression, you can fit the model using the method <code>fit</code>


In [ ]:
rm.fit(x_train_pr, y_train)

Similarly, you can obtain a prediction:


In [ ]:
y_pr = rm.predict(x_test_pr)

Let's compare the first five predicted samples to our test set:


In [ ]:
print("Predicted:", y_pr[0:4])
print("Test set :", y_test[0:4].values)

We select the value of alpha that minimizes the test error. To do so, we can use a for loop. We have also created a progress bar to see how many iterations we have completed so far.


In [ ]:
R_squared_test = []
R_squared_train = []

alphas = np.array(range(0, 5))
pbar = tqdm(alphas)

for alpha in pbar:
    ridge_model = Ridge(alpha=alpha) 
    ridge_model.fit(x_train_pr, y_train)
    test_score, train_score = ridge_model.score(x_test_pr, y_test), ridge_model.score(x_train_pr, y_train)
    
    pbar.set_postfix({"Test Score": test_score, "Train Score": train_score})

    R_squared_test.append(test_score)
    R_squared_train.append(train_score)

We can plot out the value of $R^2$ for different alphas:


In [ ]:
plt.figure(figsize=(WIDTH, HEIGHT))
plt.plot(alphas, R_squared_test, label="validation data")
plt.plot(alphas, R_squared_train, "r", label="training Data")
plt.title("Plot of Alpha vs R-squared")
plt.xlabel("Alpha")
plt.ylabel("R-squared")
plt.legend()

**Figure 4:** The blue line represents the $R^2$ of the validation data, and the red line represents the $R^2$ of the training data. We see an improvement when order gets 1 but $R^2$ remains the same after that. The x-axis represents the different values of alphas.


Here the model is built and tested on the same data, so the training and test data are the same.

The red line in **Figure 4** represents the $R^2$ of the training data. $R^2$ remains the same over alpha from 0 to 4


<div class="alert alert-danger alertdanger" style="margin-top: 20px">

# **Question  #5):**

**Perform Ridge regression. Calculate the $R^2$ using the polynomial features, use the training data to train the model and use the test data to test the model. The parameter alpha should be set to 10.**

</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
ridge_model2 = Ridge(alpha=10) 
ridge_model2.fit(x_train_pr, y_train)
ridge_model2.score(x_test_pr, y_test)

```

</details>


# **5. Grid Search**

<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX08XFEN/An-introduction-to-GridSearchCV-and-RandomizedSearchCV-Image.webp" alt="grid-search">
</center>

<center>The example of how Grid Search works</center>


`GridSearchCV` is the process of performing hyperparameter tuning in order to determine the optimal values for a given model. The performance of a model significantly depends on the value of hyperparameters. Note that there is no way to know in advance the best values for hyperparameters so ideally, we need to try all possible values to know the optimal values. Doing this manually could take a considerable amount of time and resources and thus we use GridSearchCV to automate the tuning of hyperparameters.


The term alpha is a hyperparameter. Sklearn has the class <code>GridSearchCV</code> to make the process of finding the best hyperparameter simpler.


We create a dictionary of parameter values:


In [ ]:
parameters = [{"alpha": [0.01, 0.1, 1, 10]}]
parameters

Create a <code>Ridge</code> regression object:


In [ ]:
rr = Ridge()
rr

Create a ridge grid search object:


In [ ]:
grid = GridSearchCV(rr, parameters, cv=4)

Fit the model:


In [ ]:
grid.fit(x_data[["ATR", "OBV", "RSI", "AD"]], y_data)

The object finds the best parameter values on the validation data. We can obtain the estimator with the best parameters and assign it to the variable `best_RR` as follows:


In [ ]:
best_rr = grid.best_estimator_
best_rr

We now test our model on the test data:


In [ ]:
best_rr.score(x_test[["ATR", "OBV", "RSI", "AD"]], y_test)

Let's see predicted values by the best model and compare them to actual ones


In [ ]:
y_rr = best_rr.predict(x_test[["ATR", "OBV", "RSI", "AD"]])
dist_plot(y_test, y_rr, "Plot of distributions of `y_test` and `y_rr`")

Despite the high $R^2$ score, the distributions are not very close compared to multiple linear regression


# **Conclusion:**

We learned how to split our dataset into train and test datasets. We found out what **overfitting**, **Ridge regression**, **Grid Search**, **Cross-validation** are and learned how to use these model and we got the best model with $R^2$ of 0.762


# **6. Sources**

<ul>
    <li><a target="_blank" href="https://en.wikipedia.org/wiki/Cross-validation_(statistics)?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX08XFEN2550-2023-01-01">https://en.wikipedia.org/wiki/Cross-validation_(statistics)</a></li>
    <li><a target="_blank" href="https://aws.amazon.com/what-is/overfitting/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX08XFEN2550-2023-01-01">https://aws.amazon.com/what-is/overfitting/</a></li>
    <li><a target="_blank" href="https://en.wikipedia.org/wiki/Ridge_regression?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX08XFEN2550-2023-01-01">https://en.wikipedia.org/wiki/Ridge_regression</a></li>
    <li><a target="_blank" href="https://www.mygreatlearning.com/blog/gridsearchcv/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX08XFEN2550-2023-01-01">https://www.mygreatlearning.com/blog/gridsearchcv/</a></li>
    <li><a target="_blank" href="https://developers.google.com/machine-learning/crash-course/training-and-test-sets/splitting-data?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX08XFEN2550-2023-01-01">https://developers.google.com/machine-learning/crash-course/training-and-test-sets/splitting-data</a></li>
    <li><a target="_blank" href="https://developers.google.com/static/machine-learning/crash-course/images/PartitionTwoSets.svg?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX08XFEN2550-2023-01-01">https://developers.google.com/static/machine-learning/crash-course/images/PartitionTwoSets.svg</a></li>
    <li><a target="_blank" href="https://es.mathworks.com/discovery/cross-validation.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX08XFEN2550-2023-01-01">https://es.mathworks.com/discovery/cross-validation.html</a></li>
    <li><a target="_blank" href="https://es.mathworks.com/discovery/cross-validation/_jcr_content/mainParsys/image.adapt.full.medium.jpg/1676051591395.jpg?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX08XFEN2550-2023-01-01">https://es.mathworks.com/discovery/cross-validation/_jcr_content/mainParsys/image.adapt.full.medium.jpg/1676051591395.jpg</a></li>
    <li><a target="_blank" href="https://www.superannotate.com/blog/overfitting-and-underfitting-in-machine-learning?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX08XFEN2550-2023-01-01">https://www.superannotate.com/blog/overfitting-and-underfitting-in-machine-learning</a></li>
    <li><a target="_blank" href="https://uploads-ssl.webflow.com/614c82ed388d53640613982e/6360ef2568a0381c60b26049_overfitting-and-underfitting-in-machine-learning-1.png?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX08XFEN2550-2023-01-01">https://uploads-ssl.webflow.com/614c82ed388d53640613982e/6360ef2568a0381c60b26049_overfitting-and-underfitting-in-machine-learning-1.png</a></li>
    <li><a target="_blank" href="https://sqlrelease.com/an-introduction-to-gridsearchcv-and-randomizedsearchcv?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX08XFEN2550-2023-01-01">https://sqlrelease.com/an-introduction-to-gridsearchcv-and-randomizedsearchcv</a></li>
    <li><a target="_blank" href="https://i0.wp.com/sqlrelease.com/wp-content/uploads/2021/08/An-introduction-to-GridSearchCV-and-RandomizedSearchCV-Image.jpg?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX08XFEN2550-2023-01-01&ssl=1">https://i0.wp.com/sqlrelease.com/wp-content/uploads/2021/08/An-introduction-to-GridSearchCV-and-RandomizedSearchCV-Image.jpg?ssl=1</a></li>
</ul>


# **Thank you for completing this lab!**

## Author

<a href="https://author.skills.network/instructors/borys_melnychuk?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX08XFEN2550-2023-01-01" >Borys Melnychuk</a>

<a href="https://author.skills.network/instructors/yaroslav_vyklyuk_2?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0QGDEN2306-2023-01-01">Prof. Yaroslav Vyklyuk, DrSc, PhD</a>

<a href="https://author.skills.network/instructors/mariya_fleychuk?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0QGDEN2306-2023-01-01">Prof. Mariya Fleychuk, DrSc, PhD</a>



## Change Log

| Date (YYYY-MM-DD) | Version | Changed By      | Change Description                                         |
| ----------------- | ------- | ----------------| ---------------------------------------------------------- |
|     2023-03-25    |   1.0   | Borys Melnychuk | Creation of the lab                                        |

<hr>

## <h3 align="center"> © IBM Corporation 2023. All rights reserved. </h3>
