<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="500" alt="cognitiveclass.ai logo"  />
</center>

# Data Analysis with Python on the example BTC/USD and technical financial indicators ATR, OBV, ADV, RSI, AD

# Lab 5. Model Evaluation and Refinement

Estimated time needed: **30** minutes

## Objectives

After completing this lab you will be able to:

*   Evaluate and refine prediction models


## Table of Contents
<div class="alert alert-block alert-info" style="margin-top: 20px">
    <ol>
    <li>Model Evaluation</li>
    <li>Over-fitting, Under-fitting and Model Selection</li>
    <li>Ridge Regression</li>
    <li>Grid Search</li>
    </ol>
</div>
<hr>

#### Setup


If you run the lab locally using Anaconda, you can load the correct library and versions by uncommenting the following:


In [ ]:
#install specific version of libraries used in lab
! mamba install   ipywidgets -y
! mamba install tqdm
! mamba install pandas -y
! mamba install numpy -y
! mamba install scikit-learn -y

In [ ]:
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
from ipywidgets import interact, interactive, fixed, interact_manual
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import set_config
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge
from tqdm import tqdm
from sklearn.model_selection import GridSearchCV
#set precision 2
pd.options.display.float_format = '{:.2f}'.format
set_config(display="diagram")

This dataset was hosted on IBM Cloud object. Click <a href="https://cocl.us/DA101EN_object_storage?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkDA0101ENSkillsNetwork20235326-2021-01-01">HERE</a> for free storage.


In [ ]:
path = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX09Y8EN/BTC.csv'

you will need to read the dataset:


In [ ]:
df = pd.read_csv(path)

First, let's only use numeric data:


In [ ]:
df=df._get_numeric_data()
df.head()

## Functions for Plotting

In [ ]:
def DistributionPlot(RedFunction, BlueFunction, RedName, BlueName, Title):
    width = 12
    height = 10
    plt.figure(figsize=(width, height))

    ax1 = sns.distplot(RedFunction, hist=False, color="r", label=RedName)
    ax2 = sns.distplot(BlueFunction, hist=False, color="b", label=BlueName, ax=ax1)

    plt.title(Title)
    plt.xlabel('Price (in dollars)')
    plt.ylabel('Proportion of BTC')

    plt.show()
    plt.close()

In [ ]:
def PollyPlot(xtrain, xtest, y_train, y_test, lr,poly_transform):
    width = 12
    height = 10
    plt.figure(figsize=(width, height))
    
    
    #training data 
    #testing data 
    # lr:  linear regression object 
    #poly_transform:  polynomial transformation object 
 
    xmax=max([xtrain.values.max(), xtest.values.max()])

    xmin=min([xtrain.values.min(), xtest.values.min()])

    x=np.arange(xmin, xmax, 0.1)


    plt.plot(xtrain, y_train, 'ro', label='Training Data')
    plt.plot(xtest, y_test, 'go', label='Test Data')
    plt.plot(x, lr.predict(poly_transform.fit_transform(x.reshape(-1, 1))), label='Predicted Function')
    plt.ylim([-10000, 60000])
    plt.ylabel('Price')
    plt.legend()

## Part 1: Training and Testing 

<p>An important step in testing your model is to split your data into training and testing data. We will place the target data <b>Avg_price</b> in a separate dataframe <b>y_data</b>:</p>


In [ ]:
y_data = df['Avg_price']

Drop Avg_price data in dataframe **x_data**:


In [ ]:
x_data=df.drop('Avg_price',axis=1)

Now, we randomly split our data into training and testing data using the function <b>train_test_split</b>.


In [ ]:
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.10, shuffle=False)


print("number of test samples :", x_test.shape[0])
print("number of training samples:",x_train.shape[0])


The <b>test_size</b> parameter sets the proportion of data that is split into the testing set. In the above, the testing set is 10% of the total dataset.


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<p><b style="font-size: 2em; font-weight: bold;">Question #1:</b></p>

<b>Use the function "train_test_split" to split up the dataset such that 40% of the data samples will be utilized for testing. Set the parameter "random_state" equal to zero. The output of the function should be the following:  "x_train1" , "x_test1", "y_train1" and  "y_test1".</b>

</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
x_train1, x_test1, y_train1, y_test1 = train_test_split(x_data, y_data, test_size=0.4, random_state=0) 
print("number of test samples :", x_test1.shape[0])
print("number of training samples:",x_train1.shape[0])
```

</details>


Let's remind what is Linear Regression:

In Python, <code>LinearRegression.score()</code> is a method provided by scikit-learn, a popular machine learning library, which calculates the coefficient of determination, also known as R-squared, for a linear regression model.

The R-squared value is a statistical measure that represents how well the regression model fits the observed data. It ranges from 0 to 1, where 0 indicates that the model does not explain any of the variability in the data, and 1 indicates that the model perfectly fits the data.

The <code>'score()'</code> method takes two parameters:

- X: An array-like object or sparse matrix of shape (n_samples, n_features) containing the input data.
- y: An array-like object of shape (n_samples,) or (n_samples, n_targets) containing the target values.


The method returns the R-squared value for the provided input data and target values, using the fitted linear regression model.

We create a Linear Regression object:


In [ ]:
lre=LinearRegression()

We fit the model using the feature "OBV" because OBV and Avg_price have a good corr:


In [ ]:
df[['Avg_price','OBV']].corr()

Now we will fit the model:

In [ ]:
lre.fit(x_train[['OBV']], y_train)

Let's calculate the R<sup>2</sup> on the test data:


In [ ]:
print("{:.2f}".format(lre.score(x_test[['OBV']], y_test)) )

There are situations where a negative R-squared value can occur, such as when the model is a poor fit for the data or when there is high variability in the dependent variable that cannot be explained by the independent variables. In such cases, the R-squared value can be less than 0, indicating that the model is not only a poor fit but also performs worse than a model that simply predicts the mean of the dependent variable.

A negative R<sup>2</sup> is a sign of overfitting.

We can see the R<sup>2</sup> is smaller using the test data compared to the training data.


In [ ]:
print("{:.2f}".format(lre.score(x_train[['OBV']], y_train)))

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<p><b style="font-size: 2em; font-weight: bold;">Question #2:</b></p>
<b> 
Find the R^2  on the test data using 40% of the dataset for testing.
</b>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
x_train1, x_test1, y_train1, y_test1 = train_test_split(x_data, y_data, test_size=0.4, random_state=0)
lre.fit(x_train1[['OBV']],y_train1)
lre.score(x_test1[['OBV']],y_test1)

```

</details>


Sometimes you do not have sufficient testing data; as a result, you may want to perform cross-validation. Let's go over several methods that you can use for cross-validation.


## Cross-Validation Score

<p> Cross-validation is a technique used in machine learning to assess the performance of a model on an independent data set. Cross-validation score is a performance metric that is calculated by averaging the results of cross-validation.</p>

<p>In cross-validation, the data set is split into several subsets, or "folds". The model is trained on one subset of the data and tested on another subset, with this process repeated for each fold. The results from each fold are then averaged to produce the cross-validation score.</p>

<p>The advantage of using cross-validation is that it provides a more reliable estimate of the performance of the model on new data than just using a single train/test split. By repeating the process on multiple subsets of the data, cross-validation can help to reduce the impact of any bias or variance in the data set.</p>

<p>The cross-validation score is typically reported as a percentage or a decimal, with higher values indicating better performance. In scikit-learn, a popular machine learning library in Python, the cross-validation score can be computed using the cross_val_score() function.</p>

<p>Note that there are different types of cross-validation techniques, such as k-fold cross-validation, stratified k-fold cross-validation, leave-one-out cross-validation, and more. The choice of cross-validation technique depends on the specific requirements of the problem and the characteristics of the data set.</p>

We input the object, the feature ("OBV"), and the target data (y_data). The parameter 'cv' determines the number of folds. In this case, it is 4.


In [ ]:
Rcross = cross_val_score(lre, x_data[['OBV']], y_data, cv=4)

The default scoring is R<sup>2</sup>. Each element in the array has the average R<sup>2</sup> value for the fold:


In [ ]:
Rcross

A negative R<sup>2</sup> is a sign of overfitting.

We can calculate the average and standard deviation of our estimate:


In [ ]:
print("The mean of the folds are", "{:.2f}".format( Rcross.mean()), "and the standard deviation is" , "{:.2f}".format(Rcross.std()))

We can use negative squared error as a score by setting the parameter  'scoring' metric to 'neg_mean_squared_error'.


In [ ]:
-1 * cross_val_score(lre,x_data[['OBV']], y_data,cv=4,scoring='neg_mean_squared_error')

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<p><b style="font-size: 2em; font-weight: bold;">Question #3:</b></p>
<b> 
Calculate the average R^2 using two folds, then find the average R^2 for the second fold utilizing the "OBV" feature: 
</b>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
Rc=cross_val_score(lre,x_data[['OBV']], y_data,cv=2)
Rc.mean()

```

</details>


You can also use the function 'cross_val_predict' to predict the output. The function splits up the data into the specified number of folds, with one fold for testing and the other folds are used for training.


We input the object, the feature <b>"OBV"</b>, and the target data <b>y_data</b>. The parameter 'cv' determines the number of folds. In this case, it is 4. We can produce an output:


In [ ]:
yhat = cross_val_predict(lre,x_data[['OBV']], y_data,cv=4)
yhat[0:5]

## Part 2: Overfitting, Underfitting and Model Selection

<p>It turns out that the test data, sometimes referred to as the "out of sample data", is a much better measure of how well your model performs in the real world.  One reason for this is overfitting.

Overfitting is a common problem in machine learning and refers to a situation where a model becomes too complex and fits the training data too closely. This can result in poor performance on new or unseen data.

Let's go over some examples. It turns out these differences are more apparent in Multiple Linear Regression and Polynomial Regression so we will explore overfitting in that context.</p>


Let's create Multiple Linear Regression objects and train the model using <b>'ADL', 'Volume', 'ATR', 'OBV'</b> as features.


In [ ]:
lr = LinearRegression()
lr.fit(x_train[['ADL', 'Volume', 'ATR', 'OBV']], y_train)

Prediction using training data:


In [ ]:
yhat_train = lr.predict(x_train[['ADL', 'Volume', 'ATR', 'OBV']])
yhat_train[0:5]

Prediction using test data:


In [ ]:
yhat_test = lr.predict(x_test[['ADL', 'Volume', 'ATR', 'OBV']])
yhat_test[0:5]

Let's perform some model evaluation using our training and testing data separately.


Let's examine the distribution of the predicted values of the training data.


In [ ]:
Title = 'Distribution  Plot of  Predicted Value Using Training Data vs Training Data Distribution'
DistributionPlot(y_train, yhat_train, "Actual Values (Train)", "Predicted Values (Train)", Title)

Figure 1: Plot of predicted values using the training data compared to the actual values of the training data.


So far, the model seems to be doing well in learning from the training dataset. But what happens when the model encounters new data from the testing dataset? When the model generates new values from the test data, we see the distribution of the predicted values is much different from the actual target values.


In [ ]:
Title='Distribution  Plot of  Predicted Value Using Test Data vs Data Distribution of Test Data'
DistributionPlot(y_test,yhat_test,"Actual Values (Test)","Predicted Values (Test)",Title)

Figure 2: Plot of predicted value using the test data compared to the actual values of the test data.


<p>Comparing Figure 1 and Figure 2, it is evident that the distribution of the test data in Figure 2 is much better at fitting the data. This difference in Figure 1 is apparent in the range of 16,500 to 17,000. This is where the shape of the distribution is extremely different. Let's see if polynomial regression also exhibits a drop in the prediction accuracy when analysing the test dataset.</p>


#### Overfitting
<p>Overfitting occurs when the model fits the noise, but not the underlying process. Therefore, when testing your model using the test set, your model does not perform as well since it is modelling noise, not the underlying process that generated the relationship. Let's create a degree 5 polynomial model.</p>


Let's use 55 percent of the data for training and the rest for testing:


In [ ]:
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.45, random_state=0)

We will perform a degree 5 polynomial transformation on the feature <b>'ADL'</b>.


In [ ]:
pr = PolynomialFeatures(degree=5)
x_train_pr = pr.fit_transform(x_train[['ADL']])
x_test_pr = pr.fit_transform(x_test[['ADL']])
pr

Now, let's create a Linear Regression model "poly" and train it.


In [ ]:
poly = LinearRegression()
poly.fit(x_train_pr, y_train)

We can see the output of our model using the method "predict." We assign the values to "yhat".


In [ ]:
yhat = poly.predict(x_test_pr)
yhat[0:5]

Let's take the first five predicted values and compare it to the actual targets.


In [ ]:
print("Predicted values:", yhat[0:4])
print("True values:", y_test[0:4].values)

We will use the function "PollyPlot" that we defined at the beginning of the lab to display the training data, testing data, and the predicted function.


In [ ]:
PollyPlot(x_train[['ADL']], x_test[['ADL']], y_train, y_test, poly,pr)

Figure 3: A polynomial regression model where red dots represent training data, green dots represent test data, and the blue line represents the model prediction.


R<sup>2</sup> of the training data:


In [ ]:
print("{:.2f}".format(poly.score(x_train_pr, y_train)))

R<sup>2</sup> of the test data:


In [ ]:
print("{:.2f}".format(poly.score(x_test_pr, y_test)))

We see the R<sup>2</sup> for the training data is 0.67 while the R<sup>2</sup> on the test data was 0.67.  The lower the R<sup>2</sup>, the worse the model. A negative R<sup>2</sup> is a sign of overfitting.


Let's see how the R<sup>2</sup> changes on the test data for different order polynomials and then plot the results:


In [ ]:
Rsqu_test = []

order = [1, 2, 3, 4,5,6,7]
for n in order:
    pr = PolynomialFeatures(degree=n)
    
    x_train_pr = pr.fit_transform(x_train[['ADL']])
    
    x_test_pr = pr.fit_transform(x_test[['ADL']])    
    
    lr.fit(x_train_pr, y_train)
    
    Rsqu_test.append(lr.score(x_test_pr, y_test))

plt.plot(order, Rsqu_test)
plt.xlabel('order')
plt.ylabel('R^2')
plt.title('R^2 Using Test Data')
plt.text(3, 0.75, 'Maximum R^2 ')    

We see the R<sup>2</sup> gradually increases until an order three polynomial is used. Then, the R<sup>2</sup> dramatically decreases at an order four polynomial.


The following function will be used in the next section. Please run the cell below.


In [ ]:
def f(order, test_data):
    x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=test_data, random_state=0)
    pr = PolynomialFeatures(degree=order)
    x_train_pr = pr.fit_transform(x_train[['ADL']])
    x_test_pr = pr.fit_transform(x_test[['ADL']])
    poly = LinearRegression()
    poly.fit(x_train_pr,y_train)
    PollyPlot(x_train[['ADL']], x_test[['ADL']], y_train,y_test, poly, pr)

The following interface allows you to experiment with different polynomial orders and different amounts of data.


In [ ]:
interact(f, order=(0, 6, 1), test_data=(0.05, 0.95, 0.05))

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<p><b style="font-size: 2em; font-weight: bold;">Question #4a):</b></p>

<b>We can perform polynomial transformations with more than one feature. Create a "PolynomialFeatures" object "pr1" of degree two.</b>

</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
pr1=PolynomialFeatures(degree=2)

```

</details>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<p><b style="font-size: 2em; font-weight: bold;">Question #4b):</b></p>

<b> 
 Transform the training and testing samples for the features 'OBV', 'ADL', 'ATR' and 'Volume'. Hint: use the method "fit_transform".</b>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
x_train_pr1=pr1.fit_transform(x_train[[ 'OBV', 'ADL', 'ATR', 'Volume']])

x_test_pr1=pr1.fit_transform(x_test[[ 'OBV', 'ADL', 'ATR', 'Volume']])


```

</details>


<!-- The answer is below:

x_train_pr1=pr.fit_transform(x_train[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']])
x_test_pr1=pr.fit_transform(x_test[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']])

-->


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<p><b style="font-size: 2em; font-weight: bold;">Question #4c):</b></p>
<b> 
How many dimensions does the new feature have? Hint: use the attribute "shape".
</b>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
x_train_pr1.shape #there are now 15 features


```

</details>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<p><b style="font-size: 2em; font-weight: bold;">Question #4d):</b></p>

<b> 
Create a linear regression model "poly1". Train the object using the method "fit" using the polynomial features.</b>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
poly1=LinearRegression().fit(x_train_pr1,y_train)


```

</details>


 <div class="alert alert-danger alertdanger" style="margin-top: 20px">
<p><b style="font-size: 2em; font-weight: bold;">Question #4e):</b></p>
<b>Use the method  "predict" to predict an output on the polynomial features, then use the function "DistributionPlot" to display the distribution of the predicted test output vs. the actual test data.</b>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
yhat_test1=poly1.predict(x_test_pr1)

Title='Distribution  Plot of  Predicted Value Using Test Data vs Data Distribution of Test Data'

DistributionPlot(y_test, yhat_test1, "Actual Values (Test)", "Predicted Values (Test)", Title)

```

</details>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<p><b style="font-size: 2em; font-weight: bold;">Question #4f):</b></p>

<b>Using the distribution plot above, describe (in words) the two regions where the predicted prices are less accurate than the actual prices.</b>

</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
#The predicted value is higher than actual value for BTC where the price between 16,200$ and 16,500$, conversely the predicted price is lower than the price cost in the $16,500 to $17,000 range. As such the model is not as accurate in these ranges.

```

</details>


## Part 3: Ridge Regression

Ridge regression is a linear regression technique used for regression analysis, especially in the presence of high multicollinearity among the independent variables. In traditional linear regression, the objective is to minimize the sum of squared errors between the predicted and actual values. However, when there is high multicollinearity among the independent variables, the estimates of regression coefficients become unstable, and the model may overfit to the training data.

In ridge regression, a regularization term is added to the sum of squared errors to constrain the magnitude of the coefficients. This term is the L2-norm of the coefficient vector multiplied by a hyperparameter lambda (λ). The effect of this term is to shrink the coefficient values towards zero, but not exactly to zero, which means the model can still include all the features in the final model. The hyperparameter lambda controls the strength of the regularization, and it can be tuned using cross-validation to find the optimal value for a given dataset.

Ridge regression is useful in situations where there are many independent variables, and some of them are highly correlated with each other. It can also be used for feature selection by setting some of the coefficients to zero if they are not contributing much to the prediction. However, it assumes that all the independent variables are relevant to the outcome variable, and it does not perform well if there are irrelevant variables in the dataset.

In this section, we will review Ridge Regression and see how the parameter alpha changes the model. Just a note, here our test data will be used as validation data.


Let's perform a degree two polynomial transformation on our data.


In [ ]:
pr=PolynomialFeatures(degree=2)
x_train_pr=pr.fit_transform(x_train[['ADL', 'Volume', 'ATR', 'OBV']])
x_test_pr=pr.fit_transform(x_test[['ADL', 'Volume', 'ATR', 'OBV']])

Let's create a Ridge regression object, setting the regularization parameter (alpha) to 0.1


In [ ]:
RigeModel=Ridge(alpha=1)

Like regular regression, you can fit the model using the method <b>fit</b>.


In [ ]:
RigeModel.fit(x_train_pr, y_train)

Similarly, you can obtain a prediction:


In [ ]:
yhat = RigeModel.predict(x_test_pr)

Let's compare the first five predicted samples to our test set:


In [ ]:
print('predicted:', yhat[0:4])
print('test set :', y_test[0:4].values)

We select the value of alpha that minimizes the test error. To do so, we can use a for loop. We have also created a progress bar to see how many iterations we have completed so far.


In [ ]:
Rsqu_test = []
Rsqu_train = []
dummy1 = []
Alpha = 10 * np.array(range(0,1000))
pbar = tqdm(Alpha)

for alpha in pbar:
    RigeModel = Ridge(alpha=alpha) 
    RigeModel.fit(x_train_pr, y_train)
    test_score, train_score = RigeModel.score(x_test_pr, y_test), RigeModel.score(x_train_pr, y_train)
    
    pbar.set_postfix({"Test Score": test_score, "Train Score": train_score})

    Rsqu_test.append(test_score)
    Rsqu_train.append(train_score)

We can plot out the value of R<sup>2</sup> for different alphas:


In [ ]:
width = 12
height = 10
plt.figure(figsize=(width, height))

plt.plot(Alpha,Rsqu_test, label='validation data  ')
plt.plot(Alpha,Rsqu_train, 'r', label='training Data ')
plt.xlabel('alpha')
plt.ylabel('R^2')
plt.legend()

**Figure 4**: The blue line represents the R<sup>2</sup> of the validation data, and the red line represents the R<sup>2</sup> of the training data. The x-axis represents the different values of Alpha.


Here the model is built and tested on the same data, so the training and test data are the same.

The red line in Figure 4 represents the R<sup>2</sup> of the training data. As alpha increases the R<sup>2</sup> stay same. 

The blue line represents the R<sup>2</sup> on the validation data.


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<p><b style="font-size: 2em; font-weight: bold;">Question #5:</b></p>

Perform Ridge regression. Calculate the R^2 using the polynomial features, use the training data to train the model and use the test data to test the model. The parameter alpha should be set to 10.

</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
RigeModel = Ridge(alpha=10) 
RigeModel.fit(x_train_pr, y_train)
RigeModel.score(x_test_pr, y_test)

```

</details>


## Part 4: Grid Search

Grid search is a technique used in machine learning to find the optimal hyperparameters for a model. Hyperparameters are parameters that are set before the model is trained, and they affect how the model learns from the training data. Examples of hyperparameters include the learning rate, regularization strength, number of hidden layers in a neural network, and the number of trees in a random forest.

Grid search involves creating a grid of hyperparameter values and training a model for each combination of hyperparameters in the grid. The performance of each model is evaluated using a validation set, and the hyperparameters that produce the best performance are selected as the optimal hyperparameters for the model. The validation set is typically a subset of the training data that is held out for evaluation purposes.

The grid of hyperparameters is defined by the user and can be specified manually or with the help of libraries such as Scikit-learn. The grid can be defined as a list of values for each hyperparameter, or as a range of values to be sampled randomly. The size of the grid can be adjusted depending on the computational resources available and the time it takes to train and evaluate each model.

Grid search is a simple and effective method for finding optimal hyperparameters, but it can be computationally expensive for large datasets or models with many hyperparameters. In such cases, more advanced techniques such as Bayesian optimization or randomized search can be used to reduce the search space and speed up the optimization process.

The term alpha is a hyperparameter. Sklearn has the class <b>GridSearchCV</b> to make the process of finding the best hyperparameter simpler.


We create a dictionary of parameter values:


In [ ]:
parameters1= [{'alpha': [0.001,0.1,1, 10, 100, 1000, 10000, 100000, 100000]}]
parameters1

Create a Ridge regression object:


In [ ]:
RR=Ridge()
RR

Create a ridge grid search object:


In [ ]:
Grid1 = GridSearchCV(RR, parameters1,cv=4)

Fit the model:


In [ ]:
Grid1.fit(x_data[['ADL', 'Volume', 'ATR', 'OBV']], y_data)

The object finds the best parameter values on the validation data. We can obtain the estimator with the best parameters and assign it to the variable BestRR as follows:


In [ ]:
BestRR=Grid1.best_estimator_
BestRR

We now test our model on the test data:


In [ ]:
print("{:.2f}".format(BestRR.score(x_test[['ADL', 'Volume', 'ATR', 'OBV']], y_test)))

In [ ]:
Title='Distribution  Plot of  Train Values vs Values from df[\'Avg_price \']'
DistributionPlot(df['Avg_price'], y_train,"Actual Values","Predicted Values (Test)",Title)

### Thank you for completing this lab!

## Authors

<a href="https://www.linkedin.com/in/bohdan-tsisinskyi-539913255/ " target="_blank" >Bohdan Tsisinskyi</a>

<a href="https://author.skills.network/instructors/yaroslav_vyklyuk_2">Prof. Yaroslav Vyklyuk, DrSc, PhD</a>

<a href="https://author.skills.network/instructors/mariya_fleychuk">Prof. Mariya Fleychuk, DrSc, PhD</a>.

## Change Log

| Date (YYYY-MM-DD) | Version | Changed By | Change Description                         |
| ----------------- | ------- | ---------- | ------------------------------------------ |
|2023-03-25|1.0|Bohdan Tsisinskyi|Lab created|


<hr>

## <h3 align="center"> © IBM Corporation 2023. All rights reserved. <h3/>
