<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="300" alt="cognitiveclass.ai logo">
</center>

# Motorcycle sales analysis

# *Lab5. Model Evaluation and Refinement*

Estimated time needed: **1 hour**

## Objectives

After completing this lab you will be able to:

*   Evaluate and refine prediction models


<details><summary><b style="font-size: 1.5em; font-weight: bold;">Click here to see content, description of dataset, source of dataset and licence</b></summary>
<br/>
    <b style="font-size: 1.2em; font-weight: bold;">Content</b>
<p>You work in the accounting department of a company that sells motorcycle parts. The company operates three warehouses in a large metropolitan area.</p>

<b style="font-size: 1.2em; font-weight: bold;">Dataset Glossary (Column-wise)</b>
<ul>
    <li>Date<p>Determines the date when client bought products</p></li>
    <li>Warehouse<p>The warehouse location.</p></li>
    <li>Client type<p>Determines how client bought the products. This column can be only Retail or Wholesale</p></li>
    <li>Product line<p>Name of product (part of motorcycle)</p></li>
    <li>Quantity<p>The count bought product</p></li>
    <li>Unit price<p>Cost of one product</p></li>
    <li>Total<p>The total purchase price</p></li>
    <li>Payment<p>Determines the method of payment for the purchase. This dataset has three types of payment: Credit card, cash or transfer</p></li>
</ul>

<b style="font-size: 1.2em; font-weight: bold;">Target field</b>
<ul>
    <li>Total</li>
</ul>

<b style="font-size: 1.2em; font-weight: bold;">Data source and licence</b>

<ul>
    <li>Data source: <a href="https://www.kaggle.com/datasets/devijeganath/motorcycle-sales-analysis?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0TE4EN3049-2023-01-01">https://www.kaggle.com/datasets/devijeganath/motorcycle-sales-analysis</a></li>
    <li>Data type: csv</li>
    <li>Licence: <a href="https://creativecommons.org/publicdomain/zero/1.0/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0TE4EN3049-2023-01-01">CC0: Public Domain</a></li>
</ul>
<p>
This DataSet released under CC0: Public Domain license that allow of copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission.
The person who associated a work with this deed has dedicated the work to the public domain by waiving all of his or her rights to the work worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
</p>
You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission.
</details>


<b style="font-size: 1.5em; font-weight: bold;">Table of Contents</b>

<div class="alert alert-block alert-info" style="margin-top: 20px">
<ol>
    <li><a href="#id1">Training and testing</a></li>
    <li><a href="#id2">Overfitting, Underfitting and Model Selection</a></li>
    <li><a href="#id3">Ridge Regression</a></li>
    <li><a href="#id4">Grid Search</a></li>
</ol>

</div>

<hr>


Import libraries


In [ ]:
#install specific version of libraries used in lab
#! mamba install pandas -y
#! mamba install numpy -y
#! mamba install sklearn -y
#! mamba install   ipywidgets -y
#! mamba install tqdm

! mamba install scikit-learn -y

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import OrdinalEncoder, PolynomialFeatures
from ipywidgets import interact, interactive, fixed, interact_manual
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.model_selection import cross_val_score, train_test_split, cross_val_predict, GridSearchCV
from tqdm import tqdm
from sklearn import set_config
set_config(display="diagram")

This dataset was hosted on IBM Cloud object. Click <a href="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX0TE4EN/new_motorcycles.csv">HERE</a> for free storage.


In [ ]:
path = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX0TE4EN/new_motorcycles.csv'

In [ ]:
df = pd.read_csv(path)
df.head()

<p>First, we need to use <code>OrdinalEncoder</code> to transform all our not numeric values in the numbers</p>


In [ ]:
enc = OrdinalEncoder()
df[['Product line','Quantity binned','Total ranged']] = enc.fit_transform(df[['Product line','Quantity binned','Total ranged']])

In [ ]:
df.head()

<b style="font-size: 1.2em; font-weight: bold;">Functions for Plotting</b>


In [ ]:
def DistributionPlot(RedFunction, BlueFunction, RedName, BlueName, Title):
    width = 12
    height = 10
    plt.figure(figsize = (width, height))

    ax1 = sns.distplot(RedFunction, hist = False, color = "r", label = RedName)
    ax2 = sns.distplot(BlueFunction, hist = False, color = "b", label = BlueName, ax = ax1)

    plt.title(Title)
    plt.xlabel('Total')
    plt.ylabel('Proportion of predictors')

    plt.show()
    plt.close()

In [ ]:
def PollyPlot(xtrain, xtest, y_train, y_test, lr,poly_transform):
    width = 12
    height = 10
    plt.figure(figsize=(width, height))
    
    
    #training data 
    #testing data 
    # lr:  linear regression object 
    #poly_transform:  polynomial transformation object 
 
    xmax = max([xtrain.values.max(), xtest.values.max()])

    xmin = min([xtrain.values.min(), xtest.values.min()])

    x = np.arange(xmin, xmax, 0.1)


    plt.plot(xtrain, y_train, 'ro', label = 'Training Data')
    plt.plot(xtest, y_test, 'go', label = 'Test Data')
    plt.plot(x, lr.predict(poly_transform.fit_transform(x.reshape(-1, 1))), label = 'Predicted Function')
    plt.ylim([min(y_train), max(y_train)])
    plt.ylabel('Total')
    plt.legend()
    plt.show()

<b style="font-size: 1.5em; font-weight: bold"><a name="id1" style="text-decoration: none;"><font color="black">1. Training and Testing</font></a></b>

<p>An important step in testing your model is to split your data into training and testing data. We will place the target data <b>Total</b> in a separate dataframe <b>y_data</b>:</p>


In [ ]:
y_data = df['Total']

Drop data from column Total in dataframe **x_data**:


In [ ]:
x_data = df.drop('Total',axis=1)

Now, we randomly split our data into training and testing data using the function <b>train_test_split</b>.


In [ ]:
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size = 0.2, shuffle = False)
print("number of test samples :", x_test.shape[0])
print("number of training samples:",x_train.shape[0])

The <b>test_size</b> parameter sets the proportion of data that is split into the testing set. In the above, the testing set is 20% of the total dataset.


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: bold;"> Question  #1:</b>

<b style="font-size: 1.2em">Use the function "train_test_split" to split up the dataset such that 30% of the data samples will be utilized for testing. Set the parameter "random_state" equal to zero. The output of the function should be the following:  "x_train1" , "x_test1", "y_train1" and  "y_test1". Set the parameter <code>shuffle = False</code></b>

</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
x_train1, x_test1, y_train1, y_test1 = train_test_split(x_data, y_data, test_size = 0.3, shuffle = False) 
print("number of test samples :", x_test1.shape[0])
print("number of training samples:",x_train1.shape[0])
```

</details>


We create a Linear Regression object:


In [ ]:
lre = LinearRegression()

We fit the model using the feature "Quantity":


In [ ]:
lre.fit(x_train[['Quantity']], y_train)

<b>Let's calculate the $R^{2}$ on the test data:</b>


In [ ]:
lre.score(x_test[['Quantity']], y_test)

Let's calculate the $R^{2}$ on the train data:


In [ ]:
lre.score(x_train[['Quantity']], y_train)

We can see the $R^{2}$ is almost equal between train data and test data


Now let's create model <code>lrem</code> for multiple linear regression with all features except for 'Total' and find $R^{2}$ for test and train data


In [ ]:
lrem = LinearRegression()
lrem.fit(x_train, y_train)
print('R^2 for test data: ',lrem.score(x_test, y_test))
print('R^2 for train data: ',lrem.score(x_train, y_train))

You see that in case of multiple regression $R^{2}$ for test and train data are almost equal and don't much different from $R^{2}$ score in case simple linear regression


You see that $R^{2}$ score for test data and train data is almost equal but this score isn't high, so it is a sign of underfitting. You can learn about underfitting in **2.**


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: bold;"> Question  #2: </b>
    <li><b style="font-size:1.2em">Create new Linear Regression object <code>lre1</code> for simple linear regression</b></li>
    <li><b style="font-size:1.2em">Use method <code>fit()</code> to train your model with the feature Quantity</b></li>
    <li><b style="font-size:1.2em">Find the $R^{2}$ on the test data and train data using 30% of the dataset for testing</b></li>
    <li><b style="font-size:1.2em">Create new Linear Regression object <code>lrem1</code> for multiple linear regression</b></li>
    <li><b style="font-size:1.2em">Use method <code>fit()</code> to train your model with the all features, except for 'Total'</b></li>
    <li><b style="font-size:1.2em">Find the $R^{2}$ on the test data and train data using 30% of the dataset for testing</b></li>
<br>
<b style="font-size:1.2em">Hint: you have test and train data from Question #1</b>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
lre1 = LinearRegression()
lre1.fit(x_train1[['Quantity']],y_train1)
lrem1 = LinearRegression()
lrem1.fit(x_train1,y_train1)
print("R^2 in SLR for test data: ",lre1.score(x_test1[['Quantity']],y_test1))
print("R^2 in SLR for train data: ",lre1.score(x_train1[['Quantity']],y_train1))
print()
print("R^2 in MLR for test data: ",lrem1.score(x_test1,y_test1))
print("R^2 in MLR for train data: ",lrem1.score(x_train1,y_train1))
```

</details>


Sometimes you do not have sufficient testing data; as a result, you may want to perform cross-validation. Let's go over several methods that you can use for cross-validation.


<b style="font-size: 1.2em; font-weight: bold;">Cross-Validation Score</b>


Let's use <b>model_selection</b> from the module <b>cross_val_score</b>.


We input the object, the feature ("Quantity"), and the target data (y_data). The parameter 'cv' determines the number of folds. In this case, it is 3.


In [ ]:
Rcross = cross_val_score(lre, x_data[['Quantity']], y_data, cv = 3)

The default scoring is $R^{2}$. Each element in the array has the average $R^{2}$ value for the fold:


In [ ]:
Rcross

We can calculate the average and standard deviation of our estimate:


In [ ]:
print("The mean of the folds are {:.3f} and the standard deviation is {:.3f}".format(Rcross.mean(), Rcross.std()))

We can use negative squared error as a score by setting the parameter  'scoring' metric to 'neg_mean_squared_error'.


In [ ]:
-1 * cross_val_score(lre,x_data[['Quantity']], y_data,cv = 3,scoring = 'neg_mean_squared_error')

Let's calculate the cross-validation score in case multiple linear regression


In [ ]:
Rcrossm = cross_val_score(lrem, x_data, y_data, cv = 3)
print("The mean of the folds are {:.3f} and the standard deviation is {:.3f}".format(Rcrossm.mean(), Rcrossm.std()))
print('The negative squared error\n',\
     -1 * cross_val_score(lrem,x_data, y_data,cv = 3,scoring = 'neg_mean_squared_error'))

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: bold;"> Question  #3: </b><br>
<b style="font-size:1.2em"> 
Calculate the average $R^{2}$ using two folds and standard deviation using Linear model <code>lre1</code> and model <code>lrem1</code> Set the parameter <code>cv = 4</code>
</b>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
Rc = cross_val_score(lre1,x_data[['Quantity']], y_data,cv = 4)
Rc1 = cross_val_score(lrem1, x_data, y_data, cv = 4)
print("The mean of the folds are {:.3f} and the standard deviation is {:.3f}".format(Rc.mean(), Rc.std()))
print("The mean of the folds are {:.3f} and the standard deviation is {:.3f}".format(Rc1.mean(), Rc1.std()))
```

</details>


You can also use the function 'cross_val_predict' to predict the output. The function splits up the data into the specified number of folds, with one fold for testing and the other folds are used for training.


We input the object, the feature <b>"Quantity"</b>, and the target data <b>y_data</b>. The parameter 'cv' determines the number of folds. In this case, it is 3. We can produce an output:


In [ ]:
yhat = cross_val_predict(lre,x_data[['Quantity']], y_data,cv = 3)
print('PREDICTED VALUES\n',yhat[0:5])
print('REAL VALUES\n',y_data[0:5])

And plot the predicted and real values


In [ ]:
plt.figure()
plt.title('Predicted values vs real values')
plt.xlabel('Index')
plt.ylabel('Total cost')
y_data.plot()
plt.plot(y_data.index,yhat)

And for several predictors


In [ ]:
yhatm = cross_val_predict(lrem,x_data, y_data,cv = 3)
print('PREDICTED VALUES\n',yhatm[0:5])
print('REAL VALUES\n',y_data[0:5])

In [ ]:
plt.figure()
plt.title('Predicted values vs real values')
plt.xlabel('Index')
plt.ylabel('Total cost')
y_data.plot()
plt.plot(y_data.index,yhatm)

<b style="font-size: 1.5em; font-weight: bold"><a name="id2" style="text-decoration: none;"><font color="black">2. Underfitting and Model Selection</font></a></b>
<p>Sometimes, our model is unable to capture the relationship between the input and output variables accurately, generating a high error rate on both the training set and test data. One reason for this is underfitting.</p>
<p>Go over some examples. Let's see underfitting in Multiple Linear Regression and Polynomial Regression.</p>


<b style="font-size: 1.2em; font-weight: bold;">Underfitting</b>


We have model Multiple Linear Regression which we created before.
Prediction using training data:


In [ ]:
yhat_train = lrem.predict(x_train)
print('PREDICTED TRAIN VALUES\n',yhat_train[0:5])
print('REAL TRAIN VALUES\n',y_train[0:5])

Prediction using test data:


In [ ]:
yhat_test = lrem.predict(x_test)
print('PREDICTED TEST DATA\n',yhat_test[0:5])
print('REAL TEST DATA\n',y_test[0:5])

Let's perform some model evaluation using our training and testing data separately.


First, examine the distribution of the predicted values of the training data.


In [ ]:
Title = 'Distribution  Plot of  Predicted Value Using Training Data vs Training Data Distribution'
DistributionPlot(y_train, yhat_train, "Actual Values (Train)", "Predicted Values (Train)", Title)

**Figure 1**: Plot of predicted values using the training data compared to the actual values of the training data.


In [ ]:
plt.figure()
plt.title('Predicted values vs real values in Train data')
plt.xlabel('Index')
plt.ylabel('Total cost')
y_train.plot()
plt.plot(y_train.index, yhat_train)

In [ ]:
Title='Distribution  Plot of  Predicted Value Using Test Data vs Data Distribution of Test Data'
DistributionPlot(y_test,yhat_test,"Actual Values (Test)","Predicted Values (Test)",Title)

**Figure 2**: Plot of predicted value using the test data compared to the actual values of the test data.


In [ ]:
plt.figure()
plt.title('Predicted values vs real values in Test data')
plt.xlabel('Index')
plt.ylabel('Total cost')
y_test.plot()
plt.plot(y_test.index, yhat_test)

Comparing **Figure 1** and **Figure 2**, you see that train data and test data is almost similar, but they different from real data


In [ ]:
print('R^2 for train data: ',lrem.score(x_train,y_train))
print('R^2 for test data: ',lrem.score(x_test, y_test))

Let's use 60 percent of the data for training and the rest for testing:


In [ ]:
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size = 0.4, shuffle = False)

We will perform a degree 5 polynomial transformation on the feature <b>'Quantity'</b>.


In [ ]:
pr = PolynomialFeatures(degree = 5)
x_train_pr = pr.fit_transform(x_train[['Quantity']])
x_test_pr = pr.fit_transform(x_test[['Quantity']])
pr

Now, let's create a Linear Regression model "poly" and train it.


In [ ]:
poly = LinearRegression()
poly.fit(x_train_pr, y_train)

We can see the output of our model using the method "predict." We assign the values to "yhat".


In [ ]:
yhat = poly.predict(x_test_pr)

Let's take the first five predicted values and compare it to the actual targets.


In [ ]:
print("PREDICTED VALUES:", yhat[0:5])
print("REAL VALUES:", y_test[0:5].values)


We will use the function "PollyPlot" that we defined at the beginning of the lab to display the training data, testing data, and the predicted function.


In [ ]:
PollyPlot(x_train[['Quantity']], x_test[['Quantity']], y_train, y_test, poly, pr)

**Figure 3**: A polynomial regression model where red dots represent training data, green dots represent test data, and the blue line represents the model prediction.


In [ ]:
plt.figure()
plt.title('Predicted values vs real values in Test data')
plt.xlabel('Index')
plt.ylabel('Total cost')
y_test.plot()
plt.plot(y_test.index, yhat)

We see that the estimated function appears to track the data but around 200 horsepower, the function begins to diverge from the data points.


$R^{2}$ of the training data:


In [ ]:
poly.score(x_train_pr, y_train)

$R^{2}$ of the test data:


In [ ]:
poly.score(x_test_pr, y_test)

We see the $R^{2}$ for the training data is 0.77 while the $R^{2}$ on the test data was 0.73. This is example of underfitting when the score for train data and test are almost equal but they are not high.


Let's see how the $R^{2}$ changes on the test data for different order polynomials and then plot the results:


In [ ]:
Rsqu_test = []

order = [1, 2, 3, 4, 5]
for n in order:
    pr = PolynomialFeatures(degree = n)
    
    x_train_pr = pr.fit_transform(x_train[['Quantity']])
    
    x_test_pr = pr.fit_transform(x_test[['Quantity']])    
    
    lrem.fit(x_train_pr, y_train)
    
    Rsqu_test.append(lrem.score(x_test_pr, y_test))

plt.plot(order, Rsqu_test)
plt.xlabel('order')
plt.ylabel('R^2')
plt.title('R^2 Using Test Data')   

The following function will be used in the next section. Please run the cell below.


In [ ]:
def f(order, test_data):
    x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size = test_data, random_state = 0)
    pr = PolynomialFeatures(degree = order)
    x_train_pr = pr.fit_transform(x_train[['Quantity']])
    x_test_pr = pr.fit_transform(x_test[['Quantity']])
    poly = LinearRegression()
    poly.fit(x_train_pr,y_train)
    PollyPlot(x_train[['Quantity']], x_test[['Quantity']], y_train,y_test, poly, pr)

The following interface allows you to experiment with different polynomial orders and different amounts of data.


In [ ]:
interact(f, order = (0, 6, 1), test_data = (0.05, 0.95, 0.05))

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: bold;"> Question  #4:</b><br>
<b style="font-size:1.2em">We can perform polynomial transformations with more than one feature.</b> 
<li><b style="font-size:1.2em">Create a "PolynomialFeatures" object "pr1" of degree two.</b></li>
<li><b style="font-size:1.2em">Transform the training and testing samples for the all features, except for 'Total'. Use the method <code>fit_transform()</code>. Save training data in variable <code>x_train_pr1</code>, test data in variable <code>x_test_pr1</code></b></li>
<li><b style="font-size:1.2em">Find dimensions for new feature for training data</b></li>
<li><b style="font-size:1.2em">Create a linear regression model "poly1". Train the object using the method "fit" using the polynomial features.</b></li>
<li><b style="font-size:1.2em">Use the method "predict" to predict an output on the polynomial features, then use the function "DistributionPlot" to display the distribution of the predicted test output vs. the actual test data.</b></li>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
pr1 = PolynomialFeatures(degree = 2)
x_train_pr1 = pr1.fit_transform(x_train)
x_test_pr1 = pr1.fit_transform(x_test)
print('Dimension of training data: ',x_train_pr1.shape)
poly1 = LinearRegression().fit(x_train_pr1,y_train)
yhat_test1 = poly1.predict(x_test_pr1)
Title = 'Distribution  Plot of  Predicted Value Using Test Data vs Data Distribution of Test Data'
DistributionPlot(y_test, yhat_test1, "Actual Values (Test)", "Predicted Values (Test)", Title)
```

</details>


<b style="font-size: 1.5em; font-weight: bold"><a name="id3" style="text-decoration: none;"><font color="black">Part 3: Ridge Regression</font></a></b> 


In this section, we will review Ridge Regression and see how the parameter alpha changes the model. Just a note, here our test data will be used as validation data.


Let's perform a degree two polynomial transformation on our data.<br>
<b style="font-size: 1.2em; font-weight: bold;">Attention! If you don't do previous task the next code will not run</b>


Let's create a Ridge regression object, setting the regularization parameter (alpha) to 0.1


In [ ]:
RidgeModel = Ridge(alpha = 0.1)

Like regular regression, you can fit the model using the method <b>fit</b>.


In [ ]:
RidgeModel.fit(x_train_pr1, y_train)

Similarly, you can obtain a prediction:


In [ ]:
yhat = RidgeModel.predict(x_test_pr1)

Let's compare the first five predicted samples to our test set:


In [ ]:
print('predicted:', yhat[0:4])
print('test set :', y_test[0:4].values)
plt.figure()
plt.title('Predicted values vs real values in Test data')
plt.xlabel('Index')
plt.ylabel('Total cost')
y_test.plot()
plt.plot(y_test.index, yhat)

We select the value of alpha that minimizes the test error. To do so, we can use a for loop. We have also created a progress bar to see how many iterations we have completed so far.


In [ ]:
Rsqu_test = []
Rsqu_train = []
dummy1 = []
Alpha = 0.1 * np.array(range(0,1000))
pbar = tqdm(Alpha)

for alpha in pbar:
    RidgeModel = Ridge(alpha = alpha) 
    RidgeModel.fit(x_train_pr1, y_train)
    test_score, train_score = RidgeModel.score(x_test_pr1, y_test), RidgeModel.score(x_train_pr1, y_train)
    
    pbar.set_postfix({"Test Score": test_score, "Train Score": train_score})

    Rsqu_test.append(test_score)
    Rsqu_train.append(train_score)

We can plot out the value of $R^{2}$ for different alphas:


In [ ]:
width = 12
height = 10
plt.figure(figsize = (width, height))

plt.plot(Alpha,Rsqu_test, label = 'validation data  ')
plt.plot(Alpha,Rsqu_train, 'r', label = 'training Data ')
plt.xlabel('alpha')
plt.ylabel('R^2')
plt.legend()

**Figure 4**: The blue line represents the $R^{2}$ of the validation data, and the red line represents the $R^{2}$ of the training data. The x-axis represents the different values of Alpha.


Here the model is built and tested on the same data, so the training and test data are the same.

The red line in Figure 4 represents the $R^{2}$ of the training data. As alpha increases the $R^{2}$ decreases. 

The blue line represents the $R^{2}$ on the validation data. As the value for alpha increases, the $R^{2}$ decreases.


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 1.5em; font-weight: bold;"> Question  #5:</b><br>

Perform Ridge regression. Calculate the $R^{2}$ using the polynomial features, use the training data to train the model and use the test data to test the model. The parameter alpha should be set to 10.

</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
RidgeModel = Ridge(alpha=10) 
RidgeModel.fit(x_train_pr1, y_train)
RidgeModel.score(x_test_pr1, y_test)

```

</details>


<b style="font-size: 1.5em; font-weight: bold"><a name="id4" style="text-decoration: none"><font color="black">Part 4: Grid Search</font></a></b>


The term alpha is a hyperparameter. Sklearn has the class <b>GridSearchCV</b> to make the process of finding the best hyperparameter simpler.


We create a dictionary of parameter values:


In [ ]:
parameters1= [{'alpha': [0.001,0.1,1, 10, 100, 1000, 10000, 100000, 100000]}]
parameters1

Create a Ridge regression object:


In [ ]:
RR=Ridge()
RR

Create a ridge grid search object:


In [ ]:
Grid1 = GridSearchCV(RR, parameters1, cv = 3)

Fit the model:


In [ ]:
Grid1.fit(x_data, y_data)

The object finds the best parameter values on the validation data. We can obtain the estimator with the best parameters and assign it to the variable BestRR as follows:


In [ ]:
BestRR=Grid1.best_estimator_
BestRR

We now test our model on the test data:


In [ ]:
BestRR.score(x_test, y_test)

Let's plot the real test data and predicted test data with the best alpha parameter


In [ ]:
yhat = BestRR.predict(x_test)
plt.figure()
plt.title('Predicted values vs real values in Test data')
plt.xlabel('Index')
plt.ylabel('Total cost')
y_test.plot()
plt.plot(y_test.index, yhat)

### Thank you for completing this lab!

## Authors

<a href="https://author.skills.network/instructors/victor_dyrenko?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0TE4EN3049-2023-01-01">Victor Dyrenko</a>

<a href="https://author.skills.network/instructors/yaroslav_vyklyuk_2?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0TE4EN3049-2023-01-01">Yaroslav Vyklyuk</a>

<a href="https://author.skills.network/instructors/olga_kavun?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0TE4EN3049-2023-01-01">Olga Kavun</a>

## Change Log

| Date (YYYY-MM-DD) | Version |     Changed By   | Change Description                                         |
| ----------------- | ------- | ---------------- | ---------------------------------------------------------- |
| 2023-05-05        | 1       | Victor Dyrenko   | Finished lab                                               |


<hr>

## <h3 align="center"> © IBM Corporation 2023. All rights reserved. <h3/>
