<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="300" alt="cognitiveclass.ai logo">
</center>

# **Retail Sales Dataset 2018-2022**
# **Lab 5. Model Evaluation and Refinement**

Estimated time needed: **1** hour

### **Dataset Attributes**
*   Date: year and month
*   SKU: unique code consisting of letters and numbers that identify each product
*   Group: group of related products which share some common attributes
*   Units Pkg: package weight (kg)
*   Avg Price Pkg: average price per package
*   Sales Pkg: total package sales per month

### **Target Field**
*   Turnover per month

### **Objectives**

After completing this lab you will be able to:

*   Evaluate and refine prediction models


<div class="alert alert-block alert-info" style="margin-top: 20px">
<h2>Table of Contents</h2>
<ul>
    <li><a href="#ref1">Training and Testing</a></li>
    <li><a href="#ref2">Over-fitting, Under-fitting and Model Selection </a></li>
    <li><a href="#ref3">Ridge Regression </a></li>
    <li><a href="#ref4">Grid Search</a></li>
</ul>
</div>


#### **Setup**


If you run the lab locally using Anaconda, you can load the correct library and versions by uncommenting the following:


In [ ]:
#install specific version of libraries used in lab
#! mamba install pandas -y
#! mamba install numpy -y
#! mamba install scikit-learn -y
#! mamba install ipywidgets -y
#! mamba install tqdm

In [ ]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

Libraries for plotting:


In [ ]:
from ipywidgets import interact, interactive, fixed, interact_manual
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

This dataset was hosted on IBM Cloud object. Click <a href="https://cocl.us/DA101EN_object_storage?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkDA0101ENSkillsNetwork20235326-2021-01-01">HERE</a> for free storage.


In [ ]:
path = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX044EN/clean_sales_1.csv'
df = pd.read_csv(path)
df.head()

<b>In previous lab, we told that 'Sales Pkg' was an important variable for predicting turnover. Instead, she was too suspiciously good. If you wonder why, you can check in chapter 'Overfitting' that no matter polynomial order we input or size of test data we input, the model will always fit perfectly.</br>
That's because 'Sales Pkg' is directly proportional to 'Turnover per month' because turnover = sales pkg * avg price pkg. It is impractical to give an input variable that is directly proportional to the output variable.</b>


So let's turn our categorical variables into numerical so that we can make predictions and analysis using them.


To make 'Group' and 'Date' fields numerical:


In [ ]:
from sklearn.preprocessing import OrdinalEncoder

enc = OrdinalEncoder()
df = df[df.columns]
df['Group'] = enc.fit_transform(df[['Group']]).astype(int)
df['Date'] = enc.fit_transform(df[['Date']]).astype(int)

df.head()


First, let's only use numeric data:


In [ ]:
df = df._get_numeric_data()
df.head()

## **Functions for Plotting**


We made similar functions in previous lab.


In [ ]:
def DistributionPlot(RedFunction, BlueFunction, RedName, BlueName, Title):
    width = 10
    height = 8
    plt.figure(figsize=(width, height))

    ax1 = sns.distplot(RedFunction, hist=False, color="r", label=RedName)
    ax2 = sns.distplot(BlueFunction, hist=False, color="b", label=BlueName, ax=ax1)

    plt.title(Title)
    plt.xlabel('Turnover')
    plt.ylabel('Proportion of Sales')

    plt.show()
    plt.close()

In [ ]:
def PollyPlot(xtrain, xtest, y_train, y_test, lr, poly_transform):
    width = 10
    height = 8
    plt.figure(figsize=(width, height))
    
    
    #training data 
    #testing data 
    # lr:  linear regression object 
    #poly_transform:  polynomial transformation object 
 
    xmax = max([xtrain.values.max(), xtest.values.max()])

    xmin = min([xtrain.values.min(), xtest.values.min()])

    x = np.arange(xmin, xmax, 0.1)


    plt.plot(xtrain, y_train, 'ro', label='Training Data')
    plt.plot(xtest, y_test, 'go', label='Test Data')
    plt.plot(x, lr.predict(poly_transform.fit_transform(x.reshape(-1, 1))), label='Predicted Function')
    plt.ylabel('Turnover')
    plt.legend()

<div style="margin-top: 1em;">
<b style="font-size: 2em; font-weight: bold; text-decoration: none;"><a name="ref1"><font color="black">Part 1: Training and Testing</font></a></b>
</div>

<p>An important step in testing your model is to split your data into training and testing data.
Training data is used to train model while testing data is used to test our model's accuracy.<br>
We will place the target data <b>turnover per month</b> in a separate dataframe <b>y_data</b>:</p>


In [ ]:
y_data = df['Turnover per month']

<code>df.drop()</code> creates a new DataFrame x_data that contains all the columns from df except for the 'Turnover per month' column.


Drop turnover data in dataframe **x_data**:


In [ ]:
x_data = df.drop('Turnover per month', axis=1)

Now, we randomly split our data into training and testing data using the function <b>train_test_split</b>.


In [ ]:
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.10, random_state=3)


print("number of test samples :", x_test.shape[0])
print("number of training samples:", x_train.shape[0])

x_train and y_train are the input features and target variable, respectively, for the training set, and x_test and y_test are the input features and target variable, respectively, for the test set. 


3724 (90%) rows (observations in other words) of the data is the training data, and 414 (10%) rows (observations) is the testing data.


The <b>test_size</b> parameter sets the proportion of data that is split into the testing set. In the above, the testing set is 10% of the total dataset.


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
  <b style="font-size: 2em; font-weight: bold;">Question #1:</b>

<b>Use the function "train_test_split" to split up the dataset such that 40% of the data samples will be utilized for testing. Set the parameter "random_state" equal to zero. The output of the function should be the following:  "x_train1" , "x_test1", "y_train1" and  "y_test1".</b>

</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
x_train1, x_test1, y_train1, y_test1 = train_test_split(x_data, y_data, test_size=0.4, random_state=0) 
print("number of test samples :", x_test1.shape[0])
print("number of training samples:",x_train1.shape[0])
```

</details>


We create a Linear Regression object:


In [ ]:
lm = LinearRegression()

We fit the model using the feature "Units Pkg":


In [ ]:
lm.fit(x_train[['Units Pkg']], y_train)

In [ ]:
Yhat_train = lm.predict(x_train[['Units Pkg']])
Yhat_test = lm.predict(x_test[['Units Pkg']])

Let's calculate the R^2 on the test data:


In [ ]:
lm.score(x_test[['Units Pkg']], y_test)

We can see the R^2 is not so much bigger using the test data compared to the training data.


In [ ]:
lm.score(x_train[['Units Pkg']], y_train)

Our train and test r-scores are very small, it's a sign of <b>underfitting</b>. It means that our model doesn't fit our data.


Let's visualize:


In [ ]:
y_train

In [ ]:
plt.figure(figsize=(8, 6))

res = pd.DataFrame(y_train)
res['Yhat'] = Yhat_train
res.sort_index().plot()

plt.title("Predicted data using training data vs real value using training data")
plt.xlabel("Index of Units Pkg (X)")
plt.ylabel("Turnover (Y)")

In [ ]:
plt.figure(figsize=(8, 6))
res = pd.DataFrame(y_test)
res['Yhat'] = Yhat_test
res.sort_index().plot()

plt.title("Predicted data using testing data vs real value using testing data")
plt.xlabel("Index of Units Pkg (X)")
plt.ylabel("Turnover (Y)")

It's clearly seen that our predicted data (yelow) hardly overlap our real data (blue) on both plots. <b>Our model didn't learn on given data</b>.


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: bold;">Question #2):</b>
    
<b>Find the R^2  on the test and train data using 40% of the dataset for testing.</b>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
x_train_ex, x_test_ex, y_train_ex, y_test_ex = train_test_split(x_data, y_data, test_size=0.4, random_state=0)

lm_ex = LinearRegression()
lm_ex.fit(x_train1[['Units Pkg']], y_train1)

print("r-score for test data", lm_ex.score(x_test1[['Units Pkg']], y_test1))
print("r-score for train data", lm_ex.score(x_train1[['Units Pkg']], y_train1))

```

</details>


Sometimes you do not have sufficient testing data; as a result, you may want to perform cross-validation. Let's go over several methods that you can use for cross-validation.


## **Cross-Validation Score**


We input the object, the feature ("Units Pkg"), and the target data (y_data).


The parameter <code>cv</code> specifies the number of folds or partitions that the dataset will be split into. In this case, the dataset is divided into 4 parts, or folds, and the model is trained and tested 4 times, with each fold serving as the test set once and the remaining folds as the training set. The <code>cross_val_score()</code> function returns an array of scores, one for each fold, which can be used to evaluate the performance of the model.


In [ ]:
Rcross = cross_val_score(lm, x_data[['Units Pkg']], y_data, cv=4)

The default scoring is R^2. Each element in the array has the average R^2 value for the fold:


In [ ]:
Rcross

We can calculate the average and standard deviation of our estimate:


In [ ]:
print("The mean of the folds are", Rcross.mean(), "and the standard deviation is", Rcross.std())

We can use negative squared error as a score by setting the parameter  'scoring' metric to 'neg_mean_squared_error'.


In [ ]:
-1 * cross_val_score(lm, x_data[['Units Pkg']], y_data, cv=4, scoring='neg_mean_squared_error')

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: bold;">Question #3:</b>
    
<b>Calculate the average R^2 using two folds, then find the average R^2 for the second fold utilizing the "Sales Pkg" feature:</b>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
Rcross_ex = cross_val_score(lm, x_data[["Units Pkg"]], y_data, cv=2)
Rcross_ex.mean()

```

</details>


You can also use the function <code>cross_val_predict</code> to predict the output. The function splits up the data into the specified number of folds, with one fold for testing and the other folds are used for training.


We input the object, the feature <b>"Units Pkg"</b>, and the target data <b>y_data</b>. The parameter 'cv' determines the number of folds. In this case, it is 4. We can produce an output:


In [ ]:
yhat = cross_val_predict(lm, x_data[['Units Pkg']], y_data, cv=4)
yhat

Let's see how good our model is:


In [ ]:
plt.figure(figsize=(8, 6))

plt.title("Predicted vs real value")
plt.xlabel("Index of Units Pkg (X) from 0 to 4137")
plt.ylabel("Turnover (Y)")

y_data.plot()
plt.plot(y_data.index, yhat)

We see that out model fits very poorly. 


<div style="margin-top: 1em;">
<b style="font-size: 2em; font-weight: bold; text-decoration: none;"><a name="ref2"><font color="black">Part 2: Overfitting, Underfitting and Model Selection</font></a></b>
</div>

We've seen an example of underfitting in previous chapter ('lm' model) - an example of poor learn.
<p>It turns out that the test data, sometimes referred to as the "out of sample data", is a much better measure of how well your model performs in the real world.  One reason for this is overfitting.

Let's go over some examples. It turns out these differences are more apparent in Multiple Linear Regression and Polynomial Regression so we will explore overfitting in that context.</p>


#### **Underfitting**


Let's create <b>Multiple Linear Regression</b> objects and train the model using 'Sales Pkg', 'Units Pkg', 'Avg Price Pkg' as features.


In [ ]:
lm1 = LinearRegression()
lm1.fit(x_train[['Units Pkg', 'Avg Price Pkg', 'Group']], y_train)

Prediction using training data:


In [ ]:
yhat_train = lm1.predict(x_train[['Units Pkg', 'Avg Price Pkg', 'Group']])
yhat_train[0:5]

Prediction using test data:


In [ ]:
yhat_test = lm1.predict(x_test[['Units Pkg', 'Avg Price Pkg', 'Group']])
yhat_test[0:5]

Calculate r-scores:


In [ ]:
lm1.score(x_train[['Units Pkg', 'Avg Price Pkg', 'Group']], y_train)

In [ ]:
lm1.score(x_test[['Units Pkg', 'Avg Price Pkg', 'Group']], y_test)

Very low r-score is a sign of underfitting.


Let's perform some model evaluation using our training and testing data separately.


Let's examine the distribution of the predicted values of the training data.


In [ ]:
Title = 'Distribution Plot of Predicted Value Using Training Data vs Training Data Distribution'
DistributionPlot(y_train, yhat_train, "Actual Values (Train)", "Predicted Values (Train)", Title) # red color - actual data, blue - predicted

<b>Figure 1</b>: Plot of predicted values using the training data compared to the actual values of the training data.


The model seems to be doing badly in learning from the training dataset. We already might guess that on testing data model will do this badly as well. Let's check:


In [ ]:
Title = 'Distribution Plot of Predicted Value Using Test Data vs Data Distribution of Test Data'
DistributionPlot(y_test, yhat_test, "Actual Values (Test)", "Predicted Values (Test)", Title)

<b>Figure 2</b>: Plot of predicted value using the test data compared to the actual values of the test data.


<p>Comparing Figure 1 and Figure 2, we can see that the distribution of the train data in Figure 1 and the distribution of the test data in Figure 2 are almost equally bad at fitting the data. Let's see if polynomial regression is a better choice than linear one in the prediction accuracy when analysing the test dataset.</p>


#### **Overfitting**
Overfitting is a problem that occurs when a model is too complex and when the model fits the noise, but not the underlying process. This results in a model that is highly accurate on the training data but performs poorly on new data that it has not seen before. Essentially, the model becomes too tailored to the training data and is unable to generalize to new data. Overfitting can occur when a model is too complex or when there is not enough data to train the model properly.
<p>Therefore, when testing your model using the test set, your model does not perform as well since it is modelling noise, not the underlying process that generated the relationship. Let's create a degree 5 polynomial model.</p>


Let's use 45 percent of the data for training and the rest for testing: (to achieve overfitting model example)


In [ ]:
x_train_poly, x_test_poly, y_train_poly, y_test_poly = train_test_split(x_data, y_data, test_size=0.55, random_state=0)

We will perform a degree 7 polynomial transformation on the feature <b>'Sales Pkg' (we use only for example of overfitting)</b>.


In [ ]:
pr = PolynomialFeatures(degree=7)
x_train_pr = pr.fit_transform(x_train_poly[['Sales Pkg']])
x_test_pr = pr.fit_transform(x_test_poly[['Sales Pkg']])
pr

Now, let's create a Linear Regression model "poly" and train it.


In [ ]:
poly = LinearRegression()
poly.fit(x_train_pr, y_train_poly)

We can see the output of our model using the method <code>predict</code> We assign the values to "yhat".


In [ ]:
yhat_poly = poly.predict(x_test_pr)
yhat_poly[0:5]

Let's take the first five predicted values and compare it to the actual targets.


In [ ]:
print("Predicted values:", yhat_poly[0:4])
print("True values:", y_test_poly[0:4].values)

We will use the function "PollyPlot" that we defined at the beginning of the lab to display the training data, testing data, and the predicted function.


In [ ]:
PollyPlot(x_train_poly[['Sales Pkg']], x_test_poly[['Sales Pkg']], y_train_poly, y_test_poly, poly, pr)

Figure 3: A polynomial regression model where red dots represent training data, green dots represent test data, and the blue line represents the model prediction.


We see that the test data and the train data are highly overlapped and they fit the predicted function well.


R^2 of the training data:


In [ ]:
poly.score(x_train_pr, y_train_poly)

R^2 of the test data:


In [ ]:
poly.score(x_test_pr, y_test_poly)

We see the R^2 for the training data is 0.89 while the R^2 on the test data is -1.77. The lower the R^2, the worse the model. <b>If r-scores for test and train data are quite different and train r-score are quite high is a sign of overfitting.</b> In our case, it's a very clear sign of overfitting: model (poly) learnt great on training data, but cannot predict on test data.


Let's see how the R^2 changes on the <b>test</b> data for different order polynomials and then plot the results:


In [ ]:
Rsqu_test = []

order = [1, 2, 3, 4, 5, 6, 7]
for n in order:
    pr = PolynomialFeatures(degree=n)
    
    x_train_pr = pr.fit_transform(x_train_poly[['Sales Pkg']])
    
    x_test_pr = pr.fit_transform(x_test_poly[['Sales Pkg']])    
    
    lm1.fit(x_train_pr, y_train_poly)
    
    Rsqu_test.append(lm1.score(x_test_pr, y_test_poly))
    
# plt.figure(figsize=(3, 2))
plt.plot(order, Rsqu_test)
plt.xlabel('order')
plt.ylabel('R^2')
plt.title('R^2 Using Test Data')
plt.text(6, 0.95, 'Max R^2')  

We see the R^2 gradually increases until an order 6 polynomial is used. Then, the R^2 dramatically decreases at an order 7th polynomial.


The following function will be used in the next section. Please run the cell below.


In [ ]:
def f(order, test_data):
    x_train_sample, x_test_sample, y_train_sample, y_test_sample = train_test_split(x_data, y_data, test_size=test_data, random_state=0)
    # Create polynomial features for training and test sets
    pr_sample = PolynomialFeatures(degree=order)
    X_train_poly_sample = pr_sample.fit_transform(x_train_sample[['Sales Pkg']])
    X_test_poly_sample = pr_sample.fit_transform(x_test_sample[['Sales Pkg']])

    # Fit linear regression model to training data
    lin_reg = LinearRegression()
    lin_reg.fit(X_train_poly_sample, y_train_sample)

    # Predict on test data and calculate R^2 score
    y_pred = lin_reg.predict(X_test_poly_sample)
    r2_score = lin_reg.score(X_test_poly_sample, y_test_sample)

    # Plot the test data and polynomial fit
    PollyPlot(x_train_sample[['Sales Pkg']], x_test_sample[['Sales Pkg']], y_train_sample, y_test_sample, lin_reg, pr_sample)
    plt.show()

The following interface allows you to experiment with different polynomial orders and different amounts of data.


In [ ]:
interact(f, order=(0, 10, 1), test_data=(0.05, 0.95, 0.05))

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
  <b style="font-size: 2em; font-weight: bold;">Question #4 a):</b>

<b>We can perform polynomial transformations with more than one feature. Create a "PolynomialFeatures" object "pr1" of degree two.</b>

</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
pr1 = PolynomialFeatures(degree=2)

```

</details>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
  <b style="font-size: 2em; font-weight: bold;">Question #4 b):</b>
    
<b>Transform the training and testing samples for the features 'Date', 'Units Pkg', 'Avg Price Pkg'. Hint: use the method "fit_transform".</b>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
x_train1, x_test1, y_train1, y_test1 = train_test_split(x_data, y_data, test_size=0.2, random_state=0)

x_train1_pr = pr1.fit_transform(x_train1[['Date', 'Units Pkg', 'Avg Price Pkg']])
x_test1_pr = pr1.fit_transform(x_test1[['Date', 'Units Pkg', 'Avg Price Pkg']])
```

</details>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
  <b style="font-size: 2em; font-weight: bold;">Question #4 c):</b>
    
<b>How many dimensions does the new feature have? Hint: use the attribute "shape".
</b>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
x_train1_pr.shape #10 features

```

</details>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
  <b style="font-size: 2em; font-weight: bold;">Question #4 d):</b>
    
<b>Create a linear regression model "poly1". Train the object using the polynomial features.</b>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
poly1 = LinearRegression()
poly1.fit(x_train1_pr, y_train1)
```

</details>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
  <b style="font-size: 2em; font-weight: bold;">Question #4 e):</b>
    
<b>Predict an output on the polynomial features, then use the function "DistributionPlot" to display the distribution of the predicted test output vs. the actual test data.</b>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
y_hat_poly1 = poly1.predict(x_test1_pr)

Title = 'Distribution Plot of Predicted Value Using Test Data vs Data Distribution of Test Data'
DistributionPlot(y_test1, y_hat_poly1, "Actual Values (Test)", "Predicted Values (Test)", Title) 

```

</details>


<div style="margin-top: 1em;">
<b style="font-size: 2em; font-weight: bold; text-decoration: none;"><a name="ref3"><font color="black">Part 3: Ridge Regression</font></a></b>
</div>


In this section, we will review Ridge Regression and see how the parameter <b>alpha</b> changes the model. Just a note, here our test data will be used as validation data.


Ridge regression is a linear regression method used for dealing with multicollinearity (high correlation between independent variables) in data. It adds a penalty term to the cost function of linear regression, which is the sum of squared differences between the predicted values and actual values. 


Let's perform a degree two polynomial transformation on our data.


In [ ]:
pr2 = PolynomialFeatures(degree=2)
x_train2_pr = pr2.fit_transform(x_train[['Units Pkg', 'Avg Price Pkg', 'Group', 'Date']])
x_test2_pr = pr2.fit_transform(x_test[['Units Pkg', 'Avg Price Pkg', 'Group', 'Date']])

Let's import  <b>Ridge</b>  from the module <b>linear models</b>.


In [ ]:
from sklearn.linear_model import Ridge

Let's create a Ridge regression object, setting the regularization parameter (alpha) to 0.1


In [ ]:
RidgeModel = Ridge(alpha=1) 

Like regular regression, you can fit the model using the method <b>fit</b>.


In [ ]:
RidgeModel.fit(x_train2_pr, y_train)

Similarly, you can obtain a prediction:


In [ ]:
y_hat_poly2 = RidgeModel.predict(x_test2_pr)

Let's compare the first five predicted samples to our test set:


In [ ]:
print('predicted:', y_hat_poly2[0:4])
print('test set :', y_test[0:4].values)

Let's visualize:


In [ ]:
plt.figure(figsize=(8, 6))

res = pd.DataFrame(y_test)
res['Yhat'] = y_hat_poly2
res.sort_index().plot()

plt.title("Predicted data using testing data vs real value using testing data")
plt.xlabel("x_test2_pr")
plt.ylabel("Turnover (Y)")

We select the value of alpha that minimizes the test error. To do so, we can use a for loop. We have also created a progress bar to see how many iterations we have completed so far.


In [ ]:
from tqdm import tqdm

Rsqu_test1 = []
Rsqu_train1 = []
dummy1 = []
Alpha = 10 * np.array(range(0,1000))
pbar = tqdm(Alpha)

for alpha in pbar:
    RidgeModel = Ridge(alpha=alpha) 
    RidgeModel.fit(x_train2_pr, y_train)
    test_score, train_score = RidgeModel.score(x_test2_pr, y_test), RidgeModel.score(x_train2_pr, y_train)
    
    pbar.set_postfix({"Test Score": test_score, "Train Score": train_score})

    Rsqu_test1.append(test_score)
    Rsqu_train1.append(train_score)

We can plot out the value of R^2 for different alphas:


In [ ]:
plt.figure(figsize=(8, 6))

plt.plot(Alpha, Rsqu_test1, label='validation (test) data  ')
plt.plot(Alpha, Rsqu_train1, 'r', label='training data ')
plt.xlabel('alpha')
plt.ylabel('R^2')
plt.legend()

**Figure 4**: The blue line represents the R^2 of the validation data, and the red line represents the R^2 of the training data. The x-axis represents the different values of Alpha. 


As alpha increases the both R^2 decrease. Therefore, as alpha increases, the both models perform worse on the training data.


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
  <b style="font-size: 2em; font-weight: bold;">Question #5):</b>

<b>Perform Ridge regression. Calculate the R^2 using the polynomial features, use the training data to train the model and use the test data to test the model. The parameter alpha should be set to 10.</b>

</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
RidgeModel1 = Ridge(alpha=10) 
RidgeModel1.fit(x_train2_pr, y_train)
RidgeModel1.score(x_test2_pr, y_test)

```

</details>


<div style="margin-top: 1em;">
<b style="font-size: 2em; font-weight: bold; text-decoration: none;"><a name="ref4"><font color="black">Part 4: Grid Search</font></a></b>
</div>


The term alpha is a hyperparameter. Sklearn has the class <b>GridSearchCV</b> to make the process of finding the best hyperparameter simpler.


Let's import <b>GridSearchCV</b> from  the module <b>model_selection</b>.


In [ ]:
from sklearn.model_selection import GridSearchCV

We create a dictionary of possible parameter alpha values:


In [ ]:
parameters1 = [{'alpha': [0.001, 0.1, 1, 10, 100, 1000, 10000, 100000, 100000]}]
parameters1

Create a Ridge regression object:


In [ ]:
RR = Ridge()
RR

Create a ridge grid search object:


In [ ]:
Grid1 = GridSearchCV(RR, parameters1, cv=4)


Fit the model:


In [ ]:
Grid1.fit(x_train[['Units Pkg', 'Avg Price Pkg', 'Group', 'Date']], y_train)

The object finds the best parameter values on the validation data. We can obtain the estimator with the best parameters and assign it to the variable BestRR as follows:


In [ ]:
BestRR = Grid1.best_estimator_
BestRR

In [ ]:
BestRR.score(x_test[['Units Pkg', 'Avg Price Pkg', 'Group', 'Date']], y_test)

The best alpha is equal to 100.


In [ ]:
RR = Ridge(alpha=100) 
RR.fit(x_train[['Units Pkg', 'Avg Price Pkg', 'Group', 'Date']], y_train)

We now test our model on the test data:


In [ ]:
y_grid_hat = RR.predict(x_test[['Units Pkg', 'Avg Price Pkg', 'Group', 'Date']])

print('predicted:', y_grid_hat[0:4])
print('test set :', y_test[0:4].values)

Visualize predicted using test data vs real test data:


In [ ]:
plt.figure(figsize=(8, 6))
res = pd.DataFrame(y_test)
res['Yhat'] = y_grid_hat
res.sort_index().plot()

plt.title("Predicted data using testing data vs real value using testing data")
plt.xlabel("x_test")
plt.ylabel("Turnover (Y)")

Unfortunately, model (RR) fits poorly. 


### Thank you for completing this lab!

## Authors
<a href="https://author.skills.network/instructors/rosana_klym?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX044EN3173-2023-01-01">Rosana Klym</a>

<a href="https://author.skills.network/instructors/yaroslav_vyklyuk_2?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX044EN3173-2023-01-01">Yaroslav Vyklyuk</a>

<a href="https://author.skills.network/instructors/olga_kavun?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX044EN3173-2023-01-01">Olga Kavun</a>


## Change Log

| Date (YYYY-MM-DD) | Version | Changed By | Change Description                  |
| ----------------- | ------- | ---------- | ----------------------------------- |
| 2023-05-03       | 2.0     | Rosana     | Finished                                                         |
| 2023-05-03       | 2.1     | Rosana     | Changed URL                                                      |
| 2023-05-06       | 2.2     | Rosana     | Changed title styles, added anchor links and added plot RR model |

<hr>

## <h3 align="center"> © IBM Corporation 2023. All rights reserved. <h3/>
