# Train a `Ridge` Regression Model

You are developing a machine learning model to predict diamond prices. However, you notice that the dataset of diamond features has many correlated variables, so you are using a **Ridge** regression (linear model with L2 regularization) to manage the variance of the model.

In the attached workspace, you will read data from a file, and split it into training and test sets. Then, you will fit a **Ridge** model (using the sklearn implementation, you may refer to its documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html)) on the training set, and evaluate its accuracy in on a test set.

You'll need to specify this random state in your notebook: 
> random_state = 27

and this value of α, the strength of the regularization parameter:
> alpha = 2

The following items will be graded:

| Name | Type | Description |
| ---- | ---- | ---- |
|`Xtr` |	pandas data frame	| Training data (features).|
|`Xts` |	pandas data frame	| Test data (features).|
|`ytr`	| pandas series OR pandas data frame OR 1d numpy array	| Training data (target).|
|`yts`	| pandas series OR pandas data frame OR 1d numpy array	| Test data (target).|
|`Xtr_std`	| 2d numpy array	| Training data (features) after standardizing.|
|`Xts_std`	| 2d numpy array	| Test data (features) after standardizing.|
|`yts_hat`	| 1d numpy array	| Model prediction for test data.|
|`mse`	| float	| MSE of model on test data.|

In [1]:
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split, KFold
from sklearn.preprocessing import StandardScaler

In this notebook, we are interested in predicting the price of a diamond from its physical characteristics.

First, load in the data:

In [2]:
df = pd.read_csv('diamonds.csv')

Get a quick view of the columns in this data by runnning the cell below. 

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   carat    53940 non-null  float64
 1   cut      53940 non-null  object 
 2   color    53940 non-null  object 
 3   clarity  53940 non-null  object 
 4   depth    53940 non-null  float64
 5   table    53940 non-null  float64
 6   price    53940 non-null  int64  
 7   x        53940 non-null  float64
 8   y        53940 non-null  float64
 9   z        53940 non-null  float64
dtypes: float64(6), int64(1), object(3)
memory usage: 4.1+ MB


This dataset contains information about 53,940 round-cut diamonds. It includes variables measuring various pieces of information about the diamonds. 

There are 3 variables with an ordered factor structure: `cut`, `color`, & `clarity`.

* `cut` can be, from worst to best: Fair, Good, Very Good, Premium, Ideal
* `color` can range from J (worst) to D (best)
* `clarity` can be I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best)

There are 6 variables that are of numeric structure: `carat`, `depth`, `table`, `x`, `y`, `z`. These relate to the physical size of the diamond.

* `carat` is the weight of the diamond
* `depth` is the total depth percentage	
* `table` gives the width of top of diamond relative to widest point
* `x`, `y`, and `z` are the length, width, and depth in mm, respectively.


Finally, the `price` variable (which will be our target variable in this analysis) is also numeric, and gives the price of the diamond in US dollars.


The following cell maps the three categorical variables to numeric equivalents according to their order, so that we can use them in our regression.

In [4]:
df["cut"] = df["cut"].map({"Fair": 1, "Good": 2, "Very Good": 3, "Premium": 4, "Ideal": 5})
df["color"] = df["color"].map({"J": 1, "I": 2, "H": 3, "G": 4, "F": 5, "E": 6, "D": 7})
df["clarity"] = df["clarity"].map({"I1": 1, "SI2": 2, "SI1": 3, "VS2": 4, "VS1": 5, "VVS2": 6, "VVS1": 7, "IF": 8})

For convenience, the cell below will read the names of the columns we want to use as features into an array called `x_names`, and the name of the column we want to use as the target will be read into `y_names`.

In [5]:
x_names = ['carat', 'cut', 'color', 'clarity', 'depth', 'table', 'x', 'y', 'z']
y_names = ['price']

Using `df`, `x_names`, and `y_names`, split the data into training and test sets using `sklearn`'s implementation of `train_test_split`. 

* Reserve 30% of the data for testing, and leave 70% for training.
* Shuffle the data, and use the random state specified in the PrairieLearn question page.

In [6]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
random_state = 27
Xtr, Xts = train_test_split(df[x_names], test_size=0.3, random_state = random_state, shuffle = True)
ytr, yts = train_test_split(df[y_names], test_size=0.3, random_state = random_state, shuffle = True)

When using a regularized model, we always standardize the data (remove the mean and scale to unit variance) first. Use the `sklearn` implementation of a `StandardScaler`. Save the results in `Xtr_std` and `Xts_std`.

In [7]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
scaler = StandardScaler().fit(Xtr)
Xtr_std = scaler.transform(Xtr)
Xts_std = scaler.transform(Xts)

Now we are ready to fit the `Ridge` model. 

* Specify the `random_state` as indicated in the PrairieLearn question page
* Specify the `alpha` as indicated in the PrairieLearn question page
* Leave other settings unspecified, so that the default values will be used

and fit the model on the training data. Then, use it to make predictions for the test samples, and save this prediction in `yts_hat`. Evaluate the MSE of the model on the test data, and save this in `mse`.

In [8]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)
random_state = 27
alpha = 2
model = Ridge(alpha=alpha, random_state=random_state)
model.fit(Xtr_std, ytr)
yts_hat = model.predict(Xts_std)
mse = mean_squared_error(yts, yts_hat)