### Required Assignment 8.3: Evaluating Multiple Models

**Estimated Time: 120 Minutes**

**Total points: 100**

This assignment focuses on solving a specific regression problem using basic cross-validation with a train/test/validation split.  In addition to using the methods explored, this assignment also aims to familiarize you with further utilities for data transformation, including the `OneHotEncoder` and `OrdinalEncoder` along with their use in a `make_column_transformer`.  

The operations of encoding categorical features will be introduced using `sklearn`.  This will allow you to streamline your model-building pipelines.  Depending on whether a string type feature is **ordinal** or **categorical** we want to encode differently.  The `OrdinalEncoder` will be used to encode features that do not need to be binarized due to an underlying order, and `OneHotEncoder` for categorical features (as a similar approach to that of the `.get_dummies()` method in pandas).  By the end of the assignment, you will see how to chain multiple feature encoding methods together, including the earlier `PolynomialFeatures` for numeric features. 

<center>
    <img src = images/pipes.png width = 50% />
</center>

#### Index

- [Problem 1](#Problem-1)
- [Problem 2](#Problem-2)
- [Problem 3](#Problem-3)
- [Problem 4](#Problem-4)
- [Problem 5](#Problem-5)
- [Problem 6](#Problem-6)
- [Problem 7](#Problem-7)
- [Problem 8](#Problem-8)
- [Problem 9](#Problem-9)
- [Problem 10](#Problem-10)


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures, OneHotEncoder, OrdinalEncoder
from sklearn.metrics import mean_squared_error 
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_transformer, make_column_selector

from sklearn import set_config

set_config(display="diagram") #setting this will display your pipelines as seen above

### The Data: Ames Housing

This dataset is a popular beginning dataset used in teaching regression.  The task is to use specific features of houses to predict the price of the house.  In addition to this, as discussed in video 8.10 -- this dataset is available for use in an ongoing competition where you can use the `test.csv` to submit your models predictions.  Accordingly, the two data files are identical with the exception of the `test.csv` file not containing the target feature.

The data contains 81 columns of different information on the individual houses and their sale price.  A full description of the data is attached [here](data/data_description.txt).  In this assignment, you will use a small subset of the features to begin modeling with that includes ordinal, categorical, and numeric features. As an optional exercise, you are encouraged to continue engineering additional features and attempt to improve the performance of your model, including submitting the predictions on Kaggle. 

In [2]:
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

In [3]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

In [4]:
train.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [5]:
#note the difference in one column from train to test
[i for i in train.columns if i not in test.columns]

['SalePrice']

[Back to top](#Index:) 

### Problem 1

#### Train/Test split

**5 Points**

Despite having a test dataset, you want to create a holdout set to assess your models' performance.  To do so, use sklearn's `train_test_split` to split `X` and `y` with arguments:

- `test_size = 0.3`
- `random_state = 22`

Assign your results to `X_train, X_test, y_train, y_test`.


In [6]:
X = train.drop('SalePrice', axis = 1)
y = train['SalePrice']

In [7]:
### GRADED

X_train, X_test, y_train, y_test = '', '', '', ''
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = .3, random_state = 22)
# YOUR CODE HERE
#raise NotImplementedError()

# Answer check
print(X_train.shape)
print(X_test.shape)
print(type(X_train), type(y_train))#should be DataFrame and Series

(1022, 80)
(438, 80)
<class 'pandas.core.frame.DataFrame'> <class 'pandas.core.series.Series'>


[Back to top](#Index:) 

### Problem 2

#### Baseline Predictions

**10 Points**

Before building a regression model, you should set a baseline to compare your later models to.  One way to do this is to guess the mean of the `SalePrice` column.  For the variables `baseline_train` and `baseline_test`, create arrays of same shape as `y_train` and `y_test` respectively.  The variable `baseline_train` should contain `y_train.mean()`. The variable `baseline_test` should contain `y_test.mean()`.


Use the  `mean_squared_error` function to calculate the error between `baseline_train` and `y_train`, Assign the result to `mse_baseline_train`.

Use the  `mean_squared_error` function to calculate the error between `baseline_test` and `y_test`, Assign the result to `mse_baseline_test`.


In [8]:
### GRADED

baseline_train = np.full_like(y_train, y_train.mean(), dtype=float)
baseline_test = np.full_like(y_test, y_test.mean(), dtype=float)
mse_baseline_train = mean_squared_error(baseline_train,y_train)
mse_baseline_test =  mean_squared_error(baseline_test,y_test)

# YOUR CODE HERE
#raise NotImplementedError()

# Answer check
print(baseline_train.shape, baseline_test.shape)
print(f'Baseline for training data: {mse_baseline_train}')
print(f'Baseline for testing data: {mse_baseline_test}')

(1022,) (438,)
Baseline for training data: 6277713446.182904
Baseline for testing data: 6374354899.510017


[Back to top](#Index:) 

### Problem 3

#### Examining the Correlations

**5 Points**

What feature has the highest positive correlation with `SalePrice`?  Assign your answer as a string matching the column name exactly to `highest_corr` below.  

In [9]:
### GRADED

highest_corr = ''
numeric_train = train.select_dtypes(include=[np.number])
corr_matrix = train.corr(numeric_only=True)['SalePrice'].drop('SalePrice').sort_values(ascending = False)
print(corr_matrix)
highest_corr = corr_matrix.idxmax()
# YOUR CODE HERE
#raise NotImplementedError()

# Answer check
print(highest_corr)

OverallQual      0.790982
GrLivArea        0.708624
GarageCars       0.640409
GarageArea       0.623431
TotalBsmtSF      0.613581
1stFlrSF         0.605852
FullBath         0.560664
TotRmsAbvGrd     0.533723
YearBuilt        0.522897
YearRemodAdd     0.507101
GarageYrBlt      0.486362
MasVnrArea       0.477493
Fireplaces       0.466929
BsmtFinSF1       0.386420
LotFrontage      0.351799
WoodDeckSF       0.324413
2ndFlrSF         0.319334
OpenPorchSF      0.315856
HalfBath         0.284108
LotArea          0.263843
BsmtFullBath     0.227122
BsmtUnfSF        0.214479
BedroomAbvGr     0.168213
ScreenPorch      0.111447
PoolArea         0.092404
MoSold           0.046432
3SsnPorch        0.044584
BsmtFinSF2      -0.011378
BsmtHalfBath    -0.016844
MiscVal         -0.021190
Id              -0.021917
LowQualFinSF    -0.025606
YrSold          -0.028923
OverallCond     -0.077856
MSSubClass      -0.084284
EnclosedPorch   -0.128578
KitchenAbvGr    -0.135907
Name: SalePrice, dtype: float64
Overal

[Back to top](#Index:) 

### Problem 4

#### Simple Model

**10 Points**

Complete the code below according to the instructions below:

- Define a variable `X1` and assign to it the calues in the column `OverallQual`.
- Instantiate a `LinearRegression` model and use the `fit` function to train it using `X1` and `y_train`. Assing your result to `lr`.
- Use the  `mean_squared_error` function to calculate the error between `y_train` and `lr.predict(X1)`. Assign the result to `model_1_train_mse`.
- Use the  `mean_squared_error` function to calculate the error between `y_test` and `lr.predict(X_test[['OverallQual']]`. Assign the result to `model_1_test_mse`.

In [10]:
### GRADED

model_1_train_mse = ''
model_1_test_mse = ''
X1 = X_train[['OverallQual']]
lr =  LinearRegression().fit(X1,y_train)
model_1_train_mse = mean_squared_error(lr.predict(X1),y_train)
model_1_test_mse =  mean_squared_error(lr.predict(X_test[['OverallQual']]),y_test)
# YOUR CODE HERE
#raise NotImplementedError()

# Answer check
print(f'Train MSE: {model_1_train_mse: .2f}')
print(f'Test MSE: {model_1_test_mse: .2f}')

Train MSE:  2269766380.95
Test MSE:  2578831820.35


[Back to top](#Index:) 

### Problem 5

#### Using `OneHotEncoder`

**10 Points**

Similar to the `pd.get_dummies()` method earlier encountered, scikit-learn has a utility for encoding categorical features in the same way.  Below, the `OneHotEncoder` is demonstrated in the `CentralAir` column.  You are to use these results to build a model where the only feature is the `CentralAir` column.  Note the two arguments are used in the `OneHotEncoder`:

- `sparse = False`: returns an array that we can investigate vs with `sparse = True` you are returned a sparse matrix -- a memory saving representation
- `drop = if_binary`: returns a single column for any binary categories.  This avoids redundant features in our regression model.

In the code cell below, instantiate a `LinearRegression` model and use the `fit` function to train it using `model_2_train` and `y_train`. Assing your result to `model_2`. 

In [11]:
#extract the features
central_air_train = X_train[['CentralAir']]
central_air_test = X_test[['CentralAir']]

In [12]:
#a categorical feature
central_air_train.head()

Unnamed: 0,CentralAir
1079,Y
601,Y
1015,Y
194,Y
1248,N


In [13]:
#Instantiate a OHE object
#sparse = False returns an array so we can view
ohe = OneHotEncoder(sparse_output = False, drop='if_binary')
print(ohe.fit_transform(central_air_train)[:5])

[[1.]
 [1.]
 [1.]
 [1.]
 [0.]]


In [14]:
model_2_train = ohe.fit_transform(central_air_train)
model_2_test = ohe.transform(central_air_test)

In [15]:
### GRADED

model_2 = LinearRegression().fit(model_2_train,y_train)

# YOUR CODE HERE
#raise NotImplementedError()

# Answer check
print(model_2.coef_)

[84484.53030402]




To build a model using both the `OverallQual` column and the `CentralAir` column, you could use the `OneHotEncoder` to transform `CentralAir`, and then concatenate the results back into a DataFrame or numpy array.  To streamline this process, the `make_column_transformer` can be used to separate specific columns for certain transformations.  Below, a `make_column_transformer` has been created for you to do just this.  


The arguments are tuples of the form `(transformer, columns)` that specify a transformation to perform on the given column.  Further, the `remainder = passthrough` argument says to just pass the other columns through.  You are returned a numpy array with the `CentralAir` column binarized and concatenated to the `OverallQual` feature.


For an example using the `make_column_transformer` see [here](https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html#sphx-glr-auto-examples-compose-plot-column-transformer-mixed-types-py).


In [16]:
col_transformer = make_column_transformer((OneHotEncoder(drop = 'if_binary'), ['CentralAir']), 
                                          remainder='passthrough',force_int_remainder_cols=False)

In [17]:
col_transformer.fit_transform(X_train[['OverallQual', 'CentralAir']])

array([[1., 5.],
       [1., 6.],
       [1., 8.],
       ...,
       [0., 5.],
       [1., 5.],
       [1., 9.]])

[Back to top](#Index:) 

### Problem 6

#### Using `make_column_transformer`

**10 Points**


Complete the code below according to the instructions below:


- Use `Pipeline` to create a pipeline object. Inside the pipeline object define a a tuple where the first element is a string identifier `col_transformer` and the second element is an instance of `col_transformer`. Inside the pipeline define another tuple where the first element is a string identifier `linreg`, and the second element is an instance of `LinearRegression`. Assign the pipeline object to the variable `pipe_1`.
- Use the `fit` function on `pipe_1` to train your model on `X_train[['OverallQual', 'CentralAir']]` and `y_train`. 

In [18]:
### GRADED

pipe_1 = Pipeline([('col_transformer',col_transformer),('linreg',LinearRegression())])
pipe_1.fit(X_train[['OverallQual', 'CentralAir']],y_train)
# YOUR CODE HERE
#raise NotImplementedError()

# Answer check
print(pipe_1.named_steps)#col_transformer and linreg should be keys
pipe_1

{'col_transformer': ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough',
                  transformers=[('onehotencoder',
                                 OneHotEncoder(drop='if_binary'),
                                 ['CentralAir'])]), 'linreg': LinearRegression()}




Not all columns warrant binarization as done on the `CentralAir` column.  For example, consider the `HeatingQC` feature -- representing the quality of the heating in the house.  From the data description the unique values are described as:

```
HeatingQC: Heating quality and condition

       Ex	Excellent
       Gd	Good
       TA	Average/Typical
       Fa	Fair
       Po	Poor
```

These are ordered values, and rather than binarizing them a numeric value representing the scale can be used.  For example, using a scale of 0 - 4 you may associate the categories with an order in a list from least to greatest as:

```
['Po', 'Fa', 'TA', 'Gd', 'Ex']
```

Creating an `OrdinalEncoder` with these categories will transform the `HeatingQC` feature mapping each category as

```
Po: 0
Fa: 1
TA: 2
Gd: 3
Ex: 4
```

This is demonstrated below, and in a similar manner the use of the `make_column_transformer` is shown using the three columns `['OverallQual', 'CentralAir', 'HeatingQC']`, applying the appropriate transformations to each column and passing the remaining numeric feature through.  

In [19]:
oe = OrdinalEncoder(categories = [['Po', 'Fa', 'TA', 'Gd', 'Ex']])

In [20]:
oe.fit_transform(X_train[['HeatingQC']])

array([[3.],
       [2.],
       [4.],
       ...,
       [2.],
       [3.],
       [4.]])

In [21]:
X_train['HeatingQC'].head()

1079    Gd
601     TA
1015    Ex
194     TA
1248    Fa
Name: HeatingQC, dtype: object

In [22]:
ordinal_ohe_transformer = make_column_transformer((OneHotEncoder(drop = 'if_binary'), ['CentralAir']),
                                          (OrdinalEncoder(categories = [['Po', 'Fa', 'TA', 'Gd', 'Ex']]), ['HeatingQC']),
                                          remainder='passthrough')

In [23]:
ordinal_ohe_transformer.fit_transform(X_train[['OverallQual', 'CentralAir', 'HeatingQC']])[:5]

array([[1., 3., 5.],
       [1., 2., 6.],
       [1., 4., 8.],
       [1., 2., 5.],
       [0., 1., 6.]])

In [24]:
X_train[['OverallQual', 'CentralAir', 'HeatingQC']].head()

Unnamed: 0,OverallQual,CentralAir,HeatingQC
1079,5,Y,Gd
601,6,Y,TA
1015,8,Y,Ex
194,5,Y,TA
1248,6,N,Fa


[Back to top](#Index:) 

### Problem 7

#### Using `OrdinalEncoder`

**10 Points**


Complete the code below according to the instructions below:


- Use `Pipeline` to create a pipeline object. Inside the pipeline object define a a tuple where the first element is a string identifier `transformer` and the second element is an instance of `ordinal_ohe_transformer`. Inside the pipeline define another tuple where the first element is a string identifier `linreg`, and the second element is an instance of `LinearRegression`. Assign the pipeline object to the variable `pipe_2`.
- Use the `fit` function on `pipe_2` to train your model on `X_train[['OverallQual', 'CentralAir', 'HeatingQC']]` and `y_train`. 
- Use the `predict` function on `pipe_2` to make your predictions of `X_train[['OverallQual', 'CentralAir', 'HeatingQC']]`. Assign the result to `pred_train`.
- - Use the `predict` function on `pipe_2` to make your predictions of `X_test[['OverallQual', 'CentralAir', 'HeatingQC']]`. Assign the result to `pred_test`.
- Use the `mean_squared_error` function to calculate the MSE between `y_train` and `pred_train`. Assign the result to `pipe_2_train_mse`.
- Use the `mean_squared_error` function to calculate the MSE between `y_test` and `pred_test`. Assign the result to `pipe_2_test_mse`.

In [25]:
### GRADED

pipe_2 = Pipeline([('transformer',ordinal_ohe_transformer),('linreg',LinearRegression())])
pipe_2.fit(X_train[['OverallQual', 'CentralAir', 'HeatingQC']],y_train)
pred_train = pipe_2.predict(X_train[['OverallQual', 'CentralAir', 'HeatingQC']])
pred_test = pipe_2.predict(X_test[['OverallQual', 'CentralAir', 'HeatingQC']])
pipe_2_train_mse = mean_squared_error(pred_train,y_train)
pipe_2_test_mse = mean_squared_error(pred_test,y_test)

# YOUR CODE HERE
#raise NotImplementedError()

# Answer check
print(pipe_2.named_steps)
print(f'Train MSE: {pipe_2_train_mse: .2f}')
print(f'Test MSE: {pipe_2_test_mse: .2f}')
pipe_2

{'transformer': ColumnTransformer(remainder='passthrough',
                  transformers=[('onehotencoder',
                                 OneHotEncoder(drop='if_binary'),
                                 ['CentralAir']),
                                ('ordinalencoder',
                                 OrdinalEncoder(categories=[['Po', 'Fa', 'TA',
                                                             'Gd', 'Ex']]),
                                 ['HeatingQC'])]), 'linreg': LinearRegression()}
Train MSE:  2211416025.54
Test MSE:  2597701543.40


The format of the columns of the 'remainder' transformer in ColumnTransformer.transformers_ will change in version 1.7 to match the format of the other transformers.
At the moment the remainder columns are stored as indices (of type int). With the same ColumnTransformer configuration, in the future they will be stored as column names (of type str).



[Back to top](#Index:) 

### Problem 8

#### Including `PolynomialFeatures`

**10 Points**

Finally, the earlier transformation of continuous columns using the `PolynomialFeatures` with `degree = 2` can be implemented alongside the `OneHotEncoder` and `OrdinalEncoder`.  

The `make_column_transformer` is again used, and you are to create a `Pipeline` with steps `transformer` and `linreg`.  

The `Pipeline` is fit on the training data using features `['OverallQual', 'CentralAir', 'HeatingQC']`.  

- Use the `predict` function on `pipe_3` to predict the values of `X_train[['OverallQual', 'CentralAir', 'HeatingQC']]`. Assign your result to `quad_train_preds`.
- Use the `predict` function on `pipe_3` to predict the values of `X_test[['OverallQual', 'CentralAir', 'HeatingQC']]`. Assign your result to `quad_test_preds`.
- Use the `mean_squared_error` function to calculate the MSE between `y_train` and `quad_train_preds`. Assign the result to `quad_train_mse`.
- Use the `mean_squared_error` function to calculate the MSE between `y_test` and `quad_test_preds`. Assign the result to `quad_test_mse`.

In [26]:
poly_ordinal_ohe = make_column_transformer((OrdinalEncoder(categories = [['Po', 'Fa', 'TA', 'Gd', 'Ex']]), ['HeatingQC']),
                                           (OneHotEncoder(drop = 'if_binary'), ['CentralAir']),
                                           (PolynomialFeatures(include_bias = False, degree = 2), ['OverallQual']))
pipe_3 = Pipeline([('transformer', poly_ordinal_ohe), 
                  ('linreg', LinearRegression())])

In [27]:
pipe_3.fit(X_train[['OverallQual', 'CentralAir', 'HeatingQC']], y_train)

In [28]:
### GRADED

quad_train_mse = ''
quad_test_mse = ''
quad_train_preds = pipe_3.predict(X_train[['OverallQual', 'CentralAir', 'HeatingQC']])
quad_test_preds = pipe_3.predict(X_test[['OverallQual', 'CentralAir', 'HeatingQC']])
quad_train_mse =  mean_squared_error(quad_train_preds, y_train)
quad_test_mse = mean_squared_error(quad_test_preds, y_test)
# YOUR CODE HERE
#raise NotImplementedError()

# Answer check
print(f'Train MSE: {quad_train_mse: .2f}')
print(f'Test MSE: {quad_test_mse: .2f}')

Train MSE:  1856951076.95
Test MSE:  2207864528.17


[Back to top](#Index:) 

### Problem 9

#### Including More Features

**20 Points**

Use the following features to build a new `make_column_transformer` and fit 5 different models of degree 1 - 5 using the `degree` argument in your `PolynomialFeatures` transformer.  Keep track of the subsequent train mean squared error and test set mean squared error with the lists `train_mses` and `test_mses` respectively.  

The `poly_ordinal_ohe` object contains the different transformers needed.  Note that rather than passing a list of columns to the `PolynomialFeatures` transformer, the `make_column_selector` function is used to select any numeric feature.  For more information on the `make_column_selector` see [here](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_selector.html).



In [29]:
features = ['CentralAir', 'HeatingQC', 'OverallQual', 'GrLivArea', 'KitchenQual', 'FullBath']

In [30]:
X_train[features].head()

Unnamed: 0,CentralAir,HeatingQC,OverallQual,GrLivArea,KitchenQual,FullBath
1079,Y,Gd,5,990,TA,1
601,Y,TA,6,1375,Gd,1
1015,Y,Ex,8,1665,Gd,2
194,Y,TA,5,864,TA,1
1248,N,Fa,6,2058,TA,1


In [31]:
poly_ordinal_ohe = make_column_transformer((PolynomialFeatures(), make_column_selector(dtype_include=np.number)),
                                           (OrdinalEncoder(categories = [['Po', 'Fa', 'TA', 'Gd', 'Ex']]), ['HeatingQC', 'KitchenQual']),
                                               (OneHotEncoder(drop = 'if_binary', sparse_output = False), ['CentralAir']))
poly_ordinal_ohe

In [32]:
### GRADED

train_mses = []
test_mses = []
#for degree in 1 - 5
for i in range(1, 6):
    #create pipeline with PolynomialFeatures degree i 
    #ADD APPROPRIATE ARGUMENTS IN POLYNOMIALFEATURES
    poly_ordinal_ohe = make_column_transformer((PolynomialFeatures(degree = i), make_column_selector(dtype_include=np.number)),
                                           (OrdinalEncoder(categories = [['Po', 'Fa', 'TA', 'Gd', 'Ex']]), ['HeatingQC']),
                                               (OneHotEncoder(drop = 'if_binary'), ['CentralAir']))
    
    pipe = Pipeline([('transform',poly_ordinal_ohe),('linreg', LinearRegression())])
    #fit on train
    pipe.fit(X_train[features],y_train)
    #predict on train and test
    train_mses.append(mean_squared_error(pipe.predict(X_train[features]),y_train))
    test_mses.append(mean_squared_error(pipe.predict(X_test[features]),y_test))
    #compute mean squared errors
    
    #append to train_mses and test_mses respectively

# YOUR CODE HERE
#raise NotImplementedError()

# Answer check
print(train_mses)
print(test_mses)
pipe


[1635858627.9553099, 1281188434.12774, 1175803979.7815003, 1201144713.4081798, 1160014390.6629136]
[2038392771.0031447, 1391506503.4380589, 6994945719.208903, 62824055788.33885, 282785674224.44305]


[Back to top](#Index:) 

### Problem 10

#### Optimal Model Complexity 

**10 Points**

Based on your model's mean squared error on the testing data in **Problem 9** above, what was the optimal complexity?  Assign your answer as an integer to `best_complexity` below.  Compute the **MEAN SQUARED ERROR** of this model and assign it to `best_mse` as a float. 

In [40]:
### GRADED

best_complexity = 2
best_mse  = min(test_mses)
print(best_mse)
# YOUR CODE HERE
#raise NotImplementedError()

# Answer check
print(f'The best degree polynomial model is:  {best_complexity}')
print(f'The smallest mean squared error on the test data is : {best_mse: .2f}')



1391506503.4380589
The best degree polynomial model is:  2
The smallest mean squared error on the test data is :  1391506503.44


### Further Exploration

This activity was meant to introduce you to a more streamlined modeling process using the `sklearn` library.  While your models should be performing better than the baseline, it is likely that with a bit more feature engineering and cross-validation, you would be able to further improve the performance.  You are encouraged to explore further feature engineering and encoding, particularly with handling missing values.  

Additionally, other transformations on the data may be appropriate.  For example, if you look at the distribution of errors in your model, you will note that they are slightly skewed.  An assumption of a Linear Regression model is that these should be roughly normally distributed.  By building a model on the logarithm of the target column and evaluating the model on the logarithm of the testing data, you will improve towards this assumption.  Note that the actual kaggle exercise is judged on the **ROOT MEAN SQUARED ERROR** of the logarithm of the target feature. 

If interested, scikitlearn also provides a function `TransformedTargetRegressor` that will accomplish this transformation and can easily be added to a pipeline. See [here](https://scikit-learn.org/stable/modules/generated/sklearn.compose.TransformedTargetRegressor.html) for more information on this transformer. 