# Feature Engineering of the Housing Dataset.

Feature engineering efforts mainly have two goals:
- Preparing the proper input dataset, compatible with the machine learning algorithm requirements.
- Improving the performance of machine learning models.

In this notebook, I am trying to improve OLS model by feature engineering. Models explained:
- **Model 1:** Linear Regression with original features
- **Model 2:** To regularize the OLS regression with an L1-L2 penalty term, use Elastic Net with original features. Note: This type of regularization becomes quite powerful when we have many regressors.
- **Model 3:** To further improve Elastic Net, add more features

Goal:
- (a) Add at least 10 more regressors by applying non-linear transformations to the features in the dataset. Make pipelines of transformers.
- (b) Fit the Elastic Net model (**Model 3**) and see if we have improvement on the MSE obtained by OLS (**Model 1**).
    - Run a 10-fold cross validation on the training set to find the MSE distribution of new Elastic Net (**Model 3**), and 
    - compare it to the MSE distribution of the OLS model  (**Model 1**) 
    - See any improvements both on the bias and variance?
    - For each set of features added, find the optimal elastic net model using a grid search of the parameter space. Does the best model remain the best in terms of MSE of the test set? Explain.

### Setup & Import Data

In [1]:
# To support both python 2 and python 3
from __future__ import division, print_function, unicode_literals

# Common imports
import numpy as np
import os
import warnings
warnings.filterwarnings('ignore')

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12

# Where to save the figures
PROJECT_ROOT_DIR = "."

def save_fig(fig_id, tight_layout=True):
    path = os.path.join(PROJECT_ROOT_DIR, fig_id + ".png")
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format='png', dpi=300)

import os
import tarfile
from six.moves import urllib

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
HOUSING_PATH = os.path.join("Data_Science_Applications\Housing_Feature_Engineering_DS6\datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"
def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    
    if not os.path.isdir(housing_path):
        os.makedirs(housing_path)
        
    tgz_path = os.path.join(housing_path, "housing.tgz")
    if not os.path.isfile(tgz_path): #download data if not already there
        urllib.request.urlretrieve(housing_url, tgz_path)
        
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()

fetch_housing_data()

import pandas as pd

def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

housing = load_housing_data()
housing

C:\Users\xxxli\Anaconda3\envs\mysixenv\lib\site-packages\numpy\.libs\libopenblas.IPBC74C7KURV7CB2PKT5Z5FNR3SIBV4J.gfortran-win_amd64.dll
C:\Users\xxxli\Anaconda3\envs\mysixenv\lib\site-packages\numpy\.libs\libopenblas.WCDJNK7YVMPZQ2ME2ZZHJJRJ3JIKNDB7.gfortran-win_amd64.dll
  stacklevel=1)


Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY
...,...,...,...,...,...,...,...,...,...,...
20635,-121.09,39.48,25.0,1665.0,374.0,845.0,330.0,1.5603,78100.0,INLAND
20636,-121.21,39.49,18.0,697.0,150.0,356.0,114.0,2.5568,77100.0,INLAND
20637,-121.22,39.43,17.0,2254.0,485.0,1007.0,433.0,1.7000,92300.0,INLAND
20638,-121.32,39.43,18.0,1860.0,409.0,741.0,349.0,1.8672,84700.0,INLAND


# Train & Test Split:

* Training X: ``housing_train_X``
* Training y:``housing_labels_train_y``
* Testing X: ``housing_test_X``
* Testing y: ``housing_labels_test_y``

In [2]:
from sklearn.model_selection import StratifiedShuffleSplit
# to make this notebook's output identical at every run
np.random.seed(42)

# Divide by 1.5 to limit the number of income categories
housing["income_cat"] = np.ceil(housing["median_income"] / 1.5)
# Label those above 5 as 5
housing["income_cat"].where(housing["income_cat"] < 5, 5.0, inplace=True)

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]
    
for set_ in (strat_train_set, strat_test_set):
    set_.drop("income_cat", axis=1, inplace=True)
strat_train_set.shape

(16512, 10)

In [3]:
housing_train_X = strat_train_set.drop("median_house_value", axis=1) # drop labels for training set
housing_labels_train_y = strat_train_set["median_house_value"].copy()

housing_test_X = strat_test_set.drop("median_house_value", axis=1)
housing_labels_test_y = strat_test_set["median_house_value"].copy()

##### split the training set into numerical and Categorical parts:

In [4]:
# split the training set into numerical and Categorical parts
housing_num = housing_train_X.drop("ocean_proximity", axis=1)
housing_cat = housing_train_X["ocean_proximity"]

# dealing with missing numerical value
from sklearn.impute import SimpleImputer

# dealing with categorical value
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder


array([[1, 0, 0, 0, 0],
       [1, 0, 0, 0, 0],
       [0, 0, 0, 0, 1],
       ...,
       [0, 1, 0, 0, 0],
       [1, 0, 0, 0, 0],
       [0, 0, 0, 1, 0]])

Create a class to select numerical or categorical columns 
since Scikit-Learn doesn't handle DataFrames yet

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values

Let's create our own transformer which transformes all our numerical variables
- Note that the BaseEstimator is the abstract class we need to always inherit from. 
- The TransformerMixin class basically adds the fit_transform() method once the fit() 
- and transform() methods are implemented

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
# coln index
rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs
        self.add_bedrooms_per_room = add_bedrooms_per_room

    def fit(self, X, y=None):
        return self  # nothing else to do
        
    def transform(self, X, y=None):
        rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
        population_per_household = X[:, population_ix] / X[:, household_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
                         bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]

# combine in pipline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")),
        ('attribs_adder', CombinedAttributesAdder()),
        ('std_scaler', StandardScaler()), # a stransformer which scales the variables 
    ])

housing_num_tr = num_pipeline.fit_transform(housing_num)

Then, dealing with categorical ones:

In [None]:
# LabelBinarizer is same as OneHotEncoder but fit_transform() equivalent to fit_transform().to_array() of OneHotEncoder
# unfortunately LabelBinarizer isn't pipeline friendly so we'll have to extend it as below:
from sklearn.preprocessing import LabelBinarizer 
class PipelineFriendlyLabelBinarizer(LabelBinarizer):
    def fit_transform(self, X, y=None):
        return super(PipelineFriendlyLabelBinarizer, self).fit_transform(X)

# dealing with categorical 
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
housing_cat_encoded = encoder.fit_transform(housing_cat)
# already have housing_cat_encoded
ppflb = PipelineFriendlyLabelBinarizer()
housing_cat_1hot_lb = ppflb.fit_transform(housing_cat_encoded.reshape(-1,1))
housing_cat_1hot_lb

##### Full Pipline for preparation:

In [5]:
# let's now combine the numerical and categorical pipelines
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

num_pipeline = Pipeline([
        ('selector', DataFrameSelector(num_attribs)),
        ('imputer', SimpleImputer(strategy="median")),
        ('attribs_adder', CombinedAttributesAdder()),
        ('std_scaler', StandardScaler()),
    ])

cat_pipeline = Pipeline([
        ('selector', DataFrameSelector(cat_attribs)),
        ('label_binarizer', PipelineFriendlyLabelBinarizer()),
    ])

# and concatenate them with FeatureUnion class
from sklearn.pipeline import FeatureUnion

full_pipeline = FeatureUnion(transformer_list=[
        ("num_pipeline", num_pipeline),
        ("cat_pipeline", cat_pipeline),
    ])

# this is the final transformation result!
housing_train_X_prepared = full_pipeline.fit_transform(housing_train_X)
housing_test_X_prepared = full_pipeline.transform(housing_test_X)
housing_train_X_prepared

array([[-1.15604281,  0.77194962,  0.74333089, ...,  0.        ,
         0.        ,  0.        ],
       [-1.17602483,  0.6596948 , -1.1653172 , ...,  0.        ,
         0.        ,  0.        ],
       [ 1.18684903, -1.34218285,  0.18664186, ...,  0.        ,
         0.        ,  1.        ],
       ...,
       [ 1.58648943, -0.72478134, -1.56295222, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.78221312, -0.85106801,  0.18664186, ...,  0.        ,
         0.        ,  0.        ],
       [-1.43579109,  0.99645926,  1.85670895, ...,  0.        ,
         1.        ,  0.        ]])

##### Ready for Regression:

In [6]:
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(housing_train_X_prepared, housing_labels_train_y)

print("Predictions on testing set:", lin_reg.predict(housing_test_X_prepared))
print('Actual Values of testing set: \n',housing_labels_test_y)

Predictions on testing set: [424327.91587129 264520.09425443 228109.45155968 ... 290423.67564301
 192142.16080923 151202.36199678]
Actual Values of testing set: 
 5241     500001.0
10970    240300.0
20351    218200.0
6568     182100.0
13285    121300.0
           ...   
20519     76400.0
17430    134000.0
4019     311700.0
12107    133500.0
2398      78600.0
Name: median_house_value, Length: 4128, dtype: float64


#### Full Pipline for Preparation & Prediction:
##### Predictor：Linear Regression：

In [7]:
full_pipeline_with_predictor_linreg = Pipeline([
        ("preparation", full_pipeline),
        ("linear", LinearRegression())
    ])

full_pipeline_with_predictor_linreg.fit(housing_train_X, housing_labels_train_y)

prediction_on_train_linreg = full_pipeline_with_predictor_linreg.predict(housing_train_X)
prediction_on_test_linreg = full_pipeline_with_predictor_linreg.predict(housing_test_X)

##### Predictor：Elastic Net (default L1_ratio)

In [8]:
from sklearn.linear_model import ElasticNet

full_pipeline_with_predictor_enet = Pipeline([
        ("preparation", full_pipeline),
        ("linear", ElasticNet(random_state=0))
    ])

full_pipeline_with_predictor_enet.fit(housing_train_X, housing_labels_train_y)

prediction_on_train_enet = full_pipeline_with_predictor_enet.predict(housing_train_X)
prediction_on_test_enet = full_pipeline_with_predictor_enet.predict(housing_test_X)

#### Looking at Distribution of MSE from each model by 10-fold CV:

In [9]:
# by CV, see distribution of MSE
from sklearn.model_selection import cross_val_score

def create_full_pipeline_with_predictor(predictor):
    pipe = Pipeline([
        ("preparation", full_pipeline),
        ("linear", predictor)
    ])
    return pipe

def run_describe_predictors(predictors):
    mse_distribution = {}
    scores = {}
    for name, predictor in predictors.items():
        print(name)
        pipe = create_full_pipeline_with_predictor(predictor)
        scores[name] = cross_val_score(
            pipe, housing_train_X, housing_labels_train_y, 
            scoring="neg_mean_squared_error",
            cv=10)

        mse_distribution[name] = pd.Series(-scores[name])
        print(mse_distribution[name].describe())
        print('')
    return scores, mse_distribution

predictors = {'Linear Regression':LinearRegression(), 'Elastic Net': ElasticNet(random_state=0)}

scores, mse_distribution = run_describe_predictors(predictors)

Linear Regression
count    1.000000e+01
mean     4.775625e+09
std      4.023328e+08
min      4.221053e+09
25%      4.507385e+09
50%      4.645265e+09
75%      5.038537e+09
max      5.585961e+09
dtype: float64

Elastic Net
count    1.000000e+01
mean     6.195185e+09
std      2.989564e+08
min      5.806348e+09
25%      6.022578e+09
50%      6.137844e+09
75%      6.318498e+09
max      6.703124e+09
dtype: float64



##### Grid Search for a better Elastic Net param:

In [10]:
# let's do a grid search for the best params
from sklearn.model_selection import GridSearchCV

predictor = ElasticNet(random_state=0)
param_grid = [
    # try varying the ll_ratio and tolerance
    {
        'l1_ratio': [ 0.125, 0.25, 0.5, 0.75, 0.875],
    },
  ]
grid_search = GridSearchCV(predictor, param_grid, cv=5,
                           scoring='neg_mean_squared_error')
grid_search.fit(housing_train_X_prepared, housing_labels_train_y)

GridSearchCV(cv=5, error_score=nan,
             estimator=ElasticNet(alpha=1.0, copy_X=True, fit_intercept=True,
                                  l1_ratio=0.5, max_iter=1000, normalize=False,
                                  positive=False, precompute=False,
                                  random_state=0, selection='cyclic',
                                  tol=0.0001, warm_start=False),
             iid='deprecated', n_jobs=None,
             param_grid=[{'l1_ratio': [0.125, 0.25, 0.5, 0.75, 0.875]}],
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='neg_mean_squared_error', verbose=0)

In [11]:
# Our best Elastic Net:
grid_search.best_estimator_

ElasticNet(alpha=1.0, copy_X=True, fit_intercept=True, l1_ratio=0.875,
           max_iter=1000, normalize=False, positive=False, precompute=False,
           random_state=0, selection='cyclic', tol=0.0001, warm_start=False)

##### Compare the Distributions of MSE from 
* OLS
* Elastic Net 0.5
* Elastic Net 0.875

In [12]:
predictors = {
    'Linear Regression':LinearRegression(), 
    'Elastic Net with l1_ratio 0.5': ElasticNet(random_state=0),
    'Elastic Net with l1_ratio 0.875': ElasticNet(random_state=0, l1_ratio=0.875)
}
_,_ = run_describe_predictors(predictors)

Linear Regression
count    1.000000e+01
mean     4.775625e+09
std      4.023328e+08
min      4.221053e+09
25%      4.507385e+09
50%      4.645265e+09
75%      5.038537e+09
max      5.585961e+09
dtype: float64

Elastic Net with l1_ratio 0.5
count    1.000000e+01
mean     6.195185e+09
std      2.989564e+08
min      5.806348e+09
25%      6.022578e+09
50%      6.137844e+09
75%      6.318498e+09
max      6.703124e+09
dtype: float64

Elastic Net with l1_ratio 0.875
count    1.000000e+01
mean     5.057377e+09
std      3.069818e+08
min      4.593228e+09
25%      4.886414e+09
50%      5.003651e+09
75%      5.286114e+09
max      5.527204e+09
dtype: float64



# Homework 6
## 3. Feature Engineering of the Housing Dataset.
In class we used elastic net to regularize the OLS regression with an L1-L2 penalty term.
This type of regularization becomes quite powerful when we have many regressors.
### (a) Add more regressors to the problem by applying non-linear transformations of your choice to the features in the dataset. Add at least 10 more regressors by modifying appropriately the pipeline of your code.


In [13]:
from sklearn.base import BaseEstimator, TransformerMixin

longitude_ix, latitude_ix, housing_median_age_ix, rooms_ix, bedrooms_ix, population_ix, household_ix, median_income_ix = 0, 1, 2, 3, 4, 5, 6, 7

class MyCombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self): 
        return None
    def fit(self, X, y=None):
        return self  
    def transform(self, X, y=None):
        rooms_per_household = X[:, rooms_ix] / X[:, household_ix]                      #47
        population_per_household = X[:, population_ix] / X[:, household_ix]            #67
        bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]                         #45
        bedrooms_per_household = X[:, bedrooms_ix] / X[:, household_ix] #1             #57
        longitude_latitude_ratio = X[:, longitude_ix] / X[:, latitude_ix] #2           #12
        population_per_room = X[:, population_ix] / X[:, rooms_ix] #3                  #46
        population_per_bedroom = X[:, population_ix] / X[:, bedrooms_ix] #4            #56
        housing_total_age = X[:, housing_median_age_ix] * X[:, household_ix] #5        #37
        housing_total_income = X[:, median_income_ix] * X[:, household_ix] #6          #87
        population_total_income = X[:, median_income_ix] * X[:, population_ix] #7      #86
        median_age_rooms = X[:, housing_median_age_ix] * X[:, rooms_ix] #8
        median_age_bedrooms = X[:, housing_median_age_ix] *  X[:, bedrooms_ix] #9
        meidan_income_bedrooms =  X[:, median_income_ix]*X[:, bedrooms_ix] #10
        
        # population_quadratic =  X[:, population_ix] *  X[:, population_ix] 
        # housing_median_age_quadratic = X[:, median_income_ix] * X[:, median_income_ix] 
        # bedroom_quadratic = X[:, bedrooms_ix] * X[:, bedrooms_ix]
        
        return np.c_[X, rooms_per_household, population_per_household, bedrooms_per_room, 
                     bedrooms_per_household, longitude_latitude_ratio, population_per_room, 
                     population_per_bedroom, housing_total_age, housing_total_income, 
                     population_total_income, 
                     median_age_rooms, median_age_bedrooms,meidan_income_bedrooms]
                     # population_quadratic, housing_median_age_quadratic, bedroom_quadratic]

In [14]:
# create new full pipline
my_num_pipeline = Pipeline([
        ('selector', DataFrameSelector(num_attribs)),
        ('imputer', SimpleImputer(strategy="median")),
        ('attribs_adder', MyCombinedAttributesAdder()),
        ('std_scaler', StandardScaler()),
    ])

my_full_pipeline = FeatureUnion(transformer_list=[
        ("num_pipeline", my_num_pipeline),
        ("cat_pipeline", cat_pipeline),
    ])

### (b) Fit the elastic net model and see if you will improve on the MSE obtained in class.
### Run a ``10-fold cross validation`` on the training set to find the MSE distribution of your model, and compare it to the MSE distribution of the OLS model with the features used in class.

In [15]:
def my_create_full_pipeline_with_predictor(predictor):
    pipe = Pipeline([
        ("preparation", my_full_pipeline),
        ("linear", predictor)
    ])
    return pipe

def my_run_describe_predictors(predictors):
    mse_distribution = {}
    scores = {}
    for name, predictor in predictors.items():
        print(name)
        pipe = my_create_full_pipeline_with_predictor(predictor)
        scores[name] = cross_val_score(
            pipe, housing_train_X, housing_labels_train_y, 
            scoring="neg_mean_squared_error", # for some reason cross_val_score computes negative of MSE
            cv=10) #10-fold cross-validation

        mse_distribution[name] = pd.Series(-scores[name])
        print(mse_distribution[name].describe())
        print('')
    return scores, mse_distribution

my_predictors = {'Elastic Net': ElasticNet(random_state=0)}
my_scores, my_mse_distribution = my_run_describe_predictors(my_predictors)

Elastic Net
count    1.000000e+01
mean     5.766017e+09
std      3.689235e+08
min      5.398735e+09
25%      5.535117e+09
50%      5.642344e+09
75%      5.888159e+09
max      6.445011e+09
dtype: float64



In [16]:
inclass_predictors = {'Linear Regression':LinearRegression()}
_, inclass_mse_distribution = run_describe_predictors(inclass_predictors)

Linear Regression
count    1.000000e+01
mean     4.775625e+09
std      4.023328e+08
min      4.221053e+09
25%      4.507385e+09
50%      4.645265e+09
75%      5.038537e+09
max      5.585961e+09
dtype: float64



In [17]:
print('MSE ENet / MSE OLS =', my_mse_distribution['Elastic Net'].mean()/ inclass_mse_distribution['Linear Regression'].mean())
print('  Var ENet =', my_mse_distribution['Elastic Net'].std()**2)
print('  Var OLS =', inclass_mse_distribution['Linear Regression'].std() **2)
print('Var ENet / Var OLS =', my_mse_distribution['Elastic Net'].std()**2/inclass_mse_distribution['Linear Regression'].std() **2)

MSE ENet / MSE OLS = 1.2073849074589555
  Var ENet = 1.3610451747521131e+17
  Var OLS = 1.618717209364811e+17
Var ENet / Var OLS = 0.8408171401885516


##### Compare：

**How about bias?**

bias of OLS  = 0 since OLS estimator is BLUE.

**Is there a biased estimator is better than OLS estimator in terms of MSE?**

My Elastic Net produces worse results than in-class OLS in terms of mean of MSE,
its MSE is about **20%** bigger than in-class OLS MSE.

(But my Elastic Net is better than the Elastic Net in class whose MSE is about 30% bigger than OLS.)

However, it is **not fair** if we compare two models by only looking at MSE, since
**MSE can be decomposited into two components: bias contribution + variance contribution** 

What we have now is:

var(in-class OLS above) = 4.023328e+08 ^2 **>** var(my ElaNet) = 3.689235e+08 ^2

My Elastic Net in general has a much **lower variance** than in-class OLS. i.e. In-class OLS almost doubles mine.


### Can you improve both on the bias and variance? NO, I cannot (OLS is already unbiased) and I didn't, since
**Above all:**
* MSE(ElaNet) > MSE(OLS)
* Bias(ElaNet) > Bias(OLS) = 0
* Var(ElaNet) < Var(OLS)

**Notice that we are increasing the MSE in ElaNet by adding a little bit of bias but also decreasing variance a lot.**

We know that it is **not fair** if we compare two models by only looking at MSE.

It is bad, of course, if we increasing MSE and Variance at the same time (although it is quite often that we may increase bias and variance at the same time).

**Therefore, here we got a good scenario in my Elastic Net Model with Bias goes up a little bit but Variance goes down a lot.**

---------
To have a look of my elastic net performance, I want to further optimize our Elastic Net model by changing ``L1_ratio``, and compare the optimized Elastic Net performance with OLS on **test set**.

### For each set of features you consider, find the optimal elastic net model using a grid search of the parameter space. Does your best model remain the best in terms of MSE of the test set? Explain.

In [18]:
# prepare training data by my new created pipline:
my_housing_train_X_prepared = my_full_pipeline.fit_transform(housing_train_X)

In [19]:
# Grid search for the best params
from sklearn.model_selection import GridSearchCV

target_predictor = ElasticNet(random_state=0)
my_param_grid = [
    # l1_ratio = alpha/rho
    {
        'l1_ratio': [ 0.125, 0.25, 0.5, 0.75, 0.875, 0.9, 0.95, 0.975],
    },
  ]
my_grid_search = GridSearchCV(target_predictor, my_param_grid, cv=10,
                           scoring='neg_mean_squared_error')
my_grid_search.fit(my_housing_train_X_prepared, housing_labels_train_y)
optimal_Elastic_Net = my_grid_search.best_estimator_
optimal_Elastic_Net

ElasticNet(alpha=1.0, copy_X=True, fit_intercept=True, l1_ratio=0.975,
           max_iter=1000, normalize=False, positive=False, precompute=False,
           random_state=0, selection='cyclic', tol=0.0001, warm_start=False)

##### MSE on TESTING SET :
* OLS 
* my optimal Elastic Net

In [20]:
# Inclass OLS predictions on testing
prediction_on_test_linreg = full_pipeline_with_predictor_linreg.predict(housing_test_X)

# My optimal Elastic Net predictions on testing
my_optimal_full_pipeline_with_predictor_enet = Pipeline([
        ("preparation", my_full_pipeline),
        ("linear", optimal_Elastic_Net)
    ])
my_optimal_full_pipeline_with_predictor_enet.fit(housing_train_X, housing_labels_train_y)
prediction_on_test_enet_optimal = my_optimal_full_pipeline_with_predictor_enet.predict(housing_test_X)


# mean squared error
from sklearn.metrics import mean_squared_error
lin_mse = mean_squared_error(housing_labels_test_y, prediction_on_test_linreg)
enet_mse = mean_squared_error(housing_labels_test_y, prediction_on_test_enet_optimal)

print('MSE of inclass OLS:',lin_mse)
print('MSE of my Optimal Elastic Net:', enet_mse)
print('MSE ENet Optimal / MSE OLS:',enet_mse/lin_mse)

MSE of inclass OLS: 4477213162.344775
MSE of my Optimal Elastic Net: 4416153774.006561
MSE ENet Optimal / MSE OLS: 0.9863621886820693


### In terms of MSE on TESTING SET, my optimal Elastic Net model improves the prediction result.

In [21]:
param = [ 0.125, 0.25, 0.5, 0.75, 0.875, 0.9, 0.95, 0.975]
pds = [ElasticNet(random_state=0, l1_ratio=l) for l in param]
pls = [my_create_full_pipeline_with_predictor(pred) for pred in pds]

for pipl,param in zip(pls,param):
    pipl.fit(housing_train_X, housing_labels_train_y)
    predictions_on_test = pipl.predict(housing_test_X)
    print('MSE on Test Set when L1_ratio='+str(param)+':', mean_squared_error(housing_labels_test_y, predictions_on_test))
print('MSE on Test Set of inclass OLS:',lin_mse)

MSE on Test Set when L1_ratio=0.125: 6367458023.8448
MSE on Test Set when L1_ratio=0.25: 6129324734.859161
MSE on Test Set when L1_ratio=0.5: 5593427085.613493
MSE on Test Set when L1_ratio=0.75: 4970958349.3001795
MSE on Test Set when L1_ratio=0.875: 4641864935.072175
MSE on Test Set when L1_ratio=0.9: 4578922023.980293
MSE on Test Set when L1_ratio=0.95: 4463632316.049739
MSE on Test Set when L1_ratio=0.975: 4416153774.006561
MSE on Test Set of inclass OLS: 4477213162.344775
