**Chapter 2 – End-to-end Machine Learning project**

*Welcome to Machine Learning Housing Corp.! Your task is to predict median house values in Californian districts, given a number of features from these districts.*

*This notebook contains all the sample code and solutions to the exercices in chapter 2.*

# Setup

In [1]:
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"
from sklearn.model_selection import train_test_split


# ------- Pipeline 
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin # para Criar custom transformer
from sklearn.compose import ColumnTransformer # para unificar os pipelines



# Common imports
import numpy as np
import os

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "end_to_end_project"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)
os.makedirs(IMAGES_PATH, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

# Ignore useless warnings (see SciPy issue #5998)
import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")


# to make this notebook's output identical at every run
np.random.seed(42)

# Get the data

In [2]:
import os
import tarfile
import urllib

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    if not os.path.isdir(housing_path):
        os.makedirs(housing_path)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()

In [3]:
fetch_housing_data()

In [4]:
import pandas as pd
pd.set_option('display.max_columns', 999)
pd.set_option('display.max_rows', 999)

def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    print("\n Lendo arquivo do caminho:",csv_path)
    return pd.read_csv(csv_path)

## Info do dataset

In [5]:
df = load_housing_data()
df.head()


 Lendo arquivo do caminho: datasets/housing/housing.csv


Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


# train test split
    - Lembre que o split permite multiplos dfs
    - Precisa ser stratificado?
    - Vai incluir validação? Calibração de probabilidades?

In [6]:
'''
OPCIONAL
Criando variável para extratificar o split
'''

df["income_cat"] = pd.cut(df["median_income"],
                          bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
                          labels=[1, 2, 3, 4, 5])

In [7]:
# padrão 20% para teste
df_train, df_test = train_test_split(df, test_size = 0.2, random_state=75)

print("df.shape:      ",df.shape)
print("df_train.shape:",df_train.shape)
print("df_test.shape: ",df_test.shape)

df.shape:       (20640, 11)
df_train.shape: (16512, 11)
df_test.shape:  (4128, 11)


In [8]:
'''
CODIGO EXLCUSIVO

Split Stratificado
'''

# padrão 20% para teste
df_train, df_test = train_test_split(df,
                                     test_size = 0.2,
                                     random_state=75,
                                     stratify = df["income_cat"])

print("df.shape:      ",df.shape)
print("df_train.shape:",df_train.shape)
print("df_test.shape: ",df_test.shape)

# Checando a estratificação
pd.DataFrame({"original" : df["income_cat"].value_counts(normalize=True),
              "train" : df_train["income_cat"].value_counts(normalize=True),
              "test" : df_test["income_cat"].value_counts(normalize=True)
             })

df.shape:       (20640, 11)
df_train.shape: (16512, 11)
df_test.shape:  (4128, 11)


Unnamed: 0,original,train,test
3,0.350581,0.350594,0.350533
2,0.318847,0.318859,0.318798
4,0.176308,0.176296,0.176357
5,0.114438,0.114402,0.114583
1,0.039826,0.03985,0.039729


In [9]:
# removendo a variavel criada para estratificação
for set_ in (df_train, df_test):
    set_ = set_.drop("income_cat", axis=1)

# Prepare the data for Machine Learning algorithms

In [10]:
X_train = df_train.drop("median_house_value", axis=1) # drop labels for training set
y_train = df_train["median_house_value"].copy()

### Numerical columns

In [11]:
# selecting numerical columns
cols_num = X_train.select_dtypes(include='number').columns # all numericals
list(cols_num)

['longitude',
 'latitude',
 'housing_median_age',
 'total_rooms',
 'total_bedrooms',
 'population',
 'households',
 'median_income']

### categorical 

In [12]:
# selecting categorical columns
cols_cat = X_train.select_dtypes(include='object').columns
list(cols_cat)

['ocean_proximity']

# Criando uma função para adicionar variáveis

In [13]:
# column index
rooms_ix, bedrooms_ix, population_ix, households_ix = 3, 4, 5, 6

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room=True): # no *args or **kargs
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X):
        rooms_per_household = X[:, rooms_ix] / X[:, households_ix]
        population_per_household = X[:, population_ix] / X[:, households_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
                         bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]

# Pipeline

https://stackoverflow.com/questions/54646709/sklearn-pipeline-get-feature-names-after-onehotencode-in-columntransformer

https://stackoverflow.com/questions/57528350/can-you-consistently-keep-track-of-column-labels-using-sklearns-transformer-api

In [14]:
X_train

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity,income_cat
7758,-118.13,33.91,34.0,916.0,162.0,552.0,164.0,4.9107,<1H OCEAN,4
8846,-118.40,34.10,27.0,3979.0,510.0,1351.0,520.0,15.0001,<1H OCEAN,5
11828,-121.00,39.00,4.0,170.0,23.0,93.0,27.0,10.9891,INLAND,5
8656,-118.38,33.85,28.0,4430.0,928.0,2131.0,885.0,4.9384,<1H OCEAN,4
14631,-117.21,32.81,26.0,2496.0,407.0,1062.0,380.0,5.5413,NEAR OCEAN,4
...,...,...,...,...,...,...,...,...,...,...
1244,-121.99,39.15,17.0,6440.0,1204.0,3266.0,1142.0,2.7137,INLAND,2
9193,-119.53,37.34,26.0,4047.0,702.0,571.0,199.0,2.3482,INLAND,2
4273,-118.32,34.09,34.0,2473.0,1171.0,2655.0,1083.0,1.6331,<1H OCEAN,2
13858,-117.27,34.49,7.0,2344.0,351.0,846.0,314.0,4.7361,INLAND,4


In [66]:
num_attribs = cols_num
cat_attribs = ["ocean_proximity"]

full_pipeline = ColumnTransformer([ 
        ('imputer', SimpleImputer(strategy="median"), cols_num),
#         ('attribs_adder', CombinedAttributesAdder(), cols_num),
        ('std_scaler', StandardScaler(), cols_num),
        ("cat", OneHotEncoder(), cat_attribs)],
        # MUITA Atenção aos argumentos do ColumnTransformer:
        remainder='drop',
        sparse_threshold=0.3,
        n_jobs=None,
        transformer_weights=None,
        verbose=False,
)

full_pipeline.fit(X_train)


# X_train_prepared = pd.DataFrame(full_pipeline.fit_transform(X_train))


ColumnTransformer(transformers=[('imputer', SimpleImputer(strategy='median'),
                                 Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income'],
      dtype='object')),
                                ('std_scaler', StandardScaler(),
                                 Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income'],
      dtype='object')),
                                ('cat', OneHotEncoder(), ['ocean_proximity'])])

In [67]:
full_pipeline.get_feature_names()

AttributeError: Transformer imputer (type SimpleImputer) does not provide get_feature_names.

# teste

In [302]:
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression

df = pd.DataFrame({'brand': ['A', 'B', 'C', np.NaN],
                   'num1': [1, 1, np.NaN, 0],
                   'category': ['A', 'A', np.NaN, 'D'],
                    'target': [2, 4, 8, 10]})


df

Unnamed: 0,brand,num1,category,target
0,A,1.0,A,2
1,B,1.0,A,4
2,C,,,8
3,,0.0,D,10


In [313]:
# numeric_transformer
numeric_features = ['num1']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

# categorical transformer
categorical_features = ['brand', 'category']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Preprocessor 
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Preprocessor & Regressor
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('regressor',  LinearRegression())])

# Fit
clf.fit(df.drop('target', 1), df['target'])

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['num1']),
                                                 ('cat',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(fill_value='missing',
                                                                                 strategy='constant')),
                                                                  ('onehot',
                                                                   OneHotEncoder(hand

In [315]:
preprocessor.transform(df)

array([[ 0.57735027,  1.        ,  0.        ,  0.        ,  0.        ,
         1.        ,  0.        ,  0.        ],
       [ 0.57735027,  0.        ,  1.        ,  0.        ,  0.        ,
         1.        ,  0.        ,  0.        ],
       [ 0.57735027,  0.        ,  0.        ,  1.        ,  0.        ,
         0.        ,  0.        ,  1.        ],
       [-1.73205081,  0.        ,  0.        ,  0.        ,  1.        ,
         0.        ,  1.        ,  0.        ]])

# working

In [None]:

def get_feature_out(estimator, feature_in):
    if hasattr(estimator,'get_feature_names'):
        if isinstance(estimator, _VectorizerMixin):
            # handling all vectorizers
            return [f'vec_{f}' \
                for f in estimator.get_feature_names()]
        else:
            return estimator.get_feature_names(feature_in)
    elif isinstance(estimator, SelectorMixin):
        return np.array(feature_in)[estimator.get_support()]
    else:
        return feature_in


def get_ct_feature_names(ct):
    # handles all estimators, pipelines inside ColumnTransfomer
    # doesn't work when remainder =='passthrough'
    # which requires the input column names.
    output_features = []

    for name, estimator, features in ct.transformers_:
        if name!='remainder':
            if isinstance(estimator, Pipeline):
                current_features = features
                for step in estimator:
                    current_features = get_feature_out(step, current_features)
                features_out = current_features
            else:
                features_out = get_feature_out(estimator, features)
            output_features.extend(features_out)
        elif estimator=='passthrough':
            print(ct._feature_names_in[features])
            output_features.extend(ct._feature_names_in[features])
                
    return output_features

In [299]:
train = pd.DataFrame({'age': [23,12, 12, np.nan],
                     'income': ['high','low','low','medium'],
                      'sales': [10000, 100020, 110000, 100],
                      'foo' : [1,0,0,1],
                      'Gender': ['M','F', np.nan, 'F'],
                      'anohter_cat': ['aa','bb', "ccc", 'a'],
                      'y': [0,1,1,1]})
numeric_columns = ['age']
cat_columns     = ["anohter_cat",'Gender','income']



numeric_pipeline = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())
cat_pipeline     = make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder())

transformers = [
('num', numeric_pipeline, numeric_columns),
('cat', cat_pipeline, cat_columns),
('simple_transformer', MinMaxScaler(), ['sales']),
]

combined_pipe = ColumnTransformer(transformers, remainder='drop',)

transformed_data = combined_pipe.fit_transform(train.drop('y',1), train['y'])

In [300]:
train

Unnamed: 0,age,income,sales,foo,Gender,anohter_cat,y
0,23.0,high,10000,1,M,aa,0
1,12.0,low,100020,0,F,bb,1
2,12.0,low,110000,0,,ccc,1
3,,medium,100,1,F,a,1


In [301]:
pd.DataFrame(transformed_data, 
             columns=get_ct_feature_names(combined_pipe))

Unnamed: 0,age,anohter_cat_a,anohter_cat_aa,anohter_cat_bb,anohter_cat_ccc,Gender_F,Gender_M,income_high,income_low,income_medium,sales
0,1.732051,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.090082
1,-0.57735,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.90919
2,-0.57735,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0
3,-0.57735,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0


In [None]:
aa

In [283]:
pd.DataFrame(transformed_data, 
             columns=get_ct_feature_names(combined_pipe))

Unnamed: 0,age,Gender_F,Gender_M,income_high,income_low,income_medium,sales,foo
0,1.732051,0.0,1.0,1.0,0.0,0.0,0.090082,1.0
1,-0.57735,1.0,0.0,0.0,1.0,0.0,0.90919,0.0
2,-0.57735,1.0,0.0,0.0,1.0,0.0,1.0,0.0
3,-0.57735,1.0,0.0,0.0,0.0,1.0,0.0,1.0


In [None]:
aa

In [225]:
df.head()

Unnamed: 0,brand,num1,category,target
0,A,1.0,A,2
1,B,1.0,A,4
2,C,,,8
3,,0.0,D,10


In [226]:
n_steps_cat = len(categorical_transformer.steps)
n_steps_num = len(numeric_transformer.steps)
n_steps_tot = n_steps_cat + n_steps_num 

print("cat:", n_steps_cat, "num:", n_steps_num, "tot:", n_steps_tot)

cat: 2 num: 2 tot: 4


In [227]:
numeric_transformer.steps[0][0]

'imputer'

In [230]:
clf['preprocessor'].transformers_[1][1]

Pipeline(steps=[('imputer',
                 SimpleImputer(fill_value='missing', strategy='constant')),
                ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [228]:
clf['preprocessor'].transformers_[1][1]['onehot']\
                   .get_feature_names(categorical_features)

array(['brand_A', 'brand_B', 'brand_C', 'brand_missing', 'category_A',
       'category_D', 'category_missing'], dtype=object)

# getting the names

In [195]:
clf.named_steps['preprocessor'].transformers_[0][1]

Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),
                ('scaler', StandardScaler())])

In [182]:
clf.named_steps['preprocessor'].transformers_[0][1]

Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),
                ('scaler', StandardScaler())])

In [162]:
# buscando a transofmração de categoricas
clf.named_steps['preprocessor'].transformers_[0][1].named_steps['onehot'].get_feature_names(categorical_features)

Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),
                ('scaler', StandardScaler())])

In [85]:
aux

Unnamed: 0,0,1,2,3,4,5,6,7
0,0.57735,1.0,0.0,0.0,0.0,1.0,0.0,0.0
1,0.57735,0.0,1.0,0.0,0.0,1.0,0.0,0.0
2,0.57735,0.0,0.0,1.0,0.0,0.0,0.0,1.0
3,-1.732051,0.0,0.0,0.0,1.0,0.0,1.0,0.0


In [84]:
aux.columns = ['brand_A', 'brand_B', 'brand_C', 'brand_missing', 'category_A',
       'category_D', 'category_missing']

ValueError: Length mismatch: Expected axis has 8 elements, new values have 7 elements

In [None]:
aux.columns = []

In [61]:
num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")), ,
        ('attribs_adder', CombinedAttributesAdder()),
        ('std_scaler', StandardScaler()),
    ])


num_attribs = cols_num
cat_attribs = ["ocean_proximity"]

full_pipeline = ColumnTransformer([ 
#         ("num", num_pipeline, cols_num),
        ("cat", OneHotEncoder(), cat_attribs)],
        # MUITA Atenção aos argumentos do ColumnTransformer:
        remainder='drop',
        sparse_threshold=0.3,
        n_jobs=None,
        transformer_weights=None,
        verbose=False,
)

full_pipeline.fit(X_train)


# X_train_prepared = pd.DataFrame(full_pipeline.fit_transform(X_train))


ColumnTransformer(transformers=[('cat', OneHotEncoder(), ['ocean_proximity'])])

In [62]:
full_pipeline.get_feature_names()

['cat__x0_<1H OCEAN',
 'cat__x0_INLAND',
 'cat__x0_ISLAND',
 'cat__x0_NEAR BAY',
 'cat__x0_NEAR OCEAN']

In [47]:
full_pipeline.named_transformers_['num'].get_feature_names()

AttributeError: 'Pipeline' object has no attribute 'get_feature_names'

In [44]:
full_pipeline.named_transformers_['num'].steps[1][1].get_feature_names(categorical_features)

AttributeError: 'CombinedAttributesAdder' object has no attribute 'get_feature_names'

In [42]:
full_pipeline.named_transformers_['num'].steps[1][1]\
   .named_steps['onehot'].get_feature_names(categorical_features)


AttributeError: 'CombinedAttributesAdder' object has no attribute 'named_steps'

In [23]:
full_pipeline.get_feature_names()

TypeError: get_feature_names() missing 1 required positional argument: 'self'

In [16]:
X_train_prepared

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,0.719297,-0.808004,0.428863,-0.799577,-0.900436,-0.769940,-0.880678,0.550573,0.061484,0.029981,-0.554024,1.0,0.0,0.0,0.0,0.0
1,0.584496,-0.718942,-0.128419,0.625090,-0.064832,-0.068210,0.051887,5.878027,0.863266,-0.043355,-1.277473,1.0,0.0,0.0,0.0,0.0
2,-0.713588,1.577942,-1.959490,-1.146558,-1.234197,-1.173061,-1.239559,3.760119,0.337310,0.037488,-1.171649,0.0,1.0,0.0,0.0,0.0
3,0.594481,-0.836130,-0.048808,0.834860,0.938853,0.616833,1.008029,0.565199,-0.163434,-0.061519,-0.069204,1.0,0.0,0.0,0.0,0.0
4,1.178619,-1.323631,-0.208031,-0.064685,-0.312151,-0.322027,-0.314852,0.883545,0.442889,-0.024570,-0.759023,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16507,-1.207859,1.648255,-0.924537,1.769755,1.601574,1.613658,1.681257,-0.609498,0.082382,-0.018347,-0.403925,0.0,1.0,0.0,0.0,0.0
16508,0.020329,0.799814,-0.208031,0.656719,0.396191,-0.753253,-0.788994,-0.802491,5.784696,-0.017444,-0.604461,0.0,1.0,0.0,0.0,0.0
16509,0.624437,-0.723629,0.428863,-0.075383,1.522335,1.077041,1.526703,-1.180081,-1.219584,-0.057354,3.854437,1.0,0.0,0.0,0.0,0.0
16510,1.148663,-0.536128,-1.720655,-0.135383,-0.446617,-0.511731,-0.487743,0.458380,0.790731,-0.034167,-0.956917,0.0,1.0,0.0,0.0,0.0


In [17]:
aa


NameError: name 'aa' is not defined

# Fine-tune your model - cross_val_acore

In [None]:
from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import mean_squared_error

from sklearn.model_selection import cross_val_score

In [None]:
def display_scores(scores, print_list_scores=False):
    if print_list_scores:
        print("Scores:", list(scores))
    
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())

<br>
<br>
<br>

---

### Ensemble + Cross_val_score
<br>

In [None]:
forest_reg = RandomForestRegressor(n_estimators=10,
                                   random_state=42,
                                   verbose=0)

forest_reg.fit(X_train_prepared, y_train)

In [None]:
forest_scores = cross_val_score(forest_reg, 
                                X_train_prepared, 
                                y_train,
                                scoring="neg_mean_squared_error", 
                                cv=10)

forest_rmse_scores = np.sqrt(-forest_scores)

display_scores(forest_rmse_scores)

<br>
<br>
<br>

---

### GridSearch CV
<br>

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = [
    # try 12 (3×4) combinations of hyperparameters
    {'n_estimators': [3, 10, 30],
     'max_features': [2, 4, 6, 8]},
    # then try 6 (2×3) combinations with bootstrap set as False
    {'bootstrap': [False],
    'n_estimators': [3, 10],
     'max_features': [2, 3, 4]},
  ]

forest_reg = RandomForestRegressor(random_state=42)
# train across 5 folds, that's a total of (12+6)*5=90 rounds of training 

grid_search = GridSearchCV(forest_reg,
                           param_grid,
                           cv=5,
                           scoring='neg_mean_squared_error',
                           return_train_score=True) #If ``False``, the ``cv_results_`` attribute will not include training scores.

grid_search.fit(X_train_prepared, y_train)

The best hyperparameter combination found:

In [None]:
grid_search.best_params_

In [None]:
grid_search.best_estimator_

<br>
<br>
<br>

---

### RandomizedSearchCV
        Para quando o espaço de hyperparamtros é maior

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_distribs = {
        'n_estimators': randint(low=1, high=200),
        'max_features': randint(low=1, high=8),
    }

forest_reg = RandomForestRegressor(random_state=42)

rnd_search = RandomizedSearchCV(forest_reg,
                                param_distributions=param_distribs,
                                n_iter=10,
                                cv=5,
                                scoring='neg_mean_squared_error',
                                random_state=42)

rnd_search.fit(X_train_prepared, y_train)

In [None]:
feature_importances = grid_search.best_estimator_.feature_importances_
feature_importances

In [None]:
extra_attribs = ["rooms_per_hhold", "pop_per_hhold", "bedrooms_per_room"]
#cat_encoder = cat_pipeline.named_steps["cat_encoder"] # old solution
cat_encoder = full_pipeline.named_transformers_["cat"]
cat_one_hot_attribs = list(cat_encoder.categories_[0])
attributes = num_attribs + extra_attribs + cat_one_hot_attribs
sorted(zip(feature_importances, attributes), reverse=True)

In [None]:
final_model = grid_search.best_estimator_

X_test = strat_test_set.drop("median_house_value", axis=1)
y_test = strat_test_set["median_house_value"].copy()

X_test_prepared = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_prepared)

final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)

In [None]:
final_rmse

We can compute a 95% confidence interval for the test RMSE:

In [None]:
from scipy import stats

confidence = 0.95
squared_errors = (final_predictions - y_test) ** 2
np.sqrt(stats.t.interval(confidence, len(squared_errors) - 1,
                         loc=squared_errors.mean(),
                         scale=stats.sem(squared_errors)))

We could compute the interval manually like this:

In [None]:
m = len(squared_errors)
mean = squared_errors.mean()
tscore = stats.t.ppf((1 + confidence) / 2, df=m - 1)
tmargin = tscore * squared_errors.std(ddof=1) / np.sqrt(m)
np.sqrt(mean - tmargin), np.sqrt(mean + tmargin)

Alternatively, we could use a z-scores rather than t-scores:

In [None]:
zscore = stats.norm.ppf((1 + confidence) / 2)
zmargin = zscore * squared_errors.std(ddof=1) / np.sqrt(m)
np.sqrt(mean - zmargin), np.sqrt(mean + zmargin)

# Extra material

## A full pipeline with both preparation and prediction

In [None]:
full_pipeline_with_predictor = Pipeline([
        ("preparation", full_pipeline),
        ("linear", LinearRegression())
    ])

full_pipeline_with_predictor.fit(housing, housing_labels)
full_pipeline_with_predictor.predict(some_data)

## Model persistence using joblib

In [None]:
my_model = full_pipeline_with_predictor

In [None]:
import joblib
joblib.dump(my_model, "my_model.pkl") # DIFF
#...
my_model_loaded = joblib.load("my_model.pkl") # DIFF

## Example SciPy distributions for `RandomizedSearchCV`

In [None]:
from scipy.stats import geom, expon
geom_distrib=geom(0.5).rvs(10000, random_state=42)
expon_distrib=expon(scale=1).rvs(10000, random_state=42)
plt.hist(geom_distrib, bins=50)
plt.show()
plt.hist(expon_distrib, bins=50)
plt.show()

# Exercise solutions

## 1.

Question: Try a Support Vector Machine regressor (`sklearn.svm.SVR`), with various hyperparameters such as `kernel="linear"` (with various values for the `C` hyperparameter) or `kernel="rbf"` (with various values for the `C` and `gamma` hyperparameters). Don't worry about what these hyperparameters mean for now. How does the best `SVR` predictor perform?

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = [
        {'kernel': ['linear'], 'C': [10., 30., 100., 300., 1000., 3000., 10000., 30000.0]},
        {'kernel': ['rbf'], 'C': [1.0, 3.0, 10., 30., 100., 300., 1000.0],
         'gamma': [0.01, 0.03, 0.1, 0.3, 1.0, 3.0]},
    ]

svm_reg = SVR()
grid_search = GridSearchCV(svm_reg, param_grid, cv=5, scoring='neg_mean_squared_error', verbose=2)
grid_search.fit(housing_prepared, housing_labels)

The best model achieves the following score (evaluated using 5-fold cross validation):

In [None]:
negative_mse = grid_search.best_score_
rmse = np.sqrt(-negative_mse)
rmse

That's much worse than the `RandomForestRegressor`. Let's check the best hyperparameters found:

In [None]:
grid_search.best_params_

The linear kernel seems better than the RBF kernel. Notice that the value of `C` is the maximum tested value. When this happens you definitely want to launch the grid search again with higher values for `C` (removing the smallest values), because it is likely that higher values of `C` will be better.

## 2.

Question: Try replacing `GridSearchCV` with `RandomizedSearchCV`.

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import expon, reciprocal

# see https://docs.scipy.org/doc/scipy/reference/stats.html
# for `expon()` and `reciprocal()` documentation and more probability distribution functions.

# Note: gamma is ignored when kernel is "linear"
param_distribs = {
        'kernel': ['linear', 'rbf'],
        'C': reciprocal(20, 200000),
        'gamma': expon(scale=1.0),
    }

svm_reg = SVR()
rnd_search = RandomizedSearchCV(svm_reg, param_distributions=param_distribs,
                                n_iter=50, cv=5, scoring='neg_mean_squared_error',
                                verbose=2, random_state=42)
rnd_search.fit(housing_prepared, housing_labels)

The best model achieves the following score (evaluated using 5-fold cross validation):

In [None]:
negative_mse = rnd_search.best_score_
rmse = np.sqrt(-negative_mse)
rmse

Now this is much closer to the performance of the `RandomForestRegressor` (but not quite there yet). Let's check the best hyperparameters found:

In [None]:
rnd_search.best_params_

This time the search found a good set of hyperparameters for the RBF kernel. Randomized search tends to find better hyperparameters than grid search in the same amount of time.

Let's look at the exponential distribution we used, with `scale=1.0`. Note that some samples are much larger or smaller than 1.0, but when you look at the log of the distribution, you can see that most values are actually concentrated roughly in the range of exp(-2) to exp(+2), which is about 0.1 to 7.4.

In [None]:
expon_distrib = expon(scale=1.)
samples = expon_distrib.rvs(10000, random_state=42)
plt.figure(figsize=(10, 4))
plt.subplot(121)
plt.title("Exponential distribution (scale=1.0)")
plt.hist(samples, bins=50)
plt.subplot(122)
plt.title("Log of this distribution")
plt.hist(np.log(samples), bins=50)
plt.show()

The distribution we used for `C` looks quite different: the scale of the samples is picked from a uniform distribution within a given range, which is why the right graph, which represents the log of the samples, looks roughly constant. This distribution is useful when you don't have a clue of what the target scale is:

In [None]:
reciprocal_distrib = reciprocal(20, 200000)
samples = reciprocal_distrib.rvs(10000, random_state=42)
plt.figure(figsize=(10, 4))
plt.subplot(121)
plt.title("Reciprocal distribution (scale=1.0)")
plt.hist(samples, bins=50)
plt.subplot(122)
plt.title("Log of this distribution")
plt.hist(np.log(samples), bins=50)
plt.show()

The reciprocal distribution is useful when you have no idea what the scale of the hyperparameter should be (indeed, as you can see on the figure on the right, all scales are equally likely, within the given range), whereas the exponential distribution is best when you know (more or less) what the scale of the hyperparameter should be.

## 3.

Question: Try adding a transformer in the preparation pipeline to select only the most important attributes.

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

def indices_of_top_k(arr, k):
    return np.sort(np.argpartition(np.array(arr), -k)[-k:])

class TopFeatureSelector(BaseEstimator, TransformerMixin):
    def __init__(self, feature_importances, k):
        self.feature_importances = feature_importances
        self.k = k
    def fit(self, X, y=None):
        self.feature_indices_ = indices_of_top_k(self.feature_importances, self.k)
        return self
    def transform(self, X):
        return X[:, self.feature_indices_]

Note: this feature selector assumes that you have already computed the feature importances somehow (for example using a `RandomForestRegressor`). You may be tempted to compute them directly in the `TopFeatureSelector`'s `fit()` method, however this would likely slow down grid/randomized search since the feature importances would have to be computed for every hyperparameter combination (unless you implement some sort of cache).

Let's define the number of top features we want to keep:

In [None]:
k = 5

Now let's look for the indices of the top k features:

In [None]:
top_k_feature_indices = indices_of_top_k(feature_importances, k)
top_k_feature_indices

In [None]:
np.array(attributes)[top_k_feature_indices]

Let's double check that these are indeed the top k features:

In [None]:
sorted(zip(feature_importances, attributes), reverse=True)[:k]

Looking good... Now let's create a new pipeline that runs the previously defined preparation pipeline, and adds top k feature selection:

In [None]:
preparation_and_feature_selection_pipeline = Pipeline([
    ('preparation', full_pipeline),
    ('feature_selection', TopFeatureSelector(feature_importances, k))
])

In [None]:
housing_prepared_top_k_features = preparation_and_feature_selection_pipeline.fit_transform(housing)

Let's look at the features of the first 3 instances:

In [None]:
housing_prepared_top_k_features[0:3]

Now let's double check that these are indeed the top k features:

In [None]:
housing_prepared[0:3, top_k_feature_indices]

Works great!  :)

## 4.

Question: Try creating a single pipeline that does the full data preparation plus the final prediction.

In [None]:
prepare_select_and_predict_pipeline = Pipeline([
    ('preparation', full_pipeline),
    ('feature_selection', TopFeatureSelector(feature_importances, k)),
    ('svm_reg', SVR(**rnd_search.best_params_))
])

In [None]:
prepare_select_and_predict_pipeline.fit(housing, housing_labels)

Let's try the full pipeline on a few instances:

In [None]:
some_data = housing.iloc[:4]
some_labels = housing_labels.iloc[:4]

print("Predictions:\t", prepare_select_and_predict_pipeline.predict(some_data))
print("Labels:\t\t", list(some_labels))

Well, the full pipeline seems to work fine. Of course, the predictions are not fantastic: they would be better if we used the best `RandomForestRegressor` that we found earlier, rather than the best `SVR`.

## 5.

Question: Automatically explore some preparation options using `GridSearchCV`.

In [None]:
param_grid = [{
    'preparation__num__imputer__strategy': ['mean', 'median', 'most_frequent'],
    'feature_selection__k': list(range(1, len(feature_importances) + 1))
}]

grid_search_prep = GridSearchCV(prepare_select_and_predict_pipeline, param_grid, cv=5,
                                scoring='neg_mean_squared_error', verbose=2)
grid_search_prep.fit(housing, housing_labels)

In [None]:
grid_search_prep.best_params_

The best imputer strategy is `most_frequent` and apparently almost all features are useful (15 out of 16). The last one (`ISLAND`) seems to just add some noise.

Congratulations! You already know quite a lot about Machine Learning. :)