## 07. A Machine Learning Project

#### Resources

[Machine Learning in SKL & Tensorflow (pdf)](./docs/Hands.Machine.Learning.Scikit.Learn.Tensorflow.5225.pdf#page=58)<br/>
[Machine Learning in SKL & Tensorflow (Repo)](https://github.com/ageron/handson-ml)<br/>
[Machine Learning in SKL & Tensorflox (Notebook)](https://github.com/ageron/handson-ml/blob/master/02_end_to_end_machine_learning_project.ipynb)<br/>
[Matplotlib Colormaps](https://matplotlib.org/users/colormaps.html)

#### Modules

In [None]:
import os
import tarfile
import tqdm
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas_profiling as pdpf
from six.moves import urllib
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import Imputer, LabelEncoder, OneHotEncoder, LabelBinarizer, StandardScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
from pandas.plotting import scatter_matrix
%matplotlib inline

#### Getting Started

**Checklist**  

The basic steps you will go through when taking on an ML project are as follows:  
1. Frame the problem and look at the big picture.
2. Get the data.
3. Explore the data to gain insights.
4. Prepare the data to better expose the underlying data patterns to Machine Learning algorithms.
5. Explore many different models and short-list the best ones.
6. Fine-tune your models and combine them into a great solution.
7. Present your solution.
8. Launch, monitor, and maintain your system.

#### 1. Frame the Problem and Look at the Big Picture

The first question to ask is what exactly is the business objective; building a model is probably not the end goal. How does the company expect to use and benefit from this model? This is important
because it will determine how you frame the problem, what algorithms you will select, what performance measure you will use to evaluate your model, and how much effort you should spend tweaking it.

In this case we're going to build a model to predict a district’s median housing price. This will be **Pipelined** into another Machine Learning system, along with many other signals.
This downstream system will determine whether it is worth investing in a given area or not. Getting this right is critical, as it directly affects revenue.  

The next question to ask is what the current solution looks like (if any). It will often give you a reference performance, as well as insights on how to solve the problem.

Then, you need to frame the problem: is it supervised, unsupervised, or Reinforcement Learning? Is it a classification task, a regression task, or something else? Should you use batch learning or online learning techniques? Before you read on, pause and try to answer these questions for yourself.

In this case, we have a typical supervised learning task since we are given labeled training examples (each instance comes with the expected output, i.e., the district’s median
housing price). Moreover, it is also a typical regression task, since you are asked to predict a value. More specifically, this is a multivariate regression problem since the system will use multiple features to make a prediction (it will use the district’s population, the median income, etc.). Previously, you predicted life satisfaction based on just one feature, the GDP per capita, so it was a univariate regression problem. Finally, there is no continuous flow of data coming in the system, there is no particular need to adjust to changing data rapidly, and the data is small enough to fit in memory, so plain batch learning should do just fine.

**Pipelines**

A sequence of data processing components is called a data pipeline. Pipelines are very common in Machine Learning systems, since there is a lot of data to manipulate and many data transformations to apply. Components typically run asynchronously. Each component pulls in a large amount of data, processes it, and spits out the result in another data store, and then some time later the next component in the pipeline pulls this data and spits out its own output, and so on. 

Each component is fairly self-contained: the interface between components is simply the data store. This makes the system quite simple to grasp (with the help of a data flow graph), and different teams can focus on different components. Moreover, if a component breaks down, the downstream components can often continue to run normally (at least for a while) by just using the last output from the broken component. This makes the architecture quite robust. On the other hand, a broken component can go unnoticed for some time if proper monitoring is not implemented. The data gets stale and the overallsystem’s performance drops. 

**Selecting a Performance Measure**  

Your next step is to select a performance measure. A typical performance measure for regression problems is the Root Mean Square Error (RMSE). It measures the standard deviation of the errors the
system makes in its predictions. For example, an RMSE equal to 50,000 means that about 68% of the system’s predictions fall within \$50,000 of the actual value, and about 95% of the predictions fall within \$100,000 of the actual value.  

The formula for RMSE is as follows:  

<span style="color:#888888">
    ${\displaystyle 
        RMSE (\textbf{X}, h) = \sqrt{ 
            {\frac {1}{m} } 
            {\sum_{i=1}^m }
            (h(\textbf{x})^{(i)} -
            y^{(i)})^{2}
        }
    }$
</span>

Where:

<span style="color:#888888">
$RMSE (\textbf{X}, h)$ = is the cost function measured on the set of examples using your hypothesis $h$.  
$X$ = Matrix containing all the feature values (excluding labels) of all instances in the dataset  
$h$ = System’s prediction function, also called a *hypothesis*  
$m$ = Number of instances in the dataset  
$x^{(i)}$ = Vector of all the feature values (excluding the label) of the $i$th instance in the dataset  
$y^{(y)}$ = Vector of all the feature values (excluding the label) of the $y$th instance in the dataset  
</span>

Lowercase italic font is used for for scalar values (such as $m$ or $y^{i}$ ) and function names (such as $h$), lowercase bold font for vectors (such as ${\textbf x^{(i)} }$), and uppercase bold font for matrices (such as $\textbf X$).

**Check the Assumptions**  

Lastly, it is good practice to list and verify the assumptions that were made so far (by you or others); this can catch serious issues early on.  

For example, the district prices that your system outputs are going to be fed into a downstream Machine Learning system, and we assume that these prices are going to be used as such. But what if the downstream system actually converts the prices into categories (e.g., “cheap,” “medium,” or “expensive”) and then uses those categories instead of the prices themselves? In this case, getting the price perfectly right is not important at all; your system just needs to get the category right. If that’s so, then the problem should have been framed as a classification task, not a regression task. You don’t want to find this out after working on a regression system for months.

#### Data Import & Exploration

** Questions to ask of the data**

* How was it gathered?
* Is it a sample or a full population?
* What pre-processing, if any, has the data undergone? Are there any other variables missing?
* If it's currently used, what is it used for?
* Are there any schema or descriptions available?

In [None]:
df = pd.read_csv('./data/housing.csv')       # Importing the data

In [None]:
# Basic Exploration

print(df.info())
df.hist(bins=50, figsize=(20,15))
plt.show()

In [None]:
df.plot(
    kind="scatter", 
    x="longitude", 
    y="latitude", 
    alpha=0.4,
    s=df["population"]/100, 
    label="population",
    c="median_house_value", 
    cmap=plt.get_cmap("hot"), 
    colorbar=True,
    figsize=(16,12)
)
plt.legend()    # Creating a geo plot of the lat/lon data

In [None]:
pdpf.ProfileReport(df)

**Notes**  

* Median income attribute does not look like it is expressed in US dollars (USD). After checking with the team that collected the data, you are told that the data has been scaled and capped at 15 (actually 15.0001) for higher median incomes, and at 0.5 (actually 0.4999) for lower median incomes. Working with preprocessed attributes is common in Machine Learning, and it is not necessarily a problem, but you should try to understand how the data was computed.  

* The housing median age and the median house value were also capped. The latter may be a serious problem since it is your target attribute (your labels) and Your algorithms may learn that prices never go beyond that limit.  

* These attributes have very different scales.

* Finally, many histograms are tail heavy: they extend much farther to the right of the median than to the left. This may make it a bit harder for some Machine Learning algorithms to detect patterns. We will try transforming these attributes later on to have more bell-shaped distributions.

* Households is highly correlated with Population

* Total Bedrooms is highly correlated with Total Rooms   

#### Intro to SciKit Learn

Scikit Learn's API is very well designed and logical.

* **Estimators**: Any object that can estimate some parameters based on a dataset is called an estimator (e.g., an imputer is an estimator).
* **Transformers**: Some estimators (such as an imputer) can also transform a dataset; these are called transformers.
* **Predictors**: Finally, some estimators are capable of making predictions given a dataset; they are called predictors.  
* **Inspection**: All the estimator’s hyperparameters are accessible directly via public instance variables.
* **Nonproliferation of classes**: Datasets are represented as NumPy arrays or SciPy sparse matrices, instead of homemade classes.
* **Composition**: Existing building blocks are reused as much as possible.
* **Sensible defaults**: Scikit-Learn provides reasonable default values for most parameters, making it easy to create a baseline working system quickly.

#### Data Preparation

Most median income values are clustered around 2–5 (tens of thousands of dollars), but some median incomes go far beyond 6. It is important to have a sufficient number of instances in your dataset for each stratum, or else the estimate of the stratum’s importance may be biased. This means that you should not have too many strata, and each stratum should be large enough. The following code creates an income category attribute by dividing the median income by 1.5 (to limit the number of income categories), and rounding up using ceil (to have discrete categories), and then merging all the categories greater than 5 into category 5:

In [None]:
# Creating an income_cat variable

df["income_cat"] = np.ceil(df["median_income"] / 1.5)
df["income_cat"].where(df["income_cat"] < 5, 5.0, inplace=True)

train, test = train_test_split(df, test_size=0.2, random_state=42)     # Creating a train / test split
df = train                                                             # Assigning the train set as the df

You should make sure that the income_cat variable is fairly represented in both the train and test datasets.

In [None]:
train["income_cat"].value_counts() / len(df)

In [None]:
corr_matrix = df.corr()
corr_matrix['median_house_value'].sort_values(ascending=False)

**Checking for Correlation**  

We can use pandas scatter_matrix function to check for correlations:

In [None]:
attributes = ["median_house_value", "median_income", "total_rooms", "housing_median_age"]
scatter_matrix(df[attributes], figsize=(20, 14), s=4 )

The median_house_value by median_income plot reveals a few things.  

First, the correlation is indeed very strong; you can clearly see the upward trend and the points are not too dispersed. Second, the price cap that we noticed earlier is clearly visible as a horizontal line at 500k. But this plot reveals other less obvious straight lines: a horizontal line around 450k, another around 350k, perhaps one around 280k, and a few more below that. You may want to try removing the corresponding districts to prevent your algorithms from learning to reproduce these data quirks.

In [None]:
# Adding some more meaningful variables to the dataset

df["rooms_per_household"] = df["total_rooms"]/df["households"]
df["bedrooms_per_room"] = df["total_bedrooms"]/df["total_rooms"]
df["population_per_household"]=df["population"]/df["households"]

corr_matrix = df.corr()
corr_matrix['median_house_value'].sort_values(ascending=False)

The new bedrooms_per_room attribute is much more correlated with the median house value than the total number of rooms or bedrooms. Apparently houses with a lower bedroom/room ratio
tend to be more expensive. The number of rooms per household is also more informative than the total number of rooms in a district — obviously the larger the houses, the more expensive they are.

#### Preparing the Data for Machine Learning

In [None]:
# Creating a fresh copy of the training set

df = train.drop("median_house_value", axis=1)
df_labels = train["median_house_value"].copy()

**Handling Numeric Variables**  

Machine Learning algorithms do not cope well with missing data, so this needs to be dealt with. In dealing with missing data you have three options:  
* Get rid of the column
* Get rid of the row
* Fill the value with another value (e.g. mean / median)  

SKL has an `Imputer` class that helps us deal with missing values

In [None]:
df_copy = df.drop("ocean_proximity", axis=1)      # Dropping non numeric vaues as Imputer won't deal with these
imputer = Imputer(strategy="median")              # Creating the imputer
imputer.fit(df_copy)                              # Fit function replaces missing with the median data
imputer.statistics_                               # imputer also creates a variable to store the imputed values
df_imputed = imputer.transform(df_copy)           # Applies the imputer to the dataset

**Handling Categorical Variables**

Earlier we left out the categorical attribute ocean_proximity because it is a text attribute so we cannot compute its median. Most Machine Learning algorithms prefer to work with numbers anyway, so let’s
convert these text labels to numbers. Scikit-Learn provides a **transformer** for this task called LabelEncoder.

In [None]:
encoder = LabelEncoder()                          # Creating the Encoder
df_cat = df["ocean_proximity"]                    # Creating the category variable from the df
df_cat_encoded = encoder.fit_transform(df_cat)    # Encoding the category 
print(df_cat_encoded)                             # The coded variable
print(encoder.classes_)                           # The classes

One issue with this representation is that MLalgorithms will assume that two nearby values are more similar than two distant values. Obviously this is not the case with categorical variables. To fix this issue, a common solution is to create one binary attribute per category with values of 0 & 1. This is called **one-hot encoding**, because only one attribute will be equal to 1 (hot), while the others will be 0 (cold).

In [None]:
encoder = OneHotEncoder()                                 # Creating the One Hot Encoder
cat = encoder.fit_transform(df_cat_encoded.reshape(-1,1)) # Reshaping the encoded categor
print(cat)

This creates a Scipy sparse matrix where 0's aren't recorded to save memory and increase efficiency. This can be converted to a numpy array using the .toarray() method.  

We can create a one-shot one hot encoder using the LabelBinarizer class:

In [None]:
encoder = LabelBinarizer()                         # Creating the one-shot one-hot Encoder
cat = encoder.fit_transform(df["ocean_proximity"]) # Encoding the category
cat

**Custom Transformers**  
Although Scikit-Learn provides many useful transformers, you will need to write your own for tasks such as custom cleanup operations or combining specific attributes. You will want your transformer to work seamlessly with Scikit-Learn functionalities (such as pipelines), and since Scikit-Learn relies on duck typing (not inheritance), all you need is to create a class and implement three methods: fit() (returning self), transform(), and fit_transform(). You can get the last one for free by simply adding TransformerMixin as a base class. Also, if you add BaseEstimator as a base class (and avoid *args
and \**kwargs in your constructor) you will get two extra methods (get_params() and set_params()) that will be useful for automatic hyperparameter tuning. Below, there is a small transformer class
that adds the combined attributes we discussed earlier.

In this example the transformer has one hyperparameter, add_bedrooms_per_room, set to True by default (it is often helpful to provide sensible defaults). This hyperparameter will allow you to easily find out whether adding this attribute helps the Machine Learning algorithms or not. More generally, you can add a hyperparameter to gate any data preparation step that you are not 100% sure about. The more you automate these data preparation steps, the more combinations you can automatically try out, making it much more likely that you will find a great combination (and saving you a lot of time).

In [None]:

rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True): # no *args or **kwargs
        self.add_bedrooms_per_room = add_bedrooms_per_room
        
    def fit(self, X, y=None):
        return self # nothing else to do

    def transform(self, X, y=None):
        rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
        population_per_household = X[:, population_ix] / X[:, household_ix]
        
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household, bedrooms_per_room]
        
        else:
            return np.c_[X, rooms_per_household, population_per_household]
        

# attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
# housing_extra_attribs = attr_adder.transform(df.values)

**Feature Scaling**  

One of the most important transformations you need to apply to your data is **feature scaling**. With few exceptions, Machine Learning algorithms don’t perform well when the input numerical attributes have
very different scales. This is the case for the housing data: the total number of rooms ranges from about 6 to 39,320, while the median incomes only range from 0 to 15. Note that scaling the target values is generally not required. There are two common ways to get all attributes to have the same scale: min-max scaling and standardization.  

**Min-max scaling** (many people call this normalization) is quite simple: values are shifted and rescaled so that they end up ranging from 0 to 1. We do this by subtracting the min value and dividing by the max minus the min. Scikit-Learn provides a transformer called MinMaxScaler for this. It has a feature_range hyperparameter that lets you change the range if you don’t want 0–1 for some reason.  

**Standardization** is quite different: first it subtracts the mean value (so standardized values always have a zero mean), and then it divides by the variance so that the resulting distribution has unit variance. Unlike min-max scaling, standardization does not bound values to a specific range, which may be a problem for some algorithms (e.g., neural networks often expect an input value ranging from 0 to 1). However, standardization is much less affected by outliers. For example, suppose a district had a median income equal to 100 (by mistake). Min-max scaling would then crush all the other values from 0 –15 down to 0 – 0.15, whereas standardization would not be much affected. Scikit-Learn provides a transformer called StandardScaler for standardization.

**WARNING!** As with all the transformations, it is important to fit the scalers to the training data only, not to the full dataset (including the test set). Only then can you use them to transform the training set and the test set (and new data).

**Transformation Pieplines**  

As you can see, there are many data transformation steps that need to be executed in the right order. Fortunately, Scikit-Learn provides the Pipeline class to help with such sequences of transformations. Each subpipeline starts with a selector transformer: it simply transforms the data by selecting the desired attributes (numerical or categorical), dropping the rest, and converting the resulting DataFrame to a NumPy array.

The Pipeline constructor takes a list of name/estimator pairs defining a sequence of steps. All but the last estimator must be transformers (i.e., they must have a fit_transform() method). The names can be anything you like.  

When you call the pipeline’s fit() method, it calls fit_transform() sequentially on all transformers, passing the output of each call as the parameter to the next call, until it reaches the final estimator, for which it just calls the fit() method. The pipeline exposes the same methods as the final estimator. In this example, the last estimator is a StandardScaler, which is a transformer, so the pipeline has a transform() method that applies all the transforms to the data in sequence (it also has a fit_transform method that we could have used instead of calling fit() and then transform()).

You now have a pipeline for numerical values, and you also need to apply the LabelBinarizer on the categorical values: how can you join these transformations into a single pipeline? Scikit-Learn provides a FeatureUnion class for this. You give it a list of transformers (which can be entire transformer pipelines), and when its transform() method is called it runs each transformer’s transform() method
in parallel, waits for their output, and then concatenates them and returns the result (and of course calling its fit() method calls all each transformer’s fit() method). A full pipeline handling both numerical and categorical attributes may look like this:

In [None]:
# Copy of what's been done so far

df = pd.read_csv('./data/housing.csv')                             # Importing the data
train, test = train_test_split(df, test_size=0.2, random_state=42) # Creating a train / test split

train_labels = train["median_house_value"].copy()                  # Creating a copy of the Median House Value
train = train.drop("median_house_value", axis=1)                   # Getting rid of Median House Value in the training set
train_num = train.drop("ocean_proximity", axis=1)                  # Dropping non numeric vaues as Imputer won't deal with these

# Adding an Attributes adder class to add a hyperparameter
rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True): # no *args or **kwargs
        self.add_bedrooms_per_room = add_bedrooms_per_room
        
    def fit(self, X, y=None):
        return self # nothing else to do

    def transform(self, X, y=None):
        rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
        population_per_household = X[:, population_ix] / X[:, household_ix]
        
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household, bedrooms_per_room]
        
        else:
            return np.c_[X, rooms_per_household, population_per_household]

num_attribs = list(train_num)
cat_attribs = ["ocean_proximity"]

# Dataframe selector

class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values
        
# Building the numeric pipeline        
num_pipeline = Pipeline([
    ('selector', DataFrameSelector(num_attribs)),
    ('imputer', Imputer(strategy="median")),
    ('attribs_adder', CombinedAttributesAdder()),
    ('std_scaler', StandardScaler()),
])

# Building the categorical pipeline
cat_pipeline = Pipeline([
    ('selector', DataFrameSelector(cat_attribs)),
    ('label_binarizer', LabelBinarizer()),
])

# Full pipeline
full_pipeline = FeatureUnion(transformer_list=[
    ("num_pipeline", num_pipeline),
    ("cat_pipeline", cat_pipeline),
])

housing_prepared = full_pipeline.fit_transform(train)
housing_prepared.shape

# Building the model

lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, train_labels)

Once we've built our ML model we can now test it with some data:

In [None]:
df = pd.read_csv('./data/housing.csv')                          # Importing the da
labels = df["median_house_value"].copy()                        # Creating the labels dataset
df = df.drop("median_house_value", axis=1)                      # Getting rid of Median House Value in the training set


some_data = df.iloc[:10]
some_labels = labels.iloc[:10]
some_data_prepared = full_pipeline.transform(some_data)
print("Predictions:\t", lin_reg.predict(some_data_prepared))
print("Actuals:\t\t", list(some_labels))

The model isn't performing brilliantly. We can measure the RMSE using SKLs `mean_squared_error` function:

In [None]:
from sklearn.metrics import mean_squared_error
housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(train_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse

This is better than nothing but clearly not a great score: most districts’ median_housing_values range between \$120,000 and \$265,000, so a typical prediction error of $68,628 is not very satisfying.
This is an example of a model underfitting the training data. When this happens it can mean that the features do not provide enough information to make good predictions, or that the model is not powerful
enough. As we saw in the previous chapter, the main ways to fix underfitting are to select a more powerful model, to feed the training algorithm with better features, or to reduce the constraints on the
model. This model is not regularized, so this rules out the last option. You could try to add more features (e.g., the log of the population), but first let’s try a more complex model to see how it does.

In [None]:
tree_reg = DecisionTreeRegressor()
tree_reg.fit(housing_prepared, train_labels)

Let's evaluate the model:

In [None]:
housing_predictions = tree_reg.predict(housing_prepared)
tree_mse = mean_squared_error(train_labels, housing_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse

No error at all? Could this model really be absolutely perfect? Of course, it is much more likely that the model has badly overfit the data. How can you be sure? As we saw earlier, you don’t want
to touch the test set until you are ready to launch a model you are confident about, so you need to use part of the training set for training, and part for model validation. One way to evaluate the Decision Tree model would be to use the train_test_split function to split the training set into a smaller training set and a validation set, then train your models against the smaller
training set and evaluate them against the validation set. It’s a bit of work, but nothing too difficult and it would work fairly well. A great alternative is to use Scikit-Learn’s cross-validation feature. The following code performs K-fold cross-validation: it randomly splits the training set into 10 distinct subsets called folds, then it trains and evaluates the Decision Tree model 10 times, picking a different fold for evaluation every time and training on the other 9 folds. The result is an array containing the 10 evaluation scores:

In [None]:
tree_rmse_scores = cross_val_score(tree_reg, housing_prepared, train_labels,
    scoring="neg_mean_squared_error", cv=10)
rmse_scores = np.sqrt(-tree_rmse_scores)

def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())
    
display_scores(rmse_scores)

In [None]:
forest_reg = RandomForestRegressor()
forest_reg.fit(housing_prepared, train_labels)
housing_predictions = forest_reg.predict(housing_prepared)
forest_mse = mean_squared_error(train_labels, housing_predictions)
forest_rmse = np.sqrt(forest_mse)
forest_rmse

In [None]:
forest_rmse_scores = cross_val_score(forest_reg, housing_prepared, train_labels,
    scoring="neg_mean_squared_error", cv=10)

rmse_scores = np.sqrt(-forest_rmse_scores)
display_scores(rmse_scores)

Random Forests look very promising. However, note that the score on the training set is still much lower than on the validation sets, meaning that the model is still overfitting the training set. Possible solutions for overfitting are to simplify the model, constrain it (i.e., regularize it), or get a lot more training data. However, before you dive much deeper in Random Forests, you should try out many other models from various categories of Machine Learning algorithms (several Support Vector Machines with different kernels, possibly a neural network, etc.), without spending too much time tweaking the hyperparameters.

#### Fine Tuning

Once you have a set of promising models you have a number of ways in which these can be fine tuned as follows:

* Grid Search
* Randomised Search
* Ensemble Methods

**Grid Search**

One way to fine tune would be to fiddle with the hyperparameters manually, until you find a great combination of hyperparameter values. This would be very tedious work, and you may not have time to
explore many combinations. Instead you should get Scikit-Learn’s GridSearchCV to search for you. All you need to do is tell it which hyperparameters you want it to experiment with, and what values to try out, and it will evaluate all the possible combinations of hyperparameter values, using cross-validation. For example, the following code searches for the best combination of hyperparameter values for the RandomForestRegressor:

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = [
    {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
    {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
]

forest_reg = RandomForestRegressor()
grid_search = GridSearchCV(forest_reg, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(housing_prepared, train_labels)

This param_grid tells Scikit-Learn to first evaluate all 3 × 4 = 12 combinations of n_estimators and max_features hyperparameter values specified in the first dict (don’t worry about what these
hyperparameters mean for now; they will be explained in Chapter 7), then try all 2 × 3 = 6 combinations of hyperparameter values in the second dict, but this time with the bootstrap hyperparameter set to
False instead of True (which is the default value for this hyperparameter). All in all, the grid search will explore 12 + 6 = 18 combinations of RandomForestRegressor
hyperparameter values, and it will train each model five times (since we are using five-fold cross validation). In other words, all in all, there will be 18 × 5 = 90 rounds of training! It may take quite a long time, but when it is done you can get the best combination of parameters like this:

In [None]:
grid_search.best_params_

You can check the results as follows:

In [None]:
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
        print(np.sqrt(-mean_score), params)

In this example, we obtain the best solution by setting the max_features hyperparameter to 6, and the n_estimators hyperparameter to 30. The RMSE score for this combination is 49,959, which is slightly
better than the score you got earlier using the default hyperparameter values (which was 52,634). Congratulations, you have successfully fine-tuned your best model!

**Randomized Search**

The grid search approach is fine when you are exploring relatively few combinations, like in the previous example, but when the hyperparameter search space is large, it is often preferable to use
RandomizedSearchCV instead. This class can be used in much the same way as the GridSearchCV class, but instead of trying out all possible combinations, it evaluates a given number of random combinations
by selecting a random value for each hyperparameter at every iteration. This approach has two main benefits:

* If you let the randomized search run for, say, 1,000 iterations, this approach will explore 1,000 different values for each hyperparameter (instead of just a few values per hyperparameter with the
grid search approach).  
* You have more control over the computing budget you want to allocate to hyperparameter search, simply by setting the number of iterations.

**Ensemble Methods**  

Another way to fine-tune your system is to try to combine the models that perform best. The group (or “ensemble”) will often perform better than the best individual model (just like Random Forests perform
better than the individual Decision Trees they rely on), especially if the individual models make very different types of errors.

**Analyze the Best Models and Their Errors**  

You will often gain good insights on the problem by inspecting the best models. For example, the RandomForestRegressor can indicate the relative importance of each attribute for making accurate predictions:

In [None]:
extra_attribs = ["rooms_per_hhold", "pop_per_hhold", "bedrooms_per_room"]
cat_one_hot_attribs = list(encoder.active_features_)
attributes = num_attribs + extra_attribs + cat_one_hot_attribs
sorted(zip(feature_importances, attributes), reverse=True)

With this information, you may want to try dropping some of the less useful features. You should also look at the specific errors that your system makes, then try to understand why it makes them and what could fix the problem (adding extra features or, on the contrary, getting rid of uninformative ones, cleaning up outliers, etc.).

**Evaluate Your System on the Test Set**  

After tweaking your models for a while, you eventually have a system that performs sufficiently well. Now is the time to evaluate the final model on the test set. There is nothing special about this process; just get the predictors and the labels from your test set, run your full_pipeline to transform the data (call transform(), not fit_transform()!), and evaluate the final model on the test set:

In [None]:
final_model = grid_search.best_estimator_
X_test = test.drop("median_house_value", axis=1)
y_test = test["median_house_value"].copy()
X_test_prepared = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_prepared)
final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)
final_rmse

The performance will usually be slightly worse than what you measured using cross-validation if you did a lot of hyperparameter tuning (because your system ends up fine-tuned to perform well on the validation data, and will likely not perform as well on unknown datasets). It is not the case in this example, but when this happens you must resist the temptation to tweak the hyperparameters to make the numbers look good on the test set; the improvements would be unlikely to generalize to new data. Now comes the project prelaunch phase: you need to present your solution (highlighting what you have
learned, what worked and what did not, what assumptions were made, and what your system’s limitations are), document everything, and create nice presentations with clear visualizations and easy-to-remember statements (e.g., “the median income is the number one predictor of housing prices”).

**Launch, Monitor, and Maintain Your System**  

Perfect, you got approval to launch! You need to get your solution ready for production, in particular by plugging the production input data sources into your system and writing tests. You also need to write monitoring code to check your system’s live performance at regular intervals and trigger alerts when it drops. This is important to catch not only sudden breakage, but also performance degradation. This is quite common because models tend to “rot” as data evolves over time, unless the models are regularly trained on fresh data.


Evaluating your system’s performance will require sampling the system’s predictions and evaluating them. This will generally require a human analysis. These analysts may be field experts, or workers on a
crowdsourcing platform (such as Amazon Mechanical Turk or CrowdFlower). Either way, you need to plug the human evaluation pipeline into your system. You should also make sure you evaluate the system’s input data quality. Sometimes performance will degrade slightly because of a poor quality signal (e.g., a malfunctioning sensor sending random values, or another team’s output becoming stale), but it may take a while before your system’s performance degrades enough to trigger an alert. If you monitor your system’s inputs, you may catch this earlier. Monitoring the inputs is particularly important for online learning systems.


Finally, you will generally want to train your models on a regular basis using fresh data. You should automate this process as much as possible. If you don’t, you are very likely to refresh your model only every six months (at best), and your system’s performance may fluctuate severely over time. If your system is an online learning system, you should make sure you save snapshots of its state at regular intervals so you can easily roll back to a previously working state. 

#### ML Project Checklist

**Pre Project**

1. What's the business issue? Is it a Machine Learning problem?  
2. What does the current solution look like, if indeed there is one?
3. What does the data look like? Some ideas:
    * How was it gathered?  
    * Is it a sample or a full population?  
    * What pre-processing, if any, has the data undergone? Are there any other variables missing?  
    * If it's currently used, what is it used for?  
    * Are there any schema or descriptions available?  
4. How do you anticipate solving the problem? Should you use supervised, unsupervised, or reinforcement Learning? Is it a classification task, a regression task, or something else? Should you use batch learning or online learning techniques?  
5. What Performance Measure will you use?
6. What assumptions have been made, either by you or others?



**Data Preparation** 

1. Perform some basic exploratory analysis of the data. (missing values, column names, cardinality, correlations, categorical / numeric variables, overall cleanliness etc.)
2. Visualise the data. (spread & distribution, patterns, pre-processing etc.) 
3. Eyeball a sample of the data. Are there any glaring issues (e.g. user error, sample_bias, consistency etc.)
4. Performing averaging, combination or dimensionality reduction, creating categorical variables (e.g. simplification of the dataset)

**Pipelining**  

1. Deal with missing values (SKL Imputer)
2. Encode Categorical Variables (SKL Label Encoder / One Hot Encoder)
3. Consider Feature Scaling if the variables have vastly different scales  
4. Consider adding hyperparameter options based upon the variables in the dataset.

**Building / Running the model**

1. What's the error?
2. Is the model overfitting? 
3. Is the model underfitting?
4. Consider trying different model types.

**Fine Tuning**  

1. Consider Grid Search, Randomised Search or Ensemble Methods
2. Analyse the best models and their errors. What variables can we remove?
3. Evaluate on the test set.
