## 07. A Machine Learning Project

#### Resources

[Machine Learning in SKL & Tensorflow (pdf)](./docs/Hands.Machine.Learning.Scikit.Learn.Tensorflow.5225.pdf#page=58)<br/>
[Machine Learning in SKL & Tensorflow (Repo)](https://github.com/ageron/handson-ml)<br/>
[Matplotlib Colormaps](https://matplotlib.org/users/colormaps.html)

#### Modules

In [None]:
import os
import tarfile
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas_profiling as pdpf
from six.moves import urllib
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit
from pandas.tools.plotting import scatter_matrix
%matplotlib inline

#### Getting Started

**Checklist**  

The basic steps you will go through when taking on an ML project are as follows:  
1. Frame the problem and look at the big picture.
2. Get the data.
3. Explore the data to gain insights.
4. Prepare the data to better expose the underlying data patterns to Machine Learning algorithms.
5. Explore many different models and short-list the best ones.
6. Fine-tune your models and combine them into a great solution.
7. Present your solution.
8. Launch, monitor, and maintain your system.

#### 1. Frame the Problem and Look at the Big Picture

The first question to ask is what exactly is the business objective; building a model is probably not the end goal. How does the company expect to use and benefit from this model? This is important
because it will determine how you frame the problem, what algorithms you will select, what performance measure you will use to evaluate your model, and how much effort you should spend tweaking it.

In this case we're going to build a model to predict a district’s median housing price. This will be **Pipelined** into another Machine Learning system, along with many other signals.
This downstream system will determine whether it is worth investing in a given area or not. Getting this right is critical, as it directly affects revenue.  

The next question to ask is what the current solution looks like (if any). It will often give you a reference performance, as well as insights on how to solve the problem.

Then, you need to frame the problem: is it supervised, unsupervised, or Reinforcement Learning? Is it a classification task, a regression task, or something else? Should you use batch learning or online learning techniques? Before you read on, pause and try to answer these questions for yourself.

In this case, we have a typical supervised learning task since we are given labeled training examples (each instance comes with the expected output, i.e., the district’s median
housing price). Moreover, it is also a typical regression task, since you are asked to predict a value. More specifically, this is a multivariate regression problem since the system will use multiple features to make a prediction (it will use the district’s population, the median income, etc.). Previously, you predicted life satisfaction based on just one feature, the GDP per capita, so it was a univariate regression problem. Finally, there is no continuous flow of data coming in the system, there is no particular need to adjust to changing data rapidly, and the data is small enough to fit in memory, so plain batch learning should do just fine.

**Pipelines**

A sequence of data processing components is called a data pipeline. Pipelines are very common in Machine Learning systems, since there is a lot of data to manipulate and many data transformations to apply. Components typically run asynchronously. Each component pulls in a large amount of data, processes it, and spits out the result in another data store, and then some time later the next component in the pipeline pulls this data and spits out its own output, and so on. 

Each component is fairly self-contained: the interface between components is simply the data store. This makes the system quite simple to grasp (with the help of a data flow graph), and different teams can focus on different components. Moreover, if a component breaks down, the downstream components can often continue to run normally (at least for a while) by just using the last output from the broken component. This makes the architecture quite robust. On the other hand, a broken component can go unnoticed for some time if proper monitoring is not implemented. The data gets stale and the overallsystem’s performance drops. 

**Selecting a Performance Measure**  

Your next step is to select a performance measure. A typical performance measure for regression problems is the Root Mean Square Error (RMSE). It measures the standard deviation of the errors the
system makes in its predictions. For example, an RMSE equal to 50,000 means that about 68% of the system’s predictions fall within \$50,000 of the actual value, and about 95% of the predictions fall within \$100,000 of the actual value.  

The formula for RMSE is as follows:  

<span style="color:#888888">
    ${\displaystyle 
        RMSE (\textbf{X}, h) = \sqrt{ 
            {\frac {1}{m} } 
            {\sum_{i=1}^m }
            (h(\textbf{x})^{(i)} -
            y^{(i)})^{2}
        }
    }$
</span>

Where:

<span style="color:#888888">
$RMSE (\textbf{X}, h)$ = is the cost function measured on the set of examples using your hypothesis $h$.  
$X$ = Matrix containing all the feature values (excluding labels) of all instances in the dataset  
$h$ = System’s prediction function, also called a *hypothesis*  
$m$ = Number of instances in the dataset  
$x^{(i)}$ = Vector of all the feature values (excluding the label) of the $i$th instance in the dataset  
$y^{(y)}$ = Vector of all the feature values (excluding the label) of the $y$th instance in the dataset  
</span>

Lowercase italic font is used for for scalar values (such as $m$ or $y^{i}$ ) and function names (such as $h$), lowercase bold font for vectors (such as ${\textbf x^{(i)} }$), and uppercase bold font for matrices (such as $\textbf X$).

**Check the Assumptions**  

Lastly, it is good practice to list and verify the assumptions that were made so far (by you or others); this can catch serious issues early on.  

For example, the district prices that your system outputs are going to be fed into a downstream Machine Learning system, and we assume that these prices are going to be used as such. But what if the downstream system actually converts the prices into categories (e.g., “cheap,” “medium,” or “expensive”) and then uses those categories instead of the prices themselves? In this case, getting the price perfectly right is not important at all; your system just needs to get the category right. If that’s so, then the problem should have been framed as a classification task, not a regression task. You don’t want to find this out after working on a regression system for months.

#### Data Import & Exploration

** Questions to ask of the data**

* How was it gathered?
* Is it a sample or a full population?
* What pre-processing, if any, has the data undergone? Are there any other variables missing?
* If it's currently used, what is it used for?
* Are there any schema or descriptions available?

In [None]:
df = pd.read_csv('./data/housing.csv')       # Importing the data

In [None]:
# Basic Exploration

print(df.info())
df.hist(bins=50, figsize=(20,15))
plt.show()

In [None]:
df.plot(
    kind="scatter", 
    x="longitude", 
    y="latitude", 
    alpha=0.4,
    s=df["population"]/100, 
    label="population",
    c="median_house_value", 
    cmap=plt.get_cmap("hot"), 
    colorbar=True,
    figsize=(16,12)
)
plt.legend()    # Creating a geo plot of the lat/lon data

In [None]:
pdpf.ProfileReport(df)

**Notes**  

* Median income attribute does not look like it is expressed in US dollars (USD). After checking with the team that collected the data, you are told that the data has been scaled and capped at 15 (actually 15.0001) for higher median incomes, and at 0.5 (actually 0.4999) for lower median incomes. Working with preprocessed attributes is common in Machine Learning, and it is not necessarily a problem, but you should try to understand how the data was computed.  

* The housing median age and the median house value were also capped. The latter may be a serious problem since it is your target attribute (your labels) and Your algorithms may learn that prices never go beyond that limit.  

* These attributes have very different scales.

* Finally, many histograms are tail heavy: they extend much farther to the right of the median than to the left. This may make it a bit harder for some Machine Learning algorithms to detect patterns. We will try transforming these attributes later on to have more bell-shaped distributions.

* Households is highly correlated with Population

* Total Bedrooms is highly correlated with Total Rooms   

#### Data Preparation

Most median income values are clustered around 2–5 (tens of thousands of dollars), but some median incomes go far beyond 6. It is important to have a sufficient number of instances in your dataset for each stratum, or else the estimate of the stratum’s importance may be biased. This means that you should not have too many strata, and each stratum should be large enough. The following code creates an income category attribute by dividing the median income by 1.5 (to limit the number of income categories), and rounding up using ceil (to have discrete categories), and then merging all the categories greater than 5 into category 5:

In [None]:
# Creating an income_cat variable

df["income_cat"] = np.ceil(df["median_income"] / 1.5)
df["income_cat"].where(df["income_cat"] < 5, 5.0, inplace=True)

train, test = train_test_split(df, test_size=0.2, random_state=42)     # Creating a train / test split
df = train                                                             # Assigning the train set as the df

You should make sure that the income_cat variable is fairly represented in both the train and test datasets.

In [None]:
train["income_cat"].value_counts() / len(df)

In [None]:
corr_matrix = df.corr()
corr_matrix['median_house_value'].sort_values(ascending=False)

**Checking for Correlation**  

We can use pandas scatter_matrix function to check for correlations:

In [None]:
attributes = ["median_house_value", "median_income", "total_rooms", "housing_median_age"]
scatter_matrix(df[attributes], figsize=(16, 12), s=4 )

The median_house_value by median_income plot reveals a few things.  

First, the correlation is indeed very strong; you can clearly see the upward trend and the points are not too dispersed. Second, the price cap that we noticed earlier is clearly visible as a horizontal line at 500k. But this plot reveals other less obvious straight lines: a horizontal line around 450k, another around 350k, perhaps one around 280k, and a few more below that. You may want to try removing the corresponding districts to prevent your algorithms from learning to reproduce these data quirks.

In [None]:
# Adding some more meaningful variables to the dataset

df["rooms_per_household"] = df["total_rooms"]/df["households"]
df["bedrooms_per_room"] = df["total_bedrooms"]/df["total_rooms"]
df["population_per_household"]=df["population"]/df["households"]