# Chapter 2: End-to-End Machine Learning Project

As the title implies, this chapter walks us through a machine learning project from the start to finish. It's very nice in the sense that we get a feel of what a machine learning project in the real life (at least from the author's perspective) may look like, what we need to think about, and what steps we need to take.

I'll be covering each section of the chapter briefly. It's rather long but shouldn't be too much if we keep it short and cover the main points.

I'll keep an appendix of all of the mentioned Python commands/methods used with a link to the corresponding documentation.

### ※ _Many parts of the book are either not up-to-date or too... "from scratch." I'll skip over those parts due to time constraints, but if interested please feel free to go back and check for yourself_.

<br><br>

## Table of Contents

**1. Working with Real Data**

**2. Look at the Big Picture**
   1. Frame the Problem
   2. Select a Performance Measure
   3. Check the Assumptions
   
**3. Get the Data** $\rightarrow$ _**code**_
   1. Create the Workspace
   2. Download the Data
   3. Take a Quick Look at the Data Structure
   4. Create a Test Set
   
**4. Discover and Visualize the Data to Gain Insights** $\rightarrow$ _**code**_
   1. Visualizing Geographical Data
   2. Looking for Correlations
   3. Experimenting with Attribute Combinations
   
**5. Prepare the Data for Machine Learning Algorithms** $\rightarrow$ _**code**_
   1. Data Cleaning
   2. handling Text and Categorical Attributes
   3. Custom Transformers
   4. Feature Scaling
   5. Transformation Pipelines
   
**6. Select and Train a Model** $\rightarrow$ _**code**_
   1. Training and Evaluating on the Training Set
   2. Better Evaluation Using Cross-Validation
   
**7. Fine-Tune Your Model** $\rightarrow$ _**code**_
   1. Grid Search
   2. Randomized Search
   3. Ensemble Methods
   4. Analyze the Best Methods and Their Erros
   5. Evaluate Your System on the Test Set
   
**8. Launch, Monitor, and Maintain Your System**

**9. Try It Out!**

**10. Exercises**

# 1. Working With Real Data

#### Rather than always working with artificial datasets (e.g. the ones that people make just for school projects), it's better to work with real-world datasets. Some examples are the famous iris dataset.

##### I personally think this is much better because there are plenty of real-world datasets that are small and simply enough to use as learning experience.

#### Some popular open data repositories are:

### 1. Kaggle (https://www.kaggle.com/datasets)

![Kaggle](https://github.com/seankala/ml_study_group/blob/master/Machine%20Learning/hands_on_ml/images/Kaggle_logo.png?raw=true)

### 2. UC Irvine Machine Learning Repository (http://mlr.cs.umass.edu/ml/)

![UC Irvine](https://github.com/seankala/ml_study_group/blob/master/Machine%20Learning/hands_on_ml/images/uci_ml_repo.png?raw=true)

### 3. Amazon AWS Datasets (https://registry.opendata.aws/)

![AWS](https://github.com/seankala/ml_study_group/blob/master/Machine%20Learning/hands_on_ml/images/aws.jpg?raw=true)

<br>

---

<br>

## This chapter will use the California Housing Prices dataset from the StatLib repository (it's not up-to-date and is from the 1990s).

![alt-text](https://github.com/seankala/ml_study_group/blob/master/Machine%20Learning/hands_on_ml/images/cali_housing_prices.png?raw=true)

<br>
<br>

# 2. Look at the Big Picture

_Background: You're a newly hired data scientist for the Machine Learning Housing Corporation (it's fake). Your job is to use machine learning tactics on the California Housing Prices dataset in order to gain insight from the data._

## 2-1. Frame the Problem

#### What is our objective? What do we plan to gain? These questions are important because they determine which machine learning algorithms we use, which data cleaning processes we use, etc.

The company's objective is to form a downstream pipeline to determine whether a house is worth investing in or not. Our machine learning component will be one part of a larger data pipeline.

A pipeline (or data pipeline) is basically a sequence of components, where one component's output is fed into the next component as input.

![pipeline](https://github.com/seankala/ml_study_group/blob/master/Machine%20Learning/hands_on_ml/images/data_pipline_housing.png?raw=true)

#### Knowing this information you have gathered, what would your problem be? Is it a regression problem or a classification probem? Should we use batch learning or online learning?

1. First of all, we have the labels of the houses. Therefore, we can conclude that it's a _**supervised learning problem**_.

2. Next, we can deduce that it's a _**regression**_ problem and not a classification problem because we're trying to estimate the (continuous) value of samples rather than (discrete) classes.

3. Last, we should be fine using a _**batch learning**_ system and not an online one because we're not going to be consistently receiving a flow of data and therefore don't need to frequently update our model.

<br><br>

## 2-2. Select a Performance Measure

_Next, we have to select a performance measure. This is going to be used to determine how well our model is performing._

Typically, the _**Root Mean Square Error (RMSE)**_ is used as the performance measure. The mathematical equation is as follows:

$$\text{RMSE}(\mathbf{X},\ h) = \sqrt{\frac{1}{m}\sum_{i = 1}^m(h(\vec{x}^{(i)} - y^{(i)})^2}$$

* $\vec{x}$ or $\mathbf{x}$ $\rightarrow$ one single sample (in this case house)
  * $\vec{x}^{(3)}$ $\rightarrow$ this means the 3rd sample
* $\mathbf{X} \rightarrow$ matrix containing all samples $\vec{x}$
* $m \rightarrow$ number of samples
* $h \rightarrow$ our system's prediction function (a.k.a. hypothesis)
* $y^{(i)} \rightarrow$ the label (price in this case) for the $i$th sample

Basically, we're taking the square of the difference between our prediction and the actual price, then dividing it by the total number of samples.

The RMSE is also referred to as the _Euclidean norm_ (i.e. distance formula). It's also called the $\ell_2$ norm ($\Vert\ \Vert_2$ or $\Vert\ \Vert$).

<br>

Another well-known performance measure is the _**Mean Absolute Error (MAE)**_. The math equation is as follows:

$$\text{MAE}(\mathbf{X},\ h) = \frac{1}{m}\sum_{i = 1}^m\bigg|\ h(\vec{x}^{(i)} - y^{(i)})\ \bigg|$$

The symbols used in the notation are the same as the RMSE. The notable difference here is that we don't square and take the square root, but we simply take the absolute value of the difference between our prediction and the actual value.

The MAE is also called the $\ell_1$ norm ($\Vert\ \Vert_1$).

<br>

In general, the $\ell_k$ norm of a vector $\vec{x}$ containing $n$ elements is defined as:

$$\Vert \vec{x} \Vert_k = (\vert x_1 \vert^k + \vert x_2 \vert^k + \ldots + \vert x_n \vert^k)^{1/k}$$

### The higher the norm index, the more it focuses on large values and neglects small ones. This is why RMSE is more sensitive to outliers than MAE (because it focuses on large values, it focuses more on points that are further away from the actual label than others).

The fact that the RMSE focuses more on outliers isn't necessarily a bad thing, and the RMSE is actually generally preferred.

<br><br>

## 2-3. Check the Assumptions

It's always important to keep check of the assumptions that we or others made. For example, what if you find out that your company is actually going to take the prices of the houses and label them as "cheap," "medium," and "expensive?" Then what you have is not a regression problem, but a classification problem. You don't want to find out about this later on down the road.

<br><br><br><br>

# 3. Get the Data

## 3-1. Create the Workspace

#### This section starts off with a walkthrough on how to install Python and other necessary modules (e.g. Numpy, Pandas, etc.) It will be assumed that everyone has Python installed. If not, it's not hard to install so please Google "how to install Python."

<br><br>

## 3-2. Download the Data

You can download the data manually from the official Github repository (https://github.com/ageron/handson-ml/tree/master/datasets) or create a small function that will download the data automatically.

You can also write a Python function to download the data from the web browser. The author asserts that this is better if the data changes regularly and you need to continuously update your system. It's also useful if you're downloading the data on multiple devices. Here's the code to do that:

```Python
import os
import tarfile
from six.moves import urllib


DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    if (not os.path.isdir(housing_path)):
        os.makedirs(housing_path)
    else:
        tgz_path = os.path.join(housing_path, "housing.tgz")
        urllib.request.urlretrieve(housing_url, tgz_path)
        housing_tgz = tarfile.open(tgz_path)
        housing_tgz.extractall(path=housing_path)
        housing_tgz.close()
```

More details can be found here:

[`os.path.join()`](https://docs.python.org/3/library/urllib.request.html#urllib.request.urlretrieve)

[`os.path.isdir()`](https://docs.python.org/3/library/os.path.html#os.path.isdir)

[`urllib.request.urlretrieve()`](https://docs.python.org/3/library/urllib.request.html#urllib.request.urlretrieve)

[`tarfile.open()`](https://docs.python.org/3/library/tarfile.html#tarfile.open)

[`TarFile.extractall()`](https://docs.python.org/3/library/tarfile.html#tarfile.TarFile.extractall)

[`TarFile.close()`](https://docs.python.org/3/library/tarfile.html#tarfile.TarFile.close)