## 07. A Machine Learning Project

#### Resources

[Machine Learning in SKL & Tensorflow (pdf)](./docs/Hands.Machine.Learning.Scikit.Learn.Tensorflow.5225.pdf#page=58)<br/>
[Machine Learning in SKL & Tensorflow (Repo)](https://github.com/ageron/handson-ml)<br/>

#### Modules

#### Getting Started

**Checklist**  

The basic steps you will go through when taking on an ML project are as follows:  
1. Frame the problem and look at the big picture.
2. Get the data.
3. Explore the data to gain insights.
4. Prepare the data to better expose the underlying data patterns to Machine Learning algorithms.
5. Explore many different models and short-list the best ones.
6. Fine-tune your models and combine them into a great solution.
7. Present your solution.
8. Launch, monitor, and maintain your system.

#### 1. Frame the Problem and Look at the Big Picture

The first question to ask is what exactly is the business objective; building a model is probably not the end goal. How does the company expect to use and benefit from this model? This is important
because it will determine how you frame the problem, what algorithms you will select, what performance measure you will use to evaluate your model, and how much effort you should spend tweaking it.

In this case we're going to build a model to predict a district’s median housing price. This will be **Pipelined** into another Machine Learning system, along with many other signals.
This downstream system will determine whether it is worth investing in a given area or not. Getting this right is critical, as it directly affects revenue.  

The next question to ask is what the current solution looks like (if any). It will often give you a reference performance, as well as insights on how to solve the problem.

Then, you need to frame the problem: is it supervised, unsupervised, or Reinforcement Learning? Is it a classification task, a regression task, or something else? Should you use batch learning or online learning techniques? Before you read on, pause and try to answer these questions for yourself.

In this case, we have a typical supervised learning task since we are given labeled training examples (each instance comes with the expected output, i.e., the district’s median
housing price). Moreover, it is also a typical regression task, since you are asked to predict a value. More specifically, this is a multivariate regression problem since the system will use multiple features to make a prediction (it will use the district’s population, the median income, etc.). Previously, you predicted life satisfaction based on just one feature, the GDP per capita, so it was a univariate regression problem. Finally, there is no continuous flow of data coming in the system, there is no particular need to adjust to changing data rapidly, and the data is small enough to fit in memory, so plain batch learning should do just fine.

**Pipelines**

A sequence of data processing components is called a data pipeline. Pipelines are very common in Machine Learning systems, since there is a lot of data to manipulate and many data transformations to apply. Components typically run asynchronously. Each component pulls in a large amount of data, processes it, and spits out the result in another data store, and then some time later the next component in the pipeline pulls this data and spits out its own output, and so on. 

Each component is fairly self-contained: the interface between components is simply the data store. This makes the system quite simple to grasp (with the help of a data flow graph), and different teams can focus on different components. Moreover, if a component breaks down, the downstream components can often continue to run normally (at least for a while) by just using the last output from the broken component. This makes the architecture quite robust. On the other hand, a broken component can go unnoticed for some time if proper monitoring is not implemented. The data gets stale and the overallsystem’s performance drops. 

**Selecting a Performance Measure**  

Your next step is to select a **Performance Measure**. A typical performance measure for regression problems is the Root Mean Square Error (RMSE). It measures the standard deviation
4 of the errors the
system makes in its predictions. For example, an RMSE equal to 50,000 means that about 68% of the
system’s predictions fall within $50,000 of the actual value, and about 95% of the predictions fall within
$100,000 of the actual value.
5 Equation 2-1 shows the mathematical formula to compute the RMSE.

