## Hands-On ML with Scikit-Learn

### Chapter 2. End-to-End Machine Learning Project

1. Look at the big picture

2. Get the data

3. Explore and visualize the data to gain insights

4. Prepare the data for ML algorithms

5. Select a model and train it

6. Fine-tune your model

7. Present your solution

8. Launch, monitor and maintain your system


**Working with real data**:

* [OpenML.org](https://openml.org/)

* [Kaggle](https://www.kaggle.com/datasets)

* [PapersWithCode.com](https://paperswithcode.com/datasets)

* [UC Irvine ML repo](https://archive.ics.uci.edu/)

* [Amazon AWS datasets](https://registry.opendata.aws/)

* [TensorFlow datasets](https://www.tensorflow.org/datasets)

* [DataPortals (Meta-portal)](https://dataportals.org/)

* [Open Data Monitor (Meta-portal)](https://opendatamonitor.eu/frontend/web/index.php?r=dashboard%2Findex)

#### The Project: Housing Prices in California

##### 1. Look at the big picture

Use California census data to **build a model of housing prices in the state**. 

This data includes metrics such as the population, median income, and median housing price for each *block group* ('district') in California.

Your model **should learn from this data and be able to predict the median housing price** in any district, given all the other metrics.

**Frame the Problem**

* *Know the objective*:

The first question you and your customer should ask is: *How does the company expect to use and benefit from this model?*

This will help you frame the problem, select algorithms and performance measures and how much effort you will spend teaking it.

Your customer answers that your model's output (a prediction of a district's median housing price) will be fed to another ML system (see Figure 2-2). This downstream system will determin whether it is worth investing in a given area.

![](https://abhijitramesh.me/static/images/part2-learning-hands-on-machine-learning-with-scikit-learn-keras-and-tensorflow/Screenshot_2021-03-05_at_1.17.49_PM.png)

* *Know the current situation*:

The next question is: *What does the current situation look like?*

This will give you a reference for performance, as well as insights on how to solve the problem.

Your customer answers that the district housing prices are currently estimated manually by experts: a team gathers up-to-date information about a district and they estimate it using complex rules.

* *Design your solution*

Finally, ask: *What kind of training supervision the model will need? Is it a classification, regression or reinforcement learning task? Should I use batch learning or online learning techniques?...*

The solution required in this case is a typical **regression task**, since the model is asked to predict a value:

* Multiple regression: multiple features are required to make a prediction

* Univariate regression: model only predicts a single value for each district

* Plain batch learning: there is no continuous flow of data coming into the system and no particular need to adjust to changing data rapidly. Data set is also small enough to fit in memory.

##### Pipelines

A *Pipeline* is a sequence of data processing components. Pipelines are very common in ML systems, since there is a lot of data to manipulate and many data transformations to apply.

Components typically run asynchronously.

* Each component pulls in a large amount of data, processes it and spits out the result in another data store.

* Then, some time later, the next component in the pipeline pulls in this data and spits out its own output.

* Each component is fairly self-contained: the interface between components is simply the data store.

**Select Performance Measures**:



In [None]:
#