# [Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/)
by Aurélien Géron

## Chapter 2 - End-to-End Machine Learning 

**Machine Learning Project Checklist** (see _Machine Learning Project Checklist.md_ for full breakdown)
1. Look at the big picture.
2. Get the data.
3. Explore and visualize the data to gain insights.
4. Prepare the data for maching learning algorithms.
5. Select a model and train it.
6. Fine-tune your model.
7. Present your solution.
8. Launch, monitor, and maintain your system.

When learning it is important to work with real-world datasets, not artificial datasets.
- See the _Real-world datasets_ list in the README.md

### Look at the Big Picture - Project Requirements
The project is to use California census data to build a model of housing prices in the state.
- Data includes metrics such as population, median income, and median housing price for each block group (**district**).
- The **Goal** is to be able to predict the median housing price in any district, given all the other metrics.

**System Design Overview**
- Supervised learning - model will be trained with fully labeled data.
- This is a regression task since the model needs to predict a value.
  - Multiple regression task (multiple features/inputs)
  - Univariate regression problem (single value prediction)
- Model can be trained using plain batch learning.
- **Performance Metric**
  - Root mean square error (RMSE) - standard perf measure for regression, but can be overly-sensitive to outliers.
  - Mean absolute error (MAE) - robust against outliers.

In [1]:
from pathlib import Path
import pandas as pd
import tarfile
import urllib.request

In [5]:
data_path = Path("data")
def load_housing_data():
    file_path = Path(data_path, "housing/housing.csv")
    if not file_path.is_file():
        data_path.mkdir(parents=True, exist_ok=True)
        tarball_path = Path(data_path, "housing.tgz")
        url = "https://github.com/ageron/data/raw/main/housing.tgz"
        urllib.request.urlretrieve(url, tarball_path)
        with tarfile.open(tarball_path) as housing_tarball:
            housing_tarball.extractall(path=data_path, filter='data')
    return pd.read_csv(file_path)

housing = load_housing_data()
housing.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY
