# PROJECT

### Frame the problem

### Select a performance measure

* Root Mean Square Error (RMSE): regression problems
* Mean Absolute Error (MAE): regression if lots of outliers  
-> both compute distance between 2 vectors: targets and predictions

### Check the assumptions

# GET THE DATA

### Create a dedicated workspace

### Download the data

In [1]:
# Download from a compressed csv file

import os
import tarfile
import urllib
import pandas as pd

DOWNLOAD_ROOT = 'https://raw.githubusercontent.com/ageron/handson-ml2/master/'
HOUSING_PATH = os.path.join('datasets', 'housing')
HOUSING_URL = DOWNLOAD_ROOT + 'datasets/housing/housing.tgz'

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    ''' Function creates a datasets/housing directory in your workspace,
    downloads the housing.tgz file and extracts it in your local directory.
    '''
    os.makedirs(housing_path, exist_ok=True)
    tgz_path = os.path.join(housing_path, 'housing.tgz')
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()

def load_housing_data(housing_path=HOUSING_PATH):
    ''' Load the data from the uncompressed csv file,
    and returns a DataFrame object containing all the data.
    '''
    csv_path = os.path.join(housing_path, 'housing.csv')
    return pd.read_csv(csv_path)

### Take a quick look at the data structure

In [None]:
housing = load_housing_data()

# check first rows
housing.head()

# description of the data: total nb. of rows, each attribute's type, dtypes, number of nonnull values
housing.info()

# summary of each numerical attribute: count, mean, min, std dev, percentiles
housing.describe()

# plotting, study the histograms
%matplotlib inline  # in Jupyter notebook
import matplotlib.pyplot as plt
housing.hist(bins=50, figsize=(20, 15))
plt.show()

### Create a test set

In [None]:
## --- METHOD 1

import numpy as np

def split_train_test(data, test_ratio):
    shuffled_indices = np.random.permutation(len(data))
    test_set_size = int(len(data) * test_ratio)
    test_indices = shuffled_indices[:test_set_size]
    train_indices = shuffled_indices[test_set_size:]
    return data.iloc[train_indices], data.iloc[test_indices]

train_set, test_set = split_train_test(housing, 0.2)
print(len(train_set), len(test_set))

In [None]:
## --- METHOD 2
# Use a hash of each instance's identifier, put that instance in the test set if its hash
# is lower of equal to 20% of the maximum has value
# => ensures that test set remains consistent accross multiple runs, even if you refresh the dataset

from zlib import crc32

def test_set_check(identifier, test_ratio):
    return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32

def split_train_test_by_id(data, test_ratio, id_column):
    ids = data[id_column]
    in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))
    return data.loc[~in_test_set], data.loc[in_test_set]

# If data has no id column:
# make sure that new data is added at the end and no row is ever deleted
# If not possible, try a different identifier based on features
housing_with_id = housing.reset_index()  # adds an 'index' column

train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, 'index')

In [None]:
## --- METHOD 3
# Scikit-Learn's built-in functions, same as our split_train_test() with more features

from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

In [None]:
## --- METHOD 4
# Stratified sampling: ensure that test set is representative of various values of a category
# See book for more details

from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_split=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing['income_category']):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]
    
# check proportions in the test set
strat_test_set['income_category'].value_counts() / len(strat_test_set)

# EXPLORE THE DATA TO GAIN INSIGHTS

### Visualize data

### Look for correlations

### Experiment with attribute correlations

# PREPARE THE DATA FOR ML ALGORITHMS

### Clean data

### Handling text and categorical attributes

### Custom transformers

### Features scaling

most common ways: 
* **normalization** (also called min-max scaling): Scikit-Learn `MinMaxScaler` transformer with `feature_range` param
* **standardization**: Scikit-Learn `StandardScaler` transformer
* Fit to training data only, then apply to training set and test set

### Transformation pipelines

# SELECT AND TRAIN A MODEL

### Train and evaluate on training set

### Better evaluation with cross-validation

# FINE-TUNE THE MODEL

### Grid Search

### Randomized Search

### Ensemble methods

### Analyse the best models and their errors

### Evaluate your system on the test set

# LAUNCH, MONITOR, MAINTAIN YOUR SYSTEM

# Try it out!