<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/ufidon/ml/blob/main/mod3/mlflow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
  <td>
    <a target="_blank" href="https://kaggle.com/kernels/welcome?src=https://github.com/ufidon/ml/blob/main/mod3/mlflow.ipynb"><img src="https://kaggle.com/static/images/open-in-kaggle.svg" /></a>
  </td>
</table>
<br>




A complete ML procedure
---
_homl3 ch2_

1. Look at the big picture.
2. Get the data.
3. Explore and visualize the data to gain insights.
4. Prepare the data for machine learning algorithms.
5. Select a model and train it.
6. Fine-tune your model.
7. Present your solution.
8. Launch, monitor, and maintain your system

In [None]:
import numpy as np, pandas as pd, matplotlib.pyplot as plt, matplotlib as mpl
import sklearn as skl, sklearn.datasets as skds

📝 Practice: Where to find datasets?
---
- [Explore the List of datasets for machine-learning research on Wikipedia]https://en.wikipedia.org/wiki/List_of_datasets_for_machine-learning_research
- We will go through a complete ML procedure with the `California Housing Prices` dataset from the StatLib repository


# Look at the big picture
- Goal: use California census data to build a model of housing prices
- The model should 
  - learn from this data 
  - be able to predict the median housing price in any district, 
    - given house conditions such as the population, median income, and median housing price for each block group in California

## Frame the Problem
- This is a typical `supervised learning task`
  - since the model can be trained with labeled examples
- This is a `multiple regression problem`
  - since the system will use multiple features to make a prediction
- It is also a `univariate regression problem`
  - since we are only trying to predict a single value for each district
- the data is small enough to fit in memory
  - so plain batch learning should do just fine

## Select a Performance Measure
Two typical performance measures for regression problems are
1. the root mean square error (RMSE) given $n$ samples

$\displaystyle \operatorname{RMSE}(\mathbf{X},h) = \sqrt{\frac{1}{n}\sum_{i=1}^n (h(\mathbf{x}^{(i)})-y^{(i)})^2}$

- $\mathbf{x}^{(i)}$ is the $i^{th}$ instance in the dataset
 - it is a vector of all the feature values
- $y^{(i)}$ is $\mathbf{x^{(i)}}$'s label, the desired output value for this instance
- $\mathbf{X}$ is the instance matrix, contains each instance as a row
- $h$ is the ML model's prediction function
- $\hat{y}^{(i)} = h(\mathbf{x}^{(i)})$ is the predicted value of $\mathbf{x}^{(i)}$
 - with prediction error of $\hat{y}^{(i)}-y^{(i)}$

2. mean absolute value (MAE), also called average absolute deviation

$\displaystyle \operatorname{MAE}(\mathbf{X},h) = \frac{1}{n}\sum_{i=1}^n \left|(h(\mathbf{x}^{(i)})-y^{(i)})^2\right|$

### The distance between two vectors
- RMSE is the Euclidean norm, the normal distance between two vectors
  - also called $\ell_2$ norm, noted $||\cdot||_2$ or just $||\cdot||$
- MAE is the $\ell_1$ norm, noted $||\cdot||_1$ or just $|\cdot|$
- the general $\ell_p$ norm is defined as $\displaystyle ||\mathbf{v}||_p=\left(\sum_{i=1}^m |v_i|^p\right)^\frac{1}{p}$
  - $\ell_0$ norm gives the number of $\mathbf{v}$'s nonzero components
  - $\ell_\infty$ norm gives $\mathbf{v}$'s maximum absolute component
- The higher the norm index $p$, the larger values are more significant
  - so RMSE is more sensitive to outliers than MAE
  - but when outliers are exponentially rare like in a bell-shaped distribution,
    - the RMSE generally performs well and is preferred

## Classification vs regression
- If the house prices are required to be partitioned into categories such as
  - expensive, medium and cheap
  - then this becomes a classification problem 
    - and predicting the price perfectly accurate is unimportant
- In this ML flow, actual prices are needed so it is a regression problem

# Get and explore the data

In [None]:
# 1. Get the data
housing = pd.read_csv("../datasets/housing.csv")

In [None]:
# 2. peek first 5 rows in the dataset
housing.head()

In [None]:
# 3. get a quick description of the data
# pay attention to the attributes

housing.info()

In [None]:
housing["total_bedrooms"].isnull().sum()

- pay attention to missing data such as 
  - total_bedrooms has 207=20640-20433 null values
- pay attention to data types such as
  - ocean_proximity has a data type object
    - it is probably a categorical attribute based on the first 5 rows

In [None]:
# 4. find all categories of a categorical attribute
housing['ocean_proximity'].value_counts()

In [None]:
# 5. show a statistic summary of the numerical attributes
# a percentile indicates the value below which 
# a given percentage of observations fall

housing.describe()

In [None]:
# 6. find the value distribution of each numerical attribute
#   with histogram
#  A histogram shows the number of instances (on the vertical axis) 
#   that fall in a given value range (on the horizontal axis)
fig1, axes1 = plt.subplots(3,3,figsize=(12,8))
housing.hist(bins=50, ax= axes1);

- Do the values above  make sense?
  - median income attribute, only 0-15? what unit?
    - scaled and capped to between [0.5, 15]
    - unit: tens of thousands of dollars
    - ∴ the median income is between [$5000, $150,000]
  - The housing median age and the median house value were also capped
    - the median house value is our target attribute, or label
  - These attributes have very different scales
    - feature scaling is needed
  - many histograms are skewed right
    - need to transform these attributes to have more symmetrical and bell-shaped distributions
- How do we know the hidden information?
  - ask the dataset collectors and publishers

## Create a Test Set
- thumb rule for splitting the dataset: 
  - 80% for training and 20% for test
- Test set generation is a critical part of a ML project
  - but it is often neglected, which incurs bad even useless ML models

In [None]:
# 1. shuffle then split data

def shuffle_and_split_data(data, test_ratio):
  shuffled_indices = np.random.permutation(len(data))
  test_set_size = int(len(data) * test_ratio)
  test_indices = shuffled_indices[:test_set_size]
  train_indices = shuffled_indices[test_set_size:]
  return data.iloc[train_indices], data.iloc[test_indices]

train_set, test_set = shuffle_and_split_data(housing, 0.2)
len(train_set), len(test_set)

- Issues: 
  - the random shuffling will generate different training set and test set every time
  - overtime, the ML model will see the whole dataset
- Tentative solutions:
  1. save the test set on the first run and then load it in subsequent runs
  2. fix the random number generator’s seed so that 
     - it always generates the same shuffled indices 
  - Problem: both these solutions will break the next time for an updated dataset
- Common solution:
  - use each instance’s identifier to decide whether or not it should go in the test set
    - assuming instances have unique and immutable identifiers
    - here, these identifiers are like container ids
  - make these identifiers comparable such as hashing them
    - then let the test set contains instances whose hashes no larger than 20% of the maximum hash value
    - This ensures that the test set will remain consistent across multiple runs
      - even the dataset is updated
- These purely random sampling methods are generally fine for dataset large enough
  - i.e. number of rows ≫ number of features
  - otherwise, a significant sampling bias could be introduced

In [None]:
# 2. use hash to ensures that 
# the test set will remain consistent across multiple runs
from zlib import crc32

def is_id_in_test_set(identifier, test_ratio):
    return crc32(np.int64(identifier)) < test_ratio * 2**32

def split_data_with_id_hash(data, test_ratio, id_column):
    ids = data[id_column]
    in_test_set = ids.apply(lambda id_: is_id_in_test_set(id_, test_ratio))
    return data.loc[~in_test_set], data.loc[in_test_set]
 
# The housing dataset does not have an identifier column. 
#  2.1 A simplest solution is to use the row index as the ID  
# But this requires that 
#   new data gets appended to the end of the dataset 
#   and that no row ever gets deleted
housing_with_id = housing.reset_index()  # adds an `index` column
train_set, test_set = split_data_with_id_hash(housing_with_id, 0.2, "index")  

#  2.2 Another solution is 
#   combining district’s latitude and longitude into an ID like
housing_with_id["id"] = housing["longitude"] * 1000 + housing["latitude"]
train_set, test_set = split_data_with_id_hash(housing_with_id, 0.2, "id")

- Scikit-Learn provides a few functions to split datasets into multiple subsets in various ways
- `train_test_split()` is the simplest function among them
  - similar to the house-made `shuffle_and_split_data()` above
  - but with a couple of additional features
    - it has  a `random_state` parameter for setting the random generator seed
    - it can split multiple datasets of the same size on the same indices

In [None]:
# 3. split dataset with train_test_split
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

In [None]:
# the split randomness looks uniform and acceptable
test_set["total_bedrooms"].isnull().sum()/housing["total_bedrooms"].isnull().sum()

### Stratified sampling
- divides the population into homogeneous subgroups called strata
- then samples the right number of instances from each stratum
- guarantees that the test set is representative of the overall population
- In a dataset, 
  - it is important to have a sufficient number of instances for each stratum
  - otherwise, the estimate of a stratum’s importance may be biased
- `sklearn.model_selection` package provides a number of splitter classes
- Each splitter has a `split()` method that returns an iterator 
  - over different training/test splits of the same data
  - it yields the training and test indices, not the data itself
  - this is very useful for cross-validation


In [None]:
# Stratified sampling
# 1. the US population is 51.1% females and 48.9% males, 
#   so a well-conducted survey in the US would try to maintain 
#   this ratio in the sample: 511 females and 489 males

# 1.1 the probability of getting a biased sample with 
#   <48.5% female or
#   >53.5% female
# The `cdf()` method of the binomial distribution gives us 
# the probability of `the number of females ≤ the given value`

from scipy.stats import binom

sample_size = 1000
ratio_female = 0.511
proba_too_small = binom(sample_size, ratio_female).cdf(485 - 1)
proba_too_large = 1 - binom(sample_size, ratio_female).cdf(535)
print(proba_too_small + proba_too_large)

In [None]:
# 1.2 Obtain the same result by simulation
np.random.seed(42)

samples = (np.random.rand(100_000, sample_size) < ratio_female).sum(axis=1)
((samples < 485) | (samples > 535)).mean()

In [None]:

# 2. Stratify the median incomes in the California house dataset
#   create an income category attribute with five categories 
#   labeled from 1 to 5

housing["income_cat"] = pd.cut(housing["median_income"],
                               bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
                               labels=[1, 2, 3, 4, 5])

In [None]:
# 2.1 visualize the income categories

axes2 = housing["income_cat"].value_counts().sort_index().plot.bar(rot=0, grid=True)
axes2.set_xlabel("Income category")
axes2.set_ylabel("Number of districts");

In [None]:
# 3. generates 10 different splits of the California house dataset
#    stratified on the income categories

from sklearn.model_selection import StratifiedShuffleSplit

splitter = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=42)
strat_splits = []
for train_index, test_index in splitter.split(housing, housing["income_cat"]):
    strat_train_set_n = housing.iloc[train_index]
    strat_test_set_n = housing.iloc[test_index]
    strat_splits.append([strat_train_set_n, strat_test_set_n])

In [None]:
# 3.1 use the last split
strat_train_set, strat_test_set = strat_splits[-1]

In [None]:
# 3.2 a short way to get a single stratified split
#   with `train_test_split`

strat_train_set, strat_test_set = train_test_split(
    housing, test_size=0.2, stratify=housing["income_cat"], random_state=42)

In [None]:
# 3.3 check if the stratified split worked as expected
#   check the similarity of this figure with the previous one

axes3 = (strat_test_set["income_cat"].value_counts() / len(strat_test_set)).sort_index().plot.bar(rot=0, grid=True);

In [None]:
# 3.4 compares the income category proportions in  
#   the overall dataset, 
#   the test set generated with stratified sampling, 
#   the test set generated using purely random sampling

def income_cat_proportions(data):
    return data["income_cat"].value_counts() / len(data)

train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

compare_props = pd.DataFrame({
    "Overall %": income_cat_proportions(housing),
    "Stratified %": income_cat_proportions(strat_test_set),
    "Random %": income_cat_proportions(test_set),
}).sort_index()
compare_props.index.name = "Income Category"
compare_props["Strat. Error %"] = (compare_props["Stratified %"] /
                                   compare_props["Overall %"] - 1)
compare_props["Rand. Error %"] = (compare_props["Random %"] /
                                  compare_props["Overall %"] - 1)
(compare_props * 100).round(2)

The table shows
- the test set generated using stratified sampling has income category proportions almost identical to those in the full dataset, 
- whereas the test set generated using purely random sampling is skewed

In [None]:
# 3.5 the income_cat column can be dropped if it is not used again
for set_ in (strat_train_set, strat_test_set):
  set_.drop("income_cat", axis=1, inplace=True)

# Explore and Visualize the Data to Gain Insights
- put the test set aside and explore the training set only
- better work on a copy of the training set 
  - since various transformations will be experimented
- sample an exploration set if the training set is very large

In [None]:
# 1. make a copy of the original training set since it is small
housing = strat_train_set.copy()

## Visualizing Geographical Data

In [None]:
# 1 the data point distribution looks like California

fig4, axes4 = plt.subplots(figsize=(6,4))
housing.plot(ax=axes4, kind="scatter", x="longitude", y="latitude", grid=True);

In [None]:
# 2 visualize data point density with partially transparency
# What are those areas of high density?

fig5, axes5 = plt.subplots(figsize=(6,4))
housing.plot(ax=axes5, kind="scatter", x="longitude", y="latitude", grid=True, alpha=0.2);

In [None]:
# 3. visualize 
#   housing prices with pseudo-colors
#   district's population with circle sizes
# It shows the housing prices are very much related to 
#   the location and 
#   the population density

fig6, axes6 = plt.subplots(figsize=(10,7))
housing.plot(ax=axes6, kind="scatter", x="longitude", y="latitude", grid=True,
             s=housing["population"] / 100, label="population",
             c="median_house_value", cmap="jet", colorbar=True,
             legend=True, sharex=False);

In [None]:
# 4. visualize house price and district's population on the map of California
filename = "../datasets/california.png"

housing_renamed = housing.rename(columns={
    "latitude": "Latitude", "longitude": "Longitude",
    "population": "Population",
    "median_house_value": "Median house value (ᴜsᴅ)"})

fig7, axes7 = plt.subplots(figsize=(10,7))
housing_renamed.plot(ax=axes7,
             kind="scatter", x="Longitude", y="Latitude",
             s=housing_renamed["Population"] / 100, label="Population",
             c="Median house value (ᴜsᴅ)", cmap="jet", colorbar=True,
             legend=True, sharex=False)

california_img = plt.imread( filename)
axis = (-124.55, -113.95, 32.45, 42.05)
axes7.axis(axis)
axes7.imshow(california_img, extent=axis);

## Look for [Correlations](https://en.wikipedia.org/wiki/Correlation) between features
- The correlation coefficient only measures linear correlations, such as
  - x goes up, y generally goes up/down
  - this has nothing to do with the slope
- It may completely miss out on nonlinear relationships
- Purposes
  - identify and clean outliers
  - identify and transform skewed distributions

In [None]:
# 1. compute the standard correlation coefficient between every pair of attributes
#     also called Pearson’s r
corr_matrix = housing.corr(numeric_only=True)

In [None]:
# 1.2 find out how much each attribute correlates with the median house value
# Explain
#   - positive correlation, negative correlation and no correlation

corr_matrix["median_house_value"].sort_values(ascending=False)

In [None]:
# 2. check for correlation between attributes 
#     using the Pandas scatter_matrix() function
# It plots every numerical attribute against every other numerical attribute. 
#    - Since there are now 11 numerical attributes, 
#       - you would get 11² = 121 plots
#    - we may choose a few attributes that seem 
#       - most correlated with the median housing value
# Note: Pandas displays a histogram of each attribute on the main diagonal

from pandas.plotting import scatter_matrix

attributes = ["median_house_value", "median_income", "total_rooms",
              "housing_median_age"]

fig8, axes8 = plt.subplots(4,4, figsize=(12, 8))
scatter_matrix(housing[attributes], ax=axes8);

- there is a strong positive correlation between `median_income` and `median_house_value`

In [None]:
# 3. zoom in the correlation between `median_income` and `median_house_value`
# there are several horizontal lines in the figure
# the most obvious horizontal line is at $500,000,
#   which is the price cap
# It is better remove the corresponding districts from the data set

fig9, axes9 = plt.subplots(figsize=(6, 4))
housing.plot(ax=axes9, kind="scatter", x="median_income", y="median_house_value",
             alpha=0.1, grid=True);

## Experiment with Attribute Combinations
- attribute combinations could be more meaningful than their attributes alone
- such as,
  - the number of rooms per household
  - the population per household
  - the ratio of the number of bedrooms to the number of rooms

In [None]:
# 1. Attribute combination

housing["rooms_per_house"] = housing["total_rooms"] / housing["households"]
housing["bedrooms_ratio"] = housing["total_bedrooms"] / housing["total_rooms"]
housing["people_per_house"] = housing["population"] / housing["households"]

In [None]:
# 2. verify the correlations between the attribute combinations and the target
# Are they more correlated to the target than their component attributes?
#  - bedrooms_ratio vs. total_rooms or total_bedrooms
#  - rooms_per_house vs. total_rooms or households
#  - people_per_house vs. population or households

corr_matrix = housing.corr(numeric_only=True)
corr_matrix["median_house_value"].sort_values(ascending=False)

# Prepare the Data for Machine Learning Algorithms

In [None]:
# 0. revert to a clean training set
#   - separate the predictors and the labels

housing = strat_train_set.drop("median_house_value", axis=1) # no change in place
housing_labels = strat_train_set["median_house_value"].copy()

## Clean the data
- Three options to an attributes with some missing values
  1. drop the missing values with `dropna()`
  2. remove the whole attribute with `drop()`
  3. fill the missing values with some values such as zero, the mean, the median, etc.
     - with `fillna()`
     - this is called `imputation`
- use `SimpleImputer` for option 3 instead of `fillna()`, because `SimpleImputer` has more features
  - it will store the median value of each feature
  - it is possible to impute missing values on 
    - the training set, the validation set, 
    - the test set, and any new data fed to the model
- other more powerful imputers in sklearn.impute package, to replace missing values
  - `KNNImputer` uses the mean of the k-nearest neighbors’ values of that feature
  - `IterativeImputer` uses predicted missing values with a regression model per feature

In [None]:
# show the rows that originally contained a NaN value
null_rows_idx = housing.isnull().any(axis=1)
housing.loc[null_rows_idx].head()

In [None]:
# option 1: drop the missing values

housing_option1 = housing.copy()
housing_option1.dropna(subset=["total_bedrooms"], inplace=True)  
housing_option1.loc[null_rows_idx].head()

In [None]:
# option 2: remove the whole attribute, i.e. column
housing_option2 = housing.copy()
housing_option2.drop("total_bedrooms", axis=1, inplace=True) 
housing_option2.loc[null_rows_idx].columns

In [None]:
# option 3: fill the missing values with some values such as zero, the mean, the median, etc.
housing_option3 = housing.copy()
median = housing["total_bedrooms"].median()
housing_option3["total_bedrooms"].fillna(median, inplace=True)  
housing_option3.loc[null_rows_idx].head()

In [None]:
# 2.1 Create an imputer
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median") # replace NaNs with the median of the attribute
# other strategies: mean, most_frequent, constant, etc.

In [None]:
# 2.2 strategy="median" only works ono numerical attributes
# create a copy of the data with only the numerical attributes
housing_num = housing.select_dtypes(include=[np.number])

In [None]:
# 2.3 fit the imputer instance to the training data
imputer.fit(housing_num)

In [None]:
# 2.4 the imputer simply computed the median of each attribute and
#  stored the result in its statistics_ instance variable.
#  It is safer to apply the imputer to all the numerical attributes
#  preparing for any missing values in new data.
imputer.statistics_

In [None]:
# check the above results with
housing_num.median().values

In [None]:
# 2.5 use this “trained” imputer to transform the training set
#   by replacing missing values with the learned medians
X = imputer.transform(housing_num)

In [None]:
imputer.feature_names_in_

In [None]:
imputer.strategy

In [None]:
# 2.6 X is a NumPy array, 
# it can be wrapped in a DataFrame and 
#   recover the column names and index from housing_num.
housing_tr = pd.DataFrame(X, columns=housing_num.columns,
                          index=housing_num.index)

In [None]:
housing_tr.loc[null_rows_idx].head()

In [None]:
# 2.7 Predict outliers
from sklearn.ensemble import IsolationForest

isolation_forest = IsolationForest(random_state=42)
outlier_pred = isolation_forest.fit_predict(X)

In [None]:
outlier_pred

In [None]:
# 2.8 Drop outliers
# uncomment and run these codes if you want to drop outliers
#housing = housing.iloc[outlier_pred == 1]
#housing_labels = housing_labels.iloc[outlier_pred == 1]

## Handling Text and Categorical Attributes
- The values of categorical attributes are limited
- They can be encoded in integers with `OrdinalEncoder` class

In [None]:
# 1. There is only one text attribute `ocean_proximity` in housing

housing_cat = housing[["ocean_proximity"]]
housing_cat.value_counts()

In [None]:
# 2. `ocean_proximity` has just 5 different values
#  each of which represents a category, so it is a categorical attribute.
# Convert these categories from text to numbers 
#   since most machine learning algorithms prefer to work with numbers.
from sklearn.preprocessing import OrdinalEncoder

ordinal_encoder = OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)

In [None]:
housing_cat_encoded[:5], housing_cat[:5]

In [None]:
ordinal_encoder.categories_

- Issue of `OrdinalEncoder`: ML algorithms will assume that two nearby values are more similar than two distant values
  - fine with ordered categories such as “bad”, “average”, “good”, and “excellent”
  - but bad with unordered categories such USA state names
- `one-hot encoding` can fix this issue
  - implemented in sklearn class `OneHotEncoder`
  - it encodes the categories with a binary string
    - the length of the binary string equals the number of categories, 
    - each code has only one bit that is set to 1 for the coded category

In [None]:
# 3. encode `ocean_proximity` categories with `OneHotEncoder`
from sklearn.preprocessing import OneHotEncoder

cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)

In [None]:
# By default, the `OneHotEncoder` class returns a sparse array,
# because it is  a very efficient representation for matrices that contain mostly zeros,
#  it only stores the nonzero values and their positions.
housing_cat_1hot

In [None]:
# we can convert it to a dense array if needed by calling the `toarray()` method:
housing_cat_1hot.toarray()[:5] # compare it with housing_cat_encoded[:5]

In [None]:
# or specify that `OneHotEncoder` class returns a dense array
cat_encoder = OneHotEncoder(sparse_output=False)
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
housing_cat_1hot[:5]

In [None]:
cat_encoder.categories_

- Issue of `OneHotEncoder`
  - results in a large number of input features if the categorical attribute has a large number of possible categories
  - i.e. the binary code is very long
  - This may slow down training and degrade performance
- Possible solutions:
  - replace the categorical input with useful numerical features related to the categories
    - e.x.  replace the `ocean_proximity` feature with the distance to the ocean
  - replace each category with a learnable, low-dimensional vector called an embedding

## Feature Scaling and Transformation
- Feature scaling is one of the most important transformations on data
  - Most ML algorithms don't perform well on numerical attributes with very different scales
  - they bias toward the features with large scales
- Two common feature scaling techniques:
  -  `min-max scaling` implemented in sklearn class `MinMaxScaler`
     - $\displaystyle v'=\frac{v-v_{min}}{v_{max}-v_{min}}$
  -  `standardization` in `StandardScaler`
     - $\displaystyle v'=\frac{v-v_{mean}}{σ_v}$
     - $σ_v$ is the standard deviation
- standardization 
  - does not restrict values to a specific range
  - is much less affected by outliers
  - will break a sparse matrix unless set its `with_mean` hyperparameter to False

In [None]:
# 1. `min-max` scaling
from sklearn.preprocessing import MinMaxScaler

min_max_scaler = MinMaxScaler(feature_range=(-1, 1))
housing_num_min_max_scaled = min_max_scaler.fit_transform(housing_num)

In [None]:
# 2. `standardization` scaling
from sklearn.preprocessing import StandardScaler

std_scaler = StandardScaler()
housing_num_std_scaled = std_scaler.fit_transform(housing_num)

- Long-tail distribution is generally transformed to be roughly symmetrical for better ML training
  - but both min-max scaling and standardization will squash most values into a small range
- Ways to shrink the heavy tail before scaling
  - replace the feature with its square root
    - or raise the feature to a power between 0 and 1
  - replace the feature under a [power law distribution](https://en.wikipedia.org/wiki/Power_law) by its logarithm
  - bucket the feature
    - chop its distribution into roughly equal-sized buckets, 
    - and replace each feature value with the index of the bucket it belongs to

In [None]:
# 1. replace the feature under a power law distribution
# by its logarithm

figa, axesa = plt.subplots(1, 2, figsize=(8, 3), sharey=True)
housing["population"].hist(ax=axesa[0], bins=50)
housing["population"].apply(np.log).hist(ax=axesa[1], bins=50)
axesa[0].set(xlabel="Population", title='A long tail')
axesa[1].set(xlabel="Log of population", title='Close to normal distribution')
axesa[0].set_ylabel("Number of districts");

In [None]:
# 2. replace each value with its percentile
# Bucketizing with equal-sized buckets results in 
# a feature with an almost uniform distribution.

percentiles = [np.percentile(housing["median_income"], p)
               for p in range(1, 100)]
flattened_median_income = pd.cut(housing["median_income"],
                                 bins=[-np.inf] + percentiles + [np.inf],
                                 labels=range(1, 100 + 1))
figb, axesb = plt.subplots(1, 1, figsize=(5, 3))
flattened_median_income.hist(bins=50, ax=axesb)
axesb.set_xlabel("Median income percentile")
axesb.set_ylabel("Number of districts");

# Select a model and train it



# Fine-tune your model



# Present your solution



# Launch, monitor, and maintain your system

# References
- [sklearn user guide](https://scikit-learn.org/stable/user_guide.html)