# Machine Learning with Python

The aim of this lab is to implement a complete ML project in python. We will go through the steps of data exploration, cleaning and transformation followed by model training and selection.

The lab consists of two parts. The first part is an interactive tutorial adapted from [Aurélien Géron's excellent ML book](https://github.com/ageron/handson-ml2) with some modifications to make it a classification tutorial instead of a regression one. We will learn to classify neighborhoods by median house value. The tutorial will contains conceptual questions as well as fill-in code that we'll give you a few minutes to write.

In the second part of this lab, we expect you to pick any dataset of your own choosing and go through the steps mentioned above. You should try at least 3 different classes of models, and one of them should be an `xgboost` classifier or regressor. We recommend that you pick a dataset from [Kaggle](https://www.kaggle.com/). We will be very flexible in grading this lab; we just need to see that you've taken the right steps. You should submit your notebook to gradescope. Submission instructions will be given on Piazza.

## Setup
Run the following code to install libraries and download required files.

In [None]:
import sys
assert sys.version_info >= (3, 8) # You might later get errors if you have a lower version.

# Install necessary libraries
!pip install sklearn matplotlib numpy pandas xgboost

# Do all necessary imports
import sklearn
import numpy as np
import pandas as pd
import os
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
import xgboost
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

In [None]:
# Download necessary files
import tarfile
import urllib.request

def perform_downloads():
    # Download data file
    root_url = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
    housing_url = f"{root_url}datasets/housing/housing.tgz"
    tgz_path = "housing.tgz"
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path="./")
    housing_tgz.close()
    

def to_classification():
    # Original dataset was for a regression tasks.
    # Transform it to classification task.
    classes = ['LOW', 'MEDIUM', 'HIGH']
    housing = pd.read_csv('housing.csv')
    housing["house_value_class"] = pd.cut(housing["median_house_value"], bins=[0.0, 200000, 400000, np.inf], labels=classes)
    housing.drop('median_house_value', axis=1, inplace=True)
    housing.to_csv("housing_classification.csv", index=False)

    
    
perform_downloads()
to_classification()

## Preliminary
Before we begin, we should split our data into a training and a testing set. All our learning should be done on the training set. The testing set is going to be used only once after training.  
**Q: Why is a good idea?**

You could manually implement the splitting logic, but Scikit-Learn has builtin functions that do this for us. There are many splitting methodologies, but the simplest one that works well for us is as follows:

In [None]:
# Import splitting function
from sklearn.model_selection import train_test_split

# Read data
data = pd.read_csv("housing_classification.csv")

# Split data. 80% is used as training; 20% as testing. The data will be randomly split.
train_set, test_set = train_test_split(data, test_size=0.2, random_state=37)

print(f"data size={len(data)}; training size={len(train_set)}; testing size={len(test_set)}")

## Data Exploration

In [None]:
# Read data, and take look at the attributes
housing = train_set # Rename for convenience
housing.head()

Each row contains housing data aggregated by *block group*, which is how the US officially reports housing data. Our goal is to predict the housing value class (low, medium or high) given information about a block. Let's take a look at the columns within the dataframe.

In [None]:
# Info gives a quick summary
housing.info()

First, note that the `total_bedrooms` columns contains missing values.  
**Q: What techniques can we use to handle them?**. We'll implement one of them in the *Data Preparation* section.

---

We also note that only the target class `house_value_class` and the feature `ocean_proximity` are categorical. In the following code, find the distinct values both columns:

In [None]:
# TODO: Distinct values and count of house_value_class

# TODO: Distinct values and count of ocean_proximity.

**Q: What's the standard way of dealing with categorical features in ML algorithms?**. We'll implement this in the *Data Preparation* section.

----

As for the numerical attributes, we can quickly summarize their distribution as follows:

In [None]:
housing.describe()

We see that some columns (e.g., `latitude` vs `population`) have very different ranges of values.  
**Q: What's the standard way to deal with such differences in numerical data distributions?**. (Also to be implemented in the data prep section).

----

While these numbers are informative, plots are usually a lot more intuitive. Using [pandas.DataFrame.hist](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.hist.html), show a histogram for all the numerical columns.  
**Q: Briefly describe what stands out most to you?** Hint: Look at the `housing_median_age` column.

In [None]:
# TODO: Plot histograms (< 2min)
# Hint: Set figure size to (15, 15), and the number of bins to 50 for good formatting.

It's also possible to visualize multiple attributes together. Using [pandas.DataFrame.plot](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html), we can plot the relationship between total number of rooms and population.

In [None]:
housing.plot(kind="scatter", x="population", y="total_rooms", alpha=0.2)

Naturally, we see a fairly linear, positive trend. These two values strongly depend on one another.    
**Q: How can you reduce the number of features in the dataset using correlation?**   
Since we don't have too many features, we won't worry about this here, but you might have to in your project.

----

More interesting visualizations are possible (there will be a lab about them). For now, take a look at the following:  

**Q: What pattern does it show?** Hint: [Look at this map.](https://www.google.com/maps/place/California/@36.4998513,-121.7064629,6.55z/data=!4m5!3m4!1s0x808fb9fe5f285e3d:0x8b5109a227086f55!8m2!3d36.778261!4d-119.4179324)

In [None]:
colors = {'LOW': 'lightblue', 'MEDIUM': 'lightyellow', 'HIGH': 'red'}

housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.2, s=housing["population"]/100, label="population", figsize=(10,7), c=housing['house_value_class'].map(colors))

If interesting patterns exist in the dataset you choose, try showcasing them.

---

We are now done with data exploration. We know what the data looks like, and what preparation we need to do before moving on to training.

## Data Preparation

In [None]:
# First separate features from target variables.
features = housing.drop("house_value_class", axis=1)
target = housing["house_value_class"]

We need the following transformation to make our data work with ML algorithms:
1. One hot encoding of the categorical features. 
2. Median imputation on numerical features.
3. Standardize numerical features.
4. Encode class labels into ordinals (0, 1, 2 instead of LOW, MEDIUM, HIGH). This is what the ML libraries we'll use expect.

`sklearn` has builtin functions for all of these. It also has convenience functions to put them all together.

----

One hot encoding with `sklearn` works as follows.

In [None]:
# One hot encoding
from sklearn.preprocessing import OneHotEncoder

# Get columns to transform.
categorical_col_names = ['ocean_proximity'] # stored as a separate variable for later.
categorical = features[categorical_col_names]

# Do one hot encoding.
encoder = OneHotEncoder()
categorical_one_hot = encoder.fit_transform(categorical)
categorical_one_hot.toarray()

As for imputation, `sklearn` has a [variety of implementations](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.impute) (corresponding to the techniques we described in class). We will stick to simple median imputation, but feel free to try something different in your custom project.

In [None]:
from sklearn.impute import SimpleImputer

# Get columns to transform.
numerical_col_names = [c for c in features.columns if c != 'ocean_proximity'] # stored as a separate variable for later.
numerical = features[numerical_col_names]

# Print missing count
print("Num missing before imputation:", np.count_nonzero(np.isnan(numerical)))

# Do transformation.
imputer = SimpleImputer(strategy="median")
imputed_numerical = imputer.fit_transform(numerical)

# Print missing count. Should be 0.
print("Num missing before imputation:", np.count_nonzero(np.isnan(imputed_numerical)))

Take a look at the documentation for [sklearn.preprocessing.StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) and use it to normalize numerical columns in `imputed_numerical`.

In [None]:
from sklearn.preprocessing import StandardScaler

# TODO: Do scaling (~2 mins)
# Hint: Reuse numerical column names from above

`sklearn` has a simple API to chain all of these transformation together. Take a look at the documentation for [sklearn.pipeline.Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) and [sklearn.compose.ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html).

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline


# Categorical Pipeline
categorical_pipeline = Pipeline([
    ("one_hot_encoder", OneHotEncoder()),
])

# TODO live: Numerical Pipeline
numerical_pipeline = Pipeline([
])


# Full transformation
feature_transformer = ColumnTransformer([
    # Categorical
    ("categorical", categorical_pipeline, categorical_col_names),
    # TODO: Numerical
])

features_prepped = feature_transformer.fit_transform(features)
pd.DataFrame(features_prepped)

Finally, we need to encode label (LOW=i, MEDIUM=j, HIGH=k) to satisfy `sklearn`.

In [None]:
from sklearn.preprocessing import LabelEncoder

# Encode labels
label_encoder = LabelEncoder()
target_prepped = label_encoder.fit_transform(target)
pd.Series(target_prepped).unique()

The data is now ready for ML models.

## Training and Validation
We'll now have to train and validate a few models to find the best performing one.
We start with the simple kind of model: linear classifier. We will then give you a few minutes to try out gradient boosted trees.

Training a model is simple with `sklearn`.

In [None]:
from sklearn.linear_model import SGDClassifier

# Train model
lin_sgd = SGDClassifier(loss='hinge')
lin_sgd.fit(features_prepped, target_prepped)

Performing predictions is equally easy.

In [None]:
# Get some random input
sample = housing.sample(n=10, random_state=1)

# Tranform input
sample_prepped = feature_transformer.transform(sample)
# Make prediction
prediction = lin_sgd.predict(sample_prepped)
# Decode prediction
label_encoder.inverse_transform(prediction)

---
Let's validate the efficacy of the linear model. `sklearn` provides builtin k-fold cross-validation.  
**Q: Briefly explain what k-fold cross-validation is and why it's used?**

In [None]:
from sklearn.model_selection import cross_val_score, KFold

# Create model
lin_sgd = SGDClassifier(loss='hinge')

# Do cross validation
# The cv parameter is the number of cross-validation splits.
kfold = KFold(n_splits=10, shuffle=True, random_state=37)
scores = cross_val_score(lin_sgd, features_prepped, target_prepped, cv=kfold)
print("Accuracy: %.2f%% (%.2f%%)" % (scores.mean()*100, scores.std()*100))

We get a validation accuracy of ~$76\%$. This may or may not be ok depending on your application. Try `xgboost.XGBClassifier`.

In [None]:
from xgboost import XGBClassifier

# You'll need at least these  flags 
xgb_cls = XGBClassifier(eval_metric='mlogloss', use_label_encoder=False)

# TODO: Do cross validation with xgboost. Can be somewhat slow.

You should have obtained an accuracy better than before.  
**Q: Why do think `xgboost` performs better than linear SGD?**

---

**Q: Given a model, how should we tune it to improve training performance?**

## Testing
We can now test our model to see how well it performs in the "real-world".  
**Q: Why can't we compare our models in the testing phase?**.

Fill-in the following code.

In [None]:
from sklearn.metrics import accuracy_score

# TODO: Do training on all training data.

# TODO: Implementing testing code here.

Thanks to k-fold cross validation, you should see that your test error is fairly similar to your validation error. K-fold cross validation allows to predict performance on unseen data, as long its distribution is reasonably similar to the training data.

# Go your own Way
Now pick any dataset of your choosing, and and roughly follow the steps above

In [None]:
# ...