# Lab 02: Training and Evaluating a ML Model. 

After completing these exercises, you should be able to:

* Identify the components of an ML annotation project
* Split a dataset for use in a ML project
* Run a baseline ML algorithm on gold standard data

Written by: Dr. Stephen Wu

References: [Titanic on Kaggle](https://www.kaggle.com/competitions/titanic)

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Install scikit-learn if not already installed, and run
%pip install scikit-learn
import sklearn # machine learning algorithms

As before, we will load a dataset about who survived the Titanic disaster.

In [None]:
# Get the data from Google Drive (originally from Kaggle)
dl_train_url = "../lab01/train.csv"
dl_test_url = "../lab01/test.csv"

train_data = pd.read_csv(dl_train_url)
print(f"The Dataframe (matrix) has {train_data.shape[0]} rows "
  + f"and {train_data.shape[1]} columns.\n")
print("Here's what some rows of data look like:")
display(train_data.head())


## Exercise 2.1
Eventually, we want to _predict_ whether people `Survived` or not. What kind of an ML problem will this be? (Mark with an X)

    a) Clustering
    b) Classification
    c) Regression
    d) Structured Prediction
    e) Generation

## Exercise 2.2: Train & Evaluate an ML model
We'd like to evaluate how well a _baseline_ ML works. Here, we'll prepare all the data and then evaluate on a popular algorithm called a Random Forest. (Out of scope, but here's an [article](https://www.section.io/engineering-education/introduction-to-random-forest-in-machine-learning/) and a [video](https://youtu.be/v6VJ2RO66Ag?si=2PD6JQjg1LbPljAi) about Random Forests.)

The Titanic data has a pre-defined train-test split -- `test.csv` being separate from `train.csv`. (In real life, we would need to create this split ourselves.) We will only evaluate `test_data` at the very end.


In [None]:
# Load held-out test data now, but don't use it until the end
test_data = pd.read_csv(dl_test_url)

def print_relative_size_of_datasets(df1, df2):
  num_train_samples = df1.shape[0]
  num_test_samples = df2.shape[0]
  tot_samples = num_train_samples + num_test_samples
  print(f"The first dataset has {num_train_samples} samples "\
        + f"while the second dataset has {num_test_samples}; a ")
  print(f"{num_train_samples/tot_samples*100:.2f}"\
        + f"/{num_test_samples/tot_samples*100:.2f}"\
        + f" split of the data.")

print_relative_size_of_datasets(train_data, test_data)

But not all of the data is clean. Let's look for missing values: `NaN` values.

In [None]:
train_data.isna().sum()
display(train_data)

Let's do a little quick-and-dirty processing to get rid of `NaNs` (not optimal because it'd be best to consider each variable one at a time) and throw away some columns that probably won't help in the classification (they're completely unique for each passenger).

In [None]:
# Forward-fill to eliminate NaNs
train_data = train_data.fillna(method='ffill')
test_data = test_data.fillna(method='ffill')

# Drop some variables that probably won't be useful
train_data = train_data.drop(['Name', 'Ticket', 'Cabin'], axis=1)
test_data = test_data.drop(['Name', 'Ticket', 'Cabin'], axis=1)

display(train_data.head())

## Data Splitting
In the meantime, we'll split the training dataset again, into a training and development/validation set. (Remember that we need a validation set so that we can set _hyperparameters_ before running an algorithm on the test data.)

We'll use the classic ML library `sklearn` for a utility to help us do this.

In [None]:
from sklearn.model_selection import ShuffleSplit

# Get indices for a split
split = ShuffleSplit(n_splits = 1, test_size = 0.2)
# Iterate through (only 1) split, setting train/val data
for train_indices, test_indicices in split.split(train_data):
    train_set = train_data.loc[train_indices]
    val_set = train_data.loc[test_indicices]

display(train_set['Pclass'].value_counts() / train_set.shape[0] * 100)
display(val_set['Pclass'].value_counts() / val_set.shape[0] * 100)

## Assignment 2a: Stratified sampling
If, for example, your training data is 70% men and 30% women, but your test data is 80% women and 20% men, your ML model may not perform well on the test. This is called _sampling bias_. To help with this problem, we can try _stratifying_ the data according to variables of interest (e.g., `Sex`). This ensures both training and validation have similar distributions of the 'Survived', 'Pclass', and 'Sex' features for unbiased model evaluation.

* Write an alternative split, stratifying the split according to 'Survived', 'PClass', and 'Sex'. Save the output as `strat_train_set` and `strat_val_set`. (Hint: pass in the columns you want to stratify via `y` in `StratifiedShuffleSplit`'s [`split()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html#sklearn.model_selection.StratifiedShuffleSplit.split).)
* Verify that the percent of samples in each 'Pclass' value and `Sex` value are the same in `strat_train_set` and `strat_val_set`. (Hint: use Dataframe's `value_counts()` method.)

In [None]:
# <YOUR CODE HERE>



## Establishing a baseline
Besides splitting the data, we need a few more steps of data preparation before our machine learning algorithm can work.

1. We can get rid of columns that are unlikely to contribute to the predictions: see `.drop()` below.
2. We need to transform string data into categorical/numerical values (for the way some machine learning algorithms are optimized): see `pd.get_dummies()` below.

How to do this more reliably is another issue; for now, here are quick and dirty ways to do these.


In [None]:
X_train = pd.get_dummies(train_set.drop(["Survived"], axis=1))
y_train = train_set["Survived"]

X_val = pd.get_dummies(val_set.drop(["Survived"], axis=1))
y_val = val_set["Survived"]

display(X_train.head())

We'll use a Random Forest with some pre-chosen settings as our baseline to classify between `Survived=0` and `Survived=1`.

In [None]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(
    n_estimators=100, max_depth=5, random_state=1)

clf.fit(X_train, y_train)
acc = clf.score(X_val, y_val)

print(f"Your {clf.__class__.__name__} predicts 'Survived'")
print(f" with an validation set accuracy of {acc*100:.2f}")

Congratulations! You've trained and run your first ML algorithm!

## Assignment 2b: Evaluation practice
Now that we have a working ML classifier, let's look at the evaluation environment.

* ML algorithms often give different answers even with the same parameters. Write a loop that trains the same type of classifier 5 times and averages the scores. (Hint: vary or remove `random_state`.)
* ML algorithms have lots of options. Today, we're not focusing on what those options mean, but on how to test between them. Write a loop or other function that tests out the hyperparameters (options) for [`RandomForestClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier): `n_estimators` and `max_depth`. Which values for each option give the best (averaged over 5) results?

In [None]:
# <YOUR CODE HERE>




In [None]:
average_scores

### Extra: Cross-validation
Cross-validation on a training set can be very helpful for finding the best values. In `sklearn` you can do this with less code.

Use [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV) and/or [`RandomizedSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html#sklearn.model_selection.RandomizedSearchCV) to find the best values for hyperparameters.

In [None]:
# <YOUR CODE HERE>



In [None]:
final_clf = grid_search.best_estimator_
final_clf.score(X_val,y_val)

# Extra: Held-out data evaluation on Kaggle
The Titanic data can actually be tested further on held-out data, which does not have gold standard labels (`Survived` = 0 or 1). To do this:

* Download this Jupyter notebook
* Sign up for [kaggle.com](https://www.kaggle.com)
* Visit the page for the [Titanic competition](https://www.kaggle.com/competitions/titanic), from which we borrowed this data and problem setting, and sign up to participate
* Upload this Jupyter notebook (Note: Kaggle has a native Jupyter notebook editor that's very similar to Google Colab)
* Change the input files to point to local competition files (i.e., `"/kaggle/input/titanic/train.csv"` and `"/kaggle/input/titanic/test.csv"`)
* Create the output file `submission.csv` below, and then click "Submit" to _run all the cells in the Notebook_, _re-create_ `submission.csv`, and submit that for scoring on the Competition's leaderboard.

In [None]:
model = clf # use the baseline Random Forest or replace with Assignment 3
X_test = pd.get_dummies(test_data) # clean the test data in the same way as train

predictions = model.predict(X_test)

output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")

Now click on the 3 vertical dots for options,  to your Kaggle page and go to `Submit to competition` and click on `Submit`.

You've finished your first Kaggle submission! Check your `Score`, and then keep on learning how to code AI!

# Solutions
The solutions for this notebook can be found [here](https://colab.research.google.com/drive/13mV9dGzf4HIQh97zt6uFZOx5crDt1J1P?usp=sharing).