# Lab 02 (remixed): Training and Evaluating a Decision Tree

Answer the exercise questions.

**Objectives**: After completing these exercises, you should be able to:

* Identify the components of an ML annotation project
* Clean some data
* Code a decision tree

Written by: Dr. Stephen Wu

References: [Titanic on Kaggle](https://www.kaggle.com/competitions/titanic)

## Setting up the environment and Titanic data

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Install scikit-learn if not already installed, and run
%pip install scikit-learn
import sklearn # machine learning algorithms

You should consider upgrading via the 'c:\Users\bahas\AppData\Local\Programs\Python\Python39\python.exe -m pip install --upgrade pip' command.


Note: you may need to restart the kernel to use updated packages.


In [None]:
# Get the Titanic data (originally from Kaggle)
dl_train_url = "../lab01/train.csv"
dl_test_url = "../lab01/test.csv"

train_data = pd.read_csv(dl_train_url)
test_data = pd.read_csv(dl_test_url)

The first dataset has 891 samples while the second dataset has 418; a 
68.07/31.93 split of the data.


### Exercise 2.1: Problem setup (GRADED)
Mark the correct answer with an `X`.

1. Eventually, we want to _predict_ whether people `Survived` or not. What kind of an ML problem will this be?

```
    a) Clustering
    b) Classification
    c) Regression
    d) Generation
```

2. What's the difference between `train_data` and `test_data`? '

```
    a) `test_data` has extra variables that `train_data` doesn't
    b) `test_data` lacks the output variable
    c) `test_data` is used first to help the algorithm find patterns on real data
    d) `train_data` is 
```

3. The split between `train_data` and `test_data` in this problem enables you to do what kind of learning?

```
    a) Supervised learning
    b) Unsupervised learning
    c) Reinforcement learning
    d) Representation learning
```


### Exercise 2.2: Code interpretation (GRADED)
Write a comment using `# <WRITE A COMMENT>` at the end of each line. Hint: Check the documentation for [`Dataframe.fillna()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html#pandas.DataFrame.fillna) and [`Dataframe.drop()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html#pandas.DataFrame.drop).

In [8]:
# NaN stands for "Not a Number" and is a common way to represent missing data
# What if our machine learning algorithm doesn't know how to handle missing data?
print(train_data.isna().sum())
train_data = train_data.fillna(method='ffill')
test_data = test_data.fillna(method='ffill')

# Some variables aren't useful for prediction, or aren't easy to use
train_data = train_data.drop(['Name', 'Ticket', 'Cabin'], axis=1, errors='ignore')
test_data = test_data.drop(['Name', 'Ticket', 'Cabin'], axis=1, errors='ignore')

PassengerId    0
Survived       0
Pclass         0
Sex            0
Age            0
SibSp          0
Parch          0
Fare           0
Embarked       0
dtype: int64


  train_data = train_data.fillna(method='ffill')
  test_data = test_data.fillna(method='ffill')


## Decision Trees (pt 1): Data to test the algorithm
In class, we talked about decision trees. The pseudocode given (from the textbook AIMA 19.3) was as follows:

**function** LEARN-DECISION-TREE(*examples*, _attributes_, _parent\_examples_) **returns** a tree<br>
&nbsp;&nbsp;&nbsp;&nbsp;**if** _examples_ is empty **then return** PLURALITY-VALUE(_parent\_examples_)<br>
&nbsp;&nbsp;&nbsp;&nbsp;**else if** all _examples_ have the same classification **then return** the classification<br>
&nbsp;&nbsp;&nbsp;&nbsp;**else if** _attributes_ is empty **then return** PLURALITY-VALUE(_examples_)<br>
&nbsp;&nbsp;&nbsp;&nbsp;**else**<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;_A_ = argmax(IMPORTANCE(_a_, _examples_))<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;_tree_ = a new decision tree with root test _A_<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**for each** value _v_ of _A_ **do**<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;_exs_ = {_e_ : _e_ in _examples_ **and** _e.A_ = _v_}<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;_subtree_ = LEARN-DECISION-TREE(_exs_, _attributes_ - _A_, _examples_)<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;add a branch to _tree_ with label (_A_ = _v_) and subtree _subtree_<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**return** _tree_


You will write this function in Python!

In [36]:
# Information Gain is the standard "Importance" for Decision Trees
# Take these for granted for now
def information_gain(attribute, examples):
    return entropy(examples) - remainder(attribute, examples)

def entropy(examples):
    num_survived = len(examples[examples['Survived'] == 1])
    num_died = len(examples[examples['Survived'] == 0])
    total = len(examples)
    if num_survived == 0 or num_died == 0:
        return 0
    p_survived = num_survived / total
    p_died = num_died / total
    return -p_survived * np.log2(p_survived) - p_died * np.log2(p_died)

def remainder(attribute, examples):
    total = len(examples)
    remainder = 0
    for value in examples[attribute].unique():
        exs = examples[examples[attribute] == value]
        remainder += len(exs) / total * entropy(exs)
    return remainder

## Exercise 2.3: Decision tree implementation (GRADED)
Turn the pseudocode for LEARN-DECISION-TREE into a real Python function, calling the `information_gain()` function defined above.

You also need to define `plurality_value()`, which should return a `0` if there are more 0s left in the `examples`, or a `1` if more 1s are left in the `examples.

In [41]:
def plurality_value(examples):    
# <YOUR CODE HERE>
    return 0 # Placeholder

def learn_decision_tree(examples, attributes, parent_examples):
# <YOUR CODE HERE>
    return {"no attribute" : {}} # Placeholder


Now, let's test our code. At any given place in the tree (node), one of these 4 cases applies:

1. Examples at this node are all 1 (survivors) or all 0 (non-survivors).
2. Examples at this node are mixed between 1 and 0.
3. There are no examples at this node.
4. There are no attributes at this node.

Below, let's mimic each of the 4 cases of data. There are 4 "mini" data sets.

In [None]:
mini_data = train_data.head(10)

# Option A
parent_examples_A = mini_data[mini_data['Sex'] == 'female']
examples_A = mini_data[
    ((mini_data['Sex'] == 'female') & mini_data['Survived'] == 1)]

# Option B
examples_B = mini_data[mini_data['Age'] > 40]

# Option C
examples_C = mini_data[mini_data['Age'] > 90]

print("A")
print(examples_A)
print("B")
print(examples_B)
print("C")
print(examples_C)


A
   PassengerId  Survived  Pclass     Sex   Age  SibSp  Parch     Fare Embarked
1            2         1       1  female  38.0      1      0  71.2833        C
2            3         1       3  female  26.0      0      0   7.9250        S
3            4         1       1  female  35.0      1      0  53.1000        S
8            9         1       3  female  27.0      0      2  11.1333        S
9           10         1       2  female  14.0      1      0  30.0708        C
B
   PassengerId  Survived  Pclass   Sex   Age  SibSp  Parch     Fare Embarked
6            7         0       1  male  54.0      0      0  51.8625        S
C
Empty DataFrame
Columns: [PassengerId, Survived, Pclass, Sex, Age, SibSp, Parch, Fare, Embarked]
Index: []


In [35]:
output = learn_decision_tree(mini_data, mini_data.columns[:-1], None)
print(output)

{'PassengerId': {np.int64(1): np.int64(0), np.int64(2): np.int64(1), np.int64(3): np.int64(1), np.int64(4): np.int64(1), np.int64(5): np.int64(0), np.int64(6): np.int64(0), np.int64(7): np.int64(0), np.int64(8): np.int64(0), np.int64(9): np.int64(1), np.int64(10): np.int64(1), np.int64(11): np.int64(1), np.int64(12): np.int64(1), np.int64(13): np.int64(0), np.int64(14): np.int64(0), np.int64(15): np.int64(0), np.int64(16): np.int64(1), np.int64(17): np.int64(0), np.int64(18): np.int64(1), np.int64(19): np.int64(0), np.int64(20): np.int64(1), np.int64(21): np.int64(0), np.int64(22): np.int64(1), np.int64(23): np.int64(1), np.int64(24): np.int64(1), np.int64(25): np.int64(0), np.int64(26): np.int64(1), np.int64(27): np.int64(0), np.int64(28): np.int64(0), np.int64(29): np.int64(1), np.int64(30): np.int64(0)}}


# FOR LATER (12 Mar 2025)
## Data Splitting
In the meantime, we'll split the training dataset again, into a training and development/validation set. (Remember that we need a validation set so that we can set _hyperparameters_ before running an algorithm on the test data.)

We'll use the classic ML library `sklearn` for a utility to help us do this.

In [None]:
from sklearn.model_selection import ShuffleSplit

# Get indices for a split
split = ShuffleSplit(n_splits = 1, test_size = 0.2)
# Iterate through (only 1) split, setting train/val data
for train_indices, test_indicices in split.split(train_data):
    train_set = train_data.loc[train_indices]
    val_set = train_data.loc[test_indicices]

display(train_set['Pclass'].value_counts() / train_set.shape[0] * 100)
display(val_set['Pclass'].value_counts() / val_set.shape[0] * 100)

## Assignment 2a: Stratified sampling
If, for example, your training data is 70% men and 30% women, but your test data is 80% women and 20% men, your ML model may not perform well on the test. This is called _sampling bias_. To help with this problem, we can try _stratifying_ the data according to variables of interest (e.g., `Sex`). This ensures both training and validation have similar distributions of the 'Survived', 'Pclass', and 'Sex' features for unbiased model evaluation.

* Write an alternative split, stratifying the split according to 'Survived', 'PClass', and 'Sex'. Save the output as `strat_train_set` and `strat_val_set`. (Hint: pass in the columns you want to stratify via `y` in `StratifiedShuffleSplit`'s [`split()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html#sklearn.model_selection.StratifiedShuffleSplit.split).)
* Verify that the percent of samples in each 'Pclass' value and `Sex` value are the same in `strat_train_set` and `strat_val_set`. (Hint: use Dataframe's `value_counts()` method.)

In [None]:
# <YOUR CODE HERE>



## Establishing a baseline
Besides splitting the data, we need a few more steps of data preparation before our machine learning algorithm can work.

1. We can get rid of columns that are unlikely to contribute to the predictions: see `.drop()` below.
2. We need to transform string data into categorical/numerical values (for the way some machine learning algorithms are optimized): see `pd.get_dummies()` below.

How to do this more reliably is another issue; for now, here are quick and dirty ways to do these.


In [None]:
X_train = pd.get_dummies(train_set.drop(["Survived"], axis=1))
y_train = train_set["Survived"]

X_val = pd.get_dummies(val_set.drop(["Survived"], axis=1))
y_val = val_set["Survived"]

display(X_train.head())

We'll use a Random Forest with some pre-chosen settings as our baseline to classify between `Survived=0` and `Survived=1`.

In [None]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(
    n_estimators=100, max_depth=5, random_state=1)

clf.fit(X_train, y_train)
acc = clf.score(X_val, y_val)

print(f"Your {clf.__class__.__name__} predicts 'Survived'")
print(f" with an validation set accuracy of {acc*100:.2f}")

Congratulations! You've trained and run your first ML algorithm!

## Assignment 2b: Evaluation practice
Now that we have a working ML classifier, let's look at the evaluation environment.

* ML algorithms often give different answers even with the same parameters. Write a loop that trains the same type of classifier 5 times and averages the scores. (Hint: vary or remove `random_state`.)
* ML algorithms have lots of options. Today, we're not focusing on what those options mean, but on how to test between them. Write a loop or other function that tests out the hyperparameters (options) for [`RandomForestClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier): `n_estimators` and `max_depth`. Which values for each option give the best (averaged over 5) results?

In [None]:
# <YOUR CODE HERE>




In [None]:
average_scores

### Extra: Cross-validation
Cross-validation on a training set can be very helpful for finding the best values. In `sklearn` you can do this with less code.

Use [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV) and/or [`RandomizedSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html#sklearn.model_selection.RandomizedSearchCV) to find the best values for hyperparameters.

In [None]:
# <YOUR CODE HERE>



In [None]:
final_clf = grid_search.best_estimator_
final_clf.score(X_val,y_val)

# Extra: Held-out data evaluation on Kaggle
The Titanic data can actually be tested further on held-out data, which does not have gold standard labels (`Survived` = 0 or 1). To do this:

* Download this Jupyter notebook
* Sign up for [kaggle.com](https://www.kaggle.com)
* Visit the page for the [Titanic competition](https://www.kaggle.com/competitions/titanic), from which we borrowed this data and problem setting, and sign up to participate
* Upload this Jupyter notebook (Note: Kaggle has a native Jupyter notebook editor that's very similar to Google Colab)
* Change the input files to point to local competition files (i.e., `"/kaggle/input/titanic/train.csv"` and `"/kaggle/input/titanic/test.csv"`)
* Create the output file `submission.csv` below, and then click "Submit" to _run all the cells in the Notebook_, _re-create_ `submission.csv`, and submit that for scoring on the Competition's leaderboard.

In [None]:
model = clf # use the baseline Random Forest or replace with Assignment 3
X_test = pd.get_dummies(test_data) # clean the test data in the same way as train

predictions = model.predict(X_test)

output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")

Now click on the 3 vertical dots for options,  to your Kaggle page and go to `Submit to competition` and click on `Submit`.

You've finished your first Kaggle submission! Check your `Score`, and then keep on learning how to code AI!

# Solutions
The solutions for this notebook can be found [here](https://colab.research.google.com/drive/13mV9dGzf4HIQh97zt6uFZOx5crDt1J1P?usp=sharing).