# Titanic with random forests

Looking at the [titanic](https://www.kaggle.com/competitions/titanic) kaggle competition using random forests.

## Data

In [1]:
from fastai.imports import *

frames = pd.read_csv("data/titanic/train.csv")
test_frames = pd.read_csv("data/titanic/test.csv")
modes = frames.mode().iloc[0]

In [2]:
# Split frames into train and validation
from numpy import random
from sklearn.model_selection import train_test_split

random.seed(42)
train_frames, val_frames = train_test_split(frames, test_size=0.25)

In [3]:
# Lets list our different variables and how we will treat them

cats = ["Sex", "Embarked"]  # Categorical
conts = ["Age", "SibSp", "Parch", "LogFare", "Pclass"]  # Continuous, node Pclass is in here
dep = "Survived"  # Dependent

In [4]:
# Process the data
# Random forests dont need dummy variables so we convert them to pandas categorical vars
# This is just a number with a lookup back to the string, we then use that the replace the numbers


def process_data(frames):
    frames["Fare"] = frames.Fare.fillna(0)
    frames.fillna(modes, inplace=True)
    frames["LogFare"] = np.log1p(frames.Fare)
    frames["Embarked"] = pd.Categorical(frames.Embarked)
    frames["Sex"] = pd.Categorical(frames.Sex)

    frames[cats] = frames[cats].apply(lambda x: x.cat.codes)


process_data(train_frames)
process_data(val_frames)
process_data(test_frames)

train_frames.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,LogFare
298,299,1,1,"Saalfeld, Mr. Adolphe",1,24.0,0,0,19988,30.5,C106,2,3.449988
884,885,0,3,"Sutehall, Mr. Henry Jr",1,25.0,0,0,SOTON/OQ 392076,7.05,B96 B98,2,2.085672
247,248,1,2,"Hamalainen, Mrs. William (Anna)",0,24.0,0,2,250649,14.5,B96 B98,2,2.74084
478,479,0,3,"Karlsson, Mr. Nils August",1,22.0,0,0,350060,7.5208,B96 B98,2,2.14251
305,306,1,1,"Allison, Master. Hudson Trevor",1,0.92,1,2,113781,151.55,C22 C26,2,5.027492


## Binary splits

Random forests are build from decision trees, to make a decision tree we need to create a binary split.

A binary split is where you group rows based on if they are above or below some threshold of some column. eg. Male/Female. In our data set a lot more women (70%) survived than men (20%). So we could make a model that just says all women survive.

How good would this model be?

In [5]:
from sklearn.metrics import mean_absolute_error


# Function to split out indeps (x) and dependent (y) vars as separate tuples
def get_xy(frame):
    x = frame[cats + conts].copy()
    y = frame[dep] if dep in frame else None
    return x, y


train_x, train_y = get_xy(train_frames)
val_x, val_y = get_xy(val_frames)

In [6]:
# Apparently not that bad
preds = val_x.Sex == 0
mean_absolute_error(val_y, preds)

0.21524663677130046

What about splitting on a continuous column like LogFare. Did people who paid more survive more?

In [7]:
# Less good but not that bad
thresh = 2.7  # Avg of survivors was 2.5
preds = val_x.LogFare > thresh
mean_absolute_error(val_y, preds)

0.336322869955157

In [8]:
# Lets find a way to score any split
# The score will be how well the split creates 2 groups that are similar/dissimilar
# We'll get the std deviation of the dep var of the each side of the split


def side_score(side, y):
    total = side.sum()
    if total <= 1:
        return 0

    return y[side].std() * total


def score(col, y, split):
    lhs = col <= split
    rhs = ~lhs

    return (side_score(lhs, y) + side_score(rhs, y)) / len(y)


# The scores match our previous manual results
score(train_x["Sex"], train_y, 0.5), score(train_x["LogFare"], train_y, 2.7)

(0.40787530982063946, 0.47180873952099694)

In [11]:
# Pop it in a GUI
from ipywidgets import interact
from fastai.vision.widgets import *


def getScore(nm, split):
    col = train_x[nm]
    return score(col, train_y, split)


interact(nm=conts, split=15.5)(getScore), interact(nm=cats, split=2)(getScore);

interactive(children=(Dropdown(description='nm', options=('Age', 'SibSp', 'Parch', 'LogFare', 'Pclass'), value…

interactive(children=(Dropdown(description='nm', options=('Sex', 'Embarked'), value='Sex'), IntSlider(value=2,…

In [22]:
# That is fun but impractical, lets write a function to find the best split point for every indepenent var and tell us its score
def best_score(frames, col_name):
    # Pull the single column
    x = frames[col_name]
    y = frames[dep]

    # All possible split points
    split_points = x.dropna().unique()

    # Calc score at every split
    scores = np.array([score(x, y, point) for point in split_points if not np.isnan(point)])

    # Find th best (min) score
    best_idx = scores.argmin()

    return split_points[best_idx], scores[best_idx]


best_score(train_frames, "Age")

(6.0, 0.478316717508991)

In [24]:
# We can calculate this for all of the columns
{name: best_score(train_frames, name) for name in cats + conts}

{'Sex': (0, 0.40787530982063946),
 'Embarked': (0, 0.47883342573147836),
 'Age': (6.0, 0.478316717508991),
 'SibSp': (4, 0.4783740258817434),
 'Parch': (0, 0.4805296527841601),
 'LogFare': (2.4390808375825834, 0.4620823937736597),
 'Pclass': (2, 0.46048261885806596)}

Turns out we picked the best the first time around with Sex.

This is pretty much a reinvention of the [OneR classifier](https://link.springer.com/article/10.1023/A:1022631118932), a decision tree with a single split. In the 90's this turned out to be one of the best models.