# Hands-on introduction to ML training
In this notebook, we will look at two different classifiers: **Random Forests and K-Nearest Neighbours**. We will tackle the same Titanic survival prediction problem from last lesson to see which model performs better.

### Step 1: Load and explore data
The first step is figuring out the data source. In this case we will use a pre-existing dataset. We will:
1. Create a folder 'data'
2. Download the file from public github repo using python package "requests" and save the `titanic.csv` file in the data folder.

In [1]:
%config IPCompleter.greedy=True #Helps with auto-complete

import numpy as np
import pandas as pd
import os

try:
    os.mkdir('data')
except OSError as error:
    print(error)

import requests, csv

url = 'https://raw.githubusercontent.com/techno-nerd/ML_101_Course/main/05%20Decision%20Tree/data/titanic.csv'
r = requests.get(url)
with open('data/titanic.csv', 'w') as f:
  writer = csv.writer(f)
  for line in r.iter_lines():
    writer.writerow(line.decode('utf-8').split(','))

[Errno 17] File exists: 'data'


In [2]:
df = pd.read_csv('data/titanic.csv')

In [3]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 893 entries, 0 to 892
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    float64
 1   Survived     891 non-null    float64
 2   Pclass       891 non-null    float64
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    float64
 7   Parch        891 non-null    float64
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(7), object(5)
memory usage: 83.8+ KB
None


### This data set has some missing values, which will be handled later

In [4]:
print(df[:5])

   PassengerId  Survived  Pclass  \
0          1.0       0.0     3.0   
1          2.0       1.0     1.0   
2          3.0       1.0     3.0   
3          4.0       1.0     1.0   
4          5.0       0.0     3.0   

                                                Name     Sex   Age  SibSp  \
0                             Braund Mr. Owen Harris    male  22.0    1.0   
1  Cumings Mrs. John Bradley (Florence Briggs Tha...  female  38.0    1.0   
2                              Heikkinen Miss. Laina  female  26.0    0.0   
3        Futrelle Mrs. Jacques Heath (Lily May Peel)  female  35.0    1.0   
4                            Allen Mr. William Henry    male  35.0    0.0   

   Parch            Ticket     Fare Cabin Embarked  
0    0.0         A/5 21171   7.2500   NaN        S  
1    0.0          PC 17599  71.2833   C85        C  
2    0.0  STON/O2. 3101282   7.9250   NaN        S  
3    0.0            113803  53.1000  C123        S  
4    0.0            373450   8.0500   NaN        S  


[Kaggle Dataset](https://www.kaggle.com/competitions/titanic/data) <br>
The data is from a competition on Kaggle. It contains data on various details about passangers aboard the Titanic, including whether they survived or not. We will use this data set to predict whether a person would survive the sinking of the Titanic.

### Step 2: Data preparation

There are a few tasks we need to do before we can train the model on this data:
1. Replace string values like 'male' and 'female' with integers (0 and 1)
2. Handle categorical values for Embarked
3. Get rid of null values
4. Drop unnecessary columns like Name 

Then, we will split the data the same way as last time:
1. Split the data (891 rows - before dropping duplicates) into training set (80%) and test set (20%)
2. Separate the input features (details about the passenger) from target variable ("Survived")

In [5]:
df['Sex'].replace({'male':1, 'female':0}, inplace=True)

In [6]:
print(f"Total Passengers: {df.shape[0]}")
for i in df.columns:
    print(f"{i}: {sum(df[i].isnull())}")

Total Passengers: 893
PassengerId: 2
Survived: 2
Pclass: 2
Name: 2
Sex: 2
Age: 179
SibSp: 2
Parch: 2
Ticket: 2
Fare: 2
Cabin: 689
Embarked: 4


In [7]:
#Deleting all rows with missing values (except cabin, as that column will be removed)
df.dropna(subset=['Age', 'Embarked'], inplace=True)
print(df.shape[0])

712


In [8]:
print(df.dtypes)

PassengerId    float64
Survived       float64
Pclass         float64
Name            object
Sex            float64
Age            float64
SibSp          float64
Parch          float64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object


In [9]:
#All non-numeric features need to be ignored (embarked and class will be concatenated later after being one-hot encoded)

features = df[['Sex', 'Age', 'SibSp', 'Parch', 'Fare']]
features[:5]

Unnamed: 0,Sex,Age,SibSp,Parch,Fare
0,1.0,22.0,1.0,0.0,7.25
1,0.0,38.0,1.0,0.0,71.2833
2,0.0,26.0,0.0,0.0,7.925
3,0.0,35.0,1.0,0.0,53.1
4,1.0,35.0,0.0,0.0,8.05


In [10]:
#One_hot encoding Passenger class
one_hot_class = pd.get_dummies(df['Pclass'])
features = features.join(one_hot_class)

print(df[:2])
features[:2]

   PassengerId  Survived  Pclass  \
0          1.0       0.0     3.0   
1          2.0       1.0     1.0   

                                                Name  Sex   Age  SibSp  Parch  \
0                             Braund Mr. Owen Harris  1.0  22.0    1.0    0.0   
1  Cumings Mrs. John Bradley (Florence Briggs Tha...  0.0  38.0    1.0    0.0   

      Ticket     Fare Cabin Embarked  
0  A/5 21171   7.2500   NaN        S  
1   PC 17599  71.2833   C85        C  


Unnamed: 0,Sex,Age,SibSp,Parch,Fare,1.0,2.0,3.0
0,1.0,22.0,1.0,0.0,7.25,False,False,True
1,0.0,38.0,1.0,0.0,71.2833,True,False,False


In [11]:
#One_hot encoding Embarked point
one_hot_embark = pd.get_dummies(df['Embarked'])
features = features.join(one_hot_embark)

print(df[:2])
features[:2]

   PassengerId  Survived  Pclass  \
0          1.0       0.0     3.0   
1          2.0       1.0     1.0   

                                                Name  Sex   Age  SibSp  Parch  \
0                             Braund Mr. Owen Harris  1.0  22.0    1.0    0.0   
1  Cumings Mrs. John Bradley (Florence Briggs Tha...  0.0  38.0    1.0    0.0   

      Ticket     Fare Cabin Embarked  
0  A/5 21171   7.2500   NaN        S  
1   PC 17599  71.2833   C85        C  


Unnamed: 0,Sex,Age,SibSp,Parch,Fare,1.0,2.0,3.0,C,Q,S
0,1.0,22.0,1.0,0.0,7.25,False,False,True,False,False,True
1,0.0,38.0,1.0,0.0,71.2833,True,False,False,True,False,False


In [12]:
print(features.dtypes)

Sex      float64
Age      float64
SibSp    float64
Parch    float64
Fare     float64
1.0         bool
2.0         bool
3.0         bool
C           bool
Q           bool
S           bool
dtype: object


In [13]:
for colName in features.columns:
    if features[colName].dtype == bool:
        print(colName)
        features[colName].replace({False:0, True:1}, inplace=True)

1.0
2.0
3.0
C
Q
S


In [14]:
print(features.dtypes)

Sex      float64
Age      float64
SibSp    float64
Parch    float64
Fare     float64
1.0        int64
2.0        int64
3.0        int64
C          int64
Q          int64
S          int64
dtype: object


In [15]:
features.columns = features.columns.astype(str)
print(features.columns)

Index(['Sex', 'Age', 'SibSp', 'Parch', 'Fare', '1.0', '2.0', '3.0', 'C', 'Q',
       'S'],
      dtype='object')


In [16]:
import sklearn.model_selection as ms

labels = df['Survived']

train_features, test_features, train_labels, test_labels = ms.train_test_split(features, labels, test_size=0.2)
print(train_features.shape)
print(test_features.shape)
print(train_labels.shape)
print(test_labels.shape)

(569, 11)
(143, 11)
(569,)
(143,)


### Step 3: Model Selection and Training

Instead of Logistic Regression, we will use a Decision Tree.

In [17]:
from sklearn.ensemble import RandomForestClassifier

r_forest = RandomForestClassifier(min_samples_leaf=5, n_estimators=9)
r_forest = r_forest.fit(train_features, train_labels)

In [18]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3, weights='distance')
knn = knn.fit(train_features, train_labels)

### Step 4: Model evaluation and tuning
Unlike linear regression, we are not going to use Root Mean Squared Error. Instead, we will use three metrics:
1. Accuracy = total correct / total predictions
2. Precision = correct class 1 / total predicted class 1
3. Recall = correct class 1 / total number of class 1's

In [19]:
RF_test_pred = r_forest.predict(test_features)
KNN_test_pred = knn.predict(test_features)

In [20]:
def ClassifierMetrics(labels, predictions):
    total = labels.size
    result = (labels == predictions)
    correct = result.sum()
    accuracy = (correct)/total

    #Precision (correct '1' prediction / total '1' prediction)
    precision = (result[predictions == 1.0].sum()) / (predictions == 1.0).sum()

    #Recall = (correct '1' predictions / total number of '1's)

    recall = (result[predictions == 1.0].sum()) / (labels == 1.0).sum()

    return [accuracy, precision, recall]

In [21]:
RF_test_metrics = ClassifierMetrics(test_labels, RF_test_pred)
print("TEST Metrics:")
print(f"Accuracy: {RF_test_metrics[0]}")
print(f"Precision: {RF_test_metrics[1]}")
print(f"Recall: {RF_test_metrics[2]}")

TEST Metrics:
Accuracy: 0.8251748251748252
Precision: 0.8461538461538461
Recall: 0.7213114754098361


In [22]:
KNN_test_metrics = ClassifierMetrics(test_labels, KNN_test_pred)
print("TEST Metrics:")
print(f"Accuracy: {KNN_test_metrics[0]}")
print(f"Precision: {KNN_test_metrics[1]}")
print(f"Recall: {KNN_test_metrics[2]}")

TEST Metrics:
Accuracy: 0.6993006993006993
Precision: 0.6666666666666666
Recall: 0.5901639344262295


### Step 5: Model visualisation

It is possible to visualise each tree in the Random Forest, but we will not go over that in this notebook. 
