<img src="https://github.com/datastrider/titanic_svm/blob/main/titanic_comp_pic_rf.jpg?raw=true" ></img>
# <p style="background-color:#0d101c;font-family:arial;color:#ffffff;font-size:150%;text-align:center;border-radius:20px;">Table of Contents</p>
1. [Import Modules](#import_modules)
2. [Load Data](#load_data)
3. [Data Exploration](#data_exploration)<br>
4. [Data Cleaning](#data_cleaning)<br>
5. [Data Transformation](#data_transformation)
5. [Model Training](#model_train)
6. [Kaggle Submission](#kaggle_sub)<br>
    6.1 [Cleaning and Processing Test Data](#clean_test_data)<br>
    6.2 [Create sumbission csv](#create_submission_csv)<br>

<a class="anchor" id="import_modules"></a>
# <p style="background-color:#0d101c;font-family:arial;color:#ffffff;font-size:150%;text-align:center;border-radius:20px;">Import Modules</p>


In [None]:
import os
import time
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score


from skopt import BayesSearchCV

In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

<a class="anchor" id="load_data"></a>
# <p style="background-color:#0d101c;font-family:arial;color:#ffffff;font-size:150%;text-align:center;border-radius:20px;">Load Data</p>

In [None]:
train_data = pd.read_csv("/kaggle/input/titanic/train.csv")
test_data = pd.read_csv("/kaggle/input/titanic/test.csv")

In [None]:
train_data.head()

<a class="anchor" id="data_exploration"></a>
# <p style="background-color:#0d101c;font-family:arial;color:#ffffff;font-size:150%;text-align:center;border-radius:20px;">Data Exploration</p>

In this section, the data that is being used is loaded and explored. This will entail gaining an understanding of the different attributes and their datatypes. Checks will be made to uncover any irregularities with data entries for attributes that are clearly categorical, and missing data.

In [None]:
train_data.describe()

In [None]:
train_data.pivot_table(train_data, index=["Survived"])

<b>How does Age and Fare correlate with survivability</b>

Below, we can see that there was not a large disparity of the survival rate of different ages. The median age of passengers that survived and those that did not were the same. There was a difference spotted with the fare price, where those who paid more tended to be more likely to survive.

In [None]:
fig, axes = plt.subplots(1,2)
fig.set_figwidth(25)
fig.set_figheight(8)

check_cols = ["Age", "Fare"]
sns.set_style("dark")
for i in range(len(check_cols)):
    
    sns.kdeplot(data=train_data.loc[train_data["Survived"] == 1, check_cols[i]],
                  ax=axes[i],
                  label="Survived",
                  color='blue',
                  shade=True)

    sns.kdeplot(data=train_data.loc[train_data["Survived"] == 0, check_cols[i]],
                  ax=axes[i],
                  label="Did not survive",
                  color='red',
                  shade=True)

    # plot vertical lines
    axes[i].axvline(train_data.loc[train_data["Survived"] == 1, check_cols[i]].median(),
                   color='blue')

    axes[i].axvline(train_data.loc[train_data["Survived"] == 0, check_cols[i]].median(),
                   color='red')
    
    # plot annotations of values corresponding to the vertical lines
    axes[i].annotate("Median {} Survived: {}".format(check_cols[i],
                                                        train_data.loc[train_data["Survived"] == 1, check_cols[i]].median()),
                                                        xy=(0.45, 0.95),
                                                        xycoords='axes fraction',
                                                        fontsize=15)
    
    axes[i].annotate("Median {} Not Survived: {}".format(check_cols[i],
                                                         train_data.loc[train_data["Survived"] == 0, check_cols[i]].median()),
                                                         xy=(0.45, 0.90),
                                                         xycoords='axes fraction',
                                                         fontsize=15)
    
    axes[i].title.set_text("{} and Survival".format(check_cols[i]))
    axes[i].legend()
plt.show()
plt.close()

<b>How does Pclass and Sex correlate with survivability</b>

We can see that those of a lower socio-economic status (Pclass) were more likely to have died than those of a higher class. This could potentially caused by people of higher class being let on life-boats, and/or people of higher socio-economic class were able to spend more. It was shown earlier that those who paid a higher fare were also more likely to have survived than those who paid a lower fare.

<b>It is clear that *'Age'*, *'Fare'*, *'Pclass'* and *'Sex'* are important attributes to consider when deciding on whether a passenger was likely to have survived or not </b>

In [None]:
fig, axes = plt.subplots(1,2)
fig.set_figwidth(15)
fig.set_figheight(5)
sns.countplot(x="Pclass", hue="Survived", data=train_data, palette=["#F34D4D", "#2B72D9"], ax=axes[0])
axes[0].set_xticks([1,2,3])
sns.countplot(x="Sex", hue="Survived", data=train_data, palette=["#F34D4D", "#2B72D9"], ax=axes[1])
axes[0].set_title("Pclass and Survival")
axes[1].set_title("Sex and Survival")
plt.show()
plt.close()

In [None]:
print(train_data.isnull().sum())

<a class="anchor" id="data_cleaning"></a>
# <p style="background-color:#0d101c;font-family:arial;color:#ffffff;font-size:150%;text-align:center;border-radius:20px;">Data Cleaning</p>

In this section, the data will be tidied up by removing data not necessary, and attempting to deal with missing data values. After viewing the training data, it is clear that there are missing values, and attributes with awkward values that may not be useful when creating the classification model.

## Dealing With Missing Values

There are several approaches one can take when dealing with missing values. A popular approach is simply deleting the observation containing missing values in its attributes. From previous observations, it appears that the attributes that are useful that contain missing data are 'Age' and 'Embarked'.

**Age**: <br>
The approach taken to deal with missing values was to replace them with the median of all the age values present

<b>Why the median?</b>

Below, a histrogram is plotted showing the distribution of ages of passengers. From the graph, it can be seen that the data is right-skewed, meaning that the distribution has a long right tail. In this case, it is better to use the median of the data over the mean, as the mean is affected more by extreme/outlier data, or when data is skewed. Below, the mean age is higher than the median age, as can be seen with the 2 plotted vertical lines. Thus, the missing values will be replaced with the median of the values in the *'Age'* column.

**Embarked**: <br>
The approach taken to deal with missing values was to delete entire observations where missing values were present in the 'Embarked' attribute


In [None]:
sns.set(rc={"figure.figsize":(15,5)}) # set size of figure plotted
sns.set_style("dark")
sns.histplot(train_data["Age"], kde=True, bins=20, color="teal")
plt.axvline(train_data["Age"].median(), c="red", label="Median Age: {:.1f}".format(train_data["Age"].median()))
plt.axvline(train_data["Age"].mean(), c="blue", label="Mean Age: {:.1f}".format(train_data["Age"].mean()))
plt.legend()
plt.suptitle("Age of Passengers: Right Skewed", fontsize=20)
plt.xlabel("Age")
plt.ylabel("Count")
plt.show()

In [None]:
train_data = train_data[train_data['Embarked'].notna()]
train_data['Age'].fillna(train_data['Age'].median(), inplace=True)

## Removing Attributes
There are some attributes that do not appear to be useful when constructing models to make predictions. These columns were identified as "PassengerId", "Name", "Ticket" and "Cabin". They are to be removed from the final training dataset

In [None]:
train_data.drop(["PassengerId", "Name", "Ticket", "Cabin"], axis=1, inplace=True)
train_data.head(n=10)

<a class="anchor" id="data_transformation"></a>
# <p style="background-color:#0d101c;font-family:arial;color:#ffffff;font-size:150%;text-align:center;border-radius:20px;">Data Transformation</p>
# Data Transformation

## Category Encoding

Using string data with a Random Forest classifier produces the ValueError: 'could not convert string to float'. This can be resolved by encoding the different categories. One way of doing this is assigning each category value an integer value, and incrementing by 1 until all categorical values are accounted for. An example with our data: <br>

*'male'* -> 1 <br>
*'female'* -> 2 <br>

This solution, however, only works with ordinal data. As these values are not ordinal

### Solution
A different approach is needed, where a new column for each value is created, and a 1 or 0 is assigned to its relevant attribute. Example:

| Survived | Pclass| male | female |
| --- | --- | --- | --- |
| 0 | 3 | 1 | 0 |
| 1 | 1 | 0 | 1 |
| 1 | 3 | 0 | 1 |

This will also need to be done with the *'Embarked'* attribute. *'Pclass'* is left alone, because this is nominal data. 

In [None]:
vals_sex = train_data["Sex"].unique() # variations and NaN values have been checked and removed
vals_embarked = train_data["Embarked"].unique()

for v in vals_sex:
    train_data[v] = (train_data["Sex"] == v).astype(int)

for v in vals_embarked:
    train_data[v] = (train_data["Embarked"] == v).astype(int)

train_data.drop("Sex", axis=1, inplace=True)
train_data.drop("Embarked", axis=1, inplace=True)

train_data.head(n=10)

<a class="anchor" id="model_train"></a>
# <p style="background-color:#0d101c;font-family:arial;color:#ffffff;font-size:150%;text-align:center;border-radius:20px;">Model Training</p>

For this problem a random forest is going to be used. Hyper-parameter tuning will be carried out

In [None]:
clf = RandomForestClassifier(random_state=123)

params = {
    "n_estimators": (10, 200),
    "max_depth": (4, 15),
    "min_samples_split": (2, 10),
    "max_features": (0.3, 1.0)
}

opt = BayesSearchCV(clf,
                    params,
                    cv=10,
                    random_state=123,
                    verbose=0)

Creating a train/test split to gain an understanding of how well the classifier performs. The entire training dataset will be used for the final submission to Kaggle

In [None]:
X_train, X_test, y_train, y_test = train_test_split(train_data.iloc[:, 1:],
                                                    train_data.iloc[:,0],
                                                    train_size=0.8,
                                                    random_state=123)
print(X_train.shape)
print(y_train.shape)

Optimising the hyper-parameters of the RandomForestClassifier.

A warning is produced when running this. It is due to the optimiser converging on a set of points. <br>

*"**UserWarning**: The objective has been evaluated at this point before.
  warnings.warn("The objective has been evaluated ""*

In [None]:
time_start = time.time()

_ = opt.fit(X_train, y_train)

time_end = time.time()

print("Optimisation Time: {}".format(time_end - time_start))

Get result of predictions of optimised RandomForestClassifier on the test set

In [None]:
print(opt.score(X_test, y_test))

Print confusion-matrix to better understand the prediction results. <br>

The left to right diagonal shows the correct predictions. The rest show incorrect predictions (what it should be versus what was predicted). All the numbers add up to the total number of observations in the test dataset.

In [None]:
mat = confusion_matrix(opt.predict(X_test), y_test)

plt.figure(figsize = (6,4))
sns.heatmap(mat, annot=True)

View best parameters after optimisation

In [None]:
for k in opt.best_params_:
    print(k, ":", opt.best_params_[k])

In [None]:
plt.figure(figsize=(8,5))
plt.barh(train_data.columns[1:], opt.best_estimator_.feature_importances_)
plt.suptitle("Feature Importance", size=20)
plt.title("Shows the importance of features with optimised hyper-parameters")
plt.show()

<a class="anchor" id="kaggle_sub"></a>
# <p style="background-color:#0d101c;font-family:arial;color:#ffffff;font-size:150%;text-align:center;border-radius:20px;">Kaggle Submission</p>

<a class='anchor' id='clean_test_data'></a>
## Cleaning and Processing Test Data

The testing dataset has to be prepared to resemble the training dataset in terms of structure, and deal with unexpected results.

***All steps taken with checking, data-cleaning and data-transforming on the training data are repeated below on the test data***

In [None]:
check_cols = ['Pclass', 'Sex', 'Embarked']

test_data = test_data[test_data['Embarked'].notna()]
test_data['Age'].fillna(test_data['Age'].mean(), inplace=True)
test_data['Fare'].fillna(test_data['Fare'].median(), inplace=True)

test_id = test_data["PassengerId"].copy()

test_data.drop(["PassengerId", "Name", "Ticket", "Cabin"], axis=1, inplace=True)

vals_sex = test_data["Sex"].unique() # variations and NaN values have been checked and removed
vals_embarked = test_data["Embarked"].unique()

for v in vals_sex:
    test_data[v] = (test_data["Sex"] == v).astype(int)

for v in vals_embarked:
    test_data[v] = (test_data["Embarked"] == v).astype(int)

test_data.drop("Sex", axis=1, inplace=True)
test_data.drop("Embarked", axis=1, inplace=True)

test_data.head(n=10)

<a class="anchor" id="create_submission_csv"></a>
## Create sumbission csv

Create RandomForestClassifier using optimal hyper-parameters

In [None]:
clf = RandomForestClassifier()
clf.set_params(**opt.best_params_)

Fit RandomForestClassifier and produce predictions

In [None]:
clf.fit(train_data.iloc[:, 1:], train_data.iloc[:, 0]) # X_train, y_train
pred = clf.predict(test_data)

Create submission file (.csv)

In [None]:
submission = pd.DataFrame(test_id)
submission["Survived"] = pred

submission.to_csv("submission.csv", index=False)