<font size="5">Titanic - Machine Learning from Disaster</font>

<font size="2">The RMS Titanic was a British passenger ship that struck an iceberg and sank on its maiden voyage from Southampton to New York City on April 15, 1912. It was one of the largest and most luxurious ships of its time, built to offer unmatched comfort and safety for its passengers. Constructed by Harland & Wolff and operated by the White Star Line, the Titanic was claimed to be "unsinkable" due to its advanced safety features, such as watertight compartments and a double bottom. However, on the night of April 14, the ship collided with an iceberg in the North Atlantic Ocean, leading to its tragic sinking. The disaster led to the deaths of over 1,500 people, making it one of the deadliest maritime tragedies in history. The Titanic Kaggle competition challenges participants to build a predictive model that determines which passengers survived the Titanic disaster based on historical data.</font>

![Image of the Titanic](titanic.webp)

In [23]:
import os
for dirname, _, filenames in os.walk('data'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

data\test.csv
data\train.csv


In [24]:
import pandas as pd
train_df = pd.read_csv('data/test.csv')
train_df

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0000,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
...,...,...,...,...,...,...,...,...,...,...,...
413,1305,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S
414,1306,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9000,C105,C
415,1307,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S
416,1308,3,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,,S


<font size="3">Features</font>

<ul style="font-size: small;">
    <li><strong>Age:</strong> Age can significantly impact survival chances, with younger passengers generally having higher survival rates.</li>
    <li><strong>Fare:</strong> The fare paid by passengers is a proxy for their socio-economic status. Higher fares often correlate with higher survival rates, as wealthier passengers might have had better access to lifeboats.</li>
    <li><strong>Sex:</strong> Gender significantly affects survival rates, with females generally having higher survival rates. This feature is encoded as binary (0 for male, 1 for female) to be used in the model.</li>
    <li><strong>Pclass:</strong> The passenger class (1, 2, or 3) is a strong indicator of survival likelihood, with first-class passengers generally having higher survival rates.</li>
    <li><strong>Family Size:</strong> A combination of the number of siblings/spouses aboard and parents/children aboard. Larger family sizes might impact survival due to the dynamics of family support and possible prioritization in lifeboat allocation.</li>
</ul>

In [25]:
from copy import deepcopy
features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Fare', 'Survived']
train_df_sub = deepcopy(train_df[features])
train_df_sub

Unnamed: 0,Pclass,Sex,Age,SibSp,Fare
0,3,male,34.5,0,7.8292
1,3,female,47.0,1,7.0000
2,2,male,62.0,0,9.6875
3,3,male,27.0,0,8.6625
4,3,female,22.0,1,12.2875
...,...,...,...,...,...
413,3,male,,0,8.0500
414,1,female,39.0,0,108.9000
415,3,male,38.5,0,7.2500
416,3,male,,0,8.0500



<font size="3">NaN Values</font>

<font size="2">In this section, we perform one-hot encoding on the categorical feature 'Sex' using the OneHotEncoder class from sklearn. One-hot encoding is a technique used to convert categorical variables into a numerical format suitable for machine learning models.</font>

In [26]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
encoded_columns = encoder.fit_transform(train_df_sub[['Sex']])
encoded_df = pd.DataFrame(encoded_columns, columns=encoder.get_feature_names_out(['Sex']))
encoded_df

Unnamed: 0,Sex_female,Sex_male
0,0.0,1.0
1,1.0,0.0
2,0.0,1.0
3,0.0,1.0
4,1.0,0.0
...,...,...
413,0.0,1.0
414,1.0,0.0
415,0.0,1.0
416,0.0,1.0


In [27]:
train_df_sub = train_df_sub.drop('Sex', axis=1)
train_df_sub = pd.concat([train_df_sub, encoded_df], axis=1)
train_df_sub

Unnamed: 0,Pclass,Age,SibSp,Fare,Sex_female,Sex_male
0,3,34.5,0,7.8292,0.0,1.0
1,3,47.0,1,7.0000,1.0,0.0
2,2,62.0,0,9.6875,0.0,1.0
3,3,27.0,0,8.6625,0.0,1.0
4,3,22.0,1,12.2875,1.0,0.0
...,...,...,...,...,...,...
413,3,,0,8.0500,0.0,1.0
414,1,39.0,0,108.9000,1.0,0.0
415,3,38.5,0,7.2500,0.0,1.0
416,3,,0,8.0500,0.0,1.0



<font size="3">Missing Values</font>

<font size="2">In this section, we use the KNNImputer from sklearn to handle missing values in the dataset. K-Nearest Neighbors imputation is a method that estimates missing values based on the values of the nearest neighbors, considering the similarity between data points.</font>

In [28]:
from sklearn.impute import KNNImputer
# K-nearest neighbors imputation for missing values
imputer = KNNImputer(n_neighbors=5)
train_df_sub[:] = imputer.fit_transform(train_df_sub)
train_df_sub

Unnamed: 0,Pclass,Age,SibSp,Fare,Sex_female,Sex_male
0,3,34.5,0,7.8292,0.0,1.0
1,3,47.0,1,7.0000,1.0,0.0
2,2,62.0,0,9.6875,0.0,1.0
3,3,27.0,0,8.6625,0.0,1.0
4,3,22.0,1,12.2875,1.0,0.0
...,...,...,...,...,...,...
413,3,31.9,0,8.0500,0.0,1.0
414,1,39.0,0,108.9000,1.0,0.0
415,3,38.5,0,7.2500,0.0,1.0
416,3,31.9,0,8.0500,0.0,1.0



<font size="3">Training</font>

<font size="2">In this section, we split the dataset into training and testing sets to prepare for model training and evaluation. The train_test_split function from sklearn is used to divide the data into these subsets, ensuring that the model is evaluated on unseen data. The choice of 20% for testing is a common practice, and the random state of 42 is a common practice as well as an inside joke among data scientists.</font>

In [29]:
from sklearn.model_selection import train_test_split

X = train_df_sub.drop('Survived', axis=1)
y = train_df_sub['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

KeyError: "['Survived'] not found in axis"

<font size="3">The Model</font>

<font size="2">The Random Forest Classifier achieved the highest performance with an accuracy of 0.79665. While other classification techniques such as Logistic Regression, Gradient Boost, Support Vector Machine (SVM), and Decision Tree Classifier were also used, they resulted in lower accuracy, likely due to limitations in capturing complex patterns or overfitting issues.</font>

In [None]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, max_depth=6, random_state=42)
model.fit(X_train, y_train)

In [None]:
from sklearn.metrics import accuracy_score, classification_report

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print("Classification Report:")
print(report)

Accuracy: 0.8324022346368715
Classification Report:
              precision    recall  f1-score   support

           0       0.82      0.92      0.87       105
           1       0.87      0.70      0.78        74

    accuracy                           0.83       179
   macro avg       0.84      0.81      0.82       179
weighted avg       0.84      0.83      0.83       179



In [None]:
test_df = pd.read_csv('data/test.csv')
test_df_sub = deepcopy(test_df[features[:-1]])
test_df_sub

Unnamed: 0,Pclass,Sex,Age,SibSp,Fare
0,3,male,34.5,0,7.8292
1,3,female,47.0,1,7.0000
2,2,male,62.0,0,9.6875
3,3,male,27.0,0,8.6625
4,3,female,22.0,1,12.2875
...,...,...,...,...,...
413,3,male,,0,8.0500
414,1,female,39.0,0,108.9000
415,3,male,38.5,0,7.2500
416,3,male,,0,8.0500


In [None]:
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
encoded_columns = encoder.fit_transform(test_df_sub[['Sex']])
encoded_df = pd.DataFrame(encoded_columns, columns=encoder.get_feature_names_out(['Sex']))
encoded_df.head()



Unnamed: 0,Sex_female,Sex_male
0,0.0,1.0
1,1.0,0.0
2,0.0,1.0
3,0.0,1.0
4,1.0,0.0


In [None]:
test_df_sub = test_df_sub.drop('Sex', axis=1)
test_df_sub = pd.concat([test_df_sub, encoded_df], axis=1)
test_df_sub.head()

Unnamed: 0,Pclass,Age,SibSp,Fare,Sex_female,Sex_male
0,3,34.5,0,7.8292,0.0,1.0
1,3,47.0,1,7.0,1.0,0.0
2,2,62.0,0,9.6875,0.0,1.0
3,3,27.0,0,8.6625,0.0,1.0
4,3,22.0,1,12.2875,1.0,0.0


In [None]:
test_df_sub[:] = imputer.fit_transform(test_df_sub)
test_df_sub

Unnamed: 0,Pclass,Age,SibSp,Fare,Sex_female,Sex_male
0,3,34.5,0,7.8292,0.0,1.0
1,3,47.0,1,7.0000,1.0,0.0
2,2,62.0,0,9.6875,0.0,1.0
3,3,27.0,0,8.6625,0.0,1.0
4,3,22.0,1,12.2875,1.0,0.0
...,...,...,...,...,...,...
413,3,31.9,0,8.0500,0.0,1.0
414,1,39.0,0,108.9000,1.0,0.0
415,3,38.5,0,7.2500,0.0,1.0
416,3,31.9,0,8.0500,0.0,1.0


In [None]:
x_test = test_df_sub.to_numpy()

In [None]:
test_df['Survived'] = model.predict(x_test)
test_df



Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Survived
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,0
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0000,,S,0
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,0
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,0
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S,1
...,...,...,...,...,...,...,...,...,...,...,...,...
413,1305,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S,0
414,1306,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9000,C105,C,1
415,1307,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S,0
416,1308,3,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,,S,0


In [None]:
test_df[['PassengerId', 'Survived']].to_csv('titanic_predictions.csv', index=False)