<font size="5">Titanic - Machine Learning from Disaster</font>

<font size="2">The RMS Titanic was a British passenger ship that struck an iceberg and sank on its maiden voyage from Southampton to New York City on April 15, 1912. It was one of the largest and most luxurious ships of its time, built to offer unmatched comfort and safety for its passengers. Constructed by Harland & Wolff and operated by the White Star Line, the Titanic was claimed to be "unsinkable" due to its advanced safety features, such as watertight compartments and a double bottom. However, on the night of April 14, the ship collided with an iceberg in the North Atlantic Ocean, leading to its tragic sinking. The disaster led to the deaths of over 1,500 people, making it one of the deadliest maritime tragedies in history. The Titanic Kaggle competition challenges participants to build a predictive model that determines which passengers survived the Titanic disaster based on historical data.</font>

![Image of the Titanic](titanic.webp)

In [1]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/titanic/train.csv
/kaggle/input/titanic/test.csv
/kaggle/input/titanic/gender_submission.csv


In [2]:
import pandas as pd
train_df = pd.read_csv('/kaggle/input/titanic/train.csv')
train_df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


<font size="3">Feature Engineering</font>
<font size="1">The features I chose, and what effects they have on determining survival:</font>

<ul>
    <li><strong>Age:</strong> Age can significantly impact survival chances, with younger passengers generally having higher survival rates. It is used to fill missing values by estimating based on the median or mean age of the dataset.</li>
    <li><strong>Fare:</strong> The fare paid by passengers is a proxy for their socio-economic status. Higher fares often correlate with higher survival rates, as wealthier passengers might have had better access to lifeboats.</li>
    <li><strong>Embarked:</strong> The port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton) can influence survival rates due to differences in passenger class and ship conditions. This feature is one-hot encoded to make it usable for model training.</li>
    <li><strong>Sex:</strong> Gender significantly affects survival rates, with females generally having higher survival rates. This feature is encoded as binary (0 for male, 1 for female) to be used in the model.</li>
    <li><strong>Title:</strong> Derived from the passenger’s name (e.g., Mr, Mrs, Miss), titles can provide information about the passenger’s social status and age group. This feature is one-hot encoded to capture its influence on survival.</li>
    <li><strong>Pclass:</strong> The passenger class (1, 2, or 3) is a strong indicator of survival likelihood, with first-class passengers generally having higher survival rates. This feature is directly used in the model.</li>
    <li><strong>Family Size:</strong> A combination of the number of siblings/spouses aboard and parents/children aboard. Larger family sizes might impact survival due to the dynamics of family support and possible prioritization in lifeboat allocation. This feature is calculated and included in the model.</li>
    <li><strong>Cabin:</strong> The cabin number can provide information about the passenger’s location on the ship. This feature is often converted into a binary feature indicating whether the cabin information is available or not, as missing values are prevalent.</li>
</ul>

In [3]:
from copy import deepcopy
features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Fare', 'Survived']
train_df_sub = deepcopy(train_df[features])
train_df_sub

Unnamed: 0,Pclass,Sex,Age,SibSp,Fare,Survived
0,3,male,22.0,1,7.2500,0
1,1,female,38.0,1,71.2833,1
2,3,female,26.0,0,7.9250,1
3,1,female,35.0,1,53.1000,1
4,3,male,35.0,0,8.0500,0
...,...,...,...,...,...,...
886,2,male,27.0,0,13.0000,0
887,1,female,19.0,0,30.0000,1
888,3,female,,1,23.4500,0
889,1,male,26.0,0,30.0000,1


In [1]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
encoded_columns = encoder.fit_transform(train_df_sub[['Sex']])
encoded_df = pd.DataFrame(encoded_columns, columns=encoder.get_feature_names_out(['Sex']))
encoded_df

ModuleNotFoundError: No module named 'sklearn'

In [5]:
train_df_sub = train_df_sub.drop('Sex', axis=1)
train_df_sub = pd.concat([train_df_sub, encoded_df], axis=1)
train_df_sub.head()

Unnamed: 0,Pclass,Age,SibSp,Fare,Survived,Sex_female,Sex_male
0,3,22.0,1,7.25,0,0.0,1.0
1,1,38.0,1,71.2833,1,1.0,0.0
2,3,26.0,0,7.925,1,1.0,0.0
3,1,35.0,1,53.1,1,1.0,0.0
4,3,35.0,0,8.05,0,0.0,1.0


In [6]:
from sklearn.impute import KNNImputer
# K-nearest neighbors imputation for missing values
imputer = KNNImputer(n_neighbors=5)
train_df_sub[:] = imputer.fit_transform(train_df_sub)
train_df_sub

Unnamed: 0,Pclass,Age,SibSp,Fare,Survived,Sex_female,Sex_male
0,3,22.0,1,7.2500,0,0.0,1.0
1,1,38.0,1,71.2833,1,1.0,0.0
2,3,26.0,0,7.9250,1,1.0,0.0
3,1,35.0,1,53.1000,1,1.0,0.0
4,3,35.0,0,8.0500,0,0.0,1.0
...,...,...,...,...,...,...,...
886,2,27.0,0,13.0000,0,0.0,1.0
887,1,19.0,0,30.0000,1,1.0,0.0
888,3,25.2,1,23.4500,0,1.0,0.0
889,1,26.0,0,30.0000,1,0.0,1.0


In [7]:
from sklearn.model_selection import train_test_split

X = train_df_sub.drop('Survived', axis=1)
y = train_df_sub['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [8]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, max_depth=6, random_state=42)
model.fit(X_train, y_train)

In [9]:
from sklearn.metrics import accuracy_score, classification_report

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print("Classification Report:")
print(report)

Accuracy: 0.8324022346368715
Classification Report:
              precision    recall  f1-score   support

           0       0.82      0.92      0.87       105
           1       0.87      0.70      0.78        74

    accuracy                           0.83       179
   macro avg       0.84      0.81      0.82       179
weighted avg       0.84      0.83      0.83       179



In [10]:
test_df = pd.read_csv('/kaggle/input/titanic/test.csv')
test_df_sub = deepcopy(test_df[features[:-1]])
test_df_sub

Unnamed: 0,Pclass,Sex,Age,SibSp,Fare
0,3,male,34.5,0,7.8292
1,3,female,47.0,1,7.0000
2,2,male,62.0,0,9.6875
3,3,male,27.0,0,8.6625
4,3,female,22.0,1,12.2875
...,...,...,...,...,...
413,3,male,,0,8.0500
414,1,female,39.0,0,108.9000
415,3,male,38.5,0,7.2500
416,3,male,,0,8.0500


In [11]:
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
encoded_columns = encoder.fit_transform(test_df_sub[['Sex']])
encoded_df = pd.DataFrame(encoded_columns, columns=encoder.get_feature_names_out(['Sex']))
encoded_df.head()



Unnamed: 0,Sex_female,Sex_male
0,0.0,1.0
1,1.0,0.0
2,0.0,1.0
3,0.0,1.0
4,1.0,0.0


In [12]:
test_df_sub = test_df_sub.drop('Sex', axis=1)
test_df_sub = pd.concat([test_df_sub, encoded_df], axis=1)
test_df_sub.head()

Unnamed: 0,Pclass,Age,SibSp,Fare,Sex_female,Sex_male
0,3,34.5,0,7.8292,0.0,1.0
1,3,47.0,1,7.0,1.0,0.0
2,2,62.0,0,9.6875,0.0,1.0
3,3,27.0,0,8.6625,0.0,1.0
4,3,22.0,1,12.2875,1.0,0.0


In [13]:
test_df_sub[:] = imputer.fit_transform(test_df_sub)
test_df_sub

Unnamed: 0,Pclass,Age,SibSp,Fare,Sex_female,Sex_male
0,3,34.5,0,7.8292,0.0,1.0
1,3,47.0,1,7.0000,1.0,0.0
2,2,62.0,0,9.6875,0.0,1.0
3,3,27.0,0,8.6625,0.0,1.0
4,3,22.0,1,12.2875,1.0,0.0
...,...,...,...,...,...,...
413,3,31.9,0,8.0500,0.0,1.0
414,1,39.0,0,108.9000,1.0,0.0
415,3,38.5,0,7.2500,0.0,1.0
416,3,31.9,0,8.0500,0.0,1.0


In [14]:
x_test = test_df_sub.to_numpy()

In [15]:
test_df['Survived'] = model.predict(x_test)
test_df



Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Survived
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,0
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0000,,S,0
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,0
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,0
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S,1
...,...,...,...,...,...,...,...,...,...,...,...,...
413,1305,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S,0
414,1306,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9000,C105,C,1
415,1307,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S,0
416,1308,3,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,,S,0


In [16]:
test_df[['PassengerId', 'Survived']].to_csv('titanic_predictions.csv', index=False)