### #2

Kaggle competition: [\[link\]](https://www.kaggle.com/competitions/titanic/)

Entry by Robin P.M. Kras

In [202]:
import pandas as pd

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

Between all of these variables we will remove: name and passengerid (only useful for the final assessment in combination with 'Survived').

In [203]:
train.drop(columns=['Name'], inplace=True)
test.drop(columns=['Name'], inplace=True)

In [204]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,male,35.0,0,0,373450,8.05,,S


In [205]:
train.shape

(891, 11)

From this age distribution, we can tell that a large majority of the passengers were aged >=18. Although there are also a significant amount of babies. This likely implies that age is going to play a large role in whether a passenger survives.

In [206]:
age = train.loc[train.Age > 18]["Survived"]
rate_age = sum(age)/len(age)

print(rate_age)

0.3826086956521739


### Data Preprocessing

In [207]:
print(f"Train set, null count: \n{train.isnull().sum()}")
print("\n")
print(f"Test set, null count: \n{test.isnull().sum()}")

Train set, null count: 
PassengerId      0
Survived         0
Pclass           0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64


Test set, null count: 
PassengerId      0
Pclass           0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64


A variety of passengers do not have cabins assigned to their tickets. Does this imply that these passengers did not have a room assigned to them? For convenience sake, I will assign these people with a room on the lower-end deck (G). 

In [208]:
train['Cabin'] = train['Cabin'].fillna('G')
test['Cabin'] = test['Cabin'].fillna('G')

As for the age, since it is a numerical variable, we take the mean of all ages for the missing ages.

In [209]:
train['Age'] = train['Age'].fillna(train['Age'].mean())
test['Age'] = test['Age'].fillna(train['Age'].mean())

In [210]:
print(f"Train set, null count: \n{train.isnull().sum()}")
print("\n")
print(f"Test set, null count: \n{test.isnull().sum()}")

Train set, null count: 
PassengerId    0
Survived       0
Pclass         0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       2
dtype: int64


Test set, null count: 
PassengerId    0
Pclass         0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           1
Cabin          0
Embarked       0
dtype: int64


This leaves us with a singular instance of the fare missing, as well as two people having no clear embark origins. For the missing embarked values, we will use the median and for the fare the mean.

In [211]:
train['Embarked'] = train['Embarked'].fillna(train['Embarked'].median)
test['Fare'] = test['Fare'].fillna(test['Fare'].mean())

In [212]:
print(f"Train set, null count: \n{train.isnull().sum()}")
print("\n")
print(f"Test set, null count: \n{test.isnull().sum()}")

Train set, null count: 
PassengerId    0
Survived       0
Pclass         0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64


Test set, null count: 
PassengerId    0
Pclass         0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64


In [213]:
train['Age'] = train['Age'].astype('int64')
test['Age'] = test['Age'].astype('int64')

train['Fare'] = train['Fare'].astype('int64')
test['Fare'] = test['Fare'].astype('int64')

Perfect! Now that we got rid of all missing values, we can continue to Feature Engineering.

In [214]:
train['Embarked'].value_counts()

Embarked
S                                                                                                                                                                                           644
C                                                                                                                                                                                           168
Q                                                                                                                                                                                            77
<bound method Series.median of 0      S\n1      C\n2      S\n3      S\n4      S\n      ..\n886    S\n887    S\n888    S\n889    C\n890    Q\nName: Embarked, Length: 891, dtype: object>      2
Name: count, dtype: int64

In [215]:
# Mapping 'Sex' column to 0 (male) and 1 (female)
train['Sex'] = train['Sex'].map({'male': 0, 'female': 1})
test['Sex'] = test['Sex'].map({'male': 0, 'female': 1})

# Mapping 'Embarked' column to 0 (S) and 1 (C) and 2 (Q)
train['Embarked'] = train['Embarked'].map({'S': 0, 'C': 1, 'Q': 2})
test['Embarked'] = test['Embarked'].map({'S': 0, 'C': 1, 'Q': 2})

train = train.drop(columns=['Embarked'])
test = test.drop(columns=['Embarked'])

# We only want the first indicator of the ship, this will likely be the most relevant since the lower half of a ship will arguably flood faster
train['Cabin'] = train['Cabin'].astype(str).str[0]
test['Cabin'] = test['Cabin'].astype(str).str[0]

In [216]:
train['FamilyMatters'] = train['Parch'] + train['SibSp']
test['FamilyMatters'] = test['Parch'] + test['SibSp']

# Beyond this, we can also verify whether the age matters in relation to the fare ticket and cabin number

train['Wealth'] = train['Age'] + train['Fare']
test['Wealth'] = test['Age'] + test['Fare']

# Last, it may be smart to verify whether age and sex have something to do with each other

train['AgeSex'] = train['Age'] + train['Sex']
test['AgeSex'] = test['Age'] + test['Sex']

In [217]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,FamilyMatters,Wealth,AgeSex
0,1,0,3,0,22,1,0,A/5 21171,7,G,1,29,22
1,2,1,1,1,38,1,0,PC 17599,71,C,1,109,39
2,3,1,3,1,26,0,0,STON/O2. 3101282,7,G,0,33,27
3,4,1,1,1,35,1,0,113803,53,C,1,88,36
4,5,0,3,0,35,0,0,373450,8,G,0,43,35


In [218]:
# Now, we remove the unwanted column of Ticket

train = train.drop(columns=['Ticket'])
test = test.drop(columns=['Ticket'])

In [219]:
test.head()

Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,FamilyMatters,Wealth,AgeSex
0,892,3,0,34,0,0,7,G,0,41,34
1,893,3,1,47,1,0,7,G,1,54,48
2,894,2,0,62,0,0,9,G,0,71,62
3,895,3,0,27,0,0,8,G,0,35,27
4,896,3,1,22,1,1,12,G,2,34,23


Using 'get_dummies()' restricts multicolinearity

In [220]:
train = pd.get_dummies(train, columns=['Cabin'], drop_first=True)
test = pd.get_dummies(test, columns=['Cabin'], drop_first=True)

In [221]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,FamilyMatters,Wealth,AgeSex,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T
0,1,0,3,0,22,1,0,7,1,29,22,False,False,False,False,False,True,False
1,2,1,1,1,38,1,0,71,1,109,39,False,True,False,False,False,False,False
2,3,1,3,1,26,0,0,7,0,33,27,False,False,False,False,False,True,False
3,4,1,1,1,35,1,0,53,1,88,36,False,True,False,False,False,False,False
4,5,0,3,0,35,0,0,8,0,43,35,False,False,False,False,False,True,False


Excellent, now we will move forward by selecting the model for this task

### Model Selection

Considering that the task is about binary classification, we can employ a multitude of models and get (relative) good performance. I decided to use Random Forest.

In [222]:
X = train.drop(columns=['PassengerId', 'Survived'])
y = train['Survived']  

In [223]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

In [224]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(max_depth=10, min_samples_leaf=1, min_samples_split=10, n_estimators=300)

rf.fit(X_train, y_train)

In [225]:
y_pred = rf.predict(X_val)

In [226]:
from sklearn.metrics import accuracy_score, classification_report

print("Accuracy:", accuracy_score(y_val, y_pred))
print(classification_report(y_val, y_pred))

Accuracy: 0.8212290502793296
              precision    recall  f1-score   support

           0       0.82      0.90      0.85       105
           1       0.83      0.72      0.77        74

    accuracy                           0.82       179
   macro avg       0.82      0.81      0.81       179
weighted avg       0.82      0.82      0.82       179



Extra: a small portion of hyperparameter tuning

In [227]:
# from sklearn.model_selection import GridSearchCV

# param_grid = {
#     'n_estimators': [100, 200, 300],
#     'max_depth': [5, 10, 15, None],
#     'min_samples_split': [2, 5, 10],
#     'min_samples_leaf': [1, 2, 4]
# }

# grid_search = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=5, n_jobs=-1)
# grid_search.fit(X_train, y_train)

# print("Best Parameters:", grid_search.best_params_)

In [228]:
# Retrain using best parameters

rf = RandomForestClassifier(max_depth=10, min_samples_leaf=1, min_samples_split=10, n_estimators=300, random_state=42)

rf.fit(X_train, y_train)

In [229]:
y_pred = rf.predict(X_val)

In [230]:
print("Accuracy:", accuracy_score(y_val, y_pred))
print(classification_report(y_val, y_pred))

Accuracy: 0.8100558659217877
              precision    recall  f1-score   support

           0       0.81      0.88      0.84       105
           1       0.80      0.72      0.76        74

    accuracy                           0.81       179
   macro avg       0.81      0.80      0.80       179
weighted avg       0.81      0.81      0.81       179



Forgot to make sure all decks are equal per test and train data~!

In [231]:
test['Cabin_T'] = False

In [232]:
test.head()

Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Fare,FamilyMatters,Wealth,AgeSex,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T
0,892,3,0,34,0,0,7,0,41,34,False,False,False,False,False,True,False
1,893,3,1,47,1,0,7,1,54,48,False,False,False,False,False,True,False
2,894,2,0,62,0,0,9,0,71,62,False,False,False,False,False,True,False
3,895,3,0,27,0,0,8,0,35,27,False,False,False,False,False,True,False
4,896,3,1,22,1,1,12,2,34,23,False,False,False,False,False,True,False


In [233]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,FamilyMatters,Wealth,AgeSex,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T
0,1,0,3,0,22,1,0,7,1,29,22,False,False,False,False,False,True,False
1,2,1,1,1,38,1,0,71,1,109,39,False,True,False,False,False,False,False
2,3,1,3,1,26,0,0,7,0,33,27,False,False,False,False,False,True,False
3,4,1,1,1,35,1,0,53,1,88,36,False,True,False,False,False,False,False
4,5,0,3,0,35,0,0,8,0,43,35,False,False,False,False,False,True,False


In [234]:
X_test = test.drop(columns=['PassengerId']) 
test_predictions = rf.predict(X_test)

In [235]:
submission_df = pd.DataFrame({
    'PassengerId': test['PassengerId'],  
    'Survived': test_predictions  
})

submission_df['Survived'] = submission_df['Survived']

# Save CSV file
submission_df.to_csv('submission.csv', index=False)

Final submission score of 76.315%, not bad!

It can be argued that having a higher accuracy would not make sense, since it would be too hard to predict who will live and who will die in such a catastrophic event full of panic and dread.

Revisited, edited part:

In [237]:
train[["Sex", "Survived"]].groupby(['Sex'], as_index=False).mean().sort_values(by='Survived', ascending=False)

Unnamed: 0,Sex,Survived
1,1,0.742038
0,0,0.188908


In [239]:
train[["FamilyMatters", "Survived"]].groupby(['FamilyMatters'], as_index=False).mean().sort_values(by='Survived', ascending=False)

Unnamed: 0,FamilyMatters,Survived
3,3,0.724138
2,2,0.578431
1,1,0.552795
6,6,0.333333
0,0,0.303538
4,4,0.2
5,5,0.136364
7,7,0.0
8,10,0.0


In [240]:
train[["SibSp", "Survived"]].groupby(['SibSp'], as_index=False).mean().sort_values(by='Survived', ascending=False)

Unnamed: 0,SibSp,Survived
1,1,0.535885
2,2,0.464286
0,0,0.345395
3,3,0.25
4,4,0.166667
5,5,0.0
6,8,0.0


In [241]:
train[["Parch", "Survived"]].groupby(['Parch'], as_index=False).mean().sort_values(by='Survived', ascending=False)

Unnamed: 0,Parch,Survived
3,3,0.6
1,1,0.550847
2,2,0.5
0,0,0.343658
5,5,0.2
4,4,0.0
6,6,0.0


In [None]:
from sklearn.ensemble import RandomForestClassifier

y = train["Survived"]

features = ["Pclass", "Sex", "SibSp", "Parch", "FamilyMatters", "Wealth"]
X = pd.get_dummies(train[features])
X_test = pd.get_dummies(test[features])

model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X, y)
predictions = model.predict(X_test)

output = pd.DataFrame({'PassengerId': test.PassengerId, 'Survived': predictions})
output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")

Your submission was successfully saved!


New highscore: 0.77751! Great.