# Titanic Survival Prediction (Kaggle)
This is my first project on Kaggle. We have been given a dataset of passengers onboard the Titanic, along with whether they survived the crash or not. Our job is to predict this survival variable for the passengers in the test set.
I'm going to keep this relatively simple, but I might revisit and improve my models at some point. For now, I'll go with some basic data cleaning and feature- and model selection.

In [157]:
# importing all the libraries we'll need
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier

In [158]:
# loading the training set into memory
dat_train = pd.read_csv("train.csv")
dat_test = pd.read_csv("test.csv")

## Cleaning the data and selecting the features
I will remove the `Name`, `Ticket`, `SibSp`, `Parch` and `Cabin` features because intuitively they seem to have no influence over the chance of survival.

In [159]:
dat_train = dat_train.drop(["Name", "Ticket", "SibSp", "Parch", "Cabin"], axis = 1)
dat_test = dat_test.drop(["Name", "Ticket", "SibSp", "Parch", "Cabin"], axis = 1)

Let's check for any missing values and fill them in as needed

In [160]:
print(dat_train.isnull().sum())
print(" ")
print(dat_test.isnull().sum())

PassengerId      0
Survived         0
Pclass           0
Sex              0
Age            177
Fare             0
Embarked         2
dtype: int64
 
PassengerId     0
Pclass          0
Sex             0
Age            86
Fare            1
Embarked        0
dtype: int64


In [161]:
# Filling the missing embarked and fare data with the most frequent value
dat_train["Embarked"].fillna(dat_train["Embarked"].mode()[0], inplace=True)
dat_test["Fare"].fillna(dat_test["Fare"].mode()[0], inplace=True)

# Filling the missing age values using the median
for dat in [dat_train, dat_test]:
    dat["Age"].fillna(dat["Age"].median(), inplace=True)

print(dat_train.isnull().sum())
print(" ")
print(dat_test.isnull().sum())

PassengerId    0
Survived       0
Pclass         0
Sex            0
Age            0
Fare           0
Embarked       0
dtype: int64
 
PassengerId    0
Pclass         0
Sex            0
Age            0
Fare           0
Embarked       0
dtype: int64


We will also be converting the categorical values into numerical ones. To prevent our model from intepreting some kind of order to the nominal values (such as male = 0, female = 1) we will use one-hot encoding. We will use this encoding for the `Sex` and `Embarked` features.

In [162]:
enc = OneHotEncoder()
enc.fit(dat_train[["Sex", "Embarked"]])
transformed_train = pd.DataFrame(enc.transform(dat_train[["Sex", "Embarked"]]).toarray(), columns=enc.get_feature_names_out())
transformed_test = pd.DataFrame(enc.transform(dat_test[["Sex", "Embarked"]]).toarray(), columns=enc.get_feature_names_out())


In [163]:
dat_train = dat_train.drop(labels=["Sex", "Embarked"], axis = 1)
dat_test = dat_test.drop(labels=["Sex", "Embarked"], axis=1)

dat_train = pd.concat([dat_train, transformed_train], axis=1)
dat_test = pd.concat([dat_test, transformed_test], axis=1)

Let's also split up the training data into X and y, and split those into training and test sets

In [165]:
X, y = dat_train.drop("Survived", axis=1), dat_train["Survived"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=33)

## Logistic regression
Let us try logistic regression on this problem and see how it does

In [166]:
lm_model = LogisticRegression(max_iter=300)
lm_model.fit(X_train, y_train)
lm_model.score(X_test, y_test)

0.8033898305084746

## Random Forest
We will aso try a random forest classifier on this problem

In [169]:
rf_model = RandomForestClassifier(random_state=1)
rf_model.fit(X_train, y_train)
rf_model.score(X_test, y_test)

0.8305084745762712

We get an accuracy of about 0.8 for the logistic regression model and about 0.83 for the random forest model. I will use the random forest model to do predictions on the test set.

In [168]:
y_pred = rf_model.predict(dat_test)
submission = pd.concat([dat_test["PassengerId"], pd.DataFrame(y_pred, columns=["Survived"])], axis=1).set_index("PassengerId")
submission.to_csv("submission.csv")