# Titanic Dataset - Prediction

In [9]:
import numpy as np
import pandas as pd

## 1 - Information Gathering and Data Exploration

In [10]:
df = pd.read_csv('../input/train.csv')
df.info()

In [11]:
df.head()

### irrelevant features
The ticket number ('Ticket') is irrelevant, this feature will be dropped as it is not expected to influence the vitality of passengers. Since both passenger id's and names reside in the data (which are both meant to identify the passengers) the 'Name' feature will also be dropped.

### non-numerical features
'Sex', 'Cabin' and 'Embarked' are categorical features identified by strings; encoding these categories by numerical labels will ease the analysis process.

### dealing with missing data
The 'Age', 'Embarked' and 'Cabin' features have rows without values (NaN). Since 'Cabin' feature only has 204 data points, it is not a good idea to apply listwise deletion. Instead, imputing the missing values is a better idea and is expected to result in a better dataset for training the models later on. Mean substitution imputation technique will be applied to 'Cabin' and 'Embarked' features. 

### test data - missing values
The test data has very few non-missing Cabin records, and imputing the data for the test data is not possible due to the lack of a frequent occuring value. This feature will be dropped from the testing data and even though we may impute the data for the training set and train the models with the feature included, as the test data will not have this feature, the feature will be dropped from the training set.

In [12]:
# drop irrelevant features
df = df.drop(['Name', 'Ticket', 'Cabin'], axis='columns')

In [13]:
from collections import Counter
from sklearn import preprocessing

# compute most frequent/mean values
'''
ctr = Counter(df['Cabin'])
print("Cabin feature most common 2 data points:", ctr.most_common(2))
'''

ctr = Counter(df['Embarked'])
print("Embarked feature most common 2 data points:", ctr.most_common(2))

print("Age feature mean value:", np.mean(df['Age'].dropna()))

In [14]:
# impute the feature columns
#df['Cabin'].fillna('G6', inplace=True)

df['Embarked'].fillna('S', inplace=True)

df['Age'].fillna(30, inplace=True) # 29.69... does not specify a valid age, round it

In [15]:
import copy

# encode the categorical features into numerical values
encoder = preprocessing.LabelEncoder()

embarkedEncoder = copy.copy(encoder.fit(df['Embarked']))
df['Embarked'] = embarkedEncoder.transform(df['Embarked'])
#df['Cabin'] = encoder.fit_transform(df['Cabin'])

sexEncoder = copy.copy(encoder.fit(df['Sex']))
df['Sex'] = sexEncoder.transform(df['Sex'])

In [16]:
df.describe()

In [17]:
df.info()

All features are now of numerical format and missing data is dealt with.

## 2 - Model Training

In [18]:
from sklearn.model_selection import train_test_split

X = df.drop('Survived', axis='columns')
Y = df['Survived']
trainX, testX, trainY, testY = train_test_split(X, Y, test_size=0.33)

print(len(trainX), 'training records and', len(testX), 'testing records')

def trainAndPredict(model):
    model.fit(trainX, trainY)
    predictions = model.predict(testX)
    mismatch = 0
    for estimate, real in zip(predictions, testY):
        if estimate != real:
            mismatch += 1
    return mismatch

### 2.1 - Naive Bayes

In [19]:
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB

gnb = GaussianNB()
mnb = MultinomialNB()
bnb = BernoulliNB()

modelNames = ["Gaussian Naive Bayes", "Multinomial Naive Bayes", "Bernoulli Naive Bayes"]
predictionErrors = [trainAndPredict(gnb), trainAndPredict(mnb), trainAndPredict(bnb)]

for i in range(3):
    print(f"Out of {len(testX)} records, the {modelNames[i]} classifier has {predictionErrors[i]} incorrect predictions")

### 2.2 - Support Vector Machines (SVM)

In [20]:
from sklearn import svm

svc = svm.SVC()
print(f"Out of {len(testX)} records, the SVM classifier has {trainAndPredict(svc)} incorrect predictions")

## 3 - Testing

In [21]:
testDF = pd.read_csv('../input/test.csv')

In [22]:
testDF.head()

In [23]:
testDF.info()

### data preparation
Necessary data preperation as performed for the training data:
- Dropping the 'Name' and 'Ticket' features
- Obtaining the most frequent values for imputing the missing values
- Label encoding the categorical feature records

In [24]:
testDF.drop(['Name', 'Ticket'], axis='columns', inplace=True)

In [25]:
ctr = Counter(testDF['Cabin'])
print(f'Cabin feature most common values:', ctr.most_common(4))

meanAge = np.mean(testDF['Age'])
print(f'Mean age for the age feature:', meanAge)

There is an issue with the cabin feature, there are very few non-missing values on the dataset and there is not a frequent value that can be imputed. Other imputation methods such as regression imputation, hot-deck or cold-deck imputation etc. are not suitable either. This feature will be dismissed during for our prediction process.

In [26]:
# drop the Cabin feature and fill perform mean substitution for missing records in the Age feature
testDF.drop('Cabin', axis='columns', inplace=True)
testDF['Age'].fillna(30, inplace=True)

In [27]:
# encode the Embarked and Sex features
testDF['Embarked'] = embarkedEncoder.transform(testDF['Embarked'])
testDF['Sex'] = sexEncoder.transform(testDF['Sex'])

In [28]:
testDF.info()

Testing set has a single record without a value for the fare feature, using mean substitution should work fine.

In [29]:
testDF['Fare'].fillna(np.mean(testDF['Fare']), inplace=True)
testDF.info()

In [30]:
predictions = gnb.predict(testDF)

In [31]:
def writeCSV(predictions):
    outputDF = pd.DataFrame(np.column_stack([testDF['PassengerId'], predictions]), columns=['PassengerId', 'Survived'])
    outputDF.to_csv('./predictions.csv', index=False)

In [32]:
writeCSV(predictions)

In [33]:
predDF = pd.read_csv('./predictions.csv')

In [34]:
predDF.head()