# C4021 - Group Project 2.3

### Problem Statement: Change of Survival on the Titanic

Classification Problem to predict the chances of a passenger surviving the Titanic journey.


## Dataset Information

####   Source: https://www.kaggle.com/c/titanic 

#### Context: 
"A manifest of data for each passenger on the titanic. With all of the passenger information, we can create and predict the survival of a passenger."

#### Content: 
Each row represents a passenger, each column contains passenger information.

##### The data set includes information about:

- `survival`: Survival (0 = no; 1 = yes)
- `class`: Passenger class (1 = first; 2 = second; 3 = third)
- `name`: Name
- `sex`: Sex
- `age`: Age
- `sibsp`: Number of siblings/spouses aboard
- `parch`: Number of parents/children aboard
- `ticket`: Ticket number
- `fare`: Passenger fare
- `cabin`: Cabin
- `embarked`: Port of embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
- `boat`: Lifeboat (if survived)
- `body`: Body number (if did not survive and body was recovered)

In [1]:
import matplotlib.pyplot as plt
import random
import numpy as np
import pandas as pd

### Load the Dataset into pandas dataframe

In [2]:
titanic_df = pd.read_excel('titanic3.xls', 'titanic3', index_col=None, na_values=['NA'])

FileNotFoundError: [Errno 2] No such file or directory: 'titanic3.xls'

In [None]:
titanic_df.head()

## Data Exploration

* Looking at the whole of the data we can explore it and gain some insights into why certain passengers survived over others and determine some of the main factors or feature vectors that influence the data, we can use this to then train our model and predict the survival of a passenger. 

* The next section will cover data exploration and insights

In [None]:
# Overall chance of survival
titanic_df['survived'].mean()

In [None]:
sex_group = titanic_df.groupby(['pclass','sex']).mean()
sex_group

* From a brief observation of the data we can see that if you were a female. You had a much higher percentage of surviving than being a male. Also the level at which the passenger had a ticket for also influenced the survival. Passengers in the lowest rungs of class 3 were some of the lowest survival rates.

## Preprocessing Data 

Before I can create a model to train and test the data I must filter out null values / missing data from the dataset. Once this is complete I can split it into training and test sets. Garbage data in will result in gargabge data out for the results of the model.

In [None]:
titanic_df.count()

* We can see that we are missing a lot of values from cabin, boat and body. We're going to drop these columns from the dataset.
* home.dest is also missing quite a few but we can default this value to NA.
* Age is missing quite a few values so we will have to remove all rows that are missing an age. From the history, we know that younger people were put on lifeboats as a first priority.

In [None]:
titanic_df = titanic_df.drop(['body','cabin','boat'], axis=1)
titanic_df["home.dest"] = titanic_df["home.dest"].fillna("NA")
titanic_df = titanic_df.dropna()
titanic_df.count()

* Now all the samples contain the same number of values so the data is ready to be formatted to be inserted for use in an ML model.

In [None]:
from sklearn import datasets, svm, tree, preprocessing, metrics
from sklearn.model_selection import cross_validate, train_test_split, ShuffleSplit, cross_val_score
import sklearn.ensemble as ske
from sklearn.metrics import confusion_matrix, roc_curve
import seaborn
from sklearn.linear_model import Perceptron
import itertools

def preprocess_titanic_dataset(df):
    # Make copy of dataset, set data labels and filter out non useful columns.
    processed_df = df.copy()
    le = preprocessing.LabelEncoder()
    processed_df.sex = le.fit_transform(processed_df.sex)
    processed_df.embarked = le.fit_transform(processed_df.embarked)
    processed_df = processed_df.drop(['name','ticket','home.dest'],axis=1)
    return processed_df

In [None]:
processed_df = preprocess_titanic_dataset(titanic_df)

In [None]:
X = processed_df.drop(['survived'], axis=1).values
y = processed_df['survived'].values

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2)

In [None]:
def plot_confusion_matrix(cm, normalize = False):
    print(cm)
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
    plt.title("Confusion Matrix")
    plt.colorbar()
    tick_marks = np.arange(2)
    labels = ['Survived', 'Not Survived']
    plt.xticks(tick_marks, labels, rotation = 45)
    plt.yticks(tick_marks, labels)
    
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        if cm[i, j] > 0:
            plt.text(j, i, format(cm[i, j], fmt),
                    horizontalalignment="center",
                    color="white" if cm[i, j] > thresh else "black")
    
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.grid()
    plt.show(block = True)

In [None]:
shuffle_validator = ShuffleSplit(n_splits=20, test_size=0.2, random_state=0)
def test_classifier(clf):
    scores =  cross_val_score(clf, X, y, cv=shuffle_validator)
    print("Accuracy: %0.4f (+/- %0.2f)" % (scores.mean(), scores.std()))
    y_pred = clf.predict(X=X_test)
    acc = metrics.accuracy_score(y_test, y_pred)
    
    y_pred_proba = clf.predict_proba(X_test)[::,1]
    fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
    plt.plot(fpr, tpr, label="Survival Rate")
    plt.legend(loc=4)
    plt.show()

    
    cm = confusion_matrix(y_test, y_pred)
    plot_confusion_matrix(cm, normalize=False)

In [None]:
%matplotlib inline
clf_dt = tree.DecisionTreeClassifier(max_depth=10)
clf_dt.fit(X_train, y_train)
train_score = clf_dt.score (X_train, y_train)
print(train_score)
clf_dt.score(X_test, y_test)

test_classifier(clf_dt)

In [None]:
%matplotlib inline
clf_rf = ske.RandomForestClassifier(n_estimators=50)
clf_rf.fit (X_train, y_train)
train_score = clf_rf.score (X_train, y_train)
print(train_score)
clf_rf.score (X_test, y_test)

y_pred = clf_rf.predict(X=X_test)
acc = metrics.accuracy_score(y_test, y_pred)

test_classifier(clf_rf)

In [None]:
%matplotlib inline
clf_gb = ske.GradientBoostingClassifier(n_estimators=50)
clf_gb.fit (X_train, y_train)
train_score = clf_gb.score (X_train, y_train)
print(train_score)
clf_gb.score (X_test, y_test)
y_pred = clf_gb.predict(X=X_test)
acc = metrics.accuracy_score(y_test, y_pred)

test_classifier(clf_gb)

In [None]:
%matplotlib inline
clf_prtn = Perceptron(tol=0.01, random_state=1)
clf_prtn.fit (X_train, y_train)
train_score = clf_prtn.score (X_train, y_train)
test_score = clf_prtn.score (X_test, y_test)
print(train_score)
y_pred = clf_prtn.predict(X=X_test)
acc = metrics.accuracy_score(y_test, y_pred)
print(acc)

scores =  cross_val_score(clf_prtn, X, y, cv =shuffle_validator)
print("Accuracy: %0.4f (+/- %0.2f)" % (scores.mean(), scores.std()))
cm = confusion_matrix(y_test, y_pred)
plot_confusion_matrix(cm, normalize=False)

In [None]:
%matplotlib inline
eclf = ske.VotingClassifier([('dt', clf_dt), ('rf', clf_rf), ('gb', clf_gb), ('prtn', clf_prtn)])
scores =  cross_val_score(eclf, X, y, cv=shuffle_validator)
print("Accuracy: %0.4f (+/- %0.2f)" % (scores.mean(), scores.std()))