Hello there. Welcome to my first notebook. Today we will look at The Estonia Disaster Passenger List dataset.So first thing first we will import some of the libraries which we will need. And lets have a look at our dataset.

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
dataset = pd.read_csv('../input/passenger-list-for-the-estonia-ferry-disaster/estonia-passenger-list.csv')
print(len(dataset.index))
dataset.head()

We can see we have 8 columns and 989 rows of data in this dataset. My next step will be checking the empty values in dataset. 

In [None]:
dataset.isna().values.any()

As we can see from code we have no NaN values so therefore we do not need to handle them. Lets see some graphics representations of the Country, Sex, Age, Category, Survived columns. 

Lets start with the Country column.

In [None]:
print(dataset['Country'].value_counts())
plt.bar(dataset['Country'].value_counts().index,dataset['Country'].value_counts().values)
plt.xticks(rotation=60)
plt.xlabel('Country')
plt.ylabel('Number of passengers')

Data from Country column show us that most people on the ship was from Estonia and Sweden what is expected as we know that the ship went from Estonia to Stockholm. Lets continue with Sex column.

In [None]:
print("The number of people on the boat divided by sex:" )
print(dataset['Sex'].value_counts())

fig1, ax1 = plt.subplots()
ax1.pie(dataset['Sex'].value_counts(), explode=(0.1, 0), labels=['Men', 'Woman'], autopct='%1.1f%%',
        shadow=True, startangle=90)
ax1.axis('equal')
plt.title('Sex')
plt.show()

So the total count is 503 men and 486 women on the ship. Now lets see about the age range on the boat. 

In [None]:
print("Age range:" )
print(dataset['Age'].value_counts())
k=dataset['Age'].value_counts()
plt.bar(k.index,k.values)
plt.xlabel('Age')
plt.ylabel('Number of occurences')
plt.title('Age range on the boat')

As we see from the graph the age range is pretty much evenly distributed.

In [None]:
print("The number of people on the boat divided by the category atribute:" )
print(dataset['Category'].value_counts())

fig1, ax1 = plt.subplots()
ax1.pie(dataset['Category'].value_counts(), explode=(0.1, 0), labels=['Passenger', 'Crew'], autopct='%1.1f%%',
        shadow=True, startangle=90,colors=['grey','Yellow'])
ax1.axis('equal')
plt.title('Category')
plt.show()

In [None]:
print(dataset['Survived'].value_counts())
fig1, ax1 = plt.subplots()
ax1.pie(dataset['Survived'].value_counts(), explode=(0.1, 0), labels=['Dead', 'Alive'], autopct='%1.1f%%',
        shadow=True, startangle=90,colors=['orange','Grey'])

ax1.axis('equal')
plt.title('Survived')
plt.show()

So we have basic info about the dataset. Now lets prepare this for some multilinear regression. From the dataset we already know that name of the passenger would not play any significat role for prediction because it is unique for every member as well as Passenger Id so we can get rid of those 3 columns.

In [None]:
dataset=dataset.drop(['Firstname','Lastname','PassengerId'],axis=1)
dataset.head()

Next thing we need to do to put our dataset into multilinear regression model is to change the categorical data to the numerical representation. So bassicaly we will create the new columns which will represent the categorical values with either 0/1. But we have over 15 countries in the dataset which bassicly means that we will have to create another 15+ new columns even if there was only 1 person from other countries. We decided to separate the country column into only 3 columns with either person being from Sweden or Estonia or from some other country. 

In [None]:
dataset["Country_Sweden"] = np.where(dataset["Country"]=="Sweden",1, 0)
dataset["Country_Estonia"] = np.where(dataset["Country"]=="Estonia", 1 ,0)
dataset["Other_Country"] = np.where((dataset["Country"]!="Sweden") & (dataset["Country"]!="Estonia")  , 1, 0)
dataset=dataset.drop("Country",axis=1)

dataset.head()

As we can see our Country column get replaced by 3 other columns with every country in dataset. Lets do the same for the Sex and Category columns as well. After we will create the 2 other columns for Sex and Category column we will drop first of them because we already know that they have correlation between them and therefore if in one of them is 0 we for sure know that the other will be 1 and vice versa. 

In [None]:
dataset=pd.get_dummies(dataset, columns=["Category"],drop_first=True)
dataset=pd.get_dummies(dataset, columns=["Sex"],drop_first=True)

dataset.head()

So right now we have 7 columns in total. What we are going to do is to split Age into 2categories defined by the mean value.

In [None]:
print(dataset['Age'].describe()[['mean']])

dataset["Age_under44"] = np.where(dataset["Age"]<45, 1, 0)
dataset["Age_over44"] = np.where(dataset["Age"]>=45, 1, 0)
dataset=dataset.drop("Age",axis=1)
dataset.head()

OK se we have our data sorted and now we can split the into Train and Test datasets. We choose to have the 1:4 ratio of Test to Train data. 

In [None]:
y = dataset.iloc[:, 0].values
X = dataset.iloc[:, 1:].values

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, stratify=y,random_state=6)


from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

# Predicting the Test set results
y_pred = regressor.predict(X_test)
score = regressor.score(X_test,y_test)
print("Multilinear regression score is: %.2f%% " % (score* 100.00))

As we can see on the code above we are getting only 0.14 score(only with random state other outcomes was around 0.10) on this dataset learned by multilinear regression which is pretty bad. Next we will try the Decision tree classifier. 

So lets have a look at Decision tree classifier. We will be using the Scikit library again. This library is using the "CART" trees. So lets have a look at what kind of score we can get with this. (Unfortunatelly at version 0.23.2 scikit does not support the categorical attributes so we will stick with numercial).

In [None]:
from sklearn import tree
clf = tree.DecisionTreeClassifier()
fig, ax = plt.subplots(figsize=(24, 12))
tree.plot_tree(clf.fit(X_train, y_train), max_depth=4, fontsize=10)
plt.show()

print("Mean average accuracy is: %.2f%% "% (clf.score(X_test,y_test)*100.0))

So we got our score to be 0.85.Score is returning the mean accuracy on the given test data and labels. Here we can see the much better score when we compare it with the multilinear regression.

Now lets compare this Decision trees from scikit with the XgBoost trees. 

In [None]:
import xgboost as xgb

dtrain = xgb.DMatrix(X_train,y_train)
dtest = xgb.DMatrix(X_test,y_test)
param = {'max_depth': 6, 'eta': 0.3, 'objective': 'binary:logistic'}
param['eval_metric'] = ['auc', 'rmse','map']

evallist = [(dtest, 'eval'), (dtrain, 'train')]
num_round = 100
evals_result ={}
bst = xgb.train(param, dtrain, num_round, evallist, evals_result=evals_result)


In [None]:
from sklearn.metrics import accuracy_score
y_pred = bst.predict(dtest)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
print("Area under curve score is:"+"{}".format(evals_result['eval']['auc'][-1]*100.0)+"%")
print("Mean average precision is:"+"{}".format(evals_result['eval']['map'][-1]*100.0)+"%")
ypred = bst.predict(dtest, ntree_limit=bst.best_ntree_limit)
xgb.plot_importance(bst)

As we can see we are getting 0.807 AUC metric and 0.33 mean average precisionbut same as in the decission trees we are getting 85.35% accuracy. We can see that this tree had picked feature number four as the most important for the model. 

Lets now use some simple neural network with few layers and compare the accuracy. We will use 5 layers with "rectified linear unit" activation function and last one with "Sigmoid" activation function. We will train it on 50 epochs with batch_size = 5 just to demonstrate the outcome.

In [None]:
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense
import tensorflow as tf

model = Sequential()
model.add(Dense(24, input_dim=7, activation='relu'))
model.add(Dense(18, activation='relu'))
model.add(Dense(12, activation='relu'))
model.add(Dense(6, activation='relu'))
model.add(Dense(4, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=50, batch_size=3)



In [None]:
score = model.evaluate(X_test, y_test,verbose=0)

print('Test loss: %.2f%%'% (score[0]*100)) 
print('Test accuracy: %.2f%%'% (score[1]*100))

So as we can see we are getting 86.36% accuracy which is only slightly better than from the Xgboost method we used. 

It all seems nice and getting 86% accuracy is really good, BUT on how much is this model really good at predicting? Lets have a look at the confusion matrix with test dataset.

In [None]:
clas = ['0','1']
import sklearn
from sklearn.metrics import classification_report, confusion_matrix
predictions = model.predict(X_test)
rounded_predictions = np.argmax(predictions, axis=-1)
cm = confusion_matrix(y_test, rounded_predictions)
import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay

disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                              display_labels=clas)
disp = disp.plot(cmap='Greens')
plt.show()


So here we can see that all the lines of our test dataset was classified as 0. It has high accuracy because of the inconsistent dataset and therefore we have too many '0' and just a few '1'. We might try oversampling to break down the difference from this dataset but we just wanted to show what we can get from this dataset with usage of simple machine learning methods and neural network. Thanks for reading, all comments are welcomed.