# Titanic Survivor Prediction Model
#### 1. Data Gathering
Before all else, one needs a dataset to process data; since the main question is "what kind of people were more likely to survive", I decided to look for passenger data through Google; thankfully, as part of Kaggle's competition, a passenger dataset was provided for those interested.

To begin, i decided to call several libraries such as sklearn's **Naive Bayes**, **linear_model** and **ensemble**.
I also called **pandas** and **numpy** for dataset manipulation.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression, SGDClassifier
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Embedding
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
import numpy as np

#### 2. Preprocessing Data

In order to preprocess the data, I first need to load it through panda's **read_csv** command


In [2]:
titanic_survivor = pd.read_csv("C:\\Users\\Dingus-Elite\\Downloads\\train.csv")

After accessing, I view its contents:

In [3]:
titanic_survivor

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


Interpreting the dataset, I've noticed several things for consideration:
1. The numeric values are in integer form;
2. We have irrelevant (based on my interpretation) data, such as **Ticket**, **PassengerId**, and **Name**
3. There are empty cells.

Before I go too deep in maniupulating the preprocessing data, I first need to remove unecessary features; I used the **.drop** command for this.

In [4]:

titanic_survivor = titanic_survivor.drop(["PassengerId", "Name", "Ticket", "Cabin"], axis=1)


For this next set of code, I did the following:
1. I changed the **male** and **female** with numbers **0** and **1** respectively.
2. I filled the empty cells of Embareked with the letter 'S' as it is the most frequent; and replaceD the letters S, C and Q with 3, 1 and 2.
3. I then filled the empty "Age" cells with the mean age; I also rounded off the values using **.ceil** command, and set it as an integer.
4. Finally, I used the same rounding off method from the "Age" column TO THE "Fare" column, and set its values to be an integer.

In [5]:
titanic_survivor = titanic_survivor.replace(["male" , "female"], [0,1])

titanic_survivor["Embarked"] = titanic_survivor["Embarked"].fillna('S')
titanic_survivor["Embarked"] = titanic_survivor["Embarked"].replace(["S" , "C", "Q"], [3,1,2])

titanic_survivor["Age"] = titanic_survivor["Age"].fillna(titanic_survivor["Age"].mean())
titanic_survivor["Age"] = np.ceil(titanic_survivor["Age"])
titanic_survivor["Age"] = titanic_survivor["Age"].astype('int')

titanic_survivor["Fare"] = np.ceil(titanic_survivor["Fare"])
titanic_survivor["Fare"] = titanic_survivor["Fare"].astype('int')

I then defined which columns are X and which columns are y. Column 0 would be my y as it is teh output, and the reamining columns would be my X.
I then split the dataset, with a test size of 20%.

In [6]:
X = titanic_survivor.iloc[:, 1:]
y = titanic_survivor.iloc[:, 0]
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.20, random_state=10)

#### 3. Choosing a Model
As part of the constraints; I selected two models, the Random Forest Classifier and the Gradient Boosting Classifier. I also made a NN through **keras'** Sequential Model.\


#### 4. Training
The split dataset earlier were used to fit to the model; I've divided it into three separate cells for evaluation purposes.

In [29]:
model = GradientBoostingClassifier(learning_rate=0.01)
model.fit(train_X, train_y)
predictions = model.predict(test_X)
predictions = np.around(predictions)
classification_report(test_y, predictions)
confusion_matrix(test_y,predictions)

array([[111,   6],
       [ 18,  44]], dtype=int64)

In [30]:
print(str(accuracy_score(test_y,predictions)*100) + "%")

86.59217877094973%


In [32]:
model = RandomForestClassifier(max_depth=4, random_state=5)
model.fit(train_X, train_y)
predictions = model.predict(test_X)
predictions = np.around(predictions)
classification_report(test_y, predictions)
confusion_matrix(test_y,predictions)

array([[110,   7],
       [ 18,  44]], dtype=int64)

In [31]:
print(str(accuracy_score(test_y,predictions)*100) + "%")

86.59217877094973%


In [27]:
model = Sequential()
model.add(Dense(64, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(train_X, train_y, epochs=50, verbose=1, validation_data=(test_X, test_y), batch_size=10)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x1c93c20b760>

#### 5. Evaluation
To evaluate my model, I used **sklearn's** **confusion_matrix** command; I then asked it to compare its prediction (of test_X acquired earlier) to the ACTUAL test_y results.  I also printed out its accuracy. 

In [23]:
predictions = model.predict(test_X)
predictions = np.around(predictions)
classification_report(test_y, predictions)
confusion_matrix(test_y,predictions)



array([[106,  11],
       [ 19,  43]], dtype=int64)

In [24]:
print(str(accuracy_score(test_y,predictions)*100) + "%")

83.24022346368714%


#### 6. Hyperparameter Tuning
For the Gradient Boosting Classifier, I modified its learning rate, and set it to 0.01; while the Random Forest Classifier had modified **max_depth** and **random_state**. the values are acquired through trial and error. 

#### 7. Prediction
Alongside Kaggle's training dataset they graced us with a test dataset, which I prepped for prediction below; following the same format of dropping the unnecessary columns and changing the floating point values to integers.

In [314]:
titanic_s = pd.read_csv("C:\\Users\\Dingus-Elite\\Downloads\\test (1).csv",encoding='latin1')

titanic_s = titanic_s.drop(["PassengerId", "Name", "Ticket", "Cabin"], axis=1)
titanic_s = titanic_s.replace(["male" , "female"], [0,1])

titanic_s["Embarked"] = titanic_s["Embarked"].fillna('S')
titanic_s["Embarked"] = titanic_s["Embarked"].replace(["S" , "C", "Q"], [3,1,2])

titanic_s["Age"] = titanic_s["Age"].fillna(titanic_s["Age"].mean())
titanic_s["Age"] = np.ceil(titanic_s["Age"])
titanic_s["Age"] = titanic_s["Age"].astype('int')

titanic_s["Fare"] = np.ceil(titanic_s["Fare"])
titanic_s["Fare"] = titanic_s["Fare"].fillna(titanic_s["Fare"].mean())
titanic_s["Fare"] = titanic_s["Fare"].astype('int')

predictions = model.predict(titanic_s)
predictions =  np.around(predictions)
pred_conv = ['Survived' if i > 0.5 else 'Dead' for i in predictions]



In [316]:
prediction_test_out = pd.DataFrame(pred_conv, columns=['prediction']).to_csv("C:\\Users\\Dingus-Elite\\Desktop\\billones_titanic_output.csv")

# Final Notes

Across all the three models, the best performing are the Random Forest Classifier and the Gradient Boosting CLassifier, at 86%; I had several attempts at making sure my NN acquires the same results but it seems to cap at 83%. I suspect that with time and experience, I might be able to design a NN that somewhat comes close to the two models described above.

The dataset was really messy; with floating points on age, ridiculous amount of NaN cells that needed to be filled; there were severaal suggestions as to how they may be handled, one is to fill with with 0 (I can't imagine someone would be 0 years old and board a ship on their own) and that you can fill them with averages. I decided to do the latter. I also took notice of two things; People who embarked from South Hampton are more likely to have records, not necessarily survived. Same goes with the Pclass, while there are more 3rd class passenger on the ship, it does not necessarily dictate their survival.

Curiously, as I looked further into the dataset, I noticed that there are some information that MIGHT be useful for training. Cabin Numbers dictate which part of the titanic are they on, and the passenger names contain the marital status which is useful; you can be a Mr. or Mrs. but not have a spouse, or not have a child, or both. 

This might be one of the more hands-on activities in machine learning that I've done so far; it is interesting to see how different manipulations in the dataset affect the accuracy of the model. This is why I believe the 2nd Step is so important; no dataset is ever whole, or clean, and so its up to the programmers to figure out how to scrub it out and figure out how to best approach the dataset. With that in mind; Machine Learning is not just about feeding the dataset to an algorithm; it takes careful consideration of every parameter you can modify; and that parameter in turn, defines how the model will perform.

While I did some dataset cleaning here, to an experienced data scientist it might just be like brushing one's teeth, rather than scrubbing the toilet.\
It is clear that I still have much to learn.