<a href="https://colab.research.google.com/github/siddrrsh/StartOnAI/blob/master/Breaking_the_Titanic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Breaking the Titanic**
###### Created by **Nathan Zhao**

![Titanic Image](https://cdn.discordapp.com/attachments/634851401452879883/732674363589001317/Titanic.jpg)

# **Introduction**
In this Python Notebook, we will be discussing the code of a tutorial from a recent Kaggle competition, [Titanic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic) and its [tutorial code](https://www.kaggle.com/alexisbcook/titanic-tutorial).

Our goal in this competition is: Using the Titanic's passenger data (age, price of ticket, gender, etc.) we will try to predict who lives and who dies using machine learning.

# **Handling Data**

Our data (downloadable [here](https://www.kaggle.com/c/titanic/data)), is split into training and test data (`train.csv` and `test.csv` accordingly). Using our training data, we can train a model such that given the passenger data of survivors and casualties, it can predict whether a given person from our test data would survive the Titanic, just from their passenger data.

Below shows the different variables attached to each person in the training and test data (though within the test data, the survival data will be hidden).
![Table](https://cdn.discordapp.com/attachments/634851401452879883/732674052451336302/unknown.png)

Additionally, we are given a data file `gender_submission.csv` of what our submission should look like (or how our model will output its predictions for the test data).

After downloading the data, we can use the Pandas library in order to process it. **If when you run the code and you see an error**, copy this notebook and drop the Kaggle `.csv` files into the file-space on the left. Below is what your file-space should look like.

![](https://cdn.discordapp.com/attachments/634851401452879883/733395509301215322/unknown.png)

In [None]:
# We import the necessary libraries (for all our code)
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

In [None]:
# We open our training data and show it
raw_train_data = pd.read_csv("/content/train.csv")
raw_train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [None]:
# We open our test data and show it
raw_test_data = pd.read_csv("/content/test.csv")
raw_test_data.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


Let's also check out our submission format.

In [None]:
gender_submission = pd.read_csv("/content/gender_submission.csv")
gender_submission.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1


We see that our example is somewhat suspect. Comparing the test data and our submission, using the according passenger IDs, we see that our example model purely assumes that all women on the Titanic survived and all men died. Let's see how this stands true by comparing it to the training data.

In [None]:
# We get all the women who survived
women = raw_train_data.loc[raw_train_data.Sex == 'female']["Survived"]

# And find the ratio of women who survived / total women on the Titanic
rate_women = sum(women)/len(women)

print("% of women who survived:", rate_women)

# We get all the men who survived
men = raw_train_data.loc[raw_train_data.Sex == 'male']["Survived"]

# And find the ratio of men who survived / total men on the Titanic
rate_men = sum(men)/len(men)

print("% of men who survived:", rate_men)

% of women who survived: 0.7420382165605095
% of men who survived: 0.18890814558058924


By running the code above and submitting it into Kaggle [here](https://www.kaggle.com/c/titanic/submit), we see that our example model fails to correctly predict survival rates (scored ~76.6%). So, how can we improve our predictions and encapsulate more information from our passenger data?

# **Using a Random Forest Model**

On the Kaggle tutorial code, they decide to use a random forest model. By using the training data, our network is able to create a set of **decision trees**, which depending on various inputs, can relatively determine an outcome of an event (for our purpose, we will predict whether a passenger will survive). 

However, these trees are not necessarily fully accurate. Thus, we will create a large pool of decision trees and have a voting system between them. If more trees vote that a passenger will survive than not, that will be our prediction.

In a random forest model with three trees (we will have more), our model would look like this:

![RFM Example](https://cdn.discordapp.com/attachments/634851401452879883/732686844277555291/unknown.png)

The tutorial's code, they used `scikit-learn`, a useful machine learning library with many easily trainable models for beginners. If you are starting out in machine learning, looking through the premade models and examples provided by the `scikit-learn` library is very useful.

Let's check out their code:


In [None]:
# What we will be trying to predict (based on the passenger data)
y = raw_train_data["Survived"]

# Things we will be keeping track of while predicting survival rate
features = ["Pclass", "Sex", "SibSp", "Parch"]

# Process our training and test data too keep only keep track of the features we care about.
X = pd.get_dummies(raw_train_data[features])
X_test = pd.get_dummies(raw_test_data[features])

# Create and train our random forest model
random_forest = RandomForestClassifier(n_estimators=100, random_state=1)
random_forest.fit(X, y)

# Get our predictions of our test data
predictions = random_forest.predict(X_test)

# Format and save our predictions
output = pd.DataFrame({'PassengerId': raw_test_data.PassengerId, 'Survived': predictions})
output.to_csv('my_submission_1.csv', index=False)
print("Your submission was successfully saved!")

Your submission was successfully saved!


Look through the above code and try to see the purpose of each line. Recognize how we only take into account of a limited amount of features from the passenger ID. Can you think of why that is? Take into account the comments we've made.

After downloading our `my_submission_1.csv`, we see that our model has improved (scored ~77.5)! Nevertheless, how do we continue to improve our predictions?

# **Improving our Results**

In our above code, we see that we don't truly encapsulate all our given features. We use only four: Pclass, Sex, Parch, and SibSp. However, if we think about it, shouldn't age and fare (ticket cost) help us determine who survives as well?

So, we can try adding them to our features array. After doing so, we get an error: `Input contains NaN, infinity or a value too large for dtype('float32').`Our age and fare withinn our data have a few unknowns. 

We fix this by using panda's `fillna()` and `mean()` function in order to fill our unknowns and get the code below, continuing to use the random forest classifier:



In [None]:
# What we will be trying to predict (based on the passenger data)
y = raw_train_data["Survived"]

# Things we will be keeping track of while predicting survival rate
features = ["Age","Pclass", "Sex", "SibSp", "Parch","Fare"]

X = pd.get_dummies(raw_train_data[features])
X_test = pd.get_dummies(raw_test_data[features])

# We replace NaNs with the mean age and fare.
X = X.fillna(X.mean())
X_test = X_test.fillna(X_test.mean())

# Create and train our random forest model
random_forest = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=500)
random_forest.fit(X, y)

# Get our predictions of our test data
predictions = random_forest.predict(X_test)

# Format and save our predictions
output = pd.DataFrame({'PassengerId': raw_test_data.PassengerId, 'Survived': predictions})
output.to_csv('my_submission_2.csv', index=False)
print("Your submission was successfully saved!")

Your submission was successfully saved!


Success! We have improved our model once again (scored ~78.5). By filling the unknowns with the mean of each set, we have taken more features into account, improving our predictions. 

# **Continuing to Improve in the Future**

How will we continue to improve our predictions in the future? Looking at the online Kaggle leaderboards, we can see some pretty high scores.

There are a few ways to do so:


*   **Feature engineering** - We can add more features. We do this by extracting data from things such as names. For example, we can keep track of whether the rest of a person's family survived and use that in our prediction, or keep track of useful prefixes such as "Master" that signify wealth. We can continue to extract more features by keep track of noticible patterns in our training data.
*   **Model tuning** - By altering the variables in our model, we can improve our score as well. For example, for random forest classifiers, we can alter our `max_depth` and `max_leaf_nodes` variables in order to inch towards an improved model.
*   **Including more features** - Of course, we can add more features that we haven't kept track of, such as which destination one embarks from or their cabin. However, if you were to analyze the data, you would see that these features provides little help (which is why I did not include them in my code)

In the end, I hope you learned more about machine learning and dealing with unknowns in your code. Additionally, if this was your first time working with random forest classifiers, I hope you learned a lot about them as well!

As a final project, if you want to, try to implement one of the three improvements above. It may take a bit of trial and error!