# Titantic Dataset - Python

<ul>
    <li>Ryan Erwin</li>
    <li>Experienced Associate</li>
    <li>Chicago Wacker</li>
    <li>Cell: 678.449.8950</li>
    <li>Email</li>
    <dd>- <a href="mailto:ryan.erwin@pwc.com">PwC Email</a></dd>
    <dd>- <a href="mailto:ryan.erwin1@gmail.com">Personal Email</a></dd>
</ul>


Analyze popular dataset using Random Forest algorithm to predict survival of passengers. Random Forest is a great first choice because it's relatively easy to use and robust on a diverse array of datasets. The first thing we need to do is import our packages we'll need to perform the analysis.
<br>

In [7]:
# import packages
import pandas as pd
import numpy as np
from pprint import pprint
import re
import warnings

In [8]:
warnings.simplefilter(action = "ignore", category = RuntimeWarning) # Pandas warning (bug)

In [9]:
# let's grab the data from Github
test_url = "https://raw.githubusercontent.com/agconti/kaggle-titanic/master/data/test.csv"
train_url = "https://raw.githubusercontent.com/agconti/kaggle-titanic/master/data/train.csv"

In [10]:
# read into memory
test_set = pd.read_csv(test_url)
train_set = pd.read_csv(train_url)

<br>
# Part 1: Exploratory Analysis
Now, we have the data loaded into memory, so let's print out the first few rows of the **`train_set`**. Next, print some descriptive info about the data. This step allows us to gather some context about the dataset.

In [11]:
# print out first few rows (5 rows)
train_set.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [12]:
# get descriptive info, count (non-null), column name, and data type 
train_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


In [13]:
# get the summary stats for numerical columns
train_set.describe(percentiles=np.arange(.25, 1.25, 0.25), 
                   include=[np.number])

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,,0.0,0.0,7.9104
50%,446.0,0.0,3.0,,0.0,0.0,14.4542
75%,668.5,1.0,3.0,,1.0,0.0,31.0
100%,891.0,1.0,3.0,,8.0,6.0,512.3292
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


<br>
## First Look
The ***Survived*** variable indicates whether the passenger survived or not (1='yes';0='no'). So, looking at the output of **`train_set.describe()`**, we see that the average of ***Survived*** is 0.38, which means only 38% of the passengers in the training set survived. If the training and test set were chosen at random, then we'd expect the proportion of survival in the test set to be very close to 0.38. Let's check it out.

In [14]:
# get the summary stats for numerical columns, test set this time
test_set.describe(percentiles=np.arange(.25, 1.25, 0.25), 
                   include=[np.number])

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare
count,418.0,418.0,332.0,418.0,418.0,417.0
mean,1100.5,2.26555,30.27259,0.447368,0.392344,35.627188
std,120.810458,0.841838,14.181209,0.89676,0.981429,55.907576
min,892.0,1.0,0.17,0.0,0.0,0.0
25%,996.25,1.0,,0.0,0.0,
50%,1100.5,3.0,,0.0,0.0,
75%,1204.75,3.0,,1.0,0.0,
100%,1309.0,3.0,,8.0,9.0,
max,1309.0,3.0,76.0,8.0,9.0,512.3292


<br>
However, we don't have ***Survived*** in the test set, which is interesting...how are we supposed to test it? We'll continue on anyway.

In [15]:
# print out first 5 rows
test_set.head(5)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [16]:
# columns of test set
test_set.columns

Index(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',
       'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [17]:
# which column(s) is in training but not test?
diff = set(train_set.columns) - set(test_set.columns)
for col in diff:
    print("'%s' is in Training but not Test" % col)

'Survived' is in Training but not Test


<br>
### Note
I'm not familiar with <a href="http://www.kaggle.com">Kaggle</a> competitions, so I'm not sure how they do their testing. In my experience, if you're going to test your model, then you'll need the test response. I guess Kaggle keeps the response to themselves so they can test the performance once the model has been submitted.

In [18]:
# print entire train set, I want to take another look
train_set

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


<br>
## Create New Features: Married and Reverend (maybe)
<p>
I don't want to lose information in the ***Name*** feature by blindly dropping it, especially since ***SibSp*** is spouse **or** sibling. So, let's see if we can pull out married information from the name feature by analyzing some general patterns. Taking a look at the fourth row of the `DataFrame` above (index=3), we see the name '*Futrelle, Mrs. Jacques Heath (Lily May Peel)*', which indicates&mdash;at least to me&mdash;that Lily May Peel is married to Jacques Heath Futrelle.
</p>
<p>
This seems like a fairly simple to model using <a href="https://en.wikipedia.org/wiki/Regular_expression">Regular Expressions</a>, so let's create feature that takes on the value of 1 if married and 0 otherwise.
</p>

In [19]:
# quick test of RegEx pattern
married_pattern = re.compile("(?P<prefix>Mrs\.*)\s*(?P<husband>\w+|\w+ \w+)\s*(?P<wife>\(.*\))")

In [20]:
test_str = "Futrelle, Mrs. Jacques Heath (Lily May Peel)" # 4th row of DataFrame

In [21]:
test_result = married_pattern.search(test_str) # search entire string for match

In [22]:
test_dict = test_result.groupdict() # get dict from named groups

In [23]:
# print out the matches
for title, name in test_dict.items():
    print("%s: %s" % (title, name))

wife: (Lily May Peel)
husband: Jacques Heath
prefix: Mrs.


<br>
This seems do-able so I'll create a function to determine whether or not we have a match, and if so, we'll flag that person as a married woman. If I have time, I'll build a list of `husbands` and use that to determine of the men are married too.

In [24]:
# create my function
def married_woman(pattern, string):
    '''
    Search returns an object that can be used to test True/False regarding
    whether or not the pattern was matched.
    '''
    match = pattern.search(string)
    if match:
        married = 1
    else:
        married = 0
    return married

In [25]:
# now create the new column
train_set["married_woman"] = train_set["Name"].apply(lambda x: married_woman(married_pattern, x))

In [26]:
# take a look at the new data set
train_set.head(20)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,married_woman
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,1
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,0
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q,0
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,0
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S,0
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S,1
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C,1


<br>
Now I'm curious and I want to create a `married_man` column too. I'm going to utilize the "husband" named group above to build a list of husbands from the ***Name*** column. Then, I'll use the first and last name to build a new regex pattern. If it finds a match, then we have a married man, otherwise he's single.

In [27]:
# quick function
def husband_list(pattern, string):
    '''
    Return the husband named group for each match
    '''
    match = pattern.search(string)
    if match:
        husband = match.groupdict()
        husband = husband["husband"]
    else:
        husband = "single_girl"
    return husband

In [28]:
# get the husband names
husbands_matches = [husband_list(married_pattern, x) 
                    for x in train_set["Name"]]

In [29]:
# use a DataFrame to print out (looks much better than list)
pd.DataFrame(husbands_matches, columns=["Result"]).head(20)

Unnamed: 0,Result
0,single_girl
1,John Bradley
2,single_girl
3,Jacques Heath
4,single_girl
5,single_girl
6,single_girl
7,single_girl
8,Oscar W
9,Nicholas


<br>
Not looking very good, so I'm going to hold off for now. With some more RegEx work, you could get more out of the name, but we'll stop at married woman for now. 

While we add a feature, judging from the **`train_set.info()`** output, I think we should drop a feature as well. The ***`Cabin`*** feature only has 204 non-null values, and they're most likely too specific to be used for any type of learning (at least for now). We know how many non-null values, but let's print out the number of null values too.

In [30]:
# get null values
train_set.isnull().sum()

PassengerId        0
Survived           0
Pclass             0
Name               0
Sex                0
Age              177
SibSp              0
Parch              0
Ticket             0
Fare               0
Cabin            687
Embarked           2
married_woman      0
dtype: int64

<br>
As you can see, ***`Age`*** has a fair amount of null values, but ***`Cabin`*** is missing nearly 80% of it's values. So, I think we'll drop this feature for now.

In [31]:
train_set.drop("Cabin", axis=1, inplace=True) # drop Cabin

In [32]:
# print out columns to make sure it worked
train_set.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Embarked', 'married_woman'],
      dtype='object')

In [33]:
# print out first few rows again
train_set.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,married_woman
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,1
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S,1
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S,0
