In [36]:
import pandas as pd
import numpy as np

In [37]:
df = pd.read_excel("../data/titanic3.xls")
df.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


In [38]:
# figure out which columns have null values and how many
df.isna().sum() / df.shape[0] * 100


pclass        0.000000
survived      0.000000
name          0.000000
sex           0.000000
age          20.091673
sibsp         0.000000
parch         0.000000
ticket        0.000000
fare          0.076394
cabin        77.463713
embarked      0.152788
boat         62.872422
body         90.756303
home.dest    43.086325
dtype: float64

## Data dictionary
- `pclass` - Passenger class (1, 2, 3)
- `survived` - (1, 0)
- `name` - name
- `sex` - (male|female)
- `age` - age (fractional if less than 1, xx.5 if estimated)
- `sibsp` - siblings / spouses aboard
- `parch` - number of parents / children aboard
- `ticket` - ticket number
- `fare` - passenger fare (pre-1970s British pounds Conversion Factors: 1 = 12s = 240d and 1s = 20d)
- `cabin` - cabin
- `embarked` - port of embarkation (C|S|Q) Cherbourg, Southampton, Queenstown
- `boat` - lifeboat
- `body` - body identification number
- `home.dest` - home/destination

[This](http://campus.lakeforest.edu/frank/FILES/MLFfiles/Bio150/Titanic/TitanicMETA.pdf) seems to be the only good description of the data set

In [39]:
# lot of nulls in age that can probably be worked around
# cabin and home.dest can both be filled with "unknown"
df["cabin"] = df["cabin"].fillna("unknown")

In [40]:
df.loc[df["embarked"] == "C"]["home.dest"].value_counts()

home.dest
New York, NY                             33
Paris, France                             7
Haverford, PA / Cooperstown, NY           5
Ottawa, ON                                5
Paris / Montreal, PQ                      4
                                         ..
?Havana, Cuba                             1
St James, Long Island, NY                 1
Gallipolis, Ohio / ? Paris / New York     1
Albany, NY                                1
Austria Niagara Falls, NY                 1
Name: count, Length: 70, dtype: int64

# Extension questions
1. Create a series for most common destinations, in which the index is the `embarked` column and the values are the most common destination for each value of `embarked`
2. Replace NaN values in the `home.dest` column with values from `embarked`
3. Use the most common destinations series to replace values in `home.dest` with the most common values from each embarkation point.

In [44]:
# book version - get the top destination for each embarkation point
sources = df["embarked"].dropna().unique()
top_dests = pd.Series([], dtype=object)
for source in sources:
    top_dests[source] = (
        df.loc[(df["embarked"] == source) & (df["home.dest"] != "unknown"), "home.dest"]
        .value_counts()
        .index[0]
    )
top_dests


S           New York, NY
C           New York, NY
Q    Ireland Chicago, IL
dtype: object

In [None]:
# book version - replace na values in home.dest with their embarkation values
df["home.dest"] = df["home.dest"].fillna(df["embarked"])
df["home.dest"][df["home.dest"].isin(["Q", "S", "C"])].value_counts()

home.dest
S    379
C     98
Q     86
Name: count, dtype: int64

In [49]:
# finally use the top destinations to replace embarkation values from home.dest
# with the most popular destination for that embarkation point
df["home.dest"] = df["home.dest"].replace(top_dests)
df["home.dest"].value_counts()

home.dest
New York, NY                      541
Ireland Chicago, IL                90
London                             14
Montreal, PQ                       10
Paris, France                       9
                                 ... 
Bennington, VT                      1
Chelsea, London                     1
Harrow-on-the-Hill, Middlesex       1
Copenhagen, Denmark                 1
Antwerp, Belgium / Stanton, OH      1
Name: count, Length: 369, dtype: int64

My question here is what value does this bring to the data? There is a huge amount of missing destination data (~20%) and
the quality of using the most common destination is pretty low as, for example New York, is a very small fraction of the
known destinations for any of the embarkation points.

## Me
I think this might need some extra thought from what is being suggested. The `home.dest` values frequently seem to have composite values. There are state abbreviations, as well as sometimes chains of destinations (possibly origins too?). Anyway, I think we can do better than what is suggested.