 # Survival on the Titanic
 
This is a data set of different information about passengers onboard the Titanic. We will use that information to predict whether the passengers' survival. *Source of data: Kaggle Competition*

The training set contains data we can use to train our model. It has a number of feature columns which contain various descriptive data, as well as a column of the target values we are trying to predict: in this case, **Survival**.

The testing set contains all of the same feature columns, but is missing the target value column. Additionally, the testing set usually has fewer observations (rows) than the training set.

In [0]:
import pandas as pd

In [2]:
train = pd.read_csv('https://raw.githubusercontent.com/sharontan/machine-learning/master/Titanic/train.csv')
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


In [3]:
test = pd.read_csv('https://raw.githubusercontent.com/sharontan/machine-learning/master/Titanic/test.csv')
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB


### Data cleaning

From the info() command above, we see 3 columns with missing data:
1. Age
2. Cabin
3. Embarked

In [4]:
train['Age'].unique()

array([22.  , 38.  , 26.  , 35.  ,   nan, 54.  ,  2.  , 27.  , 14.  ,
        4.  , 58.  , 20.  , 39.  , 55.  , 31.  , 34.  , 15.  , 28.  ,
        8.  , 19.  , 40.  , 66.  , 42.  , 21.  , 18.  ,  3.  ,  7.  ,
       49.  , 29.  , 65.  , 28.5 ,  5.  , 11.  , 45.  , 17.  , 32.  ,
       16.  , 25.  ,  0.83, 30.  , 33.  , 23.  , 24.  , 46.  , 59.  ,
       71.  , 37.  , 47.  , 14.5 , 70.5 , 32.5 , 12.  ,  9.  , 36.5 ,
       51.  , 55.5 , 40.5 , 44.  ,  1.  , 61.  , 56.  , 50.  , 36.  ,
       45.5 , 20.5 , 62.  , 41.  , 52.  , 63.  , 23.5 ,  0.92, 43.  ,
       60.  , 10.  , 64.  , 13.  , 48.  ,  0.75, 53.  , 57.  , 80.  ,
       70.  , 24.5 ,  6.  ,  0.67, 30.5 ,  0.42, 34.5 , 74.  ])

We will sort these ages in bins since this is an array of a wide range of numbers. We will set nan to -0.5 for under the category of missing data.

* Missing : -1 to 0
* Infant: 0 to 5
* Child: 5 to 12
* Teenager: 12 to 18
* Young Adult: 18 to 35
* Adult: 35 to 60
* Senior: 60 to 100





In [0]:
age_cut = [-1,0,5,12,18,35,60,100]
age_label_names = ['Missing', 'Infant', 'Child', 'Teenager', 'Young Adult', 'Adult', 'Senior']

In [0]:
train['Age'] = train['Age'].fillna(-0.5)

In [7]:
train['age_category'] = pd.cut(train['Age'], age_cut, labels=age_label_names)
train['age_category'].value_counts()

Young Adult    358
Adult          195
Missing        177
Teenager        70
Infant          44
Child           25
Senior          22
Name: age_category, dtype: int64

In [0]:
#drop age column
train = train.drop('Age', axis=1)

In [9]:
train['Cabin'].unique()

array([nan, 'C85', 'C123', 'E46', 'G6', 'C103', 'D56', 'A6',
       'C23 C25 C27', 'B78', 'D33', 'B30', 'C52', 'B28', 'C83', 'F33',
       'F G73', 'E31', 'A5', 'D10 D12', 'D26', 'C110', 'B58 B60', 'E101',
       'F E69', 'D47', 'B86', 'F2', 'C2', 'E33', 'B19', 'A7', 'C49', 'F4',
       'A32', 'B4', 'B80', 'A31', 'D36', 'D15', 'C93', 'C78', 'D35',
       'C87', 'B77', 'E67', 'B94', 'C125', 'C99', 'C118', 'D7', 'A19',
       'B49', 'D', 'C22 C26', 'C106', 'C65', 'E36', 'C54',
       'B57 B59 B63 B66', 'C7', 'E34', 'C32', 'B18', 'C124', 'C91', 'E40',
       'T', 'C128', 'D37', 'B35', 'E50', 'C82', 'B96 B98', 'E10', 'E44',
       'A34', 'C104', 'C111', 'C92', 'E38', 'D21', 'E12', 'E63', 'A14',
       'B37', 'C30', 'D20', 'B79', 'E25', 'D46', 'B73', 'C95', 'B38',
       'B39', 'B22', 'C86', 'C70', 'A16', 'C101', 'C68', 'A10', 'E68',
       'B41', 'A20', 'D19', 'D50', 'D9', 'A23', 'B50', 'A26', 'D48',
       'E58', 'C126', 'B71', 'B51 B53 B55', 'D49', 'B5', 'B20', 'F G63',
       'C62 C64',

We will extract the first letter of the **Cabin** column to get the category of cabin rather than deal with so many unique values.

In [0]:
train['cabin_category'] = train['Cabin'].str[0]

In [11]:
train['cabin_category'] = train['cabin_category'].fillna('Unknown')
train['cabin_category'].value_counts()

Unknown    687
C           59
B           47
D           33
E           32
A           15
F           13
G            4
T            1
Name: cabin_category, dtype: int64

In [0]:
train = train.drop('Cabin', axis=1)

In [13]:
train['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

Since most of the data is present in this column, we will replace the missing data with the mode,

In [14]:
train['Embarked'] = train['Embarked'].fillna('S')
train['Embarked'].value_counts()

S    646
C    168
Q     77
Name: Embarked, dtype: int64

Aside from performing the same functions above on the test data, we will also have to examine if there are any missing data in the test set columns that have not been missing in the train set.

The test set has missing data in the same columns as the train set with one more column with missng data: **Fare**. Since the data are of float type, and there is only one missing value, we can replace it with the mean of the column. We will use the mean of the train set as it has more values to give a more accurate value.

In [0]:
test['Fare'] = test['Fare'].fillna(train['Fare'].mean())

In [0]:
#Age
test['Age'] = test['Age'].fillna(-0.5)
test['age_category'] = pd.cut(test['Age'], age_cut, labels=age_label_names)

In [0]:
test = test.drop('Age', axis=1)

In [0]:
#Cabin
test['cabin_category'] = test['Cabin'].str[0]
test['cabin_category'] = test['cabin_category'].fillna('Unknown')

In [0]:
test = test.drop('Cabin', axis=1)

In [0]:
#Embarked
test['Embarked'] = test['Embarked'].fillna('S')

In [21]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId       891 non-null int64
Survived          891 non-null int64
Pclass            891 non-null int64
Name              891 non-null object
Sex               891 non-null object
SibSp             891 non-null int64
Parch             891 non-null int64
Ticket            891 non-null object
Fare              891 non-null float64
Embarked          891 non-null object
age_category      891 non-null category
cabin_category    891 non-null object
dtypes: category(1), float64(1), int64(5), object(5)
memory usage: 77.9+ KB


In [22]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId       418 non-null int64
Pclass            418 non-null int64
Name              418 non-null object
Sex               418 non-null object
SibSp             418 non-null int64
Parch             418 non-null int64
Ticket            418 non-null object
Fare              418 non-null float64
Embarked          418 non-null object
age_category      418 non-null category
cabin_category    418 non-null object
dtypes: category(1), float64(1), int64(4), object(5)
memory usage: 33.5+ KB


### Feature engineering

As found above, we have filled in the missing data in all columns. However, there are still 5 object/string columns which cannot be used for our machine learning models in scikit-learn. 

1. Name
2. Sex
3. Ticket
4. Embarked
5. cabin_category

We can convert **Sex**, **Embarked**, and **cabin_category** into categorical columns. We will have to perform some processing before we can use the **Name** column. The **Ticket** column, however, is just that ticket number for the passengers. We can drop this column for our analysis.



In [0]:
train = train.drop('Ticket', axis=1)

In [0]:
test = test.drop('Ticket', axis=1)

In [25]:
train['Name'].head()

0                              Braund, Mr. Owen Harris
1    Cumings, Mrs. John Bradley (Florence Briggs Th...
2                               Heikkinen, Miss. Laina
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                             Allen, Mr. William Henry
Name: Name, dtype: object

We can extract the titles from the name of the passengers to obtain a column with more meaningful values.

In [0]:
titles = {
        "Mr" :         "Mr",
        "Mme":         "Mrs",
        "Ms":          "Mrs",
        "Mrs" :        "Mrs",
        "Master" :     "Master",
        "Mlle":        "Miss",
        "Miss" :       "Miss",
        "Capt":        "Officer",
        "Col":         "Officer",
        "Major":       "Officer",
        "Dr":          "Officer",
        "Rev":         "Officer",
        "Jonkheer":    "Royalty",
        "Don":         "Royalty",
        "Sir" :        "Royalty",
        "Countess":    "Royalty",
        "Dona":        "Royalty",
        "Lady" :       "Royalty"
    }

In [0]:
extracted = train['Name'].str.extract('([A-Za-z]+)\.', expand=False)

In [0]:
train['title'] = extracted.map(titles)

In [0]:
train = train.drop('Name', axis=1)

In [0]:
test['title'] = test['Name'].str.extract('([A-Za-z]+)\.', expand=False).map(titles)

In [0]:
test = test.drop('Name', axis=1)

In [32]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,SibSp,Parch,Fare,Embarked,age_category,cabin_category,title
0,1,0,3,male,1,0,7.25,S,Young Adult,Unknown,Mr
1,2,1,1,female,1,0,71.2833,C,Adult,C,Mrs
2,3,1,3,female,0,0,7.925,S,Young Adult,Unknown,Miss
3,4,1,1,female,1,0,53.1,S,Young Adult,C,Mrs
4,5,0,3,male,0,0,8.05,S,Young Adult,Unknown,Mr


We will need to perform more processing on the **Fare** column since it has too many wide ranging values.

In [33]:
train['Fare'].unique()

array([  7.25  ,  71.2833,   7.925 ,  53.1   ,   8.05  ,   8.4583,
        51.8625,  21.075 ,  11.1333,  30.0708,  16.7   ,  26.55  ,
        31.275 ,   7.8542,  16.    ,  29.125 ,  13.    ,  18.    ,
         7.225 ,  26.    ,   8.0292,  35.5   ,  31.3875, 263.    ,
         7.8792,   7.8958,  27.7208, 146.5208,   7.75  ,  10.5   ,
        82.1708,  52.    ,   7.2292,  11.2417,   9.475 ,  21.    ,
        41.5792,  15.5   ,  21.6792,  17.8   ,  39.6875,   7.8   ,
        76.7292,  61.9792,  27.75  ,  46.9   ,  80.    ,  83.475 ,
        27.9   ,  15.2458,   8.1583,   8.6625,  73.5   ,  14.4542,
        56.4958,   7.65  ,  29.    ,  12.475 ,   9.    ,   9.5   ,
         7.7875,  47.1   ,  15.85  ,  34.375 ,  61.175 ,  20.575 ,
        34.6542,  63.3583,  23.    ,  77.2875,   8.6542,   7.775 ,
        24.15  ,   9.825 ,  14.4583, 247.5208,   7.1417,  22.3583,
         6.975 ,   7.05  ,  14.5   ,  15.0458,  26.2833,   9.2167,
        79.2   ,   6.75  ,  11.5   ,  36.75  ,   7.7958,  12.5

We will cut them into bins like we did for age.

In [34]:
train['Fare'].min()

0.0

In [35]:
train['Fare'].max()

512.3292

In [36]:
train['Fare'].median()

14.4542

In [0]:
fare_cut = [-1, 10, 50, 100, 600]
fare_cat = ['0-10', '10-50', '50-100', '100+']

In [38]:
train['fare_category'] = pd.cut(train['Fare'], fare_cut, labels=fare_cat)
train['fare_category'].value_counts()

10-50     395
0-10      336
50-100    107
100+       53
Name: fare_category, dtype: int64

In [0]:
train = train.drop('Fare', axis=1)

In [0]:
test['fare_category'] = pd.cut(test['Fare'], fare_cut, labels=fare_cat)

In [0]:
test = test.drop('Fare', axis=1)

In [42]:
train['family_size'] = train['SibSp'] + train['Parch']
train['family_size'].value_counts()

0     537
1     161
2     102
3      29
5      22
4      15
6      12
10      7
7       6
Name: family_size, dtype: int64

We can group them into alone or not alone for our machine learning model.

In [43]:
train['alone'] = train['family_size'].apply(lambda x : 1 if x==0 else 0)
train['alone'].value_counts()

1    537
0    354
Name: alone, dtype: int64

In [0]:
test['family_size'] = test['SibSp'] + test['Parch']
test['alone'] = test['family_size'].apply(lambda x : 1 if x==0 else 0)

In [0]:
cols_to_category = ['fare_category', 'age_category', 'cabin_category', 'Sex', 'title', 'Embarked']

In [0]:
def convert_to_category(df, col):
    dummy_df = pd.get_dummies(df[col], prefix=col)
    df = pd.concat([df, dummy_df], axis=1)
    df = df.drop(col, axis=1)
    return df

In [0]:
for col in cols_to_category:
    train = convert_to_category(train, col)
    test = convert_to_category(test, col)

In [48]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,SibSp,Parch,family_size,alone,fare_category_0-10,fare_category_10-50,fare_category_50-100,fare_category_100+,age_category_Missing,age_category_Infant,age_category_Child,age_category_Teenager,age_category_Young Adult,age_category_Adult,age_category_Senior,cabin_category_A,cabin_category_B,cabin_category_C,cabin_category_D,cabin_category_E,cabin_category_F,cabin_category_G,cabin_category_T,cabin_category_Unknown,Sex_female,Sex_male,title_Master,title_Miss,title_Mr,title_Mrs,title_Officer,title_Royalty,Embarked_C,Embarked_Q,Embarked_S
0,1,0,3,1,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,1,0,0,0,0,0,1
1,2,1,1,1,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0
2,3,1,3,0,0,0,1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,0,0,0,0,0,0,1
3,4,1,1,1,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1
4,5,0,3,0,0,0,1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,1,0,0,0,0,0,1


We can drop the **Sex_male** column since the data is duplicated in **Sex_female**.

In [0]:
train = train.drop(['Sex_male', 'SibSp', 'Parch', 'family_size'], axis=1)
test = test.drop(['Sex_male', 'SibSp', 'Parch', 'family_size'], axis=1)

In [50]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 34 columns):
PassengerId                 891 non-null int64
Survived                    891 non-null int64
Pclass                      891 non-null int64
alone                       891 non-null int64
fare_category_0-10          891 non-null uint8
fare_category_10-50         891 non-null uint8
fare_category_50-100        891 non-null uint8
fare_category_100+          891 non-null uint8
age_category_Missing        891 non-null uint8
age_category_Infant         891 non-null uint8
age_category_Child          891 non-null uint8
age_category_Teenager       891 non-null uint8
age_category_Young Adult    891 non-null uint8
age_category_Adult          891 non-null uint8
age_category_Senior         891 non-null uint8
cabin_category_A            891 non-null uint8
cabin_category_B            891 non-null uint8
cabin_category_C            891 non-null uint8
cabin_category_D            891 non-null uint8
ca

### Feature selection

In [0]:
from sklearn.feature_selection import RFECV
from sklearn.ensemble import RandomForestClassifier

In [52]:
columns = train.columns
columns

Index(['PassengerId', 'Survived', 'Pclass', 'alone', 'fare_category_0-10',
       'fare_category_10-50', 'fare_category_50-100', 'fare_category_100+',
       'age_category_Missing', 'age_category_Infant', 'age_category_Child',
       'age_category_Teenager', 'age_category_Young Adult',
       'age_category_Adult', 'age_category_Senior', 'cabin_category_A',
       'cabin_category_B', 'cabin_category_C', 'cabin_category_D',
       'cabin_category_E', 'cabin_category_F', 'cabin_category_G',
       'cabin_category_T', 'cabin_category_Unknown', 'Sex_female',
       'title_Master', 'title_Miss', 'title_Mr', 'title_Mrs', 'title_Officer',
       'title_Royalty', 'Embarked_C', 'Embarked_Q', 'Embarked_S'],
      dtype='object')

In [0]:
cols = columns[2:]

In [0]:
target = columns[1]

In [0]:
import warnings
warnings.filterwarnings('ignore')

In [56]:
rfc = RandomForestClassifier(random_state=1)
selector = RFECV(rfc, cv=10)
selector.fit(train[cols], train[target])

RFECV(cv=10,
      estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                       criterion='gini', max_depth=None,
                                       max_features='auto', max_leaf_nodes=None,
                                       min_impurity_decrease=0.0,
                                       min_impurity_split=None,
                                       min_samples_leaf=1, min_samples_split=2,
                                       min_weight_fraction_leaf=0.0,
                                       n_estimators='warn', n_jobs=None,
                                       oob_score=False, random_state=1,
                                       verbose=0, warm_start=False),
      min_features_to_select=1, n_jobs=None, scoring=None, step=1, verbose=0)

In [0]:
best_features = selector.support_
features = cols[best_features]

### Model Selection

In [0]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

In [0]:
models = [
    {
        'name': 'LogisticRegression',
        'estimator': LogisticRegression(),
        'hyperparameters': {
        'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
        }
    },
    {
        'name': 'KNeighborsClassifier',
        'estimator': KNeighborsClassifier(),
        'hyperparameters': {
            'n_neighbors': range(1, 20),
            'weights': ['distance', 'uniform'],
            'algorithm': ['ball_tree', 'kd_tree', 'brute'],
            'p': [1,2]                        
        }
    },
    {
        'name': 'RandomForestClassifier',
        'estimator': RandomForestClassifier(),
        'hyperparameters': {
            'criterion': ['gini', 'entropy'],
            'max_depth': [2, 5, 10],
            'max_features': ["log2", "sqrt"],
            'min_samples_leaf': [1, 5, 8],
            'min_samples_split': [2, 3, 5]
        }
    }
]

In [60]:
for model in models:
    grid = GridSearchCV(model['estimator'], param_grid=model['hyperparameters'], cv=10)
    grid.fit(train[features], train[target])
    model['best_params'] = grid.best_params_
    model['best_score'] = grid.best_score_
    model['best_model'] = grid.best_estimator_
    print(model['name'])
    print(model['best_score'])
    print(model['best_params'])
    

LogisticRegression
0.8103254769921436
{'solver': 'newton-cg'}
KNeighborsClassifier
0.8282828282828283
{'algorithm': 'ball_tree', 'n_neighbors': 9, 'p': 2, 'weights': 'distance'}
RandomForestClassifier
0.8226711560044894
{'criterion': 'gini', 'max_depth': 10, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2}


## Submission to Kaggle

In [0]:
predictions = models[2]['best_model'].predict(test[features])

In [62]:
submission = pd.DataFrame({
    'PassengerId': test['PassengerId'],
    'Survived': predictions
})
submission.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,1
