### Problem Descrition

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

### Libraries

In [1]:
import pandas as pd
import numpy as np

from pandas import get_dummies as dummies

### Functions

In [2]:
def read_csv_col(col_list):
    df = pd.read_csv('train.csv',
                     usecols = col_list
                    )
    return df

In [3]:
def to_list_objectis(DataFrame):
    obj_features = DataFrame.dtypes[DataFrame.dtypes==object].axes[0]
    obj_describe = []

    for feat in obj_features:
        l = len(DataFrame[feat].unique())
        obj_describe.append([feat,l])

    obj_describe.sort(key = lambda x: x[1], reverse = True)    
    print('Object Features:', *obj_describe, sep='\n- ')
    
    return list(obj_features)

In [4]:
def to_list_missing(DataFrame):
    missing_features = DataFrame.columns[DataFrame.isna().sum()>0]
    mis_describe = []

    for feat in missing_features:
        l = DataFrame[feat].isna().sum()
        mis_describe.append([feat,l])

    mis_describe.sort(key = lambda x: x[1], reverse = True)    
    print('Missing Features:', *mis_describe, sep='\n- ')
    
    return list(missing_features)

In [5]:
data = pd.read_csv('train.csv')

missin_features   = to_list_missing(data)
print('')
objectis_features = to_list_objectis(data)
del data

Missing Features:
- ['Cabin', 687]
- ['Age', 177]
- ['Embarked', 2]

Object Features:
- ['Name', 891]
- ['Ticket', 681]
- ['Cabin', 148]
- ['Embarked', 4]
- ['Sex', 2]


#### Age

In [6]:
x_age = read_csv_col(['Age']);

age_mean = np.mean(x_age['Age'])
x_age.fillna(age_mean, inplace = True)

In [7]:
n_bins = 8
_, age_bins = pd.cut(x_age['Age'], bins = n_bins, retbins=True)

In [8]:
def helper_Age(value):
    is_in_bin = 0;
    for bound in age_bins:
        if value <= bound:
            return is_in_bin
        is_in_bin += 1

In [9]:
x_age['Age'] = x_age['Age'].apply(helper_Age)/n_bins

#### Embarked  

In [10]:
x_embarked = read_csv_col(['Embarked']);

embarked_top = list(x_embarked['Embarked'].value_counts().axes[0])[0]
x_embarked.fillna(embarked_top, inplace = True)
x_embarked.replace({'C':'Cherbourg', 'S':'Southampton', 'Q':'Queenstown'}, 
                   inplace = True)

In [11]:
x_embarked = dummies(x_embarked['Embarked'], prefix = 'Embark')

#### Cabin

In [12]:
x_cabin = read_csv_col(['Cabin'])

flours = ('A','B','C','D','E','F','G');

In [13]:
def helper_cabin(element):
    for level in flours:
        if level in element:
            return level
    return element

In [14]:
x_cabin['Cabin'].replace({'T':'NaN'}, inplace = True)
x_cabin['Cabin'].fillna('NaN', inplace = True)
x_cabin['Cabin'] = x_cabin['Cabin'].apply(helper_cabin)

In [15]:
x_cabin = dummies(x_cabin['Cabin'], prefix = 'Cabin')
x_cabin.drop('Cabin_NaN', axis = 1, inplace = True)

In [16]:
def helper_cabin_allocat(pdSerie):

    if pdSerie['Pclass'] == 1:
        floors = ['A','B','C','D','E']
        if pdSerie['Related'] > 2:
            floors = ['B', 'C']
        elif pdSerie['Related'] > 4:
            floors = ['C']
            
    elif pdSerie['Pclass'] == 2:
        floors = ['D','E','F']
        if pdSerie['Related'] > 1:
            floors = ['F']
    else:
        floors = ['E','F','G']
        if pdSerie['Related'] > 1:
            floors = ['G']
    
    col_floors = []
    for f in floors:
        col_floors += ['Cabin_' + f]
    
    returnSerie = pdSerie.copy(deep = True)
    
    for c in col_floors:
        returnSerie[c] = 1
    
    return returnSerie

In [17]:
data = read_csv_col(['SibSp','Parch','Pclass']) 

x_cabin['Related'] = data['SibSp']+data['Parch'];
x_cabin['Pclass']  = data['Pclass']

In [18]:
x_cabin = x_cabin.apply(helper_cabin_allocat, axis = 1)
x_cabin.drop(['Related','Pclass'],axis = 1, inplace = True)

### Is all Not NaN?

In [19]:
print('Age NaN sum {}'.format(x_age.isna().sum()))

print('\nEmbarked NaN sum')
print(x_embarked.isna().sum())

print('\nCabin Allocation NaN sum:')
print(x_cabin.isna().sum())

Age NaN sum Age    0
dtype: int64

Embarked NaN sum
Embark_Cherbourg      0
Embark_Queenstown     0
Embark_Southampton    0
dtype: int64

Cabin Allocation NaN sum:
Cabin_A    0
Cabin_B    0
Cabin_C    0
Cabin_D    0
Cabin_E    0
Cabin_F    0
Cabin_G    0
dtype: int64


#### Name

In [20]:
x_name = read_csv_col(['Name'])

name_title = ('Mr', 'Mrs','Ms','Mme','Miss', 'Mlle',
              'Master','Dr','Don','Countess',
              'Major','Cap','Col',
              'Rev')

In [21]:
def helper_name(elemente):
    for title in name_title:
        if title in elemente:
            return title
    return elemente

In [22]:
x_name['Name'] = x_name['Name'].apply(helper_name)
x_name['Name'].replace({'Mme':'Ms',
                        'Mlle':'Miss',
                        'Major':'Arm',
                        'Cap':'Arm',
                        'Col':'Arm',
                        'Countess':'Nobil',
                        'Don':'Nobil',
                        'Reuchlin, Jonkheer. John George':'Mr'
                       }, inplace = True)

In [23]:
x_name = dummies(x_name, prefix = 'Title')

#### Sex

In [24]:
x_sex = read_csv_col(['Sex'])

In [25]:
def helper_sex(element):
    if element == 'male':
        return 1
    elif element == 'female':
        return -1
    return 0

In [26]:
x_sex['Sex'] = x_sex['Sex'].apply(helper_sex)

#### Ticker

Don't have unique information. 

### Data

In [27]:
drop_col = missin_features + list(set(objectis_features) - set(missin_features))
x = pd.read_csv('train.csv',
                usecols = lambda x: x not in drop_col+['PassengerId', 'Survived']
               )
y = read_csv_col(['Survived'])

In [28]:
x.describe()

Unnamed: 0,Pclass,SibSp,Parch,Fare
count,891.0,891.0,891.0,891.0
mean,2.308642,0.523008,0.381594,32.204208
std,0.836071,1.102743,0.806057,49.693429
min,1.0,0.0,0.0,0.0
25%,2.0,0.0,0.0,7.9104
50%,3.0,0.0,0.0,14.4542
75%,3.0,1.0,0.0,31.0
max,3.0,8.0,6.0,512.3292


In [29]:
def helper_normalize(pdColumn):
    maximun = pdColumn.max()
    return pdColumn/maximun

In [30]:
x = x.apply(helper_normalize)

x.describe()

Unnamed: 0,Pclass,SibSp,Parch,Fare
count,891.0,891.0,891.0,891.0
mean,0.769547,0.065376,0.063599,0.062858
std,0.27869,0.137843,0.134343,0.096995
min,0.333333,0.0,0.0,0.0
25%,0.666667,0.0,0.0,0.01544
50%,1.0,0.0,0.0,0.028213
75%,1.0,0.125,0.0,0.060508
max,1.0,1.0,1.0,1.0


In [31]:
x = pd.concat([x,
               x_age,
               x_cabin,
               x_embarked,
               x_name,
               x_sex
              ], axis = 1)

In [32]:
pd.set_option('display.max_columns', 26)
x.describe()

Unnamed: 0,Pclass,SibSp,Parch,Fare,Age,Cabin_A,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Embark_Cherbourg,Embark_Queenstown,Embark_Southampton,Title_Arm,Title_Dr,Title_Master,Title_Miss,Title_Mr,Title_Ms,Title_Nobil,Title_Rev,Sex
count,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,0.769547,0.065376,0.063599,0.062858,0.417508,0.227834,0.242424,0.242424,0.382716,0.811448,0.634119,0.551066,0.188552,0.08642,0.725028,0.005612,0.007856,0.044893,0.203143,0.727273,0.002245,0.002245,0.006734,0.295174
std,0.27869,0.137843,0.134343,0.096995,0.164532,0.41967,0.42879,0.42879,0.486323,0.391372,0.481947,0.497665,0.391372,0.281141,0.446751,0.074743,0.088337,0.207186,0.402564,0.445612,0.047351,0.047351,0.08183,0.95598
min,0.333333,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.0
25%,0.666667,0.0,0.0,0.01544,0.375,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.0
50%,1.0,0.0,0.0,0.028213,0.375,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
75%,1.0,0.125,0.0,0.060508,0.5,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [33]:
to_list_missing(x)
print('')
to_list_objectis(x)

Missing Features:

Object Features:


[]

In [34]:
x.to_csv('x_train.csv', index = False)
y.to_csv('y_train.csv', index = False)