# Titanic challenge with XGBoost - Preprocessing

This notebook resolves Titanic challenge using XGBoost model. The target is to get familiar with AWS Sagemaker and one of the most popular model - eXtreme Gradient Boosting.
The original challenge is defined at https://www.kaggle.com/c/titanic/data

This notebook is divided into 3 parts:
- The first part is about analyzing and visualize data.
- The second part is about feature engineering.
- The last part is for training and prediction.

In [2]:
# import libraries
import pandas as pd
import numpy as np


In [3]:
train_data_file = './data/raw/train.csv'
test_data_file = './data/raw/test.csv'

try:
    train_df = pd.read_csv(train_data_file)
    test_df = pd.read_csv(test_data_file)
    print('Success: Data loaded into dataframe.')
except Exception as e:
    print('Data load error: ',e)

Success: Data loaded into dataframe.


In [4]:
len_train_df = train_df.shape[0]

## 2. Feature engineering

#### 2.1. Keep it as it was
We keep the raw data and just do some necessary transformations to make it works with XGBoost.

In [5]:
titanic_df = pd.concat([train_df,test_df])

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


In [6]:
def pre_processing(df):
    # drop columns Name, Ticket, Cabin
    drop_columns = [c for c in ['Name', 'Ticket', 'Cabin'] if c in df.columns]
    df = df.drop(drop_columns, axis=1)
    # one hot encoded
    encoded_columns = [c for c in ['Sex', 'Pclass', 'Embarked'] if c in df.columns]
    df = pd.get_dummies(df, columns=encoded_columns, drop_first=True)
    return df

In [7]:
titanic_df = pre_processing(titanic_df)

In [8]:
# split to train and test df again
train_df_clean = titanic_df[:len_train_df]
test_df_clean = titanic_df[len_train_df:]

In [9]:
print(train_df_clean.shape,test_df_clean.shape)
train_df_clean.head(5)

(891, 11) (418, 11)


Unnamed: 0,Age,Fare,Parch,PassengerId,SibSp,Survived,Sex_male,Pclass_2,Pclass_3,Embarked_Q,Embarked_S
0,22.0,7.25,0,1,1,0.0,1,0,1,0,1
1,38.0,71.2833,0,2,1,1.0,0,0,0,0,0
2,26.0,7.925,0,3,0,1.0,0,0,1,0,1
3,35.0,53.1,0,4,1,1.0,0,0,0,0,1
4,35.0,8.05,0,5,0,0.0,1,0,1,0,1


In [10]:
# remove the column Survived in test_df
test_df_clean = test_df_clean.drop(['Survived'], axis=1)

In [11]:
# modify the first column to make it work with xgboost 
train_df_clean = pd.concat([train_df_clean['Survived'], train_df_clean.drop(['Survived'], axis=1)], axis=1)

In [12]:
# split to training dataset and validation dataset
train_df_clean, validate_df_clean = np.split(train_df_clean.sample(frac=1, random_state=1892), [int(0.8 * len(train_df_clean))])

In [13]:
# to csv
train_df_clean.to_csv('data/processed/exp-raw/train.csv', header=False, index=False)
validate_df_clean.to_csv('data/processed/exp-raw/validation.csv', header=False, index=False)
test_df_clean.to_csv('data/processed/exp-raw/test.csv', header=False, index=False)

##### 2.2. Additional data engineering technique
Fill missing value and extract title from Name

In [14]:
def pre_processing_advantage(df):
    # drop columns Embarked, Ticket and Cabin
    drop_columns = [c for c in ['Embarked', 'Ticket', 'Cabin'] if c in df.columns]
    df = df.drop(drop_columns, axis=1)
    # handle name
    df['Name'] = df['Name'].str.split(',', expand=True)[1].str.split('.', expand=True)[0]
    replacement = {'Ms': 'Miss',
        'Mlle': 'Miss',
        'Mme': 'Mrs'}
    for i, v in replacement.items():
        train_df['Name'] = train_df['Name'].str.replace(i,v)
    # fill missing values
    # Age
    df['Age'] = df['Age'].fillna(value=df['Age'].median())
    title_median_age = df.groupby(['Name'])['Age'].median()
    for title, median_age in title_median_age.items():
        df.loc[(df['Age'].isnull()) & (df['Name']==title), 'Age'] = median_age
    # one hot encoded
    encoded_columns = [c for c in ['Sex', 'Pclass', 'Name'] if c in df.columns]
    df = pd.get_dummies(df, columns=encoded_columns, drop_first=True)
    
    return df

In [15]:
titanic_df = pd.concat([train_df,test_df])

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


In [16]:
titanic_df = pre_processing_advantage(titanic_df)

In [17]:
# split to train and test df again
train_df_clean = titanic_df[:len_train_df]
test_df_clean = titanic_df[len_train_df:]

In [18]:
print(train_df_clean.shape,test_df_clean.shape)
train_df_clean.head(6)

(891, 26) (418, 26)


Unnamed: 0,Age,Fare,Parch,PassengerId,SibSp,Survived,Sex_male,Pclass_2,Pclass_3,Name_ Col,...,Name_ Master,Name_ Miss,Name_ Mlle,Name_ Mme,Name_ Mr,Name_ Mrs,Name_ Ms,Name_ Rev,Name_ Sir,Name_ the Countess
0,22.0,7.25,0,1,1,0.0,1,0,1,0,...,0,0,0,0,1,0,0,0,0,0
1,38.0,71.2833,0,2,1,1.0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
2,26.0,7.925,0,3,0,1.0,0,0,1,0,...,0,1,0,0,0,0,0,0,0,0
3,35.0,53.1,0,4,1,1.0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
4,35.0,8.05,0,5,0,0.0,1,0,1,0,...,0,0,0,0,1,0,0,0,0,0
5,28.0,8.4583,0,6,0,0.0,1,0,1,0,...,0,0,0,0,1,0,0,0,0,0


In [19]:
# modify the first column to make it work with xgboost 
train_df_clean = pd.concat([train_df_clean['Survived'], train_df_clean.drop(['Survived'], axis=1)], axis=1)

In [20]:
# split to training dataset and validation dataset
train_df_clean, validate_df_clean = np.split(train_df_clean.sample(frac=1, random_state=1892), [int(0.8 * len(train_df_clean))])

In [21]:
# remove the column Survived in test_df
test_df_clean = test_df_clean.drop(['Survived'], axis=1)

In [25]:
# to csv
train_df_clean.to_csv('./data/processed/exp-features/train.csv', header=False, index=False)
validate_df_clean.to_csv('./data/processed/exp-features/validation.csv', header=False, index=False)
test_df_clean.to_csv('./data/processed/exp-features/test.csv', header=False, index=False)