# Titanic

Doing feature engineering in the example of Titanic problem.
To learn more and join the competition by click [here](https://www.kaggle.com/c/titanic/)


First we'll import the packages we need.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from imblearn.over_sampling import RandomOverSampler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

/kaggle/input/titanic/train.csv
/kaggle/input/titanic/test.csv
/kaggle/input/titanic/gender_submission.csv


1. I've alredy downloaded the data. Now let's read it.

In [2]:
data = pd.read_csv('../input/titanic/train.csv')
# PassengerId column is not something to learn on
data.drop(columns=['PassengerId'], inplace=True)
test = pd.read_csv('../input/titanic/test.csv')
test.drop(columns=['PassengerId'], inplace=True)
X = data.drop(columns=['Survived'])
# Survived column is what will be predicting
Y = data['Survived']

There are some categorical features that have too many different values, let's find them out.

In [3]:
for i in range(len(X.columns)):
    print(X.iloc[:, i].value_counts(), end='\n\n')

3    491
1    216
2    184
Name: Pclass, dtype: int64

Sage, Mr. Frederick                                          1
Goldsmith, Mr. Frank John                                    1
Andersson, Miss. Sigrid Elisabeth                            1
Graham, Mrs. William Thompson (Edith Junkins)                1
Nankoff, Mr. Minko                                           1
                                                            ..
Bostandyeff, Mr. Guentcho                                    1
Calic, Mr. Petar                                             1
Birkeland, Mr. Hans Martin Monsen                            1
Sirayanian, Mr. Orsen                                        1
Asplund, Mrs. Carl Oscar (Selma Augusta Emilia Johansson)    1
Name: Name, Length: 891, dtype: int64

male      577
female    314
Name: Sex, dtype: int64

24.00    30
22.00    27
18.00    26
19.00    25
30.00    25
         ..
55.50     1
70.50     1
66.00     1
23.50     1
0.42      1
Name: Age, Length: 88, dtype

1.  These columns are ***Ticket***, ***Name*** and ***Cabin***.

Let's minimize the value counts of ***Ticket*** by the first letter of it's last word. This will give us some information on ticket's class.

In [4]:
X['Ticket'] = X['Ticket'].apply(lambda s:  str(s.split()[-1][0]))
test['Ticket'] = test['Ticket'].apply(lambda s:  str(s.split()[-1][0]))

Full name doesn't give us much of an information. Let's trim it to just surname

In [5]:
X['Name'] = X['Name'].apply(lambda s: s.split(',')[0])
test['Name'] = test['Name'].apply(lambda s: s.split(',')[0])

Let's also take into account the number of people that have the same surname, as this will probably indicate the size of the family.

In [6]:
XNamevalue_counts = X['Name'].value_counts()
testNamevalue_counts = test['Name'].value_counts()
def GetFamilySize(name):
    family_size = 0
    try:
        family_size += XNamevalue_counts[name]
    finally:
        try:
            family_size += testNamevalue_counts[name]
        finally:
            return family_size
X['Family_size'] = X['Name'].apply(lambda s: GetFamilySize(s))
test['Family_size'] = test['Name'].apply(lambda s: GetFamilySize(s))

Let's categorize the ***Cabins*** too, by taking their first letter.

In [7]:
X['Cabin'] = X['Cabin'].apply(lambda s: str(s)[0])
test['Cabin'] = test['Cabin'].apply(lambda s: str(s)[0])

There are some missing values for ***Age*** both in training, and in testing datasets. Let's impute them by taking the mean over the all ages we have (ages of both training and testing).

In [8]:
X_size_age = X.shape[0] - X['Age'].isna().sum()
test_size_age = test.shape[0] - test['Age'].isna().sum()
mean_age = (X_size_age * X['Age'].mean() + 
            test_size_age * test['Age'].mean()) / \
            (X_size_age + test_size_age)
X['Age'].fillna(mean_age, inplace=True)
test['Age'].fillna(mean_age, inplace=True)

There's a missing value for ***Fare*** in testing dataset. Let's impute it the same way.

In [9]:
X_size_fare = X.shape[0] - X['Fare'].isna().sum()
test_size_fare = test.shape[0] - test['Fare'].isna().sum()
mean_fare = (X_size_fare * X['Fare'].mean() + 
            test_size_fare * test['Fare'].mean()) / \
            (X_size_fare + test_size_fare)
test['Fare'].fillna(mean_fare, inplace=True)

Since sklearn works with only numerical values, we need to do one hot encoding for non-numerical values.

In [10]:
X_dum = pd.get_dummies(X, drop_first=True)
X_test_dum = pd.get_dummies(test, drop_first=True)
X_dum.shape , X_test_dum.shape

((891, 692), (418, 375))

We only need columns if they are contained both in testing and training datasets.

In [11]:
# symetric difference:
# the columns that are not contained either in training or in testing datasets
unnecesarry_columns = X_test_dum.columns ^ X_dum.columns
# this are the columns that exist in testing dataset, but not in training
unnecesarry_columns_X_dum = unnecesarry_columns & X_dum.columns 
# this are the columns that exist in training dataset, but not in testing
unnecesarry_columns_X_test_dum = unnecesarry_columns & X_test_dum.columns
X_dum = X_dum.drop(columns=unnecesarry_columns_X_dum)
X_test_dum = X_test_dum.drop(columns=unnecesarry_columns_X_test_dum)
# after this training and testing datasets should have the same columns
all(X_dum.columns == X_test_dum.columns)

True

To fight data imbalance let's oversample the minority class considering the Survived column.

In [12]:
ros = RandomOverSampler()
A, B = ros.fit_resample(X_dum, Y)
X_ros = pd.DataFrame(A, columns=X_dum.columns)
Y_ros = pd.Series(B)

NameError: name 'RandomOverSampler' is not defined

In case you want to hyperparameter tuning you can split the data into training and validation sets like this.

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X_ros,
                                                  Y_ros,
                                                  test_size=0.25)

Now it's time to fit a model to out data. 

In [None]:
model_rf = RandomForestClassifier(criterion='gini', n_estimators=121, 
                                  max_features=80, random_state=18)
model_rf.fit(X_ros, Y_ros)
print('accuracy: ', accuracy_score(model_rf.predict(X_ros), Y_ros))

We can finally evaluate the results

In [None]:
pred_rf = pd.DataFrame({'PassengerId' : np.arange(len(X_test_dum)) + 892,
                        'Survived': model_rf.predict(X_test_dum)})
pred_rf.to_csv('pred_rf.csv', index=False)
pred_rf