In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt

In [1]:
train = pd.read_csv('/kaggle/input/titanic/train.csv')
train.head()

# **Explore Data Analysis**

In [1]:
train.shape

In [1]:
train.info()

In [1]:
train.dtypes

In [1]:
def show_missing(data):
  ''' This function is used to show percentage of missing data. '''
  missing_values = data.isnull().sum()

  percent_missing = missing_values / data.shape[0] * 100
  percent_missing = percent_missing.round(2) 

  show_missing = pd.concat([percent_missing, data.nunique(), data.dtypes], keys=['PercentageMissing', 'Nunique values', 'Dtype'], axis = 1)

  return show_missing

show_missing(train)



*   3/12 columns have missing values. Age - 19.87% and Cabin - 77.10%
*    5/12 columns have object dtype converting



# Data Visualization

In [1]:
survivors = train[train['Survived'] == 1].shape[0]
deads = train.shape[0] - survivors

print(f'Survivor: {round(survivors / train.shape[0], 2) * 100} % \nDead people: {round(deads / train.shape[0], 2) * 100}%')

sns.countplot(data=train, x='Survived')

### Sex Column

In [1]:
sns.countplot(data=train, hue='Survived', x='Sex')



*   Man have trending to not rescued




### Pclass Column

In [1]:
# Survivors rate by class
train.groupby('Pclass')['Survived'].mean().to_frame()

In [1]:
print(train['Pclass'].value_counts())

In [1]:
fig, ax = plt.subplots(1, 2, figsize=(12, 6))
sns.barplot(data=train, x='Pclass', y='Survived', palette=['green'], ci=None, ax=ax[0])
ax[0].set_title('Survived Rate by Pclass')
sns.countplot(data=train, x='Pclass', hue='Survived', palette=['red', 'blue'], ax=ax[1])
ax[1].set_title('Survived or Dead by Pclass')

Pclass
*   There were three classes on the ship and from the plot we see that the number of passengers in the third class was higher than the number of passengers in the first and second classes combined.
*   However, the survival rate by class is not the same, more than 60% of first-class passengers and around half of the second class passengers were rescued, whereas 75% of third class passengers were not able to survive the disaster.
*   For this reason, this is definitely an important aspect to consider.



### Pclass & Sex Columns

In [1]:
train.groupby(['Pclass', 'Sex']).Survived.mean().to_frame()

In [1]:
sns.barplot(data=train, x='Pclass', y='Survived', hue='Sex', palette=['red', 'blue'], ci=None)
plt.title('Survival rate by Pclass and Sex')



*  We can also see the survival rate by Sex and Pclass, which is quite impressive. First class and second class women who were rescued were respectively 97% and 92%, while the percentage drops to 50% for third-class women.
*  Despite that, this is still more than the 37% survival rate for first-class men.



### Age Column

In [1]:
train['Age'].value_counts().head(10)

In [1]:
fig, ax = plt.subplots(1, 2, figsize=(12, 6))

sns.distplot(x=train['Age'], bins=40, kde=True, ax=ax[0], color='g')
ax[0].set_title('Age Distribution')

ax[1].set_title('Age distribution for the two subpopulations')
sns.kdeplot(train['Age'].loc[train['Survived'] == 1], color='green', ax=ax[1], shade=True, label='Survived')
sns.kdeplot(train['Age'].loc[train['Survived'] == 0], color='red', ax=ax[1], shade=True, label='Not Survived')
# ax[1].set_legends()

### Age & Sex Columns

In [1]:
plt.figure(figsize=(10, 6))
sns.swarmplot(y='Sex', x='Age', hue='Survived', palette=('#C52219', '#23C552'), data=train)
plt.title('Survived by age and sex')

* At a first look, the relationship between Age and Survived appears not to be very clear, we notice for sure that there is a peak corresponding to young passengers for those who survived, but apart from that the rest is not very informative.
* We can appreciate this feature more if we consider Sex too: now it is clearer that a good number of male survivors had less than 12 years, while the female group has no particular properties.

### Fare Column

In [1]:
train['Fare'].value_counts()

In [1]:
fig, ax = plt.subplots(1,2,figsize=(12,6))

sns.distplot(train.Fare, color='g', ax=ax[0])
ax[0].set_title('Fare distribution')

fare_range = pd.qcut(train.Fare, 4, labels = ['Low', 'Mid', 'High', 'Very high'])
sns.barplot(x=fare_range, y=train.Survived, palette='mako', ci=None, ax=ax[1])
ax[1].set_ylabel('Survival rate')

### Fare & Sex

In [1]:
sns.swarmplot(x='Sex', y='Fare', hue='Survived', palette=('#C52219', '#23C552'), data=train)
plt.title('Survived by fare and sex')

* Looking at the more detailed plot, we also see for example that all males with fare between 200 and 300 died.
* For this reason, we can left the Fare feature as it is in order to prevent losing too much information; at deeper levels of a tree, a more discriminant relationship might open up and it could become a good group detector.

### SibSp and Parch Columns

In [1]:
# Create new feature with +1 is by passenger's self
train['Nmember'] = train['SibSp'] + train['Parch'] + 1
print(train['Nmember'].value_counts())

sns.countplot(data=train, x='Nmember', hue='Survived', palette=['red', 'blue'])

### Ticket Column

In [1]:
train['Ticket'].value_counts().head(10)

In [1]:
# Calculate length of ticket
train['Ticket_len'] = train.Ticket.apply(lambda x: len(x))
train['Ticket_len'].value_counts()

### Cabin Column

In [1]:
print(train['Cabin'].unique())

# Extrac to carbin models
train['Cabin'] = train['Cabin'].str.get(0)

sns.countplot(data=train, x='Cabin', hue='Survived')

In [1]:
train['Cabin'].value_counts()

In [1]:
print(train['Embarked'].value_counts())
sns.countplot(data=train, x='Embarked', hue='Survived')

# Data Processing

In [1]:
train.drop(columns=['SibSp', 'Parch'], inplace=True)
train.head()

### Function for transform data

In [1]:
def remove_zero_fares(row): # Function for processing Fare column
    if row.Fare == 0:
        row.Fare = np.NaN
    return row

def age_transform(row): # Function for classifying to age groups
  if row['Age'] < 7:
    return 0
  elif (row['Age'] >= 8) & (row['Age'] < 19):
    return 1
  elif (row['Age'] >=19) & (row['Age'] < 30):
    return 2
  elif (row['Age'] >=30) & (row['Age'] < 60):
    return  3
  else:
    return 4

def fare_sex(row): # Function for relationship between Fare column and Sex column
  special_arrange = (row.Fare >= 200.0) & (row.Fare <=300.0)
  not_special_arrange = (row.Fare > 300.0) | (row.Fare < 200.0)
  if (row.Sex == 'male') & special_arrange:
    return 0
  elif (row.Sex == 'female') & special_arrange:
    return 1
  elif (row.Sex == 'male') & not_special_arrange:
    return 0
  else:
    return 1

def age_sex(row): # Function for relationship between Age column and Sex column
  special = (row.Age >=8) & (row.Age <= 12)
  not_special = (row.Age >12) | (row.Age < 8)
  if (row.Sex == 'female') & special:
    return 0
  else:
    return 1
    
def ticket_len_cat(row):  # Function for classifying to ticket's length groups
  if row.Ticket_len <= 5:
    return 0
  elif (row.Ticket_len > 5) & (row.Ticket_len <= 10):
    return 1
  else:
    return 2

In [1]:
def transform_data(data):
  data.drop(columns=['Name'])
  try:
    # Sex column
    data['nSex'] = data['Sex'].replace({'male': 0, 'female': 1})

    # Age column
    data['AgeCa'] = data.apply(age_transform, axis=1)
    AgeCa_dummies = pd.get_dummies(data['AgeCa'], prefix='AgeCa')

    # Fare column
    data = data.apply(remove_zero_fares, axis=1)
    data['Fare'].fillna(value=data['Fare'].median())
    data['FareCat'] = pd.qcut(data['Fare'], 4, labels = [ 0, 1, 2, 3])
    FareCat_dummies = pd.get_dummies(data['FareCat'], prefix='FareCat')

    # Cabin column
    data['Cabin'] = data['Cabin'].fillna(value='C')
    cabin_dummies = pd.get_dummies(data['Cabin'], prefix='Cabin')

    # Ticket's column
    data['TicketLen'] = data.apply(ticket_len_cat, axis=1)
    TicketLen_dummies = pd.get_dummies(data['TicketLen'], prefix='TicketLen')

    # Embarked column
    data['Embarked'] = data['Embarked'].dropna()
    Embarked_dummies = pd.get_dummies(data['Embarked'], prefix='Embarked')

    # Nnumber column
    Nmember_dummies = pd.get_dummies(data['Nmember'], prefix='Nmember')

    #Pclass
    Pclass_dummies = pd.get_dummies(data['Pclass'], prefix='Pclass')

    # New feature
    data['FareSex'] = data.apply(fare_sex, axis=1)

    data['AgeSex'] = data.apply(age_sex, axis=1)

    # New data
    new_data = pd.concat([data[['Survived', 'FareSex', 'AgeSex', 'nSex']],
                          AgeCa_dummies, FareCat_dummies, cabin_dummies,
                          TicketLen_dummies, Embarked_dummies, Nmember_dummies,
                          Pclass_dummies], axis=1)
  except: 
    new_data = pd.concat([data[['FareSex', 'AgeSex', 'nSex']],
                          AgeCa_dummies, FareCat_dummies, cabin_dummies,
                          TicketLen_dummies, Embarked_dummies, Nmember_dummies,
                          Pclass_dummies], axis=1)

  return new_data

In [1]:
train_data = transform_data(train)
train_data.head()

In [1]:
train_data.drop(columns=['Cabin_T'], inplace=True)
train_data.shape

# Train Modeling

In [1]:
from sklearn.preprocessing import MinMaxScaler
def split_data(data):
  '''Function for splitting data.
  Input: A dataframe 
  Output: X_train, X_test, y_train, y_test from dataframe input.'''

  scale = MinMaxScaler()

  X = data.drop(['Survived'], axis=1)
  X = scale.fit_transform(X)
  y = data['Survived']

  # Train_test_split of data 70% - 30%
  X_train, X_test, y_train, y_test = train_test_split(X, y.values, test_size=0.3, random_state=365)

  return (X_train, X_test, y_train, y_test)

def base_learners_evaluation(data, base_classifiers):
  '''Function for showing different score from base classifier models.
  Input: A dataframe and a list of classifier model
  Output: A dataframe score such as accuracy score, f1 score, precision score and recall score. '''


  X_train, X_test, y_train, y_test = split_data(data)

  idx = []
  scores = {'Accuracy': [], 'F1_score': [], 'Precision': [], 'Recall': []}
  for bc in base_classifiers:
    lm = bc[1]
    lm.fit(X_train, y_train)

    prediction = lm.predict(X_test)

    idx.append(bc[0])

    scores['Accuracy'].append(accuracy_score(y_test, prediction))
    scores['F1_score'].append(f1_score(y_test, prediction))
    scores['Precision'].append(precision_score(y_test, prediction))
    scores['Recall'].append(recall_score(y_test, prediction))

  return pd.DataFrame(data=scores, index=idx)


In [1]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

base_classifiers = [('Decision Tree 3', DecisionTreeClassifier(max_depth=3)),
                    ('Decision Tree 5', DecisionTreeClassifier(max_depth=5)),
                    ('Decision Tree 8', DecisionTreeClassifier(max_depth=8)),
                    ('Naive Bayes', GaussianNB()),
                    ('SVC', SVC()),
                    ('Logistic Regression', LogisticRegression(max_iter=500))]

base_learners_evaluation(train_data, base_classifiers)

# Feature Selection

- We will use the following for this purpose :
    - Pearson correlation factor pearson
    - chi square test
    - f_regression
    - f_classif 

## Using Pearson Correlation factor for feature selection

In [1]:
correlations = train_data.corr(method='pearson')['Survived'].drop('Survived')
correlations.sort_values().plot(kind='barh')

In [1]:
# Filtering features with lower absolute value than a threshold

threshold = 0.1

pearson_feature = list(correlations[abs(correlations) > threshold].index.values)
pearson_feature

In [1]:
data_corr = pd.concat([train_data[pearson_feature], train_data['Survived']], axis=1)

base_learners_evaluation(data_corr, base_classifiers)

## Using chi2 test for feature selection

In [1]:
from sklearn.feature_selection import SelectKBest, chi2

# Finding the best 20 features using chi2 test
data_chi2 = pd.DataFrame(SelectKBest(chi2, k=27).fit_transform(train_data.drop(["Survived"],axis = 1),train_data["Survived"]))
data_chi2.head()

In [1]:
data_chi2 = pd.concat([data_chi2, train_data['Survived']], axis=1)
base_learners_evaluation(data_chi2, base_classifiers)

## Using f_classif for feature selection

In [1]:
from sklearn.feature_selection import SelectKBest, f_classif

# Find the best 20 feature by f_classif test
data_classif = pd.DataFrame(SelectKBest(f_classif, 27).fit_transform(train_data.drop(['Survived'], axis=1), train_data['Survived']))
data_classif.head()

In [1]:
data_classif = pd.concat([data_classif, train_data['Survived']], axis=1)
base_learners_evaluation(data_classif, base_classifiers)

## Using f_regression for feature selection

In [1]:
from sklearn.feature_selection import SelectKBest, f_regression

# Find the best 20 feature by f_regression test
data_regression = pd.DataFrame(SelectKBest(f_regression, 27).fit_transform(train_data.drop(['Survived'], axis=1), train_data['Survived']))
data_regression.head()

In [1]:
data_regression = pd.concat([data_regression, train_data['Survived']], axis=1)
base_learners_evaluation(data_regression, base_classifiers)

In [1]:
public_data = pd.read_csv('/kaggle/input/titanic/test.csv')
PassengerId = public_data['PassengerId']
public_data.head()

In [1]:
public_data['Ticket_len'] = public_data.Ticket.apply(lambda x: len(x))
public_data['Nmember'] = public_data['SibSp'] + public_data['Parch'] + 1
public_data['Cabin'] = public_data['Cabin'].str.get(0)

In [1]:
public_data.head()

In [1]:
public_data.drop(columns=[ 'PassengerId', 'SibSp', 'Parch'], inplace=True)

In [1]:
X = transform_data(public_data)
X.head()

In [1]:
X.shape

# Training by ensemble model

In [1]:
def ensemble_evaluation(data, model, label='Original'):
  '''This function show score with Original data or Filtered data.'''
  X_train, X_test, y_train, y_test = split_data(data)
  model.fit(X_train, y_train)
  prediction = model.predict(X_test)

  return pd.DataFrame({'Accuracy' : [accuracy_score(y_test, prediction)],
                       'F1_score' : [f1_score(y_test, prediction)],
                       'precision' : [precision_score(y_test, prediction)],
                       'Recall' : [recall_score(y_test, prediction)]}, index=[label])

In [1]:
from sklearn.ensemble import VotingClassifier

models_comparison = {}

ensemble = VotingClassifier(base_classifiers)     

ensemble_data_origin = ensemble_evaluation(train_data, ensemble, label='Original')
ensemble_data_filtered = ensemble_evaluation(data_corr, ensemble, label='Filtered')

models_comparison['Voting'] = pd.concat([ensemble_data_origin, ensemble_data_filtered], axis=0)

In [1]:
models_comparison['Voting']

In [1]:
from sklearn.ensemble import BaggingClassifier

ensemble = BaggingClassifier(n_estimators=10,
                             base_estimator=DecisionTreeClassifier(max_depth=5))

ensemble_data_origin = ensemble_evaluation(train_data, ensemble, label='Original')
ensemble_data_filtered = ensemble_evaluation(data_corr, ensemble, label='Filtered')
models_comparison['Bagging'] = pd.concat([ensemble_data_origin, ensemble_data_filtered], axis=0)

In [1]:
models_comparison['Bagging']

In [1]:
from sklearn.ensemble import AdaBoostClassifier

ensemble = AdaBoostClassifier(n_estimators=365)

ensemble_data_origin = ensemble_evaluation(train_data, ensemble, label='Original')
ensemble_data_filtered = ensemble_evaluation(data_chi2, ensemble, label='Filtered')
models_comparison['AdaBoost'] = pd.concat([ensemble_data_origin, ensemble_data_filtered], axis=0)

In [1]:
models_comparison['AdaBoost']

In [1]:
from sklearn.ensemble import RandomForestClassifier

ensemble = RandomForestClassifier(n_estimators=500, max_depth=5, criterion="entropy", n_jobs=-1)

ensemble_data_origin = ensemble_evaluation(train_data, ensemble, label='Original')
ensemble_data_filtered = ensemble_evaluation(data_corr, ensemble, label='Filtered')
models_comparison['RandomForest'] = pd.concat([ensemble_data_origin, ensemble_data_filtered], axis=0)

In [1]:
models_comparison['RandomForest']

In [1]:
from xgboost import XGBClassifier

ensemble = XGBClassifier()

ensemble_data_origin = ensemble_evaluation(train_data, ensemble, label='Original')
ensemble_data_filtered = ensemble_evaluation(data_corr, ensemble, label='Filtered')
models_comparison['XGBClassifier'] = pd.concat([ensemble_data_origin, ensemble_data_filtered], axis=0)

In [1]:
models_comparison['XGBClassifier']

In [1]:
from lightgbm import LGBMClassifier

ensemble = LGBMClassifier()

ensemble_data_origin = ensemble_evaluation(train_data, ensemble, label='Original')
ensemble_data_filtered = ensemble_evaluation(data_corr, ensemble, label='Filtered')
models_comparison['LGBMClassifier'] = pd.concat([ensemble_data_origin, ensemble_data_filtered], axis=0)

In [1]:
models_comparison['LGBMClassifier']

In [1]:
# !pip install catboost
# import catboost 
# from catboost import CatBoostClassifier
# np.random.seed(42)

# ensemble = CatBoostClassifier()
# ensemble_data_origin = ensemble_evaluation(train_data, ensemble, label='Original')
# ensemble_data_filtered = ensemble_evaluation(data_corr, ensemble, label='Filtered')
# models_comparison['CatBoostClf'] = pd.concat([ensemble_data_origin, ensemble_data_filtered], axis=0)
# models_comparison['CatBoostClf']

***In general, models above is not good model for public data. Finally, i will use pipeline model for this problem***

# Use Pipeline for prediction

In [1]:
train.head()

In [1]:
public_data.head()

## Data preprocessing for pipeline model

In [1]:
# Because null value of Cabin column is so much, we will remove it from model training
public_data['Cabin'].isnull().sum()

In [1]:
# Creation of four groups
train['Nmember'] = pd.cut(train.Nmember, [0,1,4,7,11], labels=['Solo', 'Small', 'Big', 'Very big'])
public_data['Nmember'] = pd.cut(public_data.Nmember, [0,1,4,7,11], labels=['Solo', 'Small', 'Big', 'Very big'])

In [1]:
train['Title'] = train['Name'].apply(lambda x: x.split(',')[1].split('.')[0].strip())

public_data['Title'] = public_data['Name'].apply(lambda x: x.split(',')[1].split('.')[0].strip())

In [1]:
train['Title'].replace(['Mme', 'Ms', 'Lady', 'Mlle', 'the Countess', 'Dona'], 'Miss', inplace=True)
public_data['Title'].replace(['Mme', 'Ms', 'Lady', 'Mlle', 'the Countess', 'Dona'], 'Miss', inplace=True)

train['Title'].replace(['Major', 'Col', 'Capt', 'Don', 'Sir', 'Jonkheer'], 'Mr', inplace=True)
public_data['Title'].replace(['Major', 'Col', 'Capt', 'Don', 'Sir', 'Jonkheer'], 'Mr', inplace=True)

In [1]:
train = train.apply(remove_zero_fares, axis=1)
train['Fare'].fillna(value=train['Fare'].median())

public_data = public_data.apply(remove_zero_fares, axis=1)
public_data['Fare'].fillna(value=public_data['Fare'].median())

In [1]:
train['Ticket_lett'] = train.Ticket.apply(lambda x: x[:2])

public_data['Ticket_lett'] = public_data.Ticket.apply(lambda x: x[:2])

In [1]:
# train['Ticket_len'] = train.apply(ticket_len_cat, axis=1)
# public_data['Ticket_len'] = public_data.apply(ticket_len_cat, axis=1)

In [1]:
# Create group for fare ticket
# train['FareCat'] = pd.qcut(train['Fare'], 4, labels = [ 0, 1, 2, 3])
# public_data['FareCat'] = pd.qcut(public_data['Fare'], 4, labels = [ 0, 1, 2, 3])

In [1]:
# train['FareSex'] = train.apply(fare_sex, axis=1)
# public_data['FareSex'] = public_data.apply(fare_sex, axis=1) 

In [1]:
# train['AgeSex'] = train.apply(age_sex, axis=1)
# public_data['AgeSex'] = public_data.apply(age_sex, axis=1)

In [1]:
y_train = train['Survived']
features = ['Pclass', 'Fare', 'Title', 'Embarked', 'Nmember', 'Ticket_len', 'Ticket_lett']
X_train = train[features]
X_train.head()

In [1]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

numerical_cols = ['Fare']
categorical_cols = ['Pclass', 'Title', 'Embarked', 'Nmember', 'Ticket_len', 'Ticket_lett']

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='median')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

# Bundle preprocessing and modeling code 
titanic_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', RandomForestClassifier(random_state=0, n_estimators=500, max_depth=5))
])

# Preprocessing of training data, fit model 
titanic_pipeline.fit(X_train,y_train)

print('Cross validation score: {:.3f}'.format(cross_val_score(titanic_pipeline, X_train, y_train, cv=10).mean()))

In [1]:
X_test = public_data[features]
X_test.head()

In [1]:
# Preprocessing of test data, get predictions
predictions = titanic_pipeline.predict(X_test)

In [1]:
submission = pd.DataFrame({
        "PassengerId": PassengerId,
        "Survived": predictions
    })

submission.to_csv('submission_rd.csv', index=False)

# Conclusion
  Cabin and Sex columns are not valuable for model although illustrating insight is very good. Over-reliance on these two attributes will cause the model score to decrease. So next time I will redo this predictive model in a different way. Let's look forward to it.

Note: This article has references and improvements from other notebooks on Kaggle.

## Thank you!!