# Titanic: simple ensemble voting (Top 3%)

This notebook is a successful attempt to improve the result of my previous notebook:


#### [Titanic: simple voting based on cross-validation](https://www.kaggle.com/alexanderossipov/titanic-simple-voting-based-on-cross-validation)


### List of main changes:

1. One additional feature is introduced to the model and only the most important features are selected.
2. The number of models is reduced to 4.
3. An ensemble is constructed in the standard way using VotingClassifier.  

It's amazing that such a good result can be achieved without any hyperparameter tuning. If you will be able to improve it further by fine tuning the parameters, please  let me know!  


**Your feedback is very welcome!**

### Content:

[1. Import Libraries](#ch1)

[2. Read In and Explore the Data](#ch2)

[3. Data Analysis and Visualisation](#ch3)

[4. Cleaning and Transforming the Data](#ch4)

[5. Feature Creation](#ch5)

[6. Model and Feature Selection](#ch6)

[7. Creating Submission File](#ch7)


<a id="ch1"></a>
## 1. Import Libraries
First, we need to import several Python libraries such as numpy, pandas, matplotlib and seaborn.

In [None]:
#data analysis libraries 
import numpy as np
import pandas as pd

#visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#ignore warnings
import warnings
warnings.filterwarnings('ignore')

<a id="ch2"></a>
## 2. Read in and Explore the Data  <div id="part2"> </div>
It's time to read in our training and testing data using `pd.read_csv`, and take a first look at the training data using the `describe()` function.

In [None]:
#import train and test CSV files
train = pd.read_csv("../input/titanic/train.csv")
test = pd.read_csv("../input/titanic/test.csv")

#take a look at the training data
train.describe(include="all")

In [None]:
# save PassengerId for a future output 
ids = test['PassengerId']

In [None]:
#get a list of the features within the dataset
print(train.columns)

In [None]:
#see a sample of the dataset to get an idea of the variables
train.sample(5)

In [None]:
#see a summary of the training dataset
train.describe(include = "all")

#### Some Observations:
* There are a total of 891 passengers in our training set.
* The Age feature is missing approximately 19.8% of its values. I'm guessing that the Age feature is pretty important to survival, so we should probably attempt to fill these gaps. 
* The Cabin feature is missing approximately 77.1% of its values. Since so much of the feature is missing, it would be hard to fill in the missing values. We'll probably drop these values from our dataset.
* The Embarked feature is missing 0.22% of its values, which should be relatively harmless.

In [None]:
#check for any other unusable values
print(pd.isnull(train).sum())

We can see that except for the abovementioned missing values, no NaN values exist.

### Some Predictions:
* Sex: Females are more likely to survive.
* SibSp/Parch: People traveling alone are more likely to survive.
* Age: Young children are more likely to survive.
* Pclass: People of higher socioeconomic class are more likely to survive.

<a id="ch3"></a>
## 3. Data Analysis and Visualisation
It's time to analize our data so we can see whether our predictions were accurate! 

In [None]:
features = ['Sex', 'Pclass', 'SibSp', 'Parch', 'Embarked']
fig, saxis = plt.subplots(1, len(features),figsize=(len(features) * 6,6))
for ind, x in enumerate(features):
    print('Survival Correlation by:', x)
    print(train[[x, "Survived"]].groupby(x, as_index=False).mean()) 
    print('-'*10, '\n')
    #draw a bar plot of survival by sex
    sns.barplot(x, y="Survived", data=train, ax = saxis[ind])
    

As predicted, 
* females have a much higher chance of survival than males. The Sex feature is essential in our predictions
* people with higher socioeconomic class had a higher rate of survival. (62.9% vs. 47.3% vs. 24.2%)

In general, it's clear that people with more siblings or spouses aboard were less likely to survive. However, contrary to expectations, people with no siblings or spouses were less to likely to survive than those with one or two. (34.5% vs 53.4% vs. 46.4%)

People with less than four parents or children aboard are more likely to survive than those with four or more. Again, people traveling alone are less likely to survive than those with 1-3 parents or children.

### Age Feature

In [None]:
#sort the ages into logical categories
bins = [0, 2, 12, 17, 60, np.inf]
labels = ['baby', 'child', 'teenager', 'adult', 'elderly']
age_groups = pd.cut(train.Age, bins, labels = labels)
train['AgeGroup'] = age_groups

#draw a bar plot of Age vs. survival
sns.barplot(x="AgeGroup", y="Survived", data=train)
plt.show()

The survival probability deacreases with age. 

### Cabin Feature
The idea here is that people with recorded cabin numbers are of higher socioeconomic class, and thus more likely to survive.

In [None]:
train["CabinBool"] = (train["Cabin"].notnull().astype('int'))
test["CabinBool"] = (test["Cabin"].notnull().astype('int'))

#calculate percentages of CabinBool vs. survived

print('Survival Correlation by: Cabin')
print(train[["CabinBool", "Survived"]].groupby("CabinBool", as_index=False).mean()) 

#draw a bar plot of CabinBool vs. survival
sns.barplot(x="CabinBool", y="Survived", data=train)
plt.show()

People with a recorded Cabin number are, in fact, more likely to survive. (66.6% vs 29.9%)

<a id="ch4"></a>
## 4. Cleaning and Transforming the Data
Time to clean our data to account for missing values and unnecessary information!

### Looking at the Test Data
Let's see how our test data looks!

In [None]:
test.describe(include="all")

* We have a total of 418 passengers.
* 1 value from the Fare feature is missing.
* Around 20.5% of the Age feature is missing, we will need to fill that in.

### Combining Training and Test data for cleaning and transforming

In [None]:
# all_data = pd.concat([train.drop(columns='Survived'), test], ignore_index=True)
all_data = pd.concat([train, test], ignore_index=True)
print(all_data.shape)

### Filling simple missing features

In [None]:
#complete embarked with mode
all_data['Embarked'].fillna(all_data['Embarked'].mode()[0], inplace = True)

#complete missing fare with median
all_data['Fare'].fillna(all_data['Fare'].median(), inplace = True)


### Filling missing features using other features

#### Age Feature

Next we'll fill in the missing values in the Age feature. Since a higher percentage of values are missing, it would be illogical to fill all of them with the same value (as we did with Embarked). Instead, let's try to find a way to predict the missing ages. 

In [None]:
#extract a title for each Name 
all_data['Title'] = all_data.Name.str.extract(' ([A-Za-z]+)\.', expand=False)

all_data['Title'].value_counts()

In [None]:
frequent_titles = all_data['Title'].value_counts()[:5].index.tolist()
frequent_titles

In [None]:
# keep only the most frequent titles
all_data['Title'] = all_data['Title'].apply(lambda x: x if x in frequent_titles else 'Other')
# all_data.head()

In [None]:
# fill missing age with median age group for each title
median_ages = {}
# calculate median age for different titles
for title in frequent_titles:
    median_ages[title] = all_data.loc[all_data['Title'] == title]['Age'].median()
median_ages['Other'] =  all_data['Age'].median()
all_data.loc[all_data['Age'].isnull(), 'Age'] = all_data[all_data['Age'].isnull()]['Title'].map(median_ages)

In [None]:
all_data.head()

### Encoding categorical features with non-numerical values
Use LabelEncoder for categorical features

In [None]:
from sklearn.preprocessing import LabelEncoder

Cat_Features = ['Sex', 'Embarked', 'Title']
for feature in Cat_Features:
    label = LabelEncoder()
    all_data[feature] = label.fit_transform(all_data[feature])

### Creating frequency bins for continuous variables and encoding
Use qcut and LabelEncoder for continuous variable bins

In [None]:
Cont_Features = ['Age', 'Fare']
num_bins = 5
for feature in Cont_Features:
    bin_feature = feature + 'Bin'
    all_data[bin_feature] = pd.qcut(all_data[feature], num_bins)
    label = LabelEncoder()
    all_data[bin_feature] = label.fit_transform(all_data[bin_feature])

In [None]:
all_data.head()

<a id="ch5"></a>
## 5. Feature Creation

We are going create one additional feature, which can imrove the model. As it was observed in many other notebooks the members of families with children have higer probability to survive.

First, we are going to identify families. 

It appears that passengers with same surnames have the same Ticket names.
Let’s extract the surnames and tickets name and find out duplicate ones. There may be passengers from the same families.

In [None]:
all_data['Surname'] = all_data.Name.str.extract(r'([A-Za-z]+),', expand=False)
all_data['TicketPrefix'] = all_data.Ticket.str.extract(r'(.*\d)', expand=False)
all_data['Surname_Ticket'] = all_data['Surname'] + all_data['TicketPrefix']
all_data['IsFamily'] = all_data.Surname_Ticket.duplicated(keep=False).astype(int)

Next, we find the families with children.

In [None]:
all_data['Child'] = all_data.Age.map(lambda x: 1 if x <=16 else 0)
FamilyWithChild = all_data[(all_data.IsFamily==1)&(all_data.Child==1)]['Surname_Ticket'].unique()
len(FamilyWithChild)

There are 66 families which have 1 or more children.

Encode each family with children and assign 0 for others.

In [None]:
all_data['FamilyId'] = 0
for ind, identifier in enumerate(FamilyWithChild):
 all_data.loc[all_data.Surname_Ticket==identifier, ['FamilyId']] = ind + 1

For each family of above, if there is at least one survived, we assume the others can survive too.

In [None]:
all_data['FamilySurvival'] = 1 
Survived_by_FamilyId = all_data.groupby('FamilyId').Survived.sum()
for i in range(1, len(FamilyWithChild)+1):
   if Survived_by_FamilyId[i] >= 1:
      all_data.loc[all_data.FamilyId==i, ['FamilySurvival']] = 2
   elif Survived_by_FamilyId[i] == 0:
      all_data.loc[all_data.FamilyId==i, ['FamilySurvival']] = 0
sns.barplot(x='FamilySurvival', y='Survived', data=all_data)
plt.show()

Indeed, we can see that chances to survive are higher for passagiers in families.

### Splitting back Train and Test data

In [None]:
train = all_data[: len(train)]
test = all_data[len(train):]
train.shape

### Features to keep in train data 
We can drop now some original features that we don't need or that we have used already to create new features. 

In [None]:
train.columns

In [None]:
# keep only some columns
X_train = train[['Pclass', 'Sex', 'Parch', 'Embarked', 'CabinBool', 'Title', 'AgeBin', 'FareBin', 'FamilySurvival']]
y_train = train['Survived']

<a id=ch6></a>
## 6. Model and Feature Selection

In [None]:
# we start with this very powerful classifier 
from catboost import CatBoostClassifier

model = CatBoostClassifier(verbose=False)

Identifying the most important features

In [None]:
model.fit(X_train,y_train)
importance = pd.DataFrame({'feature':X_train.columns, 'importance': model.feature_importances_})
importance.sort_values('importance', ascending=False).set_index('feature').plot(kind='barh')
plt.show()

In [None]:
X_train.columns

Choose the most important features based on their importance and the cross validation score 

In [None]:
main_features = ['Sex', 'FamilySurvival', 'FareBin', 'Pclass', 'Title']


X_test = test[main_features]
X_train = train[main_features]

In [None]:
from sklearn.model_selection import cross_val_score

cross_val_score(estimator=model, X=X_train, y=y_train, cv=5).mean()

### Voting Classifier

Create an ensemble of models showing the best performance in the previous notebook.

In [None]:
from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.ensemble import VotingClassifier, RandomForestClassifier, AdaBoostClassifier, ExtraTreesClassifier
from sklearn import svm, neighbors
from xgboost import XGBClassifier

ensemble = [CatBoostClassifier(verbose=False), RandomForestClassifier(), svm.NuSVC(probability=True), neighbors.KNeighborsClassifier()]

classifiers_with_names = []
_ = [classifiers_with_names.append((clf.__class__.__name__, clf)) for clf in ensemble]
voting = VotingClassifier(classifiers_with_names, voting='hard')

cv_results = cross_validate(voting, X_train, y_train, cv=5)
print(cv_results['test_score'].mean())

voting.fit(X_train, y_train)
predictions = voting.predict(X_test)


<a id=ch7></a>
## 7. Creating Submission File

In [None]:
output = pd.DataFrame({'PassengerId': test.PassengerId, 'Survived': predictions.astype(int)})
output.to_csv('submission_new_session.csv', index=False)

## Sources:
* [**Titanic Survival Predictions (Beginner)**](https://www.kaggle.com/nadintamer/titanic-survival-predictions-beginner)
* [**A Data Science Framework: To Achieve 99% Accuracy**](https://www.kaggle.com/ldfreeman3/a-data-science-framework-to-achieve-99-accuracy) 
* [Titanic Survival Prediction](https://www.kaggle.com/vaishnavikhilari/titanic-survival-prediction)

Your feedback is very welcome! 