# Approach
This is a brief introductory walk-through, to **Extremely Randomized Trees**, which performed best among ensemble bagging classifiers with the selected subset of explanatory variables.

**Two particularly interesting choices in this workbook, are Age inference using first name as a factor, assuming people with the same first name were born more or less the same year;
The second advance, is the creation of an Ability to negotiate feature, which is evaluated here as the Fare/Pclass (lower is better), capturing a likely essential skill to survival for passenges to work their way through a life-saving deal.**



In [None]:
from IPython.display import Image
Image("../input/negotiation/negotiation.jpg")

In [None]:
# imports
from sklearn.ensemble import  ExtraTreesClassifier
import seaborn as sns
import pandas as pd
import numpy as np
import os

# hyperparameter tuning
from sklearn.model_selection import GridSearchCV

In [None]:
# input
train = pd.read_csv('/kaggle/input/titanic/train.csv')
test = pd.read_csv('/kaggle/input/titanic/test.csv')

# output
union = [train, test]
passenger_id = []

Studying the correlation among variables, and printing the dataset info for an overview, we understand that we will need to drop some variables, fill some **null** entries, and agreggate them into most relevant numerical buckets.

In [None]:
corr_matrix = train.drop("PassengerId", axis=1).corr().round(2)
sns.heatmap(corr_matrix,annot = True)

In [None]:
#info
print(train.info())

# Development

Studying **Age** correlations with other variables (e.g. **Sex** and **Pclass**) to deduct **null** values is one way to fill null variables, another could be to create a probability distribution and randomly pick replacement values from it.

Out of **891/418** passengers in the training/testing sets, **177/86** have a null Age value, approximately **20%** of passengers.

For this study as an interesting alternative, **Name** is leveraged, by using regular expressions to extract the passengers first names, and then assign the corresponding Age mean in a first name group, to passengers having this very first name and missing **Age**.

Using this process allows the Age inference of 101/52 passengers in the training/testing sets respectively, leaving approximately 8% of all passengers with **null** Age. The next step is to bucket these into Age groups, which displayed similar patterns, e.g. infants showing the highest survival rate. For simplicity, the remaining null Age people will be classified as working force adults between 18 and 65 years old, which represents the vast majority of passengers.

In [None]:
for i, df in enumerate(union):

    # Age inference based on first name
    name = df['Name'].str.split('.', n=1, expand = True)
    name = name[1].str.split(expand = True)[0]
    name.replace(['(\()','(\))'],'',regex=True, inplace = True)
    df['Name'] = name
    del name
    mean_age = df[['Name','Age']].groupby(['Name']).mean()
    df = df.merge(mean_age, on='Name')
    df['Age'] = df['Age_x'].fillna(df['Age_y'])
    df = df.drop(['Age_x', 'Age_y', 'Name'], axis=1)
    
    # Age fill na as adults
    df['Age'] = df['Age'].fillna(df['Age'].mean())

    # Age bucketing
    age_buckets= [0,2,10,18,60,200]
    age_labels = [0,1,2,3,4]
    df['AgeGroup'] = pd.cut(df['Age'], bins=age_buckets, labels=age_labels, right=False)

    # Parch bucketing
    parch_buckets= [0,1,200]
    parch_labels = [0,1]
    df['Parch'] = pd.cut(df['Parch'], bins=parch_buckets, labels=parch_labels, right=False)

    # SibSp bucketing
    sibsp_buckets= [0,1,2,200]
    sibsp_labels = [0,1,2]
    df['SibSp'] = pd.cut(df['SibSp'], bins=sibsp_buckets, labels=sibsp_labels, right=False).astype(np.int8)
    
    # Fare
    #df['Fare']= df['Fare'].clip(lower= df['Fare'].quantile(0.00), upper= df['Fare'].quantile(0.01), axis=0)
    df['Fare'] = df['Fare'].fillna(df['Fare'].mean())
    
    # Ability to bargain
    df['Pclass'] = df['Pclass'].astype(np.int8)    
    df['Ability'] = df['Fare'] / df['Pclass']
    
    # Family size
    df['Family'] = df['SibSp'].astype(np.int8) + df['Parch'].astype(np.int8) + 1
    
    # Fare and ability bucketing quartiles
    fare_buckets= [0,23,10000]
    fare_labels = [0,1]
    df['Fare'] = pd.cut(df['Fare'], bins=fare_buckets, labels=fare_labels, right=False)
    
    ab_buckets= [0,4,9,15,20,59,70,10000]
    ab_labels = [0,1,2,3,4,5,6]
    df['Ability'] = pd.cut(df['Ability'], bins=ab_buckets, labels=ab_labels, right=False)
    
    # Cleaning
    df = df.sort_values(by=['PassengerId'])
    passenger_id.append(df["PassengerId"])
    df['Sex'] = pd.get_dummies(df['Sex'])
    df['Fare'] = pd.get_dummies(df['Fare'])
    df['SibSp'] = pd.get_dummies(df['SibSp'])
    df['Parch'] = pd.get_dummies(df['Parch'])
    df = df.drop(['Embarked', 'PassengerId', 'Ticket', 'Age', 'Cabin'], axis=1)
    union[i] = df

# Results
Let's visualise the training dataset at this point, using seaborn sns on the corresponding correlation matrix.

In [None]:
corr_matrix = union[0].corr().round(2)
sns.heatmap(corr_matrix,annot = True)

The next step will be to split the dataset into the separate subsets, e.g. isolate the "Survived" column, to train the Extremely Randomized Trees model.

In [None]:
x_train = union[0].drop("Survived", axis=1)
y_train = union[0]["Survived"]
x_test  = union[1]
x_train.shape, y_train.shape, x_test.shape

In [None]:
# Extremely randomized trees
ex = ExtraTreesClassifier(random_state = 6, bootstrap=True, oob_score=True)
ex.fit(x_train, y_train)
y_pred = ex.predict(x_test)
ex.score(x_train, y_train)
score = round(ex.score(x_train, y_train) * 100, 2)
print('Extremely Randomized Trees', score)

# Features importance

In [None]:
for i,j in enumerate(x_train.head(1)):
    print('%s: %s' %(j, int(ex.feature_importances_[i]*100)) + '%')

In [None]:
#submission
submission = pd.DataFrame({
        "PassengerId": passenger_id[1],
        "Survived": y_pred
    })
submission.to_csv('/kaggle/working/submission.csv', index=False)
