<a class="anchor" id="0"></a>

# The importance of all features in different models - Advanced Visualization with Matplotlib and Seaborn (parallel_coordinates)
## Feature Importance diagrams of 3 models (XGB, LGB, LinReg) and the solution as weighted average of its
## The code is universal for both the Classification and the Regression tasks
### For the example of competition ["Titanic: Machine Learning from Disaster"](https://www.kaggle.com/c/titanic)

This based on my notebook [Merging FE & Prediction - xgb, lgb, logr, linr](https://www.kaggle.com/vbmokin/merging-fe-prediction-xgb-lgb-logr-linr)

<a class="anchor" id="0.1"></a>

## Table of Contents

1. [Import libraries](#1)
1. [Download datasets](#2)
1. [FE & EDA](#3)
1. [Preparing to modeling](#4)
1. [Tuning models, building the feature importance diagrams and prediction](#5)
    -  [LGBM](#5.1)
    -  [XGB](#5.2)
    -  [Linear Regression](#5.3)
1. [Comparison and merging of all feature importance diagrams](#6)
1. [Feature Importance - Advanced Visualization](#7)
    -  [Matplotlib](#7.1)
    -  [Seaborn](#7.2)
1. [Merging solutions and submission](#8)

## 1. Import libraries <a class="anchor" id="1"></a>

[Back to Table of Contents](#0.1)

In [None]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.filterwarnings("ignore")

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
from pandas.plotting import parallel_coordinates
import eli5

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder, StandardScaler

import lightgbm as lgbm
import xgboost as xgb

pd.set_option('max_columns',100)

## 2. Download datasets <a class="anchor" id="2"></a>

[Back to Table of Contents](#0.1)

In [None]:
# Download data
traindf = pd.read_csv('../input/titanic/train.csv').set_index('PassengerId')
testdf = pd.read_csv('../input/titanic/test.csv').set_index('PassengerId')
submission = pd.read_csv('../input/titanic/gender_submission.csv')

In [None]:
traindf.head(3)

In [None]:
traindf.info()

In [None]:
testdf.info()

In [None]:
submission.head()

## 3. FE & EDA <a class="anchor" id="3"></a>

[Back to Table of Contents](#0.1)

In [None]:
# FE - thanks to:
# https://www.kaggle.com/mauricef/titanic
# https://www.kaggle.com/vbmokin/titanic-top-3-one-line-of-the-prediction-code
#
df = pd.concat([traindf, testdf], axis=0, sort=False)
df['Title'] = df.Name.str.split(',').str[1].str.split('.').str[0].str.strip()
df['Title'] = df.Name.str.split(',').str[1].str.split('.').str[0].str.strip()
df['IsWomanOrBoy'] = ((df.Title == 'Master') | (df.Sex == 'female'))
df['LastName'] = df.Name.str.split(',').str[0]
family = df.groupby(df.LastName).Survived
df['WomanOrBoyCount'] = family.transform(lambda s: s[df.IsWomanOrBoy].fillna(0).count())
df['WomanOrBoyCount'] = df.mask(df.IsWomanOrBoy, df.WomanOrBoyCount - 1, axis=0)
df['FamilySurvivedCount'] = family.transform(lambda s: s[df.IsWomanOrBoy].fillna(0).sum())
df['FamilySurvivedCount'] = df.mask(df.IsWomanOrBoy, df.FamilySurvivedCount - \
                                    df.Survived.fillna(0), axis=0)
df['WomanOrBoySurvived'] = df.FamilySurvivedCount / df.WomanOrBoyCount.replace(0, np.nan)
df.WomanOrBoyCount = df.WomanOrBoyCount.replace(np.nan, 0)
df['Alone'] = (df.WomanOrBoyCount == 0)

#Thanks to https://www.kaggle.com/kpacocha/top-6-titanic-machine-learning-from-disaster
#"Title" improvement
df['Title'] = df['Title'].replace('Ms','Miss')
df['Title'] = df['Title'].replace('Mlle','Miss')
df['Title'] = df['Title'].replace('Mme','Mrs')
# Embarked
df['Embarked'] = df['Embarked'].fillna('S')
# Cabin, Deck
df['Deck'] = df['Cabin'].apply(lambda s: s[0] if pd.notnull(s) else 'M')
#df.loc[(df['Deck'] == 'T'), 'Deck'] = 'A'

# Thanks to https://www.kaggle.com/erinsweet/simpledetect
# Fare
med_fare = df.groupby(['Pclass', 'Parch', 'SibSp']).Fare.median()[3][0][0]
df['Fare'] = df['Fare'].fillna(med_fare)
#Age
df['Age'] = df.groupby(['Sex', 'Pclass', 'Title'])['Age'].apply(lambda x: x.fillna(x.median()))
# Family_Size
df['Family_Size'] = df['SibSp'] + df['Parch'] + 1

# Thanks to https://www.kaggle.com/vbmokin/titanic-top-3-cluster-analysis
cols_to_drop = ['Name','Ticket','Cabin', 'IsWomanOrBoy', 'WomanOrBoyCount', 'FamilySurvivedCount']
df = df.drop(cols_to_drop, axis=1)

df.WomanOrBoySurvived = df.WomanOrBoySurvived.fillna(0)
df.Alone = df.Alone.fillna(0)

target = df.Survived.loc[traindf.index]
df = df.drop(['Survived'], axis=1)
train, test = df.loc[traindf.index], df.loc[testdf.index]

In [None]:
train.head(5)

In [None]:
train.info()

In [None]:
test.info()

## 4. Preparing to modeling <a class="anchor" id="4"></a>

[Back to Table of Contents](#0.1)

In [None]:
# Encoding categorical features
numerics = ['int8', 'int16', 'int32', 'int64', 'float16', 'float32', 'float64']
categorical_columns = []
features = train.columns.values.tolist()
for col in features:
    if train[col].dtype in numerics: continue
    categorical_columns.append(col)
for col in categorical_columns:
    if col in train.columns:
        le = LabelEncoder()
        le.fit(list(train[col].astype(str).values) + list(test[col].astype(str).values))
        train[col] = le.transform(list(train[col].astype(str).values))
        test[col] = le.transform(list(test[col].astype(str).values)) 

In [None]:
train.info()

In [None]:
test.info()

## 5. Tuning models, building the feature importance diagrams and prediction<a class="anchor" id="5"></a>

[Back to Table of Contents](#0.1)

### 5.1 LGBM <a class="anchor" id="5.1"></a>

[Back to Table of Contents](#0.1)

In [None]:
#%% split training set to validation set
Xtrain, Xval, Ztrain, Zval = train_test_split(train, target, test_size=0.2, random_state=0)
train_set = lgbm.Dataset(Xtrain, Ztrain, silent=False)
valid_set = lgbm.Dataset(Xval, Zval, silent=False)

In [None]:
# Tuning LGB model
# See parameters in the documentation https://lightgbm.readthedocs.io/en/latest/Parameters.html
params = {
        'boosting_type':'gbdt',
        'objective': 'binary', # for regression task - "regression" or other
        'num_leaves': 31,
        'learning_rate': 0.05,
        'max_depth': -1,
        'subsample': 0.8,
        'bagging_fraction' : 1,
        'max_bin' : 50 ,
        'bagging_freq': 20,
        'colsample_bytree': 0.6,
        'metric': 'binary',     # eval_metric, for regression task - "rmse" or other
        'min_split_gain': 0.5,
        'min_child_weight': 1,
        'min_child_samples': 2,
        'scale_pos_weight':1,
        'zero_as_missing': True,
        'seed':0,        
    }

modelL = lgbm.train(params, train_set = train_set, num_boost_round=2000,
                   early_stopping_rounds=10, verbose_eval=10, valid_sets=valid_set)

In [None]:
# FI diagram drawing
fig =  plt.figure(figsize = (15,15))
axes = fig.add_subplot(111)
lgbm.plot_importance(modelL,ax = axes,height = 0.5)
plt.show();plt.close()

In [None]:
# FI diagram saving
feature_score = pd.DataFrame(train.columns, columns = ['feature']) 
feature_score['LGB'] = modelL.feature_importance()

In [None]:
# Prediction
y_preds_lgb = modelL.predict(test, num_iteration=modelL.best_iteration)

### 5.2 XGB<a class="anchor" id="5.2"></a>

[Back to Table of Contents](#0.1)

In [None]:
#%% split training set to validation set 
data_tr  = xgb.DMatrix(Xtrain, label=Ztrain)
data_cv  = xgb.DMatrix(Xval   , label=Zval)
data_train = xgb.DMatrix(train)
data_test  = xgb.DMatrix(test)
evallist = [(data_tr, 'train'), (data_cv, 'valid')]

In [None]:
# Tuning XGB model
# See parameters in the documentation https://xgboost.readthedocs.io/en/latest/parameter.html
parms = {'max_depth':5, # maximum depth of a tree
         'objective':'reg:logistic', # for regression task - "reg:squarederror" or other
         'eval_metric':'error',      # for regression task - "rmse" or other
         'learning_rate':0.01,
         'subsample':0.8, # SGD will use this percentage of data
         'colsample_bylevel':0.9,
         'min_child_weight': 2,
         'seed': 0}
modelx = xgb.train(parms, data_tr, num_boost_round=2000, evals = evallist,
                  early_stopping_rounds=300, maximize=False, 
                  verbose_eval=100)

print('score = %1.5f, n_boost_round =%d.'%(modelx.best_score,modelx.best_iteration))

In [None]:
# FI diagram drawing
fig =  plt.figure(figsize = (15,15))
axes = fig.add_subplot(111)
xgb.plot_importance(modelx,ax = axes,height = 0.5)
plt.show();plt.close()

In [None]:
# FI diagram saving
feature_score['XGB'] = feature_score['feature'].map(modelx.get_score(importance_type='weight'))

In [None]:
# Prediction
y_preds_xgb = modelx.predict(data_test)

### 5.3 Linear Regression <a class="anchor" id="5.3"></a>

[Back to Table of Contents](#0.1)

In [None]:
# Standardization for regression models
Scaler_train = preprocessing.MinMaxScaler().fit(train)
train = pd.DataFrame(Scaler_train.transform(train), columns=train.columns, index=train.index)
test = pd.DataFrame(Scaler_train.transform(test), columns=test.columns, index=test.index)

In [None]:
# Linear Regression Tuning
linreg = LinearRegression()
linreg.fit(train, target)

In [None]:
# FI diagram drawing
coeff_linreg = pd.DataFrame(train.columns)
coeff_linreg.columns = ['feature']
coeff_linreg["LinRegress"] = pd.Series(linreg.coef_)
coeff_linreg.sort_values(by='LinRegress', ascending=False)

In [None]:
# Eli5 visualization
eli5.show_weights(linreg)

In [None]:
# FI diagram saving
coeff_linreg["LinRegress"] = coeff_linreg["LinRegress"].abs()
feature_score = pd.merge(feature_score, coeff_linreg, on='feature')
feature_score = feature_score.fillna(0)
feature_score = feature_score.set_index('feature')
feature_score

In [None]:
# Prediction
y_preds_linreg = linreg.predict(test)

## 6. Comparison and merging of all feature importance diagrams <a class="anchor" id="6"></a>

[Back to Table of Contents](#0.1)

In [None]:
# MinMax scaling all feature importances
feature_score = pd.DataFrame(
    preprocessing.MinMaxScaler().fit_transform(feature_score),
    columns=feature_score.columns,
    index=feature_score.index
)

# Create mean column
feature_score['Mean'] = feature_score.mean(axis=1)

In [None]:
# Merging FI diagram

# Set weight of models
w_lgb = 0.4
w_xgb = 0.5
w_linreg = 1 - w_lgb - w_xgb
w_linreg

# Create merging column with different weights
feature_score['Merging'] = w_lgb*feature_score['LGB'] + w_xgb*feature_score['XGB'] + w_linreg*feature_score['LinRegress']
feature_score.sort_values('Merging', ascending=False)

## 7. Feature Importance - Advanced Visualization <a class="anchor" id="7"></a>

[Back to Table of Contents](#0.1)

### 7.1 Matplotlib <a class="anchor" id="7.1"></a>

[Back to Table of Contents](#0.1)

In [None]:
# Plot the feature importances
plot_title = "Feature Importance - Advanced Visualization with Matplotlib"
feature_score.sort_values('Merging', ascending=False).plot(kind='bar', figsize=(20, 10), title = plot_title)

### 7.2 Seaborn <a class="anchor" id="7.2"></a>

[Back to Table of Contents](#0.1)

In [None]:
def plot_feature_parallel(df, title):
    # Draw sns.parallel_coordinates for features of the given df
    
    plt.figure(figsize=(15,12))
    parallel_coordinates(df, 'feature', colormap=plt.get_cmap("tab20c"), lw=3)
    plt.title(title)
    plt.xlabel("Models")
    plt.ylabel("Feature importance")
    plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
    plt.savefig('graph.png')
    plt.show()

In [None]:
# List of models
feature_score_columns = feature_score.columns
feature_score_columns

In [None]:
feature_score = feature_score.reset_index(drop=False)
plot_feature_parallel(feature_score, f"Feature Importance - Advanced Visualization with Seaborn")

In [None]:
feature_score

In [None]:
def features_selection_by_weights(df, threshold):
    # Selection features with weights more threshold at least in a one column (model)

    features_list = df.feature.tolist()
    features_best = []
    for i in range(len(df)):
        feature_name = features_list[i]
        feature_is_best = False
        for col in feature_score_columns:
            if df.loc[i, col] > threshold:
                feature_is_best = True
        if feature_is_best:
            features_best.append(feature_name)
    
    return df[df['feature'].isin(features_best)].reset_index(drop=True)

In [None]:
# Selection the best features
threshold_fi = 0.25
feature_score_best = features_selection_by_weights(feature_score, threshold_fi)
feature_score_best

In [None]:
plot_feature_parallel(feature_score_best, f"Feature Importance of the best of features - Advanced Visualization with Seaborn")

Then you can remove insignificant features or decide to change the weights of the models' solutions, or you can first find out what accuracy the previously selected weights will give, and then experiment with their options.

## 8. Merging solutions and submission<a class="anchor" id="8"></a>

[Back to Table of Contents](#0.1)

In [None]:
# Merging solutions and submission
y_preds = w_lgb*y_preds_lgb + w_xgb*y_preds_xgb + w_linreg*y_preds_linreg
submission['Survived'] = [1 if x>0.5 else 0 for x in y_preds]
submission.head()

In [None]:
submission['Survived'].hist()

In [None]:
submission.to_csv('submission.csv', index=False)

I hope you find this kernel useful and enjoyable.

Your comments and feedback are most welcome.

[Go to Top](#0)