## EDA + Light GBM - Tabular Series Apr 2021

In this kernel I will explore and create a predictive model with the T

## Import libs and data

In [None]:
# Libs to deal with tabular data
import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None
pd.set_option('display.max_columns', None)

# Statistics
from scipy.stats import chi2_contingency
from scipy.stats.contingency import expected_freq

# Plotting packages
import seaborn as sns
sns.axes_style("darkgrid")
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn')

# Machine Learning
from sklearn import preprocessing
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split, KFold, GridSearchCV
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import roc_auc_score, accuracy_score
from sklearn.inspection import permutation_importance
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import mutual_info_classif
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder
from boruta import BorutaPy

from lightgbm import LGBMClassifier

# Optimization
!pip uninstall optuna -y
!pip uninstall typing -y
!pip install optuna==2.3.0
import optuna
from optuna.samplers import TPESampler
from optuna.visualization import plot_contour, plot_optimization_history
from optuna.visualization import plot_param_importances, plot_slice


# To display stuff in notebook
from IPython.display import display, Markdown

# Misc tqdm.notebook.tqdm
from tqdm.notebook import tqdm
import time

In [None]:
train = pd.read_csv('../input/tabular-playground-series-apr-2021/train.csv')

## Data analysis and preparation

Before we start analyzing individual variable, it's essential to take a look at the overall features of the dataset.

In [None]:
train.sample(10)

In [None]:
train.shape

In [None]:
train.info()

In [None]:
train.nunique()

### Analyzing distributions and characteristics

Here I analyze each variable individually and with respect to the target. Before we start, I define an important function to measure the Pearson's Chi2 correlation coefficient and assess the strenght of the relationship between two categorical variables.

In [None]:
def chi2_test_cramers_v(var1, var2):
    cont_freq = pd.crosstab(train[var1], train[var2]).values
    n_obs = cont_freq.sum().sum()
    chi2_test = chi2_contingency(cont_freq)
    cramers_v = np.sqrt(chi2_test[0] / (n_obs * (min(cont_freq.shape) - 1)))
    print("Cramer's V:", cramers_v)
    print('P-value:', chi2_test[1]) 

#### Survived

In [None]:
ax = train['Survived'].replace({
    0: 'No',
    1: 'Yes'
}).value_counts().plot.bar(rot=0)
ax.set_title('Survived', fontsize=16)
plt.show()

In [None]:
train['Survived'].replace({
    0: 'No',
    1: 'Yes'
}).value_counts()/train.shape[0]

Notice that our dataset is a little bit unbalanced. When we have this kind of situation, we can create a new sample from this dataset where the frequency of survived is more similar to the negative event one. This technique is very usefull when the positive class corresponds to 10% or less of the dataset. In this case, since the it's not so severe, we can keep it untouched.

#### Passenger class 

In [None]:
ax = train['Pclass'].value_counts().sort_index().plot.bar(rot=0)
ax.set_title('Class', fontsize=16)
plt.show()

Notice that the first and second class are almost equal in frequency while the third class, which is probably the cheaper one, has more passengers than the others. When we compare this variable to the frequency of the positive event, it's clear that being in the first two classes gives passengers an advantage in terms of survival. On the other hand, if a passenger is in the third class, it will likely not survive.

In [None]:
sns.countplot(data = train, x = 'Pclass', hue = 'Survived')
plt.show()

In [None]:
chi2_test_cramers_v('Pclass', 'Survived')

Using a chi2 independency test we see that chances are 0 that this pair of variables is independent. Also, computing the Cramer's V correlation coefficient we get a modest value, which indicates that there is correlation but it's not that high.

#### Name

Name is a text column that at first sight don't give us much information wether a passenger will survive or not. Thus, I'll drop it before start modelling.

In [None]:
train['Name'].value_counts()

#### Sex

Analyzing the variable sex, we conclude that survived is very dependent on the sex. Although male has a higher frequency, womans are much more prone to survive.

In [None]:
ax = train['Sex'].value_counts().plot.bar(rot = 0)
ax.set_title('Sex', fontsize=16)
plt.show()

In [None]:
sns.countplot(data = train, x = 'Sex', hue = 'Survived')
plt.show()

In [None]:
chi2_test_cramers_v('Sex', 'Survived')

Again our chi2 test indicates dependency and a correlation of 0.5.

#### Age

Age has about 3400 missing values and a few estimated ages. According to the metadata, rows with decimal ages are estimated. Below we can see that most ages aren't estimates, but there is 1100 rows with this aspect. 

In [None]:
(train['Age']%1).value_counts()

The distribution doesn't look like a normal curve. Instead age is very spread and there are some peaks around 8, 25 and 55 years. Also, some bins of the histogram are more populated than its neighbors. 

In [None]:
sns.histplot(train['Age'])
plt.title('Age', fontsize=16)
plt.xlabel('Years')
plt.show()

In [None]:
train['Age'].describe()

Regarding the dependent variable, we can draw two conclusions:
- People who survived are older. The means of the distributions shift just a bit. 
- Correlation is 0.10, which is low.
- The presence of missing rows doesn't influence the distribution of survived.

In [None]:
train['Age'].corr(train['Survived'])

In [None]:
sns.boxplot(data=train, x = 'Survived', y='Age')

In [None]:
train.loc[train['Age'].isnull(), 'Survived'].value_counts(normalize=True).sort_index()

In [None]:
train['age_null'] = train['Age'].isnull()
chi2_test_cramers_v('age_null', 'Survived')

#### Siblings or spouse

Over 90% of the passengers have at most one sibling or spouse.

In [None]:
ax = train['SibSp'].value_counts().sort_index().plot.bar(rot=0)
plt.title('Number of siblings or spouse', fontsize=16)
plt.show()

One could say that having a sibling or spouse would help in surviving because you have someone to rely on. On the other hand, it could be bad because you have to worry about the other person. As we can see below by the spearman correlation, none of these hypothesis is true because the correlation is very close to 0. 

In [None]:
train['SibSp'].corr(train['Survived'], 'spearman')

In [None]:
sns.countplot(data = train, x = 'SibSp', hue = 'Survived')
plt.show()

#### Parents or children

Most people don't have a parent or children on board, but when compared to the variable above the number of parents and children seems to be shifted and there is a considerable group of people with 2 relatives.

In [None]:
ax = train['Parch'].value_counts().sort_index().plot.bar(rot=0)
plt.title('Number of parents or children', fontsize=16)
plt.show()

Regarding the dependent variable, the variable has a little bit higher Spearman correlation than the number of siblings and spouse, but it's still very low. 

In [None]:
train['Parch'].corr(train['Survived'], 'spearman')

In [None]:
sns.countplot(data = train, x = 'Parch', hue = 'Survived')
plt.show()

#### Ticket

Ticket is a mixed variable because it can have numbers as well as letters. Also, we have 75,331 unique tickets, which means that there are a lot of people who hold identical tickets.

In [None]:
train['Ticket'].sample(10)

In [None]:
train['Ticket'].value_counts().head(10)

Although it can have letters, most tickets (over 70,000) have only numbers.

In [None]:
# number of tickets with only numbers
train['Ticket'].notnull().sum() - train['Ticket'].str.contains('[^\d]').sum()

Notice that almost 5 thousand passengers doesn't have a ticket. One hypothesis is that these people are employees of the cruise ship. Looking at the relationship between suviving and lack of tickets, it's not strong but, according to the chi2 test, they are dependent.

In [None]:
train['ticket_null'] = train['Ticket'].isnull()
chi2_test_cramers_v('age_null', 'Survived')

In [None]:
train.loc[train['ticket_null'].eq(True), 'Survived'].value_counts(normalize=True).sort_index()

#### Fare

As we can see below, fare is heavily skewed, so that most people spent little money and a few of them spent a lot.

In [None]:
sns.histplot(train['Fare'])
plt.title('Fare', fontsize=16)
plt.xlabel('Money')
plt.show()

In [None]:
train['Fare'].describe()

One solution to better visualize and understand the variable is to apply the log function. Notice how the peaks are more evident.

In [None]:
sns.histplot(train['Fare'].apply(np.log10))
plt.title('log(Fare)', fontsize=16)
plt.xlabel('Money')
plt.show()

Analyzing the graph below, people who survived tend to pay more expensive fares. The distribution of the people who survived is more skewed than the other.

In [None]:
sns.boxplot(data=train, x = 'Survived', y='Fare')

Also, the Spearman coefficient is higher than the other coefficients analyzed so far.

In [None]:
train['Fare'].corr(train['Survived'], 'spearman')

#### Cabin

Cabin has the type object and can be split in two parts: the letter and the number. One problem of this variable is the high number of missing values. Notice that the same cabin can have multiple passengers.

In [None]:
train['cabin_number'] = train['Cabin'].str.extract('(\d+)').astype('float64')
train['cabin_letter'] = train['Cabin'].str.extract('([A-Za-z])')
train['cabin_null'] = train['Cabin'].isnull()

In [None]:
train.loc[train['Cabin'].notnull(), 'Cabin'].sample(10)

In [None]:
train['Cabin'].value_counts().describe([.25, .5, .75, .9])

First, analyzing the relationship between missing cabin and suviving, we see that the Cramer's V is modest, but high compared to the other variables. That's because people who have a cabin are much more prone to survive. 

In [None]:
pd.crosstab(train['cabin_null'], train['Survived']).div(train['cabin_null'].value_counts().sort_index(), axis='rows')

In [None]:
chi2_test_cramers_v('cabin_null', 'Survived')

Now analyzing the cabin letters, "A", "B" and "C" are the most popular types.

In [None]:
train['cabin_letter'].value_counts().sort_index().plot.bar(rot=0)
plt.title('Cabin letter', fontsize=16)
plt.show()

By analyzing the graphs below it's clear that having a cabin often implies in surviving the accident. The Cramer's V is also high.

In [None]:
sns.countplot(data = train, x = 'cabin_letter', hue = 'Survived')
plt.show()

In [None]:
chi2_test_cramers_v('cabin_letter', 'Survived')

Finally, as we expected, the cabin number is just random and the Pearson correlation coefficient is close to 0.

In [None]:
sns.histplot(train['cabin_number'])

In [None]:
sns.boxplot(data=train, x = 'Survived', y = 'cabin_number')

In [None]:
train['cabin_number'].corr(train['Survived'])

#### Embarked

The two most popular embaking options are Southampton and Cherbourg.

In [None]:
ax = train['Embarked'].replace({
    'S': 'Southampton',
    'C': 'Cherbourg',
    'Q': 'Queenstown'
}).value_counts().plot.bar(rot=0)
ax.set_title('Embarked', fontsize=16)
plt.show()

The variable seems important for prediction as it has a chi2 correlation of 0.35. Regarding missing values, they don't have a predictive power. 

In [None]:
sns.countplot(data = train, x = 'Embarked', hue = 'Survived')
plt.show()

In [None]:
chi2_test_cramers_v('Embarked', 'Survived')

In [None]:
train['embarked_null'] = train['Embarked'].isnull()
chi2_test_cramers_v('embarked_null', 'Survived')

### Data preparation

In this section I will make a simple data preparation, so that we can create a simple model without fancy feature engineering.

In [None]:
# Drop useless columns or the ones that are hard to input in the model 
train = train.drop(columns = [
    'PassengerId', 'Name', 'Cabin', 'cabin_number', 'Ticket',
    'cabin_null', 'age_null', 'ticket_null', 'embarked_null'
])

# Replace missing values of cabin letter with a new category M, standing for missing
train['cabin_letter'] = train['cabin_letter'].fillna('M')

# Applying transformations
train['Fare'] = train['Fare'].apply(np.log10)

y_train = train['Survived']

x_train_raw = train.drop(columns=['Survived'])

x_train_cat_codes = x_train_raw.copy()
encoding_dict = {}
for col in ['Sex', 'Embarked', 'cabin_letter']:
    encoding_dict[col] = {val: idx for idx, val in enumerate(x_train_cat_codes[col].unique())}
    x_train_cat_codes[col] = x_train_cat_codes[col].replace(encoding_dict[col])

x_train_filled = x_train_cat_codes.copy()
median_imp = SimpleImputer(strategy='median').fit(x_train_filled.values[:,[2, 5]])
mode_imp = SimpleImputer(strategy='most_frequent').fit(x_train_filled.values[:,[6]])
x_train_filled.iloc[:,[2, 5]] = median_imp.transform(x_train_filled.values[:,[2, 5]])
x_train_filled.iloc[:,[6]] = mode_imp.transform(x_train_filled.values[:,[6]])

In [None]:
x_train_cat_codes

In [None]:
x_train_cat_codes

## Feature selection

In this section I will test a number of options regarding feature selection. 

- Correlation coefficients
- Boruta framework
- Mutual information

### Correlation coefficients

First, let's put the correlation coefficients in a single table and compare them. To do so, I will have to change the Cramer's V function a little bit.

In [None]:
def cramers_v(var1, var2):
    cont_freq = pd.crosstab(var1, var2).values
    n_obs = cont_freq.sum().sum()
    chi2_test = chi2_contingency(cont_freq)
    cramers_v = np.sqrt(chi2_test[0] / (n_obs * (min(cont_freq.shape) - 1)))
    return cramers_v

In [None]:
train[['Pclass', 'Sex', 'Embarked', 'cabin_letter']].astype("category").apply(lambda x: x.cat.codes).corrwith(train['Survived'], method = cramers_v)

In [None]:
train[['Age', 'SibSp', 'Parch', 'Fare']].corrwith(train['Survived'], method = 'spearman')

All categorical features have a correlation coeffient greater than 0.3, which is a moderate result and it's not negligible. But the numerical features doesn't have a high correlation even when we use the Spearman coefficient, which analyzes ranks instead of raw values. One variable we could drop immediatly is SibSp, which is very close to 0. But for rest, I think it's best to look at other methods to get another perspective.

### Boruta

Boruta is an all-relevant feature selection algorithm based on random forests, feature importance and random variables. In our context, Boruta is a usefull algorithm because it's capable of evaluating the relationship between more than one feature and the target. Thus, we are not limited to assess importance only with bivariate analysis between a feature and our target. 

It works by first creating a shadow variable for each input feature. This shadow variable is a permuted version of the original one, so that it has the distribution but no correlation with the target. Then, a random forest is fitted using both original and shadow features. The algorithm's key ideia is that important features are those that have an importance metric greater than the importance of the best shadow feature. Such important features are recorded with a "hit" and this process of creating shadow features and fitting RFs is repeated a number of times. Repetition is important because the algorithm is random in its nature. 

After that, we will have a list of how many times each variable was a hit and we can use the binomial distribution to assess the probability of getting an amount of hits merely by chance. The variables with low probabilities are considered important and the ones with high probabilities are discarded. The variables with probability between the low and high thresholds are considered tentative and it's up to the user to discard or utilize them. 



In [None]:
rf = RandomForestClassifier(n_jobs=-1, max_depth=5, random_state=42)
feat_selector = BorutaPy(
    rf, 
    verbose=2, 
    random_state=42,
    n_estimators = 'auto', 
    two_step=False
).fit(x_train_filled.values, y_train.values)

As we can see, Boruta considered all features relevant to the problem. So, even though the number of siblings and spouse has a low correlation with the target, it has an interaction with other variables which makes it important to predict the target.

### Mutual information

Mutual information is a metric from the information theory field which assess much much the observed joint distribution deviates from an independent joint distribution (assuming both variables are independent). Values close to 0 indicate that the distributions are almost independent.

In [None]:
mutu_info = mutual_info_classif(x_train_filled, y_train, discrete_features=[0, 1, 3, 4, 6, 7])
mutu_info = pd.Series(mutu_info, index=x_train_filled.columns).sort_values(ascending=False)

sns.barplot(x = mutu_info.values, y = mutu_info.index, color='cornflowerblue')
plt.title('Mutual information', fontsize=16)
plt.show()

Mutual information is a good way to assess feature importance regardless of the variable type, so that we can put categorical and numerical variables under the same scale. However we see that the overall ordering of impotance didn't change significantly.

## Modelling

In this section I will be using Light GBM with Optuna, which is an automatic hyperparameter search tool. Differently from the standard use of the Light GBM library, I'm going to experiment with the random forest mode. Despite it's a poorly documentated mode, it's equivalent to a boosting algorithm with 1 estimator and a number of parallel trees. The learning (shrinkage) rate is fixed to 1 and we must set the parameters that control the size of the training set to use in each tree and the number of features to be used in each split.

Below we specify a range of values for a the model hyperparameters and use Optuna to explore which combination yields the best result of the metric AUC.

In [None]:
class Light_GBM_RF_CV:
    def __init__(self, x, y, folds=5, random_state=42):
        # Hold this implementation specific arguments as the fields of the class.
        self.x = x
        self.y = y
        self.folds = folds
        self.random_state = random_state

    def __call__(self, trial):
        cv = KFold(
            self.folds, 
            random_state = self.random_state, 
            shuffle=True
        )
        
        clf = LGBMClassifier(
            # task params
            boosting_type = 'rf',
            objective = 'binary',
            metric = 'auc',
            random_state = self.random_state,
            n_jobs = -1,
            # controlling tree growth
            num_leaves = trial.suggest_int('num_leaves', 16, 256),
            max_depth = trial.suggest_int('max_depth', 4, 8),
            min_child_samples = trial.suggest_int('min_child_samples', 5, 1000),
            # random forest params
            n_estimators = trial.suggest_int('n_estimators', 10, 500),
            subsample = trial.suggest_float('subsample', 0.5, 1),
            subsample_freq = 1,
            colsample_bytree = 1,
            feature_fraction_bynode = trial.suggest_float('feature_fraction_bynode', 0.1, 1),
            # learning parameters
            learning_rate = 1,
            reg_alpha = trial.suggest_loguniform('reg_alpha', 1e-5, 1.0),
            reg_lambda = trial.suggest_loguniform('reg_lambda', 1e-5, 1.0),
            # features params
            max_bin = trial.suggest_int('max_bin', 50, 256)
        )
        
        scores = []

        for array_idxs in cv.split(self.x):
            train_index, val_index = array_idxs[0], array_idxs[1]
            x_train, x_val = self.x[train_index], self.x[val_index]
            y_train, y_val = self.y[train_index], self.y[val_index]
            
            # Filling missing values
            median_imp = SimpleImputer(strategy='median').fit(x_train[:,[2, 5]])
            mode_imp = SimpleImputer(strategy='most_frequent').fit(x_train[:,[6]])
            x_train[:,[2, 5]] = median_imp.transform(x_train[:,[2, 5]])
            x_train[:,[6]] = mode_imp.transform(x_train[:,[6]])
            x_val[:,[2, 5]] = median_imp.transform(x_val[:,[2, 5]])
            x_val[:,[6]] = mode_imp.transform(x_val[:,[6]])

            clf.fit(x_train, y_train, verbose = False)
            scores.append(roc_auc_score(y_val, clf.predict_proba(x_val)[:,1]))

        return sum(scores) / self.folds

In [None]:
lgbm_cv = Light_GBM_RF_CV(x_train_cat_codes.values, y_train.values)
study = optuna.create_study(sampler=TPESampler(seed = 42), direction='maximize')
study.optimize(lgbm_cv, n_trials=100)

In [None]:
print('Best model')
print('***********')
print('Mean validation AUC: ', study.best_value, '\n')
print('Best hyperparameters')
print('***********')
for param, val in study.best_params.items():
    print(param + ':', val)

After 100 trials, we got an AUC of almost 0.85, which is good. Regarding our hyperparameters, the best combination resulted in a set of 218 trees of max depth equal to 8. Notice that both regularization parameters ended up being close to 0, indicating that they didn't help our algorithm. Bagging and feature fraction also had similiar values of about 0.65, that is, in each tree we use 65% of our training samples and in each node split we consider only 65% of the available features.

Below we can see two graphs showing how was our progress and which hyperparameters were the most important while tuning the random forest mode.

In [None]:
plot_optimization_history(study)

In [None]:
plot_param_importances(study)

### Retraining with the best parameters

Now I need to retrain the model with the best configuration and the whole dataset in order to evaluate feature importance and make predictions.

In [None]:
lgbm_final_clf = LGBMClassifier(
    boosting_type = 'rf',
    objective = 'binary',
    metric = 'auc',
    random_state = 42,
    n_jobs = -1,
    subsample_freq = 1,
    colsample_bytree = 1,
    learning_rate = 1,
    **study.best_params
)

lgbm_final_clf.fit(
    x_train_filled.values, 
    y_train,
    verbose = False
)

### Feature importance

To assess feature importance I'm going to use a method called permutation importance. It evaluates the importance of a variable by shuffling it and computing how much the objective metric reduces. This process is repeated to each variable a number of times to generate a confidence interval.  

In [None]:
perm_importances = permutation_importance(
    lgbm_final_clf,
    x_train_filled,
    y_train,
    scoring = 'roc_auc',
    n_repeats = 10,
    n_jobs = -1
)
df_perm_importances = pd.DataFrame(perm_importances.importances, index=x_train_filled.columns).T
df_perm_importances = df_perm_importances.melt(
    value_vars = df_perm_importances.columns,
    var_name = 'feature',
    value_name = 'importance'
)

plt.figure(figsize=(10,5))
sns.boxplot(data = df_perm_importances, x = 'importance', y = 'feature')
plt.title('Permutation importance using AUC (train set)', fontsize=16)
plt.show()

In this part the order of importance changed a bit from the correlation coefficients and the mutual information values. For example, according to the permutation importance, age is the least important feature. 

### Submission

Now let's read the test set and make predicitions using the standard threshold of 0.5 to binary classification. Of course we could fine tune this value to increase our accuracy.

In [None]:
test = pd.read_csv('../input/tabular-playground-series-apr-2021/test.csv')

In [None]:
test['cabin_letter'] = test['Cabin'].str.extract('([A-Za-z])')
test['cabin_letter'] = test['cabin_letter'].fillna('M')
test = test.drop(columns = ['Name', 'Cabin', 'Ticket'])
test['Fare'] = test['Fare'].apply(np.log10)

for col in ['Sex', 'Embarked', 'cabin_letter']:
    test[col] = test[col].replace(encoding_dict[col])

test.iloc[:,[2, 5]] = median_imp.transform(test.values[:,[2, 5]])
test.iloc[:,[6]] = mode_imp.transform(test.values[:,[6]])

In [None]:
submission_df = pd.concat([
    test['PassengerId'], 
    pd.Series(lgbm_final_clf.predict(test.drop(columns='PassengerId').values), name='Survived')
], axis=1, copy=False)

submission_df.to_csv('./submission.csv', index=False)