![](https://storage.googleapis.com/kaggle-competitions/kaggle/3338/media/gate.png)

## Amazon.com - Employee Access Challenge

When an employee at any company starts work, they first need to obtain the computer access necessary to fulfill their role. This access may allow an employee to read/manipulate resources through various applications or web portals. It is assumed that employees fulfilling the functions of a given role will access the same or similar resources. It is often the case that employees figure out the access they need as they encounter roadblocks during their daily work (e.g. not able to log into a reporting portal). A knowledgeable supervisor then takes time to manually grant the needed access in order to overcome access obstacles. As employees move throughout a company, this access discovery/recovery cycle wastes a nontrivial amount of time and money.

There is a considerable amount of data regarding an employee’s role within an organization and the resources to which they have access. Given the data related to current employees and their provisioned access, models can be built that automatically determine access privileges as employees enter and leave roles within a company. These auto-access models seek to minimize the human involvement required to grant or revoke employee access.

## Kernel's motivations

In this kernel we are going to show how deep learning (using TensorFlow 2.0 and Keras) can be effectively used when the problem involved regards tabular data. We will compare a deep neural network solution (DNN) to the best in class gradient boosting machine (GBM) algorithm when high cardinality variables are present and you will discover how not only a DNN solution is comparable, but also how it can integrate nicely with a GBM solution.

## Using a datagenerator for tabular data
The process of preparing the data is made easier and more efficient using a datagenerator from https://github.com/lmassaron/deep_learning_for_tabular_data The generator can handle into different pipelines numeric, ordinal, categorical (both low and high cardinality respectively by one hot encoding and embedding layers) variables and it automatically takes care of missing values. You can also find in the repository a couple of useful activations for tabular data models: gelu (https://arxiv.org/abs/1606.08415) and mish (https://arxiv.org/abs/1908.08681).

In [None]:
# Assuring you have the most recent CatBoost release
!pip install catboost -U

In [None]:
# Getting useful tabular processing and generator functions
!git clone https://github.com/lmassaron/deep_learning_for_tabular_data.git

In [None]:
# Importing core libraries
import numpy as np
import pandas as pd
from time import time
import pprint
import joblib

# Suppressing warnings because of skopt verbosity
import warnings
warnings.filterwarnings("ignore")

# Classifiers
from catboost import CatBoostClassifier, Pool

# Model selection
from sklearn.model_selection import StratifiedKFold

# Metrics
from sklearn.metrics import roc_auc_score, average_precision_score
from sklearn.metrics import make_scorer

In [None]:
# Loading data directly from CatBoost
from catboost.datasets import amazon

X, Xt = amazon()

y = X["ACTION"].apply(lambda x: 1 if x == 1 else 0).values
X.drop(["ACTION"], axis=1, inplace=True)

In [None]:
# Transforming all the labels of all variables
from sklearn.preprocessing import LabelEncoder

label_encoders = [LabelEncoder() for _ in range(X.shape[1])]

for col, column in enumerate(X.columns):
    label_encoders[col].fit(X[column].append(Xt[column]))
    X[column] = label_encoders[col].transform(X[column])
    Xt[column] = label_encoders[col].transform(Xt[column])

In [None]:
# Enconding frequencies instead of labels (so we have some numeric variables)

def frequency_encoding(column, df, df_test=None):
    frequencies = df[column].value_counts().reset_index()
    df_values = df[[column]].merge(frequencies, how='left', 
                                   left_on=column, right_on='index').iloc[:,-1].values
    if df_test is not None:
        df_test_values = df_test[[column]].merge(frequencies, how='left', 
                                                 left_on=column, right_on='index').fillna(1).iloc[:,-1].values
    else:
        df_test_values = None
    return df_values, df_test_values

for column in X.columns:
    train_values, test_values = frequency_encoding(column, X, Xt)
    X[column+'_counts'] = train_values
    Xt[column+'_counts'] = test_values

In [None]:
# Pointing out which variables are categorical and which are numeric
categorical_variables = [col for col in X.columns if '_counts' not in col]
numeric_variables = [col for col in X.columns if '_counts' in col]

In [None]:
X.head()

In [None]:
Xt.head()

In [None]:
# Counting unique values of categorical variables
X[categorical_variables].nunique()

In [None]:
# Describing numeric variables
X[numeric_variables].describe()

# Using CatBoost

In [None]:
# Initializing a CatBoostClassifier with best parameters
best_params = {'bagging_temperature': 0.6,
               'border_count': 200,
               'depth': 8,
               'iterations': 350,
               'l2_leaf_reg': 30,
               'learning_rate': 0.30,
               'random_strength': 0.01,
               'scale_pos_weight': 0.48}

catb = CatBoostClassifier(**best_params,
                          loss_function='Logloss',
                          eval_metric = 'AUC',
                          nan_mode='Min',
                          thread_count=2,
                          verbose = False)

In [None]:
# Setting a 5-fold stratified cross-validation (note: shuffle=True)
SEED = 42
FOLDS = 5

skf = StratifiedKFold(n_splits=FOLDS, shuffle=True, random_state=SEED)

In [None]:
# CV interations

roc_auc = list()
average_precision = list()
oof = np.zeros(len(X))
best_iteration = list()

for train_idx, test_idx in skf.split(X, y):
    X_train, y_train = X.iloc[train_idx, :], y[train_idx]
    X_test, y_test = X.iloc[test_idx, :], y[test_idx]
    
    train = Pool(data=X_train, 
             label=y_train,            
             feature_names=list(X_train.columns),
             cat_features=categorical_variables)

    test = Pool(data=X_test, 
                label=y_test,
                feature_names=list(X_test.columns),
                cat_features=categorical_variables)

    catb.fit(train,
             verbose_eval=100, 
             early_stopping_rounds=50,
             eval_set=test,
             use_best_model=True,
             #task_type = "GPU",
             plot=False)
    
    best_iteration.append(catb.best_iteration_)
    preds = catb.predict_proba(X_test)
    
    oof[test_idx] = preds[:,1]
    
    roc_auc.append(roc_auc_score(y_true=y_test, y_score=preds[:,1]))
    average_precision.append(average_precision_score(y_true=y_test, y_score=preds[:,1]))

In [None]:
print("Average cv roc auc score %0.3f ± %0.3f" % (np.mean(roc_auc), np.std(roc_auc)))
print("Average cv roc average precision %0.3f ± %0.3f" % (np.mean(average_precision), np.std(average_precision)))

print("Roc auc score OOF %0.3f" % roc_auc_score(y_true=y, y_score=oof))
print("Average precision OOF %0.3f" % average_precision_score(y_true=y, y_score=oof))


In [None]:
# Using catboost on all the data for predictions

best_params = {'bagging_temperature': 0.6,
               'border_count': 200,
               'depth': 8,
               'iterations': int(np.median(best_iteration) * 1.3),
               'l2_leaf_reg': 30,
               'learning_rate': 0.30,
               'random_strength': 0.01,
               'scale_pos_weight': 0.48}

catb = CatBoostClassifier(**best_params,
                          loss_function='Logloss',
                          eval_metric = 'AUC',
                          nan_mode='Min',
                          thread_count=2,
                          verbose = False)

train = Pool(data=X, 
             label=y,            
             feature_names=list(X_train.columns),
             cat_features=categorical_variables)

catb.fit(train,
         verbose_eval=100,
         #task_type = "GPU",
         plot=False)

submission = pd.DataFrame(Xt.id)
Xt_pool = Pool(data=Xt[list(X_train.columns)],
               feature_names=list(X_train.columns),
               cat_features=categorical_variables)
submission['Action'] = catb.predict_proba(Xt_pool)[:,1]
submission.to_csv("catboost_submission.csv", index=False)

cat_boost_submission = submission.copy()

# Using deep learning

In [None]:
import tensorflow as tf
from tensorflow.keras import backend as K
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam, Nadam
from tensorflow.keras.layers import Input, Embedding, Reshape, GlobalAveragePooling1D
from tensorflow.keras.layers import Flatten, concatenate, Concatenate, Lambda, Dropout, SpatialDropout1D
from tensorflow.keras.layers import Reshape, MaxPooling1D,BatchNormalization, AveragePooling1D, Conv1D
from tensorflow.keras.layers import Activation, LeakyReLU
from tensorflow.keras.optimizers import SGD, Adam, Nadam
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, ReduceLROnPlateau
from tensorflow.keras.regularizers import l2, l1_l2
from keras.losses import binary_crossentropy

from sklearn.metrics import roc_auc_score
from sklearn.metrics import average_precision_score

import matplotlib.pyplot as plt

In [None]:
# Registering custom activations suitable for tabular problems

from tensorflow.keras.utils import get_custom_objects
from tensorflow.keras.layers import Activation, LeakyReLU
from deep_learning_for_tabular_data.tabular import gelu, Mish, mish

# Add gelu so we can use it as a string
get_custom_objects().update({'gelu': Activation(gelu)})

# Add mish so we can use it as a string
get_custom_objects().update({'mish': Mish(mish)})

# Add leaky-relu so we can use it as a string
get_custom_objects().update({'leaky-relu': Activation(LeakyReLU(alpha=0.2))})

In [None]:
# Parametric architecture

def tabular_dnn(numeric_variables, categorical_variables, categorical_counts,
                feature_selection_dropout=0.2, categorical_dropout=0.1,
                first_dense = 256, second_dense = 256, dense_dropout = 0.2, 
                activation_type=gelu):
    
    numerical_inputs = Input(shape=(len(numeric_variables),))
    numerical_normalization = BatchNormalization()(numerical_inputs)
    numerical_feature_selection = Dropout(feature_selection_dropout)(numerical_normalization)

    categorical_inputs = []
    categorical_embeddings = []
    for category in  categorical_variables:
        categorical_inputs.append(Input(shape=[1], name=category))
        category_counts = categorical_counts[category]
        categorical_embeddings.append(
            Embedding(category_counts+1, 
                      int(np.log1p(category_counts)+1), 
                      name = category + "_embed")(categorical_inputs[-1]))

    categorical_logits = Concatenate(name = "categorical_conc")([Flatten()(SpatialDropout1D(categorical_dropout)(cat_emb)) 
                                                                 for cat_emb in categorical_embeddings])

    x = concatenate([numerical_feature_selection, categorical_logits])
    x = Dense(first_dense, activation=activation_type)(x)
    x = Dropout(dense_dropout)(x)  
    x = Dense(second_dense, activation=activation_type)(x)
    x = Dropout(dense_dropout)(x)
    output = Dense(1, activation="sigmoid")(x)
    model = Model([numerical_inputs] + categorical_inputs, output)
    
    return model

In [None]:
# Useful functions

from tensorflow.keras.metrics import AUC

def mAP(y_true, y_pred):
    return tf.py_func(average_precision_score, (y_true, y_pred), tf.double)

def compile_model(model, loss, metrics, optimizer):
    model.compile(loss=loss, metrics=metrics, optimizer=optimizer)
    return model

def plot_keras_history(history, measures):
    """
    history: Keras training history
    measures = list of names of measures
    """
    rows = len(measures) // 2 + len(measures) % 2
    fig, panels = plt.subplots(rows, 2, figsize=(15, 5))
    plt.subplots_adjust(top = 0.99, bottom=0.01, hspace=0.4, wspace=0.2)
    try:
        panels = [item for sublist in panels for item in sublist]
    except:
        pass
    for k, measure in enumerate(measures):
        panel = panels[k]
        panel.set_title(measure + ' history')
        panel.plot(history.epoch, history.history[measure], label="Train "+measure)
        panel.plot(history.epoch, history.history["val_"+measure], label="Validation "+measure)
        panel.set(xlabel='epochs', ylabel=measure)
        panel.legend()
        
    plt.show(fig)

In [None]:
# Global training settings

SEED = 42
FOLDS = 5
BATCH_SIZE = 512

In [None]:
# Defining callbacks

measure_to_monitor = 'val_auc' 
modality = 'max'

early_stopping = EarlyStopping(monitor=measure_to_monitor, 
                               mode=modality, 
                               patience=5, 
                               verbose=0)

model_checkpoint = ModelCheckpoint('best.model', 
                                   monitor=measure_to_monitor, 
                                   mode=modality, 
                                   save_best_only=True, 
                                   verbose=0)

reduce_learning = ReduceLROnPlateau(monitor=measure_to_monitor, mode=modality, 
                                    factor=0.25, patience=2, min_lr=1e-6, verbose=0)

In [None]:
from deep_learning_for_tabular_data.tabular import TabularTransformer, DataGenerator

# Setting the CV strategy
skf = StratifiedKFold(n_splits=FOLDS, 
                      shuffle=True, 
                      random_state=SEED)

# CV Iteration
roc_auc = list()
average_precision = list()
oof = np.zeros(len(X))
best_iteration = list()

for fold, (train_idx, test_idx) in enumerate(skf.split(X, y)):
    
    tb = TabularTransformer(numeric = numeric_variables,
                        ordinal = [],
                        lowcat  = [],
                        highcat = categorical_variables)

    tb.fit(X.iloc[train_idx])
    sizes = tb.shape(X.iloc[train_idx])
    categorical_levels = dict(zip(categorical_variables, sizes[1:]))
    print(f"Input array sizes: {sizes}")
    print(f"Categorical levels: {categorical_levels}\n")
    
    model = tabular_dnn(numeric_variables, categorical_variables,
                        categorical_levels, 
                        feature_selection_dropout=0.1,
                        categorical_dropout=0.1,
                        first_dense = 256,
                        second_dense = 256,
                        dense_dropout = 0.1,
                        activation_type=gelu)
    
    model = compile_model(model, binary_crossentropy, [AUC(name='auc'), mAP], Adam(learning_rate=0.0001))
    
    train_batch = DataGenerator(X.iloc[train_idx], 
                                y[train_idx],
                                tabular_transformer=tb,
                                batch_size=BATCH_SIZE,
                                shuffle=True)
    
    history = model.fit_generator(train_batch,
                                  validation_data=(tb.transform(X.iloc[test_idx]), y[test_idx]),
                                  epochs=30,
                                  callbacks=[model_checkpoint, early_stopping, reduce_learning],
                                  class_weight=[1.0, (np.sum(y==0) / np.sum(y==1))],
                                  verbose=1)
    
    print("\nFOLD %i" % fold)
    plot_keras_history(history, measures=['auc', 'loss'])
    
    best_iteration.append(np.argmax(history.history['val_auc']) + 1)
    preds = model.predict(tb.transform(X.iloc[test_idx]),
                          verbose=1,
                          batch_size=1024).flatten()

    oof[test_idx] = preds

    roc_auc.append(roc_auc_score(y_true=y[test_idx], y_score=preds))
    average_precision.append(average_precision_score(y_true=y[test_idx], y_score=preds))

In [None]:
print("Average cv roc auc score %0.3f ± %0.3f" % (np.mean(roc_auc), np.std(roc_auc)))
print("Average cv roc average precision %0.3f ± %0.3f" % (np.mean(average_precision), np.std(average_precision)))

print("Roc auc score OOF %0.3f" % roc_auc_score(y_true=y, y_score=oof))
print("Average precision OOF %0.3f" % average_precision_score(y_true=y, y_score=oof))

In [None]:
# We train on all the examples, using a rule of thumb for the number of iterations

tb = TabularTransformer(numeric = numeric_variables,
                        ordinal = [],
                        lowcat  = [],
                        highcat = categorical_variables)

tb.fit(X)
sizes = tb.shape(X)
categorical_levels = dict(zip(categorical_variables, sizes[1:]))
print(f"Input array sizes: {sizes}")
print(f"Categorical levels: {categorical_levels}\n")

model = tabular_dnn(numeric_variables, categorical_variables,
                    categorical_levels, 
                    feature_selection_dropout=0.1,
                    categorical_dropout=0.1,
                    first_dense = 256,
                    second_dense = 256,
                    dense_dropout = 0.1,
                    activation_type=gelu)
    
model = compile_model(model, binary_crossentropy, [AUC(name='auc'), mAP], Adam(learning_rate=0.0001))    

train_batch = DataGenerator(X, y,
                            tabular_transformer=tb,
                            batch_size=BATCH_SIZE,
                            shuffle=True)

history = model.fit_generator(train_batch,
                              epochs=int(np.median(best_iteration)),
                              class_weight=[1.0, (np.sum(y==0) / np.sum(y==1))],
                              verbose=1)

In [None]:
# Predicting and submission
preds = model.predict(tb.transform(Xt[X.columns]),
                      verbose=1,
                      batch_size=1024).flatten()

submission = pd.DataFrame(Xt.id)
submission['Action'] = preds
submission.to_csv("tabular_dnn_submission.csv", index=False)

tabular_dnn_submission = submission.copy()

## Blending together the GBM and DNN solutions

In [None]:
from scipy.stats import rankdata

# We use normalized ranks because probabilities emissions from the two models may differ
dnn_rank = rankdata(tabular_dnn_submission.Action, method='dense') / len(Xt)
cat_rank = rankdata(cat_boost_submission.Action, method='dense') / len(Xt)

submission = pd.DataFrame(Xt.id)
submission['Action'] = 0.5 * dnn_rank + 0.5 * cat_rank 
submission.to_csv("blended_submission.csv", index=False)