## Encoding Categorical Features

Welcome to Categorical Feature Encoding Challenge presented by [Kaggle](https://www.kaggle.com)! A common task in machine learning pipelines is encoding categorical variables into a form digestable for algorithms without losing valuable information. We are tasked with taking a dataset  **solely comprised of categorical variables** and creating appropriate encoding schemes. I won't be doing any feature engineering - just encoding schemes.

These categorical variables include:
* **Binary** features
* **Nominal** features (of varying cardinality)
* **Ordinal** features (of varying cardinality)
* **Cyclical** features

#### Table of Contents
1. Exploration of the dataset
2. Handling of binary features
3. Handling of ordinal features
4. Handling of nominal features
5. Handling of cyclical features

**If this notebook was of help to you, upvotes are very much appreciated - they're what keep me going.**

Let's dive in!

# Data Exploration

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_palette("bright")
%matplotlib inline

print("Data & File Sizes")
data_dir = '../input/cat-in-the-dat/'
for f in os.listdir(data_dir):
    if 'zip' not in f:
        print(f.ljust(30) + str(round(os.path.getsize(f'{data_dir}{f}') / 1000000, 2)) + 'MB')

We're given 3 files, one training set, one test set and one sample submission. They're all quite small, weighing it at less than 50MB a pop.

Let's check out the training data.

### Training set

In [None]:
df_train = pd.read_csv(f'{data_dir}train.csv')
pd.set_option('display.max_columns', 200) # show all columns
df_train.head()

Okay, so we have some pretty clear groupings of features here
* **id: ** a simple row id
* **bin_{0,4}: ** binary features
* **nom_{0,9}: ** nominal features. Assumption - {0,4} have low cardinality, {5,9} have high cardinality
* **ord_{0,1}: ** ordinal features. Assumption - {0,3?} have low cardinality, {4?,5} have high cardinality
* **day, month: ** cyclical features
* **target: ** the **label** we are trying to predict. Looks to be binary.

Let's make sure the test data follows the same form.

### Test Data

In [None]:
df_test = pd.read_csv(f'{data_dir}test.csv')
df_test.head(1)

Obvious difference: no target variable in the test set. 

Now, let's look at the distribution of target variables in the training set:

In [None]:
sns.countplot(df_train['target']).set_title('train')

The 0 labeled target has twice the occurences as the 1 label. Not a ton to glean from this at the moment, so let's move on the meat of this notebook: **the encoding**. 

First task will be bucketing the different groups of labels by type (binary, ordinal and so on):

In [None]:
bin_columns = ['bin_0', 'bin_1', 'bin_2', 'bin_3', 'bin_4']
nom_columns = ['nom_0', 'nom_1', 'nom_2', 'nom_3', 'nom_4', 'nom_5', 'nom_6', 'nom_7', 'nom_8', 'nom_9']
ord_columns = ['ord_0', 'ord_1', 'ord_2', 'ord_3', 'ord_4', 'ord_5']
cyc_columns = ['day', 'month']

Next, we're going to need to understand which nominal & ordinal features to section off as low vs. high cardinality. 

### Cardinality
What is cardinality? Mathematically it's defined as *the number of elements in a set or other grouping, as a property of that grouping*. Basically, if a column has many unique values, we could define it as having high-cardinality; low-cardinality would be the opposite.

**For Example:**

Low-cardinality `breed-of-cats` column:

<img src="https://i.imgur.com/xCgDQq4.jpg" width="400">

*There are many cats here but all of the same breed.*

High-cardinality `breed-of-cats` column: 

<img src="http://duckboss.com/wp-content/uploads/2016/02/cats1.png" width="400">

*There are many cats of differing breeds.*

We can separate the features by cardinality via looking at unique counts:

In [None]:
df_train[nom_columns+ord_columns].nunique()

There's a pretty clear cardinality separation between each. 
* `nom_{0,4}`=low-card
* `nom_{5,9}`=high-card
* `ord_{0,4}`=low-card
* `ord_5`=high-card. 

Let's make those splits:

In [None]:
lc_nom_columns = nom_columns[0:5]
hc_nom_columns = nom_columns[5:10]
lc_ord_columns = ord_columns[0:5]
hc_ord_columns = ord_columns[5:6]

Let's get into it now.

# Binary Features
Binary features are features that contain either a 1 or a 0. For example, if we had a column `is-tabby-cat` and the following two images:
<img src="https://www.thehappycatsite.com/wp-content/uploads/2017/06/tabby-kitten.jpg" width="300"><img src="https://www.dogster.com/wp-content/uploads/2015/05/husky-puppies-01.jpg" width="300">

We would probably expect a 1 for the first image and a 0 for the second.

Before doing any encoding, let's see if we can glean some relationship between each feature's binary distribution vs. the distribution of target labels:

In [None]:
fig, ax = plt.subplots(2, 3, figsize=(16,8))
fig.suptitle('Binary Distribution vs. Distribution of Target Variable')

for ax, name in zip(ax.flatten(), list(df_train[bin_columns].columns)):
    sns.countplot(x=df_train[name], ax=ax, hue=df_train['target'], saturation=1)

Nothing here really sticks out to me that's worth doing anything about.

Now we'll get down to **Encoding**. What our algorithms are looking for are numerical values. Out of these binary features the only ones that need to be transformed are `bin_3` and `bin4`. To do this, we'll use `sklearn`'s `LabelEncoder` which transforms categories into numbers.

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

df_train['bin_3'] = le.fit_transform(df_train['bin_3'])
df_train['bin_4'] = le.fit_transform(df_train['bin_4'])
df_test['bin_3'] = le.fit_transform(df_test['bin_3'])
df_test['bin_4'] = le.fit_transform(df_test['bin_4'])

df_train[bin_columns].head()

Boom! Done. We can move on.

# Ordinal Features

Ordinal variables inherently possess a natural order. For example, if we had an `cat-age-group` column and the following image served as our data:
![cats growing](https://www.catcare4life.org/app/uploads/2018/03/lifestages.jpg)
We would expect these to be labeled left to right as something like ['kitten', ... ,'mature']

### Low-cardinality Ordinal Features

For the five low-cardinality features that we have (`ord_{0,4}`)...we can assume that they follow some form like:
* `ord_0`: **Already correct!** Values range from 1 to 3
* `ord_1`: Novice > ... > Grandmaster
* `ord_2`: Cold > .. > Hot
* `ord_3`: a > ... > o (or is it o < .. < a?)
* `ord_4`: A > ... > Z (or see above)

Before encoding these via intuition, let's compare each feature's values against their distribution of labels:

In [None]:
fig, ax = plt.subplots(5, 1, figsize=(18,10))
fig.suptitle('Distribution of Target Variable Ratio = 1 \n (lowest → highest ratio)')

ordinal_ordering = {}

for ax, name in zip(ax.flatten(), list(df_train[lc_ord_columns].columns)):
    # calculate the ratio of target counts
    ct = pd.crosstab(df_train[name], df_train['target']).apply(lambda r: r/r.sum(), axis = 1)
    # unstack the cross-tabulated df
    stacked = ct.stack().reset_index().rename(columns = {0: 'ratio'}) 
    # sort by target ratio
    stacked = stacked.sort_values(['target', 'ratio'], ascending = [False, True]) 
    sns.barplot(x = stacked[name], y = stacked['ratio'], ax = ax, hue = stacked['target'])
    
    # create mapping for encoding
    ordinal_ordering[name] = stacked[name].unique()

Sweet! This follows our intuition exactly - I say that because we can observe a linear relationship in every one of these plots between the target ratio and our logical ordering. We can reliably move forward encoding the ordinality of this intuition.

*Note:*
* *Our training set follows our assumption. This may not be true for the train/validation set.*
* *Make assumptions on the target ratio could very well lead to overfitting*

If you have a better way of doing this encoding, please leave a comment!

In [None]:
# show the order of encoding for each ordinal column
ordinal_ordering

In [None]:
# loop through low-cardinality ordinal columns and encode them
for col in lc_ord_columns:
    nbr_to_replace = len(ordinal_ordering[col])
    # print(nbr_to_replace) # quality control
    df_train[col].replace(to_replace = ordinal_ordering[col], 
                          # had to drop a pythonic line ¯\_(ツ)_/¯
                          value = [x for x in range(0, len(ordinal_ordering[col]))], 
                          inplace = True)
    df_test[col].replace(to_replace = ordinal_ordering[col], 
                          # had to drop a pythonic line ¯\_(ツ)_/¯
                          value = [x for x in range(0, len(ordinal_ordering[col]))], 
                          inplace = True)
    
#df_train[lc_ord_columns].nunique() # quality control - should match nbr_to_replace

df_train[lc_ord_columns].head()

We're good! Onwards

### High-cardinality Ordinal Features

I opted to split these out separately in case any additional steps (vs. the ones we took on the low-cardinality ord features) would be needed. The only difference here is that this column contains many more unique values. 

Before we make any decisions, lets get another look at this column's data

In [None]:
ord_5_num_unique = len(df_train['ord_5'].unique().tolist())
print(f'unique values in ord_5: {ord_5_num_unique}')

sample = list(df_train['ord_5'].sample(10))
print(f'ex of ord_5 values: {sample}')

str_lengths = df_train['ord_5'].str.len().nunique()
print(f'different string lengths in ord_5: {str_lengths}')

So we have 192 unique values - each a 2 character string with varying positions of capital characters. 

Let's plot this column using the same ratio technique above and see if we can uncover some underlying form:

In [None]:
fig, ax = plt.subplots(figsize=(16,6))

ordinal_ordering = {}


fig.suptitle('Distribution Target Variable ratio \n (lowest → highest ratio)')

# calculate the ratio of target counts
ct = pd.crosstab(df_train['ord_5'], df_train['target']).apply(lambda r: r/r.sum(), axis = 1)
stacked = ct.stack().reset_index().rename(columns = {0: 'ratio'})
stacked = stacked.sort_values(['target', 'ratio'], ascending = [False, True])

ordinal_ordering['ord_5'] = stacked['ord_5'].unique() # for encoding

sns.barplot(x = stacked['ord_5'], y = stacked['ratio'], hue = stacked['target'])

# show less x-ticks
every_nth = 10
for n, label in enumerate(ax.xaxis.get_ticklabels()):
    if n % every_nth != 0:
        label.set_visible(False)

Nice! Looks like theres a linear relationship here between alphabetical position and target ratio. We can move forward with encoding these the same way we did above:

In [None]:
nbr_to_replace = len(ordinal_ordering['ord_5'])

df_train['ord_5'].replace(to_replace = ordinal_ordering['ord_5'], 
                          value = [x for x in range(0, len(ordinal_ordering['ord_5']))],
                          inplace = True)
df_test['ord_5'].replace(to_replace = ordinal_ordering['ord_5'], 
                          value = [x for x in range(0, len(ordinal_ordering['ord_5']))],
                          inplace = True)
    
df_train['ord_5'].head()

Another technique I tried to no avail was converting the `ord_5` strings to their unicode point values. I then plotted their target ratio like I did above but the linear relationship was broken and I scrapped it. 

The code for applying that transformation to a column is as follows:
```
def sum_string(string): 
    return sum(ord(char) for char in string)
    
ord_5['ord_5_ascii_val'] = ord_5['ord_5'].apply(sum_string)
```

# Nominal Features

Nominal features differ from ordinal features in one way: **they hold no intrinsic order**. For example, if we had a feature `breed` and the following image was split in half:
![two cats](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTbErhwsYXKnUcdlWa-mBKfLL_IO9bX9Ay1N-2-3ZNUmFP0ud45)

...this nominal feature would be scored as `Siamese` for the left cat and `Tabby` for the right (he atacc).

### Low-cardinality Nominal Features

The only thing we need to do to these is one hot encode them. 

One hot encoding works by converting every category in a column into it's own binary column - Like this:

![OHE](https://i.imgur.com/mtimFxh.png)

As for our data: `nom_0` has 3 unique values, `nom_{1,3}` each have 6 unique values and `nom_4` has 4 unique values. 

One hot encoding the lot of these will give us grow our DataFrame by 3 + 6 + 6 + 6 + 4 columns i.e., 25 new columns. **This process would be unyieldy for high-cardinality nominal variables as we would begin knocking on the door of memory issues.** Why? Because the column count would grow at a much larger rate.

Back to the work. We can simply use `panda`'s built-in `get_dummies` function to achieve this one hot encoding:

In [None]:
# one hot encode low-cardinality nominal variables
df_train = pd.get_dummies(df_train, columns = lc_nom_columns)
df_test = pd.get_dummies(df_test, columns = lc_nom_columns)
df_train.filter(regex='nom_[0-4]_').head()

Done!

### High-cardinality Nominal Variables

Okay, now we up the level of complication. I don't particularly have much experience here but I know there are a few options for techniques. Some I'll implement on my own, some I'll refer to the community for (if I forget to attribute the source, please notify me in the comments). 
1. Frequency encoding: encode the frequency of values in the column
2. Feature hashing: create a hash of string values
3. Mean encoding (also target, likelihood, impact encoding): each distinct value of a categorical value is replaced with the average value of the target variable we're trying to predict. All code for this is attributed to @dustinthewind's notebook [Making sense of mean encoding](https://www.kaggle.com/dustinthewind/making-sense-of-mean-encoding). Two techniques are used here:
    1. Additive smoothing - reduce overfitting by relying on the global average of the target variable.
    2. Cross-validation - introduce variability into encoding estimates by averaging mean over folds.

In [None]:
def freq_encoding(df, cols):
    for col in cols:
        # get variable frequencies
        frequencies = (df.groupby(col).size()) / len(df) 
        # encode frequencies
        df[f'{col}_freq'] = df[col].apply(lambda x : frequencies[x]) 
    return df

In [None]:
df_train = freq_encoding(df_train, hc_nom_columns)
df_test = freq_encoding(df_test, hc_nom_columns)
df_train.filter(regex='nom_[5-9]_freq').head()

In [None]:
def feature_hashing(df, cols):
    for col in cols:
        df[f'{col}_hashed'] = df[col].apply(lambda x: hash(str(x)) % 5000)
    return df

In [None]:
df_train = feature_hashing(df_train, hc_nom_columns)
df_test = feature_hashing(df_test, hc_nom_columns)
df_train.filter(regex='nom_[5-9]_hashed').head()

In [None]:
def encode_target_smooth(data, target, categ_variables, smooth):
    """    
    Apply target encoding with smoothing.
    
    Parameters
    ----------
    data: pd.DataFrame
    target: str, dependent variable
    categ_variables: list of str, variables to encode
    smooth: int, number of observations to weigh global average with
    
    Returns
    --------
    encoded_dataset: pd.DataFrame
    code_map: dict, mapping to be used on validation/test datasets 
    defaul_map: dict, mapping to replace previously unseen values with
    """
    train_target = data.copy()
    code_map = dict()    # stores mapping between original and encoded values
    default_map = dict() # stores global average of each variable
    
    for col in categ_variables:
        prior = data[target].mean()
        n = data.groupby(col).size()
        mu = data.groupby(col)[target].mean()
        mu_smoothed = (n * mu + smooth + prior) / (n + smooth)
        
        train_target.loc[:, col] = train_target[col].map(mu_smoothed)
        code_map[col] = mu_smoothed
        default_map[col] = prior
    return train_target, code_map, default_map

In [None]:
# additive smoothing
train_target_smooth, target_map, default_map = encode_target_smooth(df_train, 'target', hc_nom_columns, 500)
test_target_smooth = df_train.copy()
for col in hc_nom_columns:
    encoded_col = test_target_smooth[col].map(target_map[col])
    mean_encoded = pd.DataFrame({f'{col}_mean_enc': encoded_col})
    df_train = pd.concat([df_train, mean_encoded], axis=1)
    
df_train.filter(regex='nom_[5-9]_mean_enc').head()

In [None]:
def impact_coding_leak(data, feature, target, n_folds=20, n_inner_folds=10):
    from sklearn.model_selection import StratifiedKFold
    '''
    ! Using oof_default_mean for encoding inner folds introduces leak.
    
    Source: https://www.kaggle.com/tnarik/likelihood-encoding-of-categorical-features
    
    Changelog:    
    a) Replaced KFold with StratifiedFold due to class imbalance
    b) Rewrote .apply() with .map() for readability
    c) Removed redundant apply in the inner loop
    '''
    impact_coded = pd.Series()
    
    oof_default_mean = data[target].mean() # Gobal mean to use by default (you could further tune this)
    kf = StratifiedKFold(n_splits=n_folds, shuffle=True) # KFold in the original
    oof_mean_cv = pd.DataFrame()
    split = 0
    for infold, oof in kf.split(data[feature], data[target]):

        kf_inner = StratifiedKFold(n_splits=n_inner_folds, shuffle=True)
        inner_split = 0
        inner_oof_mean_cv = pd.DataFrame()
        oof_default_inner_mean = data.iloc[infold][target].mean()
        
        for infold_inner, oof_inner in kf_inner.split(data.iloc[infold], data.loc[infold, target]):
            # The mean to apply to the inner oof split (a 1/n_folds % based on the rest)
            oof_mean = data.iloc[infold_inner].groupby(by=feature)[target].mean()

            # Also populate mapping (this has all group -> mean for all inner CV folds)
            inner_oof_mean_cv = inner_oof_mean_cv.join(pd.DataFrame(oof_mean), rsuffix=inner_split, how='outer')
            inner_oof_mean_cv.fillna(value=oof_default_inner_mean, inplace=True)
            inner_split += 1

        # compute mean for each value of categorical value across oof iterations
        inner_oof_mean_cv_map = inner_oof_mean_cv.mean(axis=1)

        # Also populate mapping
        oof_mean_cv = oof_mean_cv.join(pd.DataFrame(inner_oof_mean_cv), rsuffix=split, how='outer')
        oof_mean_cv.fillna(value=oof_default_mean, inplace=True)
        split += 1

        feature_mean = data.loc[oof, feature].map(inner_oof_mean_cv_map).fillna(oof_default_mean)
        impact_coded = impact_coded.append(feature_mean)
            
    return impact_coded, oof_mean_cv.mean(axis=1), oof_default_mean

def impact_coding(data, feature, target, n_folds=20, n_inner_folds=10):
    from sklearn.model_selection import StratifiedKFold
    '''
    ! Using oof_default_mean for encoding inner folds introduces leak.
    
    Source: https://www.kaggle.com/tnarik/likelihood-encoding-of-categorical-features
    
    Changelog:    
    a) Replaced KFold with StratifiedFold due to class imbalance
    b) Rewrote .apply() with .map() for readability
    c) Removed redundant apply in the inner loop
    d) Removed global average; use local mean to fill NaN values in out-of-fold set
    '''
    impact_coded = pd.Series()
        
    kf = StratifiedKFold(n_splits=n_folds, shuffle=True) # KFold in the original
    oof_mean_cv = pd.DataFrame()
    split = 0
    for infold, oof in kf.split(data[feature], data[target]):

        kf_inner = StratifiedKFold(n_splits=n_inner_folds, shuffle=True)
        inner_split = 0
        inner_oof_mean_cv = pd.DataFrame()
        oof_default_inner_mean = data.iloc[infold][target].mean()
        
        for infold_inner, oof_inner in kf_inner.split(data.iloc[infold], data.loc[infold, target]):
                    
            # The mean to apply to the inner oof split (a 1/n_folds % based on the rest)
            oof_mean = data.iloc[infold_inner].groupby(by=feature)[target].mean()
            
            # Also populate mapping (this has all group -> mean for all inner CV folds)
            inner_oof_mean_cv = inner_oof_mean_cv.join(pd.DataFrame(oof_mean), rsuffix=inner_split, how='outer')
            inner_oof_mean_cv.fillna(value=oof_default_inner_mean, inplace=True)
            inner_split += 1

        # compute mean for each value of categorical value across oof iterations
        inner_oof_mean_cv_map = inner_oof_mean_cv.mean(axis=1)

        # Also populate mapping
        oof_mean_cv = oof_mean_cv.join(pd.DataFrame(inner_oof_mean_cv), rsuffix=split, how='outer')
        oof_mean_cv.fillna(value=oof_default_inner_mean, inplace=True) # <- local mean as default
        split += 1

        feature_mean = data.loc[oof, feature].map(inner_oof_mean_cv_map).fillna(oof_default_inner_mean)
        impact_coded = impact_coded.append(feature_mean)
    
    oof_default_mean = data[target].mean() # Gobal mean to use by default (you could further tune this)
    return impact_coded, oof_mean_cv.mean(axis=1), oof_default_mean

def encode_target_cv(data, target, categ_variables, impact_coder=impact_coding):
    """Apply original function for each <categ_variables> in  <data>
    Reduced number of validation folds
    """
    train_target = data.copy() 
    
    code_map = dict()
    default_map = dict()
    for f in categ_variables:
        print(f'cv impact encoding {f}')
        train_target.loc[:, f], code_map[f], default_map[f] = impact_coder(train_target, f, target)
        
    return train_target, code_map, default_map


In [None]:
train_target_cv, code_map, default_map = encode_target_cv(df_train[hc_nom_columns+['target']], 
                                                          'target', hc_nom_columns, 
                                                          impact_coder=impact_coding)

train_target_cv = train_target_cv.drop('target', axis=1)

In [None]:
for col in train_target_cv.columns:
    train_target_cv = train_target_cv.rename(columns={col: f'{col}_cvmean_enc'})
train_target_cv.head()

In [None]:
df_train = pd.concat([df_train, train_target_cv], axis=1)

Okay, heaviest stuff is done.
<img src="https://www.vtfoodbank.org/wp-content/uploads/2016/04/cat-on-computer.jpg" width="400">

# Cyclical Features

What do we mean by cyclical features? Well, we have some in our dataset in the form of **time**. Days of the week, hour of the day, etc - they all follow cycles. 

<img src="https://i.imgur.com/ZctoWQ4.png" width="400">

Cyclical features aren't only in the form of time though - "Ecological features like tide, astrological features like position in orbit, spatial features like rotation or longitude, visual features like color wheels are all naturally cyclical." - quoted from Ian London's [blog post](https://ianlondon.github.io/blog/encoding-cyclical-features-24hour-time/) which also has a great talk through of the techniques we're going to use here.

The problem with raw cyclical data is that it doesn't explicitly show the relationship between the nature of it's cycle. For example, if we were to plot the unique values of our `day` column:

In [None]:
day_values = sorted(df_train['day'].unique().tolist())
print(f'day values: {day_values}')
plt.plot(day_values)

The issue here? Look at the y-axis. If we call 1 Monday and 7 Sunday, it's clear to see that their cyclical relationship dissolves. As it stands, these points are **6 days apart**...which is true...except the day after Sunday is Monday **so they're actually also 1 day apart**. 

**The solution:**

We create two new features - one a sine transformation and one a cosine transformation. I'll illustrate their combined power after encoding them.

In [None]:
def sin_cos_encode(df, cols):
    for col in cols:
        col_max_val = max(df[col])
        df[f'{col}_sin'] = np.sin(2*np.pi * df[col]/ col_max_val) # sin transform
        df[f'{col}_cos'] = np.cos(2*np.pi * df[col]/ col_max_val) # cos transform
    return df

In [None]:
df_train = sin_cos_encode(df_train, cyc_columns)
df_train.filter(regex='_(sin|cos)').head()

Let's see what that actually did:

In [None]:
sample = df_train[['month_sin', 'month_cos']].sample(100)
sample.plot.scatter('month_sin', 'month_cos').set_aspect('equal')

Ta-da! These features can now be fed into algorithms and the cyclical nature will be maintained.

## Modeling
<img src="https://merriam-webster.com/assets/mw/images/gallery/gal-wap-slideshow-slide/cat-using-abacus-for-arithmetic-4133-38c20f1d3412a2ecf4e756e82c1bc11e@1x.jpg" width="400">

Now that we've done a load of encoding, it's time to see how these new variables perform. I'll train an xgboost classifier and plot feature importances.

*Note - I've created this notebook as a resource and exercise in encoding, so I won't be seeing this section through to perfection.*

In [None]:
# drop hexadecimal nominal columns
X_train = df_train.drop(columns=['target', 'nom_5', 'nom_6', 'nom_7', 'nom_8', 'nom_9'], axis=1)
y_train = df_train['target']

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.2, random_state=31429)

In [None]:
import xgboost as xgb

# set parameters for xgboost
params = {
    'objective': 'binary:logistic',
    'eval_metric': 'auc',
    'eta': 0.02,
    'max_depth': 4
}

d_train = xgb.DMatrix(x_train, label=y_train)
d_valid = xgb.DMatrix(x_valid, label=y_valid)

watch_list = [(d_train, 'train'), (d_valid, 'valid')]

model = xgb.train(params, d_train, 400, watch_list, early_stopping_rounds=50, verbose_eval=25)

In [None]:
#model.get_score(importance_type='gain')
fig, ax = plt.subplots(figsize=(12,18))
xgb.plot_importance(model, max_num_features=50, height=0.8, ax=ax)
plt.show()

What do we see here?
* **The bad**
    * Hash encoded nominal features performed poorly. 
    * Frequency encoded nominal features didn't fare much better.
    * Mean encoded nominal features weren't killer either.
* **The good**: 
    * Cross-validated impact encoded nominal features were sort of important!
    * Some of the simple one-hot encoded features turned out to be important.