# Categorical Encodings

In this segment we shall cover the various types of encodinds used to process columns with categorical values. We have already seen applications of `LabelEncoder` and `OneHotEncoder`. The various types we are goint to use are:

- Label Ecoding
- One Hot Encoding
- Count Encoding
- Target Encoding and variations
- Singular value decomposition

These methods will again be applied on the Kickstarter project data, and the model will be used to predict whether a Kickstarter campaign succeeds or not. We will compare the effect on accuracy that each of these technoques have on the baseline model, which is made using Label Encoding, with minimal hyperparameter tuning.

In this segment we will use the `category_encoders package`, which can be installed using

```bash
$ conda install -c conda-forge category_encoders
```

## Baseline Model

In [28]:
import os
print(f"Current working directory: {os.getcwd()}")

Current working directory: /home/raxit/kaggle


In [29]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Read data
ks = pd.read_csv('./dataset/kickstarter_project/ks-projects-201801.csv',
                 parse_dates = ['launched', 'deadline'])

# Drop live projects
ks = ks.query('state != "live"')

# Add outcome column, "successful" = 1, others are 0
ks = ks.assign(outcome=(ks['state'] == "successful").astype(int))

# Timestamp features
ks = ks.assign(hour=ks.launched.dt.hour,
               day=ks.launched.dt.day,
               month=ks.launched.dt.month,
               year=ks.launched.dt.year)

# LABEL ENCODING
# categorical features to consider
cat_features = ['category', 'currency', 'country']
encoder = LabelEncoder()
encoded = ks[cat_features].apply(encoder.fit_transform)

# features to use with the model
data_cols = ['goal','hour', 'day', 'month', 'year', 'outcome']
baseline_data = ks[data_cols].join(encoded)

In [30]:
# Defining functions to help evaluate/test our encoded data

import lightgbm as lgb
from sklearn import metrics

def get_data_splits(dataframe, valid_fraction=0.1):
    valid_size = int(len(dataframe)*valid_fraction)
    # Split the dataframe
    train = dataframe[: -2*valid_size]
    valid = dataframe[-2*valid_size : -valid_size]
    test  = dataframe[-valid_size :]
    
    return train, valid, test

def train_model(train, valid):
    feature_cols = train.columns.drop('outcome')
    dtrain = lgb.Dataset(train[feature_cols], label=train['outcome'])
    dvalid = lgb.Dataset(valid[feature_cols], label=valid['outcome'])
    param = {'num_leaves': 64, 'objective': 'binary',
             'metric': 'auc', 'seed': 7}
    print("Training model!")
    bst = lgb.train(param, dtrain, num_boost_round=1000,
                   valid_sets = dvalid,
                   early_stopping_rounds=10, 
                   verbose_eval=False)
    valid_pred = bst.predict(valid[feature_cols])
    valid_score = metrics.roc_auc_score(valid['outcome'], valid_pred)
    print(f"Validation AUC scores: {valid_score:.4f}")
    return bst

In [31]:
train, valid, _ = get_data_splits(baseline_data)
bst = train_model(train, valid)

Training model!
Validation AUC scores: 0.7467


In [32]:
baseline_data.head()

Unnamed: 0,goal,hour,day,month,year,outcome,category,currency,country
0,1000.0,12,11,8,2015,0,108,5,9
1,30000.0,4,2,9,2017,0,93,13,22
2,45000.0,0,12,1,2013,0,93,13,22
3,5000.0,3,17,3,2012,0,90,13,22
4,19500.0,8,4,7,2015,0,55,13,22


## Count Encoding

Count encoding replaces each value with the number of times it appears in the dataset.

We'll use the `categorical-encodings` package to get this encoding. The encoder itself is available as `CountEncoder`. This encoder and other encoders in the package work like scikit-learn transformers with `.fit` and `.transform` methods.

In [33]:
import category_encoders as ce
cat_features = ['category', 'currency', 'country']
count_enc = ce.CountEncoder()
count_encoded = count_enc.fit_transform(ks[cat_features])

data = baseline_data.join(count_encoded.add_suffix("_count"))

In [35]:
data.head()

Unnamed: 0,goal,hour,day,month,year,outcome,category,currency,country,category_count,currency_count,country_count
0,1000.0,12,11,8,2015,0,108,5,9,1362,33853,33393
1,30000.0,4,2,9,2017,0,93,13,22,5174,293624,290887
2,45000.0,0,12,1,2013,0,93,13,22,5174,293624,290887
3,5000.0,3,17,3,2012,0,90,13,22,15647,293624,290887
4,19500.0,8,4,7,2015,0,55,13,22,10054,293624,290887


In [42]:
# Training a model on the baseline data
train, valid, test = get_data_splits(data)
bst = train_model(train, valid)

Training model!
Validation AUC scores: 0.7486


Using Count encoding we have gained a slight increase in validation scores. 

## Target Encoding

Target encodind replaces a categorical value with the average value of the targe for that value of the feature. *Any given categorical value is replaced with the average of outcome of all rows containing the given categorical value.* 

Since this method used targets to create new features, including the validation or test data in target encodings would be a form of *target leakage*. Thus we shold learn the target encodings from training datasets *only*.

The `category_encoders` package provides `TargetEncoder` function for target encoding. It can be implemented im a similar manner as `CountEncoder`

In [43]:
import category_encoders as ce
cat_features = ['category', 'currency', 'country']