# A Python Class for Encoding High Cardinality Categorical Features


### Introduction

In my experience working on Kaggle Competitions reading others' solutions, it's come to my attention that there a lot of time is being spent on encoding categorical features. This preprocessing step is necessary in almost all Kaggle competitions, and so I thought I'd write a simple python class with a consistent API, so that people can spend less time encoding, and more time doing more interesting stuff. 

In this notebook I discuss the three types of encodings I've implemented - **One-hot Encoding**, **Frequency Encoding**, and **Mean/Label/Likelihood Encoding**. I then perform a quick case study to show how to use the class, and how effective each method works. 

### Why  Encode High Cardinality Categorical Features? 

Most machine learning toolkits require that the input training/test data be in a numeric format. The canonical way of encoding categorical features (which cannot be represented as numbers with meaningful magnitude and order), is to use *one-hot encoding*. 

When a categorical feature has a large number of levels, however, one-hot encoding can lead to very sparse data, with many features - even more so if you want to consider the interactions between categorical feature in question and the other features in your dataset. This often leads to a propensity to overfit to your data, as well as slower training. 

To overcome this, Kaggler's often use other methods of converting categorical features into numeric datatypes, but _of lower dimension_ than yielded by one-hot encoding. This allows one to use the information in a categorical feature, but avoid the troubles of high dimensional, sparse encodings. 

### Implemented encoding schemes

In This class, I've implementd three encoding schemes (so far): **One-hot Encoding**, **Frequency Encoding**, and **Mean/Label/Likelihood Encoding**.

##### One-hot encoding

Categorical features are respresented as binary vectors. Each element of a vector signifies if the corresponding example belongs to a particular class. Every vector has only one element with the value `1`, and the rest are `0`. 

##### Frequency Encoding

Here, each category is mapped to the frequency with which it appears in the training set. Thus, a categorical variable (or a group of categorical features) with any number of levels can be represented as a single numeric feature. 

Frequency encoding works well when the frequency of a class provides true signal of the target, and when categories of similar frequency have similar properties with respect to the target. It's a straightforward method which often offers slightly improved performance. 

##### Mean/Label/Likelihood

I first encountered Mean encodings in the Coursera course **How to win a data science competiton**. The authors of this course claim that this method is often the key to outperforming other competitors. 

Mean encoding represents each category level as the average value of the response for that category. One cannot use the response value associated with a training example to encode said training example, however, as this would lead to leaking information from the response vector to the training data, and cause overfitting. 

Therefore, folding schemes are often implemented. A training set is split into `k` folds, where the encodings of each fold are determined by averaging the response values of the categories in the remaining `k-1` folds. This folding process is often repeated recursively, to further reduce the risk of overfitting. 

I felt that the details on how to implement this scheme were unclear in the course, however, and that many others agreed with me. As such, I though it would be useful to implement it once and for all, for myself and others to use. 


### What's Next? 

There are still many feature encoding schemes that I have read about that I would like to implement. As I continue working on Kaggle competitions/notebooks, If I find myself implementing one of these schemes, I'll add it to this class (so look out for updates!)

Some of these schemes are:

- Ridge Regression Feature Encodings
- Feature Hashing


## The Class

Below is the code for the Category Encoder class. After the code, I demonstrate the class's API, and show an example of how to use it. 

In [120]:
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold
from sklearn.preprocessing import LabelEncoder, scale
from keras.utils import to_categorical

class CategoryEncoder(object):
    
    @staticmethod
    def _to_hashable(X):
        # if X is an array, return as is
        try:
            if len(X.shape) == 1:
                return(X.astype(str))
        except AttributeError:
            # X is not a numpy array 
            raise ValueError("Input must be numpy array-like, with `shape` attribute.")
        """
        Given a 2D numpy array, convert it into a 1D numpy array
        by converting the elements into strings and joining with an underscore. 
        This makes each combination of several features hashable. 
        
        input:
            X: numpy 2D array
        """
        # make sure X is a 2D array
        try:
            assert len(X.shape) == 2
        except AttributeError:
            # X is not a numpy array 
            raise ValueError("Input must be numpy array-like, with `shape` attribute.")
        # reformat and return
        X_hashable = np.apply_along_axis(lambda row: "_".join(str(e) for e in row),
                                         axis = 1, arr = X)
        return(X_hashable)

    @classmethod
    def _validate_input_both(cls, X_train, X_test, y_train):
        """
        Validate that X_train and X_test have the same shape, and
        that they are either 1D or 2D nd-array types. 
        Convert X_train and X_test to 2D arrays, and y_train to 1D array. 
        
        Should be called only when X_train and X_test are provided. 
        
        input:
            - X_train: Numpy Nd-Array. Each column represent a high cardinality
                categorical variable, and each row is a training example
                
            - X_test:  Numpy Nd-Array. Each column represents the same high cardinality
                categorical variable as in the training example. Warning - if the test
                set contains factor levels not present in the training set, unexpected
                behavior will occur.
                
            - y_train: Numpy Nd-Array. Target (response) varible. Should be a numeric type, 
                so that calling `y_train.mean()` makes sense. 
        
        returns:
            - X_train_transformed: reshaped version of X_train. The levels of the
                different categorical variables across the features presented as input
                are concatenated as strings to create a single, even higher cardinality
                input. 
                
            - X_test_transformed:  reshaped version of X_test
            
            - y_train_transformed: values of y_train unchanged, but returned as a numpy
                array with shape (nrows,)
            
        raises:
            - ValueError Exception
        """
        try:
            # validate that X_train and X_test have the same shape
            if (len(X_train.shape) != len(X_test.shape)):
                raise ValueError("`X_train` and `X_test` must have the same number of dimensions")
        except AttributeError:
            """
            If X_train or X_test are not numpy array-like
            """
            raise ValueError("Input must be numpy array-like, with `shape` attribute.")
        # validate that input are of maximum 2 dimensions for X
        if (len(X_train.shape) > 2):
            raise ValueError("X input shape should be of maximum 2 dimensions")
            
        # validate that y input can be represented as a numeric 1D array
        try:
            y_flattented = y_train.reshape(-1).astype(np.float32) # reshaped response
            if(len(y_flattented) != len(y_train)):
                raise ValueError("y_train should be able to be naturally represented as a 1D array.")
        except ValueError:
            raise ValueError("y_train must be of a datatype that can be converted to a numeric datatype naturally.")
        
        # Reshape X_train and X_test
        if(len(X_train.shape) == 2):
            # Convert 1D array to 1D array of hashable type
            X_train = cls._to_hashable(X_train)
            X_test = cls._to_hashable(X_test)
        
        # make sure X_train and y_train have the same length
        if(len(X_train) != len(y_flattented)):
            raise ValueError("X_train and y_train must have the same number of elements.")

        return(np.array(X_train), np.array(X_test), np.array(y_flattented))
    
    @classmethod
    def _validate_input(cls, X_train, y_train):
        """
        Validate that X_train is either 1D or 2D nd-array type. 
        Convert X_train to 2D array, and y_train to 1D array. 

        
        input:
            - X_train: Numpy Nd-Array. Each column represent a high cardinality
                categorical variable, and each row is a training example
                
            - y_train: Numpy Nd-Array. Target (response) varible. Should be a numeric type, 
                so that calling `y_train.mean()` makes sense. 
        
        returns:
            - X_train_transformed: reshaped version of X_train. The levels of the
                different categorical variables across the features presented as input
                are concatenated as strings to create a single, even higher cardinality
                input. 
            
            - y_train_transformed: values of y_train unchanged, but returned as a numpy
                array with shape (nrows,)
            
        raises:
            - ValueError Exception
        """
        # validate that input are of maximum 2 dimensions for X
        try:
            if (len(X_train.shape) > 2):
                raise ValueError("X input shape should be of maximum 2 dimensions")
        except AttributeError:
            """
            If X_train doesn't have the `shape` attribute, then it is not
            array-like
            """
            raise ValueError("Input must be numpy array-like, with `shape` attribute.")
            
        # validate that y input can be represented as a numeric 1D array
        try:
            y_flattented = y_train.reshape(-1).astype(np.float32) # reshaped response
            if(len(y_flattented) != len(y_train)):
                raise ValueError("y_train should be able to be naturally represented as a 1D array.")
        except ValueError:
            raise ValueError("y_train must be of a datatype that can be converted to a 1D numeric datatype naturally.")
        
        # Reshape X_train
        if(len(X_train.shape) == 2):
            # Convert 1D array to 1D array of hashable type
            X_train = cls._to_hashable(X_train)
        
        # make sure X_train and y_train have the same length
        if(len(X_train) != len(y_flattented)):
            raise ValueError("X_train and y_train must have the same number of elements.")

        return(np.array(X_train), np.array(y_flattented))
    
    """
    Recursively mean encode the categories in the training data. 
    Not to be used by the outside - called by the instance method `mean_encode()`
    """
    def _mean_encode_train(self, X_train, y_train, depth = 2, n_splits = 5, shuffle = True, random_state = 1):
        # base case - depth = 1
        if (depth == 1):
            # create a template for the mean encoded ouptput
            y_train_enc = np.repeat(-1.0,len(y_train))
            # iterate through the different folds
            kf = KFold(n_splits = n_splits, shuffle = shuffle, random_state = random_state)
            for train_index, test_index in kf.split(X_train):
                # compute the group means in the training folds
                train_means = pd.DataFrame({"group" : X_train[train_index], 
                                            "mean_target": y_train[train_index]}
                                          ).groupby("group", as_index = False).mean()

                # use group means to match each group with an encoding
                encoded_means = pd.DataFrame({"group": X_train[test_index]}).reset_index().merge(train_means, how = "left")
                """
                Groups that do not appear in the train fold, but do in the test fold, 
                will have NA as that particular group mean. 
                As such, fill these NA values with the global mean for the target in 
                the train fold. 
                """
                encoded_means = encoded_means.fillna(encoded_means.mean_target.mean())
                # store the encoded means
                y_train_enc[test_index] = encoded_means.sort_index().mean_target.values
            # return the encoded target
            return(y_train_enc)
        
        # Recurive step - if depth >= 2
        else:
            # create a template for the mean encoded ouptput
            y_train_enc = np.repeat(-1.0,len(y_train))
            # iterate through the different folds, encoding the training output fold by fold
            kf = KFold(n_splits = n_splits, shuffle = shuffle, random_state = random_state)
            for train_index, test_index in kf.split(X_train):
                y_train_enc[test_index] = self._mean_encode_train(X_train[test_index], 
                                            y_train[test_index], depth = depth - 1, n_splits = n_splits, 
                                            shuffle = shuffle, random_state = random_state)
            """
            Now, use the encoded ouptuts to 're-encode' the output again
            """
            # get a new k-fold object with a different seed, which will result in different splits
            kf = KFold(n_splits = n_splits, shuffle = shuffle, random_state = random_state + 1)
            # template for re-encoded output
            y_train_reencoded = np.repeat(-1.0,len(y_train))
            for train_index, test_index in kf.split(X_train):
                # compute the group means in the training folds
                train_means = pd.DataFrame({"group" : X_train[train_index], 
                                            "mean_target": y_train_enc[train_index]}
                                          ).groupby("group", as_index = False).mean()

                # use group means to match each group with an encoding
                encoded_means = pd.DataFrame({"group": X_train[test_index]}).reset_index().merge(
                    train_means, how = "left").sort_index()
                # fill na's that result from groups in test fold that don't appear in training folds
                encoded_means = encoded_means.fillna(encoded_means.mean_target.mean())
                # store the encoded means
                y_train_reencoded[test_index] = encoded_means.mean_target.values
            return(y_train_reencoded)
    
    """
    Main instance method for mean encoding. 
    """
    def mean_encode(self, X_train, y_train, X_test = None, depth = 2, 
                    n_splits = 5, shuffle = True, random_state = 1):
        """
        Mean encode high frequency categorical features. 
        Mean encoding is a process in which categories are re-encoded by the average
        response value they have in a seperate holdout set. In this way, mean encoded
        features are similar to the predictions of a KNN classifier/regressor. 
        
        To avoid overfitting, the response any training example should not be used
        to encode that example. This is why k-folding is used - the data is split into
        `n_splits` folds, and the values of each fold are computed by averaging the 
        remaining `n_splits` folds. 
        
        To further avoid overfitting, this process is repeated recursively; the depth of
        this recursion is controlled by the parameter `depth` (default: 2)
        
        ----------
        Parameters
        ----------
        - X_train: Numpy Nd-Array. Each column represent a high cardinality
                categorical variable, and each row is a training example

        - y_train: Numpy Nd-Array. Target (response) varible. Should be a numeric type, 
                so that calling `y_train.mean()` makes sense.  
            
        - X_test (default = None) :  Numpy Nd-Array. Each column represents the same high cardinality
            categorical variable as in the training example. Warning - if the test
            set contains factor levels not present in the training set, unexpected
            behavior will occur.
        
        - depth (default = 2): Integer. Number of times to recursively use k-folding to indroduce noise to the
            encodings, and avoid overfitting.
            
        - n_splits (default = 5): Integer. Number of folds (`k`) to use when using k-folding
        
        - shuffle (default = True): Boolean. Whether to use randomly split folds. 
        
        - random_state (default = 1): Integer. For reproducability. 
        
        
        ----------
        Returns
        ----------
        
        - X_train_encoded: numeric numpy array. Encoded values of training categories. 
        
        - X_test_encoded: numeric numpy array. Only returned if `X_test` is provided 
            (not None) when method is called.
        """
        # first: Is X_test provided? 
        if X_test is not None:
            # validate the input. 
            X_train, X_test, y_train = self._validate_input_both(X_train, X_test, y_train)  
        else:
            X_train, y_train = self._validate_input(X_train, y_train)

        """
        Once inputs are validated, recursively encode the training data
        """
        y_train_encoded = self._mean_encode_train(X_train, y_train, depth, n_splits, shuffle, random_state)
        # if test set is not provided, return encoded output
        if X_test is None:
            return(y_train_encoded)
        else:
            """
            Need to group categories together. Multiple high cardinality factors may have been provided,
            so need to first combine them to one even higher cardinality factor. 
            
            Then, test encodings are the average of the train encodings within that group. 
            """
            # get the average encoded values for each group in the training set
            group_averages = pd.DataFrame({"group":X_train, "encoded_values":y_train_encoded}).groupby(
                "group", as_index = False).mean()
            # join to get test encodings
            test_encodings = pd.DataFrame({"group":X_test}).reset_index().merge(group_averages, how = "left").fillna(
                group_averages.encoded_values.mean()).sort_index()
            y_test_encoded = test_encodings.encoded_values.values
            # return both training and test encoded values
            return(y_train_encoded, y_test_encoded)
        
    def frequency_encode(self, X_train, X_test = None):
        """
        Encode categories into a single numeric variable, corresponding to the frequency of said 
        categories. 
        If a test set is provided `X_test`, then the frequencies are determined from the training
        set, so that classifiers trained on the test set generalize properly to the test set. 
        
        ----------
        Parameters
        ----------
        - X_train: Numpy Nd-Array. Each column represent a high cardinality
                categorical variable, and each row is a training example
            
        - X_test (default = None) :  Numpy Nd-Array. Each column represents the same high cardinality
            categorical variable as in the training example. Warning - if the test
            set contains factor levels not present in the training set, unexpected
            behavior will occur.
            
        ----------
        Returns
        ----------
        
        - X_train_encoded: numeric numpy array. Frequency encoded values of training categories. 
        
        - X_test_encoded: numeric numpy array. Only returned if `X_test` is provided 
            (not None) when method is called.
        """
        # first: Is X_test provided? 
        if X_test is not None:
            # collapse multiple factors
            X_train = self._to_hashable(X_train)
            X_test = self._to_hashable(X_test)
        else:
            X_train = self._to_hashable(X_train)
        # encode the training set with frequency counts
        X_train_encoded = pd.DataFrame({"group":X_train}).reset_index()
        # compute a frequency table
        train_frequency = pd.DataFrame({"frequency":X_train_encoded.groupby("group").size()/len(X_train_encoded)})
        #join to get training frequency
        X_train_encoded = X_train_encoded.merge(train_frequency, left_on = "group", 
                                                right_index=True,sort=False).sort_index()
        """
        If no test set was provided, return the training frequencies as is.
        Otherwise, return the test frequencies. 
        """
        if X_test is None:
            return X_train_encoded.frequency.values
        else:
            # merge to get the test set frequencies
            X_test_encoded = pd.DataFrame({"group":X_test}).reset_index()
            X_test_encoded = X_test_encoded.merge(train_frequency, how = "left", left_on = "group", 
                                                 right_index = True, sort = False).sort_index()
            # fill NAN values with the mean encoding 
            X_test_encoded = X_test_encoded.fillna(X_test_encoded.frequency.mean())
            return X_train_encoded.frequency.values, X_test_encoded.frequency.values
    
    
    def _label_encode(self, X_train, X_test = None):
        """
        Given an array representing a categorical feature, transform into an integer array, 
        where each integer represents an level
        """
        # initialize a label encoder object
        encoder = LabelEncoder()
        """
        If test set is provided, want to train on both the training labels
        and the test labels
        """
        if X_test is not None:
            all_categories = np.append(X_train,X_test)
        else:
            all_categories = X_train
        # train the encoder on the categories
        encoder.fit(all_categories.astype(str))
        """
        If test set is provided, return both transformed sets. Otherwise, only transform the
        training set and return
        """
        if X_test is not None:
            X_train_transformed = encoder.transform(X_train)
            X_test_transformed = encoder.transform(X_test)
            return X_train_transformed, X_test_transformed, encoder.classes_
        else:
            return encoder.transform(X_train), encoder.classes_
    
    def onehot_encode(self, X_train, X_test = None, prefix = "dummy_"):
        """
        Given an array represent a categorical feature, transform into a matrix of
        one-hot vectors, encoding the feature. 
        
        This matrix is returned as Pandas DataFrame, so that column names can help identify 
        the original category levels
        
        ----------
        Parameters
        ----------
        - X_train: Numpy Nd-Array. Each column represent a high cardinality
                categorical variable, and each row is a training example
            
        - X_test (default = None) :  Numpy Nd-Array. Each column represents the same high cardinality
            categorical variable as in the training example. Warning - if the test
            set contains factor levels not present in the training set, unexpected
            behavior will occur.
            
        - prefix (default = "dummy"): Used to prefix the column names of the returned dataframe. The category levels 
            are appended to the prefix to form the column names.
            
            
        ----------
        Returns
        ----------
        
        - X_train_encoded: Pandas DataFrame containing the one-hot encoded examples in the training set. 
        
        - X_test_encoded: Pandas DataFrame containing the one-hot encoded examples in the test set. 
            only returned if X_test is provided to the method.
        
        """
        # collapse levels into a single vector
        X_train = self._to_hashable(X_train)
        if X_test is not None:
            X_test = self._to_hashable(X_test)
        # convert the classes into integer labels
        if X_test is not None:
            X_train_labels, X_test_labels, classes = self._label_encode(X_train, X_test)
        else:
            X_train_labels, classes = self._label_encode(X_train)
        # Expand the labels into one-hot encodings
        X_train_onehot = to_categorical(X_train_labels).astype(int)
        if X_test is not None:
            X_test_onehot = to_categorical(X_test_labels, num_classes = len(classes)).astype(int)
        """
        Return the result(s) as dataframe(s). 
        First, build up a list of the column names to use. Then return as a dataframe,
        where each row is the one-hot encoding of an element in X_train
        """
        colnames = [prefix + c for c in classes]
        X_train_encoded = pd.DataFrame(X_train_onehot, columns = colnames)
        if X_test is None:
            return X_train_encoded
        else:
            X_test_encoded = pd.DataFrame(X_test_onehot, columns = colnames)
            return X_train_encoded, X_test_encoded


### Class API

The methods in this class follow a simple pattern - If only a training set is provided upon method call, then only encodings of the training set are returned. If the test set is also provided, both the encodings of the training and test set are returned. 

For example, consider the following example of creating frequency encodings:

In [81]:
# create vector representing a categorical veraible in the training set...
train_categories = np.array(["green", "red", "green", "blue", "yellow", "red"])
# ... and the same categorical variable in the test set
test_categories = np.array(["red", "green", "yellow"])

# Initialize a category encoder
encoder = CategoryEncoder()

# Provide just the "training set" to encode just the forst vector:
print("training encoding: ", encoder.frequency_encode(train_categories))
print()

# Or provide both the training sets and test sets to encode both
print("""
training encodings:%s
test encodings:    %s""" % encoder.frequency_encode(train_categories, test_categories))

training encoding:  [0.33333333 0.33333333 0.33333333 0.16666667 0.16666667 0.33333333]


training encodings:[0.33333333 0.33333333 0.33333333 0.16666667 0.16666667 0.33333333]
test encodings:    [0.33333333 0.33333333 0.16666667]


That's it! Just a couple lines of code, which save you from writing a couple more (and looking up how Pandas `merge` works again...)

For reference on how to use this class, pleas look at:
```python
help(CategoryEncoder())
```

## Case Study

In [117]:
# some more libraries
import plotnine
from plotnine import *
from IPython.display import display

In [144]:
# load the training and test sets
train = pd.read_csv("../data/application_train.csv")
test = pd.read_csv("../data/application_test.csv")

In [145]:
train.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,NAME_TYPE_SUITE,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,DAYS_ID_PUBLISH,OWN_CAR_AGE,FLAG_MOBIL,FLAG_EMP_PHONE,FLAG_WORK_PHONE,FLAG_CONT_MOBILE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS,REGION_RATING_CLIENT,REGION_RATING_CLIENT_W_CITY,WEEKDAY_APPR_PROCESS_START,HOUR_APPR_PROCESS_START,REG_REGION_NOT_LIVE_REGION,REG_REGION_NOT_WORK_REGION,LIVE_REGION_NOT_WORK_REGION,REG_CITY_NOT_LIVE_CITY,REG_CITY_NOT_WORK_CITY,LIVE_CITY_NOT_WORK_CITY,ORGANIZATION_TYPE,EXT_SOURCE_1,EXT_SOURCE_2,EXT_SOURCE_3,APARTMENTS_AVG,BASEMENTAREA_AVG,YEARS_BEGINEXPLUATATION_AVG,YEARS_BUILD_AVG,COMMONAREA_AVG,ELEVATORS_AVG,ENTRANCES_AVG,FLOORSMAX_AVG,FLOORSMIN_AVG,LANDAREA_AVG,LIVINGAPARTMENTS_AVG,LIVINGAREA_AVG,NONLIVINGAPARTMENTS_AVG,NONLIVINGAREA_AVG,APARTMENTS_MODE,BASEMENTAREA_MODE,YEARS_BEGINEXPLUATATION_MODE,YEARS_BUILD_MODE,COMMONAREA_MODE,ELEVATORS_MODE,ENTRANCES_MODE,FLOORSMAX_MODE,FLOORSMIN_MODE,LANDAREA_MODE,LIVINGAPARTMENTS_MODE,LIVINGAREA_MODE,NONLIVINGAPARTMENTS_MODE,NONLIVINGAREA_MODE,APARTMENTS_MEDI,BASEMENTAREA_MEDI,YEARS_BEGINEXPLUATATION_MEDI,YEARS_BUILD_MEDI,COMMONAREA_MEDI,ELEVATORS_MEDI,ENTRANCES_MEDI,FLOORSMAX_MEDI,FLOORSMIN_MEDI,LANDAREA_MEDI,LIVINGAPARTMENTS_MEDI,LIVINGAREA_MEDI,NONLIVINGAPARTMENTS_MEDI,NONLIVINGAREA_MEDI,FONDKAPREMONT_MODE,HOUSETYPE_MODE,TOTALAREA_MODE,WALLSMATERIAL_MODE,EMERGENCYSTATE_MODE,OBS_30_CNT_SOCIAL_CIRCLE,DEF_30_CNT_SOCIAL_CIRCLE,OBS_60_CNT_SOCIAL_CIRCLE,DEF_60_CNT_SOCIAL_CIRCLE,DAYS_LAST_PHONE_CHANGE,FLAG_DOCUMENT_2,FLAG_DOCUMENT_3,FLAG_DOCUMENT_4,FLAG_DOCUMENT_5,FLAG_DOCUMENT_6,FLAG_DOCUMENT_7,FLAG_DOCUMENT_8,FLAG_DOCUMENT_9,FLAG_DOCUMENT_10,FLAG_DOCUMENT_11,FLAG_DOCUMENT_12,FLAG_DOCUMENT_13,FLAG_DOCUMENT_14,FLAG_DOCUMENT_15,FLAG_DOCUMENT_16,FLAG_DOCUMENT_17,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,351000.0,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,0.018801,-9461,-637,-3648.0,-2120,,1,1,0,1,1,0,Laborers,1.0,2,2,WEDNESDAY,10,0,0,0,0,0,0,Business Entity Type 3,0.083037,0.262949,0.139376,0.0247,0.0369,0.9722,0.6192,0.0143,0.0,0.069,0.0833,0.125,0.0369,0.0202,0.019,0.0,0.0,0.0252,0.0383,0.9722,0.6341,0.0144,0.0,0.069,0.0833,0.125,0.0377,0.022,0.0198,0.0,0.0,0.025,0.0369,0.9722,0.6243,0.0144,0.0,0.069,0.0833,0.125,0.0375,0.0205,0.0193,0.0,0.0,reg oper account,block of flats,0.0149,"Stone, brick",No,2.0,2.0,2.0,2.0,-1134.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,1129500.0,Family,State servant,Higher education,Married,House / apartment,0.003541,-16765,-1188,-1186.0,-291,,1,1,0,1,1,0,Core staff,2.0,1,1,MONDAY,11,0,0,0,0,0,0,School,0.311267,0.622246,,0.0959,0.0529,0.9851,0.796,0.0605,0.08,0.0345,0.2917,0.3333,0.013,0.0773,0.0549,0.0039,0.0098,0.0924,0.0538,0.9851,0.804,0.0497,0.0806,0.0345,0.2917,0.3333,0.0128,0.079,0.0554,0.0,0.0,0.0968,0.0529,0.9851,0.7987,0.0608,0.08,0.0345,0.2917,0.3333,0.0132,0.0787,0.0558,0.0039,0.01,reg oper account,block of flats,0.0714,Block,No,1.0,0.0,1.0,0.0,-828.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,135000.0,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,0.010032,-19046,-225,-4260.0,-2531,26.0,1,1,1,1,1,0,Laborers,1.0,2,2,MONDAY,9,0,0,0,0,0,0,Government,,0.555912,0.729567,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,-815.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,297000.0,Unaccompanied,Working,Secondary / secondary special,Civil marriage,House / apartment,0.008019,-19005,-3039,-9833.0,-2437,,1,1,0,1,0,0,Laborers,2.0,2,2,WEDNESDAY,17,0,0,0,0,0,0,Business Entity Type 3,,0.650442,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,0.0,2.0,0.0,-617.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,513000.0,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,0.028663,-19932,-3038,-4311.0,-3458,,1,1,0,1,0,0,Core staff,1.0,2,2,THURSDAY,11,0,0,0,0,1,1,Religion,,0.322738,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,-1106.0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
with pd.option_context('display.max_rows', None, 'display.max_columns', 3):
    # emulating R's `str()` function
    print(train.apply(lambda x:[ x.unique()]))

### Basic Preprocessing 

...



#### Converting Binary features to numeric binary features

...

In [None]:
# convert Y/N features to true binary
train.FLAG_OWN_CAR = train.FLAG_OWN_CAR.apply(lambda x: (1 if x == "Y" else 0))
train.FLAG_OWN_REALTY = train.FLAG_OWN_REALTY.apply(lambda x: (1 if x == "Y" else 0))

#### Filling NA values 

... (of non-categorical variables)

In [None]:
list(train.columns)

In [None]:
with pd.option_context('display.max_rows', None, 'display.max_columns', 3):
    # Count the number of NA values in each column
    print(train.apply(lambda x: np.sum(pd.isnull(x))))

In [None]:
train[["AMT_ANNUITY", "AMT_GOODS_PRICE"]].hist()

In [None]:
# fill NA values of AMT_ANNUITY and AMT_GOODS_PRICE with -1
train.AMT_ANNUITY = train.AMT_ANNUITY.fillna(-1)
test.AMT_ANNUITY = test.AMT_ANNUITY.fillna(-1)
train.AMT_GOODS_PRICE = train.AMT_GOODS_PRICE.fillna(-1)
test.AMT_GOODS_PRICE = test.AMT_GOODS_PRICE.fillna(-1)

In [None]:
train.OWN_CAR_AGE.hist()

In [None]:
# fill NA values of OWN_CAR_AGE with -1
train.OWN_CAR_AGE = train.OWN_CAR_AGE.fillna(-1)
test.OWN_CAR_AGE = test.OWN_CAR_AGE.fillna(-1)

In [None]:
train[["EXT_SOURCE_1", "EXT_SOURCE_2", "EXT_SOURCE_3"]].hist()

In [None]:
# Replace NA values in `EXT_SOURCE_N` columns with -1
train.EXT_SOURCE_1 = train.EXT_SOURCE_1.fillna(-1)
test.EXT_SOURCE_1 = test.EXT_SOURCE_1.fillna(-1)
train.EXT_SOURCE_2 = train.EXT_SOURCE_2.fillna(-1)
test.EXT_SOURCE_2 = test.EXT_SOURCE_2.fillna(-1)
train.EXT_SOURCE_3 = train.EXT_SOURCE_3.fillna(-1)
test.EXT_SOURCE_3 = test.EXT_SOURCE_3.fillna(-1)

In [None]:
numeric_cols = ['APARTMENTS_AVG',
 'BASEMENTAREA_AVG',
 'YEARS_BEGINEXPLUATATION_AVG',
 'YEARS_BUILD_AVG',
 'COMMONAREA_AVG',
 'ELEVATORS_AVG',
 'ENTRANCES_AVG',
 'FLOORSMAX_AVG',
 'FLOORSMIN_AVG',
 'LANDAREA_AVG',
 'LIVINGAPARTMENTS_AVG',
 'LIVINGAREA_AVG',
 'NONLIVINGAPARTMENTS_AVG',
 'NONLIVINGAREA_AVG',
 'APARTMENTS_MODE',
 'BASEMENTAREA_MODE',
 'YEARS_BEGINEXPLUATATION_MODE',
 'YEARS_BUILD_MODE',
 'COMMONAREA_MODE',
 'ELEVATORS_MODE',
 'ENTRANCES_MODE',
 'FLOORSMAX_MODE',
 'FLOORSMIN_MODE',
 'LANDAREA_MODE',
 'LIVINGAPARTMENTS_MODE',
 'LIVINGAREA_MODE',
 'NONLIVINGAPARTMENTS_MODE',
 'NONLIVINGAREA_MODE',
 'APARTMENTS_MEDI',
 'BASEMENTAREA_MEDI',
 'YEARS_BEGINEXPLUATATION_MEDI',
 'YEARS_BUILD_MEDI',
 'COMMONAREA_MEDI',
 'ELEVATORS_MEDI',
 'ENTRANCES_MEDI',
 'FLOORSMAX_MEDI',
 'FLOORSMIN_MEDI',
 'LANDAREA_MEDI',
 'LIVINGAPARTMENTS_MEDI',
 'LIVINGAREA_MEDI',
 'NONLIVINGAPARTMENTS_MEDI',
 'NONLIVINGAREA_MEDI',
 'TOTALAREA_MODE']

train[numeric_cols].apply(lambda x: [min(x), max(x)])

In [None]:
# fill in the missing values for test columns with -1
for col in numeric_cols:
    train[col] = train[col].fillna(-1)
    test[col] = test[col].fillna(-1)

In [None]:
other_cols = ['OBS_30_CNT_SOCIAL_CIRCLE',
 'DEF_30_CNT_SOCIAL_CIRCLE',
 'OBS_60_CNT_SOCIAL_CIRCLE',
 'DEF_60_CNT_SOCIAL_CIRCLE',
 'DAYS_LAST_PHONE_CHANGE',
 'AMT_REQ_CREDIT_BUREAU_HOUR',
 'AMT_REQ_CREDIT_BUREAU_DAY',
 'AMT_REQ_CREDIT_BUREAU_WEEK',
 'AMT_REQ_CREDIT_BUREAU_MON',
 'AMT_REQ_CREDIT_BUREAU_QRT',
 'AMT_REQ_CREDIT_BUREAU_YEAR']

train[other_cols].apply(lambda x: [min(x), max(x)])

In [None]:
for col in ['OBS_30_CNT_SOCIAL_CIRCLE',
 'DEF_30_CNT_SOCIAL_CIRCLE',
 'OBS_60_CNT_SOCIAL_CIRCLE',
 'DEF_60_CNT_SOCIAL_CIRCLE',
 'AMT_REQ_CREDIT_BUREAU_HOUR',
 'AMT_REQ_CREDIT_BUREAU_DAY',
 'AMT_REQ_CREDIT_BUREAU_WEEK',
 'AMT_REQ_CREDIT_BUREAU_MON',
 'AMT_REQ_CREDIT_BUREAU_QRT',
 'AMT_REQ_CREDIT_BUREAU_YEAR']:
    train[col] = train[col].fillna(-1)
    test[col] = test[col].fillna(-1)
    
train.DAYS_LAST_PHONE_CHANGE = train.DAYS_LAST_PHONE_CHANGE.fillna(999)
test.DAYS_LAST_PHONE_CHANGE = test.DAYS_LAST_PHONE_CHANGE.fillna(999)

In [None]:
with pd.option_context('display.max_rows', None, 'display.max_columns', 3):
    # Count the number of NA values in each column
    print(train.apply(lambda x: np.sum(pd.isnull(x))))

## Encoding categorical features

In [None]:
categorical = ['NAME_CONTRACT_TYPE', 
              'CODE_GENDER', 
               'NAME_TYPE_SUITE',
               'NAME_INCOME_TYPE',
               'NAME_EDUCATION_TYPE',
               'NAME_FAMILY_STATUS',
               'NAME_HOUSING_TYPE',
               'OCCUPATION_TYPE',
               'HOUR_APPR_PROCESS_START',
               'ORGANIZATION_TYPE', 
               'FONDKAPREMONT_MODE', 
               'HOUSETYPE_MODE', 
               'WALLSMATERIAL_MODE', 
               'EMERGENCYSTATE_MODE'
              ]

In [None]:
tmp = train[categorical].melt().groupby(["variable", "value"], as_index = False).size().reset_index()
tmp.columns = ["variable", "value", "frequency"]

ggplot(tmp, aes(x = "value", y = "frequency", fill = "variable")) +\
    facet_wrap("variable", scales = "free") +\
    geom_col(show_legend = False) +\
    theme(axis_text=element_blank())+\
    ggtitle("Frequencies of all categorical variables")

In [None]:
# start with one-hot encoding
for col in categorical:
    train_encodings, test_encodings = encoder.onehot_encode(train[col],
                                                            test[col], 
                                                           prefix = col.lower() + "_")
    train = train.merge(train_encodings, left_index = True, right_index = True)
    test = test.merge(test_encodings, left_index = True, right_index = True)

In [None]:
# now add frequency encodings
for col in categorical:
    train_encodings, test_encodings = encoder.frequency_encode(train[col], 
                                                          test[col])
    train[col.lower() + "_freq"] = train_encodings
    test[col.lower() + "_freq"] = test_encodings

In [None]:
# finally, add the mean encodings
for col in categorical:
    train_encodings, test_encodings = encoder.mean_encode(train.EMERGENCYSTATE_MODE,
                                                     train.TARGET, 
                                                     test.EMERGENCYSTATE_MODE)
    train[col.lower() + "_mean_encoding"] = train_encodings
    test[col.lower() + "_mean_enccoding"] = test_encodings


In [None]:
categorical

In [None]:
pd.options.display.max_columns = None
display(train[train.columns[122:]].head())

## Experiments

In [None]:
with pd.option_context('display.max_rows', None, 'display.max_columns', 3):
    # emulating R's `str()` function
    print(train.apply(lambda x:[ x.unique()]))

In [None]:
y_train = train.TARGET.values

In [None]:
to_remove = ['TARGET' , 'SK_ID_CURR'] + categorical
all_columns = list(train.columns)
for col in to_remove:
    all_columns.remove(col)

In [None]:
list(train.columns).remove("TARGET")

In [None]:
all_columns