# Cat in the data
## Predict the outcome of categorical features

## Introduction
This fictional dataset was created ad hoc in order to offer a challenging dataset to make practice with the common machine learning task of encoding and predict from categorical features. The dataset contains only categorical features, and includes:

* binary features
* low- and high-cardinality nominal features
* low- and high-cardinality ordinal features
* (potentially) cyclical features

Firstly, let's import needed libraries and data.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import *
from copy import deepcopy

In [None]:
train = pd.read_csv("../input/cat-in-the-dat-ii/train.csv")
test = pd.read_csv("../input/cat-in-the-dat-ii/test.csv")

## Data inspection
Before getting into any kind of encoding or classification it is appropriate to inspect the data to get insights about the dataset. I am going start by getting dataset size, features name and distributions.

In [None]:
print("Training set size: (%d, %d)" %(train.shape[0], train.shape[1]))
print("Test set size: (%d, %d)" %(test.shape[0], test.shape[1]))

In [None]:
# Visualisation options
pd.set_option('display.max_columns', None)
train.head()

In [None]:
summary = pd.DataFrame()
summary['Name'] = train.columns
summary['Type'] = train.dtypes.values
summary['Missing'] = train.isna().sum().values    
summary['Uniques'] = train.nunique().values
print(summary.to_string(index=False))

In [None]:
full_summary = pd.DataFrame(columns=["Train categories", "Test categories", "Not in test", "Not in train"],
                            index=train.drop(columns=['id','target']).columns)
for i, column in enumerate(train.drop(columns=['id','target'])):
    full_summary.iloc[i, 0] = len(train[column].value_counts())
    full_summary.iloc[i, 1] = len(test[column].value_counts())
    train_categories = set(train[column].value_counts().keys())
    test_categories = set(test[column].value_counts().keys())
    full_summary.iloc[i, 2] = len(train_categories - test_categories)
    full_summary.iloc[i, 3] = len(test_categories - train_categories)

print(full_summary)

So, dataset is pretty huge and contains 25 features, target and id included, same the test set but it doesn't contains the target column for obvious reasons. Even though each feature is fictional it is possible to understand its type by looking at the name which doesn't explain the *meaning* but the kind of feature (binary, categorical, etc.). There are five binary features, ten nominal (half low and half high cardinality), six ordinal (three low, two *medium* and one high cardinality), and two representing day and month so ordinal but cyclic. In addition there are many missing values which will have to handled. Unfortunately, the number of categories for high cardinality column is not the same between the training set and the test set, so special care will be needed.

All in all, it is quite a difficult dataset to handle. In fact, there will be many degrees of freedom to manipulate the dataset for classification: missing values handling, categorical encoding, other preprocessing steps and classification algorithms.

## Preprocessing
There are several preprocessing possibilities that I tried out but here, I will show the ones providing the best cv score. Firslt, let's drop the id and keep the labels to work with just the features.

In [None]:
# Get target labels and test id
train_labels = train.target
test_id = test.id

# ID column and target not necessary
train = train.drop(columns=['id', 'target'])
test = test.drop(columns=['id'])

### Missing values filling
Firstly missing values must be filled out. The first trivial method is to just use the mode of each column but this is limiting since it does not preserve distributions. To overcome this problem it could be possible to random pick the replacement from the relative feature distribution but it might end up to always wrong picks and, in fact, it's not a very used technique. As a last attempt, which is the one yielding the best score I will just replace missing values with a a valid different category <i>"NULL"</i>.

### Categorical encoding
There are many ways to encode categorical features but two are particular common, namely label econding and one hot encoding. The first one assigns an integer to each category, taking into account the order in case of ordinal features. Problem with this method is that in case of features with different number of categories, labels scale differences will be very large and further preprocessing will be needed to bring them to the same range. One hot encoding columns, on the other hand, are filled by just one or zero avoiding the scale problem, but the size of the matrix will greatly increase. Furthermore cyclical features are suitable for cyclical encoding that preserve proximity between apparently *far* labels that are not (like the first and the last ones). 

All in all, the encoding performing the best, at least with logistic regression as baseline estimator, is one hot encoding with <i>null</i> filling.

### Test set encoding
We need to pay special care in encoding the test set. I cannot, in fact, just use the fitted encoder because unfortunately there are some categories which are unique only for one of the two. In real life scenarios it's something to try to avoid because new categories may produce errors since the model is trained without them, even more if one-hot encoding is used which would make the category a whole new feature. In this case I'm going to fit the encoders on the whole dataset, and then split it again, with a note that such situations must be handed carefully in real life scenarios. In particular, I'm going to use a variance filter to remove constant column in the training set that are categories displayed only in the test set.

In [None]:
fullset = pd.concat([train, test])
for column in fullset.columns:
    fullset[column] = fullset[column].fillna("NULL").astype(str)

In [None]:
ohe_encoder = preprocessing.OneHotEncoder(dtype=np.int8).fit(fullset)
fullset = ohe_encoder.transform(fullset)

In [None]:
train = fullset[:train.shape[0]]
test = fullset[train.shape[0]:]

In [None]:
var_filter = feature_selection.VarianceThreshold(threshold=0).fit(train)
train = var_filter.transform(train)
test = var_filter.transform(test)

## Features engineering
As seen above one additional category with one hot encoding seems to be the best method but, in all likely, not every feature is important for classification and most could be eliminated to improve generalisation abilities. I will perform an univariate features filter by studying the k-score in descending order and removing the less significant ones.

In [None]:
# Univariate features filter
univariate_selection = feature_selection.SelectKBest(k='all').fit(train, train_labels)

In [None]:
plt.plot(np.cumsum(sorted(univariate_selection.scores_ / sum(univariate_selection.scores_), reverse=True)))
plt.xlabel("Number of features")
plt.ylabel("% of k-score obtained")
plt.title("Number of features (sorted by k-score) vs % of k-score explained")
plt.show()

In [None]:
kscore_filter = feature_selection.SelectKBest(k=1900).fit(train, train_labels)
train = kscore_filter.transform(train)
test = kscore_filter.transform(test)

## Hyperparameters tuning
As final step I need to tune the regularisation parameter (actually <i>C</i> is the inverse of such parameter) and then just use this as final model.

In [None]:
splitter = model_selection.StratifiedShuffleSplit(n_splits=5, test_size=0.05)
classifier = linear_model.LogisticRegression(max_iter=1e+5)
tuning_grid = {"C": (0.01, 0.1)}

In [None]:
grid_searcher = model_selection.GridSearchCV(estimator=classifier, param_grid=tuning_grid,
                                             scoring='roc_auc', cv=splitter, return_train_score=True)
model = grid_searcher.fit(train, train_labels)

In [None]:
print("Best hyperparameter: %s" %model.best_params_)

In [None]:
print("Best score: %.4f" %model.best_score_)

## Final model
After finding the best missing values technique, encoding and features, I finally found out also that a simple logistic regression perform better than a more complex method like extreme gradient boosting. Final step will be then to apply the same transformations to the test set, fit the model and submit the results.

In the future I might be going to try neural network to further improve the model.

In [None]:
model = linear_model.LogisticRegression(max_iter=1e+6, C=0.1).fit(train, train_labels)
predictions = model.predict_proba(test)[:, 1]

In [None]:
submission = pd.DataFrame({'id': test_id, 'target': predictions})
submission.to_csv('submission.csv', index=False)