**LAB TASK: Categorical Feature Encoding**
<p>Binary Classification, with every feature a categorical <p>

**Problem Statement**
<p>A common task in machine learning pipelines is encoding categorical variables for a given algorithm in a format that allows as much useful signal as possible to be captured.

We have to handle different types of categorical data columns using multiple techniques in order to get best results. <p/>

Lets begin.

Types of categorical data given to us:

* binary features
* low- and high-cardinality nominal features
* low- and high-cardinality ordinal features
* (potentially) cyclical features

The dataset is available here https://disk.yandex.ru/d/A6zRRWGUBxytQg


<h1> Importing Libraries <h1/>

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler,StandardScaler

from sklearn.metrics import roc_auc_score, confusion_matrix, roc_curve

from imblearn.over_sampling import SMOTE

from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV


from category_encoders import TargetEncoder, HashingEncoder, LeaveOneOutEncoder


import string

Lets load the datasets first

In [None]:
train = pd.read_csv('')
test = pd.read_csv('')

Lets see how the train dataset looks like

In [None]:
train.head()

In [None]:
train.shape

In [None]:
train.isna().sum().sort_values(ascending = False)

<h1> Analyzing Categorical variables </h1>

In [None]:
# Get list of categorical variables
s = (train.dtypes == 'object')
object_cols = list(s[s].index)

print("Categorical variables:")
print(object_cols)

So we observed somethings:

1. There are no null values in train dataset
2. There are multiple categorical variables which are as follows
* bin_3, bin_4 :- binary cols
* nom_0 - nom_4 :- nominal columns ( with no order)
* nom_5 - nom_9 :- nominal columns with high cardinality
* ord_1 - ord_5 :- Ordered columns

We have to use different ways to treat these columns and convert them into numerical data

<h1>Encoding techniques<h1/>

Taken reference from https://www.kaggle.com/discdiver/category-encoders-examples

1. bin_3, bin_4 :- Convert Y/N and T/F to 1/0
2. nom_0 - nom_4 :- Encode using One hot encoding
3. nom_5 - nom_9 :- Target encode them as they are high cardinal variables
4. ord_1, ord_2 :- Convert into numerical order using hard coded values as Label encoder might not be able to understand the order
5. ord_3 - ord_4 :- Encode using ascii as they are alphabetical values
6. ord_5 :- Separate two alphabets and then do label encoding
7. day, month:- Encode using sin and cosine values as they are cyclic in nature

<h1>Encoding data<h1/>

Lets save target variable somewhere

In [None]:
target = train['target']

Lets identify the uncommon columns between test and train data. Replace uncommon columns with a common value

In [None]:
replace_xor = lambda x: 'xor' if x in xor_values else x

In [None]:
print(set(train['ord_4'].unique()))
print(set(test['ord_4'].unique()))

In [None]:
columns_to_test = ['ord_5', 'ord_4', 'ord_3']
for column in columns_to_test:
    xor_values = set(train[column].unique()) ^ set(test[column].unique())
    if xor_values:
        print('Column', column, 'has', len(xor_values), 'XOR values')
        train[column] = train[column].apply(replace_xor)
        test[column] = test[column].apply(replace_xor)
    else:
        print('Column', column, 'has no XOR values')

In [None]:
def date_cyc_enc(df, col, max_vals):
    df[col + '_sin'] = np.sin(2 * np.pi * df[col]/max_vals)
    df[col + '_cos'] = np.cos(2 * np.pi * df[col]/max_vals)
    return df

In [None]:
map_to_ascii_index = lambda x: string.ascii_letters.index(x)

In [None]:
replace_xor = lambda x: 'xor' if x in xor_values else x

Before starting with encoding lets create some other features

In [None]:
train['merge_col1'] =  train[['nom_0', 'nom_1']].apply(lambda x: ''.join(x), axis=1)
test['merge_col1'] =  test[['nom_0', 'nom_1']].apply(lambda x: ''.join(x), axis=1)

train['merge_col2'] =  train[['nom_1', 'nom_2']].apply(lambda x: ''.join(x), axis=1)
test['merge_col2'] =  test[['nom_1', 'nom_3']].apply(lambda x: ''.join(x), axis=1)

train['merge_col3'] =  train[['nom_2', 'nom_3']].apply(lambda x: ''.join(x), axis=1)
test['merge_col3'] =  test[['nom_2', 'nom_3']].apply(lambda x: ''.join(x), axis=1)

train['merge_col4'] =  train[['nom_3', 'nom_4']].apply(lambda x: ''.join(x), axis=1)
test['merge_col4'] =  test[['nom_3', 'nom_4']].apply(lambda x: ''.join(x), axis=1)

Encoding

In [None]:
# Binary encoding
train['bin_3'] = [0 if x == 'F' else 1 for x in train['bin_3']]
train['bin_4'] = [0 if x == 'N' else 1 for x in train['bin_4']]

#Hard coded Label encoding
train['ord_1'] = [0 if x == 'Novice' else 1 if x == 'Contributor' else 2 if x == 'Expert' else 3 if x == 'Master' else 4 for x in train['ord_1']]
train['ord_2'] = [0 if x == 'Freezing' else 1 if x == 'Cold' else 2 if x == 'Warm' else 3 if x == 'Hot' else 4 if x == 'Boiling Hot' else 5 for x in train['ord_2']]

# Label encoding via LabelEncoder class
label_encoder = LabelEncoder()
train['ord_3'] = label_encoder.fit_transform(train['ord_3'])
test['ord_3'] = label_encoder.transform(test['ord_3'])

train['ord_4'] = label_encoder.fit_transform(train['ord_4'])
test['ord_4'] = label_encoder.transform(test['ord_4'])

train['ord_5'] = label_encoder.fit_transform(train['ord_5'])
test['ord_5'] = label_encoder.transform(test['ord_5'])

train = date_cyc_enc(train, 'day', 7)
train = date_cyc_enc(train, 'month', 12)
train.drop(['day', 'month'], axis=1, inplace = True)

#Leave one out encoding high cardinal variables
high_cardinal_vars = ['nom_5', 'nom_6', 'nom_7', 'nom_8', 'nom_9']

loo_encoder = LeaveOneOutEncoder(cols=high_cardinal_vars)
train = loo_encoder.fit_transform(train.drop(['target'], axis = 1), train['target'])

# Same for test data
test['bin_3'] = [0 if x == 'F' else 1 for x in test['bin_3']]
test['bin_4'] = [0 if x == 'N' else 1 for x in test['bin_4']]
test['ord_1'] = [0 if x == 'Novice' else 1 if x == 'Contributor' else 2 if x == 'Expert' else 3 if x == 'Master' else 4 for x in test['ord_1']]
test['ord_2'] = [0 if x == 'Freezing' else 1 if x == 'Cold' else 2 if x == 'Warm' else 3 if x == 'Hot' else 4 if x == 'Boiling Hot' else 5 for x in test['ord_2']]

#For cyclic data we convert it into sin and cosine values
test = date_cyc_enc(test, 'day', 7)
test = date_cyc_enc(test, 'month', 12)
test.drop(['day', 'month'], axis=1, inplace = True)

test = loo_encoder.transform(test)

In [None]:
# One Hot encoding other nominal columns
train_df = pd.get_dummies(train, drop_first=True)

# Same for test data
test_df = pd.get_dummies(test, drop_first=True)

In [None]:
print(train_df.shape)
print(test_df.shape)

In [None]:
cor = train_df.corr()

In [None]:
f, ax = plt.subplots(figsize=(25, 25))
sns.heatmap(cor, annot=False, ax=ax)

We can clearly see that correlation between data points is quite less here. So lets keep all these features and move ahead with our classification

In [None]:
X = train_df.drop(['id'], axis=1)
y = target

<h1>Analysing imbalance in dataset<h1/>

Read here regarding SMOTE https://habr.com/ru/companies/otus/articles/769242/


In [None]:
x=y.value_counts()
plt.bar(x.index,x)
plt.gca().set_xticks([0,1])
plt.title('distribution of target variable')
plt.show()

Clearly there is imbalance in dataset. We need to cater this implance using SMOTE technique

Now lets do the necessary train_test_split



In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, train_size=0.7, random_state=1)

Lets cater the imbabalnce in dataset

In [None]:
sm = SMOTE(kind = "regular")
X_tr,y_tr = sm.fit_sample(X_train,y_train)

Lets create a function to test our dataset and calculate ROC-AUC score for multiple models as param



In [None]:
def draw_roc( actual, probs ):
    fpr, tpr, thresholds = roc_curve( actual, probs,
                                              drop_intermediate = False )
    auc_score = roc_auc_score( actual, probs )
    plt.figure(figsize=(5, 5))
    plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic example')
    plt.legend(loc="lower right")
    plt.show()

In [None]:
# Function for comparing different approaches
def score_dataset(X_train, X_test, y_train, y_test, model):
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    draw_roc(y_test, preds)
    return roc_auc_score(y_test, preds)

<h1>Scaling data <h1>

In [None]:
scaler = StandardScaler()
X_tr = scaler.fit_transform(X_tr)
X_test = scaler.transform(X_test)

<h1>Testing with vanilla version of models <h1/>

Read here regarding ML metrics https://habr.com/ru/companies/ods/articles/328372/

Lets plot roc curver and calculate score for all models

In [None]:
#Define a simple logistic regression model (lr) with random_state 0, and solver 'lbfgs'

#Print AUC score with Logistic Regression

In [None]:
#Define a simple decision tree classifier (dt) with random_state 0.

#Print AUC score with Decision Tree

<h1> Model Refinement <h1/>

In [None]:
# specify number of folds for k-fold CV
n_folds = 5

# parameters to build the model on
param_grid = {
    'max_depth': range(1, 5),
    'min_samples_leaf': range(25, 175, 50),
    'min_samples_split': range(50, 150, 50)

Decision Tree

In [None]:
# uncomment if you want to see hyper parameter tuning. Although it takes some good amount of time
'''
# instantiate the model
dt = DecisionTreeClassifier()

# fit tree on training data
grid_search_dt = GridSearchCV(estimator = dt, param_grid = param_grid,
                          cv = n_folds, verbose = 1, n_jobs = -1, scoring="roc_auc")
grid_search_dt.fit(X_tr, y_tr)
'''

In [None]:
# uncomment to see the results
'''
cv_results_dt = pd.DataFrame(grid_search_dt.cv_results_)
# printing the optimal accuracy score and hyperparameters
print("Decison Tree grid search Accuracy : ", grid_search_dt.best_score_)
print(grid_search_dt.best_estimator_)
'''

Logistic Regression

In [None]:
logit_param_grid = {
    'C': [0.100, 0.150, 0.120, 0.125, 0.130, 0.135, 0.140, 0.145, 0.150]
}

logit_grid = GridSearchCV(estimator = lr, param_grid = logit_param_grid,
                          scoring='roc_auc', cv=5, n_jobs=-1, verbose=0)
logit_grid.fit(X_tr, y_tr)

best_C = logit_grid.best_params_['C']
# best_C = C = 0.125

print('Best C:', best_C)

Model fitting with tuned hyper parameters

In [None]:
dt = DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=4,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=25, min_samples_split=50,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')
print('AUC score with Decision Tree :- ', score_dataset(X_tr, X_test, y_tr, y_test, dt))

In [None]:
lr = LogisticRegression(solver='lbfgs', random_state = 0, C=best_C)
print('AUC score with Losgistic Regression :- ', score_dataset(X_tr, X_test, y_tr, y_test, lr))

Lets predict for test data

In [None]:
test_df.drop(['id'], axis=1, inplace = True)

Lets scale test data first

In [None]:
#Scale (transform only) test_df

Lets fit for entire training set before making predictions. But before that we have to scale entire train data also

In [None]:
#Do the same to X

In [None]:
lr = lr.fit(X,y)
dt = dt.fit(X,y)

In [None]:
y_test_final_lr = lr.predict(test_df)
y_test_final_dt = dt.predict(test_df)

In [None]:
y_test_prob_lr = lr.predict_proba(test_df)[:, 1]
y_test_prob_dt = dt.predict_proba(test_df)[:, 1]

In [None]:
submission_lr = pd.DataFrame({
        "id": test["id"],
        "target": y_test_prob_lr
    })

In [None]:
#Do the same for submission_dt