# **Hey there.**
### In this notebook I experimented with a couple of encoding techniques that one can use when OHE increases dimensionality too much.

## Data:
### The usual tabular playground series, March 2021. These TPS datasets are really good for experimentation.
### 300000 samples, 31 predictor features, and a binary target.

## What I did:
### First, I tried OHE.
I ended up with more than 600 features.
The accuracy wasn't bad at all with LGBM, and it didn't overfit.
I tried feature selection, but that considerably increased bias.
PCA didn't do any good either, it only increased variance.
### Then, I implemented frequency encoding.
It's when you replace each category in each categorical feature with its frequency.
Category Encoders (link at the end) is a library that implements count encoding, which is almost the same.
But I just hard coded it.
### Finally, Target Encoding with Gini Index.
Target encoding is basically the replacement of each category with some kind of information that it tells about the target variable.
For continuous targets, it's usually the target mean for that category.
For categorical targets, the Category Encoders library uses, if I'm not mistaken, the posterior probability: P(target/category).
I've got a categorical (binary) target here, but instead of the posterior probability, I used Gini index.
I just felt like trying it out, and it did well.

##                                                                        بسم الله

## Importing and Exploring the Data. 

In [None]:
import numpy as np
import pandas as pd

In [None]:
data=pd.read_csv("../input/tabular-playground-series-mar-2021/train.csv")

In [None]:
data.head()

In [None]:
data.info()

In [None]:
data.describe()

In [None]:
for col in data.columns[1:20]:
    print("unique values in {}:\n".format(col),data[col].unique())

OHE will blow this up to 600+ dimensions.

## One-Hot Encoding

In [None]:
ohd=pd.get_dummies(data, drop_first=True)
ohd.shape

In [None]:
ohd.drop("id",axis=1,inplace=True)
y=ohd["target"]
x=ohd.drop("target", axis=1, inplace=False)

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.15)

In [None]:
from sklearn.preprocessing import StandardScaler
ss=StandardScaler()

In [None]:
x_tr = ss.fit_transform(x_train)

I'll be using LGBM for this one as it's fast.

In [None]:
from lightgbm import LGBMClassifier
lgbm=LGBMClassifier()

In [None]:
lgbm.fit(x_tr,y_train)

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
x_ts = ss.transform(x_test)
y_pred_tr=lgbm.predict(x_tr)
y_pred_ts=lgbm.predict(x_ts)
train_acc=accuracy_score(y_train,y_pred_tr)
test_acc=accuracy_score(y_test,y_pred_ts)
print("LGBM Results with OHE:")
print("training accuracy = {}".format(train_acc))
print("testing accuracy = {}".format(test_acc))

Surprisingly good results.  
It also didn't take long to train, but LGBM is fast.  
It would probably take much longer with other models.  

### Would feature importance improve anything?

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.style as style
style.use('seaborn-darkgrid')

In [None]:
feat_imp = lgbm.feature_importances_

In [None]:
plt.figure(figsize=(25,7.5))
sns.barplot(x.columns, feat_imp ,palette="cool_r")
plt.title("Feature Impotances",fontsize=40)
plt.xlabel("Features",fontsize=30)
plt.ylabel("Importance",fontsize=30)

So I'm going to set a threshhold here, and everything below it will be thrown away.  
I tried with 

In [None]:
for t in [100,40,20,10,5,1]:
    droplist=[]
    for j in range(x.shape[1]):
        if feat_imp[j]<t:
            droplist.append(x.columns[j])
    len(droplist)
    x_sel=x.drop(droplist, axis=1, inplace=False)
    print("Results for threshhold = {}".format(t))
    print("Shape of Dataframe: {}".format(x_sel.shape))
    x_train_sel, x_test_sel, y_train_sel, y_test_sel = train_test_split(x_sel,y,test_size=0.15)
    x_tr_sel = ss.fit_transform(x_train_sel)
    lgbm.fit(x_tr_sel,y_train_sel)
    x_ts_sel = ss.transform(x_test_sel)
    y_pred_tr=lgbm.predict(x_tr_sel)
    y_pred_ts=lgbm.predict(x_ts_sel)
    train_acc=accuracy_score(y_train,y_pred_tr)
    test_acc=accuracy_score(y_test,y_pred_ts)
    print("LGBM training accuracy = {}".format(train_acc))
    print("LGBM testing accuracy = {}\n".format(test_acc))

BIAS.

### That didn't work. How about PCA?

In [None]:
from sklearn.decomposition import PCA
pca=PCA()

In [None]:
x_train_pca=pca.fit_transform(x_train)
ax=plt.figure(figsize=(25,20))
pca_features= list(range(0,pca.n_components_))
sns.barplot(pca_features, pca.explained_variance_,palette="winter")
plt.title("Variation along PCA Components", fontsize=40)
plt.xlabel("Components", fontsize=30)
plt.ylabel("Variation", fontsize=30)

Beyond about 200 components, pretty much all is noise.

In [None]:
for i in [15,35,100,200]:
    pca=PCA(n_components=i)
    x_tr = ss.fit_transform(x_train)
    x_tr_pca=pca.fit_transform(x_tr)
    lgbm.fit(x_tr_pca,y_train)
    x_ts = ss.transform(x_test)
    x_ts_pca=pca.fit_transform(x_ts)
    y_pred_tr=lgbm.predict(x_tr_pca)
    y_pred_ts=lgbm.predict(x_ts_pca)
    train_acc=accuracy_score(y_train,y_pred_tr)
    test_acc=accuracy_score(y_test,y_pred_ts)
    print("PCA Components: {}".format(i))
    print("LGBM training accuracy = {}".format(train_acc))
    print("LGBM testing accuracy = {}".format(test_acc))

High variance here.

### Bottom line:
To be honest, I expected PCA and feature importance to improve results.  
OHE certainly did increase dimensionality, but LGBM still did achieve reatively low bias and almost 0 variance.  
LGBM also didn't take too long to train.  
But still, people often look for alternatives to OHE, and a smaller dataset takes less training time.  
So brace yourself, dear viewer, for frequency encoding!

### Frequency Encoding!
You just replace categories with their frequencies. Simple as that.  
It's the same as count encoding except that you need to divide counts by the total number of samples to get frequencies.  
If you plan to standardize your data then counts and frequencies would give the same results.  
I didn't use the categorical encoders library, I felt like doing it myself.

In [None]:
frqdata=pd.DataFrame()
for col in data.columns[20:]:
    frqdata[col]=data[col]
for col in data.columns[1:20]:
    d=data[col].value_counts().to_dict()
    frqdata[col]=data[col].map(d)/300000
frqdata.head()

That's all there is to it.  
Now we train a model.  
I'll be using the SGD Classifier from sklearn.  
It's a fast linear model that uses stochastic gradient descent.  
It seemed interesting when I read about it so I felt like trying it out.  
I also expect standardization to make a difference, and LGBM isn't affected by it whereas SGDC is.  
PS: As it wouldn't make sense to compare two encoding techniques with different models, I'll also try LGBM.

In [None]:
from sklearn.linear_model import SGDClassifier
sgdc=SGDClassifier(max_iter=1000, tol=1e-3)

In [None]:
x_frq = frqdata.drop("target",axis=1,inplace=False)

In [None]:
x_train_frq, x_test_frq, y_train_frq, y_test_frq = train_test_split(x_frq,y,test_size=0.15)

In [None]:
sgdc.fit(x_train_frq,y_train_frq)
y_pred_tr=sgdc.predict(x_train_frq)
y_pred_ts=sgdc.predict(x_test_frq)
train_acc=accuracy_score(y_train_frq,y_pred_tr)
test_acc=accuracy_score(y_test_frq,y_pred_ts)
print("SGDClassifier results with Frequency/Count Encoding:")
print("Without Feature Scaling:")
print("training accuracy = {}".format(train_acc))
print("testing accuracy = {}".format(test_acc))
x_train_frqs = ss.fit_transform(x_train_frq)
sgdc.fit(x_train_frqs,y_train_frq)
y_pred_tr=sgdc.predict(x_train_frqs)
x_test_frqs = ss.fit_transform(x_test_frq)
y_pred_ts=sgdc.predict(x_test_frqs)
train_acc=accuracy_score(y_train_frq,y_pred_tr)
test_acc=accuracy_score(y_test_frq,y_pred_ts)
print("With Feature Scaling:")
print("training accuracy = {}".format(train_acc))
print("testing accuracy = {}".format(test_acc))

Results are good, but I certainly didn't expect scaling to increase bias.  
Anyways, let's try LGBM.

In [None]:
lgbm.fit(x_train_frq,y_train_frq)
y_pred_tr=lgbm.predict(x_train_frq)
y_pred_ts=lgbm.predict(x_test_frq)
train_acc=accuracy_score(y_train_frq,y_pred_tr)
test_acc=accuracy_score(y_test_frq,y_pred_ts)
print("LGBM results with Frequency/Count Encoding:")
print("training accuracy = {}".format(train_acc))
print("testing accuracy = {}".format(test_acc))

Pretty much the same results as with OHE, but less training time.

## Target Encoding with Gini Index!
You just replace each category with its Gini index. The gini index basically tells you how much information that category tells you about the target variable.  
One could use entropy instead of gini, but gini -from what I've read- is faster to compute.

So I just wrote a function (next cell) that implements this, with two options:  

**Weighted**: If true, then each category will further be multiplied by its frequency.  
This could perhaps give better results as rare categories don't hold much information anyways.  

**Standardize**: Name says it all.  

### Feel free to copy and paste the following encoder in your work if you wish to use it, but please cite this notebook if you do.

##### PS: It only works for a binary target.  
##### You would need to make some adjustments for multi-class targets.

In [None]:
def gini_encoder(column,target,weighted=False,standardize='None'):
    unique=column.unique()
    total=column.shape[0]
    gini_ind=dict()
    
    for i in unique:
        x_i = column[column==i]
        i_total = x_i.shape[0]
        x_i_yes = x_i[target==1]
        yes_count = x_i_yes.shape[0]
        x_i_no = x_i[target==0]
        no_count = x_i_no.shape[0]
        gini = 1 - (yes_count/i_total)**2 - (no_count/i_total)**2
        if weighted==True:
            gini *= i_total/total
        gini_ind[i] = gini
    
    encoded = column.copy()
    encoded = encoded.map(gini_ind)
    
    if standardize=="mean":
        encoded = (encoded - encoded.mean())/encoded.std()
    elif standardize=="median":
        encoded = (encoded - encoded.median())/encoded.std()
        
    return encoded

We'll start with LGBM:

In [None]:
gini_results_lgbm=pd.DataFrame(columns=["No Standardization","Median Standardization","Mean Standardization"],
                          index=["Training (Weighted)","Testing (Weighted)","Training (Unweighted)","Testing (Unweighted)"])

weighted=[True,False]
standardize=["None","median","mean"]

for w in weighted:
    for s in standardize:
        gini_data=pd.DataFrame()
        for col in data.columns[20:]:
            gini_data[col]=data[col]
        for col in data.columns[1:20]:
            gini_data[col]=gini_encoder(column=data[col],target=data["target"], weighted=w, standardize=s)
        x_gini=gini_data.drop("target",axis=1)
        x_train_gini, x_test_gini, y_train_gini, y_test_gini = train_test_split(x_gini,y,test_size=0.15)
        lgbm.fit(x_train_gini,y_train_gini)
        y_pred_tr=lgbm.predict(x_train_gini)
        y_pred_ts=lgbm.predict(x_test_gini)
        train_acc=accuracy_score(y_train_gini,y_pred_tr)
        test_acc=accuracy_score(y_test_gini,y_pred_ts)
        wi=weighted.index(w)
        si=standardize.index(s)
        
        gini_results_lgbm.iloc[wi*2,si] = train_acc
        gini_results_lgbm.iloc[1+wi*2,si] = test_acc

In [None]:
print("LGBM Results with Gini Target Encoding:")
gini_results_lgbm

Surprisingly, weighting made no considerable difference.  
Same with Standardization, but LGBM is already insensitive to scale.

In [None]:
gini_results_sgdc=pd.DataFrame(columns=["No Standardization","Median Standardization","Mean Standardization"],
                          index=["Training (Weighted)","Testing (Weighted)","Training (Unweighted)","Testing (Unweighted)"])

weighted=[True,False]
standardize=["None","median","mean"]

for w in weighted:
    for s in standardize:
        gini_data=pd.DataFrame()
        for col in data.columns[20:]:
            gini_data[col]=data[col]
        for col in data.columns[1:20]:
            gini_data[col]=gini_encoder(column=data[col],target=data["target"], weighted=w, standardize=s)
        x_gini=gini_data.drop("target",axis=1)
        x_train_gini, x_test_gini, y_train_gini, y_test_gini = train_test_split(x_gini,data["target"],test_size=0.15)
        sgdc.fit(x_train_gini,y_train_gini)
        y_pred_tr=sgdc.predict(x_train_gini)
        y_pred_ts=sgdc.predict(x_test_gini)
        train_acc=accuracy_score(y_train_gini,y_pred_tr)
        test_acc=accuracy_score(y_test_gini,y_pred_ts)
        wi=weighted.index(w)
        si=standardize.index(s)
        
        gini_results_sgdc.iloc[wi*2,si] = train_acc
        gini_results_sgdc.iloc[1+wi*2,si] = test_acc

In [None]:
print("SGDClassifier Results with Gini Target Encoding:")
gini_results_sgdc

Weighting only increased bias.  
Median Standardization slightly decreased variance.  

### Well, there you go, 2 alternatives to OHE.  
### Use them wisely.
### There are several other techniques that you can implement with the following library: https://contrib.scikit-learn.org/category_encoders/

### سبحانك اللهم وبحمدك، أشهد أن لا إله إلا أنت، أستغفرك وأنوب إليك