<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Imports-and-setup" data-toc-modified-id="Imports-and-setup-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Imports and setup</a></span></li><li><span><a href="#Option-1---Bin-Counting-and-One-hot-encoding" data-toc-modified-id="Option-1---Bin-Counting-and-One-hot-encoding-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Option 1 - Bin Counting and One-hot encoding</a></span><ul class="toc-item"><li><span><a href="#Testing-it-out-(1)" data-toc-modified-id="Testing-it-out-(1)-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Testing it out (1)</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Bin-counting" data-toc-modified-id="Bin-counting-2.1.0.1"><span class="toc-item-num">2.1.0.1&nbsp;&nbsp;</span>Bin counting</a></span></li></ul></li></ul></li></ul></li><li><span><a href="#Option-2---Feature-hashing-and-One-hot-encoding" data-toc-modified-id="Option-2---Feature-hashing-and-One-hot-encoding-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Option 2 - Feature hashing and One-hot encoding</a></span><ul class="toc-item"><li><span><a href="#Testing-it-out-(2)" data-toc-modified-id="Testing-it-out-(2)-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Testing it out (2)</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Feature-hashing" data-toc-modified-id="Feature-hashing-3.1.0.1"><span class="toc-item-num">3.1.0.1&nbsp;&nbsp;</span>Feature hashing</a></span></li></ul></li></ul></li></ul></li></ul></div>

# Task 2 <a class="tocSkip">

### Imports and setup

In [1]:
import numpy as np
import pandas as pd
from sklearn.datasets import make_regression
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction import FeatureHasher

In [2]:
# Generating synthetic regression dataset, all numeric columns
df, y = make_regression(n_samples=1000, n_features=4, n_informative=4, random_state = 0)
df = pd.DataFrame(df, columns=['col1', 'col2', 'col3', 'col4'])
y = pd.DataFrame(y, columns=['target'])
df = pd.concat([y, df], axis=1)

# Generating synthetic categorical columns, one with high cardinality and another with low
high_card_col = pd.DataFrame(np.random.choice(range(4000,4500), 1000), columns = ['cat1'])
low_card_col = pd.DataFrame(np.random.choice(range(1,10), 1000), columns = ['cat2'])

df = pd.concat([df, high_card_col, low_card_col], axis=1)

# Changing dtype away from numeric - str or object will do
df['cat1'] = df['cat1'].astype('str')
df['cat2'] = df['cat2'].astype('str')

Cardinality of 'cat1'

In [3]:
len(df['cat1'].unique())

436

Cardinality of 'cat2'

In [4]:
len(df['cat2'].unique())

9

# <a class="tocSkip">

### Option 1 - Bin Counting and One-hot encoding

Bin counting converts a categorical column of arbitrary cardinality into a single column containing its odds or log-odds.

Here, we employ bin counting when the cardinality of a categorical column is above 10, and one-hot encoding otherwise.

In [5]:
def column_encoder(df, i):
    
    """
    Returns the log-odds of the count values for a categorical column in the input dataframe.
    
    Args ->
        df (pd.Series or pd.DataFrame): The input dataframe
        i (int): The index of the column in the df to be transformed
    
    Returns ->
        encoded (pd.Series): A pd.Series object which is an encoded form of the
        original column.
    
    Raises ->
        AssertionError: On various conditions
    """
    
    assert df.shape[0] > 0
    assert isinstance(i, int)
    
    col = df.iloc[:,i]
    
    # Checking if column in categorical
    if col.dtype not in ['str', 'object', 'O']:
        return
    
    counts = col.value_counts()
    
    cardinality = len(col.unique())
    
    # Bin counting
    if cardinality > 10:
    
        prop = counts/df.shape[0]
        not_prop = (df.shape[0] - counts)/df.shape[0]
        log_odds_ratio = np.log(prop) - np.log(not_prop)                  
        encoded = col.map(log_odds_ratio.to_dict())
        
        return encoded
    
    # One-hot encoding
    else:
        ohe = OneHotEncoder(handle_unknown='ignore', sparse=True)
        encoded = ohe.fit_transform(np.array(col).reshape(-1,1))
        encoded = pd.DataFrame(encoded.toarray())
        return encoded

#### Testing it out (1)

###### Bin counting

In [6]:
# 'cat1' is in index 5 of df

encoded_col = column_encoder(df, 5) 

encoded_col

0     -5.806138
1     -5.806138
2     -5.293305
3     -5.517453
4     -6.212606
         ...   
995   -5.806138
996   -6.212606
997   -5.806138
998   -5.517453
999   -5.806138
Name: cat1, Length: 1000, dtype: float64

In [7]:
set(encoded_col)

{-6.906754778648553,
 -6.212606095751518,
 -5.806138481293729,
 -5.517452896464707,
 -5.293304824724492,
 -5.109977737428519,
 -4.9548205149898585}

The 427 cardinality column has been reduced to just 6 unique values.

###### One-hot encoding

In [8]:
# 'cat2' is in index 6 of df


encoded_col = column_encoder(df, 6)

encoded_col

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
996,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
997,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
998,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


For lower cardinality columns, One-hot encoding is performed.

# <a class="tocSkip">

### Option 2 - Feature hashing and One-hot encoding

In feature hashing, a categorical column can be mapped to a 'n' dimensional space no matter what the cardinality.

Here, we employ Feature hashing when the cardinality of a categorical column is above 10, and one-hot encoding otherwise.

In [9]:
def column_encoder(df, i):
    
    """
    Returns the feature hash for a categorical column in the input dataframe.
    
    Args ->
        df (pd.Series or pd.DataFrame): The input dataframe
        i (int): The index of the column in the df to be transformed
    
    Returns ->
        encoded (pd.Series): A pd.Series object which is an encoded form of the
        original column.
    
    Raises ->
        AssertionError: On various conditions
    """
    
    assert df.shape[0] > 0
    assert isinstance(i, int)
    
    col = df.iloc[:,i]
    
    
    # Checking if column in categorical
    if col.dtype not in ['str', 'object', 'O']:
        return
    
    # Checking number of unique levels    
    cardinality = len(col.unique())
    
    # Feature hashing
    if cardinality > 10:
        
        h = FeatureHasher(n_features = 10, input_type='string')
        encoded = h.transform(col).toarray()
        encoded = pd.DataFrame(encoded)
        
        return encoded
    
    # One-hot encoding
    else:
        ohe = OneHotEncoder(handle_unknown='ignore', sparse=True)
        encoded = ohe.fit_transform(np.array(col).reshape(-1,1))
        encoded = pd.DataFrame(encoded.toarray())
        return encoded

#### Testing it out (2)

###### Feature hashing

In [10]:
# 'cat1' is in index 5 of df

encoded_col = column_encoder(df, 5)

encoded_col

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,1.0,1.0,-1.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.0
1,2.0,1.0,-1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,1.0,-1.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.0
3,0.0,1.0,-2.0,-1.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1.0,0.0,-2.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.0
...,...,...,...,...,...,...,...,...,...,...
995,1.0,2.0,-1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
996,0.0,2.0,-2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
997,1.0,0.0,-1.0,0.0,0.0,0.0,0.0,-1.0,1.0,0.0
998,1.0,0.0,-1.0,-1.0,0.0,0.0,0.0,0.0,1.0,0.0


The 427 cardinality column has been reduced to a 10 dimensional vector.