# Feature Hashing

Let's create a pandas dataframe with a color feature with high cardinality.

In [35]:
import pandas as pd
import numpy as np
import random

# Set random seed for reproducibility
np.random.seed(42)

# Define the number of rows and unique categories
num_rows = 50
num_categories = 15

# List of color names
color_names = ['Red', 'Orange', 'Yellow', 'Green', 'Blue', 'Indigo', 'Violet', 'White', 'Black', 'Gray', 'Pink', 'Brown', 'Cyan', 'Magenta', 'Purple']

# Generate random categorical data
categories = random.choices(color_names, k=num_rows)

# Generate random flag values (0 and 1)
flags = np.random.randint(2, size=num_rows)

# Create the DataFrame
df = pd.DataFrame({'Color': categories, 'Target': flags})
df['Color'] = df['Color'].astype('category')

# Display the DataFrame
print(df)


      Color  Target
0      Cyan       0
1     Brown       1
2      Gray       0
3    Orange       0
4    Orange       0
5     White       1
6      Gray       0
7     Brown       0
8   Magenta       0
9    Yellow       1
10   Orange       0
11  Magenta       0
12     Gray       0
13   Yellow       0
14  Magenta       1
15      Red       0
16      Red       1
17    Green       1
18     Blue       1
19  Magenta       0
20   Indigo       1
21     Pink       0
22   Violet       1
23     Cyan       1
24    Brown       1
25   Indigo       1
26   Indigo       1
27   Purple       1
28     Gray       1
29   Indigo       1
30    Green       0
31    White       0
32   Orange       1
33    White       1
34   Violet       1
35   Indigo       0
36   Indigo       1
37   Violet       0
38     Blue       0
39   Purple       0
40      Red       0
41     Blue       0
42   Purple       1
43   Violet       1
44     Cyan       1
45  Magenta       1
46     Gray       1
47    Green       0
48   Purple       1


In [3]:
df.Color.unique()

['Pink', 'Gray', 'Yellow', 'White', 'Black', ..., 'Cyan', 'Brown', 'Orange', 'Green', 'Violet']
Length: 15
Categories (15, object): ['Black', 'Blue', 'Brown', 'Cyan', ..., 'Red', 'Violet', 'White', 'Yellow']

There are 15 unique color categories in this column. Let's explore each encoding technique one by one:
1. One hot encoding
2. Label encoding
3. Target Encoding
4. Hash encoding

## Hash encoding or Feature Hashing

One-Hot Encoding has a major drawback. The number of new feature it produces are equivalent to the categories in the original feature, which causes dimensionality issues when the cardinality is too high. Hash Encoding can represent the categorical data into a lesser number of columns.

The main advantage of using Hash Encoding is that you can control the number of numerical columns you want to create to represent categorical data.

In [9]:
#!pip install category_encoders

Using the **category_encoders** library, let's try to Hash Encode the category column color.

In [4]:
# import category_encoders
import category_encoders as ce

# create an object and specify the number of new columns required
encoder=ce.HashingEncoder(cols='Color',n_components=5)

# Encode the Color feature
hash_en = encoder.fit_transform(df['Color'])

# View data
hash_en.sample(5)

Unnamed: 0,col_0,col_1,col_2,col_3,col_4
33,0,1,0,0,0
0,0,0,1,0,0
34,0,0,0,1,0
12,0,0,0,0,1
10,0,0,0,1,0


Hash Encoding has some significant weaknesses. 

**1. Collision:** Since a high number of categorical values are represented with a smaller number of features, different categorical values could be represented by the same Hash values, this is called **collision**.


If the hash size is too small, more collisions will happen and negatively affect model performance. On the other hand, the larger the hash size, the more it will consume memory.
Collisions also affect model performance. With high collisions, a model won’t be able to differentiate coefficients between feature values.


**2. No inverse mapping:** Interpretability becomes an issue. You cannot interpret feature importances since we do not store the hash values, we cannot go from feature indices back to feature names. This is similar to issues with PCA.


In [5]:
# concat the data
full_df = pd.concat([encoder.fit_transform(df['Color']), df], axis =1)

In [6]:
full_df.head()

Unnamed: 0,col_0,col_1,col_2,col_3,col_4,Color,Target
0,0,0,1,0,0,Pink,0
1,0,0,0,1,0,Gray,1
2,0,1,0,0,0,Yellow,0
3,1,0,0,0,0,White,0
4,0,0,0,1,0,Black,0


Let's check if there are any collisions

In [7]:
# Group the DataFrame by the hashed features
grouped_df = full_df.groupby(['col_0', 'col_1', 'col_2', 'col_3', 'col_4'])

# Get the categories with collision (more than one category per group)
collision_pairs = grouped_df['Color'].apply(lambda x: x.nunique() > 1)

# Filter the grouped DataFrame to show only the collision pairs
collision_df = grouped_df['Color'].unique()[collision_pairs]

# Display the collision pairs
print(collision_df)


col_0  col_1  col_2  col_3  col_4
0      0      0      0      1        ['Indigo', 'Purple', 'Brown', 'Orange', 'Green...
                     1      0        ['Gray', 'Black', 'Blue']
Categories (15, obje...
              1      0      0        ['Pink', 'Red', 'Magenta']
Categories (15, obj...
       1      0      0      0        ['Yellow', 'Cyan']
Categories (15, object): ['...
1      0      0      0      0        ['White', 'Violet']
Categories (15, object): [...
Name: Color, dtype: object


In [8]:
# Inspect the 'Black', 'Gray', 'Blue' group
full_df[full_df.Color.isin(['Black', 'Gray', 'Blue'])]

Unnamed: 0,col_0,col_1,col_2,col_3,col_4,Color,Target
1,0,0,0,1,0,Gray,1
4,0,0,0,1,0,Black,0
6,0,0,0,1,0,Black,0
9,0,0,0,1,0,Black,1
10,0,0,0,1,0,Blue,0
17,0,0,0,1,0,Gray,1
34,0,0,0,1,0,Blue,1
35,0,0,0,1,0,Gray,0
37,0,0,0,1,0,Black,0
39,0,0,0,1,0,Gray,0


As you can see, there is a collision between the categories "Blue" and “Black” and "Gray". They are both represented by dimension 3.

The lower the desired dimensionality, the higher the chances of collision. To reduce the probability of collision, we can increase the desired dimensions. This is the trade-off between speed and quality of learning.

One possible solution to this is using **signed hash functions**. Which essentialy gives a positive sign to a feature and the collided feature gets a negative sign. 

However, the impact on accuracy due to collision is observed to be low as long as you use a sufficiently large vector space for hashing.

In [9]:
from sklearn.feature_extraction import FeatureHasher

In [10]:
h = FeatureHasher(n_features=5, input_type="string")

How to choose `n_features` in FeatureHasher in sklearn:

https://datascience.stackexchange.com/questions/77819/how-should-i-choose-n-features-in-featurehasher-in-sklearn

In [11]:
f = h.fit_transform(df['Color'])

In [12]:
type(f)

scipy.sparse._csr.csr_matrix

To put data from a **scipy.sparse.csr_matrix** into a pandas DataFrame, you can use the `.toarray()` method to convert the sparse matrix into a dense array. Then, you can pass the dense array to the DataFrame constructor.

In [54]:
f.toarray().shape

(50, 5)

In [13]:
fh_hash=f.toarray()

In [13]:
fh_hash

array([[ 1., -1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  3.],
       [ 1., -1.,  0.,  0.,  0.],
       [ 0., -1.,  1.,  1.,  1.],
       [ 1., -2.,  2.,  0.,  1.],
       [ 0., -1.,  1.,  1.,  1.],
       [ 0.,  0.,  0.,  0.,  3.],
       [ 2.,  0.,  0.,  0.,  2.],
       [ 2.,  0.,  0.,  0.,  0.],
       [ 2.,  0.,  0.,  0.,  2.],
       [ 0., -3.,  0.,  0., -1.],
       [ 0.,  0.,  0.,  0.,  3.],
       [ 0., -1.,  1.,  1.,  1.],
       [ 2.,  0.,  0.,  0.,  0.],
       [ 2.,  0.,  0.,  0.,  0.],
       [ 3.,  0., -1.,  0., -1.],
       [ 0., -1.,  1.,  1.,  1.],
       [ 2.,  0.,  0.,  0.,  0.],
       [ 0., -1.,  1.,  1.,  1.],
       [ 0., -1., -1.,  1.,  1.],
       [-1., -1.,  0.,  0., -1.],
       [ 1.,  0.,  0.,  0.,  2.],
       [ 0., -3.,  0.,  0., -1.],
       [ 1., -1., -1.,  1., -1.],
       [ 0., -1.,  1.,  1.,  1.],
       [ 1.,  0.,  0.,  0.,  2.],
       [ 0., -1.,  1.,  1.,  1.],
       [ 1., -1.,  0.,  0.,  0.],
       [ 0., -1.,  1.,  1.,  1.],
       [ 1., -

In [18]:
# Create a pandas DataFrame from the dense array
new = pd.DataFrame(fh_hash, columns=['col_0', 'col_1', 'col_2', 'col_3', 'col_4'])

# Display the DataFrame
print(new)

    col_0  col_1  col_2  col_3  col_4
0     0.0   -3.0    0.0    0.0   -1.0
1     2.0    0.0    0.0    0.0    2.0
2     0.0   -1.0    1.0    1.0    1.0
3    -1.0   -1.0    0.0    0.0   -1.0
4     1.0   -1.0   -1.0    0.0    0.0
5     0.0   -3.0    0.0    0.0   -1.0
6     1.0   -1.0   -1.0    0.0    0.0
7    -1.0   -1.0    0.0    0.0   -1.0
8     0.0    0.0    0.0    0.0    3.0
9     1.0   -1.0   -1.0    0.0    0.0
10    0.0    0.0    1.0    0.0    1.0
11    3.0    0.0   -1.0    0.0   -1.0
12    0.0   -1.0   -1.0    1.0    1.0
13    0.0    0.0    0.0    0.0    3.0
14    3.0    0.0   -1.0    0.0   -1.0
15    0.0    0.0    0.0    0.0    3.0
16    1.0   -2.0    2.0    0.0    1.0
17    2.0    0.0    0.0    0.0    2.0
18    1.0   -1.0    0.0    0.0    0.0
19    0.0   -1.0   -1.0    1.0    1.0
20   -1.0   -1.0    0.0    0.0   -1.0
21    1.0   -2.0    2.0    0.0    1.0
22   -1.0   -1.0    0.0    0.0   -1.0
23    1.0   -1.0   -1.0    1.0   -1.0
24    2.0    0.0    0.0    0.0    0.0
25    2.0   

In [22]:
type(new)

pandas.core.frame.DataFrame

In [27]:
new_full = pd.concat((new, df[['Color','Target']]), axis =1)

In [28]:
new_full.head()

Unnamed: 0,col_0,col_1,col_2,col_3,col_4,Color,Target
0,0.0,-3.0,0.0,0.0,-1.0,Pink,0
1,2.0,0.0,0.0,0.0,2.0,Gray,1
2,0.0,-1.0,1.0,1.0,1.0,Yellow,0
3,-1.0,-1.0,0.0,0.0,-1.0,White,0
4,1.0,-1.0,-1.0,0.0,0.0,Black,0


In [29]:
# Group the DataFrame by the hashed features
grouped_df = new_full.groupby(['col_0', 'col_1', 'col_2', 'col_3', 'col_4'])

# Get the categories with collision (more than one category per group)
collision_pairs = grouped_df['Color'].apply(lambda x: x.nunique() > 1)

# Filter the grouped DataFrame to show only the collision pairs
collision_df2 = grouped_df['Color'].unique()[collision_pairs]

# Display the collision pairs
print(collision_df2)


col_0  col_1  col_2  col_3  col_4
0.0    -1.0   1.0    1.0    1.0      ['Yellow', 'Violet']
Categories (15, object): ...
Name: Color, dtype: object


In [31]:
# Inspect the 'Black', 'Gray', 'Blue' group
new_full[new_full.Color.isin(['Yellow', 'Violet'])]

Unnamed: 0,col_0,col_1,col_2,col_3,col_4,Color,Target
2,0.0,-1.0,1.0,1.0,1.0,Yellow,0
27,0.0,-1.0,1.0,1.0,1.0,Yellow,1
28,0.0,-1.0,1.0,1.0,1.0,Yellow,1
33,0.0,-1.0,1.0,1.0,1.0,Yellow,1
42,0.0,-1.0,1.0,1.0,1.0,Violet,1
44,0.0,-1.0,1.0,1.0,1.0,Violet,1
46,0.0,-1.0,1.0,1.0,1.0,Yellow,1
47,0.0,-1.0,1.0,1.0,1.0,Violet,0


Here, 'Yellow' and 'Voilet' have collided but this is much better than `HashingEncoder`. You can try higher value for `n_features`. Although, higher dimensions does not always gaurantee non collisions. We can try a few different values.

In [32]:
# Let's check feature correlations to target
new_full.corr()

Unnamed: 0,col_0,col_1,col_2,col_3,col_4,Target
col_0,1.0,0.3602,-0.272709,-0.263801,-0.037063,0.121428
col_1,0.3602,1.0,-0.273896,-0.14866,0.422281,0.051148
col_2,-0.272709,-0.273896,1.0,0.113662,0.314826,0.078668
col_3,-0.263801,-0.14866,0.113662,1.0,0.020761,0.142819
col_4,-0.037063,0.422281,0.314826,0.020761,1.0,-0.150581
Target,0.121428,0.051148,0.078668,0.142819,-0.150581,1.0


In [34]:
# Let's check feature correlations to target
new_full.corr(method='spearman')

Unnamed: 0,col_0,col_1,col_2,col_3,col_4,Target
col_0,1.0,0.372617,-0.300158,-0.273136,0.010607,0.128729
col_1,0.372617,1.0,-0.224906,-0.297357,0.407178,0.015167
col_2,-0.300158,-0.224906,1.0,0.160227,0.37954,0.092529
col_3,-0.273136,-0.297357,0.160227,1.0,0.06041,0.142819
col_4,0.010607,0.407178,0.37954,0.06041,1.0,-0.130853
Target,0.128729,0.015167,0.092529,0.142819,-0.130853,1.0
