# Hash Bucket

This operator is a way of narrowing down the number of categories.

This has several use cases. First of all, it minimizes the memory footprint of a RecSys model. But, surprisingly, it can also help on the predictive power side.

Having low frequency categories can lead to overfitting. This is exacerbated when the diemnsionality of the embeddings that we use is high.

One solution is to use the `Categorify` operator with frequency capping. The `HashBucket` operator is another popular choice offering different performance trade offs.

Let's observe the `HashBucket` operator in action.

In [1]:
import numpy as np
import cudf
import nvtabular as nvt

In [2]:
tenk_categories = np.random.randint(0, 10_000, 50_000)

In [3]:
gdf = cudf.DataFrame(data={'customer_id' :tenk_categories})
gdf.head()

Unnamed: 0,customer_id
0,4283
1,6766
2,5393
3,2413
4,3322


We have 10_000 customers across just 50_000 data points.

In [4]:
gdf.customer_id.value_counts().head()

5286    16
613     15
3508    14
6800    14
206     14
Name: customer_id, dtype: int32

In [5]:
(gdf.customer_id.value_counts() > 3).sum()

7345

70+% of our customers have just one or two data points!

Let us address this issue using the `HashBucket` operator.

In [6]:
nvt_dataset = nvt.Dataset(gdf)

hashed = ['customer_id'] >> nvt.ops.HashBucket(num_buckets=100)

workflow = nvt.Workflow(hashed)

In [7]:
%%time

gdf = workflow.fit_transform(nvt_dataset).to_ddf().compute()

CPU times: user 20.5 ms, sys: 363 µs, total: 20.9 ms
Wall time: 19.8 ms


In [8]:
gdf.head()

Unnamed: 0,customer_id
0,84
1,4
2,63
3,41
4,41


In [9]:
gdf.customer_id.value_counts()

71    646
44    633
34    627
98    606
73    601
     ... 
55    410
68    407
40    396
24    363
38    359
Name: customer_id, Length: 100, dtype: int32

There is also another way of decreasing the number of categories, one that combines frequency capping with hashing.

Categories of sufficient count will not be modified, but the long tail of low frequency categories will get hashed. Instead of obtaining a single bucket for all the low frequency categories, we get several buckets.

In [10]:
gdf = cudf.DataFrame(data={'customer_id': tenk_categories})
gdf.customer_id.value_counts().head()

5286    16
613     15
3508    14
6800    14
206     14
Name: customer_id, dtype: int32

In [11]:
nvt_dataset = nvt.Dataset(gdf)

frequency_hashed = ['customer_id'] >> nvt.ops.Categorify(freq_threshold=14, num_buckets=100)

workflow = nvt.Workflow(frequency_hashed)



In [12]:
%%time

gdf = workflow.fit_transform(nvt_dataset).to_ddf().compute()



CPU times: user 181 ms, sys: 30.6 ms, total: 212 ms
Wall time: 218 ms


In [13]:
gdf.customer_id.nunique()

112

In [14]:
gdf.customer_id.value_counts()

84     646
57     633
47     611
111    606
86     601
      ... 
11      14
10      14
12      14
5       14
4       14
Name: customer_id, Length: 112, dtype: int32

As we can see, the lower freuqency categories got assigned to one of 100 buckets. Categories above the frequency threshold have remained intact.