# Hashed Cross


Hashed cross combines category hashing and the creation of feature interactions. In that, it lends itself well to preprocessing data at scale.

In [1]:
import numpy as np
import cudf
import nvtabular as nvt

This might not be enormous data per se, but to test things out, I am running this example on a 4GB laptop GPU!

In [2]:
user_ids = np.random.randint(0, 10_000, 10_000_000)
post_ids = np.random.randint(0, 10_000, 10_000_000)

In [3]:
gdf = cudf.DataFrame(data={'user_id': user_ids, 'post_id': post_ids})
gdf.head()

Unnamed: 0,user_id,post_id
0,6777,9446
1,9303,9639
2,2166,8513
3,1295,4740
4,5788,9356


In [4]:
gdf.nunique()

user_id    10000
post_id    10000
dtype: int64

In [5]:
nvt_dataset = nvt.Dataset(gdf)

hashed = ['user_id', 'post_id'] >> nvt.ops.HashedCross(num_buckets=100)

workflow = nvt.Workflow(hashed)

In [6]:
%%time

gdf = workflow.fit_transform(nvt_dataset).compute()

CPU times: user 50.5 ms, sys: 191 µs, total: 50.7 ms
Wall time: 58 ms


In [7]:
gdf.nunique()

user_id_X_post_id    100
dtype: int64

In [8]:
gdf.user_id_X_post_id.value_counts()

0     101748
52    100861
74    100861
38    100723
84    100664
       ...  
73     99412
45     99395
31     99372
43     99333
99     99288
Name: user_id_X_post_id, Length: 100, dtype: int32

In [9]:
!nvidia-smi

Tue Aug  2 06:33:29 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.72       Driver Version: 512.72       CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0 Off |                  N/A |
| N/A   52C    P8     9W /  N/A |    578MiB /  4096MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+---------------------------------------------------------------------------