# LambdaOp

The `LambdaOp` operator allows you to run custom defined preprocessing on your data.

Why might you chose to use it instead of performing the operations directly on a `cudf` or `pandas` `DataFrame`?

First of all, you can integrate custom preprocessing with the rest of hte `NVTabular` operator suite. Meaning, you can make use of all the operators already there and on top of that define your own set of operations.

However, even if you would chose to run only custom operations on your data (highly unlikely but still) `LambdaOp` might still be your best bet.

Using `NVTabular` you can easily split your data in chunks and perform operations only on subset of the data. By specifying the `npartitions` parameter you can tailor your workflow so that you make use of the available compute resources without running out of RAM.

In [1]:
import nvtabular as nvt
import cudf
import numpy as np

In [2]:
df = cudf.DataFrame(data={'thermal_readings': np.random.rand(100_000) * 100})
df.head()

Unnamed: 0,thermal_readings
0,52.247439
1,44.012942
2,57.839022
3,35.931038
4,10.863175


Let's create a `Merlin Dataset` now, specifying for the data to be split across 10 partitions.

In [3]:
ds = nvt.Dataset(df, npartitions=10)

We define the preprocessing we would like performed. We will define a custom operator that will add noise to the data.

In [4]:
def noisify(col):
    print(col.shape)
    return col + col * np.random.randn() * 0.05

In [5]:
add_noise = ['thermal_readings'] >> nvt.ops.LambdaOp(noisify)

In [6]:
wf = nvt.Workflow(add_noise)
ds = wf.transform(ds)
ds.compute().head()

(10000,)
(10000,)
(10000,)
(10000,)
(10000,)
(10000,)
(10000,)
(10000,)
(10000,)
(10000,)


Unnamed: 0,thermal_readings
0,51.199102
1,43.129829
2,56.678491
3,35.210088
4,10.645207


We have defined a custom operator and applied it to chunks of our data in order to limit our memory footprint. This functionality can be very handy when running on arbitrarily large data.