In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

# Transforming columns

## Introduction

There are two ways to use the `transform_column` function: by passing in a function that operates elementwise, or by passing in a function that operates columnwise.

We will show you both in this notebook.

In [None]:
import janitor
import pandas as pd
import numpy as np

## Numeric Data

In [None]:
data = np.random.normal(size=(1_000_000, 4))

In [None]:
df = pd.DataFrame(data).clean_names()

Using the elementwise application:

In [None]:
%%timeit
# We are using a lambda function that operates on each element,
# to highlight the point about elementwise operations.
df.transform_column("0", lambda x: np.abs(x), "abs_0")

And now using columnwise application:

In [None]:
%%timeit
df.transform_column("0", lambda s: np.abs(s), elementwise=False)

Because `np.abs` is vectorizable over the entire series, 
it runs about 50X faster.
If you know your function is vectorizable,
then take advantage of the fact,
and use it inside `transform_column`. 
After all, all that `transform_column` has done 
is provide a method-chainable way of applying the function.

## String Data

Let's see it in action with string-type data.

In [None]:
from random import choice

def make_strings(length: int):
    return "".join(choice("ABCDEFGHIJKLMNOPQRSTUVWXYZ") for _ in range(length))

strings = (make_strings(30) for _ in range(1_000_000))

stringdf = pd.DataFrame({"data": list(strings)})

Firstly, by raw function application:

In [None]:
def first_five(s):
    return s.str[0:5]

In [None]:
%%timeit
stringdf.assign(data=first_five(stringdf["data"]))

In [None]:
%%timeit
first_five(stringdf["data"])

In [None]:
%%timeit
stringdf["data"].str[0:5]

In [None]:
%%timeit
stringdf["data"].apply(lambda x: x[0:5])

It appears assigning the result to a column comes with a bit of overhead.

Now, by using `transform_column` with default settings:

In [None]:
%%timeit
stringdf.transform_column("data", lambda x: x[0:5])

Now by using `transform_column` while also leveraging string methods:

In [None]:
%%timeit
stringdf.transform_column("data", first_five, elementwise=False)