In [10]:
import pandas as pd
import numpy as np
np.random.seed(0)

**Avoid using apply() when possible because it is slow.** Use vectorized functions instead.

Let us look at how much of a speedup we can get using the `%time` magic function.

In [11]:
# Create a random array of numbers between 1 and 5  
trip_data = pd.DataFrame(np.random.randint(1, 5, size=(100_000_000, 1)), columns=['Rating'])

The following code sets the target column to 1 if the rating is 4 or 5, and 0 otherwise.

The wrong, i.e., slow way to do it:

In [12]:
%time trip_data['target'] = trip_data.Rating.apply(lambda x: 1 if x in [4, 5] else 0)  

CPU times: total: 11.3 s
Wall time: 14.6 s


The right, i.e., fast way to do it:

In [13]:
%time trip_data['target'] = trip_data['Rating'].isin([4, 5]).astype(int)  

CPU times: total: 219 ms
Wall time: 251 ms


This is about 50 times faster than using apply().

# Using the `%timeit` magic function to measure the execution time of the code.

You can also use %timeit to get the average time of multiple runs as well as store the results in a variable. 

In [14]:
# Create a random array of numbers between 1 and 5  
trip_data = pd.DataFrame(np.random.randint(1, 5, size=(10_000_000, 1)), columns=['Rating'])

In [15]:
apply_timeTaken = %timeit -o -r 1 trip_data['target'] = trip_data.Rating.apply(lambda x: 1 if x in [4, 5] else 0)  # create target

1.43 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [16]:
isin_timeTaken = %timeit -o -r 1 trip_data['target'] = trip_data['Rating'].isin([4, 5]).astype(int)

23 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)


In [17]:
print (f"This is exactly {apply_timeTaken.average/isin_timeTaken.average:.2f} times faster.")

This is exactly 62.28 times faster.
