In [19]:
import pandas as pd
import numpy as np
np.random.seed(0)

Avoid using apply() when possible because it is slow. Use vectorized functions instead.

Let us look at how much of a speedup we can get.

In [20]:
# Create a random array of numbers between 1 and 5  
trip_data = pd.DataFrame(np.random.randint(1, 5, size=(100_000_000, 1)), columns=['Rating'])

In [21]:
%time trip_data['target'] = trip_data.Rating.apply(lambda x: 1 if x in [4, 5] else 0)  # create target

CPU times: total: 8.27 s
Wall time: 14.8 s


In [22]:
%time trip_data['target'] = trip_data['Rating'].isin([4, 5]).astype(int)  

CPU times: total: 141 ms
Wall time: 256 ms


You can also use %timeit to get the average time of multiple runs as well as store the results in a variable. 

In [23]:
# Create a random array of numbers between 1 and 5  
trip_data = pd.DataFrame(np.random.randint(1, 5, size=(10_000_000, 1)), columns=['Rating'])

In [24]:
apply_timeTaken = %timeit -o -r 1 trip_data['target'] = trip_data.Rating.apply(lambda x: 1 if x in [4, 5] else 0)  # create target

1.48 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [25]:
isin_timeTaken = %timeit -o -r 1 trip_data['target'] = trip_data['Rating'].isin([4, 5]).astype(int)

24.4 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)


In [26]:
print (f"This is {apply_timeTaken.average/isin_timeTaken.average:.2f} times faster.")

This is 60.85 times faster.
