<img src="images/logo_city.png" align="right" width="20%">

# Comparing Performance of Pandas and Numpy + Numba

In this tutorial, we will benchmark plain Pandas and Numba inhanced Numpy.

First, import needed libraries:

In [1]:
import numpy as np
import pandas as pd
import numba

from numba import jit

### 1. The original slow function

Our journey starts with a simple function that using pandas to do row selection:

In [9]:
def gen_snap_layers(table, bound):
    '''
    Generate Past layer, Now layer and Future layer for one snapshot.
    Params:
        table: pandas dataframe
        bounds: time bound tuple, for example: (left_timestring, right_timestring)
    Return:
        PNF layers, a list.
    '''
    # left bound and right bound of time interval
    assert type(bound) == tuple
    left = bound[0]
    right = bound[1]
    
    table = table.loc[:, ['tripid',
                          'tpep_pickup_datetime',
                          'tpep_dropoff_datetime',
                          'pulocationid',
                          'dolocationid']]

    # The condition of making snapshot should be:
    # AT LEAST ONE temporal end of a trip should be within the bounds:
    snap = table.loc[
        ((table['tpep_pickup_datetime'] >= left) &
         (table['tpep_pickup_datetime'] < right)) |
        ((table['tpep_dropoff_datetime'] >= left) &
         (table['tpep_dropoff_datetime'] < right))]

    # generate F,P,N layers
    # Use the interval to 'catch' corresponding trips.
    # future layer
    f_layer = snap.loc[(snap['tpep_pickup_datetime'] < right) &
                       (snap['tpep_pickup_datetime'] >= left) &
                       (snap['tpep_dropoff_datetime'] >= right)]
    # past layer
    p_layer = snap.loc[(snap['tpep_pickup_datetime'] < left) &
                       (snap['tpep_dropoff_datetime'] >= left) &
                       (snap['tpep_dropoff_datetime'] < right)]
    # now layer
    n_layer = snap.loc[(snap['tpep_pickup_datetime'] >= left) &
                       (snap['tpep_dropoff_datetime'] < right)]

    # Their count should add up to total trips caught
    assert snap.shape[0] == f_layer.shape[0] + p_layer.shape[0] + n_layer.shape[0]

    return p_layer, n_layer, f_layer

The above function is used process a .csv file for a certain purpose. For now, that purpose is irrelvant and we just need to understand this function somehow does some cumbersome data operation (which is computation intense).

In order to use the function, let's load in a sample data:

In [2]:
file = 'dataset/nytaxi_yellow_2017_09.csv'
# this load may take some time
table = pd.read_csv(file)

In [3]:
# the table has 11 columns and MANY MANY rows
table.shape

(8879929, 11)

To utilize the function, one more argument should be defined.

Let's again, ignore what this argument indicates in a certain context, but rather focus on code itself:

In [5]:
bound = (pd.Timestamp('2017-09-18 12:00:00'), pd.Timestamp('2017-09-18 12:15:00'))

# format two columns
table['tpep_pickup_datetime'] = pd.to_datetime(table['tpep_pickup_datetime'])
table['tpep_dropoff_datetime'] = pd.to_datetime(table['tpep_dropoff_datetime'])

Preparations are all done now. Let time the function:

In [10]:
%timeit gen_snap_layers(table, bound)

758 ms ± 244 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### 2. Rewrite Pandas code with Numpy

Can we improve performance by rewrite the function using Numpy? Let's try!

In [None]:
def gen_snap_layers(table, bound):
    '''
    Generate Past layer, Now layer and Future layer for one snapshot.
    Params:
        table: pandas dataframe
        bounds: time bound tuple, for example: (left_timestring, right_timestring)
    Return:
        PNF layers, a list.
    '''
    # left bound and right bound of time interval
    assert type(bound) == tuple
    left = bound[0]
    right = bound[1]
    
    table = table.loc[:, ['tripid',
                          'tpep_pickup_datetime',
                          'tpep_dropoff_datetime',
                          'pulocationid',
                          'dolocationid']]

    # The condition of making snapshot should be:
    # AT LEAST ONE temporal end of a trip should be within the bounds:
    snap = table.loc[
        ((table['tpep_pickup_datetime'] >= left) &
         (table['tpep_pickup_datetime'] < right)) |
        ((table['tpep_dropoff_datetime'] >= left) &
         (table['tpep_dropoff_datetime'] < right))]

    # generate F,P,N layers
    # Use the interval to 'catch' corresponding trips.
    # future layer
    f_layer = snap.loc[(snap['tpep_pickup_datetime'] < right) &
                       (snap['tpep_pickup_datetime'] >= left) &
                       (snap['tpep_dropoff_datetime'] >= right)]
    # past layer
    p_layer = snap.loc[(snap['tpep_pickup_datetime'] < left) &
                       (snap['tpep_dropoff_datetime'] >= left) &
                       (snap['tpep_dropoff_datetime'] < right)]
    # now layer
    n_layer = snap.loc[(snap['tpep_pickup_datetime'] >= left) &
                       (snap['tpep_dropoff_datetime'] < right)]

    # Their count should add up to total trips caught
    assert snap.shape[0] == f_layer.shape[0] + p_layer.shape[0] + n_layer.shape[0]

    return p_layer, n_layer, f_layer

### 3. Enhance Numpy with Numba

### 4. Discussion