# Benchmarks Comparison &mdash; pandas Versus RAPIDS cuDF

This tutorial uses `timeit` to compare performance benchmarks with pandas and RAPIDS cuDF.

## System Details

### GPU

In [1]:
!nvidia-smi -q



Timestamp                           : Thu Jan  9 11:56:26 2020
Driver Version                      : 440.31
CUDA Version                        : 10.2

Attached GPUs                       : 1
GPU 00000000:81:00.0
    Product Name                    : Tesla T4
    Product Brand                   : Tesla
    Display Mode                    : Enabled
    Display Active                  : Disabled
    Persistence Mode                : Enabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 4000
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : 0561119011981
    GPU UUID                        : GPU-8b4068b3-1bcf-8dbe-978e-8eacb3c22801
    Minor Number                    : 0
    VBIOS Version                   : 90.04.38.00.03
    MultiGPU Board                  : No
    Board ID                        : 0x8100
    GPU Part Number                 : 900-2G183-0000-0

### CPU

In [2]:
!less /proc/cpuinfo

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 79
model name      : Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
stepping        : 1
microcode       : 0xb000012
cpu MHz         : 2200.000
cache size      : 25600 KB
physical id     : 0
siblings        : 10
core id         : 0
cpu cores       : 10
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 20
wp              : yes
[K:[K         : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg[7m/proc/cpuinfo[m[K

## Benchmark Setup

### Installations

Install v3io-generator to create a 1 GB data set for the benchmark.<br>
You only need to run the generator once, and then you can reuse the generated data set.

In [3]:
!pip install pytimeparse
!pip install -i https://test.pypi.org/simple/ v3io-generator --upgrade
!pip install faker

Looking in indexes: https://test.pypi.org/simple/
Requirement already up-to-date: v3io-generator in /User/.pythonlibs/jupyter/lib/python3.6/site-packages (0.0.27.dev0)
/bin/sh: pin: command not found


### Imports

In [4]:
import os
import yaml
import time
import datetime
import json
import itertools

# Generator
from v3io_generator import metrics_generator, deployment_generator

# Dataframes
import cudf
import pandas as pd

### Configurations

In [5]:
# Benchmark configurations
metric_names = ['cpu_utilization', 'latency', 'packet_loss', 'throughput']
nlargest = 10
source_file = os.path.join(os.getcwd(), 'data', 'ops.logs') # Use full path


os.environ['SOURCE_PATH'] = source_file                    # Expose for display
os.environ['SOURCE_DIR'] = os.path.dirname(source_file)    # Expose for display
os.environ['SOURCE_FILE'] = os.path.basename(source_file)  # Expose for display

### Create the Data Source

Use v3io-generator to create a time-series network-operations dataset for 100 companies, including 4 metrics (CPU utilization, latency, throughput, and packet loss).<br>
Then, write the dataset to a JSON file to be used as the data source.

In [6]:
# Create a metadata factory
dep_gen = deployment_generator.deployment_generator()
faker=dep_gen.get_faker()

# Design the metadata
dep_gen.add_level(name='company',number=100,level_type=faker.company)

# Generate a deployment structure
deployment_df = dep_gen.generate_deployment()

# Initialize the metric values
for metric in metric_names:
    deployment_df[metric] = 0

deployment_df.head()

Unnamed: 0,company,cpu_utilization,latency,packet_loss,throughput
0,Schaefer__Jones_and_Sanchez,0,0,0,0
1,Odom-Sutton,0,0,0,0
2,Estrada-Grimes,0,0,0,0
3,Gardner-Smith,0,0,0,0
4,Smith_LLC,0,0,0,0


Specify metrics configuration for the generator.

In [7]:
metrics_configuration = yaml.safe_load("""
errors: {length_in_ticks: 50, rate_in_ticks: 150}
timestamps: {interval: 5s, stochastic_interval: false}
metrics:
  cpu_utilization:
    accuracy: 2
    distribution: normal
    distribution_params: {mu: 70, noise: 0, sigma: 10}
    is_threshold_below: true
    past_based_value: false
    produce_max: false
    produce_min: false
    validation:
      distribution: {max: 1, min: -1, validate: false}
      metric: {max: 100, min: 0, validate: true}
  latency:
    accuracy: 2
    distribution: normal
    distribution_params: {mu: 0, noise: 0, sigma: 5}
    is_threshold_below: true
    past_based_value: false
    produce_max: false
    produce_min: false
    validation:
      distribution: {max: 1, min: -1, validate: false}
      metric: {max: 100, min: 0, validate: true}
  packet_loss:
    accuracy: 0
    distribution: normal
    distribution_params: {mu: 0, noise: 0, sigma: 2}
    is_threshold_below: true
    past_based_value: false
    produce_max: false
    produce_min: false
    validation:
      distribution: {max: 1, min: -1, validate: false}
      metric: {max: 50, min: 0, validate: true}
  throughput:
    accuracy: 2
    distribution: normal
    distribution_params: {mu: 250, noise: 0, sigma: 20}
    is_threshold_below: false
    past_based_value: false
    produce_max: false
    produce_min: false
    validation:
      distribution: {max: 1, min: -1, validate: false}
      metric: {max: 300, min: 0, validate: true}
""")

Create the data according to the given hierarchy and metrics configuration.

In [8]:
met_gen = metrics_generator.Generator_df(metrics_configuration, 
                                         user_hierarchy=deployment_df, 
                                         initial_timestamp=time.time())

metrics = met_gen.generate_range(start_time=datetime.datetime.now(),
                                 end_time=datetime.datetime.now()+datetime.timedelta(hours=62),
                                 as_df=True,
                                 as_iterator=False)

# Verify that the source-file parent directory exists.
os.makedirs(os.path.dirname(source_file), exist_ok=1)

# Generate file from metrics
with open(source_file, 'w') as f:
    metrics_batch = metrics
    metrics_batch.to_json(f,
                          orient='records',
                          lines=True)

### Validate the Target File Size

Set the target size for the test file, in MB.

In [9]:
!ls -lah ${SOURCE_DIR} | grep ${SOURCE_FILE}

-rw-r--r-- 1 root nogroup 1.2G Jan  9 12:05 ops.logs


In [10]:
!head ${SOURCE_PATH}

{"company":"Schaefer__Jones_and_Sanchez","cpu_utilization":60.7249169402,"cpu_utilization_is_error":false,"latency":0.0,"latency_is_error":false,"packet_loss":1.8576310021,"packet_loss_is_error":false,"throughput":266.1555833373,"throughput_is_error":false,"timestamp":1578571120848}
{"company":"Odom-Sutton","cpu_utilization":76.4322140086,"cpu_utilization_is_error":false,"latency":7.8381013211,"latency_is_error":false,"packet_loss":0.0,"packet_loss_is_error":false,"throughput":250.0232627126,"throughput_is_error":false,"timestamp":1578571120848}
{"company":"Estrada-Grimes","cpu_utilization":79.5602560259,"cpu_utilization_is_error":false,"latency":3.8517916739,"latency_is_error":false,"packet_loss":0.2517241329,"packet_loss_is_error":false,"throughput":267.5772519228,"throughput_is_error":false,"timestamp":1578571120848}
{"company":"Gardner-Smith","cpu_utilization":72.8406272809,"cpu_utilization_is_error":false,"latency":0.0,"latency_is_error":false,"packet_loss":2.1089029723,"packet_lo

## Benchmark

The benchmark tests use the following flow:

- Read file
- Compute aggregations
- Get the n-largest values

In [11]:
benchmark_file = source_file

In the following examples, `timeit` is executed in a loop.<br>
You can change the number of runs and loops:
```
%%timeit -n 1 -r 1
```

### cuDF Benchmark

In [12]:
%%timeit

# Read file
gdf = cudf.read_json(benchmark_file, lines=True)

# Perform aggregation
ggdf = gdf.groupby(['company']).\
            agg({k: ['min', 'max', 'mean'] for k in metric_names})

# Get the n-largest values (from the original DataFrame)
raw_nlargest = gdf.nlargest(nlargest, 'cpu_utilization')

4.97 s ± 47.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### pandas Benchmark

In [13]:
%%timeit

# Read file
pdf = pd.read_json(benchmark_file, lines=True)

# Perform aggregation
gpdf = pdf.groupby(['company']).\
            agg({k: ['min', 'max', 'mean'] for k in metric_names})

# Get the n-largest values (from the original DataFrame)
raw_nlargest = pdf.nlargest(nlargest, 'cpu_utilization')

47.9 s ± 2.52 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Test Load Times

#### cuDF

In [15]:
%%timeit -r 2
gdf = cudf.read_json(benchmark_file, lines=True)

5.95 s ± 77.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


#### pandas

In [16]:
%%timeit
gdf = pd.read_json(benchmark_file, lines=True)

41.1 s ± 651 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Test Aggregation

Load the files to memory to allow applying `timeit` only to the aggregations.

In [17]:
gdf = cudf.read_json(benchmark_file, lines=True)
pdf = pd.read_json(benchmark_file, lines=True)

#### cuDF

In [18]:
%%timeit

ggdf = gdf.groupby(['company']).\
            agg({k: ['min', 'max', 'mean'] for k in metric_names})
raw_nlargest = gdf.nlargest(nlargest, 'cpu_utilization')

212 ms ± 14.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


#### pandas

In [19]:
%%timeit

gpdf = pdf.groupby(['company']).\
            agg({k: ['min', 'max', 'mean'] for k in metric_names})
raw_nlargest = pdf.nlargest(nlargest, 'cpu_utilization')

2.17 s ± 72.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
