# Benchmark Pandas vs Cudf
- Using *timeit*

### System details

#### GPU

In [1]:
!nvidia-smi -q



Timestamp                           : Mon Jul 22 11:32:33 2019
Driver Version                      : 418.56
CUDA Version                        : 10.1

Attached GPUs                       : 1
GPU 00000000:00:1E.0
    Product Name                    : Tesla V100-SXM2-16GB
    Product Brand                   : Tesla
    Display Mode                    : Enabled
    Display Active                  : Disabled
    Persistence Mode                : Enabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 4000
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : 0323217016780
    GPU UUID                        : GPU-3ec8803d-1d6d-b362-7a9d-57b78fe42967
    Minor Number                    : 0
    VBIOS Version                   : 88.00.4F.00.09
    MultiGPU Board                  : No
    Board ID                        : 0x1e
    GPU Part Number                 : 900-2G

#### CPU

In [2]:
!less /proc/cpuinfo

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 79
model name      : Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
stepping        : 1
microcode       : 0xb000037
cpu MHz         : 2699.945
cache size      : 46080 KB
physical id     : 0
siblings        : 8
core id         : 0
cpu cores       : 4
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
[K:[K

## Benchmark setup

### Installations
Install our v3io-generator to create our 1gb dataset for the benchmark

In [3]:
!pip install -i https://test.pypi.org/simple/ v3io-generator --upgrade

Looking in indexes: https://test.pypi.org/simple/
Requirement already up-to-date: v3io-generator in /User/.pythonlibs/lib/python3.6/site-packages (0.0.27.dev0)


### Imports

In [4]:
import os
import yaml
import time
import datetime
import json
import itertools

# Generator
from v3io_generator import metrics_generator, deployment_generator

# Dataframes
import cudf
import pandas as pd

### Configurations

In [5]:
# Benchmark configurations
metric_names = ['cpu_utilization', 'latency', 'packet_loss', 'throughput']
nlargest = 10
source_file = os.path.join(os.getcwd(), 'data', 'ops.logs') # Use full path


os.environ['SOURCE_PATH'] = source_file                    # Expose for display
os.environ['SOURCE_DIR'] = os.path.dirname(source_file)    # Expose for display
os.environ['SOURCE_FILE'] = os.path.basename(source_file)  # Expose for display

### Create data source
Using our V3IO-Generator we will create a timeseries network-operations dataset for 100 companies including 4 metrics (cpu utilization, latency, throughput, packet loss).

We will then write the dataset to a json file to be used as our source

In [6]:
# Create meta-data factory
dep_gen = deployment_generator.deployment_generator()
faker=dep_gen.get_faker()

# Design meta-data
dep_gen.add_level(name='company',number=100,level_type=faker.company)

# Generate deployment structure
deployment_df = dep_gen.generate_deployment()

# Setup initial values
for metric in metrics:
    deployment_df[metric] = 0

deployment_df.head()

Unnamed: 0,company,cpu_utilization,latency,packet_loss,throughput
0,Rios__Pope_and_Baird,0,0,0,0
1,Ross__Calderon_and_Brown,0,0,0,0
2,Jackson_PLC,0,0,0,0
3,Reyes_Group,0,0,0,0
4,Carr-Reyes,0,0,0,0


Specify metrics configuration for the generator

In [8]:
metrics_configuration = yaml.safe_load("""
errors: {length_in_ticks: 50, rate_in_ticks: 150}
timestamps: {interval: 5s, stochastic_interval: false}
metrics:
  cpu_utilization:
    accuracy: 2
    distribution: normal
    distribution_params: {mu: 70, noise: 0, sigma: 10}
    is_threshold_below: true
    past_based_value: false
    produce_max: false
    produce_min: false
    validation:
      distribution: {max: 1, min: -1, validate: false}
      metric: {max: 100, min: 0, validate: true}
  latency:
    accuracy: 2
    distribution: normal
    distribution_params: {mu: 0, noise: 0, sigma: 5}
    is_threshold_below: true
    past_based_value: false
    produce_max: false
    produce_min: false
    validation:
      distribution: {max: 1, min: -1, validate: false}
      metric: {max: 100, min: 0, validate: true}
  packet_loss:
    accuracy: 0
    distribution: normal
    distribution_params: {mu: 0, noise: 0, sigma: 2}
    is_threshold_below: true
    past_based_value: false
    produce_max: false
    produce_min: false
    validation:
      distribution: {max: 1, min: -1, validate: false}
      metric: {max: 50, min: 0, validate: true}
  throughput:
    accuracy: 2
    distribution: normal
    distribution_params: {mu: 250, noise: 0, sigma: 20}
    is_threshold_below: false
    past_based_value: false
    produce_max: false
    produce_min: false
    validation:
      distribution: {max: 1, min: -1, validate: false}
      metric: {max: 300, min: 0, validate: true}
""")

Create the data according to the given hierarchy and metrics configuration

In [9]:
met_gen = metrics_generator.Generator_df(metrics_configuration, 
                                         user_hierarchy=deployment_df, 
                                         initial_timestamp=time.time())

metrics = met_gen.generate_range(start_time=datetime.datetime.now(),
                                 end_time=datetime.datetime.now()+datetime.timedelta(hours=62),
                                 as_df=True,
                                 as_iterator=False)

# Verify source file directory exists
os.makedirs(os.path.dirname(source_file), exist_ok=1)

# Generate file from metrics
with open(source_file, 'w') as f:
    metrics_batch = metric_names
    metrics_batch.to_json(f,
                          orient='records',
                          lines=True)

## Target file size validation
Set target size (in MB) for the test file

In [10]:
!ls -lah ${SOURCE_DIR} | grep ${SOURCE_FILE}

-rw-r--r-- 1 50 nogroup 1.2G Aug 13 11:00 ops.logs


In [11]:
!head ${SOURCE_PATH}

{"company":"Parker_and_Sons","cpu_utilization":77.0001379709,"cpu_utilization_is_error":false,"latency":8.9685908315,"latency_is_error":false,"packet_loss":0.1182060132,"packet_loss_is_error":false,"throughput":264.5567821475,"throughput_is_error":false,"timestamp":1565015264829}
{"company":"Barnes-Fletcher","cpu_utilization":82.3951249969,"cpu_utilization_is_error":false,"latency":2.7561547101,"latency_is_error":false,"packet_loss":0.9046704441,"packet_loss_is_error":false,"throughput":260.4376162919,"throughput_is_error":false,"timestamp":1565015264829}
{"company":"Johnson_Ltd","cpu_utilization":74.2230353639,"cpu_utilization_is_error":false,"latency":7.0199027791,"latency_is_error":false,"packet_loss":0.0,"packet_loss_is_error":false,"throughput":245.2035029337,"throughput_is_error":false,"timestamp":1565015264829}
{"company":"Cameron_Ltd","cpu_utilization":61.9061750617,"cpu_utilization_is_error":false,"latency":6.8589515103,"latency_is_error":false,"packet_loss":0.0,"packet_loss_i

## Benchmark

### Flow
- Read file
- Compute aggregations
- get nlargest()

In [12]:
benchmark_file = source_file

#### cudf

In [13]:
%%timeit

# Read file
gdf = cudf.read_json(benchmark_file, lines=True)

# Perform aggregation
ggdf = gdf.groupby(['company']).\
            agg({k: ['min', 'max', 'mean'] for k in metric_names})

# Get N Largest (From original df)
raw_nlargest = gdf.nlargest(nlargest, 'cpu_utilization')

6.24 s ± 62.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


#### Pandas

In [14]:
%%timeit

# Read file
pdf = pd.read_json(benchmark_file, lines=True)

# Perform aggregation
gpdf = pdf.groupby(['company']).\
            agg({k: ['min', 'max', 'mean'] for k in metric_names})

# Get N Largest (From original df)
raw_nlargest = pdf.nlargest(nlargest, 'cpu_utilization')

47.6 s ± 228 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Test loading times

#### cudf

In [15]:
%%timeit
gdf = cudf.read_json(benchmark_file, lines=True)

5.95 s ± 77.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


#### Pandas

In [16]:
%%timeit
gdf = pd.read_json(benchmark_file, lines=True)

41.1 s ± 651 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Test aggregation
Load the files to memory so we can %timeit on the aggregations only

In [17]:
gdf = cudf.read_json(benchmark_file, lines=True)
pdf = pd.read_json(benchmark_file, lines=True)

#### cudf

In [18]:
%%timeit

ggdf = gdf.groupby(['company']).\
            agg({k: ['min', 'max', 'mean'] for k in metric_names})
raw_nlargest = gdf.nlargest(nlargest, 'cpu_utilization')

212 ms ± 14.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


#### Pandas

In [19]:
%%timeit

gpdf = pdf.groupby(['company']).\
            agg({k: ['min', 'max', 'mean'] for k in metric_names})
raw_nlargest = pdf.nlargest(nlargest, 'cpu_utilization')

2.17 s ± 72.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
