## Flights Agg analysis - STLmap and noavx configurations

First, let's start with the basics:


Setup:
- Single `m5.large` instance
- everything run single-threaded-
- C++ dummy code using csvmonkey parser
- run over full flight data (from 1987 - 2021).

For time reasons, only analyze 2003 because the change in specialized/general occurs in May 2003

So let's crunch the numbers.

THESE SHOWN HERE ARE WITH PRELOADING THE CSV TO MEMORY FIRST.

In [1]:
!tar xf results-agg-experiment-month.tar.gz

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import glob
import os
import json

In [3]:
def load_to_df(logs_path, name=None, drop_first_run=True):
    paths = glob.glob(os.path.join(logs_path, '*.txt'))
    rows = []
    for path in paths:
        try:
            name = os.path.basename(path)
            t = name[:name.find('-run')].replace('flights-', '')
            r = int(name[name.find('run-')+4:name.rfind('-date')])

            yearmonth = int(name[name.find('date-')+5:].replace('.txt', ''))
            year = yearmonth // 100
            month = yearmonth % 100
            row = {'type' : t, 'run' : r, 'year' : int(year), 'month' : int(month)}

            with open(path, 'r') as fp:
                lines = fp.readlines()

            row.update({'t_' + k : v for k, v in json.loads(lines[-1]).items()})
            rows.append(row)
        except Exception as e:
            print('ERROR: {}'.format(path))
    df = pd.DataFrame(rows)
    
    if drop_first_run:
        # drop the first run, important because it's loading the file from EBS to RAM
        df = df[df['run'] != 1]
    
    if name is not None:
        df['name'] = name
    return df

In [4]:
df = load_to_df('results-agg-experiment-month/logs/', 'month')

In [5]:
gdf = df.groupby(['type', 'year', 'month']).mean().reset_index()
gdf = gdf.sort_values(by=['year', 'month', 'type']).reset_index(drop=True)

gdf_general = gdf.iloc[::2]
gdf_special = gdf.iloc[1::2]

In [6]:
gdf

Unnamed: 0,type,year,month,run,t_output,t_transform,t_total
0,general,2003,4,6.5,0.000105,1.061554,1.06166
1,specialized,2003,4,6.5,0.000291,1.014464,1.014752


Looking at the numbers there's basically no significant difference.

(note: for time reasons above numbers are averages of 2 runs only)

In [7]:
# how much faster is specialized total vs. rest? 

def calc_speedup(df):
    
    # calculate mean speedup!
    
    
    g_t = df[df['type'] == 'general']['t_transform'].sum()
    s_t  = df[df['type'] == 'specialized']['t_transform'].sum()
    
    return 'specialized:\t{:.2f}s\ngeneral:\t{:.2f}s\nspeedup:\t{:.2f}x'.format(s_t, g_t, g_t / s_t)

print('specialized month vs. rest:\n---\n{}'.format(calc_speedup(df)))
print()

specialized month vs. rest:
---
specialized:	10.14s
general:	10.62s
speedup:	1.05x



---
(c) 2017 - 2022 Tuplex authors