# Benchmarks

These benchmarks seek to establish the performance of tablite as a user sees it.

Overview

**Input/Output:**

- Save / Load .tpz format
- Save tables to various formats
- Import data from various formats

**Various column functions:**

- Setitem / getitem
- iter
- equal, not equal
- copy
- t += t
- t *= t
- contains
- remove all
- replace
- index
- unique
- histogram
- statistics
- count


**Various table functions**

- **base**
  - Setitem / getitem
  - iter / rows
  - equal, not equal
  - load
  - save
  - copy
  - stack
  - types
  - display_dict
  - show
  - to_dict
  - as_json_serializable
  - index
- **core**
  - expression
  - filter
  - sort_index
  - reindex
  - drop_duplicates
  - sort
  - is_sorted
  - any
  - all
  - drop 
  - replace
  - groupby
  - pivot
  - joins
  - lookup
  - replace missing values
  - transpose
  - pivot_transpose
  - diff






In [1]:
from tablite import Table
from tablite.datasets import synthetic_order_data
import psutil, os, gc
import tempfile
from pathlib import Path
from time import perf_counter, time
from tablite.config import Config

### Create tables from synthetic data.

In [2]:
process = psutil.Process(os.getpid())

# The last tables are too big for RAM (~24Gb), so I create subtables of 1M rows and append them.
t = synthetic_order_data(Config.PAGE_SIZE)
real, flat = t.nbytes()
print(f"Table {len(t):,} rows is {real/1e6:,.0f} Mb on disk")

tables = [t]  # 1M rows.

for i in [2,5,10,50,100]:
    t2 = synthetic_order_data(Config.PAGE_SIZE)
    for _ in range(i-1):
        t2 += synthetic_order_data(Config.PAGE_SIZE)  # these are all unique
    real, flat = t2.nbytes()
    tables.append(t2)
    print(f"Table {len(t2):,} rows is {real/1e6:,.0f} Mb on disk")

tables[-1].show()


Table 1,000,000 rows is 240 Mb on disk
Table 2,000,000 rows is 480 Mb on disk
Table 5,000,000 rows is 1,200 Mb on disk
Table 10,000,000 rows is 2,400 Mb on disk
Table 50,000,000 rows is 12,000 Mb on disk
Table 100,000,000 rows is 24,000 Mb on disk
|     ~     |   #   |      1      |         2         |  3  | 4 |  5  | 6  | 7 | 8  | 9 |         10         |        11        |
+-----------+-------+-------------+-------------------+-----+---+-----+----+---+----+---+--------------------+------------------+
|          0|      1|1897876237916|2021-11-24T00:00:00|50868|  0| 8387|C4-3|WGA|21° |   |  1.5878192881433402|9.046682540914231 |
|          1|      2|2191019820422|2021-09-25T00:00:00|50164|  1|11017|C2-1|NRK|None|ABC|0.010731545998969043|15.3822483763811  |
|          2|      3| 466194288952|2021-11-09T00:00:00|50491|  1| 4773|C3-5|FCD|21° |XYZ|  2.0598485514016818|3.383346544302269 |
|          3|      4| 462279372300|2021-10-03T00:00:00|50056|  1|15795|C1-4|DGJ|0°  |XYZ|   1.17253016

The values in the tables above are all unique!

### Save / Load .tpz format

Without compression (fastest)

In [3]:
tmp = Path(tempfile.gettempdir()) / "junk"
tmp.mkdir(exist_ok=True)

results = Table()
results.add_columns('rows', 'save (sec)', 'load (sec)')
for t in tables:
    fn = tmp / f'{len(t)}.tpz'
    start = perf_counter()
    t.save(fn)
    end = perf_counter()
    save = round(end-start,3)
    assert fn.exists()
    print(f"saving {len(t):,} rows ({fn.stat().st_size/1e6:,.0f} Mb) took {end-start:,} seconds")
    
    start = perf_counter()
    t2 = Table.load(fn)
    end = perf_counter()
    load = round(end-start,3)
    print(f"loading {len(t2):,} tows took {end-start:,} seconds")
    del t2
    fn.unlink()
    results.add_rows(len(t), save, load)

saving 1,000,000 rows (240 Mb) took 0.4755823000014061 seconds


importing '1000000.tpz' file: 100%|██████████| 12/12 [00:00<00:00, 21.45it/s]


loading 1,000,000 tows took 0.5728237000002991 seconds
saving 2,000,000 rows (480 Mb) took 0.9596193000033963 seconds


importing '2000000.tpz' file: 100%|██████████| 24/24 [00:01<00:00, 20.00it/s]


loading 2,000,000 tows took 1.2175401999993483 seconds
saving 5,000,000 rows (1,200 Mb) took 2.417090199996892 seconds


importing '5000000.tpz' file: 100%|██████████| 60/60 [00:03<00:00, 19.67it/s]


loading 5,000,000 tows took 3.0794375999976182 seconds
saving 10,000,000 rows (2,400 Mb) took 4.631444800004829 seconds


importing '10000000.tpz' file: 100%|██████████| 120/120 [00:06<00:00, 19.80it/s]


loading 10,000,000 tows took 6.088362100002996 seconds
saving 50,000,000 rows (12,000 Mb) took 24.71230110000033 seconds


importing '50000000.tpz' file: 100%|██████████| 600/600 [00:34<00:00, 17.40it/s]


loading 50,000,000 tows took 34.57949820000067 seconds
saving 100,000,000 rows (24,000 Mb) took 48.25354669999797 seconds


importing '100000000.tpz' file: 100%|██████████| 1200/1200 [01:18<00:00, 15.24it/s]


loading 100,000,000 tows took 78.92228220000106 seconds


In [4]:
results['save r/sec'] = [int(a/b) if b!=0  else "nil" for a,b in zip(results['rows'], results['save (sec)']) ]
results['load r/sec'] = [int(a/b) if b!=0  else "nil" for a,b in zip(results['rows'], results['load (sec)'])]
results

#,rows,save (sec),load (sec),save r/sec,load r/sec
0,1000000,0.476,0.573,2100840,1745200
1,2000000,0.96,1.218,2083333,1642036
2,5000000,2.417,3.079,2068680,1623903
3,10000000,4.631,6.088,2159360,1642575
4,50000000,24.712,34.579,2023308,1445964
5,100000000,48.254,78.922,2072367,1267073


With various compression options

In [5]:
tmp = Path(tempfile.gettempdir()) / "junk"
tmp.mkdir(exist_ok=True)

t = tables[0]  # 1 m rows

import zipfile  # https://docs.python.org/3/library/zipfile.html#zipfile.ZipFile
methods = [(None, zipfile.ZIP_STORED, "zip stored"), (None, zipfile.ZIP_LZMA, "zip lzma")]
methods += [(i, zipfile.ZIP_DEFLATED, "zip deflated") for i in range(0,10)]
methods += [(i, zipfile.ZIP_BZIP2, "zip bzip2") for i in range(1,10)]

results = Table()
results.add_columns('file size (Mb)', 'method', 'write (sec)', 'read (sec)')
for level, method, name in methods:
    fn = tmp / f'{len(t)}.tpz'
    start = perf_counter()  
    t.save(fn, compression_method=method, compression_level=level)
    end = perf_counter()
    write = round(end-start,3)
    assert fn.exists()
    size = int(fn.stat().st_size/1e6)
    print(f"saving {len(t):,} rows ({size} Mb) took {write} secconds with {name}(level={level})")
    
    start = perf_counter()
    t2 = Table.load(fn)
    end = perf_counter()
    read = round(end-start,3)
    print(f"loading {len(t2):,} rows took {end-start:,} seconds")
    
    del t2
    fn.unlink()
    results.add_rows(size, f"{name}(level={level})", write, read)

saving 1,000,000 rows (240 Mb) took 0.487 secconds with zip stored(level=None)


importing '1000000.tpz' file: 100%|██████████| 12/12 [00:00<00:00, 21.07it/s]


loading 1,000,000 rows took 0.5881239999944228 seconds
saving 1,000,000 rows (29 Mb) took 100.386 secconds with zip lzma(level=None)


importing '1000000.tpz' file: 100%|██████████| 12/12 [00:02<00:00,  5.52it/s]


loading 1,000,000 rows took 2.2067826999991667 seconds
saving 1,000,000 rows (240 Mb) took 0.485 secconds with zip deflated(level=0)


importing '1000000.tpz' file: 100%|██████████| 12/12 [00:00<00:00, 17.77it/s]


loading 1,000,000 rows took 0.6990915999995195 seconds
saving 1,000,000 rows (48 Mb) took 2.072 secconds with zip deflated(level=1)


importing '1000000.tpz' file: 100%|██████████| 12/12 [00:00<00:00, 12.34it/s]


loading 1,000,000 rows took 1.0050537999995868 seconds
saving 1,000,000 rows (46 Mb) took 2.209 secconds with zip deflated(level=2)


importing '1000000.tpz' file: 100%|██████████| 12/12 [00:00<00:00, 12.21it/s]


loading 1,000,000 rows took 0.9996115999965696 seconds
saving 1,000,000 rows (43 Mb) took 3.056 secconds with zip deflated(level=3)


importing '1000000.tpz' file: 100%|██████████| 12/12 [00:00<00:00, 12.47it/s]


loading 1,000,000 rows took 0.9797398999944562 seconds
saving 1,000,000 rows (43 Mb) took 3.047 secconds with zip deflated(level=4)


importing '1000000.tpz' file: 100%|██████████| 12/12 [00:00<00:00, 13.09it/s]


loading 1,000,000 rows took 0.9479739000016707 seconds
saving 1,000,000 rows (42 Mb) took 4.622 secconds with zip deflated(level=5)


importing '1000000.tpz' file: 100%|██████████| 12/12 [00:00<00:00, 13.12it/s]


loading 1,000,000 rows took 0.9351310000056401 seconds
saving 1,000,000 rows (39 Mb) took 8.494 secconds with zip deflated(level=6)


importing '1000000.tpz' file: 100%|██████████| 12/12 [00:00<00:00, 14.97it/s]


loading 1,000,000 rows took 0.8191517000013846 seconds
saving 1,000,000 rows (39 Mb) took 13.968 secconds with zip deflated(level=7)


importing '1000000.tpz' file: 100%|██████████| 12/12 [00:00<00:00, 12.85it/s]


loading 1,000,000 rows took 0.957268699996348 seconds
saving 1,000,000 rows (38 Mb) took 55.835 secconds with zip deflated(level=8)


importing '1000000.tpz' file: 100%|██████████| 12/12 [00:00<00:00, 14.15it/s]


loading 1,000,000 rows took 0.8678541999979643 seconds
saving 1,000,000 rows (37 Mb) took 115.164 secconds with zip deflated(level=9)


importing '1000000.tpz' file: 100%|██████████| 12/12 [00:00<00:00, 13.79it/s]


loading 1,000,000 rows took 0.8996915000025183 seconds
saving 1,000,000 rows (29 Mb) took 14.368 secconds with zip bzip2(level=1)


importing '1000000.tpz' file: 100%|██████████| 12/12 [00:03<00:00,  3.13it/s]


loading 1,000,000 rows took 3.8538355999990017 seconds
saving 1,000,000 rows (29 Mb) took 15.24 secconds with zip bzip2(level=2)


importing '1000000.tpz' file: 100%|██████████| 12/12 [00:04<00:00,  2.99it/s]


loading 1,000,000 rows took 4.038587100003497 seconds
saving 1,000,000 rows (29 Mb) took 16.722 secconds with zip bzip2(level=3)


importing '1000000.tpz' file: 100%|██████████| 12/12 [00:04<00:00,  2.75it/s]


loading 1,000,000 rows took 4.371564500004752 seconds
saving 1,000,000 rows (29 Mb) took 17.355 secconds with zip bzip2(level=4)


importing '1000000.tpz' file: 100%|██████████| 12/12 [00:04<00:00,  2.57it/s]


loading 1,000,000 rows took 4.6704499999978 seconds
saving 1,000,000 rows (29 Mb) took 17.892 secconds with zip bzip2(level=5)


importing '1000000.tpz' file: 100%|██████████| 12/12 [00:05<00:00,  2.31it/s]


loading 1,000,000 rows took 5.212135200003104 seconds
saving 1,000,000 rows (29 Mb) took 18.313 secconds with zip bzip2(level=6)


importing '1000000.tpz' file: 100%|██████████| 12/12 [00:05<00:00,  2.29it/s]


loading 1,000,000 rows took 5.262236700000358 seconds
saving 1,000,000 rows (29 Mb) took 19.496 secconds with zip bzip2(level=7)


importing '1000000.tpz' file: 100%|██████████| 12/12 [00:05<00:00,  2.21it/s]


loading 1,000,000 rows took 5.467584300000453 seconds
saving 1,000,000 rows (29 Mb) took 20.858 secconds with zip bzip2(level=8)


importing '1000000.tpz' file: 100%|██████████| 12/12 [00:05<00:00,  2.31it/s]


loading 1,000,000 rows took 5.222707900000387 seconds
saving 1,000,000 rows (29 Mb) took 20.54 secconds with zip bzip2(level=9)


importing '1000000.tpz' file: 100%|██████████| 12/12 [00:05<00:00,  2.09it/s]

loading 1,000,000 rows took 5.758435000003374 seconds





In [6]:
results.sort({'write (sec)':True})
load_and_save_results = results.copy()
load_and_save_results

creating sort index: 100%|██████████| 1/1 [00:00<?, ?it/s]


#,file size (Mb),method,write (sec),read (sec)
0,37,zip deflated(level=9),115.164,0.9
1,29,zip lzma(level=None),100.386,2.207
2,38,zip deflated(level=8),55.835,0.868
3,29,zip bzip2(level=8),20.858,5.223
4,29,zip bzip2(level=9),20.54,5.758
5,29,zip bzip2(level=7),19.496,5.468
6,29,zip bzip2(level=6),18.313,5.262
...,...,...,...,...
14,42,zip deflated(level=5),4.622,0.935
15,43,zip deflated(level=3),3.056,0.98
16,43,zip deflated(level=4),3.047,0.948
17,46,zip deflated(level=2),2.209,1.0
18,48,zip deflated(level=1),2.072,1.005
19,240,zip stored(level=None),0.487,0.588
20,240,zip deflated(level=0),0.485,0.699


### Save / load tables to / from various formats

The handlers for saving / export are:

- to_sql
- to_json
- to_xls
- to_ods
- to_csv
- to_tsv
- to_text
- to_html
- to_hdf5


In [7]:
n_rows = 10_000_000
L = [t for t in tables if len(t)>=n_rows]
t = L[0]

tmp = Path(tempfile.gettempdir()) / "junk"
tmp.mkdir(exist_ok=True)

In [8]:
tmp = Path(tempfile.gettempdir()) / "junk"
tmp.mkdir(exist_ok=True)

results = Table()
results.add_columns('method', 'write (s)', 'read (s)', 'rows', 'size (Mb)', 'config')

In [9]:
def to_sql_benchmark(t, rows=1_000_000):
    t2 = t[:rows]
    write_start = time()
    _ = t2.to_sql(name='1')
    write_end = time()
    write = round(write_end-write_start,3)
    results.add_rows( t.to_sql.__name__, write, 0, len(t), "" , "" ) 

to_sql_benchmark(t)

In [10]:
def to_json_benchmark(t, rows=1_000_000):
    t2 = t[:rows]
    path = tmp / "1.json"    
    write_start = time()
    bytestr = t2.to_json()
    with path.open('w') as fo:
        fo.write(bytestr)
    write_end = time()
    write = round(write_end-write_start,3)

    read_start = time()
    with path.open('r') as fi:
        _ = Table.from_json(fi.read())  # <-- JSON
    read_end = time()
    read = round(read_end-read_start,3)

    results.add_rows( t.to_json.__name__, write, read, len(t), int(path.stat().st_size/1e6), "" ) 

to_json_benchmark(t)

In [None]:
def f(results, t, args):
    rows, c1, c1_kw, c2, c2_kw = args
    t2 = t[:rows]

    call = getattr(t2, c1)
    assert callable(call)

    write_start = time()
    call(**c1_kw)
    write_end = time()
    write = round(write_end-write_start,3)

    for _ in range(10):
        gc.collect()

    read_start = time()
    if callable(c2):
        c2(**c2_kw)
    read_end = time()
    read = round(read_end-read_start,3)

    fn = c2_kw['path']
    assert fn.exists()
    fs = int(fn.stat().st_size/1e6)
    config = {k:v for k,v in c2_kw.items() if k!= 'path'}

    results.add_rows( c1, write, read, len(t2), fs , str(config))

args = [
    (   100_000, "to_xlsx", {'path': tmp/'1.xlsx'}, Table.from_file, {"path":tmp/'1.xlsx', "sheet":"pyexcel_sheet1"}),
    (   100_000,  "to_ods",  {'path': tmp/'1.ods'}, Table.from_file, {"path":tmp/'1.ods', "sheet":"pyexcel_sheet1"} ),
    ( 1_000_000,  "to_csv",  {'path': tmp/'1.csv'}, Table.from_file, {"path":tmp/'1.csv'}                           ),
    ( 1_000_000,  "to_csv",  {'path': tmp/'1.csv'}, Table.from_file, {"path":tmp/'1.csv', "guess_datatypes":False}),
    (10_000_000,  "to_csv",  {'path': tmp/'1.csv'}, Table.from_file, {"path":tmp/'1.csv', "guess_datatypes":False}),
    ( 1_000_000,  "to_tsv",  {'path': tmp/'1.tsv'}, Table.from_file, {"path":tmp/'1.tsv'}                           ),
    ( 1_000_000, "to_text",  {'path': tmp/'1.txt'}, Table.from_file, {"path":tmp/'1.txt'}                           ),
    ( 1_000_000, "to_html", {'path': tmp/'1.html'}, Table.from_file, {"path":tmp/'1.html'}                          ),
    ( 1_000_000, "to_hdf5", {'path': tmp/'1.hdf5'}, Table.from_file, {"path":tmp/'1.hdf5'}                          )
]

Config.PROCESSING_MODE = Config.FALSE
for arg in args:
    assert len(t)>=arg[0]
    print(arg[1], arg[0])
    f(results, t, arg)

import shutil
shutil.rmtree(tmp)

In [12]:
results['read r/sec'] = [int(a/b) if b!=0  else "nil" for a,b in zip(results['rows'], results['read (s)']) ]
results['write r/sec'] = [int(a/b) if b!=0  else "nil" for a,b in zip(results['rows'], results['write (s)'])]

In [13]:
t

~,#,1,2,3,4,5,6,7,8,9,10,11
0,1,542679287742,2021-12-23T00:00:00,50095,1,17266,C1-1,XYM,0°,ABC,1.8986304837702648,15.097150545026425
1,2,2191550797369,2021-09-10T00:00:00,50436,1,5417,C5-2,FHD,None,Unnamed: 10_level_2,1.043372674764467,24.25713722018791
2,3,351721316579,2021-10-18T00:00:00,50751,0,9423,C2-2,TSK,6°,Unnamed: 10_level_3,1.7526701991849603,7.297518443741395
3,4,1831394352386,2021-11-14T00:00:00,50754,0,16770,C1-2,DQF,None,XYZ,1.9261726704599242,18.82744378973711
4,5,472316342615,2021-10-21T00:00:00,50292,1,10201,C1-4,RYO,6°,ABC,2.028002535079511,22.61428795844147
5,6,1553288002456,2021-09-14T00:00:00,50114,0,29954,C5-2,NOK,6°,XYZ,0.8669425726157858,24.875695709430403
6,7,636893104834,2021-11-25T00:00:00,50101,1,27681,C1-4,RQK,21°,ABC,2.0567125627667115,7.112373810350086
...,...,...,...,...,...,...,...,...,...,...,...,...
"9,999,993",999994,2197750112850,2021-09-19T00:00:00,50397,1,10238,C4-4,EWQ,6°,Unnamed: 10_level_9,1.612314295262871,15.466396322377262
"9,999,994",999995,1584493442566,2021-12-18T00:00:00,50869,0,23183,C2-3,OHN,None,ABC,1.013647495864597,19.50264374400631
"9,999,995",999996,168390943274,2021-10-24T00:00:00,50494,1,8329,C1-4,CNH,0°,ABC,1.8819334352246124,16.572654053504113
"9,999,996",999997,1378848691555,2021-07-30T00:00:00,50875,1,26599,C5-4,EGG,None,XYZ,1.5002822943739924,22.460765429830104
"9,999,997",999998,601550942641,2021-12-16T00:00:00,50536,1,13275,C1-4,RPT,None,XYZ,1.3817648802362876,14.110618856992053
"9,999,998",999999,1301329537489,2021-10-21T00:00:00,50095,0,6801,C5-3,VTQ,21°,ABC,1.7346629195862737,8.271902185269754
"9,999,999",1000000,69118260322,2021-10-02T00:00:00,50051,1,8809,C1-2,PLW,21°,Unnamed: 10_level_15,1.1628641644551068,16.81932955829112


In [14]:
data_format_results = results.copy()
data_format_results

#,method,write (s),read (s),rows,size (Mb),config,read r/sec,write r/sec
0,to_sql,13.467,0,10000000,Unnamed: 5_level_1,Unnamed: 6_level_1,nil,742555
1,to_json,13.384,4.252,10000000,142,Unnamed: 6_level_2,2351834,747160
2,to_xlsx,12.555,24.601,100000,9,{'sheet': 'pyexcel_sheet1'},4064,7964
3,to_ods,73.045,68.588,100000,7,{'sheet': 'pyexcel_sheet1'},1457,1369
4,to_csv,19.061,20.04,1000000,109,{},49900,52463
5,to_csv,19.46,10.722,1000000,109,{'guess_datatypes': False},93266,51387
6,to_csv,185.77,126.405,10000000,1090,{'guess_datatypes': False},79110,53830
7,to_tsv,13.882,15.947,1000000,109,{},62707,72035
8,to_text,14.052,16.023,1000000,109,{},62410,71164
9,to_html,11.957,71.13,1000000,228,{},14058,83633
10,to_hdf5,5.267,11.85,1000000,300,{},84388,189861
