# Benchmarks

These benchmarks seek to establish the performance of tablite as a user sees it.

Overview

**Input/Output:**

- Save / Load .tpz format
- Save tables to various formats
- Import data from various formats

**Various column functions:**

- Setitem / getitem
- iter
- equal, not equal
- copy
- t += t
- t *= t
- contains
- remove all
- replace
- index
- unique
- histogram
- statistics
- count


**Various table functions**

- **base**
  - Setitem / getitem
  - iter / rows
  - equal, not equal
  - load
  - save
  - copy
  - stack
  - types
  - display_dict
  - show
  - to_dict
  - as_json_serializable
  - index
- **core**
  - expression
  - filter
  - sort_index
  - reindex
  - drop_duplicates
  - sort
  - is_sorted
  - any
  - all
  - drop 
  - replace
  - groupby
  - pivot
  - joins
  - lookup
  - replace missing values
  - transpose
  - pivot_transpose
  - diff






In [1]:
from tablite import Table
from tablite.datasets import synthetic_order_data
import psutil, os, gc
import tempfile
from pathlib import Path
from time import process_time
from tablite.config import Config

### Create tables from synthetic data.

In [2]:
process = psutil.Process(os.getpid())

# The last tables are too big for RAM (~24Gb), so I create subtables of 1M rows and append them.
ram_start = process.memory_info().rss
t = synthetic_order_data(Config.PAGE_SIZE)
ram_end = process.memory_info().rss
real, flat = t.nbytes()
print(f"Table {len(t):,} rows is {real/1e6:,.0f} Mb on disk, using {(ram_end - ram_start)/1e6:,.0f} Mb ram")

tables = [t]  # 1M rows.

for i in [2,5,10,50,100]:
    for _ in range(10):
        gc.collect()

    ram_start = process.memory_info().rss
    t2 = synthetic_order_data(Config.PAGE_SIZE)
    for _ in range(i-1):
        t2 += synthetic_order_data(Config.PAGE_SIZE)  # these are all unique
    ram_end = process.memory_info().rss
    real, flat = t2.nbytes()
    tables.append(t2)
    print(f"Table {len(t2):,} rows is {real/1e6:,.0f} Mb on disk, using {(ram_end - ram_start)/1e6:,.0f} Mb ram")

tables[-1].show()


Table 1,000,000 rows is 240 Mb on disk, using 7 Mb ram
Table 2,000,000 rows is 480 Mb on disk, using 15 Mb ram
Table 5,000,000 rows is 1,200 Mb on disk, using 28 Mb ram
Table 10,000,000 rows is 2,400 Mb on disk, using 27 Mb ram
Table 50,000,000 rows is 12,000 Mb on disk, using 6 Mb ram
Table 100,000,000 rows is 24,000 Mb on disk, using 7 Mb ram
|     ~     |   #   |      1      |         2         |  3  | 4 |  5  | 6  | 7 | 8  | 9 |         10        |        11        |
+-----------+-------+-------------+-------------------+-----+---+-----+----+---+----+---+-------------------+------------------+
|          0|      1|1398629815837|2021-11-22T00:00:00|50696|  0|19084|C4-5|NHA|0°  |   |0.14986566997128964|12.10455014963018 |
|          1|      2|1538699944465|2021-11-19T00:00:00|50045|  1|20508|C1-4|YKC|0°  |XYZ|  1.426933632208182|19.600321009705745|
|          2|      3|1429530642607|2021-11-20T00:00:00|50596|  1|12877|C2-4|GKA|0°  |ABC|  1.422395435905098|2.7417023157941642|
|       

### Save / Load .tpz format

In [3]:
tmp = Path(tempfile.gettempdir()) / "junk"
tmp.mkdir(exist_ok=True)

for t in tables:
    fn = tmp / f'{len(t)}.tpz'
    start = process_time()
    t.save(fn)
    end = process_time()
    assert fn.exists()
    print(f"saving {len(t):,} rows ({fn.stat().st_size/1e6:,.0f} Mb) took {end-start:,} secconds")
    
    start = process_time()
    t2 = Table.load(fn)
    end = process_time()
    print(f"loading {len(t2):,} tows took {end-start:,} seconds")
    del t2
    fn.unlink()


saving 1,000,000 rows (240 Mb) took 0.296875 secconds
loading 1,000,000 tows took 0.5 seconds
saving 2,000,000 rows (480 Mb) took 0.625 secconds
loading 2,000,000 tows took 1.046875 seconds
saving 5,000,000 rows (1,200 Mb) took 1.5 secconds
loading 5,000,000 tows took 2.484375 seconds
saving 10,000,000 rows (2,400 Mb) took 3.140625 secconds
loading 10,000,000 tows took 4.84375 seconds
saving 50,000,000 rows (12,000 Mb) took 15.765625 secconds
loading 50,000,000 tows took 29.84375 seconds
saving 100,000,000 rows (24,000 Mb) took 30.40625 secconds
loading 100,000,000 tows took 57.8125 seconds


### Save / load tables to / from various formats

The handlers for saving / export are:

- to_sql
- to_json
- to_xls
- to_ods
- to_csv
- to_tsv
- to_text
- to_html
- to_hdf5


In [4]:
t = synthetic_order_data(1_000_000)
tmp = Path(tempfile.gettempdir()) / "junk"
tmp.mkdir(exist_ok=True)


In [5]:
start = process_time()
string = t.to_sql(name='t')  # --> SQL
end = process_time()
print(f"to_sql() took {end-start:,.2f} secs for {len(t):,} rows")

# start = process_time() TODO
# Table.from_sql(string)  # <-- SQL
# end = process_time()
# print(f"from_sql() took {end-start:,.2f} secs for {len(t):,} rows")
del string

to_sql() took 12.91 secs for 1,000,000 rows


In [6]:


start = process_time()
bytestr = t.to_json()  # --> JSON
end = process_time()
print(f"to_json() took {end-start:,.2f} secs for {len(t):,} rows")

start = process_time()
Table.from_json(bytestr)  # <-- JSON
end = process_time()
print(f"from_json() took {end-start:,.2f} secs for {len(t):,} rows")
del bytestr


to_json() took 12.83 secs for 1,000,000 rows
from_json() took 3.81 secs for 1,000,000 rows


In [7]:
p = psutil.Process()
start_ram = p.memory_full_info().uss
fn = tmp / '1.xlsx'  # --> XLS
start = process_time()
t.to_xlsx(fn)
end = process_time()
max_ram = p.memory_full_info().peak_wset
print(f"t.to_xls({fn.name}) took {end-start:,.2f} secs for {len(t):,} rows and used {(max_ram-start_ram)/1e6:,.0f}Mb RAM")

p = psutil.Process()
start_ram = p.memory_full_info().uss
start = process_time()
Table.from_file(fn, sheet="pyexcel_sheet1")  # <-- XLS
end = process_time()
max_ram = p.memory_full_info().peak_wset
print(f"Table.from_file({fn.name}) took {end-start:,.2f} secs for {len(t):,} rows and used {(max_ram-start_ram)/1e6:,.0f}Mb RAM")
print(f"The file was {fn.stat().st_size/1e6:,.0f} Mb on disk")
fn.unlink()


t.to_xls(1.xlsx) took 133.58 secs for 1,000,000 rows and used 1,856Mb RAM
Table.from_file(1.xlsx) took 250.17 secs for 1,000,000 rows and used 5,153Mb RAM
The file was 92 Mb on disk


In [8]:
p = psutil.Process()
start_ram = p.memory_full_info().uss
fn = tmp / '1.ods' # --> ODS
start = process_time()
snip = t[:100_000]
snip.to_ods(fn)  # limit the memory footprint.
end = process_time()
max_ram = p.memory_full_info().peak_wset
print(f"t.to_ods({fn.name}) took {end-start:,.2f} secs for {len(snip):,} rows and used {(max_ram-start_ram)/1e6:,.0f}Mb RAM")

p = psutil.Process()
start_ram = p.memory_full_info().uss
start = process_time()
Table.from_file(fn, sheet="pyexcel_sheet1")  # <-- ODS
end = process_time()
max_ram = p.memory_full_info().peak_wset
print(f"Table.from_file({fn.name}) took {end-start:,.2f} secs for {len(snip):,} rows and used {(max_ram-start_ram)/1e6:,.0f}Mb RAM")
print(f"The file was {fn.stat().st_size/1e6:,.0f} Mb on disk")
fn.unlink()

t.to_ods(1.ods) took 83.28 secs for 100,000 rows and used 1,005Mb RAM
Table.from_file(1.ods) took 74.67 secs for 100,000 rows and used 1,642Mb RAM
The file was 8 Mb on disk


In [9]:
Config.MULTIPROCESSING_MODE = Config.FALSE

p = psutil.Process()
start_ram = p.memory_full_info().uss
fn = tmp / '1.csv'  # --> CSV
start = process_time()
t.to_csv(fn)
end = process_time()
max_ram = p.memory_full_info().peak_wset
print(f"t.to_csv({fn.name}) took {end-start:,.2f} secs for {len(t):,} rows and used {(max_ram-start_ram)/1e6:,.0f}Mb RAM")

p = psutil.Process()
start_ram = p.memory_full_info().uss
start = process_time()
Table.from_file(fn)  # <-- CSV
end = process_time()
max_ram = p.memory_full_info().peak_wset
print(f"Table.from_file({fn.name}) took {end-start:,.2f} secs for {len(t):,} rows and used {(max_ram-start_ram)/1e6:,.0f}Mb RAM")
print(f"The file was {fn.stat().st_size/1e6:,.0f} Mb on disk")
fn.unlink()


100%|██████████| 1000000/1000000 [00:16<00:00, 59673.37it/s]


t.to_csv(1.csv) took 16.83 secs for 1,000,000 rows and used 3,124Mb RAM


importing: consolidating '1.csv': 100.00%|██████████| [00:59<00:00]


Table.from_file(1.csv) took 59.11 secs for 1,000,000 rows and used 3,153Mb RAM
The file was 109 Mb on disk


In [10]:
Config.MULTIPROCESSING_MODE = Config.FALSE

p = psutil.Process()
start_ram = p.memory_full_info().uss
fn = tmp / '1.tsv'  # --> TSV
start = process_time()
t.to_tsv(fn)
end = process_time()
max_ram = p.memory_full_info().peak_wset
print(f"t.to_tsv({fn.name}) took {end-start:,.2f} secs for {len(t):,} rows and used {(max_ram-start_ram)/1e6:,.0f}Mb RAM")

p = psutil.Process()
start_ram = p.memory_full_info().uss
start = process_time()
Table.from_file(fn)  # <-- TSV
end = process_time()
max_ram = p.memory_full_info().peak_wset
print(f"Table.from_file({fn.name}) took {end-start:,.2f} secs for {len(t):,} rows and used {(max_ram-start_ram)/1e6:,.0f}Mb RAM")
print(f"The file was {fn.stat().st_size/1e6:,.0f} Mb on disk")
fn.unlink()

100%|██████████| 1000000/1000000 [00:16<00:00, 59532.87it/s]


t.to_tsv(1.tsv) took 16.84 secs for 1,000,000 rows and used 3,034Mb RAM


importing: consolidating '1.tsv': 100.00%|██████████| [00:58<00:00]

Table.from_file(1.tsv) took 58.31 secs for 1,000,000 rows and used 3,034Mb RAM
The file was 109 Mb on disk





In [11]:
Config.MULTIPROCESSING_MODE = Config.FALSE

p = psutil.Process()
start_ram = p.memory_full_info().uss
fn = tmp / '1.txt'  # --> TXT
start = process_time()
t.to_text(fn)
end = process_time()
max_ram = p.memory_full_info().peak_wset
print(f"t.to_text({fn.name}) took {end-start:,.2f} secs for {len(t):,} rows and used {(max_ram-start_ram)/1e6:,.0f}Mb RAM")

p = psutil.Process()
start_ram = p.memory_full_info().uss
start = process_time()
Table.from_file(fn)  # <-- TXT
end = process_time()
max_ram = p.memory_full_info().peak_wset
print(f"Table.from_file({fn.name}) took {end-start:,.2f} secs for {len(t):,} rows and used {(max_ram-start_ram)/1e6:,.0f}Mb RAM")
print(f"The file was {fn.stat().st_size/1e6:,.0f} Mb on disk")
fn.unlink()

100%|██████████| 1000000/1000000 [00:16<00:00, 59597.65it/s]


t.to_text(1.txt) took 16.83 secs for 1,000,000 rows and used 3,023Mb RAM


importing: consolidating '1.txt': 100.00%|██████████| [00:57<00:00]


Table.from_file(1.txt) took 57.69 secs for 1,000,000 rows and used 3,023Mb RAM
The file was 109 Mb on disk


In [12]:
Config.MULTIPROCESSING_MODE = Config.FALSE

p = psutil.Process()
start_ram = p.memory_full_info().uss
fn = tmp / '1.html'  # --> HTML
start = process_time()
t.to_html(fn)
end = process_time()
max_ram = p.memory_full_info().peak_wset
print(f"t.to_html({fn.name}) took {end-start:,.2f} secs for {len(t):,} rows and used {(max_ram-start_ram)/1e6:,.0f}Mb RAM")

p = psutil.Process()
start_ram = p.memory_full_info().uss
start = process_time()
Table.from_file(fn)  # <-- HTML
end = process_time()
max_ram = p.memory_full_info().peak_wset
print(f"Table.from_file({fn.name}) took {end-start:,.2f} secs for {len(t):,} rows and used {(max_ram-start_ram)/1e6:,.0f}Mb RAM")
print(f"The file was {fn.stat().st_size/1e6:,.0f} Mb on disk")
fn.unlink()

t.to_html(1.html) took 11.70 secs for 1,000,000 rows and used 3,330Mb RAM


from_html: 100%|██████████| 228960213/228960213 [01:19<00:00, 2893933.85it/s]


Table.from_file(1.html) took 79.22 secs for 1,000,000 rows and used 3,329Mb RAM
The file was 229 Mb on disk


In [13]:
Config.MULTIPROCESSING_MODE = Config.FALSE

p = psutil.Process()
start_ram = p.memory_full_info().uss
fn = tmp / '1.hdf5'  # --> HDF5
start = process_time()
t.to_hdf5(fn)
end = process_time()
max_ram = p.memory_full_info().peak_wset
print(f"t.to_hdf5({fn.name}) took {end-start:,.2f} secs for {len(t):,} rows and used {(max_ram-start_ram)/1e6:,.0f}Mb RAM")

p = psutil.Process()
start_ram = p.memory_full_info().uss
start = process_time()
Table.from_file(fn)  # <-- HDF5
end = process_time()
max_ram = p.memory_full_info().peak_wset
print(f"Table.from_file({fn.name}) took {end-start:,.2f} secs for {len(t):,} rows and used {(max_ram-start_ram)/1e6:,.0f}Mb RAM")
print(f"The file was {fn.stat().st_size/1e6:,.0f} Mb on disk")
fn.unlink()

writing :, records to C:\Users\madsenbj\AppData\Local\Temp\junk\1.hdf5
writing C:\Users\madsenbj\AppData\Local\Temp\junk\1.hdf5 to HDF5 done
t.to_hdf5(1.hdf5) took 6.08 secs for 1,000,000 rows and used 3,329Mb RAM


  d['descr'] = dtype_to_descr(array.dtype)


Table.from_file(1.hdf5) took 2.83 secs for 1,000,000 rows and used 3,289Mb RAM
The file was 301 Mb on disk
