In [1]:
# !uv pip install h3pandas duckdb h3 pyarrow pandas

## Benchmarking

We install a couple libraries. 
- I ran this locally on my 16 core M3 Max Macbook Pro. 
- If you know how to make any of these other libraries more performant, please open a PR. I want to be as fair as possible.  
- I'm not an expert in DuckDB, but copying the data should be 0 cost due to Apache Arrow?
- I used `h3==4.1.2`, `polars==1.8.2` and `duckdb==1.1.3`. 

To run the benchmarks you can either run it with your cli by running `python -m benchmarks.main` or you can run it in this notebook with the below cell.

**The benchmarks aims to cover most of the common operations.**

In [None]:
from collections import defaultdict
import json
import statistics

from .. import benchmarks 

param_config = benchmarks.ParamConfig(
    resolution=9,
    grid_ring_distance=3,
    num_iterations=3,
    libraries=["plh3", "duckdb", "h3_py"],
    difficulty_to_num_rows={
        "basic": 10_000_000,
        "medium": 10_000_000,
        "complex": 100_000,
    },
    # functions=["grid_path"],
    # verbose=True,
)
benchmark_engine = benchmarks.Benchmark(config=param_config)
raw_results = benchmark_engine.run_all()
prev_func = None
for result in raw_results:
    if prev_func != result.name:
        print(f"\n{result.name}")
        prev_func = result.name
    print(result)

In [16]:
# of if you used the cli
# import json

# with open("../benchmarks/benchmarks-results.json", "r") as f:
#     raw_results = json.load(f)

## Results
| Function            | polars-h3 (Time)   | duckdb (Time) | h3_py (Time)   |
|---------------------|---------------|---------------|---------------|
| latlng_to_cell (10M)      | 0.19s   | 3.62s   | 6.89s   |
| cell_to_latlng (10M)      | 1.42s   | 2.96s   | 36.30s  |
| get_resolution (10M)      | 0.28s   | 0.18s   | 2.10s   |
| int_hex_to_str (10M)      | 0.36s   | 1.66s   | 2.35s   |
| str_hex_to_int (10M)      | 0.13s   | 1.26s   | 2.05s   |
| is_valid_cell (10M)       | 0.03s   | 0.25s   | 2.09s   |
| are_neighbor_cells (10M)  | 0.16s   | 0.77s   | 4.20s   |
| cell_to_parent (10M)      | 0.05s   | 0.18s   | 3.66s   |
| cell_to_children (10M)    | 2.23s   | 3.20s   | 62.38s  |
| grid_disk (10M)           | 3.86s   | 13.50s  | 140.69s |
| grid_ring (10M)           | 3.18s   | 7.67s   | 90.83s  |
| grid_distance (10M)      | 0.16s   | 1.61s   | 5.23s   |
| cell_to_boundary (10M)    | 3.96s   | 39.13s  | 186.57s |
| grid_path (100K)           | 0.74s   | 13.14s  | 28.31s  |


## Calculating Multipliers

In this section, we take the raw benchmark results and group them by the function name.
Then, for each function, we identify the fastest execution time across all libraries
and compute how many times slower the other libraries are in comparison to that fastest time.

After we compute these "multiples" (speed factors) for each library across all functions,
we then summarize the data by calculating the median and average multiples per library.
This helps us understand, on average, how much slower each library is compared to the fastest one.

In [20]:
by_name = defaultdict(list)
for d in raw_results:
    by_name[d["name"]].append(d)

multiples = []
for speeds in by_name.values():
    fastest = min(v["seconds"] for v in speeds)
    for v in speeds:
        multiples.append((v["library"], v["seconds"] / fastest))

by_lib = defaultdict(list)
for lib, mult in multiples:
    by_lib[lib].append(mult)

median_by_lib = {lib: round(statistics.median(ms), 2) for lib, ms in by_lib.items()}
avg_by_lib = {lib: round(sum(ms) / len(ms), 2) for lib, ms in by_lib.items()}
print("Median:")
print(median_by_lib)
print("Average:")
print(avg_by_lib)

Median:
{'plh3': 1.0, 'duckdb': 4.69, 'h3_py': 30.93}
Average:
{'plh3': 1.04, 'duckdb': 6.97, 'h3_py': 33.55}
