megamerge

A python library accelerated by a rust binary for joining / merging tables with an interval index.

This project is an improvement over a previous tool documented here https://github.com/thehappycheese/dtimsprep

The part of the algorithm accelerated by rust has been minimized as it seems preferable to maintain the majority of the code in python, where things are a bit more concise and easy to read. Rust has been used to accelerate the part of the task which runs in exponential time O(n²), wrapped in a python library which handles the linear-time O(n) portion of the task.

1. Installation

See the releases page for install / uninstall instructions

2. Usage

TODO

The api has not been finalized yet. The releases page has some example code, and there is a notebooks folder with some example jupyter notebooks.

3. Theory

Given a target segmentation, and some data our goal is to efficiently merge the measures and categories of interest from data onto the segmentation according to the overlap. An overlap occurs when key is matching and the from/to ranges overlap.

We are interested in maintaining the key of the segmentation; ie the number of rows must stay the same, and the values in the columns key, from, and to should remain unaffected; we are just adding new columns to the segmentation.

There are a number of ways to aggregate the measure and category columns in the data table as they move to the segmentation but we will get to those later.

3.1. Example `segmentation` and `data`

segmentation

id	key	from	to
0	0	0	100
1	0	100	200
2	0	200	300
3	0	300	400
4	1	0	100

data

id	key	from	to	some_measure	some_category
0	0	50	140	1.0	"A"
1	0	140	160	2.0	"B"
2	0	160	180	3.0	"B"
3	0	180	220	4.0	"B"
4	0	220	240	5.0	"C"
5	0	240	260	5.0	"C"
6	0	260	280	6.0	"D"
7	0	280	300	7.0	"E"
8	0	300	320	8.0	"F"
9	1	10	80	9.0	"G"
10	1	80	120	10.0	"H"

3.2. Overlap Tables

Our main challenge is efficiently computing the overlap_tables. For each row in segments where the key column matches the key column in data, we want the id and overlap_length of rows in data. From this we can compute the length weighted average, length weighted percentile, proportional sum etc. (more detail later)

3.2.1. psudocode

# `outer_minimum` and `outer_maximum` are similar in function to the outer product
# except they compute the minimum / maximum of instead of multiplying elements
# and return a huge matrix which has the dimensions of the first and second input.
overlap_tables = []
for key in unique(segmentation["key"]):
    overlap_min = outer_maximum(
        segmentation.loc[key, "from"],
        data.loc[key, "from"]
    )
    overlap_max = outer_minimum(
        segmentation.loc[key,   "to"],
        data.loc[key,   "to"]
    )
    signed_overlap = overlap_max - overlap_min
    overlap        = maximum(0, signed_overlap)
    overlap_tables.append(overlap)

Notes about the psudocode:

In practice, this code does not work because it requires exponential memory 😦,
however there is a strong motivation to make it work anyway, because:
The this approach is not fussy about self-intersections in data or self-intersections in segments
- If multiple rows have overlapping from/to ranges (in either data or segments) then we are still able to aggregate all overlapping values from data proportional to the length by which they overlap the target row in segments.
- Typically we are not interested in segments which are self-intersecting anyway, but it is fairly common to have self-intersecting data
signed_overlap is calculated as an intermediate value.
This is interesting because it would allow us to perform aggregations based on proximity (e.g. count data within 1 kilometer of each segments. Count data touching each segments etc)
- positive values are the overlap length
- negative values are the distance between intervals
- zero values indicate touching intervals

3.2.2. example overlap tables

The psudocode above would produce the following two overlap tables

from/to overlap table for key == 0

id	0	1	2	3
0	50	40	0	0
1	0	20	0	0
2	0	20	0	0
3	0	20	20	0
4	0	0	20	0
5	0	0	20	0
6	0	0	20	0
7	0	0	20	0
8	0	0	0	20

from/to overlap table for key == 1

id	4
9	70
10	20

3.3. The solution to the exponential memory problem

We still effectively calculate and check every cell in the overlap table. We simply store our output in a sparse data-structure; discarding all zeros.

So far there are no smart optimizations that reduce the O(n²) time. These require either

some assumptions about the input data (ie sorted, non-self-intersecting etc) or
some pre-treatment to facilitate a divide and conquer approach
- This is an avenue we could explore in the future, but for now it seems our simple brute force method is still working.

For the moment we rely on the brute-force speed of iterating over the data in rust; We can top-out all CPU cores to get things done fast. For practical applications our speedup is over 100 times; tasks that took 20 minutes previously should now take about 10 seconds. But there is no escaping the exponential time problem. For some worst imaginable cases this still isn't fast enough. For these cases the use of an interval index should be reconsidered in favor of traditional database keys and join operations which are super fast even on very large data.

4. Development Setup

On windows:

python -m pip install pip --upgrade
python -m venv .env
conda deactivate
./.env/Scripts/activate
pip install maturin
maturin develop

On Databricks linux web terminal

# sudo apt-get install libssl-dev # doesnt seem to be needed
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env
pip install venv
python -m venv .env
source .env/bin/activate
pip install maturin
maturin develop

Building on windows apparently requires the --interpreter python argument for 'reasons'

maturin build --release --interpreter python

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github/workflows		.github/workflows
notebooks		notebooks
python/megamerge		python/megamerge
src		src
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
pyproject.toml		pyproject.toml
readme.md		readme.md

thehappycheese/megamerge

Folders and files

Latest commit

History

Repository files navigation

megamerge

Contents

1. Installation

2. Usage

3. Theory

3.1. Example segmentation and data

3.2. Overlap Tables

3.2.1. psudocode

3.2.2. example overlap tables

3.3. The solution to the exponential memory problem

4. Development Setup

About

Resources

Stars

Watchers

Forks

Languages

3.1. Example `segmentation` and `data`