Table array access in TreeSequence #2424

jeromekelleher · 2022-07-20T16:15:19Z

Add array access to TreeSequence and solve some performance issues/memory leaks. See below for more details.

codecov · 2022-07-20T16:27:30Z

Codecov Report

Merging #2424 (d51f87e) into main (8c932c2) will increase coverage by 0.03%.
The diff coverage is 98.78%.

❗ Current head d51f87e differs from pull request most recent head 01ef912. Consider uploading reports for the commit 01ef912 to get more accurate results

@@            Coverage Diff             @@
##             main    #2424      +/-   ##
==========================================
+ Coverage   93.36%   93.39%   +0.03%     
==========================================
  Files          28       28              
  Lines       27056    27245     +189     
  Branches     1253     1253              
==========================================
+ Hits        25260    25445     +185     
- Misses       1762     1766       +4     
  Partials       34       34

Flag	Coverage Δ
c-tests	`92.26% <ø> (-0.01%)`	⬇️
lwt-tests	`89.05% <ø> (ø)`
python-c-tests	`71.37% <96.34%> (+0.41%)`	⬆️
python-tests	`98.94% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
python/_tskitmodule.c	`90.91% <98.60%> (+0.18%)`	⬆️
python/tskit/trees.py	`98.68% <100.00%> (+0.01%)`	⬆️
c/tskit/trees.c	`94.99% <0.00%> (+0.01%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8c932c2...01ef912. Read the comment docs.

jeromekelleher · 2022-07-20T19:19:17Z

Fixes #2423:

Before using the standard ts.trees() approach:

Before using the sequential ts.at(j) approach:

This branch:

The difference is pretty stark - 60M vs around 650M. It's interesting that it doesn't seem like this is a real "leak" per-se, because the memory does hit a high-water mark. It's an awful lot more though, and maybe we're having to run garbage collection or something to get it back?

jeromekelleher · 2022-07-21T12:51:43Z

Fixes #1916:

Benchmark:

import msprime
import time


before = time.perf_counter()
ts = msprime.sim_ancestry(
    10000,
    sequence_length=1000000,
    population_size=10_000,
    recombination_rate=1e-8,
    random_seed=1234,
)
duration = time.perf_counter() - before
print(f"Simulation of {ts.num_trees} trees done after {duration:.2f} seconds")

for _ in range(1000):
    samples = ts.samples(time=0)
    # print(samples)

Before:

This branch:

jeromekelleher · 2022-07-21T12:59:42Z

Performance improvement of ts.site(position=x). Benchmark:

import msprime
import time


before = time.perf_counter()
ts = msprime.sim_ancestry(
    10000,
    sequence_length=1000000,
    population_size=10_000,
    recombination_rate=1e-8,
    random_seed=1234,
)
ts = msprime.sim_mutations(ts, rate=1e-7, random_seed=1)
duration = time.perf_counter() - before
print(f"Simulation of {ts.num_trees} trees done after {duration:.2f} seconds")

for site in ts.sites():
    s = ts.site(position=site.position)
    if site.id >= 1000:
        break

Before:

After:

jeromekelleher · 2022-07-21T13:12:14Z

Performance for using ts.variants(left=x):

import msprime
import time


before = time.perf_counter()
ts = msprime.sim_ancestry(
    10000,
    sequence_length=1000000,
    population_size=10_000,
    recombination_rate=1e-8,
    random_seed=1234,
)
ts = msprime.sim_mutations(ts, rate=1e-7, random_seed=1)
duration = time.perf_counter() - before
print(f"Simulation of {ts.num_trees} trees done after {duration:.2f} seconds")

for _ in range(10):
    for var in ts.variants(left=1):
        pass

Before:

After:

(Note before appears to be leaking memory, and since we released a version which always takes a copy of the tables in 0.5.1, the variants iterator is leaking at the moment in the released version).

Updated: opened #2427 to track the leak in the released version in 0.5.1 which leaks even when we don't specify left or right).

jeromekelleher · 2022-07-21T13:52:31Z

Running edge_diffs:

import msprime
import time


before = time.perf_counter()
ts = msprime.sim_ancestry(
    10000,
    sequence_length=1000000,
    population_size=10_000,
    recombination_rate=1e-8,
    random_seed=1234,
)
# ts = msprime.sim_mutations(ts, rate=1e-7, random_seed=1)
duration = time.perf_counter() - before
print(f"Simulation of {ts.num_trees} trees done after {duration:.2f} seconds")

before = time.perf_counter()
for _ in range(100):
    for _ in ts.edge_diffs():
        pass
duration = time.perf_counter() - before

print(f"Ran edge diffs in {duration:.2f} seconds")

Before

After:

Interestingly there's no real difference in edge_diffs - we just take a little bit off the peak memory. We don't have a memory leak beforehand, which is a bit of a head-scratcher.

jeromekelleher · 2022-07-21T14:01:12Z

This is ready for review. The basic idea is to provide array access via the TreeSequence object in lieu of the "proper" solution of making TreeSequence.tables a read-only view (which is hard, engineering wise). On the way I've moved all internal usage of self.tables as they were all leading to significant performance problems (at best) and sometimes major memory leaks (#2428). I don't understand why some of these were leaking memory and others aren't, but we can follow that up later - at least library functions shouldn't leak and be efficient.

See #2426 for discussion of what we should do for the ragged columns (that I've left out here).

Since we've got some memory leaks in the released version we should probably get this out ASAP.

python/tskit/trees.py

petrelharp

NICE!!! Looks good to me - just a docs suggestion/comment.

Also remove most references to self.tables from the TreeSequence class, resolving a number of performance/memory issues. Closes tskit-dev#1916 Closes tskit-dev#1917 Closes tskit-dev#2423 Closes tskit-dev#2427

jeromekelleher · 2022-07-26T10:45:47Z

Marking this one for merging as I need it to work with the 1.4M sample tree sequences on Zenodo (and there's another bug that needs resolving to get a VCF out of them). Please do follow up with issues for any review problems you spot @benjeffery @hyanwong

hyanwong · 2022-07-27T12:20:08Z

Sorry for the slow review. LGTM - this is a great improvement. My only query would be if / how we intend to maintain this in the long term, once we address #760. Will we then deprecate these attributes, and simply have them as permanently maintained (and undocumented?) equivalents to ts.tables.nodes.time and friends?

jeromekelleher · 2022-07-27T13:05:59Z

Will we then deprecate these attributes, and simply have them as permanently maintained (and undocumented?) equivalents to ts.tables.nodes.time and friends?

I think we can keep them permanently as part of the public API. There is an advantage of fewer attribute lookups in ts.node_flags to ts.tables.nodes.flags as well as a bit less typing, so no harm in keeping them I think.

benjeffery

Been through this closely and looks good!

benjeffery · 2022-07-27T23:50:48Z

python/tests/test_highlevel.py

+        ts = ts_fixture
+        tables = ts.tables
+
+        assert_array_equal(ts.individuals_flags, tables.individuals.flags)


Why not parameterise this with the same list as the last test?

I thought this was simpler and easier to follow. There would have been some obscure string parsing and getattring in order to get the corresponding table array, and this seemed more ... obvious.

jeromekelleher force-pushed the node_time_array branch from 6541762 to 86be47f Compare July 20, 2022 16:16

jeromekelleher mentioned this pull request Jul 21, 2022

Provide TreeSequence array access to ragged columns #2426

Open

jeromekelleher mentioned this pull request Jul 21, 2022

Referring to tables in TableCollection leaks memory #2428

Closed

jeromekelleher force-pushed the node_time_array branch from d51f87e to c106abe Compare July 21, 2022 13:57

jeromekelleher marked this pull request as ready for review July 21, 2022 13:58

jeromekelleher requested review from hyanwong, petrelharp and benjeffery and removed request for hyanwong and petrelharp July 21, 2022 13:58

jeromekelleher requested a review from petrelharp July 21, 2022 14:01

jeromekelleher changed the title ~~Node time array~~ Table array access in TreeSequence Jul 21, 2022

jeromekelleher force-pushed the node_time_array branch from c106abe to d655c44 Compare July 21, 2022 14:12

jeromekelleher mentioned this pull request Jul 21, 2022

Add flags_array and time_array to Tree class? #1322

Closed

petrelharp reviewed Jul 22, 2022

View reviewed changes

python/tskit/trees.py Outdated Show resolved Hide resolved

petrelharp approved these changes Jul 22, 2022

View reviewed changes

jeromekelleher force-pushed the node_time_array branch from d655c44 to 4e863d8 Compare July 22, 2022 09:26

jeromekelleher mentioned this pull request Jul 22, 2022

Table collection returned by ts.tables not read-only. #760

Open

jeromekelleher added 2 commits July 26, 2022 11:43

Add pin for sphinx-autodoc-typehints for devs

2a50e21

Add direct access to all non-ragged table arrays

01ef912

Also remove most references to self.tables from the TreeSequence class, resolving a number of performance/memory issues. Closes tskit-dev#1916 Closes tskit-dev#1917 Closes tskit-dev#2423 Closes tskit-dev#2427

jeromekelleher force-pushed the node_time_array branch from 4e863d8 to 01ef912 Compare July 26, 2022 10:44

jeromekelleher added the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Jul 26, 2022

mergify bot merged commit ada9596 into tskit-dev:main Jul 26, 2022

mergify bot removed the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Jul 26, 2022

jeromekelleher deleted the node_time_array branch July 26, 2022 11:31

benjeffery reviewed Jul 28, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Table array access in TreeSequence #2424

Table array access in TreeSequence #2424

jeromekelleher commented Jul 20, 2022 •

edited

codecov bot commented Jul 20, 2022 •

edited

jeromekelleher commented Jul 20, 2022 •

edited

jeromekelleher commented Jul 21, 2022

jeromekelleher commented Jul 21, 2022

jeromekelleher commented Jul 21, 2022 •

edited

jeromekelleher commented Jul 21, 2022 •

edited

jeromekelleher commented Jul 21, 2022 •

edited

petrelharp left a comment

jeromekelleher commented Jul 26, 2022

hyanwong commented Jul 27, 2022

jeromekelleher commented Jul 27, 2022

benjeffery left a comment

benjeffery Jul 27, 2022

jeromekelleher Jul 28, 2022

Table array access in TreeSequence #2424

Table array access in TreeSequence #2424

Conversation

jeromekelleher commented Jul 20, 2022 • edited

codecov bot commented Jul 20, 2022 • edited

Codecov Report

jeromekelleher commented Jul 20, 2022 • edited

jeromekelleher commented Jul 21, 2022

jeromekelleher commented Jul 21, 2022

jeromekelleher commented Jul 21, 2022 • edited

jeromekelleher commented Jul 21, 2022 • edited

jeromekelleher commented Jul 21, 2022 • edited

petrelharp left a comment

Choose a reason for hiding this comment

jeromekelleher commented Jul 26, 2022

hyanwong commented Jul 27, 2022

jeromekelleher commented Jul 27, 2022

benjeffery left a comment

Choose a reason for hiding this comment

benjeffery Jul 27, 2022

Choose a reason for hiding this comment

jeromekelleher Jul 28, 2022

Choose a reason for hiding this comment

jeromekelleher commented Jul 20, 2022 •

edited

codecov bot commented Jul 20, 2022 •

edited

jeromekelleher commented Jul 20, 2022 •

edited

jeromekelleher commented Jul 21, 2022 •

edited

jeromekelleher commented Jul 21, 2022 •

edited

jeromekelleher commented Jul 21, 2022 •

edited