Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Table array access in TreeSequence #2424

Merged
merged 2 commits into from Jul 26, 2022

Conversation

jeromekelleher
Copy link
Member

@jeromekelleher jeromekelleher commented Jul 20, 2022

Add array access to TreeSequence and solve some performance issues/memory leaks. See below for more details.

@codecov
Copy link

codecov bot commented Jul 20, 2022

Codecov Report

Merging #2424 (d51f87e) into main (8c932c2) will increase coverage by 0.03%.
The diff coverage is 98.78%.

❗ Current head d51f87e differs from pull request most recent head 01ef912. Consider uploading reports for the commit 01ef912 to get more accurate results

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2424      +/-   ##
==========================================
+ Coverage   93.36%   93.39%   +0.03%     
==========================================
  Files          28       28              
  Lines       27056    27245     +189     
  Branches     1253     1253              
==========================================
+ Hits        25260    25445     +185     
- Misses       1762     1766       +4     
  Partials       34       34              
Flag Coverage Δ
c-tests 92.26% <ø> (-0.01%) ⬇️
lwt-tests 89.05% <ø> (ø)
python-c-tests 71.37% <96.34%> (+0.41%) ⬆️
python-tests 98.94% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
python/_tskitmodule.c 90.91% <98.60%> (+0.18%) ⬆️
python/tskit/trees.py 98.68% <100.00%> (+0.01%) ⬆️
c/tskit/trees.c 94.99% <0.00%> (+0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8c932c2...01ef912. Read the comment docs.

@jeromekelleher
Copy link
Member Author

jeromekelleher commented Jul 20, 2022

Fixes #2423:

Before using the standard ts.trees() approach:

timedesc-leak-before-trees

Before using the sequential ts.at(j) approach:
timedesc-leak-before-at

This branch:

timedesc-leak-after

The difference is pretty stark - 60M vs around 650M. It's interesting that it doesn't seem like this is a real "leak" per-se, because the memory does hit a high-water mark. It's an awful lot more though, and maybe we're having to run garbage collection or something to get it back?

@jeromekelleher
Copy link
Member Author

Fixes #1916:

Benchmark:

import msprime
import time


before = time.perf_counter()
ts = msprime.sim_ancestry(
    10000,
    sequence_length=1000000,
    population_size=10_000,
    recombination_rate=1e-8,
    random_seed=1234,
)
duration = time.perf_counter() - before
print(f"Simulation of {ts.num_trees} trees done after {duration:.2f} seconds")

for _ in range(1000):
    samples = ts.samples(time=0)
    # print(samples)

Before:
samples-mem-before

This branch:

samples-mem-after

@jeromekelleher
Copy link
Member Author

Performance improvement of ts.site(position=x). Benchmark:

import msprime
import time


before = time.perf_counter()
ts = msprime.sim_ancestry(
    10000,
    sequence_length=1000000,
    population_size=10_000,
    recombination_rate=1e-8,
    random_seed=1234,
)
ts = msprime.sim_mutations(ts, rate=1e-7, random_seed=1)
duration = time.perf_counter() - before
print(f"Simulation of {ts.num_trees} trees done after {duration:.2f} seconds")

for site in ts.sites():
    s = ts.site(position=site.position)
    if site.id >= 1000:
        break

Before:
site-search-before

After:
site-search-after

@jeromekelleher
Copy link
Member Author

jeromekelleher commented Jul 21, 2022

Performance for using ts.variants(left=x):

import msprime
import time


before = time.perf_counter()
ts = msprime.sim_ancestry(
    10000,
    sequence_length=1000000,
    population_size=10_000,
    recombination_rate=1e-8,
    random_seed=1234,
)
ts = msprime.sim_mutations(ts, rate=1e-7, random_seed=1)
duration = time.perf_counter() - before
print(f"Simulation of {ts.num_trees} trees done after {duration:.2f} seconds")

for _ in range(10):
    for var in ts.variants(left=1):
        pass

Before:
varstart-before

After:
varstart-after

(Note before appears to be leaking memory, and since we released a version which always takes a copy of the tables in 0.5.1, the variants iterator is leaking at the moment in the released version).

Updated: opened #2427 to track the leak in the released version in 0.5.1 which leaks even when we don't specify left or right).

@jeromekelleher
Copy link
Member Author

jeromekelleher commented Jul 21, 2022

Running edge_diffs:

import msprime
import time


before = time.perf_counter()
ts = msprime.sim_ancestry(
    10000,
    sequence_length=1000000,
    population_size=10_000,
    recombination_rate=1e-8,
    random_seed=1234,
)
# ts = msprime.sim_mutations(ts, rate=1e-7, random_seed=1)
duration = time.perf_counter() - before
print(f"Simulation of {ts.num_trees} trees done after {duration:.2f} seconds")

before = time.perf_counter()
for _ in range(100):
    for _ in ts.edge_diffs():
        pass
duration = time.perf_counter() - before

print(f"Ran edge diffs in {duration:.2f} seconds")

Before
edge_diffs_before

After:
edge_diffs_after

Interestingly there's no real difference in edge_diffs - we just take a little bit off the peak memory. We don't have a memory leak beforehand, which is a bit of a head-scratcher.

@jeromekelleher jeromekelleher marked this pull request as ready for review July 21, 2022 13:58
@jeromekelleher jeromekelleher requested review from hyanwong, petrelharp and benjeffery and removed request for hyanwong and petrelharp July 21, 2022 13:58
@jeromekelleher
Copy link
Member Author

jeromekelleher commented Jul 21, 2022

This is ready for review. The basic idea is to provide array access via the TreeSequence object in lieu of the "proper" solution of making TreeSequence.tables a read-only view (which is hard, engineering wise). On the way I've moved all internal usage of self.tables as they were all leading to significant performance problems (at best) and sometimes major memory leaks (#2428). I don't understand why some of these were leaking memory and others aren't, but we can follow that up later - at least library functions shouldn't leak and be efficient.

See #2426 for discussion of what we should do for the ragged columns (that I've left out here).

Since we've got some memory leaks in the released version we should probably get this out ASAP.

@jeromekelleher jeromekelleher changed the title Node time array Table array access in TreeSequence Jul 21, 2022
python/tskit/trees.py Outdated Show resolved Hide resolved
Copy link
Contributor

@petrelharp petrelharp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NICE!!! Looks good to me - just a docs suggestion/comment.

Also remove most references to self.tables from the TreeSequence class,
resolving a number of performance/memory issues.

Closes tskit-dev#1916
Closes tskit-dev#1917
Closes tskit-dev#2423
Closes tskit-dev#2427
@jeromekelleher
Copy link
Member Author

Marking this one for merging as I need it to work with the 1.4M sample tree sequences on Zenodo (and there's another bug that needs resolving to get a VCF out of them). Please do follow up with issues for any review problems you spot @benjeffery @hyanwong

@mergify mergify bot merged commit ada9596 into tskit-dev:main Jul 26, 2022
@mergify mergify bot removed the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Jul 26, 2022
@jeromekelleher jeromekelleher deleted the node_time_array branch July 26, 2022 11:31
@hyanwong
Copy link
Member

Sorry for the slow review. LGTM - this is a great improvement. My only query would be if / how we intend to maintain this in the long term, once we address #760. Will we then deprecate these attributes, and simply have them as permanently maintained (and undocumented?) equivalents to ts.tables.nodes.time and friends?

@jeromekelleher
Copy link
Member Author

Will we then deprecate these attributes, and simply have them as permanently maintained (and undocumented?) equivalents to ts.tables.nodes.time and friends?

I think we can keep them permanently as part of the public API. There is an advantage of fewer attribute lookups in ts.node_flags to ts.tables.nodes.flags as well as a bit less typing, so no harm in keeping them I think.

Copy link
Member

@benjeffery benjeffery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Been through this closely and looks good!

ts = ts_fixture
tables = ts.tables

assert_array_equal(ts.individuals_flags, tables.individuals.flags)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not parameterise this with the same list as the last test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this was simpler and easier to follow. There would have been some obscure string parsing and getattring in order to get the corresponding table array, and this seemed more ... obvious.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants