Skip to content

Conversation

@benjeffery
Copy link
Member

Fixes #2111

Draft as still needs memory perf to check we have gains.

@codecov
Copy link

codecov bot commented Feb 8, 2022

Codecov Report

Merging #2124 (85f2d21) into main (7238acb) will decrease coverage by 0.00%.
The diff coverage is 93.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2124      +/-   ##
==========================================
- Coverage   93.37%   93.37%   -0.01%     
==========================================
  Files          27       27              
  Lines       25591    25586       -5     
  Branches     1161     1161              
==========================================
- Hits        23895    23890       -5     
  Misses       1666     1666              
  Partials       30       30              
Flag Coverage Δ
c-tests 92.33% <93.33%> (-0.01%) ⬇️
lwt-tests 89.05% <ø> (ø)
python-c-tests 72.10% <ø> (ø)
python-tests 98.89% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
c/tskit/tables.c 90.27% <93.33%> (-0.01%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7238acb...85f2d21. Read the comment docs.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but I think there's a simpler approach where we deal with the (small number) of const columns separately.

@benjeffery
Copy link
Member Author

At the moment this is having no effect on max ram, digging in to check the changes are being used.

@jeromekelleher
Copy link
Member

Plots from memory-profiler are super helpful for this, should let us see the different stages on a simple script like

ts = tskit.load("big-freakin-file.trees")
ts.dump("big-freakin-file-copy.trees")

I guess we might have to manually turn up the sampling interval to see accurately what's happening though.

@benjeffery
Copy link
Member Author

benjeffery commented Feb 8, 2022

Plots from memory-profiler are super helpful for this, should let us see the different stages on a simple script like

ts = tskit.load("big-freakin-file.trees")
ts.dump("big-freakin-file-copy.trees")

I guess we might have to manually turn up the sampling interval to see accurately what's happening though.

This is actually an interesting challenge as the write arrays are copied then freed without control going back to python. /usr/bin/time -v shows max ram - but this is the loading value as that still copies. I think there is a way by making the tables, then dumping, but I have to be careful with the table growing thresholds.

@jeromekelleher
Copy link
Member

If you run it with mprof python script.py you can use the --interval option to set the sampling interval. This runs it as a forked subprocess, and doesn't know anything about Pytohn.

@benjeffery
Copy link
Member Author

benjeffery commented Feb 8, 2022

Got it! Seems to work

import tskit
import time
tables = tskit.TableCollection(1)
for i in range(10000):
    tables.populations.add_row(b"v"*100000)
time.sleep(5)
tables.dump("temp.trees")

main
main
branch
branch

@jeromekelleher
Copy link
Member

Kablammo! Nicely done @benjeffery - it's rare we get such clear wins (you should post this to the slack, in case folks haven't seen it).

@benjeffery benjeffery force-pushed the zero_copy_write branch 3 times, most recently from 8f2a1dd to 229d148 Compare February 9, 2022 14:18
@benjeffery benjeffery marked this pull request as ready for review February 9, 2022 14:19
@benjeffery
Copy link
Member Author

@jeromekelleher As discussed I've bought the file format writing up a level so all the arrays to write have the correct lifetime. This simplifies things so we have one code path for everything (except the 32bit arrays).

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this is much simpler!

A few layout nitpicks and we're done. (I have a general phobia of repeated complicated function calls when all that's different is one parameter. It's so easy to introduce subtle issues later if you update one call later and forget to update the other one)

@benjeffery
Copy link
Member Author

@jeromekelleher all fixed up!

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@mergify mergify bot merged commit 734311a into tskit-dev:main Feb 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Zero copy writing to kastore

2 participants