-
Notifications
You must be signed in to change notification settings - Fork 78
Optimize two-locus site operations #3291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3291 +/- ##
==========================================
- Coverage 89.79% 89.79% -0.01%
==========================================
Files 29 29
Lines 30962 31008 +46
Branches 5664 5673 +9
==========================================
+ Hits 27803 27843 +40
- Misses 1775 1778 +3
- Partials 1384 1387 +3
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Overall looks like a nice cleanup with simpler code. I haven't grokked every detail, but assuming our test coverage is good I think it's fairly safe.
A few minor details pointed out, happy to merge after that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've cleaned things up a bit, please have another look. I think the memory ownership of get_mutation_sample_sets
makes a lot more sense now. The caller allocates all memory and get_mutation_sample_sets
fills the inputs with data.
I'm happy to merge this whenever. @benjeffery how does this work fit in with our plans for tagging tskit 1.0? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to push this out as part of 1.0.0?
c/tskit/core.c
Outdated
tsk_bitset_count(const tsk_bitset_t *self, tsk_size_t row) | ||
{ | ||
tsk_size_t i = 0; | ||
uint32_t count = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit-picky, but strictly speaking, with _TSK_BIG_TABLES
and enough samples this would silently give the wrong count?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, and I'm even casting to a tsk_size_t
at the end. I was trying to avoid 32->64 bit casting in this loop, but I doubt it'll make much difference in the end. I've switched to tsk_size_t
. If I understand correctly, the change will address the issue you're concerned with?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
This PR is a combination of three separate modifications. They are described below and in tskit-dev#3290. Fixes (tskit-dev#3290). * Two-locus malloc optimizations This revision moves all malloc operations out of the hot loop in two-locus statistics, instead providing pre-allocated regions of memory that the two-locus framework will use to perform work. Instead of simply passing each pre-allocated array into each function call, we introduce a simple structure called `two_locus_work_t`, which stores the statistical results, and provides temporary arrays for storing the normalisation constants. Setup and teardown methods for this work structure are provided. Python and C tests are passing and valgrind reports no errors. * Refactor bit array api, rename to bitset. As discussed in tskit-dev#2834, this patch renames tsk_bit_array_t to tsk_bitset_t. Philosophically, we treat these as sets and not arrays, performing intersections, unions, and membership tests. Therefore, it makes sense to alter the API to use set theoretic vocabulary, describing the intent more precisely. Fundamentally, the bitset structure is a list of N independent bitsets. Each operation on two sets must select the row on which to operate. The tsk_bitset_t originally tracked `len` only, which was N, the number of sets. For convenience, we also track the `row_len`, which is the number of unsigned integers per row. If we multiply `row_len` by `TSK_BITSET_BITS`, we get the number of bits that each set (or row) in the list of bitsets will hold. We had also discussed each set theoretic operation accepting a row index instead of a pointer to a row within the bitset object. Now, each operation accepts a row index for each bitset structure passed into the function. This simplifies the consumption of this API considerably, removing the need of storing and tracking many intermediate temporary array pointers. We also see some performance improvements from this cleanup. For DRY purposes, I've created a private macro, `BITSET_DATA_ROW`, which abstracts away the pointer arithmetic for selecting a row out of the list of sets. Because of these changes, `tsk_bit_array_get_row` is no longer needed and has been removed from the API. This change does not change the size of the "chunk", which is the unsigned integer storing bits. It remains a 32 bit unsigned integer, which is most performant for bit counting (popcount). I've streamlined the macros used to determine which integer in the row will be used to store a particular bit. Everything now revolves around the TSK_BITSET_BITS macro, which is simply 32 and bitshift operations have been converted to unsigned integer division. Testing has been refactored to reflect these changes, removing tests that operate on a specific rows. Tests in c and python are passing and valgrind shows no errors. Fixes (tskit-dev#2834). * Precompute A/B Counts and Biallelic Summary Func Precompute A/B counts for each sample set. We were previously computing them redundantly each for each site pair in our results matrix. The precomputation happens in a function called `get_mutation_sample_sets`, which takes our list of sets (`tsk_bitset_t`) for each mutation and intersects the samples with a particular mutation with the sample sets passed in by the user. The result is an expanded list of sets with one set per mutation per sample set. During this operation, we compute the number of samples containing the given allele for each mutation, avoiding the need to perform redundant count operations on the data. In addition to precomputation, we add a non-normalized version of `compute_general_two_site_stat_result` for situations where we're computing stats from biallelic loci. We dispatch the computation of the result based on the number of alleles in the two loci we're comparing. If the number of alleles in both loci is 2, then we simply perform an LD computation on the derived alleles for the two loci. As a result, we remove the need to compute a matrix of LD values, then take a weighted sum. This is much more efficient and means that we only run the full multiallelic LD routine on sites that are multiallelic.
c5ebfab
to
0f075bf
Compare
Rebased/squashed. Thanks for the useful feedback! |
This PR is a combination of three separate modifications. They are described below and in #3290. Fixes (#3290).
This revision moves all malloc operations out of the hot loop in two-locus statistics, instead providing pre-allocated regions of memory that the two-locus framework will use to perform work. Instead of simply passing each pre-allocated array into each function call, we introduce a simple structure called
two_locus_work_t
, which stores the statistical results, and provides temporary arrays for storing the normalisation constants. Setup and teardown methods for this work structure are provided. Python and C tests are passing and valgrind reports no errors.As discussed in #2834, this patch renames tsk_bit_array_t to tsk_bitset_t. Philosophically, we treat these as sets and not arrays, performing intersections, unions, and membership tests. Therefore, it makes sense to alter the API to use set theoretic vocabulary, describing the intent more precisely. Fundamentally, the bitset structure is a list of N independent bitsets. Each operation on two sets must select the row on which to operate. The tsk_bitset_t originally tracked
len
only, which was N, the number of sets. For convenience, we also track therow_len
, which is the number of unsigned integers per row. If we multiplyrow_len
byTSK_BITSET_BITS
, we get the number of bits that each set (or row) in the list of bitsets will hold.We had also discussed each set theoretic operation accepting a row index instead of a pointer to a row within the bitset object. Now, each operation accepts a row index for each bitset structure passed into the function. This simplifies the consumption of this API considerably, removing the need of storing and tracking many intermediate temporary array pointers. We also see some performance improvements from this cleanup. For DRY purposes, I've created a private macro,
BITSET_DATA_ROW
, which abstracts away the pointer arithmetic for selecting a row out of the list of sets. Because of these changes,tsk_bit_array_get_row
is no longer needed and has been removed from the API.This change does not change the size of the "chunk", which is the unsigned integer storing bits. It remains a 32 bit unsigned integer, which is most performant for bit counting (popcount). I've streamlined the macros used to determine which integer in the row will be used to store a particular bit. Everything now revolves around the TSK_BITSET_BITS macro, which is simply 32 and bitshift operations have been converted to unsigned integer division.
Testing has been refactored to reflect these changes, removing tests that operate on a specific rows. Tests in c and python are passing and valgrind shows no errors.
Fixes (#2834).
Precompute A/B counts for each sample set. We were previously computing them redundantly each for each site pair in our results matrix. The precomputation happens in a function called
get_mutation_sample_sets
, which takes our list of sets (tsk_bitset_t
) for each mutation and intersects the samples with a particular mutation with the sample sets passed in by the user. The result is an expanded list of sets with one set per mutation per sample set. During this operation, we compute the number of samples containing the given allele for each mutation, avoiding the need to perform redundant count operations on the data.In addition to precomputation, we add a non-normalized version of
compute_general_two_site_stat_result
for situations where we're computing stats from biallelic loci. We dispatch the computation of the result based on the number of alleles in the two loci we're comparing. If the number of alleles in both loci is 2, then we simply perform an LD computation on the derived alleles for the two loci. As a result, we remove the need to compute a matrix of LD values, then take a weighted sum. This is much more efficient and means that we only run the full multiallelic LD routine on sites that are multiallelic.