Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Acceptable local assembly performance #74

Closed
iqbal-lab opened this issue Oct 19, 2018 · 41 comments
Closed

Acceptable local assembly performance #74

iqbal-lab opened this issue Oct 19, 2018 · 41 comments
Assignees

Comments

@iqbal-lab
Copy link
Collaborator

@mbhall88 has been running "latest" Pandora with the perf changes you 3 produced on the full graph with a cardio sample. So far running for days, sitting in the "get intervals" stage.
@mbhall88 feel free to edit/update with details.

@rmcolq
Copy link
Collaborator

rmcolq commented Oct 19, 2018

Curious to know which bit it is getting stuck in specifically.
Also, did we finish adding tests for all of the work? Looking at the latest dev commit there are some tests missing/where important bits are commented out.

@mbhall88
Copy link
Member

@rmcolq can you please link to examples you are talking about?

@mbhall88
Copy link
Member

Job was killed 😞
Will restart it with some changes I have made.

@mbhall88
Copy link
Member

#14 is related to this and identifies a possible slow spot.

@rmcolq
Copy link
Collaborator

rmcolq commented Oct 20, 2018

If by examples, you mean which functions am I referring to as not having full tests?
add_pnode_coordinate_pairs seems to have partial test, the important bit is commented out in line
https://github.com/rmcolq/pandora/blob/87c0d5d02edee404d183acf1b1940ec219689c3e/test/denovo_discovery/extract_reads_test.cpp#L988
collect_read_pileups has no test.

Commented out tests that could be adjusted to new code as extra test cases:
https://github.com/rmcolq/pandora/blob/87c0d5d02edee404d183acf1b1940ec219689c3e/test/denovo_discovery/extract_reads_test.cpp#L169
https://github.com/rmcolq/pandora/blob/87c0d5d02edee404d183acf1b1940ec219689c3e/test/denovo_discovery/extract_reads_test.cpp#L185

Also noticed we have two add_pnode_coordinate_pairs in the header file, the second along with save_read_strings_to_denovo_assemble is junk.

When Zam says you have been "sitting in the 'get intervals' stage", do you know where exactly?

mbhall88 pushed a commit that referenced this issue Oct 22, 2018
remove redundant functions and tests as mentioned in #74
@mbhall88
Copy link
Member

mbhall88 commented Oct 22, 2018

Ok. I have removed the junk.

@rmcolq Regarding the tests, I thought you said you could make some tests quite easily for those functions?

@rmcolq If I jump to the bottom of the log file at the moment, it shows the following

continue checking
check if pnodes defined
check if one is defined
again
continue checking
check if pnodes defined
check if one is defined
again
continue checking
check if pnodes defined
check if one is defined
again
continue checking
check if pnodes defined
check if one is defined
again
continue checking

This code comes from the < operator for GeneIntervalInfo https://github.com/rmcolq/pandora/blob/83e9858c467128d8d528121187fb73969cfe0160/src/denovo_discovery/gene_interval_info.cpp#L6

I am guessing this is being printed out whenever there is some kind of insertion into a map happening?
Therefore I am assuming it is collecting read pileups.
Will try and get some Valgrind CPU assessment on the de novo stuff today.

@rmcolq
Copy link
Collaborator

rmcolq commented Oct 22, 2018 via email

@mbhall88
Copy link
Member

Profiled CPU usage on a toy example using callgrind.

valgrind --tool=callgrind /Users/mbhall88/Projects/pandora/cmake-build-debug/pandora map --output_covgs --discover -p toy_prg.fa -r toy_5000_mapped.fastq.gz -o . --genome_size 20000 --genotype

callgrind.out.70583.txt

It segfaulted when it was running local assembly on an interval, but I have results for the code prior to that. It appears that up until it segfaulted, 98.07% of CPU time was in FastaqHandler::get_id() and more specifically on this line https://github.com/rmcolq/pandora/blob/fae5c3fd167364e25793ce250d79483fbe09d108/src/fastaq_handler.cpp#L138

Speaking with @rffrancon it appears this line is not necessary as it seems like it is being used to clear the istream, but two lines below clear() is called on the istream anyways. Unless there was a specific reason for this line @rmcolq I will remove it and rerun the analysis.

screenshot 2018-10-23 10 03 51

@rmcolq
Copy link
Collaborator

rmcolq commented Oct 23, 2018

Excellent, if it runs without it, and tests pass, do remove it!

@iqbal-lab
Copy link
Collaborator Author

@rmcolq
Copy link
Collaborator

rmcolq commented Oct 23, 2018

At one point I tried FQFeeder - at the time it was slower, but I was doing different things. It may be faster for this.

@ffranr
Copy link
Contributor

ffranr commented Oct 23, 2018

Seems very popular:
https://github.com/seqan/seqan
https://seqan.readthedocs.io/en/seqan-v1.4.2/Tutorial/SequenceFileIO.html

#include <fstream>
#include <iostream>

#include <seqan/seq_io.h>
#include <seqan/sequence.h>

int main(int argc, char const ** argv)
{
    if (argc != 2)
        return 1;  // Invalid number of arguments.

    // Open file and create RecordReader.
    std::fstream in(argv[1], std::ios::binary | std::ios::in);
    seqan::RecordReader<std::fstream, seqan::SinglePass<> > reader(in);

    // Read file record-wise.
    seqan::CharString id;
    seqan::Dna5String seq;
    seqan::CharString qual;
    while (!atEnd(reader))
    {
        if (readRecord(id, seq, qual, reader, seqan::Fastq()) != 0)
            return 1;  // Could not record from file.

        std::cout << id << "\t" << seq << "\n";
    }
    
    return 0;
}

@rmcolq
Copy link
Collaborator

rmcolq commented Oct 23, 2018

Now added missed tests in commit ba4078c

@mbhall88
Copy link
Member

Most recent callgrind analysis shows the following call graph

screenshot 2018-10-24 14 55 50

This is obviously only a small portion of the overall call graph, but is where about half the run-time is at the moment on my toy dataset. This is the higher level shot

screenshot 2018-10-24 14 58 19

This example runs local assembly on two intervals in total.

Here is the callgrind log file
callgrind.out.494.txt

@mbhall88
Copy link
Member

mbhall88 commented Oct 24, 2018

Looking into this a little, it seems - for this toy example - 21.86% of time is spent on this
https://github.com/rmcolq/pandora/blob/5786548a1a1111a4990f0b8a6ec3335e4d1a5319/src/minimizer.cpp#L12

@rmcolq Is there a better way of achieving this? A lookup table or approximation perhaps? It is called 6 million times.

@ffranr
Copy link
Contributor

ffranr commented Oct 24, 2018

I didn't think of this before now as i dont really use asserts: https://codeyarns.com/2015/03/16/how-to-disable-assert-in-gcc/

I would go with the cmake option.

@iqbal-lab
Copy link
Collaborator Author

Correct me if wrong @rffrancon , but I think Robyn is saying you can set things up so you can compile the code in two ways.
In "debug mode" all asserts are there. You can do unit tests and you can do test runs.
In "ndebug mode", those extra asserts are stripped out, so they do not slow you down.

@mbhall88
Copy link
Member

Is this line where I would put the compiler flag?
https://github.com/rmcolq/pandora/blob/5786548a1a1111a4990f0b8a6ec3335e4d1a5319/CMakeLists.txt#L24

@mbhall88
Copy link
Member

mbhall88 commented Oct 25, 2018

Compiling with -DNDEBUG flag on changes things quite a bit in terms of percentage of CPU time spent in certain functions. Only 2.07% of time now spent initialising minimisers, compared to 25.68% before.

Here is an overview of the call graph now.

screenshot 2018-10-25 10 49 15

In terms of absolute time. Averaging 5 runs on the toy dataset for each:

  • With asserts: 2.492s +- 0.200s
  • Without asserts: 2.228s +- 0.190s

Which is an 11% speedup on the toy dataset.

@mbhall88
Copy link
Member

Going to close this as speed is now "acceptable" and #150 is tracking the performance of the map module now.

@mbhall88 mbhall88 changed the title Acceptable pandora map+de novo performance Acceptable local assembly performance Oct 27, 2020
@mbhall88 mbhall88 reopened this Oct 27, 2020
@mbhall88 mbhall88 self-assigned this Oct 27, 2020
@mbhall88
Copy link
Member

mbhall88 commented Oct 27, 2020

I am opening this issue to track an investigation I am currently doing to pinpoint concrete examples of where our slow points in local assembly are. This analysis also feeds directly into #195

Whilst trying to use pandora in the head to head analysis (see mbhall88/head_to_head_pipeline#48) I have been rerunning pandora discover a lot. What I have been struggling with is that there are examples of samples that can take upwards of 4 days to get through all candidate regions. And in some cases, these never finish before I kill the job.

I selected an example of a sample that took 4 days to (successfully) run pandora discover - sample mada_1-25. In the head to head analysis, I am running each sample twice with two PRGs of different density - sparse (variants from ~80 samples) and dense (variants from ~400 samples).

I have parsed the log file to get information about how long local assembly took on each candidate region and how long the region is to see whether the length of a candidate region is correlated with its runtime, or if there is something else at play.

image

image

Conclusion

The length of a candidate region, at least in this example, does not seem to have a noticeable impact on the amount of time it takes to run local assembly. We can also see that, for this sample, there is a single candidate region that takes nearly all of the time.

75% of all candidate regions take 0.26-0.37 seconds to complete, with the remaining 25% taking next to no time (these are likely regions where there are little-to-no reads to build the dBG from and other various GATB errors).

Take a sample with 10,000 candidate regions, each running for 0.3 seconds, then we will spend 37.5 minutes enumerating paths through local assembly graphs. Theoretically, if #195 is implemented, each extra thread should equate to a proportional reduction in local assembly runtime.

However, this does not address the issue of single regions that take 4 days! When I looked at this candidate region a little closer, I found it was PE_PGRS54+IGR:3936710-3936876 - interval [1916, 2046) and for each PRG it produced 14 paths. It tried 4 combinations of start/end kmers before it was able to find any paths. Each of these enumerations took ~12 hours. I calculated the GC-content for this gene and it is one of the genes with 80% GC-content. So I suspect that there is a huge amount of cycles causing this slowdown. (As an aside, I also saw another example of this with PE_PGRS53, which has 78% GC-content).

Next step

I will pull out this PRG and the reads that map to it and play around with discover params that can help reduce this runtime.

@iqbal-lab
Copy link
Collaborator Author

iqbal-lab commented Oct 27, 2020

Do you keep a count of how many "loops you meet"? Maybe no of times you meet a kmer you've met before? Coukd have a ceiling. In principle this can be partially done at construction time not runtime. We coukd take the reference for a locus and count how many repeated kmers there are

@iqbal-lab iqbal-lab reopened this Oct 27, 2020
@iqbal-lab
Copy link
Collaborator Author

Sorry I accidentally closed the issue

@leoisl
Copy link
Collaborator

leoisl commented Oct 27, 2020

Do you keep a count of how many "loops you meet"? Maybe no of times you meet a kmer you've met before? Coukd have a ceiling. In principle this can be partially done at construction time not runtime. We coukd take the reference for a locus and count how many repeated kmers there are

Sorry for the long message, it seems I get carried away when brainstorming...

We had a talk on late monday UK / early tuesday Aus, and we actually came up with a similar idea, but that was on runtime. We could limit how many times we loop through cycles simply by having a count of how many times we visited each kmer when building a path, and if we visit a kmer up to, for eg 5 times, we don't consider it as a candidate to get in the path again (sort of removing the kmer from the graph). But this was during runtime and during path building... Removing the kmer at construction time has the impact of being faster (i.e. we won't even loop through repeat kmers, as they will be straight out removed), but it might really affect results (e.g. it might be that the true denovo path to be found goes through the repeat, and removing these repeat kmers makes it impossible to find it).

I guess we need to wait for more details on this region from Michael. I would like very much to see the graph (I think GATB can export to GFA? Or just export to a tabular format, and we can view on cytoscape). I wonder if we have several loops over loops, etc there. It is fine that we don't perform well on this. These repeat-induced-full-of-cycles graphs are one of the main hardness on assemblying genomes, everyone struggles...

This was one of my PhD topics, trying to solve repeats (which induce cycles), but assemblying human alternative splicing events. I did not succeed much, but some strategies that I remember:

  • Cycle-visiting threshold : Limit the amount of times we visit a kmer (as we already said before);

  • Variable-k assembly: We can compute on linear time the amount of cycles a digraph has, and this could give us evidence if the graph is complex or not. If the graph is too complex (e.g. it has too many cycles), we could increase the value of k. This will break some repeats, and make the graph simpler. We can do this until the graph is not so complex, and then we process it (in the worst case, k is too big, and there are no cycles, but also no paths, and we find nothing). This means we have a variable-k denovo discovery method: for simple graphs, small k should be enough; for complicated ones, we increase k. This is used in several assemblers;

  • Read threading: Not all paths are read coherent. A read coherent path is a path that has a minimum amount of reads supporting it exists. So the idea is to just build read coherent paths, which is called read threading. What could help us in this example is: if in the reads we see that a given repeat is present exactly 4 or 5 times, it is not read coherent to loop through that repeat 3 or less time, or 6 or more times, and thus we limit our looping to that repeat to exactly 4 or 5 times. Fun fact is that I started thinking about this idea by the end of my PhD, just to see that Isaac, Zam et al. already thoroughly thought about this, implemented, benchmarked and wrote a paper :p : https://academic.oup.com/bioinformatics/article/34/15/2556/4938484 . This is not sth simple to implement though, will take I guess a good time investment, and IDK if will solve the performance issue... I am listing it here though because I am wondering if we switch the DBG library to mccortex, we could gain this feature for free...

There are several other ways to tackle repeats/cycles/complex assembly graphs, we might need to research and pick the one that fits better our data

@iqbal-lab
Copy link
Collaborator Author

i was simply suggesting at construction time counting high-repeat kmers - i wasnt suggesting removing them from the middle of the graph.
If there are many/any, we could decide this region is too hard to do local assembly.

@leoisl
Copy link
Collaborator

leoisl commented Oct 27, 2020

Oh, I see! It is another way of deciding if the graph is complex or not. We could increase the k-mer size and recompute this stat instead of directly abandoning the assembly, do you think is worth it?

@iqbal-lab
Copy link
Collaborator Author

the mccortex read-coherence thing is expensive and uses i/o. right now i'm more thinking in terms of discarding candidate regions if they are too repeaty. They would also be difficult regions for mapping/normal-approaches anyway.
No need for tonnes of machinery if this is just losing a small amount of genome

@iqbal-lab
Copy link
Collaborator Author

the cost/benefit is all about how many regions there are like this.

@iqbal-lab
Copy link
Collaborator Author

important to avoid trying too hard to squeeze tiny increases of theoretical recall

@iqbal-lab
Copy link
Collaborator Author

but my idea of looking at the locus might/not spot michael's problem region. and it might spot loads of others that local assembly copes with fine. we need data really

@leoisl
Copy link
Collaborator

leoisl commented Oct 27, 2020

important to avoid trying too hard to squeeze tiny increases of theoretical recall

I do agree with this, especially with my recent experience on pandora paper. Finding the correct denovo path for this single region among several thousand other regions won't really make a difference in the results... I changed my idea from trying to deal with these complicated regions to identifying and skipping them

but my idea of looking at the locus might/not spot michael's problem region. and it might spot loads of others that local assembly copes with fine. we need data really

Yeah, I think we need a measure to identify the complexity of the region, which correlates with the time spent on it. Maybe a high GC-content could flag a complex region, and we don't process it (although I don't think this was the single high-GC-content region), or maybe the nb of repeated kmers, or nb of cycles, etc...

IIRC, Michael also suggested a timeout on the finding-denovo-paths algorithm. I previously opposed to this, as things can become non-reproduceable, and there might be some other external factors that can impact runtime (e.g. filesystem latency, cpu load, etc), and we might end up missing a bunch of regions... but a timeout is indeed an option rethinking again, as almost everything takes <5s, and a single one takes 48h (not sure this is the behaviour of other datasets)

@iqbal-lab
Copy link
Collaborator Author

not sure if this link will work
https://bacteria.ensembl.org/Mycobacterium_tuberculosis_h37rv/Location/View?r=Chromosome:3936710-3936876;site=ensemblthis;db=core

so attaching a screenshot. that's a lot of consecutive G/C

Screenshot 2020-10-27 at 16 28 31

@mbhall88
Copy link
Member

No, we do not currently keep track of how many times we visit a kmer. I could definitely implement this but would need to alter the data structure for out DFS tree. I will come back to this...

@leoisl unfortunately GATB doesn't output GFA, only HDF5. There is a hack I can do to get a cortex graph, but for the moment I have been working on other avenues as this is very time consuming.

Thanks for the ideas and discussion! I think the node visiting is a great idea.


For the long-running candidate region I have been prodding at I've found out some more information. If I change the distance for merging candidate regions to 15 (instead of 22) then we have no runtime issues. This causes the region causing the long runtime to be broken into three separate regions and each of these completes quickly.

Digging in a little deeper I suspect it is the region near the beginning of the original longer region that is causing problems. When I zoom in on that region it has a GC content of ~90%!! https://bacteria.ensembl.org/Mycobacterium_tuberculosis_h37rv/Location/View?db=core;g=Rv3508;r=Chromosome:3932920-3933052;t=CCP46330

Mycobacterium_tuberculosis_H37Rv_Chromosome3932920_3933052

If I clean the GATB graph (with merge distance at the original 22) discover does complete on that long runtime region in 6 minutes, but abandons that region as there are too many paths.

If I use a discover kmer size of 15 (with the original merge distance) it completes quickly also, but can't find a path between any anchor combination for that region.

Conclusion

I am going to look at the impact of reducing the merging distance to 15, the impact of cleaning the GATB graph, and the impact of --discover-k 15 all on precision/recall

@leoisl
Copy link
Collaborator

leoisl commented Oct 28, 2020

Nice investigation, this cleared a lot of doubts! Anyway, I think you will be able to tweak some parameters here and there, and will be able to run de novo discovery on these samples spending a short amount of time, but I think this issue might still come up to us in the future or to a user.

Should we still think on a way to flag a graph as too complex, and on strategies to deal with it (i.e. try to automatise what Michael is doing)? I think simply skipping the region, and warning the user how many regions were skipped can be already a good first solution. Other solutions include, I think, doing what you will test: reducing the merging distance and check if the subgraphs produced due to this are not complex anymore, or cleaning, or increasing k (it will depend on what worked for you)

@mbhall88
Copy link
Member

Update from overnight: I reran the pipeline with --discover-k 11 --merge-dist 15. This completed all but one sample in perfectly reasonable time and actually gave use a recall and precision boost compared to --merge-dist 22. I am now trying with --clean-dbg --discover-k 13 --merge-dist 15 to see what impact (if any) cleaning the GATB graphs has on results.


Re: our ongoing discussion about the PE_PGRS54 locus

I think we actually make this complex region even more difficult for ourselves due to the way we select variants when building our Mtb PRG (mbhall88/head_to_head_pipeline#10).

We apply some filtering to the VCFs for the sparse and dense PRGs - see https://github.com/mbhall88/head_to_head_pipeline/blob/0323cbb8b088c357e39d62489721682e0fc8b707/data/H37Rv_PRG/Snakefile#L262-L298

The two bits of filtering that are important for the current issue are:

  • filtering out positions in a genome mask
  • applying VCF FILTER column

Now, normally, these are totally sane filters. But, we know that pe/ppe genes are commonly inside these masks. (I also include FILTER in this as the clockwork VCFs have a FILTER class for positions in the mask). So whilst we think we're doing ourselves a favour by not including variants in the mask, we are actually removing complexity from these regions, with the end results being de novo is struggling as it is being triggered a lot and has next to no prioir information in the PRG (this is actually almost the same issue that was being caused by mbhall88/head_to_head_pipeline#53 and mbhall88/head_to_head_pipeline#54). To back up this claim, there is only 2 variants in the PRG for this PE_PGRS54 locus, but lots of variants in the CRyPTIC VCFs that look reasonable but are all in the mask.

Summary

I don't think we should be masking at the PRG construction stage - this should be a downstream process.

@mbhall88
Copy link
Member

mbhall88 commented Nov 2, 2020

Update from overnight: I reran the pipeline with --discover-k 11 --merge-dist 15. This completed all but one sample in perfectly reasonable time and actually gave use a recall and precision boost compared to --merge-dist 22. I am now trying with --clean-dbg --discover-k 13 --merge-dist 15 to see what impact (if any) cleaning the GATB graphs has on results.

Cleaning the local assembly (GATB) graph seems to hit us pretty hard in terms of recall.

The fourth pair of boxes from the left is the run with cleaning

image

However, we do gain precision - but bear in mind this is without any post-filtering.

image


It is beginning to look like the de novo k-mer size isn't as influential as I first thought (at least for Mtb). I'm still expolring parameters though.

@leoisl
Copy link
Collaborator

leoisl commented Nov 2, 2020

That looks like the classic cleaning DBGs results: - recall, + precision. It is interesting nonetheless, I'd propose to add this to your Mtb paper (at least as supplementary... not sure if it fits the paper though... but I bet there will be use cases where we prefer better precision than recall, and this shows a way of doing it...)

@iqbal-lab
Copy link
Collaborator Author

  1. How is recall better with no prg?
  2. Interesting about clean dbg,what does it precisely mean?
  3. All of this is with abundance threshold at default?

@mbhall88
Copy link
Member

mbhall88 commented Nov 3, 2020

1. How is recall better with no prg?

That's COMPASS

2. Interesting about clean dbg,what does it precisely mean?

https://github.com/rmcolq/pandora/blob/0f1210e6927b905c0a1fe03612ed067d73471851/src/denovo_discovery/local_assembly.cpp#L256-L272

3. All of this is with abundance threshold at default?

The abundance threshold is indicated by abdunx in the names on the x-axis

@mbhall88
Copy link
Member

mbhall88 commented Nov 4, 2020

I am now confident I have explored the discover parameters thoroughly enough. Based on the following recall plot, I am going to put in a PR to change the default pandora discover parameters to --discover-k 15 --merge-dist 15 -L 30 (the boxes on the far right of the plot below). In addition I will remove the --padding option and hard-code it to twice the discover-k value

image

From the point of view of performance, the summary statistics of wall-time for the discover jobs for the newly proposed defaults (with 8 threads) are:

count    600.000000
mean       0.730133
std        0.378366
min        0.480000
25%        0.570000
50%        0.630000
75%        0.712500
max        4.330000

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants