-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
preseq pop_size may be a useful addition to the output report of MPRA-match #3
Comments
Thanks Mackenzie for flagging this. I think this step more generally could be improved. In my experience the plot we provide is more reflective of where you are on the saturation curve with respect to if additional sequencing will provide a positive ROI. I would expect, if you run the MPRA experiment the vast majority of barcodes int the RNA-seq will be assigned to an oligo. My best explanation for this is because the sequencing generates errors there are a lot of singleton barcodes that can create a misleading estimate by preseq. |
Ryan Stanton performed a spot-plating assay on the library graphed above, and computed that the total number of barcode-oligo combinations was 633333333.333. "Zooming in" and graphing only the first 10^9 reads of the preseq graph I made above: We see the same shape shown in the output of mpra_match It seems (?) that the first 10^9 reads give the appearance of convergence, even in cases where the library definitely isn't converging. I would guess that this is because the chances of seeing new barcode-oligo combinations is very high when the number of reads is low, then drops off quick. Consider that a complexity of 621148316 means that the first ~60% of the graph, the number of reads is actually less than the number of barcode-oligos! So the chance of seeing a new barcode-oligo is high and dropping fast. |
Wow, yea I think the complexity of your library is well beyond anything we anticipated and 100% agree your preseq plot that goes out to 10e12 reads is the correct way to look at your results. We will discuss how to fix this plot so it will work for a broader range of libraries. |
Seems like preseq is not quite estimating what we are actually interested in, if, say, 20% of the most common unique barcodes represents 80% of the "mass" of the library. So preseq might estimate that only 20% of the unique barcodes are recovered in MPRAmatch, but ~80% of the plasmid/RNA sequencing ends up being usable in MPRAcount. Such as in the scenario rtewhey mentioned about a lot of singleton barcodes from errors. Not sure what those "20%" and "80%" numbers would be. |
It may be worth adding the output of
preseq pop_size
to the report generated by MPRA-match as the graph can be misleading. As an example, here is a preseq graph generated from some MPRA data:It looks like it is trending towards convergence. However, the output of
preseq pop_size
reveals that we are only ~15% of the way to saturation. Graphingpreseq lc_extrap
all the way to convergence reveals:The text was updated successfully, but these errors were encountered: