Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

preseq pop_size may be a useful addition to the output report of MPRA-match #3

Open
mackenzie-noon opened this issue Aug 18, 2023 · 4 comments

Comments

@mackenzie-noon
Copy link

It may be worth adding the output of preseq pop_size to the report generated by MPRA-match as the graph can be misleading. As an example, here is a preseq graph generated from some MPRA data:

image

It looks like it is trending towards convergence. However, the output of preseq pop_size reveals that we are only ~15% of the way to saturation. Graphing preseq lc_extrap all the way to convergence reveals:

unnamed-1

@rtewhey
Copy link
Contributor

rtewhey commented Aug 18, 2023

Thanks Mackenzie for flagging this. I think this step more generally could be improved. In my experience the plot we provide is more reflective of where you are on the saturation curve with respect to if additional sequencing will provide a positive ROI. I would expect, if you run the MPRA experiment the vast majority of barcodes int the RNA-seq will be assigned to an oligo. My best explanation for this is because the sequencing generates errors there are a lot of singleton barcodes that can create a misleading estimate by preseq.
Maybe a better approach would be to pass the barcodes through an error correction step first before generating the preseq plot. This might make the upper convergence line more reflective of the true plasmid population.

@rtewhey rtewhey closed this as completed Aug 18, 2023
@rtewhey rtewhey reopened this Aug 18, 2023
@mackenzie-noon
Copy link
Author

Ryan Stanton performed a spot-plating assay on the library graphed above, and computed that the total number of barcode-oligo combinations was 633333333.333. preseq pop_size estimated that the total number of unique molecules is 621148315.9. Assuming I'm reading their notes right, these numbers agree quite closely (~2% difference). Since spot-plating is not susceptible to sequencing errors, I think it's more likely that preseq is actually reporting the correct figure, and the graph has a problem.
Examining MPRA_match.wdl I see that preseq lc_extrap is called with argument -e 1000000000, which means it extrapolates to a maximum of 10^9 reads.

"Zooming in" and graphing only the first 10^9 reads of the preseq graph I made above:

unnamed

We see the same shape shown in the output of mpra_match

It seems (?) that the first 10^9 reads give the appearance of convergence, even in cases where the library definitely isn't converging. I would guess that this is because the chances of seeing new barcode-oligo combinations is very high when the number of reads is low, then drops off quick. Consider that a complexity of 621148316 means that the first ~60% of the graph, the number of reads is actually less than the number of barcode-oligos! So the chance of seeing a new barcode-oligo is high and dropping fast.

@rtewhey
Copy link
Contributor

rtewhey commented Aug 20, 2023

Wow, yea I think the complexity of your library is well beyond anything we anticipated and 100% agree your preseq plot that goes out to 10e12 reads is the correct way to look at your results.

We will discuss how to fix this plot so it will work for a broader range of libraries.
Thanks for flagging this and following up.

@stephenrong
Copy link

Seems like preseq is not quite estimating what we are actually interested in, if, say, 20% of the most common unique barcodes represents 80% of the "mass" of the library. So preseq might estimate that only 20% of the unique barcodes are recovered in MPRAmatch, but ~80% of the plasmid/RNA sequencing ends up being usable in MPRAcount. Such as in the scenario rtewhey mentioned about a lot of singleton barcodes from errors. Not sure what those "20%" and "80%" numbers would be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants