preseq pop_size may be a useful addition to the output report of MPRA-match #3

mackenzie-noon · 2023-08-18T15:24:00Z

It may be worth adding the output of preseq pop_size to the report generated by MPRA-match as the graph can be misleading. As an example, here is a preseq graph generated from some MPRA data:

It looks like it is trending towards convergence. However, the output of preseq pop_size reveals that we are only ~15% of the way to saturation. Graphing preseq lc_extrap all the way to convergence reveals:

The text was updated successfully, but these errors were encountered:

rtewhey · 2023-08-18T15:32:31Z

Thanks Mackenzie for flagging this. I think this step more generally could be improved. In my experience the plot we provide is more reflective of where you are on the saturation curve with respect to if additional sequencing will provide a positive ROI. I would expect, if you run the MPRA experiment the vast majority of barcodes int the RNA-seq will be assigned to an oligo. My best explanation for this is because the sequencing generates errors there are a lot of singleton barcodes that can create a misleading estimate by preseq.
Maybe a better approach would be to pass the barcodes through an error correction step first before generating the preseq plot. This might make the upper convergence line more reflective of the true plasmid population.

mackenzie-noon · 2023-08-19T22:08:52Z

Ryan Stanton performed a spot-plating assay on the library graphed above, and computed that the total number of barcode-oligo combinations was 633333333.333. preseq pop_size estimated that the total number of unique molecules is 621148315.9. Assuming I'm reading their notes right, these numbers agree quite closely (~2% difference). Since spot-plating is not susceptible to sequencing errors, I think it's more likely that preseq is actually reporting the correct figure, and the graph has a problem.
Examining MPRA_match.wdl I see that preseq lc_extrap is called with argument -e 1000000000, which means it extrapolates to a maximum of 10^9 reads.

"Zooming in" and graphing only the first 10^9 reads of the preseq graph I made above:

We see the same shape shown in the output of mpra_match

It seems (?) that the first 10^9 reads give the appearance of convergence, even in cases where the library definitely isn't converging. I would guess that this is because the chances of seeing new barcode-oligo combinations is very high when the number of reads is low, then drops off quick. Consider that a complexity of 621148316 means that the first ~60% of the graph, the number of reads is actually less than the number of barcode-oligos! So the chance of seeing a new barcode-oligo is high and dropping fast.

rtewhey · 2023-08-20T01:52:10Z

Wow, yea I think the complexity of your library is well beyond anything we anticipated and 100% agree your preseq plot that goes out to 10e12 reads is the correct way to look at your results.

We will discuss how to fix this plot so it will work for a broader range of libraries.
Thanks for flagging this and following up.

stephenrong · 2023-09-27T22:18:54Z

Seems like preseq is not quite estimating what we are actually interested in, if, say, 20% of the most common unique barcodes represents 80% of the "mass" of the library. So preseq might estimate that only 20% of the unique barcodes are recovered in MPRAmatch, but ~80% of the plasmid/RNA sequencing ends up being usable in MPRAcount. Such as in the scenario rtewhey mentioned about a lot of singleton barcodes from errors. Not sure what those "20%" and "80%" numbers would be.

rtewhey closed this as completed Aug 18, 2023

rtewhey reopened this Aug 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

preseq pop_size may be a useful addition to the output report of MPRA-match #3

preseq pop_size may be a useful addition to the output report of MPRA-match #3

mackenzie-noon commented Aug 18, 2023

rtewhey commented Aug 18, 2023 •

edited

Loading

mackenzie-noon commented Aug 19, 2023

rtewhey commented Aug 20, 2023 •

edited

Loading

stephenrong commented Sep 27, 2023

preseq pop_size may be a useful addition to the output report of MPRA-match #3

preseq pop_size may be a useful addition to the output report of MPRA-match #3

Comments

mackenzie-noon commented Aug 18, 2023

rtewhey commented Aug 18, 2023 • edited Loading

mackenzie-noon commented Aug 19, 2023

rtewhey commented Aug 20, 2023 • edited Loading

stephenrong commented Sep 27, 2023

rtewhey commented Aug 18, 2023 •

edited

Loading

rtewhey commented Aug 20, 2023 •

edited

Loading