Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalize by contig size before RPKM? #1

Closed
Aciole-David opened this issue Apr 26, 2021 · 2 comments
Closed

Normalize by contig size before RPKM? #1

Aciole-David opened this issue Apr 26, 2021 · 2 comments

Comments

@Aciole-David
Copy link

Hi, Simon!
Sorry to bother you here, but I think you are my best source of help on this subject:

In your paper "Benchmarking viromics: an in silico evaluation of metagenome-enabled estimates of viral community composition and diversity", there is a normalization of raw counts by contig size. Afterwards, normalization by RPKM (edgeR) as correction for different library sizes.

I am confused with this one paper (Rasmussen et al, 2019) which also follows yours, although they affirm that RPKM normalization is done to account for contig size, not library size;

"Prior any analysis the raw read counts in the vOTU-tables were normalized by reads per kilobase per million mapped reads (RPKM) [48], since the size of the viral contigs is highly variable [49]"

Is it correct to divide counts by contig size and then transform them into RPKM, or only do as Rasmussen et al?

Again, sorry if this is not the right channel to question.
Thank you very much.

@simroux
Copy link
Owner

simroux commented Apr 26, 2021

Hi @Aciole-David

Sorry for the confusion, in short: Rasmussen et al. 2019 is correct. Basically RPKM (at least the way we use it) provides two "correction":

  • By library size (this is the "M" for "million of reads")
  • By contig size (this is the "K" for "kilobase of contigs"). Note that the original RPKM in e.g. RNA-Seq does this per gene, but this is the same idea.

In our benchmark, we test both a simple normalization by contig size (i.e. kind of "RPK"-only), and we test normalization by both contig size and library size (proper "RPKM"). For simplicity, it is probably best to do as Rasmussen et al. did and directly transform your read mapping data into RPKM, providing edgeR with (i) library size, and (ii) contig length.

Let me know if it makes sense !
Best,
Simon

@Aciole-David
Copy link
Author

Simon, it makes sense to me perfectly.
You just saved me some hours of discussions here!
Thank you a lot for the quick reply
Cheers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants