High level of duplicated protein sequences #41

hegardon · 2023-02-03T15:16:46Z

Hi,
I am using PLASS (v4.687d7) on a set of metagenomes from ~100 cheese samples and it works very well, but still, I have some questions.
In each dataset a high level of protein sequences (on average 30%) are duplicated (with 100% identity and coverage). I understand that some sequences could be duplicated (originating from closely related species), but 30% seems to be quite high.
Another issue is the total amount of assembled amino acid. As an example, for an initial dataset of 18 million reads (2x150 bp paired-end reads, 2.7 Gbp in total), 7 million proteins are assembled (2e+9 aa in total, almost as much as the total amount of nucleotides, which means, to me, more amino acid than expected...).
Is there an explanation about these results ?

I am using PLASS with the following command (others parameters as default):
plass assemble METAG_R1.fastq.gz METAG_R2.fastq.gz METAG_out.fasta -e 0.001 --num-iterations 12 --filter-proteins 1 --remove-tmp-files 1

Thanks
Helene

milot-mirdita · 2023-03-13T05:33:10Z

Since Plass can reuse each read in every iteration. It tends to create a lot of variation that are not necessarily useful. We generally use mmseqs linclust to remove fragments afterwards.

AnnSeidel closed this as completed Apr 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High level of duplicated protein sequences #41

High level of duplicated protein sequences #41

hegardon commented Feb 3, 2023

milot-mirdita commented Mar 13, 2023 •

edited

Loading

High level of duplicated protein sequences #41

High level of duplicated protein sequences #41

Comments

hegardon commented Feb 3, 2023

milot-mirdita commented Mar 13, 2023 • edited Loading

milot-mirdita commented Mar 13, 2023 •

edited

Loading