Ground truth metagenomic profiles in CAMI format for Sun et al.

We simulated metagenomic sequencing reads for 25 communities from distinct habitats (for example, gastrointestinal, oral, dermal, vaginal and building, five communities for each habitat; Methods). To avoid reference database bias of different metagenomic profil- ers, the genomes used to generate simulated communities were selected from the intersection among the reference databases of MetaPhlAn2, mOTUs2 and Kraken2.

Sun, Z., Huang, S., Zhang, M. et al. Challenges in benchmarking metagenomic profilers. Nat Methods 18, 618–626 (2021). https://doi.org/10.1038/s41592-021-01141-3

Sun et al. simulated 25 metagenomic reads for benchmarking metagenomic profilers, while the ground truth profiles format is not convenient for interpretation with tools like opal. For example:

$ head -n 4 sun/5_building_sequence_abd.txt 
SpeciesID       sample_1        sample_2        sample_3        sample_4        sample_5
Corynebacterium_jeikeium        0.00195744970771946     0       0       0.0081377431495817      0
Lactococcus_lactis      0.0285317256732946      0       0.00218905863883039     0.00157454493494673     0.00769396428284639
Streptococcus_agalactiae        0.00126070506577494     0       0.00268546083074535     0       0.00348204130934378

Taxonomic names instead of TaxId were used, and they were formated:

Square brackets were deleted:

 Orininal: [Clostridium] hiranonis
 Formated: Clostridium_hiranonis

Characters except letters and numbers were replaced with underlines.

 Orininal: Synechococcus sp. JA-2-3B'a(2-13)
 Formated: Synechococcus_sp_JA_2_3B_a_2_13

Tailing underlines were deleted

Besides, the verion of NCBI Taxonomy database was not clear. The only clue is:

Indeed, in the recently updated microbial genome database (NCBI RefSeq, 6 November 2020),

This made it hard to convert taxonomic names to the right TaxId, because NCBI Taxonomy changes frequently. I had to manually checking the history of a taxon via taxid-changelog, and finally found the lastest available taxdump version: 2020-03-01.

DOWNLOAD

Taxonomic abundance: sun2021_gs_taxonomic_abd.profile
Sequence abundance: sun2021_gs_sequence_abd.profile

HOWTO

Resourses

Datasets

Twenty-five simulated reads and ground truth profiles.
Taxdump file of original profile:taxdmp_2020-03-01
Taxdump file for new profile: taxdmp_2021-10-01 or other versions of NCBI taxdump files.

Tools:

taxonkit (>= v0.9.0)
rush
csvtk

Preparing taxdump for TaxonKit

Here we use taxdmp_2021-10-01.zip as the new taxonomy version.

# download taxdmp_2021-10-01.zip
wget https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump_archive/taxdmp_2021-10-01.zip

mkdir taxdump-to

unzip taxdmp_2021-10-01.zip names.dmp nodes.dmp merged.dmp delnodes.dmp  -d taxdump-to

Install TaxonKit

Preparing mapping relationship between species names and TaxId

# wget https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump_archive/taxdmp_2020-03-01.zip

taxdump=taxdump-from

mkdir -p $taxdump
unzip taxdmp_2020-03-01.zip names.dmp nodes.dmp merged.dmp delnodes.dmp -d $taxdump

# taxid -> name
taxonkit list --ids 1 --indent "" --data-dir $taxdump \
    | taxonkit lineage -n -L --data-dir $taxdump \
    > taxid2name.tsv

# Sun2021 deletes `[` and `]`
# name -> taxid
csvtk cut -lHt -f 2,1 taxid2name.tsv \
    | csvtk replace -lHt -f 1 -p '[\[\]]' \
    | csvtk replace -lHt -f 1 -p '[\W]+' -r '_' \
    | csvtk replace -lHt -f 1 -p '_$' \
    > name2taxid.tsv

Reformating

Changing building to build for name consistency. Because the reads file are: Build_sample1.left.fq.gz, Gut_sample1.left.fq.gz, Oral_sample1.left.fq.gz, Skin_sample1.left.fq.gz, VG_sample1.left.fq.gz.
```
 # mv sun/5_building_sequence_abd.txt sun/5_build_sequence_abd.txt
 # mv sun/5_building_taxonomic_abd.txt sun/5_build_taxonomic_abd.txt
```

Reformating Sun's format to two-column format (taxid and abundance):

 type=taxonomic_abd
 # type=sequence_abd
         
 rm -rf $type
 mkdir $type
 
 for f in sun/*$type.txt; do
     for c in $(seq 2 6); do 
         cut -f 1,$c $f \
             | sed 1d \
             | csvtk replace -Ht -K -k name2taxid.tsv -p '(.+)' -r '{kv}' \
             > $type/$(basename $f | awk -F _ '{print $2}' | sed -r 's/^(.)/\U\1/')_sample$(expr $c - 1)           
     done 
 done
 
 
 # checking unsolved names:
 cat $type/* | csvtk grep -Ht -f 1 -r -p '[^\d]'
 
 # manually checking the change history via https://github.com/shenwei356/taxid-changelog

Formating to CAMI2 format:

 taxver=ncbi-taxonomy-2021-12-06
 taxdump=taxdump-to
 
 /bin/rm $type/*.profile
 ls $type/* \
     | rush -v taxver=$taxver -v taxdump=$taxdump \
         'taxonkit profile2cami --data-dir {taxdump} -s {%} -t {taxver} {} -o {}.profile'

Concatenating:

 cat $type/*.profile > sun2021_gs_$type.profile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ground truth metagenomic profiles in CAMI format for Sun et al.

DOWNLOAD

HOWTO

Resourses

Preparing taxdump for TaxonKit

Preparing mapping relationship between species names and TaxId

Reformating

About

Releases 2

Packages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
sequence_abd		sequence_abd
sun		sun
taxonomic_abd		taxonomic_abd
.gitignore		.gitignore
README.md		README.md
sun2021_gs_sequence_abd.profile		sun2021_gs_sequence_abd.profile
sun2021_gs_taxonomic_abd.profile		sun2021_gs_taxonomic_abd.profile

shenwei356/sun2021-cami-profiles

Folders and files

Latest commit

History

Repository files navigation

Ground truth metagenomic profiles in CAMI format for Sun et al.

DOWNLOAD

HOWTO

Resourses

Preparing taxdump for TaxonKit

Preparing mapping relationship between species names and TaxId

Reformating

About

Topics

Resources

Stars

Watchers

Forks

Releases 2

Packages 0

Packages