re-format taxonomy training data for use in R (dada2)? #2

maxfarrell · 2018-06-01T22:54:04Z

Hi, I'm trying to re-format the training data to have 7 consistent taxonomy levels (Kindgom, Phylum, Class, Order, Family, Genus, Species), but I'm unsure how to parse the "mytaxon.txt" file or FASTA headers into a format that I can easily manipulate them in R.

Essentially, I want to use your database with the RDP classifier included with the DADA2 pipeline, but I need to reformat the sequence names to have this format.

Any thoughts or suggestions would be greatly appreciated, and thanks for all your work putting this reference DB together.

maxfarrell · 2018-06-04T13:40:17Z

So I figured out a way to do this within R by directly manipulating the headers, and using a few system calls in bash. I thought I'd share in case others are interested.

# Starting with a separate file of headers speeds things up 
# (compared to reading from a system call directly)
system("grep  \">\" mytrainseq.fasta > headers.txt")
tax <- read.table("headers.txt", sep=";")
seqid <- sub(" .*$","", tax[,1])
tax[,1] <- sub("^.* ","", tax[,1])
tax <- cbind(seqid,tax)
names(tax) <- c("seqid","cellular","domain","kingdom","phylum","class","order","family","genus","species") 

seqnames <- with(tax, paste0(">",phylum,";",class,";",order,";",family,";",genus,";",species))
write.table(seqnames,file="dada2_headers.txt",sep="\n",row.names=FALSE, quote = FALSE, col.names=FALSE)

# generating dada2trainseq.fasta via bash
# modified from https://www.biostars.org/p/103089/ to also convert to uppercase nucleotides
system("awk 'NR%2==0' mytrainseq.fasta | tr [a-z] [A-Z] | paste -d'\\n' dada2_headers.txt - > dada2trainseq.fasta")

Likely this database is too large to properly allocate memory, but the dada2 developers are working on this. You can subset to a particular taxonomic group with grep or something similar in bash.

grep -A 1 "Chordata" dada2trainseq.fasta > dada2trainseq_chordata.fasta

cjfields · 2020-02-13T04:18:21Z

This is brilliant, thanks @maxfarrell !

maxfarrell closed this as completed Jun 4, 2018

cjfields mentioned this issue Feb 25, 2020

Add officially supported COI database? benjjneb/dada2#922

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

re-format taxonomy training data for use in R (dada2)? #2

re-format taxonomy training data for use in R (dada2)? #2

maxfarrell commented Jun 1, 2018

maxfarrell commented Jun 4, 2018

cjfields commented Feb 13, 2020

re-format taxonomy training data for use in R (dada2)? #2

re-format taxonomy training data for use in R (dada2)? #2

Comments

maxfarrell commented Jun 1, 2018

maxfarrell commented Jun 4, 2018

cjfields commented Feb 13, 2020