# RExMap tutorial: mock community data

In [1]:
library(rexmap)

RExMap v1.0 loaded.


Specify full main path and load raw FASTQ files.

In [2]:
out_path = path.expand('~/data/zheng_2015_tutorial')

In [3]:
fastq_path = file.path(out_path, 'fastq')
fq_fwd = read_files(fastq_path, 'R1')
fq_rev = read_files(fastq_path, 'R2')

Generate output files for the pre-processing part of the pipeline:

In [4]:
sample_ids = sampleids_from_filenames(fq_fwd, separator='_')
fq_mer = file.path(out_path, 'rexmap_merged', paste0(sample_ids, '.fastq'))
fq_pcr = file.path(out_path, 'rexmap_trimmed', paste0(sample_ids, '.fastq'))
fq_fil = file.path(out_path, 'rexmap_filtered', paste0(sample_ids, '.fastq'))

## Pre-processing
Merge reads, remove PCR primers and perform quality control and fixed-length trimming.

In [5]:
mergestats = merge_pairs(fq_fwd, fq_rev, fq_mer, verbose=T)

Loading FASTQ reads: V3V4Rep1_R1.fastq, V3V4Rep1_R2.fastq ... OK.
Merging pairs... OK.
Writing output files... OK.


In [6]:
mergestats

Unnamed: 0,/Users/igor/data/zheng_2015_tutorial/fastq/V3V4Rep1_R1.fastq
total,25600
low_pct_sim,1355
low_aln_len,2


Primer set used for amplifying V3-V4 region was 341F and 805R, which are labelled as 'V3-V4-2' in our reference table:

In [7]:
rexmap_option('blast_dbs')[6]

Primer1,Primer2,Primer1_sequence_5to3,Primer2_sequence_3to5,Hypervariable_region,DB,table
341F,805R,GACAGCCTACGGGNGGCWGCAG,GACTACHVGGGTATCTAATCC,V3-V4-2,V3-V4_337F-805R_hang21_wrefseq_sequences_unique_variants,V3-V4_337F-805R_hang21_wrefseq_table_unique_variants_R.txt


In [8]:
trimstats = remove_pcr_primers(fq_mer, fq_pcr, region='V3-V4-2', verbose=T)

* PCR trimmer mode: region V3-V4-2 
(fwd: GACAGCCTACGGGNGGCWGCAG, rev: GGATTAGATACCCBDGTAGTC)
* loading file...OK.
* trimming...OK.
* saving output...OK.


In [9]:
trimstats

Unnamed: 0,/Users/igor/data/zheng_2015_tutorial/rexmap_merged/V3V4Rep1.fastq
fwd_trim,24219
rev_trim,24127


Now find a length to trim all reads before passing them to DADA2 denoising. A good rule of thumb is length above which we have 99% of the reads:

In [11]:
seqlen.ft = sequence_length_table(fq_pcr)
trim_length = ftquantile(seqlen.ft, 0.01)
trim_length

Trim reads to this length and then filter out the remaining reads with too many (> 2) expected errors:

In [12]:
filtstats = filter_and_trim(fq_pcr, fq_fil, truncLen=trim_length)

In [13]:
filtstats

Unnamed: 0,reads.in,reads.out
V3V4Rep1.fastq,24244,22070


## DADA2 denoising
Denoise these filter and trimmed reads, then pass on the pre-trimmed reads (`fq_tri`) to retrieve back full-length sequences before we align them vs a reference database:

In [14]:
dada_result = dada_denoise(fq_fil, fq_pcr, verbose=T)

* learn errors...8872140 total bases in 22070 reads from 1 samples will be used for learning the error rates.
Initializing error rates to maximum possible estimate.
selfConsist step 1 .
   selfConsist step 2
   selfConsist step 3
   selfConsist step 4
   selfConsist step 5
Convergence after  5  rounds.
 OK.
* processing  V3V4Rep1.fastq
Sample 1 - 22070 reads in 14060 unique sequences.
Trimmed length:  402  nt.
Retrieving full-length sequences...
Sample  1 . Load...OK. Consensus...OK. Update...OK.


Extract sequence abundance from DADA2 output:

In [15]:
ab.dt = sequence_abundance(dada_result)

* generating sequence table...

The sequences being tabled vary in length.


 OK.
* removing bimeras...

Identified 28 bimeras out of 45 input sequences.


 OK.
* adding together sequences that differ in shifts on lengths...collapse:
* generating temporary files...OK.
* blast word size: 322 
* running blast...blast status:  0
OK.
* selecting ends-free alignments...OK.
* no sequences need collapsing.
* cleaning up temporary files...OK.
* returning input.
 OK.


## BLAST
Align the un-trimmed sequences to the RExMap reference database for V3-V4 region:

In [16]:
blast_output = blast(ab.dt, region='V3-V4-2', verbose=T)

* blast input type: abundance table
* blast out: OK. blast best: OK. copy number table: OK.
merge: OK. Fix overhang differences:..OK. OK.


## OSU abundance table
From sequence abundance table and BLAST output, estimate the abundance of each Operational Strain Unit (OSU):

In [17]:
osu_ab.dt = abundance(ab.dt, blast_output)

In [18]:
osu_ab.dt

sample_id,osu_id,osu_count,pctsim,species
V3V4Rep1,1074,2813,100.0,Streptococcus_mutans_[106]
V3V4Rep1,206,2436,100.0,"Escherichia_coli_[7164],Shigella_sonnei_[854],Shigella_flexneri_[67],Shigella_boydii_[32],Escherichia_fergusonii_[7],Escherichia_albertii_[2],Shigella_dysenteriae_[2],Achromobacter_sp._ATCC35328,Brenneria_alni_pvfi20,Citrobacter_braakii_SCC4,Escherichia_sp._1_1_43,Escherichia_sp._3_2_53FAA,Escherichia_sp._4_1_40B,Escherichia_sp._B1147,Escherichia_sp._KTE11,Escherichia_sp._KTE159,Escherichia_sp._KTE52,Escherichia_sp._KTE96,Escherichia_sp._TW09308,Escherichia_sp._TW15838,Escherichia_vulneris_ATCC_33821,Klebsiella_oxytoca_2880STDY5682666"
V3V4Rep1,1093,946,100.0,"Staphylococcus_epidermidis_RP62A,Staphylococcus_epidermidis_ET-024,Staphylococcus_epidermidis_FDAARGOS_157"
V3V4Rep1,2440,920,100.0,"Rhodobacter_sphaeroides_[11],Rhodobacter_johrii_JA192,Rhodobacter_megalophilus_DSM_18937,Rhodobacter_sp._AKP1"
V3V4Rep1,68,596,100.0,"Bacillus_cereus_[887],Bacillus_thuringiensis_[413],Bacillus_toyonensis_[198],Bacillus_wiedmannii_[131],Bacillus_anthracis_[118],Bacillus_pseudomycoides_[103],Bacillus_mycoides_[55],Bacillus_gaemokensis_[2],Bacillus_weihenstephanensis_[2],Bacillus_bingmayongensis_FJAT-13831,Bacillus_bombysepticus_Wang,Bacillus_licheniformis_V30,Bacillus_sp._0711P9-1,Bacillus_sp._100374,Bacillus_sp._4048,Bacillus_sp._4049,Bacillus_sp._7_6_55CFAA_CT2,Bacillus_sp._AFS012607,Bacillus_sp._AFS014408,Bacillus_sp._AFS015896,Bacillus_sp._AFS019443,Bacillus_sp._AFS023182,Bacillus_sp._AFS029637,Bacillus_sp._AFS033286,Bacillus_sp._AFS051223,Bacillus_sp._AFS054943,Bacillus_sp._AFS059628,Bacillus_sp._AFS075034,Bacillus_sp._AFS075960,Bacillus_sp._AFS094611,Bacillus_sp._AFS098217,Bacillus_sp._AKBS9,Bacillus_sp._BI3,Bacillus_sp._EB422,Bacillus_sp._FDAARGOS_235,Bacillus_sp._G3(2015),Bacillus_sp._GeD10,Bacillus_sp._H1a,Bacillus_sp._H1m,Bacillus_sp._HMSC036E02,Bacillus_sp._JH7,Bacillus_sp._K2I17,Bacillus_sp._KbaB1,Bacillus_sp._KbaL1,Bacillus_sp._L_1B0_5,Bacillus_sp._L_1B0_8,Bacillus_sp._L27,Bacillus_sp._LK2,Bacillus_sp._M13(2017),Bacillus_sp._MB353a,Bacillus_sp._MB366,Bacillus_sp._MN5_Mn5,Bacillus_sp._N24,Bacillus_sp._N35-10-2,Bacillus_sp._N35-10-4,Bacillus_sp._NH11B,Bacillus_sp._NH24A2,Bacillus_sp._Root11,Bacillus_sp._Root131,Bacillus_sp._RUTrin4,Bacillus_sp._RZ2MS9,Bacillus_sp._S1-R1J2-FB,Bacillus_sp._S1-R2T1-FB,Bacillus_sp._S1-R4H1-FB,Bacillus_sp._S1-R5C1-FB,Bacillus_sp._TD41,Bacillus_sp._TD42,Bacillus_sp._UAEU-H3K6M1,Bacillus_sp._UMTAT18,Bacillus_sp._YF23"
V3V4Rep1,1056,324,100.0,"Streptococcus_agalactiae_[466],Streptococcus_sp._HMSC036H09,Streptococcus_sp._HMSC056B03,Streptococcus_sp._HMSC056D01,Streptococcus_sp._HMSC063D10,Streptococcus_sp._HMSC064H02,Streptococcus_sp._HMSC065G04,Streptococcus_sp._HMSC068H01,Streptococcus_sp._HMSC069D09,Streptococcus_sp._HMSC070A10,Streptococcus_sp._HMSC070B09,Streptococcus_sp._HMSC072E12,Streptococcus_sp._HMSC072G02,Streptococcus_sp._HMSC074F09,Streptococcus_sp._HMSC074F10,Streptococcus_sp._HMSC075A03,Streptococcus_sp._HMSC076D11,Streptococcus_sp._HMSC076H07,Streptococcus_sp._HMSC076H08,Streptococcus_sp._HMSC076H09,Streptococcus_sp._HMSC076H10,Streptococcus_sp._HMSC078D02,Streptococcus_sp._HMSC078E03,Streptococcus_sp._HMSC078E08,Streptococcus_sp._HMSC11C05"
V3V4Rep1,2715,287,100.0,"Clostridium_beijerinckii_[13],Clostridium_saccharoperbutylacetonicum_[4],Clostridium_diolis_[2],Clostridium_sp._BL-8,Clostridium_sp._DL-VIII,Clostridium_sp._LS"
V3V4Rep1,754,264,100.0,"Pseudomonas_aeruginosa_[1853],Candidatus_Hepatobacter_penaei_NHPB,Pseudomonas_denitrificans_(nomen_rejiciendum),Pseudomonas_otitidis_LNU-E-001,Pseudomonas_sp._2_1_26,Pseudomonas_sp._HMSC057H01,Pseudomonas_sp._HMSC058A10,Pseudomonas_sp._HMSC058B07,Pseudomonas_sp._HMSC058C05,Pseudomonas_sp._HMSC059F05,Pseudomonas_sp._HMSC05H02,Pseudomonas_sp._HMSC060F12,Pseudomonas_sp._HMSC060G01,Pseudomonas_sp._HMSC060G02,Pseudomonas_sp._HMSC061A10,Pseudomonas_sp._HMSC063H08,Pseudomonas_sp._HMSC064G05,Pseudomonas_sp._HMSC065H01,Pseudomonas_sp._HMSC065H02,Pseudomonas_sp._HMSC066A08,Pseudomonas_sp._HMSC066B03,Pseudomonas_sp._HMSC067D05,Pseudomonas_sp._HMSC067F09,Pseudomonas_sp._HMSC067G02,Pseudomonas_sp._HMSC069G05,Pseudomonas_sp._HMSC070B12,Pseudomonas_sp._HMSC071F02,Pseudomonas_sp._HMSC072F09,Pseudomonas_sp._HMSC073F05,Pseudomonas_sp._HMSC075A08,Pseudomonas_sp._HMSC076A11,Pseudomonas_sp._HMSC076A12,Pseudomonas_sp._HMSC16B01,Pseudomonas_sp._P179,Pseudomonas_sp._YS-1p"
V3V4Rep1,1309,242,100.0,"Staphylococcus_aureus_[7702],Staphylococcus_argenteus_[81],Staphylococcus_schweitzeri_[2],Staphylococcus_simiae_[2],Pararheinheimera_mesophila_IITR-13,Staphylococcus_sp._HMSC035F01,Staphylococcus_sp._HMSC035F11,Staphylococcus_sp._HMSC055H04,Staphylococcus_sp._HMSC055H07,Staphylococcus_sp._HMSC057B01,Staphylococcus_sp._HMSC058E01,Staphylococcus_sp._HMSC060D01,Staphylococcus_sp._HMSC060D12,Staphylococcus_sp._HMSC063G01,Staphylococcus_sp._HMSC063H12,Staphylococcus_sp._HMSC067F10,Staphylococcus_sp._HMSC075C08,Staphylococcus_sp._HMSC34D01,Staphylococcus_sp._HMSC34H10,Staphylococcus_sp._HMSC35D08,Staphylococcus_sp._HMSC36A10,Staphylococcus_sp._HMSC36C03,Staphylococcus_sp._HMSC36D07,Staphylococcus_sp._HMSC36D12,Staphylococcus_sp._HMSC36F05,Staphylococcus_sp._HMSC36G04,Staphylococcus_sp._HMSC55F09,Staphylococcus_sp._HMSC56B09,Staphylococcus_sp._HMSC57A07,Staphylococcus_sp._HMSC57B03,Staphylococcus_sp._HMSC58A02,Staphylococcus_sp._HMSC58B01,Staphylococcus_sp._HMSC58E11,Staphylococcus_sp._HMSC64F10,Staphylococcus_sp._HMSC66A04,Staphylococcus_sp._HMSC66C11,Staphylococcus_sp._HMSC70F07,Staphylococcus_sp._HMSC72E08,Staphylococcus_sp._HMSC73A05,Staphylococcus_sp._HMSC74D05,Staphylococcus_sp._HMSC74G01,Staphylococcus_sp._HMSC74G05,Staphylococcus_sp._HMSC76G03,Staphylococcus_sp._HMSC77A05,Streptococcus_sobrinus_TCI-345"
V3V4Rep1,2336,90,100.0,Helicobacter_pylori_[587]


## Taxonomy
Assign NCBI taxonomy classification for each OSU:

In [19]:
osu_tax.dt = taxonomy(osu_ab.dt)

In [21]:
osu_tax.dt

osu_id,pctsim,phylum,class,order,family,genus
206,100.0,Proteobacteria_[8143],"Gammaproteobacteria_[8142],Betaproteobacteria_[1]","Enterobacterales_[8142],Burkholderiales_[1]","Enterobacteriaceae_[8141],Alcaligenaceae_[1],Pectobacteriaceae_[1]","Escherichia_[7184],Shigella_[955],Achromobacter_[1],Brenneria_[1],Citrobacter_[1],Klebsiella_[1]"
1000017,99.29,"Proteobacteria_[8860],Firmicutes_[1]","Gammaproteobacteria_[8859],Betaproteobacteria_[1],Bacilli_[1]","Enterobacterales_[8857],Pseudomonadales_[2],Burkholderiales_[1],Lactobacillales_[1]","Enterobacteriaceae_[8856],Moraxellaceae_[2],Alcaligenaceae_[1],Pectobacteriaceae_[1],Lactobacillaceae_[1]","Escherichia_[7895],Shigella_[959],Acinetobacter_[2],Achromobacter_[1],Brenneria_[1],Citrobacter_[1],Klebsiella_[1],Lactobacillus_[1]"
2887,100.0,Proteobacteria_[1914],Gammaproteobacteria_[1914],Pseudomonadales_[1914],Moraxellaceae_[1914],Acinetobacter_[1914]
68,100.0,Firmicutes_[1970],Bacilli_[1970],Bacillales_[1970],Bacillaceae_[1970],Bacillus_[1970]
2715,100.0,Firmicutes_[22],Clostridia_[22],Clostridiales_[22],Clostridiaceae_[22],Clostridium_[22]
2336,100.0,Proteobacteria_[587],Epsilonproteobacteria_[587],Campylobacterales_[587],Helicobacteraceae_[587],Helicobacter_[587]
754,100.0,Proteobacteria_[1886],Gammaproteobacteria_[1886],Pseudomonadales_[1886],Pseudomonadaceae_[1886],"Pseudomonas_[1886],Hepatobacter_[1]"
2525,100.0,Firmicutes_[25],Bacilli_[25],Lactobacillales_[25],Lactobacillaceae_[25],Lactobacillus_[25]
1185,100.0,Firmicutes_[2048],Bacilli_[2048],Bacillales_[2048],Listeriaceae_[2048],Listeria_[2048]
1263,100.0,Proteobacteria_[458],"Betaproteobacteria_[457],Gammaproteobacteria_[1]","Neisseriales_[457],Pseudomonadales_[1]","Neisseriaceae_[457],Moraxellaceae_[1]","Neisseria_[457],Psychrobacter_[1]"


The numbers in brackets show the number of strains belonging to the specific taxonomic rank.