# <center> Mothur tutorial <br/> Downloading data for analysis</center>
The installation section took a relatively long time to compile, therefore I will perform a relatively simple analysis.

## (a) Sample Sequence data
We will be working with mice fecal data, the tutorial works with a subset of a bigger dataset published as the original paper referenced by [1]. Here I will use an even smaller a subset of the subset to keep the process short.

In [None]:
wget https://www.mothur.org/w/images/d/d6/MiSeqSOPData.zip

Let's take a look what we get in this archive. There are many files in this archive created with macOS, so there are also junk files, hence my filter in the pipeline.

In [7]:
unzip -l MiSeqSOPData.zip | grep -v __MAC

Archive:  MiSeqSOPData.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
        0  2014-08-14 16:03   MiSeq_SOP/
  4375560  2013-03-28 09:17   MiSeq_SOP/F3D0_S188_L001_R1_001.fastq
  4370800  2013-03-28 09:17   MiSeq_SOP/F3D0_S188_L001_R2_001.fastq
  3345300  2013-03-28 09:17   MiSeq_SOP/F3D141_S207_L001_R1_001.fastq
  3340982  2013-03-28 09:17   MiSeq_SOP/F3D141_S207_L001_R2_001.fastq
  1787268  2013-03-28 09:17   MiSeq_SOP/F3D142_S208_L001_R1_001.fastq
  1784740  2013-03-28 09:17   MiSeq_SOP/F3D142_S208_L001_R2_001.fastq
  1784360  2013-03-28 09:17   MiSeq_SOP/F3D143_S209_L001_R1_001.fastq
  1782200  2013-03-28 09:17   MiSeq_SOP/F3D143_S209_L001_R2_001.fastq
  2710311  2013-03-28 09:17   MiSeq_SOP/F3D144_S210_L001_R1_001.fastq
  2706259  2013-03-28 09:17   MiSeq_SOP/F3D144_S210_L001_R2_001.fastq
  4142241  2013-03-28 09:17   MiSeq_SOP/F3D145_S211_L001_R1_001.fastq
  4135947  2013-03-28 09:17   MiSeq_SOP/F3D145_S211_L001_R2_001.fastq
  2819293  2013-03-28 09:1

I extract two data points and a reference dataset: pair ended reads of fecal sample of a female mouse 150 days after weaning and the Mock sample, a mixture of bacteria made of 32 bacterial strains. I choose the samples `F3D150` and `Mock`. Finally, I will extract and recompress the reads, because **Mothur** can handle them in `.fasta.gz` format as well. I also extract the `HMP_MOCK.v35.fasta` file because it contains known sequences of the Mock sample. It will help us figure out the reference composition; we will compare it with the results that **Mothur** produces.

The extracte files and all the products of **Mothur** will be stored in the `Data` folder.

In [9]:
mkdir Data

for file in $(unzip -l MiSeqSOPData.zip | grep -oE "/(F3D150|Mock|HMP_).*$" | tr -d '/'); 
do 
    unzip -p MiSeqSOPData.zip "*/$file*" | gzip - > Data/$file.gz; 
done

ls -la Data/

total 2448
drwxrwxr-x 2 viktor viktor   4096 Mar 23 10:47 .
drwxrwxr-x 4 viktor viktor   4096 Mar 23 10:34 ..
-rw-rw-r-- 1 viktor viktor 569706 Mar 23 10:48 F3D150_S216_L001_R1_001.fastq.gz
-rw-rw-r-- 1 viktor viktor 778215 Mar 23 10:48 F3D150_S216_L001_R2_001.fastq.gz
-rw-rw-r-- 1 viktor viktor   2698 Mar 23 10:48 HMP_MOCK.v35.fasta.gz
-rw-rw-r-- 1 viktor viktor 514032 Mar 23 10:48 Mock_S280_L001_R1_001.fastq.gz
-rw-rw-r-- 1 viktor viktor 626119 Mar 23 10:48 Mock_S280_L001_R2_001.fastq.gz


We have the two samples with pair ended reads and the Mock sample composition.

## (b) Current Silva based bacaterial reference alignment (311MB)
This file is a reference alignment of rRNA sequences with taxonomical data. It has 50,000 column long alignments of 168,111 bacterial, 4337 archeal, and 18213 eukarya sequences. It is ~8.9 GB uncompressed, so we need to process it to reduce it. Here I just download the file, I will deal with it later.

In [None]:
wget https://www.mothur.org/w/images/b/b4/Silva.nr_v128.tgz

## (c) Current Ribosomal Database Reference files (RDP)
A collection of 12,681 bacterial and 531 archaeal 16S rRNA gene sequences (*not aligned!*) for Bayesian classification. This also contains taxonomy data, it is used to resolve OTUs at the genus level. This file is relatively small I extract it here. I also rename the folder this step creates to `PDS` to make the access smoother later.

In [None]:
wget https://www.mothur.org/w/images/c/c3/Trainset16_022016.pds.tgz
tar xvfz Trainset16_022016.pds.tgz
mv trainset16_022016.pds PDS

[1] https://www.mothur.org/wiki/MiSeq_SOP