# Data Retrieval 

In this section we are describing how to find and download data products created during computation. Data can be downloaded for an individual metagenome or per project.

We destinguish between precomputed and on demand data. Precomputed data available for download are all (intermediate) pipeline products including sequence similarities. On demand date are for example reads annotated with function or organisms for a given name/annotation space (e.g. SEED or RefSeq).

##### Example Data

We are using data from the MG-RAST project <a href="http://metagenomics.anl.gov/linkin.cgi?project=128">The oral metagenome in health and disease</a> (mgp128). The project includes 8 samples from the oral cavity of humans.


MG-RAST ID | Metagenome Name |bp Count | Sequence Count	| Biome	| Sequence Type | Sequence Method
---------- | --------------- |-------- | -------------- | ----- | ------------- | ---------------
4447943.3|	CA_04P |	142,374,233|	339,503|		human-associated habitat|		WGS|	454
4447192.3|	NOCA_01P |	77,538,485 |	204,218|		human-associated habitat|		WGS|	454
4447103.3|	CA1_01P |	203,711,161|	464,594|		human-associated habitat|		WGS|	454
4447102.3|	NOCA_03P|	100,125,112|	244,881|		human-associated habitat|		WGS|	454
4447101.3|	CA1_02P |	129,851,692|	295,072|		human-associated habitat|		WGS|	454
4447971.3|	CA_06_1.6 |	37,519,874 |	97,722 |		human-associated habitat|		WGS|	454
4447970.3|	CA_05_4.6 |	27,669,924 |	70,503 |		human-associated habitat|		WGS|	454
4447903.3|	CA_06P |	123,266,763|	306,740|		human-associated habitat|		WGS|	454

### Downloading input and intermediate data products

The MG-RAST pipeline is producing various intermediate results, see MG-RAST Manual or ... <br>

The mg-rast download script provides the capability for 
- listing all data products for a metagenome
- download a single data product
- download all data for an entire study/project

In [7]:
! mg-download.py --help


NAME
    mg-download

VERSION
    1

SYNOPSIS
    mg-download [ --help, --user <user>, --passwd <password>, --token <oAuth token>, --project <project id>, --metagenome <metagenome id>, --file <file id> --dir <directory name> --list <list files for given id>]

DESCRIPTION
    Retrieve metadata for a metagenome.

Options:
  -h, --help            show this help message and exit
  --url=URL             communities API url
  --user=USER           OAuth username
  --passwd=PASSWD       OAuth password
  --token=TOKEN         OAuth token
  --project=PROJECT     project ID
  --metagenome=METAGENOME
                        metagenome ID
  --file=FILE           file ID for given project or metagenome
  --dir=DIR             directory to do downloads
  --list                list files and their info for given ID

Output
    List available files (name and size) for given project or metagenome id.
      OR
    Download of file(s) for given project, metagenome, or file i

#### Download Products for a single metagenome

TEXT?

In [10]:
! mg-download.py --metagenome mgm4441680.3 --list

+--------------+---------------------------------------------+---------+----------------------------------+-----------+
| Metagenome   | File Name                                   | File ID | Checksum                         | Byte Size |
+--------------+---------------------------------------------+---------+----------------------------------+-----------+
| mgm4441680.3 | mgm4441680.3.050.upload.fna                 | 050.1   | b5971c881731254ccfeed9b20c205fda |  20119214 |
| mgm4441680.3 | mgm4441680.3.100.preprocess.passed.fna      | 100.1   | None                             |  19194865 |
| mgm4441680.3 | mgm4441680.3.100.preprocess.removed.fna     | 100.2   | None                             |    924348 |
| mgm4441680.3 | mgm4441680.3.150.dereplication.passed.fna   | 150.1   | None                             |  18857673 |
| mgm4441680.3 | mgm4441680.3.150.dereplication.removed.fna  | 150.2   | None                             |    336945 |
| mgm4441680.3 | mgm4441680.3.29

Downloading the cluster mapping file for mgm4441680.3

Metagenome | File Name | File ID
----------|------------|--------
mgm4441680.3 | mgm4441680.3.550.cluster.aa90.mapping       | 550.1 


Execute <code>mg-download.py --metagenome mgm4441680.3 --file 550.1</code> on the command line. This will create a directory in your current or specified directory (--dir) with the metagenome as name.

In [29]:
! mg-download.py --metagenome mgm4441680.3 --file 550.1 

Downloading mgm4441680.3.550.cluster.aa90.mapping for mgm4441680.3 ... Done


Check content of download directory: 
<code>ls mgm4441680.3</code>

In [32]:
ls mgm4441680.3

mgm4441680.3.550.cluster.aa90.mapping


In [23]:
less mgm4441680.3/mgm4441680.3.550.cluster.aa90.mapping

In [33]:
#! for i in `cut -f3 mgm4441680.3/mgm4441680.3.550.cluster.aa90.mapping` ; do echo $i | sed -e 's/[^,]*//g' | wc -c ; done  | sort | uniq -c

Cluster with more than one members:

<code>
for i in `cut -f3 mgm4441680.3/mgm4441680.3.550.cluster.aa90.mapping` ; 
    do c=`echo $i | sed -e 's/[^,]*//g' | wc -c` ; echo `expr $c + 1` ; 
done  | sort | uniq -c
</code>

In [28]:
! for i in `cut -f3 mgm4441680.3/mgm4441680.3.550.cluster.aa90.mapping` ; do c=`echo $i | sed -e 's/[^,]*//g' | wc -c` ; echo `expr $c + 1` ; done  | sort | uniq -c

1528 2
  44 3
   2 4
   1 5


#### Download Products for a Project

In [11]:
! mg-download.py --project mgp128 --list

+--------------+---------------------------------------------+---------+----------------------------------+-----------+
| Metagenome   | File Name                                   | File ID | Checksum                         | Byte Size |
+--------------+---------------------------------------------+---------+----------------------------------+-----------+
| mgm4447971.3 | mgm4447971.3.050.upload.fna                 | 050.1   | 328834dc94901ba458afc26f47fed41c |  39181148 |
| mgm4447971.3 | mgm4447971.3.100.preprocess.passed.fna      | 100.1   | 328834dc94901ba458afc26f47fed41c |  39181148 |
| mgm4447971.3 | mgm4447971.3.100.preprocess.removed.fna     | 100.2   | d41d8cd98f00b204e9800998ecf8427e |         0 |
| mgm4447971.3 | mgm4447971.3.150.dereplication.passed.fna   | 150.1   | 328834dc94901ba458afc26f47fed41c |  39181148 |
| mgm4447971.3 | mgm4447971.3.150.dereplication.removed.fna  | 150.2   | d41d8cd98f00b204e9800998ecf8427e |         0 |
| mgm4447971.3 | mgm4447971.3.29

This will download the entiere study:
<code>mg-download.py --project mgp128</code> 

In [35]:
!  mg-download.py --project mgp128

Downloading mgm4447971.3.050.upload.fna for mgm4447971.3 ... Done
Downloading mgm4447971.3.100.preprocess.passed.fna for mgm4447971.3 ... Done
Downloading mgm4447971.3.100.preprocess.removed.fna for mgm4447971.3 ... Done
Downloading mgm4447971.3.150.dereplication.passed.fna for mgm4447971.3 ... Done
Downloading mgm4447971.3.150.dereplication.removed.fna for mgm4447971.3 ... Done
Downloading mgm4447971.3.299.screen.passed.fna for mgm4447971.3 ... Done
Downloading mgm4447971.3.350.genecalling.faa for mgm4447971.3 ... Done
Downloading mgm4447971.3.425.rna.filter.fna for mgm4447971.3 ... Done
Downloading mgm4447971.3.440.cluster.rna97.mapping for mgm4447971.3 ... Done
Downloading mgm4447971.3.440.cluster.rna97.fna for mgm4447971.3 ... Done
Downloading mgm4447971.3.450.rna.sims for mgm4447971.3 ... Done
Downloading mgm4447971.3.550.cluster.aa90.mapping for mgm4447971.3 ... Done
Downloading mgm4447971.3.550.cluster.aa90.faa for mgm4447971.3 ... Done
Downloading mgm4447971.3.650.protein.sims 

### Downloading "annotated" sequences

 Invoke the mg-get-sequences-for-\* scripts with the --help option. 

In [2]:
! mg-get-sequences-for-function.py --help


NAME
    mg-get-sequences-for-function

VERSION
    1

SYNOPSIS
    mg-get-sequences-for-function [ --help, --user <user>, --passwd <password>, --token <oAuth token>, --id <metagenome id>, --name <function name>, --level <function level>, --source <datasource>, --evalue <evalue negative exponent>, --identity <percent identity>, --length <alignment length> ]

DESCRIPTION
    Retrieve function annotated sequences for a metagenome filtered by function containing inputted name.

Options:
  -h, --help           show this help message and exit
  --id=ID              KBase Metagenome ID
  --url=URL            communities API url
  --user=USER          OAuth username
  --passwd=PASSWD      OAuth password
  --token=TOKEN        OAuth token
  --name=NAME          function name to filter by
  --level=LEVEL        function level to filter by
  --source=SOURCE      datasource to filter results by, default is Subsystems
  --evalue=EVALUE      negative exponent value for maxim

We want to retrieve all reads from mgm4447903.3 wich are part of the "Central carbohydrate metabolism" (see SEED Subsystems)

In [6]:
! mg-get-sequences-for-function.py --id "mgm4447903.3" --name "Central carbohydrate metabolism" --level level2 --source Subsystems --evalue 10 2>>error.log | head -n 15  

sequence id	m5nr id (md5sum)	dna sequence	semicolon separated list of annotations
mgm4447903.3|GF8803K02GAMU6	0007c0e8723384b66f266a60ea8a0219	ATTATAGCAGAGGGAGTGTGGAATGTCATGTATGAACTATTTTTCGCGGATAACTTCGATTTTATATCCATCTAAGTCTTTCACGAAGTAGTAGTTAGCAGGATGTCCTGGTAAACCTTTTAATTCTGTAACGTCATAACCTTTAGCAACGTGTTCAGCGTGAAGAGCTTCTAAGTCTTTAGCACTGATGGCGATATGACCATATCCGTCACCGATTTCGTATGGTCCGTGACCATAGTTATATGTTAATTCTAATTCATAGTCATCTCCTTCAAAAGCGAGATAAGCGATTGTGTAATTTATGTTCTGGGAAGTTCTTTTCTTCTTGTTTACTTTAAAACCGATGCTTCTTCGTAGAACCGAATGCTTTCTTCTAGTATTTTCAACACGGACACAAGTATG	SS08191
mgm4447903.3|GF8803K02HY8OA	0007c0e8723384b66f266a60ea8a0219	CAGAACTCATGCGCACACTTAAAAAGTCTCCACAGAGAATCCCCTATCCCTTTTATCTGACACCTTGTGAGTCTCTCTGGACTCCCCTTAAAAGGTTAAATTTGTATTAGTATACTCTTTCAAAGAAAAAAAGTCAAGTAGAAAACGAACATTCTACTTGACTTTACTGGATTATTTTTCACGAATCACTTCGACCTTGTAGCCGTCAGGATCCTTGACAAAGTATAGTTTGGTGGAGTTCCTGGTAGGCCTTTTGGCTCTGTAACCTCATAGCCTTTAGCGCTATGTTCTTGATGTAGGGCCTCAAGGTCAGGTGTACTGAGGGCGATATGGGCAAAACCNTCTCCTACCACGTAAGGGCC	SS08191
mgm4447903.3|G