New components for dockerization

The scripts in this directory allow the user to reproduce the entire process of downloading raw reads, process them with the MetaPhlAn2 and HUMAnN2 pipelines, generate the final normalized profiles, and arrange them in folders exactly as done for the package.

A single sample can be downloaded and processed as follows:

$ bash curatedMetagenomicData_pipeline.sh sample_name "SRRxxyxyxx;SRRyyxxyyx"

where sample_name is the name to be given to the sample, and SRRxxyxyxx etc are the relative NCBI accession numbers.

See within curatedMetagenomicData_pipeline.sh for requirements and settings.
sample_name and relative NCBI accession numbers for all the samples included in the package are available in the 'combined_metadata' dataset provided in the package:
```
> library(curatedMetagenomicData)
> data(combined_metadata)
```
All the samples included in the package can be downloaded and processed by running the commands in

curatedMetagenomicData_pipeline_allsamples.sh

Please be aware that this would take ages to be run on a single CPU.
Follow the same procedure if you want to process your own dataset.
When done, your output profile files are properly organized and ready to be included in the curatedMetagenomicData package.

To run the Docker container
This container is automatically build from the Dockerfile in GitHub
There are some confusion about the location of the databases for the new Metaphlan2. The get-around is to duplicate the database files for metaphlan in two locations metaphlan2/db_v20 and metaphlan2/databases/
On AWS m5.4xlarge instance, with 50GB Volume attached. "SRR4052038" below took about 30 min.

#sudo apt-get install -y docker.io #install Docker if needed
docker pull stevetsa/curatedmetagenomicdatahighload
docker run -it stevetsa/curatedmetagenomicdatahighload

## mount the current directory in the container for debugging
#docker run -v `pwd`:`pwd` -w `pwd` -i -t stevetsa/curatedmetagenomicdatahighload

## Inside container - 
git clone https://github.com/stevetsa/curatedMetagenomicDataHighLoad.git
cd curatedMetagenomicDataHighLoad
bash setup.sh
bash curatedMetagenomicData_pipeline.sh MV_FEI4_t1Q14 "SRR4052038"

New components for dockerization

configrc: source this to set version numbers or other configurations
download_humann2_databases.sh: Install humann2, download its uniref and chocophlan databases, and copy these to s3://curatedmetagenomics.bioconductor.org/humann2_database_downloads_$humann2_version. Note this also requires environment variables AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID to be set
parsemetadata.sh: specify a _metadata.tsv filename and return pipeline commands in the format of curatedMetagenomicData_pipeline_allsamples.sh to stdout.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
aws_batch		aws_batch
docker		docker
pipeline		pipeline
wdl		wdl
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
SRR4052038log		SRR4052038log
configrc		configrc
curatedMetagenomicData_pipeline.sh		curatedMetagenomicData_pipeline.sh
curatedMetagenomicData_pipeline_allsamples.sh		curatedMetagenomicData_pipeline_allsamples.sh
download.nf		download.nf
download_humann2_databases.sh		download_humann2_databases.sh
nextflow.config		nextflow.config
parsemetadata.sh		parsemetadata.sh
run_markers2.py		run_markers2.py
setup.sh		setup.sh

seandavi/curatedMetagenomicDataHighLoad

Folders and files

Latest commit

History

Repository files navigation

New components for dockerization

About

Resources

Stars

Watchers

Forks

Languages