Skip to content

A Nextflow pipeline to download FASTQ, SRA, and processed files from the Gene Expression Omnibus (GEO) database, a public functional genomics data repository supporting MIAME-compliant data submissions.

Notifications You must be signed in to change notification settings

vonMeyennLab/nf_fetchgeo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Prepare Genome Pipeline

A Nextflow pipeline to download FASTQ, SRA, and processed files from the Gene Expression Omnibus (GEO) database, a public functional genomics data repository supporting MIAME-compliant data submissions.

The pipeline was created to run on the ETH Euler cluster and it relies on the server's Lmod environment modules and genome files. Thus, the pipeline needs to be adapted before running it in a different HPC cluster.

Pipeline steps

  1. geofetch
  2. sradownloader

Required parameters

A single or multiple GEO accession numbers separated by commas. --geo_acc

--geo_acc 'GSE129393,GSE208727,GSE54651'

Output directory where the files will be saved. --outdir

--outdir /cluster/work/nme/data/josousa/project

Optional parameters

  • Option to choose the file types to download from the GEO database. --output_type

    --output_type 'FastQ data' # Default
    --output_type 'SRA data'
    --output_type 'FastQ + SRA data'
    --output_type 'Processed data'
    --output_type 'SRA metadata'
    --output_type 'Processed metadata'
  • Option to specify the source of data on the GEO record to retrieve processed data. --data_source

    --data_source 'samples' # Default
    --data_source 'series'
    --data_source 'both'

    This option only applies for the processed data download. Specifies the source of data on the GEO record to retrieve processed data, which may be attached to the collective series entity, or to individual samples. Allowable values are: samples, series or both (all). Ignored unless 'processed' flag is set.

Extra arguments

  • Option to add extra arguments to the package geofetch. --geofetch_args

  • Option to add extra arguments to the package sradownloader. --sradownloader_args

Additional information

The package sradownloader was modified to replace the FTP connection to the ENA FTP server with downloading the files using the package Axel. This was done because our HPC server doesn't allow to establish a connection with the FTP server. If you which to use the pipeline outside of our group, you have to replace sradownloader_axel with sradownloader in the module sradownloader.mod.nf.

Acknowledgements

This pipeline was adapted from the Nextflow pipelines created by the Babraham Institute Bioinformatics Group and from the nf-core pipelines. We thank all the contributors for both projects. We also thank the Nextflow community and the nf-core community for all the help and support.

About

A Nextflow pipeline to download FASTQ, SRA, and processed files from the Gene Expression Omnibus (GEO) database, a public functional genomics data repository supporting MIAME-compliant data submissions.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published