Skip to content
sarpiens edited this page Mar 15, 2024 · 19 revisions

Overview

Omics Dataset Curation Toolkit (OMD Curation Toolkit) is a suite of programs designed for the download and curation of metadata and fastq files of public omics datasets(metataxonomics, metagenomics, metatranscriptomics, RNA Seq data, etc.). This workflow provides a standardized framework intended to facilitate the arduous task of curating public omics projects. While centered on the European Nucleotide Archive (ENA), the majority of provided tools are generic and can be used to curate datasets from different sources.

Implementation

OMD Curation Toolkit is an open source omics dataset curation package implemented entirely in Python3 (>=3.10). The core functionality depends mainly on standard Python libraries, whereas other aesthetics options (termcolor and tabulate libraries) and functionalities (pandas, mg-toolkit and parfive libraries) are carried out by third-party open source Python libraries. OMD Curation Toolkit is implemented in a workflow fashion in which each program corresponds to a different curation step.

Workflow and Programs included in OMD Curation Toolkit

A) Explore the Original Publication

Most of the time when we work with a public study project, we will find that there is little or no metadata available. In these case scenario, we will need to survey the original publication. In particular, the most relevant sections will be "Materials and Methods" (where a description of the datasets is usually provided) and "Supplementary Information" (where we can usually find tables of associated metadata, tables of quality control of the files, etc.). Nevertheless, exploring the original publication is always useful as it provides the necessary context for subsequent work. Furthermore, it is also worthwhile to be aware of the following Considerations for Curating Metadata in Public Datasets.

B) Collection Programs

These programs correspond to the basic workflow steps that allow us to obtain the metadata and fastq files associated to our dataset of interest. The following programs correspond to the Collection Programs group:

C) Control Check Programs

These programs correspond to the control points of the workflow that allow us to check, verify and interpret the obtained associated files of our dataset of interest. They are designed to analyze the various files and provide help messages to assist the researcher during the curation process. The following programs correspond to the Control Check Programs group:

D) Optional Programs

These programs correspond to extra workflow steps that can be really helpful in particular cases but are not always required during the curation process, depending on the dataset of interest. They provide further functionalities, including merge and filtering of metadata tables, as well as treatment of fastq files (copy, rename, and merge) and combining the associated metadata. The following programs correspond to the Optional Programs group: