Skip to content

Tutorial Full Example

sarpiens edited this page Mar 13, 2024 · 41 revisions

Overview

In this tutorial, we will work with the ENA project PRJEB10949 and the GSA project PRJCA001214 to perform a complete curation example using all available programs. This datasets are also used by the Test omdctk program to test that the installation was successful.

Initial setup

To do this tutorial in a neat and tidy way, we will previously generate a set of directories to execute the different programs.

Commands:

## Create Example directory
mkdir Example

## Enter Example directory
cd Example

## Create PRJEB10949 directory
mkdir PRJEB10949

## Create CRA001372 directory
mkdir CRA001372

1) ENA Dataset Curation Workflow

We will use the ENA project PRJEB10949 as an example of how to use the package to curate an ENA Dataset.

1.1) Collection and Initial Processing of Metadata

1.1.1) Download Metadata from ENA

First, we download the information available in ENA using the Download Metadata ENA program. This program is exclusive to the ENA Dataset Workflow.

Commands:

## Enter PRJEB10949 directory
cd PRJEB10949

## Download PRJEB10949 ENA metadata
download_metadata_ENA -p PRJEB10949

Output:

################################################################
##                                                            ##
##    ___  __  __ ___      ___              _   _             ##
##   / _ \|  \/  |   \    / __|  _ _ _ __ _| |_(_)___ _ _     ##
##   |(_)|| |\/| | |) |   |(_| || | '_/ _` |  _| / _ \ ' \    ##
##   \___/|_|  |_|___/    \___\_,_|_| \__,_|\__|_\___/_||_|   ##
##                 _____         _ _   _ _                    ##
##                |_   _|__  ___| | |_(_) |_                  ##
##                  | |/ _ \/ _ \ | / / |  _|                 ##
##                  |_|\___/\___/_|_\_\_|\__|                 ##
##                                                            ##
##   download_metadata_ENA.py                                 ##
##    * v1.1.0 - 12 Mar 2024 *                                ##
##                                                            ##
################################################################

Program Parameters:
┌──────────────────┬────────────┐
│ Argument         │ Value      │
├──────────────────┼────────────┤
│ project          │ PRJEB10949 │
│ output_directory │            │
│ plain_text       │ False      │
└──────────────────┴────────────┘

ENA Technical Metadata (ENA Browser):

Downloading:
https://www.ebi.ac.uk/ena/portal/api/filereport?accession=PRJEB10949&result=read_run&fields=study_accession,secondary_study_accession,sample_accession,secondary_sample_accession,experiment_accession,run_accession,submission_accession,tax_id,scientific_name,instrument_platform,instrument_model,library_name,nominal_length,library_layout,library_strategy,library_source,library_selection,read_count,base_count,center_name,first_public,last_updated,experiment_title,study_title,study_alias,experiment_alias,run_alias,fastq_bytes,fastq_md5,fastq_ftp,fastq_aspera,fastq_galaxy,submitted_bytes,submitted_md5,submitted_ftp,submitted_aspera,submitted_galaxy,submitted_format,sra_bytes,sra_md5,sra_ftp,sra_aspera,sra_galaxy,sample_alias,broker_name,sample_title,nominal_sdev,first_created&format=tsv&download=true&limit=0

Saved file:
/home/sapies/Desktop/Prueba_concat/Example/PRJEB10949/PRJEB10949_ENA_browser.tsv

Loading file:
/home/sapies/Desktop/Prueba_concat/Example/PRJEB10949/PRJEB10949_ENA_browser.tsv

ENA Samples Metadata (mg-toolkit):

Downloading with:
mg-toolkit
0.10.4

Saving in file:
/home/sapies/Desktop/Prueba_concat/Example/PRJEB10949/PRJEB10949_mg-toolkit.tsv

Creating ENA Metadata Table:

Combining results:
Left Join by Run Accessions using ENA Technical Metadata Table as Reference

Saving results in file:
/home/sapies/Desktop/Prueba_concat/Example/PRJEB10949/PRJEB10949_ENA_metadata.tsv

As we can see, the program has collected the information available in the ENA Browser and the data related to the project’s samples using the mg-tooolkit package (https://pypi.org/project/mg-toolkit/). A left join has also been carried out, taking the PRJEB10949_ENA_browser.tsv file as reference against the PRJEB10949_mg-toolkit.tsv file. The resulting PRJEB10949_ENA_metadata.tsv file is the one that will be used in the next workflow step.

1.1.2) Metadata Merging

In this case, we have no additional metadata in the publication (https://doi.org/10.1371/journal.pone.0142334). However, after examining the ENA metadata, we were able to generate some extra columns of interest (sample_column, replicate, run_label, miseq_kit) which have been included in the PRJEB10949_publication_example.tsv test file. Therefore, we are going to take advantage of this file to illustrate how to merge metadata tables using the Merge Metadata program.

Command:

merge_metadata -m PRJEB10949_ENA_metadata.tsv -mc run_accession -e PRJEB10949_publication_example.tsv -ec run_accessions

Output:

################################################################
##                                                            ##
##    ___  __  __ ___      ___              _   _             ##
##   / _ \|  \/  |   \    / __|  _ _ _ __ _| |_(_)___ _ _     ##
##   |(_)|| |\/| | |) |   |(_| || | '_/ _` |  _| / _ \ ' \    ##
##   \___/|_|  |_|___/    \___\_,_|_| \__,_|\__|_\___/_||_|   ##
##                 _____         _ _   _ _                    ##
##                |_   _|__  ___| | |_(_) |_                  ##
##                  | |/ _ \/ _ \ | / / |  _|                 ##
##                  |_|\___/\___/_|_\_\_|\__|                 ##
##                                                            ##
##   merge_metadata.py                                        ##
##    * v1.1.0 - 12 Mar 2024 *                                ##
##                                                            ##
################################################################

Program Parameters:
┌──────────────────────┬────────────────────────────────────┐
│ Argument             │ Value                              │
├──────────────────────┼────────────────────────────────────┤
│ main_metadata_table  │ PRJEB10949_ENA_metadata.tsv        │
│ extra_metadata_table │ PRJEB10949_publication_example.tsv │
│ main_merge_column    │ run_accession                      │
│ extra_merge_column   │ run_accessions                     │
│ pandas_merge_mode    │ left                               │
│ main_merge_suffix    │ _x                                 │
│ extra_merge_suffix   │ _y                                 │
│ output_directory     │                                    │
│ plain_text           │ False                              │
└──────────────────────┴────────────────────────────────────┘

Loading Files:

Main Metadata Table file:
PRJEB10949_ENA_metadata.tsv

Extra Metadata Table file:
PRJEB10949_publication_example.tsv

Creating Merged Metadata Table:

Combining tables:
Pandas Merge Mode: left
  o Left Table (x): Main Metadata Table
  o Right Table (y): Extra Metadata Table

Saving results in file:
/home/sapies/Desktop/Prueba_concat/Example/PRJEB10949/merged_PRJEB10949_ENA_metadata.tsv

Merge Columns Intersection Checks:

Check merge columns' unique values:
Main Metadata Merge Column selected: run_accession
Extra Metadata Merge Column selected: run_accessions

Unique values:
  o Main Metadata Merge Column (total unique values): 181
  o Extra Metadata Merge Column (total unique values): 181

All unique values are common between the merge columns provided!

As we can see, the program has combined the PRJEB10949_ENA_metadata.tsv file (ENA Metadata Table) with the PRJEB10949_publication_example.tsv test file (Publication Metadata Table). A left join has been carried out, taking the ENA Metadata Table as reference against the Publication Metadata Table. Furthermore, after merging, it also performed an intersection analysis of the two provided merge columns. In this particular case, all values are common between the two merge columns provided. This is an interesting feature to detect orphan merge values between metadata tables. The resulting merged_PRJEB10949_ENA_metadata.tsv file is the one that will be used in the next workflow steps.

1.1.3) Initial Metadata Check

Now, we are going to check the metadata of the merged_PRJEB10949_ENA_metadata.tsv file using the Check Metadata ENA program. This program is exclusive to the ENA Dataset Workflow.

Command:

check_metadata_ENA -t merged_PRJEB10949_ENA_metadata.tsv

Output:

################################################################
##                                                            ##
##    ___  __  __ ___      ___              _   _             ##
##   / _ \|  \/  |   \    / __|  _ _ _ __ _| |_(_)___ _ _     ##
##   |(_)|| |\/| | |) |   |(_| || | '_/ _` |  _| / _ \ ' \    ##
##   \___/|_|  |_|___/    \___\_,_|_| \__,_|\__|_\___/_||_|   ##
##                 _____         _ _   _ _                    ##
##                |_   _|__  ___| | |_(_) |_                  ##
##                  | |/ _ \/ _ \ | / / |  _|                 ##
##                  |_|\___/\___/_|_\_\_|\__|                 ##
##                                                            ##
##   check_metadata_ENA.py                                    ##
##    * v1.1.0 - 12 Mar 2024 *                                ##
##                                                            ##
################################################################

Program Parameters:
┌─────────────────────┬────────────────────────────────────┐
│ Argument            │ Value                              │
├─────────────────────┼────────────────────────────────────┤
│ metadata_table      │ merged_PRJEB10949_ENA_metadata.tsv │
│ ena_download_column │ fastq_ftp                          │
│ fastq_pattern       │ .fastq.gz                          │
│ sample_column       │ sample_alias                       │
│ extra_columns_stats │                                    │
│ plain_text          │ False                              │
└─────────────────────┴────────────────────────────────────┘

Loading File:

Metadata Table file:
merged_PRJEB10949_ENA_metadata.tsv

Runs' Stats:

1) Number of run_accessions: 181

2) Appearances per scientific_name(tax_id):
                unique_values  counts
          Mus musculus(10090)     112
synthetic metagenome(1235509)      69

3) Appearances per instrument_model(instrument_platform):
           unique_values  counts
Illumina MiSeq(ILLUMINA)     181

4) Appearances per library_layout:
unique_values  counts
       PAIRED     181

5) Appearances per library_strategy+library_source:
       unique_values  counts
AMPLICON+METAGENOMIC     181

Runs' Checks:

1) Check that library_layout and ENA Download Column match:
ENA Download Column selected: fastq_ftp
Fastq File Pattern provided: .fastq.gz
- Total number of runs: 181
- Total number of warning runs: 0

All library_layouts for ENA Download Column match for all run_accessions!
- All PAIRED run_accessions have only 2 associated Fastq files

2) Check if original uploaded Fastqs are available:
- Is submitted_ftp available? True
- Is submitted_aspera available? True
- Is submitted_galaxy available? True

There are original submitted Fastqs columns available!

3) Check if there are duplicated Fastq file names in the original uploaded Fastqs column:
Using the following submitted Fastqs column: submitted_ftp
No duplicates were detected!

Sample Stats:

1) Stats for "sample_accession":

1.1) Number of unique samples for "sample_accession": 181

1.2) Samples per scientific_name(tax_id) in "sample_accession":
                       values  counts
          Mus musculus(10090)     112
synthetic metagenome(1235509)      69

1.3) Samples per instrument_model(instrument_platform) in "sample_accession":
                  values  counts
Illumina MiSeq(ILLUMINA)     181

1.4) Samples per library_layout in "sample_accession":
values  counts
PAIRED     181

1.5) Samples per library_strategy+library_source in "sample_accession":
              values  counts
AMPLICON+METAGENOMIC     181

1.6) Groups of samples in "sample_accession" by number of run_accessions:

Group with  1  run_accession(s) per sample:
- Total number of samples in this group: 181

2) Stats for "sample_alias":

2.1) Number of unique samples for "sample_alias": 181

2.2) Samples per scientific_name(tax_id) in "sample_alias":
                       values  counts
          Mus musculus(10090)     112
synthetic metagenome(1235509)      69

2.3) Samples per instrument_model(instrument_platform) in "sample_alias":
                  values  counts
Illumina MiSeq(ILLUMINA)     181

2.4) Samples per library_layout in "sample_alias":
values  counts
PAIRED     181

2.5) Samples per library_strategy+library_source in "sample_alias":
              values  counts
AMPLICON+METAGENOMIC     181

2.6) Groups of samples in "sample_alias" by number of run_accessions:

Group with  1  run_accession(s) per sample:
- Total number of samples in this group: 181

Sample Checks:

1) Does the number of run_accessions equal the number of samples?
- Number of runs equals samples in "sample_accessions": True
- Number of runs equals samples in "sample_alias": True

2) Does the number of samples equal each other between "sample_accession" and "sample_alias"?
- Number of samples in "sample_alias" equals samples in "sample_accession": True

2.1) Check samples for "sample_accession":
- Total number of samples: 181
- Total number of warning samples: 0
All samples in "sample_accession" have only one match with "sample_alias"!

2.2) Check samples for "sample_alias":
- Total number of samples: 181
- Total number of warning samples: 0
All samples in "sample_alias" have only one match with "sample_accession"!

3) Is there more than one library_strategy+library_source per sample?

3.1) Check samples for "sample_accession":
- Total number of samples: 181
- Total number of warning samples: 0
All samples in "sample_accession" have only one match per "library_strategy+library_source"!

3.2) Check samples for "sample_alias":
- Total number of samples: 181
- Total number of warning samples: 0
All samples in "sample_alias" have only one match per "library_strategy+library_source"!

4) Is there more than one scientific_name(tax_id) per sample?

4.1) Check samples for "sample_accession":
- Total number of samples: 181
- Total number of warning samples: 0
All samples in "sample_accession" have only one match per "scientific_name(tax_id)"!

4.2) Check samples for "sample_alias":
- Total number of samples: 181
- Total number of warning samples: 0
All samples in "sample_alias" have only one match per "scientific_name(tax_id)"!

5) Is there more than one instrument_model(instrument_platform) per sample?

5.1) Check samples for "sample_accession":
- Total number of samples: 181
- Total number of warning samples: 0
All samples in "sample_accession" have only one match per "instrument_model(instrument_platform)"!

5.2) Check samples for "sample_alias":
- Total number of samples: 181
- Total number of warning samples: 0
All samples in "sample_alias" have only one match per "instrument_model(instrument_platform)"!

6) Is there more than one library_layout per sample?

6.1) Check samples for "sample_accession":
- Total number of samples: 181
- Total number of warning samples: 0
All samples in "sample_accession" have only one type of library_layout!

6.2) Check samples for "sample_alias":
- Total number of samples: 181
- Total number of warning samples: 0
All samples in "sample_alias" have only one type of library_layout!

Let's explore the metadata checking results:

  • Runs' Stats. The following relevant statistics have been calculated for the run accessions: 1) There are a total of 181 run accessions; 2) There are 112 Mus musculus(10090) runs and 69 synthetic metagenome(1235509) runs; 3) All runs used Illumina MiSeq sequencing; 4) All runs have a PAIRED fastq files layout; 5) All runs have AMPLICON+METAGENOMIC data.

  • Runs' Checks. The following relevant checks for run accessions have been performed: 1) After checking library_layouts and the fastq_ftp download column, we see that all PAIRED run accessions have only 2 associated fastq files; 2) The original submitted fastqs columns are available; 3) the submitted_ftp download column was used to check for duplicated fast file names, but none were detected.

  • Sample Stats. After checking the relevant statistics that have been calculated for the "sample_accession" and "sample_alias" sample columns, we see that in this particular case these two sample columns are equivalent in terms of statistics: 1) There are a total of 181 sample accessions and sample aliases; 2) There are 112 Mus musculus(10090) samples and 69 synthetic metagenome(1235509) samples; 3) All samples used Illumina MiSeq sequencing; 4) All samples have a PAIRED fastq files layout; 5) All samples have AMPLICON+METAGENOMIC data; 6) All samples have a only one run accession associated.

  • Sample Checks. The following relevant checks for samples have been performed: 1) Number of samples equals the number of run accessions; 2) Number of samples equal each other between "sample_accession" and "sample_alias" sample columns. Furthermore, all samples in "sample_accession" have only one match with samples in "sample_alias" and vice versa; 3) All samples have only one match per "library_strategy+library_source" combination, which means that each sample has only one type of data associated (AMPLICON+METAGENOMICS); 4) All samples have only one scientific_name(tax_id) combination per sample; 5) All samples have only one instrument_model(instrument_platform) combination; 6) All samples have only one type of library_layout (PAIRED).

In this particular case, there were no warning elements to further curate after the checking process. However, we have gained some useful knowledge about this dataset, and we can proceed with the next workflow steps.

1.1.4) Metadata Filtering

Next, we are going to apply some filters to the merged_PRJEB10949_ENA_metadata.tsv file using the Filter Metadata program. For this purpose, we will use the PRJEB10949_filterfile_example.tsv test file, which contains the following information:

variable filter_type action NA_treatment values
read_count numerical greater_equal no [10000]
scientific_name categorical drop no ['synthetic metagenome']
miseq_kit numerical equal keep [3]
replicate categorical keep keep ['new']
tissue categorical drop keep ['Brain', 'Heart', 'Muscle']

For further details about the structure of this file and the expected values for the different columns, see the Filter Metadata program documentation.

Command:

filter_metadata -t merged_PRJEB10949_ENA_metadata.tsv -f PRJEB10949_filterfile_example.tsv

Output:

################################################################
##                                                            ##
##    ___  __  __ ___      ___              _   _             ##
##   / _ \|  \/  |   \    / __|  _ _ _ __ _| |_(_)___ _ _     ##
##   |(_)|| |\/| | |) |   |(_| || | '_/ _` |  _| / _ \ ' \    ##
##   \___/|_|  |_|___/    \___\_,_|_| \__,_|\__|_\___/_||_|   ##
##                 _____         _ _   _ _                    ##
##                |_   _|__  ___| | |_(_) |_                  ##
##                  | |/ _ \/ _ \ | / / |  _|                 ##
##                  |_|\___/\___/_|_\_\_|\__|                 ##
##                                                            ##
##   filter_metadata.py                                       ##
##    * v1.1.0 - 12 Mar 2024 *                                ##
##                                                            ##
################################################################

Program Parameters:
┌──────────────────┬────────────────────────────────────┐
│ Argument         │ Value                              │
├──────────────────┼────────────────────────────────────┤
│ metadata_table   │ merged_PRJEB10949_ENA_metadata.tsv │
│ filter_table     │ PRJEB10949_filterfile_example.tsv  │
│ output_directory │                                    │
│ plain_text       │ False                              │
└──────────────────┴────────────────────────────────────┘

Loading Files:

Metadata Table file:
merged_PRJEB10949_ENA_metadata.tsv

Filter Table file:
PRJEB10949_filterfile_example.tsv

Filtering Metadata Table:

Number of rows in Provided Metadata Table: 181

Filter Variable:  read_count
Values:  [10000]
Filter Type:  numerical
Filter Action:  greater_equal
NA Treatment:  no

Filter Variable:  scientific_name
Values:  ['synthetic metagenome']
Filter Type:  categorical
Filter Action:  drop
NA Treatment:  no

Filter Variable:  miseq_kit
Values:  [3]
Filter Type:  numerical
Filter Action:  equal
NA Treatment:  keep

Filter Variable:  replicate
Values:  ['new']
Filter Type:  categorical
Filter Action:  keep
NA Treatment:  keep

Filter Variable:  tissue
Values:  ['Brain', 'Heart', 'Muscle']
Filter Type:  categorical
Filter Action:  drop
NA Treatment:  keep

Number of rows in Filtered Metadata Table: 21

Saving results in file:
/home/sapies/Desktop/Prueba_concat/Example/PRJEB10949/filtered_merged_PRJEB10949_ENA_metadata.tsv

As we can see, the program has sequentially applied different filters to the Metadata Table. The resulting filtered_merged_PRJEB10949_ENA_metadata.tsv file is the one that will be used in the next workflow steps.

1.1.5) Metadata Check After Filtering

Now, let's check the metadata again after filtering. In this case, we will also include the extra stats parameter (-e parameter) indicating some variables to get some extra stats.

Command:

check_metadata_ENA -t filtered_merged_PRJEB10949_ENA_metadata.tsv -e read_count scientific_name miseq_kit replicate tissue

Output:

################################################################
##                                                            ##
##    ___  __  __ ___      ___              _   _             ##
##   / _ \|  \/  |   \    / __|  _ _ _ __ _| |_(_)___ _ _     ##
##   |(_)|| |\/| | |) |   |(_| || | '_/ _` |  _| / _ \ ' \    ##
##   \___/|_|  |_|___/    \___\_,_|_| \__,_|\__|_\___/_||_|   ##
##                 _____         _ _   _ _                    ##
##                |_   _|__  ___| | |_(_) |_                  ##
##                  | |/ _ \/ _ \ | / / |  _|                 ##
##                  |_|\___/\___/_|_\_\_|\__|                 ##
##                                                            ##
##   check_metadata_ENA.py                                    ##
##    * v1.1.0 - 12 Mar 2024 *                                ##
##                                                            ##
################################################################

Program Parameters:
┌─────────────────────┬───────────────────────────────────────────────────────────────────────┐
│ Argument            │ Value                                                                 │
├─────────────────────┼───────────────────────────────────────────────────────────────────────┤
│ metadata_table      │ filtered_merged_PRJEB10949_ENA_metadata.tsv                           │
│ ena_download_column │ fastq_ftp                                                             │
│ fastq_pattern       │ .fastq.gz                                                             │
│ sample_column       │ sample_alias                                                          │
│ extra_columns_stats │ ['read_count', 'scientific_name', 'miseq_kit', 'replicate', 'tissue'] │
│ plain_text          │ False                                                                 │
└─────────────────────┴───────────────────────────────────────────────────────────────────────┘

Loading File:

Metadata Table file:
filtered_merged_PRJEB10949_ENA_metadata.tsv

Runs' Stats:

1) Number of run_accessions: 21

2) Appearances per scientific_name(tax_id):
      unique_values  counts
Mus musculus(10090)      21

3) Appearances per instrument_model(instrument_platform):
           unique_values  counts
Illumina MiSeq(ILLUMINA)      21

4) Appearances per library_layout:
unique_values  counts
       PAIRED      21

5) Appearances per library_strategy+library_source:
       unique_values  counts
AMPLICON+METAGENOMIC      21

Runs' Checks:

1) Check that library_layout and ENA Download Column match:
ENA Download Column selected: fastq_ftp
Fastq File Pattern provided: .fastq.gz
- Total number of runs: 21
- Total number of warning runs: 0

All library_layouts for ENA Download Column match for all run_accessions!
- All PAIRED run_accessions have only 2 associated Fastq files

2) Check if original uploaded Fastqs are available:
- Is submitted_ftp available? True
- Is submitted_aspera available? True
- Is submitted_galaxy available? True

There are original submitted Fastqs columns available!

3) Check if there are duplicated Fastq file names in the original uploaded Fastqs column:
Using the following submitted Fastqs column: submitted_ftp
No duplicates were detected!

Sample Stats:

1) Stats for "sample_accession":

1.1) Number of unique samples for "sample_accession": 21

1.2) Samples per scientific_name(tax_id) in "sample_accession":
             values  counts
Mus musculus(10090)      21

1.3) Samples per instrument_model(instrument_platform) in "sample_accession":
                  values  counts
Illumina MiSeq(ILLUMINA)      21

1.4) Samples per library_layout in "sample_accession":
values  counts
PAIRED      21

1.5) Samples per library_strategy+library_source in "sample_accession":
              values  counts
AMPLICON+METAGENOMIC      21

1.6) Groups of samples in "sample_accession" by number of run_accessions:

Group with  1  run_accession(s) per sample:
- Total number of samples in this group: 21

2) Stats for "sample_alias":

2.1) Number of unique samples for "sample_alias": 21

2.2) Samples per scientific_name(tax_id) in "sample_alias":
             values  counts
Mus musculus(10090)      21

2.3) Samples per instrument_model(instrument_platform) in "sample_alias":
                  values  counts
Illumina MiSeq(ILLUMINA)      21

2.4) Samples per library_layout in "sample_alias":
values  counts
PAIRED      21

2.5) Samples per library_strategy+library_source in "sample_alias":
              values  counts
AMPLICON+METAGENOMIC      21

2.6) Groups of samples in "sample_alias" by number of run_accessions:

Group with  1  run_accession(s) per sample:
- Total number of samples in this group: 21

Sample Checks:

1) Does the number of run_accessions equal the number of samples?
- Number of runs equals samples in "sample_accessions": True
- Number of runs equals samples in "sample_alias": True

2) Does the number of samples equal each other between "sample_accession" and "sample_alias"?
- Number of samples in "sample_alias" equals samples in "sample_accession": True

2.1) Check samples for "sample_accession":
- Total number of samples: 21
- Total number of warning samples: 0
All samples in "sample_accession" have only one match with "sample_alias"!

2.2) Check samples for "sample_alias":
- Total number of samples: 21
- Total number of warning samples: 0
All samples in "sample_alias" have only one match with "sample_accession"!

3) Is there more than one library_strategy+library_source per sample?

3.1) Check samples for "sample_accession":
- Total number of samples: 21
- Total number of warning samples: 0
All samples in "sample_accession" have only one match per "library_strategy+library_source"!

3.2) Check samples for "sample_alias":
- Total number of samples: 21
- Total number of warning samples: 0
All samples in "sample_alias" have only one match per "library_strategy+library_source"!

4) Is there more than one scientific_name(tax_id) per sample?

4.1) Check samples for "sample_accession":
- Total number of samples: 21
- Total number of warning samples: 0
All samples in "sample_accession" have only one match per "scientific_name(tax_id)"!

4.2) Check samples for "sample_alias":
- Total number of samples: 21
- Total number of warning samples: 0
All samples in "sample_alias" have only one match per "scientific_name(tax_id)"!

5) Is there more than one instrument_model(instrument_platform) per sample?

5.1) Check samples for "sample_accession":
- Total number of samples: 21
- Total number of warning samples: 0
All samples in "sample_accession" have only one match per "instrument_model(instrument_platform)"!

5.2) Check samples for "sample_alias":
- Total number of samples: 21
- Total number of warning samples: 0
All samples in "sample_alias" have only one match per "instrument_model(instrument_platform)"!

6) Is there more than one library_layout per sample?

6.1) Check samples for "sample_accession":
- Total number of samples: 21
- Total number of warning samples: 0
All samples in "sample_accession" have only one type of library_layout!

6.2) Check samples for "sample_alias":
- Total number of samples: 21
- Total number of warning samples: 0
All samples in "sample_alias" have only one type of library_layout!

Extra Stats:

- Extra stats for "read_count":

Skipping extra column!
This extra column has already been deeply explored or does not make sense to get stats!

- Extra stats for "scientific_name":

1) Run_accessions per "scientific_name":
      values  counts
Mus musculus      21

2) Samples per "scientific_name" in "sample_accession":
      values  counts
Mus musculus      21

3) Samples per "scientific_name" in "sample_alias":
      values  counts
Mus musculus      21

- Extra stats for "miseq_kit":

1) Run_accessions per "miseq_kit":
 values  counts
    3.0      15
    NaN       6

2) Samples per "miseq_kit" in "sample_accession":
 values  counts
    3.0      15
    NaN       6

3) Samples per "miseq_kit" in "sample_alias":
 values  counts
    3.0      15
    NaN       6

- Extra stats for "replicate":

1) Run_accessions per "replicate":
values  counts
   new      15
   NaN       6

2) Samples per "replicate" in "sample_accession":
values  counts
   new      15
   NaN       6

3) Samples per "replicate" in "sample_alias":
values  counts
   new      15
   NaN       6

- Extra stats for "tissue":

1) Run_accessions per "tissue":
        values  counts
         Ileum      15
         Liver       3
Adipose tissue       3

2) Samples per "tissue" in "sample_accession":
        values  counts
         Ileum      15
Adipose tissue       3
         Liver       3

3) Samples per "tissue" in "sample_alias":
        values  counts
         Ileum      15
Adipose tissue       3
         Liver       3

After applying the previous filters, we see that: 1) There are a total of 21 surviving runs/samples; 2) There are 21 Mus musculus(10090) runs/samples.

We also used the extra stats parameter to get appearances statistics for the selected variables:

  • Variable read_count. In this particular case, the program skips generating to generate appearances statistics since the ENA read_count column is treated as a numerical variable by the program. If we open the filtered_merged_PRJEB10949_ENA_metadata.tsv file we can see that all surviving run accessions have more than 10,000 counts.

  • Variable scientific_name. There are no surviving 'synthetic metagenome' in the scientific_name column.

  • Variable miseq_kit. We can see that only run accessions with NA values or equal to 3 were kept in the miseq_kit column.

  • Variable replicate. We can see that only run accessions with NA values or that were stated as 'new' were kept in the replicate column.

  • Variable tissue. There are no run accessions related to 'Brain', 'Heart' or 'Muscle' in the tissue column.

Thus, all filters have been applied successfully, and we can continue to the next workflow steps. At this point, we could also use the Check Metadata Values program to check that our metadata table is within our allowed parameters, as we will see at the end of the tutorial.

1.2) Collection and Checking of Fastq Files

1.2.1) Download Fastqs from ENA

Now, we download the associated available fastq files from ENA using the Download Fastqs program in ENA mode. We will previously generate a directory called downloads that will be used as Output Directory.

Commands:

##Create downloads directory
mkdir downloads

##Download fastqs
download_fastqs -i filtered_merged_PRJEB10949_ENA_metadata.tsv -o downloads

Output:

################################################################
##                                                            ##
##    ___  __  __ ___      ___              _   _             ##
##   / _ \|  \/  |   \    / __|  _ _ _ __ _| |_(_)___ _ _     ##
##   |(_)|| |\/| | |) |   |(_| || | '_/ _` |  _| / _ \ ' \    ##
##   \___/|_|  |_|___/    \___\_,_|_| \__,_|\__|_\___/_||_|   ##
##                 _____         _ _   _ _                    ##
##                |_   _|__  ___| | |_(_) |_                  ##
##                  | |/ _ \/ _ \ | / / |  _|                 ##
##                  |_|\___/\___/_|_\_\_|\__|                 ##
##                                                            ##
##   download_fastqs.py                                       ##
##    * v1.1.0 - 12 Mar 2024 *                                ##
##                                                            ##
################################################################

Program Parameters:
┌─────────────────────┬─────────────────────────────────────────────┐
│ Argument            │ Value                                       │
├─────────────────────┼─────────────────────────────────────────────┤
│ input_file          │ filtered_merged_PRJEB10949_ENA_metadata.tsv │
│ mode                │ ENA                                         │
│ ena_download_column │ fastq_ftp                                   │
│ max_conn            │ 5                                           │
│ parfive_verbose     │ False                                       │
│ output_directory    │ downloads                                   │
│ plain_text          │ False                                       │
└─────────────────────┴─────────────────────────────────────────────┘

Loading Input File:

Metadata Table file:
filtered_merged_PRJEB10949_ENA_metadata.tsv

Main Information:

1) Number of run_accessions: 21

2) Number of unique sample_accessions: 21

3) Number of unique sample_alias: 21

4) Appearances per library_layout:
unique_values  counts
       PAIRED      21

5) Number of URLs to download: 42

Downloading Files (parfive):

Downloader version:
parfive
2.0.2

Resulting files saved in:
downloads

As we can see, the program has downloaded the available fastq files of interest from ENA using the parfive package (https://pypi.org/project/parfive/). A total of 42 fastq files (21 PAIRED fastq files) have been downloaded using the filtered_merged_PRJEB10949_ENA_metadata.tsv file as reference. The resulting fastq files have been saved in the downloads Output Directory.

1.2.2) Checking of Downloaded Fastq Files

Now, we are going to check the integrity of the downloaded fastq files using the Check Fastqs program in ENA mode.

Command:

check_fastqs_ENA -t filtered_merged_PRJEB10949_ENA_metadata.tsv -d downloads --md5_check

Output:

################################################################
##                                                            ##
##    ___  __  __ ___      ___              _   _             ##
##   / _ \|  \/  |   \    / __|  _ _ _ __ _| |_(_)___ _ _     ##
##   |(_)|| |\/| | |) |   |(_| || | '_/ _` |  _| / _ \ ' \    ##
##   \___/|_|  |_|___/    \___\_,_|_| \__,_|\__|_\___/_||_|   ##
##                 _____         _ _   _ _                    ##
##                |_   _|__  ___| | |_(_) |_                  ##
##                  | |/ _ \/ _ \ | / / |  _|                 ##
##                  |_|\___/\___/_|_\_\_|\__|                 ##
##                                                            ##
##   check_fastqs.py                                          ##
##    * v1.1.0 - 12 Mar 2024 *                                ##
##                                                            ##
################################################################

Program Parameters:
┌──────────────────────────┬─────────────────────────────────────────────┐
│ Argument                 │ Value                                       │
├──────────────────────────┼─────────────────────────────────────────────┤
│ metadata_table           │ filtered_merged_PRJEB10949_ENA_metadata.tsv │
│ fastqs_directory         │ downloads                                   │
│ manifest_table           │                                             │
│ mode                     │ ENA                                         │
│ ena_download_column      │ fastq_ftp                                   │
│ generic_common_column_mt │                                             │
│ fastq_pattern            │ .fastq.gz                                   │
│ md5_check                │ True                                        │
│ plain_text               │ False                                       │
└──────────────────────────┴─────────────────────────────────────────────┘

Loading Files:

Metadata Table file:
filtered_merged_PRJEB10949_ENA_metadata.tsv

Main Information:

1) Number of run_accessions: 21

2) Number of unique sample_accessions: 21

3) Number of unique sample_alias: 21

4) Appearances per library_layout:
unique_values  counts
       PAIRED      21

5) Number of URLs expected to be downloaded: 42

6) Number of Fastq files in the provided Fastqs Directory: 42

Fastqs' Checks:

1) Check that expected files from the Metadata Table exist in the provided directory:
All the expected Fastq files from the Metadata Table are in the provided directory!

2) Check if there are Fastq files of the provided directory absent in the Metadata Table:
There are no extra Fastq files in the provided directory!

3) Check if there are Fastq files of the provided directory with multiple matches in the Metadata Table:
All Fastq files in the provided directory have a unique run_accession match with the Metadata Table!

4) Check that expected MD5s from the Metadata Table match calculated MD5s:

This may take a while...
  o Total number of Fastqs detected in Metadata Table: 42
  o Total number of warnings for MD5s: 0

All MD5s present in the Metadata Table match the MD5s calculated for their corresponding Fastq files!

Let's explore the fastqs checking results:

  • Main Information. The following relevant statistics have been calculated: 1) There are a total of 21 run accessions; 2) There are a total of 21 unique sample accessions; 3) There are a total of 21 unique sample aliases; 4) All runs have a PAIRED fastq files layout; 5) The number of URLs expected to be downloaded was 42 files; 6) The number of fastq files in the provided fastqs directory is 42 files.

  • Fastqs' Checks. The following relevant checks for the downloaded fastq files have been performed: 1) All the expected fastq files from the Metadata Table are in the provided fastqs directory; 2) There are no extra fastq files in the provided directory; 3) All fastq files in the provided directory have a unique run accession match with the Metadata Table; 4) All MD5s present in the Metadata Table match the MD5s calculated for their corresponding fastq files (optional check).

In this particular case, there were no warning elements to further curate after the checking process. However, we have gained some useful knowledge about this dataset, and we can proceed with the next workflow steps.

1.3) Further Treatment of Metadata and Fastq Files

In this particular dataset, we had technical replicates for the different samples. We will take advantage of this condition to do a treatment of fastq files (copy, rename, and merge) and combine the associated metadata.

1.3.1) Make Raw Treatment Template

First, we will generate a raw treatment template using the Make Treatment Template program in ENA mode.

Command:

make_treatment_template -i filtered_merged_PRJEB10949_ENA_metadata.tsv -d downloads --extra_sample_columns sample_column

Output:

################################################################
##                                                            ##
##    ___  __  __ ___      ___              _   _             ##
##   / _ \|  \/  |   \    / __|  _ _ _ __ _| |_(_)___ _ _     ##
##   |(_)|| |\/| | |) |   |(_| || | '_/ _` |  _| / _ \ ' \    ##
##   \___/|_|  |_|___/    \___\_,_|_| \__,_|\__|_\___/_||_|   ##
##                 _____         _ _   _ _                    ##
##                |_   _|__  ___| | |_(_) |_                  ##
##                  | |/ _ \/ _ \ | / / |  _|                 ##
##                  |_|\___/\___/_|_\_\_|\__|                 ##
##                                                            ##
##   make_treatment_template.py                               ##
##    * v1.1.0 - 12 Mar 2024 *                                ##
##                                                            ##
################################################################

Program Parameters:
┌──────────────────────┬─────────────────────────────────────────────┐
│ Argument             │ Value                                       │
├──────────────────────┼─────────────────────────────────────────────┤
│ input_file           │ filtered_merged_PRJEB10949_ENA_metadata.tsv │
│ fastqs_directory     │ downloads                                   │
│ mode                 │ ENA                                         │
│ ena_download_column  │ fastq_ftp                                   │
│ fastq_pattern        │ .fastq.gz                                   │
│ r1_pattern           │ _1.fastq.gz                                 │
│ r2_pattern           │ _2.fastq.gz                                 │
│ extra_sample_columns │ ['sample_column']                           │
│ output_directory     │                                             │
│ plain_text           │ False                                       │
└──────────────────────┴─────────────────────────────────────────────┘

Loading File:

Metadata Table file:
filtered_merged_PRJEB10949_ENA_metadata.tsv

Main Information:

1) Number of run_accessions: 21

2) Number of unique sample_accessions: 21

3) Number of unique sample_alias: 21

4) Appearances per library_layout:
unique_values  counts
       PAIRED      21

5) Number of URLs expected to be downloaded: 42

6) Number of Fastq files in the provided Fastqs Directory: 42

Creating Raw Treatment Template:

Saving results in file:
/home/sapies/Desktop/Prueba_concat/Example/PRJEB10949/raw_treatment_template_filtered_merged_PRJEB10949_ENA_metadata.tsv

The resulting raw_treatment_template_filtered_merged_PRJEB10949_ENA_metadata.tsv file will be further curated so that it can be used by the next programs of the workflow, Treat Metadata and Treat Fastqs programs. The three programs together can be used for the extra treatment of fastq files and associated metadata.

After examining the candidate columns provided by the program, we selected the column named "sample_column" as the sample_name column. With respect to the treatment values, we will apply the copy mode to the surviving Liver samples (Liver-1, Liver-2, Liver-3), the rename mode to the surviving MAT samples (MAT-1, MAT-2, MAT-3), and merge mode to the surviving Ileum samples (Ileum-1, Ileum-2, Ileum-3, Ileum-4, Ileum-5, Ileum-6). Thus generating the final treatment_template_filtered_PRJEB10949_merged_metadata_example.tsv test file that will be used for the remaining treatment steps.

1.3.2) Treat Metadata

Now, we will use the Treat Metadata program in ENA mode to combine and treat the metadata based on the treatment information provided by the final file treatment_template_filtered_PRJEB10949_merged_metadata_example.tsv. After examining the filtered_merged_PRJEB10949_ENA_metadata.tsv file the following metadata columns (Run, Sample, run_accessions, run_label) will be indicated as Extra No Warning Columns.

Command:

treat_metadata -t treatment_template_filtered_PRJEB10949_merged_metadata_example.tsv -m filtered_merged_PRJEB10949_ENA_metadata.tsv --extra_no_warning_columns Run Sample run_accessions run_label

Output:

################################################################
##                                                            ##
##    ___  __  __ ___      ___              _   _             ##
##   / _ \|  \/  |   \    / __|  _ _ _ __ _| |_(_)___ _ _     ##
##   |(_)|| |\/| | |) |   |(_| || | '_/ _` |  _| / _ \ ' \    ##
##   \___/|_|  |_|___/    \___\_,_|_| \__,_|\__|_\___/_||_|   ##
##                 _____         _ _   _ _                    ##
##                |_   _|__  ___| | |_(_) |_                  ##
##                  | |/ _ \/ _ \ | / / |  _|                 ##
##                  |_|\___/\___/_|_\_\_|\__|                 ##
##                                                            ##
##   treat_metadata.py                                        ##
##    * v1.1.0 - 12 Mar 2024 *                                ##
##                                                            ##
################################################################

Program Parameters:
┌────────────────────────────┬────────────────────────────────────────────────────────────────────┐
│ Argument                   │ Value                                                              │
├────────────────────────────┼────────────────────────────────────────────────────────────────────┤
│ metadata_table             │ filtered_merged_PRJEB10949_ENA_metadata.tsv                        │
│ treatment_template         │ treatment_template_filtered_PRJEB10949_merged_metadata_example.tsv │
│ mode                       │ ENA                                                                │
│ ena_download_column        │ fastq_ftp                                                          │
│ generic_common_column_mt   │                                                                    │
│ generic_common_column_tt   │                                                                    │
│ extra_no_warning_columns   │ ['Run', 'Sample', 'run_accessions', 'run_label']                   │
│ sample_name_sep            │ _                                                                  │
│ sample_name_sep_appearance │ 1                                                                  │
│ fastq_pattern              │ .fastq.gz                                                          │
│ r1_pattern                 │ _1.fastq.gz                                                        │
│ r2_pattern                 │ _2.fastq.gz                                                        │
│ output_directory           │                                                                    │
│ plain_text                 │ False                                                              │
└────────────────────────────┴────────────────────────────────────────────────────────────────────┘

Loading Files:

Treatment Template file:
treatment_template_filtered_PRJEB10949_merged_metadata_example.tsv

Metadata Table file:
filtered_merged_PRJEB10949_ENA_metadata.tsv

Pre-treatment Information:

  o Number of Samples in Treatment Template: 12
  o Number of Rows in the Original Metadata Table: 21

Treat Metadata:

Treating metadata ...

Post-treatment Information:

  o Number of Rows in the Treated Metadata Table: 12

Saving results in file:
/home/sapies/Desktop/Prueba_concat/Example/PRJEB10949/treated_filtered_merged_PRJEB10949_ENA_metadata.tsv

Initially, the filtered_merged_PRJEB10949_ENA_metadata.tsv file presented 21 run accessions. After treatment, the resulting treated_filtered_merged_PRJEB10949_ENA_metadata.tsv file presents information about 12 samples. When combining information for a final sample name, if multiple different values are found, these will be separated by a semicolon (;), and the program will generate a warning report that should be used to check for possible metadata inconsistencies. In this particular case, no warnings were raised. For further details about the structure of this file, see the Treat Metadata program documentation.

1.3.3) Treat Fastqs

Finally, we will use the Treat Fastqs program to perform the different treatment operations on the downloaded fastq files based on the treatment information in the final treatment_template_filtered_PRJEB10949_merged_metadata_example.tsv test file. We will previously generate a directory called treated_files that will be used as Output Directory.

Commands:

##Create treated_files directory
mkdir treated_files

##Treat fastq files
treat_fastqs -t treatment_template_filtered_PRJEB10949_merged_metadata_example.tsv -i downloads -o treated_files

Output:

################################################################
##                                                            ##
##    ___  __  __ ___      ___              _   _             ##
##   / _ \|  \/  |   \    / __|  _ _ _ __ _| |_(_)___ _ _     ##
##   |(_)|| |\/| | |) |   |(_| || | '_/ _` |  _| / _ \ ' \    ##
##   \___/|_|  |_|___/    \___\_,_|_| \__,_|\__|_\___/_||_|   ##
##                 _____         _ _   _ _                    ##
##                |_   _|__  ___| | |_(_) |_                  ##
##                  | |/ _ \/ _ \ | / / |  _|                 ##
##                  |_|\___/\___/_|_\_\_|\__|                 ##
##                                                            ##
##   treat_fastqs.py                                          ##
##    * v1.1.0 - 12 Mar 2024 *                                ##
##                                                            ##
################################################################

Program Parameters:
┌────────────────────┬────────────────────────────────────────────────────────────────────┐
│ Argument           │ Value                                                              │
├────────────────────┼────────────────────────────────────────────────────────────────────┤
│ treatment_template │ treatment_template_filtered_PRJEB10949_merged_metadata_example.tsv │
│ input_directory    │ downloads                                                          │
│ output_directory   │ treated_files                                                      │
│ fastq_pattern      │ .fastq.gz                                                          │
│ r1_pattern         │ _1.fastq.gz                                                        │
│ r2_pattern         │ _2.fastq.gz                                                        │
│ plain_text         │ False                                                              │
└────────────────────┴────────────────────────────────────────────────────────────────────┘

Loading Files:

Treatment Template file:
treatment_template_filtered_PRJEB10949_merged_metadata_example.tsv

Pre-treatment Information:

  o Number of Samples in Treatment Template: 12
  o Number of Fastq files in Treatment Template: 42

Treat Fastqs:

Sample:  Ileum-1
Treatment: merge
Number of Fastq files: 6
Configuration: Number of pair1(s) = 3; Number of pair2(s) = 3; Number of single(s) = 0
Paired Merging:
['ERR1049857_1.fastq.gz', 'ERR1049913_1.fastq.gz', 'ERR1049924_1.fastq.gz']  >  Ileum-1_1.fastq.gz
['ERR1049857_2.fastq.gz', 'ERR1049913_2.fastq.gz', 'ERR1049924_2.fastq.gz']  >  Ileum-1_2.fastq.gz

Sample:  Ileum-2
Treatment: merge
Number of Fastq files: 6
Configuration: Number of pair1(s) = 3; Number of pair2(s) = 3; Number of single(s) = 0
Paired Merging:
['ERR1049859_1.fastq.gz', 'ERR1049915_1.fastq.gz', 'ERR1049925_1.fastq.gz']  >  Ileum-2_1.fastq.gz
['ERR1049859_2.fastq.gz', 'ERR1049915_2.fastq.gz', 'ERR1049925_2.fastq.gz']  >  Ileum-2_2.fastq.gz

Sample:  Ileum-3
Treatment: merge
Number of Fastq files: 4
Configuration: Number of pair1(s) = 2; Number of pair2(s) = 2; Number of single(s) = 0
Paired Merging:
['ERR1049911_1.fastq.gz', 'ERR1049923_1.fastq.gz']  >  Ileum-3_1.fastq.gz
['ERR1049911_2.fastq.gz', 'ERR1049923_2.fastq.gz']  >  Ileum-3_2.fastq.gz

Sample:  Ileum-4
Treatment: merge
Number of Fastq files: 4
Configuration: Number of pair1(s) = 2; Number of pair2(s) = 2; Number of single(s) = 0
Paired Merging:
['ERR1049917_1.fastq.gz', 'ERR1049926_1.fastq.gz']  >  Ileum-4_1.fastq.gz
['ERR1049917_2.fastq.gz', 'ERR1049926_2.fastq.gz']  >  Ileum-4_2.fastq.gz

Sample:  Ileum-5
Treatment: merge
Number of Fastq files: 6
Configuration: Number of pair1(s) = 3; Number of pair2(s) = 3; Number of single(s) = 0
Paired Merging:
['ERR1049861_1.fastq.gz', 'ERR1049919_1.fastq.gz', 'ERR1049927_1.fastq.gz']  >  Ileum-5_1.fastq.gz
['ERR1049861_2.fastq.gz', 'ERR1049919_2.fastq.gz', 'ERR1049927_2.fastq.gz']  >  Ileum-5_2.fastq.gz

Sample:  Ileum-6
Treatment: merge
Number of Fastq files: 4
Configuration: Number of pair1(s) = 2; Number of pair2(s) = 2; Number of single(s) = 0
Paired Merging:
['ERR1049921_1.fastq.gz', 'ERR1049928_1.fastq.gz']  >  Ileum-6_1.fastq.gz
['ERR1049921_2.fastq.gz', 'ERR1049928_2.fastq.gz']  >  Ileum-6_2.fastq.gz

Sample:  Liver-1
Treatment: copy
Number of Fastq files: 2
Configuration: Number of pair1(s) = 1; Number of pair2(s) = 1; Number of single(s) = 0
Copying file(s):
['ERR1049851_1.fastq.gz']  >  ERR1049851_1.fastq.gz
['ERR1049851_2.fastq.gz']  >  ERR1049851_2.fastq.gz

Sample:  Liver-2
Treatment: copy
Number of Fastq files: 2
Configuration: Number of pair1(s) = 1; Number of pair2(s) = 1; Number of single(s) = 0
Copying file(s):
['ERR1049852_1.fastq.gz']  >  ERR1049852_1.fastq.gz
['ERR1049852_2.fastq.gz']  >  ERR1049852_2.fastq.gz

Sample:  Liver-3
Treatment: copy
Number of Fastq files: 2
Configuration: Number of pair1(s) = 1; Number of pair2(s) = 1; Number of single(s) = 0
Copying file(s):
['ERR1049853_1.fastq.gz']  >  ERR1049853_1.fastq.gz
['ERR1049853_2.fastq.gz']  >  ERR1049853_2.fastq.gz

Sample:  MAT-1
Treatment: rename
Number of Fastq files: 2
Configuration: Number of pair1(s) = 1; Number of pair2(s) = 1; Number of single(s) = 0
Renaming file(s):
['ERR1049854_1.fastq.gz']  >  MAT-1_1.fastq.gz
['ERR1049854_2.fastq.gz']  >  MAT-1_2.fastq.gz

Sample:  MAT-2
Treatment: rename
Number of Fastq files: 2
Configuration: Number of pair1(s) = 1; Number of pair2(s) = 1; Number of single(s) = 0
Renaming file(s):
['ERR1049855_1.fastq.gz']  >  MAT-2_1.fastq.gz
['ERR1049855_2.fastq.gz']  >  MAT-2_2.fastq.gz

Sample:  MAT-3
Treatment: rename
Number of Fastq files: 2
Configuration: Number of pair1(s) = 1; Number of pair2(s) = 1; Number of single(s) = 0
Renaming file(s):
['ERR1049856_1.fastq.gz']  >  MAT-3_1.fastq.gz
['ERR1049856_2.fastq.gz']  >  MAT-3_2.fastq.gz

Post-treatment Information:

  o Number of Fastq files in Output Directory: 24

Resulting files saved in:
treated_files

As we can see, the program has treated the corresponding fast files. A total of 42 fastq files (21 PAIRED fastq files) have been treated, generating a final set of 24 fastq files (12 PAIRED fastq files). The resulting fastq files saved in the treated_files Output Directory correspond to the final fastq files for this dataset.

2) External Dataset Curation Workflow

We will use the GSA project PRJCA001214 as an example of how to use the package to curate an External Dataset.

2.1) Collection and Initial Processing of Metadata

2.1.1) Initial Metadata Collection

In this case, we will need to manually download the metadata from the GSA project PRJCA001214 (https://ngdc.cncb.ac.cn/bioproject/browse/PRJCA001214). Specifically, the 16S amplicon sequencing data (https://ngdc.cncb.ac.cn/gsa/browse/CRA001372). Within this subproject CRA001372, there is a Metadata CRA001372.xlsx file. This file contains 3 sheets (Sample, Experiment and Run). The first step will be to generate 3 tables corresponding to each sheet and make them ready to work with. Likewise, we can look for additional metadata in the original publication (https://doi.org/10.1038/s41587-019-0104-4). In this case, we will use the CRA001372_publication_metadata_example.tsv test file as the publication metadata.

2.1.2) Metadata Merging

First, we will merge the 3 table sheets. The resulting file should be equivalent to the CRA001372_main_metadata_example.tsv test file.

Commands:

#Get out of the PRJEB10949 Directory
cd ..

#Enter the CRA001372 Directory
cd CRA001372

##Merge Run and Experiment sheets (as main and extra metadata table, respectively)
merge_metadata -m CRA001372_run_clean.tsv -mc 'Experiment accession' -e CRA001372_experiment_clean.tsv -ec Accession -ms _run -es _experiment

##Merge Previous Merged Table and Sample sheet (as main and extra metadata table, respectively)
merge_metadata -m merged_CRA001372_run_clean.tsv -mc 'BioSample accession' -e CRA001372_sample_clean.tsv -ec Accession -es _sample

Finally, we will merge this file with the CRA001372_publication_metadata_example.tsv test file:

Commands:

##Merge Main and Publication Example Tables (as main and extra metadata table, respectively)
merge_metadata -m CRA001372_main_metadata_example.tsv -mc Sample_name -e CRA001372_publication_metadata_example.tsv -ec sample_id -es _publication

Output:

################################################################
##                                                            ##
##    ___  __  __ ___      ___              _   _             ##
##   / _ \|  \/  |   \    / __|  _ _ _ __ _| |_(_)___ _ _     ##
##   |(_)|| |\/| | |) |   |(_| || | '_/ _` |  _| / _ \ ' \    ##
##   \___/|_|  |_|___/    \___\_,_|_| \__,_|\__|_\___/_||_|   ##
##                 _____         _ _   _ _                    ##
##                |_   _|__  ___| | |_(_) |_                  ##
##                  | |/ _ \/ _ \ | / / |  _|                 ##
##                  |_|\___/\___/_|_\_\_|\__|                 ##
##                                                            ##
##   merge_metadata.py                                        ##
##    * v1.1.0 - 12 Mar 2024 *                                ##
##                                                            ##
################################################################

Program Parameters:
┌──────────────────────┬────────────────────────────────────────────┐
│ Argument             │ Value                                      │
├──────────────────────┼────────────────────────────────────────────┤
│ main_metadata_table  │ CRA001372_main_metadata_example.tsv        │
│ extra_metadata_table │ CRA001372_publication_metadata_example.tsv │
│ main_merge_column    │ Sample_name                                │
│ extra_merge_column   │ sample_id                                  │
│ pandas_merge_mode    │ left                                       │
│ main_merge_suffix    │ _x                                         │
│ extra_merge_suffix   │ _publication                               │
│ output_directory     │                                            │
│ plain_text           │ False                                      │
└──────────────────────┴────────────────────────────────────────────┘

Loading Files:

Main Metadata Table file:
CRA001372_main_metadata_example.tsv

Extra Metadata Table file:
CRA001372_publication_metadata_example.tsv

Creating Merged Metadata Table:

Combining tables:
Pandas Merge Mode: left
  o Left Table (x): Main Metadata Table
  o Right Table (y): Extra Metadata Table

Saving results in file:
/home/sapies/Desktop/Prueba_concat/Example/CRA001372/merged_CRA001372_main_metadata_example.tsv

Merge Columns Intersection Checks:

Check merge columns' unique values:
Main Metadata Merge Column selected: Sample_name
Extra Metadata Merge Column selected: sample_id

Unique values:
  o Main Metadata Merge Column (total unique values): 828
  o Extra Metadata Merge Column (total unique values): 828

All unique values are common between the merge columns provided!

As we can see, the program has combined the CRA001372_main_metadata_example.tsv test file (Main Metadata Table) with the CRA001372_publication_metadata_example.tsv test file (Publication Metadata Table). A left join has been carried out, taking the Main Metadata Table as reference against the Publication Metadata Table. Furthermore, after merging, it also performed an intersection analysis of the two provided merge columns. In this particular case, all values are common between the two merge columns provided. This is an interesting feature to detect orphan merge values between metadata tables. The resulting merged_CRA001372_main_metadata_example.tsv file is the one that will be used in the next workflow steps.

2.1.3) Metadata Filtering

Next, we are going to apply some filters to the merged_CRA001372_main_metadata_example.tsv file using the Filter Metadata program. For this purpose, we will use the CRA001372_filterfile_example.tsv test file, which contains the following information:

variable filter_type action NA_treatment values
tissue categorical keep no ['Soil']

For further details about the structure of this file and the expected values for the different columns, see the Filter Metadata program documentation.

Command:

filter_metadata -t merged_CRA001372_main_metadata_example.tsv -f CRA001372_filterfile_example.tsv

Output:

################################################################
##                                                            ##
##    ___  __  __ ___      ___              _   _             ##
##   / _ \|  \/  |   \    / __|  _ _ _ __ _| |_(_)___ _ _     ##
##   |(_)|| |\/| | |) |   |(_| || | '_/ _` |  _| / _ \ ' \    ##
##   \___/|_|  |_|___/    \___\_,_|_| \__,_|\__|_\___/_||_|   ##
##                 _____         _ _   _ _                    ##
##                |_   _|__  ___| | |_(_) |_                  ##
##                  | |/ _ \/ _ \ | / / |  _|                 ##
##                  |_|\___/\___/_|_\_\_|\__|                 ##
##                                                            ##
##   filter_metadata.py                                       ##
##    * v1.1.0 - 12 Mar 2024 *                                ##
##                                                            ##
################################################################

Program Parameters:
┌──────────────────┬────────────────────────────────────────────┐
│ Argument         │ Value                                      │
├──────────────────┼────────────────────────────────────────────┤
│ metadata_table   │ merged_CRA001372_main_metadata_example.tsv │
│ filter_table     │ CRA001372_filterfile_example.tsv           │
│ output_directory │                                            │
│ plain_text       │ False                                      │
└──────────────────┴────────────────────────────────────────────┘

Loading Files:

Metadata Table file:
merged_CRA001372_main_metadata_example.tsv

Filter Table file:
CRA001372_filterfile_example.tsv

Filtering Metadata Table:

Number of rows in Provided Metadata Table: 828

Filter Variable:  tissue
Values:  ['Soil']
Filter Type:  categorical
Filter Action:  keep
NA Treatment:  no

Number of rows in Filtered Metadata Table: 24

Saving results in file:
/home/sapies/Desktop/Prueba_concat/Example/CRA001372/filtered_merged_CRA001372_main_metadata_example.tsv

As we can see, the program has sequentially applied different filters to the Metadata Table. After applying the previous filters, we see that there are 24 surviving soil samples. The resulting filtered_merged_CRA001372_main_metadata_example.tsv file is the one that will be used in the next workflow steps.

2.2) Collection and Checking of Fastq Files

2.2.1) Download Fastqs

Now, we download the associated available fastq files from GSA using the Download Fastqs program in LINKS mode. We will have to previously generate a TXT file with the URLs from the "DownLoad Read file1" column from the resulting file filtered_merged_CRA001372_main_metadata_example.tsv. This has already been done for you in the test file filtered_CRA001372_URLS_example.txt. We will also generate a directory called downloads that will be used as Output Directory.

Commands:

##Create downloads directory
mkdir downloads

##Download fastqs
download_fastqs -m LINKS -i filtered_CRA001372_URLS_example.txt -o downloads

Output:

################################################################
##                                                            ##
##    ___  __  __ ___      ___              _   _             ##
##   / _ \|  \/  |   \    / __|  _ _ _ __ _| |_(_)___ _ _     ##
##   |(_)|| |\/| | |) |   |(_| || | '_/ _` |  _| / _ \ ' \    ##
##   \___/|_|  |_|___/    \___\_,_|_| \__,_|\__|_\___/_||_|   ##
##                 _____         _ _   _ _                    ##
##                |_   _|__  ___| | |_(_) |_                  ##
##                  | |/ _ \/ _ \ | / / |  _|                 ##
##                  |_|\___/\___/_|_\_\_|\__|                 ##
##                                                            ##
##   download_fastqs.py                                       ##
##    * v1.1.0 - 12 Mar 2024 *                                ##
##                                                            ##
################################################################

Program Parameters:
┌─────────────────────┬─────────────────────────────────────┐
│ Argument            │ Value                               │
├─────────────────────┼─────────────────────────────────────┤
│ input_file          │ filtered_CRA001372_URLS_example.txt │
│ mode                │ LINKS                               │
│ ena_download_column │                                     │
│ max_conn            │ 5                                   │
│ parfive_verbose     │ False                               │
│ output_directory    │ downloads                           │
│ plain_text          │ False                               │
└─────────────────────┴─────────────────────────────────────┘

Loading Input File:

Links TXT file:
filtered_CRA001372_URLS_example.txt

Main Information:

Number of URLs to download: 24

Downloading Files (parfive):

Downloader version:
parfive
2.0.2

Resulting files saved in:
downloads

As we can see, the program has downloaded the available fastq files of interest from GSA using the parfive package (https://pypi.org/project/parfive/). A total of 24 fastq files (24 SINGLE fastq files) have been downloaded using as reference the filtered_CRA001372_URLS_example.txt test file . The resulting fastq files have been saved in the downloads Output Directory.

2.2.2) Checking of Downloaded Fastq Files

Now, we are going to check the integrity of the downloaded fastq files using the Check Fastqs program in Generic mode.

Command:

check_fastqs -s Generic -t filtered_merged_CRA001372_main_metadata_example.tsv -d downloads -a filtered_manifest_CRA001372_example.tsv -p '.fq.gz' --md5_check

Output:

################################################################
##                                                            ##
##    ___  __  __ ___      ___              _   _             ##
##   / _ \|  \/  |   \    / __|  _ _ _ __ _| |_(_)___ _ _     ##
##   |(_)|| |\/| | |) |   |(_| || | '_/ _` |  _| / _ \ ' \    ##
##   \___/|_|  |_|___/    \___\_,_|_| \__,_|\__|_\___/_||_|   ##
##                 _____         _ _   _ _                    ##
##                |_   _|__  ___| | |_(_) |_                  ##
##                  | |/ _ \/ _ \ | / / |  _|                 ##
##                  |_|\___/\___/_|_\_\_|\__|                 ##
##                                                            ##
##   check_fastqs.py                                          ##
##    * v1.1.0 - 12 Mar 2024 *                                ##
##                                                            ##
################################################################

Program Parameters:
┌──────────────────────────┬─────────────────────────────────────────────────────┐
│ Argument                 │ Value                                               │
├──────────────────────────┼─────────────────────────────────────────────────────┤
│ metadata_table           │ filtered_merged_CRA001372_main_metadata_example.tsv │
│ fastqs_directory         │ downloads                                           │
│ manifest_table           │ filtered_manifest_CRA001372_example.tsv             │
│ mode                     │ Generic                                             │
│ ena_download_column      │                                                     │
│ generic_common_column_mt │ sample_id                                           │
│ fastq_pattern            │ .fq.gz                                              │
│ md5_check                │ True                                                │
│ plain_text               │ False                                               │
└──────────────────────────┴─────────────────────────────────────────────────────┘

Loading Files:

Metadata Table file:
filtered_merged_CRA001372_main_metadata_example.tsv

Manifest Table file:
filtered_manifest_CRA001372_example.tsv

Main Information:

1) Number of rows in Metadata Table: 24

2) Number of unique samples for provided generic_merge_column in Metadata Table: 24

3) Number of Fastq files in the provided Manifest Table: 24

4) Number of unique samples for "sample_name" column in Manifest Table: 24

5) Number of Fastq files in the provided Fastqs Directory: 24

Fastqs' Checks:

1) Check that expected files from the Manifest Table exist in the provided directory:
All the expected Fastq files from the Manifest Table are in the provided directory!

2) Check if there are Fastq files of the provided directory absent in the Manifest Table:
There are no extra Fastq files in the provided directory!

3) Check if there are Fastq files of the provided directory with multiple matches in the Manifest Table:
All Fastq files in the provided directory have a unique sample_name match with the Manifest Table!

4) Check intersection between "sample_name" column from Manifest and provided generic_common_column_mt from Metadata:

Unique values:
  o Manifest Column [sample_name] (total unique values): 24
  o Metadata Column [sample_id] (total unique values): 24

All unique values are common between the columns provided!

5) Check that expected MD5s from the Manifest Table match calculated MD5s:

This may take a while...
  o Total number of Fastqs detected in Manifest Table: 24
  o Total number of warnings for MD5s: 0

All MD5s present in the Manifest Table match the MD5s calculated for their corresponding Fastq files!

Let's explore the fastqs checking results:

  • Main Information. The following relevant statistics have been calculated: 1) There are a total of 24 entries in the Metadata Table; 2) There are a total of 24 unique samples in the Metadata Table; 3) There are a total of 24 fastq files in the Manifest Table; 4) There are a total of 24 unique samples in the Manifest Table; 5) The number of fastq files in the provided fastqs directory is 24 files.

  • Fastqs' Checks. The following relevant checks for the downloaded fastq files have been performed: 1) All the expected fastq files from the Manifest Table are in the provided fastqs directory; 2) There are no extra fastq files in the provided directory; 3) All fastq files in the provided directory have a unique sample_name match with the Manifest Table; 4) All unique values between the common columns provided for the Metadata and Manifest Tables are common; 5) All MD5s present in the Manifest Table match the MD5s calculated for their corresponding fastq files (optional check).

In this particular case, there were no warning elements to further curate after the checking process. However, we have gained some useful knowledge about this dataset, and we can proceed with the next workflow steps.

2.3) Further Treatment of Metadata and Fastq Files

In this particular dataset, we will use a simple treatment of fastq files (copy, rename) and combine the associated metadata.

2.3.1) Make Raw Treatment Template

First, we will generate a raw treatment template using the Make Treatment Template program in Generic mode.

Command:

make_treatment_template -s Generic -i filtered_manifest_CRA001372_example.tsv -d downloads -p '.fq.gz' -r1 '_1.fq.gz' -r2 '_2.fq.gz'

Output:

################################################################
##                                                            ##
##    ___  __  __ ___      ___              _   _             ##
##   / _ \|  \/  |   \    / __|  _ _ _ __ _| |_(_)___ _ _     ##
##   |(_)|| |\/| | |) |   |(_| || | '_/ _` |  _| / _ \ ' \    ##
##   \___/|_|  |_|___/    \___\_,_|_| \__,_|\__|_\___/_||_|   ##
##                 _____         _ _   _ _                    ##
##                |_   _|__  ___| | |_(_) |_                  ##
##                  | |/ _ \/ _ \ | / / |  _|                 ##
##                  |_|\___/\___/_|_\_\_|\__|                 ##
##                                                            ##
##   make_treatment_template.py                               ##
##    * v1.1.0 - 12 Mar 2024 *                                ##
##                                                            ##
################################################################

Program Parameters:
┌──────────────────────┬─────────────────────────────────────────┐
│ Argument             │ Value                                   │
├──────────────────────┼─────────────────────────────────────────┤
│ input_file           │ filtered_manifest_CRA001372_example.tsv │
│ fastqs_directory     │ downloads                               │
│ mode                 │ Generic                                 │
│ ena_download_column  │                                         │
│ fastq_pattern        │ .fq.gz                                  │
│ r1_pattern           │ _1.fq.gz                                │
│ r2_pattern           │ _2.fq.gz                                │
│ extra_sample_columns │                                         │
│ output_directory     │                                         │
│ plain_text           │ False                                   │
└──────────────────────┴─────────────────────────────────────────┘

Loading File:

Manifest Table file:
filtered_manifest_CRA001372_example.tsv

Main Information:

1) Number of Fastq files in the provided Manifest Table: 24

2) Number of unique samples for "sample_name" column in Manifest Table: 24

3) Number of Fastq files in the provided Fastqs Directory: 24

Creating Raw Treatment Template:

Saving results in file:
/home/sapies/Desktop/Prueba_concat/Example/CRA001372/raw_treatment_template_filtered_manifest_CRA001372_example.tsv

The resulting raw_treatment_template_filtered_manifest_CRA001372_example.tsv file will be further curated so that it can be used by the next programs of the workflow, Treat Metadata and Treat Fastqs programs. The three programs together can be used for the extra treatment of fastq files and associated metadata.

With respect to the treatment values, we will apply the copy mode to the Soil1 samples and rename mode to the Soil2 samples. Thus generating the final treatment_template_filtered_CRA001372_example.tsv test file that will be used for the remaining treatment steps.

2.3.2) Treat Metadata

Now, we will use the Treat Metadata program in Generic mode to combine and treat the metadata based on the treatment information provided by the final treatment_template_filtered_CRA001372_example.tsv test file.

Command:

treat_metadata -s Generic -t treatment_template_filtered_CRA001372_example.tsv -m filtered_merged_CRA001372_main_metadata_example.tsv -p '.fq.gz' -r1 '_1.fq.gz' -r2 '_2.fq.gz'

Output:

################################################################
##                                                            ##
##    ___  __  __ ___      ___              _   _             ##
##   / _ \|  \/  |   \    / __|  _ _ _ __ _| |_(_)___ _ _     ##
##   |(_)|| |\/| | |) |   |(_| || | '_/ _` |  _| / _ \ ' \    ##
##   \___/|_|  |_|___/    \___\_,_|_| \__,_|\__|_\___/_||_|   ##
##                 _____         _ _   _ _                    ##
##                |_   _|__  ___| | |_(_) |_                  ##
##                  | |/ _ \/ _ \ | / / |  _|                 ##
##                  |_|\___/\___/_|_\_\_|\__|                 ##
##                                                            ##
##   treat_metadata.py                                        ##
##    * v1.1.0 - 12 Mar 2024 *                                ##
##                                                            ##
################################################################

Program Parameters:
┌────────────────────────────┬─────────────────────────────────────────────────────┐
│ Argument                   │ Value                                               │
├────────────────────────────┼─────────────────────────────────────────────────────┤
│ metadata_table             │ filtered_merged_CRA001372_main_metadata_example.tsv │
│ treatment_template         │ treatment_template_filtered_CRA001372_example.tsv   │
│ mode                       │ Generic                                             │
│ ena_download_column        │                                                     │
│ generic_common_column_mt   │ sample_id                                           │
│ generic_common_column_tt   │ sample_name                                         │
│ extra_no_warning_columns   │                                                     │
│ sample_name_sep            │ _                                                   │
│ sample_name_sep_appearance │ 1                                                   │
│ fastq_pattern              │ .fq.gz                                              │
│ r1_pattern                 │ _1.fq.gz                                            │
│ r2_pattern                 │ _2.fq.gz                                            │
│ output_directory           │                                                     │
│ plain_text                 │ False                                               │
└────────────────────────────┴─────────────────────────────────────────────────────┘

Loading Files:

Treatment Template file:
treatment_template_filtered_CRA001372_example.tsv

Metadata Table file:
filtered_merged_CRA001372_main_metadata_example.tsv

Unique values:
  o Treatment Template Column [sample_name] (total unique values): 24
  o Metadata Column [sample_id] (total unique values): 24

All unique values are common between the columns provided!

Pre-treatment Information:

  o Number of Samples in Treatment Template: 24
  o Number of Rows in the Original Metadata Table: 24

Treat Metadata:

Treating metadata ...

Post-treatment Information:

  o Number of Rows in the Treated Metadata Table: 24

Saving results in file:
/home/sapies/Desktop/Prueba_concat/Example/CRA001372/treated_filtered_merged_CRA001372_main_metadata_example.tsv

Initially, the filtered_merged_CRA001372_main_metadata_example.tsv file presented 24 elements. After treatment, the resulting treated_filtered_merged_CRA001372_main_metadata_example.tsv file still presents information about 24 samples since we only changed the names of some files. When combining information for a final sample name, if multiple different values are found, these will be separated by a semicolon (;), and the program will generate a warning report that should be used to check for possible metadata inconsistencies. In this particular case, no warnings were raised. For further details about the structure of this file, see the Treat Metadata program documentation.

2.3.3) Treat Fastqs

Finally, we will use the Treat Fastqs program to perform the different treatment operations on the downloaded fastq files based on the treatment information in the final treatment_template_filtered_CRA001372_example.tsv test file. We will previously generate a directory called treated_files that will be used as Output Directory.

Commands:

##Create treated_files directory
mkdir treated_files

##Treat fastq files
treat_fastqs -t treatment_template_filtered_CRA001372_example.tsv -i downloads -o treated_files -p '.fq.gz' -r1 '_1.fq.gz' -r2 '_2.fq.gz'

Output:

################################################################
##                                                            ##
##    ___  __  __ ___      ___              _   _             ##
##   / _ \|  \/  |   \    / __|  _ _ _ __ _| |_(_)___ _ _     ##
##   |(_)|| |\/| | |) |   |(_| || | '_/ _` |  _| / _ \ ' \    ##
##   \___/|_|  |_|___/    \___\_,_|_| \__,_|\__|_\___/_||_|   ##
##                 _____         _ _   _ _                    ##
##                |_   _|__  ___| | |_(_) |_                  ##
##                  | |/ _ \/ _ \ | / / |  _|                 ##
##                  |_|\___/\___/_|_\_\_|\__|                 ##
##                                                            ##
##   treat_fastqs.py                                          ##
##    * v1.1.0 - 12 Mar 2024 *                                ##
##                                                            ##
################################################################

Program Parameters:
┌────────────────────┬───────────────────────────────────────────────────┐
│ Argument           │ Value                                             │
├────────────────────┼───────────────────────────────────────────────────┤
│ treatment_template │ treatment_template_filtered_CRA001372_example.tsv │
│ input_directory    │ downloads                                         │
│ output_directory   │ treated_files                                     │
│ fastq_pattern      │ .fq.gz                                            │
│ r1_pattern         │ _1.fq.gz                                          │
│ r2_pattern         │ _2.fq.gz                                          │
│ plain_text         │ False                                             │
└────────────────────┴───────────────────────────────────────────────────┘

Loading Files:

Treatment Template file:
treatment_template_filtered_CRA001372_example.tsv

Pre-treatment Information:

  o Number of Samples in Treatment Template: 24
  o Number of Fastq files in Treatment Template: 24

Treat Fastqs:

Sample:  Soil1Ha
Treatment: copy
Number of Fastq files: 1
Configuration: Number of pair1(s) = 0; Number of pair2(s) = 0; Number of single(s) = 1
Copying file(s):
['CRR044580.fq.gz']  >  CRR044580.fq.gz

Sample:  Soil1Hb
Treatment: copy
Number of Fastq files: 1
Configuration: Number of pair1(s) = 0; Number of pair2(s) = 0; Number of single(s) = 1
Copying file(s):
['CRR044581.fq.gz']  >  CRR044581.fq.gz

Sample:  Soil1Hc
Treatment: copy
Number of Fastq files: 1
Configuration: Number of pair1(s) = 0; Number of pair2(s) = 0; Number of single(s) = 1
Copying file(s):
['CRR044582.fq.gz']  >  CRR044582.fq.gz

Sample:  Soil1Hd
Treatment: copy
Number of Fastq files: 1
Configuration: Number of pair1(s) = 0; Number of pair2(s) = 0; Number of single(s) = 1
Copying file(s):
['CRR044583.fq.gz']  >  CRR044583.fq.gz

Sample:  Soil1He
Treatment: copy
Number of Fastq files: 1
Configuration: Number of pair1(s) = 0; Number of pair2(s) = 0; Number of single(s) = 1
Copying file(s):
['CRR044584.fq.gz']  >  CRR044584.fq.gz

Sample:  Soil1Hf
Treatment: copy
Number of Fastq files: 1
Configuration: Number of pair1(s) = 0; Number of pair2(s) = 0; Number of single(s) = 1
Copying file(s):
['CRR044585.fq.gz']  >  CRR044585.fq.gz

Sample:  Soil1La
Treatment: copy
Number of Fastq files: 1
Configuration: Number of pair1(s) = 0; Number of pair2(s) = 0; Number of single(s) = 1
Copying file(s):
['CRR044586.fq.gz']  >  CRR044586.fq.gz

Sample:  Soil1Lb
Treatment: copy
Number of Fastq files: 1
Configuration: Number of pair1(s) = 0; Number of pair2(s) = 0; Number of single(s) = 1
Copying file(s):
['CRR044587.fq.gz']  >  CRR044587.fq.gz

Sample:  Soil1Lc
Treatment: copy
Number of Fastq files: 1
Configuration: Number of pair1(s) = 0; Number of pair2(s) = 0; Number of single(s) = 1
Copying file(s):
['CRR044588.fq.gz']  >  CRR044588.fq.gz

Sample:  Soil1Ld
Treatment: copy
Number of Fastq files: 1
Configuration: Number of pair1(s) = 0; Number of pair2(s) = 0; Number of single(s) = 1
Copying file(s):
['CRR044589.fq.gz']  >  CRR044589.fq.gz

Sample:  Soil1Le
Treatment: copy
Number of Fastq files: 1
Configuration: Number of pair1(s) = 0; Number of pair2(s) = 0; Number of single(s) = 1
Copying file(s):
['CRR044590.fq.gz']  >  CRR044590.fq.gz

Sample:  Soil1Lf
Treatment: copy
Number of Fastq files: 1
Configuration: Number of pair1(s) = 0; Number of pair2(s) = 0; Number of single(s) = 1
Copying file(s):
['CRR044591.fq.gz']  >  CRR044591.fq.gz

Sample:  Soil2Ha
Treatment: rename
Number of Fastq files: 1
Configuration: Number of pair1(s) = 0; Number of pair2(s) = 0; Number of single(s) = 1
Renaming file(s):
['CRR044592.fq.gz']  >  Soil2Ha.fq.gz

Sample:  Soil2Hb
Treatment: rename
Number of Fastq files: 1
Configuration: Number of pair1(s) = 0; Number of pair2(s) = 0; Number of single(s) = 1
Renaming file(s):
['CRR044593.fq.gz']  >  Soil2Hb.fq.gz

Sample:  Soil2Hc
Treatment: rename
Number of Fastq files: 1
Configuration: Number of pair1(s) = 0; Number of pair2(s) = 0; Number of single(s) = 1
Renaming file(s):
['CRR044594.fq.gz']  >  Soil2Hc.fq.gz

Sample:  Soil2Hd
Treatment: rename
Number of Fastq files: 1
Configuration: Number of pair1(s) = 0; Number of pair2(s) = 0; Number of single(s) = 1
Renaming file(s):
['CRR044595.fq.gz']  >  Soil2Hd.fq.gz

Sample:  Soil2He
Treatment: rename
Number of Fastq files: 1
Configuration: Number of pair1(s) = 0; Number of pair2(s) = 0; Number of single(s) = 1
Renaming file(s):
['CRR044596.fq.gz']  >  Soil2He.fq.gz

Sample:  Soil2Hf
Treatment: rename
Number of Fastq files: 1
Configuration: Number of pair1(s) = 0; Number of pair2(s) = 0; Number of single(s) = 1
Renaming file(s):
['CRR044597.fq.gz']  >  Soil2Hf.fq.gz

Sample:  Soil2La
Treatment: rename
Number of Fastq files: 1
Configuration: Number of pair1(s) = 0; Number of pair2(s) = 0; Number of single(s) = 1
Renaming file(s):
['CRR044598.fq.gz']  >  Soil2La.fq.gz

Sample:  Soil2Lb
Treatment: rename
Number of Fastq files: 1
Configuration: Number of pair1(s) = 0; Number of pair2(s) = 0; Number of single(s) = 1
Renaming file(s):
['CRR044599.fq.gz']  >  Soil2Lb.fq.gz

Sample:  Soil2Lc
Treatment: rename
Number of Fastq files: 1
Configuration: Number of pair1(s) = 0; Number of pair2(s) = 0; Number of single(s) = 1
Renaming file(s):
['CRR044600.fq.gz']  >  Soil2Lc.fq.gz

Sample:  Soil2Ld
Treatment: rename
Number of Fastq files: 1
Configuration: Number of pair1(s) = 0; Number of pair2(s) = 0; Number of single(s) = 1
Renaming file(s):
['CRR044601.fq.gz']  >  Soil2Ld.fq.gz

Sample:  Soil2Le
Treatment: rename
Number of Fastq files: 1
Configuration: Number of pair1(s) = 0; Number of pair2(s) = 0; Number of single(s) = 1
Renaming file(s):
['CRR044602.fq.gz']  >  Soil2Le.fq.gz

Sample:  Soil2Lf
Treatment: rename
Number of Fastq files: 1
Configuration: Number of pair1(s) = 0; Number of pair2(s) = 0; Number of single(s) = 1
Renaming file(s):
['CRR044603.fq.gz']  >  Soil2Lf.fq.gz

Post-treatment Information:

  o Number of Fastq files in Output Directory: 24

Resulting files saved in:
treated_files

As we can see, the program has treated the corresponding fast files. A total of 24 fastq files (24 SINGLE fastq files) have been treated, generating a final set of 24 fastq files (24 SINGLE fastq files). The resulting fastq files saved in the treated_files Output Directory correspond to the final fastq files for this dataset.

3) Multidatasets Programs

At this point we have managed to curate two independent datasets and we would like to combine them in a safe way. To do so, the first step would be to prepare a dictionary of variables of interest and manually curate the metadata of our datasets based on it. For this purpose, we will use the variables_dictionary_example.tsv test file. For further details about the structure of this file and the expected values for the different columns, see the Check Metadata Values program documentation.

Many times our columns of interest will have different names that we will have to manually unify on the basis of this dictionary. Likewise, we will also be interested in eliminating columns that are not relevant to our work (absent in our variables dictionary). This has already been done, and so we will directly use the curated_CRA001372_external_example_metadata_final.tsv and curated_PRJEB10949_ENA_example_metadata_final.tsv test files as an example.

3.1) Concatenate Curated Metadata Datasets

First, we will use the Concat Datasets program to savely concatenate the curated metadata tables based on the information from the Variables Dictionary.

Commands:

##Get back to the Example directory(where the test files should be manually copied)
cd ..

##Concatenate Curated Metadata Files
concat_datasets -i Example -d variables_dictionary_example.tsv -op 'tutorial'

Output:

################################################################
##                                                            ##
##    ___  __  __ ___      ___              _   _             ##
##   / _ \|  \/  |   \    / __|  _ _ _ __ _| |_(_)___ _ _     ##
##   |(_)|| |\/| | |) |   |(_| || | '_/ _` |  _| / _ \ ' \    ##
##   \___/|_|  |_|___/    \___\_,_|_| \__,_|\__|_\___/_||_|   ##
##                 _____         _ _   _ _                    ##
##                |_   _|__  ___| | |_(_) |_                  ##
##                  | |/ _ \/ _ \ | / / |  _|                 ##
##                  |_|\___/\___/_|_\_\_|\__|                 ##
##                                                            ##
##   concat_datasets.py                                       ##
##    * v1.1.0 - 12 Mar 2024 *                                ##
##                                                            ##
################################################################

Program Parameters:
┌──────────────────────┬──────────────────────────────────┐
│ Argument             │ Value                            │
├──────────────────────┼──────────────────────────────────┤
│ input_directory      │ Example                          │
│ variables_dictionary │ variables_dictionary_example.tsv │
│ search_mode          │ simple                           │
│ metadata_pattern     │ _metadata_final.tsv              │
│ output_name_prefix   │ tutorial                         │
│ output_directory     │                                  │
│ plain_text           │ False                            │
└──────────────────────┴──────────────────────────────────┘

Searching Metadata Files:

- Total Metadata Files detected: 2

Loading Files:

Variables Dictionary file:
variables_dictionary_example.tsv

Datasets Metadata files:
Example/curated_CRA001372_external_example_metadata_final.tsv
Example/curated_PRJEB10949_ENA_example_metadata_final.tsv

Concatenating Metadata Tables:

- Number of final Rows: 36
- Number of final Columns: 17

Saving results in file:
/home/sapies/Desktop/Prueba_concat/Example/tutorial_concatenated_final_metadata.tsv

The generated tutorial_concatenated_final_metadata.tsv corresponds to the final concatenated metadata file of all our datasets and variables of interest based on the Variables Dictionary. For more details and options for the Concat Datasets program check the corresponding documentation.

3.2) Check Curated Metadata Values

Finally, we will check that the values of the concatenated curated metadata are within our permitted values based on the information of the Variables Dictionary. In this case, we do this step for the final concatenated metadata table, but it could also have been done for the curated tables of each dataset independently.

Commands:

##Check Curated Metadata Values
check_metadata_values -t tutorial_concatenated_final_metadata.tsv -d variables_dictionary_example.tsv

Output:

################################################################
##                                                            ##
##    ___  __  __ ___      ___              _   _             ##
##   / _ \|  \/  |   \    / __|  _ _ _ __ _| |_(_)___ _ _     ##
##   |(_)|| |\/| | |) |   |(_| || | '_/ _` |  _| / _ \ ' \    ##
##   \___/|_|  |_|___/    \___\_,_|_| \__,_|\__|_\___/_||_|   ##
##                 _____         _ _   _ _                    ##
##                |_   _|__  ___| | |_(_) |_                  ##
##                  | |/ _ \/ _ \ | / / |  _|                 ##
##                  |_|\___/\___/_|_\_\_|\__|                 ##
##                                                            ##
##   check_metadata_values.py                                 ##
##    * v1.1.0 - 12 Mar 2024 *                                ##
##                                                            ##
################################################################

Program Parameters:
┌──────────────────────┬──────────────────────────────────────────┐
│ Argument             │ Value                                    │
├──────────────────────┼──────────────────────────────────────────┤
│ metadata_table       │ tutorial_concatenated_final_metadata.tsv │
│ variables_dictionary │ variables_dictionary_example.tsv         │
│ plain_text           │ False                                    │
└──────────────────────┴──────────────────────────────────────────┘

Loading Files:

Variables Dictionary file:
variables_dictionary_example.tsv

Curated Metadata Table file:
tutorial_concatenated_final_metadata.tsv

Checking Provided Variables:

Variable: final_files_sample_name
1) Requiredness: required
Okey! This required variable is in the provided Curated Metadata Table!
2) Class Type: character
Okey! Indicated as character and dtype is Object or Bool!
3) Uniqueness Within: unique
Okey! Indicated as unique and no duplicates were found within variable!
4) Uniqueness Between:
Check Uniqueness Between variables: no
- Variable to check: none
Okey! Indicated as check no! Skipping check uniqueness between variables for final_files_sample_name!
5) Allowed Values:
- Allowed Values Treatment: any
- NAs Allowed: no
- Values Allowed: any
Okey! NAs were not allowed and no NAs were detected!

Variable: original_files_sample_names
1) Requiredness: required
Okey! This required variable is in the provided Curated Metadata Table!
2) Class Type: character
Okey! Indicated as character and dtype is Object or Bool!
3) Uniqueness Within: unique
Okey! Indicated as unique and no duplicates were found within variable!
4) Uniqueness Between:
Check Uniqueness Between variables: no
- Variable to check: none
Okey! Indicated as check no! Skipping check uniqueness between variables for original_files_sample_names!
5) Allowed Values:
- Allowed Values Treatment: any
- NAs Allowed: no
- Values Allowed: any
Okey! NAs were not allowed and no NAs were detected!

Variable: treatment_fastq_type
1) Requiredness: required
Okey! This required variable is in the provided Curated Metadata Table!
2) Class Type: character
Okey! Indicated as character and dtype is Object or Bool!
3) Uniqueness Within: nonunique
Okey! Indicated as nonunique! Duplicates were found within variable!
4) Uniqueness Between:
Check Uniqueness Between variables: no
- Variable to check: none
Okey! Indicated as check no! Skipping check uniqueness between variables for treatment_fastq_type!
5) Allowed Values:
- Allowed Values Treatment: subset
- NAs Allowed: no
- Values Allowed: ['pair1;pair2', 'single']
Okey! All values are allowed values!

Variable: run_accession
1) Requiredness: required
Okey! This required variable is in the provided Curated Metadata Table!
2) Class Type: character
Okey! Indicated as character and dtype is Object or Bool!
3) Uniqueness Within: unique
Okey! Indicated as unique and no duplicates were found within variable!
4) Uniqueness Between:
Check Uniqueness Between variables: yes
- Variable to check: final_files_sample_name
Okey! No duplicates were detected in variable run_accession thus each value has a unique match with the provided variable final_files_sample_name!
- Variable to check: sample_accession
Okey! No duplicates were detected in variable run_accession thus each value has a unique match with the provided variable sample_accession!
- Variable to check: project_accession
Okey! No duplicates were detected in variable run_accession thus each value has a unique match with the provided variable project_accession!
5) Allowed Values:
- Allowed Values Treatment: any
- NAs Allowed: no
- Values Allowed: any
Okey! NAs were not allowed and no NAs were detected!

Variable: sample_accession
1) Requiredness: required
Okey! This required variable is in the provided Curated Metadata Table!
2) Class Type: character
Okey! Indicated as character and dtype is Object or Bool!
3) Uniqueness Within: unique
Okey! Indicated as unique and no duplicates were found within variable!
4) Uniqueness Between:
Check Uniqueness Between variables: yes
- Variable to check: final_files_sample_name
Okey! No duplicates were detected in variable sample_accession thus each value has a unique match with the provided variable final_files_sample_name!
- Variable to check: project_accession
Okey! No duplicates were detected in variable sample_accession thus each value has a unique match with the provided variable project_accession!
5) Allowed Values:
- Allowed Values Treatment: any
- NAs Allowed: no
- Values Allowed: any
Okey! NAs were not allowed and no NAs were detected!

Variable: project_accession
1) Requiredness: required
Okey! This required variable is in the provided Curated Metadata Table!
2) Class Type: character
Okey! Indicated as character and dtype is Object or Bool!
3) Uniqueness Within: nonunique
Okey! Indicated as nonunique! Duplicates were found within variable!
4) Uniqueness Between:
Check Uniqueness Between variables: no
- Variable to check: none
Okey! Indicated as check no! Skipping check uniqueness between variables for project_accession!
5) Allowed Values:
- Allowed Values Treatment: any
- NAs Allowed: no
- Values Allowed: any
Okey! NAs were not allowed and no NAs were detected!

Variable: instrument_model
1) Requiredness: required
Okey! This required variable is in the provided Curated Metadata Table!
2) Class Type: character
Okey! Indicated as character and dtype is Object or Bool!
3) Uniqueness Within: nonunique
Okey! Indicated as nonunique! Duplicates were found within variable!
4) Uniqueness Between:
Check Uniqueness Between variables: no
- Variable to check: none
Okey! Indicated as check no! Skipping check uniqueness between variables for instrument_model!
5) Allowed Values:
- Allowed Values Treatment: any
- NAs Allowed: no
- Values Allowed: any
Okey! NAs were not allowed and no NAs were detected!

Variable: library_strategy
1) Requiredness: required
Okey! This required variable is in the provided Curated Metadata Table!
2) Class Type: character
Okey! Indicated as character and dtype is Object or Bool!
3) Uniqueness Within: nonunique
Okey! Indicated as nonunique! Duplicates were found within variable!
4) Uniqueness Between:
Check Uniqueness Between variables: no
- Variable to check: none
Okey! Indicated as check no! Skipping check uniqueness between variables for library_strategy!
5) Allowed Values:
- Allowed Values Treatment: wholeset
- NAs Allowed: no
- Values Allowed: ['AMPLICON']
Okey! All values in the provided allowed values are present and there are no extra values in the "library_strategy" column in the metadata table!

Variable: library_source
1) Requiredness: required
Okey! This required variable is in the provided Curated Metadata Table!
2) Class Type: character
Okey! Indicated as character and dtype is Object or Bool!
3) Uniqueness Within: nonunique
Okey! Indicated as nonunique! Duplicates were found within variable!
4) Uniqueness Between:
Check Uniqueness Between variables: no
- Variable to check: none
Okey! Indicated as check no! Skipping check uniqueness between variables for library_source!
5) Allowed Values:
- Allowed Values Treatment: wholeset
- NAs Allowed: no
- Values Allowed: ['METAGENOMIC']
Okey! All values in the provided allowed values are present and there are no extra values in the "library_source" column in the metadata table!

Variable: library_selection
1) Requiredness: required
Okey! This required variable is in the provided Curated Metadata Table!
2) Class Type: character
Okey! Indicated as character and dtype is Object or Bool!
3) Uniqueness Within: nonunique
Okey! Indicated as nonunique! Duplicates were found within variable!
4) Uniqueness Between:
Check Uniqueness Between variables: no
- Variable to check: none
Okey! Indicated as check no! Skipping check uniqueness between variables for library_selection!
5) Allowed Values:
- Allowed Values Treatment: wholeset
- NAs Allowed: no
- Values Allowed: ['PCR']
Okey! All values in the provided allowed values are present and there are no extra values in the "library_selection" column in the metadata table!

Variable: associated_host
1) Requiredness: required
Okey! This required variable is in the provided Curated Metadata Table!
2) Class Type: character
Okey! Indicated as character and dtype is Object or Bool!
3) Uniqueness Within: nonunique
Okey! Indicated as nonunique! Duplicates were found within variable!
4) Uniqueness Between:
Check Uniqueness Between variables: no
- Variable to check: none
Okey! Indicated as check no! Skipping check uniqueness between variables for associated_host!
5) Allowed Values:
- Allowed Values Treatment: wholeset
- NAs Allowed: no
- Values Allowed: ['Oryza sativa', 'Mus musculus']
Okey! All values in the provided allowed values are present and there are no extra values in the "associated_host" column in the metadata table!

Variable: tissue
1) Requiredness: required
Okey! This required variable is in the provided Curated Metadata Table!
2) Class Type: character
Okey! Indicated as character and dtype is Object or Bool!
3) Uniqueness Within: nonunique
Okey! Indicated as nonunique! Duplicates were found within variable!
4) Uniqueness Between:
Check Uniqueness Between variables: yes
- Variable to check: associated_host
Okey! Duplicates were detected in variable tissue and each value has a unique match with the provided variable associated_host!
5) Allowed Values:
- Allowed Values Treatment: subset
- NAs Allowed: no
- Values Allowed: ['Ileum', 'Liver', 'Adipose tissue', 'Soil']
Okey! All values are allowed values!

Variable: replicate
1) Requiredness: optional
Okey! This optional variable is in the provided Curated Metadata Table!
2) Class Type: character
Okey! Indicated as character and dtype is Object or Bool!
3) Uniqueness Within: nonunique
Okey! Indicated as nonunique! Duplicates were found within variable!
4) Uniqueness Between:
Check Uniqueness Between variables: no
- Variable to check: none
Okey! Indicated as check no! Skipping check uniqueness between variables for replicate!
5) Allowed Values:
- Allowed Values Treatment: subset
- NAs Allowed: yes
- Values Allowed: ['new']
Okey! All values are allowed values or are NAs!

Variable: run_label
1) Requiredness: optional
Okey! This optional variable is in the provided Curated Metadata Table!
2) Class Type: character
Okey! Indicated as character and dtype is Object or Bool!
3) Uniqueness Within: nonunique
Okey! Indicated as nonunique! Duplicates were found within variable!
4) Uniqueness Between:
Check Uniqueness Between variables: no
- Variable to check: none
Okey! Indicated as check no! Skipping check uniqueness between variables for run_label!
5) Allowed Values:
- Allowed Values Treatment: any
- NAs Allowed: yes
- Values Allowed: any
Okey! Allowed values treatment "any" and NAs are allowed! Skipping check!

Variable: miseq_kit
1) Requiredness: optional
Okey! This optional variable is in the provided Curated Metadata Table!
2) Class Type: numeric
Okey! Indicated as numeric and dtype is numeric!
3) Uniqueness Within: nonunique
Okey! Indicated as nonunique! Duplicates were found within variable!
4) Uniqueness Between:
Check Uniqueness Between variables: no
- Variable to check: none
Okey! Indicated as check no! Skipping check uniqueness between variables for miseq_kit!
5) Allowed Values:
- Allowed Values Treatment: range
- NAs Allowed: yes
- Values Allowed[min,max]: [0, 3]
Okey! All values are in the allowed numeric range or are NAs!

Variable: isolation_source
1) Requiredness: optional
Okey! This optional variable is in the provided Curated Metadata Table!
2) Class Type: character
Okey! Indicated as character and dtype is Object or Bool!
3) Uniqueness Within: nonunique
Okey! Indicated as nonunique! Duplicates were found within variable!
4) Uniqueness Between:
Check Uniqueness Between variables: no
- Variable to check: none
Okey! Indicated as check no! Skipping check uniqueness between variables for isolation_source!
5) Allowed Values:
- Allowed Values Treatment: subset
- NAs Allowed: yes
- Values Allowed: ['root']
Okey! All values are allowed values or are NAs!

Variable: geographic_location
1) Requiredness: optional
Okey! This optional variable is in the provided Curated Metadata Table!
2) Class Type: character
Okey! Indicated as character and dtype is Object or Bool!
3) Uniqueness Within: nonunique
Okey! Indicated as nonunique! Duplicates were found within variable!
4) Uniqueness Between:
Check Uniqueness Between variables: no
- Variable to check: none
Okey! Indicated as check no! Skipping check uniqueness between variables for geographic_location!
5) Allowed Values:
- Allowed Values Treatment: any
- NAs Allowed: yes
- Values Allowed: any
Okey! Allowed values treatment "any" and NAs are allowed! Skipping check!

As we can see, the program has carried out the relevant checks for each of the corresponding variables. In this particular case, no warnings were raised indicating that all metadata values are within our permitted values based on the information of the Variables Dictionary. Thus, we can proceed knowing that our metadata is ready to use and within the expected values. For further details about the available checks see the Check Metadata Values program documentation.