In [None]:
import sys
sys.path.append("genome_metadata/scripts/")
import json
# library used to generate fake fasta files
from pysam import FastaFile
from input_core_metadata import input_core_metadata
from input_additional_metadata import input_additional_metadata
from edit_metadata_field import edit_metadata_field
from standardize_fasta_name import standardize_fasta_name
from list_metadata_collection import list_metadata_collection
from extract_specific_metadata import extract_specific_metadata

# Goals

- In a heterogenous collection of FASTA files, retrieve what every genome represents and how it was obtained.
- Respect FAIR principals (Findability, Accessibility, Interoperability, and Reuse of digital assets)

# State of the Art

### Why is is important? 

The metadata annotation is critical to inform on data collection variables which could potentially impact survey results. The meta data should describe: the sample source, tissue collection method, the environment, DNA extraction method and the sequence library preparation.

### What are the existing standards?

- MIGS: Minimum information about a genome sequence
- MIMS: Minimum information about a metagenome sequence
- MIMARKS: Minimum information about a marker gene sequence
- MISAG: Minimum information about a single amplified genome sequence
- MIMAG: Minimum information about a metagenome-assembled genome sequence
- MIUViG: Minimum Information about an Uncultivated Virus Genome

All checklists listed above **share the same central set of core (Minimum) descriptors**, which are: Project name, Sample name, Taxonomy ID of DNA sample, Geographic location (latitude and longitude), Geographic location (country and/or sea,region), Collection date, Environment (biome, feature, and material), Sequencing method. BUT for each of the different sequence categories, there are **different additional checklist descriptors** which are mandatory. 


# Generating fake FASTA files

For generating fake FASTA files, I ran the following command line in my terminal : 
`python fastq_generator.py generate_fasta Chr1 1000 > chr1.fasta`

Where `generate_fasta` is the function, `Chr` is the name of the sequence and `1000` is the length of the sequence. `chr1.fasta` is the name of the resulting FASTA file. 

In coherence with the different existing categories of sequences, I created 2 (reduced-size) fasta files for each category. These files do not aim to represent loyaly the sequence type and are only used for further manipulations.

Below is an example of how to read in one such file. 

In [None]:
fasta = "genome_metadata/FASTA_files/single_amplified_genome2.fasta"
# read FASTA file
sequences_object = FastaFile(fasta)
sequences_object.fetch("sequence")

# Part 1: Enter metadata for a new FASTA file

## Exploring how to organize metadata collections


In [None]:
# example of a metadata collection (a dictionary of dictionaries, stored as a JSON) 
# where only the core descriptors are completed 
# (due to time constraints and my lack of knowledge for certain fields)

metadata_collection = {
    # Example of a metadata entry for MIGS-BA
   "genome1.fasta" : 
    {
        "project name": "extreme_environments",
        "sample name": "thermofilum_pendens",
        "taxonomy ID": "archeae",
        "sequencing method": "16S",
        "collection date": "2022-10-18",
        "geographic latitude": 55,
        "geographic longitude": 21,
        "geographic location": "piton_de_la_fournaise_volcano",
        "geographic country": "la_reunion",
        "environment": "volcan_lava",
        "number of replicons" : "NA",
        "reference for biomaterial" : "NA",
        "isolation and growth condition" : "NA",
        "assembly quality" : "NA",
        "assembly software" : "NA",
        "number of contigs" : "NA",
        "extra comments": "None"
   },
    # Example of a metadata entry for MISAG
    "single_amplified_genome.fasta" : 
    {
        "project name": "stem_cells_under_pressure",
        "sample name": "stem_cell",
        "taxonomy ID": "eukaryota",
        "sequencing method": "RNA-seq",
        "collection date": "2021-08-03",
        "geographic latitude": 46,
        "geographic longitude": 6,
        "geographic location": "hopital_universitaire_geneve",
        "geographic country": "switzerland",
        "environment": "liver",
        "taxonomic identity marker" : "NA",
        "assembly quality" : "NA",
        "assembly software" : "NA",
        "completeness score" : "NA",
        "completeness software" : "NA",
        "contamination score" : "NA",
        "sorting technology" : "NA",
        "single cell lysis approach" : "NA",
        "WGA amplification approach" : "NA",
        "extra comments": "None"
   }
    
    # etc. etc. 
    
}


### A few comments on such a collection: 

**Everything in one place**. BUT **heterogenity of the data standards**. This can be source to incompleteness and incoherence further on. 

For this reason, **I have opted for 1 metadata collection per sequence category**. 

**Challenge at hand**: different additional data for each standard 
**Please Note: Strong possibility for further automation.** It would be better if some of the input data was not left as free text and instead was matched to a codebook. For example:

- the sample name
- the taxonomy ID, 
- the sequencing method
- the geographic country

This was not implemented due to time constraints. 

## A standard naming procedure 
For a standard naming, we should use fields that can be easily comparable between all files. This means we would want fields that we can automate such as the sequencing method, the date, the taxonomy ID and the sample name. In addition, we need one free text field, which makes the naming unique. We would choose the project name, making sure to clean the string before hand.  

## Core vs Additional Metadata 

- The core data is requested for all FASTA files

- The additional data is unique to each sequence category. In the User Notebook, only an example is given for MIGS-BA, one must imagine the same structure for other sequence categories

## User Notebook

The user-side of this implementation can be found in `SDSC_New_Metadata` notebook. The idea is for the notebook to run on its own with users completing the `"INCOMPLETE"` fields. 

Some checks and cleaning are implemented regarding their inputs but more could be done. 

A notebook was chosen for its visual aspect (compared to a terminal for example), and for the fact that its input is more easy to control than sourcing from an external CSV (where columns can be changed by the user). The most ideal would be a GUI, see limitations and next steps in the final section. 

### Example of adding core metadata

Try out the example below !

In [None]:
metadata_collection = "genome_metadata/metadata/migs_ba_metadata_collection.json"
fasta_name = "genome_metadata/FASTA_files/archeae.fasta"
project_name = "extreme_environments"
sample_name = "thermofilum_pendens"
taxonomy_ID = "archeae"
sequencing_method = "16S"
collection_date = "2020-01-06"
geographic_latitude = "55"
geographic_longitude = "21"
geographic_location = "Piton de la fournaise Volcano"
geographic_country = "La reunion"
environment = "volcano"

input_core_metadata(metadata_collection,
                    fasta_name,
                    project_name,
                    sample_name,
                    taxonomy_ID,
                    sequencing_method,
                    collection_date,
                    geographic_latitude,
                    geographic_longitude,
                    geographic_location,
                    geographic_country,
                    environment)

### Example of adding additional metadata

Try out the example below!

In [None]:
input_additional_metadata(sequence_category = "MIUVIG",
                              metadata_collection = "genome_metadata/metadata/miuvig_metadata_collection.json",
                              fasta_name = "genome_metadata/FASTA_files/2015_02_16_taxonomy13_sample66_methodi.fasta",
                              assembly_quality = "INCOMPLETE",
                              assembly_software = "INCOMPLETE",
                              binning_parameters = "INCOMPLETE",
                              binning_software = "INCOMPLETE",
                              completeness_score = "INCOMPLETE",
                              completeness_software = "INCOMPLETE",
                              contamination_score = "INCOMPLETE",
                              isolation_and_growth_condition = "INCOMPLETE",
                              number_of_contigs = "INCOMPLETE",
                              number_of_replicons = "INCOMPLETE",
                              reference_for_biomaterial = "INCOMPLETE",
                              single_cell_lysis_approach = "INCOMPLETE",
                              source_of_uvigs = "source88",
                              sorting_technology = "INCOMPLETE",
                              target_gene = "INCOMPLETE",
                              taxonomic_identity_marker = "INCOMPLETE",
                              virus_enrichment_approach = "enrichment99",
                              wga_amplification_approach = "INCOMPLETE")

# Part 2: Edit & Access

## Edit field to the meta data

An example below:

In [None]:
edit_metadata_field(metadata_collection = "genome_metadata/metadata/migs_eu_metadata_collection.json",
                       fasta_name = "2022_09_19_taxonomy3_sample1_methodb.fasta",
                       field = "collection date",
                       new_value = "2022-10-19")

## Standardize the naming of an existing FASTA file

Example: 

In [None]:
metadata_collection = "mimag_metadata_collection.json" 
unstandardized_fasta_name = "mimag_seq.fasta"

standardize_fasta_name(metadata_collection = "genome_metadata/metadata/" + metadata_collection,
                       fasta_name = "FASTA_files/" + unstandardized_fasta_name)

## Allow to access existing genomes and their metadata easily

### List all metadata present in the metadata collection

In [None]:
metadata_collection = "genome_metadata/metadata/mimarks_s_metadata_collection.json"
list_metadata_collection(metadata_collection)

### List all FASTA based on a field equal to a value

In [None]:
metadata_collection = "genome_metadata/metadata/misag_metadata_collection.json"
field = 'collection date'
value = '2000-06-01'

extraction = extract_specific_metadata(metadata_collection,
                                              field,
                                              value)

# Limitations and Theoretical Next Steps

- **Optimizing with reference dictionaries / codebooks**

It would be better if some of the input data was not left as free text but instead was matched to a codebook. For example, the taxonomy ID, the sequencing method, and the geographic country. 

- **Getting the metadata from the sequences of the FASTA file:** 

Based on known specificities of each metagenome, one can imagine parsing the FASTA files to look for specific genomic sequences (viral, bacterial, human, etc) to try and autocomplete some of the fields of the metadata. Of course, this information would have to be validated by a specialist of the domain. A quick example: bacterias have a 16S rRNA sequence. If it is present within the FASTA file, then we are definitely dealing with a bacterial sequencing. 

- **Crucial: Creating a user interface rather than a notebook for users**

Using a jupyter notebook requires some knowledge of the terminal, python and other coding aspects. Having a simple GUI (Shiny for Python) or data collection system with obligatory fields (RedCap, ODK Cloud). A GUI like the METAGENOTE one seems quite nice. 

- **Batch processing**

For the moment files are processed one at a time. It would be ideal to make a pipeline to process multiple ones. A script could do so but due to time constraints, this could not be explored 


# References 

- METAGENOTE: a simplified web platform for metadata annotation of genomic samples and streamlined submission to NCBI’s sequence read archive. (Quinones et al.) https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03694-0

- METAGENOTE: https://metagenote.niaid.nih.gov/

- GenomicStandardsConsortium: http://www.gensc.org/pages/standards/checklists.html

- Synthetic FASTA files: https://github.com/johanzi/fastq_generator#usage