# Negative data preparation
In the context of this project, a negative example constitutes a pair of molecules (miRNA and mRNA) that do NOT interact. Given the lack of experimentally confirmed negative examples, the available datasets for target prediction are highly unbalanced, containing exclusively positive data (pairs that interact). To overcome this problem, a negative publicly available curated dataset for human (hsa) miRNA target prediction is proposed. However, since this project proposal uses A. thaliana (ath) as organism to study, only the homologous and highly conserved genes across both organisms will be considered.

The process of mapping those genes across organisms is presented in this notebook.


### Preprocessing
The original dataset is presented in the publication associated to TargetMiner [1]. For the preprocessing stage, the RefSeq IDs were converted into Gene symbol IDs using Genomics Biotools [2], then the invalid IDs and duplicates were removed from the dataset.

### Methodology
The proposed methodology for preparing a negative dataset consists of:
- Extracting the H. sapiens (hsa) miRNA sequence.
- Getting the hsa target sequence.
- Finding the homologous miRNA sequences in A. thaliana (ath).
- Getting the homologous target sequence in ath.


### Sequence extraction
To extract the miRNA sequences, a dataset containing all known mature miRNA sequences was downloaded from miRBase [3], whereas to locate the target sequences, the full hsa genome was used [4, 5].

Once the sequences are extracted, they are integrated with the negative dataset [1].


In [17]:
# Load the FASTA file containing all known miRNA sequences for all organisms.
with open('data/mature_mirna_all_organisms.fa') as f:
    mature_mirnas = f.read().split('>')[1:]
    f.close()

# Isolate H. sapiens (hsa) and A. thaliana (ath) sequences.
hsa_mature_mirnas_dict = {mirna.split(' ')[0]: mirna.split('\n')[1]
                          for mirna in mature_mirnas
                          if 'hsa' in mirna}

ath_mature_mirnas_dict = {mirna.split('\n')[1]: mirna.split(' ')[0]
                          for mirna in mature_mirnas
                          if 'ath' in mirna}

print(f'Total hsa miRNAs: {len(hsa_mature_mirnas_dict)}')
print(f'Total ath miRNAs: {len(ath_mature_mirnas_dict)}')


Total hsa miRNAs: 2655
Total ath miRNAs: 359


In [11]:
# Load the list of miRNAs included in the negative dataset [1].
with open('data/negative_pairs/hsa_mirnas.txt') as f:
    hsa_negative_mirnas = f.read().split('\n')
    f.close()


In [12]:
# Match the negative dataset [1] miRNAs with the respective mature miRNA miRBase [3] sequence.
hsa_negative_mirnas_seq = [hsa_mature_mirnas_dict[mirna]
                           if mirna in hsa_mature_mirnas_dict.keys() else mirna
                           for mirna in hsa_negative_mirnas]

# Create a separate file holding only the sequences.
with open('data/negative_pairs/hsa_mirnas_seq.txt', 'w') as f:
    f.write('\n'.join(hsa_negative_mirnas_seq))
    f.close()


In [13]:
# Get the hsa target sequences from Human genome.
with open('data/hsa/GCF_000001405.40_GRCh38.p14_rna.fna') as f:
    hsa_genome = f.read().split('>')[1:]
    f.close()

hsa_genome_dict = {gene.split('\n')[0].split('),')[0].split(' (')[-1]: ''.join(gene.split('\n')[1:])
                   for gene in hsa_genome}


In [14]:
# Get the list of hsa target genes included in the negative dataset.
with open('data/negative_pairs/hsa_targets.txt') as f:
    hsa_negative_targets = f.read().split('\n')
    f.close()


In [15]:
# Match the negative hsa target names with the respective sequence from the hsa genome [4, 5].
hsa_negative_targets_seq = [hsa_genome_dict[target] if target in hsa_genome_dict.keys() else '-'
                            for target in hsa_negative_targets]

# Create a separate file holding only the target sequences.
with open('data/negative_pairs/hsa_targets_seq.txt', 'w') as f:
    f.write('\n'.join(hsa_negative_targets_seq))
    f.close()


### Sequence matching - Homology search in A. thaliana
To create the negative dataset for A. thaliana organism, only perfectly identical sequences appearing in the H. sapiens dataset [1] will be considered. Potential target mRNA sequences from A. thaliana are extracted from the last released genome [6], and the respective mature miRNA sequences are downloaded from miRBase database [3].



In [25]:
# Get the names of the equivalent miRNAs in A. thaliana.
ath_negative_mirnas = [ath_mature_mirnas_dict[mirna_seq]
                       if mirna_seq in ath_mature_mirnas_dict.keys() else mirna_seq
                       for mirna_seq in hsa_negative_mirnas_seq]


### References

[1] S. Bandyopadhyay and R. Mitra, ‘TargetMiner: microRNA target prediction with systematic identification of tissue-specific negative examples’, Bioinformatics, vol. 25, no. 20, pp. 2625–2631, Oct. 2009, doi: 10.1093/bioinformatics/btp503.

[2] ‘refSeq Accession to Gene Symbol Converter - Genomics Biotools’. https://www.biotools.fr/mouse/refseq_symbol_converter (accessed Aug. 13, 2023).

[3] ‘miRBase - Downloads’. https://mirbase.org/download/ (accessed Aug. 13, 2023).

[4] ‘Genome’, NCBI. https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=9606 (accessed Aug. 13, 2023).

[5] ‘11968211 - Assembly - NCBI’. https://www.ncbi.nlm.nih.gov/assembly/?term=GCF_000001405 (accessed Aug. 13, 2023).

[6] ‘Arabidopsis thaliana (ID 4) - Genome - NCBI’. https://www.ncbi.nlm.nih.gov/genome/4?genome_assembly_id=380024 (accessed Jul. 02, 2023).