# Training pre-miRNAs classifiers for SARS-CoV-2 genome
This notebook has the basic code to train three different machine learning models to find new sarscov2 pre-miRNAs. It can easily run in a stand alone way with [Google Colaboratory](colab.research.google.com), otherwise a python instalation and a GPU are required.

More details of these models can be found in:

- L. A. Bugnon, C. Yones, D. H. Milone, G. Stegmayer, Deep neural architectures for highly imbalanced data in bioinformatics, IEEE Transactions on Neural Networks and Learning Systems, 2019, https://doi.org/10.1109/TNNLS.2019.2914471
- C. Yones, J. Raad, L.A. Bugnon, D.H. Milone, G. Stegmayer, High precision in microRNA prediction: a novel genome-wide approach based on convolutional deep residual networks, bioRxiv 2020.10.23.352179, 2020, https://doi.org/10.1101/2020.10.23.352179


In [None]:
# Run this cell ONLY if you are working from a google colab. This will download 
# the dataset and set the working directory.
import os 
! git clone https://github.com/sinc-lab/sarscov2-mirna-discovery.git
os.chdir("sarscov2-mirna-discovery/")

In [None]:
# If you are running from your PC, make sure that you are placed in the 
# root of the repository, and you have installed all required packages
import os 
os.chdir("../")
print(os.getcwd()) # This should end with "sarscov2-mirna-discovery"

! pip3 install --user -r src/requirements_pre-miRNA_prediction.txt

# Dataset preparation
mirBase virus pre-miRNAs are used as positive class, along with non-mirna hairpin-like sequences from the human genome. 

In [None]:
import numpy as np
import pandas as pd
from scipy.stats import zscore  

# all the known virus pre-miRNAs are used as positive examples
features_virus_mirnas = pd.read_csv('pre-miRNAs_features/pre-miRNAs_virus.csv')

# SARS-CoV2 sequences are used to train the deeSOM transductivelly 
features_unlabeled_hairpins = pd.read_csv('pre-miRNAs_features/sars-cov2_hairpins.csv') # Hairpins from hsa genome

labels = np.concatenate((np.ones(len(features_virus_mirnas)), np.zeros(len(features_unlabeled_hairpins))))
features = np.concatenate((features_virus_mirnas.drop(columns=["sequence_names"]), 
                       features_unlabeled_hairpins.drop(columns=["sequence_names"]))).astype(np.float)
sequence_names = np.concatenate((features_virus_mirnas.sequence_names, features_unlabeled_hairpins.sequence_names))

# Feature normalization
features[np.where(np.isnan(features))] = 0
features = zscore(features, axis=0)
features[np.where(np.isnan(features))] = 0

## One-Class SVM (OC-SVM)


In [None]:
from sklearn.svm import OneClassSVM
import pickle

ocsvm = OneClassSVM(kernel="linear")
# Use only the positive class to define the decision frontier
ocsvm.fit(features[labels == 1, :]) 
print("Fitting OC-SVM done")
if not os.path.isdir("pre-miRNAs_models"):
    os.mkdir("pre-miRNAs_models")
pickle.dump(ocsvm, open("pre-miRNAs_models/ocsvm.pk", "wb"))

# Deep Ensemble-Elastic Self-organized maps (deeSOM)
You can find more details of the model implementation in the [deeSOM repository](https://github.com/lbugnon/deeSOM).

In [None]:
!pip3 install deesom 
from deesom import DeeSOM
deesom = DeeSOM(verbosity=True)
# Train deepSOM
deesom.fit(features, labels)
print("Fitting deeSOM done")
if not os.path.isdir("pre-miRNAs_models"):
    os.mkdir("pre-miRNAs_models")
deesom.save_model("pre-miRNAs_models/deesom.pk")

# miRNA Deep Neural Network (mirDNN)
You can find more details of the model implementation in the [mirDNN repository](https://github.com/cyones/mirDNN).

In [None]:
# Download extra unlabeled sequences
! wget https://sourceforge.net/projects/sourcesinc/files/mirdata/sequences/unlabeled.tar.gz
! tar -xvf unlabeled.tar.gz
! mv unlabeled/sequences/unlabeled_hairpins.fold sequences/unlabeled_hairpins.fold

# Install mirDNN
! git clone --recurse-submodules https://github.com/cyones/mirDNN.git
! pip install -r mirDNN/requirements.txt

import numpy as np
import shutil
npos = int(np.sum(labels))
shutil.rmtree("tmp/", ignore_errors=True)
os.mkdir("tmp/")
if not os.path.isdir("pre-miRNAs_models"):
    os.mkdir("pre-miRNAs_models")

# Run train script (-i indicates first unlabeled sequences, 
# then positive sequences)
! python3 mirDNN/mirdnn_fit.py -i sequences/unlabeled_hairpins.fold -i sequences/pre-miRNAs_virus.fold -m pre-miRNAs_models/mirdnn.pmt -l tmp/train.log -d "cuda" -s 160

print("Fitting mirDNN done")