# Using pre-miRNAs classifiers on SARS-CoV-2 genome
This notebook have uses the train machine learning models to find new sarscov2 pre-miRNAs. It can easily run in a stand alone way with [Google Colaboratory](colab.research.google.com), otherwise a python instalation and a GPU are required.

More details of the used models can be found in:

- L. A. Bugnon, C. Yones, D. H. Milone, G. Stegmayer, Deep neural architectures for highly imbalanced data in bioinformatics, IEEE Transactions on Neural Networks and Learning Systems, 2019, https://doi.org/10.1109/TNNLS.2019.2914471
- C. Yones, J. Raad, L.A. Bugnon, D.H. Milone, G. Stegmayer, High precision in microRNA prediction: a novel genome-wide approach based on convolutional deep residual networks, bioRxiv 2020.10.23.352179, 2020, https://doi.org/10.1101/2020.10.23.352179


In [None]:
# Run this cell ONLY if you are working from a google colab. This will download 
# the dataset and set the working directory.
import os 
! git clone https://github.com/sinc-lab/sarscov2-mirna-discovery.git
os.chdir("sarscov2-mirna-discovery/")

In [None]:
# If you are running this from your PC, check that the working directory is 
# the root of the repository
import os 
os.chdir("../")
print(os.getcwd())  # should end with "sarscov2-mirna-discovery/"
!pip3 install --user -r src/requirements_pre-miRNA_prediction.txt

# Dataset preparation
Here we load all the hairpin-like sequences found in the SARS-CoV2 genome and its features.

In [None]:
import numpy as np
import pandas as pd
from scipy.stats import zscore  

features_sarscov2 = pd.read_csv('pre-miRNAs_features/sars-cov2_hairpins.csv')
sequence_names = features_sarscov2.sequence_names.values
features_sarscov2 = features_sarscov2.drop(columns=["sequence_names"]).values.astype(float)

# Feature normalization
features_sarscov2[np.where(np.isnan(features_sarscov2))] = 0
features_sarscov2 = zscore(features_sarscov2, axis=0)
features_sarscov2[np.where(np.isnan(features_sarscov2))] = 0

## One-Class SVM (OC-SVM)


In [None]:
from sklearn.svm import OneClassSVM
import pickle
model_file = "pre-miRNAs_models/ocsvm.pk"
try:
    ocsvm = pickle.load(open(model_file, "rb"))
except FileNotFoundError:
    raise f"Model file {model_file} not found. You probably need to train the model first"  

if not os.path.isdir("pre-miRNAs_predictions"):
    os.mkdir("pre-miRNAs_predictions")

scores = ocsvm.decision_function(features_sarscov2) # better candidates at first
ind = np.argsort(scores)[::-1]
pd.DataFrame(np.array([sequence_names[ind], scores[ind]]).T, 
             columns=["sequence_names", "OC-SVM_scores"]).to_csv("pre-miRNAs_predictions/OC-SVM.csv",
                                                                 index=False)
# If you are working in a google colab, you should see the output "pre-miRNAs_predictions" 
# under the folder icon on the left panel

# Deep Ensemble-Elastic Self-organized maps (deeSOM)
You can find more details of the model implementation in the [deeSOM repository](https://github.com/lbugnon/deeSOM).

In [None]:
!pip3 install deesom 
from deesom import DeeSOM
deesom = DeeSOM(verbosity=True)
model_file = "pre-miRNAs_models/deesom.pk"
try:
    deesom.load_model(model_file)
except FileNotFoundError:
    raise f"Model file {model_file} not found. You probably need to train the model first"  

if not os.path.isdir("pre-miRNAs_predictions"):
    os.mkdir("pre-miRNAs_predictions")

scores = deesom.predict_proba(features_sarscov2)
ind = np.argsort(scores)[::-1]
pd.DataFrame(np.array([sequence_names[ind], scores[ind]]).T, 
             columns=["sequence_names", "deeSOM_scores"]).to_csv("pre-miRNAs_predictions/deeSOM.csv",
                                                                 index=False)

## miRNA Deep Neural Network (mirDNN)
You can find more details of the model implementation in the [mirDNN repository](https://github.com/cyones/mirDNN).

In [None]:
# Install mirDNN
! git clone --recurse-submodules https://github.com/cyones/mirDNN.git
! pip3 install -r mirDNN/requirements.txt

! python3 mirDNN/mirdnn_eval.py -i "sequences/sars-cov2_hairpins.fold" -o "pre-miRNAs_predictions/mirDNN.csv" -m pre-miRNAs_models/mirdnn.pmt -s160 -d "cpu"

# Load scores and sort them
scores = pd.read_csv("pre-miRNAs_predictions/mirDNN.csv", header=None)
sequence_names, scores = scores[0].values, scores[1].values

ind = np.argsort(scores)[::-1]
pd.DataFrame(np.array([sequence_names[ind], scores[ind]]).T, 
             columns=["sequence_names", "mirDNN_scores"]).to_csv("pre-miRNAs_predictions/mirDNN.csv",
                                                                 index=False)