# Using pre-miRNAs classifiers on SARS-CoV-2 genome
This notebook have uses the train machine learning models to find new sarscov2 pre-miRNAs. It can easily run in a stand alone way with [Google Colaboratory](colab.research.google.com), otherwise a python instalation and a GPU are required.

More details of the used models can be found in:

- L. A. Bugnon, C. Yones, D. H. Milone, G. Stegmayer, Deep neural architectures for highly imbalanced data in bioinformatics, IEEE Transactions on Neural Networks and Learning Systems, 2019, https://doi.org/10.1109/TNNLS.2019.2914471
- C. Yones, J. Raad, L.A. Bugnon, D.H. Milone, G. Stegmayer, High precision in microRNA prediction: a novel genome-wide approach based on convolutional deep residual networks, bioRxiv 2020.10.23.352179, 2020, https://doi.org/10.1101/2020.10.23.352179


In [1]:
# Run this cell only if you are working from a google colab. This will download 
# the dataset and set the working directory.
! sudo apt-get install git-lfs
import os 
! git clone https://github.com/sinc-lab/sarscov2-mirna-discovery.git
os.chdir("sarscov2-mirna-discovery/")
! git lfs pull # This download the large files (for example, the features)

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following NEW packages will be installed:
  git-lfs
0 upgraded, 1 newly installed, 0 to remove and 14 not upgraded.
Need to get 2,129 kB of archives.
After this operation, 7,662 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 git-lfs amd64 2.3.4-1 [2,129 kB]
Fetched 2,129 kB in 0s (4,875 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 76, <> line 1.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin: 
Selecting previously unselected package git-lfs.
(Reading database ... 144865 files and directories c

# Dataset preparation
Here we load all the hairpin-like sequences found in the SARS-CoV2 genome and its features.

In [2]:
import numpy as np
import pandas as pd
from scipy.stats import zscore  

features_sarscov2 = pd.read_csv('features/sars-cov2_hairpins.csv')
sequence_names = features_sarscov2.sequence_names.values
features_sarscov2 = features_sarscov2.drop(columns=["sequence_names"]).values.astype(float)

# Feature normalization
features_sarscov2[np.where(np.isnan(features_sarscov2))] = 0
features_sarscov2 = zscore(features_sarscov2, axis=0)
features_sarscov2[np.where(np.isnan(features_sarscov2))] = 0

## One-Class SVM (OC-SVM)


In [4]:
from sklearn.svm import OneClassSVM
import pickle
model_file = "models/ocsvm.pk"
try:
    ocsvm = pickle.load(open(model_file, "rb"))
except FileNotFoundError:
    raise f"Model file {model_file} not found. You probably need to train the model first"  

if not os.path.isdir("predictions"):
    os.mkdir("predictions")

scores = ocsvm.decision_function(features_sarscov2) # better candidates at first
ind = np.argsort(scores)[::-1]
pd.DataFrame(np.array([sequence_names[ind], scores[ind]]).T, 
             columns=["sequence_names", "OC-SVM_scores"]).to_csv("predictions/OC-SVM.csv",
                                                                 index=False)
# If you are working in a google colab, you should see the output "predictions" 
# under the folder icon on the left panel

# Deep Ensemble-Elastic Self-organized maps (deeSOM)
You can find more details of the model implementation in the [deeSOM repository](https://github.com/lbugnon/deeSOM).

In [None]:
!pip install deesom 
from deesom import DeeSOM
deesom = DeeSOM(verbosity=True)
model_file = "models/deesom.pk"
try:
    deesom.load_model(model_file)
except FileNotFoundError:
    raise f"Model file {model_file} not found. You probably need to train the model first"  

if not os.path.isdir("predictions"):
    os.mkdir("predictions")

scores = deesom.predict_proba(features_sarscov2)
ind = np.argsort(scores)[::-1]
pd.DataFrame(np.array([sequence_names[ind], scores[ind]]).T, 
             columns=["sequence_names", "deeSOM_scores"]).to_csv("predictions/deeSOM.csv",
                                                                 index=False)



## miRNA Deep Neural Network (mirDNN)
You can find more details of the model implementation in the [mirDNN repository](https://github.com/cyones/mirDNN).

In [None]:
# Install mirDNN
! git clone --recurse-submodules https://github.com/cyones/mirDNN.git
! pip install -r mirDNN/requirements.txt

! python mirDNN/model_eval.py -i "sequences/sars-cov2_hairpins.fold" -o "predictions/mirDNN.csv" -m models/mirDNN.pmt -s160 -d "cuda"

