# Training pre-miRNAs classifiers for SARS-CoV-2 genome
This notebook have the basic code to train three different machine learning models to find new sarscov2 pre-miRNAs. It can easily run in a stand alone way with [Google Colaboratory](colab.research.google.com), otherwise a python instalation and a GPU are required.

More details of these models can be found in:

- L. A. Bugnon, C. Yones, D. H. Milone, G. Stegmayer, Deep neural architectures for highly imbalanced data in bioinformatics, IEEE Transactions on Neural Networks and Learning Systems, 2019, https://doi.org/10.1109/TNNLS.2019.2914471
- C. Yones, J. Raad, L.A. Bugnon, D.H. Milone, G. Stegmayer, High precision in microRNA prediction: a novel genome-wide approach based on convolutional deep residual networks, bioRxiv 2020.10.23.352179, 2020, https://doi.org/10.1101/2020.10.23.352179


In [None]:
# Run this cell only if you are working from a google colab. This will download 
# the dataset and set the working directory.
! sudo apt-get install git-lfs
import os 
! git clone https://github.com/sinc-lab/sarscov2-mirna-discovery.git
os.chdir("sarscov2-mirna-discovery/")
! git lfs pull # This download the large files (for example, the features)

# Dataset preparation
mirBase virus miRNAs are used as positive class, along with non-mirna hairpin-like sequences from the human genome. 

In [4]:
import numpy as np
import pandas as pd
from scipy.stats import zscore  

# all the known virus miRNAs are used as positive examples
features_virus_mirnas = pd.read_csv('features/pre-miRNAs_virus.csv')

# SARS-CoV2 sequences are used to train the deeSOM transductivelly 
features_unlabeled_hairpins = pd.read_csv('features/sars-cov2_hairpins.csv') # Hairpins from hsa genome

labels = np.concatenate((np.ones(len(features_virus_mirnas)), np.zeros(len(features_unlabeled_hairpins))))
features = np.concatenate((features_virus_mirnas.drop(columns=["sequence_names"]), 
                       features_unlabeled_hairpins.drop(columns=["sequence_names"]))).astype(np.float)
sequence_names = np.concatenate((features_virus_mirnas.sequence_names, features_unlabeled_hairpins.sequence_names))

# Feature normalization
features[np.where(np.isnan(features))] = 0
features = zscore(features, axis=0)
features[np.where(np.isnan(features))] = 0

## One-Class SVM (OC-SVM)


In [12]:
from sklearn.svm import OneClassSVM
import pickle

ocsvm = OneClassSVM(kernel="linear")
# Use only the positive class to define the decision frontier
ocsvm.fit(features[labels == 1, :]) 
print("Fitting OC-SVM done")
if not os.path.isdir("models"):
    os.mkdir("models")
pickle.dump(ocsvm, open("models/ocsvm.pk", "wb"))

Fitting OC-SVM done


# Deep Ensemble-Elastic Self-organized maps (deeSOM)
You can find more details of the model implementation in the [deeSOM repository](https://github.com/lbugnon/deeSOM).

In [None]:
!pip install deesom 
from deesom import DeeSOM
deesom = DeeSOM(verbosity=True)
# Train deepSOM
deesom.fit(features, labels)
print("Fitting deeSOM done")
if not os.path.isdir("models"):
    os.mkdir("models")
deesom.save_model("models/deesom.pk")

# miRNA Deep Neural Network (mirDNN)
You can find more details of the model implementation in the [mirDNN repository](https://github.com/cyones/mirDNN).

In [None]:
# Install mirDNN
! git clone --recurse-submodules https://github.com/cyones/mirDNN.git
! pip install -r mirDNN/requirements.txt

import numpy as np
import shutil
npos = int(np.sum(labels))
shutil.rmtree("tmp/", ignore_errors=True)
os.mkdir("tmp/")
if not os.path.isdir("models"):
    os.mkdir("models")

# Run train script (-i indicates first unlabeled sequences, 
# then positive sequences)
! python3 mirDNN/mirdnn_fit.py -i sequences/sequences_unlabeled_hairpins.fold -i sequences/miRNAs_virus.fold -m models/mirdnn.pmt -l tmp/train.log -d "cuda" -s 160

print("Fitting mirDNN done")

fatal: destination path 'mirDNN' already exists and is not an empty directory.
epoch	trainLoss	validAUC	last_imp
0	0.1723	1.4473	0
1	0.1132	3.3740	0
2	0.0932	7.2525	0
3	0.0849	10.9427	0
4	0.0883	14.1527	0
5	0.0968	17.1326	0
6	0.0677	20.1408	0
7	0.0547	21.3322	0
8	0.0568	24.2975	0
9	0.0422	27.7852	0
10	0.0461	30.3895	0
11	0.0411	32.3713	0
12	0.0308	34.9651	0
13	0.0438	36.4939	0
14	0.0268	38.1245	0
15	0.0293	39.9394	0
16	0.0140	41.9659	0
17	0.0229	43.3353	0
18	0.0129	45.4677	0
19	0.0100	46.4719	0
20	0.0107	47.7017	0
21	0.0122	48.7321	0
22	0.0112	49.9123	0
23	0.0075	50.6747	0
24	0.0043	51.6943	0
25	0.0132	52.2116	0
26	0.0024	53.1494	0
27	0.0034	53.1179	1
28	0.0042	54.0723	0
29	0.0025	54.3234	0
30	0.0068	54.9606	0
31	0.0016	55.8373	0
32	0.0083	55.6682	1
33	0.0058	56.4341	0
34	0.0008	57.2770	0
35	0.0045	56.5584	1
36	0.0122	56.4647	2
37	0.0024	57.7551	0
38	0.0004	58.6834	0
39	0.0000	59.6118	0
40	0.0000	60.4528	0
41	0.0000	61.1761	0
42	0.0000	61.8625	0
43	0.0000	62.4894	0
44	0.0000	63.0411	0
