<a href="https://colab.research.google.com/github/stazam/ML-hackathon---genomic_benchmarks/blob/main/human_nontata_promoters_models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Dataset - human_nontata_promoters 

Použil som dva modely:

1. CNN (vyžitá funkcia **CNN_model**) - touto architektúrou sa mi podarilo dosiahnuť maximálne **87%** (pri hodnote loss funkcie asi 31.5). Avšak metóda je nestabilnejšia ale permanentne dáva výsledky cez **86%** (na testovacej sade) pri použití aspoň 5 epoch.

2. CNN + Bi-LSTM vrstva - stabilnejšia architektúra. Ňou sa mi maximálne podarilo dosiahnuť asi **86.2%** (tento výsledok sa dá disiahnuť opakovane).

Návrh ďalších architektúr ktoré som chcel vyskúšať. Použiť K-mery, podľa článku tu: https://www.hindawi.com/journals/cmmm/2021/1835056/ by to malo viesť ešte k lepším výsledkom. Bohužiaľ už mi na vyskúšanie nevyšiel čas :D

In [20]:
#pip install genomic_benchmarks --upgrade
%%capture
!pip install genomic_benchmarks

import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import keras
import sys
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure


from keras.models import Sequential
from keras.layers import Dense, Dropout, Bidirectional, Activation, Flatten, MaxPooling1D, BatchNormalization, Conv1D
from keras import optimizers
from sklearn.model_selection import train_test_split
from pathlib import Path


from genomic_benchmarks.loc2seq import download_dataset
from genomic_benchmarks.data_check import is_downloaded, info, list_datasets

In [21]:
info('human_nontata_promoters', 0)

Dataset `human_nontata_promoters` has 2 classes: negative, positive.

All lenghts of genomic intervals equals 251.

Totally 36131 sequences have been found, 27097 for training and 9034 for testing.


Unnamed: 0,train,test
negative,12355,4119
positive,14742,4915


In [22]:
download_dataset("human_nontata_promoters", version=0)

Downloading 1VdUg0Zu8yfLS6QesBXwGz1PIQrTW3Ze4 into /root/.genomic_benchmarks/human_nontata_promoters.zip... Done.
Unzipping...Done.


PosixPath('/root/.genomic_benchmarks/human_nontata_promoters')

In [23]:
SEQ_PATH = Path.home() / '.genomic_benchmarks' / 'human_nontata_promoters'
CLASSES = [x.stem for x in (SEQ_PATH/'train').iterdir() if x.is_dir()]

train_dset = tf.keras.preprocessing.text_dataset_from_directory(
    SEQ_PATH / 'train',
    batch_size=27097,
    class_names=CLASSES)

Found 27097 files belonging to 2 classes.


In [24]:
SEQ_PATH = Path.home() / '.genomic_benchmarks' / 'human_nontata_promoters'
CLASSES = [x.stem for x in (SEQ_PATH/'test').iterdir() if x.is_dir()]

test_dset = tf.keras.preprocessing.text_dataset_from_directory(
    SEQ_PATH / 'test',
    batch_size=9034,
    class_names=CLASSES) 

Found 9034 files belonging to 2 classes.


In [25]:
from google.colab import drive
drive.mount('/content/drive',  force_remount=True)


Mounted at /content/drive


In [26]:
sys.path.append('/content/drive/MyDrive/ML_Hackathon/')

from help_functions import *

In [27]:
X_train, y_train, sequence_size = preprocess_NN(train_dset)
X_ev, y_ev, _ = preprocess_NN(test_dset)

In [28]:
print(X_ev.shape)
print(y_ev.shape)

(9033, 251, 4)
(9033,)


**Toto je moja najlepšia CNN architekrúra ktorou som dosiahol 86% (pár krát sa to prehuplo aj cez 87%).**

In [16]:
def CNN_model(sequence_size):
  model = Sequential([
        Conv1D(filters = 32, kernel_size=8, padding='same', activation = 'relu', input_shape=(sequence_size, 4)),
        Dropout(0.5),
        Flatten(),
        Dense(128, activation='relu'),
        Dense(128, activation='relu'),
        Dense(64, activation='relu'),
        Dense(64, activation='relu'),
        Dropout(0.5),
        Dense(1, activation='sigmoid')
    ])

  model.summary()

  return model

**Toto je moja najlepšia CNN + Bi-LSTM architekrúra ktorou som dosiahol 86,2%.**


In [46]:
def CNN_LSTM_model(sequence_size):
  model = Sequential([
        Conv1D(filters = 32, kernel_size=8, padding='same', activation = 'relu', input_shape=(sequence_size, 4)),
        Bidirectional(keras.layers.LSTM(32, return_sequences=True)),
        Bidirectional(keras.layers.LSTM(16, return_sequences=True)),
        Dropout(0.5),
        Flatten(),
        Dense(128, activation='relu'),
        Dense(16, activation='relu'),
        Dropout(0.5),
        Dense(1, activation='sigmoid')
    ])

  model.summary()

  return model



In [47]:
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, train_size = 0.8, random_state = 20)
model = CNN_LSTM_model(sequence_size)

model.compile(
        optimizer='rmsprop',
        loss='binary_crossentropy',
        metrics=['accuracy']
    )

models_data = model.fit(X_train, y_train, batch_size=32, epochs=8, validation_data = (X_test, y_test))

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv1d_3 (Conv1D)           (None, 251, 32)           1056      
                                                                 
 bidirectional_6 (Bidirectio  (None, 251, 64)          16640     
 nal)                                                            
                                                                 
 bidirectional_7 (Bidirectio  (None, 251, 32)          10368     
 nal)                                                            
                                                                 
 dropout_6 (Dropout)         (None, 251, 32)           0         
                                                                 
 flatten_3 (Flatten)         (None, 8032)              0         
                                                                 
 dense_7 (Dense)             (None, 128)              

In [48]:
metrics = model.evaluate(X_ev, y_ev, verbose=0)
print('model evaluation on unknown dataset [loss, accuracy]:', metrics)

model evaluation on unknown dataset [loss, accuracy]: [0.33043742179870605, 0.8612864017486572]
