<a href="https://colab.research.google.com/github/xmorcinekp/Homework-3/blob/main/DU3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DU3 - Pavel Morcinek

This notebook demonstrates how to use `genomic_benchmarks` to train a neural network classifier on one of its benchmark datasets [human_enhancers_cohn](https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks/tree/main/docs/human_enhancers_cohn).

In [None]:
#If you work in Google Colaboratory - uncomment the following line to install the package to your virtual machine  
!pip install tensorflow_addons genomic-benchmarks

# Data download

With the function `download_dataset` downloads, we can download full-sequence form of the benchmark, splitted into train and test sets, one folder for each class.

In [2]:
from pathlib import Path
import tensorflow as tf
import numpy as np

import tensorflow_addons as tfa
from tensorflow.keras.layers import (
    BatchNormalization,
    Conv1D,
    Dense,
    Dropout,
    GlobalAveragePooling1D,
    MaxPooling1D,
)
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

from genomic_benchmarks.loc2seq import download_dataset
from genomic_benchmarks.data_check import is_downloaded, info
from genomic_benchmarks.models.tf import vectorize_layer, binary_f1_score
from genomic_benchmarks.models.tf import basic_cnn_model_v0 as model # this can be rewritten

if not is_downloaded('human_enhancers_cohn'):
    download_dataset('human_enhancers_cohn')

  from tqdm.autonotebook import tqdm


Downloading 176563cDPQ5Y094WyoSBF02QjoVQhWuCh into /root/.genomic_benchmarks/human_enhancers_cohn.zip... 



Done.
Unzipping...Done.


In [3]:
# !ls /root/.genomic_benchmarks/human_enhancers_cohn/train/positive

In [4]:
info('human_enhancers_cohn', 0)

Dataset `human_enhancers_cohn` has 2 classes: negative, positive.

All lenghts of genomic intervals equals 500.

Totally 27791 sequences have been found, 20843 for training and 6948 for testing.


Unnamed: 0,train,test
negative,10422,3474
positive,10421,3474


**Definition of model**

In [5]:
character_split_fn = lambda x: tf.strings.unicode_split(x, "UTF-8")
vectorize_layer = TextVectorization(output_mode="int", split=character_split_fn)

# one-hot encoding
onehot_layer = tf.keras.layers.Lambda(lambda x: tf.one_hot(tf.cast(x, "int64"), 4))


In [18]:
# Binary F1 score
binary_f1_score = tfa.metrics.F1Score(num_classes=1, threshold=0.75, average="micro")

model_original = tf.keras.Sequential(
    [
        onehot_layer,
        Conv1D(32, kernel_size=8, data_format="channels_last", activation="relu"),
        BatchNormalization(),
        MaxPooling1D(),
        Conv1D(16, kernel_size=8, data_format="channels_last", activation="relu"),
        BatchNormalization(),
        MaxPooling1D(),
        Conv1D(4, kernel_size=8, data_format="channels_last", activation="relu"),
        BatchNormalization(),
        MaxPooling1D(),
        Dropout(0.3),
        GlobalAveragePooling1D(),
        Dense(1),
    ]
)

# architecrure changed:

model_2 = tf.keras.Sequential(
    [
        onehot_layer,
        Conv1D(32, kernel_size=8, data_format="channels_last", activation="relu"),
        BatchNormalization(),
        MaxPooling1D(),
        Conv1D(32, kernel_size=8, data_format="channels_last", activation="relu"),
        BatchNormalization(),
        MaxPooling1D(),
        Conv1D(16, kernel_size=8, data_format="channels_last", activation="relu"),
        BatchNormalization(),
        MaxPooling1D(),
        Conv1D(8, kernel_size=8, data_format="channels_last", activation="relu"),
        BatchNormalization(),
        MaxPooling1D(),
        Dropout(0.1),
        GlobalAveragePooling1D(),
        Dense(1),
    ]
)

model = model_2

## TF Dataset object

To train the model with TensorFlow, we must create a TF Dataset. Because the directory structure of our benchmarks is ready for training, we can just call `tf.keras.preprocessing.text_dataset_from_directory` function as follows.

In [19]:
BATCH_SIZE = 64  # original
BATCH_SIZE = 128 # modification
SEQ_PATH = Path.home() / '.genomic_benchmarks' / 'human_enhancers_cohn'
CLASSES = [x.stem for x in (SEQ_PATH/'train').iterdir() if x.is_dir()]

train_dset = tf.keras.preprocessing.text_dataset_from_directory(
    SEQ_PATH / 'train',
    batch_size=BATCH_SIZE,
    class_names=CLASSES)

Found 20843 files belonging to 2 classes.


## Text vectorization

To convert the strings to tensors, we internally use TF `TextVectorization` layer and splitting to characters.

In [20]:
vectorize_layer.adapt(train_dset.map(lambda x, y: x))
#vectorize_layer.set_vocabulary(vocabulary=np.asarray(['a', 'c', 't', 'g', 'n']))
vectorize_layer.get_vocabulary()

['', '[UNK]', 't', 'a', 'c', 'g']

In [21]:
def vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return vectorize_layer(text)-2, label

train_ds = train_dset.map(vectorize_text)

## Model training

To get a baseline (other models can be compared to) we ship a package with [a simple CNN model](https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks/blob/main/src/genomic_benchmarks/models/tf.py). We have vectorized the dataset before training the model to speed up the process.

In [22]:
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer='adam',
              metrics=[tf.metrics.BinaryAccuracy(threshold=0.0), binary_f1_score])

In [23]:
EPOCHS = 10 # original
EPOCHS = 20 # modified

history = model.fit(
    train_ds,
    epochs=EPOCHS)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


**Original model**: 

0.8057 (binary_accuracy), 0.7719 (f1_score)

**Modification 1**:

*(batch size 128 instead of original 64)*

0.8417 (binary_accuracy), 0.8161 (f1_score)

**Modification 2**:

*(batch size 128, 20 epoches)*

0.8560 (binary_accuracy), 0.8340 (f1_score)

**Modification 3**:

*(batch size 128, 20 epoches, treshold 0.75)*

0.8670 (binary_accuracy), 0.8528 (f1_score)

**Modification 4**:

*(batch size 128, 20 epoches, treshold 0.75, model 2)*

0.8840 (binary_accuracy), 0.8594 (f1_score)

**Modification 5**:

*(batch size 128, 20 epoches, treshold 0.75, model 2, dropout 0.1)*

0.9014 (binary_accuracy), 0.8800 (f1_score)

## Evaluation on the test set

Finally, we can do the same pre-processing for the test set and evaluate the F1 score of our model.

In [24]:
test_dset = tf.keras.preprocessing.text_dataset_from_directory(
    SEQ_PATH / 'test',
    batch_size=BATCH_SIZE,
    class_names=CLASSES,
    shuffle=False)

test_ds = test_dset.map(vectorize_text)

Found 6948 files belonging to 2 classes.


In [25]:
model.evaluate(test_ds)



[0.9901401996612549, 0.6567357778549194, 0.6625360250473022]

**Original:**

0.7012 *(binary_accuracy)*, 0.6755 *(f1_score)*

**Modified:** (best result)

0.6357 *(binary_accuracy)*, 0.7265 *(f1_score)*