# Character-Level CNN Pseudo DNA Classifier

We train CNN model on a set of positive examples (DNA sequences) vs. their random permutations. The model should recognize this type of genomic sequences (e.g. intergenomic sequences).

In [1]:
%tensorflow_version 1.x
import pandas as pd
import numpy as np
from tqdm import tqdm
import random

from keras.models import Sequential
from keras import layers
from keras.optimizers import RMSprop


Using TensorFlow backend.


## Step 1) Read DNA sequences

These sequences were generated in a previous notebook from intergenic regions.

In [5]:
df = pd.read_csv("random_seqs.csv")
print('corpus length:', sum(df.seq.str.len()))
df.shape

corpus length: 10000000


(50000, 4)

## Step 2) Text preprocessing

For simplicity, we remove every sequence containing `N` (unknown), drop the new index and shuffle rows.

In [6]:
containsN = df.seq.str.contains("N")
print(sum(containsN))
df = df[~containsN]

4867


In [7]:
df = df.reset_index().drop(columns="index").sample(frac=1)

assert all(~df.seq.str.contains("N"))
df.shape

(45133, 4)

## Step 3) Permutation

For each sequence, get a permuted version.

In [0]:
def random_str_shuffle(s):
  return ''.join(random.sample(s,len(s)))

In [0]:
df['seq_permuted'] = df.seq.apply(random_str_shuffle)

## Step 4) Vectorization

Encode the sequences into `numpy.array`.

In [10]:
# dictionaries to convert characters to numbers and vice-versa
chars = ['A', 'C', 'T', 'G']
num_chars = 4
char_to_indices = dict((c, i) for i, c in enumerate(chars))
indices_to_char = dict((i, c) for i, c in enumerate(chars))

seq_length = len(df.seq[0])
n_seq = df.shape[0]
seq_length, n_seq

(200, 45133)

In [11]:
X = np.zeros((2*n_seq, seq_length, num_chars), dtype=np.bool)
y = np.zeros((2*n_seq), dtype=np.bool)

for i in tqdm(range(n_seq)):
    for j in range(seq_length):
        X[i][j][char_to_indices[df.seq[i][j]]] = 1
        y[i] = 1
        X[i+n_seq][j][char_to_indices[df.seq_permuted[i][j]]] = 1
        y[i+n_seq] = 0


100%|██████████| 45133/45133 [04:03<00:00, 191.71it/s]


In [12]:
X.shape, y.shape

((90266, 200, 4), (90266,))

## Step 5) Train-Test Split

Two thirds of data will be used for training, one third for testing.

In [13]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((60478, 200, 4), (29788, 200, 4), (60478,), (29788,))

## Step 6) Model definition

We will use two layers of `Conv1D` followed by one Dense layer and max. pooling.

In [14]:
model = Sequential()
model.add(layers.Conv1D(num_chars, 7, activation='relu'))
#model.add(layers.Dropout(0.1))
model.add(layers.MaxPooling1D(5))
model.add(layers.Conv1D(num_chars, 7, activation='relu'))
#model.add(layers.Dropout(0.1))
model.add(layers.GlobalMaxPooling1D())
model.add(layers.Dense(1))
model.build()

model.compile(optimizer=RMSprop(lr=5e-5),
              loss='binary_crossentropy',
              metrics=['acc'])






## Step 7) Model training

Each time you run the code below, the model is trained for 10 epochs  (each sequence is visited 10 times). Seems that ~30 epochs are ideal.

In [19]:
history = model.fit(X_train, y_train,
                    epochs=10,
                    batch_size=128,
                    validation_split=0.2)

Train on 48382 samples, validate on 12096 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [20]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv1d_1 (Conv1D)            (None, 194, 4)            116       
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 38, 4)             0         
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 32, 4)             116       
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 4)                 0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 5         
Total params: 237
Trainable params: 237
Non-trainable params: 0
_________________________________________________________________


## Step 8) Performance on the test set

Generate a pseudogenomic sequence from the model trained above.

In [21]:
model.evaluate(X_test, y_test)



[0.4253279036710098, 0.8215053041253103]

In [22]:
y_pred = model.predict_classes(X_test)
(y_pred[:,0] == y_test).mean()

0.8226802739358131

In [25]:
# accuracy on real sequences
real_only = y_test == 1
model.evaluate(X_test[real_only,:], y_test[real_only])



[0.4246123458338136, 0.8291750503178014]

In [26]:
# accuracy on unreal sequences
model.evaluate(X_test[~real_only,:], y_test[~real_only])



[0.42604500114204646, 0.8138190616537669]

## Step 8) Saving the model

Save the model for the later use.

In [0]:
model_filename = 'dna_classifier.loss{0:.2f}.h5'.format(history.history['loss'][-1])
model.save(model_filename)
#files.download(model_filename)

## Notes

This notebook was inspired by [Convolutional Neural Networks for Sequence Processing: Part 1](https://medium.com/@jon.froiland/convolutional-neural-networks-for-sequence-processing-part-1-420dd9b500). The hyperparameters have not yet been tunes. I have tried to add `Dropout` layers but they do not improve the metrics.

It is based on an old version of Keras/TF and should be updated.