# Character-Level CNN Pseudo DNA Classifier

We train CNN model on a set of positive examples (DNA sequences) vs. their random permutations. The model should recognize this type of genomic sequences (e.g. intergenomic sequences).

In [0]:
%tensorflow_version 1.x
import pandas as pd
import numpy as np
from tqdm import tqdm
import random

from keras.models import Sequential
from keras import layers
from keras.optimizers import RMSprop


## Step 1) Read DNA sequences

These sequences were generated in a previous notebook from intergenic regions.

In [10]:
df = pd.read_csv("random_seqs.csv")
print('corpus length:', sum(df.seq.str.len()))
df.shape, len(df.seq[0])

corpus length: 10000000


((50000, 4), 200)

In [11]:
df2 = pd.read_csv("random-seq_2020-04-22.115020_cZE5mD.fasta.txt")
df2.columns = ['seq']
df2 = df2[::2].reset_index().copy()  #ignore FASTA headers
print('corpus length:', sum(df2.seq.str.len()))
df2.shape, len(df2.seq[0])

corpus length: 10000000


((50000, 2), 200)

In [19]:
for i in range(df2.shape[0]):
  df2.seq.iloc[i] = df2.seq.iloc[i].upper()
df2.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


Unnamed: 0,index,seq
0,0,TACGTGTCGGCAAAGCAACATACCAGGCAGTGAAATTGCCCTGCTC...
1,2,ACTCCCAGGATTATCCCAAGACCACTTACAGGGAAGCGGGCTGCAG...
2,4,TAAGGTGCATTTGCACTCTCTGCTGCAGTCTCAAAACTGGAACTCT...
3,6,TGATGCCTGGCTTACAAAAAAAGTAATCCCCTAAGCTCCACTCTGC...
4,8,GTCGATCCACAGAAGACATGTTATAAGTGTACTTACCGAGGTGGAG...


## Step 2) Text preprocessing

For simplicity, we remove every sequence containing `N` (unknown), drop the new index and shuffle rows.

In [12]:
containsN = df.seq.str.contains("N")
print(sum(containsN))
df = df[~containsN]

4867


In [20]:
df = df.reset_index().drop(columns="index").sample(frac=1)

assert all(~df.seq.str.contains("N"))
df.shape

(45133, 5)

## Step 3) Permutation

Instead of permutation, use the generated seqs

In [0]:
#def random_str_shuffle(s):
#  return ''.join(random.sample(s,len(s)))

In [0]:
df['seq_permuted'] = df2.seq[:len(df['seq'])]

## Step 4) Vectorization

Encode the sequences into `numpy.array`.

In [22]:
# dictionaries to convert characters to numbers and vice-versa
chars = ['A', 'C', 'T', 'G']
num_chars = 4
char_to_indices = dict((c, i) for i, c in enumerate(chars))
indices_to_char = dict((i, c) for i, c in enumerate(chars))

seq_length = len(df.seq[0])
n_seq = df.shape[0]
seq_length, n_seq

(200, 45133)

In [23]:
X = np.zeros((2*n_seq, seq_length, num_chars), dtype=np.bool)
y = np.zeros((2*n_seq), dtype=np.bool)

for i in tqdm(range(n_seq)):
    for j in range(seq_length):
        X[i][j][char_to_indices[df.seq[i][j]]] = 1
        y[i] = 1
        X[i+n_seq][j][char_to_indices[df.seq_permuted[i][j]]] = 1
        y[i+n_seq] = 0


100%|██████████| 45133/45133 [04:57<00:00, 151.64it/s]


In [24]:
X.shape, y.shape

((90266, 200, 4), (90266,))

## Step 5) Train-Test Split

Two thirds of data will be used for training, one third for testing.

In [25]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((60478, 200, 4), (29788, 200, 4), (60478,), (29788,))

## Step 6) Model definition

We will use two layers of `Conv1D` followed by one Dense layer and max. pooling.

In [0]:
model = Sequential()
model.add(layers.Conv1D(num_chars, 8, activation='relu'))
#model.add(layers.Dropout(0.1))
model.add(layers.MaxPooling1D(5))
model.add(layers.Conv1D(num_chars, 8, activation='relu'))
#model.add(layers.Dropout(0.1))
model.add(layers.GlobalMaxPooling1D())
model.add(layers.Dense(1))
model.build()

model.compile(optimizer=RMSprop(lr=2e-3),
              loss='binary_crossentropy',
              metrics=['acc'])


## Step 7) Model training

Each time you run the code below, the model is trained for 10 epochs  (each sequence is visited 10 times). Seems that ~30 epochs are ideal.

In [35]:
history = model.fit(X_train, y_train,
                    epochs=20,
                    batch_size=128,
                    validation_split=0.2)

Train on 48382 samples, validate on 12096 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [36]:
model.summary()

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv1d_7 (Conv1D)            (None, 193, 4)            132       
_________________________________________________________________
max_pooling1d_4 (MaxPooling1 (None, 38, 4)             0         
_________________________________________________________________
conv1d_8 (Conv1D)            (None, 31, 4)             132       
_________________________________________________________________
global_max_pooling1d_4 (Glob (None, 4)                 0         
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 5         
Total params: 269
Trainable params: 269
Non-trainable params: 0
_________________________________________________________________


## Step 8) Performance on the test set

Generate a pseudogenomic sequence from the model trained above.

In [37]:
model.evaluate(X_test, y_test)



[0.5362805217500499, 0.7409359216690063]

In [38]:
y_pred = model.predict_classes(X_test)
(y_pred[:,0] == y_test).mean()

0.7470457902511078

In [39]:
# accuracy on real sequences
real_only = y_test == 1
model.evaluate(X_test[real_only,:], y_test[real_only])



[0.5249354685016481, 0.7046948075294495]

In [40]:
# accuracy on unreal sequences
model.evaluate(X_test[~real_only,:], y_test[~real_only])



[0.5476499775562064, 0.7772549986839294]

## Step 8) Saving the model

Save the model for the later use.

In [0]:
model_filename = 'dna_classifier_rsat.loss{0:.2f}.h5'.format(history.history['loss'][-1])
model.save(model_filename)
#files.download(model_filename)

## Notes

This notebook was inspired by [Convolutional Neural Networks for Sequence Processing: Part 1](https://medium.com/@jon.froiland/convolutional-neural-networks-for-sequence-processing-part-1-420dd9b500). The hyperparameters have not yet been tunes. I have tried to add `Dropout` layers but they do not improve the metrics.

It is based on an old version of Keras/TF and should be updated.