# Character-Level Pseudo DNA Generator

Based on a sample of genomic sequences, we train LSTM model to generate pseudo DNA, i.e. sequences that look like a real DNA but cannot be found in a genome.

In [1]:
%tensorflow_version 1.x
import keras
import pandas as pd
import numpy as np
from tqdm import tqdm
from google.colab import files    # for download files


Using TensorFlow backend.


## Step 1) Read DNA sequences

These sequences were generated in a previous notebook from intergenic regions.

In [2]:
df = pd.read_csv("random_seqs2.csv")
print('corpus length:', sum(df.seq.str.len()))


corpus length: 945448


## Step 2) Text preprocessing

For simplicity, we remove every sequence containing `N` (unknown).

In [0]:
containsN = df.seq.str.contains("N")
sum(containsN)
df = df[~containsN]

assert all(~df.seq.str.contains("N"))

## Step 3) Cut the text in semi-redundant sequences

For training, the test is cut into smaller pieces of the same length. Longer pieces enable better context but needs more time and memory for training.

In [0]:
SEQ_LENGTH = 49   # length of sequences
STEP = 20         # shift in cursor between sequences
DEPTH = 1         # number of hidden LSTM/GRU layers
UNIT_SIZE = 64   # number of units per LSTM
DROPOUT = 0.1     # dropout parameter

In [5]:
sentences = list()
targets = list()

for s in df.seq: 
  for i in range(0, len(s) - SEQ_LENGTH - 1, STEP):
    sentences.append(s[i: i + SEQ_LENGTH])
    targets.append(s[i + 1: i + SEQ_LENGTH + 1])
print('number of sequences:', len(sentences))

number of sequences: 34056


## Step 4) Vectorization

One reason to do this is that entering raw numbers into a RNN may not make sense
    because it assumes an ordering for catergorical variables.

In [6]:
# dictionaries to convert characters to numbers and vice-versa
chars = ['A', 'C', 'T', 'G']
num_chars = 4
char_to_indices = dict((c, i) for i, c in enumerate(chars))
indices_to_char = dict((i, c) for i, c in enumerate(chars))

X = np.zeros((len(sentences), SEQ_LENGTH, num_chars), dtype=np.bool)
y = np.zeros((len(sentences), SEQ_LENGTH, num_chars), dtype=np.bool)
for i in tqdm(range(len(sentences))):
    sentence = sentences[i]
    target = targets[i]
    for j in range(SEQ_LENGTH):
        X[i][j][char_to_indices[sentence[j]]] = 1
        y[i][j][char_to_indices[target[j]]] = 1

100%|██████████| 34056/34056 [00:02<00:00, 15710.57it/s]


## Step 5) Model definition

One, two (or three) layers of LSTM and dropout, followed by dense connected layer and softmax. Possibly, LSTM could be replaced by GRU (keras.layers.GRU), or RMSprop optimizer can be replaced by SGD or Adam.

In [7]:
model = keras.models.Sequential()
for _ in range(DEPTH):
    model.add(keras.layers.LSTM(UNIT_SIZE, input_shape=(None, num_chars), return_sequences=True))
    model.add(keras.layers.Dropout(DROPOUT))
model.add(keras.layers.wrappers.TimeDistributed(keras.layers.Dense(num_chars)))
model.add(keras.layers.Activation('softmax'))





Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


In [8]:
optimizer = keras.optimizers.RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])





## Helper functions: Generating text from the model

The function **sample** takes the trained model and get you a sample of a text generated from it. Lower temperatures make result more boring (but avoid crazy outputs).

In [0]:
def multinomial_with_temperature(preds, temperature=1.0):
    """
    Helper function to sample from a multinomial distribution (+adj. for temperature)
    """ 
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds + 1e-8) / temperature  
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)


def sample(model, char_to_indices, indices_to_char, 
           seed_string=" ", temperature=0.2, test_length=199):
    """
    Generates text of test_length length from model starting with seed_string.
    """
    num_chars = len(char_to_indices.keys())
    for i in range(test_length):
        test_in = np.zeros((1, len(seed_string), num_chars))
        for t, char in enumerate(seed_string):
            test_in[0, t, char_to_indices[char]] = 1
        entire_prediction = model.predict(test_in, verbose=0)[0]
        next_index = multinomial_with_temperature(entire_prediction[-1], temperature)
        next_char = indices_to_char[next_index]
        seed_string = seed_string + next_char
    return seed_string

## Step 6) Model training

Each time you run the code below, the model is trained for 15 epochs  (each sequence is visited 15 times). If the quality of predictions is not sufficient, you can add another 15 epochs, etc.

In [10]:
history = model.fit(X, y,
            batch_size=1024,
            epochs=15)

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where



Epoch 1/15





Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


## Step 7) Generate pseudo genomic sequence

Generate a pseudogenomic sequence from the model trained above.

In [11]:
%%time
random_seed = np.random.choice(chars)
random_seq = sample(model, char_to_indices=char_to_indices, indices_to_char=indices_to_char, 
                    seed_string=random_seed, temperature=0.7)
print(len(random_seq), random_seq)

200 AGGCATGCTCCTGAAGTCCTGAATTTCAGGCGATCCACCTGCCTCAGCCTCCCAAGGTTCTGGGATTTACAGACACGCGACAACCACTCTACAAGACCTGGCTTCTCAGACCCTTGGGGGGGCTGTGTTGTGAAGGCAGGAATTCGGAGCACTTTGGGAGGCTTATGTGGCACATTACACACAAAAAAAGTTGCTGTAAG
CPU times: user 5.65 s, sys: 41.9 ms, total: 5.69 s
Wall time: 5.56 s


In [12]:
def one_generated_sequence():
  random_seed = np.random.choice(chars)
  return sample(model, char_to_indices=char_to_indices, indices_to_char=indices_to_char, 
                    seed_string=random_seed, temperature=0.7)

N = 1000
generated_seqs = [one_generated_sequence() for i in tqdm(range(N))]

100%|██████████| 1000/1000 [1:27:59<00:00,  5.26s/it]


In [14]:
df_output = pd.DataFrame.from_dict({'generated_seqs': generated_seqs})
df_output.head()

Unnamed: 0,generated_seqs
0,TTGTATCATATATATATTTTTTTAAATTTTTTATATACTATTTATA...
1,CAAACTAGAAGTAAAGAAATATAATGCTTAATTTTTTGTTTTAATA...
2,GCACACACACTCACACATATCTGCATTTGTGTGGGCTGAAAGATGT...
3,TACATTGGCACATGCTCCACTACAGGAAGCTGAACTCCCTTTGAGA...
4,TGTGCAGCAGGAATGATTGTGACAATGAGATTGATTTATTTCTTTT...


## Step 8) Saving the model and seqs

Save the model and the generated seqs for the later use.

In [0]:
df_output.to_csv('generated_seqs.csv', index=False)

model_filename = 'random_dna.loss{0:.2f}.h5'.format(history.history['loss'][-1])
model.save(model_filename)
#files.download(model_filename)

## Notes

This notebook is based on my [Nietzche-like text generator](https://github.com/karlafej/keras_pyconCZ/blob/master/04-Nietzsche_text_generation.ipynb) from PyconCZ2017 workshop. It was adapted from Michael Zhang's [Char-RNN](https://github.com/michaelrzhang/Char-RNN) and [lstm_text_generation.py](https://github.com/keras-team/keras/blob/master/examples/lstm_text_generation.py) example in keras github repo. Both were inspired from Andrej Karpathy's blog post [The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/).

It is based on an old version of Keras/TF and should be updated.