# Character level neural language model - bacterial species

Suppose we isloate novel bacterial species from the human microbiome that are yet discovered and named, how do we name them so the names sound plausible or educated? To give them new names, we can get inspiration from exisiting names of bacterial species. I will use character level language model for bacterial species. The species are compiled from bacterial isolates of the human microbiome that I recently used in my paper.

In [1]:
import numpy as np
from keras import Sequential
from keras.layers import Dense, LSTM, TimeDistributed
from keras.optimizers import Adam

Using TensorFlow backend.


## Prepare data

Let's take a look at the species data. I use @ as end-of-sentence token.

In [2]:
species = []
with open("bacterial_species.txt", "r") as h:
    for line in h:
        species.append(line.replace("\n", "@").lower())
print("Number of species = {}".format(len(species)))

char_set = set()
for i in species:
    for j in i:
        char_set.add(j)
n_x = len(char_set)
print("Number of unique characters = {}".format(n_x))
print("Examples:")
species[:6]

Number of species = 718
Number of unique characters = 28
Examples:


['bifidobacterium_longum@',
 'escherichia_coli@',
 'staphylococcus_aureus@',
 'gardnerella_vaginalis@',
 'mageeibacillus_indolicus@',
 'cutibacterium_acnes@']

We need mappings to convert between indices and characters.

In [3]:
char_list = sorted(list(char_set))
char_map = {i:c for c, i in enumerate(char_list)}

def char2idx(char):
    return char_map[char]

def idx2char(idx):
    return char_list[idx]

Next, we convert indices to vectors with one-hot encoding. In theory, we don't need zero-padding if we process one sequence at a time. However, to use mini-batch gradient descent for multiple sequences, we use zero padding.

Note that input sequence is the original species name, and output sequence is shifted to the right by 1, ending with end-of-sequence token @.

In [4]:
maxlen = max([len(i) - 1 for i in species])
print("Maximum sequence length = {}".format(maxlen))

# one-hot encoding with zero padding
X = np.zeros((len(species), maxlen, n_x))
Y = np.zeros((len(species), maxlen, n_x))
for i, chars in enumerate(species):
    for t in range(len(chars) - 1):
        X[i, t, char2idx(chars[t])] = 1
        # label is shifted to the right
        Y[i, t, char2idx(chars[t + 1])] = 1
print("Shape of training data = {}".format(X.shape))

Maximum sequence length = 37
Shape of training data = (718, 37, 28)


## Many-to-many sequence model

I use RNN with LSTM cell for the character level language model. For each time point, the cell takes current character and previous hidden state as input, and outputs prediction for next word.

In [5]:
model = Sequential()
model.add(LSTM(100, input_shape=(None, n_x), return_sequences=True))
model.add(TimeDistributed(Dense(n_x, activation="softmax")))
optimizer = Adam(lr=0.05, beta_1=0.9, beta_2=0.999, decay=0.01)

model.compile(loss='categorical_crossentropy', optimizer=optimizer)

I train the model with 101 epochs. To see how model improves over time, I not only report cross-entropy loss on the training set, but also generate a random species name based on the current model.

In [6]:
for epoch in range(101):
    if epoch % 10 > 0:
        model.fit(X, Y, batch_size=32, epochs=1, verbose=0)
        continue
    
    print("Epoch {}:".format(epoch))
    model.fit(X, Y, batch_size=32, epochs=1, verbose=2)
    generated = []
    # randomly select the first letter
    first_idx = np.random.randint(2, n_x)
    x = np.zeros((1, 1, n_x))
    x[:, :, first_idx] = 1
    generated.append(idx2char(first_idx))
    # generate sequence until reach end-of-sequence token
    while True:
        # predict distribution of next character
        y_pred = model.predict(x, verbose=0)[0][-1]
        # sample next character
        idx_sampled = np.random.choice(n_x, size=1, p=y_pred)[0]
        char = idx2char(idx_sampled)
        if char == "@":
            break
        generated.append(char)
        x_new = np.zeros((1, 1, n_x))
        x_new[:, :, idx_sampled] = 1
        x = np.concatenate((x, x_new), axis=1)
    print("Randomly generated species name: {}\n".format("".join(generated)))

Epoch 0:
Epoch 1/1
 - 1s - loss: 1.6246
Randomly generated species name: hweopirum_serun

Epoch 10:
Epoch 1/1
 - 1s - loss: 0.5241
Randomly generated species name: xaethia_melgae

Epoch 20:
Epoch 1/1
 - 1s - loss: 0.3405
Randomly generated species name: fusobacterium_timas

Epoch 30:
Epoch 1/1
 - 1s - loss: 0.2519
Randomly generated species name: oligella_ornichii

Epoch 40:
Epoch 1/1
 - 1s - loss: 0.2027
Randomly generated species name: klebsiella_aerogenes

Epoch 50:
Epoch 1/1
 - 1s - loss: 0.1814
Randomly generated species name: rovhoibacter_lepei

Epoch 60:
Epoch 1/1
 - 1s - loss: 0.1609
Randomly generated species name: actinomyces_dentasseri

Epoch 70:
Epoch 1/1
 - 1s - loss: 0.1501
Randomly generated species name: zaphomonas_odontolotihae

Epoch 80:
Epoch 1/1
 - 1s - loss: 0.1518
Randomly generated species name: prevotella_shomensis

Epoch 90:
Epoch 1/1
 - 1s - loss: 0.1380
Randomly generated species name: clostridium_gillveaseneens

Epoch 100:
Epoch 1/1
 - 1s - loss: 0.1387
Rand

It is obvious that the first few epochs give less plausible name, while the last few epochs give realistic names!

## Coming up with new species names

Suppose we have isolated a novel Staphylococcus species, and we want to give it a new name. To get inspiration, we use the language model to sample few names.

In [7]:
prefix = "staphlococcus_"
staph = np.zeros((1, len(prefix), n_x))
for t in range(len(prefix)):
    staph[0, t, char2idx(prefix[t])] = 1

# complete species name
candidates = set()
for _ in range(10):
    x = staph.copy()
    generated = []
    while True:
        y_pred = model.predict(x, verbose=0)[0][-1]
        # sample next character
        idx_sampled = np.random.choice(n_x, size=1, p=y_pred)[0]
        char = idx2char(idx_sampled)
        if char == "@":
            break
        generated.append(char)
        x_new = np.zeros((1, 1, n_x))
        x_new[:, :, idx_sampled] = 1
        x = np.concatenate((x, x_new), axis=1)
    candidates.add(prefix + "".join(generated))

print("10 synthesized Staphlococcus species:")
for i in candidates:
    print(i)

10 synthesized Staphlococcus species:
staphlococcus_petorim
staphlococcus_hadrisienseriflavii
staphlococcus_dutri
staphlococcus_egerinolyticus
staphlococcus_peroris
staphlococcus_warneri
staphlococcus_pseudoacacipa
staphlococcus_derdolyticus
staphlococcus_eungloifaci
staphlococcus_sthuni
