#### Antibiotic Resistant Bacteria Multiclass-Classification and Drug Discovery
#### Corey J Sinnott
# Drug Discovery Antibiotic Molecule Generator

## Executive Summary

This report was commissioned to determine a robust, fast, and reproducible means of searching for, and developing, new antibiotics, in an effort to combat antibiotic resistance. After in-depth analysis, conclusions and recommendations will be presented.
   
Data was obtained from the following source:
- Comprehensive Antibiotic Resistance Database via CARD CLI interface: 
 - https://card.mcmaster.ca
- ChEMBL via Python client library: 
 - https://www.ebi.ac.uk/chembl/ 

**Full Executive Summary, Conclusion, Recommendations, Data Dictionary and Sources can be found in README.**

## Contents:
- [Data Import & Cleaning](#Data-Import-&-Cleaning)

In [None]:
#!pip install --user tensorflow-gpu

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, Dense, BatchNormalization, LSTM, Dropout
from tensorflow.keras.callbacks import ModelCheckpoint
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Embedding
from sklearn.feature_extraction.text import CountVectorizer
import tensorflow as tf
import keras
import random
import sys

In [2]:
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.initializers import RandomNormal
from tensorflow.keras import regularizers
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import ReduceLROnPlateau, ModelCheckpoint, TensorBoard
from tensorflow.keras.utils import Sequence

In [3]:
df = pd.read_csv('./data/acinetobacter_baumannii_MIC_addFeats_addBits.csv').drop(columns = 
                        'Unnamed: 0').drop_duplicates(subset = ['canonical_smiles'])

Creating a scaled target

In [4]:
df['pMIC'] = df['standard_value'].map(lambda x: -np.log10(x * (10**-9)))

In [5]:
df_active = df[df.bioactivity_binary == 'active'].sort_values(by = ['pMIC']).tail(200)

# Molecule Generator
#### Adapted from Deep Learning with Python ch. 8.1

In [6]:
X = df_active.canonical_smiles

In [7]:
maxlen = 60 # can adjust
step = 3

In [8]:
smiles = [i for i in X]

In [9]:
smiles[0]

'CC(C)(O/N=C(\\C(=O)N[C@@H]1C(=O)N2C(C(=O)[O-])=C(C[N+]3(CCNC(=O)c4ccc(O)c(O)c4Cl)CCCC3)CS[C@H]12)c1csc(N)n1)C(=O)O'

In [10]:
text = [''.join(i) for i in smiles] #making a big jumbles
                                    #instead of working with list of strings

In [11]:
text = ''.join(text)

In [12]:
print(f'Number of SMILES: {len(text)}')

Number of SMILES: 19570


In [13]:
# chars = [sorted(list(set(i))) for i in smiles]
# chars[0]

In [14]:
from itertools import chain

chars = set(chain.from_iterable(text))

In [15]:
chars_list = list(chars) #convert set to list

In [16]:
print(f'Number of unique characters: {len(chars)}')

Number of unique characters: 33


In [17]:
char_indices = dict((char, chars_list.index(char)) for char in chars_list)

In [18]:
len(char_indices)

33

In [19]:
#again -> delete
next_chars = []
sentences = []

for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])

In [20]:
chars = sorted(list(text))

In [21]:
char_indices = dict((char, chars.index(char)) for char in chars)

In [22]:
#again -> delete
x = np.zeros((len(sentences), maxlen, len(chars)), dtype = np.bool)
y = np.zeros((len(sentences), len(chars)), dtype = np.bool)

In [23]:
# one-hot encoding
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

In [24]:
model = Sequential() 
#model.add(Embedding(input_dim = X_fgr.shape[1], output_dim = 128))
model.add(LSTM(128, input_shape = (maxlen, len(chars))))
# model.add(Dropout(0.2))
# model.add(LSTM(256, return_sequences = True))
# model.add(Dropout(0.2))
# model.add(LSTM(512, return_sequences = True))
# model.add(Dropout(0.2))
# model.add(LSTM(256, return_sequences = True))
# model.add(Dropout(0.2))
# model.add(LSTM(128))
# model.add(Dropout(0.2))
# model.add(Dense(y.shape[0], activation='softmax'))

model.add(Dense(len(chars), activation = 'softmax'))

In [25]:
optimizer = keras.optimizers.RMSprop(lr = 0.01)

In [26]:
model.compile(loss = 'categorical_crossentropy', optimizer = optimizer)

In [27]:
def reweight_distribution(original_distribution, temperature=0.5): 
    distribution = np.log(original_distribution) / temperature 
    distribution = np.exp(distribution)
    
    return distribution / np.sum(distribution)

In [28]:
def sample(preds, temperature=1.0):
    preds = np.asarray(preds).astype('float64') 
    preds = np.log(preds) / temperature 
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds) 
    probas = np.random.multinomial(1, preds, 1) 

    return np.argmax(probas)

In [36]:
checkpoint_list = []
generated_list = []
for epoch in range(1, 25): # less epochs - trained many times while
    print('epoch', epoch)  # troubleshooting
    model.fit(x, y, batch_size = 128, epochs = 1)
    start_index = random.randint(0, len(text) - maxlen -1)
    generated_text = text[start_index: start_index + maxlen]
    print(f'Generating with seed: {generated_text}')

    for temperature in [0.9]: #only one due to time
        print(f'temperature: {temperature}')
        sys.stdout.write(generated_text)

        for i in range(200):
            sampled = np.zeros((1, maxlen, len(chars)))

            for t, char in enumerate(generated_text):
                sampled[0, t, char_indices[char]] = 1

            preds = model.predict(sampled, verbose=1)[0] 
            next_index = sample(preds, temperature) 
            next_char = chars[next_index]
            
            generated_text += next_char
            checkpoint_list.append(generated_text)
            generated_text = generated_text[1:]
            generated_list.append(generated_text)
            
            sys.stdout.write(next_char)

epoch 1
Generating with seed: (CCNC(=O)c4cc(O)c(O)c(Cl)c4)CCCC3)CS[C@H]12)c1csc(N)n1)C(=O)
temperature: 0.9
(epoch 2
Generating with seed: sc(N)n2)C(=O)N1OS(=O)(=O)OO=C(NCCc1c[nH]c2ccc(Br)cc12)Nc1cc(
temperature: 0.9
Nepoch 3
Generating with seed: O)[C@H](CN)NC(=O)[C@@H](NC(=O)[C@H](CCN)NC(=O)c2ccc(Cl)c(-c3
temperature: 0.9
cepoch 4
Generating with seed: /N=C(\C(=O)N[C@@H]1C(=O)N2C(C(=O)[O-])=C(C[n+]3cc(NC(=O)c4cc
temperature: 0.9
[epoch 5
Generating with seed: c4csc(N)n4)[C@H]3SC2)CCCC1CCCCCCCCCC(=O)N[C@@H](CCN)C(=O)N[C
temperature: 0.9
Oepoch 6
Generating with seed: H](C)O)NC(=O)[C@H](CCN)NC(=O)[C@H](CCN)NC(=O)[C@H](CC(C)C)NC
temperature: 0.9
=epoch 7
Generating with seed: (O)=C(C(N)=O)C(=O)[C@@]2(O)C(O)=C3C(=O)c4c(O)c(NC(=O)CN5CC6C
temperature: 0.9
)epoch 8
Generating with seed: sc(N)n1)c1cc(=O)c(O)cn1OCN(C)c1ccc(O)c2c1C[C@H]1C[C@H]3[C@H]
temperature: 0.9
(epoch 9
Generating with seed: )N[C@@H]1C(=O)N(OS(=O)(=O)O)C1(C)C)c1csc(N)n1)c1cc(=O)c(O)cn
temperature: 0.9
Cepoch 10
Generating

In [37]:
generated_text

')OCC(C)(O/N=C(\\C(=O)N[C@@H]1C(=O)N(OS(=O)(=O)O)C1(C)C)c1csc('

In [38]:
preds

array([4.5626786e-10, 2.2545292e-12, 2.2663039e-12, ..., 2.3483364e-12,
       2.2051280e-12, 2.2817406e-12], dtype=float32)

In [44]:
generated_list[0]

'CCNC(=O)c4cc(O)c(O)c(Cl)c4)CCCC3)CS[C@H]12)c1csc(N)n1)C(=O)['

In [48]:
checkpoint_list[1000]

'H](C)O)NC(=O)[C@H](CCN)NC(=O)[C@H](CCN)NC(=O)[C@H](CC(C)C)NC('

In [50]:
pd.Series(generated_list).to_csv('generated_list_1')

4801 unique strings were generated, but due to the small training size, there isn't a lot of variation.