#### Antibiotic Resistant Bacteria Multiclass-Classification and Drug Discovery
#### Corey J Sinnott
# Drug Discovery Antibiotic Molecule Generator - Extra Code

## Executive Summary

This report was commissioned to determine a robust, fast, and reproducible means of searching for, and developing, new antibiotics, in an effort to combat antibiotic resistance. After in-depth analysis, conclusions and recommendations will be presented.
   
Data was obtained from the following source:
- Comprehensive Antibiotic Resistance Database via CARD CLI interface: 
 - https://card.mcmaster.ca
- ChEMBL via Python client library: 
 - https://www.ebi.ac.uk/chembl/ 

**Full Executive Summary, Conclusion, Recommendations, Data Dictionary and Sources can be found in README.**

## Contents:
- [Data Import & Cleaning](#Data-Import-&-Cleaning)

In [None]:
!pip install --user tensorflow-gpu

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, Dense, BatchNormalization, LSTM, Dropout
from tensorflow.keras.callbacks import ModelCheckpoint
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Embedding
from sklearn.feature_extraction.text import CountVectorizer
import tensorflow as tf
import keras
import random
import sys

In [2]:
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.initializers import RandomNormal
from tensorflow.keras import regularizers
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import ReduceLROnPlateau, ModelCheckpoint, TensorBoard
from tensorflow.keras.utils import Sequence

In [3]:
df = pd.read_csv('./data/acinetobacter_baumannii_MIC_addFeats_addBits.csv').drop(columns = 
                        'Unnamed: 0').drop_duplicates(subset = ['canonical_smiles'])

Creating a scaled target

In [4]:
df['pMIC'] = df['standard_value'].map(lambda x: -np.log10(x * (10**-9)))

In [5]:
df_active = df[df.bioactivity_binary == 'active'].sort_values(by = ['pMIC']).tail(100)

# Molecule Generator
#### Adapted from Deep Learning with Python ch. 8.1

In [9]:
import tensorflow as tf
with tf.compat.v1.device('/gpu:0'):
    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
    b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
    c = tf.matmul(a, b)

with tf.compat.v1.Session() as sess:
    print (sess.run(c))

RuntimeError: The Session graph is empty.  Add operations to the graph before calling run().

In [10]:
X = df_active.canonical_smiles

In [11]:
maxlen = 60 # can adjust
step = 3

In [12]:
smiles = [i for i in X]

In [13]:
smiles[0]

'CC[C@H](C)CCCCC(=O)N[C@@H](CCN)C(=O)N[C@H](C(=O)N[C@@H](CCN)C(=O)N[C@H]1CCNC(=O)[C@H]([C@@H](C)O)NC(=O)[C@H](CCN)NC(=O)[C@H](CCN)NC(=O)[C@H](C(C)C)NC(=O)[C@@H](CC(C)C)NC(=O)[C@H](CCN)NC1=O)[C@@H](C)O'

In [14]:
text = [''.join(i) for i in smiles] #making a big jumbles
                                    #instead of working with list of strings

In [15]:
text = ''.join(text)

In [16]:
print(f'Number of SMILES: {len(text)}')

Number of SMILES: 9729


In [17]:
# chars = [sorted(list(set(i))) for i in smiles]
# chars[0]

In [18]:
from itertools import chain

chars = set(chain.from_iterable(text))

In [19]:
chars_list = list(chars) #convert set to list

In [20]:
print(f'Number of unique characters: {len(chars)}')

Number of unique characters: 32


In [21]:
char_indices = dict((char, chars_list.index(char)) for char in chars_list)

In [22]:
len(char_indices)

32

In [23]:
#again -> delete
next_chars = []
sentences = []

for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])

In [24]:
chars = sorted(list(text))

In [25]:
char_indices = dict((char, chars.index(char)) for char in chars)

In [26]:
#again -> delete
x = np.zeros((len(sentences), maxlen, len(chars)), dtype = np.bool)
y = np.zeros((len(sentences), len(chars)), dtype = np.bool)

In [27]:
# one-hot encoding
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

In [28]:
model = Sequential() 
#model.add(Embedding(input_dim = X_fgr.shape[1], output_dim = 128))
model.add(LSTM(128, input_shape = (maxlen, len(chars))))
# model.add(Dropout(0.2))
# model.add(LSTM(256, return_sequences = True))
# model.add(Dropout(0.2))
# model.add(LSTM(512, return_sequences = True))
# model.add(Dropout(0.2))
# model.add(LSTM(256, return_sequences = True))
# model.add(Dropout(0.2))
# model.add(LSTM(128))
# model.add(Dropout(0.2))
# model.add(Dense(y.shape[0], activation='softmax'))

model.add(Dense(len(chars), activation = 'softmax'))

In [29]:
optimizer = keras.optimizers.RMSprop(lr = 0.01)

In [30]:
model.compile(loss = 'categorical_crossentropy', optimizer = optimizer)

In [31]:
def reweight_distribution(original_distribution, temperature=0.5): 
    distribution = np.log(original_distribution) / temperature 
    distribution = np.exp(distribution)
    
    return distribution / np.sum(distribution)

In [32]:
def sample(preds, temperature=1.0):
    preds = np.asarray(preds).astype('float64') 
    preds = np.log(preds) / temperature 
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds) 
    probas = np.random.multinomial(1, preds, 1) 

    return np.argmax(probas)

In [33]:
for epoch in range(1, 10):
    print('epoch', epoch)
    model.fit(x, y, batch_size = 128, epochs = 1)
    start_index = random.randint(0, len(text) - maxlen -1)
    generated_text = text[start_index: start_index + maxlen]
    print(f'Generating with seed: {generated_text}')

    for temperature in [0.4, 0.8, 1.2]:
        print(f'temperature: {temperature}')
        #sys.stdout.write(generated_text)

        for i in range(200):
            sampled = np.zeros((1, maxlen, len(chars)))

            for t, char in enumerate(generated_text):
                sampled[0, t, char_indices[char]] = 1

            preds = model.predict(sampled, verbose=1)[0] 
            next_index = sample(preds, temperature) 
            next_char = chars[next_index]
            
            generated_text += next_char
            generated_text = generated_text[1:]
            
            #sys.stdout.write(next_char)

epoch 1
Generating with seed: )CCCOC[C@@H]1CCCN1CC(=O)Nc1cc(F)c2c(c1O)C(=O)C1=C(O)[C@]3(O)
temperature: 0.4
temperature: 0.8
temperature: 1.2
epoch 2
Generating with seed: NCc3cc4c(cn3)OCS4)(CC1)CO2CC[C@H](C)CCCCC(=O)N[C@@H](CCN)C(=
temperature: 0.4
temperature: 0.8
temperature: 1.2
epoch 3
Generating with seed: ](CCN)NC1=O)[C@@H](C)O.O=S(=O)(O)O.O=S(=O)(O)O[C-]#[N+]C(=C)
temperature: 0.4
temperature: 0.8
temperature: 1.2
epoch 4
Generating with seed: H](OC(=O)CO)CC[C@@H]1OCC(O/N=C(\C(=O)N[C@@H]1C(=O)N(OS(=O)(=
temperature: 0.4
temperature: 0.8
temperature: 1.2
epoch 5
Generating with seed: C(=O)[C@H](CCN)NC1=O)[C@@H](C)OCC(C)(O/N=C(\C(=O)N[C@@H]1C(=
temperature: 0.4
temperature: 0.8
temperature: 1.2
epoch 6
Generating with seed: )[C@@H]1CNC(=O)OCc1cc(=O)c(O)cn1O)c1csc(N)n1)C(=O)[O-].[Na+]
temperature: 0.4
temperature: 0.8
temperature: 1.2
epoch 7
Generating with seed: =O)c(O)cn2O)cs1CSCC(O/N=C(\C(=O)N[C@@H]1C(=O)N(OS(=O)(=O)O)C
temperature: 0.4
temperature: 0.8
temperature: 1.2
epoch 

In [34]:
generated_text

'O)Nl)c4c(O)c3/CC)NCNC(=O)O)cc(OC=CC[C@@]]3C[C@@]]c(O)cc2c(O)'

In [35]:
preds

array([8.769390e-09, 8.766982e-09, 8.434535e-09, ..., 9.276961e-09,
       9.177455e-09, 9.128642e-09], dtype=float32)