<h1><center>Neural Networks 101</center></h1>
<h2><center>RNN</center></h2>
<p>The goal behind the current project is to be able to identify chemical names from common text with just splitting by space. A more specific direction in the goal is that the aim is towards the chemicals data set at the European Chemical Agency (ECHA). </p>

<h2><center>Literature Review</center></h2>
<p> In terms of data sets, on this http://cheminformatics.org/datasets/, there is sufficient amount of information regarding chemical compounds including their specific characteristics including inhibitor classifications. Another data set that has been designed for similar projects is CHEMDNER, a corpus of chemicals and drugs. The work I found on that data set is made on word embeddings and multiple classification analysis. However I wanted to test how would character embedding handle such a task. Nowadays most of the MLP processing is done with a form of RNN. Such is and this project. Similar projects like https://www.depends-on-the-definition.com/lstm-with-char-embeddings-for-ner/ rate at accuracy above 95%, and are deeper than expected initially.    </p>

In [20]:
%matplotlib inline


<h2><center>Data preparation and augmentation</center></h2>

In [3]:
import requests
import json
import csv


def write_json_file(name,data):
    with open(name+'.json', 'w') as outfile:
        json.dump(data, outfile)

def read_json_file(name):
    with open(name+'.json', 'r', encoding='utf-8-sig') as file:
        read_data = json.load(file)
    return read_data

def read_csv(name):
    with open(name + '.csv', 'r') as csvfile:
        spamreader = csv.reader(csvfile, delimiter=',')
        result = [i for i in spamreader]
    return result

In [4]:
def aquaire_data_from_iuclid():
    #In order to obtain the data you need a iuclid server or in my case a local uiclid server

    responce = requests.get('http://localhost:8080/iuclid6-ext/api/ext/v1/inventory?l=300000')
    data=[]
    for i in responce.json()['results']:
        if 'name' in i['representation']:
            data.append(i['representation']['name'])
    write_json_file('iuclid_row',data)

<h1><center>Prepare the data</center></h1>

<p>As mentioned before, the target chemical names are in the EUCH data base. In order to get access to their chemicals, one needs to go through their local/server application https://iuclid6.echa.europa.eu/download in order to parse the available data for chemicals that can be found here https://iuclid6.echa.europa.eu/inventories-iuclid. In addition the context data set to be used in this project was taken from the previously mentioned CHEMDNER. What we have in the end is a mix between 150 000 chemical names and around 600 000 words and text symbols to produce noise. Both data sets are cured of unwanted utf characters like hieroglyphs /less than 10 words are taken out./</p>

In [None]:
data = read_json_file('iuclid_row')

<p>As some of the chemical names consist of 2 words, however in the current flow we look at only one word to classify it. Therefore all combinations of words are slitted and only the unique are taken. The reason behind this is that if 2 chemical names are identified one after the other then there is a high chance they belong together.</p>

In [None]:
token_frequency={}
for name in data:
    tokens = name.lower().split()
    for token in tokens:
        token_frequency[token] = token_frequency.get(token,0) + 1

<p>Extracting some words by hand from the result set and also getting familiar with the data.</p>

In [None]:
import pandas as pd

df = pd.DataFrame.from_dict(token_frequency, orient='index', columns=['frequency'])
df.sort_values(by='frequency', ascending=False, inplace=True)
drop_words = ['and','of','reaction','mass','with','products','6','(1:2)','/','<2%','by',
              'orange','the', 'cracked','lights','from','tall','gum','cas',',','a']
df.drop(drop_words,inplace=True)

In [None]:
extracted_key_words = []
for index, row in df.iterrows():
    if u'\uF061' in index or u'\uf061' in index or u'\uf06b' in index or u'\u200b' in index:
        continue
    extracted_key_words.append(index)

In [None]:
len(extracted_key_words)

In [None]:
hot_tokens=set()

greek_letters=[
    u'\u03B1',
    u'\u03B2',
    u'\u03B3',
    u'\u03B4',
    u'\u03B5',
    u'\u03B6',
    u'\u03B7',
    u'\u03B8',
    u'\u03B9',
    u'\u03BA',
    u'\u03BB',
    u'\u03BC',
    u'\u03BD',
    u'\u03BE',
    u'\u03BF',
    u'\u03C0',
    u'\u03C1',
    u'\u03C2',
    u'\u03C3',
    u'\u03C4',
    u'\u03C5',
    u'\u03C6',
    u'\u03C7',
    u'\u03C8',
    u'\u03C9']

hot_tokens.update(greek_letters)
hot_tokens.add('≤')


for word in extracted_key_words:
    for token in word:
        if token == u'\uF061' or token == u'\uf061' or token == u'\u200b':
            print(word)
        hot_tokens.add(token)

In [None]:
def add_words_to_tokens(extracted_key_words,hot_tokens):
    cheker=len(hot_tokens)
    for word in extracted_key_words:
        for token in word:
            if token == u'\uF061' or token == u'\uf061' or token == u'\u200b':
                print(word)
            hot_tokens.add(token)
    return False if len(hot_tokens)==cheker else len(hot_tokens)-cheker

In [None]:
text_tokens= read_csv('text')
text_tokens.pop(0)

In [None]:
distribution_fill_tokens = {}

for token in text_tokens:
    distribution_fill_tokens[token[3]] = distribution_fill_tokens.get(token[3],0)+1

distribution_fill_tokens

<p> Taken only data classified as ordinary from CHEMDNER</p>

In [None]:
text_tokens = [i[2].lower() for i in text_tokens if (i[3]=='O' and u'\xa0' not in i[2].lower())]
add_words_to_tokens(text_tokens,hot_tokens)

<p> A Monte Carlo black-jack is used to keep distribution and sparsity of data as expected to be found. </p>

In [None]:
from random import randrange
from random import random

balance = len(extracted_key_words)*1.0/(len(text_tokens)+len(extracted_key_words))
print(balance)
building_corpus=[]
while True:
    black_jack=(random())
    if black_jack>balance:
        index=randrange(len(text_tokens))
        building_corpus.append([text_tokens.pop(index),0])
    else:
        index=randrange(len(extracted_key_words))
        building_corpus.append([extracted_key_words.pop(index),1])
    if not (text_tokens and extracted_key_words):
        break

In [None]:
write_json_file('corpus',building_corpus)

In [None]:
building_corpus

<h1><center>Encode Data</center></h1>

<P> Initially I was going for a one_hot_encoding but then I saw that an embedding layer can also do the trick so I decided to go with it. Based on https://towardsdatascience.com/deep-learning-4-embedding-layers-f9a02d55ac12 and the fact that the size of the data is quite big I decided to switch to a embedding layer.<\p>

In [None]:
import numpy as np

identity=np.identity(len(hot_tokens))

hot_tokens_encoded={i:list(identity[k]) for k,i in enumerate(hot_tokens)}


In [None]:
write_json_file('encoding',hot_tokens_encoded)

In [None]:
hot_tokens=[i for i in hot_tokens]
hot_tokens.insert(0,' ')
for token in building_corpus:
    to_be_encoded=token[0]
    encoding_matrix=[]
    for charachter in to_be_encoded:
        encoding_matrix.append(hot_tokens.index(charachter))  
    token[0]=encoding_matrix

In [None]:
write_json_file('encoded_corpus',building_corpus)

In [None]:
data_ready_for_model={}
data_ready_for_model['corpus']=building_corpus
data_ready_for_model['mapping']=hot_tokens
write_json_file('data_for_model',data_ready_for_model)

<p>One can skip all of the above if you just load the initial cells with functions and then go directly from here.<\p>

In [5]:
data_ready_for_model=read_json_file('data_for_model')
building_corpus=data_ready_for_model['corpus']
hot_tokens=data_ready_for_model['mapping']
import numpy as np

In [6]:
num_tokens = [len(tokens[0]) for tokens in building_corpus]
num_tokens = np.array(num_tokens)
print(np.mean(num_tokens))
print(np.max(num_tokens))
max_tokens=np.max(num_tokens)
x_data=[i[0] for i in building_corpus]
y_data=[i[1] for i in building_corpus]

9.606791610523594
411


<p>I decision was taken to not trim any data, because the chemical names were the once that were lifting the max(num_tokens) so high.<\p>

In [15]:
import os
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import Dense, GRU, Embedding,CuDNNLSTM,LSTM,Dropout
from tensorflow.python.keras.optimizers import Adam
from tensorflow.python.keras.preprocessing.sequence import pad_sequences
from tensorflow.python.keras.callbacks import EarlyStopping, ModelCheckpoint, TensorBoard, ReduceLROnPlateau,Callback


<p> When I was using the GRU layers, the GPU was handling them at 7 slower than the CPU, therefore big, big attention, GPU is not always the fastest. <\p>

In [None]:
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'

In [8]:
pad='post'
x_data_pad = pad_sequences(x_data, maxlen=max_tokens,
                            padding=pad)


In [9]:
x_data_pad.shape

len(x_data)

734730

In [10]:
x_train=x_data_pad[:720000]
y_train=y_data[:720000]
x_test=x_data_pad[720000:]
y_test=y_data[720000:]


In [11]:
720000*0.05  # calculating validation set

36000.0

<h1><center>Building the Model</center></h1>

<p>Initially I started not with a very simple model (Contrary to the given initial directions) I came across this https://arxiv.org/pdf/1412.3555v1.pdf paper and decided to test it with GRU. Also it looks like a transistor /I thought its worth it a shot/.<\p>

In [12]:
model = Sequential()
embedding_size = len(hot_tokens)
model.add(Embedding(input_dim=embedding_size,
                    output_dim=embedding_size,
                    input_length=max_tokens))
model.add(GRU(units=16, return_sequences=True))
model.add(GRU(units=8, return_sequences=True))
model.add(GRU(units=4))
model.add(Dense(1, activation='sigmoid'))
optimizer = Adam(lr=1e-3)
model.compile(loss='binary_crossentropy',
              optimizer=optimizer,
              metrics=['accuracy'])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 411, 170)          28900     
_________________________________________________________________
gru (GRU)                    (None, 411, 16)           8976      
_________________________________________________________________
gru_1 (GRU)                  (None, 411, 8)            600       
_________________________________________________________________
gru_2 (GRU)                  (None, 4)                 156       
_________________________________________________________________
dense (Dense)                (None, 1)                 5         
Total params: 38,637
Trainable params: 38,637
Non-trainable params: 0
_________________________________________________________________


In [13]:
model = Sequential()
embedding_size = len(hot_tokens)
model.add(Embedding(input_dim=embedding_size,
                    output_dim=embedding_size,
                    input_length=max_tokens))
model.add(CuDNNLSTM(128, return_sequences=True))
model.add(Dropout(0.2))
model.add(CuDNNLSTM(32))
model.add(Dropout(0.1))
model.add(Dense(1, activation='sigmoid'))
# add decay                     
optimizer = Adam(lr=1e-3,decay=1e-6)
model.compile(loss='binary_crossentropy',
              optimizer=optimizer,
              metrics=['accuracy'])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 411, 170)          28900     
_________________________________________________________________
cu_dnnlstm (CuDNNLSTM)       (None, 411, 128)          153600    
_________________________________________________________________
dropout (Dropout)            (None, 411, 128)          0         
_________________________________________________________________
cu_dnnlstm_1 (CuDNNLSTM)     (None, 32)                20736     
_________________________________________________________________
dropout_1 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 33        
Total params: 203,269
Trainable params: 203,269
Non-trainable params: 0
_________________________________________________________________


<p>After the initial run up with GRU, it seemed that the model has a bias, and it seemed not to be able to fit the model after certain treshhold. Therefore I constructed a similar model with LSTM this time, they were running better on GPU. After running the same results appeared, so I added a decay in the optimiser with the intention to make the learning rate smaller and to be able to fit the model a little bit better. The success could be state to be moderate. <\p>

In [16]:
x_train

class TestCallback(Callback):
    def __init__(self, test_data):
        self.test_data = test_data

    def on_epoch_end(self, epoch, logs={}):
        x, y = self.test_data
        loss, acc = self.model.evaluate(x, y, verbose=0)
        print('\nTesting loss: {}, acc: {}\n'.format(loss, acc))

<p>A very nice extension of a Callback function to represent acc after every epoch. Of course accuracy needs to be looked at after the complete training, but this additional information kind of gives the notion of where the model is going. </p>

In [17]:
path_checkpoint='Project1_v1.2'
callback_checkpoint = ModelCheckpoint(filepath=path_checkpoint,
                                      monitor='val_loss',
                                      verbose=1,
                                      # save_weights_only=True,
                                      save_best_only=True)

callback_early_stopping = EarlyStopping(monitor='val_loss',
                                        patience=5, verbose=1)

callback_tensorboard = TensorBoard(log_dir='./23_logs/',
                                   histogram_freq=0,
                                   write_graph=True)
test_callback=TestCallback((x_test, y_test))
callbacks = [callback_early_stopping,
             callback_checkpoint,
             callback_tensorboard]

In [19]:
%%time
model.fit(x_train, y_train,
          validation_split=0.05, epochs=2, batch_size=64,callbacks=callbacks)

Train on 684000 samples, validate on 36000 samples
Epoch 1/2
Epoch 00001: val_loss did not improve from 0.37084
Epoch 2/2
Epoch 00002: val_loss did not improve from 0.37084
Wall time: 44min 44s


<tensorflow.python.keras.callbacks.History at 0x66ca37db70>

In [None]:
%%time
result = model.evaluate(x_test, y_test)

In [None]:
result

In [None]:
model.load_weights('path_name')

<h1><center>Conclusion</center></h1>

<p>For the state to where the current model development is, there are several steps that can be taken in order to improve performance. First is to test it with more than 10 epoch to see if at some moment the network wont find revelations. Another would be to reduce the drop outs and see if fitting the model is going to improve. Third option would be to check with a RMS optimiser instead of adam, because in some of the RNN, RMS seemed to be working better. Overall the goal of identifying chemical names was achieved to a satisfying degree with potential to be developed further.</p>