<h1><center>Bristol-Myers Squibb - Molecular Translation</center></h1>

InChI or International Chemical Identifier is a textual identifier for chemical substances, designed to provide a standard way to encode molecular information and to facilitate the search for such information in databases and on the web.

**This code is an Inference code to find the training details check out this kaggle notebook** <a href="https://www.kaggle.com/sambhavsg/bms-baseline-inchi-first-layer-train-tf-keras?scriptVersionId=62450249">BMS Baseline InChI first layer Train TF Keras</a> 


# Import Modules

In [None]:
import pandas as pd
import numpy as np

import tensorflow as tf
import tensorflow.keras.models as M
import tensorflow.keras.layers as L
import tensorflow.keras.optimizers as O
import tensorflow.keras.losses as Loss

from tqdm import tqdm

from PIL import Image
import cv2

import matplotlib.pyplot as plt

In [None]:
gpus = tf.config.list_physical_devices('GPU')
if gpus:
  # Restrict TensorFlow to only allocate 1GB of memory on the first GPU
  try:
    tf.config.experimental.set_virtual_device_configuration(
        gpus[0],
        [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=15240)])
    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Virtual devices must be set before GPUs have been initialized
    print(e)

# Loading and Preprocessing Data

## Setting global variables

### Batch Size, Epochs, input image dimensions and max length of the target sequence is defined here.

In [None]:
BATCH_SIZE = 800
EPOCHS = 1
DIM =(100,100)
MAX_LENGTH = 20

## Importing and preprcocessing train data

# Loading test data

In [None]:
sampl = pd.read_csv('../input/bms-molecular-translation/sample_submission.csv')

### Image path is added to the image id column of the data to load the image for inference

In [None]:
test_path = '../input/bms-molecular-translation/test'
for i in tqdm(range(len(sampl))):
    image_id = sampl.image_id.values[i]
    sampl.image_id.values[i] = test_path+'/'+image_id[0]+'/'+image_id[1]+'/'+image_id[2]+'/'+image_id+'.png'
sampl.head()

During training tokens were defined to tokenize the input data labels. Now, in order to detokenize the data a detokens dictionary is created. This dictionary will be sued later in this code to interpret the predictions and detokenize the predicted values.

In [None]:
detokens = {0: 'S', 1: 'B', 2: 'N', 3: 'I', 4: 'F', 5: 'P', 6: '3', 7: '6', 8: 'i', 9: '5', 10: '8', 11: 'C', 12: '7', 13: '4', 14: 'r', 15: '0', 16: '2', 17: 'O', 18: '9', 19: 'H', 20: 'l', 21: '$', 22: '1'} 
print(detokens)

This function preprocesses the data before loading the image to the GPU for predictions. In this function the image is loaded directly during prediction reducing the use of the available RAM as it would be nearly impossible to load that much data in the current provided RAM size before prerdiction. This function is utilized during prediction to preprocess the data when fetched.

In [None]:
def preprocess_test_image(image_id):
    image = tf.io.read_file(image_id)    
    image = tf.image.decode_png(image,channels=1)
    def f1(): return tf.image.rot90(image,k=3)
    def f2(): return image
    image = tf.cond(tf.less(tf.shape(image)[1],tf.shape(image)[0]),f1,f2)
    image = tf.image.convert_image_dtype(image, tf.float32)
    image = tf.image.resize_with_pad(image, DIM[0],DIM[1])
    return image

In [None]:
#     image = tf.io.read_file('../input/bms-molecular-translation/train/0/0/0/000011a64c74.png')
#     image = tf.image.decode_png(image,channels=1)
#     print(tf.shape(image))

Here tf.data.Dataset is utlized to fetch the data on the go instead of preloading the data in the RAM which would be nearly impossible given the current available RAM size. This function only loads the batch size amount of data in the memory and preprocesses the data on the go using the above function. Also, prefetch is used here that prefetches some data before hand to minimize bottleneck and improves spped.

In [None]:
test_data = tf.data.Dataset.from_tensor_slices(sampl.image_id.values).map(preprocess_test_image,num_parallel_calls=tf.data.AUTOTUNE).batch(2048).prefetch(tf.data.AUTOTUNE)

# Loading Pretrained Model

In [None]:
model = M.load_model('../input/trained-model-for-bmsmolecular/phase1_base_model_v1.4.h5')

# Inference 

This function is utilized to detokenize the prediction using the detokens dictionary

In [None]:
def detokenize(pred):
    string_d = []
    for i in range(len(pred)):
        a = []
        for j in range(len(pred[i])):
            if pred[i][j] in detokens.keys():
                a.append(detokens[pred[i][j]])
            else:
                a.append(str(pred[i][j]-47))
        a = "".join(a)
        string_d.append(a)
    return string_d

## Prediction

In [None]:
pred = model.predict(test_data,verbose=1)


## Post Processing

Since the predictions are in range between 0 and 1 and the output activation was sigmoid so we are using argmax function to find the index of the maximum value in the axis, then the data is detokenized using the detokenize function and then the padded token is removed from the detokenized predicted sequences

In [None]:
pred = np.argmax(pred,axis=-1)

In [None]:
pred = detokenize(pred)
pred = np.char.strip(pred,chars='$')

In [None]:
sampl.InChI = pred

Removing the file path from the image id column of the data

In [None]:
for i in tqdm(range(len(sampl.image_id.values))):
    sampl.image_id.values[i] = sampl.image_id.values[i][46:58]

This is baseline for the first layer of the InChI notation so the other layers are set the same for all values.

In [None]:
text = '/c1-18(2,3)24-17(22)21-10-6-8-14(21)15-11-13(20-25-15)12-7-5-9-19-16(12)23-4/h5,7,9,11,14H,6,8,10H2,1-4H3'
sampl.InChI = sampl.InChI.values+text

In [None]:
sampl.head()

In [None]:
sampl.to_csv('submission.csv',index=False)