**Hello fellow Kagglers,**


This notebook is a baseline for an encoder/decoder model with attention written in Tensorflow and running on a TPU. Several notebooks, examples and documentation were used as a source of inspiration, especially the two Kaggle notebooks, a big thanks for sharing that work:

**Kaggle Notebook**

[Pytorch training by Eric Pasewark](https://www.kaggle.com/yasufuminakama/inchi-resnet-lstm-with-attention-starter)

[Pytorch training by Y.Nakama](https://www.kaggle.com/yasufuminakama/inchi-resnet-lstm-with-attention-starter)

**Tensorflow Code Examples/Documentation**

[Tensorflow encoder/decoder attention baseline](https://www.tensorflow.org/tutorials/text/nmt_with_attention)

[Custom Tensorflow model](https://www.tensorflow.org/guide/keras/customizing_what_happens_in_fit)

[TPU training in Tensorflow](https://www.tensorflow.org/tutorials/distribute/custom_training)

**My own preprocessing notebook**

[Advanced Image Cleaning and TFRecord Generation](https://www.kaggle.com/markwijkhuizen/advanced-image-cleaning-and-tfrecord-generation)

**Prediction Notebook (available several hours after V3 completes running)**

[BMS - Tensorflow TPU Predictions](https://www.kaggle.com/markwijkhuizen/bms-tensorflow-tpu-predictions)

I will not disclose the prediction notebook to prevent people from simply copying and submitting this notebook and thereby flooding the leaderboard with equal scores.

If you have any questions or remarks, feel free to leave a comment :D

When publishing a notebook based on this notebook, please don't forget to reference this notebook.

A small disclaimer, this is the first time I am playing around with sequence predictions and encoder/decoder models. Keep this in mind when reading the notebook, many improvements will be possible.

**VERSION 2 UPDATES**

* Dataset converted to iterator. Without iterator the dataset starts at the beginning each epoch, thereby using only the first part of the train dataset. Credits go to [Darien Shettler](https://www.kaggle.com/dschettler8845) for pointing this out in the comments.

* Dynamically assign encoder dimensions. This idea is based on [Andy Penrose's](https://www.kaggle.com/andypenrose) comment

* Optimized training loop, this idea is based on [this](https://www.kaggle.com/mgornergoogle/custom-training-loop-with-100-flowers-on-tpu) training notebook made by [Martin GÃ¶rner](https://www.kaggle.com/mgornergoogle). An example of this in the Tensorflow documentation can be found [here](https://www.tensorflow.org/guide/tpu#improving_performance_by_multiple_steps_within_tffunction). Multiple training steps are performed in one run on the TPU, 100 to be precise. Also, the batch of images and labels are retrieved directly on the TPU, rather than on the CPU to be then send to the TPU. This reduces the training step duration from 45 second to 38 seconds, a reduction of 16\% :D.

**VERSION 3 UPDATES**

* Updates the attention mechanism based on [this](https://www.kaggle.com/konradb/model-train-efficientnet) notebook. This improves both the score and efficiency, and epoch now takes only 27 seconds, TPU's are awesome ;)

* Modified learning rate scheduler, using lower learning rates.

* Reduced the character embedding dimension.

* Made [prediction notebook](https://www.kaggle.com/markwijkhuizen/bms-tensorflow-tpu-predictions) public, will be finished after V3 has finished.

In [None]:
# install tensorflow implementations of EfficientNet with noisy-student weights
!pip install -q --upgrade pip
!pip install -q efficientnet

In [None]:
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
import efficientnet.tfkeras as efn

from tensorflow.keras.mixed_precision import experimental as mixed_precision
from kaggle_datasets import KaggleDatasets
from tqdm.notebook import tqdm

import unicodedata
import re
import numpy as np
import os
import io
import time
import pickle
import math
import random

In [None]:
# seed everything
SEED = 42
os.environ['PYTHONHASHSEED'] = str(SEED)
random.seed(SEED)
np.random.seed(SEED)
tf.random.set_seed(SEED)

In [None]:
# Detect hardware, set appropriate distribution strategy (GPU/TPU)
try:
    TPU = tf.distribute.cluster_resolver.TPUClusterResolver()  # TPU detection. No parameters necessary if TPU_NAME environment variable is set. On Kaggle this is always the case.
    print('Running on TPU ', TPU.master())
except ValueError:
    print('Running on GPU')
    TPU = None

if TPU:
    tf.config.experimental_connect_to_cluster(TPU)
    tf.tpu.experimental.initialize_tpu_system(TPU)
    strategy = tf.distribute.experimental.TPUStrategy(TPU)
else:
    strategy = tf.distribute.get_strategy() # default distribution strategy in Tensorflow. Works on CPU and single GPU.

REPLICAS = strategy.num_replicas_in_sync
print(f'REPLICAS: {REPLICAS}')

In [None]:
#IMG_HEIGHT = 256
#IMG_WIDTH = 448

IMG_HEIGHT = 256
IMG_WIDTH = 512

N_CHANNELS = 3
MAX_INCHI_LEN = 200

#this works for B4 on TPU. let try GPU

#BATCH_SIZE_BASE = 128 if TPU else 64

#above too big for model b4 on TPU

BATCH_SIZE_BASE = 32 if TPU else 16 

BATCH_SIZE = BATCH_SIZE_BASE * REPLICAS

N_TEST_IMGS = 1616107
N_TEST_STEPS = N_TEST_IMGS // BATCH_SIZE + 1

TARGET_DTYPE = tf.bfloat16 if TPU else tf.float32

IMAGENET_MEAN = tf.constant([0.485, 0.456, 0.406], dtype=tf.float32)
IMAGENET_STD = tf.constant([0.229, 0.224, 0.225], dtype=tf.float32)

AUTO = tf.data.experimental.AUTOTUNE

if TPU: # get Google Cloud path to dataset for TPU
    # Given Data Train/Val/Test
    #GCS_DS_PATH_IMGS = KaggleDatasets().get_gcs_path('molecular-translation-images-cleaned-tfrecords')
    
    #try new tfrecords with 512x256
    GCS_DS_PATH_IMGS = KaggleDatasets().get_gcs_path('effnb1-tf-data')



In [None]:
#with open('/kaggle/input/molecular-translation-images-cleaned-tfrecords/vocabulary_to_int.pkl', 'rb') as handle:

with open('../input/effnb1-tf-data/vocabulary_to_int.pkl', 'rb') as handle:
    vocabulary_to_int   = pickle.load( handle)

# dictionary to convert the integer encoding to vocabulary
#with open('/kaggle/input/molecular-translation-images-cleaned-tfrecords/int_to_vocabulary.pkl', 'rb') as handle:
with open('../input/effnb1-tf-data/int_to_vocabulary.pkl', 'rb') as handle:

    int_to_vocabulary  = pickle.load( handle)
    
print(f'vocabulary_to_int head: {list(vocabulary_to_int.items())[:5]}')
print(f'int_to_vocabulary head: {list(int_to_vocabulary.items())[:5]}')

In [None]:
# configure problem
VOCAB_SIZE = len(vocabulary_to_int.values())
SEQ_LEN_OUT = MAX_INCHI_LEN
DECODER_DIM = 512
CHAR_EMBEDDING_DIM = 256
ATTENTION_UNITS = 256

print(f'vocabulary size: {VOCAB_SIZE}')

In [None]:
# Decodes the TFRecords to a tuple yielding the image and image_id
@tf.function
def decode_tfrecord_train(record_bytes):
    features = tf.io.parse_single_example(record_bytes, {
        'image': tf.io.FixedLenFeature([], tf.string),
        'image_id': tf.io.FixedLenFeature([], tf.string),
    })

    image = tf.io.decode_png(features['image'])    
    image = tf.reshape(image, [IMG_HEIGHT, IMG_WIDTH, 1])
    image = tf.cast(image, tf.float32)  / 255.0
    image = (image - IMAGENET_MEAN) / IMAGENET_STD
    image = tf.cast(image, TARGET_DTYPE)
    
    image_id = features['image_id']
    
    return image, image_id


In [None]:
# Benchmark function to finetune the dataset
def benchmark_dataset(dataset, num_epochs=3, bs=BATCH_SIZE, N_IMGS_PER_EPOCH=int(100e3) if TPU else 5000):
    n_steps_per_epoch = N_IMGS_PER_EPOCH // (num_epochs * bs)
    start_time = time.perf_counter()
    for epoch_num in range(num_epochs):
        epoch_start = time.perf_counter()
        for idx, (images, image_id) in enumerate(dataset.take(n_steps_per_epoch)):
            if idx is 1 and epoch_num is 0:
                print(f'image shape: {images.shape}, image dtype: {images.dtype}')
            pass
        epoch_t = time.perf_counter() - epoch_start
        mean_step_t = round(epoch_t / n_steps_per_epoch * 1000, 1)
        n_imgs_per_s = int(1 / (mean_step_t / 1000) * bs)
        print(f'epoch {epoch_num} took: {round(epoch_t, 2)} sec, mean step duration: {mean_step_t}ms, images/s: {n_imgs_per_s}')

In [None]:
# plots the first few images
def show_batch(dataset, rows=3, cols=2):
    fig, axes = plt.subplots(nrows=rows, ncols=cols, figsize=(cols*7, rows*4))
    imgs, img_ids = next(iter(dataset.unbatch().batch(rows*cols)))
    for r in range(rows):
        for c in range(cols):
            img = imgs[r*cols+c].numpy().astype(np.float32)
            img += abs(img.min())
            img /= img.max()
            axes[r, c].imshow(img)
            axes[r, c].set_title(img_ids[r*cols+c].numpy().decode(), size=16)

In [None]:
#  dataset for the test images
def get_test_dataset(bs=BATCH_SIZE):
    ignore_order = tf.data.Options()
    ignore_order.experimental_deterministic = False
    
    if TPU:
        FNAMES_TRAIN_TFRECORDS = tf.io.gfile.glob(f'{GCS_DS_PATH_IMGS}/test/*.tfrecords')
    else:
        #FNAMES_TRAIN_TFRECORDS = tf.io.gfile.glob('/kaggle/input/molecular-translation-images-cleaned-tfrecords/test/*.tfrecords')
        FNAMES_TRAIN_TFRECORDS = tf.io.gfile.glob('../input/effnb1-tf-data/test/*.tfrecords')
        
    train_dataset = tf.data.TFRecordDataset(FNAMES_TRAIN_TFRECORDS, num_parallel_reads=AUTO if TPU else os.cpu_count())
    train_dataset = train_dataset.with_options(ignore_order)
    train_dataset = train_dataset.prefetch(AUTO)
    train_dataset = train_dataset.map(decode_tfrecord_train, num_parallel_calls=AUTO if TPU else os.cpu_count())
    train_dataset = train_dataset.batch(BATCH_SIZE)
    train_dataset = train_dataset.prefetch(1)
    
    return train_dataset

test_dataset = get_test_dataset()


In [None]:
# benchmark dataset, should bve roughly 300 images a second
benchmark_dataset(test_dataset)

In [None]:
imgs, img_ids = next(iter(test_dataset))
print(f'imgs.shape: {imgs.shape}, img_ids.shape: {img_ids.shape}')
print(f'imgs dtype: {imgs.dtype}, img_ids dtype: {img_ids.dtype}')
img0 = imgs[0].numpy().astype(np.float32)
train_batch_info = (img0.mean(), img0.std(), img0.min(), img0.max())
print('train img 0 mean: %.3f, 0 std: %.3f, min: %.3f, max: %.3f' % train_batch_info)

In [None]:
show_batch(test_dataset)

# Encoder
An encoder/decoder model with attention is used, which is based on [this](https://www.tensorflow.org/tutorials/text/nmt_with_attention) Tensorflow example.

The encoder creates the feature maps of the images, which are then used in the encoder. EfficientNetB0 with pretrained noisy-student weights creates 1280 feature maps with dimensions of $14\cdot8$ pixels. These feature maps are flattened by a reshape: $14\cdot8\cdot1280 \Rightarrow 112\cdot1280$.

In [None]:
class Encoder(tf.keras.Model):
    def __init__(self):
        super(Encoder, self).__init__()
        
        # output: (bs, 1280, 14, 8)
        #self.feature_maps = efn.EfficientNetB0(include_top=False, weights='noisy-student')
        #self.feature_maps = efn.EfficientNetB2(include_top=False, weights='noisy-student')
        #self.feature_maps = efn.EfficientNetB4(include_top=False, weights='noisy-student')
        self.feature_maps = efn.EfficientNetB7(include_top=False, weights='noisy-student')
        # set global encoder dimension variable
        global ENCODER_DIM
        ENCODER_DIM = self.feature_maps.layers[-1].output_shape[-1]
        
        # output: (bs, 1280, 112)
        self.reshape = tf.keras.layers.Reshape([-1, ENCODER_DIM], name='reshape_featuere_maps')

        
    def call(self, x, training, debug=False):
        x = self.feature_maps(x, training=training)
        if debug:
            print(f'feature maps shape: {x.shape}')
            
        x = self.reshape(x, training=training)
        if debug:
            print(f'feature maps reshaped shape: {x.shape}')
        
        return x

In [None]:
# Example enoder output
with tf.device('/CPU:0'):
    encoder = Encoder()
    encoder_res = encoder(imgs[:BATCH_SIZE])

print ('Encoder output shape: (batch size, sequence length, units) {}'.format(encoder_res.shape))

# Attention
During the decoding phase the important features from the encoder will differ for each character predicted. The attention mechanism takes as input the hidden state from the LSTM, which is the LSTM state after the last predicted character, and encoder features. The hidden LSTM state will differ each prediction iteration, but the encoder result remains the same. Using this hidden LSTM state the attention mechanism learns which parts of the feature maps are important. The feature maps have a dimension of 8*14 pixels whicha re flattened to a vector of size 112. The attention mechanism creates a importancy score for each pixel, which is a probability distribution summing to 1, over the 112 pixels and multiplies it with the feature map vectors to create a single value for each feature map.

To make it a bit less abstract, take the next InChI as an example

```C13H5F5N2/c14-7-3-6(5-19)1-2-10(7)20-13-11(17)8(15)4-9(16)12(13)18/h1-4,20H```

After predicting C13H5 the attention mechanism should focus on features containing F atoms and leave any feature maps on C or H atoms aside. The LSTM hidden state should tell the attention mechanism it has predicted C13H5 so far and the attention mechanism will learn it has to focus on F atoms after C and H atoms are predicted.

In [None]:
class BahdanauAttention(tf.keras.layers.Layer):
    def __init__(self, units):
        super(BahdanauAttention, self).__init__()
        self.H = tf.keras.layers.Dense(units, name='hidden_to_attention_units')
        self.E = tf.keras.layers.Dense(units, name='encoder_res_to_attention_units')
        self.V = tf.keras.layers.Dense(1, name='score_to_alpha')

    def call(self, h, encoder_res, training, debug=False):
        # dense hidden state to attention units size and expand dimension
        h_expand = tf.expand_dims(h, axis=1) # expand dimension
        if debug:
            print(f'h shape: {h.shape}, encoder_res shape: {encoder_res.shape}')
            print(f'h_expand shape: {h_expand.shape}')
            
        h_dense = self.H(h_expand, training=training)
        
        # dense features to units size
        encoder_res_dense = self.E(encoder_res, training=training) # dense to attention

        # add vectors
        score = tf.nn.relu(h_dense + encoder_res_dense)
        if debug:
            print(f'h_dense shape: {h_dense.shape}')
            print(f'encoder_res_dense shape: {encoder_res_dense.shape}')
            print(f'score tanh shape: {score.shape}')
        score = self.V(score, training=training)
        
        # create alpha vector size (bs, layers)        
        attention_weights = tf.nn.softmax(score, axis=1)
        if debug:
            score_np = score.numpy().astype(np.float32)
            print(f'score V shape: {score.shape}, score min: %.3f score max: %.3f' % (score_np.min(), score_np.max()))
            print(f'attention_weights shape: {attention_weights.shape}')
            aw = attention_weights.numpy().astype(np.float32)
            aw_print_data = (aw.min(), aw.max(), aw.mean(), aw.sum())
            print(f'aw shape: {aw.shape} aw min: %.3f, aw max: %.3f, aw mean: %.3f,aw sum: %.3f' % aw_print_data)
        
        # create attention weights (bs, layers)
        context_vector = encoder_res * attention_weights
        if debug:
            print(f'first attention weights: {attention_weights.numpy().astype(np.float32)[0,0]}')
            print(f'first encoder_res: {encoder_res.numpy().astype(np.float32)[0,0,0]}')
            print(f'first context_vector: {context_vector.numpy().astype(np.float32)[0,0,0]}')
            
            print(f'42th attention weights: {attention_weights.numpy().astype(np.float32)[0,42]}')
            print(f'42th encoder_res: {encoder_res.numpy().astype(np.float32)[0,42,42]}')
            print(f'42th context_vector: {context_vector.numpy().astype(np.float32)[0,42,42]}')
            
            print(f'encoder_res abs sum: {abs(encoder_res.numpy().astype(np.float32)).sum()}')
            print(f'context_vector abs sum: {abs(context_vector.numpy().astype(np.float32)).sum()}')
            
            print(f'encoder_res shape: {encoder_res.shape}, attention_weights shape: {attention_weights.shape}')
            print(f'context_vector shape: {context_vector.shape}')
        
        # reduce to ENCODER_DIM features
        context_vector = tf.reduce_sum(context_vector, axis=1)
        
        return context_vector

In [None]:
with tf.device('/CPU:0'):
    attention_layer = BahdanauAttention(ATTENTION_UNITS)
    context_vector = attention_layer(tf.zeros([BATCH_SIZE, DECODER_DIM]), encoder_res, debug=True)

print('context_vector shape: (batch size, units) {}'.format(context_vector.shape))
 

**# Decoder
The decoder takes the encoder features and predicts one character at a time using an LSTMCell. The LSTMCell takes a concatinated context from the attention mechanism and an embedded character as input. The LSTMCell hidden and carry states are initialized with the encoder features. A 30\% dropout is used on the LSTMCell output before making the final prediction.

In [None]:
class Decoder(tf.keras.Model):
    def __init__(self, vocab_size, attention_units, encoder_dim, decoder_dim, char_embedding_dim):
        super(Decoder, self).__init__()
        self.vocab_size = vocab_size
        self.attention_units = attention_units
        self.encoder_dim = encoder_dim
        self.decoder_dim = decoder_dim
        
        self.init_h = tf.keras.layers.Dense(units=decoder_dim, input_shape=[encoder_dim], name='encoder_res_to_hidden_init')
        self.init_c = tf.keras.layers.Dense(units=decoder_dim, input_shape=[encoder_dim], name='encoder_res_to_inp_act_init')
        self.lstm_cell = tf.keras.layers.LSTMCell(decoder_dim, name='lstm_char_predictor')
        self.fcn = tf.keras.layers.Dense(units=vocab_size, input_shape=[decoder_dim], dtype=tf.float32, name='lstm_output_to_char_probs')
        self.do = tf.keras.layers.Dropout(0.30, name='prediction_dropout')
        
        self.embedding = tf.keras.layers.Embedding(vocab_size, char_embedding_dim)

        # used for attention
        self.attention = BahdanauAttention(self.attention_units)

    def call(self, char, h, c, enc_output):
        # embed previous character
        char = self.embedding(char, training=False)
        char = tf.squeeze(char, axis=1)
        # get attention alpha and context vector
        context = self.attention(h, enc_output, training=False)

        # concat context and char to create lstm input
        lstm_input = tf.concat((context, char), axis=-1)
        
        # LSTM call, get new h, c
        _, (h_new, c_new) = self.lstm_cell(lstm_input, (h, c), training=False)
        
        # compute predictions with dropout
        output = self.do(h_new, training=False)
        output = self.fcn(output, training=False)

        return output, h_new, c_new
    
    def init_hidden_state(self, encoder_out):
        mean_encoder_out = tf.math.reduce_mean(encoder_out, axis=1)
        h = self.init_h(mean_encoder_out, training=False)  # (batch_size, decoder_dim)
        c = self.init_c(mean_encoder_out, training=False)
        return h, c

# Model

In [None]:
# The start/end/pad tokens will be removed from the string when computing the Levenshtein distance
START_TOKEN = tf.constant(vocabulary_to_int.get('<start>'), dtype=tf.int64)
END_TOKEN = tf.constant(vocabulary_to_int.get('<end>'), dtype=tf.int64)
PAD_TOKEN = tf.constant(vocabulary_to_int.get('<pad>'), dtype=tf.int64)

In [None]:
# Models
tf.keras.backend.clear_session()

# enable XLA optmizations
tf.config.optimizer.set_jit(True)

with strategy.scope():
    encoder = Encoder()
    encoder.build(input_shape=[BATCH_SIZE, IMG_HEIGHT, IMG_WIDTH, N_CHANNELS])
    encoder_res = encoder(imgs[:BATCH_SIZE])
    #encoder.load_weights('../input/b2-encdec/encoder_epoch_12.h5')
    #encoder.load_weights('../input/b4-acc996-encdec/encoder_epoch_12.h5')
    
    #encoder.load_weights('../input/512x256-acc998-encdec/encoder_epoch_20.h5')
    
    #encoder.load_weights('../input/512x256-valloss-1-encdec/encoder_epoch_20.h5')
    
    #encoder.load_weights('../input/b7-acc997-51x256-encdec/encoder_epoch_30.h5')
    
    encoder.load_weights('../input/b7-lsd2d9-512x256-encdec/encoder_epoch_12.h5')
    
    
    encoder.trainable = False
    encoder.compile()

    decoder = Decoder(VOCAB_SIZE, ATTENTION_UNITS, ENCODER_DIM, DECODER_DIM, CHAR_EMBEDDING_DIM)
    h, c = decoder.init_hidden_state(encoder_res)
    preds, h, c = decoder(tf.ones([BATCH_SIZE, 1]), h, c, encoder_res)
    #decoder.load_weights('../input/b2-encdec/decoder_epoch_12.h5')
    #decoder.load_weights('../input/b4-acc996-encdec/decoder_epoch_12.h5')
    
    #decoder.load_weights('../input/512x256-acc998-encdec/decoder_epoch_20.h5')
    
    #decoder.load_weights('../input/512x256-valloss-1-encdec/decoder_epoch_20.h5')
    
    #decoder.load_weights('../input/b7-acc997-51x256-encdec/decoder_epoch_30.h5')
    decoder.load_weights('../input/b7-lsd2d9-512x256-encdec/decoder_epoch_12.h5')
    
    decoder.trainable = False
    decoder.compile()


In [None]:
encoder.summary()




In [None]:
decoder.summary()

# Predictions

Predictions
Due to popular demand the prediction loop has been modified to run on a TPU. The last batch will be run on a single TPU core, not distributed over all 8. This adaption is needed due to the different batch size which cannot be ditributed evenly over all TPU cores and will therefore throw an error as pointed out by dragon zhang. Predictions on TPU take less than 20 minutes, about 10 times faster than on a GPU :D


In [None]:
# converts and integer encoded InChI prediction to a correct InChI string
# Note the "InChI=1S/" part is prepended and all <start>/<end>/<pad> tokens are ignored

END_TOKEN = vocabulary_to_int.get('<end>')
START_TOKEN = vocabulary_to_int.get('<start>')
PAD_TOKEN =  vocabulary_to_int.get('<pad>')

def int2char(i_str):
    res = 'InChI=1S/'
    for i in i_str:
        if i == END_TOKEN:
            return res
        elif i != START_TOKEN and i != PAD_TOKEN:
            res += int_to_vocabulary.get(i)
    return res

In [None]:
# Makes the InChI prediction for a given image
def prediction_step(imgs):
    # get the feature maps from the encoder
    encoder_res = encoder(imgs)
    # initialize the hidden LSTM states given the feature maps
    h, c = decoder.init_hidden_state(encoder_res)
    
    # initialize the prediction results with the <start> token
    predictions_seq = tf.fill([len(imgs), 1], value=vocabulary_to_int.get('<start>'))
    predictions_seq = tf.cast(predictions_seq, tf.int32)
    # first encoder input is always the <start> token
    dec_input = tf.expand_dims([vocabulary_to_int.get('<start>')] * len(imgs), 1)

    # Teacher forcing - feeding the target as the next input
    for t in range(1, SEQ_LEN_OUT):
        # make character prediction and receive new LSTM states
        predictions, h, c = decoder(dec_input, h, c, encoder_res)
        
        # softmax prediction to get prediction classes
        dec_input = tf.math.argmax(predictions, axis=1, output_type=tf.int32)
               
        # expand dimension of prediction to make valid encoder input
        dec_input = tf.expand_dims(dec_input, axis=1)
        
        # add character to predictions
        predictions_seq = tf.concat([predictions_seq, dec_input], axis=1)
            
    return predictions_seq


In [None]:
# distributed test step, will also run on TPU :D
@tf.function
def distributed_test_step(imgs):
    per_replica_predictions = strategy.run(prediction_step, args=[imgs])
    predictions = strategy.gather(per_replica_predictions, axis=0)
    
    return predictions

In [None]:
# perform a test step on a single device, used for last batch with random size
@tf.function
def test_step_last_batch(imgs):
    return prediction_step(imgs)

Due to popular demand the prediction loop has been modified to run on a TPU. The last batch will be run on a single TPU core, not distributed over all 8. This adaption is needed due to the different batch size which cannot be ditributed evenly over all TPU cores and will therefore throw an error as pointed out by dragon zhang. Predictions on TPU take less than 20 minutes, about 10 times faster than on a GPU :D

In [None]:

# list with predicted InChI's
predictions_inchi = []
# List with image id's
predictions_img_ids = []
# Distributed test set, needed for TPU
test_dist_dataset = strategy.experimental_distribute_dataset(test_dataset)

# Prediction Loop
for step, (per_replica_imgs, per_repliac_img_ids) in tqdm(enumerate(test_dist_dataset), total=N_TEST_STEPS):
    # special step for last batch which has a different size
    # this step will take about half a minute because the function needs to be compiled
    if TPU and step == N_TEST_STEPS - 1:
        imgs_single_device = strategy.gather(per_replica_imgs, axis=0)
        preds = test_step_last_batch(imgs_single_device)
    else:
        # make test step and get predictions
        preds = distributed_test_step(per_replica_imgs)
    
    # get image ids
    img_ids = strategy.gather(per_repliac_img_ids, axis=0)
    
    # decode integer encoded predictions to characters and add to InChI's prediction list
    predictions_inchi += [int2char(p) for p in preds.numpy()]
    # add image id's to list
    predictions_img_ids += [e.decode() for e in img_ids.numpy()]


In [None]:
# create DataFrame with image ids and predicted InChI's
submission = pd.DataFrame({ 'image_id': predictions_img_ids, 'InChI': predictions_inchi }, dtype='string')
# save as CSV file so we can submit it :D
submission.to_csv('submission.csv', index=False)
# show head of submission, sanity check
pd.options.display.max_colwidth = 200
submission.head()


In [None]:
# submission csv info, important, it should contain 1616107 rows!!!
submission.info()

In [None]:
def normalize_inchi(inchi):
    try:
        mol = Chem.MolFromInchi(inchi)
        if mol is None:
            return inchi
        else: 
            try:
                return Chem.MolToInchi(mol)
            except:
                return inchi

    except: return inchi

In [None]:
#submission['InChI']=submission.apply(lambda x: normalize_inchi(x['InChI']), axis=1)


 
