Hello fellow Kagglers,

Getting in the medal zone of the leaderboard was a challenge for me, as I was constantly positioned ~125.

Normalizing the prediction gave a nice LB improvement of 0.12 (2.12->2.00), however this was not enough to get in the medal zone.
Normalizing the predicted InChI's was adapted from [this](https://www.kaggle.com/wuliaokaola/bmsmt-0331-normalize-your-predictions) notebook.

In the last few days of the competition I was inspired by [this](https://www.kaggle.com/c/bms-molecular-translation/discussion/242082) discussion about assembling techniques.

I started working on this notebook and got a massive LB improvement of 0.19, which was enough to get me in the medal zone!

Hope you enjoy this notebook and see you in the next competition :D

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import re

from tqdm.notebook import tqdm

# initialize pandas apply progress bar
tqdm.pandas()

5 different submission are used in this assemble, which are all submission from 1 model on different training checkpoints. The concept behind assembling is simple, training a model will improve the overall score, however there will be mistakes introduced which weren't made earlier. I.e. there will exists prediction in the 2.12 LB submission which are predicted correctly in the 2.21 LB submission. Those are the prediction we are interested in and want to find.

In [None]:
# names of the submission files, which indicate the LB score
NAMES = ['212_v2', '212', '218', '219', '221']

# Read Submissions

Here all submission files are read and the InChI column is renamed accordingly to the LB score. The $N$ indicates the InChI's are normalized. Throughout this notebook the predicted InChI will refer to the InChI predicted by the model and the normalized InChI will refer to the normalized predicted InChI.

In [None]:
# debugging variable to select a subset of each submission
n = int(10e10)

print('Reading 212 v2...')
submission_212_v2 = pd.read_csv('../input/bms-submissions/submission_2.12_v2.csv').head(n)
submission_212_v2.rename({ 'InChI': 'InChI212_v2' }, axis=1, inplace=True)
submission_212_v2N = pd.read_csv('../input/bms-submissions/submission_norm_2.12_v2.csv').head(n)
submission_212_v2N.rename({ 'InChI': 'InChI212_v2N' }, axis=1, inplace=True)

print('Reading 212...')
submission_212 = pd.read_csv('../input/bms-submissions/submission_2.12.csv').head(n)
submission_212.rename({ 'InChI': 'InChI212' }, axis=1, inplace=True)
submission_212N = pd.read_csv('../input/bms-submissions/submission_norm_2.12.csv').head(n)
submission_212N.rename({ 'InChI': 'InChI212N' }, axis=1, inplace=True)

print('Reading 218...')
submission_218 = pd.read_csv('../input/bms-submissions/submission_2.18.csv').head(n)
submission_218.rename({ 'InChI': 'InChI218' }, axis=1, inplace=True)
submission_218N = pd.read_csv('../input/bms-submissions/submission_norm_2.18.csv').head(n)
submission_218N.rename({ 'InChI': 'InChI218N' }, axis=1, inplace=True)

print('Reading 219...')
submission_219 = pd.read_csv('../input/bms-submissions/submission_2.19.csv').head(n)
submission_219.rename({ 'InChI': 'InChI219' }, axis=1, inplace=True)
submission_219N = pd.read_csv('../input/bms-submissions/submission_norm_2.19.csv').head(n)
submission_219N.rename({ 'InChI': 'InChI219N' }, axis=1, inplace=True)

print('Reading 221...')
submission_221 = pd.read_csv('../input/bms-submissions/submission_2.21.csv').head(n)
submission_221.rename({ 'InChI': 'InChI221' }, axis=1, inplace=True)
submission_221N = pd.read_csv('../input/bms-submissions/submission_norm_2.21.csv').head(n)
submission_221N.rename({ 'InChI': 'InChI221N' }, axis=1, inplace=True)

print('Done Reading')

# Merge Submission

Merge all submission files to a single submission file. Merging is done on image\_id

In [None]:
submission = pd.DataFrame({ 'image_id': submission_218['image_id'] })

# Adding 212_v2
submission = submission.merge(submission_212_v2, on='image_id')
submission = submission.merge(submission_212_v2N, on='image_id')

# Adding 212
submission = submission.merge(submission_212, on='image_id')
submission = submission.merge(submission_212N, on='image_id')

# Adding 218
submission = submission.merge(submission_218, on='image_id')
submission = submission.merge(submission_218N, on='image_id')

# Adding 219
submission = submission.merge(submission_219, on='image_id')
submission = submission.merge(submission_219N, on='image_id')

# Adding 221
submission = submission.merge(submission_221, on='image_id')
submission = submission.merge(submission_221N, on='image_id')

In [None]:
# For each submission file the original and normalized InChI are shown
display(submission.head())

In [None]:
# Check if we indeed have 1616107 rows
display(submission.info())

This next column decleration is important and will be used later on. The boolean indicates whether all normalized InChI's are equal in each submission, thus if all predictions are normalized to the same InChI.

In [None]:
submission['equal'] = (
    (submission['InChI212_v2N'] == submission['InChI212N']) &
    (submission['InChI212_v2N'] == submission['InChI218N']) &
    (submission['InChI212_v2N'] == submission['InChI219N']) &
    (submission['InChI212_v2N'] == submission['InChI221N'])
)

# Submission Statistics

In [None]:
# Percentage where all submission are equal
percentage_equal = submission['equal'].sum() / len(submission) * 100
print(f'percentage_equal: {percentage_equal:.3f}%')

These next percentages indicate how many InChI's were not valid and could therefore not be normalized. When normalizing the InChI's a non-valid InChI's would normally default to the predicted InChI. However for this method to work, non-valid InChI's need to be normalized to "error".

In [None]:
# Error rate in each submission
for n in NAMES:
    error_rate = len(submission.loc[submission[f'InChI{n}N'] == 'error']) / len(submission) * 100
    print(f'{n.ljust(6)} error rate: {error_rate:.3f}%')

This next percentage shows where not all predicted InChI's could not be normalized, but at least one could. This is an indication of the expected improvement. Intuively, a normalizable InChI has a higher chance of being the correct prediction, as a non-normalizable InChI is by definition a non-existing molecule and therefore incorrect. A normalizable InChI could be the correct InChI.

In [None]:
# Submission where some, but not all, have an error
error_sums = (
    (submission['InChI212_v2N'] == 'error').astype(int) +
    (submission['InChI212N'] == 'error').astype(int) +
    (submission['InChI218N'] == 'error').astype(int) +
    (submission['InChI219N'] == 'error').astype(int) +
    (submission['InChI221N'] == 'error').astype(int)
)
non_overlapping_error_ratio = sum((error_sums < len(NAMES)) & (error_sums > 0)) / len(submission) * 100
print(f'Non overlapping error ratio: {non_overlapping_error_ratio:.3f}%')

# Submission Selection Process

This next loop is the beating heart of the submission assembling where the selection process takes place. The base case is when all predictions are equal and can be normalized. In this case it doesn't matter which prediction is selected, as they are all equal. If they are not all equal, the normalizable prediction from the best scoring submission is selected. If no prediction could be normalized the prediction from the best scoring submission is used.

In [None]:
submission_final_dict = dict()
# Selection Statistics
selection_stats = dict({
    '12_v2': 0,
    '12': 0,
    '18': 0,
    '19': 0,
    '21': 0,
    'e': 0,
})

for idx, row in tqdm(submission.iterrows(), total=len(submission)):
    # All Equal and not error, use submission
    if row['equal'] and row['InChI212_v2N'] != 'error':
        submission_final_dict[row['image_id']] = row['InChI212_v2N']
    # else, choose best submission without error
    else:
        # Choose best normalized submission without error
        if row['InChI212_v2N'] != 'error':
            selection_stats['12_v2'] += 1
            submission_final_dict[row['image_id']] = row['InChI212_v2N']
        elif row['InChI212N'] != 'error':
            selection_stats['12'] += 1
            submission_final_dict[row['image_id']] = row['InChI212N']
        elif row['InChI218N'] != 'error':
            selection_stats['18'] += 1
            submission_final_dict[row['image_id']] = row['InChI218N']
        elif row['InChI219N'] != 'error':
            selection_stats['19'] += 1
            submission_final_dict[row['image_id']] = row['InChI219N']
        elif row['InChI221N'] != 'error':
            selection_stats['21'] += 1
            submission_final_dict[row['image_id']] = row['InChI221N']
        # if none could be normalized use best submission
        else:
            selection_stats['e'] += 1
            submission_final_dict[row['image_id']] = row['InChI212_v2']

This pie chart shown the selection statistics. As expected, the selection process prefers high scoring submission. This can be explained by the order of the selection. Around half of the selections default to the best scoring prediction, as no prediction could be normalized. This indicates there is room for improvement, as these predictions are incorrect.

In [None]:
# Show Selection Statistics
plt.figure(figsize=(10,10))
pd.Series(selection_stats, name='fill counts').plot.pie(y='fill count', title='Selection Statistics', legend=False, autopct='%1.1f%%')
pass

In [None]:
display(pd.Series(selection_stats).to_frame(name='fill counts'))

# Make DataFrame

In [None]:
# Transform dictionary to dataframe
submission_final = pd.DataFrame.from_dict(submission_final_dict, orient='index', columns=['InChI'])
# Set the index_name to 'image_id' as required by the submission format
submission_final.index.name = 'image_id'

# Remove \p and \q from submission

The normalized InChI's contained InChI parts which were not present in the training data, namely the /p and /q part. Assuming the test dataset also didn't contain these parts an improvement could be expected by removing these parts. No LB improvement was however observed, which is probably due to the low number of occurances of these parts. It would however be interesting to know if this removal did actually improve the score.

In [None]:
removed_pq_parts = 0

def remove_pq(InChI, debug=False):
    global removed_pq_parts
    
    InChI_Original = InChI
    # P
    substr_p = re.findall('(?<=\/p)[^\/]*', InChI)
    if len(substr_p) > 0:
        for substr in substr_p:
            removed_pq_parts += 1
            InChI = InChI.replace(f'/p{substr}', '')
    # Q
    substr_q = re.findall('(?<=\/q)[^\/]*', InChI)
    if len(substr_q) > 0:
        for substr in substr_q:
            removed_pq_parts += 1
            InChI = InChI.replace(f'/q{substr}', '')
            
    if debug and (len(substr_p) > 0 or len(substr_q) > 0):
        print('=' * 50)
        print(InChI_Original)
        print(InChI)
        print('=' * 50)
        
    return InChI

In [None]:
# Remove /p and /q parts as they are not present in the training set
submission_final['InChI'] = submission_final['InChI'].progress_apply(remove_pq)

In [None]:
# Show statistics about /p /q removal
print(f'Total number of removed parts: {removed_pq_parts}')
removed_pq_parts_percentage = removed_pq_parts / len(submission_final) * 100
print(f'Mean percentage of InChI\'s with /p or /q part removed: {removed_pq_parts_percentage:.3f}%')

# Submission to CSV

In [None]:
# Show submission head
display(submission_final.head())

In [None]:
# Check if there are 1616107 rows!!!
display(submission_final.info())

In [None]:
# write submission file to CSV
submission_final.to_csv('submission.csv', index=True)