This notebook is a work in progress to better understand the InChI format and the image dataset. This is an interesting problem statement with the flavor of Image Captioning. 

This notebook includes:

* Break down of the InChI format with some insights. 
* W&B Artifacts for dataset versioning.
* Data Visualization using interactive W&B dashboard. 

Hope you like the work so far. 

In [None]:
%%capture
# Install Weights and Biases.
!pip install wandb -q

In [None]:
import tensorflow as tf
print(tf.__version__)

import os
os.environ["WANDB_SILENT"] = "true"

import re
import cv2
import glob
import numpy as np
import pandas as pd
from PIL import Image
import seaborn as sns
from functools import partial
import matplotlib.pyplot as plt

from tqdm.auto import tqdm
tqdm.pandas()

from IPython.display import display

%matplotlib inline

In [None]:
import wandb
from kaggle_secrets import UserSecretsClient

user_secrets = UserSecretsClient()
wandb_api = user_secrets.get_secret("wandb_api")

wandb.login(key=wandb_api)

In [None]:
WORKING_DIR = '../input/bms-molecular-translation/'
LOG_AS_ARTIFACT = True

# EDA Caption

In [None]:
train_df = pd.read_csv('../input/bms-molecular-translation/train_labels.csv')

if LOG_AS_ARTIFACT:
    # Log the raw train_labels.csv file as W&B artifact
    run = wandb.init(project='bms', job_type='raw-dataset')
    artifact = wandb.Artifact('raw', type='dataset')
    artifact.add_file(WORKING_DIR+'train_labels.csv')
    run.log_artifact(artifact)
    run.join()

display("Description of the train_labels.csv")
display(train_df.describe())

display("First 10 rows of the train_labels.csv")
display(train_df.head(10))

 ðŸ“Œ There are a total of 2424186 unique images and each image got a unique InCHI indentifier. 

### Add path to images as a column.

In [None]:
train_df['path'] = train_df['image_id'].progress_apply(
    lambda x: "../input/bms-molecular-translation/train/{}/{}/{}/{}.png".format(
        x[0], x[1], x[2], x))

train_df.to_csv('train_labels_path.csv', index=False)

if LOG_AS_ARTIFACT:
    # Log the modified csv file as W&B artifact
    run = wandb.init(project='bms', job_type='modified-dataset')
    artifact_raw = run.use_artifact('ayush-thakur/bms/raw:v0', type='dataset')

    artifact = wandb.Artifact('labels-path', type='dataset')
    artifact.add_file('train_labels_path.csv')
    run.log_artifact(artifact)
    run.join()

display(train_df.head())

DISPLAY_IMGS = 50
paths = train_df['path'].values[:DISPLAY_IMGS]
inchis = train_df['InChI'].values[:DISPLAY_IMGS]

run = wandb.init(project='bms', job_type='image-visualization')
wandb.log({'Example Images': [wandb.Image(img_path, caption=inchi) for img_path, inchi in zip(paths, inchis)]})
run.finish()

run

* You can click on the pencil icon in the Media panel to change how you want to visualize the images. 
* I have turned on the "Smooth Image" feature. 

# Two Words on InChI

> International Chemical Identifier (InChI) is a textual identifier for chemical substances, designed to provide a standard way to encode molecular information and to facilitate the search for such information in databases and on the web. ([Source](https://en.wikipedia.org/wiki/International_Chemical_Identifier))

Let's understand the format of the label string: 

## 1. InChI starts with "InChI="

ðŸ“Œ Every InChI starts with the string "InChI=" followed by the version number, currently 1.

In [None]:
inchi_labels = train_df['InChI'].values
count = 0
for inchi_label in inchi_labels:
    if 'InChI=1' in inchi_label:
        count+=1
        
print(f'There are {count} label strings starting with InChI= followed by version 1')

## 2. S stands for Standard InChI

ðŸ“Œ If the InChI is standard, this is followed by the letter `S` for **standard InChIs**, which is a fully standardized InChI flavor maintaining the same level of attention to structure details and the same conventions for drawing perception.

In [None]:
inchi_labels = train_df['InChI'].values
count = 0
for inchi_label in inchi_labels:
    if 'InChI=1S' in inchi_label:
        count+=1

print(f'There are {count} standard InChI labels.')

## 3. InChI got layers and sublayers

ðŸ“Œ The remaining information in the string is structured as a sequence of layers and sub-layers, with each layer providing one specific type of information. 

ðŸ“Œ The layers and sublayers are separated by '/' and start with a prefix character. There are six important layers and we will go through each of them one by one:

In [None]:
# Ref: https://www.kaggle.com/wineplanetary/understanding-inchi-format-and-arrange-train-label/
prefix_list = ["c", "h", "b", "t", "m", "s", "i"]
formula_list = []
prefix_val_lists = {prefix: [] for prefix in prefix_list}
prefix_val_lists.update({"%s_flg" % prefix: [] for prefix in prefix_list})

for inchi in tqdm(train_df["InChI"]):
    text_list = inchi.split("/")
    formula_list.append(text_list[1])
    for prefix in prefix_list:
        for text in text_list:
            if text.startswith(prefix):
                prefix_val_lists[prefix].append(text)
                prefix_val_lists["%s_flg" % prefix].append(1)
                break
            else:
                prefix_val_lists[prefix].append("")
                prefix_val_lists["%s_flg" % prefix].append(0)

### 3.1 Main Layer

ðŸ“Œ This layer is separated into three sublayer:

* Chemical formula: It is the sublayer that's avilable in every InChI. It start with no prefix character. 
* Atom connections (prefix: "c"). The atoms in the chemical formula (except for hydrogens) are numbered in sequence; this sublayer describes which atoms are connected by bonds to which other ones.
* Hydrogen atoms (prefix: "h"). Describes how many hydrogen atoms are connected to each of the other atoms.


In [None]:
INDEX = 10

image_name = train_df["image_id"].loc[INDEX]
image_path = train_df["path"].loc[INDEX]
img = cv2.imread(image_path)
display(Image.fromarray(img))

display(f'Shape of image is: {img.shape}')

inchi_label = train_df["InChI"].loc[INDEX]
display(f'The InChI label is: {inchi_label}')

inchi_split = inchi_label.split('/')
display(f"Chemical Formula: {inchi_split[1]}")
display(f"Atom Connections: {inchi_split[2]}")
display(f"Hydrogen Atoms: {inchi_split[3]}")

> Chemical Formula

In [None]:
count = 0
for a in train_df["InChI"].values:
    x = a.split('/')[1]
    if x[0].islower(): # The atoms in formula starts with an upper case character. 
        print(a)
    count+=1

if count==len(train_df):
    print('Every InChI label got chemical formula.')

    train_df['chemical_formula'] = train_df['InChI'].progress_apply(
        lambda x: x.split('/')[1])
    
display(train_df.head(3))

> Atom Connections

In [None]:
if len(prefix_val_lists['c'])==len(train_df):
    print('Every label got atom connections sublayer or have a sublayer with prefix c.')

    train_df['atom_connection'] = train_df['InChI'].progress_apply(
        lambda x: x.split('/')[2])
    
display(train_df.head(3))

> Hydrogen Atoms

In [None]:
count = 0
no_hydrogen_path = []
no_hydrogen_inchi = []

for i, a in enumerate(train_df["InChI"].values):
    try:
        x = a.split('/')[2:]
        if x[1][0] is 'h':
            count+=1
        else:
            no_hydrogen_path.append(train_df["path"].loc[i])
            no_hydrogen_inchi.append(train_df["InChI"].loc[i])
    except:
        no_hydrogen_path.append(train_df["path"].loc[i])
        no_hydrogen_inchi.append(train_df["InChI"].loc[i])

print(f'There are {len(train_df)-count} labels with no hydrogen atoms.')

print(f'Let us look at some of these images')

run = wandb.init(project='bms', job_type='image-visualization')
wandb.log({'No Hydrogen': [wandb.Image(img_path, caption=inchi) for img_path, inchi in zip(no_hydrogen_path, no_hydrogen_inchi)]})
run.finish()

run

### 3.2 Charge Layer

* charge sublayer (prefix: "q")
* proton sublayer (prefix: "p" for "protons")

ðŸ“Œ In the InChI labels that we have there are no charge and proton sublayer. 

### 3.3 Stereochemical Layer 

* double bonds and cumulenes (prefix: "b")
* tetrahedral stereochemistry of atoms and allenes (prefixes: "t", "m")
* type of stereochemistry information (prefix: "s")

### 3.4 Isotopic Layer 

It has prefixes: "i", "h", as well as "b", "t", "m", "s" for isotopic stereochemistry.

In [None]:
count = len([x for x in prefix_val_lists['b'] if x is not ''])
print(f'The number of occurences of prefix b: {count}')

count = len([x for x in prefix_val_lists['t'] if x is not ''])
print(f'The number of occurences of prefix t: {count}')

count = len([x for x in prefix_val_lists['m'] if x is not ''])
print(f'The number of occurences of prefix m: {count}')

count = len([x for x in prefix_val_lists['s'] if x is not ''])
print(f'The number of occurences of prefix s: {count}')

# Final CSV File

I am going to use a modified `train_labels.csv` file for training purposes. At this point of time, the `csv` file contains:

* `image_id` - Name of image
* `InChI` - Label (String)
* `path` - Absolute path to the image
* `chemical_formula` - Substring indicating the chemical formulae of the chemical
* `atom_connection` - Substring starting with prefix `c`.

In [None]:
train_df.to_csv('final_train_labels.csv', index=False)

if LOG_AS_ARTIFACT:
    # Log the modified csv file as W&B artifact
    run = wandb.init(project='bms', job_type='final-dataset')
    artifact_raw = run.use_artifact('ayush-thakur/bms/raw:v0', type='dataset')

    artifact = wandb.Artifact('final-csv', type='dataset')
    artifact.add_file('final_train_labels.csv')
    run.log_artifact(artifact)
    run.join()

To use the final csv file in your training pipeline you can use this code snippet to download the csv file:

```
import wandb
run = wandb.init()
artifact = run.use_artifact('ayush-thakur/bms/final-csv:v0', type='dataset')
artifact_dir = artifact.download()
```

![img](https://i.imgur.com/R14gNwT.png)

# References:

* https://pubs.acs.org/doi/pdf/10.1021/acs.jchemed.8b00090
* https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3599061/
* https://en.wikipedia.org/wiki/International_Chemical_Identifier
* https://www.youtube.com/watch?v=rAnJ5toz26c

# WORK IN PROGRESS (WIP)

If you find the work useful please considering upvoting the kernel. Share your own opinion and things to improve/add.