# Introduction

* The InChI identifier tells us about chemical substance in terms of layers information.
* So, one of approaches to construct InChI descriptions is to determine all layers one by one.
* To determine all layeres, it should be need to split train InChI into sublayers.

## Aim of this notebook
* In this notebook, I describe a bit of information about InChI in training dataset and arrange data.
* The arranged dataset are avarable from [here](https://www.kaggle.com/wineplanetary/bms-arranged-label)

## Reference and Acknowledgements
* https://en.wikipedia.org/wiki/International_Chemical_Identifier

## Version

version 6 : add total numbers of atoms

version 5 : atom order in appendix

version 4 : detected which and how many atoms are contained in each chemical formulas and listed in csv

version 1-3 : initial

## Import libraries and load train labels

In [None]:
import os
import re
import itertools
import numpy as np
import pandas as pd
from tqdm.notebook import tqdm

In [None]:
train_label_path = "../input/bms-molecular-translation/train_labels.csv"

In [None]:
df = pd.read_csv(train_label_path)

# Understanding InChI format

## Version number

* Every InChI starts with the string "InChI=" followed by the version number.
* All InChI in train dataset have value "InChI=1S".
* "S" means standard InChIs.

In [None]:
version_list = [inchi.split("/")[0] for inchi in tqdm(df["InChI"])]
print("A version number is always %s" % set(version_list))

## Chemical formula

* In the next layer, InChIs show chemical formula which is the only sublayer that must occur in every InChI.
* Every chemical substance in train dataset consists of atoms in ['B', 'Br', 'C', 'Cl', 'F', 'H', 'I', 'N', 'O', 'P', 'S', 'Si']

In [None]:
chemical_formula_list = [inchi.split("/")[1] for inchi in tqdm(df["InChI"])]
atom_list_org = [re.split("\d+", chemical_formula) for chemical_formula in tqdm(chemical_formula_list)]
bounded_atom_list = set(itertools.chain.from_iterable(atom_list_org))
atom_list = []
for bounded_atom in bounded_atom_list:
    before_char = ""
    for char in bounded_atom:
        if char.isupper():
            if before_char.isupper():
                atom_list.append(before_char)
            before_char = char
        elif char.islower():
            atom_list.append(before_char+char)
            before_char = ""
print("atoms including in training chemical substances are follows %s" % set(atom_list))

## Other layers

* Other layers also have indispensable information about chemical substances.
* In InChI, these layers always start with some specific prefix. For example, atom connections sublayers have to start with prefix "c"
* Every InChI layer and sublayer in training dataset starts with one of prefixes ['b', 'c', 'h', 'i', 'm', 's', 't']

In [None]:
prefix_list = [layer[0] for inchi in tqdm(df["InChI"]) for layer in inchi.split("/")[2:]]
print("prefixes used in training InChI are follows %s" % set(prefix_list))

# Arange Dataset for Training

* Then I split train dataset in some sublayers.
* I detectd which and how many atoms are contained in each chemical formulas.
* I convert image_id into absolute path of kaggle notebook.

In [None]:
# id to path
def id2path(image_id):
    return "../input/bms-molecular-translation/train/%s/%s/%s/%s.png" % (image_id[0], image_id[1], image_id[2], image_id)

df["image_path"] = df["image_id"].apply(id2path)

In [None]:
# separate into elements
all_df = df.copy()
prefix_list = ["c", "h", "b", "t", "m", "s", "i"]
formula_list = []
prefix_val_lists = {prefix: [] for prefix in prefix_list}
prefix_val_lists.update({"%s_flg" % prefix: [] for prefix in prefix_list})

for inchi in tqdm(df["InChI"]):
    text_list = inchi.split("/")
    formula_list.append(text_list[1])
    for prefix in prefix_list:
        for text in text_list:
            if text.startswith(prefix):
                prefix_val_lists[prefix].append(text)
                prefix_val_lists["%s_flg" % prefix].append(1)
                break
        else:
            prefix_val_lists[prefix].append("")
            prefix_val_lists["%s_flg" % prefix].append(0)

In [None]:
# reduce formula into atoms
atom_list = ["C", "H", "B", "Br", "Cl", "F", "I", "N", "O", "P", "S", "Si", "total"]
def split_atom(formula):
    atom_dict = {atom: 0 for atom in atom_list}
    now_atom = ""
    now_num = ""
    total_atom = 0
    for char in formula+"E":
        if char.isupper():
            if now_atom != "":
                if now_num == "":
                    atom_dict[now_atom] = int(1)
                    total_atom += 1
                    now_atom = char
                else:
                    atom_dict[now_atom] = int(now_num)
                    total_atom += int(now_num)
                    now_atom = char
                    now_num = ""
            else:
                now_atom = char
        elif char.islower():
            now_atom += char
        else:
            if now_atom != "":
                now_num += char
    atom_dict["total"] = total_atom
    return atom_dict

atom_num_list = [split_atom(inchi.split("/")[1]) for inchi in tqdm(df["InChI"])]

In [None]:
split_df = df.copy()
split_df["formula"] = formula_list
for prefix in prefix_list:
    split_df[prefix] = prefix_val_lists[prefix]
    split_df["%s_flg" % prefix] = prefix_val_lists["%s_flg" % prefix]
arranged_df = pd.concat([split_df, pd.DataFrame(atom_num_list)], axis=1)

In [None]:
arranged_df.head()

In [None]:
arranged_df.to_csv("arranged_bms_train_labels.csv")

## Appendix1 : Atom order

Does formula part have strict atom ordering?

* A chemical formula of organic matter has strict ordering.
* "C" is placed first, "H" is next, then placed alphabetically.
* The code below shows chemical formulas in train data are ordering strictly by expected order.
* Note if a chemical formula does not contain "C", then formulas are ordered by alphabetically.

In [None]:
expected_atom_order = ["C", "H", "B", "Br", "Cl", "F", "I", "N", "O", "P", "S", "Si"]
atom_only_formula_list = [re.sub("\d+", "", formula) for formula in tqdm(chemical_formula_list)]
for atom_only_formula, atom_dict in zip(tqdm(atom_only_formula_list), atom_num_list):
    expected_formula = ""
    for atom in expected_atom_order:
        if atom_dict[atom] != 0:
            expected_formula += atom
    if expected_formula != atom_only_formula:
        print("Unexpected Order !: expected=%s, true=%s" % (expected_formula, atom_only_formula))