# Import Packages

In [None]:
import pandas as pd
pd.set_option('display.max_colwidth', -1)

# InChI

The IUPAC International Chemical Identifier is a textual identifier for chemical substances, designed to provide a standard way to encode molecular information and to facilitate the search for such information in databases and on the web.

# Format and layers


Every InChI starts with the string "InChI=" followed by the version number, currently 1. If the InChI is standard, this is followed by the letter S for standard InChIs, which is a fully standardized InChI flavor maintaining the same level of attention to structure details and the same conventions for drawing perception. The remaining information is structured as a sequence of layers and sub-layers, with each layer providing one specific type of information. The layers and sub-layers are separated by the delimiter "/" and start with a characteristic prefix letter (except for the chemical formula sub-layer of the main layer). The six layers with important sublayers are:

## Main layer
Chemical formula (no prefix). This is the only sublayer that must occur in every InChI.
Atom connections (prefix: "c"). The atoms in the chemical formula (except for hydrogens) are numbered in sequence; this sublayer describes which atoms are connected by bonds to which other ones.
Hydrogen atoms (prefix: "h"). Describes how many hydrogen atoms are connected to each of the other atoms.

## Charge layer
charge sublayer (prefix: "q")
proton sublayer (prefix: "p" for "protons")

## Stereochemical layer
double bonds and cumulenes (prefix: "b")
tetrahedral stereochemistry of atoms and allenes (prefixes: "t", "m")
type of stereochemistry information (prefix: "s")

## Isotopic layer 
(prefixes: "i", "h", as well as "b", "t", "m", "s" for isotopic stereochemistry)

## Fixed-H layer
(prefix: "f"); contains some or all of the above types of layers except atom connections; may end with "o" sublayer; never included in standard InChI

## Reconnected layer 
(prefix: "r"); contains the whole InChI of a structure with reconnected metal atoms; never included in standard InChI

# Training Data

train_labels - This gives the ground truth InChi labels for the training images

In [None]:
import os
os.listdir('/kaggle/input/bms-molecular-translation/')

In [None]:
train_df = pd.read_csv('/kaggle/input/bms-molecular-translation/train_labels.csv')
print(train_df.shape)
train_df

In [None]:
train_df['split'] = train_df['InChI'].apply(lambda x: x.split('/'))
train_df.head()

In [None]:
train_df['split_len'] = train_df['split'].apply(lambda x: len(x))
train_df.head()

In [None]:
train_df['split_len'].unique()

In [None]:
train_InChI_df = train_df['InChI'].str.split('/', 11, expand=True)
train_InChI_df

In [None]:
train_InChI_df[0].unique()

In [None]:
train_InChI_df[1].unique()

In [None]:
train_InChI_df[2].unique()

In [None]:
train_InChI_df[3].unique()

In [None]:
train_InChI_df[4].unique()

In [None]:
train_InChI_df[5].unique()

In [None]:
train_InChI_df[6].unique()

In [None]:
train_InChI_df[7].unique()

In [None]:
train_InChI_df[8].unique()

In [None]:
train_InChI_df[9].unique()

In [None]:
train_InChI_df[10].unique()