In [None]:
import os

import cv2
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
!ls ../input/bms-molecular-translation

# Understand dataset with molecular 3D models

## Contents

1. [Introduction](#1)
1. [Why we check 3D model?](#2)
1. [About Structual fomula](#3)
1. [Check 3D molecules on PubChem website](#4)
1. [Check 3D models on kaggle notebook](#5)
1. [Points to note for data augmentation with flip](#6)

<a id="1"></a> <br>
# <div class="alert alert-block alert-info">Introduction</div>

## About this notebook

In this competiton, we have to detect InChI from [Structual fomula](https://en.wikipedia.org/wiki/Structural_formula) of molecules. You may have learned the structural formulae of simple molecules in school, but in this competition, relatively complex structural formulae will be introduced. Especially since data is given in the form of an image, we may feel free to use standard data augmentation methods like flip, but since molecules are essentially three-dimensional structures, there are pitfalls. To help our understand the characteristics of the data more effectively, we will use 3D model of a molecules to illustrate how to read the given image data.

<a id="2"></a> <br>
# <div class="alert alert-block alert-warning">Why we check 3D model?</div> 

- Even if we are not familiar with chemical notation, it is easy to understand molecules. 

Some elements may be omitted in the skeletal stractual formula, so the molecule you imagine may not be actual. For example, hydrogen bonded to carbon may be omitted, but you can see them in the 3D model (3D models often show almost all of them).

- We can develop intuition for molecular structures and aplly them to ML. 

Ituition for molecular structures is important for this machine learning theme. For examle, if you do data augmentation, you may end up with different molecules ( see [Points to note for data augmentation with flip](#6) ) by inverting the images blindly. On the other hand, we can also come up with ideas for data augmentation methods that are unique to molecular structural expressions. For example, we can change the angle from which you look at molecules. Also, if you know how to write structural expressions, you may be able to generate images with various structural expressions from InChI(This may be too difficult). In short, rotating the 3D model around is essentially equal to data augmentation.

<a id="3"></a> <br>
# <div class="alert alert-block alert-success">About Structual fomula</div> 

Molecules are made up of atoms bonded together. A structural formula describes the bonds between atoms. 

There are There are many atoms. Knowing what elements (kinds of atoms) are present is necessary for understanding the structural formula. Check [Periodic table](https://en.wikipedia.org/wiki/Periodic_table) for the types of atoms.

In [None]:
#From https://www.kaggle.com/ihelon/molecular-translation-exploratory-data-analysis
def convert_image_id_2_path(image_id: str) -> str:
    return "../input/bms-molecular-translation/train/{}/{}/{}/{}.png".format(
        image_id[0], image_id[1], image_id[2], image_id 
    )

def visualize_train_image(image_id, label):
    plt.figure(figsize=(10, 8))
    
    image = cv2.imread(convert_image_id_2_path(image_id))
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

    plt.imshow(image)
    plt.title(f"{label}", fontsize=14)
    plt.axis("off")
    
    plt.show()
    
train_labels = pd.read_csv("../input/bms-molecular-translation/train_labels.csv")
index = 1
molecule_image_id = train_labels["image_id"].iloc[index]
molecule_InChI = train_labels["InChI"].iloc[index]
print(molecule_image_id)
print(molecule_InChI)
    
visualize_train_image(molecule_image_id, molecule_InChI)

The image given as data is called skeletal formulas. It represents a molecule with several kinds of lines and element symbols.  As an example of a structural formula, here is an example of a simple molecule with two carbons. The first row is the name of the molecule, the second row is the carefully written structural formula, and the third row is the skeletal structural formula like the training data.

<img src="https://raw.githubusercontent.com/tasotasoso/kaggle_media/main/How_to_check_conformation_of_Structural_formula/Example%20for%20structural%20formulas.png" width="1000">


Depending on the case, more experienced chemists often write skeletal structural formulas because they require less effort to write. How many lines are written in a bond is the difference between [Bond order](https://en.wikipedia.org/wiki/Bond_order). In skeletal structural formulas, carbon or hydrogen are often omitted from the letters.

"▼" may be difficult to understand, but it shows which way the molecules are facing. Compare with the 3D model of a water molecule.

<img src="https://raw.githubusercontent.com/tasotasoso/kaggle_media/main/How_to_check_conformation_of_Structural_formula/In%20front%20of%20or%20behind%20page%3F.png" width="800">

Do you understand? When the molecule is flat on paper, it is written as just a bar line. On the other hand, if the molecule is placed so that it has depth, we use a "▼" to indicate this. In this way, it is possible to represent the three-dimensional structure of even complex molecules using only a pen and paper. Also, when considering chemical reaction mechanisms, we use this kind of notation because the 3D structure is one of the very important factor.

Here is a summary of the main notations introduced. If you know the correspondence between these notations and the 3D models (or actual molecules), you will be able to understand the data of this competition more clearly.

<img src="https://raw.githubusercontent.com/tasotasoso/kaggle_media/main/How_to_check_conformation_of_Structural_formula/Main%20notation%20of%20structural%20formula.png" width="800">

<a id="4"></a> <br>
# <div class="alert alert-block alert-info">Check 3D molecules on PubChem website</div> 


We can check the chemical & Structual data by [PubChem](https://pubchem.ncbi.nlm.nih.gov). First, we get InChI expression from label. Let's take a look at the 3D model of the very first data (above structural formula) of the training data.

<u>Note</u> 

The PubChem images shown below was obtained on 3/3/2021 from the [official PubChem website](https://pubchem.ncbi.nlm.nih.gov).

## 0. Get InChI for molecule

Get the InChI of the molecule we want to check as follows.

In [None]:
train_labels = pd.read_csv("../input/bms-molecular-translation/train_labels.csv")
test_labels = pd.read_csv("../input/bms-molecular-translation/sample_submission.csv")
index = 1
molecule_image_id = train_labels["image_id"].iloc[index]
molecule_InChI = train_labels["InChI"].iloc[index]
print(molecule_InChI)

Let's go PubChem site and search the molecure.

## 1. Enter InChI to search box. And search.

<img src="https://raw.githubusercontent.com/tasotasoso/kaggle_media/main/How_to_check_conformation_of_Structural_formula/search_on_pubchem.JPG" width="***500***">

## 2. Select molecule

<img src="https://raw.githubusercontent.com/tasotasoso/kaggle_media/main/How_to_check_conformation_of_Structural_formula/search_result.JPG" width="***500***">

## 3. Check molecular models

It may be little hard to tell from the image, but you can check the 3D model in addition to the structural formula. Please click on the link below to check the results.
[PubChem CID: 124916588](https://pubchem.ncbi.nlm.nih.gov/compound/124916588)

<img src="https://raw.githubusercontent.com/tasotasoso/kaggle_media/main/How_to_check_conformation_of_Structural_formula/indivisual_molecule.JPG" width="***500***">

<a id="5"></a> <br>
# <div class="alert alert-block alert-info">Check 3D models on kaggle notebook</div> 

We can also check comformation on kaggle notebook with py3Dmol.

In [None]:
!pip install py3Dmol #used to create molecular diagarams

In [None]:
import py3Dmol 

To query molecure's information, we need PubChem CID. We can get like following. The CID almost asks like an ID to identify the certain molecule.

<img src="https://raw.githubusercontent.com/tasotasoso/kaggle_media/main/How_to_check_conformation_of_Structural_formula/get_pubchem_cid.JPG" width="***500***">

In [None]:
#Enter CID here!!!
cid_for_query = 'cid:124916588'

In [None]:
view = py3Dmol.view(width=680, height=300, query=cid_for_query, linked=False)
view.setStyle({'stick': {}})
view.setBackgroundColor('#f9f4fb')
view.show()

It is also possible to view different types of 3D models.

In [None]:
view = py3Dmol.view(width=500, height=1000, query=cid_for_query, viewergrid=(3,1), linked=False)
view.setStyle({'line': {'linewidth': 1}}, viewer=(0,0)) #line 3D model
view.setStyle({'stick': {}}, viewer=(1,0)) #stick 3D model
view.setStyle({'sphere': {}}, viewer=(2,0)) #sphere 3D model
view.setBackgroundColor('#ebf4fb', viewer=(0,0))
view.setBackgroundColor('#f9f4fb', viewer=(1,0))
view.setBackgroundColor('#e1e1e1', viewer=(2,0))
view.show()

<a id="6"></a> <br>
# <div class="alert alert-block alert-warning">Points to note for data augmentation with flip</div> 

## There are some rotated data in dataset

According to [Data page](https://www.kaggle.com/c/bms-molecular-translation/data).

> The images provided (both in the training data as well as the test data) may be rotated to different angles, be at various resolutions, and have different noise levels.

There are not many, but if we check the data carefully, we can find that there are some rotated data.

In [None]:
#From https://www.kaggle.com/ihelon/molecular-translation-exploratory-data-analysis
def convert_image_id_2_path(image_id: str) -> str:
    return "../input/bms-molecular-translation/train/{}/{}/{}/{}.png".format(
        image_id[0], image_id[1], image_id[2], image_id 
    )

def visualize_train_image(image_id, label):
    plt.figure(figsize=(10, 8))
    
    image = cv2.imread(convert_image_id_2_path(image_id))
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

    plt.imshow(image)
    plt.title(f"{label}", fontsize=14)
    plt.axis("off")
    
    plt.show()
    
def visualize_train_image_with_hflip(image_id, label):
    plt.figure(figsize=(10, 8))
    
    image = cv2.imread(convert_image_id_2_path(image_id))
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    image = cv2.flip(image, 0) #this piece of code flips the data so it is on its correct side

    plt.imshow(image)
    plt.title(f"InChI=????", fontsize=14)
    plt.axis("off")
    
    plt.show()
    
def convert_image_id_2_path_test(image_id: str) -> str:
    return "../input/bms-molecular-translation/test/{}/{}/{}/{}.png".format(
        image_id[0], image_id[1], image_id[2], image_id 
    )

def visualize_train_image_test(image_id, label):
    plt.figure(figsize=(10, 8))
    
    image = cv2.imread(convert_image_id_2_path_test(image_id))
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

    plt.imshow(image)
    plt.title(f"{label}", fontsize=14)
    plt.axis("off")
    
    plt.show()

### train data

In [None]:
index = 1516695
molecule_image_id = train_labels["image_id"].iloc[index]
molecule_InChI = train_labels["InChI"].iloc[index]
print(molecule_InChI)
    
visualize_train_image(molecule_image_id, molecule_InChI)
#rotated diagram

### test data

In [None]:
index = 1009387
molecule_image_id_test = test_labels["image_id"].iloc[index]
molecule_InChI_test = test_labels["InChI"].iloc[index]
print(molecule_InChI_test)
    
visualize_train_image_test(molecule_image_id_test, molecule_InChI_test)
#This diagram is rotated

To solve this problem, we can think of two strategies to start with.

1. Add more rotated data to the training data.

1. Return all test data to the correct (being able to read characters) orientation.

The strategy1 is a common way, but I expect that some augmentation method may cause bad results. I will explain why.

## Flip may cause bad effect?

We often use vertical and holizontal flip when we try rotation, like following.

In [None]:
index = 945
molecule_image_id = train_labels["image_id"].iloc[index]
molecule_InChI = train_labels["InChI"].iloc[index]
print(molecule_InChI)
    
visualize_train_image(molecule_image_id, molecule_InChI)

#presenting a sample visual

<div class="alert alert-block alert-warning">↓↓↓holizontal flip↓↓↓</div>

In [None]:
visualize_train_image_with_hflip(molecule_image_id, molecule_InChI)
#horizontal flip rotation

The fact that the atomic symbols have been reversed this time may be problematic, but there are another pitfalls unique to this theme. This is because they are actually two different molecules. Original one is "Methyl (2S)-2-amino-3-(4-fluorophenyl)sulfanylpropanoate" and fliped one is "Methyl (2R)-2-amino-3-(4-fluorophenyl)sulfanylpropanoate".

## What is Enantiomer?

[Enantiomer](https://en.wikipedia.org/wiki/Enantiomer) is the stereoisomer that are mirror images of each other. They are considered completely different compounds. Not only do they differ geometrically, but they may also have different chemical properties.

Simplifying, it can be described by the following tetrahedron. It's vertices of the tetrahedron are labeled A ~ D. We will think another tetrahedron with a mirrored surface. If you put them side by side, they will look like the figure below.

<img src="https://raw.githubusercontent.com/tasotasoso/kaggle_media/main/How_to_check_conformation_of_Structural_formula/enantiomer1.png" width="700">

So are these two the same thing? No, they are not. No matter how hard we try, we can't superimpose the labels of the vertices.

Atoms are arranged to form a tetrahedron around carbon, which has only single bonds. Depending on the type of bonded atoms, phenomena such as the one we saw above can occur. In this molecule, the carbon marked with * is the [Stereocenter](https://en.wikipedia.org/wiki/Stereocenter), which is the center of this tetrahedron.

<img src="https://raw.githubusercontent.com/tasotasoso/kaggle_media/main/How_to_check_conformation_of_Structural_formula/enantiomer3.png" width="300">

Now let's compare "Methyl (2S)-2-amino-3-(4-fluorophenyl)sulfanylpropanoate" and "Methyl (2R)-2-amino-3-(4-fluorophenyl)sulfanylpropanoate". We can see that they are mirror images of each other, although they are different from the previous figure due to their preparation. Since it is a different molecule, we can see that the sublayers starting with /m in the InChI string are different.

<img src="https://raw.githubusercontent.com/tasotasoso/kaggle_media/main/How_to_check_conformation_of_Structural_formula/enantiomer2.png" width="900">

I have also prepared 3D models, so please try turning them around and see if the placement around the stereocenter overlaps.

### Methyl (2S)-2-amino-3-(4-fluorophenyl)sulfanylpropanoate

In [None]:
###### Methyl (2S)-2-amino-3-(4-fluorophenyl)sulfanylpropanoate
view = py3Dmol.view(width=680, height=300, query="cid:94054932", linked=False)
view.setStyle({'stick': {}})
view.setBackgroundColor('#f9f4fb')
view.show()

### Methyl (2R)-2-amino-3-(4-fluorophenyl)sulfanylpropanoate

In [None]:
#Methyl (2R)-2-amino-3-(4-fluorophenyl)sulfanylpropanoate
view = py3Dmol.view(width=680, height=300, query="cid:94054933", linked=False)
view.setStyle({'stick': {}})
view.setBackgroundColor('#f9f4fb')
view.show()

----------------------

Takeaway: It's important to understand that mirrored or rotated structures are not the exact same InChl structure as seen by the above examples. There could be a slight change which will distinguish it from its mirrored counterpart.

I was curious, so I will focus on the length of InChI. Let's look at the longest and shortest molecules. It's hard to say, but it must be related to the size and complexity of the molecule.

### Shortest one

In [None]:
shortest_mol_idx = np.argmin(train_labels["InChI"].apply(lambda x: len(x))) #np.argmin returns the values of the minimium integer value in the list
train_labels["InChI"][shortest_mol_idx]

In [None]:
#Methyl (2R)-2-amino-3-(4-fluorophenyl)sulfanylpropanoate
view = py3Dmol.view(width=680, height=300, query="cid:13153", linked=False)
view.setStyle({'stick': {}})
view.setBackgroundColor('#f9f4fb')
view.show()
#as seen by the image below, the size of the molecule depends on the string length of the InChI

### Longest one

In [None]:
longest_mol_idx = np.argmax(train_labels["InChI"].apply(lambda x: len(x)))
train_labels["InChI"][longest_mol_idx]

Unfortunately, there doesn't seem to be 3D model.

In [None]:
#Methyl (2R)-2-amino-3-(4-fluorophenyl)sulfanylpropanoate
view = py3Dmol.view(width=680, height=300, query="cid:138197614", linked=False)
view.setStyle({'stick': {}})
view.setBackgroundColor('#f9f4fb')
view.show()

We can write 2D model by rdkit.

In [None]:
!conda install -c rdkit rdkit -y

In [None]:
import rdkit

# https://www.kaggle.com/brodzik/drawing-molecules-with-rdkit-inchi-to-png

#because a 3d model for the longest Inchi molecule won't show up, using rdkit we cans ee the 2d model for the longest one

mol = rdkit.Chem.inchi.MolFromInchi(train_labels["InChI"][longest_mol_idx])
d = rdkit.Chem.Draw.rdMolDraw2D.MolDraw2DCairo(512, 512)
d.drawOptions().useBWAtomPalette()
d.DrawMolecule(mol)
d.WriteDrawingText("0.png")
img = cv2.imread("0.png", cv2.IMREAD_GRAYSCALE)
plt.figure(figsize=(20, 20))
plt.imshow(img, "gray")
plt.show()

The distribution of the length of InChI is as follows.

In [None]:
sns.distplot(train_labels["InChI"].apply(lambda x: len(x)))