<img src="https://cdn.pixabay.com/photo/2014/07/31/21/42/industry-406905_960_720.jpg" width="900px">

## Table of Contents
* [Introduction](#introduction)
* [Details of the Problem](#details-of-the-problem)
* [Visualise Random Images with Boundary Box](#visualise-random-images)
* [Preparing Dataset for Training](#preparing-dataset-training)
* [Create Model - ResNet50(Faster R-CNN)](#create-model-resnet50)
* [Preparing Model for Training](#preparing-model-for-training)
* [Training](#let's-train-it)
* [Calculation on unlabeled data](#calculation-on-unlabeled-data)

## Introduction<a class="anchor" id="introduction"></a>

Recognizing molecules and composing the full inchi solution seemed to be a very complex task.
Inchi has several levels of information. Being self-assured is not a successful mindset.
I decided to accomplish the first part of inchi, the molecular formula.
So I concentrated my efforts on uncovering the molecular formula.
My problem is to recognize atoms, make some calculations, return molecular formula.
So the problem is mostly about object detection. 

Seems very easy. 
But there are a lot of options. Which objects are the most useful, what objects should be recognized?
There are many of them. I thought about it for a long time. Finally, I realized that I am capable of accomplishing of something very simple.
Complexity makes the code much larger. 
My model is to recognize atoms from the depiction of a molecule.

### Details of the Problem<a class="anchor" id="details-of-the-problem"></a>


All organic molecules have one of the twelve atoms:

'C', 'H', 'O', 'S', 'N', 'Br', 'F', 'Cl', 'P', 'Si', 'B', 'I'

In [None]:
import os
import cv2
import time
import pandas as pd
import numpy as np
import re
import csv

from PIL import Image
import torch
import torchvision
import torchvision.transforms as T
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
from torch.utils.data import DataLoader, Dataset

import seaborn as sns
import matplotlib.pyplot as plt
import pathlib

In [None]:
EPOCHS = 1 #100
THRESHOLD = 0.5

CSV_OUT = 'submission.csv'
DIR_INPUT = "/kaggle/input/extendedlabels/"
DIR_IMAGES = DIR_INPUT + "images/train/"

DIR_TEST_IMAGES = "/kaggle/input/bms-molecular-translation/train/0/0/0/"
# DIR_TEST_IMAGES = "/kaggle/input/bms-molecular-translation/test/0/0/"


### Loading Dataset
df = pd.read_csv(DIR_INPUT + "images/train_labels.csv")
display(df.head())
df.describe()

In [None]:
### Null Values, Unique Values

unq_values = df["filename"].unique()
print("Total Records: ", len(df))
print("Unique Images: ", len(unq_values))

null_values = df.isnull().sum(axis=0)
print("\n> Null Values in each column <")
print(null_values)

### Total Classes

classes = df["class"].unique()
print("Total Classes: ", len(classes))
print("\n> Classes <\n", classes)

### Visualizing Class Distribution

plt.figure(figsize=(14,8))
plt.title('Class Distribution', fontsize=20)
sns.countplot(x="class", data=df)

## Visualise Random Images with Boundary Box<a class="anchor" id="visualise-random-images"></a>

For labeling I used 2 tools, easy to use.

[labelImg](https://github.com/tzutalin/labelImg)

[xml_to_csv](https://github.com/EdjeElectronics/TensorFlow-Object-Detection-API-Tutorial-Train-Multiple-Objects-Windows-10/blob/master/xml_to_csv.py)

In [None]:
### Function to plot image

def plot_img(image_name, image_dir, df_labels, verbose=1):
    
    fig, ax = plt.subplots(1, 2, figsize = (14, 14))
    ax = ax.flatten()
    
    df = df_labels
    bbox = df[df['filename'] == image_name]
    img_path = os.path.join(image_dir, image_name)
    
    image = cv2.imread(img_path, cv2.IMREAD_COLOR)
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB).astype(np.float32)
    image /= 255.0
    image2 = image
    
    ax[0].set_title('Original Image')
    ax[0].imshow(image)
    
    for idx, row in bbox.iterrows():
        x1 = row['xmin']
        y1 = row['ymin']
        x2 = row['xmax']
        y2 = row['ymax']
        label = row['class']
        if verbose == 1:
            print(x1, y1, x2, y2, label)
        else: 
            pass
        
        cv2.rectangle(image2, (int(x1),int(y1)), (int(x2),int(y2)), (255,0,0), 1)
        font = cv2.FONT_HERSHEY_SIMPLEX
        cv2.putText(image2, label, (int(x1),int(y1-10)), font, 0.5, (255,0,0), 1)
    
    ax[1].set_title('Image with Boundary Box')
    ax[1].imshow(image2)

    plt.show()

In [None]:
### Pass any image name as parameter
count = 0
for image in unq_values:
    if len(image) > 9: 
        plot_img(image, DIR_IMAGES, df, verbose=0)
        count += 1
        if count > 5:
            break

### Preparing Dataset for Training<a class="anchor" id="preparing-dataset-training"></a>

In [None]:
### Class <-> Int

_classes = np.insert(classes, 0, 'background', axis=0) # adding a background class for Faseter R-CNN
class_to_int = {_classes[i] : i for i in range(len(_classes))}
int_to_class = {i : _classes[i] for i in range(len(_classes))}
print("class_to_int : \n", class_to_int)
print("\nint _to_class : \n", int_to_class)

In [None]:
### Creating Data (Labels & Targets) for Faster R-CNN

def get_transform():
    return T.Compose([T.ToTensor()])

class MoleculeDetectionDataset(Dataset):
    
    def __init__(self, dataframe, image_dir, mode='train', transforms=None ):
        # if mode='train' then dataframe is a dataframe with labels(classes and rectangles)
        # if mode = 'test' then dataframe = None, model will be used for prediction \
        # Example for training:
        # MoleculeDetectionDataset(image_dir, transforms=function(), dataframe=df_labels, mode='train')
        
        super().__init__()
        
        self.paths = None
        
        if mode == 'train':
            self.image_names = dataframe["filename"].unique()
            
        else: # mode == 'test'
            if dataframe == None:
                paths = [x for x in list(pathlib.Path(image_dir).rglob('*.png'))]
                self.paths = paths
                self.image_names = np.asarray([x.name for x in paths], dtype=object)
                
            else:
                raise ValueError("Ignore input parameter 'dataframe' in a 'test' mode")
                
        self.df = dataframe
        self.image_dir = image_dir
        self.transforms = transforms
        self.mode = mode
        
    def __getitem__(self, index: int):
        

        if self.mode == 'train':
            
            # Retrive Image name and its records (x1, y1, x2, y2, classname) from df
            image_name = self.image_names[index]

            # Loading Image
            image = cv2.imread(self.image_dir + image_name, cv2.IMREAD_COLOR)
            image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB).astype(np.float32)
            image /= 255.0
            
            # # # # # # # #
            
            records = self.df[self.df["filename"] == image_name]
            # Get bounding box co-ordinates for each box
            boxes = records[['xmin','ymin','xmax','ymax']].values
            
            # Getting labels for each box
            temp_labels = records[['class']].to_numpy() #.value
            labels = []
            for label in temp_labels:
                label = class_to_int[label[0]]
                labels.append(label)
                
            # Converting boxes & labels into torch tensor
            boxes = torch.as_tensor(boxes, dtype=torch.float32)
            labels = torch.as_tensor(labels, dtype=torch.int64)
            
            # Creating target
            target = {}
            target['boxes'] = boxes
            target['labels'] = labels
            
            # Transforms
            if self.transforms:
                image = self.transforms(image)
              
            return image, target, image_name
        
        elif self.mode == 'test':
            
            # Retrive Image name and its records (x1, y1, x2, y2, classname) from df
            image_path = self.paths[index]
            image_name = image_path.name

            # Loading Image
            image = cv2.imread(str(image_path), cv2.IMREAD_COLOR) 
            image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB).astype(np.float32) 
            image /= 255.0
            
            #-#-#-#-#-#-#
            
            if self.transforms:
                image = self.transforms(image)
                
            return image, image_name
        
    def __len__(self):
        return len(self.image_names)
    

# test
dataset = MoleculeDetectionDataset(image_dir=DIR_IMAGES, 
                                   transforms=get_transform(), 
                                   dataframe=df, 
                                   mode='train' )
dataset.__getitem__(0)

In [None]:
### Preparing data for Train & Validation

def collate_fn(batch):
    return tuple(zip(*batch))

# Dataset object
dataset = MoleculeDetectionDataset(dataframe=df, 
                                   image_dir=DIR_IMAGES, 
                                   transforms=get_transform(),
                                    mode='train')

indices = torch.randperm(len(dataset)).tolist()

train_dataset = torch.utils.data.Subset(dataset, indices)

# Preparing data loaders
train_data_loader = DataLoader(
    train_dataset,
    batch_size=1, 
    shuffle=True,
    num_workers=4, 
    collate_fn=collate_fn
)

### Create Model - ResNet50(Faster R-CNN)<a class="anchor" id="create-model-resnet50"></a>

In [None]:
### Utilize GPU if available

# device = torch.device('cpu') 
# device = torch.device('cuda') 
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
torch.cuda.empty_cache()

ResNet is one the latest successful architectures, probably the better solution is DenseNet. I compared the scores of neural networks. 

For everybody who is new to Faster R-CNN, I insert a couple of links. 

[Faster R-CNN: Towards Real-Time ObjectDetection with Region Proposal Networks](https://arxiv.org/pdf/1506.01497.pdf)

[TorchVision](https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html)

In [None]:
### Create / load model

# Faster - RCNN Model - pretrained on COCO
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
num_classes = len(class_to_int)

# get number of input features for the classifier
in_features = model.roi_heads.box_predictor.cls_score.in_features

# replace the pre-trained head with a new one
model.roi_heads.box_predictor =  FastRCNNPredictor(in_features, num_classes)

### Preparing Model for Training<a class="anchor" id="preparing-model-for-training"></a>

In [None]:
### Preparing model for training

#Retriving all trainable parameters from model (for optimizer)
params = [p for p in model.parameters() if p.requires_grad]

#Defininig Optimizer
# optimizer = torch.optim.SGD(params, lr = 0.25, momentum = 0.9)
optimizer = torch.optim.SGD(params, lr = 0.005, momentum = 0.9)


model.to(device)

#No of epochs
epochs = EPOCHS # 30

### Training<a class="anchor" id="let's-train-it"></a>

In [None]:
### Training model
import tensorflow as tf

PATH_LOAD = pathlib.Path("/kaggle/input/modelsave/checkpoint_save_330ext2")


## Save/Load state_dict()
# model.load_state_dict(torch.load(PATH_LOAD))
# model.eval()

## Save/Load Entire Model
# model = torch.load(PATH_LOAD)
# model.eval()
# model.train()

def load_checkpoint(PATH, 
                   model, 
                   optimizer):
    checkpoint = torch.load(PATH, map_location=device)
    
    model.load_state_dict(checkpoint['model_state_dict'])
    optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
    epoch = checkpoint['epoch']
    loss = checkpoint['loss']
    
    return epoch, loss

## Loading a checkpoint
# preferred
epoch, loss = load_checkpoint(PATH_LOAD, model, optimizer)

itr = 1
total_train_loss = []

for epoch in range(epochs):
    
    start_time = time.time()
    train_loss = []

    #Retriving Mini-batch
    for images, targets, image_names in train_data_loader:
        
        #Loading images & targets on device
        images = list(image.to(device) for image in images)
        targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
        
        #Forward propagation
        out = model(images, targets)
        losses = sum(loss for loss in out.values())
        
        #Reseting Gradients
        optimizer.zero_grad()
        
        #Back propagation
        losses.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1)
        optimizer.step()
        
        #Average loss
        loss_value = losses.item()
        train_loss.append(loss_value)

        
        if itr % 25 == 0:
            print(f"\n Iteration #{itr} loss: {out} \n")

        itr += 1
    
    epoch_train_loss = np.mean(train_loss)
    total_train_loss.append(epoch_train_loss)
    print(f'Epoch train loss is {epoch_train_loss:.4f}')

    
    time_elapsed = time.time() - start_time
    print("Time elapsed: ",time_elapsed)

In [None]:
fig, axes = plt.subplots()
fig.suptitle('Training Metrics')

axes.set_ylabel("Loss", fontsize=14)
axes.plot(total_train_loss)

plt.show()

In [None]:
sum([param.nelement() for param in model.parameters()])

In [None]:
import pathlib 
import tensorflow as tf

pathlib.Path("/kaggle/working/model").mkdir(parents=True, exist_ok=True)

# Saving a model
PATH_SAVE = pathlib.Path('/kaggle/working/model/checkpoint_save')

# ## Saving only a checkpoint
torch.save({
    'epoch': epoch,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'loss': epoch_train_loss,},
    PATH_SAVE)

## Calculation on Unlabeled Data<a class="anchor" id="calculation-on-unlabeled-data"></a>

In [None]:
### Preparing data for Tests
# I try to make validation on dataset, based on code from training
# The difference is only in the mode of data and model

# Dataset object
dataset = MoleculeDetectionDataset(dataframe=None, 
                                   image_dir=DIR_TEST_IMAGES, 
                                   mode='test', 
                                   transforms=get_transform())

indices = torch.randperm(len(dataset)).tolist()

valid_dataset = torch.utils.data.Subset(dataset, indices)

valid_data_loader = DataLoader(
    valid_dataset,
    batch_size=4, #1,
    shuffle=True,
    num_workers=4, #1,
    collate_fn=collate_fn
)

In [None]:
threshold = THRESHOLD # unnecessary parameter

# Indicator of iterations
itr = 1

# This sets model for prediction mode, not for training
model.eval()

model.to(device) 

list_ = []
dict_row = {}
j = 0
    
start_time = time.time()
    
#Retriving Mini-batch
for images, image_names in valid_data_loader:
     
    #Loading images & targets on device
    images = list(image.to(device) for image in images)
    # Prediction based on already trained model
    out = model(images)
    
    for i in range(len(out)):
        # Converting tensors to array
        boxes = out[i]['boxes'].data.cpu().numpy()
        scores = out[i]['scores'].data.cpu().numpy()
        labels = out[i]['labels'].data.cpu().numpy()

        # Thresholding
        boxes_th = boxes[scores >= threshold].astype(np.int32)
        scores_th = scores[scores >= threshold]

        # int_to_class - labels
        labels_th = []
        for x in range(len(labels)):
            if scores[x] > threshold:
                labels_th.append(int_to_class[labels[x]])

        # Appending results to csv
        for y in range(len(boxes_th)):

            # Bboxes, classname & image name
            xmin = boxes_th[y][0]
            ymin = boxes_th[y][1]
            xmax = boxes_th[y][2]
            ymax = boxes_th[y][3]
            class_name = labels_th[y]

            # Creating row for df
            dict_row[j] = {"filename": image_names[i], "xmin" : xmin, "ymin" : ymin, 
                   "xmax" : xmax, "ymax" : ymax, "class" : class_name}
            j += 1

            list_.append(image_names[i])
            
    itr += 1
    
    if itr % 50 == 0:
        print('itr = ', itr)

submission = pd.DataFrame.from_dict(dict_row, "index") # very fast approach        
time_elapsed = time.time() - start_time
print("Time elapsed: ",time_elapsed)

In [None]:
# list(set(list_))

In [None]:
print('threshold = ', threshold)
display(submission.describe())
display(submission.head())

**Hydrogen in molecular formula.** Hydrogen can't be detected from the picture, there is no signs for hydrogen. I don't mean chemical groups such as NH, NH2, OH, ... Hydrogen that is connected to carbon has no signs. It should be calculated.

Caclulation of hydrogen needs information about structure of a molecule. So, calculation of hydrogen is a complex task, much harder, than caclulation of carbon atoms!

I think, it is too complex to solve it. That is why I didn't print hydrogen amount.

In [None]:
def submission_to_inchi(submission):
    classes = ['h', 'p', 'n', 'o', 'c', 'cl', 's', 'b', 'si', 'i', 'f', 'br']
    #CHBBrClFINOPSSi
    ordered_output = ['c', 'h', 'b', 'br', 'cl', 'f', 'i', 'n', 'o', 'p', 's', 'si']
    
    list_output = [] # (file, inchi)
        
    files = submission['filename'].unique().tolist()
    for file in files:
        df = submission.loc[submission['filename'] == file]
        labels = df['class'].tolist()
        dictionary = {}
        
        for label in classes:
            if labels.count(label) == 0:
                continue
            elif label == 'h':
                dictionary[label] = ''
            elif labels.count(label) == 1:
                dictionary[label] = ''
            else:
                dictionary[label] = labels.count(label)
                
        formula = ''
        for item in ordered_output:
            if item in dictionary:
                formula += item.title()
                formula += str(dictionary[item])
        
        m = re.search(r'(.*)\.[^.]+$', file)
        file_name = m.group(1)
        
        list_output.append((file_name, 'InChI=1S/' + formula + '/c1/h'))
        
    return list_output

In [None]:
# Write inchi to disk
list_submission = submission_to_inchi(submission)


with open(CSV_OUT, 'w') as out:
    csv_out = csv.writer(out, quotechar=None)
    csv_out.writerow(['image_id', 'InChI'])
    for row in list_submission:
        csv_out.writerow(row)

In [None]:
## Displaying Test Images
filepath_image_list = list(pathlib.Path(DIR_TEST_IMAGES).rglob('*.png'))
filepath_image_list.sort()
image_test = [x.name for x in filepath_image_list]

count = 0
for image in image_test:
    if len(image) > 9: 
        plot_img(image, DIR_TEST_IMAGES, submission, verbose=0)
        print(image)
        count += 1
        # limit output files to ...
        if count > 5:
            break

In [None]:
raise SystemExit("Stop right there!")

In [None]:
import re
import pandas as pd
import unittest

class Test02(unittest.TestCase):
    def test_01(self):
        submission = pd.DataFrame({'filename': [
                        '003781e47d63.png', 
                       '003781e47d63.png',
                       '003781e47d63.png',
                       '002781e47d63.png',
                       '003781e47d63.png'],
            'class': ['c', 'c', 'h', 'h', 'c'],
           'xmin': [159, 191, 197, 159, 75],
           'ymin': [115, 41, 114, 44, 112],
           'xmax': [187, 225, 229, 186, 108],
           'ymax': [142, 72, 143, 76, 140]
          })
        
        list_ = submission_to_inchi(submission)
        print(list_)
        
    def test_02(self):
        submission = pd.DataFrame({'filename': [
                        '003781e47d63.png', 
                       '003781e47d63.png',
                       '003781e47d63.png',
                       '003781e47d63.png',
                       '003781e47d63.png'],
            'class': ['c', 'c', 'p', 'p', 'c'],
           'xmin': [159, 191, 197, 159, 75],
           'ymin': [115, 41, 114, 44, 112],
           'xmax': [187, 225, 229, 186, 108],
           'ymax': [142, 72, 143, 76, 140]
          })
        
        list_ = submission_to_inchi(submission)
        print(list_)
        
        
unittest.main(argv=[''], verbosity=2, exit=False)

In [None]:
submission.to_csv('submission.csv', index=False)

In [None]:
# Clear the directory

import os
import glob

files = glob.glob('/kaggle/working/*')
for f in files:
    os.remove(f)