# BMS Mol.Tr.: Approaches (EDA, Denoise, Baseline)
![](https://storage.googleapis.com/kaggle-competitions/kaggle/22422/logos/header.png?t=2021-02-03-02-05-31)

In a technology-forward world, sometimes the best and easiest tools are still pen and paper. Organic chemists frequently draw out molecular work with the Skeletal formula, a structural notation used for centuries. Recent publications are also annotated with machine-readable chemical descriptions (InChI), but there are decades of scanned documents that can't be automatically searched for specific chemical depictions. Automated recognition of optical chemical structures, with the help of machine learning, could speed up research and development efforts.

Unfortunately, most public data sets are too small to support modern machine learning models. Existing tools produce 90% accuracy but only under optimal conditions. Historical sources often have some level of image corruption, which reduces performance to near zero. In these cases, time-consuming, manual work is required to reliably convert scanned chemical structure images into a machine-readable format.

Bristol-Myers Squibb is a global biopharmaceutical company working to transform patients' lives through science. Their mission is to discover, develop, and deliver innovative medicines that help patients prevail over serious diseases.

In [None]:
import numpy as np
import pandas as pd
from tqdm.auto import tqdm
tqdm.pandas()
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from collections import Counter, defaultdict
import cv2, os
import skimage.io as io

import Levenshtein

# ignoring warnings
import warnings
warnings.simplefilter("ignore")

# Fast look at the data

In [None]:
labels = pd.read_csv('../input/bms-molecular-translation/train_labels.csv')
ss = pd.read_csv('../input/bms-molecular-translation/sample_submission.csv', index_col = 0)

print('Labels:\t\t Len: {};\tUnique values: {}'.format(
    len(labels.InChI), labels.InChI.nunique()))
print('Samp. subm.:\t Len: {};\tUnique values: {}'.format(
    len(ss.InChI), ss.InChI.nunique()))

print('*'*60)
print('-'*30, 'Labels head', '-'*30)
print(labels.head(5))
print('-'*25, 'Sample submission head', '-'*25)
print(ss.head(5))

In [None]:
# Check NaN values
labels.isna().sum()

In [None]:
# Add all training paths to labels df
labels['path'] = labels['image_id'].progress_apply(
    lambda x: "../input/bms-molecular-translation/train/{}/{}/{}/{}.png".format(
        x[0], x[1], x[2], x))
labels.head()

### Sample of 50k images

We use **io.imread** to read images, as this method is faster than **cv2.imread**.

In [None]:
train_sample = labels.sample(50000)
train_sample['img_tensor'] = train_sample['path'].progress_apply(lambda x: io.imread(x))
train_sample = train_sample.reset_index()
train_sample.head()

In this approach, we save tensors for 50k images, but in the future, when modeling starts, it will be too costly.

In [None]:
plt.figure(figsize = (15, 15))
for i in range(10):
    image = train_sample.iloc[i, 4]
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    
    plt.subplot(5, 2, i + 1)
    plt.imshow(image)
    plt.title(train_sample.loc[i, 'InChI'][:70] + '...', size = 9)
    plt.axis('off')

plt.show()

In [None]:
plt.figure(figsize = (15, 15))
for i in range(10):
    image = 255 - train_sample.iloc[i, 4]
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    
    plt.subplot(5, 2, i + 1)
    plt.imshow(image)
    plt.title(train_sample.loc[i, 'InChI'][:70] + '...', size = 9)
    plt.axis('off')

plt.show()

In [None]:
# Shape columns
train_sample['img_height'] = train_sample['img_tensor'].progress_apply(lambda x: np.shape(x)[0])
train_sample['img_width'] = train_sample['img_tensor'].progress_apply(lambda x: np.shape(x)[1])

### Image shapes distribution

In [None]:
sns.jointplot(x = train_sample['img_width'].astype('float32'), 
              y = train_sample['img_height'].astype('float32'),
              height = 8, color = '#930077')
plt.show()

The data contains both very small and very large images. Now, let's look at the labels.

In [None]:
label_lengths = labels['InChI'].progress_apply(lambda x: len(x))

In [None]:
sns.set_style("whitegrid")
plt.figure(figsize = (10, 6))
plt.title('Distribution of label length', fontsize = '15')
sns.kdeplot(label_lengths, fill = True, color = '#930077', 
            edgecolor = 'black', alpha = 0.9)
plt.xlabel('InChlI length')
plt.show()

The data contains both fairly short and very long formulas.

In [None]:
label_splited = labels['InChI'].progress_apply(lambda x: x.split('/'))

In [None]:
num_elem = label_splited.apply(lambda x: len(x))
num_elem_count = pd.DataFrame(num_elem.value_counts()).reset_index().sort_values('index')
num_elem_count['perc'] = round(num_elem_count['InChI'] / num_elem_count['InChI'].sum() * 100, 3)

In [None]:
sns.set_style("whitegrid")
fig, ax = plt.subplots(figsize = (10, 6))
plt.title('Distribution of labels by number of parts', fontsize = 15)
sns.barplot(num_elem_count['index'], num_elem_count['InChI'], fill = True, 
            color = '#930077', edgecolor = 'black', alpha = 0.9, ax = ax)
for idx, i in enumerate(ax.patches):
    ax.annotate("{} %".format(num_elem_count.iloc[idx, 2]), (i.get_x() + i.get_width() / 2, i.get_height() + 20000),
                 ha = 'center', fontsize = 12)
ax.set_xlabel('InChlI number of parts')
plt.show()

The dominant number of labels has 4 parts - almost 80% of 2,424,186. Also, a fairly large part - ~ 17% - is quite long.

Let's look at the most common parts.

In [None]:
part_count = Counter([part for i in label_splited for part in i]).most_common()

parts, counts = [], []
for part, count in part_count[1:]:
    parts.append(part)
    counts.append(count)

We exclude the first part because it is common for all formulas.

In [None]:
print('The first part is common for all formulas (must be True): {}'.format(
    [i[1] for i in part_count][0] == len(labels)))
part_count[0]

In [None]:
sns.set_style("whitegrid")

plt.figure(figsize = (20, 2))
plt.title('TOP-3 most common parts', fontsize = 12)
sns.barplot(counts[:3], parts[:3], fill = True, 
            color = '#930077', edgecolor = 'black', alpha = 0.9)
plt.xlabel('Frequency', fontsize = 12)

plt.figure(figsize = (20, 8))
from_ = 4
to = 18
for i in range(2):
    plt.subplot(1, 2, i + 1)
    plt.title('Most common parts ({}-{})'.format(from_, to), fontsize = 12)
    sns.barplot(counts[from_-1:to], parts[from_-1:to], fill = True, 
                color = '#930077', edgecolor = 'black', alpha = 0.9)
    plt.xlabel('Frequency', fontsize = 12)
    from_ += 15
    to += 12

plt.show()

In [None]:
print('Total unique values in the data: %i' % len(parts))
print('Parts with only one value: %i' % len([i for i in counts if i == 1]))

# Image denoising

There are many ways to deal with image noise. However, most of them will not be suitable for our task, since they try to delete noise throughout the image and are too redundant in terms of changes. 

Some examples of such approaches are presented below.

In [None]:
def image_viz(image, title, figsize=(15,8)):
    """
    Function for image visualization.
    Takes image tensor, plot title (label) and figsize.
    """
    plt.figure(figsize = figsize)
    plt.imshow(image)
    plt.title(title, size = 16)
    plt.axis('off')
    plt.show()

In [None]:
image = io.imread(labels.iloc[2, 2])
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
title = 'Original image with noise'

image_viz(image, title)

In [None]:
from scipy import ndimage

image = io.imread(labels.iloc[2, 2])
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
image_blur = ndimage.gaussian_filter(image, 2)
title = 'Blurred image (bad example)'

image_viz(image_blur, title)

Gaussian blur (or gaussian smoothing) looks expectedly bad. The principle of total variation denoising (according to the Rudin, Fatemi and Osher algorithm that was proposed by Chambolle) does not look much better.

In [None]:
from skimage.restoration import (denoise_tv_chambolle)

image = io.imread(labels.iloc[2, 2])
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
image_restorated = denoise_tv_chambolle(image, weight = 0.33, multichannel = True)
title = 'Restorated image (still bad example)'

image_viz(image_restorated, title)

The best solution would be an algorithm that traverses the image and removes only those elements that are less than a certain size (in our case, the points that represent noise). An elegant solution to this problem is presented on [Stack Overflow](https://stackoverflow.com/questions/48681465/how-do-i-remove-the-dots-noise-without-damaging-the-text)

In [None]:
def image_denoising(img_path, dot_size = 2):
    """
    Source: https://stackoverflow.com/questions/48681465/how-do-i-remove-the-dots-noise-without-damaging-the-text
    Function for removing noise in the form of small dots. 
    The input takes the path to the image.
    Increase 'dot_size' parameter to increase the size of areas (dots) to be removed
    """
    image = io.imread(img_path)
    _, BW = cv2.threshold(image, 127, 255, cv2.THRESH_BINARY_INV)
    nlabels, labels, stats, _ = \
        cv2.connectedComponentsWithStats(BW, None, None, None, 
                                         8, cv2.CV_32S)
    sizes = stats[1:, -1]
    image2 = np.zeros((labels.shape), np.uint8)
    for i in range(0, nlabels - 1):
        if sizes[i] >= dot_size: 
            image2[labels == i + 1] = 255
    image = cv2.bitwise_not(image2)
    return image

In [None]:
denoise_image = image_denoising(labels.iloc[2, 2])
title = 'Image without dots (the best example)'

image_viz(denoise_image, title)

# Simple baseline

Before we start doing serious modeling, it is necessary to establish a level above which our model will be considered effective. For this task, there are many different approaches to create a baseline. We'll start with the "most frequent part" principle. Over time, we may replace it with a more thoughtful and correct baseline, but for a start, I think it's a good idea.

In [None]:
print('Unique values in sample submission: %i' % ss.nunique())

The baseline accuracy established by the competition is 109.6. For this, the same short label is assigned to all test images.

In [None]:
ss.head()

In [None]:
# Data Frame with label parts
label_parts = pd.DataFrame.from_records(label_splited.values)
label_parts.columns = np.array(label_parts.columns + 1)
label_parts = label_parts.add_prefix('Part_')

label_parts.head()

I think it would be a good idea to take a look at the variation in the length of each part. To do this, let's calculate some basic statistics.

In [None]:
means = []
variations = []
mins = []
maxs = []
for i in range(len(label_parts.columns)):
    lengths = label_parts.iloc[:, i].dropna().progress_apply(lambda x: len(x))
    means.append(round(lengths.mean(), 2))
    variations.append(round(lengths.var(), 2))
    mins.append(lengths.min())
    maxs.append(lengths.max())

df_stat = pd.DataFrame({'Part': label_parts.columns,
                        'Mean length': means,
                        'Length variation': variations,
                        'Count': label_parts.count().values,
                        'Min length': mins,
                        'Max length': maxs})

In [None]:
all_lengths = label_parts.progress_apply(lambda x: [len(i) for i in x.dropna()])

sns.set_style("whitegrid")
plt.figure(figsize = (20, 20))
for i in range(11):
    plt.subplot(4, 3, i + 1)
    plt.title('Distribution of length (Part_{})'.format(i+1), fontsize = '10')
    sns.kdeplot(all_lengths[i], fill = True, color = '#930077', 
                edgecolor = 'black', alpha = 0.9)
    plt.xlabel('')
plt.show()

In [None]:
df_stat

Wow, the variation of the third part is extremely big (min length is 4 and max length is 267)! This is not good news, but let's not worry about it for now.

It's time to create our new formula! It will be assembled from the most frequent parts (what will chemists say about this Frankenstein?).

In [None]:
baseline_label = ''
n = 7
for i in range(n):
    baseline_label += '/' + label_parts.iloc[:, i].value_counts().index[0]
baseline_label = baseline_label[1:]

print('Baseline {} part label:'.format(n))
print(baseline_label)

In [None]:
def Levenshtein_dist(y_true, y_pred):
    """
    Function that calculates the average Levenshtein distance for all data.
    Takes arrays of true (y_true) and predicted (y_pred) labels as input.
    """
    values = []
    for y_true, y_pred in zip(y_true, y_pred):
        values.append(Levenshtein.distance(y_true, y_pred))
    return np.mean(values)

In [None]:
print('Train Levenshtein distance: %.4f' 
      % Levenshtein_dist(labels['InChI'], [baseline_label] * len(labels)))

Not bad. It looks better than the basic competition baseline. I think this is a good starting level that we need to exceed.

In [None]:
ss['InChI'] = [baseline_label] * len(ss['InChI'])
ss.to_csv("submission.csv")
ss

## WORK IN PROGRESS...