In [None]:
!conda install -y -c rdkit rdkit

# Generating New Data

This notebook shows how to create additional image/inchi pairs using RDKit

In [None]:
import numpy as np
from skimage import color
import matplotlib.pyplot as plt
from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from PIL import Image
import io

Here we load the molecule as a smiles string and convert to an inchi

In [None]:
smile = 'Fc(c1)ccc-2c1C(=O)N(C)Cc3n2cnc3C(=O)OCC'
mol = Chem.MolFromSmiles(smile)
inchi = Chem.inchi.MolToInchi(mol)

In [None]:
inchi

We can then use RDKit to generate a black and white image similar to the dataset

In [None]:
IPythonConsole.drawOptions.useBWAtomPalette()
im = Draw.MolsToGridImage([mol], molsPerRow=1)

In [None]:
im = Image.open(io.BytesIO(im.data))

In [None]:
im

In [None]:
im.save('new_im.png')

With a little post-processing, the compound image could be made to look more like the scanned images in the dataset.

The ability to generate new inchi/image pairs adds an interesting element to this competition. A sufficiently motivated person could download 1 billion compounds from the [Zinc Database](https://zinc.docking.org/tranches/home/) to create a massive dataset of image/inchi pairs.

Even if the generated data doesn't quite match the challenge data, the sheer volume of data that can be created allows for pre-training a model on the generated data before fine-tuning on the actual challenge data