# Transformers for chemical reactions - reaction prediction


![](https://pubs.acs.org/cms/10.1021/acscentsci.9b00576/asset/images/medium/oc9b00576_0009.gif)
</br><center>Figure 1: SMILES-to-SMILES translation with the Molecular Transformer</center>

## Table of content:
#### Setup
* [Data download](#first-bullet)
* [Load the data](#second-bullet)
* [Tokenization](#third-bullet)

#### OpenNMT-py main steps
* [Building the vocab](#fourth-bullet)
* [Training the Molecular Transformer](#fifth-bullet)
* [Testing](#sixth-bullet)

#### Additional stuff
* [Improvements](#seventh-bullet)
* [Further steps](#eighth-bullet)
* [Publications](#ninth-bullet)

We start by installing [OpenNMT-py](https://github.com/OpenNMT/OpenNMT-py), a common Python neural machine translation framework, and [RDKit](https://www.rdkit.org), the open-source python cheminformatics Swiss army knife.

In [None]:
import sys
IN_COLAB = 'google.colab' in sys.modules
if IN_COLAB:
    !pip install rdkit-pypi==2022.3.1
    !pip install pip install OpenNMT-py==2.2.0

In [None]:
import gdown
import os
import random
import re

import pandas as pd

from tqdm.auto import tqdm
tqdm.pandas()
from rdkit import Chem

# to display molecules
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
IPythonConsole.ipython_useSVG=True


# disable RDKit warnings
from rdkit import RDLogger
RDLogger.DisableLog('rdApp.*') 

# Data download <a class="anchor" id="first-bullet"></a>

In this short tutorial, we will look at USPTO_480k, which is a frequently used reaction prediction benchmark dataset. Please note that it does not contain stereochemistry. The original USPTO data can be downloaded from [figshare](https://figshare.com/articles/dataset/Chemical_reactions_from_US_patents_1976-Sep2016_/5104873).

In [None]:
def download_data():
    # links from https://github.com/coleygroup/Graph2SMILES/blob/main/scripts/download_raw_data.py
    USPTO_480k_links= [
            ("https://drive.google.com/uc?id=1RysNBvB2rsMP0Ap9XXi02XiiZkEXCrA8", "src-train.txt"),
            ("https://drive.google.com/uc?id=1CxxcVqtmOmHE2nhmqPFA6bilavzpcIlb", "tgt-train.txt"),
            ("https://drive.google.com/uc?id=1FFN1nz2yB4VwrpWaBuiBDzFzdX3ONBsy", "src-val.txt"),
            ("https://drive.google.com/uc?id=1pYCjWkYvgp1ZQ78EKQBArOvt_2P1KnmI", "tgt-val.txt"),
            ("https://drive.google.com/uc?id=10t6pHj9yR8Tp3kDvG0KMHl7Bt_TUbQ8W", "src-test.txt"),
            ("https://drive.google.com/uc?id=1FeGuiGuz0chVBRgePMu0pGJA4FVReA-b", "tgt-test.txt")
        ]
    data_path = 'USPTO_480k'
    os.makedirs(data_path, exist_ok=True)
    for url, name in USPTO_480k_links:
        target_path = os.path.join(data_path, name)
        if not os.path.exists(target_path):
            gdown.download(url, target_path, quiet=False)
        else:
            print(f"{target_path} already exists")

def canonicalize_smiles(smiles, verbose=False): # will raise an Exception if invalid SMILES
    mol = Chem.MolFromSmiles(smiles)
    if mol is not None:
        return Chem.MolToSmiles(mol)
    else:
        if verbose:
            print(f'{smiles} is invalid.')
        return ''


In [None]:
!rm -rf sample_data
download_data()

# Load the data <a class="anchor" id="second-bullet"></a>

Ideally you would make sure that all SMILES are canonicalized but here we will skip this for time reasons and assume that all SMILES were already canonicalized. The full canonicalization could take ~20 minutes.

```python
line_count = !cat USPTO_480k/src-train.txt | wc -l
total = int(line_count[0])
with open('USPTO_480k/src-train.txt', 'r') as f:
    precursors_train = [canonicalize_smiles(line.strip().replace(' ', '')) for line in tqdm(f, total=total)]
with open('USPTO_480k/tgt-train.txt', 'r') as f:
    products_train = [canonicalize_smiles(line.strip().replace(' ', '')) for line in tqdm(f, total=total)]
    
line_count = !cat USPTO_480k/src-val.txt | wc -l
total = int(line_count[0])
with open('USPTO_480k/src-val.txt', 'r') as f:
    precursors_val = [canonicalize_smiles(line.strip().replace(' ', '')) for line in tqdm(f, total=total)]
with open('USPTO_480k/tgt-val.txt', 'r') as f:
    products_val = [canonicalize_smiles(line.strip().replace(' ', '')) for line in tqdm(f, total=total)]
    
line_count = !cat USPTO_480k/src-test.txt | wc -l
total = int(line_count[0])
with open('USPTO_480k/src-test.txt', 'r') as f:
    precursors_test = [canonicalize_smiles(line.strip().replace(' ', '')) for line in tqdm(f, total=total)]
with open('USPTO_480k/tgt-test.txt', 'r') as f:
    products_test = [canonicalize_smiles(line.strip().replace(' ', '')) for line in tqdm(f, total=total)]
```

Here we will simply read the data and load it into pandas dataframes:

In [None]:
with open('USPTO_480k/src-train.txt', 'r') as f:
    precursors_train = [line.strip().replace(' ', '') for line in f]
with open('USPTO_480k/tgt-train.txt', 'r') as f:
    products_train = [line.strip().replace(' ', '') for line in f]
with open('USPTO_480k/src-val.txt', 'r') as f:
    precursors_val = [line.strip().replace(' ', '') for line in f]
with open('USPTO_480k/tgt-val.txt', 'r') as f:
    products_val = [line.strip().replace(' ', '') for line in f]
with open('USPTO_480k/src-test.txt', 'r') as f:
    precursors_test = [line.strip().replace(' ', '') for line in f]
with open('USPTO_480k/tgt-test.txt', 'r') as f:
    products_test = [line.strip().replace(' ', '') for line in f]

In [None]:
train_df = pd.DataFrame({'precursors': precursors_train, 'products': products_train})
print(f"The training set contains {train_df.shape[0]} reactions.")
train_df.head()

In [None]:
val_df = pd.DataFrame({'precursors': precursors_val, 'products': products_val})
print(f"The validation set contains {val_df.shape[0]} reactions.")
val_df.head()

In [None]:
test_df = pd.DataFrame({'precursors': precursors_test, 'products': products_test})
print(f"The test set contains {test_df.shape[0]} reactions.")
test_df.head()

## Sanity check - canoncalization
There is no standard for the canonicalization of SMILES. We might find SMILES that differ... A potential reason for this is that the canonicalization has changed with a newer RDKit version. So, always state the RDKit version that you are working with.

In [None]:
line_count = !cat USPTO_480k/src-val.txt | wc -l
total = int(line_count[0])+1
with open('USPTO_480k/src-val.txt', 'r') as f:
    can_precursors_val = [canonicalize_smiles(line.strip().replace(' ', '')) for line in tqdm(f, total=total)]

for smiles, can_smiles in zip(precursors_val, can_precursors_val):
    try:
        assert smiles == can_smiles
    except AssertionError:
        print(smiles)
        print(can_smiles)
    break

# Tokenization <a class="anchor" id="third-bullet"></a>

To be able to train a language model, we need to split the strings into tokens.

We take the regex pattern introduced in the [Molecular Transformer](https://pubs.acs.org/doi/abs/10.1021/acscentsci.9b00576).

In [None]:
SMI_REGEX_PATTERN =  r"(\%\([0-9]{3}\)|\[[^\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\||\(|\)|\.|=|#|-|\+|\\|\/|:|~|@|\?|>>?|\*|\$|\%[0-9]{2}|[0-9])"

def smiles_tokenizer(smiles):
    smiles_regex = re.compile(SMI_REGEX_PATTERN)
    tokens = [token for token in smiles_regex.findall(smiles)]
    return ' '.join(tokens)

In [None]:
print('Tokenizing training set')
train_df['tokenized_precursors'] = train_df.precursors.progress_apply(lambda smi: smiles_tokenizer(smi))
train_df['tokenized_products'] = train_df.products.progress_apply(lambda smi: smiles_tokenizer(smi))
print('Tokenizing validation set')
val_df['tokenized_precursors'] = val_df.precursors.progress_apply(lambda smi: smiles_tokenizer(smi))
val_df['tokenized_products'] = val_df.products.progress_apply(lambda smi: smiles_tokenizer(smi))
print('Tokenizing test set')
test_df['tokenized_precursors'] = test_df.precursors.progress_apply(lambda smi: smiles_tokenizer(smi))
test_df['tokenized_products'] = test_df.products.progress_apply(lambda smi: smiles_tokenizer(smi))

## Save the preprocessed data set

Don't forget to shuffle the training set before saving it. At least earlier versions of OpenNMT-py would not shuffle it during preprocessing.

In [None]:
shuffled_train_df = train_df.sample(frac=1., random_state=42)

In [None]:
data_path = 'USPTO_480k_preprocessed'

os.makedirs(data_path, exist_ok=True)
with open(os.path.join(data_path, 'precursors-train.txt'), 'w') as f:
    f.write('\n'.join(shuffled_train_df.tokenized_precursors.values))
with open(os.path.join(data_path, 'products-train.txt'), 'w') as f:
    f.write('\n'.join(shuffled_train_df.tokenized_products.values))

with open(os.path.join(data_path, 'precursors-val.txt'), 'w') as f:
    f.write('\n'.join(val_df.tokenized_precursors.values))
with open(os.path.join(data_path, 'products-val.txt'), 'w') as f:
    f.write('\n'.join(val_df.tokenized_products.values))
    
with open(os.path.join(data_path, 'precursors-test.txt'), 'w') as f:
    f.write('\n'.join(test_df.tokenized_precursors.values))
with open(os.path.join(data_path, 'products-test.txt'), 'w') as f:
    f.write('\n'.join(test_df.tokenized_products.values))

# Building the vocab <a class="anchor" id="fourth-bullet"></a>

The first step for the [OpenNMT-py pipeline](https://opennmt.net/OpenNMT-py/quickstart.html) is to build the vocabulary.

![](https://camo.githubusercontent.com/69fb11841ce1abd51a3fd7f3ed4b424857029ce123521cc301eb48a1e22bee2f/687474703a2f2f6f70656e6e6d742e6769746875622e696f2f73696d706c652d6174746e2e706e67)
</br><center>Figure 2: In contrast to a neural machine translation model for human language, we will use an atom-wise vocabulary. </center>


Please note:
- Typical sequence pairs in machine translation are much shorter than the ones you encounter in chemical reaction prediction. Hence, set a `src_seq_length` and `tgt_seq_length` that is much higher than the maximum you would expect to include all reactions.
- With `n_sample` set to `-1` we include the whole dataset.

The paths to the training and validation datasets are defined in the `run_config.yaml`:

```yaml
# https://opennmt.net/OpenNMT-py/quickstart.html
# Examples in https://github.com/OpenNMT/OpenNMT-py/tree/master/config

## Where the samples will be written
save_data: example_run
## Where the vocab(s) will be written
src_vocab: example_run/uspto.vocab.src
tgt_vocab: example_run/uspto.vocab.src
# Prevent overwriting existing files in the folder
overwrite: true
share_vocab: true

# Corpus opts:
data:
    corpus-1:
        path_src: USPTO_480k_preprocessed/precursors-train.txt
        path_tgt: USPTO_480k_preprocessed/products-train.txt
    valid:
        path_src: USPTO_480k_preprocessed/precursors-val.txt
        path_tgt: USPTO_480k_preprocessed/products-val.txt
```

As the source (precusors) and the target (products) are represented as SMILES and consist of the same tokens, we share the vocabulary between source and target (`share_vocab: true`). 

In [None]:
config_url = 'https://raw.githubusercontent.com/schwallergroup/dmds_language_models_for_reactions/main/example_run/run_config.yaml'
config_folder = 'example_run'
config_name = 'run_config.yaml'

os.makedirs(config_folder, exist_ok=True)
target_path = os.path.join(config_folder, config_name)
if not os.path.exists(target_path):
    gdown.download(config_url, target_path, quiet=False)
else:
    print(f"{target_path} already exists")

In [None]:
! onmt_build_vocab -config example_run/run_config.yaml \
    -src_seq_length 1000 -tgt_seq_length 1000 \
    -src_vocab_size 1000 -tgt_vocab_size 1000 \
    -n_sample -1

# Training the Molecular Transformer <a class="anchor" id="fifth-bullet"></a>

If you look at the `run_config.yaml`, you will see that we have defined some of the training parameters (but not yet the hyperparameters of the model.

```yaml
# Train on a single GPU
world_size: 1
gpu_ranks: [0]

# Where to save the checkpoints
save_model: example_run/model
save_checkpoint_steps: 5000
keep_checkpoint: 3
train_steps: 400000
valid_steps: 10000
report_every: 100

tensorboard: true
tensorboard_log_dir: log_dir
```

The Transformer architecture was published in the [Attention is all you need](https://proceedings.neurips.cc/paper/7181-attention-is-all-you-need) paper by Vaswani et al. (NeurIPS, 2017). The model sizes (65 to 212M parameters) in that paper were larger than what we use for reaction prediction (20M parameters). 

![](https://raw.githubusercontent.com/nlp-with-transformers/notebooks/main/images/chapter01_self-attention.png)
</br><center>Figure 3: Transformer model (source: https://github.com/nlp-with-transformers). </center>

Illustrated transformer blogposts:
- https://nlp.seas.harvard.edu/2018/04/03/attention.html
- https://jalammar.github.io/illustrated-transformer/

In [None]:
# hyperparameters from https://github.com/rxn4chemistry/OpenNMT-py/tree/carbohydrate_transformer
!onmt_train -config example_run/run_config.yaml \
        -seed 42 -gpu_ranks 0  \
        -param_init 0 \
        -param_init_glorot -max_generator_batches 32 \
        -batch_type tokens -batch_size 6144\
         -normalization tokens -max_grad_norm 0  -accum_count 4 \
        -optim adam -adam_beta1 0.9 -adam_beta2 0.998 -decay_method noam  \
        -warmup_steps 8000 -learning_rate 2 -label_smoothing 0.0 \
        -layers 4 -rnn_size  384 -word_vec_size 384 \
        -encoder_type transformer -decoder_type transformer \
        -dropout 0.1 -position_encoding -share_embeddings  \
        -global_attention general -global_attention_function softmax \
        -self_attn_type scaled-dot -heads 8 -transformer_ff 2048 \
        -tensorboard True -tensorboard_log_dir log_dir

The training can take more than 24 hours on a single GPU. Hence, we will download the trained model.

In [None]:
trained_model_url = 'https://drive.google.com/uc?id=1ywJCJHunoPTB5wr6KdZ8aLv7tMFMBHNy'
model_folder = 'models'
model_name = 'USPTO480k_model_step_400000.pt'

os.makedirs(model_folder, exist_ok=True)
target_path = os.path.join(model_folder, model_name)
if not os.path.exists(target_path):
    gdown.download(trained_model_url, target_path, quiet=False)
else:
    print(f"{target_path} already exists")

# Evaluating the model <a class="anchor" id="sixth-bullet"></a>

In [None]:
!onmt_translate -model models/USPTO480k_model_step_400000.pt -gpu 0 \
    --src USPTO_480k_preprocessed/precursors-val.txt \
    --tgt USPTO_480k_preprocessed/products-val.txt \
    --output results/USPTO480k_model_step_400000_val_predictions.txt \
    --n_best 5 --beam_size 10 --max_length 300 --batch_size 64

In [None]:
# https://github.com/pschwllr/MolecularTransformer/blob/master/score_predictions.py

n_best = 5 # top-5 predictions were outputted
predictions = [[] for i in range(n_best)]

with open('USPTO_480k_preprocessed/products-val.txt', 'r') as f:
    targets = [line.strip().replace(' ', '') for line in f]

evaluation_df = pd.DataFrame(targets)
evaluation_df.columns = ['target']

with open('USPTO_480k_preprocessed/precursors-val.txt', 'r') as f:
    precursors = [line.strip().replace(' ', '') for line in f]
evaluation_df['precursors'] = precursors

total = len(evaluation_df)

with open('results/USPTO480k_model_step_400000_val_predictions_precomputed.txt', 'r') as f:
    
    for i, line in enumerate(f):
        predictions[i % n_best].append(''.join(line.strip().split(' ')))
for i, preds in enumerate(predictions):
    evaluation_df['prediction_{}'.format(i + 1)] = preds
    evaluation_df['canonical_prediction_{}'.format(i + 1)] = evaluation_df['prediction_{}'.format(i + 1)].progress_apply(
        lambda x: canonicalize_smiles(x))

In [None]:
def get_rank(row, col_name, max_rank):
    for i in range(1, max_rank+1):
        if row['target'] == row['{}{}'.format(col_name, i)]:
            return i
    return 0
evaluation_df['rank'] = evaluation_df.progress_apply(lambda row: get_rank(row, 'canonical_prediction_', n_best), axis=1)

correct = 0

for i in range(1, n_best+1):
    correct += (evaluation_df['rank'] == i).sum()
    invalid_smiles = (test_df['canonical_prediction_{}'.format(i)] == '').sum()
    
    print('Top-{}: {:.1f}% || Invalid SMILES {:.2f}%'.format(i, correct/total*100,
                                                                 invalid_smiles/total*100))
    

## Let's draw some of the reactions 

In [None]:
from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem import rdChemReactions
from rdkit.Chem.Draw import rdMolDraw2D
from rdkit.Chem.Draw import IPythonConsole
from IPython.display import SVG, display


# https://gist.github.com/greglandrum/61c1e751b453c623838759609dc41ef1
def draw_chemical_reaction(smiles,useSmiles=True,  highlightByReactant=False, notesAtomMaps=True, font_scale=1.5):
    rxn = rdChemReactions.ReactionFromSmarts(smiles,useSmiles=useSmiles)
    trxn = rdChemReactions.ChemicalReaction(rxn)
    # move atom maps to be annotations:
    if notesAtomMaps:
        for m in trxn.GetReactants():
            moveAtomMapsToNotes(m)
        for m in trxn.GetProducts():
            moveAtomMapsToNotes(m)
    d2d = rdMolDraw2D.MolDraw2DSVG(800,300)
    d2d.drawOptions().annotationFontScale=font_scale
    d2d.DrawReaction(trxn,highlightByReactant=highlightByReactant)
    d2d.FinishDrawing()

    return d2d.GetDrawingText()

def moveAtomMapsToNotes(m):
    for at in m.GetAtoms():
        if at.GetAtomMapNum():
            at.SetProp("atomNote",str(at.GetAtomMapNum()))


In [None]:
for i, row in evaluation_df[evaluation_df['rank']==1].sample(5, random_state=1).iterrows():
    rxn_smiles = f"{row['precursors']}>>{row['canonical_prediction_1']}"
    display(SVG(draw_chemical_reaction(rxn_smiles)))
    print(rxn_smiles)

In [None]:
for i, row in evaluation_df[evaluation_df['rank']==0].sample(5, random_state=1).iterrows():
    rxn_smiles = f"{row['precursors']}>>{row['target']}.{row['canonical_prediction_1']}"
    display(SVG(draw_chemical_reaction(rxn_smiles)))
    print(rxn_smiles)

# Improvements to the chemical reaction language models <a class="anchor" id="seventh-bullet"></a>

One of the improvements compared to the plain Molecular Transformer model, that was done in the past is data augmentation. 
- [Molecular Transformer](https://pubs.acs.org/doi/abs/10.1021/acscentsci.9b00576) -> one non-canonical copy of each precursors
- [Augmented Molecular Transformer](https://www.nature.com/articles/s41467-020-19266-y) -> extensive data augmentation on precursors and products sides

If you have a small dataset of more challenging reactions you could use transfer learning, as we explored in [Transfer learning enables the molecular transformer to predict regio- and stereoselective reactions on carbohydrates](https://www.nature.com/articles/s41467-020-18671-7).

## Data augmentations for reaction SMILES


In [None]:
# What if now we wanted to do some data augmentation on the training set

def randomize_smiles(smiles, random_type="rotated"):
    """
    # https://github.com/rxn4chemistry/rxn_yields/blob/master/nbs/06_data_augmentation.ipynb
    Inspired from: https://github.com/undeadpixel/reinvent-randomized and https://github.com/GLambard/SMILES-X
    Returns a random SMILES given a SMILES of a molecule.
    :param mol: A Mol object
    :param random_type: The type (unrestricted, restricted, rotated) of randomization performed.
    :return : A random SMILES string of the same molecule or None if the molecule is invalid.
    """
    mol = Chem.MolFromSmiles(smiles)
    if not mol:
        print(f"{smiles} not valid.")
        return None

    if random_type == "unrestricted":
        return Chem.MolToSmiles(mol, canonical=False, doRandom=True, isomericSmiles=True)
    elif random_type == "restricted":
        new_atom_order = list(range(mol.GetNumAtoms()))
        random.shuffle(new_atom_order)
        random_mol = Chem.RenumberAtoms(mol, newOrder=new_atom_order)
        return Chem.MolToSmiles(random_mol, canonical=False, isomericSmiles=True)
    elif random_type == 'rotated':
        n_atoms = mol.GetNumAtoms()
        rotation_index = random.randint(0, n_atoms-1)
        atoms = list(range(n_atoms))
        new_atoms_order = (atoms[rotation_index%len(atoms):]+atoms[:rotation_index%len(atoms)])
        rotated_mol = Chem.RenumberAtoms(mol,new_atoms_order)
        return Chem.MolToSmiles(rotated_mol, canonical=False, isomericSmiles=True)
    raise ValueError("Type '{}' is not valid".format(random_type))

In [None]:
example_smi = 'O=C1C2=C(N=CN2C)N(C(=O)N1C)C'
mol = Chem.MolFromSmiles(example_smi)
print(f"The canonical SMILES of this caffeine molecule is: {Chem.MolToSmiles(mol)}")
mol

In [None]:
# different starting atom
rotated_random_smiles = []
for i in range (500):
    rotated_random_smiles.append(randomize_smiles(example_smi))
print(len(set(rotated_random_smiles)))
set(rotated_random_smiles)

In [None]:
restricted_random_smiles = []
for i in range (500):
    restricted_random_smiles.append(randomize_smiles(example_smi, 'restricted'))
print(len(set(restricted_random_smiles)))
list(set(restricted_random_smiles))[:5]

In [None]:
unrestricted_random_smiles = []
for i in range (10000):
    unrestricted_random_smiles.append(randomize_smiles(example_smi, random_type='unrestricted'))
print(len(set(unrestricted_random_smiles)))
list(set(unrestricted_random_smiles))[:5]

In [None]:
recanonicalised_smiles = set([Chem.MolToSmiles(Chem.MolFromSmiles(smiles)) for smiles in unrestricted_random_smiles])
assert len(recanonicalised_smiles) == 1
recanonicalised_smiles

In [None]:
for i in range(5):
    print(randomize_smiles(can_smiles))

In [None]:
# we will include a rotated copy of all the training reactions

rotated_train_precursors = [randomize_smiles(precursors) for precursors in tqdm(train_df.precursors)]

In [None]:
rotated_train_df = pd.DataFrame({'precursors': rotated_train_precursors, 'products': products_train})
total_train_df = pd.concat([train_df, rotated_train_df])
total_train_df.shape


In [None]:
print('Tokenizing training set')
total_train_df['tokenized_precursors'] = total_train_df.precursors.apply(lambda smi: smiles_tokenizer(smi))
total_train_df['tokenized_products'] = total_train_df.products.apply(lambda smi: smiles_tokenizer(smi))
print('Tokenizing validation set')
val_df['tokenized_precursors'] = val_df.precursors.apply(lambda smi: smiles_tokenizer(smi))
val_df['tokenized_products'] = val_df.products.apply(lambda smi: smiles_tokenizer(smi))
print('Tokenizing test set')
test_df['tokenized_precursors'] = test_df.precursors.apply(lambda smi: smiles_tokenizer(smi))
test_df['tokenized_products'] = test_df.products.apply(lambda smi: smiles_tokenizer(smi))

In [None]:
# remember to shuffle your training data :)

shuffled_total_train_df = total_train_df.sample(frac=1., random_state=42)

In [None]:
shuffled_total_train_df.head()

In [None]:
data_path = 'USPTO_480k_augm_preprocessed'

os.makedirs(data_path, exist_ok=True)
with open(os.path.join(data_path, 'precursors-train.txt'), 'w') as f:
    f.write('\n'.join(shuffled_total_train_df.tokenized_precursors.values))
with open(os.path.join(data_path, 'products-train.txt'), 'w') as f:
    f.write('\n'.join(shuffled_total_train_df.tokenized_products.values))

with open(os.path.join(data_path, 'precursors-val.txt'), 'w') as f:
    f.write('\n'.join(val_df.tokenized_precursors.values))
with open(os.path.join(data_path, 'products-val.txt'), 'w') as f:
    f.write('\n'.join(val_df.tokenized_products.values))
    
with open(os.path.join(data_path, 'precursors-test.txt'), 'w') as f:
    f.write('\n'.join(test_df.tokenized_precursors.values))
with open(os.path.join(data_path, 'products-test.txt'), 'w') as f:
    f.write('\n'.join(test_df.tokenized_products.values))

### Build vocab, train, and test

Start by writing a `example_run/run_config_augm.yaml` file. 

In [None]:
! onmt_build_vocab -config example_run/run_config_augm.yaml ...

In [None]:
! onmt_train ... 

# Further steps <a class="anchor" id="eighth-bullet"></a>

## RXN for Chemistry
You can access all the trained models from [RXN for Chemistry](https://rxn.res.ibm.com) through the rxn4chemistry Python API:
https://github.com/rxn4chemistry/rxn4chemistry

There are examples in:
https://github.com/rxn4chemistry/rxn4chemistry/tree/master/examples



## RXNFP and DRFP -> chemical reaction fingerprints
- Data driven reaction fingerprint: https://github.com/rxn4chemistry/rxnfp with tutorial on https://rxn4chemistry.github.io/rxnfp/
- Engineered reaction fingerprint: https://github.com/reymond-group/drfp with great examples in https://github.com/reymond-group/drfp/tree/main/notebooks

## Atom-mapping 
When Transformers are trained on large datasets of unlabelled reactions represented as SMILES, they learn how atom rearrange during chemical reactions. We used this signal to build [RXNMapper](http://rxnmapper.ai). The code can be found in: https://github.com/rxn4chemistry/rxnmapper

If you just want to play with the demo:
http://rxnmapper.ai/demo.html?rxn=CC(C)S.CN(C)C%253DO.Fc1cccnc1F.O%253DC(%255BO-%255D)%255BO-%255D.%255BK%252B%255D.%255BK%252B%255D%253E%253ECC(C)Sc1ncccc1F&selectedLayer=10&selectedHead=5&selectedTokenSide=null&selectedTokenInd=null


#  Publications <a class="anchor" id="ninth-bullet"></a>
### Reaction prediction
- [“Found in Translation”: predicting outcomes of complex organic chemistry reactions using neural sequence-to-sequence models](https://pubs.rsc.org/en/content/articlehtml/2018/sc/c8sc02339e) 
- [Molecular Transformer: A Model for Uncertainty-Calibrated Chemical Reaction Prediction](https://pubs.acs.org/doi/abs/10.1021/acscentsci.9b00576)
- [Transfer learning enables the molecular transformer to predict regio- and stereoselective reactions on carbohydrates](https://www.nature.com/articles/s41467-020-18671-7)

### Retrosynthesis
- [Predicting retrosynthetic pathways using transformer-based models and a hyper-graph exploration strategy](https://pubs.rsc.org/en/content/articlehtml/2020/sc/c9sc05704h)

### Reaction fingerprints
- [Mapping the space of chemical reactions using attention-based neural networks](http://rdcu.be/cenmd)
- [Reaction classification and yield prediction using the differential reaction fingerprint DRFP](https://pubs.rsc.org/en/content/articlehtml/2022/dd/d1dd00006c)

### Yield prediction
- [Prediction of chemical reaction yields using deep learning](https://iopscience.iop.org/article/10.1088/2632-2153/abc81d/meta)
- [Data augmentation strategies to improve reaction yield predictions and estimate uncertainty](https://chemrxiv.org/engage/chemrxiv/article-details/60c75258702a9b726c18c101)

### Atom-mapping
- [Extraction of organic chemistry grammar from unsupervised learning of chemical reactions](https://www.science.org/doi/10.1126/sciadv.abe4166)

### Extensive review
- [Machine Intelligence for Chemical Reaction Space](https://wires.onlinelibrary.wiley.com/doi/full/10.1002/wcms.1604)
