# Demo of generating TS geometries from SMILES

In this demo, we try to use force field to prepare the input for TS-GCN. The requirements are:
- the SMILES for all reactants and products involving in the reactions
- the reaction must belong to a RMG reaction family

Since there are randomness introduced in conformer embedding and in the graph neural network, you may want to run this notebook multiple times if the initial results are not desirable.

NOTE: Currently, inputting XYZ is also available
- For A = B, A = B + C (also A = B + C + D if RMG has this kind of family), XYZ of the reactant will be used for TS-GCN, and it will also be used as a template for generating the geometry of the product complex. The XYZs of products will only be used for molecule identification.
- For A + B = C + D, XYZs for all input species are only used for molecule identification and will not be used for TS-GCN.

Some codes are compiled from https://github.com/ReactionMechanismGenerator/TS-GCN

In [2]:
import os
import sys
import subprocess

# To add this RDMC into PYTHONPATH in case you haven't do it
sys.path.append(os.path.dirname(os.path.abspath('')))

from rdkit import Chem
from rdmc.mol import RDKitMol
from rdmc.view import grid_viewer, mol_viewer
from rdmc.forcefield import RDKitFF, OpenBabelFF
try:
    # import RMG dependencies
    from rdmc.external.rmg import (from_rdkit_mol,
                                   find_reaction_family,
                                   generate_reaction_complex,
                                   load_rmg_database,
                                   )
    # Load RMG database
    database = load_rmg_database()
except (ImportError, ModuleNotFoundError):
    print('You need to install RMG-Py first and run this IPYNB in rmg_env!')

try:
    # Openbabel 3
    print('Using Openbabel 3...')
    from openbabel import openbabel as ob
except ImportError:
    # Openbabel 2
    pring('Using Openbabel 2...')
    import openbabel as ob


def parse_xyz_or_smiles(identifier, **kwargs):
    try:
        return RDKitMol.FromXYZ(identifier, **kwargs)
    except:
        mol = RDKitMol.FromSmiles(identifier,)
        mol.EmbedConformer()
        return mol

# Testing features
# Ideally, each reaction family may have a value that works better
# and the author is still trying to find those numbers
# Ones recorded below are just for reference
BOND_CONSTRAINT = {'1,3_Insertion_ROR': 2.5,
                   'Retroene': 2.5,
                   '1,2_Insertion': 2.5,
                   '2+2_cycloaddition_Cd': 2.5,
                   'Diels_alder_addition': 3.0,
                   'H_Abstraction': 3.0,}

%load_ext autoreload
%autoreload 2



Using Openbabel 3...
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### 1. Input molecule information
You can input SMILEs, XYZs or mix them together. Molecule instances are then generated from the input identifiers.<br>
**RECOMMENDATIONs:**
- **Better define the single species end of the reaction as the reactant.**
- **Better put the heavier product in the first place of the list.**

Here, some examples are provided

#### 1.1: Intra H migration (A = B)

In [2]:
reactants = ["""CCC[O]""",
]

products = ["""C[CH]CO""",
]

#### 1.2: Intra_R_Add_Endocyclic (A = B)

In [3]:
reactants = ["""C=CCCO[O]""",
]

products = ["""[CH2]C1CCOO1""",
]

#### 1.3: ketoenol (A = B)
An example of xyzs

In [4]:
reactants = ["""O 0.898799  1.722422  0.70012
C 0.293754  -0.475947  -0.083092
C -1.182804  -0.101736  -0.000207
C 1.238805  0.627529  0.330521
H 0.527921  -1.348663  0.542462
H 0.58037  -0.777872  -1.100185
H -1.45745  0.17725  1.018899
H -1.813437  -0.937615  -0.310796
H -1.404454  0.753989  -0.640868
H 2.318497  0.360641  0.272256""",
    ]

products = ["""O 2.136128  0.058786  -0.999372
C -1.347448  0.039725  0.510465
C 0.116046  -0.220125  0.294405
C 0.810093  0.253091  -0.73937
H -1.530204  0.552623  1.461378
H -1.761309  0.662825  -0.286624
H -1.923334  -0.892154  0.536088
H 0.627132  -0.833978  1.035748
H 0.359144  0.869454  -1.510183
H 2.513751  -0.490247  -0.302535"""]

#### 1.4: Retroene (A = B + C)

In [5]:
reactants = [
"""CCC1C=CC=C1""",
]

products = [
"""C1C=CC=C1""",

"""C=C""",
]

#### 1.5 HO2 Addition (A = B + C)

In [6]:
reactants = [
"""C -1.890664  -0.709255  -0.271996
C -0.601182  0.078056  -0.018811
C 0.586457  -0.545096  -0.777924
C -0.292203  0.188974  1.451901
H -0.683164  -0.56844  2.124827
C 0.477032  1.332664  2.012529
O -0.367239  2.493656  2.288335
O -0.679966  1.393013  -0.618968
O -1.811606  2.119506  -0.074789
H -1.819659  -1.711353  0.159844
H -2.063907  -0.801665  -1.346104
H -2.739557  -0.190076  0.171835
H 0.374452  -0.548385  -1.849706
H 1.501209  0.026135  -0.608139
H 0.747239  -1.572318  -0.444379
H 1.209047  1.707778  1.296557
H 0.998836  1.047896  2.931789
H -0.994076  2.235514  2.974109
H -1.392774  2.537261  0.704151"""
]

products = [
"""C -1.395681  1.528483  -0.00216
C -0.402668  0.411601  -0.210813
C -0.997629  -0.972081  -0.127641
C 0.890607  0.678979  -0.433435
C 2.015631  -0.28316  -0.676721
O 2.741986  0.043989  -1.867415
H -0.923699  2.509933  -0.072949
H -2.200649  1.479183  -0.744922
H -1.873843  1.44886  0.981238
H -1.839799  -1.068706  -0.822233
H -0.283424  -1.765173  -0.346167
H -1.400492  -1.154354  0.875459
H 1.201336  1.7219  -0.466637
H 2.754241  -0.212398  0.127575
H 1.667906  -1.32225  -0.7073
H 2.101868  0.079395  -2.5857""",

"""O -0.168488  0.443026  0.0
O 1.006323  -0.176508  0.0
H -0.837834  -0.266518  0.0""",
]

#### 1.6 cycloaddition (A = B + C)

In [7]:
reactants = [
"""O -0.854577  1.055663  -0.58206
O 0.549424  1.357531  -0.196886
C -0.727718  -0.273028  -0.011573
C 0.76774  -0.043476  0.113736
H -1.066903  -1.044054  -0.706048
H -1.263435  -0.349651  0.939354
H 1.374762  -0.530738  -0.655177
H 1.220707  -0.172248  1.098653"""
           ]

products = [
"""O 0.0  0.0  0.682161
C 0.0  0.0  -0.517771
H 0.0  0.938619  -1.110195
H 0.0  -0.938619  -1.110195""",

"""O 0.0  0.0  0.682161
C 0.0  0.0  -0.517771
H 0.0  0.938619  -1.110195
H 0.0  -0.938619  -1.110195""",
]

#### 1.7 Disproportionation

In [8]:
reactants = [
"""[OH]""",
"""CC(C)=C(C)C""",]            

products = [
"""O""",
"""[CH2]C(C)=C(C)C""",]

#### [TEST]

In [9]:
reactants = [
"""C=CC(C)C1C=CCC1""",
]            

products = [
"""C1=CC(C2C=CCC2)C=C1""",
"""C1C=CC=C1""",]

In [9]:
with open('diene.xyz') as f:
    diene = f.read()
with open('dienophile.xyz') as f:
    dienophile = f.read()
reactants = ['\n'.join(diene.splitlines()[2:]),
             '\n'.join(dienophile.splitlines()[2:])]
products = ['C1=CC2CC1C1CC3C4C=CC(C4)C3C21']

## 2. Find RMG reaction and generate reactant/product complex

#### [OPTIONAL] XYZ perception arguments
- `backends`: choose the backends for XYZ perception. It has no influence if you are using SMILES. Previously, `openbabel` xyz perception is prefered over `jensen`
- `header`: The xyz files contains a line indicates the number of atoms and a line of title/comments. If your string does not contain those two lines, set `header` to `False`. 

In [10]:
############### XYZ Perception ################
# Backend perception algorithm
backends = ['openbabel', 'jensen']
# If the XYZ has the first two lines (atom number + title/comments)
header = False
################################################

Check if this reaction matches RMG templates. If the reaction matches at least one RMG family, the result will be shown, and complexes will be generated. Otherwise, this notebook is not helpful to you. 

In [11]:
# For A = B + C reactions, Better to make A as the reactant
if len(reactants) == 2 and len(products) == 1:
    reactants, products = products, reactants

# Generate reactant and product complex
for backend in backends:
    print(f'Using \"{backend}\" method as the XYZ perception backend.')
    try:
        # Convert XYZ to rdkit mol
        reactants_rdkit = [parse_xyz_or_smiles(reactant, backend=backend, header=False) for reactant in reactants]
        products_rdkit = [parse_xyz_or_smiles(product, backend=backend, header=False) for product in products]

        # Convert rdkit mol to RMG mol
        reactant_molecules = [from_rdkit_mol(r.ToRWMol()) for r in reactants_rdkit]
        product_molecules = [from_rdkit_mol(p.ToRWMol()) for p in products_rdkit]

    except Exception as e:
        print(e)
        print(f'Cannot generate molecule instances using {backend}...')
        continue

    else:
        # A product complex with the same atom indexing as the reactant is generated
        family_label, _ = find_reaction_family(database, reactant_molecules,
                                               product_molecules, verbose=False)
        r_complex, p_complex = generate_reaction_complex(database,
                                                 reactant_molecules,
                                                 product_molecules,
                                                 verbose=True)
    if not r_complex:
        continue

    try:
        # p_rmg is the complex in RDKitMol form and product_match is its RMG molecule form
        r_mol = RDKitMol.FromRMGMol(r_complex)
        p_mol = RDKitMol.FromRMGMol(p_complex)
    except Exception as e:
        # There can be some problem converting RMG mol back to RDKit
        print(e)
        continue
    else:
        # Find formed and broken bonds
        r_bonds = [{bond.GetBeginAtomIdx(), bond.GetEndAtomIdx()} for bond in r_mol.GetBonds()]
        p_bonds = [{bond.GetBeginAtomIdx(), bond.GetEndAtomIdx()} for bond in p_mol.GetBonds()]
        formed_bonds = [bond for bond in p_bonds if bond not in r_bonds]
        broken_bonds = [bond for bond in r_bonds if bond not in p_bonds]
        print(f'Following bonds are formed in the reaction: {formed_bonds}')
        print(f'Following bonds are broken in the reaction: {broken_bonds}')
        break

else:
    print('No matched RMG reaction is found for the given reactants and products.')

Using "jensen" method as the XYZ perception backend.
C1=CC2C3C=CC(C3)C2C1 + C1=CCC=C1 <=> C1=CC2CC1C1CC3C4C=CC(C4)C3C21
RMG family: Diels_alder_addition
Is forward reaction: False
Following bonds are formed in the reaction: []
Following bonds are broken in the reaction: [{2, 14}, {4, 5}]


## 3. Complex geometry generation

### Force field arguments

In [19]:
############### Force Field ###################
# bond length be fixed at the following value
bond_constraint_factor = 3.0 #BOND_CONSTRAINT.get(family_label, 3.0) # Unit: Angstrom

# Force Field
force_field_type = "MMFF94s"
# Convergence criteria
tol = 1e-8
# Step size
step = 5
# Max step
max_step = 10000
###############################################

def optimize_mol(mol, frozen_bonds=[],):
    mol_copy = mol.Copy()
    try:
        # First try if we can use RDKit Forcefield to optimize
        # It is faster, relatively more robust, but have limited atom-type support
        ff = RDKitFF(force_field_type)
        # In RDKit FF, setup first then add constraints
        ff.setup(mol_copy, ignore_interfrag_interactions=False)
        for frozen_bond in frozen_bonds:
            ff.add_distance_constraint(frozen_bond, bond_constraint_factor)
    except NotImplementedError as e:
        print(e)
        # It usually means we cannot make the molecule optimizable by RDKit force field
        # Then use OpenBabel force field
        ff = OpenBabelFF(force_field_type)
        # provides mol can help adjust the atom index difference between RDKit Mol and OpenBabel Mol
        ff.mol = mol_copy
        # In Openbabel Forcefield, add constraints first then setup the force field
        for frozen_bond in frozen_bonds:
            ff.add_distance_constraint(frozen_bond, bond_constraint_factor)
        ff.setup()
    finally:
        ff.optimize(max_step=max_step, tol=tol, step_per_iter=step)
        mol_copy = ff.get_optimized_mol()
    return mol_copy

### Generate reactant complex geometry
- For A = B and A = B + C, if XYZ is available then reactant complex uses the exact same geometry as the input; otherwise, we just use a RDKit embed geometry;
- For A + B = C + D, reactant complex will be embeded by RDKit and optimized by a force field

In [20]:
if len(reactants) == 1 and r_mol.GetNumConformers() > 0:
    # XYZ inputs
    r_mol = reactants_rdkit[0]
        
elif len(reactants) in [1, 2]:

    # There is no 3D information, originally generated from a SMILES
    r_mol.EmbedConformer()
    r_mol = optimize_mol(r_mol, frozen_bonds=formed_bonds)
    
if r_mol:
    print('\nThe generated reactant complex:')
    viewer = mol_viewer(r_mol.ToMolBlock(), 'sdf')
    viewer.show()


The generated reactant complex:


### Generate product complex geometry
Using the reactant atom coordinates as the initial guess for the product complex geometry; then, use a force field to optimize the geometry.

In [21]:
try:
    p_mol.SetPositions(r_mol.GetPositions())
except ValueError as e:
    # Very ocassionally, RDKit Cannot embed molecules 
    # Though the molecule itself looks okay
    # Generate a openbabel first and convert it to RDKitMol
    print(e)

p_combine = optimize_mol(p_mol, frozen_bonds=broken_bonds)

if p_combine:
    print('\nThe generated product complex:')
    viewer = mol_viewer(p_combine.ToMolBlock(), 'sdf')
    viewer.show()


The generated product complex:


### Find the best atom mapping by RMSD. 
At this point, all heavy atoms are mapped, but some H atoms may be no longer mapped, for example due to a rotation in the methyl rotor during the optimization. We recommend you to do this step, but it is not a requirement though

NOTE:
1. this can perform relatively poorly if the reactant and the product are in different stereotype (cis/trans). or most rotors are significantly different oriented. However, previous step (match according to RMG reaction) makes sure that all heavy atoms and reacting H atoms are consistent, so only H atoms that are more trivial are influenced by this.
2. AlignMol can yields wrong numbers, we switch to `GetBestRMS` and `CalcRMS`.

In [None]:
# Whether to find better matches by reflecting the molecule (resulting in mirror image)
reflect = False

In [None]:
# Generate substructure matches,
# There is no difference using `p_combine` or `p_mol` as the argument
# Since both of them have the same connectivity information
matches = p_mol.GetSubstructMatches(p_combine, uniquify=False)

# Make a copy of p_combine to preserve its original information
p_align = p_combine.Copy()

rmsds = []

# Align the combined complex to the rmg generated complex
# According to different mapping and find the best one.
for i, match in enumerate(matches):
    atom_map = [list(enumerate(match))]
    rmsd1 = Chem.rdMolAlign.GetBestRMS(prbMol=p_align.ToRWMol(),
                                       refMol=p_mol.ToRWMol(),
                                       map=atom_map)
    if reflect:
        p_align.Reflect()
        rmsd2 = Chem.rdMolAlign.GetBestRMS(prbMol=p_align.ToRWMol(),
                                           refMol=p_mol.ToRWMol(),
                                           map=atom_map)
        p_align.Reflect()
    else:
        rmsd2 = 1e10
    if rmsd1 > rmsd2:
        rmsds.append((i, True, rmsd2,))
    else:
        rmsds.append((i, False, rmsd1,))
best = sorted(rmsds, key=lambda x: x[2])[0]
print('Match index: {0}, Reflect Conformation: {1}, RMSD: {2}'.format(*best))

# Realign and reorder atom indexes according to the best match
best_match = matches[best[0]]
Chem.rdMolAlign.GetBestRMS(prbMol=p_align.ToRWMol(),
                           refMol=p_mol.ToRWMol(),
                           map=[list(enumerate(best_match))])
if best[1]:
    p_align.Reflect()
new_atom_indexes = [best_match.index(i) for i in range(len(best_match))]
p_align = p_align.RenumberAtoms(new_atom_indexes)

### 4. View Complexes

In [None]:
# mols_to_view = r_mols + [r_mol, p_align,] + p_mols
mols_to_view = [r_mol, p_align]
entries = len(mols_to_view)

viewer = grid_viewer(viewer_grid=(1, entries), viewer_size=(240 * entries, 300),)
for i in range(entries):
    mol_viewer(mols_to_view[i].ToMolBlock(), 'sdf', viewer=viewer, viewer_loc=(0, i))

print('reactant complex    product complex')
viewer.show()

### 5. Export to SDF file and run ts_gen

In [None]:
r_mol.ToSDFFile('reactant.sdf')
p_align.ToSDFFile('product.sdf')

#### 5.1 TS Gen V2

In [None]:
TS_GEN_PYTHON = '~/Apps/anaconda3/envs/ts_gen_v2/bin/python3.7'
TS_GEN_DIR = '~/Apps/ts_gen_v2'

In [None]:
try:
    subprocess.run(f'export PYTHONPATH=$PYTHONPATH:{TS_GEN_DIR};'
                   f'{TS_GEN_PYTHON} {TS_GEN_DIR}/inference.py '
                   f'--r_sdf_path reactant.sdf '
                   f'--p_sdf_path product.sdf '
                   f'--ts_xyz_path TS.xyz',
                   check=True,
                   shell=True)
except subprocess.CalledProcessError as e:
    print(e)
else:
    with open('TS.xyz', 'r') as f:
        ts_xyz=f.read()
    ts = RDKitMol.FromXYZ(ts_xyz)

### 6. Visualize TS

In [None]:
# Align the TS to make visualization more convenient
atom_map = [(i, i) for i in range(r_mol.GetNumAtoms())]
ts.GetBestAlign(refMol=r_mol,
                atomMap=atom_map,
                keepBestConformer=True)

# View results in 3D geometries
mols_to_view = [r_mol, ts, p_align]
entries = len(mols_to_view)
viewer = grid_viewer(viewer_grid=(1, entries), viewer_size=(400 * entries, 400),)
for i in range(entries):
    mol_viewer(mols_to_view[i].ToMolBlock(), 'sdf', viewer=viewer, viewer_loc=(0, i))

print('reactant    TS      product')
viewer.show()

Get TS xyz

In [None]:
print(ts.ToXYZ())