<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Read-in-USPTO-from-ORD" data-toc-modified-id="Read-in-USPTO-from-ORD-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Read in USPTO from ORD</a></span><ul class="toc-item"><li><span><a href="#Preface" data-toc-modified-id="Preface-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Preface</a></span></li><li><span><a href="#Extract-USPTO-data-from-ORD" data-toc-modified-id="Extract-USPTO-data-from-ORD-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Extract USPTO data from ORD</a></span><ul class="toc-item"><li><span><a href="#Example-of-reading-ORD-formatted-data" data-toc-modified-id="Example-of-reading-ORD-formatted-data-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>Example of reading ORD formatted data</a></span></li></ul></li></ul></li><li><span><a href="#Preprocessing-of-USPTO---Molecular-AI" data-toc-modified-id="Preprocessing-of-USPTO---Molecular-AI-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Preprocessing of USPTO - Molecular AI</a></span><ul class="toc-item"><li><span><a href="#Read-in-data-cleaned-by-rxn-utils" data-toc-modified-id="Read-in-data-cleaned-by-rxn-utils-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Read in data cleaned by rxn utils</a></span></li></ul></li></ul></div>

# Read in USPTO from ORD

## Preface

In [None]:
# I tried to read USPTO data from the ord schema, e.g. the data contained through this link:
# url = "https://github.com/Open-Reaction-Database/ord-data/blob/main/data/02/ord_dataset-026684a62f91469db49c7767d16c39fb.pb.gz?raw=true"
# However, ORD reads literally EVERYTHING from USPTO, so this resulted in around 90k x 120k df, which Joe's computer
# and my laptop do not have the momory to deal with.

# There may be 90k columns, but a lot of the columns may have superfluous info, e.g. a type column = SMILES, 
# email columns etc. 
# So one possible solution would be to pre-filter the columns (delete all the unnecessary ones), 
# and then load it afterwards

# I could use the code below to do this 
# However, it's unnecessary, as Joe is parsing the original USPTO xml files!


In [None]:
# # import ord_schema
# # from ord_schema import message_helpers, validations
# # from ord_schema.proto import dataset_pb2

# # import wget

# # # url = "https://github.com/Open-Reaction-Database/ord-data/blob/main/data/02/ord_dataset-026684a62f91469db49c7767d16c39fb.pb.gz?raw=true"
# # url = "https://github.com/open-reaction-database/ord-data/blob/main/data/68/ord_dataset-68cb8b4b2b384e3d85b5b1efae58b203.pb.gz?raw=true"
# # pb = wget.download(url)

# # # Load Dataset message
# # data = message_helpers.load_message(pb, dataset_pb2.Dataset)

# rows = []
# for d in data.reactions:
#     # print(d)
#     row = message_helpers.message_to_row(d)
#     rows.append(row)
#     for k,v in row.items():
#         print(k)
#     break
# df = pd.DataFrame(rows)

## Extract USPTO data from ORD

1. All of the grants USPTO data is contained here: https://github.com/open-reaction-database/ord-data
2. It is batched by year, it's best to just maintain this batching, it will make it easier to handle (each file won't get excessively large)
3. Read in the data contained in the .pb.gz file, each entry in the "list" is a reaction. Write a for loop over the "list", and extract the following from each reaction:
    3.1 Reactants
    3.2 Products
    3.3 Solvents
    3.4 Reagents
    3.5 Catalyst
    3.6 Temperature
    3.7 Anything else?
4. Build a list for each of these, combine to a df, and then save as a paraquet file
5. repeat this for each of the 41 years (41 datasets) we have data for in USPTO. It'll probably be easiest to convert the code in this notebook into a script, and then run it automatically on each.

### Example of reading ORD formatted data

In [None]:
# https://github.com/open-reaction-database/ord-schema/blob/main/examples/applications/Perera_Science_Granda_Nature_Suzuki/Granda_Perera_ml_example.ipynb


In [3]:
# Import modules
import ord_schema
from ord_schema import message_helpers, validations
from ord_schema.proto import dataset_pb2

import math
import pandas as pd
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import os
import wget

from rdkit import Chem
from rdkit.Chem import AllChem
from sklearn import model_selection, metrics
from glob import glob

2022-11-30 20:09:04.550216: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
# Download dataset from ord-data
#url = "https://github.com/open-reaction-database/ord-data/blob/main/data/68/ord_dataset-68cb8b4b2b384e3d85b5b1efae58b203.pb.gz?raw=true"
#https://github.com/open-reaction-database/ord-data
url = "https://github.com/Open-Reaction-Database/ord-data/blob/main/data/02/ord_dataset-026684a62f91469db49c7767d16c39fb.pb.gz?raw=true"
pb = wget.download(url)

  0% [                                                    ]        0 / 56336827  0% [                                                    ]     8192 / 56336827  0% [                                                    ]    16384 / 56336827  0% [                                                    ]    24576 / 56336827  0% [                                                    ]    32768 / 56336827  0% [                                                    ]    40960 / 56336827  0% [                                                    ]    49152 / 56336827  0% [                                                    ]    57344 / 56336827  0% [                                                    ]    65536 / 56336827  0% [                                                    ]    73728 / 56336827  0% [                                                    ]    81920 / 56336827  0% [                                                    ]    90112 / 56336827  0% [                                 

  6% [...                                                 ]  3596288 / 56336827  6% [...                                                 ]  3604480 / 56336827  6% [...                                                 ]  3612672 / 56336827  6% [...                                                 ]  3620864 / 56336827  6% [...                                                 ]  3629056 / 56336827  6% [...                                                 ]  3637248 / 56336827  6% [...                                                 ]  3645440 / 56336827  6% [...                                                 ]  3653632 / 56336827  6% [...                                                 ]  3661824 / 56336827  6% [...                                                 ]  3670016 / 56336827  6% [...                                                 ]  3678208 / 56336827  6% [...                                                 ]  3686400 / 56336827  6% [...                              

 13% [......                                              ]  7430144 / 56336827 13% [......                                              ]  7438336 / 56336827 13% [......                                              ]  7446528 / 56336827 13% [......                                              ]  7454720 / 56336827 13% [......                                              ]  7462912 / 56336827 13% [......                                              ]  7471104 / 56336827 13% [......                                              ]  7479296 / 56336827 13% [......                                              ]  7487488 / 56336827 13% [......                                              ]  7495680 / 56336827 13% [......                                              ]  7503872 / 56336827 13% [......                                              ]  7512064 / 56336827 13% [......                                              ]  7520256 / 56336827 13% [......                           

 19% [..........                                          ] 11141120 / 56336827 19% [..........                                          ] 11149312 / 56336827 19% [..........                                          ] 11157504 / 56336827 19% [..........                                          ] 11165696 / 56336827 19% [..........                                          ] 11173888 / 56336827 19% [..........                                          ] 11182080 / 56336827 19% [..........                                          ] 11190272 / 56336827 19% [..........                                          ] 11198464 / 56336827 19% [..........                                          ] 11206656 / 56336827 19% [..........                                          ] 11214848 / 56336827 19% [..........                                          ] 11223040 / 56336827 19% [..........                                          ] 11231232 / 56336827 19% [..........                       

 27% [..............                                      ] 15269888 / 56336827 27% [..............                                      ] 15278080 / 56336827 27% [..............                                      ] 15286272 / 56336827 27% [..............                                      ] 15294464 / 56336827 27% [..............                                      ] 15302656 / 56336827 27% [..............                                      ] 15310848 / 56336827 27% [..............                                      ] 15319040 / 56336827 27% [..............                                      ] 15327232 / 56336827 27% [..............                                      ] 15335424 / 56336827 27% [..............                                      ] 15343616 / 56336827 27% [..............                                      ] 15351808 / 56336827 27% [..............                                      ] 15360000 / 56336827 27% [..............                   

 34% [..................                                  ] 19595264 / 56336827 34% [..................                                  ] 19603456 / 56336827 34% [..................                                  ] 19611648 / 56336827 34% [..................                                  ] 19619840 / 56336827 34% [..................                                  ] 19628032 / 56336827 34% [..................                                  ] 19636224 / 56336827 34% [..................                                  ] 19644416 / 56336827 34% [..................                                  ] 19652608 / 56336827 34% [..................                                  ] 19660800 / 56336827 34% [..................                                  ] 19668992 / 56336827 34% [..................                                  ] 19677184 / 56336827 34% [..................                                  ] 19685376 / 56336827 34% [..................               

 42% [......................                              ] 23920640 / 56336827 42% [......................                              ] 23928832 / 56336827 42% [......................                              ] 23937024 / 56336827 42% [......................                              ] 23945216 / 56336827 42% [......................                              ] 23953408 / 56336827 42% [......................                              ] 23961600 / 56336827 42% [......................                              ] 23969792 / 56336827 42% [......................                              ] 23977984 / 56336827 42% [......................                              ] 23986176 / 56336827 42% [......................                              ] 23994368 / 56336827 42% [......................                              ] 24002560 / 56336827 42% [......................                              ] 24010752 / 56336827 42% [......................           

 49% [.........................                           ] 27934720 / 56336827 49% [.........................                           ] 27942912 / 56336827 49% [.........................                           ] 27951104 / 56336827 49% [.........................                           ] 27959296 / 56336827 49% [.........................                           ] 27967488 / 56336827 49% [.........................                           ] 27975680 / 56336827 49% [.........................                           ] 27983872 / 56336827 49% [.........................                           ] 27992064 / 56336827 49% [.........................                           ] 28000256 / 56336827 49% [.........................                           ] 28008448 / 56336827 49% [.........................                           ] 28016640 / 56336827 49% [.........................                           ] 28024832 / 56336827 49% [.........................        

 57% [.............................                       ] 32137216 / 56336827 57% [.............................                       ] 32145408 / 56336827 57% [.............................                       ] 32153600 / 56336827 57% [.............................                       ] 32161792 / 56336827 57% [.............................                       ] 32169984 / 56336827 57% [.............................                       ] 32178176 / 56336827 57% [.............................                       ] 32186368 / 56336827 57% [.............................                       ] 32194560 / 56336827 57% [.............................                       ] 32202752 / 56336827 57% [.............................                       ] 32210944 / 56336827 57% [.............................                       ] 32219136 / 56336827 57% [.............................                       ] 32227328 / 56336827 57% [.............................    

 64% [.................................                   ] 36536320 / 56336827 64% [.................................                   ] 36544512 / 56336827 64% [.................................                   ] 36552704 / 56336827 64% [.................................                   ] 36560896 / 56336827 64% [.................................                   ] 36569088 / 56336827 64% [.................................                   ] 36577280 / 56336827 64% [.................................                   ] 36585472 / 56336827 64% [.................................                   ] 36593664 / 56336827 64% [.................................                   ] 36601856 / 56336827 64% [.................................                   ] 36610048 / 56336827 64% [.................................                   ] 36618240 / 56336827 65% [.................................                   ] 36626432 / 56336827 65% [.................................

 72% [.....................................               ] 40828928 / 56336827 72% [.....................................               ] 40837120 / 56336827 72% [.....................................               ] 40845312 / 56336827 72% [.....................................               ] 40853504 / 56336827 72% [.....................................               ] 40861696 / 56336827 72% [.....................................               ] 40869888 / 56336827 72% [.....................................               ] 40878080 / 56336827 72% [.....................................               ] 40886272 / 56336827 72% [.....................................               ] 40894464 / 56336827 72% [.....................................               ] 40902656 / 56336827 72% [.....................................               ] 40910848 / 56336827 72% [.....................................               ] 40919040 / 56336827 72% [.................................

 79% [.........................................           ] 44859392 / 56336827 79% [.........................................           ] 44867584 / 56336827 79% [.........................................           ] 44875776 / 56336827 79% [.........................................           ] 44883968 / 56336827 79% [.........................................           ] 44892160 / 56336827 79% [.........................................           ] 44900352 / 56336827 79% [.........................................           ] 44908544 / 56336827 79% [.........................................           ] 44916736 / 56336827 79% [.........................................           ] 44924928 / 56336827 79% [.........................................           ] 44933120 / 56336827 79% [.........................................           ] 44941312 / 56336827 79% [.........................................           ] 44949504 / 56336827 79% [.................................

 87% [.............................................       ] 49373184 / 56336827 87% [.............................................       ] 49381376 / 56336827 87% [.............................................       ] 49389568 / 56336827 87% [.............................................       ] 49397760 / 56336827 87% [.............................................       ] 49405952 / 56336827 87% [.............................................       ] 49414144 / 56336827 87% [.............................................       ] 49422336 / 56336827 87% [.............................................       ] 49430528 / 56336827 87% [.............................................       ] 49438720 / 56336827 87% [.............................................       ] 49446912 / 56336827 87% [.............................................       ] 49455104 / 56336827 87% [.............................................       ] 49463296 / 56336827 87% [.................................

 93% [................................................    ] 52731904 / 56336827 93% [................................................    ] 52740096 / 56336827 93% [................................................    ] 52748288 / 56336827 93% [................................................    ] 52756480 / 56336827 93% [................................................    ] 52764672 / 56336827 93% [................................................    ] 52772864 / 56336827 93% [................................................    ] 52781056 / 56336827 93% [................................................    ] 52789248 / 56336827 93% [................................................    ] 52797440 / 56336827 93% [................................................    ] 52805632 / 56336827 93% [................................................    ] 52813824 / 56336827 93% [................................................    ] 52822016 / 56336827 93% [.................................

In [5]:
# Load Dataset message
pb = 'data/USPTO/ord_dataset-026684a62f91469db49c7767d16c39fb.pb.gz'
data = message_helpers.load_message(pb, dataset_pb2.Dataset)

In [6]:
type(data)

ord_schema.proto.dataset_pb2.Dataset

In [3]:
data.reactions[0]

identifiers {
  type: REACTION_CXSMILES
  value: "Br[C:2]1[C:3]([C:15]([O:17][CH2:18][CH3:19])=[O:16])=[N:4][N:5]([C:7]2[CH:12]=[CH:11][C:10]([F:13])=[C:9]([F:14])[CH:8]=2)[CH:6]=1.CC1(C)C(C)(C)OB([C:28]2[CH:29]=[C:30]([CH:35]=[CH:36][CH:37]=2)[C:31]([O:33][CH3:34])=[O:32])O1.[C:39](=O)([O-])[O-].[K+].[K+]>C(O)C.C([O-])(=O)C.[Pd+2].C([O-])(=O)C>[F:14][C:9]1[CH:8]=[C:7]([N:5]2[CH:6]=[C:2]([C:36]3[CH:37]=[CH:28][CH:29]=[C:30]([C:31]([O:33][CH2:34][CH3:39])=[O:32])[CH:35]=3)[C:3]([C:15]([O:17][CH2:18][CH3:19])=[O:16])=[N:4]2)[CH:12]=[CH:11][C:10]=1[F:13] |f:2.3.4,6.7.8|"
  is_mapped: true
}
inputs {
  key: "m1"
  value {
    components {
      identifiers {
        type: NAME
        value: "ethyl 4-bromo-1-(3,4-difluorophenyl)-1H-pyrazole-3-carboxylate"
      }
      identifiers {
        type: SMILES
        value: "BrC=1C(=NN(C1)C1=CC(=C(C=C1)F)F)C(=O)OCC"
      }
      identifiers {
        type: INCHI
        value: "InChI=1S/C12H9BrF2N2O2/c1-2-19-12(18)11-8(13)6-17(16-11)7-3-4-9(14)

In [5]:
rxn = data.reactions[0]

In [34]:
rxn.inputs['m1'].components[0].identifiers[0].value

'ethyl 4-bromo-1-(3,4-difluorophenyl)-1H-pyrazole-3-carboxylate'

In [6]:
rxn.conditions.temperature.control.type

2

In [35]:
rxn2 = data.reactions[1]

In [36]:
rxn2

identifiers {
  type: REACTION_CXSMILES
  value: "[ClH:1].[CH3:2][N:3]([CH:11]([CH3:13])[CH3:12])[C:4]1[CH:9]=[CH:8][C:7]([NH2:10])=[CH:6][CH:5]=1.Cl.Cl.[CH2:16]([O:20][C:21]1[CH:26]=[CH:25][C:24]([NH2:27])=[CH:23][C:22]=1[NH2:28])[CH2:17][CH2:18][CH3:19].N.OO.[CH2:32](O)C>O>[ClH:1].[CH2:16]([O:20][C:21]1[C:22](=[NH:28])[CH:23]=[C:24]([NH2:27])[CH:25]([NH:10][C:7]2[CH:8]=[CH:9][C:4]([N:3]([CH2:2][CH3:32])[CH:11]([CH3:13])[CH3:12])=[CH:5][CH:6]=2)[CH:26]=1)[CH2:17][CH2:18][CH3:19] |f:0.1,2.3.4,9.10|"
  is_mapped: true
}
inputs {
  key: "m1"
  value {
    components {
      identifiers {
        type: NAME
        value: "N-methyl-N-isopropyl-4-aminoaniline hydrochloride"
      }
      identifiers {
        type: SMILES
        value: "Cl.CN(C1=CC=C(C=C1)N)C(C)C"
      }
      identifiers {
        type: INCHI
        value: "InChI=1S/C10H16N2.ClH/c1-8(2)12(3)10-6-4-9(11)5-7-10;/h4-8H,11H2,1-3H3;1H"
      }
      amount {
        moles {
          value: 0.003000000026077032
          un

In [37]:
%%time
rxn3 = data.reactions[2]
rxn3

CPU times: user 27 µs, sys: 7 µs, total: 34 µs
Wall time: 44.8 µs


identifiers {
  type: REACTION_CXSMILES
  value: "OO.S([O:7][C:8]1[CH:13]=[CH:12][C:11]([NH2:14])=[CH:10][C:9]=1[Cl:15])(O)(=O)=O.Cl.Cl.[CH2:18]([O:22][C:23]1[CH:28]=[CH:27][C:26]([NH2:29])=[CH:25][C:24]=1[NH2:30])[CH2:19][CH2:20][CH3:21].N>O.C(O)C>[Cl:15][C:9]1[C:8](=[O:7])[CH:13]=[CH:12][C:11](=[N:14][C:27]2[CH:28]=[C:23]([O:22][CH2:18][CH2:19][CH2:20][CH3:21])[C:24]([NH2:30])=[CH:25][C:26]=2[NH2:29])[CH:10]=1 |f:2.3.4|"
  is_mapped: true
}
inputs {
  key: "m1_m2_m3_m5_m6"
  value {
    components {
      identifiers {
        type: NAME
        value: "hydrogen peroxide"
      }
      identifiers {
        type: SMILES
        value: "OO"
      }
      identifiers {
        type: INCHI
        value: "InChI=1S/H2O2/c1-2/h1-2H"
      }
      amount {
        volume {
          value: 26.0
          units: MILLILITER
        }
      }
      reaction_role: REACTANT
    }
    components {
      identifiers {
        type: NAME
        value: "4-amino-2-chlorophenol sulfate"
      }
    

In [38]:
rxn3 = data.reactions[3]
rxn3

identifiers {
  type: REACTION_CXSMILES
  value: "Cl.[CH2:2]([N:4](C(C)C)[C:5]1[CH:10]=[CH:9][C:8]([NH2:11])=[CH:7][CH:6]=1)[CH3:3].Cl.Cl.[CH2:17]([O:21][C:22]1[CH:27]=[CH:26][C:25]([NH2:28])=[CH:24][C:23]=1[NH2:29])[CH2:18][CH2:19][CH3:20].N.[OH:31]O.[CH2:33]([OH:35])[CH3:34]>O>[NH2:28][C:25]1[C:26](=[N:11][C:8]2[CH:9]=[CH:10][C:5]([N:4]([CH2:2][CH2:3][OH:31])[CH2:34][CH2:33][OH:35])=[CH:6][CH:7]=2)[CH:27]=[C:22]([O:21][CH2:17][CH2:18][CH2:19][CH3:20])[C:23](=[NH:29])[CH:24]=1 |f:0.1,2.3.4|"
  is_mapped: true
}
inputs {
  key: "m1"
  value {
    components {
      identifiers {
        type: NAME
        value: "N-ethyl-N-isopropyl-4-aminoaniline hydrochloride"
      }
      identifiers {
        type: SMILES
        value: "Cl.C(C)N(C1=CC=C(C=C1)N)C(C)C"
      }
      identifiers {
        type: INCHI
        value: "InChI=1S/C11H18N2.ClH/c1-4-13(9(2)3)11-7-5-10(12)6-8-11;/h5-9H,4,12H2,1-3H3;1H"
      }
      amount {
        mass {
          value: 0.925000011920929
          units:

In [6]:
# Ensure dataset validates
valid_output = validations.validate_message(data)

[18:17:50] reactant 3 has no mapped atoms.
[18:17:50] reactant 4 has no mapped atoms.
[18:17:50] reactant 2 has no mapped atoms.
[18:17:50] reactant 3 has no mapped atoms.
[18:17:50] reactant 5 has no mapped atoms.
[18:17:50] reactant 6 has no mapped atoms.
[18:17:50] reactant 0 has no mapped atoms.
[18:17:50] reactant 2 has no mapped atoms.
[18:17:50] reactant 3 has no mapped atoms.
[18:17:50] reactant 5 has no mapped atoms.
[18:17:50] reactant 0 has no mapped atoms.
[18:17:50] reactant 2 has no mapped atoms.
[18:17:50] reactant 3 has no mapped atoms.
[18:17:50] reactant 5 has no mapped atoms.
[18:17:50] reactant 1 has no mapped atoms.
[18:17:50] reactant 2 has no mapped atoms.
[18:17:50] reactant 3 has no mapped atoms.
[18:17:50] reactant 5 has no mapped atoms.
[18:17:50] reactant 0 has no mapped atoms.
[18:17:50] reactant 2 has no mapped atoms.
[18:17:50] reactant 4 has no mapped atoms.
[18:17:50] reactant 5 has no mapped atoms.
[18:17:50] product atom-mapping number 4 found multipl

[18:17:51] reactant 2 has no mapped atoms.
[18:17:51] reactant 3 has no mapped atoms.
[18:17:51] reactant 4 has no mapped atoms.
[18:17:51] reactant 5 has no mapped atoms.
[18:17:51] reactant 2 has no mapped atoms.
[18:17:51] reactant 1 has no mapped atoms.
[18:17:51] reactant 2 has no mapped atoms.
[18:17:51] reactant 3 has no mapped atoms.
[18:17:51] reactant 4 has no mapped atoms.
[18:17:51] reactant 5 has no mapped atoms.
[18:17:51] reactant 1 has no mapped atoms.
[18:17:51] reactant 1 has no mapped atoms.
[18:17:51] product atom-mapping number 24 found multiple times.
[18:17:51] product atom-mapping number 25 found multiple times.
[18:17:51] product atom-mapping number 28 found multiple times.
[18:17:51] product atom-mapping number 27 found multiple times.
[18:17:51] product atom-mapping number 26 found multiple times.
[18:17:51] reactant 2 has no mapped atoms.
[18:17:51] reactant 3 has no mapped atoms.
[18:17:51] reactant 4 has no mapped atoms.
[18:17:51] reactant 2 has no mapped

[18:17:52] reactant 1 has no mapped atoms.
[18:17:52] reactant 1 has no mapped atoms.
[18:17:52] reactant 2 has no mapped atoms.
[18:17:52] reactant 3 has no mapped atoms.
[18:17:52] reactant 2 has no mapped atoms.
[18:17:52] reactant 3 has no mapped atoms.
[18:17:52] reactant 4 has no mapped atoms.
[18:17:52] reactant 5 has no mapped atoms.
[18:17:52] reactant 1 has no mapped atoms.
[18:17:52] reactant 2 has no mapped atoms.
[18:17:52] reactant 2 has no mapped atoms.
[18:17:52] reactant 1 has no mapped atoms.
[18:17:52] reactant 3 has no mapped atoms.
[18:17:52] reactant 4 has no mapped atoms.
[18:17:52] reactant 5 has no mapped atoms.
[18:17:52] reactant 7 has no mapped atoms.
[18:17:52] reactant 2 has no mapped atoms.
[18:17:52] reactant 2 has no mapped atoms.
[18:17:52] reactant 1 has no mapped atoms.
[18:17:52] reactant 2 has no mapped atoms.
[18:17:52] reactant 2 has no mapped atoms.
[18:17:52] reactant 3 has no mapped atoms.
[18:17:52] reactant 4 has no mapped atoms.
[18:17:52] 

[18:17:53] reactant 2 has no mapped atoms.
[18:17:53] reactant 1 has no mapped atoms.
[18:17:53] reactant 2 has no mapped atoms.
[18:17:53] reactant 3 has no mapped atoms.
[18:17:53] reactant 4 has no mapped atoms.
[18:17:53] reactant 2 has no mapped atoms.
[18:17:53] reactant 3 has no mapped atoms.
[18:17:53] reactant 1 has no mapped atoms.
[18:17:53] reactant 2 has no mapped atoms.
[18:17:53] reactant 3 has no mapped atoms.
[18:17:53] reactant 4 has no mapped atoms.
[18:17:53] reactant 2 has no mapped atoms.
[18:17:53] reactant 1 has no mapped atoms.
[18:17:53] reactant 2 has no mapped atoms.
[18:17:53] reactant 1 has no mapped atoms.
[18:17:53] reactant 2 has no mapped atoms.
[18:17:53] reactant 1 has no mapped atoms.
[18:17:53] reactant 3 has no mapped atoms.
[18:17:53] reactant 1 has no mapped atoms.
[18:17:53] reactant 2 has no mapped atoms.
[18:17:53] reactant 4 has no mapped atoms.
[18:17:53] reactant 1 has no mapped atoms.
[18:17:53] reactant 2 has no mapped atoms.
[18:17:53] 

[18:17:54] reactant 1 has no mapped atoms.
[18:17:54] reactant 2 has no mapped atoms.
[18:17:54] reactant 4 has no mapped atoms.
[18:17:54] reactant 5 has no mapped atoms.
[18:17:54] reactant 6 has no mapped atoms.
[18:17:54] reactant 2 has no mapped atoms.
[18:17:54] reactant 3 has no mapped atoms.
[18:17:54] reactant 4 has no mapped atoms.
[18:17:54] reactant 5 has no mapped atoms.
[18:17:54] reactant 6 has no mapped atoms.
[18:17:54] reactant 0 has no mapped atoms.
[18:17:54] reactant 1 has no mapped atoms.
[18:17:54] reactant 2 has no mapped atoms.
[18:17:54] reactant 3 has no mapped atoms.
[18:17:54] reactant 4 has no mapped atoms.
[18:17:54] reactant 5 has no mapped atoms.
[18:17:54] reactant 7 has no mapped atoms.
[18:17:54] reactant 8 has no mapped atoms.
[18:17:54] reactant 9 has no mapped atoms.
[18:17:54] reactant 0 has no mapped atoms.
[18:17:54] reactant 2 has no mapped atoms.
[18:17:54] reactant 3 has no mapped atoms.
[18:17:54] reactant 4 has no mapped atoms.
[18:17:54] 

[18:17:54] reactant 3 has no mapped atoms.
[18:17:54] reactant 0 has no mapped atoms.
[18:17:54] reactant 1 has no mapped atoms.
[18:17:54] reactant 4 has no mapped atoms.
[18:17:54] reactant 5 has no mapped atoms.
[18:17:54] reactant 6 has no mapped atoms.
[18:17:54] reactant 2 has no mapped atoms.
[18:17:54] reactant 3 has no mapped atoms.
[18:17:54] reactant 2 has no mapped atoms.
[18:17:54] reactant 2 has no mapped atoms.
[18:17:54] reactant 3 has no mapped atoms.
[18:17:54] reactant 2 has no mapped atoms.
[18:17:54] reactant 3 has no mapped atoms.
[18:17:54] reactant 2 has no mapped atoms.
[18:17:54] reactant 3 has no mapped atoms.
[18:17:54] reactant 0 has no mapped atoms.
[18:17:54] reactant 1 has no mapped atoms.
[18:17:54] reactant 4 has no mapped atoms.
[18:17:54] reactant 5 has no mapped atoms.
[18:17:54] reactant 6 has no mapped atoms.
[18:17:54] reactant 2 has no mapped atoms.
[18:17:54] reactant 3 has no mapped atoms.
[18:17:54] reactant 3 has no mapped atoms.
[18:17:54] 

[18:17:55] reactant 2 has no mapped atoms.
[18:17:55] reactant 3 has no mapped atoms.
[18:17:55] reactant 4 has no mapped atoms.
[18:17:55] reactant 2 has no mapped atoms.
[18:17:55] reactant 3 has no mapped atoms.
[18:17:55] reactant 4 has no mapped atoms.
[18:17:55] reactant 2 has no mapped atoms.
[18:17:55] reactant 3 has no mapped atoms.
[18:17:55] reactant 4 has no mapped atoms.
[18:17:55] reactant 2 has no mapped atoms.
[18:17:55] reactant 3 has no mapped atoms.
[18:17:55] reactant 4 has no mapped atoms.
[18:17:55] reactant 5 has no mapped atoms.
[18:17:55] reactant 0 has no mapped atoms.
[18:17:55] reactant 1 has no mapped atoms.
[18:17:55] reactant 4 has no mapped atoms.
[18:17:55] reactant 5 has no mapped atoms.
[18:17:55] reactant 6 has no mapped atoms.
[18:17:55] reactant 2 has no mapped atoms.
[18:17:55] reactant 3 has no mapped atoms.
[18:17:55] reactant 4 has no mapped atoms.
[18:17:55] reactant 2 has no mapped atoms.
[18:17:55] reactant 3 has no mapped atoms.
[18:17:55] 

[18:17:55] reactant 2 has no mapped atoms.
[18:17:55] reactant 3 has no mapped atoms.
[18:17:55] reactant 4 has no mapped atoms.
[18:17:55] reactant 5 has no mapped atoms.
[18:17:55] reactant 6 has no mapped atoms.
[18:17:55] reactant 7 has no mapped atoms.
[18:17:55] reactant 8 has no mapped atoms.
[18:17:55] reactant 0 has no mapped atoms.
[18:17:55] reactant 1 has no mapped atoms.
[18:17:55] reactant 3 has no mapped atoms.
[18:17:55] reactant 4 has no mapped atoms.
[18:17:55] reactant 5 has no mapped atoms.
[18:17:55] reactant 2 has no mapped atoms.
[18:17:55] reactant 3 has no mapped atoms.
[18:17:55] reactant 4 has no mapped atoms.
[18:17:55] reactant 5 has no mapped atoms.
[18:17:55] reactant 6 has no mapped atoms.
[18:17:55] reactant 1 has no mapped atoms.
[18:17:55] reactant 2 has no mapped atoms.
[18:17:55] reactant 3 has no mapped atoms.
[18:17:55] reactant 0 has no mapped atoms.
[18:17:55] reactant 1 has no mapped atoms.
[18:17:55] reactant 3 has no mapped atoms.
[18:17:55] 

[18:17:56] reactant 1 has no mapped atoms.
[18:17:56] reactant 2 has no mapped atoms.
[18:17:56] reactant 3 has no mapped atoms.
[18:17:56] reactant 4 has no mapped atoms.
[18:17:56] reactant 5 has no mapped atoms.
[18:17:56] reactant 6 has no mapped atoms.
[18:17:56] reactant 2 has no mapped atoms.
[18:17:56] reactant 3 has no mapped atoms.
[18:17:56] reactant 4 has no mapped atoms.
[18:17:56] reactant 5 has no mapped atoms.
[18:17:56] reactant 2 has no mapped atoms.
[18:17:56] reactant 1 has no mapped atoms.
[18:17:56] reactant 2 has no mapped atoms.
[18:17:56] reactant 3 has no mapped atoms.
[18:17:56] reactant 4 has no mapped atoms.
[18:17:56] reactant 5 has no mapped atoms.
[18:17:56] reactant 1 has no mapped atoms.
[18:17:56] reactant 1 has no mapped atoms.
[18:17:56] product atom-mapping number 24 found multiple times.
[18:17:56] product atom-mapping number 25 found multiple times.
[18:17:56] product atom-mapping number 28 found multiple times.
[18:17:56] product atom-mapping nu

[18:17:57] reactant 2 has no mapped atoms.
[18:17:57] reactant 4 has no mapped atoms.
[18:17:57] reactant 6 has no mapped atoms.
[18:17:57] reactant 3 has no mapped atoms.
[18:17:57] reactant 1 has no mapped atoms.
[18:17:57] reactant 2 has no mapped atoms.
[18:17:57] reactant 3 has no mapped atoms.
[18:17:58] reactant 1 has no mapped atoms.
[18:17:58] reactant 2 has no mapped atoms.
[18:17:58] reactant 3 has no mapped atoms.
[18:17:58] reactant 1 has no mapped atoms.
[18:17:58] reactant 2 has no mapped atoms.
[18:17:58] reactant 3 has no mapped atoms.
[18:17:58] reactant 0 has no mapped atoms.
[18:17:58] reactant 1 has no mapped atoms.
[18:17:58] reactant 2 has no mapped atoms.
[18:17:58] reactant 3 has no mapped atoms.
[18:17:58] reactant 1 has no mapped atoms.
[18:17:58] reactant 2 has no mapped atoms.
[18:17:58] reactant 3 has no mapped atoms.
[18:17:58] reactant 4 has no mapped atoms.
[18:17:58] reactant 5 has no mapped atoms.
[18:17:58] reactant 6 has no mapped atoms.
[18:17:58] 

[18:17:58] reactant 1 has no mapped atoms.
[18:17:58] reactant 1 has no mapped atoms.
[18:17:58] reactant 2 has no mapped atoms.
[18:17:58] reactant 1 has no mapped atoms.
[18:17:58] reactant 1 has no mapped atoms.
[18:17:58] reactant 2 has no mapped atoms.
[18:17:58] reactant 3 has no mapped atoms.
[18:17:58] reactant 4 has no mapped atoms.
[18:17:58] reactant 5 has no mapped atoms.
[18:17:58] reactant 6 has no mapped atoms.
[18:17:58] reactant 1 has no mapped atoms.
[18:17:58] reactant 2 has no mapped atoms.
[18:17:58] reactant 1 has no mapped atoms.
[18:17:58] reactant 1 has no mapped atoms.
[18:17:58] reactant 0 has no mapped atoms.
[18:17:58] reactant 1 has no mapped atoms.
[18:17:58] reactant 4 has no mapped atoms.
[18:17:58] reactant 5 has no mapped atoms.
[18:17:58] reactant 2 has no mapped atoms.
[18:17:58] reactant 0 has no mapped atoms.
[18:17:58] reactant 3 has no mapped atoms.
[18:17:58] reactant 4 has no mapped atoms.
[18:17:59] product atom-mapping number 7 found multipl

[18:17:59] reactant 0 has no mapped atoms.
[18:17:59] reactant 1 has no mapped atoms.
[18:17:59] reactant 2 has no mapped atoms.
[18:17:59] reactant 0 has no mapped atoms.
[18:17:59] reactant 1 has no mapped atoms.
[18:17:59] reactant 2 has no mapped atoms.
[18:17:59] reactant 0 has no mapped atoms.
[18:17:59] reactant 1 has no mapped atoms.
[18:17:59] reactant 2 has no mapped atoms.
[18:17:59] reactant 0 has no mapped atoms.
[18:17:59] reactant 1 has no mapped atoms.
[18:17:59] reactant 2 has no mapped atoms.
[18:17:59] reactant 3 has no mapped atoms.
[18:17:59] reactant 2 has no mapped atoms.
[18:17:59] reactant 3 has no mapped atoms.
[18:17:59] reactant 2 has no mapped atoms.
[18:17:59] reactant 3 has no mapped atoms.
[18:17:59] reactant 4 has no mapped atoms.
[18:17:59] reactant 1 has no mapped atoms.
[18:17:59] reactant 2 has no mapped atoms.
[18:17:59] reactant 4 has no mapped atoms.
[18:17:59] reactant 5 has no mapped atoms.
[18:17:59] reactant 1 has no mapped atoms.
[18:17:59] 

[18:18:00] reactant 2 has no mapped atoms.
[18:18:00] reactant 3 has no mapped atoms.
[18:18:00] reactant 4 has no mapped atoms.
[18:18:00] reactant 5 has no mapped atoms.
[18:18:00] reactant 2 has no mapped atoms.
[18:18:00] reactant 1 has no mapped atoms.
[18:18:00] reactant 2 has no mapped atoms.
[18:18:00] reactant 2 has no mapped atoms.
[18:18:00] reactant 3 has no mapped atoms.
[18:18:00] reactant 1 has no mapped atoms.
[18:18:00] reactant 2 has no mapped atoms.
[18:18:00] reactant 3 has no mapped atoms.
[18:18:00] reactant 5 has no mapped atoms.
[18:18:00] reactant 6 has no mapped atoms.
[18:18:00] reactant 2 has no mapped atoms.
[18:18:00] reactant 1 has no mapped atoms.
[18:18:00] reactant 2 has no mapped atoms.
[18:18:00] reactant 3 has no mapped atoms.
[18:18:00] reactant 5 has no mapped atoms.
[18:18:00] reactant 1 has no mapped atoms.
[18:18:00] reactant 3 has no mapped atoms.
[18:18:00] product atom-mapping number 1 found multiple times.
[18:18:00] product atom-mapping nu

[18:18:00] reactant 0 has no mapped atoms.
[18:18:00] reactant 2 has no mapped atoms.
[18:18:00] reactant 4 has no mapped atoms.
[18:18:00] product atom-mapping number 2 found multiple times.
[18:18:00] reactant 1 has no mapped atoms.
[18:18:00] reactant 0 has no mapped atoms.
[18:18:00] reactant 1 has no mapped atoms.
[18:18:00] reactant 0 has no mapped atoms.
[18:18:00] reactant 2 has no mapped atoms.
[18:18:00] reactant 3 has no mapped atoms.
[18:18:00] reactant 0 has no mapped atoms.
[18:18:00] reactant 4 has no mapped atoms.
[18:18:00] reactant 0 has no mapped atoms.
[18:18:00] reactant 4 has no mapped atoms.
[18:18:00] reactant 1 has no mapped atoms.
[18:18:00] product atom-mapping number 39 found multiple times.
[18:18:00] reactant 1 has no mapped atoms.
[18:18:00] reactant 2 has no mapped atoms.
[18:18:00] reactant 0 has no mapped atoms.
[18:18:00] reactant 2 has no mapped atoms.
[18:18:00] reactant 3 has no mapped atoms.
[18:18:00] reactant 0 has no mapped atoms.
[18:18:00] re

[18:18:01] reactant 2 has no mapped atoms.
[18:18:01] reactant 3 has no mapped atoms.
[18:18:01] reactant 4 has no mapped atoms.
[18:18:01] reactant 2 has no mapped atoms.
[18:18:01] reactant 3 has no mapped atoms.
[18:18:01] reactant 4 has no mapped atoms.
[18:18:01] reactant 0 has no mapped atoms.
[18:18:01] reactant 1 has no mapped atoms.
[18:18:01] reactant 1 has no mapped atoms.
[18:18:01] reactant 3 has no mapped atoms.
[18:18:01] reactant 1 has no mapped atoms.
[18:18:01] product atom-mapping number 1 found multiple times.
[18:18:01] product atom-mapping number 2 found multiple times.
[18:18:01] product atom-mapping number 15 found multiple times.
[18:18:01] product atom-mapping number 14 found multiple times.
[18:18:01] product atom-mapping number 3 found multiple times.
[18:18:01] product atom-mapping number 10 found multiple times.
[18:18:01] product atom-mapping number 11 found multiple times.
[18:18:01] product atom-mapping number 13 found multiple times.
[18:18:01] product

[18:18:02] reactant 2 has no mapped atoms.
[18:18:02] reactant 3 has no mapped atoms.
[18:18:02] reactant 4 has no mapped atoms.
[18:18:02] reactant 5 has no mapped atoms.
[18:18:02] reactant 6 has no mapped atoms.
[18:18:02] reactant 7 has no mapped atoms.
[18:18:02] reactant 1 has no mapped atoms.
[18:18:02] reactant 1 has no mapped atoms.
[18:18:02] reactant 3 has no mapped atoms.
[18:18:02] reactant 4 has no mapped atoms.
[18:18:02] reactant 5 has no mapped atoms.
[18:18:02] reactant 2 has no mapped atoms.
[18:18:02] reactant 1 has no mapped atoms.
[18:18:02] reactant 2 has no mapped atoms.
[18:18:02] reactant 1 has no mapped atoms.
[18:18:02] reactant 2 has no mapped atoms.
[18:18:02] reactant 3 has no mapped atoms.
[18:18:02] reactant 2 has no mapped atoms.
[18:18:02] reactant 3 has no mapped atoms.
[18:18:02] reactant 4 has no mapped atoms.
[18:18:02] reactant 0 has no mapped atoms.
[18:18:02] reactant 1 has no mapped atoms.
[18:18:02] reactant 1 has no mapped atoms.
[18:18:02] 

[18:18:03] reactant 2 has no mapped atoms.
[18:18:03] reactant 2 has no mapped atoms.
[18:18:03] reactant 3 has no mapped atoms.
[18:18:03] reactant 4 has no mapped atoms.
[18:18:03] reactant 5 has no mapped atoms.
[18:18:03] reactant 2 has no mapped atoms.
[18:18:03] reactant 1 has no mapped atoms.
[18:18:03] reactant 2 has no mapped atoms.
[18:18:03] reactant 3 has no mapped atoms.
[18:18:03] reactant 0 has no mapped atoms.
[18:18:03] reactant 1 has no mapped atoms.
[18:18:03] reactant 2 has no mapped atoms.
[18:18:03] reactant 2 has no mapped atoms.
[18:18:03] reactant 2 has no mapped atoms.
[18:18:03] reactant 3 has no mapped atoms.
[18:18:03] reactant 4 has no mapped atoms.
[18:18:03] reactant 2 has no mapped atoms.
[18:18:03] reactant 3 has no mapped atoms.
[18:18:03] reactant 4 has no mapped atoms.
[18:18:03] reactant 5 has no mapped atoms.
[18:18:03] reactant 6 has no mapped atoms.
[18:18:03] reactant 2 has no mapped atoms.
[18:18:03] reactant 3 has no mapped atoms.
[18:18:03] 

[18:18:03] reactant 1 has no mapped atoms.
[18:18:03] product atom-mapping number 1 found multiple times.
[18:18:03] product atom-mapping number 23 found multiple times.
[18:18:03] product atom-mapping number 24 found multiple times.
[18:18:03] product atom-mapping number 25 found multiple times.
[18:18:03] product atom-mapping number 26 found multiple times.
[18:18:03] product atom-mapping number 27 found multiple times.
[18:18:03] product atom-mapping number 28 found multiple times.
[18:18:03] product atom-mapping number 2 found multiple times.
[18:18:03] product atom-mapping number 3 found multiple times.
[18:18:03] product atom-mapping number 4 found multiple times.
[18:18:03] product atom-mapping number 7 found multiple times.
[18:18:03] product atom-mapping number 8 found multiple times.
[18:18:03] product atom-mapping number 15 found multiple times.
[18:18:03] product atom-mapping number 16 found multiple times.
[18:18:03] product atom-mapping number 18 found multiple times.
[18

[18:18:03] reactant 1 has no mapped atoms.
[18:18:03] reactant 2 has no mapped atoms.
[18:18:03] reactant 3 has no mapped atoms.
[18:18:03] reactant 5 has no mapped atoms.
[18:18:03] reactant 2 has no mapped atoms.
[18:18:03] reactant 3 has no mapped atoms.
[18:18:03] reactant 4 has no mapped atoms.
[18:18:03] reactant 5 has no mapped atoms.
[18:18:03] reactant 6 has no mapped atoms.
[18:18:03] reactant 1 has no mapped atoms.
[18:18:03] reactant 3 has no mapped atoms.
[18:18:03] product atom-mapping number 1 found multiple times.
[18:18:03] product atom-mapping number 23 found multiple times.
[18:18:03] product atom-mapping number 24 found multiple times.
[18:18:03] product atom-mapping number 25 found multiple times.
[18:18:03] product atom-mapping number 26 found multiple times.
[18:18:03] product atom-mapping number 27 found multiple times.
[18:18:03] product atom-mapping number 28 found multiple times.
[18:18:03] product atom-mapping number 2 found multiple times.
[18:18:03] produc

[18:18:04] reactant 2 has no mapped atoms.
[18:18:04] reactant 3 has no mapped atoms.
[18:18:04] reactant 4 has no mapped atoms.
[18:18:04] reactant 5 has no mapped atoms.
[18:18:04] reactant 6 has no mapped atoms.
[18:18:04] reactant 7 has no mapped atoms.
[18:18:04] reactant 2 has no mapped atoms.
[18:18:04] reactant 1 has no mapped atoms.
[18:18:04] reactant 2 has no mapped atoms.
[18:18:04] reactant 3 has no mapped atoms.
[18:18:04] reactant 4 has no mapped atoms.
[18:18:04] reactant 2 has no mapped atoms.
[18:18:04] reactant 3 has no mapped atoms.
[18:18:04] reactant 4 has no mapped atoms.
[18:18:04] reactant 5 has no mapped atoms.
[18:18:04] reactant 0 has no mapped atoms.
[18:18:04] reactant 1 has no mapped atoms.
[18:18:04] reactant 4 has no mapped atoms.
[18:18:04] reactant 5 has no mapped atoms.
[18:18:04] reactant 2 has no mapped atoms.
[18:18:04] reactant 3 has no mapped atoms.
[18:18:04] reactant 4 has no mapped atoms.
[18:18:04] reactant 1 has no mapped atoms.
[18:18:04] 

[18:18:05] reactant 2 has no mapped atoms.
[18:18:05] reactant 3 has no mapped atoms.
[18:18:05] reactant 2 has no mapped atoms.
[18:18:06] reactant 2 has no mapped atoms.
[18:18:06] reactant 3 has no mapped atoms.
[18:18:06] reactant 4 has no mapped atoms.
[18:18:06] reactant 1 has no mapped atoms.
[18:18:06] reactant 3 has no mapped atoms.
[18:18:06] reactant 2 has no mapped atoms.
[18:18:06] reactant 3 has no mapped atoms.
[18:18:06] reactant 4 has no mapped atoms.
[18:18:06] reactant 5 has no mapped atoms.
[18:18:06] reactant 1 has no mapped atoms.
[18:18:06] reactant 2 has no mapped atoms.
[18:18:06] reactant 4 has no mapped atoms.
[18:18:06] reactant 5 has no mapped atoms.
[18:18:06] reactant 2 has no mapped atoms.
[18:18:06] reactant 3 has no mapped atoms.
[18:18:06] reactant 2 has no mapped atoms.
[18:18:06] reactant 2 has no mapped atoms.
[18:18:06] reactant 2 has no mapped atoms.
[18:18:06] reactant 3 has no mapped atoms.
[18:18:06] reactant 0 has no mapped atoms.
[18:18:06] 

ValidationError: Dataset.reactions[1478].inputs["m1_m2"].components[1].identifiers[2]: RDKit 2022.03.5 could not validate InChI identifier InChI=1S/CHCl/c1-2/h1H

In [None]:
# Convert dataset to pandas dataframe
df = message_helpers.messages_to_dataframe(data.reactions, drop_constant_columns=True)

# View dataframe
df

In [10]:
# View all columns with variation in the dataset
list(df.columns)

['inputs["Solvent_1"].components[0].identifiers[0].value',
 'inputs["Ligand in Solvent"].components[0].identifiers[0].value',
 'inputs["Ligand in Solvent"].components[0].amount.moles.value',
 'inputs["Ligand in Solvent"].components[0].amount.moles.units',
 'inputs["Ligand in Solvent"].components[0].reaction_role',
 'inputs["Ligand in Solvent"].components[1].identifiers[0].type',
 'inputs["Ligand in Solvent"].components[1].identifiers[0].value',
 'inputs["Ligand in Solvent"].components[1].amount.volume.value',
 'inputs["Ligand in Solvent"].components[1].amount.volume.units',
 'inputs["Ligand in Solvent"].components[1].amount.volume_includes_solutes',
 'inputs["Ligand in Solvent"].components[1].reaction_role',
 'inputs["Boronate in Solvent"].components[0].identifiers[0].value',
 'inputs["Boronate in Solvent"].components[1].identifiers[0].value',
 'inputs["Base in Solvent"].components[0].identifiers[0].value',
 'inputs["Base in Solvent"].components[0].amount.moles.value',
 'inputs["Base i

# Preprocessing of USPTO - Molecular AI

In [1]:
# Running code from:
# https://molecularai.github.io/reaction_utils/uspto.html

# From within the folder containing all the USPTO data

# First run:
# conda activate <rxnutilities>
#  python -m rxnutils.data.uspto.preparation_pipeline run --nbatches 200  --max-workers 8 --max-num-splits 200

# Then I was supposed to run:
# conda activate rxnmapper
# python -m rxnutils.data.mapping_pipeline run --data-prefix uspto --nbatches 200  --max-workers 8 --max-num-splits 200
# But that didn't work, so I just ran the first part 
# Even after I replaced the delimiter (from 	 to , it still failed ). I'll just give up lol

## Read in data cleaned by rxn utils

In [1]:
import pandas as pd

In [2]:
cleaned_USPTO = pd.read_csv('/Users/dsw46/USPTO_data/uspto_data_cleaned.csv', sep = '	')
cleaned_USPTO.shape

(3740596, 7)

In [3]:
cleaned_USPTO['ReactionSmilesClean'][0]

'OCCBr.CCS(=O)(=O)Cl.CCOCC.CCN(CC)CC>>CCS(=O)(=O)OCCBr'

In [4]:
#full_USPTO = pd.read_csv('/Users/dsw46/USPTO_data/uspto_data.csv', sep = '	')
#full_USPTO.shape