<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Read-in-USPTO-from-ORD" data-toc-modified-id="Read-in-USPTO-from-ORD-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Read in USPTO from ORD</a></span><ul class="toc-item"><li><span><a href="#Preface" data-toc-modified-id="Preface-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Preface</a></span></li><li><span><a href="#Extract-USPTO-data-from-ORD" data-toc-modified-id="Extract-USPTO-data-from-ORD-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Extract USPTO data from ORD</a></span></li><li><span><a href="#Tests:-Figure-out-how-to-access-the-info-I-need-in-the-dataset-file" data-toc-modified-id="Tests:-Figure-out-how-to-access-the-info-I-need-in-the-dataset-file-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Tests: Figure out how to access the info I need in the dataset file</a></span></li></ul></li><li><span><a href="#Preprocessing-of-USPTO---Molecular-AI" data-toc-modified-id="Preprocessing-of-USPTO---Molecular-AI-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Preprocessing of USPTO - Molecular AI</a></span><ul class="toc-item"><li><span><a href="#Read-in-data-cleaned-by-rxn-utils" data-toc-modified-id="Read-in-data-cleaned-by-rxn-utils-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Read in data cleaned by rxn utils</a></span></li></ul></li></ul></div>

# Read in USPTO from ORD

## Preface

In [None]:
# I tried to read USPTO data from the ord schema, e.g. the data contained through this link:
# url = "https://github.com/Open-Reaction-Database/ord-data/blob/main/data/02/ord_dataset-026684a62f91469db49c7767d16c39fb.pb.gz?raw=true"
# However, ORD reads literally EVERYTHING from USPTO, so this resulted in around 90k x 120k df, which Joe's computer
# and my laptop do not have the momory to deal with.

# There may be 90k columns, but a lot of the columns may have superfluous info, e.g. a type column = SMILES, 
# email columns etc. 
# So one possible solution would be to pre-filter the columns (delete all the unnecessary ones), 
# and then load it afterwards

# I could use the code below to do this 
# However, it's unnecessary, as Joe is parsing the original USPTO xml files!


In [None]:
# # import ord_schema
# # from ord_schema import message_helpers, validations
# # from ord_schema.proto import dataset_pb2

# # import wget

# # # url = "https://github.com/Open-Reaction-Database/ord-data/blob/main/data/02/ord_dataset-026684a62f91469db49c7767d16c39fb.pb.gz?raw=true"
# # url = "https://github.com/open-reaction-database/ord-data/blob/main/data/68/ord_dataset-68cb8b4b2b384e3d85b5b1efae58b203.pb.gz?raw=true"
# # pb = wget.download(url)

# # # Load Dataset message
# # data = message_helpers.load_message(pb, dataset_pb2.Dataset)

# rows = []
# for d in data.reactions:
#     # print(d)
#     row = message_helpers.message_to_row(d)
#     rows.append(row)
#     for k,v in row.items():
#         print(k)
#     break
# df = pd.DataFrame(rows)

In [None]:
# Or following the example here
# https://github.com/open-reaction-database/ord-schema/blob/main/examples/applications/Perera_Science_Granda_Nature_Suzuki/Granda_Perera_ml_example.ipynb

# # Download dataset from ord-data
# url = "https://github.com/open-reaction-database/ord-data/blob/main/data/68/ord_dataset-68cb8b4b2b384e3d85b5b1efae58b203.pb.gz?raw=true"
# pb = wget.download(url)

# # Load Dataset message
# data = message_helpers.load_message(pb, dataset_pb2.Dataset)

# # Ensure dataset validates
# valid_output = validations.validate_message(data)

# # Convert dataset to pandas dataframe
# df = message_helpers.messages_to_dataframe(data.reactions, drop_constant_columns=True)

# # View dataframe
# df


# # View all columns with variation in the dataset
# list(df.columns)


## Extract USPTO data from ORD

1. All of the grants USPTO data is contained here: https://github.com/open-reaction-database/ord-data
2. It is batched by year, it's best to just maintain this batching, it will make it easier to handle (each file won't get excessively large)
3. Read in the data contained in the .pb.gz file, each entry in the "list" is a reaction. Write a for loop over the "list", and extract the following from each reaction:
    - Reactants
    - Products
    - Solvents
    - Reagents
    - Catalyst
    - Temperature
    - Yield
    - Anything else?
4. Build a list for each of these, combine to a df, and then save as a paraquet file
5. repeat this for each of the 41 years (41 datasets) we have data for in USPTO. It'll probably be easiest to convert the code in this notebook into a script, and then run it automatically on each.

In [None]:
# NB: Pura requires python >=3.8, <3.10, while ord-schema requires python >=3.10
# Pura and ord-schema cannot be in same env

In [None]:
# Find the schema here
# https://github.com/open-reaction-database/ord-schema/blob/main/ord_schema/proto/reaction.proto

In [4]:
# Import modules
import ord_schema
from ord_schema import message_helpers, validations
from ord_schema.proto import dataset_pb2

import math
import pandas as pd
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import os
import wget

from rdkit import Chem
from rdkit.Chem import AllChem
from sklearn import model_selection, metrics
from glob import glob

from tqdm import tqdm

2022-12-20 10:37:32.280091: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
# Download dataset from ord-data
#url = "https://github.com/open-reaction-database/ord-data/blob/main/data/68/ord_dataset-68cb8b4b2b384e3d85b5b1efae58b203.pb.gz?raw=true"
#https://github.com/open-reaction-database/ord-data
url = "https://github.com/Open-Reaction-Database/ord-data/blob/main/data/02/ord_dataset-026684a62f91469db49c7767d16c39fb.pb.gz?raw=true"
pb = wget.download(url)


  0% [                                                    ]        0 / 56336827
  0% [                                                    ]     8192 / 56336827
  0% [                                                    ]    16384 / 56336827
  0% [                                                    ]    24576 / 56336827
  0% [                                                    ]    32768 / 56336827
  0% [                                                    ]    40960 / 56336827
  0% [                                                    ]    49152 / 56336827
  0% [                                                    ]    57344 / 56336827
  0% [                                                    ]    65536 / 56336827
  0% [                                                    ]    73728 / 56336827
  0% [                                                    ]    81920 / 56336827
  0% [                                                    ]    90112 / 56336827
  0% [                    


  6% [...                                                 ]  3596288 / 56336827
  6% [...                                                 ]  3604480 / 56336827
  6% [...                                                 ]  3612672 / 56336827
  6% [...                                                 ]  3620864 / 56336827
  6% [...                                                 ]  3629056 / 56336827
  6% [...                                                 ]  3637248 / 56336827
  6% [...                                                 ]  3645440 / 56336827
  6% [...                                                 ]  3653632 / 56336827
  6% [...                                                 ]  3661824 / 56336827
  6% [...                                                 ]  3670016 / 56336827
  6% [...                                                 ]  3678208 / 56336827
  6% [...                                                 ]  3686400 / 56336827
  6% [...                 


 13% [......                                              ]  7430144 / 56336827
 13% [......                                              ]  7438336 / 56336827
 13% [......                                              ]  7446528 / 56336827
 13% [......                                              ]  7454720 / 56336827
 13% [......                                              ]  7462912 / 56336827
 13% [......                                              ]  7471104 / 56336827
 13% [......                                              ]  7479296 / 56336827
 13% [......                                              ]  7487488 / 56336827
 13% [......                                              ]  7495680 / 56336827
 13% [......                                              ]  7503872 / 56336827
 13% [......                                              ]  7512064 / 56336827
 13% [......                                              ]  7520256 / 56336827
 13% [......              


 19% [..........                                          ] 11141120 / 56336827
 19% [..........                                          ] 11149312 / 56336827
 19% [..........                                          ] 11157504 / 56336827
 19% [..........                                          ] 11165696 / 56336827
 19% [..........                                          ] 11173888 / 56336827
 19% [..........                                          ] 11182080 / 56336827
 19% [..........                                          ] 11190272 / 56336827
 19% [..........                                          ] 11198464 / 56336827
 19% [..........                                          ] 11206656 / 56336827
 19% [..........                                          ] 11214848 / 56336827
 19% [..........                                          ] 11223040 / 56336827
 19% [..........                                          ] 11231232 / 56336827
 19% [..........          


 27% [..............                                      ] 15269888 / 56336827
 27% [..............                                      ] 15278080 / 56336827
 27% [..............                                      ] 15286272 / 56336827
 27% [..............                                      ] 15294464 / 56336827
 27% [..............                                      ] 15302656 / 56336827
 27% [..............                                      ] 15310848 / 56336827
 27% [..............                                      ] 15319040 / 56336827
 27% [..............                                      ] 15327232 / 56336827
 27% [..............                                      ] 15335424 / 56336827
 27% [..............                                      ] 15343616 / 56336827
 27% [..............                                      ] 15351808 / 56336827
 27% [..............                                      ] 15360000 / 56336827
 27% [..............      


 34% [..................                                  ] 19595264 / 56336827
 34% [..................                                  ] 19603456 / 56336827
 34% [..................                                  ] 19611648 / 56336827
 34% [..................                                  ] 19619840 / 56336827
 34% [..................                                  ] 19628032 / 56336827
 34% [..................                                  ] 19636224 / 56336827
 34% [..................                                  ] 19644416 / 56336827
 34% [..................                                  ] 19652608 / 56336827
 34% [..................                                  ] 19660800 / 56336827
 34% [..................                                  ] 19668992 / 56336827
 34% [..................                                  ] 19677184 / 56336827
 34% [..................                                  ] 19685376 / 56336827
 34% [..................  


 42% [......................                              ] 23920640 / 56336827
 42% [......................                              ] 23928832 / 56336827
 42% [......................                              ] 23937024 / 56336827
 42% [......................                              ] 23945216 / 56336827
 42% [......................                              ] 23953408 / 56336827
 42% [......................                              ] 23961600 / 56336827
 42% [......................                              ] 23969792 / 56336827
 42% [......................                              ] 23977984 / 56336827
 42% [......................                              ] 23986176 / 56336827
 42% [......................                              ] 23994368 / 56336827
 42% [......................                              ] 24002560 / 56336827
 42% [......................                              ] 24010752 / 56336827
 42% [....................


 49% [.........................                           ] 27934720 / 56336827
 49% [.........................                           ] 27942912 / 56336827
 49% [.........................                           ] 27951104 / 56336827
 49% [.........................                           ] 27959296 / 56336827
 49% [.........................                           ] 27967488 / 56336827
 49% [.........................                           ] 27975680 / 56336827
 49% [.........................                           ] 27983872 / 56336827
 49% [.........................                           ] 27992064 / 56336827
 49% [.........................                           ] 28000256 / 56336827
 49% [.........................                           ] 28008448 / 56336827
 49% [.........................                           ] 28016640 / 56336827
 49% [.........................                           ] 28024832 / 56336827
 49% [....................


 57% [.............................                       ] 32137216 / 56336827
 57% [.............................                       ] 32145408 / 56336827
 57% [.............................                       ] 32153600 / 56336827
 57% [.............................                       ] 32161792 / 56336827
 57% [.............................                       ] 32169984 / 56336827
 57% [.............................                       ] 32178176 / 56336827
 57% [.............................                       ] 32186368 / 56336827
 57% [.............................                       ] 32194560 / 56336827
 57% [.............................                       ] 32202752 / 56336827
 57% [.............................                       ] 32210944 / 56336827
 57% [.............................                       ] 32219136 / 56336827
 57% [.............................                       ] 32227328 / 56336827
 57% [....................


 64% [.................................                   ] 36536320 / 56336827
 64% [.................................                   ] 36544512 / 56336827
 64% [.................................                   ] 36552704 / 56336827
 64% [.................................                   ] 36560896 / 56336827
 64% [.................................                   ] 36569088 / 56336827
 64% [.................................                   ] 36577280 / 56336827
 64% [.................................                   ] 36585472 / 56336827
 64% [.................................                   ] 36593664 / 56336827
 64% [.................................                   ] 36601856 / 56336827
 64% [.................................                   ] 36610048 / 56336827
 64% [.................................                   ] 36618240 / 56336827
 65% [.................................                   ] 36626432 / 56336827
 65% [....................


 72% [.....................................               ] 40828928 / 56336827
 72% [.....................................               ] 40837120 / 56336827
 72% [.....................................               ] 40845312 / 56336827
 72% [.....................................               ] 40853504 / 56336827
 72% [.....................................               ] 40861696 / 56336827
 72% [.....................................               ] 40869888 / 56336827
 72% [.....................................               ] 40878080 / 56336827
 72% [.....................................               ] 40886272 / 56336827
 72% [.....................................               ] 40894464 / 56336827
 72% [.....................................               ] 40902656 / 56336827
 72% [.....................................               ] 40910848 / 56336827
 72% [.....................................               ] 40919040 / 56336827
 72% [....................


 79% [.........................................           ] 44859392 / 56336827
 79% [.........................................           ] 44867584 / 56336827
 79% [.........................................           ] 44875776 / 56336827
 79% [.........................................           ] 44883968 / 56336827
 79% [.........................................           ] 44892160 / 56336827
 79% [.........................................           ] 44900352 / 56336827
 79% [.........................................           ] 44908544 / 56336827
 79% [.........................................           ] 44916736 / 56336827
 79% [.........................................           ] 44924928 / 56336827
 79% [.........................................           ] 44933120 / 56336827
 79% [.........................................           ] 44941312 / 56336827
 79% [.........................................           ] 44949504 / 56336827
 79% [....................


 87% [.............................................       ] 49373184 / 56336827
 87% [.............................................       ] 49381376 / 56336827
 87% [.............................................       ] 49389568 / 56336827
 87% [.............................................       ] 49397760 / 56336827
 87% [.............................................       ] 49405952 / 56336827
 87% [.............................................       ] 49414144 / 56336827
 87% [.............................................       ] 49422336 / 56336827
 87% [.............................................       ] 49430528 / 56336827
 87% [.............................................       ] 49438720 / 56336827
 87% [.............................................       ] 49446912 / 56336827
 87% [.............................................       ] 49455104 / 56336827
 87% [.............................................       ] 49463296 / 56336827
 87% [....................


 93% [................................................    ] 52731904 / 56336827
 93% [................................................    ] 52740096 / 56336827
 93% [................................................    ] 52748288 / 56336827
 93% [................................................    ] 52756480 / 56336827
 93% [................................................    ] 52764672 / 56336827
 93% [................................................    ] 52772864 / 56336827
 93% [................................................    ] 52781056 / 56336827
 93% [................................................    ] 52789248 / 56336827
 93% [................................................    ] 52797440 / 56336827
 93% [................................................    ] 52805632 / 56336827
 93% [................................................    ] 52813824 / 56336827
 93% [................................................    ] 52822016 / 56336827
 93% [....................

In [5]:
# Load Dataset message
pb = 'data/ORD_USPTO/ord-data/data/02/ord_dataset-02ee2261663048188cf6d85d2cc96e3f.pb.gz'
data = message_helpers.load_message(pb, dataset_pb2.Dataset)

In [19]:
valid_output = validations.validate_message(data)

[20:57:50] reactant 3 has no mapped atoms.
[20:57:50] reactant 4 has no mapped atoms.
[20:57:51] reactant 2 has no mapped atoms.
[20:57:51] reactant 3 has no mapped atoms.
[20:57:51] reactant 5 has no mapped atoms.
[20:57:51] reactant 6 has no mapped atoms.
[20:57:51] reactant 0 has no mapped atoms.
[20:57:51] reactant 2 has no mapped atoms.
[20:57:51] reactant 3 has no mapped atoms.
[20:57:51] reactant 5 has no mapped atoms.
[20:57:51] reactant 0 has no mapped atoms.
[20:57:51] reactant 2 has no mapped atoms.
[20:57:51] reactant 3 has no mapped atoms.
[20:57:51] reactant 5 has no mapped atoms.
[20:57:51] reactant 1 has no mapped atoms.
[20:57:51] reactant 2 has no mapped atoms.
[20:57:51] reactant 3 has no mapped atoms.
[20:57:51] reactant 5 has no mapped atoms.
[20:57:51] reactant 0 has no mapped atoms.
[20:57:51] reactant 2 has no mapped atoms.
[20:57:51] reactant 4 has no mapped atoms.
[20:57:51] reactant 5 has no mapped atoms.
[20:57:51] product atom-mapping number 4 found multipl

[20:57:51] reactant 2 has no mapped atoms.
[20:57:51] reactant 3 has no mapped atoms.
[20:57:51] reactant 4 has no mapped atoms.
[20:57:51] reactant 5 has no mapped atoms.
[20:57:51] reactant 6 has no mapped atoms.
[20:57:51] reactant 2 has no mapped atoms.
[20:57:51] reactant 3 has no mapped atoms.
[20:57:51] reactant 4 has no mapped atoms.
[20:57:51] reactant 5 has no mapped atoms.
[20:57:51] reactant 2 has no mapped atoms.
[20:57:51] reactant 1 has no mapped atoms.
[20:57:51] reactant 2 has no mapped atoms.
[20:57:51] reactant 3 has no mapped atoms.
[20:57:51] reactant 4 has no mapped atoms.
[20:57:51] reactant 5 has no mapped atoms.
[20:57:51] reactant 1 has no mapped atoms.
[20:57:51] reactant 1 has no mapped atoms.
[20:57:51] product atom-mapping number 24 found multiple times.
[20:57:51] product atom-mapping number 25 found multiple times.
[20:57:51] product atom-mapping number 28 found multiple times.
[20:57:51] product atom-mapping number 27 found multiple times.
[20:57:51] pr

[20:57:53] reactant 3 has no mapped atoms.
[20:57:53] reactant 4 has no mapped atoms.
[20:57:53] reactant 5 has no mapped atoms.
[20:57:53] reactant 1 has no mapped atoms.
[20:57:53] reactant 2 has no mapped atoms.
[20:57:53] reactant 2 has no mapped atoms.
[20:57:53] reactant 1 has no mapped atoms.
[20:57:53] reactant 3 has no mapped atoms.
[20:57:53] reactant 4 has no mapped atoms.
[20:57:53] reactant 5 has no mapped atoms.
[20:57:53] reactant 7 has no mapped atoms.
[20:57:53] reactant 2 has no mapped atoms.
[20:57:53] reactant 2 has no mapped atoms.
[20:57:53] reactant 1 has no mapped atoms.
[20:57:53] reactant 2 has no mapped atoms.
[20:57:53] reactant 2 has no mapped atoms.
[20:57:53] reactant 3 has no mapped atoms.
[20:57:53] reactant 4 has no mapped atoms.
[20:57:53] reactant 5 has no mapped atoms.
[20:57:53] reactant 1 has no mapped atoms.
[20:57:53] reactant 1 has no mapped atoms.
[20:57:53] reactant 3 has no mapped atoms.
[20:57:53] reactant 4 has no mapped atoms.
[20:57:53] 

[20:57:53] reactant 2 has no mapped atoms.
[20:57:53] reactant 1 has no mapped atoms.
[20:57:53] reactant 2 has no mapped atoms.
[20:57:53] reactant 3 has no mapped atoms.
[20:57:53] reactant 4 has no mapped atoms.
[20:57:53] reactant 2 has no mapped atoms.
[20:57:53] reactant 1 has no mapped atoms.
[20:57:53] reactant 2 has no mapped atoms.
[20:57:53] reactant 3 has no mapped atoms.
[20:57:53] reactant 4 has no mapped atoms.
[20:57:53] reactant 2 has no mapped atoms.
[20:57:53] reactant 3 has no mapped atoms.
[20:57:53] reactant 1 has no mapped atoms.
[20:57:53] reactant 2 has no mapped atoms.
[20:57:53] reactant 3 has no mapped atoms.
[20:57:53] reactant 4 has no mapped atoms.
[20:57:53] reactant 2 has no mapped atoms.
[20:57:53] reactant 1 has no mapped atoms.
[20:57:53] reactant 2 has no mapped atoms.
[20:57:54] reactant 1 has no mapped atoms.
[20:57:54] reactant 2 has no mapped atoms.
[20:57:54] reactant 1 has no mapped atoms.
[20:57:54] reactant 3 has no mapped atoms.
[20:57:54] 

[20:57:54] reactant 2 has no mapped atoms.
[20:57:54] reactant 3 has no mapped atoms.
[20:57:54] reactant 4 has no mapped atoms.
[20:57:54] reactant 5 has no mapped atoms.
[20:57:54] reactant 6 has no mapped atoms.
[20:57:54] reactant 1 has no mapped atoms.
[20:57:54] reactant 2 has no mapped atoms.
[20:57:54] reactant 3 has no mapped atoms.
[20:57:54] reactant 4 has no mapped atoms.
[20:57:54] reactant 2 has no mapped atoms.
[20:57:54] reactant 2 has no mapped atoms.
[20:57:55] reactant 2 has no mapped atoms.
[20:57:55] reactant 1 has no mapped atoms.
[20:57:55] reactant 2 has no mapped atoms.
[20:57:55] reactant 3 has no mapped atoms.
[20:57:55] reactant 1 has no mapped atoms.
[20:57:55] reactant 2 has no mapped atoms.
[20:57:55] reactant 4 has no mapped atoms.
[20:57:55] reactant 5 has no mapped atoms.
[20:57:55] reactant 6 has no mapped atoms.
[20:57:55] reactant 1 has no mapped atoms.
[20:57:55] reactant 3 has no mapped atoms.
[20:57:55] reactant 4 has no mapped atoms.
[20:57:55] 

[20:57:55] reactant 2 has no mapped atoms.
[20:57:55] reactant 3 has no mapped atoms.
[20:57:55] reactant 2 has no mapped atoms.
[20:57:55] reactant 3 has no mapped atoms.
[20:57:55] reactant 4 has no mapped atoms.
[20:57:55] reactant 0 has no mapped atoms.
[20:57:55] reactant 1 has no mapped atoms.
[20:57:55] reactant 4 has no mapped atoms.
[20:57:55] reactant 5 has no mapped atoms.
[20:57:55] reactant 6 has no mapped atoms.
[20:57:55] reactant 2 has no mapped atoms.
[20:57:55] reactant 3 has no mapped atoms.
[20:57:55] reactant 4 has no mapped atoms.
[20:57:55] reactant 2 has no mapped atoms.
[20:57:55] reactant 3 has no mapped atoms.
[20:57:55] reactant 4 has no mapped atoms.
[20:57:55] reactant 2 has no mapped atoms.
[20:57:55] reactant 3 has no mapped atoms.
[20:57:55] reactant 4 has no mapped atoms.
[20:57:55] reactant 0 has no mapped atoms.
[20:57:55] reactant 1 has no mapped atoms.
[20:57:55] reactant 4 has no mapped atoms.
[20:57:55] reactant 5 has no mapped atoms.
[20:57:55] 

[20:57:56] reactant 4 has no mapped atoms.
[20:57:56] reactant 5 has no mapped atoms.
[20:57:56] reactant 1 has no mapped atoms.
[20:57:56] reactant 2 has no mapped atoms.
[20:57:56] reactant 4 has no mapped atoms.
[20:57:56] reactant 2 has no mapped atoms.
[20:57:56] reactant 3 has no mapped atoms.
[20:57:56] reactant 4 has no mapped atoms.
[20:57:56] reactant 5 has no mapped atoms.
[20:57:56] reactant 3 has no mapped atoms.
[20:57:56] reactant 4 has no mapped atoms.
[20:57:56] reactant 5 has no mapped atoms.
[20:57:56] reactant 1 has no mapped atoms.
[20:57:56] reactant 2 has no mapped atoms.
[20:57:56] reactant 4 has no mapped atoms.
[20:57:56] reactant 1 has no mapped atoms.
[20:57:56] reactant 3 has no mapped atoms.
[20:57:56] reactant 1 has no mapped atoms.
[20:57:56] reactant 2 has no mapped atoms.
[20:57:56] reactant 1 has no mapped atoms.
[20:57:56] reactant 2 has no mapped atoms.
[20:57:56] reactant 1 has no mapped atoms.
[20:57:56] reactant 2 has no mapped atoms.
[20:57:56] 

[20:57:56] reactant 2 has no mapped atoms.
[20:57:56] reactant 3 has no mapped atoms.
[20:57:56] reactant 4 has no mapped atoms.
[20:57:56] reactant 5 has no mapped atoms.
[20:57:56] reactant 6 has no mapped atoms.
[20:57:56] reactant 0 has no mapped atoms.
[20:57:56] reactant 3 has no mapped atoms.
[20:57:56] reactant 4 has no mapped atoms.
[20:57:56] reactant 2 has no mapped atoms.
[20:57:56] reactant 3 has no mapped atoms.
[20:57:56] reactant 4 has no mapped atoms.
[20:57:56] reactant 5 has no mapped atoms.
[20:57:56] reactant 6 has no mapped atoms.
[20:57:56] reactant 2 has no mapped atoms.
[20:57:56] reactant 1 has no mapped atoms.
[20:57:56] reactant 2 has no mapped atoms.
[20:57:56] reactant 3 has no mapped atoms.
[20:57:56] reactant 2 has no mapped atoms.
[20:57:56] reactant 3 has no mapped atoms.
[20:57:56] reactant 4 has no mapped atoms.
[20:57:56] reactant 3 has no mapped atoms.
[20:57:56] reactant 2 has no mapped atoms.
[20:57:56] reactant 2 has no mapped atoms.
[20:57:56] 

[20:57:58] reactant 1 has no mapped atoms.
[20:57:58] reactant 3 has no mapped atoms.
[20:57:58] reactant 4 has no mapped atoms.
[20:57:58] reactant 5 has no mapped atoms.
[20:57:58] reactant 7 has no mapped atoms.
[20:57:58] reactant 2 has no mapped atoms.
[20:57:58] reactant 2 has no mapped atoms.
[20:57:58] reactant 3 has no mapped atoms.
[20:57:58] reactant 4 has no mapped atoms.
[20:57:58] reactant 5 has no mapped atoms.
[20:57:58] reactant 6 has no mapped atoms.
[20:57:58] reactant 1 has no mapped atoms.
[20:57:58] reactant 2 has no mapped atoms.
[20:57:58] reactant 1 has no mapped atoms.
[20:57:58] reactant 2 has no mapped atoms.
[20:57:58] reactant 2 has no mapped atoms.
[20:57:58] reactant 4 has no mapped atoms.
[20:57:58] reactant 6 has no mapped atoms.
[20:57:58] reactant 3 has no mapped atoms.
[20:57:58] reactant 1 has no mapped atoms.
[20:57:58] reactant 2 has no mapped atoms.
[20:57:58] reactant 3 has no mapped atoms.
[20:57:58] reactant 1 has no mapped atoms.
[20:57:58] 

[20:57:59] reactant 1 has no mapped atoms.
[20:57:59] reactant 2 has no mapped atoms.
[20:57:59] reactant 1 has no mapped atoms.
[20:57:59] reactant 1 has no mapped atoms.
[20:57:59] reactant 2 has no mapped atoms.
[20:57:59] reactant 3 has no mapped atoms.
[20:57:59] reactant 4 has no mapped atoms.
[20:57:59] reactant 5 has no mapped atoms.
[20:57:59] reactant 6 has no mapped atoms.
[20:57:59] reactant 1 has no mapped atoms.
[20:57:59] reactant 2 has no mapped atoms.
[20:57:59] reactant 1 has no mapped atoms.
[20:57:59] reactant 1 has no mapped atoms.
[20:57:59] reactant 0 has no mapped atoms.
[20:57:59] reactant 1 has no mapped atoms.
[20:57:59] reactant 4 has no mapped atoms.
[20:57:59] reactant 5 has no mapped atoms.
[20:57:59] reactant 2 has no mapped atoms.
[20:57:59] reactant 0 has no mapped atoms.
[20:57:59] reactant 3 has no mapped atoms.
[20:57:59] reactant 4 has no mapped atoms.
[20:57:59] product atom-mapping number 7 found multiple times.
[20:57:59] product atom-mapping nu

[20:57:59] reactant 0 has no mapped atoms.
[20:57:59] reactant 1 has no mapped atoms.
[20:57:59] reactant 2 has no mapped atoms.
[20:57:59] reactant 0 has no mapped atoms.
[20:57:59] reactant 1 has no mapped atoms.
[20:57:59] reactant 2 has no mapped atoms.
[20:57:59] reactant 0 has no mapped atoms.
[20:57:59] reactant 1 has no mapped atoms.
[20:57:59] reactant 2 has no mapped atoms.
[20:57:59] reactant 0 has no mapped atoms.
[20:57:59] reactant 1 has no mapped atoms.
[20:57:59] reactant 2 has no mapped atoms.
[20:57:59] reactant 3 has no mapped atoms.
[20:57:59] reactant 2 has no mapped atoms.
[20:57:59] reactant 3 has no mapped atoms.
[20:57:59] reactant 2 has no mapped atoms.
[20:57:59] reactant 3 has no mapped atoms.
[20:57:59] reactant 4 has no mapped atoms.
[20:57:59] reactant 1 has no mapped atoms.
[20:57:59] reactant 2 has no mapped atoms.
[20:57:59] reactant 4 has no mapped atoms.
[20:57:59] reactant 5 has no mapped atoms.
[20:57:59] reactant 1 has no mapped atoms.
[20:57:59] 

[20:58:00] reactant 1 has no mapped atoms.
[20:58:00] reactant 2 has no mapped atoms.
[20:58:00] reactant 2 has no mapped atoms.
[20:58:00] reactant 3 has no mapped atoms.
[20:58:00] reactant 1 has no mapped atoms.
[20:58:00] reactant 2 has no mapped atoms.
[20:58:00] reactant 3 has no mapped atoms.
[20:58:00] reactant 5 has no mapped atoms.
[20:58:00] reactant 6 has no mapped atoms.
[20:58:00] reactant 2 has no mapped atoms.
[20:58:00] reactant 1 has no mapped atoms.
[20:58:00] reactant 2 has no mapped atoms.
[20:58:00] reactant 3 has no mapped atoms.
[20:58:00] reactant 5 has no mapped atoms.
[20:58:00] reactant 1 has no mapped atoms.
[20:58:00] reactant 3 has no mapped atoms.
[20:58:00] product atom-mapping number 1 found multiple times.
[20:58:00] product atom-mapping number 5 found multiple times.
[20:58:00] product atom-mapping number 6 found multiple times.
[20:58:00] product atom-mapping number 8 found multiple times.
[20:58:00] product atom-mapping number 12 found multiple tim

[20:58:00] reactant 1 has no mapped atoms.
[20:58:00] reactant 2 has no mapped atoms.
[20:58:00] reactant 3 has no mapped atoms.
[20:58:00] reactant 5 has no mapped atoms.
[20:58:00] reactant 6 has no mapped atoms.
[20:58:00] reactant 1 has no mapped atoms.
[20:58:00] reactant 1 has no mapped atoms.
[20:58:00] reactant 2 has no mapped atoms.
[20:58:00] reactant 3 has no mapped atoms.
[20:58:00] reactant 5 has no mapped atoms.
[20:58:00] reactant 6 has no mapped atoms.
[20:58:00] reactant 7 has no mapped atoms.
[20:58:00] reactant 0 has no mapped atoms.
[20:58:00] reactant 3 has no mapped atoms.
[20:58:00] reactant 1 has no mapped atoms.
[20:58:00] reactant 2 has no mapped atoms.
[20:58:00] reactant 3 has no mapped atoms.
[20:58:00] reactant 0 has no mapped atoms.
[20:58:00] reactant 0 has no mapped atoms.
[20:58:00] reactant 4 has no mapped atoms.
[20:58:00] reactant 1 has no mapped atoms.
[20:58:00] reactant 2 has no mapped atoms.
[20:58:00] reactant 1 has no mapped atoms.
[20:58:00] 

[20:58:01] reactant 2 has no mapped atoms.
[20:58:01] reactant 3 has no mapped atoms.
[20:58:01] reactant 1 has no mapped atoms.
[20:58:01] reactant 0 has no mapped atoms.
[20:58:01] reactant 3 has no mapped atoms.
[20:58:01] reactant 1 has no mapped atoms.
[20:58:01] reactant 2 has no mapped atoms.
[20:58:01] reactant 3 has no mapped atoms.
[20:58:01] reactant 5 has no mapped atoms.
[20:58:01] reactant 6 has no mapped atoms.
[20:58:01] reactant 7 has no mapped atoms.
[20:58:01] reactant 1 has no mapped atoms.
[20:58:01] reactant 2 has no mapped atoms.
[20:58:01] reactant 3 has no mapped atoms.
[20:58:01] reactant 1 has no mapped atoms.
[20:58:01] reactant 2 has no mapped atoms.
[20:58:01] reactant 3 has no mapped atoms.
[20:58:01] reactant 1 has no mapped atoms.
[20:58:01] reactant 2 has no mapped atoms.
[20:58:01] reactant 3 has no mapped atoms.
[20:58:01] reactant 1 has no mapped atoms.
[20:58:01] reactant 1 has no mapped atoms.
[20:58:01] reactant 2 has no mapped atoms.
[20:58:01] 

[20:58:02] reactant 1 has no mapped atoms.
[20:58:02] product atom-mapping number 1 found multiple times.
[20:58:02] product atom-mapping number 2 found multiple times.
[20:58:02] product atom-mapping number 7 found multiple times.
[20:58:02] product atom-mapping number 6 found multiple times.
[20:58:02] product atom-mapping number 5 found multiple times.
[20:58:02] product atom-mapping number 8 found multiple times.
[20:58:02] product atom-mapping number 12 found multiple times.
[20:58:02] product atom-mapping number 13 found multiple times.
[20:58:02] product atom-mapping number 15 found multiple times.
[20:58:02] product atom-mapping number 16 found multiple times.
[20:58:02] product atom-mapping number 21 found multiple times.
[20:58:02] product atom-mapping number 20 found multiple times.
[20:58:02] product atom-mapping number 19 found multiple times.
[20:58:02] product atom-mapping number 22 found multiple times.
[20:58:02] product atom-mapping number 25 found multiple times.
[20

[20:58:02] reactant 2 has no mapped atoms.
[20:58:02] reactant 3 has no mapped atoms.
[20:58:02] reactant 4 has no mapped atoms.
[20:58:02] reactant 5 has no mapped atoms.
[20:58:02] reactant 6 has no mapped atoms.
[20:58:02] reactant 7 has no mapped atoms.
[20:58:02] reactant 1 has no mapped atoms.
[20:58:02] reactant 3 has no mapped atoms.
[20:58:02] reactant 4 has no mapped atoms.
[20:58:02] reactant 2 has no mapped atoms.
[20:58:02] reactant 3 has no mapped atoms.
[20:58:02] reactant 1 has no mapped atoms.
[20:58:02] reactant 2 has no mapped atoms.
[20:58:02] reactant 1 has no mapped atoms.
[20:58:02] reactant 3 has no mapped atoms.
[20:58:02] reactant 4 has no mapped atoms.
[20:58:02] reactant 1 has no mapped atoms.
[20:58:02] reactant 1 has no mapped atoms.
[20:58:02] reactant 2 has no mapped atoms.
[20:58:02] reactant 1 has no mapped atoms.
[20:58:03] reactant 2 has no mapped atoms.
[20:58:03] reactant 3 has no mapped atoms.
[20:58:03] reactant 4 has no mapped atoms.
[20:58:03] 

[20:58:03] reactant 0 has no mapped atoms.
[20:58:03] reactant 2 has no mapped atoms.
[20:58:03] product atom-mapping number 8 found multiple times.
[20:58:03] product atom-mapping number 12 found multiple times.
[20:58:03] product atom-mapping number 13 found multiple times.
[20:58:03] product atom-mapping number 15 found multiple times.
[20:58:03] product atom-mapping number 19 found multiple times.
[20:58:03] product atom-mapping number 20 found multiple times.
[20:58:03] product atom-mapping number 21 found multiple times.
[20:58:03] product atom-mapping number 22 found multiple times.
[20:58:03] product atom-mapping number 23 found multiple times.
[20:58:03] product atom-mapping number 24 found multiple times.
[20:58:03] product atom-mapping number 27 found multiple times.
[20:58:03] product atom-mapping number 28 found multiple times.
[20:58:03] product atom-mapping number 29 found multiple times.
[20:58:03] product atom-mapping number 30 found multiple times.
[20:58:03] product 

[20:58:04] reactant 1 has no mapped atoms.
[20:58:04] reactant 2 has no mapped atoms.
[20:58:04] product atom-mapping number 1 found multiple times.
[20:58:04] product atom-mapping number 16 found multiple times.
[20:58:04] product atom-mapping number 17 found multiple times.
[20:58:04] product atom-mapping number 18 found multiple times.
[20:58:04] product atom-mapping number 19 found multiple times.
[20:58:04] product atom-mapping number 20 found multiple times.
[20:58:04] product atom-mapping number 21 found multiple times.
[20:58:04] product atom-mapping number 2 found multiple times.
[20:58:04] product atom-mapping number 3 found multiple times.
[20:58:04] product atom-mapping number 4 found multiple times.
[20:58:04] product atom-mapping number 7 found multiple times.
[20:58:04] product atom-mapping number 8 found multiple times.
[20:58:04] product atom-mapping number 12 found multiple times.
[20:58:04] product atom-mapping number 11 found multiple times.
[20:58:04] product atom-

[20:58:04] reactant 1 has no mapped atoms.
[20:58:04] reactant 0 has no mapped atoms.
[20:58:04] reactant 2 has no mapped atoms.
[20:58:04] reactant 3 has no mapped atoms.
[20:58:04] reactant 4 has no mapped atoms.
[20:58:04] reactant 2 has no mapped atoms.
[20:58:04] reactant 3 has no mapped atoms.
[20:58:04] reactant 4 has no mapped atoms.
[20:58:04] reactant 1 has no mapped atoms.
[20:58:04] reactant 2 has no mapped atoms.
[20:58:04] reactant 1 has no mapped atoms.
[20:58:04] reactant 2 has no mapped atoms.
[20:58:04] reactant 1 has no mapped atoms.
[20:58:04] reactant 2 has no mapped atoms.
[20:58:04] reactant 1 has no mapped atoms.
[20:58:04] reactant 2 has no mapped atoms.
[20:58:04] reactant 1 has no mapped atoms.
[20:58:04] reactant 2 has no mapped atoms.
[20:58:04] reactant 2 has no mapped atoms.
[20:58:04] reactant 3 has no mapped atoms.
[20:58:04] reactant 2 has no mapped atoms.
[20:58:04] reactant 3 has no mapped atoms.
[20:58:04] reactant 2 has no mapped atoms.
[20:58:04] 

[20:58:05] reactant 1 has no mapped atoms.
[20:58:05] reactant 2 has no mapped atoms.
[20:58:05] reactant 4 has no mapped atoms.
[20:58:05] reactant 5 has no mapped atoms.
[20:58:05] reactant 2 has no mapped atoms.
[20:58:05] reactant 3 has no mapped atoms.
[20:58:05] reactant 2 has no mapped atoms.
[20:58:05] reactant 2 has no mapped atoms.
[20:58:05] reactant 2 has no mapped atoms.
[20:58:05] reactant 3 has no mapped atoms.
[20:58:05] reactant 0 has no mapped atoms.
[20:58:05] reactant 2 has no mapped atoms.
[20:58:05] reactant 4 has no mapped atoms.
[20:58:05] reactant 5 has no mapped atoms.
[20:58:06] reactant 0 has no mapped atoms.
[20:58:06] reactant 1 has no mapped atoms.
[20:58:06] reactant 2 has no mapped atoms.
[20:58:06] reactant 3 has no mapped atoms.
[20:58:06] reactant 4 has no mapped atoms.
[20:58:06] reactant 2 has no mapped atoms.
[20:58:06] reactant 3 has no mapped atoms.
[20:58:06] reactant 2 has no mapped atoms.
[20:58:06] reactant 4 has no mapped atoms.
[20:58:06] 

ValidationError: Dataset.reactions[1478].inputs["m1_m2"].components[1].identifiers[2]: RDKit 2022.03.5 could not validate InChI identifier InChI=1S/CHCl/c1-2/h1H

In [None]:
# inputs
# REACTANT = 1;
# REAGENT = 2;
# SOLVENT = 3;
# CATALYST = 4;
# WORKUP = 5;
# INTERNAL_STANDARD = 6;
# AUTHENTIC_STANDARD = 7;
# PRODUCT = 8;

# temperature:
# UNSPECIFIED = 0;
# CUSTOM = 1;
# AMBIENT = 2;
# OIL_BATH = 3;
# WATER_BATH = 4;
# SAND_BATH = 5;
# ICE_BATH = 6;
# DRY_ALUMINUM_PLATE = 7;
# MICROWAVE = 8;
# DRY_ICE_BATH = 9;
# AIR_FAN = 10;
# LIQUID_NITROGEN = 11;

# structure
# inputs -> m1, m2, m3 ...
# conditions -> temperature, ...
# notes
# workups
# outcomes -> products, yield

In [42]:
res = [field.name for field in data.DESCRIPTOR.fields]
res
# can do data.name to get the year, e.g. data.name returns 'uspto-grants-2016'
# can do data.dataset_id to get the ord file name, e.g. 'ord_dataset-026684a62f91469db49c7767d16c39fb'

['name', 'description', 'reactions', 'reaction_ids', 'dataset_id']

In [6]:

def find_smiles(identifiers):
    for i in identifiers:
        if i.type == 2:
            return i.value
    for i in identifiers: #if there's no smiles, return the name
        if i.type == 6:
            return i.value
    return np.nan

reactants_all = []
reagents_all = []
products_all = []
solvents_all = []
catalysts_all = []

temperature_all = []

rxn_times_all = []

for i in range(len(data.reactions)):
    rxn = data.reactions[i]
    # handle rxn inputs: reactants, reagents etc
    reactants = []
    reagents = []
    solvents = []
    catalysts = []
    products = []
    
    temperatures = []

    rxn_times = []

    # inputs
    for key in rxn.inputs: #these are the keys in the 'dict' style data struct
        try:
            components = rxn.inputs[key].components
            for component in components:
                rxn_role = component.reaction_role #rxn role
                identifiers = component.identifiers
                smiles = find_smiles(identifiers)
                if rxn_role == 1: #reactant
                    reactants += [smiles]
                elif rxn_role ==2: #reagent
                    reagents += [smiles]
                elif rxn_role ==3: #solvent
                    solvents += [smiles]
                elif rxn_role ==4: #catalyst
                    catalysts += [smiles]
                elif rxn_role in [5,6,7]: #workup, internal standard, authentic standard. don't care about these
                    continue
                # elif rxn_role ==8: #product
                #     #products += [smiles]
                # there are no products recorded in rxn_role == 8, they're all stored in "outcomes"
        except IndexError:
            #print(i, key )
            continue

    # temperature
    try:
        temperatures +=[rxn.conditions.temperature.control.type]
    except IndexError:
        continue

    #rxn time
    try:
        rxn_times = (rxn.outcomes[0].reaction_time.value, rxn.outcomes[0].reaction_time.units)
    except IndexError:
        continue

    # products & yield
    products_obj = rxn.outcomes[0].products
    y1 = np.nan
    y2 = np.nan
    for product in products_obj:
        try:
            identifiers = product.identifiers
            product_smiles = find_smiles(identifiers)
            measurements = product.measurements
            for measurement in measurements:
                if measurement.details =="PERCENTYIELD":
                    y1 = measurement.percentage.value
                elif measurement.details =="CALCULATEDPERCENTYIELD":
                    y2 = measurement.percentage.value
            products += [(product_smiles, y1, y2)]
        except IndexError:
            continue

    reactants_all += [list(set(reactants))]
    reagents_all += [reagents] # no reagents apparently
    solvents_all += [list(set(solvents))]
    catalysts_all += [list(set(catalysts))]
    
    temperature_all = [temperatures]

    rxn_times_all += [rxn_times]
    products_all += [list(set(products))]
              

In [7]:
reactants_all[108]

['C(C1=CC=CC=C1)(=O)C1=CC=CC=C1',
 'C[Si](C)(C)[N-][Si](C)(C)C.[Li+]',
 'Cl.N12CC(C(CC1)C2)=O']

In [8]:
solvents_all[108]

['C1CCOC1']

In [10]:
products_all[1008]

[('ClC1=CC2=C(OC3C2CCCC3)C(=C1)C(=O)O', nan, nan)]

## Inspect data

In [43]:
data.name

'uspto-grants-1993_09'

In [63]:
data.reactions[108].inputs

{'m1_m2_m4': components {
  identifiers {
    type: NAME
    value: "1-azabicyclo[2.2.1]heptan-3-one hydrochloride"
  }
  identifiers {
    type: SMILES
    value: "Cl.N12CC(C(CC1)C2)=O"
  }
  identifiers {
    type: INCHI
    value: "InChI=1S/C6H9NO.ClH/c8-6-4-7-2-1-5(6)3-7;/h5H,1-4H2;1H"
  }
  amount {
    mass {
      value: 1.06
      units: GRAM
    }
  }
  reaction_role: REACTANT
}
components {
  identifiers {
    type: NAME
    value: "benzophenone"
  }
  identifiers {
    type: SMILES
    value: "C(C1=CC=CC=C1)(=O)C1=CC=CC=C1"
  }
  identifiers {
    type: INCHI
    value: "InChI=1S/C13H10O/c14-13(11-7-3-1-4-8-11)12-9-5-2-6-10-12/h1-10H"
  }
  amount {
    mass {
      value: 1.82
      units: GRAM
    }
  }
  reaction_role: REACTANT
}
components {
  identifiers {
    type: NAME
    value: "THF"
  }
  identifiers {
    type: SMILES
    value: "C1CCOC1"
  }
  identifiers {
    type: INCHI
    value: "InChI=1S/C4H8O/c1-2-4-5-3-1/h1-4H2"
  }
  amount {
    volume {
      value: 5.

In [121]:
is_mapped = data.reactions[108].identifiers[0].is_mapped
mapped_rxn = data.reactions[108].identifiers[0].value
mapped_rxn = mapped_rxn.split(' ')[0]

reactant, reagent, product = mapped_rxn_2.split('>')

# if it's been mapped, add to reactant, otherwise add to reagent
reactants = []
other = []
products = []
for r in reactant.split('.'):
    if '[' in r and ']' in r and ':' in r:
        reactants += [r]
    else:
        other += [r]

other += [reagent.split('.')]

for r in product.split('.'):
    if '[' in r and ']' in r and ':' in r:
        products += [r]
    else:
        other += [r]

In [122]:
mapped_rxn

'Cl.[N:2]12[CH2:8][CH:5]([CH2:6][CH2:7]1)[C:4](=[O:9])[CH2:3]2.[C:10]([C:18]1[CH:23]=[CH:22][CH:21]=[CH:20][CH:19]=1)(=[O:17])[C:11]1[CH:16]=[CH:15][CH:14]=[CH:13][CH:12]=1.C[Si]([N-][Si](C)(C)C)(C)C.[Li+]>C1COCC1>[C:18]1([C:10]([C:11]2[CH:12]=[CH:13][CH:14]=[CH:15][CH:16]=2)([OH:17])[CH:3]2[C:4](=[O:9])[CH:5]3[CH2:8][N:2]2[CH2:7][CH2:6]3)[CH:19]=[CH:20][CH:21]=[CH:22][CH:23]=1'

In [99]:
# if it's been mapped, add to reactant, otherwise add to reagent
reactants = []
other = []
products = []
for r in reactant.split('.'):
    if '[' in r and ']' in r and ':' in r:
        reactants += [r]
    else:
        other += [r]

other += [reagent.split('.')]

for r in product.split('.'):
    if '[' in r and ']' in r and ':' in r:
        products += [r]
    else:
        other += [r]

In [100]:
other

['Cl', 'C[Si]([N-][Si](C)(C)C)(C)C', '[Li+]', ['C1COCC1']]

In [101]:
products

['[C:18]1([C:10]([C:11]2[CH:12]=[CH:13][CH:14]=[CH:15][CH:16]=2)([OH:17])[CH:3]2[C:4](=[O:9])[CH:5]3[CH2:8][N:2]2[CH2:7][CH2:6]3)[CH:19]=[CH:20][CH:21]=[CH:22][CH:23]=1']

In [5]:
len(reactants_all)

93834

In [4]:
# combine reaction data into a df
data_df = pd.DataFrame(products_all, columns = ['product_1', 'product_2','product_3','product_4', 'product_5'])

In [5]:
data_df

Unnamed: 0,product_1,product_2,product_3,product_4,product_5
0,(FC=1C=C(C=CC1F)N1N=C(C(=C1)C1=CC(=CC=C1)C(=O)...,,,,
1,(Cl.C(CCC)OC=1C(C=C(C(C1)NC1=CC=C(C=C1)N(C(C)C...,,,,
2,"(ClC=1C(C=CC(C1)=NC1=C(C=C(C(=C1)OCCCC)N)N)=O,...",,,,
3,(NC=1C(C=C(C(C1)=N)OCCCC)=NC1=CC=C(C=C1)N(CCO)...,,,,
4,(Cl.NC=1C(C=C(C(C1)=N)OCC)=NC1=CC=C(C=C1)N(CCO...,,,,
...,...,...,...,...,...
93829,(C1(=CC=CC=C1)C1=CC=CC=2C3=C(SC21)C(=CC=C3)C=3...,,,,
93830,(C1(=CC=CC=C1)C1=CC=CC=2C3=C(SC21)C(=CC=C3)C=3...,,,,
93831,(C1=CC=C(C=2SC3=C(C21)C=CC=C3)C=3C=CC=2N(C1=CC...,,,,
93832,(C1=CC=C(C=2SC3=C(C21)C=CC=C3)C=3C=CC=2N(C1=CC...,,,,


## Save to pickle

In [7]:
# create the column headers for the df
def create_column_headers(df, base_string):
    column_headers = []
    for i in range(len(df.columns)):
        column_headers += [base_string+str(i)]
    return column_headers

In [36]:
headers = ['reactant_', 'reagents_',  'solvent_', 'catalyst_', 'temperature_', 'rxn_time_', 'product_']
data_lists = [reactants_all, reagents_all, solvents_all, catalysts_all, temperature_all, rxn_times_all, products_all]
for i in range(len(headers)):
    new_df = pd.DataFrame(data_lists[i])
    df_headers = create_column_headers(new_df, headers[i])
    new_df.columns = df_headers
    if i ==0:
        full_df = new_df
    else:
        full_df = pd.concat([full_df, new_df], axis=1)

In [40]:
full_df

Unnamed: 0,reactant_0,reactant_1,reactant_2,reactant_3,reactant_4,reactant_5,reactant_6,reactant_7,reactant_8,reactant_9,...,catalyst_3,catalyst_4,temperature_0,rxn_time_0,rxn_time_1,product_0,product_1,product_2,product_3,product_4
0,C([O-])([O-])=O.[K+].[K+],CC1(OB(OC1(C)C)C=1C=C(C(=O)OC)C=CC1)C,BrC=1C(=NN(C1)C1=CC(=C(C=C1)F)F)C(=O)OCC,"2,2-dicyclohexylphosphino-2″,6″-diisopropoxybi...",,,,,,,...,,,0.0,0.0,0,(FC=1C=C(C=CC1F)N1N=C(C(=C1)C1=CC(=CC=C1)C(=O)...,,,,
1,N,Cl.CN(C1=CC=C(C=C1)N)C(C)C,OO,C(C)O,Cl.Cl.C(CCC)OC1=C(C=C(C=C1)N)N,20-V,,,,,...,,,,24.0,1,(Cl.C(CCC)OC=1C(C=C(C(C1)NC1=CC=C(C=C1)N(C(C)C...,,,,
2,Cl.Cl.C(CCC)OC1=C(C=C(C=C1)N)N,N,S(=O)(=O)(O)OC1=C(C=C(C=C1)N)Cl,OO,,,,,,,...,,,,5.0,1,"(ClC=1C(C=CC(C1)=NC1=C(C=C(C(=C1)OCCCC)N)N)=O,...",,,,
3,N,C(C)O,OO,Cl.C(C)N(C1=CC=C(C=C1)N)C(C)C,Cl.Cl.C(CCC)OC1=C(C=C(C=C1)N)N,,,,,,...,,,,24.0,1,(NC=1C(C=C(C(C1)=N)OCCCC)=NC1=CC=C(C=C1)N(CCO)...,,,,
4,Cl.Cl.C(CCC)OC1=C(C=C(C=C1)N)N,C(C)N(CCO)C1=CC=C(C=C1)N=O,,,,,,,,,...,,,,3.0,1,(Cl.NC=1C(C=C(C(C1)=N)OCC)=NC1=CC=C(C=C1)N(CCO...,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
93829,BrC=1C=CC=2NC3=CC=CC=C3C2C1,C([O-])([O-])=O.[Na+].[Na+],C1(=CC=CC=C1)C1=CC=CC=2C3=C(SC21)C(=CC=C3)B(O)O,CC1=C(C=CC=C1)P(C1=C(C=CC=C1)C)C1=C(C=CC=C1)C,,,,,,,...,,,,0.0,0,(C1(=CC=CC=C1)C1=CC=CC=2C3=C(SC21)C(=CC=C3)C=3...,,,,
93830,C1(=CC=CC=C1)C1=CC=CC=2C3=C(SC21)C(=CC=C3)C=3C...,BrC1=CC=C(C=C1)C=1C2=CC=CC=C2C(=C2C=CC=CC12)C1...,CC(C)([O-])C.[Na+],C(C)(C)(C)P(C(C)(C)C)C(C)(C)C,,,,,,,...,,,,0.0,0,(C1(=CC=CC=C1)C1=CC=CC=2C3=C(SC21)C(=CC=C3)C=3...,,,,
93831,BrC1=CC=C(C=C1)C1=CC2=C(C3=CC=CC=C3C(=C2C=C1)C...,CC(C)([O-])C.[Na+],C(C)(C)(C)P(C(C)(C)C)C(C)(C)C,C1=CC=C(C=2SC3=C(C21)C=CC=C3)C=3C=CC=2NC1=CC=C...,,,,,,,...,,,,0.0,0,(C1=CC=C(C=2SC3=C(C21)C=CC=C3)C=3C=CC=2N(C1=CC...,,,,
93832,CC(C)([O-])C.[Na+],BrC=1C=C(C=CC1)C1=CC2=C(C3=CC=CC=C3C(=C2C=C1)C...,C(C)(C)(C)P(C(C)(C)C)C(C)(C)C,C1=CC=C(C=2SC3=C(C21)C=CC=C3)C=3C=CC=2NC1=CC=C...,,,,,,,...,,,,0.0,0,(C1=CC=C(C=2SC3=C(C21)C=CC=C3)C=3C=CC=2N(C1=CC...,,,,


In [43]:
#save to pickle
filename = data.name
full_df.to_pickle(f"data/ORD_USPTO/pickled_data/{filename}.pkl")

In [44]:
#unpickle
unpickled_df = pd.read_pickle(f"data/ORD_USPTO/pickled_data/{filename}.pkl")
unpickled_df

Unnamed: 0,reactant_0,reactant_1,reactant_2,reactant_3,reactant_4,reactant_5,reactant_6,reactant_7,reactant_8,reactant_9,...,catalyst_3,catalyst_4,temperature_0,rxn_time_0,rxn_time_1,product_0,product_1,product_2,product_3,product_4
0,C([O-])([O-])=O.[K+].[K+],CC1(OB(OC1(C)C)C=1C=C(C(=O)OC)C=CC1)C,BrC=1C(=NN(C1)C1=CC(=C(C=C1)F)F)C(=O)OCC,"2,2-dicyclohexylphosphino-2″,6″-diisopropoxybi...",,,,,,,...,,,0.0,0.0,0,(FC=1C=C(C=CC1F)N1N=C(C(=C1)C1=CC(=CC=C1)C(=O)...,,,,
1,N,Cl.CN(C1=CC=C(C=C1)N)C(C)C,OO,C(C)O,Cl.Cl.C(CCC)OC1=C(C=C(C=C1)N)N,20-V,,,,,...,,,,24.0,1,(Cl.C(CCC)OC=1C(C=C(C(C1)NC1=CC=C(C=C1)N(C(C)C...,,,,
2,Cl.Cl.C(CCC)OC1=C(C=C(C=C1)N)N,N,S(=O)(=O)(O)OC1=C(C=C(C=C1)N)Cl,OO,,,,,,,...,,,,5.0,1,"(ClC=1C(C=CC(C1)=NC1=C(C=C(C(=C1)OCCCC)N)N)=O,...",,,,
3,N,C(C)O,OO,Cl.C(C)N(C1=CC=C(C=C1)N)C(C)C,Cl.Cl.C(CCC)OC1=C(C=C(C=C1)N)N,,,,,,...,,,,24.0,1,(NC=1C(C=C(C(C1)=N)OCCCC)=NC1=CC=C(C=C1)N(CCO)...,,,,
4,Cl.Cl.C(CCC)OC1=C(C=C(C=C1)N)N,C(C)N(CCO)C1=CC=C(C=C1)N=O,,,,,,,,,...,,,,3.0,1,(Cl.NC=1C(C=C(C(C1)=N)OCC)=NC1=CC=C(C=C1)N(CCO...,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
93829,BrC=1C=CC=2NC3=CC=CC=C3C2C1,C([O-])([O-])=O.[Na+].[Na+],C1(=CC=CC=C1)C1=CC=CC=2C3=C(SC21)C(=CC=C3)B(O)O,CC1=C(C=CC=C1)P(C1=C(C=CC=C1)C)C1=C(C=CC=C1)C,,,,,,,...,,,,0.0,0,(C1(=CC=CC=C1)C1=CC=CC=2C3=C(SC21)C(=CC=C3)C=3...,,,,
93830,C1(=CC=CC=C1)C1=CC=CC=2C3=C(SC21)C(=CC=C3)C=3C...,BrC1=CC=C(C=C1)C=1C2=CC=CC=C2C(=C2C=CC=CC12)C1...,CC(C)([O-])C.[Na+],C(C)(C)(C)P(C(C)(C)C)C(C)(C)C,,,,,,,...,,,,0.0,0,(C1(=CC=CC=C1)C1=CC=CC=2C3=C(SC21)C(=CC=C3)C=3...,,,,
93831,BrC1=CC=C(C=C1)C1=CC2=C(C3=CC=CC=C3C(=C2C=C1)C...,CC(C)([O-])C.[Na+],C(C)(C)(C)P(C(C)(C)C)C(C)(C)C,C1=CC=C(C=2SC3=C(C21)C=CC=C3)C=3C=CC=2NC1=CC=C...,,,,,,,...,,,,0.0,0,(C1=CC=C(C=2SC3=C(C21)C=CC=C3)C=3C=CC=2N(C1=CC...,,,,
93832,CC(C)([O-])C.[Na+],BrC=1C=C(C=CC1)C1=CC2=C(C3=CC=CC=C3C(=C2C=C1)C...,C(C)(C)(C)P(C(C)(C)C)C(C)(C)C,C1=CC=C(C=2SC3=C(C21)C=CC=C3)C=3C=CC=2NC1=CC=C...,,,,,,,...,,,,0.0,0,(C1=CC=C(C=2SC3=C(C21)C=CC=C3)C=3C=CC=2N(C1=CC...,,,,


## Full ORD to pickle workflow - v1 (without mapped rxn)

In [6]:
# Import modules
import ord_schema
from ord_schema import message_helpers, validations
from ord_schema.proto import dataset_pb2

import math
import pandas as pd
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import os
import wget

from rdkit import Chem
from rdkit.Chem import AllChem
from sklearn import model_selection, metrics
from glob import glob

from tqdm import tqdm

2022-12-11 23:37:29.558449: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


### Get list of all file names

In [1]:
# get list of all file names
import os

# Set the directory you want to look in
directory = "data/ORD_USPTO/ord-data/data/"

# Use os.listdir to get a list of all files in the directory
folders = os.listdir(directory)
files = []
# Use a for loop to iterate over the files and print their names
for folder in folders:
    if not folder.startswith("."):
        new_dir = directory+folder
        file_list = os.listdir(new_dir)
        # Check if the file name starts with a .
        for file in file_list:
            if not file.startswith("."):
                new_file = new_dir+'/'+file
                files += [new_file]

In [2]:
files[0]

'data/ORD_USPTO/ord-data/data/59/ord_dataset-59f453c3a3d34a89bfd97b6b8b151908.pb.gz'

### Define helper function

In [27]:
class OrdToPickle():

    def __init__(self, ord_file_path, ):
        self.ord_file_path = ord_file_path
        self.data = message_helpers.load_message(self.ord_file_path, dataset_pb2.Dataset)
        self.filename = self.data.name

    def find_smiles(self, identifiers):
        for i in identifiers:
            if i.type == 2:
                return i.value
        for i in identifiers: #if there's no smiles, return the name
            if i.type == 6:
                return i.value
        return np.nan

    def build_rxn_lists(self):
        mapped_rxn_all = []
        reactants_all = []
        reagents_all = []
        products_all = []
        solvents_all = []
        reagents_all = []
        catalysts_all = []

        temperature_all = []

        rxn_times_all = []

        for i in tqdm(range(len(self.data.reactions))):
            rxn = self.data.reactions[i]
            # handle rxn inputs: reactants, reagents etc
            reactants = []
            reagents = []
            solvents = []
            catalysts = []
            products = []
            
            temperatures = []

            rxn_times = []

            # inputs
            for key in rxn.inputs: #these are the keys in the 'dict' style data struct
                try:
                    components = rxn.inputs[key].components
                    for component in components:
                        rxn_role = component.reaction_role #rxn role
                        identifiers = component.identifiers
                        smiles = self.find_smiles(identifiers)
                        if rxn_role == 1: #reactant
                            reactants += [smiles]
                        elif rxn_role ==2: #reagent
                            reagents += [smiles]
                        elif rxn_role ==3: #solvent
                            solvents += [smiles]
                        elif rxn_role ==4: #catalyst
                            catalysts += [smiles]
                        elif rxn_role in [5,6,7]: #workup, internal standard, authentic standard. don't care about these
                            continue
                        # elif rxn_role ==8: #product
                        #     #products += [smiles]
                        # there are no products recorded in rxn_role == 8, they're all stored in "outcomes"
                except IndexError:
                    #print(i, key )
                    continue

            # temperature
            try:
                temperatures +=[rxn.conditions.temperature.control.type]
            except IndexError:
                continue

            #rxn time
            try:
                rxn_times = (rxn.outcomes[0].reaction_time.value, rxn.outcomes[0].reaction_time.units)
            except IndexError:
                continue

            # products & yield
            products_obj = rxn.outcomes[0].products
            y1 = np.nan
            y2 = np.nan
            for product in products_obj:
                try:
                    identifiers = product.identifiers
                    product_smiles = self.find_smiles(identifiers)
                    measurements = product.measurements
                    for measurement in measurements:
                        if measurement.details =="PERCENTYIELD":
                            y1 = measurement.percentage.value
                        elif measurement.details =="CALCULATEDPERCENTYIELD":
                            y2 = measurement.percentage.value
                    products += [(product_smiles, y1, y2)]
                except IndexError:
                    continue

            reactants_all += [list(set(reactants))]
            reagents_all += [reagents] # no reagents apparently
            solvents_all += [list(set(solvents))]
            catalysts_all += [list(set(catalysts))]
            
            temperature_all = [temperatures]

            rxn_times_all += [rxn_times]
            products_all += [list(set(products))]
        
        return reactants_all, reagents_all, solvents_all, catalysts_all, temperature_all, rxn_times_all, products_all

    # create the column headers for the df
    def create_column_headers(self, df, base_string):
        column_headers = []
        for i in range(len(df.columns)):
            column_headers += [base_string+str(i)]
        return column_headers
    
    def build_full_df(self):
        headers = ['reactant_', 'reagents_',  'solvent_', 'catalyst_', 'temperature_', 'rxn_time_', 'product_']
        #data_lists = [reactants_all, reagents_all, solvents_all, catalysts_all, temperature_all, rxn_times_all, products_all]
        data_lists = self.build_rxn_lists()
        for i in range(len(headers)):
            new_df = pd.DataFrame(data_lists[i])
            df_headers = self.create_column_headers(new_df, headers[i])
            new_df.columns = df_headers
            if i ==0:
                full_df = new_df
            else:
                full_df = pd.concat([full_df, new_df], axis=1)
        return full_df

    def main(self):
        if 'uspto' in self.filename:
            full_df = self.build_full_df()

            #save to pickle
            filename = self.data.name
            full_df.to_pickle(f"data/ORD_USPTO/pickled_data/{filename}.pkl")
        else:
            print(f'The following does not contain USPTO data: {self.filename}')



In [28]:
files[0]

'data/ORD_USPTO/ord-data/data/59/ord_dataset-59f453c3a3d34a89bfd97b6b8b151908.pb.gz'

In [29]:
instance = OrdToPickle(files[0])

In [31]:
instance.main()

100%|██████████| 1575/1575 [00:00<00:00, 35203.62it/s]


In [35]:
#unpickle
filename = instance.filename
unpickled_df = pd.read_pickle(f"data/ORD_USPTO/pickled_data/{filename}.pkl")
unpickled_df

Unnamed: 0,reactant_0,reactant_1,reactant_2,reactant_3,reactant_4,reactant_5,reactant_6,reactant_7,reactant_8,reactant_9,...,catalyst_5,catalyst_6,catalyst_7,temperature_0,rxn_time_0,rxn_time_1,product_0,product_1,product_2,product_3
0,O.NN,NS(=O)(=O)CC1=C(C(=O)OC)C=CC=C1,,,,,,,,,...,,,,0.0,0.0,0,"(NS(=O)(=O)CC1=C(C(=O)NN)C=CC=C1, nan, 58.2999...",,,
1,Cl,C1(=CC=CC=C1)OC(NC1=NC(=CC(=N1)OC)OC)=O,CSC1=NN=C(O1)C1=C(C=CC=C1)CS(=O)(=O)N,CCCCC=CCCCCC,"1,5-diazbicyclo[5.4.0",,,,,,...,,,,,2.0,1,(COC1=NC(=NC(=C1)OC)NC(=O)NS(=O)(=O)CC1=C(C=CC...,,,
2,C([O-])([O-])=O.[K+].[K+],CSC1=NN=C(O1)C1=C(C=CC=C1)CS(=O)(=O)N,C(CCC)N=C=O,,,,,,,,...,,,,,8.0,1,(C(CCC)NC(=O)NS(=O)(=O)CC1=C(C=CC=C1)C=1OC(=NN...,,,
3,ClC(C)(CC(CC)(C)Cl)C,C1(=CC=CC=C1O)C,[Cl-].[Al+3].[Cl-].[Cl-],ClCCl,,,,,,,...,,,,,2.0,1,"(CC=1C(=CC=2C(CCC(C2C1)(C)C)(C)C)O, 93.0, nan)",,,
4,ClC(C)(CCC(C)(C)Cl)C,[Cl-].[Al+3].[Cl-].[Cl-],C1(=CC=CC=C1)C,,,,,,,,...,,,,,1.0,1,"(CC1=CC=2C(CCC(C2C=C1)(C)C)(C)C, 98.0, 97.4000...",,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1570,C(C)OC(CNC([C@@H](NC(C(CN(OCC1=CC=CC=C1)C=O)CC...,,,,,,,,,,...,,,,,0.0,0,(C(C)OC(CNC([C@@H](NC(C(CN(O)C=O)CC(C)C)=O)CC(...,,,
1571,C(C(C)C)C(C(=O)O)=C,C(C1=CC=CC=C1)ON,,,,,,,,,...,,,,,0.0,0,"(C(C1=CC=CC=C1)ONCC(C(=O)O)CC(C)C, 35.0, 34.40...",,,
1572,C(=O)O,C(C1=CC=CC=C1)ONCC(C(=O)O)CC(C)C,,,,,,,,,...,,,,,0.0,0,"(C(=O)N(CC(C(=O)O)CC(C)C)OCC1=CC=CC=C1, 100.0,...",,,
1573,Cl.C(C)OC([C@@H](N)C)=O,C(C)(C)(C)OC(=O)N[C@@H](CCCCNC(=O)OCC1=CC=CC=C...,C(C)N1CCOCC1,ClC(=O)OCC(C)C,,,,,,,...,,,,,4.0,1,(C(C)OC([C@@H](NC([C@@H](NC(=O)OC(C)(C)C)CCCCN...,,,


In [37]:
# pickle everything
for file in files:
    instance = OrdToPickle(file)
    instance.main()

100%|██████████| 1575/1575 [00:00<00:00, 28305.04it/s]
100%|██████████| 1266/1266 [00:00<00:00, 28069.63it/s]
100%|██████████| 760/760 [00:00<00:00, 28698.88it/s]
100%|██████████| 3592/3592 [00:00<00:00, 31969.28it/s]
100%|██████████| 1829/1829 [00:00<00:00, 26657.20it/s]


The following does not contain USPTO data: 


100%|██████████| 1510/1510 [00:00<00:00, 29417.44it/s]
100%|██████████| 3477/3477 [00:00<00:00, 29767.36it/s]
100%|██████████| 1220/1220 [00:00<00:00, 27413.90it/s]
100%|██████████| 881/881 [00:00<00:00, 27889.43it/s]
100%|██████████| 1152/1152 [00:00<00:00, 25567.58it/s]
100%|██████████| 2772/2772 [00:00<00:00, 29679.18it/s]
100%|██████████| 2187/2187 [00:00<00:00, 25935.71it/s]
100%|██████████| 1213/1213 [00:00<00:00, 27604.44it/s]
100%|██████████| 2981/2981 [00:01<00:00, 2574.81it/s]
100%|██████████| 4930/4930 [00:00<00:00, 30402.98it/s]
100%|██████████| 1583/1583 [00:00<00:00, 28652.11it/s]


The following does not contain USPTO data: 


100%|██████████| 8523/8523 [00:00<00:00, 29069.48it/s]
100%|██████████| 4434/4434 [00:00<00:00, 27589.19it/s]


The following does not contain USPTO data: 


100%|██████████| 1429/1429 [00:00<00:00, 18810.31it/s]
100%|██████████| 5421/5421 [00:00<00:00, 14829.18it/s]
100%|██████████| 9362/9362 [00:00<00:00, 26112.84it/s]


The following does not contain USPTO data: NiCOlit


100%|██████████| 3254/3254 [00:00<00:00, 28311.20it/s]
100%|██████████| 1264/1264 [00:00<00:00, 21350.65it/s]


The following does not contain USPTO data: synthesis of islatravir by biocatalytic cascade


100%|██████████| 1368/1368 [00:00<00:00, 26842.54it/s]
100%|██████████| 3576/3576 [00:00<00:00, 30411.81it/s]
100%|██████████| 1672/1672 [00:00<00:00, 26769.16it/s]
100%|██████████| 12099/12099 [00:00<00:00, 31924.41it/s]
100%|██████████| 5298/5298 [00:00<00:00, 24450.69it/s]


The following does not contain USPTO data: Development of an automated kinetic profiling system with online HPLC for reaction optimization


100%|██████████| 2366/2366 [00:00<00:00, 27927.32it/s]
100%|██████████| 9220/9220 [00:00<00:00, 26977.53it/s]
100%|██████████| 2844/2844 [00:00<00:00, 27370.66it/s]
100%|██████████| 1351/1351 [00:00<00:00, 26909.16it/s]
100%|██████████| 755/755 [00:00<00:00, 27644.45it/s]
100%|██████████| 2936/2936 [00:00<00:00, 28630.79it/s]
100%|██████████| 1222/1222 [00:00<00:00, 28340.83it/s]
100%|██████████| 1122/1122 [00:00<00:00, 27933.98it/s]
100%|██████████| 1838/1838 [00:00<00:00, 28892.09it/s]
100%|██████████| 3188/3188 [00:00<00:00, 25051.22it/s]
100%|██████████| 1955/1955 [00:00<00:00, 30410.08it/s]
100%|██████████| 1537/1537 [00:00<00:00, 28325.57it/s]
100%|██████████| 1846/1846 [00:00<00:00, 25585.33it/s]
100%|██████████| 3762/3762 [00:00<00:00, 28458.58it/s]
100%|██████████| 894/894 [00:00<00:00, 26687.36it/s]
100%|██████████| 8816/8816 [00:00<00:00, 32276.13it/s]
100%|██████████| 990/990 [00:00<00:00, 28565.85it/s]
100%|██████████| 1590/1590 [00:00<00:00, 28000.30it/s]
100%|██████████|

The following does not contain USPTO data: Nano CN PhotoChemistry Informers Library


100%|██████████| 2260/2260 [00:00<00:00, 22567.47it/s]
100%|██████████| 2191/2191 [00:00<00:00, 25871.52it/s]
100%|██████████| 1878/1878 [00:00<00:00, 21137.48it/s]
100%|██████████| 1541/1541 [00:00<00:00, 27748.35it/s]
100%|██████████| 2855/2855 [00:00<00:00, 23776.19it/s]
100%|██████████| 5759/5759 [00:00<00:00, 23778.56it/s]
100%|██████████| 3425/3425 [00:00<00:00, 26723.76it/s]
100%|██████████| 808/808 [00:00<00:00, 21718.77it/s]
100%|██████████| 1652/1652 [00:00<00:00, 7063.78it/s]
100%|██████████| 1870/1870 [00:00<00:00, 19439.68it/s]
100%|██████████| 1603/1603 [00:00<00:00, 25333.06it/s]
100%|██████████| 784/784 [00:00<00:00, 19911.68it/s]
100%|██████████| 918/918 [00:00<00:00, 25389.35it/s]
100%|██████████| 1056/1056 [00:00<00:00, 1345.08it/s]
100%|██████████| 4613/4613 [00:00<00:00, 28337.78it/s]


The following does not contain USPTO data: Photodehalogenation_HTE_estimated_conv_at_5hr


100%|██████████| 1415/1415 [00:00<00:00, 22273.29it/s]
100%|██████████| 3020/3020 [00:01<00:00, 2693.50it/s]
100%|██████████| 634/634 [00:00<00:00, 23953.42it/s]
100%|██████████| 2466/2466 [00:00<00:00, 29043.93it/s]
100%|██████████| 1797/1797 [00:00<00:00, 26761.13it/s]
100%|██████████| 1603/1603 [00:00<00:00, 2142.75it/s]
100%|██████████| 1988/1988 [00:00<00:00, 25617.22it/s]
100%|██████████| 2283/2283 [00:00<00:00, 27310.62it/s]
100%|██████████| 993/993 [00:00<00:00, 28791.66it/s]
100%|██████████| 1430/1430 [00:00<00:00, 26680.97it/s]
100%|██████████| 1176/1176 [00:00<00:00, 24906.34it/s]
100%|██████████| 1453/1453 [00:00<00:00, 29361.89it/s]
100%|██████████| 3622/3622 [00:00<00:00, 16304.43it/s]
100%|██████████| 1440/1440 [00:00<00:00, 5206.07it/s]
100%|██████████| 1399/1399 [00:00<00:00, 25918.90it/s]
100%|██████████| 719/719 [00:00<00:00, 10167.44it/s]
100%|██████████| 1286/1286 [00:00<00:00, 27090.74it/s]
100%|██████████| 1243/1243 [00:00<00:00, 28950.81it/s]
100%|██████████| 95

The following does not contain USPTO data: 


100%|██████████| 1735/1735 [00:00<00:00, 21793.71it/s]
100%|██████████| 3808/3808 [00:00<00:00, 24386.49it/s]
100%|██████████| 3016/3016 [00:00<00:00, 26918.49it/s]
100%|██████████| 1553/1553 [00:00<00:00, 30300.62it/s]


The following does not contain USPTO data: Deoxyfluorination screen


100%|██████████| 2019/2019 [00:00<00:00, 26828.47it/s]
100%|██████████| 12098/12098 [00:00<00:00, 18547.57it/s]
100%|██████████| 1028/1028 [00:00<00:00, 23982.25it/s]
100%|██████████| 757/757 [00:00<00:00, 28820.16it/s]
100%|██████████| 1146/1146 [00:00<00:00, 29252.61it/s]
100%|██████████| 1181/1181 [00:00<00:00, 27228.40it/s]
100%|██████████| 1039/1039 [00:00<00:00, 26446.51it/s]
100%|██████████| 2053/2053 [00:00<00:00, 27201.07it/s]
100%|██████████| 1344/1344 [00:00<00:00, 26774.06it/s]
100%|██████████| 12038/12038 [00:03<00:00, 3198.00it/s]
100%|██████████| 1610/1610 [00:00<00:00, 25157.04it/s]
100%|██████████| 13865/13865 [00:00<00:00, 29985.13it/s]
100%|██████████| 851/851 [00:00<00:00, 27736.92it/s]
100%|██████████| 2955/2955 [00:00<00:00, 27192.90it/s]
100%|██████████| 3349/3349 [00:00<00:00, 21263.33it/s]
100%|██████████| 1254/1254 [00:00<00:00, 23486.17it/s]
100%|██████████| 5651/5651 [00:00<00:00, 29300.74it/s]
100%|██████████| 5027/5027 [00:00<00:00, 28222.68it/s]
100%|████

The following does not contain USPTO data: HTE Pd-catalyzed cross-coupling screen


100%|██████████| 1211/1211 [00:00<00:00, 25710.17it/s]
100%|██████████| 3725/3725 [00:00<00:00, 27054.31it/s]
100%|██████████| 3406/3406 [00:00<00:00, 29945.56it/s]
100%|██████████| 1690/1690 [00:00<00:00, 25674.14it/s]
100%|██████████| 4197/4197 [00:00<00:00, 19571.09it/s]
100%|██████████| 1917/1917 [00:00<00:00, 24015.34it/s]
100%|██████████| 7072/7072 [00:00<00:00, 27594.18it/s]
100%|██████████| 14102/14102 [00:00<00:00, 26411.00it/s]


The following does not contain USPTO data: Coupling of α-carboxyl sp3-carbons with aryl halides


100%|██████████| 3571/3571 [00:00<00:00, 31758.59it/s]
100%|██████████| 1444/1444 [00:00<00:00, 27541.40it/s]
100%|██████████| 3511/3511 [00:00<00:00, 29560.59it/s]
100%|██████████| 886/886 [00:00<00:00, 28238.25it/s]
100%|██████████| 5107/5107 [00:00<00:00, 30646.94it/s]
100%|██████████| 10803/10803 [00:00<00:00, 29869.79it/s]
100%|██████████| 4462/4462 [00:00<00:00, 28830.30it/s]
100%|██████████| 8645/8645 [00:00<00:00, 29769.60it/s]
100%|██████████| 2855/2855 [00:00<00:00, 23262.08it/s]
100%|██████████| 1841/1841 [00:00<00:00, 24659.06it/s]
100%|██████████| 10995/10995 [00:00<00:00, 31449.88it/s]
100%|██████████| 865/865 [00:00<00:00, 28011.68it/s]
100%|██████████| 743/743 [00:00<00:00, 24246.04it/s]
100%|██████████| 11561/11561 [00:00<00:00, 26402.15it/s]
100%|██████████| 4994/4994 [00:00<00:00, 25581.93it/s]
100%|██████████| 3296/3296 [00:00<00:00, 28176.26it/s]
100%|██████████| 1186/1186 [00:00<00:00, 27586.61it/s]
100%|██████████| 1379/1379 [00:00<00:00, 24582.62it/s]
100%|█████

The following does not contain USPTO data: Imidazopyridines dataset


100%|██████████| 1848/1848 [00:00<00:00, 2106.12it/s]
100%|██████████| 1375/1375 [00:00<00:00, 29291.47it/s]
100%|██████████| 1289/1289 [00:00<00:00, 1812.62it/s]
100%|██████████| 946/946 [00:00<00:00, 24363.78it/s]
100%|██████████| 973/973 [00:00<00:00, 28271.57it/s]
100%|██████████| 2569/2569 [00:00<00:00, 28825.79it/s]
100%|██████████| 4426/4426 [00:00<00:00, 29935.98it/s]
100%|██████████| 8831/8831 [00:00<00:00, 31373.02it/s]
100%|██████████| 5140/5140 [00:00<00:00, 28504.05it/s]


The following does not contain USPTO data: 


100%|██████████| 1370/1370 [00:00<00:00, 29307.91it/s]
100%|██████████| 1655/1655 [00:00<00:00, 28737.62it/s]
100%|██████████| 2296/2296 [00:00<00:00, 26551.72it/s]
100%|██████████| 1302/1302 [00:00<00:00, 27728.05it/s]
100%|██████████| 849/849 [00:00<00:00, 26953.31it/s]
100%|██████████| 2989/2989 [00:00<00:00, 30315.97it/s]
100%|██████████| 1351/1351 [00:00<00:00, 25278.28it/s]
100%|██████████| 1165/1165 [00:00<00:00, 1547.20it/s]
100%|██████████| 1267/1267 [00:00<00:00, 28245.30it/s]
100%|██████████| 5687/5687 [00:00<00:00, 27641.98it/s]
100%|██████████| 1195/1195 [00:00<00:00, 26726.57it/s]
100%|██████████| 1232/1232 [00:00<00:00, 27036.52it/s]
100%|██████████| 1268/1268 [00:00<00:00, 28477.83it/s]
100%|██████████| 870/870 [00:00<00:00, 25560.51it/s]
100%|██████████| 5948/5948 [00:00<00:00, 31294.26it/s]
100%|██████████| 1539/1539 [00:00<00:00, 25931.43it/s]
100%|██████████| 9619/9619 [00:00<00:00, 29994.60it/s]
100%|██████████| 1066/1066 [00:00<00:00, 24903.24it/s]
100%|██████████

The following does not contain USPTO data: 750 AstraZeneca ELN dataset


100%|██████████| 1769/1769 [00:00<00:00, 28074.39it/s]
100%|██████████| 1585/1585 [00:00<00:00, 27795.31it/s]
100%|██████████| 11686/11686 [00:00<00:00, 31038.84it/s]
100%|██████████| 10176/10176 [00:00<00:00, 30552.94it/s]
100%|██████████| 1365/1365 [00:00<00:00, 24140.97it/s]
100%|██████████| 978/978 [00:00<00:00, 26756.44it/s]
100%|██████████| 2687/2687 [00:00<00:00, 28477.60it/s]
100%|██████████| 3932/3932 [00:00<00:00, 28775.88it/s]
100%|██████████| 2499/2499 [00:00<00:00, 25761.78it/s]
100%|██████████| 740/740 [00:00<00:00, 27780.83it/s]
100%|██████████| 7563/7563 [00:00<00:00, 28905.46it/s]
100%|██████████| 3038/3038 [00:00<00:00, 26339.74it/s]
100%|██████████| 1601/1601 [00:00<00:00, 21273.48it/s]
100%|██████████| 3156/3156 [00:00<00:00, 26804.84it/s]
100%|██████████| 9029/9029 [00:00<00:00, 28584.73it/s]
100%|██████████| 1158/1158 [00:00<00:00, 25994.00it/s]
100%|██████████| 1898/1898 [00:00<00:00, 18197.84it/s]
100%|██████████| 1651/1651 [00:00<00:00, 26849.87it/s]
100%|█████

The following does not contain USPTO data: Validation data from https://doi.org/10.1039/C8SC04228D


100%|██████████| 894/894 [00:00<00:00, 23369.51it/s]
100%|██████████| 3168/3168 [00:01<00:00, 2624.78it/s]
100%|██████████| 2686/2686 [00:00<00:00, 26171.41it/s]
100%|██████████| 8338/8338 [00:00<00:00, 29373.32it/s]
100%|██████████| 11282/11282 [00:00<00:00, 27030.09it/s]
100%|██████████| 1433/1433 [00:00<00:00, 26919.44it/s]
100%|██████████| 1675/1675 [00:00<00:00, 17197.09it/s]
100%|██████████| 2114/2114 [00:00<00:00, 27354.72it/s]
100%|██████████| 3204/3204 [00:00<00:00, 28334.69it/s]
100%|██████████| 9876/9876 [00:00<00:00, 31288.29it/s]
100%|██████████| 1953/1953 [00:00<00:00, 24033.34it/s]
100%|██████████| 1545/1545 [00:00<00:00, 27156.30it/s]


The following does not contain USPTO data: 39 compound library from "Building a Sulfonamide Library by Eco-Friendly Flow Synthesis"


100%|██████████| 1347/1347 [00:00<00:00, 28450.06it/s]
100%|██████████| 11334/11334 [00:00<00:00, 30909.03it/s]
100%|██████████| 3630/3630 [00:00<00:00, 27826.34it/s]
100%|██████████| 1770/1770 [00:00<00:00, 27917.55it/s]
100%|██████████| 8545/8545 [00:00<00:00, 30040.10it/s]
100%|██████████| 8677/8677 [00:02<00:00, 3092.23it/s] 
100%|██████████| 1943/1943 [00:00<00:00, 22913.31it/s]
100%|██████████| 1067/1067 [00:00<00:00, 24286.37it/s]
100%|██████████| 1224/1224 [00:00<00:00, 25093.25it/s]
100%|██████████| 878/878 [00:00<00:00, 27720.81it/s]
100%|██████████| 1120/1120 [00:00<00:00, 1841.94it/s]
100%|██████████| 2997/2997 [00:00<00:00, 28596.81it/s]
100%|██████████| 1297/1297 [00:00<00:00, 27056.26it/s]
100%|██████████| 2540/2540 [00:00<00:00, 25625.22it/s]
100%|██████████| 1424/1424 [00:00<00:00, 25517.44it/s]
100%|██████████| 1233/1233 [00:00<00:00, 29454.58it/s]
100%|██████████| 3376/3376 [00:00<00:00, 29513.34it/s]
100%|██████████| 9040/9040 [00:00<00:00, 31122.19it/s]
100%|██████

The following does not contain USPTO data: Microwave-assisted Biginelli Condensation Dataset


100%|██████████| 3248/3248 [00:00<00:00, 29932.13it/s]


The following does not contain USPTO data: Copper-Catalyzed Enantioselective Hydroamination of Alkenes


100%|██████████| 11648/11648 [00:00<00:00, 29944.75it/s]
100%|██████████| 11842/11842 [00:00<00:00, 30802.45it/s]
100%|██████████| 2662/2662 [00:00<00:00, 28156.79it/s]
100%|██████████| 4397/4397 [00:00<00:00, 27542.02it/s]
100%|██████████| 7427/7427 [00:00<00:00, 30009.07it/s]
100%|██████████| 4831/4831 [00:00<00:00, 26470.07it/s]
100%|██████████| 365/365 [00:00<00:00, 23967.45it/s]
100%|██████████| 1349/1349 [00:00<00:00, 28215.81it/s]
100%|██████████| 991/991 [00:00<00:00, 25440.41it/s]
100%|██████████| 730/730 [00:00<00:00, 23991.11it/s]
100%|██████████| 5574/5574 [00:00<00:00, 30711.76it/s]
100%|██████████| 3709/3709 [00:00<00:00, 28014.85it/s]
100%|██████████| 1859/1859 [00:00<00:00, 25947.72it/s]
100%|██████████| 3590/3590 [00:00<00:00, 28034.86it/s]
100%|██████████| 10995/10995 [00:00<00:00, 31349.33it/s]
100%|██████████| 3552/3552 [00:00<00:00, 27130.25it/s]


The following does not contain USPTO data: Pd_CN_Coupling_Informer_Library


100%|██████████| 4413/4413 [00:00<00:00, 29507.73it/s]
100%|██████████| 3363/3363 [00:00<00:00, 27698.58it/s]
100%|██████████| 1300/1300 [00:00<00:00, 26530.47it/s]
100%|██████████| 2025/2025 [00:00<00:00, 22479.00it/s]
100%|██████████| 1047/1047 [00:00<00:00, 26632.52it/s]
100%|██████████| 782/782 [00:00<00:00, 28491.29it/s]
100%|██████████| 2774/2774 [00:00<00:00, 29912.18it/s]
100%|██████████| 2278/2278 [00:00<00:00, 27333.13it/s]
100%|██████████| 1798/1798 [00:00<00:00, 28412.93it/s]
100%|██████████| 4637/4637 [00:00<00:00, 29806.28it/s]
100%|██████████| 1823/1823 [00:00<00:00, 26487.92it/s]
100%|██████████| 2430/2430 [00:00<00:00, 28141.61it/s]
100%|██████████| 1157/1157 [00:00<00:00, 24613.81it/s]
100%|██████████| 13945/13945 [00:00<00:00, 31218.66it/s]
100%|██████████| 1427/1427 [00:00<00:00, 29529.77it/s]
100%|██████████| 2326/2326 [00:00<00:00, 29477.02it/s]
100%|██████████| 730/730 [00:00<00:00, 26113.34it/s]
100%|██████████| 1492/1492 [00:00<00:00, 27032.44it/s]
100%|███████

The following does not contain USPTO data: 


100%|██████████| 13323/13323 [00:00<00:00, 26362.18it/s]
100%|██████████| 12189/12189 [00:00<00:00, 31350.45it/s]
100%|██████████| 6270/6270 [00:00<00:00, 27796.87it/s]
100%|██████████| 2050/2050 [00:00<00:00, 23110.79it/s]
100%|██████████| 1235/1235 [00:00<00:00, 24701.91it/s]
100%|██████████| 10570/10570 [00:00<00:00, 31663.60it/s]
100%|██████████| 7618/7618 [00:00<00:00, 28443.96it/s]
100%|██████████| 888/888 [00:00<00:00, 20009.36it/s]
100%|██████████| 955/955 [00:00<00:00, 25185.87it/s]
100%|██████████| 17280/17280 [00:00<00:00, 29829.88it/s]
100%|██████████| 7578/7578 [00:00<00:00, 29927.83it/s]
100%|██████████| 10749/10749 [00:00<00:00, 29660.01it/s]
100%|██████████| 4393/4393 [00:00<00:00, 26473.87it/s]


The following does not contain USPTO data: Training data from https://doi.org/10.1039/C8SC04228D


100%|██████████| 1210/1210 [00:00<00:00, 20131.41it/s]
100%|██████████| 960/960 [00:00<00:00, 14944.19it/s]
100%|██████████| 1338/1338 [00:00<00:00, 24138.69it/s]
100%|██████████| 1670/1670 [00:00<00:00, 22998.56it/s]
100%|██████████| 949/949 [00:00<00:00, 23984.78it/s]
100%|██████████| 1325/1325 [00:00<00:00, 23214.28it/s]
100%|██████████| 1569/1569 [00:00<00:00, 11094.23it/s]
100%|██████████| 1778/1778 [00:00<00:00, 26305.67it/s]
100%|██████████| 849/849 [00:00<00:00, 23080.13it/s]
100%|██████████| 2544/2544 [00:00<00:00, 28100.84it/s]
100%|██████████| 4136/4136 [00:00<00:00, 30223.95it/s]
100%|██████████| 1008/1008 [00:00<00:00, 23076.32it/s]
100%|██████████| 6653/6653 [00:00<00:00, 21284.08it/s]
100%|██████████| 17639/17639 [00:00<00:00, 31143.40it/s]
100%|██████████| 4003/4003 [00:00<00:00, 29097.33it/s]
100%|██████████| 10271/10271 [00:00<00:00, 32191.52it/s]
100%|██████████| 5109/5109 [00:00<00:00, 26718.34it/s]
100%|██████████| 9344/9344 [00:00<00:00, 31062.49it/s]
100%|███████

The following does not contain USPTO data: 


100%|██████████| 755/755 [00:00<00:00, 26647.42it/s]
100%|██████████| 6134/6134 [00:00<00:00, 31740.73it/s]
100%|██████████| 12466/12466 [00:00<00:00, 31220.39it/s]
100%|██████████| 2359/2359 [00:00<00:00, 25670.91it/s]
100%|██████████| 1721/1721 [00:00<00:00, 29260.77it/s]
100%|██████████| 8038/8038 [00:00<00:00, 24719.14it/s]


The following does not contain USPTO data: Ahneman


100%|██████████| 1983/1983 [00:00<00:00, 27301.72it/s]
100%|██████████| 1296/1296 [00:00<00:00, 28098.76it/s]
100%|██████████| 1275/1275 [00:00<00:00, 29828.80it/s]
100%|██████████| 769/769 [00:00<00:00, 24829.64it/s]
100%|██████████| 11747/11747 [00:00<00:00, 32614.52it/s]
100%|██████████| 1288/1288 [00:00<00:00, 19959.96it/s]
100%|██████████| 1669/1669 [00:00<00:00, 25477.48it/s]
100%|██████████| 1497/1497 [00:00<00:00, 22726.65it/s]
100%|██████████| 2186/2186 [00:00<00:00, 29037.82it/s]
100%|██████████| 8110/8110 [00:00<00:00, 30540.26it/s]
100%|██████████| 5065/5065 [00:00<00:00, 31291.92it/s]
100%|██████████| 2284/2284 [00:00<00:00, 28633.74it/s]
100%|██████████| 925/925 [00:00<00:00, 26049.30it/s]
100%|██████████| 879/879 [00:00<00:00, 26971.54it/s]


The following does not contain USPTO data: Test data from https://doi.org/10.1039/C8SC04228D


100%|██████████| 1257/1257 [00:00<00:00, 25842.41it/s]
100%|██████████| 6796/6796 [00:00<00:00, 31221.61it/s]
100%|██████████| 2612/2612 [00:00<00:00, 30193.06it/s]
100%|██████████| 5061/5061 [00:00<00:00, 29423.53it/s]
100%|██████████| 3675/3675 [00:00<00:00, 30603.50it/s]
100%|██████████| 3229/3229 [00:00<00:00, 29154.70it/s]
100%|██████████| 1490/1490 [00:00<00:00, 27164.12it/s]
100%|██████████| 4974/4974 [00:00<00:00, 27982.17it/s]
100%|██████████| 1971/1971 [00:00<00:00, 28706.26it/s]
100%|██████████| 1279/1279 [00:00<00:00, 24411.23it/s]
100%|██████████| 931/931 [00:00<00:00, 19074.42it/s]
100%|██████████| 1820/1820 [00:00<00:00, 29853.63it/s]
100%|██████████| 1352/1352 [00:00<00:00, 27605.25it/s]
100%|██████████| 1478/1478 [00:00<00:00, 26819.10it/s]
100%|██████████| 1188/1188 [00:00<00:00, 29603.86it/s]
100%|██████████| 1537/1537 [00:00<00:00, 27114.20it/s]
100%|██████████| 9498/9498 [00:00<00:00, 32268.71it/s]


The following does not contain USPTO data: Electroreductive Coupling of Alkenyl and Benzyl Halides


100%|██████████| 3803/3803 [00:00<00:00, 31582.95it/s]
100%|██████████| 1343/1343 [00:00<00:00, 21702.59it/s]
100%|██████████| 720/720 [00:00<00:00, 1415.19it/s]
100%|██████████| 1383/1383 [00:00<00:00, 29176.63it/s]
100%|██████████| 1643/1643 [00:00<00:00, 28759.40it/s]
100%|██████████| 3038/3038 [00:00<00:00, 31231.65it/s]
100%|██████████| 1133/1133 [00:00<00:00, 25979.09it/s]
100%|██████████| 9623/9623 [00:00<00:00, 31825.37it/s]
100%|██████████| 982/982 [00:00<00:00, 27780.97it/s]
100%|██████████| 488/488 [00:00<00:00, 25915.02it/s]
100%|██████████| 3352/3352 [00:00<00:00, 26262.54it/s]
100%|██████████| 1115/1115 [00:00<00:00, 27764.48it/s]
100%|██████████| 1512/1512 [00:00<00:00, 22467.89it/s]
100%|██████████| 17188/17188 [00:05<00:00, 3122.42it/s] 
100%|██████████| 1905/1905 [00:00<00:00, 27472.38it/s]
100%|██████████| 1450/1450 [00:00<00:00, 21185.57it/s]
100%|██████████| 1784/1784 [00:00<00:00, 26073.55it/s]
100%|██████████| 4837/4837 [00:00<00:00, 30884.27it/s]
100%|██████████

# Full ORD to pickle workflow - v2 (using mapped rxn)

In [1]:
# Import modules
import ord_schema
from ord_schema import message_helpers, validations
from ord_schema.proto import dataset_pb2

import math
import pandas as pd
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import os
import wget

from rdkit import Chem
from rdkit.Chem import AllChem
from sklearn import model_selection, metrics
from glob import glob

from tqdm import tqdm

# Import pura
from pura.resolvers import resolve_identifiers
from pura.compound import CompoundIdentifierType
from pura.services import PubChem, CIR, CAS

2022-12-21 20:34:00.577239: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Get list of all file names

In [2]:
# get list of all file names
import os

# Set the directory you want to look in
directory = "data/ORD_USPTO/ord-data/data/"

# Use os.listdir to get a list of all files in the directory
folders = os.listdir(directory)
files = []
# Use a for loop to iterate over the files and print their names
for folder in folders:
    if not folder.startswith("."):
        new_dir = directory+folder
        file_list = os.listdir(new_dir)
        # Check if the file name starts with a .
        for file in file_list:
            if not file.startswith("."):
                new_file = new_dir+'/'+file
                files += [new_file]

## Define base class for pickling

In [3]:
# Should we double check that the products from the mapped reaction and the molecules marked as 'product' are actually the same? 
# I wonder which mapped reaction has 4 products...
# What do we do if there's any discrepency?...
# We should trust the mapped reaction more
# I'll create a new column for the mapped products, and then in the cleaning, filter out any molecules that have been marked as a product
#   but don't actually appear in the mapped reaction

In [4]:
class OrdToPickle():
    """
    Read in an ord file, check if it contains USPTO data, and then:
    1) Extract all the relevant data (raw): reactants, reagents, products, yields, temp, time
    2) Use Pura to sanitise all the molecules. Do it on a list at a time, it's faster!
    3) Canonicalise all the molecules
    """

    def __init__(self, ord_file_path):
        self.ord_file_path = ord_file_path
        self.data = message_helpers.load_message(self.ord_file_path, dataset_pb2.Dataset)
        self.filename = self.data.name

    def find_smiles(self, identifiers):
        for i in identifiers:
            if i.type == 2:
                smiles = i.value
                clean_smiles = self.sanitise_molecule(smiles)
                return clean_smiles
        for i in identifiers: #if there's no smiles, return the name
            if i.type == 6:
                name = i.value
                clean_smiles = self.sanitise_molecule(name)
                return clean_smiles
        return np.nan

    def clean_smiles(self, smiles):
        # remove mapping info and canonicalsie the smiles at the same time
        # converting to mol and back canonicalises the smiles string
        m = Chem.MolFromSmiles(smiles)
        for atom in m.GetAtoms():
            atom.SetAtomMapNum(0)
        cleaned_smiles = Chem.MolToSmiles(m)
        return cleaned_smiles

    #its probably a lot faster to sanitise the whole thing at the end

    def sanitise_molecule(self, smiles):
        # if smiles is None type or np.nan
        if smiles == None or smiles != smiles: 
            return np.nan
        else:
            try:
                # assume it's a smiles string
                cleaned_smiles = self.clean_smiles(smiles)
                return cleaned_smiles
            except AttributeError:
                # it might be a molecule name, let's resolve using Pura
                try:
                    
                    resolved = resolve_identifiers(
                        [smiles],
                        input_identifer_type=CompoundIdentifierType.NAME,
                        output_identifier_type=CompoundIdentifierType.SMILES,
                        services=[PubChem(autocomplete=True), CIR(), CAS()],
                        agreement=1,
                        silent=True,
                    )
                    
                    pura_smiles = resolved[0][1][0]
                    cleaned_smiles = self.clean_smiles(smiles)
                    return cleaned_smiles

                except AttributeError:
                    return smiles



    def build_rxn_lists(self):
        mapped_rxn_all = []
        reactants_all = []
        reagents_all = []
        marked_products_all = []
        mapped_products_all = []
        products_all = []
        solvents_all = []
        catalysts_all = []

        temperature_all = []

        rxn_times_all = []

        not_mapped_products_all = []

        for i in range(len(self.data.reactions)):
            rxn = self.data.reactions[i]
            # handle rxn inputs: reactants, reagents etc
            reactants = []
            reagents = []
            solvents = []
            catalysts = []
            marked_products = []
            mapped_products = []
            products = []
            not_mapped_products = []
            
            temperatures = []

            rxn_times = []

            #if reaction has been mapped, get reactant and product from the mapped reaction
            #Actually, we should only extract data from reactions that have been mapped
            is_mapped = self.data.reactions[i].identifiers[0].is_mapped
            if is_mapped:
                mapped_rxn_extended_smiles = self.data.reactions[i].identifiers[0].value
                mapped_rxn = mapped_rxn_extended_smiles.split(' ')[0]

                reactant, reagent, mapped_product = mapped_rxn.split('>')

                for r in reactant.split('.'):
                    if '[' in r and ']' in r and ':' in r:
                        reactants += [self.sanitise_molecule(r)]
                    else:
                        reagents += [self.sanitise_molecule(r)]

                reagents += [self.sanitise_molecule(r) for r in reagent.split('.')]

                for p in mapped_product.split('.'):
                    if '[' in p and ']' in p and ':' in p:
                        mapped_products += [self.sanitise_molecule(p)]
                        
                    else:
                        not_mapped_products += [self.sanitise_molecule(p)]


                # inputs
                for key in rxn.inputs: #these are the keys in the 'dict' style data struct
                    try:
                        components = rxn.inputs[key].components
                        for component in components:
                            rxn_role = component.reaction_role #rxn role
                            identifiers = component.identifiers
                            smiles = self.find_smiles(identifiers)
                            if rxn_role == 1: #reactant
                                continue # we already added reactants from mapped rxn
                                #reactants += [smiles]
                            elif rxn_role ==2: #reagent
                                reagents += [smiles]
                            elif rxn_role ==3: #solvent
                                solvents += [smiles]
                            elif rxn_role ==4: #catalyst
                                catalysts += [smiles]
                            elif rxn_role in [5,6,7]: #workup, internal standard, authentic standard. don't care about these
                                continue
                            # elif rxn_role ==8: #product
                            #     #products += [smiles]
                            # there are no products recorded in rxn_role == 8, they're all stored in "outcomes"
                    except IndexError:
                        #print(i, key )
                        continue

                # temperature
                try:
                    temperatures +=[rxn.conditions.temperature.control.type]
                except IndexError:
                    continue

                #rxn time
                try:
                    rxn_times = (rxn.outcomes[0].reaction_time.value, rxn.outcomes[0].reaction_time.units)
                except IndexError:
                    continue

                # products & yield
                products_obj = rxn.outcomes[0].products
                y1 = np.nan
                y2 = np.nan
                for marked_product in products_obj:
                    try:
                        identifiers = marked_product.identifiers
                        product_smiles = self.find_smiles(identifiers)
                        measurements = marked_product.measurements
                        for measurement in measurements:
                            if measurement.details =="PERCENTYIELD":
                                y1 = measurement.percentage.value
                            elif measurement.details =="CALCULATEDPERCENTYIELD":
                                y2 = measurement.percentage.value
                        marked_products += [(product_smiles, y1, y2)]
                    except IndexError:
                        continue
            
            # Finally, remove reagents that have been identified as a solvent or catalyst
            reagents_final = []
            for r in reagents:
                if r not in solvents and r not in catalysts:
                    reagents_final += [r]

            mapped_rxn_all += [mapped_rxn]
            reactants_all += [list(set(reactants))]
            reagents_all += [list(set(reagents_final))]
            solvents_all += [list(set(solvents))]
            catalysts_all += [list(set(catalysts))]
            
            temperature_all = [temperatures]

            rxn_times_all += [rxn_times]
            # marked_products_all += [list(set(marked_products))]
            
            # mapped_products_all += [list(set(mapped_products))]
            # not_mapped_products_all += [list(set(not_mapped_products))]

            # products logic
            # handle the products
            # for each row, I will trust the mapped product more
            # loop over the mapped products, and if the mapped product exists in the marked product
            # add the yields, else simply add (smiles, nan, nan)
            for mapped_p in mapped_products:
                added = False
                for marked_p in marked_products:
                    if mapped_p == marked_p[0]:
                        products += [marked_p]
                        added = True
                if not added:
                    products += [(mapped_p, np.nan, np.nan)]

            # if we have a product with yield, but it hasn't been mapped
            # print out a warning, as this shouldn't occur
            for marked_p in marked_products:
                if marked_p not in products and any([marked_p[1] == marked_p[1], marked_p[2] == marked_p[2]]):
                    print(i, marked_p)
            

            products_all += [products] 


        
        return mapped_rxn_all, reactants_all, reagents_all, solvents_all, catalysts_all, temperature_all, rxn_times_all, products_all

    # create the column headers for the df
    def create_column_headers(self, df, base_string):
        column_headers = []
        for i in range(len(df.columns)):
            column_headers += [base_string+str(i)]
        return column_headers
    
    def build_full_df(self):
        headers = ['mapped_rxn_', 'reactant_', 'reagents_',  'solvent_', 'catalyst_', 'temperature_', 'rxn_time_', 'product_']
        #data_lists = [mapped_rxn, reactants_all, reagents_all, solvents_all, catalysts_all, temperature_all, rxn_times_all, products_all]
        data_lists = self.build_rxn_lists()
        for i in range(len(headers)):
            new_df = pd.DataFrame(data_lists[i])
            df_headers = self.create_column_headers(new_df, headers[i])
            new_df.columns = df_headers
            if i ==0:
                full_df = new_df
            else:
                full_df = pd.concat([full_df, new_df], axis=1)
        return full_df

    def main(self):
        # This function doesn't return anything. Instead, it saves the requested data as a pickle file at the path you see below
        # So you need to unpickle the data to see the output
        if 'uspto' in self.filename:
            full_df = self.build_full_df()

            #save to pickle
            filename = self.data.name
            full_df.to_pickle(f"data/ORD_USPTO/pickled_data/{filename}.pkl")
        else:
            print(f'The following does not contain USPTO data: {self.filename}')



In [38]:
instance = OrdToPickle(files[0])
instance.main()

[19:36:08] SMILES Parse Error: syntax error while parsing: 1,5-diazbicyclo[5.4.0
[19:36:08] SMILES Parse Error: Failed parsing SMILES '1,5-diazbicyclo[5.4.0' for input: '1,5-diazbicyclo[5.4.0'
Batch 0 Progress: 100%|██████████| 1/1 [00:01<00:00,  1.41s/it]
Batch: 100%|██████████| 1/1 [00:01<00:00,  1.42s/it]
[19:36:10] SMILES Parse Error: syntax error while parsing: 1,5-diazbicyclo[5.4.0
[19:36:10] SMILES Parse Error: Failed parsing SMILES '1,5-diazbicyclo[5.4.0' for input: '1,5-diazbicyclo[5.4.0'
[19:36:10] SMILES Parse Error: syntax error while parsing: ice
[19:36:10] SMILES Parse Error: Failed parsing SMILES 'ice' for input: 'ice'
Batch 0 Progress: 100%|██████████| 1/1 [00:01<00:00,  1.16s/it]
Batch: 100%|██████████| 1/1 [00:01<00:00,  1.17s/it]
[19:36:11] SMILES Parse Error: syntax error while parsing: ice
[19:36:11] SMILES Parse Error: Failed parsing SMILES 'ice' for input: 'ice'
[19:36:11] SMILES Parse Error: syntax error while parsing: aryl
[19:36:11] SMILES Parse Error: Failed 

KeyboardInterrupt: 

In [7]:
%%time
# Resolve names to SMILES
resolved = resolve_identifiers(
    ["benzene", "toluene"]*5,
    input_identifer_type=CompoundIdentifierType.NAME,
    output_identifier_type=CompoundIdentifierType.SMILES,
    services=[PubChem(autocomplete=True), CIR(), CAS()],
    agreement=1,
    silent=True,
)

Batch 0 Progress: 100%|██████████| 10/10 [00:01<00:00,  5.74it/s]
Batch: 100%|██████████| 1/1 [00:01<00:00,  1.75s/it]

CPU times: user 108 ms, sys: 15.9 ms, total: 124 ms
Wall time: 1.76 s





In [18]:
#unpickle
filename = instance.filename
unpickled_df = pd.read_pickle(f"data/ORD_USPTO/pickled_data/{filename}.pkl")


In [19]:
unpickled_df.columns

Index(['mapped_rxn_0', 'reactant_0', 'reactant_1', 'reactant_2', 'reactant_3',
       'reactant_4', 'reagents_0', 'reagents_1', 'reagents_2', 'reagents_3',
       'reagents_4', 'reagents_5', 'reagents_6', 'reagents_7', 'reagents_8',
       'solvent_0', 'solvent_1', 'solvent_2', 'solvent_3', 'solvent_4',
       'solvent_5', 'catalyst_0', 'catalyst_1', 'catalyst_2', 'catalyst_3',
       'catalyst_4', 'catalyst_5', 'catalyst_6', 'catalyst_7', 'temperature_0',
       'rxn_time_0', 'rxn_time_1', 'product_0', 'product_1', 'product_2',
       'product_3', 'product_4', 'product_5', 'product_6', 'product_7',
       'product_8', 'product_9'],
      dtype='object')

In [21]:
p = unpickled_df.iloc[0]['product_0'][0]

In [34]:
mol = Chem.MolFromSmiles(p)
smiles = Chem.MolToSmiles(mol)
smiles = Chem.CanonSmiles('acetone')
# how to remove mapping info from mapped smiles

[13:47:59] SMILES Parse Error: syntax error while parsing: acetone
[13:47:59] SMILES Parse Error: Failed parsing SMILES 'acetone' for input: 'acetone'


ArgumentError: Python argument types in
    rdkit.Chem.rdmolfiles.MolToSmiles(NoneType, int)
did not match C++ signature:
    MolToSmiles(RDKit::ROMol mol, bool isomericSmiles=True, bool kekuleSmiles=False, int rootedAtAtom=-1, bool canonical=True, bool allBondsExplicit=False, bool allHsExplicit=False, bool doRandom=False)
    MolToSmiles(RDKit::ROMol mol, RDKit::SmilesWriteParams params)

In [32]:
smiles

In [24]:
p

'[NH2:1][S:2](=[O:3])(=[O:4])[CH2:5][c:6]1[c:7]([C:8](=[O:9])[NH:17][NH2:18])[cH:12][cH:13][cH:14][cH:15]1'

In [5]:
# pickle everything
for file in tqdm(files):
    instance = OrdToPickle(file)
    instance.main()

  1%|          | 6/515 [00:18<25:46,  3.04s/it]

The following does not contain USPTO data: 


  3%|▎         | 18/515 [01:07<50:23,  6.08s/it]

The following does not contain USPTO data: 


  4%|▍         | 21/515 [01:32<52:57,  6.43s/it]  

The following does not contain USPTO data: 


  5%|▍         | 25/515 [13:21<20:25:28, 150.06s/it]

The following does not contain USPTO data: NiCOlit


  5%|▌         | 27/515 [13:27<10:09:39, 74.96s/it] 

The following does not contain USPTO data: synthesis of islatravir by biocatalytic cascade


  7%|▋         | 34/515 [14:09<1:53:45, 14.19s/it] 

The following does not contain USPTO data: Development of an automated kinetic profiling system with online HPLC for reaction optimization


 16%|█▌        | 82/515 [50:19<32:46,  4.54s/it]    

The following does not contain USPTO data: Nano CN PhotoChemistry Informers Library


 19%|█▉        | 98/515 [51:16<21:49,  3.14s/it]

The following does not contain USPTO data: Photodehalogenation_HTE_estimated_conv_at_5hr


 26%|██▌       | 132/515 [53:35<25:33,  4.00s/it]

The following does not contain USPTO data: 


 27%|██▋       | 137/515 [53:52<18:30,  2.94s/it]

The following does not contain USPTO data: Deoxyfluorination screen


 37%|███▋      | 189/515 [59:47<25:26,  4.68s/it]  

The following does not contain USPTO data: HTE Pd-catalyzed cross-coupling screen


 38%|███▊      | 198/515 [1:00:49<43:33,  8.24s/it]  

The following does not contain USPTO data: Coupling of α-carboxyl sp3-carbons with aryl halides


 40%|████      | 206/515 [1:01:59<1:32:58, 18.05s/it]


KeyboardInterrupt: 

## Read in the pickle

In [2]:
import pandas as pd
from tqdm import tqdm
from os import listdir
from os.path import isfile, join
import pandas as pd
from tqdm import tqdm

In [3]:
#
#create one big df of all the pickled data
self.folder_path = 'data/ORD_USPTO/pickled_data/'
onlyfiles = [f for f in listdir(self.folder_path) if isfile(join(self.folder_path, f))]
full_df = pd.DataFrame()
for file in tqdm(onlyfiles):
    if file[0] != '.': #We don't want to try to unpickle .DS_Store
        filepath = self.folder_path+file 
        unpickled_df = pd.read_pickle(filepath)
        full_df = pd.concat([full_df, unpickled_df], ignore_index=True)


100%|██████████| 491/491 [05:35<00:00,  1.46it/s]


In [1]:
full_df

NameError: name 'full_df' is not defined

In [None]:
class USPTO_cleaning():
    def __init__(self, folder_path) -> None:
        self.folder_path = folder_path
    
    def create_full_df(self):
        #create one big df of all the pickled data
        #folder_path = 'data/ORD_USPTO/pickled_data/'
        onlyfiles = [f for f in listdir(self.folder_path) if isfile(join(self.folder_path, f))]
        full_df = pd.DataFrame()
        for file in tqdm(onlyfiles):
            if file[0] != '.': #We don't want to try to unpickle .DS_Store
                filepath = self.folder_path+file 
                unpickled_df = pd.read_pickle(filepath)
                full_df = pd.concat([full_df, unpickled_df], ignore_index=True)
        return full_df

    def clean_data(self, df):
        # 
        



In [None]:
folder_path = 'data/ORD_USPTO/pickled_data/'
full_USPTO_data_df = USPTO_cleaning(folder_path)


# Preprocessing of USPTO - Molecular AI

In [1]:
# Running code from:
# https://molecularai.github.io/reaction_utils/uspto.html

# From within the folder containing all the USPTO data

# First run:
# conda activate <rxnutilities>
#  python -m rxnutils.data.uspto.preparation_pipeline run --nbatches 200  --max-workers 8 --max-num-splits 200

# Then I was supposed to run:
# conda activate rxnmapper
# python -m rxnutils.data.mapping_pipeline run --data-prefix uspto --nbatches 200  --max-workers 8 --max-num-splits 200
# But that didn't work, so I just ran the first part 
# Even after I replaced the delimiter (from 	 to , it still failed ). I'll just give up lol

## Read in data cleaned by rxn utils

In [1]:
import pandas as pd

In [2]:
cleaned_USPTO = pd.read_csv('/Users/dsw46/USPTO_data/uspto_data_cleaned.csv', sep = '	')
cleaned_USPTO.shape

(3740596, 7)

In [3]:
cleaned_USPTO['ReactionSmilesClean'][0]

'OCCBr.CCS(=O)(=O)Cl.CCOCC.CCN(CC)CC>>CCS(=O)(=O)OCCBr'

In [4]:
#full_USPTO = pd.read_csv('/Users/dsw46/USPTO_data/uspto_data.csv', sep = '	')
#full_USPTO.shape