<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Read-in-USPTO-from-ORD" data-toc-modified-id="Read-in-USPTO-from-ORD-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Read in USPTO from ORD</a></span><ul class="toc-item"><li><span><a href="#Preface" data-toc-modified-id="Preface-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Preface</a></span></li><li><span><a href="#Extract-USPTO-data-from-ORD" data-toc-modified-id="Extract-USPTO-data-from-ORD-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Extract USPTO data from ORD</a></span></li><li><span><a href="#Tests:-Figure-out-how-to-access-the-info-I-need-in-the-dataset-file" data-toc-modified-id="Tests:-Figure-out-how-to-access-the-info-I-need-in-the-dataset-file-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Tests: Figure out how to access the info I need in the dataset file</a></span></li></ul></li><li><span><a href="#Preprocessing-of-USPTO---Molecular-AI" data-toc-modified-id="Preprocessing-of-USPTO---Molecular-AI-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Preprocessing of USPTO - Molecular AI</a></span><ul class="toc-item"><li><span><a href="#Read-in-data-cleaned-by-rxn-utils" data-toc-modified-id="Read-in-data-cleaned-by-rxn-utils-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Read in data cleaned by rxn utils</a></span></li></ul></li></ul></div>

# Read in USPTO from ORD

## Extract USPTO data from ORD

1. All of the grants USPTO data is contained here: https://github.com/open-reaction-database/ord-data
2. It is batched by year, it's best to just maintain this batching, it will make it easier to handle (each file won't get excessively large)
3. Read in the data contained in the .pb.gz file, each entry in the "list" is a reaction. Write a for loop over the "list", and extract the following from each reaction:
    - Reactants
    - Products
    - Solvents
    - Reagents
    - Catalyst
    - Temperature
    - Yield
    - Anything else?
4. Build a list for each of these, combine to a df, and then save as a paraquet file
5. repeat this for each of the 41 years (41 datasets) we have data for in USPTO. It'll probably be easiest to convert the code in this notebook into a script, and then run it automatically on each.

In [155]:
# Find the schema here
# https://github.com/open-reaction-database/ord-schema/blob/main/ord_schema/proto/reaction.proto

In [156]:
# Import modules
import ord_schema
from ord_schema import message_helpers, validations
from ord_schema.proto import dataset_pb2

import math
import pandas as pd
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import os
import wget

from rdkit import Chem
from rdkit.Chem import AllChem
from sklearn import model_selection, metrics
from glob import glob

from tqdm import tqdm

In [157]:
# Download dataset from ord-data
#url = "https://github.com/open-reaction-database/ord-data/blob/main/data/68/ord_dataset-68cb8b4b2b384e3d85b5b1efae58b203.pb.gz?raw=true"
#https://github.com/open-reaction-database/ord-data
url = "https://github.com/Open-Reaction-Database/ord-data/blob/main/data/02/ord_dataset-026684a62f91469db49c7767d16c39fb.pb.gz?raw=true"
pb = wget.download(url)

KeyboardInterrupt: 

In [None]:
# Load Dataset message
pb = 'data/ORD_USPTO/ord-data/data/02/ord_dataset-02ee2261663048188cf6d85d2cc96e3f.pb.gz'
data = message_helpers.load_message(pb, dataset_pb2.Dataset)

In [None]:
valid_output = validations.validate_message(data)

[20:57:50] reactant 3 has no mapped atoms.
[20:57:50] reactant 4 has no mapped atoms.
[20:57:51] reactant 2 has no mapped atoms.
[20:57:51] reactant 3 has no mapped atoms.
[20:57:51] reactant 5 has no mapped atoms.
[20:57:51] reactant 6 has no mapped atoms.
[20:57:51] reactant 0 has no mapped atoms.
[20:57:51] reactant 2 has no mapped atoms.
[20:57:51] reactant 3 has no mapped atoms.
[20:57:51] reactant 5 has no mapped atoms.
[20:57:51] reactant 0 has no mapped atoms.
[20:57:51] reactant 2 has no mapped atoms.
[20:57:51] reactant 3 has no mapped atoms.
[20:57:51] reactant 5 has no mapped atoms.
[20:57:51] reactant 1 has no mapped atoms.
[20:57:51] reactant 2 has no mapped atoms.
[20:57:51] reactant 3 has no mapped atoms.
[20:57:51] reactant 5 has no mapped atoms.
[20:57:51] reactant 0 has no mapped atoms.
[20:57:51] reactant 2 has no mapped atoms.
[20:57:51] reactant 4 has no mapped atoms.
[20:57:51] reactant 5 has no mapped atoms.
[20:57:51] product atom-mapping number 4 found multipl

[20:57:51] reactant 2 has no mapped atoms.
[20:57:51] reactant 3 has no mapped atoms.
[20:57:51] reactant 4 has no mapped atoms.
[20:57:51] reactant 5 has no mapped atoms.
[20:57:51] reactant 6 has no mapped atoms.
[20:57:51] reactant 2 has no mapped atoms.
[20:57:51] reactant 3 has no mapped atoms.
[20:57:51] reactant 4 has no mapped atoms.
[20:57:51] reactant 5 has no mapped atoms.
[20:57:51] reactant 2 has no mapped atoms.
[20:57:51] reactant 1 has no mapped atoms.
[20:57:51] reactant 2 has no mapped atoms.
[20:57:51] reactant 3 has no mapped atoms.
[20:57:51] reactant 4 has no mapped atoms.
[20:57:51] reactant 5 has no mapped atoms.
[20:57:51] reactant 1 has no mapped atoms.
[20:57:51] reactant 1 has no mapped atoms.
[20:57:51] product atom-mapping number 24 found multiple times.
[20:57:51] product atom-mapping number 25 found multiple times.
[20:57:51] product atom-mapping number 28 found multiple times.
[20:57:51] product atom-mapping number 27 found multiple times.
[20:57:51] pr

[20:57:53] reactant 3 has no mapped atoms.
[20:57:53] reactant 4 has no mapped atoms.
[20:57:53] reactant 5 has no mapped atoms.
[20:57:53] reactant 1 has no mapped atoms.
[20:57:53] reactant 2 has no mapped atoms.
[20:57:53] reactant 2 has no mapped atoms.
[20:57:53] reactant 1 has no mapped atoms.
[20:57:53] reactant 3 has no mapped atoms.
[20:57:53] reactant 4 has no mapped atoms.
[20:57:53] reactant 5 has no mapped atoms.
[20:57:53] reactant 7 has no mapped atoms.
[20:57:53] reactant 2 has no mapped atoms.
[20:57:53] reactant 2 has no mapped atoms.
[20:57:53] reactant 1 has no mapped atoms.
[20:57:53] reactant 2 has no mapped atoms.
[20:57:53] reactant 2 has no mapped atoms.
[20:57:53] reactant 3 has no mapped atoms.
[20:57:53] reactant 4 has no mapped atoms.
[20:57:53] reactant 5 has no mapped atoms.
[20:57:53] reactant 1 has no mapped atoms.
[20:57:53] reactant 1 has no mapped atoms.
[20:57:53] reactant 3 has no mapped atoms.
[20:57:53] reactant 4 has no mapped atoms.
[20:57:53] 

[20:57:53] reactant 2 has no mapped atoms.
[20:57:53] reactant 1 has no mapped atoms.
[20:57:53] reactant 2 has no mapped atoms.
[20:57:53] reactant 3 has no mapped atoms.
[20:57:53] reactant 4 has no mapped atoms.
[20:57:53] reactant 2 has no mapped atoms.
[20:57:53] reactant 1 has no mapped atoms.
[20:57:53] reactant 2 has no mapped atoms.
[20:57:53] reactant 3 has no mapped atoms.
[20:57:53] reactant 4 has no mapped atoms.
[20:57:53] reactant 2 has no mapped atoms.
[20:57:53] reactant 3 has no mapped atoms.
[20:57:53] reactant 1 has no mapped atoms.
[20:57:53] reactant 2 has no mapped atoms.
[20:57:53] reactant 3 has no mapped atoms.
[20:57:53] reactant 4 has no mapped atoms.
[20:57:53] reactant 2 has no mapped atoms.
[20:57:53] reactant 1 has no mapped atoms.
[20:57:53] reactant 2 has no mapped atoms.
[20:57:54] reactant 1 has no mapped atoms.
[20:57:54] reactant 2 has no mapped atoms.
[20:57:54] reactant 1 has no mapped atoms.
[20:57:54] reactant 3 has no mapped atoms.
[20:57:54] 

[20:57:54] reactant 2 has no mapped atoms.
[20:57:54] reactant 3 has no mapped atoms.
[20:57:54] reactant 4 has no mapped atoms.
[20:57:54] reactant 5 has no mapped atoms.
[20:57:54] reactant 6 has no mapped atoms.
[20:57:54] reactant 1 has no mapped atoms.
[20:57:54] reactant 2 has no mapped atoms.
[20:57:54] reactant 3 has no mapped atoms.
[20:57:54] reactant 4 has no mapped atoms.
[20:57:54] reactant 2 has no mapped atoms.
[20:57:54] reactant 2 has no mapped atoms.
[20:57:55] reactant 2 has no mapped atoms.
[20:57:55] reactant 1 has no mapped atoms.
[20:57:55] reactant 2 has no mapped atoms.
[20:57:55] reactant 3 has no mapped atoms.
[20:57:55] reactant 1 has no mapped atoms.
[20:57:55] reactant 2 has no mapped atoms.
[20:57:55] reactant 4 has no mapped atoms.
[20:57:55] reactant 5 has no mapped atoms.
[20:57:55] reactant 6 has no mapped atoms.
[20:57:55] reactant 1 has no mapped atoms.
[20:57:55] reactant 3 has no mapped atoms.
[20:57:55] reactant 4 has no mapped atoms.
[20:57:55] 

[20:57:55] reactant 2 has no mapped atoms.
[20:57:55] reactant 3 has no mapped atoms.
[20:57:55] reactant 2 has no mapped atoms.
[20:57:55] reactant 3 has no mapped atoms.
[20:57:55] reactant 4 has no mapped atoms.
[20:57:55] reactant 0 has no mapped atoms.
[20:57:55] reactant 1 has no mapped atoms.
[20:57:55] reactant 4 has no mapped atoms.
[20:57:55] reactant 5 has no mapped atoms.
[20:57:55] reactant 6 has no mapped atoms.
[20:57:55] reactant 2 has no mapped atoms.
[20:57:55] reactant 3 has no mapped atoms.
[20:57:55] reactant 4 has no mapped atoms.
[20:57:55] reactant 2 has no mapped atoms.
[20:57:55] reactant 3 has no mapped atoms.
[20:57:55] reactant 4 has no mapped atoms.
[20:57:55] reactant 2 has no mapped atoms.
[20:57:55] reactant 3 has no mapped atoms.
[20:57:55] reactant 4 has no mapped atoms.
[20:57:55] reactant 0 has no mapped atoms.
[20:57:55] reactant 1 has no mapped atoms.
[20:57:55] reactant 4 has no mapped atoms.
[20:57:55] reactant 5 has no mapped atoms.
[20:57:55] 

[20:57:56] reactant 4 has no mapped atoms.
[20:57:56] reactant 5 has no mapped atoms.
[20:57:56] reactant 1 has no mapped atoms.
[20:57:56] reactant 2 has no mapped atoms.
[20:57:56] reactant 4 has no mapped atoms.
[20:57:56] reactant 2 has no mapped atoms.
[20:57:56] reactant 3 has no mapped atoms.
[20:57:56] reactant 4 has no mapped atoms.
[20:57:56] reactant 5 has no mapped atoms.
[20:57:56] reactant 3 has no mapped atoms.
[20:57:56] reactant 4 has no mapped atoms.
[20:57:56] reactant 5 has no mapped atoms.
[20:57:56] reactant 1 has no mapped atoms.
[20:57:56] reactant 2 has no mapped atoms.
[20:57:56] reactant 4 has no mapped atoms.
[20:57:56] reactant 1 has no mapped atoms.
[20:57:56] reactant 3 has no mapped atoms.
[20:57:56] reactant 1 has no mapped atoms.
[20:57:56] reactant 2 has no mapped atoms.
[20:57:56] reactant 1 has no mapped atoms.
[20:57:56] reactant 2 has no mapped atoms.
[20:57:56] reactant 1 has no mapped atoms.
[20:57:56] reactant 2 has no mapped atoms.
[20:57:56] 

[20:57:56] reactant 2 has no mapped atoms.
[20:57:56] reactant 3 has no mapped atoms.
[20:57:56] reactant 4 has no mapped atoms.
[20:57:56] reactant 5 has no mapped atoms.
[20:57:56] reactant 6 has no mapped atoms.
[20:57:56] reactant 0 has no mapped atoms.
[20:57:56] reactant 3 has no mapped atoms.
[20:57:56] reactant 4 has no mapped atoms.
[20:57:56] reactant 2 has no mapped atoms.
[20:57:56] reactant 3 has no mapped atoms.
[20:57:56] reactant 4 has no mapped atoms.
[20:57:56] reactant 5 has no mapped atoms.
[20:57:56] reactant 6 has no mapped atoms.
[20:57:56] reactant 2 has no mapped atoms.
[20:57:56] reactant 1 has no mapped atoms.
[20:57:56] reactant 2 has no mapped atoms.
[20:57:56] reactant 3 has no mapped atoms.
[20:57:56] reactant 2 has no mapped atoms.
[20:57:56] reactant 3 has no mapped atoms.
[20:57:56] reactant 4 has no mapped atoms.
[20:57:56] reactant 3 has no mapped atoms.
[20:57:56] reactant 2 has no mapped atoms.
[20:57:56] reactant 2 has no mapped atoms.
[20:57:56] 

[20:57:58] reactant 1 has no mapped atoms.
[20:57:58] reactant 3 has no mapped atoms.
[20:57:58] reactant 4 has no mapped atoms.
[20:57:58] reactant 5 has no mapped atoms.
[20:57:58] reactant 7 has no mapped atoms.
[20:57:58] reactant 2 has no mapped atoms.
[20:57:58] reactant 2 has no mapped atoms.
[20:57:58] reactant 3 has no mapped atoms.
[20:57:58] reactant 4 has no mapped atoms.
[20:57:58] reactant 5 has no mapped atoms.
[20:57:58] reactant 6 has no mapped atoms.
[20:57:58] reactant 1 has no mapped atoms.
[20:57:58] reactant 2 has no mapped atoms.
[20:57:58] reactant 1 has no mapped atoms.
[20:57:58] reactant 2 has no mapped atoms.
[20:57:58] reactant 2 has no mapped atoms.
[20:57:58] reactant 4 has no mapped atoms.
[20:57:58] reactant 6 has no mapped atoms.
[20:57:58] reactant 3 has no mapped atoms.
[20:57:58] reactant 1 has no mapped atoms.
[20:57:58] reactant 2 has no mapped atoms.
[20:57:58] reactant 3 has no mapped atoms.
[20:57:58] reactant 1 has no mapped atoms.
[20:57:58] 

[20:57:59] reactant 1 has no mapped atoms.
[20:57:59] reactant 2 has no mapped atoms.
[20:57:59] reactant 1 has no mapped atoms.
[20:57:59] reactant 1 has no mapped atoms.
[20:57:59] reactant 2 has no mapped atoms.
[20:57:59] reactant 3 has no mapped atoms.
[20:57:59] reactant 4 has no mapped atoms.
[20:57:59] reactant 5 has no mapped atoms.
[20:57:59] reactant 6 has no mapped atoms.
[20:57:59] reactant 1 has no mapped atoms.
[20:57:59] reactant 2 has no mapped atoms.
[20:57:59] reactant 1 has no mapped atoms.
[20:57:59] reactant 1 has no mapped atoms.
[20:57:59] reactant 0 has no mapped atoms.
[20:57:59] reactant 1 has no mapped atoms.
[20:57:59] reactant 4 has no mapped atoms.
[20:57:59] reactant 5 has no mapped atoms.
[20:57:59] reactant 2 has no mapped atoms.
[20:57:59] reactant 0 has no mapped atoms.
[20:57:59] reactant 3 has no mapped atoms.
[20:57:59] reactant 4 has no mapped atoms.
[20:57:59] product atom-mapping number 7 found multiple times.
[20:57:59] product atom-mapping nu

[20:57:59] reactant 0 has no mapped atoms.
[20:57:59] reactant 1 has no mapped atoms.
[20:57:59] reactant 2 has no mapped atoms.
[20:57:59] reactant 0 has no mapped atoms.
[20:57:59] reactant 1 has no mapped atoms.
[20:57:59] reactant 2 has no mapped atoms.
[20:57:59] reactant 0 has no mapped atoms.
[20:57:59] reactant 1 has no mapped atoms.
[20:57:59] reactant 2 has no mapped atoms.
[20:57:59] reactant 0 has no mapped atoms.
[20:57:59] reactant 1 has no mapped atoms.
[20:57:59] reactant 2 has no mapped atoms.
[20:57:59] reactant 3 has no mapped atoms.
[20:57:59] reactant 2 has no mapped atoms.
[20:57:59] reactant 3 has no mapped atoms.
[20:57:59] reactant 2 has no mapped atoms.
[20:57:59] reactant 3 has no mapped atoms.
[20:57:59] reactant 4 has no mapped atoms.
[20:57:59] reactant 1 has no mapped atoms.
[20:57:59] reactant 2 has no mapped atoms.
[20:57:59] reactant 4 has no mapped atoms.
[20:57:59] reactant 5 has no mapped atoms.
[20:57:59] reactant 1 has no mapped atoms.
[20:57:59] 

[20:58:00] reactant 1 has no mapped atoms.
[20:58:00] reactant 2 has no mapped atoms.
[20:58:00] reactant 2 has no mapped atoms.
[20:58:00] reactant 3 has no mapped atoms.
[20:58:00] reactant 1 has no mapped atoms.
[20:58:00] reactant 2 has no mapped atoms.
[20:58:00] reactant 3 has no mapped atoms.
[20:58:00] reactant 5 has no mapped atoms.
[20:58:00] reactant 6 has no mapped atoms.
[20:58:00] reactant 2 has no mapped atoms.
[20:58:00] reactant 1 has no mapped atoms.
[20:58:00] reactant 2 has no mapped atoms.
[20:58:00] reactant 3 has no mapped atoms.
[20:58:00] reactant 5 has no mapped atoms.
[20:58:00] reactant 1 has no mapped atoms.
[20:58:00] reactant 3 has no mapped atoms.
[20:58:00] product atom-mapping number 1 found multiple times.
[20:58:00] product atom-mapping number 5 found multiple times.
[20:58:00] product atom-mapping number 6 found multiple times.
[20:58:00] product atom-mapping number 8 found multiple times.
[20:58:00] product atom-mapping number 12 found multiple tim

[20:58:00] reactant 1 has no mapped atoms.
[20:58:00] reactant 2 has no mapped atoms.
[20:58:00] reactant 3 has no mapped atoms.
[20:58:00] reactant 5 has no mapped atoms.
[20:58:00] reactant 6 has no mapped atoms.
[20:58:00] reactant 1 has no mapped atoms.
[20:58:00] reactant 1 has no mapped atoms.
[20:58:00] reactant 2 has no mapped atoms.
[20:58:00] reactant 3 has no mapped atoms.
[20:58:00] reactant 5 has no mapped atoms.
[20:58:00] reactant 6 has no mapped atoms.
[20:58:00] reactant 7 has no mapped atoms.
[20:58:00] reactant 0 has no mapped atoms.
[20:58:00] reactant 3 has no mapped atoms.
[20:58:00] reactant 1 has no mapped atoms.
[20:58:00] reactant 2 has no mapped atoms.
[20:58:00] reactant 3 has no mapped atoms.
[20:58:00] reactant 0 has no mapped atoms.
[20:58:00] reactant 0 has no mapped atoms.
[20:58:00] reactant 4 has no mapped atoms.
[20:58:00] reactant 1 has no mapped atoms.
[20:58:00] reactant 2 has no mapped atoms.
[20:58:00] reactant 1 has no mapped atoms.
[20:58:00] 

[20:58:01] reactant 2 has no mapped atoms.
[20:58:01] reactant 3 has no mapped atoms.
[20:58:01] reactant 1 has no mapped atoms.
[20:58:01] reactant 0 has no mapped atoms.
[20:58:01] reactant 3 has no mapped atoms.
[20:58:01] reactant 1 has no mapped atoms.
[20:58:01] reactant 2 has no mapped atoms.
[20:58:01] reactant 3 has no mapped atoms.
[20:58:01] reactant 5 has no mapped atoms.
[20:58:01] reactant 6 has no mapped atoms.
[20:58:01] reactant 7 has no mapped atoms.
[20:58:01] reactant 1 has no mapped atoms.
[20:58:01] reactant 2 has no mapped atoms.
[20:58:01] reactant 3 has no mapped atoms.
[20:58:01] reactant 1 has no mapped atoms.
[20:58:01] reactant 2 has no mapped atoms.
[20:58:01] reactant 3 has no mapped atoms.
[20:58:01] reactant 1 has no mapped atoms.
[20:58:01] reactant 2 has no mapped atoms.
[20:58:01] reactant 3 has no mapped atoms.
[20:58:01] reactant 1 has no mapped atoms.
[20:58:01] reactant 1 has no mapped atoms.
[20:58:01] reactant 2 has no mapped atoms.
[20:58:01] 

[20:58:02] reactant 1 has no mapped atoms.
[20:58:02] product atom-mapping number 1 found multiple times.
[20:58:02] product atom-mapping number 2 found multiple times.
[20:58:02] product atom-mapping number 7 found multiple times.
[20:58:02] product atom-mapping number 6 found multiple times.
[20:58:02] product atom-mapping number 5 found multiple times.
[20:58:02] product atom-mapping number 8 found multiple times.
[20:58:02] product atom-mapping number 12 found multiple times.
[20:58:02] product atom-mapping number 13 found multiple times.
[20:58:02] product atom-mapping number 15 found multiple times.
[20:58:02] product atom-mapping number 16 found multiple times.
[20:58:02] product atom-mapping number 21 found multiple times.
[20:58:02] product atom-mapping number 20 found multiple times.
[20:58:02] product atom-mapping number 19 found multiple times.
[20:58:02] product atom-mapping number 22 found multiple times.
[20:58:02] product atom-mapping number 25 found multiple times.
[20

[20:58:02] reactant 2 has no mapped atoms.
[20:58:02] reactant 3 has no mapped atoms.
[20:58:02] reactant 4 has no mapped atoms.
[20:58:02] reactant 5 has no mapped atoms.
[20:58:02] reactant 6 has no mapped atoms.
[20:58:02] reactant 7 has no mapped atoms.
[20:58:02] reactant 1 has no mapped atoms.
[20:58:02] reactant 3 has no mapped atoms.
[20:58:02] reactant 4 has no mapped atoms.
[20:58:02] reactant 2 has no mapped atoms.
[20:58:02] reactant 3 has no mapped atoms.
[20:58:02] reactant 1 has no mapped atoms.
[20:58:02] reactant 2 has no mapped atoms.
[20:58:02] reactant 1 has no mapped atoms.
[20:58:02] reactant 3 has no mapped atoms.
[20:58:02] reactant 4 has no mapped atoms.
[20:58:02] reactant 1 has no mapped atoms.
[20:58:02] reactant 1 has no mapped atoms.
[20:58:02] reactant 2 has no mapped atoms.
[20:58:02] reactant 1 has no mapped atoms.
[20:58:03] reactant 2 has no mapped atoms.
[20:58:03] reactant 3 has no mapped atoms.
[20:58:03] reactant 4 has no mapped atoms.
[20:58:03] 

[20:58:03] reactant 0 has no mapped atoms.
[20:58:03] reactant 2 has no mapped atoms.
[20:58:03] product atom-mapping number 8 found multiple times.
[20:58:03] product atom-mapping number 12 found multiple times.
[20:58:03] product atom-mapping number 13 found multiple times.
[20:58:03] product atom-mapping number 15 found multiple times.
[20:58:03] product atom-mapping number 19 found multiple times.
[20:58:03] product atom-mapping number 20 found multiple times.
[20:58:03] product atom-mapping number 21 found multiple times.
[20:58:03] product atom-mapping number 22 found multiple times.
[20:58:03] product atom-mapping number 23 found multiple times.
[20:58:03] product atom-mapping number 24 found multiple times.
[20:58:03] product atom-mapping number 27 found multiple times.
[20:58:03] product atom-mapping number 28 found multiple times.
[20:58:03] product atom-mapping number 29 found multiple times.
[20:58:03] product atom-mapping number 30 found multiple times.
[20:58:03] product 

[20:58:04] reactant 1 has no mapped atoms.
[20:58:04] reactant 2 has no mapped atoms.
[20:58:04] product atom-mapping number 1 found multiple times.
[20:58:04] product atom-mapping number 16 found multiple times.
[20:58:04] product atom-mapping number 17 found multiple times.
[20:58:04] product atom-mapping number 18 found multiple times.
[20:58:04] product atom-mapping number 19 found multiple times.
[20:58:04] product atom-mapping number 20 found multiple times.
[20:58:04] product atom-mapping number 21 found multiple times.
[20:58:04] product atom-mapping number 2 found multiple times.
[20:58:04] product atom-mapping number 3 found multiple times.
[20:58:04] product atom-mapping number 4 found multiple times.
[20:58:04] product atom-mapping number 7 found multiple times.
[20:58:04] product atom-mapping number 8 found multiple times.
[20:58:04] product atom-mapping number 12 found multiple times.
[20:58:04] product atom-mapping number 11 found multiple times.
[20:58:04] product atom-

[20:58:04] reactant 1 has no mapped atoms.
[20:58:04] reactant 0 has no mapped atoms.
[20:58:04] reactant 2 has no mapped atoms.
[20:58:04] reactant 3 has no mapped atoms.
[20:58:04] reactant 4 has no mapped atoms.
[20:58:04] reactant 2 has no mapped atoms.
[20:58:04] reactant 3 has no mapped atoms.
[20:58:04] reactant 4 has no mapped atoms.
[20:58:04] reactant 1 has no mapped atoms.
[20:58:04] reactant 2 has no mapped atoms.
[20:58:04] reactant 1 has no mapped atoms.
[20:58:04] reactant 2 has no mapped atoms.
[20:58:04] reactant 1 has no mapped atoms.
[20:58:04] reactant 2 has no mapped atoms.
[20:58:04] reactant 1 has no mapped atoms.
[20:58:04] reactant 2 has no mapped atoms.
[20:58:04] reactant 1 has no mapped atoms.
[20:58:04] reactant 2 has no mapped atoms.
[20:58:04] reactant 2 has no mapped atoms.
[20:58:04] reactant 3 has no mapped atoms.
[20:58:04] reactant 2 has no mapped atoms.
[20:58:04] reactant 3 has no mapped atoms.
[20:58:04] reactant 2 has no mapped atoms.
[20:58:04] 

[20:58:05] reactant 1 has no mapped atoms.
[20:58:05] reactant 2 has no mapped atoms.
[20:58:05] reactant 4 has no mapped atoms.
[20:58:05] reactant 5 has no mapped atoms.
[20:58:05] reactant 2 has no mapped atoms.
[20:58:05] reactant 3 has no mapped atoms.
[20:58:05] reactant 2 has no mapped atoms.
[20:58:05] reactant 2 has no mapped atoms.
[20:58:05] reactant 2 has no mapped atoms.
[20:58:05] reactant 3 has no mapped atoms.
[20:58:05] reactant 0 has no mapped atoms.
[20:58:05] reactant 2 has no mapped atoms.
[20:58:05] reactant 4 has no mapped atoms.
[20:58:05] reactant 5 has no mapped atoms.
[20:58:06] reactant 0 has no mapped atoms.
[20:58:06] reactant 1 has no mapped atoms.
[20:58:06] reactant 2 has no mapped atoms.
[20:58:06] reactant 3 has no mapped atoms.
[20:58:06] reactant 4 has no mapped atoms.
[20:58:06] reactant 2 has no mapped atoms.
[20:58:06] reactant 3 has no mapped atoms.
[20:58:06] reactant 2 has no mapped atoms.
[20:58:06] reactant 4 has no mapped atoms.
[20:58:06] 

ValidationError: Dataset.reactions[1478].inputs["m1_m2"].components[1].identifiers[2]: RDKit 2022.03.5 could not validate InChI identifier InChI=1S/CHCl/c1-2/h1H

In [None]:
# inputs
# REACTANT = 1;
# REAGENT = 2;
# SOLVENT = 3;
# CATALYST = 4;
# WORKUP = 5;
# INTERNAL_STANDARD = 6;
# AUTHENTIC_STANDARD = 7;
# PRODUCT = 8;

# temperature:
# UNSPECIFIED = 0;
# CUSTOM = 1;
# AMBIENT = 2;
# OIL_BATH = 3;
# WATER_BATH = 4;
# SAND_BATH = 5;
# ICE_BATH = 6;
# DRY_ALUMINUM_PLATE = 7;
# MICROWAVE = 8;
# DRY_ICE_BATH = 9;
# AIR_FAN = 10;
# LIQUID_NITROGEN = 11;

# structure
# inputs -> m1, m2, m3 ...
# conditions -> temperature, ...
# notes
# workups
# outcomes -> products, yield

In [None]:
res = [field.name for field in data.DESCRIPTOR.fields]
res
# can do data.name to get the year, e.g. data.name returns 'uspto-grants-2016'
# can do data.dataset_id to get the ord file name, e.g. 'ord_dataset-026684a62f91469db49c7767d16c39fb'

['name', 'description', 'reactions', 'reaction_ids', 'dataset_id']

## Inspect data

In [None]:
data.name

'uspto-grants-1993_09'

In [None]:
#save to pickle
filename = data.name
full_df.to_pickle(f"data/ORD_USPTO/pickled_data/{filename}.pkl")

In [None]:
#unpickle
unpickled_df = pd.read_pickle(f"data/ORD_USPTO/pickled_data/{filename}.pkl")
unpickled_df

Unnamed: 0,reactant_0,reactant_1,reactant_2,reactant_3,reactant_4,reactant_5,reactant_6,reactant_7,reactant_8,reactant_9,...,catalyst_3,catalyst_4,temperature_0,rxn_time_0,rxn_time_1,product_0,product_1,product_2,product_3,product_4
0,C([O-])([O-])=O.[K+].[K+],CC1(OB(OC1(C)C)C=1C=C(C(=O)OC)C=CC1)C,BrC=1C(=NN(C1)C1=CC(=C(C=C1)F)F)C(=O)OCC,"2,2-dicyclohexylphosphino-2″,6″-diisopropoxybi...",,,,,,,...,,,0.0,0.0,0,(FC=1C=C(C=CC1F)N1N=C(C(=C1)C1=CC(=CC=C1)C(=O)...,,,,
1,N,Cl.CN(C1=CC=C(C=C1)N)C(C)C,OO,C(C)O,Cl.Cl.C(CCC)OC1=C(C=C(C=C1)N)N,20-V,,,,,...,,,,24.0,1,(Cl.C(CCC)OC=1C(C=C(C(C1)NC1=CC=C(C=C1)N(C(C)C...,,,,
2,Cl.Cl.C(CCC)OC1=C(C=C(C=C1)N)N,N,S(=O)(=O)(O)OC1=C(C=C(C=C1)N)Cl,OO,,,,,,,...,,,,5.0,1,"(ClC=1C(C=CC(C1)=NC1=C(C=C(C(=C1)OCCCC)N)N)=O,...",,,,
3,N,C(C)O,OO,Cl.C(C)N(C1=CC=C(C=C1)N)C(C)C,Cl.Cl.C(CCC)OC1=C(C=C(C=C1)N)N,,,,,,...,,,,24.0,1,(NC=1C(C=C(C(C1)=N)OCCCC)=NC1=CC=C(C=C1)N(CCO)...,,,,
4,Cl.Cl.C(CCC)OC1=C(C=C(C=C1)N)N,C(C)N(CCO)C1=CC=C(C=C1)N=O,,,,,,,,,...,,,,3.0,1,(Cl.NC=1C(C=C(C(C1)=N)OCC)=NC1=CC=C(C=C1)N(CCO...,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
93829,BrC=1C=CC=2NC3=CC=CC=C3C2C1,C([O-])([O-])=O.[Na+].[Na+],C1(=CC=CC=C1)C1=CC=CC=2C3=C(SC21)C(=CC=C3)B(O)O,CC1=C(C=CC=C1)P(C1=C(C=CC=C1)C)C1=C(C=CC=C1)C,,,,,,,...,,,,0.0,0,(C1(=CC=CC=C1)C1=CC=CC=2C3=C(SC21)C(=CC=C3)C=3...,,,,
93830,C1(=CC=CC=C1)C1=CC=CC=2C3=C(SC21)C(=CC=C3)C=3C...,BrC1=CC=C(C=C1)C=1C2=CC=CC=C2C(=C2C=CC=CC12)C1...,CC(C)([O-])C.[Na+],C(C)(C)(C)P(C(C)(C)C)C(C)(C)C,,,,,,,...,,,,0.0,0,(C1(=CC=CC=C1)C1=CC=CC=2C3=C(SC21)C(=CC=C3)C=3...,,,,
93831,BrC1=CC=C(C=C1)C1=CC2=C(C3=CC=CC=C3C(=C2C=C1)C...,CC(C)([O-])C.[Na+],C(C)(C)(C)P(C(C)(C)C)C(C)(C)C,C1=CC=C(C=2SC3=C(C21)C=CC=C3)C=3C=CC=2NC1=CC=C...,,,,,,,...,,,,0.0,0,(C1=CC=C(C=2SC3=C(C21)C=CC=C3)C=3C=CC=2N(C1=CC...,,,,
93832,CC(C)([O-])C.[Na+],BrC=1C=C(C=CC1)C1=CC2=C(C3=CC=CC=C3C(=C2C=C1)C...,C(C)(C)(C)P(C(C)(C)C)C(C)(C)C,C1=CC=C(C=2SC3=C(C21)C=CC=C3)C=3C=CC=2NC1=CC=C...,,,,,,,...,,,,0.0,0,(C1=CC=C(C=2SC3=C(C21)C=CC=C3)C=3C=CC=2N(C1=CC...,,,,


In [None]:
#unpickle
filename = instance.filename
unpickled_df = pd.read_pickle(f"data/ORD_USPTO/pickled_data/{filename}.pkl")
unpickled_df

Unnamed: 0,reactant_0,reactant_1,reactant_2,reactant_3,reactant_4,reactant_5,reactant_6,reactant_7,reactant_8,reactant_9,...,catalyst_5,catalyst_6,catalyst_7,temperature_0,rxn_time_0,rxn_time_1,product_0,product_1,product_2,product_3
0,O.NN,NS(=O)(=O)CC1=C(C(=O)OC)C=CC=C1,,,,,,,,,...,,,,0.0,0.0,0,"(NS(=O)(=O)CC1=C(C(=O)NN)C=CC=C1, nan, 58.2999...",,,
1,Cl,C1(=CC=CC=C1)OC(NC1=NC(=CC(=N1)OC)OC)=O,CSC1=NN=C(O1)C1=C(C=CC=C1)CS(=O)(=O)N,CCCCC=CCCCCC,"1,5-diazbicyclo[5.4.0",,,,,,...,,,,,2.0,1,(COC1=NC(=NC(=C1)OC)NC(=O)NS(=O)(=O)CC1=C(C=CC...,,,
2,C([O-])([O-])=O.[K+].[K+],CSC1=NN=C(O1)C1=C(C=CC=C1)CS(=O)(=O)N,C(CCC)N=C=O,,,,,,,,...,,,,,8.0,1,(C(CCC)NC(=O)NS(=O)(=O)CC1=C(C=CC=C1)C=1OC(=NN...,,,
3,ClC(C)(CC(CC)(C)Cl)C,C1(=CC=CC=C1O)C,[Cl-].[Al+3].[Cl-].[Cl-],ClCCl,,,,,,,...,,,,,2.0,1,"(CC=1C(=CC=2C(CCC(C2C1)(C)C)(C)C)O, 93.0, nan)",,,
4,ClC(C)(CCC(C)(C)Cl)C,[Cl-].[Al+3].[Cl-].[Cl-],C1(=CC=CC=C1)C,,,,,,,,...,,,,,1.0,1,"(CC1=CC=2C(CCC(C2C=C1)(C)C)(C)C, 98.0, 97.4000...",,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1570,C(C)OC(CNC([C@@H](NC(C(CN(OCC1=CC=CC=C1)C=O)CC...,,,,,,,,,,...,,,,,0.0,0,(C(C)OC(CNC([C@@H](NC(C(CN(O)C=O)CC(C)C)=O)CC(...,,,
1571,C(C(C)C)C(C(=O)O)=C,C(C1=CC=CC=C1)ON,,,,,,,,,...,,,,,0.0,0,"(C(C1=CC=CC=C1)ONCC(C(=O)O)CC(C)C, 35.0, 34.40...",,,
1572,C(=O)O,C(C1=CC=CC=C1)ONCC(C(=O)O)CC(C)C,,,,,,,,,...,,,,,0.0,0,"(C(=O)N(CC(C(=O)O)CC(C)C)OCC1=CC=CC=C1, 100.0,...",,,
1573,Cl.C(C)OC([C@@H](N)C)=O,C(C)(C)(C)OC(=O)N[C@@H](CCCCNC(=O)OCC1=CC=CC=C...,C(C)N1CCOCC1,ClC(=O)OCC(C)C,,,,,,,...,,,,,4.0,1,(C(C)OC([C@@H](NC([C@@H](NC(=O)OC(C)(C)C)CCCCN...,,,


# Full ORD to pickle workflow - v2 (using mapped rxn)

In [23]:
# Import modules
import ord_schema
from ord_schema import message_helpers, validations
from ord_schema.proto import dataset_pb2

import math
import pandas as pd
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import os
import wget
import pickle

from rdkit import Chem
from rdkit.Chem import AllChem
from sklearn import model_selection, metrics
from glob import glob

from tqdm import tqdm

# Import pura
from pura.resolvers import resolve_identifiers
from pura.compound import CompoundIdentifierType
from pura.services import PubChem, CIR, CAS

"""
Disables RDKit whiny logging.
"""
import rdkit.rdBase as rkrb
import rdkit.RDLogger as rkl
logger = rkl.logger()
logger.setLevel(rkl.ERROR)
rkrb.DisableLog('rdApp.error')

2022-12-29 01:42:10.751785: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Get list of all file names

In [2]:
# get list of all file names
import os

# Set the directory you want to look in
directory = "data/ORD_USPTO/ord-data/data/"

# Use os.listdir to get a list of all files in the directory
folders = os.listdir(directory)
files = []
# Use a for loop to iterate over the files and print their names
for folder in folders:
    if not folder.startswith("."):
        new_dir = directory+folder
        file_list = os.listdir(new_dir)
        # Check if the file name starts with a .
        for file in file_list:
            if not file.startswith("."):
                new_file = new_dir+'/'+file
                files += [new_file]

## Define base class for pickling

In [3]:
# Should we double check that the products from the mapped reaction and the molecules marked as 'product' are actually the same? 
# I wonder which mapped reaction has 4 products...
# What do we do if there's any discrepency?...
# We should trust the mapped reaction more
# I'll create a new column for the mapped products, and then in the cleaning, filter out any molecules that have been marked as a product
#   but don't actually appear in the mapped reaction

# Create a map of the unique molecules names with Pura, and apply that map to the df! Much 
#   more efficient than applying Pura to everything! 

# Let's just assume that we can trust the marked products, and add both the marked products and mapped products
#   to the final df

# Need to re-write the main function so it outputs a list of names
# I can then list(set(names)) to get the unique ones, and build a dictionary from that using Pura
# Check out any names that Pura couldn't resolve by checking whether the outputted list is empty
# And then resolve these manually
# Finally we can remap the full df using df.replace(my_dict)

In [4]:
class OrdToPickle():
    """
    Read in an ord file, check if it contains USPTO data, and then:
    1) Extract all the relevant data (raw): reactants, reagents, products, yields, temp, time
    2) Use Pura to sanitise all the molecules. Do it on a list at a time, it's faster!
    3) Canonicalise all the molecules
    """

    def __init__(self, ord_file_path):
        self.ord_file_path = ord_file_path
        self.data = message_helpers.load_message(self.ord_file_path, dataset_pb2.Dataset)
        self.filename = self.data.name
        self.names_list = []

    def find_smiles(self, identifiers):
        for i in identifiers:
            if i.type == 2:
                smiles = self.clean_smiles(i.value)
                return smiles
        for ii in identifiers: #if there's no smiles, return the name
            if ii.type == 6:
                name = ii.value
                self.names_list += [name]
                return name
        return None

    def clean_mapped_smiles(self, smiles):
        # remove mapping info and canonicalsie the smiles at the same time
        # converting to mol and back canonicalises the smiles string
        try:
            m = Chem.MolFromSmiles(smiles)
            for atom in m.GetAtoms():
                atom.SetAtomMapNum(0)
            cleaned_smiles = Chem.MolToSmiles(m)
            return cleaned_smiles
        except AttributeError:
            self.names_list += [smiles]
            return smiles

    def clean_smiles(self, smiles):
        # remove mapping info and canonicalsie the smiles at the same time
        # converting to mol and back canonicalises the smiles string
        try:
            cleaned_smiles = Chem.CanonSmiles(smiles)
            return cleaned_smiles
        except:
            self.names_list += [smiles]
            return smiles

    #its probably a lot faster to sanitise the whole thing at the end
    # NB: And create a hash map/dict



    def build_rxn_lists(self):
        mapped_rxn_all = []
        reactants_all = []
        reagents_all = []
        marked_products_all = []
        mapped_products_all = []
        products_all = []
        solvents_all = []
        catalysts_all = []

        temperature_all = []

        rxn_times_all = []

        not_mapped_products_all = []
        yields_all = []

        for i in range(len(self.data.reactions)):
            rxn = self.data.reactions[i]
            # handle rxn inputs: reactants, reagents etc
            reactants = []
            reagents = []
            solvents = []
            catalysts = []
            marked_products = []
            mapped_products = []
            products = []
            not_mapped_products = []
            
            temperatures = []

            rxn_times = []

            yields = []
            mapped_yields = []
            

            #if reaction has been mapped, get reactant and product from the mapped reaction
            #Actually, we should only extract data from reactions that have been mapped
            is_mapped = self.data.reactions[i].identifiers[0].is_mapped
            if is_mapped:
                mapped_rxn_extended_smiles = self.data.reactions[i].identifiers[0].value
                mapped_rxn = mapped_rxn_extended_smiles.split(' ')[0]

                reactant, reagent, mapped_product = mapped_rxn.split('>')

                for r in reactant.split('.'):
                    if '[' in r and ']' in r and ':' in r:
                        reactants += [r]
                    else:
                        reagents += [r]

                reagents += [r for r in reagent.split('.')]

                for p in mapped_product.split('.'):
                    if '[' in p and ']' in p and ':' in p:
                        mapped_products += [p]
                        
                    else:
                        not_mapped_products += [p]


                # inputs
                for key in rxn.inputs: #these are the keys in the 'dict' style data struct
                    try:
                        components = rxn.inputs[key].components
                        for component in components:
                            rxn_role = component.reaction_role #rxn role
                            identifiers = component.identifiers
                            smiles = self.find_smiles(identifiers)
                            if rxn_role == 1: #reactant
                                #reactants += [smiles]
                                # we already added reactants from mapped rxn
                                # So instead I'll add it to the reagents list
                                # A lot of the reagents seem to have been misclassified as reactants
                                # I just need to remember to remove reagents that already appear as reactants
                                #   when I do cleaning

                                reagents += [r for r in smiles.split('.')]
                            elif rxn_role ==2: #reagent
                                reagents += [r for r in smiles.split('.')]
                            elif rxn_role ==3: #solvent
                                solvents += [smiles]
                            elif rxn_role ==4: #catalyst
                                catalysts += [smiles]
                            elif rxn_role in [5,6,7]: #workup, internal standard, authentic standard. don't care about these
                                continue
                            # elif rxn_role ==8: #product
                            #     #products += [smiles]
                            # there are no products recorded in rxn_role == 8, they're all stored in "outcomes"
                    except IndexError:
                        #print(i, key )
                        continue

                # temperature
                try:
                    temperatures +=[rxn.conditions.temperature.control.type]
                except IndexError:
                    continue

                #rxn time
                try:
                    rxn_times = (rxn.outcomes[0].reaction_time.value, rxn.outcomes[0].reaction_time.units)
                except IndexError:
                    continue

                # products & yield
                products_obj = rxn.outcomes[0].products
                y1 = np.nan
                y2 = np.nan
                for marked_product in products_obj:
                    try:
                        identifiers = marked_product.identifiers
                        product_smiles = self.find_smiles(identifiers)
                        measurements = marked_product.measurements
                        for measurement in measurements:
                            if measurement.details =="PERCENTYIELD":
                                y1 = measurement.percentage.value
                            elif measurement.details =="CALCULATEDPERCENTYIELD":
                                y2 = measurement.percentage.value
                        #marked_products += [(product_smiles, y1, y2)]
                        marked_products += [product_smiles]
                        if y1 == y1:
                            yields += [y1]
                        elif y2==y2:
                            yields +=[y2]
                        else:
                            yields += [np.nan]
                    except IndexError:
                        continue
            
            #clean the smiles

            #remove reagents that are integers
            #reagents = [x for x in reagents if not (x.isdigit() or x[0] == '-' and x[1:].isdigit())]
            # I'm assuming there are no negative integers
            reagents = [x for x in reagents if not (x.isdigit())]

            reactants = [self.clean_mapped_smiles(smi) for smi in reactants]
            reagents = [self.clean_smiles(smi) for smi in reagents]
            solvents = [self.clean_smiles(smi) for smi in solvents]
            catalysts = [self.clean_smiles(smi) for smi in catalysts]

            # if the reagent exists in another list, remove it
            reagents_trimmed = []
            for reag in reagents:
                if reag not in reactants and reag not in solvents and reag not in catalysts:
                    reagents_trimmed += [reag]
            

            mapped_rxn_all += [mapped_rxn]
            reactants_all += [reactants]
            reagents_all += [list(set(reagents_trimmed))]
            solvents_all += [list(set(solvents))]
            catalysts_all += [list(set(catalysts))]
            
            temperature_all = [temperatures]

            rxn_times_all += [rxn_times]


            # products logic
            # handle the products
            # for each row, I will trust the mapped product more
            # loop over the mapped products, and if the mapped product exists in the marked product
            # add the yields, else simply add smiles and np.nan

            # canon and remove mapped info from products
            mapped_p_clean = [self.clean_mapped_smiles(p) for p in mapped_products]
            marked_p_clean = [self.clean_smiles(p) for p in marked_products]
            # What if there's a marked product that only has the correct name, but not the smiles?



            for mapped_p in mapped_p_clean:
                added = False
                for ii, marked_p in enumerate(marked_p_clean):
                    if mapped_p == marked_p and mapped_p not in products:
                        products+= [mapped_p]
                        mapped_yields += [yields[ii]]
                        added = True
                        break

                if not added and mapped_p not in products:
                    products+= [mapped_p]
                    mapped_yields += [np.nan]
            

            products_all += [products] 
            yields_all +=[mapped_yields]


        
        return mapped_rxn_all, reactants_all, reagents_all, solvents_all, catalysts_all, temperature_all, rxn_times_all, products_all, yields_all

    # create the column headers for the df
    def create_column_headers(self, df, base_string):
        column_headers = []
        for i in range(len(df.columns)):
            column_headers += [base_string+str(i)]
        return column_headers
    
    def build_full_df(self):
        headers = ['mapped_rxn_', 'reactant_', 'reagents_',  'solvent_', 'catalyst_', 'temperature_', 'rxn_time_', 'product_', 'yields_']
        #data_lists = [mapped_rxn, reactants_all, reagents_all, solvents_all, catalysts_all, temperature_all, rxn_times_all, products_all]
        data_lists = self.build_rxn_lists()
        for i in range(len(headers)):
            new_df = pd.DataFrame(data_lists[i])
            df_headers = self.create_column_headers(new_df, headers[i])
            new_df.columns = df_headers
            if i ==0:
                full_df = new_df
            else:
                full_df = pd.concat([full_df, new_df], axis=1)
        return full_df
    
    #def clean_df(self,df):
        # In the test case: data/ORD_USPTO/ord-data/data/59/ord_dataset-59f453c3a3d34a89bfd97b6b8b151908.pb.gz
        #   there is only 1 reaction with 9 catalysts, all other reactions have like 1 or 2
        #   Perhaps I should here add some hueristic that there's more than x unique catalysts, just filter out
        #   the whole reaction
        

    def main(self):
        # This function doesn't return anything. Instead, it saves the requested data as a pickle file at the path you see below
        # So you need to unpickle the data to see the output
        if 'uspto' in self.filename:
            full_df = self.build_full_df()
            #cleaned_df = self.clean_df(full_df)
            

            #save to pickle
            filename = self.data.name
            full_df.to_pickle(f"data/ORD_USPTO/pickled_data/{filename}.pkl")
            return self.names_list #list of the names used for molecules, as opposed to SMILES strings
        else:
            #print(f'The following does not contain USPTO data: {self.filename}')
            return [] #added for consistency



In [5]:
instance = OrdToPickle(files[1])
names = instance.main()

In [37]:
len(names)

739

In [231]:
files[0]

'data/ORD_USPTO/ord-data/data/59/ord_dataset-59f453c3a3d34a89bfd97b6b8b151908.pb.gz'

In [232]:
#unpickle
filename = instance.filename
unpickled_df = pd.read_pickle(f"data/ORD_USPTO/pickled_data/{filename}.pkl")


In [233]:
unpickled_df.columns

Index(['mapped_rxn_0', 'reactant_0', 'reactant_1', 'reactant_2', 'reactant_3',
       'reactant_4', 'reagents_0', 'reagents_1', 'reagents_2', 'reagents_3',
       'reagents_4', 'reagents_5', 'reagents_6', 'reagents_7', 'reagents_8',
       'solvent_0', 'solvent_1', 'solvent_2', 'solvent_3', 'solvent_4',
       'solvent_5', 'catalyst_0', 'catalyst_1', 'catalyst_2', 'catalyst_3',
       'catalyst_4', 'catalyst_5', 'catalyst_6', 'catalyst_7', 'temperature_0',
       'rxn_time_0', 'rxn_time_1', 'product_0', 'product_1', 'product_2',
       'product_3', 'yields_0', 'yields_1', 'yields_2', 'yields_3',
       'includes_name_0'],
      dtype='object')

In [237]:
unpickled_df.yields_0.dropna()

0        58.299999
1        25.700001
2        92.800003
3        93.000000
4        98.000000
           ...    
1570     98.000000
1571     35.000000
1572    100.000000
1573     77.000000
1574     46.000000
Name: yields_0, Length: 781, dtype: float64

In [234]:
df = unpickled_df
df = df.loc[df.includes_name_0, :]

In [184]:
unpickled_df.iloc[261]['mapped_rxn_0']

'[H-].[Na+].C(OC(C)(C)C)(=O)[CH2:4][C:5]([O:7]C(C)(C)C)=[O:6].[F:18][C:19]1[C:24]([C:25]([O:27][CH3:28])=[O:26])=[C:23]([F:29])[C:22]([F:30])=[C:21](F)[C:20]=1[F:32].Cl>CN(C)C=O.FC(F)(F)C(O)=O.C1(C)C=CC=CC=1.O.C(OCC)C.C(OCC)(=O)C>[C:5]([CH2:4][C:21]1[C:22]([F:30])=[C:23]([F:29])[C:24]([C:25]([O:27][CH3:28])=[O:26])=[C:19]([F:18])[C:20]=1[F:32])([OH:7])=[O:6]'

In [14]:
#pb = 'data/ORD_USPTO/ord-data/data/02/ord_dataset-02ee2261663048188cf6d85d2cc96e3f.pb.gz'
#data = message_helpers.load_message(pb, dataset_pb2.Dataset)
data = message_helpers.load_message(files[0], dataset_pb2.Dataset)

In [30]:
#data.reactions[0].outcomes[0].products[0].measurements
for component in data.reactions[9].inputs:
    print(component)

m10
m11
m12
m2
m3
m4
m5
m6
m7
m8
m9


In [34]:
data.reactions[9].inputs

{'m10': components {
  identifiers {
    type: NAME
    value: "ethanol"
  }
  identifiers {
    type: SMILES
    value: "C(C)O"
  }
  identifiers {
    type: INCHI
    value: "InChI=1S/C2H6O/c1-2-3/h3H,2H2,1H3"
  }
  amount {
    moles {
      value: 0.0
      precision: 1.0
      units: MOLE
    }
  }
  reaction_role: REACTANT
}
, 'm11': components {
  identifiers {
    type: NAME
    value: "4-hydroxy-3-methoxyphenylethanol"
  }
  identifiers {
    type: SMILES
    value: "OC1=C(C=C(C=C1)C(C)O)OC"
  }
  identifiers {
    type: INCHI
    value: "InChI=1S/C9H12O3/c1-6(10)7-3-4-8(11)9(5-7)12-2/h3-6,10-11H,1-2H3"
  }
  amount {
    moles {
      value: 0.0
      precision: 1.0
      units: MOLE
    }
  }
  reaction_role: REACTANT
}
, 'm12': components {
  identifiers {
    type: NAME
    value: "CD2C12"
  }
  identifiers {
    type: SMILES
    value: "[2H]C(Cl)(Cl)[2H]"
  }
  identifiers {
    type: INCHI
    value: "InChI=1S/CH2Cl2/c2-1-3/h1H2/i1D2"
  }
  amount {
    moles {
      val

In [12]:
import multiprocessing
multiprocessing.cpu_count()

8

In [5]:
# pickle everything
names_list = []
for file in tqdm(files):
    instance = OrdToPickle(file)
    names = instance.main()
    names_list += names

#save the names_list to pickle file
with open('data/ORD_USPTO/molecule_names.pkl', 'wb') as f:
    pickle.dump(names_list, f)

  1%|          | 6/515 [01:17<1:37:59, 11.55s/it]

The following does not contain USPTO data: 


  3%|▎         | 18/515 [04:58<2:37:48, 19.05s/it]

The following does not contain USPTO data: 


  4%|▍         | 21/515 [06:54<3:34:31, 26.05s/it]

The following does not contain USPTO data: 


  5%|▍         | 25/515 [09:17<4:22:43, 32.17s/it]

The following does not contain USPTO data: NiCOlit


  5%|▌         | 27/515 [09:56<3:23:25, 25.01s/it]

The following does not contain USPTO data: synthesis of islatravir by biocatalytic cascade


  7%|▋         | 34/515 [13:19<4:02:05, 30.20s/it]

The following does not contain USPTO data: Development of an automated kinetic profiling system with online HPLC for reaction optimization


 10%|▉         | 49/515 [11:12:22<106:34:27, 823.32s/it] 


KeyboardInterrupt: 

In [7]:
len(names_list)

69449

In [8]:
names_copy = names_list

In [1]:
len(list(set(names_copy)))

NameError: name 'names_copy' is not defined

## Using Pura: an example

In [31]:
resolved = resolve_identifiers(
    ['1-methyl-2,4-cyclohexane diisocyanates', 'benzene', 'djfaksdvbjkaf'],
    input_identifer_type=CompoundIdentifierType.NAME,
    output_identifier_type=CompoundIdentifierType.SMILES,
    services=[PubChem(autocomplete=True), CIR(), CAS()],
    agreement=1,
    silent=True,
)

pura_smiles = resolved[0][1][0]
pura_smiles

Batch:   0%|          | 0/1 [00:00<?, ?it/s]ERROR 2022-12-29 00:36:52,999 cir.py:276: 'ascii' codec can't encode character '\u2212' in position 62: ordinal not in range(128)
ERROR 2022-12-29 00:36:52,999 resolvers.py:396: CompoundIdentifierType.NAME is not one of the valid identifier types for the CAS.
ERROR 2022-12-29 00:36:56,717 resolvers.py:396: CompoundIdentifierType.NAME is not one of the valid identifier types for the CAS.
ERROR 2022-12-29 00:36:56,840 resolvers.py:396: CompoundIdentifierType.NAME is not one of the valid identifier types for the CAS.
Batch 0 Progress: 100%|██████████| 3/3 [00:10<00:00,  3.55s/it]
Batch: 100%|██████████| 1/1 [00:10<00:00, 10.65s/it]


IndexError: list index out of range

ERROR 2022-12-29 00:37:00,644 cir.py:276: 'ascii' codec can't encode character '\u2212' in position 60: ordinal not in range(128)
ERROR 2022-12-29 00:37:00,644 resolvers.py:396: CompoundIdentifierType.NAME is not one of the valid identifier types for the CAS.
Batch 17 Progress: 100%|██████████| 100/100 [00:27<00:00,  3.67it/s]
Batch:   4%|▍         | 18/479 [15:50<3:25:41, 26.77s/it]ERROR 2022-12-29 00:37:10,096 resolvers.py:396: CompoundIdentifierType.NAME is not one of the valid identifier types for the CAS.
ERROR 2022-12-29 00:37:10,234 resolvers.py:396: CompoundIdentifierType.NAME is not one of the valid identifier types for the CAS.
ERROR 2022-12-29 00:37:10,238 resolvers.py:396: CompoundIdentifierType.NAME is not one of the valid identifier types for the CAS.
ERROR 2022-12-29 00:37:10,245 resolvers.py:396: CompoundIdentifierType.NAME is not one of the valid identifier types for the CAS.
ERROR 2022-12-29 00:37:10,249 resolvers.py:396: CompoundIdentifierType.NAME is not one of the 

In [66]:
resolved

[('benzene', ['c1ccccc1']),
 ('1-methyl-2,4-cyclohexane diisocyanates', ['CNS(=O)(=O)C1CCCCC1']),
 ('djfaksdvbjkaf', [])]

In [63]:
dict2 = {}
for i in range(len(resolved)):
    dict2[resolved[i][0]]=resolved[i][1][0]

In [64]:
dict2

{'benzene': 'c1ccccc1',
 '1-methyl-2,4-cyclohexane diisocyanates': 'CNS(=O)(=O)C1CCCCC1'}

In [60]:
di = dict(resolved)

In [62]:
di['benzene']

['c1ccccc1']

## Read in molecule names to create big dict 

In [9]:
from os import listdir
from os.path import isfile, join
from tqdm import tqdm
import pandas as pd
import numpy as np

In [6]:
#create one big df of all the pickled data
folder_path = 'data/ORD_USPTO/molecule_names/'
onlyfiles = [f for f in listdir(folder_path) if isfile(join(folder_path, f))]
molecule_names = []
for file in tqdm(onlyfiles):
    if file[0] != '.': #We don't want to try to unpickle .DS_Store
        filepath = folder_path+file 
        molecule_names += pd.read_pickle(filepath)


100%|██████████| 489/489 [00:00<00:00, 1277.31it/s]


In [10]:
unique_molecule_names = list(set(molecule_names))
print(f"""
Molecule names: {len(molecule_names)}
Unique molecule names: {len(unique_molecule_names)}      
      """)


Molecule names: 894534
Unique molecule names: 47810      
      


In [19]:
unique_molecules_dict = {}
filtered_names = []
# Anything that only appears once or twice should be filtered out
for molecule in tqdm(unique_molecule_names):
    if molecule_names.count(molecule) >4:
        filtered_names += [molecule]
    else:
        unique_molecules_dict[molecule] = np.nan 

100%|██████████| 47810/47810 [08:39<00:00, 92.07it/s]


In [20]:
len(filtered_names)

12483

In [26]:
# actually, I should make the dict on all the unique names
resolved = resolve_identifiers(
    filtered_names,
    input_identifer_type=CompoundIdentifierType.NAME,
    output_identifier_type=CompoundIdentifierType.SMILES,
    services=[PubChem(autocomplete=True), CIR()],
    agreement=1,
    silent=True,
)


ERROR 2022-12-29 01:48:27,496 cir.py:276: 'ascii' codec can't encode character '\u2032' in position 60: ordinal not in range(128)
ERROR 2022-12-29 01:48:27,605 cir.py:276: 'ascii' codec can't encode character '\xae' in position 61: ordinal not in range(128)
ERROR 2022-12-29 01:48:29,718 cir.py:276: 'ascii' codec can't encode characters in position 60-62: ordinal not in range(128)
ERROR 2022-12-29 01:48:29,730 cir.py:276: 'ascii' codec can't encode character '\u2032' in position 66: ordinal not in range(128)
ERROR 2022-12-29 01:48:31,413 cir.py:276: 'ascii' codec can't encode character '\u2014' in position 60: ordinal not in range(128)
Batch 0 Progress: 100%|██████████| 100/100 [00:31<00:00,  3.18it/s]
Batch:   1%|          | 1/125 [00:31<1:04:55, 31.42s/it]ERROR 2022-12-29 01:48:41,609 cir.py:276: 'ascii' codec can't encode character '\u2022' in position 68: ordinal not in range(128)
ERROR 2022-12-29 01:48:41,627 cir.py:276: 'ascii' codec can't encode character '\u03b1' in position 77:

In [27]:
len(resolved)

12483

In [30]:
#save the names_resolved to pickle file
with open('data/ORD_USPTO/molecule_names_resolved.pkl', 'wb') as f:
    pickle.dump(resolved, f)

In [31]:
mol_names = pd.read_pickle('data/ORD_USPTO/molecule_names_resolved.pkl')

In [44]:
mol_names[12][1][0]

IndexError: list index out of range

In [45]:
for mol in mol_names:
    try:
        unique_molecules_dict[mol[0]] = mol[1][0]
    except IndexError:
        unique_molecules_dict[mol[0]] = np.nan

In [47]:
with open('data/ORD_USPTO/unique_molecules_dict.pkl', 'wb') as f:
    pickle.dump(unique_molecules_dict, f)

## Read in the pickle

In [49]:
import pandas as pd
from tqdm import tqdm
from os import listdir
from os.path import isfile, join
import pandas as pd
from tqdm import tqdm

In [50]:
#create one big df of all the pickled data
folder_path = 'data/ORD_USPTO/pickled_data/'
onlyfiles = [f for f in listdir(folder_path) if isfile(join(folder_path, f))]
full_df = pd.DataFrame()
for file in tqdm(onlyfiles):
    if file[0] != '.': #We don't want to try to unpickle .DS_Store
        filepath = folder_path+file 
        unpickled_df = pd.read_pickle(filepath)
        full_df = pd.concat([full_df, unpickled_df], ignore_index=True)


100%|██████████| 490/490 [09:51<00:00,  1.21s/it]


In [51]:
# full USPTO
print(len(full_df))

1771032


## Sanitise full_df using unique molecules dict

In [76]:
unique_mol_df = pd.DataFrame.from_dict(unique_molecules_dict, orient='index', columns = ['smiles'])

In [80]:
temp = unique_mol_df.dropna()

In [83]:
temp[:50]

Unnamed: 0,smiles
Te,[TeH2]
propylene and 5-vinyl-2-norbornene,C=CC.C=CC1CC2C=CC1C2
C8,CC[C@H](C)[C@H](NC(=O)[C@H](Cc1c[nH]c2ccccc12)...
Na+,[Na+]
H20,CC(C)c1c(OC2CCCCC2)c2cc(Cl)ccc2[nH]c1=O
diphenyl,c1ccc(-c2ccccc2)cc1
15B,NCCCN1CCN(CCCNC(=O)c2cc(O[C@H]3O[C@H](CO)[C@H]...
magnetite,O=[Fe].O=[Fe]O[Fe]=O
sulphoxide,CCCCCCCCS(=O)C(C)Cc1ccc2c(c1)OCO2
polypropylene glycol,CC(O)CO


In [53]:
len(unique_molecules_dict)

47810

In [54]:
unique_molecules_dict['1-methyl-2,4-cyclohexane diisocyanates']

nan

In [20]:
c = full_df.columns


In [24]:
for i in range(len(c)):
    print(c[i])

mapped_rxn_0
reactant_0
reactant_1
reactant_2
reactant_3
reactant_4
reactant_5
reagents_0
reagents_1
reagents_2
reagents_3
reagents_4
reagents_5
reagents_6
reagents_7
reagents_8
reagents_9
reagents_10
reagents_11
reagents_12
reagents_13
reagents_14
reagents_15
reagents_16
reagents_17
solvent_0
solvent_1
solvent_2
solvent_3
solvent_4
catalyst_0
catalyst_1
catalyst_2
catalyst_3
temperature_0
rxn_time_0
rxn_time_1
product_0
product_1
product_2
product_3
product_4
yields_0
yields_1
yields_2
yields_3
yields_4
includes_name_0
reagents_18
solvent_5
solvent_6
product_5
yields_5
reactant_6
reactant_7
reactant_8
reagents_19
reagents_20
solvent_7
reagents_21
product_6
product_7
yields_6
yields_7
reagents_22
reagents_23
reagents_24
reagents_25
reagents_26
reagents_27
reagents_28
reagents_29
reagents_30
reagents_31
reagents_32
reagents_33
reagents_34
reagents_35
reagents_36
catalyst_4
reagents_37
reagents_38
reagents_39
reagents_40
reagents_41
reagents_42
reagents_43
reagents_44
reagents_45
reagent

In [40]:
full_df['reagents_80'].dropna()

279113    1-methyl-2,4-cyclohexane diisocyanates
343068    1-methyl-2,4-cyclohexane diisocyanates
675646    1-methyl-2,4-cyclohexane diisocyanates
Name: reagents_80, dtype: object

In [49]:
only_row = full_df.iloc[279113]
print(only_row)

mapped_rxn_0    [C:1]1(N=C=O)[CH:6]=[CH:5][CH:4]=[C:3]([N:7]=[...
reactant_0                                   O=C=Nc1cccc(N=C=O)c1
reactant_1                                                   None
reactant_2                                                   None
reactant_3                                                   None
                                      ...                        
catalyst_5                                                   None
catalyst_6                                                   None
solvent_8                                                     NaN
catalyst_7                                                    NaN
reactant_9                                                    NaN
Name: 279113, Length: 129, dtype: object


## Try the most restrictive filtering

1. Remove all rows that have a reagent represented by a name
2. Remove all rows that have too many reagents/catalysts/solvents

Maybe: Find all reagents with a Pd, and move to catalyst column

### Remove all rows that have a reagent represented by a name

In [170]:
# write functions to filter the df
def remove_reactions_with_too_many_of_component(df, component_name, number_of_columns_to_keep):
    
    cols = list(df.columns)
    count = 0
    for col in cols:
        if component_name in col:
            count += 1
    
    columns = []
    for i in range(count):
        if i >= number_of_columns_to_keep:
            columns += [component_name+str(i)]
            
    for col in columns:
        df = df[pd.isnull(df[col])]
        
    df = df.drop(columns, axis=1)
            
    return df
    
    

def remove_reactions_using_names(df, unique_molecule_names):
    """
    Handling molecules where the only avaialble label is the name (as opposed to the SMILES string) turned out
    to be surprisingly difficult. 
    
    For each molecule/component in USPTO I first tried to extract the smiles string, but if that didn't exist, I extracted the name instead. This resulted in 900k names that needed to be resolved, 47k of which were unique. I then filtered this down further to 12k by filtering out any names that appeared 4 or fewer times. These 12k names I fed to Pura, and it took 238 min to resolve (ran it overnight). Of the 12k, it managed to resolve 900, returning None for the rest. However, I inspected the SMILES for the remaining 900, and there seems to be quite a few false positives.
    
    So let's just remove all reactions that contain a molecule represented by a name. If we filter too much away, we can have a look at maybe doing some manual resolving
    """
    # replace all nan in full_df with -1000, then replace all the unique names with nan, and then dropna()
    df2 = df.replace(np.nan, -1000)
    
    # We should only do this for reagents and solvents, since these should be convertible to SMILES. Reagents and products should already be fine (ie represented as smiles), and catalysts can be weird, eg using Pd written in a million different ways
    cols = list(df2.columns)
    cols2 = []
    for col in cols:
        if 'reagent' in col or 'solvent' in col:
            cols2 += [col]
    df3 = df2.copy()
    for col in cols2:
        df3 = df3[~df3[col].isin(unique_molecule_names)]
    
    df4 = df3.replace(-1000, np.nan)
    
    return df4
    

In [175]:
df2 = remove_reactions_with_too_many_of_component(full_df, 'catalyst_', 1)
print('num left after removing catalysts: ', len(df2))

num left after removing catalysts:  1748985


In [181]:
df3 = remove_reactions_with_too_many_of_component(df2, 'solvent_', 2)
print('num left after removing solvents: ', len(df3))

num left after removing solvents:  1676981


In [182]:
df4 = remove_reactions_with_too_many_of_component(df3, 'reagents_', 2)
print('num left after removing reagents: ', len(df4))

num left after removing reagents:  1105355


In [188]:
df5 = remove_reactions_with_too_many_of_component(df4, 'reactant_', 4)
print('num left after removing reactants: ', len(df5))

num left after removing reactants:  1101699


In [192]:
df6 = remove_reactions_with_too_many_of_component(df5, 'product_', 4)
print('num left after removing products: ', len(df6))
# remove the yield columns
excess_yield_columns = ['yields_4', 'yields_5', 'yields_6', 'yields_7']
df6 = df6.drop(excess_yield_columns, axis=1)


num left after removing products:  1101534


In [194]:
# remove any reactions containing a molecule represented with a name
df7 = remove_reactions_using_names(df6, unique_molecule_names)
print('num left after removing molecules with only a name: ', len(df7))

num left after removing molecules with only a name:  833037


In [196]:
# remove duplicates
df8 = df7.drop_duplicates()
print('num left after removing duplicate reactions: ', len(df8))

num left after removing duplicate reactions:  543672


In [230]:
def remove_rare_molecules(df, columns: list, cutoff: int):
    # Remove reactions that include a rare molecule (ie it appears 3 times or fewer)
    
    if len(columns) == 1:
        # Get the count of each value
        value_counts = df[columns[0]].value_counts()
        to_remove = value_counts[value_counts <= cutoff].index
        # Keep rows where the column is not in to_remove
        
        df2 = df[~df[columns[0]].isin(to_remove)]
        return df2
    
    elif len(columns) ==2:
        # Get the count of each value
        value_counts_0 = df[columns[0]].value_counts()
        value_counts_1 = df[columns[1]].value_counts()
        value_counts_2 = value_counts_0.add(value_counts_1, fill_value=0)

        # Select the values where the count is less than 3 (or 5 if you like)
        to_remove = value_counts_2[value_counts_2 <= cutoff].index

        # # Keep rows where the city column is not in to_remove
        df2 = df[~df[columns[0]].isin(to_remove)]
        df3 = df2[~df2[columns[1]].isin(to_remove)]
        
        return df3
        
    else:
        print("error")

In [231]:
# Remove reactions that include a rare molecule (ie it appears 3 times or fewer)
# remove rare catalysts
df9 = remove_rare_molecules(df8, ['catalyst_0'], 3)
print('num left after removing rare catalysts: ', len(df9))

num left after removing rare catalysts:  542615


In [233]:
# Remove reactions that include a rare molecule (ie it appears 3 times or fewer)
# remove rare solvents
df10 = remove_rare_molecules(df9, ['solvent_0', 'solvent_1'], 3)
print('num left after removing rare solvents: ', len(df10))

num left after removing rare solvents:  541494


In [234]:
# Remove reactions that include a rare molecule (ie it appears 3 times or fewer)
# remove rare reagents
df11 = remove_rare_molecules(df10, ['reagents_0', 'reagents_1'], 3)
print('num left after removing rare reagents: ', len(df11))

num left after removing rare reagents:  526999


In [236]:
for i in df11.columns:
    print(i)

mapped_rxn_0
reactant_0
reactant_1
reactant_2
reactant_3
reagents_0
reagents_1
solvent_0
solvent_1
catalyst_0
temperature_0
rxn_time_0
rxn_time_1
product_0
product_1
product_2
product_3
yields_0
yields_1
yields_2
yields_3


In [239]:
# pickle the final cleaned dataset
with open('data/ORD_USPTO/cleaned_data.pkl', 'wb') as f:
    pickle.dump(df11, f)
