## CommercialCompoundSearcher
An internal Sigman Lab tool for assessing the commercial availability of molecules based on the Pubchem database. We highly recommended to begin with a small subset (~25 molecules) to test the script first before using it on a larger dataset. Once a few variables are defined in later cells (like vendors to ignore), the script can be run autonomously by running all cells.

### Imports

In [None]:
# Built ins
import urllib
from pathlib import Path
from pprint import pprint

# Data manipulation
import pandas as pd
import numpy as np

# Custom
from utils import canonicalize_smiles, smiles_to_inchi_key, smiles_to_inchi
from utils import remove_duplicate_inchi_keys
from utils import get_cid_from_inchi_key, get_vendor_list_from_cid
from utils import convert_str_list
from utils import remove_specific_vendors_from_dataframe
from utils import draw_molecules_to_grid_image

## Reading in list of smiles

The list of smiles should be a plaintext document with __SMILES__ in the first line.

```
┌────────────────────────┐
│ SMILES                 │
│ CC(=O)OCC[N+](C)(C)C   │
│ CC(C[N+](C)(C)C)OC(=O) │
│ ...                    │
└────────────────────────┘

```


In [None]:
# Define a file
file = Path('./data/indoles.txt')

# Read in the text file
df = pd.read_table(file, header=0)

# Check that the file is formatted correctly
assert 'SMILES' in df.columns

display(df)

## Canonicalization and additional molecular identifiers

This section is used to canonicalize smiles and add additional molecular identifier information using RDKit. The output of this block will contain warnings (and potentially errors) from RDKit. Many of these errors (such as None mol from RDKit) are handled by removing the SMILES string and storing it in a separate file. 

In [None]:
# Apply canonicalization
df['SMILES'] = df['SMILES'].apply(canonicalize_smiles)

# Add InChI column
df['INCHI'] = df['SMILES'].apply(smiles_to_inchi)

# Add InChI key column
df['INCHI_KEY'] = df['SMILES'].apply(smiles_to_inchi_key)

# Get all the smiles that failed to make InChI or InChI keys
failed = df[(df['INCHI'].isna()) | (df['INCHI_KEY'].isna())]

# Get every successfully standardized row
df = df[~(df['INCHI'].isna()) | ~(df['INCHI_KEY'].isna())]

failed.to_csv('./results/failed_canonicalization.csv', index=False)

df.to_csv('./results/canonicalized.csv', index=False)

display(failed)
display(df)

## Remove duplicate InChI keys

Because we will be using REST queries to gather vendor information, it is important to remove duplicates because they will "waste" and REST query. This procedure removes __exact__ duplicates of the InChI key in the dataframe even if the SMILES string is different.

In [None]:
# Remove exact duplicates
df, duplicates = remove_duplicate_inchi_keys(df=df)

# For your viewing pleasure
display(df)
display(duplicates)

# Save the results for good book keeping.
df.to_csv('./results/added_molecular_identifiers.csv', index=False)
duplicates.to_csv('./results/duplicate_molecular_identifiers.csv', index=False)

## Query Pubchem for CID and Vendors

The best identifier to use for querying Pubchem is the Pubchem Compound ID (CID). For more information on how Pubchem standardizes its archive, please see the [compounds webpage](https://pubchem.ncbi.nlm.nih.gov/docs/compounds). This section will first obtain a CID for a given InChi key and subsequently query for a list of vendors. The REST queries each take at least 200 ms, and there are two queries per molecule.

__If you stop this cell while it is running, you will lose all of your progress__

In [None]:
# Get inchi keys as a list
inchi_keys = df['INCHI_KEY'].to_list()

# This assertion statement will fail if you have duplicate InChi keys
# If you don't care, remove the following line
assert len(list(set(inchi_keys))) == df.shape[0]

# Get the total length of InChI keys for tracking progress
total = len(inchi_keys)

# Enumerate over all inchi keys
for i, inchi_key in enumerate(inchi_keys):
    print(f'Working on {i + 1} of {total} ({round((i + 1) / total * 100, 2)}%)')

    # Set cid to nan if we can't find it
    cid = np.nan

    # Try to get the CID, if there is no CID, skip
    try:
        cid = get_cid_from_inchi_key(inchi_key)
    except urllib.error.HTTPError as e:
        print(f'Could not convert InChi Key {inchi_key} to CID because {e}. Skipping.')
        continue

    # Try to a list of vendors
    try:
        vendors = list(set(get_vendor_list_from_cid(cid)))
    except urllib.error.HTTPError as e:
        print(f'Could not get vendor list from CID {cid}.')
        vendors = []

    # Check how many instances of that INCHI_KEY are in the df
    indices = df[df['INCHI_KEY'] == inchi_key].index.values
    if len(indices) != 1:
        print(f'WARNING: Found more than one inchi key {inchi_key}!')

    # Add the CID/VENDORS based on inchi_key
    df.loc[df['INCHI_KEY'] == inchi_key, 'CID'] = str(cid)
    df.loc[df['INCHI_KEY'] == inchi_key, 'VENDORS'] = str(vendors)

# Get the df of molecules for which there is no CID
no_cids = df[df['CID'].astype(float).isna()].copy(deep=True)

# Get the df of molecules for which there is a CID but no vendors
# We're going to convert the lists stored in 'VENDORS' to a string
# for string comparison. There may be a better way to do this
no_vendors = df[(df['VENDORS'].astype(str) == '[]') & ~(df['CID'].astype(float).isna())].copy(deep=True)

# Get the result df that has both CIDs and some vendors
df = df[~(df['VENDORS'].astype(str) == '[]') & ~(df['CID'].astype(float).isna())]

# Save the different dataframes
df.to_csv('./results/filtered_for_cid_and_vendors.csv', index=False)
no_cids.to_csv('./results/no_cid_found.csv', index=False)
no_vendors.to_csv('./results/no_vendors_found.csv', index=False)

display(df)

## Filtering Vendors

The term "commercial availability" may differ between applications. Some vendors report that a compound is purchasable but will only synthesize it upon request. Additionally, the geographic location of the vendor's warehouse may lead to extended shipping times. In this section, we can filter vendors by selecting them from a list of total vendors.

The next cells are organized into separate steps.

In [None]:
# Optional - read in the previous dataframe so you don't have to run all cells
#df = pd.read_csv('./results/filtered_for_cid_and_vendors.csv')

In [None]:
# Convert the string representation of 
# the list into a regular python list
df['VENDORS'] = df['VENDORS'].astype(object)
df['VENDORS'] = df['VENDORS'].apply(convert_str_list)

# Print the total list of vendors
list_of_current_vendors = list(set([vendor for vendor_list in df['VENDORS'].to_list() for vendor in vendor_list]))
print(f'UNIQUE VENDORS:')
pprint(list_of_current_vendors)
print(f'\nN_UNIQUE_VENDORS: {len(list_of_current_vendors)}')

#### Select vendors to keep __OR__ vendors to remove

Two variables are declared below. Define one and only one of these variables to be a list of vendor strings. __This section relies on exact string comparison. Thus, it is important that the **exact** string is used from the block above.__ We recommend using VENDORS_TO_REMOVE to be more deliberate with vendor selection.

(experimental) We've included a list of vendors as a template for VENDORS_TO_KEEP.

In [None]:
# Define only one of these as a list
VENDORS_TO_KEEP = ['TCI (Tokyo Chemical Industry)',
 'Ambeed',
 'Combi-Blocks',
 'Shanghai Tauto Biotech Co., Ltd',
 'Thermo Fisher Scientific',
 'Sigma-Aldrich',
 'VWR, Part of Avantor']

VENDORS_TO_REMOVE = None

# Print the total list of vendors
# Get the list of current vendors (again)
list_of_current_vendors = list(set([vendor for vendor_list in df['VENDORS'].to_list() for vendor in vendor_list]))

# Convert vendors to keep into a vendors_to_remove list
if VENDORS_TO_REMOVE is None and VENDORS_TO_KEEP is not None:
    VENDORS_TO_REMOVE = [x for x in list_of_current_vendors if x not in VENDORS_TO_KEEP]

# Illegal options
elif VENDORS_TO_REMOVE is not None and VENDORS_TO_KEEP is not None:
    raise ValueError(f'Define either VENDORS_TO_REMOVE or VENDORS_TO_KEEP as a list not both.')

# User not removing any vendors
elif VENDORS_TO_REMOVE is None and VENDORS_TO_KEEP is None:
    VENDORS_TO_REMOVE = []

else:
    raise ValueError(f'Make sure you define the unused variable at the beginning of this cell to None')

# Remove the unwanted vendors
df = remove_specific_vendors_from_dataframe(df, vendors=VENDORS_TO_REMOVE)

# Get the new list of vendors
list_of_current_vendors = list(set([vendor for vendor_list in df['VENDORS'].to_list() for vendor in vendor_list]))
print(f'UNIQUE VENDORS:')
pprint(list_of_current_vendors)
print(f'\nN_UNIQUE_VENDORS: {len(list_of_current_vendors)}')

# Purge empty df entries now
df = df[~(df['VENDORS'].astype(str) == '[]')]

display(df)

#### Save the curated list of molecules

In [None]:
df.to_csv('./FINAL_LIBRARY_CURATED.csv', index=False)

## Drawing molecules 🥳 !

In this section we've included some useful functions for drawing molecules in your library.

In [None]:
# Get a list of all smiles
smiles = df['SMILES'].to_list()

print(f'Number of SMILES: {len(smiles)}')

# Get the PIL image of the grid by passing smiles
# to the draw function the number of molecules per row
img = draw_molecules_to_grid_image(smiles, mols_per_row=6, maxMols=163, img_resolution=600)

display(img)