# CommercialCompoundSearcher
An internal Sigman Lab tool for assessing the commercial availability of molecules based on the Pubchem database. We highly recommended to begin with a small subset (~25 molecules) to test the script first before using it on a larger dataset. Once a few variables are defined in later cells (like vendors to ignore), the script can be run autonomously by running all cells.

### Imports

In [13]:
# Built ins
import urllib
from pathlib import Path
from pprint import pprint

# Data manipulation
import pandas as pd
import numpy as np

# Custom
from utils import canonicalize_smiles, smiles_to_inchi_key, smiles_to_inchi
from utils import remove_duplicate_inchi_keys
from utils import get_cid_from_inchi_key, get_vendor_list_from_cid
from utils import remove_specific_vendors_from_dataframe
from utils import draw_molecules_to_grid_image
from utils import convert_str_list


## Reading in list of smiles

The list of smiles should be a plaintext document with __SMILES__ in the first line. The file extension should be '.txt'

```
┌────────────────────────┐
│ SMILES                 │
│ CC(=O)OCC[N+](C)(C)C   │
│ CC(C[N+](C)(C)C)OC(=O) │
│ ...                    │
└────────────────────────┘

```


In [14]:
# Define a file
file = Path('./data/small.txt')

# Read in the file
if file.suffix == '.txt':
    df = pd.read_table(file, header=0)
elif file.suffix == '.csv':
    df = pd.read_csv(file, header=0)
else:
    raise ValueError(f'{file.name} does not have a supported extension.')

# Check that the file is formatted correctly
assert 'SMILES' in df.columns

# Drop any empty rows
df.dropna(axis=0, how='any', inplace=True)

display(df)

Unnamed: 0,SMILES
0,c1c([nH]c2ccccc12)-c1ccccc1
1,Cc1ccc(cc1)-c1[c]c2[c][c][c][c]c2[nH]1
2,C1c2ccccc2-c2cc3ccccc3n12
3,Cn1c(cc2ccccc12)-c1ccccc1
4,Cc1ccc2cc([nH]c2c1)-c1ccccc1
5,Cc1ccc(cc1)-c1cc2ccccc2[nH]1
6,Cc1c([nH]c2ccccc12)-c1ccccc1
7,Cc1ccc2[nH]c(cc2c1)-c1ccccc1


## Canonicalization and additional molecular identifiers

This section is used to canonicalize the SMILES and add additional molecular identifier information using RDKit. The output of this block will contain warnings (and potentially errors) from RDKit. Many of these errors (such as None mol from RDKit) are handled by removing the SMILES string and storing it in a separate file. 

In [15]:
# Apply canonicalization
#TODO Understand how this affects stereoconfiguration in the SMILES/InChI/InChI key values
df['SMILES'] = df['SMILES'].apply(canonicalize_smiles)

# Add InChI column
df['INCHI'] = df['SMILES'].apply(smiles_to_inchi)

# Add InChI key column
df['INCHI_KEY'] = df['SMILES'].apply(smiles_to_inchi_key)

# Get all the smiles that failed to make InChI or InChI keys
failed = df[(df['INCHI'].isna()) | (df['INCHI_KEY'].isna())]

# Get every successfully standardized row
df = df[~(df['INCHI'].isna()) | ~(df['INCHI_KEY'].isna())]

# Save the failed and the canonicalized datasets to a csv file
failed.to_csv('./results/failed_canonicalization.csv', index=False)
df.to_csv('./results/canonicalized.csv', index=False)

if failed.empty:
    print('No SMILES strings failed canonicalization.')
else:
    display(failed)
display(df)

No SMILES strings failed canonicalization.





Unnamed: 0,SMILES,INCHI,INCHI_KEY
0,c1ccc(-c2cc3ccccc3[nH]2)cc1,InChI=1S/C14H11N/c1-2-6-11(7-3-1)14-10-12-8-4-...,KLLLJCACIRKBDT-UHFFFAOYSA-N
1,Cc1ccc(-c2[c]c3[c][c][c][c]c3[nH]2)cc1,InChI=1S/C15H8N/c1-11-6-8-12(9-7-11)15-10-13-4...,RKWCTGZQLAGHRQ-UHFFFAOYSA-N
2,c1ccc2c(c1)Cn1c-2cc2ccccc21,InChI=1S/C15H11N/c1-3-7-13-12(6-1)10-16-14-8-4...,JJUQRPHMJPHFBH-UHFFFAOYSA-N
3,Cn1c(-c2ccccc2)cc2ccccc21,InChI=1S/C15H13N/c1-16-14-10-6-5-9-13(14)11-15...,SFWZZSXCWQTORH-UHFFFAOYSA-N
4,Cc1ccc2cc(-c3ccccc3)[nH]c2c1,InChI=1S/C15H13N/c1-11-7-8-13-10-15(16-14(13)9...,WHOVJSPCXWJPBL-UHFFFAOYSA-N
5,Cc1ccc(-c2cc3ccccc3[nH]2)cc1,InChI=1S/C15H13N/c1-11-6-8-12(9-7-11)15-10-13-...,VPXGIHGJJJBJFP-UHFFFAOYSA-N
6,Cc1c(-c2ccccc2)[nH]c2ccccc12,InChI=1S/C15H13N/c1-11-13-9-5-6-10-14(13)16-15...,KYAXCYQVBBQQHB-UHFFFAOYSA-N
7,Cc1ccc2[nH]c(-c3ccccc3)cc2c1,InChI=1S/C15H13N/c1-11-7-8-14-13(9-11)10-15(16...,JPFTUUXPCFNLIX-UHFFFAOYSA-N


## Remove duplicate InChI keys

Because we will be using REST queries to gather vendor information, it is important to remove duplicates because they will "waste" and REST query. This procedure removes __exact__ duplicates of the InChI key in the dataframe even if the SMILES string is different.

In [4]:
# Remove exact duplicates
df, duplicates = remove_duplicate_inchi_keys(df=df)

# For your viewing pleasure
display(df)

if duplicates.empty:
    print('No duplicate entries were found.')
else:
    display(duplicates)

# Save the results for good book keeping.
df.to_csv('./results/added_molecular_identifiers.csv', index=False)
duplicates.to_csv('./results/duplicate_molecular_identifiers.csv', index=False)

Unnamed: 0,SMILES,INCHI,INCHI_KEY
0,c1ccc(-c2cc3ccccc3[nH]2)cc1,InChI=1S/C14H11N/c1-2-6-11(7-3-1)14-10-12-8-4-...,KLLLJCACIRKBDT-UHFFFAOYSA-N
1,Cc1ccc(-c2[c]c3[c][c][c][c]c3[nH]2)cc1,InChI=1S/C15H8N/c1-11-6-8-12(9-7-11)15-10-13-4...,RKWCTGZQLAGHRQ-UHFFFAOYSA-N
2,c1ccc2c(c1)Cn1c-2cc2ccccc21,InChI=1S/C15H11N/c1-3-7-13-12(6-1)10-16-14-8-4...,JJUQRPHMJPHFBH-UHFFFAOYSA-N
3,Cn1c(-c2ccccc2)cc2ccccc21,InChI=1S/C15H13N/c1-16-14-10-6-5-9-13(14)11-15...,SFWZZSXCWQTORH-UHFFFAOYSA-N
4,Cc1ccc2cc(-c3ccccc3)[nH]c2c1,InChI=1S/C15H13N/c1-11-7-8-13-10-15(16-14(13)9...,WHOVJSPCXWJPBL-UHFFFAOYSA-N
5,Cc1ccc(-c2cc3ccccc3[nH]2)cc1,InChI=1S/C15H13N/c1-11-6-8-12(9-7-11)15-10-13-...,VPXGIHGJJJBJFP-UHFFFAOYSA-N
6,Cc1c(-c2ccccc2)[nH]c2ccccc12,InChI=1S/C15H13N/c1-11-13-9-5-6-10-14(13)16-15...,KYAXCYQVBBQQHB-UHFFFAOYSA-N
7,Cc1ccc2[nH]c(-c3ccccc3)cc2c1,InChI=1S/C15H13N/c1-11-7-8-14-13(9-11)10-15(16...,JPFTUUXPCFNLIX-UHFFFAOYSA-N


No duplicate entries were found.


## Query Pubchem for CID

The best identifier to use for querying Pubchem is the Pubchem Compound ID (CID). For more information on how Pubchem standardizes its database, please see the [compounds webpage](https://pubchem.ncbi.nlm.nih.gov/docs/compounds). This section will obtain a CID for a given InChi key. The REST queries each take at least 200 ms.

__If you stop this cell while it is running, you will lose all of your progress__

In [5]:
# Get inchi keys as a list
inchi_keys = df['INCHI_KEY'].to_list()

# This assertion statement will fail if you have duplicate 
# InChi keys. If you don't care, remove the following line
assert len(list(set(inchi_keys))) == df.shape[0]

# Get the total length of InChI keys for tracking progress
total = len(inchi_keys)

# Enumerate over all inchi keys and add CID values
for i, inchi_key in enumerate(inchi_keys):
    print(f'Working on {i + 1} of {total} ({round((i + 1) / total * 100, 2)}%)')

    # Set cid to nan if we can't find it
    cid = np.nan

    # Try to get the CID, if there is no CID, skip
    try:
        cid = get_cid_from_inchi_key(inchi_key)
    except urllib.error.HTTPError as e:
        print(f'Could not convert InChi Key {inchi_key} to CID because {e}. Skipping.')
        continue

    # Check how many instances of that INCHI_KEY are in the df
    if df[df['INCHI_KEY'] == inchi_key].shape[0] != 1:
        print(f'WARNING: Found more than one InChI key {inchi_key}!')

    # Add the CID/VENDORS based on inchi_key
    df.loc[df['INCHI_KEY'] == inchi_key, 'CID'] = str(cid)

# Get the df of molecules for which there is no CID, save it for good book keeping
no_cids = df[df['CID'].astype(float).isna()].copy(deep=True)
no_cids.to_csv('./results/no_cid_found.csv', index=False)

# Get the new df that has CID values for each molecule
df = df[~(df['CID'].astype(float).isna())]

display(df)

Working on 1 of 8 (12.5%)
Working on 2 of 8 (25.0%)
Could not convert InChi Key RKWCTGZQLAGHRQ-UHFFFAOYSA-N to CID because HTTP Error 404: PUGREST.NotFound. Skipping.
Working on 3 of 8 (37.5%)
Working on 4 of 8 (50.0%)
Working on 5 of 8 (62.5%)
Working on 6 of 8 (75.0%)
Working on 7 of 8 (87.5%)
Working on 8 of 8 (100.0%)


Unnamed: 0,SMILES,INCHI,INCHI_KEY,CID
0,c1ccc(-c2cc3ccccc3[nH]2)cc1,InChI=1S/C14H11N/c1-2-6-11(7-3-1)14-10-12-8-4-...,KLLLJCACIRKBDT-UHFFFAOYSA-N,13698
2,c1ccc2c(c1)Cn1c-2cc2ccccc21,InChI=1S/C15H11N/c1-3-7-13-12(6-1)10-16-14-8-4...,JJUQRPHMJPHFBH-UHFFFAOYSA-N,13207630
3,Cn1c(-c2ccccc2)cc2ccccc21,InChI=1S/C15H13N/c1-16-14-10-6-5-9-13(14)11-15...,SFWZZSXCWQTORH-UHFFFAOYSA-N,77095
4,Cc1ccc2cc(-c3ccccc3)[nH]c2c1,InChI=1S/C15H13N/c1-11-7-8-13-10-15(16-14(13)9...,WHOVJSPCXWJPBL-UHFFFAOYSA-N,261513
5,Cc1ccc(-c2cc3ccccc3[nH]2)cc1,InChI=1S/C15H13N/c1-11-6-8-12(9-7-11)15-10-13-...,VPXGIHGJJJBJFP-UHFFFAOYSA-N,292599
6,Cc1c(-c2ccccc2)[nH]c2ccccc12,InChI=1S/C15H13N/c1-11-13-9-5-6-10-14(13)16-15...,KYAXCYQVBBQQHB-UHFFFAOYSA-N,258555
7,Cc1ccc2[nH]c(-c3ccccc3)cc2c1,InChI=1S/C15H13N/c1-11-7-8-14-13(9-11)10-15(16...,JPFTUUXPCFNLIX-UHFFFAOYSA-N,83247


## Query Pubchem for vendors

This section will us the CID values found in the previous cell to acquire a list of vendors from Pubchem. The REST queries each take at least 200 ms.

__If you stop this cell while it is running, you will lose all of your progress towards acquiring vendors__

In [6]:
# Get inchi keys as a list
cids = df['CID'].astype(int).to_list()

# This assertion statement will fail if you have duplicate 
# InChi keys. If you don't care, remove the following line
assert len(list(set(cids))) == df.shape[0]

# Get the total number of CIDs for tracking progress
total = len(cids)

# Keep a list of CIDs that have no vendors
no_vendor_cids = []

# Make a CID vendor dictionary that will contain the
# PubchemVendor objects
cid_vendor_dict = {}

# Enumerate over all CIDs and look for vendors
for i, cid in enumerate(cids):
    print(f'Working on {i + 1} of {total} ({round((i + 1) / total * 100, 2)}%)')

    # Try to get the list of PubchemVendor objects
    try:
        vendors = list(set(get_vendor_list_from_cid(cid)))
    except urllib.error.HTTPError as e:
        print(f'Could not get vendor list from CID {cid}.')
        no_vendor_cids.append(int(cid))
        continue

    # Check how many instances of that CID are in the df
    if df[df['CID'].astype(int) == cid].shape[0] != 1:
        print(f'WARNING: Found more than one CID for {cid}!')

    # Add the CID/VENDORS based on inchi_key
    df.loc[df['CID'].astype(int) == cid, 'VENDORS'] = str([x.SourceName for x in vendors])

    # Add the CID:Vendor key:value pair
    cid_vendor_dict[cid] = vendors

# Get the df of molecules for which there are no vendors, save it for good book keeping
no_vendors = df[df['CID'].astype(int).isin(no_vendor_cids)].copy(deep=True)
no_vendors.to_csv('./results/no_vendors_found.csv')

# Get all the molecules that have vendors
df = df[~df['CID'].astype(int).isin(no_vendor_cids)]

display(df)

#print(cid_vendor_dict)

Working on 1 of 7 (14.29%)
Working on 2 of 7 (28.57%)
Working on 3 of 7 (42.86%)
Working on 4 of 7 (57.14%)
Working on 5 of 7 (71.43%)
Working on 6 of 7 (85.71%)
Working on 7 of 7 (100.0%)


Unnamed: 0,SMILES,INCHI,INCHI_KEY,CID,VENDORS
0,c1ccc(-c2cc3ccccc3[nH]2)cc1,InChI=1S/C14H11N/c1-2-6-11(7-3-1)14-10-12-8-4-...,KLLLJCACIRKBDT-UHFFFAOYSA-N,13698,"['BLD Pharm', 'Smolecule', 'Aurum Pharmatech L..."
2,c1ccc2c(c1)Cn1c-2cc2ccccc21,InChI=1S/C15H11N/c1-3-7-13-12(6-1)10-16-14-8-4...,JJUQRPHMJPHFBH-UHFFFAOYSA-N,13207630,"['BLD Pharm', 'A2B Chem', 'Chemieliva Pharmace..."
3,Cn1c(-c2ccccc2)cc2ccccc21,InChI=1S/C15H13N/c1-16-14-10-6-5-9-13(14)11-15...,SFWZZSXCWQTORH-UHFFFAOYSA-N,77095,"['AAA Chemistry', 'abcr GmbH', 'AHH Chemical c..."
4,Cc1ccc2cc(-c3ccccc3)[nH]c2c1,InChI=1S/C15H13N/c1-11-7-8-13-10-15(16-14(13)9...,WHOVJSPCXWJPBL-UHFFFAOYSA-N,261513,"['J&H Chemical Co.,ltd', 'LabNetwork, a WuXi A..."
5,Cc1ccc(-c2cc3ccccc3[nH]2)cc1,InChI=1S/C15H13N/c1-11-6-8-12(9-7-11)15-10-13-...,VPXGIHGJJJBJFP-UHFFFAOYSA-N,292599,"['TCI (Tokyo Chemical Industry)', 'LEAPCHEM', ..."
6,Cc1c(-c2ccccc2)[nH]c2ccccc12,InChI=1S/C15H13N/c1-11-13-9-5-6-10-14(13)16-15...,KYAXCYQVBBQQHB-UHFFFAOYSA-N,258555,"['Alfa Chemistry', 'ZINC', 'Starshine Chemical..."
7,Cc1ccc2[nH]c(-c3ccccc3)cc2c1,InChI=1S/C15H13N/c1-11-7-8-14-13(9-11)10-15(16...,JPFTUUXPCFNLIX-UHFFFAOYSA-N,83247,"['ChemDiv', 'labseeker', 'MolPort', 'AA BLOCKS..."


## Filtering Vendors

The term "commercial availability" may differ between applications. Some vendors report that a compound is purchasable but will only synthesize it upon request. Additionally, the geographic location of the vendor's warehouse may lead to extended shipping times. In this section, we can filter vendors by selecting them from a list of total vendors.

The next cells are organized into separate steps.

In [7]:

# Print the total list of vendors
#vendors = [x for i in cid_vendor_dict.values() for x in i]
#print(vendors)
list_of_current_vendors = list(set([vendor for vendor_list in df['VENDORS'].apply(convert_str_list) for vendor in vendor_list]))
display(f'UNIQUE VENDORS:')
pprint(list_of_current_vendors)
print(f'\nN_UNIQUE_VENDORS: {len(list_of_current_vendors)}')


'UNIQUE VENDORS:'

['Smolecule',
 'MolCore',
 'BLD Pharm',
 'AN PharmaTech',
 'AAA Chemistry',
 'Yick-Vic Chemicals & Pharmaceuticals (HK) Ltd.',
 'CymitQuimica',
 'LabNetwork, a WuXi AppTec Company',
 'Chemieliva Pharmaceutical Co., Ltd',
 'AEchem Scientific Corp., USA',
 'Hangzhou Trylead Chemical Technology',
 'A2B Chem',
 'Wolves R&D chemical',
 'Vesino Industrial Co., Ltd',
 'Chem-Space.com Database',
 'Aromalake Chemical',
 'MolPort',
 'CAPOT',
 'Amadis Chemical',
 'Mcule',
 'Aaron Chemicals LLC',
 'Combi-Blocks',
 'MolMall',
 'Santa Cruz Biotechnology, Inc.',
 'Bic Biotech',
 'Wubei-Biochem',
 'Innovapharm',
 'Phion Ltd',
 'ChemExper Chemical Directory',
 'TCI (Tokyo Chemical Industry)',
 'Specs',
 'Ambinter',
 'Aurora Fine Chemicals LLC',
 'Angene Chemical',
 'Activate Scientific',
 'ACT Chemical',
 'Life Chemicals',
 'Matrix Scientific',
 'BOC Sciences',
 'Yuhao Chemical',
 'EvitaChem',
 '001Chemical',
 'Hoffman Fine Chemicals',
 'Key Organics/BIONET',
 'DSL Chemicals',
 'ISpharm',
 'ChemBridge'

#### Select vendors to keep __OR__ vendors to remove

Two variables are declared below. Define one and only one of these variables to be a list of vendor strings. __This section relies on exact string comparison. Thus, it is important that the **exact** string is used from the block above.__ We recommend using VENDORS_TO_REMOVE to be more deliberate with vendor selection.

(experimental) We've included a list of vendors as a template for VENDORS_TO_KEEP.

In [9]:
# Define only one of these as a list
VENDORS_TO_KEEP = ['TCI (Tokyo Chemical Industry)',
 'Ambeed',
 'Combi-Blocks',
 'Thermo Fisher Scientific',
 'Sigma-Aldrich',
 'VWR, Part of Avantor']

VENDORS_TO_REMOVE = None

df['VENDORS'] = df['VENDORS'].apply(convert_str_list)

# Print the total list of vendors
# Get the list of current vendors (again)
list_of_current_vendors = list(set([vendor for vendor_list in df['VENDORS'].apply(convert_str_list) for vendor in vendor_list]))

# Convert vendors to keep into a vendors_to_remove list
if VENDORS_TO_REMOVE is None and VENDORS_TO_KEEP is not None:
    VENDORS_TO_REMOVE = [x for x in list_of_current_vendors if x not in VENDORS_TO_KEEP]

# Illegal options
elif VENDORS_TO_REMOVE is not None and VENDORS_TO_KEEP is not None:
    raise ValueError(f'Define either VENDORS_TO_REMOVE or VENDORS_TO_KEEP as a list not both.')

# User not removing any vendors
elif VENDORS_TO_REMOVE is None and VENDORS_TO_KEEP is None:
    VENDORS_TO_REMOVE = []

else:
    raise ValueError(f'Make sure you define the unused variable at the beginning of this cell to None')

# Remove the unwanted vendors
df = remove_specific_vendors_from_dataframe(df, vendors=VENDORS_TO_REMOVE)

# Purge empty df entries now
df = df[~(df['VENDORS'].astype(str) == '[]')]

display(df)


'''
# Get the new list of vendors
list_of_current_vendors = list(set([vendor for vendor_list in df['VENDORS'].to_list() for vendor in vendor_list]))
print(f'UNIQUE VENDORS:')
pprint(list_of_current_vendors)
print(f'\nN_UNIQUE_VENDORS: {len(list_of_current_vendors)}')



display(df)

for col in df.columns:
    print(df[col].dtype)
'''

Unnamed: 0,SMILES,INCHI,INCHI_KEY,CID,VENDORS
0,c1ccc(-c2cc3ccccc3[nH]2)cc1,InChI=1S/C14H11N/c1-2-6-11(7-3-1)14-10-12-8-4-...,KLLLJCACIRKBDT-UHFFFAOYSA-N,13698,"[Ambeed, Sigma-Aldrich, TCI (Tokyo Chemical In..."
2,c1ccc2c(c1)Cn1c-2cc2ccccc21,InChI=1S/C15H11N/c1-3-7-13-12(6-1)10-16-14-8-4...,JJUQRPHMJPHFBH-UHFFFAOYSA-N,13207630,"[Combi-Blocks, Ambeed]"
3,Cn1c(-c2ccccc2)cc2ccccc21,InChI=1S/C15H13N/c1-16-14-10-6-5-9-13(14)11-15...,SFWZZSXCWQTORH-UHFFFAOYSA-N,77095,"[Combi-Blocks, Ambeed, Sigma-Aldrich, Thermo F..."
4,Cc1ccc2cc(-c3ccccc3)[nH]c2c1,InChI=1S/C15H13N/c1-11-7-8-13-10-15(16-14(13)9...,WHOVJSPCXWJPBL-UHFFFAOYSA-N,261513,"[Ambeed, Combi-Blocks]"
5,Cc1ccc(-c2cc3ccccc3[nH]2)cc1,InChI=1S/C15H13N/c1-11-6-8-12(9-7-11)15-10-13-...,VPXGIHGJJJBJFP-UHFFFAOYSA-N,292599,"[TCI (Tokyo Chemical Industry), Combi-Blocks, ..."
7,Cc1ccc2[nH]c(-c3ccccc3)cc2c1,InChI=1S/C15H13N/c1-11-7-8-14-13(9-11)10-15(16...,JPFTUUXPCFNLIX-UHFFFAOYSA-N,83247,"[Sigma-Aldrich, Ambeed]"


"\n# Get the new list of vendors\nlist_of_current_vendors = list(set([vendor for vendor_list in df['VENDORS'].to_list() for vendor in vendor_list]))\nprint(f'UNIQUE VENDORS:')\npprint(list_of_current_vendors)\nprint(f'\nN_UNIQUE_VENDORS: {len(list_of_current_vendors)}')\n\n\n\ndisplay(df)\n\nfor col in df.columns:\n    print(df[col].dtype)\n"

#### Save the curated list of molecules

In [10]:
df.to_csv('./FINAL_LIBRARY_CURATED.csv', index=False)

## Query Pubchem for CAS number 

This section will us the CID values found in the previous cells to acquire a CAS number from Pubchem. The REST queries each take at least 200 ms.

__If you stop this cell while it is running, you will lose all of your progress towards acquiring vendors__

In [None]:
'''
from utils import convert_str_list
from utils import get_CAS_from_cid

# Get the full list of CIDs from the library
cids = pd.read_table('./data/cids.txt', header=0)
cids = [x for x in cids['CID'].to_list() if x != '']

# Get the total number of CIDs for tracking progress
total = len(cids)

# Keep a list of CIDs that have no vendors
no_vendor_cids = []


cid_cas_pairs = {}

# Enumerate over all CIDs and look for vendors
for i, cid in enumerate(cids):
    print(f'Working on {i + 1} of {total} ({round((i + 1) / total * 100, 2)}%)')

    # Try to get the list of PubchemVendor objects
    try:
        cas = get_CAS_from_cid(cid)
    except urllib.error.HTTPError as e:
        print(f'Could not get vendor list from CID {cid}.\n{e}')
        continue

    cid_cas_pairs[cid] = cas
    # Check how many instances of that CID are in the df
    #if df[df['CID'].astype(int) == cid].shape[0] != 1:
    #    print(df[df['CID'] == cid])
    #    print(f'WARNING: Found more than one CID for {cid}!')
    ## Add the CID/VENDORS based on inchi_key
    #df.loc[df['CID'].astype(int) == cid, 'CAS'] = str(cas)

print(cid_cas_pairs)

with open('./data/cas_cid_pairs.csv', 'w') as o:
    o.write('CID, CAS,\n')
    for k, v in cid_cas_pairs.items():
        o.write(f'{k},{v},\n')
    o.write('\n')
'''

In [None]:
'''
# Check if we own the diketones
# Get the inventory CAS numbers
inventory = pd.read_excel('./data/INVENTORY-01-19-2024.xlsx', header=0, sheet_name='INVENTORY')

cas_cid_pairs = pd.read_csv('./data/cas_cid_pairs.csv', header=0)
cas_cid_pairs = cas_cid_pairs.dropna(subset='CAS', axis=0, how='any')

owned_cas = inventory[inventory['CAS'].isin(cas_cid_pairs['CAS'].to_list())]
owned_cid = cas_cid_pairs[cas_cid_pairs['CAS'].isin(inventory['CAS'].to_list())]

df = pd.read_csv('./data/COMBINED_LIB.csv', header=0)
df = df[df['CID'].isin(owned_cid['CID'].to_list())]
smiles = df['SMILES'].to_list()
display(df)

display(owned_cas[['NAME', 'PACK_SIZE', 'LOCATION', 'SUB_LOCATION', 'BARCODE']])
'''


## Drawing molecules 🥳 !

In this section we've included some useful functions for drawing molecules in your library.

In [12]:
# Get a list of all smiles
smiles = df['SMILES'].to_list()

print(f'Number of SMILES: {len(smiles)}')

# Get the PIL images of the grid by passing smiles list
images = draw_molecules_to_grid_image(smiles, mols_per_row=6, img_resolution=600)

for image in images:
    display(image)

Number of SMILES: 6


NameError: name 'chunkify' is not defined