# Dataset Curation for Open, Reproducible Computation of Assembly Indices

This notebook was created by Olivia M. Smith and documents the curation of three reference datasets for this manuscript: `gdb13_1201`, `gdb17_800`, and `coconut_220`. 

In [None]:
# import necessary packages
import os
from rdkit import Chem
import numpy as np
import pandas as pd

# Set random seed
np.random.seed(137)

## Conversion between molecular representation formats

The molecular representation that is most relevant to this algorithm is the `.mol` file. GDB13 and GDB17 databases contain molecules in `.smi` format, which is for the SMILES molecular representation. Additionally, the COCONUT database contains molecules in `.sdf` format, which is closely related to the `.mol` format. The following functions utilize RDKit and allow for simple conversion and handling of the file types, ultimately to the `.mol` format. 

Frequently in this notebook, I reference RDKit's Mol object class, formally the rdkit.Chem.rdchem.Mol object. More information on the handling of this object can be found here: https://content.schrodinger.com/Docs/r2017-2/python_api/api/rdkit.Chem.rdchem.Mol-class.html


In [None]:
def convert_SMILES_to_SDF(smiles_list, sdf_output_name):
    """
    Converts a list of SMILES molecules into .sdf format in a single file. 

    :param smiles_list: a list containing SMILES molecules
    :param sdf_output_name: a string for the desired name of the output .sdf file
    :returns: a .sdf file named with the desired name in the working directory
    """
    writer = Chem.SDWriter(sdf_output_name)
    for smile_molecule in smiles_list:
        # read each SMILES list element as a Mol object 
        mol = Chem.MolFromSmiles(smile_molecule)
        # write the molecule to the new .sdf file
        writer.write(mol)
    writer.close()
    return 


def write_molecules_to_SDF(list_of_molecules, sdf_output_name):
    """
    Takes a list of RDKit Mol objects and writes them to a single .sdf file. 

    :param list_of_molecules: a list containing RDKit Mol objects
    :param sdf_output_name: a string for the desired name of the output .sdf file
    :returns: a .sdf file in the working directory
    """
    writer = Chem.SDWriter(sdf_output_name)
    for mol_object in list_of_molecules:
        writer.write(mol_object)
    writer.close()
    return


def split_into_mol(sdf_filename, output_directory):
    """
    Splits all molecules in a .sdf file into individual .mol files in a specified output directory. 

    :param sdf_filename: a string for the name of a .sdf file in the working directory
    :param output_directory: a string for the name of the desired output directory
    :returns: individual .mol files for each molecule contained in a specified directory
    """
    # Check if output exists, and make it if not
    if not os.path.isdir(output_directory):
        os.mkdir(output_directory)
    # read each SDF molecule representation in the file as a Mol object 
    supplier = Chem.SDMolSupplier(sdf_filename)
    for i, molecule in enumerate(supplier):
        if molecule is not None:
            # file naming scheme starting with "0001.mol"
            if i < 9:
                mol_filename = f"{output_directory}/000{i+1}.mol"
            if 9 <= i < 99:
                mol_filename = f"{output_directory}/00{i+1}.mol"
            if 99 <= i < 999:
                mol_filename = f"{output_directory}/0{i+1}.mol"
            if i >= 999:
                mol_filename = f"{output_directory}/{i+1}.mol"
            # convert Mol object into a .mol file with the numbered file name 
            Chem.MolToMolFile(molecule, mol_filename)
    return

## gdb13_1201

`gdb13_1201`: 1,201 small, organic molecular structures sampled from GDB-13, a database of enumerated chemical structures containing Carbon, Hydrogen, Nitrogen, Oxygen, Sulfur, and Chlorine that are constrained only by valence rules and quantum mechanics but may not be chemically stable or synthesizable (Reymond, 2015)(Blum, L. C., & Reymond, J.-L., 2009). Our sample includes all 201 molecules in GDB-13 with 4&ndash;5 heavy atoms and 200 randomly sampled molecules for each number of heavy atoms from 6&ndash;10.

**Link to download GDB13** https://zenodo.org/record/5172018/files/gdb13.tgz?download=1&ref=gdb.unibe.ch

Once downloaded and unzipped, GDB13 is organized into `.smi` files by number of heavy atoms. For example, file `4.smi` contains only molecules with four heavy atoms. These files are additionally organized by their molecular complexity, meaning molecules containing primarily Carbon are listed first, and molecules containing a greater variety of atoms are listed later. Here, "even sampling" means taking a random sample of all molecules in the file, rather than only the least complex molecules from the beginning. Even sampling doesn't pertain to `4.smi` and `5.smi` because all molecules are taken from those files. Random sampling is accomplished throughout this code using `numpy.random.choice`. 

To acquire 1,201 molecules from GDB13, all molecules are collected from `4.smi` and `5.smi` and converted to `.sdf` files. Then, 200 molecules from `6.smi` through `10.smi` are randomly sampled and converted into `.sdf` files. These seven files are combined into a single `.sdf` file in the command line using the `cat` function. Finally, the single `.sdf` file is split into 1,201 individual `.mol` files. 

In [None]:
def collect_SMILES(file_name, number_of_molecules):
    """
    Takes a specified number of SMILES molecules from a .smi file without random sampling. 

    :param file_name: a string containing the name of the .smi file in the working directory
    :param number_of_molecules: an integer of the desired number of molecules
    :returns smiles: a list of length number_of_molecules containing SMILES molecular representations
    """
    with open(file_name, 'r') as file:
        smiles = []
        for line in file:
            # if the length of the list is less than the number of molecules you specified, 
            # split the line from the file containing the SMILES molecule and add it to the list.
            if len(smiles) < number_of_molecules:
                molecule = line.split('\n')[0]
                smiles.append(molecule)         
    return smiles


def smi_molecules_even_sampling(file_name, number_of_molecules): 
    """
    Collects a random sample of molecules from a .smi file using randomly generated numbers. 

    :param file_name: a string of the name of the .smi file
    :param number_of_molecules: an integer of the desired number of molecules
    :returns random_smiles: a list of length number_of_molecules containing randomly sampled SMILES molecules
    """

    with open(file_name, 'r') as file:
        random_smiles = []
        # read every line in the file 
        lines = file.readlines()
        # count how many lines there are
        num_lines = len(lines)

        # make an array containing random numbers between 1 and the total number of lines in the file, of length number_of_molecules
        random_numbers = np.random.choice(range(num_lines), number_of_molecules, replace=False)
        
        for i in random_numbers:
            # access the line in the file whose index matches the random number
            file_line = lines[i]
            # split the molecule from the other lines in the file
            molecule = file_line.split("\n")[0]
            # append the molecule to the list of random SMILES molecules
            random_smiles.append(molecule)
    return random_smiles

In [None]:
# collect all molecules from 4.smi
# there are 43 molecules in this file
four_heavy_atoms = collect_SMILES(file_name = "gdb13/4.smi", number_of_molecules = 43)
convert_SMILES_to_SDF(smiles_list = four_heavy_atoms, sdf_output_name = "4.sdf")


# collect all molecules from 5.smi
# there are 158 molecules in this file
five_heavy_atoms = collect_SMILES(file_name = "gdb13/5.smi", number_of_molecules = 158)
convert_SMILES_to_SDF(smiles_list = five_heavy_atoms, sdf_output_name = "5.sdf")


# collect 200 evenly sampled molecules from 6.smi through 10.smi 
six_heavy_atoms = smi_molecules_even_sampling(file_name = "gdb13/6.smi", number_of_molecules = 200)
convert_SMILES_to_SDF(smiles_list = six_heavy_atoms, sdf_output_name = "6.sdf")

seven_heavy_atoms = smi_molecules_even_sampling(file_name = "gdb13/7.smi", number_of_molecules = 200)
convert_SMILES_to_SDF(smiles_list = seven_heavy_atoms, sdf_output_name = "7.sdf")

eight_heavy_atoms = smi_molecules_even_sampling(file_name = "gdb13/8.smi", number_of_molecules = 200)
convert_SMILES_to_SDF(smiles_list = eight_heavy_atoms, sdf_output_name = "8.sdf")

nine_heavy_atoms = smi_molecules_even_sampling(file_name = "gdb13/9.smi", number_of_molecules = 200)
convert_SMILES_to_SDF(smiles_list = nine_heavy_atoms, sdf_output_name = "9.sdf")

ten_heavy_atoms = smi_molecules_even_sampling(file_name = "gdb13/10.smi", number_of_molecules = 200)
convert_SMILES_to_SDF(smiles_list = ten_heavy_atoms, sdf_output_name = "10.sdf")

After these `.sdf` files are generated, they are combined into a single `.sdf` file in the command line using the command
 
`cat 4.sdf 5.sdf 6.sdf 7.sdf 8.sdf 9.sdf 10.sdf > 1201_gdb13_molecules.sdf`.

In [None]:
# split the .sdf file into 1201 .mol f"iles
# the output directory needs to exist in the working directory prior to calling this function
split_into_mol(sdf_filename = "1201_gdb13_molecules.sdf", output_directory = "gdb13_1201")

## gdb17_800

`gdb17_800`: 800 organic molecular structures sampled from the larger GDB-17 database, which includes additional nuclei beyond GDB-13 such as the halogens Flourine and Iodine (Reymond, 2015)(Ruddigkeit et al., 2012). Compared to GDB-13, these molecules are typically larger and represent more structural diversity. Our sample includes 200 randomly sampled molecules for each number of heavy atoms from 14&ndash;17. These molecules' MA range from 5&ndash;15.

**Link to download GDB17**: https://zenodo.org/record/5172018/files/GDB17.50000000.smi.gz?download=1&ref=gdb.unibe.ch

Unlike GDB13, which is organized by heavy atom count into separate `.smi` files, GDB17 downloads as a single `.smi` file. As the filename suggests, GDB17 contains 50 million molecules. This is too many lines for a standard Python kernel to read, so the random sample is accomplished differently than what was previously outlined. First, 100,000 molecules are randomly sampled from across the entire database. These molecules are then turned into a `pandas` dataframe to accomplish sorting by heavy atom count. Then, lists of Mol objects are created for each number of heavy atoms from 14&ndash;17. The first 200 elements from each of these lists are turned into a `.sdf` file. These four files are combined into a single `.sdf` file in the command line using `cat`. Finally, the single `.sdf` file gets split into 800 individual `.mol` files. 

In [None]:
def get_n_heavyatoms(mol):
    """
    Retrieves the number of heavy atoms for a Mol object using RDKit's GetNumHeavyAtoms() function.
    
    :param mol: an RDKit Mol object
    :returns: an integer of the number of heavy atoms in the molecule
    """
    
    return mol.GetNumHeavyAtoms()


def random_gdb17_dataframe(file_name):
    """
    Creates a pandas dataframe from 100,000 randomly sampled GDB17 molecules
    
    :param file_name: a string of the name of the GDB17 .smi file
    :returns mol_dataframe: a pandas dataframe containing the random sample of molecules from GDB17
    """

    with open(file_name, "r") as f:
        all_lines = f.readlines()

    # generate 100,000 random numbers between 1 and the total number of lines
    random_numbers = np.random.choice(range(len(all_lines)), int(1e5), replace=False)
    # split the lines whose indices match the randomly generated numbers 
    all_random_lines = [all_lines[k].split("\n")[0] for k in random_numbers]
    # convert each line to a Mol object from SMILES format
    all_random_molecules = [Chem.MolFromSmiles(ks) for ks in all_random_lines]

    # make those molecules into a pandas dataframe
    mol_dataframe = pd.DataFrame({"mol":all_random_molecules})
    # add a column in the dataframe for number of heavy atoms
    mol_dataframe["HeavyAtoms"] = mol_dataframe["mol"].map(get_n_heavyatoms)

    return mol_dataframe

In [None]:
# make GDB17 into a dataframe of 100,000 randomly sampled molecules
mol_dataframe = random_gdb17_dataframe(file_name = "GDB17.50000000.smi")

# sort the dataframe by the heavy atoms column
mol_data_sorted = mol_dataframe.sort_values(by = 'HeavyAtoms')

In [None]:
# filter the molecules between 14 and 17 heavy atoms (HA)
# and turn these categories into lists using tolist()
fourteen_heavy_atoms = mol_data_sorted[ mol_data_sorted["HeavyAtoms"] == 14]
fourteen_heavy_atoms = fourteen_heavy_atoms["mol"].tolist()

fifteen_heavy_atoms = mol_data_sorted[ mol_data_sorted["HeavyAtoms"] == 15]
fifteen_heavy_atoms = fifteen_heavy_atoms["mol"].tolist()

sixteen_heavy_atoms = mol_data_sorted[ mol_data_sorted["HeavyAtoms"] == 16]
sixteen_heavy_atoms = sixteen_heavy_atoms["mol"].tolist()

seventeen_heavy_atoms = mol_data_sorted[ mol_data_sorted["HeavyAtoms"] == 17]
seventeen_heavy_atoms = seventeen_heavy_atoms["mol"].tolist()

# keep only the first 200 elements of each list 
fourteen_heavy_atoms = fourteen_heavy_atoms[0:200]
fifteen_heavy_atoms = fifteen_heavy_atoms[0:200]
sixteen_heavy_atoms = sixteen_heavy_atoms[0:200]
seventeen_heavy_atoms = seventeen_heavy_atoms[0:200]

# convert each list into a .sdf file 
write_molecules_to_SDF(list_of_molecules = fourteen_heavy_atoms, sdf_output_name = "14.sdf")
write_molecules_to_SDF(list_of_molecules = fifteen_heavy_atoms, sdf_output_name = "15.sdf")
write_molecules_to_SDF(list_of_molecules = sixteen_heavy_atoms, sdf_output_name = "16.sdf")
write_molecules_to_SDF(list_of_molecules = seventeen_heavy_atoms, sdf_output_name = "17.sdf")

After these `.sdf` files are generated, they are combined into a single `.sdf` file in the command line using the command
 
`cat 14.sdf 15.sdf 16.sdf 17.sdf > 800_gdb17_molecules.sdf`.

In [None]:
# split the .sdf into individual 800 individual .mol files 
# the output directory needs to exist in the working directory prior to calling this function
split_into_mol(sdf_filename = "800_gdb17_molecules.sdf", output_directory = "gdb17_800")

## coconut_220

`coconut_220`: 220 natural products sampled from the COCONUT database (Sorokina et al., 2021)(Chandrasekhar et al., 2025).
Natural products (or secondary metabolites) are a rich source of evolved chemical complexity, often exhibiting drug-like properties.
Subsets of this database were used to benchmark recent algorithmic progress in (Seet et al., 2024). Our sample includes 20 randomly sampled molecules for each number of heavy atoms from 15&ndash;25. These molecules' MA range from 5&ndash;20.

We were using version 3 of COCONUT as of October 2024. There has since been a new release as of March 2025. The link to download the October 2024 version that we used is linked below. 

**Link to download COCONUT**: https://zenodo.org/records/13897048/files/coconut_complete-10-2024.sdf.zip?download=1

COCONUT downloads as a single `.sdf` file containing all molecules in the database. To obtain 220 molecules with these properties from COCONUT, the `.sdf` is filtered by heavy atoms from 15&ndash;25. Then, 20 randomly sampled molecules from each number of heavy atoms gets written to a new `.sdf` file. Finally, the new `.sdf` file gets split into 220 individual `.mol` files. 

In [None]:
def filter_sdf_by_heavy_atoms(file_name):
    """
    Takes a .sdf file and filters it by heavy atoms to keep only molecules between 15 and 25 heavy atoms. 
    
    :param file_name: a string containing the name of the .sdf file to filter
    :returns filtered_molecules: a list containing RDKit Mol objects with heavy atoms between 15 and 25
    """

    filtered_molecules = []
    # Chem.SDMolSupplier() is the RDKit function for interpreting .sdf files
    for mol in Chem.SDMolSupplier(file_name):
        # Some molecules in the .sdf may not be readable by RDKit (None)
        if mol is None:
            continue
        
        # if the number of heavy atoms in the molecule is between 15 and 25, append it to the list
        num_heavy_atoms = mol.GetNumHeavyAtoms() 
        if num_heavy_atoms > 14 and num_heavy_atoms < 26:
            filtered_molecules.append(mol)

    return filtered_molecules


def get_random_coconut_molecules(filtered_molecules):
    """
    Randomly samples 20 molecules from each number of heavy atoms between 15 and 25. 

    :param filtered_molecules: a list of RDKit Mol objects previously filtered to all contain between 15 and 25 heavy atoms
    :returns kept_molecules: a list of 220 randomly sampled molecules 
    """

    kept_molecules = []
    # this loop will run for each number of heavy atoms between 15 and 25
    for i in range(15, 26):
        # first, make a list of molecules containing that number of heavy atoms
        # ex: all molecules with 15 heavy atoms
        molecules_in_category = []
        for mol in filtered_molecules:
            if mol.GetNumHeavyAtoms() == i:
                molecules_in_category.append(mol)

        # generate 20 random numbers between 1 and the total number of molecules in the list 
        random_numbers = np.random.choice(range(len(molecules_in_category)), 20, replace=False)
        for k in random_numbers:
            # keep the molecules that match the indices of the random numbers 
            kept_molecules.append(molecules_in_category[k])

    return kept_molecules

In [None]:
# filter the COCONUT database for molecules between 15 and 25 heavy atoms
coconut_filtered_list = filter_sdf_by_heavy_atoms(file_name = 'coconut_complete-10-2024.sdf')

# pick 20 random molecules for each number of heavy atoms between 15 and 25
random_coconut_molecules = get_random_coconut_molecules(filtered_molecules = coconut_filtered_list)

In [None]:
# write the list of 220 random molecules to a new .sdf file
write_molecules_to_SDF(list_of_molecules = random_coconut_molecules, sdf_output_name = "220_coconut_molecules.sdf")

# split the .sdf into 220 individual .mol files
# the output directory needs to exist in the working directory prior to calling this function
split_into_mol(sdf_filename = "220_coconut_molecules.sdf", output_directory = "coconut_220")

## References

Reymond, J.-L. (2015). The Chemical Space Project. Accounts of Chemical Research, 48(3), 722–730. https://doi.org/10.1021/ar500432k

Blum, L. C., & Reymond, J.-L. (2009). 970 Million Druglike Small Molecules for Virtual Screening in the Chemical Universe Database GDB-13. Journal of the American Chemical    Society, 131(25), 8732–8733. https://doi.org/10.1021/ja902302h

Ruddigkeit, L., van Deursen, R., Blum, L. C., & Reymond, J.-L. (2012). Enumeration of 166 Billion Organic Small Molecules in the Chemical Universe Database GDB-17. Journal of Chemical Information and Modeling, 52(11), 2864–2875. https://doi.org/10.1021/ci300415d

Chandrasekhar, V., Rajan, K., Kanakam, S. R. S., Sharma, N., Weißenborn, V., Schaub, J., & Steinbeck, C. (2025). COCONUT 2.0: a comprehensive overhaul and curation of the     collection of open natural products database. Nucleic Acids Research, 53(D1), D634–D643. https://doi.org/10.1093/nar/gkae1063

Seet, I., Patarroyo, K. Y., Siebert, G., Walker, S. I., & Cronin, L. (2024). Rapid Computation of the Assembly Index of Molecular Graphs.