Notebook walking through constructing a Dataset with DatasetBuilder

In [None]:
import sys
import pandas as pd
sys.path.append("..")

from chemspace.Dataset.DatasetBuilder import DatasetBuilder

Dataset Builder Class can be instantiated one of 3 ways:
1. From a Chemical Structures Records `.json.gz` file downloaded from PubChem's search function
2. From a `.csv` file containing CIDs as the indices of the file
3. From a previously constructed DataFrame that has CIDs of interest as the indices

Method 1: Instantiate a DatasetBuilder object from a Chemical Structures Records `.json.gz` file downloaded from PubChem's search function

In [None]:
# Instantiate class with json file form PubChem
DB = DatasetBuilder(compound_file_path='../chemspace/Dataset/Data/PubChem_compound_list_records.json.gz')

# Save as .CSV
DB.CIDs.to_csv('../chemspace/Dataset/Data/CIDs.csv', index = False)

# Display dataset
DB.CIDs

Method 2: Instantiate a DatasetBuilder object from a `.csv` file containing CIDs as the indices of the file

In [None]:
# Instantiate class with previously generated Dataframe (CSV)
DB = DatasetBuilder(compound_file_path='../chemspace/Dataset/Data/Dataset.csv')
#DB = DatasetBuilder(compound_file_path='../chemspace/Dataset/Data/CIDs.csv')
# Display dataset
DB.CIDs
DB.dataset

Method 3: Instantiate a DatasetBuilder object from a previously constructed DataFrame that has CIDs of interest as the indices

In [None]:
# Load df
df = pd.read_csv('../chemspace/Dataset/Data/CIDs.csv')

# Instantiate class with previously generated Dataframe
DB = DatasetBuilder(compound_df=df)

# Display dataset
DB.CIDs

Add SMILES data to Dataset

In [None]:
DB.add_SMILES()
DB.dataset.to_csv('../chemspace/Dataset/Data/Dataset.csv', index=False)

Add PubChem Text to Dataset

In [None]:
DB.add_pubchem_text()
DB.dataset.to_csv('../chemspace/Dataset/Data/Dataset.csv', index=False)