Notebook walking through constructing a Dataset with DatasetBuilder

In [1]:
import sys
import pandas as pd
sys.path.append("..")

from chemspace.Dataset.DatasetBuilder import DatasetBuilder

Dataset Builder Class can be instantiated one of 3 ways:
1. From a Chemical Structures Records `.json.gz` file downloaded from PubChem's search function
2. From a `.csv` file containing CIDs as the indices of the file
3. From a previously constructed DataFrame that has CIDs of interest as the indices

Method 1: Instantiate a DatasetBuilder object from a Chemical Structures Records `.json.gz` file downloaded from PubChem's search function

In [None]:
# Instantiate class with json file form PubChem
DB = DatasetBuilder(compound_file_path='../chemspace/Dataset/Data/PubChem_compound_list_records.json.gz')

# Save as .CSV
DB.CIDs.to_csv('../chemspace/Dataset/Data/CIDs.csv', index = False)

# Display dataset
DB.CIDs

Method 2: Instantiate a DatasetBuilder object from a `.csv` file containing CIDs as the indices of the file

In [None]:
# Instantiate class with previously generated Dataframe (CSV)
DB = DatasetBuilder(compound_file_path='../chemspace/Dataset/Data/Dataset.csv')
#DB = DatasetBuilder(compound_file_path='../chemspace/Dataset/Data/CIDs.csv')
# Display dataset
DB.CIDs
DB.dataset

Method 3: Instantiate a DatasetBuilder object from a previously constructed DataFrame that has CIDs of interest as the indices

In [2]:
# Load df
df = pd.read_csv('../chemspace/Dataset/Data/Dataset.csv')

# Instantiate class with previously generated Dataframe
DB = DatasetBuilder(compound_df=df)

# Display dataset
DB.CIDs

0                 1
1                 3
2                 4
3                 5
4                 6
            ...    
327752    168010365
327753    168010381
327754    168010385
327755    168010395
327756    168011857
Name: CID, Length: 327757, dtype: int64

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 327757 entries, 0 to 327756
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   CID     327757 non-null  int64 
 1   SMILES  327757 non-null  object
dtypes: int64(1), object(1)
memory usage: 5.0+ MB


In [4]:
# DB = DatasetBuilder(compound_file_path='path_to_compound_file')
# DB.add_SMILES(data_path='path_to_smiles_file')
DB.add_synonyms(synonyms_data_path='../chemspace/Dataset/Data/CID-Synonym-filtered.csv')


0


TypeError: '<' not supported between instances of 'numpy.ndarray' and 'str'

In [None]:
DB.add_SMILES()
DB.dataset.to_csv('../chemspace/Dataset/Data/Dataset.csv', index=False)

In [None]:
DB.dataset