
# BigSolDB2 – Reproducible Preprocessing 

This notebook documents the preprocessing workflow for the BigSolDB2 dataset.

**Dataset:** Zenodo record • <https://zenodo.org/records/15094979> (file to download: BigSolDBv2.0.csv)

Krasnov, L., Malikov, D., Kiseleva, M. et al. BigSolDB 2.0, dataset of solubility values for organic compounds in different solvents at various temperatures. Sci Data 12, 1236 (2025). https://doi.org/10.1038/s41597-025-05559-8



## Prerequisites

- Download the BigSolDB2 dataset from the Zenodo record and place it locally.
- Ensure Python and required packages are installed.
- Modify paths in code cells if necessary to point to your dataset location.


### Step 1: Import libraries

You may need to adjust paths to match your environment.

In [1]:
import pandas as pd
import numpy as np
import rdkit
import seaborn as sns
from matplotlib import pyplot as plt

### Step 2: Load dataset

In [2]:
df = pd.read_csv('BigSolDBv2.0.csv') # adjust the path is needed

In [3]:
len(df)

103944

### Step 3: Start preprocessing

#### - Filter out the NaN values

In [4]:
df = df[df['LogS(mol/L)'].notna()]

In [6]:
len(df)

100983

#### - Collect the names of unique solvents

In [7]:
un_solvents = np.unique(df['Solvent'].to_numpy())

#### - Define a filtering function
The function does the following:
- 1. Keeps the solubility values only in the range [temp_lower, temp_upper]
- 2. Drops duplicates based on the solute SMILES, so that only one entry remains for each unique solute molecule
- 3. Drops columns not used in the meta-learning experiments

In [8]:
def filter_df(master_df, solvent_name,temp_lower=290, temp_upper=300):
    solvent_locations=(master_df['Solvent']==solvent_name)
    temp_values = (master_df['Temperature_K']>=temp_lower)&(master_df['Temperature_K']<=temp_upper)
    df = master_df[solvent_locations&temp_values].copy()
    df = df.drop_duplicates('SMILES_Solute', ignore_index=True)
    df = df.drop(columns=['Temperature_K', 'Solvent', 'SMILES_Solvent',
       'Solubility(mole_fraction)', 'Solubility(mol/L)', 'Compound_Name', 'CAS', 'PubChem_CID', 'FDA_Approved', 'Source'])
    df = df.rename(columns={'LogS(mol/L)': f'LogS_{solvent_name}'})
    df = df.rename(columns={'SMILES_Solute': 'SMILES'})
    return df

#### Run final preprocessing
for all unique solvents

In [9]:
for i, solv in enumerate(un_solvents):
    if i%10==0:
        print(i)

    temp = filter_df(df, solv)
    if i == 0:
        preprocessed = temp.copy()
    else:
        preprocessed = pd.merge(preprocessed, temp, on='SMILES', how='outer')

0
10
20
30
40
50
60


#### Check the database columns contain only SMILES and LogS values for different solvents

In [11]:
print(preprocessed.columns)

Index(['SMILES', 'LogS_1,2-dichloroethane', 'LogS_1,4-dioxane',
       'LogS_2-butanone', 'LogS_2-butoxyethanol', 'LogS_2-ethoxyethanol',
       'LogS_2-methoxyethanol', 'LogS_2-propoxyethanol', 'LogS_DMAc',
       'LogS_DMF', 'LogS_DMSO', 'LogS_MIBK', 'LogS_MTBE', 'LogS_NMP',
       'LogS_THF', 'LogS_acetic acid', 'LogS_acetone', 'LogS_acetonitrile',
       'LogS_anisole', 'LogS_benzene', 'LogS_chlorobenzene', 'LogS_chloroform',
       'LogS_cyclohexane', 'LogS_cyclohexanone', 'LogS_cyclopentanone',
       'LogS_dichloromethane', 'LogS_diethyl ether', 'LogS_dimethyl carbonate',
       'LogS_ethanol', 'LogS_ethyl acetate', 'LogS_ethyl formate',
       'LogS_ethylbenzene', 'LogS_ethylene glycol', 'LogS_formamide',
       'LogS_formic acid', 'LogS_gamma-butyrolactone', 'LogS_isobutanol',
       'LogS_isobutyl acetate', 'LogS_isopentanol', 'LogS_isopropanol',
       'LogS_isopropyl acetate', 'LogS_m-xylene', 'LogS_methanol',
       'LogS_methyl acetate', 'LogS_n-butanol', 'LogS_n-butyl ac

#### Save the preprocessed database

In [12]:
preprocessed.to_csv('BigSolDBv2_processed.csv', index=False)

### Step 4: Filter Based On Min Number of Datapoints per Task (if needed)

We can also filter the datasets to include only tasks with at least a minimum number of data points required for meta-learning experiments. 

In [13]:
nlimit = 500 # required minimum number of datapoints per task

preprocessed = pd.read_csv('BigSolDBv2_processed.csv')
# count datapoints
cols = list(preprocessed.columns[1:])
lengths = [len(preprocessed[preprocessed[col].notna()]) for col in cols]
# filter
inds = np.where(np.array(lengths) < nlimit)[0]
cols_to_drop = np.array(cols)[inds]
# print statement
print(f"There are {len(cols) - len(cols_to_drop)} solvents with at least {nlimit} datapoints")
# filter and save
preprocessed_new = preprocessed.copy()
preprocessed_new = preprocessed_new.drop(columns=cols_to_drop)
print(len(preprocessed_new.columns)-1)
preprocessed_new.to_csv(f'BigSolDBv2_processed_mindata{nlimit}.csv', index=False)
print(f"[*] Saved to BigSolDBv2_processed_mindata{nlimit}.csv")

There are 9 solvents with at least 500 datapoints
9
[*] Saved to BigSolDBv2_processed_mindata500.csv
