# Data Acquisition and Preprocessing
**Objective:** To construct a robust dataset of drug-like molecules for the Quantum Sequence-to-Sequence model.

**Methodology:**
1.  **Data Source:** ChEMBL Database via the official API.
2.  **Filtering:** We apply the "Rule of Five" and specific molecular weight/atom count constraints to ensure the molecules are suitable for drug discovery and small enough to be encoded into current quantum simulation capabilities.
3.  **Representation:** We convert SMILES (Simplified Molecular Input Line Entry System) to SELFIES (Self-Referencing Embedded Strings) to ensure 100% chemical validity during the generation phase.

In [1]:
from typing import Tuple, List, Dict, Any
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import SaltRemover, GetFormalCharge
from sklearn.model_selection import train_test_split
import selfies as sf
from math import ceil, log2
import csv
import json
import pickle
import numpy as np
import pandas as pd
import math
from chembl_webresource_client.new_client import new_client

### Data Filtering and Property Selection

To construct a dataset that is both pharmaceutically relevant and suitable for current **Quantum Machine Learning (QML)** hardware (NISQ era), we filter the ChEMBL database using specific physicochemical constraints.

Our goal is to select **"Fragment-like"** or **"Lead-like"** molecules. These are smaller than typical drug candidates, making them ideal for quantum simulation because they require shorter sequence lengths (fewer qubits) while retaining the core structural features of valid drugs.

We apply the following filters:

* **Molecular Weight (MW $\le$ 300 Da):**
    * *Reason:* Restricts the dataset to small molecules. This minimizes the sequence length (number of SELFIES tokens), reducing the depth of the quantum circuit required for processing.

* **LogP (Partition Coefficient $\le$ 5):**
    * *Reason:* Measures lipophilicity. Adhering to **Lipinski's Rule of 5**, a LogP under 5 suggests the molecule is likely to be orally active and membrane-permeable.

* **QED (Quantitative Estimation of Drug-likeness $\ge$ 0.5):**
    * *Reason:* A composite score (0 to 1) that aggregates multiple properties (solubility, polarity, structural alerts) to quantify how "drug-like" a structure is. A score above 0.5 ensures we are training on high-quality chemical matter.

* **Heavy Atoms ($\le$ 15):**
    * *Reason:* A hard constraint to limit molecular complexity. Fewer atoms generally translate to shorter string representations, which is critical for efficient encoding into our quantum states (Basis Embedding).

* **Rule of 5 Violations ($\le$ 1):**
    * *Reason:* Ensures general adherence to established guidelines for drug bioavailability.

In [2]:
# Using the ChEMBL API to get the molecules dataset
molecule = new_client.molecule

# Filter for drug-like small molecules interesting for human use
druglike_molecules = molecule.filter(
    molecule_properties__heavy_atoms__lte=15,           # Heavy atoms less than 15
    molecule_properties__alogp__lte=5,                  # LogP less than 5 (Lipophilicity and membrane permeability)
    molecule_properties__mw_freebase__lte=300,          # Molecular weight less than 300 g/mol
    molecule_properties__qed_weighted__gte=0.5,         # QED weighted greater than 0.5 (Drug-likeness)
    molecule_properties__num_ro5_violations__lte=1,     # At most 1 Rule of 5 violation (Drug-likeness filter)

)

print("Training molecules set: ", len(druglike_molecules))  # Check how many molecules match the filter criteria

Training molecules set:  65778


## Molecular Filtering and Encoding
Here we process the raw API data.
* **Salt Removal:** We strip counter-ions to focus on the active pharmacophore.
* **Stereochemistry:** We remove isomeric information to simplify the vocabulary size for the quantum embedding.
* **Validation:** We ensure every molecule can be successfully translated to SELFIES without errors.

In [3]:
# --- Set up filters ---
remover = SaltRemover.SaltRemover()

N_MOLECS = 1000
molecules_subset = druglike_molecules[:N_MOLECS]

MAX_LEN = 0
alphabet = set()
valid_molecules_for_training = [] 
total_processed = 0
charged_skipped = 0
selfies_error_skipped = 0
mixture_skipped = 0

print(f"Starting with {len(molecules_subset)} molecules...")

for mol_data in molecules_subset:
    total_processed += 1
    smiles = mol_data.get('molecule_structures', {}).get('canonical_smiles')
    if not smiles:
        continue
        
    rdkit_mol = Chem.MolFromSmiles(smiles)
    if rdkit_mol is None:
        continue

    # Remove salts (counter-ions)
    neutral_mol = remover.StripMol(rdkit_mol)
    
    # Check every atom for formal charge (filters zwitterions)
    has_charge = False
    for atom in neutral_mol.GetAtoms():
        if atom.GetFormalCharge() != 0:
            has_charge = True
            break # Found a charged atom
    if has_charge:
        charged_skipped += 1
        continue # Skip this zwitterion/charged molecule

    # Remove ALL stereochemistry
    cleaned_smiles = Chem.MolToSmiles(neutral_mol, isomericSmiles=False)
    
    if not cleaned_smiles:
        continue
        
    try:
        selfies = sf.encoder(cleaned_smiles)
    except sf.EncoderError:
        # Skip molecules with exotic valency (like hypervalent Iodine)
        selfies_error_skipped += 1
        continue 
    
    if selfies:
        # Final check for mixtures
        if "." in selfies:
            mixture_skipped += 1
            continue # Skip any remaining molecules with '.'
            
        tokens = list(sf.split_selfies(selfies))
        if MAX_LEN < len(tokens):
            MAX_LEN = len(tokens)
        alphabet.update(tokens)
        
        # This one is good! Store it.
        valid_molecules_for_training.append((mol_data, cleaned_smiles, selfies, tokens))

# Build Final Alphabet ---
alphabet = sorted(list(alphabet))
alphabet = ['<SOS>'] + alphabet + ['<EOS>'] + ['<PAD>']

VOCABULARY_SIZE = len(alphabet)
BITS_PER_TOKEN = ceil(log2(VOCABULARY_SIZE))
MAX_LEN += 2  # For <SOS> and <EOS>

print(f"\n--- Filtering Stats ---")
print(f"Total molecules processed: {total_processed}")
print(f"Skipped (charged/zwitterion): {charged_skipped}")
print(f"Skipped (selfies valency error): {selfies_error_skipped}")
print(f"Skipped (mixture/'.'): {mixture_skipped}")
print(f"Kept for training: {len(valid_molecules_for_training)}")

print(f"\n--- Final Results ---")
print(f"Final Alphabet of SELFIES characters: {alphabet}")
print(f"Total unique characters in SELFIES: {VOCABULARY_SIZE}")
print(f"Maximum length of SELFIES in dataset: {MAX_LEN}")
print(f"Bits per token: {BITS_PER_TOKEN}")

# Create token to index mapping
token_to_index = {tok: i for i, tok in enumerate(alphabet)}

Starting with 1000 molecules...

--- Filtering Stats ---
Total molecules processed: 1000
Skipped (charged/zwitterion): 47
Skipped (selfies valency error): 0
Skipped (mixture/'.'): 0
Kept for training: 953

--- Final Results ---
Final Alphabet of SELFIES characters: ['<SOS>', '[#Branch1]', '[#Branch2]', '[#C]', '[#N]', '[=Branch1]', '[=Branch2]', '[=C]', '[=N]', '[=O]', '[=P]', '[=Ring1]', '[=S]', '[Br]', '[Branch1]', '[Branch2]', '[C]', '[Cl]', '[F]', '[I]', '[NH1]', '[N]', '[O]', '[PH1]', '[P]', '[Ring1]', '[Ring2]', '[S]', '<EOS>', '<PAD>']
Total unique characters in SELFIES: 30
Maximum length of SELFIES in dataset: 31
Bits per token: 5


In [4]:
# Diccionario token → índice
token_to_index = {tok: i for i, tok in enumerate(alphabet)}

def print_token_bits(tokens, token_to_index):
    for tok in tokens:
        idx = token_to_index.get(tok, None)
        if idx is None:
            print(f"Token '{tok}' no está en el diccionario.")
            continue
        binary = format(idx, f'0{BITS_PER_TOKEN}b')
        print(f"'{tok}' → index {idx} → {binary}")

print_token_bits(alphabet, token_to_index)

'<SOS>' → index 0 → 00000
'[#Branch1]' → index 1 → 00001
'[#Branch2]' → index 2 → 00010
'[#C]' → index 3 → 00011
'[#N]' → index 4 → 00100
'[=Branch1]' → index 5 → 00101
'[=Branch2]' → index 6 → 00110
'[=C]' → index 7 → 00111
'[=N]' → index 8 → 01000
'[=O]' → index 9 → 01001
'[=P]' → index 10 → 01010
'[=Ring1]' → index 11 → 01011
'[=S]' → index 12 → 01100
'[Br]' → index 13 → 01101
'[Branch1]' → index 14 → 01110
'[Branch2]' → index 15 → 01111
'[C]' → index 16 → 10000
'[Cl]' → index 17 → 10001
'[F]' → index 18 → 10010
'[I]' → index 19 → 10011
'[NH1]' → index 20 → 10100
'[N]' → index 21 → 10101
'[O]' → index 22 → 10110
'[PH1]' → index 23 → 10111
'[P]' → index 24 → 11000
'[Ring1]' → index 25 → 11001
'[Ring2]' → index 26 → 11010
'[S]' → index 27 → 11011
'<EOS>' → index 28 → 11100
'<PAD>' → index 29 → 11101


### Bit-Basis Encoding
To process discrete tokens in a quantum circuit, we map the vocabulary index of each token to a binary string. This allows us to use **Basis Embedding** (preparing qubits in state |0> or |1>) as the input state for our variational circuit.

In [5]:
basis_encoded_dataset = []
token_to_index = {tok: i for i, tok in enumerate(alphabet)}

def smiles_to_bits(tokens: list) -> np.ndarray:
    """Convert tokens to a 2D array"""
    padded_tokens = ['<SOS>'] + tokens + ['<EOS>']
    bit_matrix = []
    for tok in padded_tokens:
        idx = token_to_index[tok]
        bits = list(f"{idx:0{BITS_PER_TOKEN}b}")  # length of the binary string depends on the number of bits required to represent the alphabet
        bit_matrix.append([int(b) for b in bits])
    return np.array(bit_matrix)

### Properties normalization

In [6]:
def normalize(value, min_val, max_val, target_max=np.pi):
    ''' Normalize a value to a range [0, [0, pi] to later encode them as rotation angles'''
    norm = (value - min_val) / (max_val - min_val) * target_max
    return float(f"{norm:.3f}")
    
min_logp = float('inf')
max_logp = float('-inf')
min_qed = float('inf')
max_qed = float('-inf')
min_mw = float('inf')
max_mw = float('-inf')

# Iterate through the subset of molecules to find min/max properties to normalize them
for mol in molecules_subset:
    logP = mol.get('molecule_properties', {}).get('alogp')
    qed = mol.get('molecule_properties', {}).get('qed_weighted')
    mw = mol.get('molecule_properties', {}).get('mw_freebase')

    if logP is None or qed is None or mw is None:
        continue  # Skip if any property is missing

    logP = float(logP)
    qed = float(qed)
    mw = float(mw)

    if logP < min_logp:
        min_logp = logP
    if logP > max_logp:
        max_logp = logP

    if qed < min_qed:
        min_qed = qed
    if qed > max_qed:
        max_qed = qed

    if mw < min_mw:
        min_mw = mw
    if mw > max_mw:
        max_mw = mw

print(f"LogP range: {min_logp} to {max_logp}")
print(f"QED range: {min_qed} to {max_qed}")
print(f"MW range: {min_mw} to {max_mw}")

LogP range: -1.53 to 4.66
QED range: 0.5 to 0.92
MW range: 73.14 to 298.11


In [7]:
# --- Save metadata to a JSON file 
metadata = {
    "vocabulary_size": VOCABULARY_SIZE,
    "bits_per_token": BITS_PER_TOKEN,
    "alphabet": alphabet,
    "max_sequence_length": MAX_LEN,
    "min_logP": min_logp,
    "max_logP": max_logp,
    "min_qed": min_qed,
    "max_qed": max_qed,
    "min_mw": min_mw,
    "max_mw": max_mw
}
METADATA_PATH = F"../data/metadata_selfies_{N_MOLECS}.json"
with open(METADATA_PATH, 'w') as f:
    json.dump(metadata, f, indent=4)
print(f"Metadata saved to {METADATA_PATH}.")

Metadata saved to ../data/metadata_selfies_1000.json.


### Train - Validate - Test

In this step the data is split so that every model can be trained, validated and tested over the same data.

**Training Set (~70%):** Used by the optimizer to calculate gradients and update model weights ($\theta$). The goal is to fit the model to the data.

**Validation Set (~15%):** It detects overfitting. Used for hyperparameter tunning (e.g., number of layers).

**Test Set (~15%):** Used to generate the final accuracy numbers and molecular plots.

In [8]:
all_processed_rows = []
header = ["logP", "qed", "mw"] + [f"token_{i}" for i in range(MAX_LEN)]

print(f"Processing {len(valid_molecules_for_training)} molecules...")

for mol in valid_molecules_for_training:
    smiles = mol[1]
    selfies = mol[-1]
    props = mol[0].get('molecule_properties', {})
    
    if not selfies or "." in selfies:
        continue
        
    try:
        logP = float(props.get('alogp'))
        qed = float(props.get('qed_weighted'))
        mw = float(props.get('mw_freebase'))
    except (TypeError, ValueError):
        continue

    # Normalize properties
    norm_logp = normalize(logP, min_logp, max_logp)
    norm_qed = normalize(qed, min_qed, max_qed)
    norm_mw = normalize(mw, min_mw, max_mw)

    if not all(tok in token_to_index for tok in selfies):
        continue

    # Encode to bits
    bit_matrix = smiles_to_bits(selfies)
    token_bits_as_strings = ["".join(map(str, row)) for row in bit_matrix]
    
    # Construct the row
    row = [norm_logp, norm_qed, norm_mw] + token_bits_as_strings

    # Padding
    while len(row) < len(header):
        pad_idx = token_to_index['<PAD>']
        pad_bits = f"{pad_idx:0{BITS_PER_TOKEN}b}"
        row.append(pad_bits)
    
    # Append to list instead of writing to file
    all_processed_rows.append(row)

print(f"Total valid rows processed: {len(all_processed_rows)}")

# 2. Perform the Train / Validation / Test Split
# Split: 70% Train, 15% Val, 15% Test
train_rows, test_val_rows = train_test_split(all_processed_rows, test_size=0.3, random_state=42)
val_rows, test_rows = train_test_split(test_val_rows, test_size=0.5, random_state=42)

print(f"Split sizes - Train: {len(train_rows)}, Val: {len(val_rows)}, Test: {len(test_rows)}")

# 3. Helper function to write CSVs
def save_split_csv(filename, data_rows, header):
    with open(filename, mode="w", newline="") as file:
        writer = csv.writer(file)
        writer.writerow(header)
        writer.writerows(data_rows)
    print(f"Saved {filename}")

# 4. Save the three files
base_path = f"../data/structured_data_selfies_{N_MOLECS}"
save_split_csv(f"{base_path}_train.csv", train_rows, header)
save_split_csv(f"{base_path}_val.csv", val_rows, header)
save_split_csv(f"{base_path}_test.csv", test_rows, header)

Processing 953 molecules...
Total valid rows processed: 953
Split sizes - Train: 667, Val: 143, Test: 143
Saved ../data/structured_data_selfies_1000_train.csv
Saved ../data/structured_data_selfies_1000_val.csv
Saved ../data/structured_data_selfies_1000_test.csv
