# T03: Representation Matters — Fingerprints, Descriptors, and Beyond

Authors:

* Aaryan Jaitly
* Atia Tul Wahab
* Sadia Chaudhry
* Youssef Mohamed Fathy
* Zenab Khan

## Aim of this talktorial

The main goal of this project is to understand how molecular structure are chnaged into machine readable format. Which are the ways we can represent the molecular data. To understand "What really we are feeding to our machine"

### Contents in _Theory_

* What is Molecular Representations
* Ways of Representing Molecular Representation
* Feature Relevance for Solubility Prediction
* Additional features beyond Mordred and MACCS
* Conclusion and Strategic Recommendation


## Theory

Building an ML model requires more than just a dataset and a label; it requires domain expertise. To successfully predict LogS, we need a large set of molecules and encodings, but we must also critically analyze the source of our data.

We cannot blindly treat all solubility values as equal. We need to know if the experiments measured kinetic solubility (often used in high-throughput screening) or equilibrium solubility (the gold standard using Shake Flask). These methods yield different results for the same molecule. By understanding these experimental nuances, we can filter out noise and account for biases, leading to a more robust and scientifically accurate model.


### What is Molecular Representations
Molecules exist in 3D chemical structure with complex arrangements. Machine learning algorithms, however, cannot process these chemical structures directly. A molecular representation translates these complex structures into machine-readable vectors. This process transforms the rich, continuous information of molecular geometry and electronic properties into a discrete set of features (also called descriptors), which are the individual numerical or binary variables that serve as the model’s input. 1 The choice of representation is fundamental. It predefines which chemical information shall be available to the model for learning. To effectively predict aqueous solubility, a representation format has to emphasize those features that encode the relevant physicochemical interactions governing dissolution. This report discusses the curation of features from two widely used paradigms for molecular representation: descriptor based (Mordred) and fingerprint based (MACCS) methods.

### Ways of Representing Molecular Representation
Mordred Descriptors 
* Mordred is a freely available descriptor calculator, which computes over 1,800 features and calculate 2D and 3D descriptors from molecular structure (usually known as SMILES).

MACCS Keys 
* MACCS (Molecular ACCess System) keys represent molecules as fixed 166-bit binary fingerprints, where each bit indicates presence/absence of a predefined structural pattern. 3

### Feature Relevance for Solubility Prediction
Not all the Mordred descriptors and MACCS keys are important for predicting solubility of a compound. Thus, in this section, the focus is on the features that make sense for predicting aqueous solubility. Solubility prediction requires features that capture fundamental physicochemical interactions between molecules and water.

* Hydrophobic-Hydrophilic Balance (Lipophilicity)
* Polarity and hydrogen-bonding capacity
* Molecular size and shape
* Ionizable functional groups and charge
* Aromaticity and hydrophobic π-systems


### Additional features beyond Mordred and MACCS
Apart from all the above properties there are few equally important properties that describe how tightly a solid holds together and how it partitions between phases. Melting point and lattice energy strongly affect solubility, but they are absent from traditional descriptor sets.11 The solid-state form adds another layer of complexity: different polymorphs, hydrates, solvates, and especially amorphous versions can change solubility dramatically because amorphous forms possess higher free energy. 
Experimental conditions further complicate matters. Variations in measurement technique (e.g., shake-flask, nephelometry, HPLC), temperature, agitation, and buffer composition lead to inconsistent reported values across datasets.12 Since these external factors are not encoded by structural descriptors, they must be added as separate features for reliable models.
In short, a robust solubility model should combine structural descriptors with pKa, melting point, log P/log D, solid-state characteristics, intrinsic solubility, and experimental conditions, all of which are missing from MACCS and Mordred alone.

### Conclusion and Strategic Recommendation
MACCS and Mordred remain useful baselines, but they oversimplify the physics that we now know to be important for solubility. A sensible next step is to combine more expressive representations with a small number of key physicochemical descriptors. Concretely, we propose training a GNN on molecular graphs and augmenting its learned representation with experimentally measured or accurately predicted melting points (and, where available, simple 3D shape or charge descriptors).13,15 This hybrid approach preserves the flexibility of representation learning while explicitly addressing one major limitation we observed in Task 2: the inability of our current 2D feature set to recognise when poor solubility is driven by strong crystal packing rather than by solvation alone.