# Physicochemical Descriptors
In this notebook, we explore various physicochemical descriptors that can be used to represent molecules.

## Setting up the environment
Below we install additional Python libraries you will need to run this notebook:

In [None]:
! pip install scipy ipywidgets

## Loading the data

In [1]:
import pandas as pd

# Load the CSV file
file_path = '../data/zinc-250k-sample.csv'
df = pd.read_csv(file_path)

# Extract each property into a different variable
smiles = df['smiles'].tolist()
logP = df['logP'].tolist()
qed = df['qed'].tolist()
SAS = df['SAS'].tolist()

# Print the extracted variables
print("SMILES:", smiles[:10])
print("logP:", logP[:10])
print("qed:", qed[:10])
print("SAS:", SAS[:10])

SMILES: ['CC(C)(C)c1ccc2occ(CC(=O)Nc3ccccc3F)c2c1', 'C[C@@H]1CC(Nc2cncc(-c3nncn3C)c2)C[C@@H](C)C1', 'N#Cc1ccc(-c2ccc(O[C@@H](C(=O)N3CCCC3)c3ccccc3)cc2)cc1', 'CCOC(=O)[C@@H]1CCCN(C(=O)c2nc(-c3ccc(C)cc3)n3c2CCCCC3)C1', 'N#CC1=C(SCC(=O)Nc2cccc(Cl)c2)N=C([O-])[C@H](C#N)C12CCCCC2', 'CC[NH+](CC)[C@](C)(CC)[C@H](O)c1cscc1Br', 'COc1ccc(C(=O)N(C)[C@@H](C)C/C(N)=N/O)cc1O', 'O=C(Nc1nc[nH]n1)c1cccnc1Nc1cccc(F)c1', 'Cc1c(/C=N/c2cc(Br)ccn2)c(O)n2c(nc3ccccc32)c1C#N', 'C[C@@H]1CN(C(=O)c2cc(Br)cn2C)CC[C@H]1[NH3+]']
logP: [5.0506, 3.1137, 4.96778, 4.00022, 3.60956, 2.6374, 0.9978, 2.3347, 4.2813, 0.8801]
qed: [0.702012232801, 0.928975488089, 0.599681738168, 0.690944353014, 0.7890271545540001, 0.824368801371, 0.327296942917, 0.6876120611859999, 0.483078613889, 0.8350242768889999]
SAS: [2.0840945720726807, 3.4320038192747795, 2.4706326078252, 2.8227533111709646, 4.035182138302743, 5.091437549913934, 2.8523163989668188, 2.627857491802141, 3.073935310669482, 3.947933469646377]


## Computing descriptors with RDKit
Below are some of the many physicochemical descriptors that can be computed for molecules using RDKit:

In [3]:
from rdkit import Chem
from rdkit.Chem import Descriptors

# Function to compute descriptors
def compute_descriptors(smiles_list):
    descriptors = []
    for smile in smiles_list:
        mol = Chem.MolFromSmiles(smile)
        desc = {
            'SMILES': smile,
            'MW': Descriptors.MolWt(mol),
            'LogP': Descriptors.MolLogP(mol),
            'TPSA': Descriptors.TPSA(mol),
            'HBD': Descriptors.NumHDonors(mol),
            'HBA': Descriptors.NumHAcceptors(mol),
            'RotatableBonds': Descriptors.NumRotatableBonds(mol),
            'AromaticRings': Descriptors.NumAromaticRings(mol),
            'AliphaticRings': Descriptors.NumAliphaticRings(mol),
            'HeavyAtoms': Descriptors.HeavyAtomCount(mol),
            'FractionCSP3': Descriptors.FractionCSP3(mol),
            'NumAliphaticCarbocycles': Descriptors.NumAliphaticCarbocycles(mol),
            'NumAliphaticHeterocycles': Descriptors.NumAliphaticHeterocycles(mol),
            'NumAromaticCarbocycles': Descriptors.NumAromaticCarbocycles(mol),
            'NumAromaticHeterocycles': Descriptors.NumAromaticHeterocycles(mol),
        }
        descriptors.append(desc)
    return pd.DataFrame(descriptors)

# Compute descriptors for the first five molecules
df_descriptors = compute_descriptors(smiles[:2])

# Display the DataFrame
print(df_descriptors)

                                         SMILES       MW    LogP   TPSA  HBD  \
0       CC(C)(C)c1ccc2occ(CC(=O)Nc3ccccc3F)c2c1  325.383  5.0506  42.24    1   
1  C[C@@H]1CC(Nc2cncc(-c3nncn3C)c2)C[C@@H](C)C1  285.395  3.1137  55.63    1   

   HBA  RotatableBonds  AromaticRings  AliphaticRings  HeavyAtoms  \
0    2               3              3               0          24   
1    5               3              2               1          21   

   FractionCSP3  NumAliphaticCarbocycles  NumAliphaticHeterocycles  \
0        0.2500                        0                         0   
1        0.5625                        1                         0   

   NumAromaticCarbocycles  NumAromaticHeterocycles  
0                       2                        1  
1                       0                        2  


We can verify that the partition coefficient in the ZINC-250k data set has been computed using RDKit:

In [4]:
print(f'logPs from the ZINC-150k data set: {logP[:5]}')
just_computed_logP = [float(f'{i:.5f}') for i in df_descriptors['LogP']]  # Remove significant figures
print(f'logPs we just computed using RDKit: {just_computed_logP}')

logPs from the ZINC-150k dataset: [5.0506, 3.1137, 4.96778, 4.00022, 3.60956]
logPs we just computed using RDKit: [5.0506, 3.1137]


## RDKit 2D Descriptor Fingerprints
RDKit 2D descriptor fingerprints are bit-vector representations for molecules based on their 2D structure, capturing various molecular descriptors. 2D descriptor fingerprints encode structural information and can be generated from molecular descriptors calculated using RDKit.

Although we have seen how to compute these decriptors ourselves in the cell above, we will show here how to generate a 2D descriptor fingerprint with the help of the `MolecularDescriptorCalculator` function:

In [5]:
from rdkit.ML.Descriptors.MoleculeDescriptors import MolecularDescriptorCalculator

# Define the descriptors to be calculated
descriptor_names = [
    'MolWt', 'MolLogP', 'TPSA', 'NumHDonors', 'NumHAcceptors',
    'NumRotatableBonds', 'NumAromaticRings', 'NumAliphaticRings',
    'HeavyAtomCount', 'FractionCSP3'
]

# Create a MolecularDescriptorCalculator instance
calculator = MolecularDescriptorCalculator(descriptor_names)

# Function to compute descriptor fingerprints
def compute_descriptor_fingerprints(smiles):
    fingerprints = []
    for smi in smiles:
        mol = Chem.MolFromSmiles(smi)
        if mol is not None:
            descriptors = calculator.CalcDescriptors(mol)
            fingerprints.append(descriptors)
    return pd.DataFrame(fingerprints, columns=descriptor_names)

# Compute descriptor fingerprints for the first five SMILES in our data set
df_fps = compute_descriptor_fingerprints(smiles[:5])

# Display the DataFrame
print(df_fps)

     MolWt  MolLogP    TPSA  NumHDonors  NumHAcceptors  NumRotatableBonds  \
0  325.383  5.05060   42.24           1              2                  3   
1  285.395  3.11370   55.63           1              5                  3   
2  382.463  4.96778   53.33           0              3                  5   
3  409.530  4.00022   64.43           0              5                  4   
4  413.910  3.60956  112.10           1              6                  4   

   NumAromaticRings  NumAliphaticRings  HeavyAtomCount  FractionCSP3  
0                 3                  0              24      0.250000  
1                 2                  1              21      0.562500  
2                 3                  1              29      0.200000  
3                 2                  2              30      0.541667  
4                 1                  2              28      0.400000  


While these descriptors could in theory be used "as-is," that is not the recommended practice in data-driven molecular applications. Read on below for some recommended best practices on how to process this data before using it in a machine learning model.

## Tips💡

### Know what each descriptor represents
It is very easy to compute various descriptors with the help of RDKit, without knowing anything about how that descriptor was calculated under the hood. Nevertheless, we recommend that vefore using any descriptor, you understand what it measures and how it relates to the molecular property or task you are working with.

### Choose descriptors wisely

You can, for instance, use a random forest to compute feature importance and get insights into which descriptors are contributing most to your predictive accuracy. You can also perform an ablation study, where you progressively shuffle or remove different descriptors to understand how much information they are contributing to the system.

### Normalize your descriptors
If your descriptors differ greatly in magnitude, you are unintentionally (or perhaps intentionally) telling your model that the descriptor with the larger value range is more important. This is generally something to avoid. To ensure descriptors are on comparable scales, you can normalize or scale them using methods such as min-max scaling or z-score normalization. Additionally, transforming descriptors, e.g., through log transformations, can improve their distribution and enhance model performance. These points are important to consider when designing your own 2D descriptor fingerprints.