# Predicting Protonation States and Partitioning Behavior

This notebook demonstrates how to:
1.  Use `dimorphite-dl` to predict the protonation states of molecules at various pH values.
2.  Calculate LogP (lipophilicity) for these different protonation states using RDKit.
3.  Predict whether molecules will partition into an organic solvent or an aqueous phase (1M acid or 1M base) based on their charge state and LogP.

In [None]:
# Install necessary packages
!pip install rdkit-pypi dimorphite-ojmb molvs

In [None]:
# Import necessary libraries
from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem import Descriptors # For general descriptors
from rdkit.Chem.Crippen import MolLogP # Specific for Crippen LogP
from rdkit.Chem.Draw import IPythonConsole # For displaying molecules in Jupyter

# Attempt to import dimorphite_dl
try:
    import dimorphite_dl
    print("dimorphite_dl imported successfully.")
except ImportError:
    print("dimorphite_dl not found. Will try to use command line if needed.")
    # Placeholder for potential subprocess import if dimorphite_dl is not a library
    # import subprocess 

# Import molvs for standardization
try:
    from molvs import standardize_smiles
    print("molvs imported successfully.")
except ImportError:
    print("molvs not found. Standardization might be limited.")

import os # For any potential file operations if needed

## 1. Define Example Molecules

We will use a few example molecules to demonstrate the prediction of protonation states and their partitioning behavior. These include:
*   Acetic Acid (a simple carboxylic acid)
*   Methylamine (a simple amine)
*   Glycine (an amino acid with both acidic and basic groups)
*   Benzene (a neutral molecule for comparison)

In [None]:
# Define example molecules using their SMILES strings
example_molecules_smiles = {
    "Acetic Acid": "CC(=O)O",
    "Methylamine": "CN",
    "Glycine": "C(C(=O)O)N", # or NCC(=O)O
    "Benzene": "c1ccccc1"
}

# Convert SMILES to RDKit molecule objects and display them
example_molecules_rdkit = {name: Chem.MolFromSmiles(smi) for name, smi in example_molecules_smiles.items()}

Draw.MolsToGridImage(list(example_molecules_rdkit.values()), molsPerRow=4, legends=list(example_molecules_rdkit.keys()))

## 2. Predict Protonation States with Dimorphite-DL

Now, we'll use `dimorphite_dl` to predict the dominant protonation state of our example molecules at different pH values:
*   **Physiological pH:** 7.4
*   **1M Acidic solution:** pH 0 (approximating 1M HCl)
*   **1M Basic solution:** pH 14 (approximating 1M NaOH)

`dimorphite_dl` will return the SMILES string of the molecule in its most likely ionization state at the given pH.

In [None]:
# Define pH values and run Dimorphite-DL for protonation state prediction
ph_values = {
    "Physiological": 7.4,
    "1M Acid (pH 0)": 0.0,
    "1M Base (pH 14)": 14.0
}

protonation_results = {}

# Check if dimorphite_dl was imported successfully before trying to use it
if 'dimorphite_dl' in globals():
    print("Running Dimorphite-DL...")
    for name, smiles in example_molecules_smiles.items():
        print(f"\nProcessing molecule: {name} ({smiles})")
        protonation_results[name] = {}
        output_mols = []
        output_legends = []
        for ph_label, ph_val in ph_values.items():
            # Run dimorphite_dl for the specific pH
            # It returns a list of SMILES strings, typically one for a single pH point.
            try:
                protonated_smiles_list = dimorphite_dl.run(
                    smiles,
                    min_ph=ph_val,
                    max_ph=ph_val,
                    # pka_precision=1.0 # Default, can be adjusted
                )
                
                if protonated_smiles_list:
                    protonated_smiles = protonated_smiles_list[0] # Take the first result
                    protonation_results[name][ph_label] = protonated_smiles
                    print(f"  pH {ph_val} ({ph_label}): {protonated_smiles}")
                    mol = Chem.MolFromSmiles(protonated_smiles)
                    if mol:
                        output_mols.append(mol)
                        output_legends.append(f"{name} @ pH {ph_val}")
                    else:
                        print(f"    Could not generate molecule from SMILES: {protonated_smiles}")
                else:
                    print(f"  pH {ph_val} ({ph_label}): No result from dimorphite_dl. Using original SMILES.")
                    # Store original if no result, or handle as error
                    protonation_results[name][ph_label] = smiles 
                    mol = Chem.MolFromSmiles(smiles)
                    if mol:
                        output_mols.append(mol)
                        output_legends.append(f"{name} @ pH {ph_val} (original)")
                    else:
                        print(f"    Could not generate molecule from original SMILES: {smiles}")


            except Exception as e:
                print(f"  Error running dimorphite_dl for {name} at pH {ph_val}: {e}")
                protonation_results[name][ph_label] = smiles # Store original on error
                mol = Chem.MolFromSmiles(smiles)
                if mol:
                    output_mols.append(mol)
                    output_legends.append(f"{name} @ pH {ph_val} (error)")
                else:
                    print(f"    Could not generate molecule from original SMILES: {smiles}")


        if output_mols:
            display(Draw.MolsToGridImage(output_mols, molsPerRow=len(ph_values), legends=output_legends, subImgSize=(300,300)))
        else:
            print(f"No molecules to display for {name}")

else:
    print("dimorphite_dl module not available. Skipping protonation state prediction.")
    # Initialize protonation_results with original SMILES if dimorphite_dl is not available
    for name, smiles in example_molecules_smiles.items():
        protonation_results[name] = {}
        for ph_label, ph_val in ph_values.items():
            protonation_results[name][ph_label] = smiles


# Display the collected results (dictionary of SMILES)
print("\nProtonation SMILES Results:")
for name, ph_data in protonation_results.items():
    print(f"  {name}:")
    for ph_label, smi in ph_data.items():
        print(f"    {ph_label}: {smi}")

## 3. Calculate LogP for Protonation States

LogP (logarithm of the partition coefficient) is a measure of a compound's lipophilicity (affinity for fatty/non-polar environments) versus its hydrophilicity (affinity for watery/polar environments).
- A **positive LogP** means the compound is more lipophilic (prefers organic solvents).
- A **negative LogP** means the compound is more hydrophilic (prefers aqueous solvents).
- A **LogP around 0** means it has similar affinity for both.

The protonation state of a molecule can significantly affect its LogP, as charged species are generally more hydrophilic. We will calculate the Wildman-Crippen LogP value (often referred to as MolLogP in RDKit) for each of the protonation states predicted by `dimorphite_dl`.

In [None]:
# Calculate LogP for each protonated state

logP_results = {}

print("Calculating LogP values using RDKit's Crippen calculator (MolLogP):")
for name, ph_smi_map in protonation_results.items():
    logP_results[name] = {}
    print(f"\nMolecule: {name}")
    for ph_label, smiles in ph_smi_map.items():
        mol = Chem.MolFromSmiles(smiles)
        if mol:
            try:
                logp_value = MolLogP(mol)
                logP_results[name][ph_label] = {"smiles": smiles, "logP": logp_value}
                print(f"  {ph_label} (SMILES: {smiles}): LogP = {logp_value:.2f}")
            except Exception as e:
                logP_results[name][ph_label] = {"smiles": smiles, "logP": "Error"}
                print(f"  {ph_label} (SMILES: {smiles}): Error calculating LogP - {e}")
        else:
            logP_results[name][ph_label] = {"smiles": smiles, "logP": "Invalid SMILES"}
            print(f"  {ph_label} (SMILES: {smiles}): Could not create RDKit molecule, skipping LogP.")

# For clarity, print the logP_results dictionary (optional)
# import json
# print("\nLogP Results Structure:")
# print(json.dumps(logP_results, indent=2))

## 4. Predict and Explain Partitioning Behavior

Based on the calculated LogP values for the dominant protonation state of each molecule at a given pH, we can predict its likely partitioning behavior between an organic solvent and an aqueous phase.

**General Principles:**
*   **Aqueous phase at 1M Acid (e.g., pH 0):** We look at the LogP of the molecule's form dominant at pH 0.
*   **Aqueous phase at 1M Base (e.g., pH 14):** We look at the LogP of the molecule's form dominant at pH 14.
*   **Partitioning Prediction:**
    *   If **LogP > 0** (especially > 1): The molecule is predicted to favor the **organic solvent**.
    *   If **LogP < 0** (especially < -1): The molecule is predicted to favor the **aqueous phase**.
    *   If **LogP is close to 0** (e.g., between -1 and 1): The molecule may show significant solubility in both phases or partition roughly equally.

Charged species (ions) are generally much more water-soluble (lower LogP) than their neutral counterparts. `dimorphite_dl` helps us identify the correct species to consider at each pH.

In [None]:
# Predict partitioning behavior based on LogP values
print("Predicting Partitioning Behavior:\n")

for name, ph_data in logP_results.items():
    print(f"--- {name} ---")
    
    # Partitioning at 1M Acid (pH 0)
    label_acid = "1M Acid (pH 0)"
    if label_acid in ph_data and isinstance(ph_data[label_acid]['logP'], (int, float)):
        logp_acid = ph_data[label_acid]['logP']
        smiles_acid = ph_data[label_acid]['smiles']
        partition_acid = "favors organic solvent" if logp_acid > 0.5 else \
                         "favors aqueous phase" if logp_acid < -0.5 else \
                         "has mixed/equal partitioning"
        print(f"  At pH 0 (1M Acid), {name} (as {smiles_acid}, LogP: {logp_acid:.2f}) {partition_acid}.")
    elif label_acid in ph_data:
        print(f"  At pH 0 (1M Acid), LogP for {name} was '{ph_data[label_acid]['logP']}'. Partitioning cannot be determined.")
    else:
        print(f"  Data for {name} at {label_acid} not found.")

    # Partitioning at 1M Base (pH 14)
    label_base = "1M Base (pH 14)"
    if label_base in ph_data and isinstance(ph_data[label_base]['logP'], (int, float)):
        logp_base = ph_data[label_base]['logP']
        smiles_base = ph_data[label_base]['smiles']
        partition_base = "favors organic solvent" if logp_base > 0.5 else \
                         "favors aqueous phase" if logp_base < -0.5 else \
                         "has mixed/equal partitioning"
        print(f"  At pH 14 (1M Base), {name} (as {smiles_base}, LogP: {logp_base:.2f}) {partition_base}.")
    elif label_base in ph_data:
        print(f"  At pH 14 (1M Base), LogP for {name} was '{ph_data[label_base]['logP']}'. Partitioning cannot be determined.")
    else:
        print(f"  Data for {name} at {label_base} not found.")
    print("") # Newline for readability

## 5. Summary and Conclusion

This notebook demonstrated a workflow for predicting the protonation states of small molecules and their subsequent partitioning behavior in different solvent conditions.

**Key Steps and Takeaways:**

1.  **Environment Setup:** We installed necessary packages like `rdkit-pypi`, `dimorphite-dl`, and `molvs`.
2.  **Protonation State Prediction:** Using `dimorphite-dl`, we predicted the dominant ionization states of example molecules (Acetic Acid, Methylamine, Glycine, Benzene) at physiological pH (7.4), highly acidic conditions (pH 0 for 1M acid), and highly basic conditions (pH 14 for 1M base). This step is crucial as the charge of a molecule significantly impacts its properties.
3.  **LogP Calculation:** We calculated the Crippen LogP values for each relevant protonation state using RDKit. This allowed us to quantify the lipophilicity/hydrophilicity of each form. We observed that charged species generally have lower LogP values.
4.  **Partitioning Prediction:** By combining the protonation state information with the LogP values, we predicted whether each molecule would preferentially partition into an organic solvent or an aqueous solution (1M acid or 1M base).
    *   For example, a carboxylic acid like Acetic Acid is neutral and more lipophilic at low pH, but becomes charged (deprotonated) and more hydrophilic at neutral and high pH.
    *   An amine like Methylamine is charged (protonated) and more hydrophilic at low and neutral pH, but becomes neutral and more lipophilic at high pH.
    *   Zwitterionic compounds like Glycine can have complex behavior depending on which group ionizes.
    *   Neutral compounds like Benzene show little change in LogP with pH.

This process is valuable in many areas of chemistry and drug discovery, such as understanding drug absorption, distribution, metabolism, and excretion (ADME), or designing extraction and separation protocols.