# Molecular Visualization: Computational Chemistry Basics

Learn computational chemistry fundamentals through molecular structure analysis.

## Dataset

25 common molecules:
- Drugs and pharmaceuticals
- Natural products
- Neurotransmitters and biological molecules
- SMILES notation format

## Methods
- Molecular structure visualization
- Chemical property calculation
- Molecular descriptors
- Structural similarity
- 3D conformer generation

In [None]:
import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import py3Dmol
import seaborn as sns
from rdkit import Chem
from rdkit.Chem import AllChem, DataStructs, Descriptors, Draw, Lipinski

warnings.filterwarnings("ignore")

plt.style.use("seaborn-v0_8-darkgrid")
%matplotlib inline

print("✓ Setup complete")

## 1. Load and Explore Molecules

In [None]:
# Load molecules from CSV
df = pd.read_csv("sample_molecules.csv")

# Convert SMILES to RDKit molecule objects
df["mol"] = df["smiles"].apply(Chem.MolFromSmiles)

# Remove any invalid molecules
df = df[df["mol"].notna()].reset_index(drop=True)

print(f"Number of molecules: {len(df)}")
print(f"\nCategories: {df['category'].value_counts().to_dict()}")
print("\nFirst 5 molecules:")
df[["name", "category", "use"]].head()

## 2. Visualize 2D Structures

In [None]:
# Display first 6 molecules
molecules_to_show = df.head(6)["mol"].tolist()
legends = df.head(6)["name"].tolist()

img = Draw.MolsToGridImage(
    molecules_to_show, molsPerRow=3, subImgSize=(300, 300), legends=legends, returnPNG=False
)
img

## 3. Calculate Molecular Properties

In [None]:
# Calculate basic properties
def calc_properties(mol):
    return {
        "molecular_weight": Descriptors.MolWt(mol),
        "logP": Descriptors.MolLogP(mol),
        "num_h_donors": Lipinski.NumHDonors(mol),
        "num_h_acceptors": Lipinski.NumHAcceptors(mol),
        "num_rotatable_bonds": Lipinski.NumRotatableBonds(mol),
        "num_aromatic_rings": Lipinski.NumAromaticRings(mol),
        "tpsa": Descriptors.TPSA(mol),
    }


# Apply to all molecules
props = df["mol"].apply(calc_properties).apply(pd.Series)
df = pd.concat([df, props], axis=1)

print("Molecular Properties:")
print(df[["name", "molecular_weight", "logP", "num_h_donors", "num_h_acceptors"]].round(2))

## 4. Property Distributions

In [None]:
# Plot property distributions
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Molecular weight
axes[0, 0].hist(df["molecular_weight"], bins=15, color="steelblue", alpha=0.7, edgecolor="black")
axes[0, 0].set_xlabel("Molecular Weight (g/mol)", fontsize=11)
axes[0, 0].set_ylabel("Frequency", fontsize=11)
axes[0, 0].set_title("Molecular Weight Distribution", fontsize=12, fontweight="bold")
axes[0, 0].axvline(500, color="red", linestyle="--", label="Lipinski's Rule (≤500)")
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# LogP
axes[0, 1].hist(df["logP"], bins=15, color="green", alpha=0.7, edgecolor="black")
axes[0, 1].set_xlabel("LogP (Lipophilicity)", fontsize=11)
axes[0, 1].set_ylabel("Frequency", fontsize=11)
axes[0, 1].set_title("LogP Distribution", fontsize=12, fontweight="bold")
axes[0, 1].axvline(5, color="red", linestyle="--", label="Lipinski's Rule (≤5)")
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# H-bond donors
donor_counts = df["num_h_donors"].value_counts().sort_index()
axes[1, 0].bar(
    donor_counts.index, donor_counts.values, color="orange", alpha=0.7, edgecolor="black"
)
axes[1, 0].set_xlabel("Number of H-bond Donors", fontsize=11)
axes[1, 0].set_ylabel("Frequency", fontsize=11)
axes[1, 0].set_title("H-bond Donor Distribution", fontsize=12, fontweight="bold")
axes[1, 0].axvline(5, color="red", linestyle="--", label="Lipinski's Rule (≤5)")
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3, axis="y")

# H-bond acceptors
acceptor_counts = df["num_h_acceptors"].value_counts().sort_index()
axes[1, 1].bar(
    acceptor_counts.index, acceptor_counts.values, color="purple", alpha=0.7, edgecolor="black"
)
axes[1, 1].set_xlabel("Number of H-bond Acceptors", fontsize=11)
axes[1, 1].set_ylabel("Frequency", fontsize=11)
axes[1, 1].set_title("H-bond Acceptor Distribution", fontsize=12, fontweight="bold")
axes[1, 1].axvline(10, color="red", linestyle="--", label="Lipinski's Rule (≤10)")
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3, axis="y")

plt.tight_layout()
plt.show()

print("Red dashed lines show Lipinski's Rule of Five thresholds for drug-likeness.")

## 5. Lipinski's Rule of Five

In [None]:
# Check Lipinski's Rule of Five
def check_lipinski(row):
    violations = 0
    reasons = []

    if row["molecular_weight"] > 500:
        violations += 1
        reasons.append("MW>500")
    if row["logP"] > 5:
        violations += 1
        reasons.append("logP>5")
    if row["num_h_donors"] > 5:
        violations += 1
        reasons.append("HBD>5")
    if row["num_h_acceptors"] > 10:
        violations += 1
        reasons.append("HBA>10")

    return pd.Series(
        {
            "lipinski_violations": violations,
            "violation_reasons": ", ".join(reasons) if reasons else "None",
        }
    )


lipinski_results = df.apply(check_lipinski, axis=1)
df = pd.concat([df, lipinski_results], axis=1)

print("Lipinski's Rule of Five Analysis:")
print(f"\nMolecules passing (≤1 violation): {(df['lipinski_violations'] <= 1).sum()}/{len(df)}")
print(f"Molecules with 0 violations: {(df['lipinski_violations'] == 0).sum()}/{len(df)}")

print("\nMolecules with violations:")
violations_df = df[df["lipinski_violations"] > 0][
    ["name", "lipinski_violations", "violation_reasons"]
]
if len(violations_df) > 0:
    print(violations_df.to_string(index=False))
else:
    print("None")

## 6. 3D Conformer Generation

In [None]:
# Generate 3D conformer for a molecule
def generate_3d_conformer(mol):
    mol_copy = Chem.Mol(mol)
    mol_3d = Chem.AddHs(mol_copy)
    AllChem.EmbedMolecule(mol_3d, randomSeed=42)
    AllChem.MMFFOptimizeMolecule(mol_3d)
    return mol_3d


# Visualize a molecule in 3D
def visualize_3d(mol, name):
    mol_3d = generate_3d_conformer(mol)
    mb = Chem.MolToMolBlock(mol_3d)

    view = py3Dmol.view(width=600, height=400)
    view.addModel(mb, "mol")
    view.setStyle({"stick": {}, "sphere": {"scale": 0.3}})
    view.zoomTo()

    print(f"3D Structure: {name}")
    return view.show()


# Visualize aspirin
aspirin_idx = df[df["name"] == "Aspirin"].index[0]
visualize_3d(df.loc[aspirin_idx, "mol"], "Aspirin")

## 7. Molecular Fingerprints and Similarity

In [None]:
# Generate Morgan fingerprints (circular fingerprints)
df["fingerprint"] = df["mol"].apply(
    lambda m: AllChem.GetMorganFingerprintAsBitVect(m, radius=2, nBits=2048)
)

# Calculate pairwise Tanimoto similarity
n_mols = len(df)
similarity_matrix = np.zeros((n_mols, n_mols))

for i in range(n_mols):
    for j in range(n_mols):
        similarity_matrix[i, j] = DataStructs.TanimotoSimilarity(
            df.iloc[i]["fingerprint"], df.iloc[j]["fingerprint"]
        )

# Create DataFrame
sim_df = pd.DataFrame(similarity_matrix, index=df["name"], columns=df["name"])

# Plot heatmap (first 12 molecules for clarity)
fig, ax = plt.subplots(figsize=(14, 12))
sns.heatmap(
    sim_df.iloc[:12, :12],
    annot=True,
    fmt=".2f",
    cmap="YlOrRd",
    square=True,
    ax=ax,
    cbar_kws={"label": "Tanimoto Similarity"},
)
ax.set_title("Molecular Similarity Matrix (First 12 Molecules)", fontsize=14, fontweight="bold")
plt.xticks(rotation=45, ha="right")
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

print("Tanimoto similarity ranges from 0 (completely different) to 1 (identical).")

## 8. Find Most Similar Molecules

In [None]:
# Find most similar pairs
similar_pairs = []
for i in range(n_mols):
    for j in range(i + 1, n_mols):
        if similarity_matrix[i, j] > 0.4:  # Threshold for similarity
            similar_pairs.append(
                {
                    "Molecule 1": df.iloc[i]["name"],
                    "Molecule 2": df.iloc[j]["name"],
                    "Similarity": similarity_matrix[i, j],
                }
            )

similar_df = pd.DataFrame(similar_pairs).sort_values("Similarity", ascending=False)

print("Most Similar Molecule Pairs (Tanimoto > 0.4):\n")
print(similar_df.head(10).to_string(index=False))

if len(similar_df) > 0:
    print(
        f"\nMost similar pair: {similar_df.iloc[0]['Molecule 1']} and {similar_df.iloc[0]['Molecule 2']}"
    )
    print(f"Similarity: {similar_df.iloc[0]['Similarity']:.3f}")

## 9. Functional Group Analysis

In [None]:
# Define SMARTS patterns for common functional groups
functional_groups = {
    "Hydroxyl": "[OH]",
    "Carbonyl": "[CX3]=[OX1]",
    "Carboxyl": "[CX3](=O)[OX2H1]",
    "Amine": "[NX3;H2,H1;!$(NC=O)]",
    "Amide": "[NX3][CX3](=[OX1])",
    "Ester": "[#6][CX3](=O)[OX2H0][#6]",
    "Ether": "[OD2]([#6])[#6]",
    "Aromatic": "c",
}

# Count functional groups in each molecule
for fg_name, smarts in functional_groups.items():
    pattern = Chem.MolFromSmarts(smarts)
    df[fg_name] = df["mol"].apply(lambda m: len(m.GetSubstructMatches(pattern)))

# Show results
fg_cols = list(functional_groups.keys())
print("Functional Group Counts:\n")
print(df[["name", *fg_cols]].to_string(index=False))

# Plot most common functional groups
fg_sums = df[fg_cols].sum().sort_values(ascending=False)

fig, ax = plt.subplots(figsize=(10, 6))
bars = ax.bar(fg_sums.index, fg_sums.values, color="teal", alpha=0.7, edgecolor="black")
ax.set_xlabel("Functional Group", fontsize=12)
ax.set_ylabel("Total Count", fontsize=12)
ax.set_title("Functional Group Distribution Across All Molecules", fontsize=14, fontweight="bold")
ax.grid(True, alpha=0.3, axis="y")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()

## 10. Property Correlations

In [None]:
# Analyze correlations between properties
prop_cols = [
    "molecular_weight",
    "logP",
    "num_h_donors",
    "num_h_acceptors",
    "num_rotatable_bonds",
    "num_aromatic_rings",
    "tpsa",
]

corr_matrix = df[prop_cols].corr()

fig, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(
    corr_matrix,
    annot=True,
    fmt=".2f",
    cmap="coolwarm",
    center=0,
    square=True,
    ax=ax,
    cbar_kws={"label": "Correlation"},
)
ax.set_title("Molecular Property Correlations", fontsize=14, fontweight="bold")
plt.xticks(rotation=45, ha="right")
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

print("\nStrongest correlations:")
corr_pairs = []
for i in range(len(prop_cols)):
    for j in range(i + 1, len(prop_cols)):
        corr_pairs.append(
            {
                "Property 1": prop_cols[i],
                "Property 2": prop_cols[j],
                "Correlation": corr_matrix.iloc[i, j],
            }
        )

corr_df = pd.DataFrame(corr_pairs).sort_values("Correlation", key=abs, ascending=False)
print(corr_df.head(5).to_string(index=False))

## 11. Summary Statistics

In [None]:
summary = {
    "Total Molecules": len(df),
    "Categories": df["category"].nunique(),
    "Avg Molecular Weight": f"{df['molecular_weight'].mean():.1f} g/mol",
    "Avg LogP": f"{df['logP'].mean():.2f}",
    "Drug-like (Lipinski)": f"{(df['lipinski_violations'] <= 1).sum()}/{len(df)}",
    "Avg Aromatic Rings": f"{df['num_aromatic_rings'].mean():.1f}",
    "Most Common FG": fg_sums.index[0],
    "Avg Similarity": f"{similarity_matrix[np.triu_indices_from(similarity_matrix, k=1)].mean():.3f}",
}

print("=" * 60)
print("MOLECULAR ANALYSIS SUMMARY")
print("=" * 60)
for key, value in summary.items():
    print(f"{key:.<40} {value}")
print("=" * 60)

print("\n✓ Analysis complete!")
print("\nKey Findings:")
print("  1. Calculated key physicochemical properties")
print("  2. Evaluated drug-likeness using Lipinski's Rule")
print("  3. Generated 3D conformers for visualization")
print("  4. Analyzed structural similarity using fingerprints")
print("  5. Identified functional group patterns")

## Key Concepts Learned

### Molecular Representation
- **SMILES**: Text-based molecular notation
- **2D structures**: Bond connectivity and atom types
- **3D conformers**: Three-dimensional geometry
- **Molecular graphs**: Atoms as nodes, bonds as edges

### Chemical Properties
- **Molecular weight**: Sum of atomic masses
- **LogP**: Lipophilicity (fat vs water solubility)
- **H-bond donors/acceptors**: Hydrogen bonding capability
- **TPSA**: Topological polar surface area
- **Rotatable bonds**: Molecular flexibility

### Drug-Likeness
- **Lipinski's Rule of Five**: Drug-like property criteria
  - MW ≤ 500 g/mol
  - LogP ≤ 5
  - H-bond donors ≤ 5
  - H-bond acceptors ≤ 10
- **Violations**: Most drugs have ≤1 violation

### Molecular Similarity
- **Fingerprints**: Binary vectors representing molecular features
- **Morgan fingerprints**: Circular substructure encoding
- **Tanimoto similarity**: Fingerprint comparison metric
- **Structure-activity relationships**: Similar structures often have similar properties

### Functional Groups
- **SMARTS patterns**: Substructure search queries
- **Common groups**: Hydroxyl, carbonyl, amine, etc.
- **Chemical reactivity**: Functional groups determine reactions

## Next Steps

### Real Chemical Databases
- **[PubChem](https://pubchem.ncbi.nlm.nih.gov/)**: 110+ million compounds
- **[ChEMBL](https://www.ebi.ac.uk/chembl/)**: Bioactive drug-like molecules
- **[ZINC](https://zinc.docking.org/)**: Commercially available compounds
- **[DrugBank](https://go.drugbank.com/)**: Drug and drug target database

### Advanced Methods
- Quantitative structure-activity relationships (QSAR)
- Molecular docking and binding
- Machine learning for property prediction
- Quantum chemistry calculations

## Resources

- **[RDKit Documentation](https://www.rdkit.org/docs/)**
- **[Daylight SMILES](https://www.daylight.com/dayhtml/doc/theory/theory.smiles.html)**
- **Textbook**: *Molecular Modeling: Principles and Applications* by Leach