# Molecular Property Analysis with AWS

**Tier 2 Project: Chemistry - Molecular Analysis**

This notebook demonstrates cloud-based molecular property analysis using AWS services:
- **S3**: Store molecular structures
- **Lambda**: Calculate molecular properties
- **DynamoDB**: Store and query results

---

## Workflow

1. **Connect to AWS** - Configure boto3 clients
2. **Query DynamoDB** - Fetch molecular properties
3. **Analyze Properties** - Statistical analysis and distributions
4. **Filter Drug-Like** - Apply Lipinski's Rule of Five
5. **Visualize** - Create plots and chemical space maps
6. **Export Results** - Save hit lists and figures

In [None]:
# Import required libraries
import json
import warnings
from decimal import Decimal

import boto3
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

warnings.filterwarnings("ignore")

# Set plotting style
sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (12, 6)

print("Libraries imported successfully")

## 1. Connect to AWS Services

Configure AWS credentials and connect to DynamoDB and S3.

In [None]:
# AWS Configuration
AWS_REGION = "us-east-1"
BUCKET_NAME = "molecular-data-xxxx"  # Replace with your bucket name
TABLE_NAME = "MolecularProperties"

# Initialize AWS clients
dynamodb = boto3.resource("dynamodb", region_name=AWS_REGION)
s3 = boto3.client("s3", region_name=AWS_REGION)
table = dynamodb.Table(TABLE_NAME)

print(f"Connected to DynamoDB table: {TABLE_NAME}")
print(f"Connected to S3 bucket: {BUCKET_NAME}")

## 2. Query Molecular Properties from DynamoDB

Retrieve molecular property data stored by Lambda.

In [None]:
def scan_table(table, limit=None):
    """Scan DynamoDB table and return all items."""
    items = []
    scan_kwargs = {}

    if limit:
        scan_kwargs["Limit"] = limit

    response = table.scan(**scan_kwargs)
    items.extend(response["Items"])

    # Handle pagination
    while "LastEvaluatedKey" in response:
        scan_kwargs["ExclusiveStartKey"] = response["LastEvaluatedKey"]
        response = table.scan(**scan_kwargs)
        items.extend(response["Items"])

        if limit and len(items) >= limit:
            items = items[:limit]
            break

    return items


# Fetch all molecular properties
print("Querying DynamoDB...")
molecules = scan_table(table)
print(f"Retrieved {len(molecules)} molecules")

## 3. Convert to Pandas DataFrame

Convert DynamoDB items to pandas for easier analysis.

In [None]:
def dynamo_to_dataframe(items):
    """Convert DynamoDB items to pandas DataFrame."""
    # Convert Decimal to float
    cleaned_items = []
    for item in items:
        cleaned = {}
        for key, value in item.items():
            if isinstance(value, Decimal):
                cleaned[key] = float(value)
            else:
                cleaned[key] = value
        cleaned_items.append(cleaned)

    df = pd.DataFrame(cleaned_items)

    # Ensure numeric columns
    numeric_cols = [
        "molecular_weight",
        "logp",
        "tpsa",
        "hbd",
        "hba",
        "rotatable_bonds",
        "aromatic_rings",
    ]
    for col in numeric_cols:
        if col in df.columns:
            df[col] = pd.to_numeric(df[col], errors="coerce")

    return df


# Convert to DataFrame
df = dynamo_to_dataframe(molecules)
print(f"DataFrame shape: {df.shape}")
df.head()

## 4. Data Summary and Statistics

Calculate summary statistics for molecular properties.

In [None]:
print("=" * 70)
print("Molecular Property Summary")
print("=" * 70)

print(f"\nTotal molecules: {len(df)}")

# Lipinski compliance
if "lipinski_compliant" in df.columns:
    lipinski_count = df["lipinski_compliant"].sum()
    lipinski_pct = (lipinski_count / len(df)) * 100
    print(f"Lipinski compliant: {lipinski_count} ({lipinski_pct:.1f}%)")

# Compound classes
if "compound_class" in df.columns:
    print("\nCompound classes:")
    print(df["compound_class"].value_counts())

# Property statistics
print("\nProperty Statistics:")
stats_cols = ["molecular_weight", "logp", "tpsa", "hbd", "hba"]
stats_cols = [col for col in stats_cols if col in df.columns]
print(df[stats_cols].describe())

## 5. Property Distributions

Visualize distributions of key molecular properties.

In [None]:
# Create subplots for property distributions
fig, axes = plt.subplots(2, 3, figsize=(15, 10))

# Molecular Weight
if "molecular_weight" in df.columns:
    axes[0, 0].hist(df["molecular_weight"], bins=30, color="skyblue", edgecolor="black")
    axes[0, 0].axvline(500, color="red", linestyle="--", label="Lipinski limit (500)")
    axes[0, 0].set_xlabel("Molecular Weight (Da)")
    axes[0, 0].set_ylabel("Frequency")
    axes[0, 0].set_title("Molecular Weight Distribution")
    axes[0, 0].legend()

# LogP
if "logp" in df.columns:
    axes[0, 1].hist(df["logp"], bins=30, color="lightcoral", edgecolor="black")
    axes[0, 1].axvline(5, color="red", linestyle="--", label="Lipinski limit (5)")
    axes[0, 1].set_xlabel("LogP")
    axes[0, 1].set_ylabel("Frequency")
    axes[0, 1].set_title("LogP Distribution")
    axes[0, 1].legend()

# TPSA
if "tpsa" in df.columns:
    axes[0, 2].hist(df["tpsa"], bins=30, color="lightgreen", edgecolor="black")
    axes[0, 2].set_xlabel("TPSA (Ų)")
    axes[0, 2].set_ylabel("Frequency")
    axes[0, 2].set_title("TPSA Distribution")

# Hydrogen Bond Donors
if "hbd" in df.columns:
    hbd_counts = df["hbd"].value_counts().sort_index()
    axes[1, 0].bar(hbd_counts.index, hbd_counts.values, color="gold", edgecolor="black")
    axes[1, 0].axvline(5, color="red", linestyle="--", label="Lipinski limit (5)")
    axes[1, 0].set_xlabel("H-Bond Donors")
    axes[1, 0].set_ylabel("Frequency")
    axes[1, 0].set_title("H-Bond Donors Distribution")
    axes[1, 0].legend()

# Hydrogen Bond Acceptors
if "hba" in df.columns:
    hba_counts = df["hba"].value_counts().sort_index()
    axes[1, 1].bar(hba_counts.index, hba_counts.values, color="orange", edgecolor="black")
    axes[1, 1].axvline(10, color="red", linestyle="--", label="Lipinski limit (10)")
    axes[1, 1].set_xlabel("H-Bond Acceptors")
    axes[1, 1].set_ylabel("Frequency")
    axes[1, 1].set_title("H-Bond Acceptors Distribution")
    axes[1, 1].legend()

# Lipinski Compliance
if "lipinski_compliant" in df.columns:
    lipinski_counts = df["lipinski_compliant"].value_counts()
    axes[1, 2].pie(
        lipinski_counts.values,
        labels=["Non-compliant", "Compliant"],
        autopct="%1.1f%%",
        colors=["lightcoral", "lightgreen"],
    )
    axes[1, 2].set_title("Lipinski Compliance")

plt.tight_layout()
plt.savefig("molecular_property_distributions.png", dpi=300, bbox_inches="tight")
plt.show()

print("Figure saved: molecular_property_distributions.png")

## 6. Chemical Space Visualization

Plot LogP vs Molecular Weight (classic chemical space plot).

In [None]:
plt.figure(figsize=(12, 8))

# Color by Lipinski compliance
if "lipinski_compliant" in df.columns:
    colors = df["lipinski_compliant"].map({True: "green", False: "red"})
    plt.scatter(df["molecular_weight"], df["logp"], c=colors, alpha=0.6, s=50)
else:
    plt.scatter(df["molecular_weight"], df["logp"], alpha=0.6, s=50)

# Add Lipinski limits
plt.axvline(500, color="red", linestyle="--", label="MW limit (500)", linewidth=2)
plt.axhline(5, color="red", linestyle="--", label="LogP limit (5)", linewidth=2)

# Lipinski Rule of Five box
plt.axvspan(0, 500, alpha=0.1, color="green")
plt.axhspan(-5, 5, alpha=0.1, color="green")

plt.xlabel("Molecular Weight (Da)", fontsize=12)
plt.ylabel("LogP", fontsize=12)
plt.title("Chemical Space: Molecular Weight vs LogP", fontsize=14, fontweight="bold")
plt.legend()
plt.grid(True, alpha=0.3)

plt.savefig("chemical_space_plot.png", dpi=300, bbox_inches="tight")
plt.show()

print("Figure saved: chemical_space_plot.png")

## 7. Filter Drug-Like Molecules

Apply Lipinski's Rule of Five to identify drug-like candidates.

In [None]:
# Filter for Lipinski-compliant molecules
if "lipinski_compliant" in df.columns:
    drug_like = df[df["lipinski_compliant"]].copy()
else:
    # Manual Lipinski filter
    drug_like = df[
        (df["molecular_weight"] <= 500) & (df["logp"] <= 5) & (df["hbd"] <= 5) & (df["hba"] <= 10)
    ].copy()

print(f"Total molecules: {len(df)}")
print(f"Drug-like molecules: {len(drug_like)} ({len(drug_like) / len(df) * 100:.1f}%)")

# Show top drug-like candidates
print("\nTop 10 drug-like candidates:")
cols_to_show = ["name", "molecular_weight", "logp", "tpsa", "hbd", "hba"]
cols_to_show = [col for col in cols_to_show if col in drug_like.columns]
print(drug_like[cols_to_show].head(10).to_string(index=False))

## 8. Compare Compound Classes

Compare properties across different compound classes.

In [None]:
if "compound_class" in df.columns:
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))

    # Molecular Weight by Class
    df.boxplot(column="molecular_weight", by="compound_class", ax=axes[0])
    axes[0].set_title("Molecular Weight by Class")
    axes[0].set_xlabel("Compound Class")
    axes[0].set_ylabel("Molecular Weight (Da)")

    # LogP by Class
    df.boxplot(column="logp", by="compound_class", ax=axes[1])
    axes[1].set_title("LogP by Class")
    axes[1].set_xlabel("Compound Class")
    axes[1].set_ylabel("LogP")

    # TPSA by Class
    df.boxplot(column="tpsa", by="compound_class", ax=axes[2])
    axes[2].set_title("TPSA by Class")
    axes[2].set_xlabel("Compound Class")
    axes[2].set_ylabel("TPSA (Ų)")

    plt.suptitle("")
    plt.tight_layout()
    plt.savefig("compound_class_comparison.png", dpi=300, bbox_inches="tight")
    plt.show()

    print("Figure saved: compound_class_comparison.png")

## 9. Property Correlations

Analyze correlations between molecular properties.

In [None]:
# Select numeric columns for correlation
corr_cols = ["molecular_weight", "logp", "tpsa", "hbd", "hba", "rotatable_bonds"]
corr_cols = [col for col in corr_cols if col in df.columns]

if len(corr_cols) >= 2:
    correlation_matrix = df[corr_cols].corr()

    plt.figure(figsize=(10, 8))
    sns.heatmap(
        correlation_matrix,
        annot=True,
        fmt=".2f",
        cmap="coolwarm",
        center=0,
        square=True,
        linewidths=1,
    )
    plt.title("Molecular Property Correlations", fontsize=14, fontweight="bold")
    plt.tight_layout()
    plt.savefig("property_correlations.png", dpi=300, bbox_inches="tight")
    plt.show()

    print("Figure saved: property_correlations.png")

## 10. Export Results

Save filtered results and hit lists.

In [None]:
# Export drug-like molecules
drug_like.to_csv("drug_like_molecules.csv", index=False)
print(f"Exported {len(drug_like)} drug-like molecules to drug_like_molecules.csv")

# Export all molecules
df.to_csv("all_molecules.csv", index=False)
print(f"Exported {len(df)} molecules to all_molecules.csv")

# Create summary report
summary = {
    "total_molecules": len(df),
    "drug_like_molecules": len(drug_like),
    "lipinski_compliance_rate": f"{len(drug_like) / len(df) * 100:.1f}%",
    "property_statistics": df[corr_cols].describe().to_dict(),
}

with open("analysis_summary.json", "w") as f:
    json.dump(summary, f, indent=2)

print("Exported analysis summary to analysis_summary.json")

## 11. Key Findings

Summarize key findings from the analysis.

In [None]:
print("=" * 70)
print("KEY FINDINGS")
print("=" * 70)

print("\n1. Dataset Overview:")
print(f"   - Total molecules analyzed: {len(df)}")
print(f"   - Drug-like candidates: {len(drug_like)} ({len(drug_like) / len(df) * 100:.1f}%)")

if "compound_class" in df.columns:
    print("\n2. Compound Classes:")
    for class_name, count in df["compound_class"].value_counts().items():
        print(f"   - {class_name}: {count} molecules")

print("\n3. Molecular Properties (Mean ± Std):")
print(
    f"   - Molecular Weight: {df['molecular_weight'].mean():.1f} ± {df['molecular_weight'].std():.1f} Da"
)
print(f"   - LogP: {df['logp'].mean():.2f} ± {df['logp'].std():.2f}")
print(f"   - TPSA: {df['tpsa'].mean():.1f} ± {df['tpsa'].std():.1f} Ų")

print("\n4. Drug-Likeness:")
print(f"   - Molecules within MW limit (<500): {(df['molecular_weight'] <= 500).sum()}")
print(f"   - Molecules within LogP limit (<5): {(df['logp'] <= 5).sum()}")
print(f"   - Fully Lipinski compliant: {len(drug_like)}")

print("\n5. Recommendations:")
print(f"   - {len(drug_like)} molecules meet drug-likeness criteria")
print("   - Top candidates saved to drug_like_molecules.csv")
print("   - Visualizations saved to PNG files")

print(f"\n{'=' * 70}")

## Conclusion

This notebook demonstrated:

1. **Cloud-based molecular analysis** - Leveraging AWS for scalable processing
2. **Property calculation** - Using Lambda for serverless computation
3. **NoSQL database** - Storing and querying results in DynamoDB
4. **Drug-likeness filtering** - Applying Lipinski's Rule of Five
5. **Data visualization** - Creating publication-quality figures

### Next Steps

- **Scale up**: Process larger molecular libraries (100K-1M compounds)
- **Add features**: Calculate more descriptors (MACCS keys, Morgan fingerprints)
- **Machine learning**: Build predictive models for bioactivity
- **Production**: Move to Tier 3 with CloudFormation infrastructure

### References

- Lipinski's Rule of Five: [Lipinski et al., 2001](https://doi.org/10.1016/S0169-409X(00)00129-0)
- RDKit Documentation: https://www.rdkit.org/docs/
- AWS Lambda for Science: https://aws.amazon.com/lambda/

---

**Research Jumpstart - Tier 2 Project**

Last updated: 2025-11-14