# OEPandas - Getting Started

This notebook demonstrates the basic usage of OEPandas for working with molecular data in Pandas DataFrames.

## Prerequisites

```bash
pip install oepandas
```

**Note:** You'll need an OpenEye Toolkits license. Free academic licenses are available at [OpenEye Scientific](https://www.eyesopen.com/academic-licensing).

## 1. Basic Setup and Importing

In [None]:
import oepandas as oepd
import pandas as pd
from openeye import oechem
import numpy as np

## 2. Reading Molecular Data

OEPandas provides readers for all major chemical file formats:

In [None]:
# Create sample data from SMILES
sample_data = [
    {"SMILES": "CC(=O)Oc1ccccc1C(=O)O", "Name": "Aspirin", "MW": 180.16},
    {"SMILES": "CC(C)Cc1ccc(cc1)C(C)C(=O)O", "Name": "Ibuprofen", "MW": 206.28},
    {"SMILES": "CC(=O)Nc1ccc(cc1)O", "Name": "Acetaminophen", "MW": 151.16},
    {"SMILES": "Cn1cnc2c1c(=O)n(c(=O)n2C)C", "Name": "Caffeine", "MW": 194.19}
]

# Create a DataFrame
df = pd.DataFrame(sample_data)

# Convert SMILES to molecule objects
df = oepd.read_molecule_csv(df, molecule_columns="SMILES")

print(f"Created DataFrame with {len(df)} molecules")
df.head()

## 3. Working with Molecules

Once molecules are in a DataFrame, you can use standard pandas operations along with molecular accessors:

In [None]:
# Standard pandas filtering
heavy_molecules = df[df.MW > 180]
print(f"Molecules with MW > 180: {len(heavy_molecules)}")
heavy_molecules

## 4. Molecular Calculations

Apply OpenEye functions directly to molecule columns:

In [None]:
# Calculate molecular properties
df["NumAtoms"] = df.SMILES.apply(lambda mol: mol.NumAtoms() if mol is not None else None)
df["NumBonds"] = df.SMILES.apply(lambda mol: mol.NumBonds() if mol is not None else None)
df["NumOxygens"] = df.SMILES.apply(lambda mol: oechem.OECount(mol, oechem.OEIsOxygen()) if mol is not None else None)

df[["Name", "NumAtoms", "NumBonds", "NumOxygens"]]

## 5. Molecular Accessors

OEPandas provides convenient accessors for common operations:

In [None]:
# Generate canonical SMILES
canonical = df.SMILES.to_smiles()
print("Canonical SMILES:")
for name, smiles in zip(df.Name, canonical):
    print(f"{name}: {smiles}")

## 6. Data Manipulation

Pandas operations work seamlessly with molecular data:

In [None]:
# Sorting
sorted_df = df.sort_values("MW", ascending=False)
print("\nMolecules sorted by molecular weight:")
sorted_df[["Name", "MW"]]

In [None]:
# Grouping and aggregation
df["HasRing"] = df.SMILES.apply(lambda mol: oechem.OEDetermineRingMembership(mol) > 0 if mol is not None else False)

ring_stats = df.groupby("HasRing").agg({
    "MW": ["mean", "min", "max"],
    "NumAtoms": "mean"
})

print("\nStatistics by ring presence:")
ring_stats

## 7. Copying Molecules

Create deep copies of molecular data when needed:

In [None]:
# Deep copy molecules
df["SMILES_copy"] = df.SMILES.copy_molecules()

print(f"\nOriginal and copied molecules are different objects: {df.SMILES[0] is not df.SMILES_copy[0]}")

## 8. Handling Missing Values

OEPandas handles NaN/None values gracefully:

In [None]:
# Add a row with missing molecule
df_with_nan = pd.concat([df, pd.DataFrame([{"Name": "Unknown", "MW": np.nan}])], ignore_index=True)

print(f"\nDataFrame has {df_with_nan.SMILES.isna().sum()} missing molecules")

# Filter out missing values
df_clean = df_with_nan[df_with_nan.SMILES.notnull()]
print(f"After filtering: {len(df_clean)} molecules remain")

## 9. Substructure Searching

Perform SMARTS-based substructure searches:

In [None]:
# Find molecules containing a carboxylic acid group
carboxylic_acid_pattern = "C(=O)O"

matches = df.SMILES.subsearch(carboxylic_acid_pattern)
print(f"\nMolecules containing carboxylic acid group:")
df[matches][["Name", "MW"]]

## 10. Summary

This notebook covered:
- Reading molecular data into pandas DataFrames
- Using pandas operations with molecular data
- Calculating molecular properties
- Using OEPandas accessors
- Handling missing values
- Substructure searching

See the advanced examples notebook for more sophisticated use cases!