# OEPandas - Getting Started

This notebook demonstrates the basic usage of OEPandas for working with molecular data in Pandas DataFrames.

## Prerequisites

```bash
pip install oepandas
```

**Note:** You'll need an OpenEye Toolkits license. Free academic licenses are available at [OpenEye Scientific](https://www.eyesopen.com/academic-licensing).

In [1]:
import oepandas as oepd
import pandas as pd
from openeye import oechem
import numpy as np

## Creating DataFrames with Molecules

There are several ways to create DataFrames with molecular data:

### Method 1: Convert SMILES strings to molecules using `.chem.as_molecule()`

This converts a column with molecule strings to a `molecule` column with `oechem.OEMol` objects. The default format is SMILES, but you an read any molecule format supported by the OpenEye Toolkits:

```python
DataFrame.chem.as_molecule(
    columns: str | Iterable[str],
    molecule_format: str | int | None = None,
    inplace: bool = False
) -> DataFrame
```

In [2]:
# Create sample data from SMILES strings
sample_data = [
    {"SMILES": "CC(=O)Oc1ccccc1C(=O)O", "Name": "Aspirin", "MW": 180.16},
    {"SMILES": "CC(C)Cc1ccc(cc1)C(C)C(=O)O", "Name": "Ibuprofen", "MW": 206.28},
    {"SMILES": "CC(=O)Nc1ccc(cc1)O", "Name": "Acetaminophen", "MW": 151.16},
    {"SMILES": "Cn1cnc2c1c(=O)n(c(=O)n2C)C", "Name": "Caffeine", "MW": 194.19}
]

# Create a DataFrame
df = pd.DataFrame(sample_data)

# Convert the SMILES column to molecules using the .chem accessor
df = df.chem.as_molecule("SMILES")

print(f"Created DataFrame with {len(df)} molecules")
print(f"SMILES column dtype: {df.SMILES.dtype}")
df.head()

Created DataFrame with 4 molecules
SMILES column dtype: molecule


Unnamed: 0,SMILES,Name,MW
0,<oechem.OEMol; proxy of <Swig Object of type '...,Aspirin,180.16
1,<oechem.OEMol; proxy of <Swig Object of type '...,Ibuprofen,206.28
2,<oechem.OEMol; proxy of <Swig Object of type '...,Acetaminophen,151.16
3,<oechem.OEMol; proxy of <Swig Object of type '...,Caffeine,194.19


### Method 2: Read from molecule files (SDF, SMILES, OEB)

If you use a native OEPandas reader, molecules will already be objects in the DataFrame.

```python
# Read from SDF file (includes SD data as columns)
df = oepd.read_sdf("molecules.sdf", molecule_column="Mol", title_column="Name")

# Read from SMILES file
df = oepd.read_smi("molecules.smi", molecule_column="Mol", title_column="Name")

# Read from OEB (OpenEye binary) file
df = oepd.read_oeb("molecules.oeb", molecule_column="Mol", title_column="Name")
```

### Method 3: Read CSV with molecule strings

You can also read CSV files that have molecule columns. This will also automatically create molecule objects.

```python
# Read CSV file and convert SMILES column to molecules
df = oepd.read_molecule_csv("molecules.csv", molecule_columns="SMILES")

# You can also auto-detect molecule columns
df = oepd.read_molecule_csv("molecules.csv", molecule_columns="detect")
```

## Working with Molecules

Once molecules are in a DataFrame, you can use standard Pandas operations:

In [5]:
# Calculate the number of heavy atoms
df["NumHeavyAtoms"] = df.SMILES.apply(lambda mol: oechem.OECount(mol, oechem.OEIsHeavy()))

# Molecules with 14+ heavy atoms
df[df.NumHeavyAtoms >= 14]

Unnamed: 0,SMILES,Name,MW,NumHeavyAtoms
1,<oechem.OEMol; proxy of <Swig Object of type '...,Ibuprofen,206.28,15
3,<oechem.OEMol; proxy of <Swig Object of type '...,Caffeine,194.19,14


## Molecular Calculations

Examples of various calculations you can perform.

In [6]:
# Calculate molecular properties
df["NumAtoms"] = df.SMILES.apply(lambda mol: mol.NumAtoms() if mol is not None else None)
df["NumBonds"] = df.SMILES.apply(lambda mol: mol.NumBonds() if mol is not None else None)
df["NumOxygens"] = df.SMILES.apply(lambda mol: oechem.OECount(mol, oechem.OEIsOxygen()) if mol is not None else None)

df[["Name", "NumAtoms", "NumBonds", "NumOxygens"]]

Unnamed: 0,Name,NumAtoms,NumBonds,NumOxygens
0,Aspirin,13,13,4
1,Ibuprofen,15,15,2
2,Acetaminophen,11,11,2
3,Caffeine,14,15,2


## Using the `.chem` Accessor

OEPandas provides convenient chemistry functions via the `.chem` accessor for both Series and DataFrames.

### DataFrame Accessors

Access these methods via `df.chem.<method>()`:

**`as_molecule()`**

Convert column(s) to MoleculeDtype.

```python
df.chem.as_molecule(
    columns,
    *,
    molecule_format=None,
    inplace=False
)
```

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `columns` | str, list | required | Column name(s) to convert |
| `molecule_format` | str, int | `None` | Format for parsing (default: SMILES) |
| `inplace` | bool | `False` | Modify DataFrame in place |

**`as_design_unit()`**

Convert column(s) to DesignUnitDtype.

```python
df.chem.as_design_unit(columns, *, inplace=False)
```

**`filter_valid()`**

Filter rows to keep only those with valid molecules.

```python
df.chem.filter_valid(columns, *, inplace=False)
```

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `columns` | str, list | required | MoleculeDtype column(s) to check |
| `inplace` | bool | `False` | Modify DataFrame in place |

**`detect_molecule_columns()`**

Auto-detect and convert molecule columns based on predominant type.

```python
df.chem.detect_molecule_columns(*, sample_size=25)
```

**`to_sdf()`**

Write DataFrame to SDF file.

```python
df.chem.to_sdf(
    fp,
    primary_molecule_column,
    *,
    title_column=None,
    columns=None,
    index=True,
    index_tag="index",
    secondary_molecules_as="smiles",
    secondary_molecule_flavor=None,
    gzip=False
)
```

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `fp` | str, Path | required | Output file path |
| `primary_molecule_column` | str | required | Column with molecules |
| `title_column` | str | `None` | Column for titles |
| `columns` | str, list | `None` | Columns to include as SD tags (None for all) |
| `index` | bool | `True` | Include index as SD tag |
| `index_tag` | str | `"index"` | Name of index SD tag |
| `secondary_molecules_as` | str, int | `"smiles"` | Encoding for other molecule columns |
| `gzip` | bool | `False` | Gzip compress output |

**`to_smi()`**

Write DataFrame to SMILES file.

```python
df.chem.to_smi(
    fp,
    primary_molecule_column,
    *,
    flavor=None,
    molecule_format=oechem.OEFormat_SMI,
    title_column=None,
    gzip=False
)
```

#### `to_molecule_csv()`

Write DataFrame to CSV with molecules as strings.

```python
df.chem.to_molecule_csv(
    fp,
    *,
    molecule_format="smiles",
    flavor=None,
    gzip=False,
    b64encode=False,
    columns=None,
    index=True,
    sep=',',
    **kwargs
)
```

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `molecule_format` | str, int | `"smiles"` | Output format for molecules |
| `b64encode` | bool | `False` | Base64 encode molecule strings |
| `**kwargs` | | | Additional arguments passed to pandas CSV writer |

**`to_oedb()`**

Write DataFrame to OERecord database.

```python
df.chem.to_oedb(
    fp,
    primary_molecule_column=None,
    *,
    title_column=None,
    columns=None,
    index=True,
    index_label="index",
    sample_size=25,
    safe=True
)
```

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `primary_molecule_column` | str | `None` | Molecule column (None creates OERecord, not OEMolRecord) |
| `sample_size` | int | `25` | Sample size for type detection |
| `safe` | bool | `True` | Type check before writing |

### Series Accessors

Access these methods via `series.chem.<method>()`:

**Molecule Methods**

| Method | Returns | Description |
|--------|---------|-------------|
| `copy_molecules()` | `Series[MoleculeDtype]` | Deep copy all molecules |
| `is_valid()` | `Series[bool]` | Boolean mask of valid molecules |
| `as_molecule(molecule_format=None)` | `Series[MoleculeDtype]` | Convert series to molecules |
| `to_molecule(molecule_format=None)` | `Series[MoleculeDtype]` | Convert from strings to molecules |
| `to_molecule_bytes(molecule_format=OEFormat_SMI, flavor=None, gzip=False)` | `Series[bytes]` | Convert to byte strings |
| `to_molecule_strings(molecule_format="smiles", flavor=None, gzip=False, b64encode=False)` | `Series[str]` | Convert to string representations |
| `to_smiles(flavor=OESMILESFlag_ISOMERIC)` | `Series[str]` | Convert to SMILES strings |
| `subsearch(pattern, adjustH=False)` | `Series[bool]` | Substructure search with SMARTS pattern |

**Design Unit Methods**

| Method | Returns | Description |
|--------|---------|-------------|
| `copy_design_units()` | `Series[DesignUnitDtype]` | Deep copy all design units |
| `get_ligands(clear_titles=False)` | `Series[MoleculeDtype]` | Extract ligand molecules |
| `get_proteins(clear_titles=False)` | `Series[MoleculeDtype]` | Extract protein molecules |
| `get_components(mask)` | `Series[MoleculeDtype]` | Extract components by mask |
| `as_design_unit()` | `Series[DesignUnitDtype]` | Convert series to design units |


In [7]:
# Generate canonical SMILES using the .chem accessor
canonical = df.SMILES.chem.to_smiles()
print("Canonical SMILES:")
for name, smiles in zip(df.Name, canonical):
    print(f"  {name}: {smiles}")

Canonical SMILES:
  Aspirin: CC(=O)Oc1ccccc1C(=O)O
  Ibuprofen: CC(C)Cc1ccc(cc1)C(C)C(=O)O
  Acetaminophen: CC(=O)Nc1ccc(cc1)O
  Caffeine: Cn1cnc2c1c(=O)n(c(=O)n2C)C


## Checking Molecule Validity

OEPandas provides methods to check and filter molecule validity:

In [8]:
# Create a DataFrame with some invalid molecules
test_data = [
    {"SMILES": "CCO", "Name": "Ethanol"},
    {"SMILES": "invalid_smiles", "Name": "Invalid"},  # This will fail to parse
    {"SMILES": "c1ccccc1", "Name": "Benzene"},
]
df_test = pd.DataFrame(test_data)
df_test = df_test.chem.as_molecule("SMILES")

# Check which molecules are valid using .chem.is_valid()
validity = df_test.SMILES.chem.is_valid()
print("Molecule validity:")
for name, is_valid in zip(df_test.Name, validity):
    print(f"  {name}: {'Valid' if is_valid else 'Invalid'}")




Molecule validity:
  Ethanol: Valid
  Invalid: Invalid
  Benzene: Valid


In [9]:
# Filter to keep only valid molecules using .chem.filter_valid()
df_valid = df_test.chem.filter_valid("SMILES")
print(f"\nOriginal rows: {len(df_test)}")
print(f"After filtering: {len(df_valid)} valid molecules")
df_valid


Original rows: 3
After filtering: 2 valid molecules


Unnamed: 0,SMILES,Name
0,<oechem.OEMol; proxy of <Swig Object of type '...,Ethanol
2,<oechem.OEMol; proxy of <Swig Object of type '...,Benzene


## Data Manipulation

Pandas operations work seamlessly with molecular data:

In [18]:
# Sorting
sorted_df = df.sort_values("MW", ascending=False)
print("Molecules sorted by molecular weight:")
sorted_df[["Name", "MW"]]

Molecules sorted by molecular weight:


Unnamed: 0,Name,MW
1,Ibuprofen,206.28
3,Caffeine,194.19
0,Aspirin,180.16
2,Acetaminophen,151.16


In [19]:
# Grouping and aggregation
df["HasRing"] = df.SMILES.apply(lambda mol: oechem.OECount(mol, oechem.OEAtomIsInRing()) > 0 if mol is not None else False)

ring_stats = df.groupby("HasRing").agg({
    "MW": ["mean", "min", "max"],
    "NumAtoms": "mean"
})

print("Statistics by ring presence:")
ring_stats

Statistics by ring presence:


Unnamed: 0_level_0,MW,MW,MW,NumAtoms
Unnamed: 0_level_1,mean,min,max,mean
HasRing,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
True,182.9475,151.16,206.28,13.25


## Copying Molecules

Create deep copies of molecular data when needed:

In [12]:
# Deep copy molecules using the .chem accessor
df["SMILES_copy"] = df.SMILES.chem.copy_molecules()

print(f"Original and copied molecules are different objects: {df.SMILES[0] is not df.SMILES_copy[0]}")

Original and copied molecules are different objects: True


## Handling Missing Values

OEPandas handles NaN/None values gracefully:

In [20]:
# Add a row with missing molecule
df_with_nan = pd.concat([df, pd.DataFrame([{"Name": "Unknown", "MW": np.nan}])], ignore_index=True)

print(f"DataFrame has {df_with_nan.SMILES.isna().sum()} missing molecules")

# Filter out missing values
df_clean = df_with_nan[df_with_nan.SMILES.notnull()]
print(f"After filtering: {len(df_clean)} molecules remain")

DataFrame has 1 missing molecules
After filtering: 4 molecules remain


## Substructure Searching

Perform SMARTS-based substructure searches using `.chem.subsearch()`:

In [14]:
# Find molecules containing a carboxylic acid group
carboxylic_acid_pattern = "C(=O)O"

matches = df.SMILES.chem.subsearch(carboxylic_acid_pattern)
print(f"Molecules containing carboxylic acid group:")
df[matches][["Name", "MW"]]

Molecules containing carboxylic acid group:


Unnamed: 0,Name,MW
0,Aspirin,180.16
1,Ibuprofen,206.28


## Converting Molecules to Different Formats

The `.chem` accessor provides methods to convert molecules to various string/byte formats:

In [15]:
# Convert to SMILES strings
smiles_series = df.SMILES.chem.to_smiles()
print("SMILES strings:")
print(smiles_series.tolist()[:2])

# Convert to molecule strings (various formats)
sdf_strings = df.SMILES.chem.to_molecule_strings(molecule_format="sdf")
print(f"\nSDF string (first 100 chars): {sdf_strings.iloc[0][:100]}...")

SMILES strings:
['CC(=O)Oc1ccccc1C(=O)O', 'CC(C)Cc1ccc(cc1)C(C)C(=O)O']

SDF string (first 100 chars): -OEChem-01192617012D

 13 13  0     0  0  0  0  0  0999 V2000
    3.4641   -2.0024    0.0000 C   0  ...
