# Ligand Expo Takeover Documentation

## Introduction

This notebook demonstrates how to use the [RCSB PDB API](https://github.com/rcsb/py-rcsb-api) to retrieve a mapping between **chemical component IDs (ligands)** and the **PDB structure IDs** in which they appear. Chemical components include branched molecules (like saccharides), non-standard polymer residues, and non-polymer ligands (like cofactors and small molecules).

By running this notebook, users will be able to generate a file listing each chemical component along with the structures it occurs in.

### Inputs and Associated Variables

* **`ALL_STRUCTURES`**: A constant from `rcsbapi.data` representing all known PDB entries.
* **`Query(...)`**: An instance of `DataQuery` that specifies what data to retrieve from each structure (e.g., chem\_comp IDs from different entity types).

### Outputs and Associated Variables

* **`cc-to-pdb.tsv`**: A tab-separated file mapping each chemical component ID to the list of PDB IDs where it appears.

### Major Steps in the Coding Process

1. **Initialize a data query** to extract ligands from branched, polymer, and non-polymer entities across all PDB structures.
2. **Process the returned JSON** to collect a mapping of `chem_comp_id` â†’ `pdb_id(s)` using Python dictionaries.
3. **Write the results** to a `.tsv` file for downstream use or analysis.


### Questions

This notebook helps answer the following:

* In which PDB structures does a given ligand appear?
* Which ligands are most commonly used in PDB entries?
* How are ligands distributed across branched, polymer, and non-polymer entities?


### Learning Objectives

By using this notebook, users will learn to:

* Construct and execute a `DataQuery` using the RCSB API.
* Parse hierarchical JSON structure data to extract nested entity information.
* Map chemical component identifiers to structure identifiers.
* Write structured output to a `.tsv` file.


### Purpose

This notebook is designed to automate the process of mapping **ligands (chem\_comp IDs)** to **PDB entries** by extracting entity-level chemical component data using the RCSB Data API. This can be useful for structural biology research, ligand enrichment analysis, or building datasets for machine learning models involving proteinâ€“ligand interactions.

## Libraries

A list of libraries that need to be installed and imported to complete the tasks in this notebook.

|  Library  | Abbreviation | Contents                                                                                                 | Source                                                                                 |
| :-------: | ------------ | :------------------------------------------------------------------------------------------------------- | :------------------------------------------------------------------------------------- |
| `rcsbapi` | N/A          | Python package for accessing the [RCSB Protein Data Bank](https://www.rcsb.org) via data and search APIs | [rcsb-api on PyPI](https://pypi.org/project/rcsb-api/) 


*Note:* The `rcsbapi` library is the only external dependency and can be installed using:

```bash
pip install rcsb-api
```

Hereâ€™s a tailored **Notebook Contents** section that matches the structure and steps of your code:

---

## Notebook Contents

The next section of the notebook includes all of the raw code for this example. **Experienced coders** can use this directly, either within the notebook or in your preferred development environment.

For **novice and intermediate coders**, the code is divided into sequential cells, each performing one clear step in the process. This notebook includes the following steps:

1. **Import Required Libraries**
   Set up all necessary modules and dependencies for working with the RCSB API.

2. **Define the Ligand Mapping Function**
   Create a function that queries all PDB entries, extracts ligand information from branched, polymer, and non-polymer entities, and builds a mapping of chemical component IDs to PDB IDs.

3. **Execute the Query and Save the Output**
   Run the function, collect the results, and write the mapping to a `.tsv` file for further analysis.


In [None]:
# These imports are from the RCSB API Python package.
# You can install it using:
#     pip install rcsb-api
from rcsbapi.data import DataQuery as Query
from rcsbapi.data import ALL_STRUCTURES


# This function queries all PDB structures to extract chemical component IDs (ligands)
# from branched, polymer, and nonpolymer entities, and writes a mapping of
# chem_comp_id -> associated PDB IDs to a TSV file.
def get_all_chem_comp_ids_and_write_to_file():
    # Initialize the data query to retrieve relevant chemical component data
    query = Query(
        input_type="entries",              # Query all structure entries
        input_ids=ALL_STRUCTURES,         # Constant representing all known structures
        return_data_list=[
            "rcsb_id",  # PDB ID of the structure
            "polymer_entities.chem_comp_nstd_monomers.chem_comp.id",   # Non-standard monomers in polymer chains
            "branched_entities.chem_comp_monomers.chem_comp.id",       # Monomers in branched entities
            "nonpolymer_entities.nonpolymer_comp.chem_comp.id"         # Ligands in nonpolymer entities
        ]
    )

    # Execute the query with a progress bar
    result = query.exec(progress_bar=True)

    # Extract list of returned structure entries
    entries = result.get("data", {}).get("entries", [])

    # Dictionary to collect mapping from chem_comp_id to a set of PDB IDs
    chem_comp_to_pdb_map = {}

    # Iterate over all returned entries
    for entry in entries:
        pdb_id = entry.get("rcsb_id")

        # --- 1. Branched Entities ---
        # These may include things like saccharides
        branched_entities = entry.get("branched_entities")
        if branched_entities:
            for branched in branched_entities:
                chem_comp_monomers = branched.get("chem_comp_monomers")
                if chem_comp_monomers:
                    for mono in chem_comp_monomers:
                        chem_comp = mono.get("chem_comp")
                        if chem_comp:
                            chem_id = chem_comp.get("id")
                            if chem_id:
                                # Add chem_comp_id â†’ pdb_id to the mapping
                                chem_comp_to_pdb_map.setdefault(chem_id, set()).add(pdb_id)

        # --- 2. Polymer Entities ---
        # Includes non-standard residues within polymer chains
        polymer_entities = entry.get("polymer_entities")
        if polymer_entities:
            for polymer in polymer_entities:
                chem_comp_nstd_monomers = polymer.get("chem_comp_nstd_monomers")
                if chem_comp_nstd_monomers:
                    for mono in chem_comp_nstd_monomers:
                        chem_comp = mono.get("chem_comp")
                        if chem_comp:
                            chem_id = chem_comp.get("id")
                            if chem_id:
                                chem_comp_to_pdb_map.setdefault(chem_id, set()).add(pdb_id)

        # --- 3. Nonpolymer Entities ---
        # Usually small molecule ligands
        nonpolymer_entities = entry.get("nonpolymer_entities")
        if nonpolymer_entities:
            for nonpolymer in nonpolymer_entities:
                nonpolymer_comp = nonpolymer.get("nonpolymer_comp")
                if nonpolymer_comp:
                    chem_comp = nonpolymer_comp.get("chem_comp")
                    if chem_comp:
                        chem_id = chem_comp.get("id")
                        if chem_id:
                            chem_comp_to_pdb_map.setdefault(chem_id, set()).add(pdb_id)

    # Write the final mapping to a TSV file
    # Format: <chem_comp_id>    <pdb_id1> <pdb_id2> ...
    output_file = "cc-to-pdb.tsv"
    with open(output_file, "w", encoding="utf-8") as f:
        for ccid, pdb_ids in chem_comp_to_pdb_map.items():
            f.write(f"{ccid}\t{' '.join(sorted(pdb_ids))}\n")

    return output_file


# Run the function and report the output file path
output_path = get_all_chem_comp_ids_and_write_to_file()
print(f"File saved at: {output_path}")

# Stepwise Code for NOVICE and INTERMEDIATE CODERS

This section breaks down the code into clear, ordered steps. Follow each step in sequence to extract ligandâ€“PDB mappings from the RCSB Protein Data Bank using the `rcsb-api` package.

## Step 1: Installing the Required Library

You need to install the `rcsb-api` package if you havenâ€™t already.

In [None]:
# For Jupyter notebooks or interactive environments:
!pip install rcsb-api

## ðŸ“¦ Step 2: Import Required Libraries

Import all necessary Python libraries.

In [None]:
from rcsbapi.data import DataQuery as Query
from rcsbapi.data import ALL_STRUCTURES

## Step 3: Define the Function to Query Ligand Data

This function queries all PDB entries and maps chemical component IDs to the PDB IDs in which they occur.

In [None]:
def get_all_chem_comp_ids_and_write_to_file():
    # Initialize the query to retrieve relevant chemical component data
    query = Query(
        input_type="entries",
        input_ids=ALL_STRUCTURES,
        return_data_list=[
            "rcsb_id",
            "polymer_entities.chem_comp_nstd_monomers.chem_comp.id",
            "branched_entities.chem_comp_monomers.chem_comp.id",
            "nonpolymer_entities.nonpolymer_comp.chem_comp.id"
        ]
    )

    # Execute the query with a progress bar
    result = query.exec(progress_bar=True)

    # Extract the returned entries
    entries = result.get("data", {}).get("entries", [])

    # Map chem_comp_id â†’ set of PDB IDs
    chem_comp_to_pdb_map = {}

    for entry in entries:
        pdb_id = entry.get("rcsb_id")

        # Branched entities (e.g. saccharides)
        branched_entities = entry.get("branched_entities")
        if branched_entities:
            for branched in branched_entities:
                for mono in branched.get("chem_comp_monomers", []):
                    chem_id = mono.get("chem_comp", {}).get("id")
                    if chem_id:
                        chem_comp_to_pdb_map.setdefault(chem_id, set()).add(pdb_id)

        # Polymer entities (non-standard residues)
        polymer_entities = entry.get("polymer_entities")
        if polymer_entities:
            for polymer in polymer_entities:
                for mono in polymer.get("chem_comp_nstd_monomers", []):
                    chem_id = mono.get("chem_comp", {}).get("id")
                    if chem_id:
                        chem_comp_to_pdb_map.setdefault(chem_id, set()).add(pdb_id)

        # Nonpolymer entities (e.g. ligands, cofactors)
        nonpolymer_entities = entry.get("nonpolymer_entities")
        if nonpolymer_entities:
            for nonpolymer in nonpolymer_entities:
                chem_id = nonpolymer.get("nonpolymer_comp", {}).get("chem_comp", {}).get("id")
                if chem_id:
                    chem_comp_to_pdb_map.setdefault(chem_id, set()).add(pdb_id)

    # Write results to file
    output_file = "cc-to-pdb.tsv"
    with open(output_file, "w", encoding="utf-8") as f:
        for ccid, pdb_ids in chem_comp_to_pdb_map.items():
            f.write(f"{ccid}\t{' '.join(sorted(pdb_ids))}\n")

    return output_file

## Step 4: Run the Function and Save the Output

Call the function and confirm the output file was written.

In [None]:
output_path = get_all_chem_comp_ids_and_write_to_file()
print(f"File saved at: {output_path}")

## Output

The script generates a file called `cc-to-pdb.tsv` containing lines like:

ATP 1BNA 1FAD 2GTP ...
HEM 1A6M 2HHB 3HBB ...

Each line shows a **chemical component ID** followed by a list of **PDB structure IDs** where it's found.