Open and run this in Google Colab: <a href="https://colab.research.google.com/github/rcsb/rcsb-training-resources/blob/master/example-use-cases/ligand_file_download/ligand_file_download.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Ligand File Download Documentation

## Introduction

This notebook demonstrates how to download, clean, and aggregate mmCIF data for specific ligands found in PDB structures. Using a ligand-to-PDB mapping file as input, it retrieves ligand-specific mmCIF categories (like `chem_comp` and `atom_site`) from the RCSB model server, filters out irrelevant data, and exports a combined cleaned mmCIF file.

### Inputs and Associated Variables

* **`input_file_path`**: Path to a TSV file mapping chemical component IDs (ligands) to associated PDB IDs.
* **`MarshalUtil`**: Utility class from the `rcsb.utils.io` package for downloading, parsing, and writing mmCIF data.
* **`output_file_path`**: Path to the output cleaned mmCIF file.

### Outputs and Associated Variables

* **`all_cleaned_ligands.cif`**: A merged mmCIF file containing only relevant ligand data categories (`chem_comp` and `atom_site`) for all ligand–PDB pairs in the input.

### Major Steps in the Coding Process

1. **Download mmCIF data** for each ligand in each specified PDB entry using `MarshalUtil`.
2. **Filter data categories** to keep only those relevant to chemical components and atomic sites.
3. **Merge all cleaned ligand data** into a single mmCIF file for downstream analysis.

### Questions

This notebook helps answer:

* How can I efficiently retrieve ligand-specific mmCIF data from the RCSB model server?
* How do I clean mmCIF data to keep only relevant chemical and atomic site information?
* How can I aggregate ligand data across multiple PDB entries into a single file?

### Learning Objectives

By using this notebook, users will learn to:

* Use `MarshalUtil` to import and export mmCIF data programmatically.
* Filter mmCIF data containers to retain only specific categories.
* Automate batch processing of ligand data across multiple PDB entries.
* Post-process mmCIF files to meet formatting requirements.

### Purpose

This notebook is designed to facilitate ligand-centric mmCIF data retrieval and cleaning, supporting structural biology studies, ligand-focused data curation, or preparation of datasets for computational modeling and analysis.

## Libraries

A list of libraries that need to be installed and imported to complete the tasks in this notebook.

|     Library     | Abbreviation | Contents                                                                       | Source                                                                                 |
| :-------------: | ------------ | :----------------------------------------------------------------------------- | :------------------------------------------------------------------------------------- |
| `rcsb.utils.io` | N/A          | Utility package for working with mmCIF files and data containers from RCSB PDB | [rcsb.utils.io on PyPI](https://pypi.org/project/rcsb.utils.io/)                       |
|       `os`      | N/A          | Standard Python library for interacting with the operating system              | [os — Miscellaneous OS interfaces](https://docs.python.org/3/library/os.html)          |
|      `copy`     | N/A          | Standard Python library for shallow and deep copying of Python objects         | [copy — Shallow and deep copy operations](https://docs.python.org/3/library/copy.html) |

> **Note:**
> The external dependency `rcsb.utils.io` can be installed via:
>
> ```bash
> pip install rcsb.utils.io
> ```


## Notebook Contents

The next section of the notebook includes all of the raw code for this example. **Experienced coders** can use this directly, either within the notebook or in your preferred development environment.

For **novice and intermediate coders**, the code is divided into sequential cells, each performing one clear step in the process. This notebook includes the following steps:

1. **Install and Import Required Libraries**  
   Install and import all necessary Python libraries, including the `rcsb.utils.io` package and standard modules.

2. **Define a Function to Download and Clean mmCIF Ligand Data**  
   Create a function that retrieves and filters mmCIF data for a specific ligand in a structure, retaining only relevant categories.

3. **Define a Function to Process a Ligand-to-PDB Mapping File**  
   Create a function that processes a `.tsv` file mapping ligands to structures, and collects/cleans mmCIF data for all listed pairs.

4. **Run the Data Collection and Export the Merged CIF File**  
   Provide input/output file paths, call the functions, and export a cleaned, merged `.cif` file.

In [None]:
import os  # Standard library for interacting with the operating system; install using pip install os
from copy import deepcopy  # Used to deeply copy data containers without shared references; install using pip install copy

# MarshalUtil is part of the RCSB PDB Python utilities package.
# You can install it using:
#     pip install rcsb.utils.io
from rcsb.utils.io.MarshalUtil import MarshalUtil


# This function downloads and cleans mmCIF data for a specific ligand in a given PDB entry.
# It only retains the "chem_comp" and "atom_site" categories.
def clean_and_collect_ligand_cif(ligand_id: str, entry_id: str, marshal_util: MarshalUtil):
    # Construct URL for the mmCIF data specific to the ligand in the structure entry
    url = f"https://models.rcsb.org/v1/{entry_id}/atoms?label_comp_id={ligand_id}&encoding=cif&copy_all_categories=false&download=false"
    try:
        # Download and parse mmCIF data using the given marshal utility
        dataContainerList = marshal_util.doImport(url, fmt="mmcif")
        if not dataContainerList:
            print(f"[Warning] No data containers found for {ligand_id} in {entry_id}")
            return None

        # Extract the first data container and get all category names
        originalContainer = dataContainerList[0]
        categoryNames = originalContainer.getObjNameList()

        # Define which categories to keep
        categories_to_keep = {"chem_comp", "atom_site"}

        # Create a deep copy and remove all unwanted categories
        newContainer = deepcopy(originalContainer)
        for catName in categoryNames:
            if catName not in categories_to_keep:
                newContainer.remove(catName)

        return newContainer

    except Exception as e:
        # Print an error message if fetching or processing fails
        print(f"[Error] Failed for {ligand_id} in {entry_id}: {e}")
        return None


# This function processes a TSV input file containing ligand IDs and their associated PDB IDs.
# It collects and cleans the relevant mmCIF data for each combination, and writes it to a single output CIF file.
def process_input_file_to_single_cif(input_file_path: str, output_file: str):
    mU = MarshalUtil()
    all_containers = []

    # Open the input file and read line-by-line
    with open(input_file_path, "r", encoding="utf-8") as f:
        for line in f:
            # Each line should contain a ligand ID followed by one or more PDB entry IDs
            parts = line.strip().split()
            if len(parts) < 2:
                continue
            ligand_id = parts[0]
            pdb_ids = parts[1:]
            for entry_id in pdb_ids:
                # Fetch and clean the mmCIF data for each ligand-entry pair
                container = clean_and_collect_ligand_cif(ligand_id, entry_id, mU)
                if container:
                    # Assign a unique name to each container
                    data_name = f"{entry_id.lower()}_{ligand_id.upper()}"
                    container.setName(data_name)
                    all_containers.append(container)

    # If any cleaned containers were collected, export them to a single output CIF file
    if all_containers:
        mU.doExport(output_file, all_containers, fmt="mmcif")

        # Post-process the file to insert "#" lines after "data_" lines and remove the following line
        with open(output_file, "r", encoding="utf-8") as infile:
            lines = infile.readlines()

        with open(output_file, "w", encoding="utf-8") as outfile:
            skip_next = False
            for i, line in enumerate(lines):
                if skip_next:
                    skip_next = False
                    continue  # skip the line that came after "data_"

                if line.strip().startswith("data_"):
                    outfile.write(line)
                    outfile.write("#\n")
                    skip_next = True  # flag the next line to be skipped
                else:
                    outfile.write(line)


# Define the input and output file paths and execute the processing function
input_file_path = "INPUT YOUR PATH" # PLEASE INPUT YOUR PATH TO START
output_file_path = "all_cleaned_ligands.cif"
process_input_file_to_single_cif(input_file_path, output_file_path)

# Stepwise Code for NOVICE and INTERMEDIATE CODERS

This section walks through downloading and cleaning mmCIF data for ligands found in a set of PDB entries. It REQUIRES a TSV input file mapping chemical component IDs to structures, and outputs a single merged mmCIF file with only relevant data categories.

## Step 1: Install Required Library

We use the `rcsb.utils.io` module, which is part of the [rcsb.utils](https://pypi.org/project/rcsb.utils.io/) package. Install it using pip:

In [None]:
!pip install rcsb.utils.io

## Step 2: Import Required Libraries

Import standard libraries and `MarshalUtil`, which handles CIF file download and parsing.

In [None]:
import os  # For interacting with the operating system
from copy import deepcopy  # For safe copying of complex data structures

from rcsb.utils.io.MarshalUtil import MarshalUtil  # For reading and writing mmCIF files

## Step 3: Define a Function to Download and Clean mmCIF Data

This function retrieves mmCIF data for a specific ligand in a given structure and retains only the "chem_comp" and "atom_site" categories.

In [None]:
def clean_and_collect_ligand_cif(ligand_id: str, entry_id: str, marshal_util: MarshalUtil):
    url = (
        f"https://models.rcsb.org/v1/{entry_id}/atoms"
        f"?label_comp_id={ligand_id}&encoding=cif&copy_all_categories=false&download=false"
    )
    try:
        dataContainerList = marshal_util.doImport(url, fmt="mmcif")
        if not dataContainerList:
            print(f"[Warning] No data containers found for {ligand_id} in {entry_id}")
            return None

        originalContainer = dataContainerList[0]
        categoryNames = originalContainer.getObjNameList()

        categories_to_keep = {"chem_comp", "atom_site"}

        newContainer = deepcopy(originalContainer)
        for catName in categoryNames:
            if catName not in categories_to_keep:
                newContainer.remove(catName)

        return newContainer

    except Exception as e:
        print(f"[Error] Failed for {ligand_id} in {entry_id}: {e}")
        return None

## Step 4: Process the Input File and Write Cleaned CIFs

This function reads a `.tsv` file of ligand-to-PDB mappings, downloads and cleans their CIF data, and writes a merged CIF output.

In [None]:
def process_input_file_to_single_cif(input_file_path: str, output_file: str):
    mU = MarshalUtil()
    all_containers = []

    with open(input_file_path, "r", encoding="utf-8") as f:
        for line in f:
            parts = line.strip().split()
            if len(parts) < 2:
                continue
            ligand_id = parts[0]
            pdb_ids = parts[1:]
            for entry_id in pdb_ids:
                container = clean_and_collect_ligand_cif(ligand_id, entry_id, mU)
                if container:
                    data_name = f"{entry_id.lower()}_{ligand_id.upper()}"
                    container.setName(data_name)
                    all_containers.append(container)

    if all_containers:
        mU.doExport(output_file, all_containers, fmt="mmcif")

        # Insert "#" after "data_" and skip the next line
        with open(output_file, "r", encoding="utf-8") as infile:
            lines = infile.readlines()

        with open(output_file, "w", encoding="utf-8") as outfile:
            skip_next = False
            for i, line in enumerate(lines):
                if skip_next:
                    skip_next = False
                    continue
                if line.strip().startswith("data_"):
                    outfile.write(line)
                    outfile.write("#\n")
                    skip_next = True
                else:
                    outfile.write(line)

## Step 5: Set File Paths and Run Processing

Provide your input and output file paths and run the processing.

In [None]:
input_file_path = "INPUT YOUR PATH"
output_file_path = "all_cleaned_ligands.cif"

process_input_file_to_single_cif(input_file_path, output_file_path)

## Output

The result is a cleaned, combined `.cif` file containing ligand-specific mmCIF data for all PDB entries listed in the input. The file is saved as: `all_cleaned_ligands.cif`