# PMCGrab Tutorial

**Welcome to PMCGrab!** This notebook will take you through the complete process of:

1. Fetching scientific papers from PubMed Central (PMC)
2. Converting messy XML to clean, AI-ready JSON
3. Exploring the structured data
4. Preparing data for AI/ML workflows (RAG, vector databases, LLM training)
5. Saving results to disk

**Prerequisites:** Make sure you have PMCGrab installed:
```bash
uv add pmcgrab
# or
pip install pmcgrab
```


## What Makes PMCGrab Special?

PMCGrab transforms this messy process:

```xml
<sec sec-type="intro">
  <title>Introduction</title>
  <p>Machine learning <xref ref-type="bibr" rid="B1">1</xref> has revolutionized...
    <fig id="F1"><graphic xlink:href="figure1.jpg"/></fig>
  </p>
</sec>
```

Into this clean structure:

```json
{
  "body": {
    "Introduction": "Machine learning has revolutionized..."
  }
}
```


## Setup and Imports

In [None]:
# Import required libraries
import json
import warnings
from pathlib import Path

# PMCGrab imports
from pmcgrab import Paper  # OOP interface (recommended)
from pmcgrab.application.processing import process_single_pmc  # Dict-based interface

print("All imports successful!")
print("Ready to process some scientific literature!")

## Step 1: Process Your First Paper

Let's start with a real paper from PMC. We'll use **PMC7114487** -- a freely available research paper.

The `Paper.from_pmc()` method downloads the XML from NCBI and parses it into a clean Python object. Pass `suppress_warnings=True` to silence any informational parsing warnings.


In [None]:
# Process a single paper using the Paper class (recommended API)
pmcid = "7114487"
print(f"Fetching PMC{pmcid} from PubMed Central...")
print("This might take 5-15 seconds...\n")

paper = Paper.from_pmc(pmcid, suppress_warnings=True)

if paper.has_data:
    print("Success! Paper processed.")
    print(f"  Title:    {paper.title}")
    print(f"  Abstract: {paper.abstract_as_str()[:120]}...")
    print(f"  Sections: {len(paper.body_as_dict())}")
else:
    print("Failed to process the paper.")

## Step 2: Explore the Paper Structure

Let's dive deep into what PMCGrab extracted for us:


In [None]:
# body_as_dict() returns {section_title: clean_text}
body = paper.body_as_dict()

print("PAPER SECTIONS:")
print("=" * 50)

for section_name, content in body.items():
    word_count = len(content.split())
    char_count = len(content)

    print(f"\n{section_name}:")
    print(f"   {word_count:,} words | {char_count:,} characters")

    # Show preview
    preview = content[:200].replace("\n", " ").strip()
    print(f"   Preview: {preview}...")

## Step 3: Batch Processing - Build a Dataset

For batch work the dict-based `process_single_pmc()` is often more convenient -- it returns a plain dictionary that is already JSON-serializable.

Key fields in the returned dict:
- `title` -- article title (string)
- `abstract_text` -- plain-text abstract (string)
- `abstract` -- structured abstract (dict of section -> text)
- `body` -- body sections (dict of section title -> text)
- `authors` -- author list (list of dicts)


In [None]:
paper_collection = {
    "7114487": "Junin virus PI3K/Akt signalling",
    "3084273": "Nuclear domain 10 functional interaction",
    "7181753": "Single-cell transcriptomes of human skin",
}

print(f"Processing {len(paper_collection)} papers for our dataset...")
print("This will take 1-3 minutes...\n")

dataset = {}
stats = {"ok": 0, "fail": 0}

# Suppress informational parsing warnings for cleaner output
with warnings.catch_warnings():
    warnings.simplefilter("ignore")

    for pmcid, description in paper_collection.items():
        print(f"  PMC{pmcid}: {description}")
        try:
            data = process_single_pmc(pmcid)
            if data:
                dataset[pmcid] = data
                stats["ok"] += 1
                print(f"    -> '{data['title'][:60]}...'\n")
            else:
                stats["fail"] += 1
                print("    -> FAILED\n")
        except Exception as e:
            stats["fail"] += 1
            print(f"    -> Error: {e}\n")

print(f"Done! {stats['ok']} successful, {stats['fail']} failed.")
print(f"Dataset size: {len(dataset)} papers")

## Step 4: Prepare Data for AI/ML Workflows

### RAG (Retrieval-Augmented Generation) Preparation

Each paper dict contains `abstract_text` (plain string) and `body` (dict of section title to text). We can split these into chunks suitable for a vector database like Pinecone, Weaviate, or ChromaDB:


In [None]:
# Prepare chunks for a RAG pipeline
rag_chunks = []

for pmcid, paper in dataset.items():
    # Use abstract_text (plain string) -- NOT abstract (structured dict)
    abstract_text = paper.get("abstract_text", "")

    # Add abstract as a chunk
    if abstract_text:
        rag_chunks.append(
            {
                "id": f"PMC{pmcid}_abstract",
                "source": f"PMC{pmcid}",
                "section": "Abstract",
                "content": abstract_text,
                "metadata": {
                    "title": paper["title"],
                    "journal": paper.get("journal_title", ""),
                    "content_type": "abstract",
                    "word_count": len(abstract_text.split()),
                },
            }
        )

    # Add each body section as a chunk
    for section_title, content in paper["body"].items():
        rag_chunks.append(
            {
                "id": f"PMC{pmcid}_{section_title.lower().replace(' ', '_')}",
                "source": f"PMC{pmcid}",
                "section": section_title,
                "content": content,
                "metadata": {
                    "title": paper["title"],
                    "journal": paper.get("journal_title", ""),
                    "content_type": "section",
                    "word_count": len(content.split()),
                },
            }
        )

print(f"Created {len(rag_chunks)} RAG chunks from {len(dataset)} papers")
if rag_chunks:
    avg_words = sum(c["metadata"]["word_count"] for c in rag_chunks) / len(rag_chunks)
    print(f"Average chunk size: {avg_words:.0f} words")

    # Show a sample chunk
    print("\nSAMPLE RAG CHUNK:")
    print("=" * 50)
    sample = rag_chunks[0]
    print(f"ID:         {sample['id']}")
    print(f"Section:    {sample['section']}")
    print(f"Word Count: {sample['metadata']['word_count']}")
    print(f"Preview:    {sample['content'][:200]}...")

## Step 5: Save Your Data

Let's save all the processed data for future use:


In [None]:
# Create output directory
output_dir = Path("pmcgrab_tutorial_output")
output_dir.mkdir(exist_ok=True)

# 1. Save individual paper JSON files
papers_dir = output_dir / "papers"
papers_dir.mkdir(exist_ok=True)

for pmcid, paper_dict in dataset.items():
    dest = papers_dir / f"PMC{pmcid}.json"
    with open(dest, "w", encoding="utf-8") as f:
        json.dump(paper_dict, f, indent=2, ensure_ascii=False)

print(f"Saved {len(dataset)} paper files to {papers_dir}/")

# 2. Save RAG chunks
chunks_path = output_dir / "rag_chunks.json"
with open(chunks_path, "w", encoding="utf-8") as f:
    json.dump(rag_chunks, f, indent=2, ensure_ascii=False)

print(f"Saved {len(rag_chunks)} RAG chunks to {chunks_path}")
print(f"\nAll data saved to '{output_dir}/'.")

## Next Steps

You now know how to:

- Fetch and parse PMC papers using `Paper.from_pmc()` (OOP) or `process_single_pmc()` (dict)
- Explore structured sections and abstracts
- Build RAG-ready chunks from the parsed data
- Save everything as JSON

### Where to go from here

1. **Scale up** -- use `process_pmc_ids()` or the CLI (`pmcgrab --pmcids ...`) for bulk processing
2. **Local XML** -- process pre-downloaded XML with `Paper.from_local_xml()` or `process_single_local_xml()`
3. **Vector databases** -- load your RAG chunks into Pinecone, Weaviate, or ChromaDB
4. **Knowledge graphs** -- use the citation and author data to build relational graphs

### Quick reference

| Task | Code |
|------|------|
| Single paper (OOP) | `Paper.from_pmc("7181753", suppress_warnings=True)` |
| Single paper (dict) | `process_single_pmc("7181753")` |
| Local XML file | `Paper.from_local_xml("article.xml")` |
| Batch (network) | `process_pmc_ids(["7181753", "3084273"])` |
| CLI | `pmcgrab --pmcids 7181753 3084273 --output-dir ./out` |

### Resources

- [PMCGrab Documentation](https://rajdeepmondaldotcom.github.io/pmcgrab/)
- [GitHub Repository](https://github.com/rajdeepmondaldotcom/pmcgrab)
