# PMCGrab Tutorial

**Welcome to PMCGrab!** This notebook will take you through the complete process of:

1. Fetching scientific papers from PubMed Central (PMC)
2. Converting messy XML to clean, AI-ready JSON
3. Exploring the structured data
<!-- 4. Preparing data for AI/ML workflows (RAG, vector databases, LLM training) -->

**Prerequisites:** Make sure you have PMCGrab installed:
```bash
uv add pmcgrab
# or
pip install pmcgrab
```


## What Makes PMCGrab Special?

PMCGrab transforms this messy process:

```xml
<sec sec-type="intro">
  <title>Introduction</title>
  <p>Machine learning <xref ref-type="bibr" rid="B1">1</xref> has revolutionized...
    <fig id="F1"><graphic xlink:href="figure1.jpg"/></fig>
  </p>
</sec>
```

Into this clean structure:

```json
{
  "body": {
    "Introduction": "Machine learning has revolutionized..."
  }
}
```


## Setup and Imports

In [1]:
# Import required libraries
import json
import pandas as pd
from pathlib import Path
from collections import Counter

# PMCGrab imports
from pmcgrab.application.processing import process_single_pmc

print("All imports successful!")
print("Ready to process some scientific literature!")
print("Tip: If you see any import errors, make sure you installed PMCGrab using 'uv add pmcgrab' or 'pip install pmcgrab'")

All imports successful!
Ready to process some scientific literature!
Tip: If you see any import errors, make sure you installed PMCGrab using 'uv add pmcgrab' or 'pip install pmcgrab'


## Step 1: Process Your First Paper

Let's start with a real paper from PMC. We'll use **PMC7114487** - a COVID-19 research paper that's freely available.


In [2]:
# Process a single paper
pmcid = "7114487"
print(f"Fetching PMC{pmcid} from PubMed Central...")
print("This might take 5-15 seconds...")

# The magic happens here!
paper_data = process_single_pmc(pmcid)

if paper_data:
    print("\nSuccess! Paper processed successfully!")
    print(f"Title: {paper_data['title']}")
    print(f"Authors: {len(paper_data['authors'])}")
    print(f"Sections: {len(paper_data['body'])}")
else:
    print("Failed to process the paper")


Fetching PMC7114487 from PubMed Central...
This might take 5-15 seconds...

Success! Paper processed successfully!
Title: Participation of the phosphatidylinositol 3-kinase/Akt pathway in Junín virus replication in vitro
Authors: 2
Sections: 11


## Step 2: Explore the Paper Structure

Let's dive deep into what PMCGrab extracted for us:


In [None]:
# Explore the paper sections (the gold for RAG!)
print("PAPER SECTIONS:")
print("=" * 50)

for section_name, content in paper_data['body'].items():
    word_count = len(content.split())
    char_count = len(content)

    print(f"\n{section_name}:")
    print(f"   {word_count:,} words | {char_count:,} characters")

    # Show preview
    preview = content[:200].replace('\n', ' ').strip()
    print(f"   Preview: {preview}...")


PAPER SECTIONS:

Section 1:
   193 words | 1,314 characters
   Preview: Junín virus (JUNV), agent of the Argentine haemorrhagic fever, belongs to the Arenaviridae family, a group of enveloped viruses with genome composed of two negative-sense single stranded RNA segments...

Section 2:
   221 words | 1,358 characters
   Preview: In this study we examined the role of PI3K/Akt signalling pathway during JUNV in vitro infection. In first place, we tested whether JUNV infection would lead to activation of PI3K/Akt pathway by exami...

Section 3:
   294 words | 1,788 characters
   Preview: Taken into consideration that Akt activation was an early event during infection, we speculated this activation might be due to the initial virus–host cell interactions. To test this hypothesis we ino...

Section 4:
   49 words | 282 characters
   Preview: The fact that the level of phosporylation of Akt in JUNV infected cells analyzed at later times p.i. (6, 12 and 18 h p.i.) did not differ from non-infec

## Step 3: Batch Processing - Build a Dataset

Now let's process multiple papers to build a small dataset. We'll use papers from different domains:


In [None]:
paper_collection = {
    "7114487": "COVID-19 pandemic response and lessons learned",
    "3084273": "Machine learning approaches in genomics",
    "7181753": "Single-cell transcriptomes of human skin"
}

print(f" Processing {len(paper_collection)} papers for our dataset...")
print(" This will take 1-3 minutes...\n")

# Storage for our dataset
dataset = {}
processing_stats = {"successful": 0, "failed": 0}

for pmcid, description in paper_collection.items():
    print(f" Processing PMC{pmcid}: {description}")

    try:
        data = process_single_pmc(pmcid)
        if data:
            dataset[pmcid] = data
            processing_stats["successful"] += 1
            print(f"   Success! '{data['title'][:50]}...'")
        else:
            processing_stats["failed"] += 1
            print(f"   Failed to process PMC{pmcid}")
    except Exception as e:
        processing_stats["failed"] += 1
        print(f"   Error: {e}")

    print()  # Empty line for readability

print(f" Processing complete!")
print(f" Successful: {processing_stats['successful']}")
print(f" Failed: {processing_stats['failed']}")
print(f" Dataset size: {len(dataset)} papers")


 Processing 3 papers for our dataset...
 This will take 1-3 minutes...

 Processing PMC7114487: COVID-19 pandemic response and lessons learned
   Success! 'Participation of the phosphatidylinositol 3-kinase...'

 Processing PMC3084273: Machine learning approaches in genomics
   Success! 'Functional Interaction of Nuclear Domain 10 and It...'

 Processing PMC7181753: Single-cell transcriptomes of human skin
   Success! 'Single-cell transcriptomes of the human skin revea...'

 Processing complete!
 Successful: 3
 Failed: 0
 Dataset size: 3 papers


## Step 4: Prepare Data for AI/ML Workflows

### RAG (Retrieval-Augmented Generation) Preparation

Perfect for feeding into vector databases like Pinecone, Weaviate, or ChromaDB:


In [None]:
# Prepare chunks for RAG pipeline
rag_chunks = []

for pmcid, paper in dataset.items():
    # Add abstract as a chunk
    rag_chunks.append({
        "id": f"PMC{pmcid}_abstract",
        "source": f"PMC{pmcid}",
        "section": "Abstract",
        "content": paper["abstract"],
        "metadata": {
            "title": paper["title"],
            "journal": paper.get("journal_title", ""),
            "authors": [f"{a.get('First_Name', '')} {a.get('Last_Name', '')}".strip()
                      for a in paper["authors"][:3]],  # First 3 authors
            "content_type": "abstract",
            "word_count": len(paper["abstract"].split())
        }
    })

    # Add each section as a chunk
    for section, content in paper["body"].items():
        rag_chunks.append({
            "id": f"PMC{pmcid}_{section.lower().replace(' ', '_')}",
            "source": f"PMC{pmcid}",
            "section": section,
            "content": content,
            "metadata": {
                "title": paper["title"],
                "journal": paper.get("journal_title", ""),
                "content_type": "section",
                "word_count": len(content.split())
            }
        })

print(f"Created {len(rag_chunks)} RAG chunks from {len(dataset)} papers")
avg_word_count = sum(chunk["metadata"]["word_count"] for chunk in rag_chunks) / len(rag_chunks)
print(f"Average chunk size: {avg_word_count:.0f} words")

# Show a sample chunk
print("\nSAMPLE RAG CHUNK:")
print("=" * 50)
sample_chunk = rag_chunks[0]
print(f"ID: {sample_chunk['id']}")
print(f"Section: {sample_chunk['section']}")
print(f"Word Count: {sample_chunk['metadata']['word_count']}")
print(f"Content Preview: {sample_chunk['content'][:200]}...")

Created 23 RAG chunks from 3 papers
Average chunk size: 595 words

SAMPLE RAG CHUNK:
ID: PMC7114487_abstract
Section: Abstract
Word Count: 183
Content Preview: In this paper we demonstrate that infection of cell cultures with the arenavirus Junín (JUNV), agent of the argentine haemorrhagic fever, leads to the activation of PI3K/Akt signalling pathway. Phosph...


## Step 5: Save Your Data

Let's save all the processed data for future use:


In [None]:
# Create output directory
output_dir = Path("pmcgrab_tutorial_output")
output_dir.mkdir(exist_ok=True)

print(f"Saving all data to: {output_dir}")

# 1. Save raw paper data
papers_dir = output_dir / "papers"
papers_dir.mkdir(exist_ok=True)

for pmcid, paper in dataset.items():
    with open(papers_dir / f"PMC{pmcid}.json", "w", encoding="utf-8") as f:
        json.dump(paper, f, indent=2, ensure_ascii=False)

print(f"Saved {len(dataset)} individual paper files to {papers_dir}")

# 2. Save RAG chunks
with open(output_dir / "rag_chunks.json", "w", encoding="utf-8") as f:
    json.dump(rag_chunks, f, indent=2, ensure_ascii=False)

print(f"Saved {len(rag_chunks)} RAG chunks to rag_chunks.json")

print("\nALL DATA SAVED SUCCESSFULLY!")
print(f"Check the '{output_dir}' folder for all your files")


Saving all data to: pmcgrab_tutorial_output
Saved 3 individual paper files to pmcgrab_tutorial_output/papers
Saved 23 RAG chunks to rag_chunks.json

 ALL DATA SAVED SUCCESSFULLY!
 Check the 'pmcgrab_tutorial_output' folder for all your files


## Next Steps & Advanced Usage

Congratulations! You've successfully:

- Processed multiple scientific papers from PMC
- Analyzed the structured data
- Prepared data for RAG workflows
- Saved everything in organized formats

### What You Can Do Next:

1. **Scale Up**: Process hundreds or thousands of papers using PMCGrab's batch processing
2. **Vector Database**: Load your RAG chunks into Pinecone, Weaviate, or ChromaDB
3. **Fine-tune Models**: Use training examples to fine-tune GPT, Llama, or other LLMs
4. **Build Knowledge Graphs**: Use Neo4j or similar to create interactive knowledge networks
5. **Research Analysis**: Apply NLP techniques to analyze trends, topics, and relationships

### Resources:

- **[PMCGrab Documentation](https://rajdeepmondaldotcom.github.io/pmcgrab/)**
- **[Complete Beginner Guide](../docs/getting-started/complete-beginner-guide.md)**
- **[Advanced Usage Examples](https://rajdeepmondaldotcom.github.io/pmcgrab/examples/advanced-usage/)**
- **[GitHub Repository](https://github.com/rajdeepmondaldotcom/pmcgrab)**

### Command Line Usage:

PMCGrab also works from the command line for large-scale processing:

```bash
# Process multiple papers with uv (recommended)
uv run python -m pmcgrab --pmcids 7114487 3084273 7181753 --workers 4

# Or with regular Python
python -m pmcgrab --pmcids 7114487 3084273 --output-dir ./results
```

### Pro Tips:

- Use `uv run` for faster execution and dependency management
- Use `--workers N` flag for parallel processing when running from command line
- PMCGrab handles rate limiting automatically, so you don't need to worry about NCBI limits
- The JSON structure is designed specifically for AI workflows - each section is cleanly separated
- For large datasets, consider streaming processing to avoid memory issues

**PMCGrab transforms the most time-consuming part of biomedical AI workflows into a simple API call.**

*Happy processing!*
