# PMC Multi-Agent Search & Summarization — notebook.ipynb

This document contains a runnable notebook-style guide (sections + code snippets) and a README for the Sanofi R&D case study. It's intended to be copy-pasted into a Jupyter notebook (or run cell-by-cell).

## Table of contents

Setup & data access

Retriever Agent

Summarizer Agent

(Optional) Verifier Agent

Report generation

README (run instructions, design choices)

## 1. Setup & data access

Goal: work with a small subset (<100) of PMC OA oa_comm full-text .txt files located in the public S3 bucket pmc-oa-opendata in us-east-1.

Requirements:

In [None]:
python -m venv pmc_agent
source pmc_agent/bin/activate
pip install --upgrade pip
pip install pandas botocore sentence-transformers transformers torch scikit-learn faiss-cpu jupyterlab

## Access S3 (no AWS credentials required)

You can list/copy files directly via AWS CLI without credentials using --no-sign-request with an unsigned config.

In [None]:
# list top-level
aws s3 ls --no-sign-request s3://pmc-oa-opendata/oa_comm/ | head

# fetch the CSV filelist (metadata)
aws s3 cp --no-sign-request s3://pmc-oa-opendata/oa_comm/txt/metadata/csv/oa_comm.filelist.csv ./

## 2. Retriever Agent

Design: build sentence embeddings for the article abstracts (or full-text small sections), index them, and perform semantic similarity ranking for a user query.

Model (CPU-friendly): sentence-transformers/all-MiniLM-L6-v2 (fast, 384-d embeddings).

Steps & code

Choose a small set of PMIDs (random or filtered by date/keyword). We'll pick <=100 from the filelist CSV.

For each PMID, download the corresponding .txt file from S3 (path: oa_comm/txt/all/{PMID}.txt).

Extract the abstract (or first N characters) and store a small document metadata table.

Encode abstracts with SentenceTransformer and store embeddings (numpy array).

Query: encode query and compute cosine similarity (dot / norm).

In [1]:
from sentence_transformers import SentenceTransformer
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

import pandas as pd

# Load the CSV file list
filelist = pd.read_csv('oa_comm.filelist.csv')

# Pick the first 50 PMIDs (or random sample)
sample_pmids = filelist['AccessionID'].head(50).tolist()
# or random sample: filelist['pmid'].sample(50, random_state=42).tolist()
import boto3
from botocore import UNSIGNED
from botocore.client import Config

s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED), region_name='us-east-1')

def download_txt(pmid):
    key = f'oa_comm/txt/all/{pmid}.txt'
    obj = s3.get_object(Bucket='pmc-oa-opendata', Key=key)
    return obj['Body'].read().decode('utf-8')

def extract_metadata(text):
    lines = text.splitlines()
    title, abstract = '', ''
    for i, line in enumerate(lines):
        if line.lower().startswith('title:'):
            title = line[len('title:'):].strip()
        if line.lower().startswith('abstract:'):
            abstract = line[len('abstract:'):].strip()
    # fallback: if no abstract, take first 300 words
    if not abstract:
        abstract = ' '.join(lines[:50])
    return title, abstract

docs = []

for pmid in sample_pmids:
    text = download_txt(pmid)
    title, abstract = extract_metadata(text)
    docs.append({
        'pmid': pmid,
        'title': title,
        'abstract': abstract,
        'text': text
    })

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# Suppose `docs` is a list of dicts: {'pmid':..., 'title':..., 'abstract':..., 'text':...}
abstracts = [d['abstract'] or d['text'][:1000] for d in docs]
embs = model.encode(abstracts, show_progress_bar=True)

# build helper
def retrieve(query, top_k=5):
    q_emb = model.encode([query])
    sims = cosine_similarity(q_emb, embs)[0]
    idx = np.argsort(sims)[::-1][:top_k]
    return [(docs[i], float(sims[i])) for i in idx]

# example
results = retrieve('Adverse events with mRNA vaccines in pediatrics', top_k=5)
for r,score in results:
    print(r['pmid'], r['title'], score)

  from .autonotebook import tqdm as notebook_tqdm
Batches: 100%|██████████| 2/2 [00:00<00:00,  8.57it/s]

PMC10000043  0.25557103753089905
PMC10000007  0.2406770884990692
PMC10000048  0.2237662523984909
PMC10000010  0.2000018060207367
PMC10000017  0.16839763522148132





## 3. Summarizer Agent

Design: take abstracts (or the retrieved snippet) and produce concise 2-3 sentence summaries + keywords.

Model choices (CPU-friendly): google/flan-t5-small, t5-small, or facebook/bart-base.

Implementation approach: use transformers pipeline for summarization and RAKE or a simple TF-IDF to extract keywords.

In [2]:

from transformers import pipeline

summarizer = pipeline('summarization', model='google/flan-t5-small', device=-1)

query = "Adverse events with mRNA vaccines in pediatrics"
results = retrieve(query, top_k=5)  # returns list of (doc, score)

def make_summary(text, max_new_tokens=120):
    # guard: short texts
    if len(text.split()) < 30:
        return text
    out = summarizer(text, max_new_tokens=max_new_tokens, truncation=True)
    return out[0]['summary_text']


# keywords (quick): top-k by TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer


vectorizer = TfidfVectorizer(stop_words='english', max_features=2000)
X = vectorizer.fit_transform(abstracts)
terms = vectorizer.get_feature_names_out()


def top_keywords(idx, k=6):
    row = X[idx].toarray()[0]
    top_idx = np.argsort(row)[::-1][:k]
    return [terms[i] for i in top_idx if row[i] > 0]


# produce summaries for retrieved docs
summaries = []
for doc,score in results:
    text = doc['abstract'] or doc['text'][:2000]
    s = make_summary(text)
    kw = top_keywords(docs.index(doc), k=6)
    summaries.append({'pmid':doc['pmid'], 'title':doc['title'], 'summary':s, 'keywords':kw, 'score':score})

for s in summaries:
    print(f"PMID: {s['pmid']}, Title: {s['title']}, Score: {s['score']:.4f}")
    print(f"Keywords: {', '.join(s['keywords'])}")
    print(f"Summary: {s['summary']}")
    print("="*80)

Device set to use cpu


PMID: PMC10000043, Title: , Score: 0.2556
Keywords: mansoura, clinical, egypt, writing, majmaah, mastitis
Summary: Animals : an Open Access Journal from MDPI 2076-2615 MDPI 10.3390/ani13050892 animals-13-00892 Article Immunological and Oxidative Biomarkers in Bovine Serum from Healthy, Clinical, and Sub-Clinical Mastitis Caused by Escherichia coli and Staphylococcus aureus Infection
PMID: PMC10000007, Title: , Score: 0.2407
Keywords: epidemics, fever, county, typhoid, community, good
Summary: ARTICLE II. A PAPER ON EPIDEMICS. By H. NANCE, M.D., Kewanee, Ill. Read to the Military Tract Medical Society. Having been selected by...
PMID: PMC10000048, Title: , Score: 0.2238
Keywords: et, ncrnas, lncrnas, al, roles, regulation
Summary: Animals: an Open Access Journal from MDPI 2076-2615 MDPI 10.3390/ani13050805 animals-13-00805 Editorial: Role of Non-Coding RNAs in Animals
PMID: PMC10000010, Title: , Score: 0.2000
Keywords: president, meeting, annual, quincy, society, dr
Summary: QUINCY MEDI

## 4. (Optional) Verifier Agent

Purpose: validate/refine summaries and map them to high-level themes: Deep Learning, Clinical Trial, Traditional Methods, etc.

Two quick CPU-friendly strategies:

A. Embedding + small classifier: compute embeddings for summaries (same sentence-transformer) and train a small logistic regression on a tiny labeled set (or use keywords -> rule-based). This is fast and explainable.

B. Clustering / Theme detection: cluster the summary embeddings (KMeans with k=3..6) and manually label clusters.

In [3]:

# A: simple rule-based mapper
THEME_KEYWORDS = {
'Deep Learning': ['transformer', 'deep learning', 'neural network', 'machine learning', 'cnn', 'rnn'],
'Clinical Trial': ['phase', 'randomized', 'double-blind', 'trial', 'placebo', 'cohort'],
'Traditional Methods': ['western blot', 'immunohistochemistry', 'enzyme-linked', 'assay', 'flow cytometry']
}


def infer_theme(summary):
    s = summary.lower()
    scores = {t: sum(s.count(w) for w in kws) for t,kws in THEME_KEYWORDS.items()}
    best = max(scores, key=lambda k: scores[k])
    if scores[best] == 0:
        return 'Other'
    return best


for s in summaries:
    s['theme_rule'] = infer_theme(s['summary'])


# B: clustering
from sklearn.cluster import KMeans
sum_embs = model.encode([s['summary'] for s in summaries])
km = KMeans(n_clusters=min(4, len(summaries)))
labels = km.fit_predict(sum_embs)
for s,l in zip(summaries, labels):
    s['cluster'] = int(l)

for s in summaries:
    print(f"PMID: {s['pmid']}")
    print(f"Title: {s['title']}")
    print(f"Summary: {s['summary']}")
    print(f"Keywords: {', '.join(s['keywords'])}")
    print(f"Retrieved score: {s['score']:.2f}")
    print(f"Inferred theme (rule-based): {s['theme_rule']}")
    print(f"Cluster ID: {s['cluster']}")
    print("-"*80)


PMID: PMC10000043
Title: 
Summary: Animals : an Open Access Journal from MDPI 2076-2615 MDPI 10.3390/ani13050892 animals-13-00892 Article Immunological and Oxidative Biomarkers in Bovine Serum from Healthy, Clinical, and Sub-Clinical Mastitis Caused by Escherichia coli and Staphylococcus aureus Infection
Keywords: mansoura, clinical, egypt, writing, majmaah, mastitis
Retrieved score: 0.26
Inferred theme (rule-based): Other
Cluster ID: 1
--------------------------------------------------------------------------------
PMID: PMC10000007
Title: 
Summary: ARTICLE II. A PAPER ON EPIDEMICS. By H. NANCE, M.D., Kewanee, Ill. Read to the Military Tract Medical Society. Having been selected by...
Keywords: epidemics, fever, county, typhoid, community, good
Retrieved score: 0.24
Inferred theme (rule-based): Other
Cluster ID: 3
--------------------------------------------------------------------------------
PMID: PMC10000048
Title: 
Summary: Animals: an Open Access Journal from MDPI 2076-2615 MDPI 

## 5. Report generation

Export the results to CSV, JSON, and generate a small Markdown report that aggregates themes.

In [4]:
import pandas as pd
from collections import Counter

# Convert summaries list to DataFrame
df = pd.DataFrame(summaries)

# Export CSV
df.to_csv('pmc_summaries.csv', index=False)

# Export JSON
df.to_json('pmc_summaries.json', orient='records', indent=2)

# Generate a Markdown report
report_md = f"# PMC Multi-Agent Summary Report\n\n"
report_md += f"## Query: {query}\n\n"

# Theme counts
report_md += "### Theme counts (rule-based)\n\n"
theme_counts = Counter(df['theme_rule'])
for theme, count in theme_counts.items():
    report_md += f"- {theme}: {count}\n"

report_md += "\n### Top summaries\n\n"
for _, row in df.iterrows():
    report_md += f"- **{row['title']}** (PMID {row['pmid']})\n"
    report_md += f"  - Summary: {row['summary']}\n"
    report_md += f"  - Keywords: {', '.join(row['keywords'])}\n"
    report_md += f"  - Retrieved score: {row['score']:.2f}\n"
    report_md += f"  - Theme: {row['theme_rule']}\n"
    report_md += f"  - Cluster ID: {row['cluster']}\n\n"

# Save report
with open('pmc_report.md', 'w', encoding='utf-8') as f:
    f.write(report_md)

print("CSV, JSON, and Markdown report generated successfully!")


CSV, JSON, and Markdown report generated successfully!
