# ArxivLoader

[arXiv](https://arxiv.org/) is an open-access archive for 2 million scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics.

## Setup

To access Arxiv document loader you'll need to install the `arxiv`, `PyMuPDF` and `langchain-community` integration packages. PyMuPDF transforms PDF files downloaded from the arxiv.org site into the text format.

In [6]:
%pip install -qU langchain-community arxiv pymupdf


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## Instantiation

Now we can instantiate our model object and load documents:

In [1]:
from langchain_community.document_loaders import ArxivLoader

# Supports all arguments of `ArxivAPIWrapper`
arxiv_loader = ArxivLoader(
    query="Diabetes",
    top_k_results=300,
    load_max_docs=300,
    # doc_content_chars_max=1000,
    load_all_available_meta=True,
    # ...
)

## get summaries as docs

Directly access the arXiv PDF links via the metadata from get_summaries_as_docs().

Let's run through a basic example of how to use the `ArxivLoader` searching for papers of reasoning:

In [5]:
arxiv_docs = arxiv_loader.get_summaries_as_docs()

In [6]:
arxiv_docs[0]

Document(metadata={'Entry ID': 'http://arxiv.org/abs/2202.11216v1', 'Published': datetime.date(2022, 2, 22), 'Title': 'Early Stage Diabetes Prediction via Extreme Learning Machine', 'Authors': 'Nelly Elsayed, Zag ElSayed, Murat Ozer'}, page_content="Diabetes is one of the chronic diseases that has been discovered for decades.\nHowever, several cases are diagnosed in their late stages. Every one in eleven\nof the world's adult population has diabetes. Forty-six percent of people with\ndiabetes have not been diagnosed. Diabetes can develop several other severe\ndiseases that can lead to patient death. Developing and rural areas suffer the\nmost due to the limited medical providers and financial situations. This paper\nproposed a novel approach based on an extreme learning machine for diabetes\nprediction based on a data questionnaire that can early alert the users to seek\nmedical assistance and prevent late diagnoses and severe illness development.")

In [7]:
len(arxiv_docs)

300

In [8]:
# Extract PDF URLs from metadata
arxiv_pdf_urls = []
for doc in arxiv_docs:
    entry_id = doc.metadata.get("Entry ID")  # e.g., http://arxiv.org/abs/2402.03268v2
    Title  = doc.metadata.get("Title")
    if entry_id:
        arxiv_id = entry_id.split("/")[-1]
        pdf_url = f"https://arxiv.org/pdf/{arxiv_id}.pdf"
        arxiv_pdf_urls.append(pdf_url)

# Example usage
for url in arxiv_pdf_urls:
    print(url)

https://arxiv.org/pdf/2202.11216v1.pdf
https://arxiv.org/pdf/1005.5221v1.pdf
https://arxiv.org/pdf/1510.02196v1.pdf
https://arxiv.org/pdf/2012.15025v1.pdf
https://arxiv.org/pdf/1909.09853v1.pdf
https://arxiv.org/pdf/2105.09379v1.pdf
https://arxiv.org/pdf/2307.16417v1.pdf
https://arxiv.org/pdf/2307.16622v1.pdf
https://arxiv.org/pdf/2406.17090v1.pdf
https://arxiv.org/pdf/2406.00297v1.pdf
https://arxiv.org/pdf/2412.14736v1.pdf
https://arxiv.org/pdf/2409.13191v2.pdf
https://arxiv.org/pdf/1904.08764v1.pdf
https://arxiv.org/pdf/2008.11153v1.pdf
https://arxiv.org/pdf/2003.02261v1.pdf
https://arxiv.org/pdf/2011.02286v1.pdf
https://arxiv.org/pdf/1809.05814v1.pdf
https://arxiv.org/pdf/2101.03203v1.pdf
https://arxiv.org/pdf/2102.12984v1.pdf
https://arxiv.org/pdf/2301.10450v1.pdf
https://arxiv.org/pdf/2402.10153v2.pdf
https://arxiv.org/pdf/2410.03188v1.pdf
https://arxiv.org/pdf/2004.03408v1.pdf
https://arxiv.org/pdf/2011.08068v1.pdf
https://arxiv.org/pdf/2409.07315v1.pdf
https://arxiv.org/pdf/2007

## Read PDF from URL

In [9]:
from io import BytesIO
import requests
import fitz  # pymupdf

pdf_url = arxiv_pdf_urls[0]
response = requests.get(pdf_url)
if response.status_code != 200:
    raise ValueError(f"Failed to download PDF. Status code: {response.status_code}")

pdf_stream = BytesIO(response.content)
pdf = fitz.open(stream=pdf_stream, filetype="pdf")


# PubMed Loader

https://python.langchain.com/docs/integrations/document_loaders/pubmed/

https://github.com/langchain-ai/langchain-community/blob/main/libs/community/langchain_community/document_loaders/pubmed.py



>[PubMed®](https://pubmed.ncbi.nlm.nih.gov/) by `The National Center for Biotechnology Information, National Library of Medicine` comprises more than 35 million citations for biomedical literature from `MEDLINE`, life science journals, and online books. Citations may include links to full text content from `PubMed Central` and publisher web sites.

In [10]:
from langchain_community.document_loaders import PubMedLoader

In [13]:

pubmed_loader = PubMedLoader(
    query="Diabetes",
    #top_k_results=300,
    load_max_docs=300,
    # doc_content_chars_max=1000,
    #load_all_available_meta=True,
    # ...
)

In [15]:
pubmed_docs = pubmed_loader.load()

In [16]:
pubmed_docs[0]

Document(metadata={'uid': '40498532', 'Title': 'Caring for the Caregiver: Investigating the Relationship Between Caregiving, Gender, and Diabetes Risk in MIDUS II.', 'Published': '2025-06-11', 'Copyright Information': ''}, page_content='This study examines whether caregiving presents an equal risk for diabetes among gender. This study uses data from the second wave of the Midlife in the United States Survey, which included biological markers. We tested the relationship between caregiving and risk of diabetes across various models, controlling for demographics, confounders, and mechanisms that can explain the relationship. Cross-sectional analysis of the Homeostatic Model Assessment of Insulin Resistance (HOMO-IR) determined that (1)men had a higher risk of diabetes than women overall; (2)male caregivers demonstrated a lower risk of diabetes compared to non-caregiving men; (3)female caregivers exhibited a non-significant elevation in diabetes risk compared with non-caregiving females. F

In [17]:
len(pubmed_docs)

300

In [139]:
pubmed_docs[10].metadata

{'uid': '40498198',
 'Title': 'Protocol to Use Mouse Hepatocyte Cell Line AML12 to Study Hepatic Metabolism In Vitro.',
 'Published': '--',
 'Copyright Information': '© 2025. The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature.'}

In [172]:
!pip install pubmed2pdf


Collecting pubmed2pdf
  Downloading pubmed2pdf-0.0.7.tar.gz (7.4 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting biopython (from pubmed2pdf)
  Downloading biopython-1.85-cp312-cp312-macosx_11_0_arm64.whl.metadata (13 kB)
Downloading biopython-1.85-cp312-cp312-macosx_11_0_arm64.whl (2.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.8/2.8 MB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hBuilding wheels for collected packages: pubmed2pdf
  Building wheel for pubmed2pdf (setup.py) ... [?25ldone
[?25h  Created wheel for pubmed2pdf: filename=pubmed2pdf-0.0.7-py3-none-any.whl size=7801 sha256=e4da29c2df272a1cda4bf653f6a3a387d1262b453cfbc3cdbba868018ffdea63
  Stored in directory: /Users/tubakaraca/Library/Caches/pip/wheels/6c/f8/50/159e46c278dd07df8d5aa665b2285220d0670a63d155d4542c
Successfully built pubmed2pdf
Installing collected packages: biopython, pubmed2pdf
Successfully installed biopython-1.85 pubmed2pdf-0.0.7

[1m[[0m[34;49m

In [189]:
!pip install pubmed_download

Collecting pubmed_download
  Downloading pubmed_download-0.2.1-py3-none-any.whl.metadata (976 bytes)
Collecting PyPDF2 (from pubmed_download)
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Collecting bs4 (from pubmed_download)
  Downloading bs4-0.0.2-py2.py3-none-any.whl.metadata (411 bytes)
Downloading pubmed_download-0.2.1-py3-none-any.whl (9.1 kB)
Downloading bs4-0.0.2-py2.py3-none-any.whl (1.2 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
Installing collected packages: PyPDF2, bs4, pubmed_download
Successfully installed PyPDF2-3.0.1 bs4-0.0.2 pubmed_download-0.2.1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [178]:
pubmed_docs[4].metadata

{'uid': '40498412',
 'Title': 'miR-320-3p regulates apelin and TGF-β/SMAD3 signaling in hypobaric hypoxia exposed rats to induce skeletal muscle atrophy.',
 'Published': '2025-06-11',
 'Copyright Information': '© 2025. The Author(s) under exclusive licence to University of Navarra.'}

In [186]:
uids = ['40498412']

In [190]:
from pubmed_download import DownloadPdf

download_path = r"./pubmed_downloads"
DownloadPdf(download_path, uids).run()

[2025-06-13 21:11:12] [pubmed] [ INFO  ] [ 86 ] [start download.....]
[2025-06-13 21:11:12] [pubmed] [ INFO  ] [ 89 ] [start download 40498412, 0/1]
[2025-06-13 21:11:13] [pubmed] [ INFO  ] [122 ] [search by doi: 40498412, 10.1007/s13105-025-01100-y]
Traceback (most recent call last):
  File "/Users/tubakaraca/.pyenv/versions/3.12.0/lib/python3.12/site-packages/pubmed_download/pubmed_download.py", line 131, in search_by_doi
    content = ub.urlopen(temp_url).read().decode('utf-8')
              ^^^^^^^^^^^^^^^^^^^^
  File "/Users/tubakaraca/.pyenv/versions/3.12.0/lib/python3.12/urllib/request.py", line 215, in urlopen
    return opener.open(url, data, timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tubakaraca/.pyenv/versions/3.12.0/lib/python3.12/urllib/request.py", line 521, in open
    response = meth(req, response)
               ^^^^^^^^^^^^^^^^^^^
  File "/Users/tubakaraca/.pyenv/versions/3.12.0/lib/python3.12/urllib/request.py", line 630, in http_response
    r

[]

In [187]:
import subprocess

def download_pubmed_pdfs(uids, out_dir="./pubmed_downloads"):
    cmd = [
        "python3", "-m", "pubmed2pdf", "pdf",
        "--pmids", ",".join(uids),
        "--out", out_dir,
        "--save"
    ]
    subprocess.run(cmd, check=True)


In [188]:
download_pubmed_pdfs(uids)

Usage: python -m pubmed2pdf pdf [OPTIONS]
Try 'python -m pubmed2pdf pdf --help' for help.

Error: No such option: --outpath Did you mean --out?


CalledProcessError: Command '['python3', '-m', 'pubmed2pdf', 'pdf', '--pmids', '40498412', '--outpath', './pubmed_downloads', '--save']' returned non-zero exit status 2.

In [None]:
from pubmed2pdf import pubmed2pdf

for doc in pubmed_docs:
    uid = doc.metadata.get("uid")
    if uid:
        pdf_path = pubmed2pdf(
        uid=uid,
        out_dir="./downloads",  # directory to save the PDF
        save=True,              # save the PDF
        return_path=True        # return the path to the saved PDF
        )
        url = f"https://pubmed.ncbi.nlm.nih.gov/{uid}/"
        pubmed_pdf_urls.append(url)

In [18]:
pubmed_pdf_urls = []
for doc in pubmed_docs:
    uid = doc.metadata.get("uid")
    if uid:
        url = f"https://pubmed.ncbi.nlm.nih.gov/{uid}/"
        pubmed_pdf_urls.append(url)

for url in pubmed_pdf_urls:
    print(url)

https://pubmed.ncbi.nlm.nih.gov/40498532/
https://pubmed.ncbi.nlm.nih.gov/40498486/
https://pubmed.ncbi.nlm.nih.gov/40498480/
https://pubmed.ncbi.nlm.nih.gov/40498478/
https://pubmed.ncbi.nlm.nih.gov/40498412/
https://pubmed.ncbi.nlm.nih.gov/40498369/
https://pubmed.ncbi.nlm.nih.gov/40498355/
https://pubmed.ncbi.nlm.nih.gov/40498290/
https://pubmed.ncbi.nlm.nih.gov/40498287/
https://pubmed.ncbi.nlm.nih.gov/40498212/
https://pubmed.ncbi.nlm.nih.gov/40498198/
https://pubmed.ncbi.nlm.nih.gov/40498168/
https://pubmed.ncbi.nlm.nih.gov/40498128/
https://pubmed.ncbi.nlm.nih.gov/40498074/
https://pubmed.ncbi.nlm.nih.gov/40498052/
https://pubmed.ncbi.nlm.nih.gov/40498015/
https://pubmed.ncbi.nlm.nih.gov/40497986/
https://pubmed.ncbi.nlm.nih.gov/40497970/
https://pubmed.ncbi.nlm.nih.gov/40497936/
https://pubmed.ncbi.nlm.nih.gov/40497934/
https://pubmed.ncbi.nlm.nih.gov/40497928/
https://pubmed.ncbi.nlm.nih.gov/40497767/
https://pubmed.ncbi.nlm.nih.gov/40497659/
https://pubmed.ncbi.nlm.nih.gov/40

In [132]:
import requests
from bs4 import BeautifulSoup
import time

In [144]:
import requests
import xml.etree.ElementTree as ET

In [145]:
def fetch_article_metadata(uid):
    url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
    params = {
        "db": "pubmed",
        "id": uid,
        "retmode": "xml"
    }
    r = requests.get(url, params=params)
    if r.status_code != 200:
        print(f"Failed to fetch metadata for UID {uid}")
        return None

    root = ET.fromstring(r.text)
    
    # Find DOI
    doi = None
    for article_id in root.findall(".//ArticleId"):
        if article_id.attrib.get("IdType") == "doi":
            doi = article_id.text
    
    # Find PMCID
    pmcid = None
    for article_id in root.findall(".//ArticleId"):
        if article_id.attrib.get("IdType") == "pmc":
            pmcid = article_id.text
    
    return {"doi": doi, "pmcid": pmcid}

In [148]:
def get_pdf_url_from_metadata(metadata):
    pmcid = metadata.get("pmcid")
    if pmcid:
        return f"https://www.ncbi.nlm.nih.gov/pmc/articles/{pmcid}/pdf/"
    return None

In [130]:
def get_pmc_pdf_url(pmcid):
    """Return the direct PDF URL for a PMC article if it exists."""
    if not pmcid.startswith("PMC"):
        pmcid = "PMC" + pmcid
    pdf_url = f"https://www.ncbi.nlm.nih.gov/pmc/articles/{pmcid}/pdf/"
    # Quick HEAD request to check if PDF exists
    r = requests.head(pdf_url)
    if r.status_code == 200:
        return pdf_url
    return None

In [140]:
import requests
from bs4 import BeautifulSoup

def get_pmcid_from_pubmed_page(uid):
    url = f"https://pubmed.ncbi.nlm.nih.gov/{uid}/"
    r = requests.get(url)
    if r.status_code != 200:
        print(f"Failed to fetch PubMed page for UID {uid}")
        return None

    soup = BeautifulSoup(r.text, "html.parser")
    # Find all anchor tags
    links = soup.find_all("a", href=True)
    for link in links:
        href = link['href']
        # Check if href contains the PMC article path
        if "/pmc/articles/PMC" in href:
            # Extract PMCID from href
            pmcid = href.strip("/").split("/")[-1]
            return pmcid
    return None


In [136]:
def get_pdf_url_from_pubmed_uid(uid):
    """Try to get direct PDF URL for a PubMed article UID."""
    pmcid = get_pmcid_from_pubmed_page(uid)
    if pmcid:
        pdf_url = get_pmc_pdf_url(pmcid)
        if pdf_url:
            return pdf_url
    print(f"No direct PDF found for PubMed UID {uid}")
    return None


In [149]:
pubmed_pmc_pdf_urls = []
for doc in pubmed_docs:
    uid = doc.metadata.get("uid")
    if uid:
        metadata= fetch_article_metadata(uid)
        print(metadata)
        pdf_url = get_pdf_url_from_metadata(metadata)
        if pdf_url:
            print(f"PDF URL for PubMed UID {uid}: {pdf_url}")
            pubmed_pmc_pdf_urls.append(pdf_url)
        else:
            print(f"No PDF URL found for PubMed UID {uid}")
        time.sleep(1)  
        

for url in pubmed_pmc_pdf_urls:
    print(url)

{'doi': '10.1177/07334648251343924', 'pmcid': None}
No PDF URL found for PubMed UID 40498532
{'doi': '10.1001/jamanetworkopen.2025.14631', 'pmcid': None}
No PDF URL found for PubMed UID 40498486
{'doi': '10.1001/jamadermatol.2025.1565', 'pmcid': None}
No PDF URL found for PubMed UID 40498480
{'doi': '10.1001/jamacardio.2025.1714', 'pmcid': None}
No PDF URL found for PubMed UID 40498478
{'doi': '10.3390/jcm11010263', 'pmcid': '8746094'}
PDF URL for PubMed UID 40498412: https://www.ncbi.nlm.nih.gov/pmc/articles/8746094/pdf/
{'doi': '10.1002/eji.202451176', 'pmcid': None}
No PDF URL found for PubMed UID 40498369
{'doi': '10.1080/07391102.2025.2507817', 'pmcid': None}
No PDF URL found for PubMed UID 40498355
{'doi': '10.1136/spcare-2024-004984', 'pmcid': '5817484'}
PDF URL for PubMed UID 40498290: https://www.ncbi.nlm.nih.gov/pmc/articles/5817484/pdf/
{'doi': '10.1016/S0168-8278(97)80288-7', 'pmcid': '1421079'}
PDF URL for PubMed UID 40498287: https://www.ncbi.nlm.nih.gov/pmc/articles/1421

In [150]:
len(pubmed_pmc_pdf_urls)

167

In [152]:
pubmed_pdf_urls[0]

'https://pubmed.ncbi.nlm.nih.gov/40498532/'

In [None]:
pubmed_pmc_pdf_urls = []
for doc in pubmed_docs:
    uid = doc.metadata.get("uid")
    if uid:
        metadata= fetch_article_metadata(uid)
        print(metadata)
        pdf_url = get_pdf_url_from_metadata(metadata)
        if pdf_url:
            print(f"PDF URL for PubMed UID {uid}: {pdf_url}")
            pubmed_pmc_pdf_urls.append(pdf_url)
        else:
            print(f"No PDF URL found for PubMed UID {uid}")
        time.sleep(1)  
        

for url in pubmed_pmc_pdf_urls:
    print(url)

## Read PDF from URL

In [160]:
pubmed_pmc_pdf_urls[0]

'https://www.ncbi.nlm.nih.gov/pmc/articles/8746094/pdf/'

In [164]:
from io import BytesIO
import requests
import fitz  # pymupdf

pdf_url = pubmed_pmc_pdf_urls[0]
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)",
    "Referer": "https://www.ncbi.nlm.nih.gov/"
}
session = requests.Session()
response = session.get(url, headers=headers, allow_redirects=True)

print("Status Code:", response.status_code)
print("Content-Type:", response.headers.get("Content-Type"))
print("Content-Length:", len(response.content))
print(response.content[:200])  # Peek at start of file
if not response.headers.get("Content-Type", "").startswith("application/pdf"):
    raise ValueError("Not a valid PDF file, got HTML instead.")
if response.status_code != 200:
    raise ValueError(f"Failed to download PDF. Status code: {response.status_code}")

pdf_stream = BytesIO(response.content)
pdf = fitz.open(stream=pdf_stream, filetype="pdf")
print("✅ PDF loaded successfully. Page count:", pdf.page_count)

Status Code: 200
Content-Type: text/html; charset=utf-8
Content-Length: 1285
b'\n\n\n\n<html>\n  <head>\n    <meta name="viewport" content="width=device-width, initial-scale=1.0">\n    <title>Preparing to download ...</title>\n    <style type="text/css">\n      body{font-size:1rem;line-h'


ValueError: Not a valid PDF file, got HTML instead.

In [163]:
content_type = response.headers.get("Content-Type", "")
content_length = len(response.content)
print("Content-Type:", content_type)
print("Content-Length:", content_length)


Content-Type: text/html; charset=utf-8
Content-Length: 1285


In [162]:
pdf

Document('None', <memory, doc# 322>)

In [158]:
for pdf_url in pubmed_pdf_urls:
    print(pdf_url)
    response = requests.get(pdf_url)
    if response.status_code != 200:
        raise ValueError(f"Failed to download PDF. Status code: {response.status_code}")

    pdf_stream = BytesIO(response.content)
    pdf = fitz.open(stream=pdf_stream, filetype="pdf")
    print(pdf)

https://pubmed.ncbi.nlm.nih.gov/40498532/
Document('None', <memory, doc# 285>)
https://pubmed.ncbi.nlm.nih.gov/40498486/
Document('None', <memory, doc# 286>)
https://pubmed.ncbi.nlm.nih.gov/40498480/
Document('None', <memory, doc# 287>)
https://pubmed.ncbi.nlm.nih.gov/40498478/
Document('None', <memory, doc# 288>)
https://pubmed.ncbi.nlm.nih.gov/40498412/
Document('None', <memory, doc# 289>)
https://pubmed.ncbi.nlm.nih.gov/40498369/
Document('None', <memory, doc# 290>)
https://pubmed.ncbi.nlm.nih.gov/40498355/
Document('None', <memory, doc# 291>)
https://pubmed.ncbi.nlm.nih.gov/40498290/
Document('None', <memory, doc# 292>)
https://pubmed.ncbi.nlm.nih.gov/40498287/
Document('None', <memory, doc# 293>)
https://pubmed.ncbi.nlm.nih.gov/40498212/
Document('None', <memory, doc# 294>)
https://pubmed.ncbi.nlm.nih.gov/40498198/
Document('None', <memory, doc# 295>)
https://pubmed.ncbi.nlm.nih.gov/40498168/
Document('None', <memory, doc# 296>)
https://pubmed.ncbi.nlm.nih.gov/40498128/
Document('

KeyboardInterrupt: 

In [167]:
import requests

query = "diabetes"
url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
params = {
    "db": "pmc",
    "term": query,
    "retmax": 5,
    "retmode": "json"
}
response = requests.get(url, params=params)
uids = response.json()["esearchresult"]["idlist"]
print(uids)


['12162178', '12162169', '12162160', '12162155', '12162150']


In [168]:
def get_article_metadata(pmcid):
    summary_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi"
    params = {
        "db": "pmc",
        "id": pmcid,
        "retmode": "json"
    }
    r = requests.get(summary_url, params=params)
    return r.json()

In [169]:
from io import BytesIO
import fitz  # PyMuPDF

def download_pmc_pdf(pmcid):
    url = f"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC{pmcid}/pdf/"
    headers = {
        "User-Agent": "Mozilla/5.0",
        "Referer": "https://www.ncbi.nlm.nih.gov/"
    }
    r = requests.get(url, headers=headers, allow_redirects=True)

    if r.status_code == 200 and r.headers.get("Content-Type", "").startswith("application/pdf"):
        pdf_stream = BytesIO(r.content)
        pdf = fitz.open(stream=pdf_stream, filetype="pdf")
        print(f"Successfully downloaded and opened PDF for PMC{pmcid}")
        return pdf
    else:
        print(f"Failed to download PDF for PMC{pmcid}. Content-Type: {r.headers.get('Content-Type')}")
        return None


In [170]:
uids = ['8746094', '8901234']
for uid in uids:
    pdf = download_pmc_pdf(uid)
    if pdf:
        print(f"Title: {pdf.metadata.get('title')}")


Failed to download PDF for PMC8746094. Content-Type: text/html; charset=utf-8
Failed to download PDF for PMC8901234. Content-Type: text/html; charset=utf-8


In [171]:
def extract_pdf_link(pmcid):
    url = f"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC{pmcid}/"
    r = requests.get(url)
    soup = BeautifulSoup(r.text, "html.parser")
    pdf_link = soup.find("a", string="PDF")
    if pdf_link:
        return "https://www.ncbi.nlm.nih.gov" + pdf_link.get("href")
    return None

# TavilySearchAPIRetriever
https://python.langchain.com/docs/integrations/retrievers/tavily/

>[Tavily's Search API](https://tavily.com) is a search engine built specifically for AI agents (LLMs), delivering real-time, accurate, and factual results at speed.

We can use this as a [retriever](/docs/how_to#retrievers). It will show functionality specific to this integration. After going through, it may be useful to explore [relevant use-case pages](/docs/how_to#qa-with-rag) to learn how to use this vectorstore as part of a larger chain.

### Integration details

import {ItemTable} from "@theme/FeatureTables";

<ItemTable category="external_retrievers" item="TavilySearchAPIRetriever" />

## Setup

If you want to get automated tracing from individual queries, you can also set your [LangSmith](https://docs.smith.langchain.com/) API key by uncommenting below:

In [None]:
# os.environ["LANGSMITH_API_KEY"] = getpass.getpass("Enter your LangSmith API key: ")
# os.environ["LANGSMITH_TRACING"] = "true"

### Installation

The integration lives in the `langchain-community` package. We also need to install the `tavily-python` package itself.

In [107]:
%pip install -qU langchain-community tavily-python


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [108]:
#import getpass
import os

#os.environ["TAVILY_API_KEY"] = getpass.getpass()

In [109]:
os.environ["TAVILY_API_KEY"] = "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"

## Instantiation

Now we can instantiate our retriever:

In [110]:
from langchain_community.retrievers import TavilySearchAPIRetriever

tavily_retriever = TavilySearchAPIRetriever(k=100,
                                            include_raw_content = True,                                            
                                            search_depth = "advanced",
                                            include_domains=[
                                            "clinicaltrials.gov", "pubmed.ncbi.nlm.nih.gov", "arxiv.org",
                                            "biorxiv.org", "medrxiv.org", "diabetes.org", "cdc.gov", "who.int",
                                            "fda.gov", "ema.europa.eu", "nejm.org", "thelancet.com", "jamanetwork.com",
                                            "nature.com", "sciencedirect.com", "springer.com", "cell.com", "biomedcentral.com"
                                            ]
                                            )

In [None]:
tavily_docs = tavily_retriever.invoke("Diabetes")

In [None]:
len(tavily_docs)

38

In [None]:
tavily_docs[0].metadata

{'title': 'diabetes and hearing loss-cross sectional study.docx',
 'source': 'https://cdn.clinicaltrials.gov/large-docs/38/NCT06190938/Prot_SAP_000.pdf',
 'score': 0.6533265,
 'images': ['https://static.vecteezy.com/system/resources/previews/001/436/750/large_2x/diabetes-symptoms-information-infographic-free-vector.jpg',
  'http://thumbs.dreamstime.com/z/types-diabetes-type-type-mellitus-insulin-dependent-mellitus-non-insulin-dependent-mellitus-39731814.jpg',
  'https://thumbs.dreamstime.com/z/types-diabetes-simple-medical-vector-illustration-scheme-health-care-information-diagram-types-diabetes-simple-medical-115835865.jpg',
  'https://lirp.cdn-website.com/69c0b277/dms3rep/multi/opt/Types+of+Diabetes-1920w.jpg',
  'https://detsutah.com/wp-content/uploads/2023/05/DETS-Diabetes-Mellitus.jpg']}

In [None]:
tavily_docs[0]

Document(metadata={'title': 'diabetes and hearing loss-cross sectional study.docx', 'source': 'https://cdn.clinicaltrials.gov/large-docs/38/NCT06190938/Prot_SAP_000.pdf', 'score': 0.6533265, 'images': ['https://static.vecteezy.com/system/resources/previews/001/436/750/large_2x/diabetes-symptoms-information-infographic-free-vector.jpg', 'http://thumbs.dreamstime.com/z/types-diabetes-type-type-mellitus-insulin-dependent-mellitus-non-insulin-dependent-mellitus-39731814.jpg', 'https://thumbs.dreamstime.com/z/types-diabetes-simple-medical-vector-illustration-scheme-health-care-information-diagram-types-diabetes-simple-medical-115835865.jpg', 'https://lirp.cdn-website.com/69c0b277/dms3rep/multi/opt/Types+of+Diabetes-1920w.jpg', 'https://detsutah.com/wp-content/uploads/2023/05/DETS-Diabetes-Mellitus.jpg']}, page_content='')

In [111]:
import time

In [112]:
tavily_queries = [
    "Latest breakthroughs in diabetes treatment 2024",
    "Novel approaches to curing type 1 and type 2 diabetes",
    "Recent advances in beta cell regeneration for diabetes",
    "New drug targets for insulin resistance and diabetes cure",
    "Gene editing and CRISPR therapies for diabetes",
    "Immunotherapy approaches to treat or cure type 1 diabetes",
    "Ongoing clinical trials for diabetes reversal",
    "Use of AI and machine learning in discovering diabetes treatments",
    "Recent trials on diabetes AND site:clinicaltrials.gov",
    "Pathophysiology of insulin resistance in type 2 diabetes", #mechanistic, scientific explanations — helpful for AI models focusing on mechanistic reasoning
    "Prevalence of diabetes in Europe 2024 statistics", #public health datasets, WHO data, etc.
    "FDA-approved drugs for type 2 diabetes and their mechanisms", #Useful for drug discovery, pharmacological studies, or treatment plans,
    "Latest ADA clinical guidelines for diabetes management"#Targets specific trusted content like ADA (American Diabetes Association) guidelines.



]

all_docs = []

for query in tavily_queries:
    print(query)
    try:
        docs = tavily_retriever.invoke(query)
        all_docs.extend(docs)
    except TimeoutError:
        print(f"Timeout on query: {query}")
    time.sleep(3)  # Add delay between requests

# Optional: Deduplicate based on document content
tavily_docs = list({doc.page_content: doc for doc in all_docs}.values())




Latest breakthroughs in diabetes treatment 2024
Novel approaches to curing type 1 and type 2 diabetes
Recent advances in beta cell regeneration for diabetes
New drug targets for insulin resistance and diabetes cure
Gene editing and CRISPR therapies for diabetes
Immunotherapy approaches to treat or cure type 1 diabetes
Ongoing clinical trials for diabetes reversal
Use of AI and machine learning in discovering diabetes treatments
Recent trials on diabetes AND site:clinicaltrials.gov
Pathophysiology of insulin resistance in type 2 diabetes
Prevalence of diabetes in Europe 2024 statistics
FDA-approved drugs for type 2 diabetes and their mechanisms
Latest ADA clinical guidelines for diabetes management


In [114]:
tavily_docs[0].metadata

{'title': 'Clinical Update Conference On-Demand Registration ...',
 'source': 'https://professional.diabetes.org/clinical-update-conference/clinical-update-conference-2025-registration-information',
 'score': 0.038311254,
 'images': []}

In [115]:
tavily_docs[0]

Document(metadata={'title': 'Clinical Update Conference On-Demand Registration ...', 'source': 'https://professional.diabetes.org/clinical-update-conference/clinical-update-conference-2025-registration-information', 'score': 0.038311254, 'images': []}, page_content='')

In [116]:
len(tavily_docs)

139

In [121]:
tavily_docs[-1].metadata

{'title': 'The New England Journal of Medicine | Research & Review Articles on ...',
 'source': 'https://www.nejm.org/',
 'score': 0.05570498,
 'images': []}

In [122]:
high_score_docs = [doc for doc in tavily_docs if doc.metadata.get("score", 0) > 0.6]

# Optional: sort
high_score_docs = sorted(high_score_docs, key=lambda d: d.metadata["score"], reverse=True)

for doc in high_score_docs:
    print(f"Title: {doc.metadata.get('title')}")
    print(f"Score: {doc.metadata.get('score'):.4f}")
    print(f"Source: {doc.metadata.get('source')}")
    print()

Title: Exploring pancreatic beta-cell restoration's potential and challenges
Score: 0.8908
Source: https://pubmed.ncbi.nlm.nih.gov/39494103/

Title: The promise of CRISPR/Cas9 technology in diabetes mellitus therapy
Score: 0.8868
Source: https://www.sciencedirect.com/science/article/pii/S1056872723001228

Title: The American Diabetes Association Releases ...
Score: 0.8830
Source: https://diabetes.org/newsroom/press-releases/american-diabetes-association-releases-standards-care-diabetes-2024

Title: Leveraging artificial intelligence and machine learning to ...
Score: 0.8639
Source: https://pubmed.ncbi.nlm.nih.gov/39694914/

Title: Innovative immunotherapies and emerging treatments in type 1 ...
Score: 0.8580
Source: https://www.sciencedirect.com/science/article/pii/S2666970624000519

Title: From islet transplantation to beta-cell regeneration - PubMed
Score: 0.8519
Source: https://pubmed.ncbi.nlm.nih.gov/38693782/

Title: Pancreatic β Cell Regeneration as a Possible Therapy for Diabete

In [123]:
len(high_score_docs)

47

In [165]:
tavily_pdf_urls = [doc.metadata.get("source") for doc in high_score_docs if doc.metadata.get("source", "").endswith(".pdf")]


In [166]:
for url in tavily_pdf_urls:
    print(url)

https://diabetes.org/sites/default/files/2024-06/24.06.06-%20%20Sci%20Sessions%20curtain%20raiser_FINAL%20%281%29.pdf
https://professional.diabetes.org/sites/dpro/files/2025-02/ADA_2024_ResearchReport_final-reduced.pdf


In [128]:
from io import BytesIO
import requests
import fitz  # PyMuPDF

pdf_url = tavily_pdf_urls[0]  # or whichever URL you want

response = requests.get(pdf_url)
response.raise_for_status()  # better error handling

pdf_stream = BytesIO(response.content)
pdf = fitz.open(stream=pdf_stream, filetype="pdf")

# Now you can work with 'pdf' object as needed


In [129]:
pdf

Document('None', <memory, doc# 13>)

In [118]:
print(tavily_docs[10])

page_content='Guidelines Your visual guide to the guidelines InSIGHT Starting FDA-Approved Weight Management Medications This infographic is based on recommendations from the ADA’s Standards of Care in Diabetes—2024 Supported in part by Treat Obesity, Help Prevent Diabetes Initiative. American Diabetes Association® (ADA) Professional version Learn more at professional.diabetes.org | 1-800-DIABETES (800-342-2383) Medications Lifestyle Recommendations in Conjunction with Interventions for Weight Management MINIMIZE SIDE EFFECTS INCRETIN-BASED GLP-1 RA AND DUAL GIP/GLP-1 RA THERAPIES • Liraglutide (Approved for ages 12 and older) • Semaglutide (Approved for ages 12 and older) • Tirzepatide (Approved for ages 18 and older) ORLISTAT (Approved for ages 12 and older) NALTREXONE/ BUPROPION (Approved for ages 18 and older) PHENTERMINE/ TOPIRAMATE (Approved for ages 12 and older) Mimics gut hormone to lower hunger and slow stomach emptying.
Reduces appetite by influencing central appetite center

In [119]:
tavily_docs[10]

Document(metadata={'title': 'PDF', 'source': 'https://professional.diabetes.org/sites/dpro/files/2024-08/startingfdaapprovedweightmanagementmedications.pdf', 'score': 0.41805586, 'images': []}, page_content="Guidelines Your visual guide to the guidelines InSIGHT Starting FDA-Approved Weight Management Medications This infographic is based on recommendations from the ADA’s Standards of Care in Diabetes—2024 Supported in part by Treat Obesity, Help Prevent Diabetes Initiative. American Diabetes Association® (ADA) Professional version Learn more at professional.diabetes.org | 1-800-DIABETES (800-342-2383) Medications Lifestyle Recommendations in Conjunction with Interventions for Weight Management MINIMIZE SIDE EFFECTS INCRETIN-BASED GLP-1 RA AND DUAL GIP/GLP-1 RA THERAPIES • Liraglutide (Approved for ages 12 and older) • Semaglutide (Approved for ages 12 and older) • Tirzepatide (Approved for ages 18 and older) ORLISTAT (Approved for ages 12 and older) NALTREXONE/ BUPROPION (Approved for

NameError: name 'pdf_urls' is not defined

# Tavily Search
https://python.langchain.com/docs/integrations/tools/tavily_search/

https://github.com/langchain-ai/langchain-community/blob/main/libs/community/langchain_community/tools/tavily_search/tool.py

[Tavily's Search API](https://tavily.com) is a search engine built specifically for AI agents (LLMs), delivering real-time, accurate, and factual results at speed.

## Overview

### Integration details
| Class                                                         | Package                                                        | Serializable | [JS support](https://js.langchain.com/docs/integrations/tools/tavily_search) |  Package latest |
|:--------------------------------------------------------------|:---------------------------------------------------------------| :---: | :---: | :---: |
| [TavilySearch](https://github.com/tavily-ai/langchain-tavily) | [langchain-tavily](https://pypi.org/project/langchain-tavily/) | ✅ | ✅  |  ![PyPI - Version](https://img.shields.io/pypi/v/langchain-tavily?style=flat-square&label=%20) |

### Tool features
| [Returns artifact](/docs/how_to/tool_artifacts/) | Native async |                       Return data                        | Pricing |
| :---: | :---: |:--------------------------------------------------------:| :---: |
| ❌ | ✅ | title, URL, content snippet, raw_content, answer, images | 1,000 free searches / month |


## Setup

The integration lives in the `langchain-tavily` package.

In [85]:
%pip install -qU langchain-tavily

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llama-index-legacy 0.9.48.post4 requires tenacity<9.0.0,>=8.2.0, but you have tenacity 9.1.2 which is incompatible.
llama-index-core 0.11.22 requires tenacity!=8.4.0,<9.0.0,>=8.2.0, but you have tenacity 9.1.2 which is incompatible.[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


### Credentials

We also need to set our Tavily API key. You can get an API key by visiting [this site](https://app.tavily.com/sign-in) and creating an account.

In [86]:
import getpass
import os

#if not os.environ.get("TAVILY_API_KEY"):
#    os.environ["TAVILY_API_KEY"] = getpass.getpass("Tavily API key:\n")

In [87]:
os.environ["TAVILY_API_KEY"] = ""

## Instantiation

Here we show how to instantiate an instance of the Tavily search tool. The tool accepts various parameters to customize the search. After instantiation we invoke the tool with a simple query. This tool allows you to complete search queries using Tavily's Search API endpoint.

Instantiation
The tool accepts various parameters during instantiation:

- max_results (optional, int): Maximum number of search results to return. Default is 5.
- topic (optional, str): Category of the search. Can be "general", "news", or "finance". Default is "general".
- include_answer (optional, bool): Include an answer to original query in results. Default is False.
- include_raw_content (optional, bool): Include cleaned and parsed HTML of each search result. Default is False.
- include_images (optional, bool): Include a list of query related images in the response. Default is False.
- include_image_descriptions (optional, bool): Include descriptive text for each image. Default is False.
- search_depth (optional, str): Depth of the search, either "basic" or "advanced". Default is "basic".
- time_range (optional, str): The time range back from the current date to filter results - "day", "week", "month", or "year". Default is None.
- include_domains (optional, List[str]): List of domains to specifically include. Default is None.
- exclude_domains (optional, List[str]): List of domains to specifically exclude. Default is None.

For a comprehensive overview of the available parameters, refer to the [Tavily Search API documentation](https://docs.tavily.com/documentation/api-reference/endpoint/search)

In [97]:
from langchain_tavily import TavilySearch

tavily_tool = TavilySearch(
    k=100,
    max_results=100,
    topic="general",
    include_answer=True,
    include_raw_content=True,
    # include_images=False,
    # include_image_descriptions=False,
    search_depth="advanced",
    # time_range="day",
    include_domains=[
                    "clinicaltrials.gov", "pubmed.ncbi.nlm.nih.gov", "arxiv.org",
                    "biorxiv.org", "medrxiv.org", "diabetes.org", "cdc.gov", "who.int",
                    "fda.gov", "ema.europa.eu", "nejm.org", "thelancet.com", "jamanetwork.com",
                    "nature.com", "sciencedirect.com", "springer.com", "cell.com", "biomedcentral.com"
                    ]
    # exclude_domains=None
)

## Invocation

### [Invoke directly with args](/docs/concepts/tools)

The Tavily search tool accepts the following arguments during invocation:
- `query` (required): A natural language search query
- The following arguments can also be set during invocation : `include_images`, `search_depth` , `time_range`, `include_domains`, `exclude_domains`, `include_images`
- For reliability and performance reasons, certain parameters that affect response size cannot be modified during invocation: `include_answer` and `include_raw_content`. These limitations prevent unexpected context window issues and ensure consistent results.


NOTE: The optional arguments are available for agents to dynamically set, if you set an argument during instantiation and then invoke the tool with a different value, the tool will use the value you passed during invocation.

In [102]:
tavily_docs = tavily_tool.run({"query": "Diabetes"})

In [103]:
len(tavily_docs)

6

In [105]:
tavily_docs

{'query': 'Diabetes',
 'follow_up_questions': None,
 'answer': 'Diabetes is a chronic metabolic disorder characterized by high blood sugar levels. It has two main types: Type 1 and Type 2. Management includes lifestyle changes and medication.',
 'images': [],
 'results': [{'title': 'Diabetes Research and Clinical Practice - ScienceDirect',
   'url': 'https://www.sciencedirect.com/journal/diabetes-research-and-clinical-practice',
   'content': 'Diabetes Research and Clinical Practice is an international journal for health-care providers and clinically oriented researchers that publishes high-quality original research articles and expert reviews in diabetes and related areas.. The role Diabetes Research and Clinical Practice is to provide a venue for dissemination of knowledge and discussion of topics related to diabetes clinical',
   'score': 0.6854659,
   'raw_content': None},
  {'title': 'Diabetes - Latest research and news - Nature',
   'url': 'https://www.nature.com/subjects/diabete

In [106]:
import time

In [None]:
tavily_queries = [
    "Latest breakthroughs in diabetes treatment 2024",
    "Novel approaches to curing type 1 and type 2 diabetes",
    "Recent advances in beta cell regeneration for diabetes",
    "New drug targets for insulin resistance and diabetes cure",
    "Gene editing and CRISPR therapies for diabetes",
    "Immunotherapy approaches to treat or cure type 1 diabetes",
    "Ongoing clinical trials for diabetes reversal",
    "Use of AI and machine learning in discovering diabetes treatments",
    "Recent trials on diabetes AND site:clinicaltrials.gov",
    "Pathophysiology of insulin resistance in type 2 diabetes", #mechanistic, scientific explanations — helpful for AI models focusing on mechanistic reasoning
    "Prevalence of diabetes in Europe 2024 statistics", #public health datasets, WHO data, etc.
    "FDA-approved drugs for type 2 diabetes and their mechanisms", #Useful for drug discovery, pharmacological studies, or treatment plans,
    "Latest ADA clinical guidelines for diabetes management"#Targets specific trusted content like ADA (American Diabetes Association) guidelines.



]

all_docs = []

for query in tavily_queries:
    print(query)
    try:
        docs = tavily_tool.invoke(query)
        all_docs.extend(docs)
    except TimeoutError:
        print(f"Timeout on query: {query}")
    time.sleep(3)  # Add delay between requests

# Optional: Deduplicate based on document content
tavily_docs = list({doc.page_content: doc for doc in all_docs}.values())




# Recursive URL

https://python.langchain.com/docs/integrations/document_loaders/recursive_url/

The `RecursiveUrlLoader` lets you recursively scrape all child links from a root URL and parse them into Documents.

## Overview
### Integration details

| Class | Package | Local | Serializable | [JS support](https://js.langchain.com/docs/integrations/document_loaders/web_loaders/recursive_url_loader/)|
| :--- | :--- | :---: | :---: |  :---: |
| [RecursiveUrlLoader](https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.recursive_url_loader.RecursiveUrlLoader.html) | [langchain_community](https://python.langchain.com/api_reference/community/index.html) | ✅ | ❌ | ✅ | 
### Loader features
| Source | Document Lazy Loading | Native Async Support
| :---: | :---: | :---: | 
| RecursiveUrlLoader | ✅ | ❌ | 


To enable automated tracing of your model calls, set your [LangSmith](https://docs.smith.langchain.com/) API key:

## Setup

### Credentials

No credentials are required to use the `RecursiveUrlLoader`.

### Installation

The `RecursiveUrlLoader` lives in the `langchain-community` package. There's no other required packages, though you will get richer default Document metadata if you have ``beautifulsoup4` installed as well.

In [50]:
# os.environ["LANGSMITH_API_KEY"] = getpass.getpass("Enter your LangSmith API key: ")
# os.environ["LANGSMITH_TRACING"] = "true"

### Installation

Install **langchain_community**.£

In [64]:
%pip install -qU langchain-community beautifulsoup4 lxml


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## Instantiation

Now we can instantiate our document loader object and load Documents:

In [81]:
from langchain_community.document_loaders import RecursiveUrlLoader
from bs4 import BeautifulSoup
import re

# clean HTML content
def bs4_extractor(html: str) -> str:
    soup = BeautifulSoup(html, "lxml")
    return re.sub(r"\n\n+", "\n\n", soup.text).strip()

# Load and extract diabetes-related pages
medrxiv_loader = RecursiveUrlLoader(
    'https://clinicaltrials.gov/',
     max_depth=10,
    # use_async=False,
    extractor=bs4_extractor,
    link_regex=r".*(diabetes|blood[-_]?sugar|glucose|insulin).*",  # Filter links by diabetes keywords
    # metadata_extractor=None,
    # exclude_dirs=(),
    # timeout=10,
    # check_response_status=True,
    continue_on_failure=True,
    prevent_outside=True,
    # base_url=None,
    # ...
)

In [82]:
medrxiv_docs = medrxiv_loader.load()

In [83]:
len(medrxiv_docs)

1

In [84]:
medrxiv_docs

[Document(metadata={'source': 'https://clinicaltrials.gov/', 'content_type': 'text/html', 'title': 'ClinicalTrials.gov', 'description': '', 'language': 'en'}, page_content='ClinicalTrials.gov\n\nHide glossary\n\nGlossary\n\n  Study record managers: refer to the Data Element Definitions if submitting registration or results information.\n  \n\nSearch for terms')]

In [55]:
from langchain_community.document_loaders.sitemap import SitemapLoader

In [60]:
medrxiv_sitemap_loader = SitemapLoader(web_path='https://www.medrxiv.org/sitemap.xml',
                               filter_urls = [r".*(diabetes|blood[-_]?sugar|glucose|insulin).*"],
                               continue_on_failure=True,
                               )

In [61]:
medrxiv_docs = list(medrxiv_sitemap_loader.lazy_load())

Fetching pages: 100%|##########| 1/1 [00:01<00:00,  1.25s/it]
Fetching pages: 100%|##########| 1/1 [00:01<00:00,  1.13s/it]
Fetching pages: 0it [00:00, ?it/s]


In [62]:
len(medrxiv_docs)

0

In [63]:
for doc in medrxiv_docs:
    print(doc.metadata["source"])
    print(doc.page_content[:500])  # show first 500 chars of the content
    print("=" * 80)


In [49]:
from firecrawl import FirecrawlApp, ScrapeOptions
# Step 1: Initialize Firecrawl client
medrxiv_app = FirecrawlApp(api_key=api_key)

# Step 2: Crawl medRxiv homepage
medrxiv_crawl_result = medrxiv_app.crawl_url(
  'https://www.medrxiv.org/',
  limit=10, 
  scrape_options=ScrapeOptions(formats=['markdown', 'html']),
)
# Step 3: Filter results for "diabetes"
medrxiv_diabetes_docs = []
for doc in medrxiv_crawl_result.get("data", []):
    text = doc.get("text", "").lower()
    if "diabetes" in text:
        medrxiv_diabetes_docs.append(doc["url"])

# Step 4: Show matched URLs
print("Documents mentioning 'diabetes':")
for url in medrxiv_diabetes_docs:
    print(url)

AttributeError: 'CrawlStatusResponse' object has no attribute 'get'

In [44]:
# Step 1: Instantiate the loader with your starting URL
medrxiv_loader = FireCrawlLoader(
    api_key="fc-9a8592bd12ee4afd8f361c1ef69786ec", url="https://www.medrxiv.org/", mode="map"
)

In [45]:
# Step 2: Lazy load documents
medrxiv_docs_lazy = medrxiv_loader.lazy_load()

In [46]:
medrxiv_docs_lazy

<generator object FireCrawlLoader.lazy_load at 0x12a21b740>

In [47]:
# Step 3: Filter docs that mention "diabetes" and collect their URLs
medrxiv_diabetes_urls = []

for doc in medrxiv_loader.lazy_load():
    text = doc.page_content.lower()
    if "diabetes" in text:
        metadata = doc.metadata or {}
        url = metadata.get("sourceURL") or metadata.get("ogUrl")
        if url:
            medrxiv_diabetes_urls.append(url)

ValueError: Unsupported parameter(s) for map_url: params. Please refer to the API documentation for the correct parameters.

In [None]:
# Step 4: Print results
for url in medrxiv_diabetes_urls:
    print(url)

In [31]:
len(medrxiv_docs)

1

In [33]:
medrxiv_docs[0]

Document(metadata={'source': 'https://www.medrxiv.org/search', 'content_type': 'text/html; charset=utf-8', 'title': 'Advanced Search | medRxiv', 'description': 'medRxiv - The Preprint Server for Health Sciences', 'language': 'en'}, page_content='<!DOCTYPE html>\n<html lang="en" dir="ltr" \n  xmlns="http://www.w3.org/1999/xhtml"\n  xmlns:mml="http://www.w3.org/1998/Math/MathML">\n  <head prefix="og: http://ogp.me/ns# article: http://ogp.me/ns/article# book: http://ogp.me/ns/book#" >\n    <!--[if IE]><![endif]-->\n<link rel="dns-prefetch" href="//cdn.jsdelivr.net" />\n<link rel="dns-prefetch" href="//d33xdlntwy0kbs.cloudfront.net" />\n<link rel="dns-prefetch" href="//www.googletagmanager.com" />\n<link rel="dns-prefetch" href="//rum-static.pingdom.net" />\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />\n<link rel="shortcut icon" href="https://www.medrxiv.org/sites/default/files/images/favicon.ico" type="image/vnd.microsoft.icon" />\n<meta name="viewport" content="w