# ArxivRetriever

https://python.langchain.com/docs/integrations/retrievers/arxiv/

>[arXiv](https://arxiv.org/) is an open-access archive for 2 million scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics.

This notebook shows how to retrieve scientific articles from Arxiv.org into the [Document](https://python.langchain.com/api_reference/core/documents/langchain_core.documents.base.Document.html) format that is used downstream.

For detailed documentation of all `ArxivRetriever` features and configurations head to the [API reference](https://python.langchain.com/api_reference/community/retrievers/langchain_community.retrievers.arxiv.ArxivRetriever.html).

### Integration details

import {ItemTable} from "@theme/FeatureTables";

<ItemTable category="external_retrievers" item="ArxivRetriever" />

## Setup

If you want to get automated tracing from individual queries, you can also set your [LangSmith](https://docs.smith.langchain.com/) API key by uncommenting below:

In [None]:
# os.environ["LANGSMITH_API_KEY"] = getpass.getpass("Enter your LangSmith API key: ")
# os.environ["LANGSMITH_TRACING"] = "true"

### Installation

This retriever lives in the `langchain-community` package. We will also need the [arxiv](https://pypi.org/project/arxiv/) dependency:

In [3]:
%pip install -qU langchain-community arxiv


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## Instantiation

`ArxivRetriever` parameters include:
- optional `load_max_docs`: default=100. Use it to limit number of downloaded documents. It takes time to download all 100 documents, so use a small number for experiments. There is a hard limit of 300 for now.
- optional `load_all_available_meta`: default=False. By default only the most important fields downloaded: `Published` (date when document was published/last updated), `Title`, `Authors`, `Summary`. If True, other fields also downloaded.
- `get_full_documents`: boolean, default False. Determines whether to fetch full text of documents.

See [API reference](https://python.langchain.com/api_reference/community/retrievers/langchain_community.retrievers.arxiv.ArxivRetriever.html) for more detail.

In [12]:
from langchain_community.retrievers import ArxivRetriever

arxiv_retriever = ArxivRetriever(
    top_k_results=300,
    load_max_docs=300,
    get_ful_documents=True,
)

In [13]:
arxiv_docs = arxiv_retriever.invoke("Diabetes")

In [14]:
arxiv_docs[0].metadata  # meta-information of the Document

{'Entry ID': 'http://arxiv.org/abs/2202.11216v1',
 'Published': datetime.date(2022, 2, 22),
 'Title': 'Early Stage Diabetes Prediction via Extreme Learning Machine',
 'Authors': 'Nelly Elsayed, Zag ElSayed, Murat Ozer'}

In [15]:
len(arxiv_docs)

300

# PubMed

https://python.langchain.com/docs/integrations/retrievers/pubmed/


>[PubMed®](https://pubmed.ncbi.nlm.nih.gov/) by `The National Center for Biotechnology Information, National Library of Medicine` comprises more than 35 million citations for biomedical literature from `MEDLINE`, life science journals, and online books. Citations may include links to full text content from `PubMed Central` and publisher web sites.

This notebook goes over how to use `PubMed` as a retriever

In [3]:
!pip install xmltodict

Collecting xmltodict
  Downloading xmltodict-0.14.2-py2.py3-none-any.whl.metadata (8.0 kB)
Downloading xmltodict-0.14.2-py2.py3-none-any.whl (10.0 kB)
Installing collected packages: xmltodict
Successfully installed xmltodict-0.14.2

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [4]:
from langchain_community.retrievers import PubMedRetriever

In [16]:
pubmed_retriever = PubMedRetriever(
    top_k_results=300,
    load_max_docs=300,
    get_ful_documents=True,
)

In [17]:
pubmed_docs = pubmed_retriever.invoke("Diabetes")

In [18]:
pubmed_docs[0].metadata

{'uid': '40493381',
 'Title': 'Predictors of insulin adherence among patients with Type 2 diabetes: a cross-sectional study.',
 'Published': '2025-06-10',
 'Copyright Information': ''}

# TavilySearchAPIRetriever
https://python.langchain.com/docs/integrations/retrievers/tavily/

>[Tavily's Search API](https://tavily.com) is a search engine built specifically for AI agents (LLMs), delivering real-time, accurate, and factual results at speed.

We can use this as a [retriever](/docs/how_to#retrievers). It will show functionality specific to this integration. After going through, it may be useful to explore [relevant use-case pages](/docs/how_to#qa-with-rag) to learn how to use this vectorstore as part of a larger chain.

### Integration details

import {ItemTable} from "@theme/FeatureTables";

<ItemTable category="external_retrievers" item="TavilySearchAPIRetriever" />

## Setup

If you want to get automated tracing from individual queries, you can also set your [LangSmith](https://docs.smith.langchain.com/) API key by uncommenting below:

In [31]:
# os.environ["LANGSMITH_API_KEY"] = getpass.getpass("Enter your LangSmith API key: ")
# os.environ["LANGSMITH_TRACING"] = "true"

### Installation

The integration lives in the `langchain-community` package. We also need to install the `tavily-python` package itself.

In [32]:
%pip install -qU langchain-community tavily-python


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [5]:
#import getpass
import os

#os.environ["TAVILY_API_KEY"] = getpass.getpass()

In [6]:
os.environ["TAVILY_API_KEY"] = "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"

## Instantiation

Now we can instantiate our retriever:

In [7]:
from langchain_community.retrievers import TavilySearchAPIRetriever

tavily_retriever = TavilySearchAPIRetriever(k=100,
                                            include_raw_content = True,                                            
                                            search_depth = "advanced",
                                            include_domains=[
                                            "clinicaltrials.gov", "pubmed.ncbi.nlm.nih.gov", "arxiv.org",
                                            "biorxiv.org", "medrxiv.org", "diabetes.org", "cdc.gov", "who.int",
                                            "fda.gov", "ema.europa.eu", "nejm.org", "thelancet.com", "jamanetwork.com",
                                            "nature.com", "sciencedirect.com", "springer.com", "cell.com", "biomedcentral.com"
                                            ]
                                            )

In [8]:
tavily_docs = tavily_retriever.invoke("Diabetes")

In [9]:
len(tavily_docs)

38

In [57]:
tavily_docs[0].metadata

{'title': 'diabetes and hearing loss-cross sectional study.docx',
 'source': 'https://cdn.clinicaltrials.gov/large-docs/38/NCT06190938/Prot_SAP_000.pdf',
 'score': 0.6533265,
 'images': ['https://static.vecteezy.com/system/resources/previews/001/436/750/large_2x/diabetes-symptoms-information-infographic-free-vector.jpg',
  'http://thumbs.dreamstime.com/z/types-diabetes-type-type-mellitus-insulin-dependent-mellitus-non-insulin-dependent-mellitus-39731814.jpg',
  'https://thumbs.dreamstime.com/z/types-diabetes-simple-medical-vector-illustration-scheme-health-care-information-diagram-types-diabetes-simple-medical-115835865.jpg',
  'https://lirp.cdn-website.com/69c0b277/dms3rep/multi/opt/Types+of+Diabetes-1920w.jpg',
  'https://detsutah.com/wp-content/uploads/2023/05/DETS-Diabetes-Mellitus.jpg']}

In [58]:
tavily_docs[0]

Document(metadata={'title': 'diabetes and hearing loss-cross sectional study.docx', 'source': 'https://cdn.clinicaltrials.gov/large-docs/38/NCT06190938/Prot_SAP_000.pdf', 'score': 0.6533265, 'images': ['https://static.vecteezy.com/system/resources/previews/001/436/750/large_2x/diabetes-symptoms-information-infographic-free-vector.jpg', 'http://thumbs.dreamstime.com/z/types-diabetes-type-type-mellitus-insulin-dependent-mellitus-non-insulin-dependent-mellitus-39731814.jpg', 'https://thumbs.dreamstime.com/z/types-diabetes-simple-medical-vector-illustration-scheme-health-care-information-diagram-types-diabetes-simple-medical-115835865.jpg', 'https://lirp.cdn-website.com/69c0b277/dms3rep/multi/opt/Types+of+Diabetes-1920w.jpg', 'https://detsutah.com/wp-content/uploads/2023/05/DETS-Diabetes-Mellitus.jpg']}, page_content='')

In [49]:
import time

In [59]:
tavily_queries = [
    "Latest breakthroughs in diabetes treatment 2024",
    "Novel approaches to curing type 1 and type 2 diabetes",
    "Recent advances in beta cell regeneration for diabetes",
    "New drug targets for insulin resistance and diabetes cure",
    "Gene editing and CRISPR therapies for diabetes",
    "Immunotherapy approaches to treat or cure type 1 diabetes",
    "Ongoing clinical trials for diabetes reversal",
    "Use of AI and machine learning in discovering diabetes treatments",
    "Recent trials on diabetes AND site:clinicaltrials.gov",
    "Pathophysiology of insulin resistance in type 2 diabetes", #mechanistic, scientific explanations — helpful for AI models focusing on mechanistic reasoning
    "Prevalence of diabetes in Europe 2024 statistics", #public health datasets, WHO data, etc.
    "FDA-approved drugs for type 2 diabetes and their mechanisms", #Useful for drug discovery, pharmacological studies, or treatment plans,
    "Latest ADA clinical guidelines for diabetes management"#Targets specific trusted content like ADA (American Diabetes Association) guidelines.



]

all_docs = []

for query in tavily_queries:
    print(query)
    try:
        docs = tavily_retriever.invoke(query)
        all_docs.extend(docs)
    except TimeoutError:
        print(f"Timeout on query: {query}")
    time.sleep(3)  # Add delay between requests

# Optional: Deduplicate based on document content
tavily_docs = list({doc.page_content: doc for doc in all_docs}.values())




Latest breakthroughs in diabetes treatment 2024
Novel approaches to curing type 1 and type 2 diabetes
Recent advances in beta cell regeneration for diabetes
New drug targets for insulin resistance and diabetes cure
Gene editing and CRISPR therapies for diabetes
Immunotherapy approaches to treat or cure type 1 diabetes
Ongoing clinical trials for diabetes reversal
Use of AI and machine learning in discovering diabetes treatments
Recent trials on diabetes AND site:clinicaltrials.gov
Pathophysiology of insulin resistance in type 2 diabetes
Prevalence of diabetes in Europe 2024 statistics
FDA-approved drugs for type 2 diabetes and their mechanisms
Latest ADA clinical guidelines for diabetes management


In [60]:
tavily_docs[0].metadata

{'title': 'PDF',
 'source': 'https://cdn.clinicaltrials.gov/large-docs/08/NCT04988308/SAP_001.pdf',
 'score': 0.014718728,
 'images': ['https://diabetesonthenet.com/wp-content/uploads/Figure-2.png',
  'https://www.meded101.com/wp-content/uploads/2022/02/2022-t2dm-algorithm-1-1024x791.png',
  'https://i0.wp.com/www.diabetesasia.org/magazine/wp-content/uploads/2020/09/ADA-2021.jpg?w=1280&ssl=1',
  'http://diabetologia-journal.org/wp-content/uploads/2018/09/Fig-2-for-EASD-ADA-768x554.jpg',
  'https://www.tomwademd.net/wp-content/uploads/2020/04/fig13-e1586859367841.png']}

In [61]:
tavily_docs[0]

Document(metadata={'title': 'PDF', 'source': 'https://cdn.clinicaltrials.gov/large-docs/08/NCT04988308/SAP_001.pdf', 'score': 0.014718728, 'images': ['https://diabetesonthenet.com/wp-content/uploads/Figure-2.png', 'https://www.meded101.com/wp-content/uploads/2022/02/2022-t2dm-algorithm-1-1024x791.png', 'https://i0.wp.com/www.diabetesasia.org/magazine/wp-content/uploads/2020/09/ADA-2021.jpg?w=1280&ssl=1', 'http://diabetologia-journal.org/wp-content/uploads/2018/09/Fig-2-for-EASD-ADA-768x554.jpg', 'https://www.tomwademd.net/wp-content/uploads/2020/04/fig13-e1586859367841.png']}, page_content='')

In [72]:
len(tavily_docs)

67

In [64]:
tavily_docs[-1].metadata

{'title': 'ClinicalTrials.gov',
 'source': 'https://clinicaltrials.gov/study/NCT07007962',
 'score': 0.07407731,
 'images': ['https://diabetesonthenet.com/wp-content/uploads/Figure-2.png',
  'https://www.meded101.com/wp-content/uploads/2022/02/2022-t2dm-algorithm-1-1024x791.png',
  'https://i0.wp.com/www.diabetesasia.org/magazine/wp-content/uploads/2020/09/ADA-2021.jpg?w=1280&ssl=1',
  'http://diabetologia-journal.org/wp-content/uploads/2018/09/Fig-2-for-EASD-ADA-768x554.jpg',
  'https://www.tomwademd.net/wp-content/uploads/2020/04/fig13-e1586859367841.png']}

In [75]:
high_score_docs = [doc for doc in tavily_docs if doc.metadata.get("score", 0) > 0.6]

# Optional: sort
high_score_docs = sorted(high_score_docs, key=lambda d: d.metadata["score"], reverse=True)

for doc in high_score_docs:
    print(f"Title: {doc.metadata.get('title')}")
    print(f"Score: {doc.metadata.get('score'):.4f}")
    print(f"Source: {doc.metadata.get('source')}")
    print()

Title: ClinicalTrials.gov
Score: 0.9855
Source: https://clinicaltrials.gov/study/NCT04255433

Title: ClinicalTrials.gov
Score: 0.9846
Source: https://clinicaltrials.gov/study/NCT07006272

Title: ClinicalTrials.gov
Score: 0.9494
Source: https://clinicaltrials.gov/search?cond=Diabetes+Mellitus+Type+2

Title: PDF
Score: 0.9334
Source: https://cdn.clinicaltrials.gov/large-docs/61/NCT04074161/Prot_000.pdf

Title: ClinicalTrials.gov
Score: 0.9293
Source: https://clinicaltrials.gov/study/NCT06962280

Title: ClinicalTrials.gov
Score: 0.9288
Source: https://clinicaltrials.gov/study/NCT05232916

Title: ClinicalTrials.gov
Score: 0.9255
Source: https://clinicaltrials.gov/find-studies/how-to-search

Title: ClinicalTrials.gov
Score: 0.9204
Source: https://clinicaltrials.gov/search?cond=TYPE+1+DIABETES+MELLITUS

Title: ClinicalTrials.gov
Score: 0.9202
Source: https://clinicaltrials.gov/study/NCT05432583

Title: ClinicalTrials.gov
Score: 0.9175
Source: https://clinicaltrials.gov/study/NCT06972472

Tit

In [69]:
print(tavily_docs[10])

page_content='Cover Page GB002, Inc. Gossamer Bio, Inc. CONFIDENTIAL CLINICAL STUDY PROTOCOL A Phase 1A Single Ascending Dose and Multiple Ascending Dose Double-Blind, Placebo-Controlled, Randomized Trial of Oral Inhalation PK10571 in Healthy Adult Subjects Protocol Number: 4004002 VERSION 5.3 DATE: 31 Jul 2018 NCT NUMBER: NCT03473236 Protocol 4004002 V.5.3 Final CLINICAL STUDY PROTOCOL A Phase 1A Single Ascending Dose and Multiple Ascending Dose Double-Blind, Placebo-Controlled, Randomized Trial of Oral Inhalation PK10571 in Healthy Adult Subjects PROTOCOL NUMBER 4004002 FINAL VERSION 5.3 DATE 31 Jul 2018 CONFIDENTIALITY STATEMENT The confidential information in this document is provided to you as an Investigator or consultant for review by you, your staff, and the applicable Institutional Review Board/Independent Ethics Committee. Your acceptance of this document constitutes agreement that you will not disclose the information herein to others without written authorization from World

In [70]:
tavily_docs[10]

Document(metadata={'title': 'PDF', 'source': 'https://cdn.clinicaltrials.gov/large-docs/36/NCT03473236/Prot_000.pdf', 'score': 0.06775009, 'images': ['https://m.media-amazon.com/images/I/51+k2UNavDL._SL500_.jpg', 'http://www.chla.org/sites/default/files/thumbnails/image/Infographic+-+Type+1+Diabetes.jpg', 'https://els-jbs-prod-cdn.jbs.elsevierhealth.com/cms/attachment/86c2d468-ccb7-4f9d-bf34-b2ce1cdf1549/gr1_lrg.jpg', 'https://www.trifectanutrition.com/hs-fs/hubfs/Diagrams-05.png?width=2108&name=Diagrams-05.png', 'https://www.thelancet.com/cms/attachment/e02070af-c2e4-4796-959d-3b2a08d155a0/gr1_lrg.jpg']}, page_content="Cover Page GB002, Inc. Gossamer Bio, Inc. CONFIDENTIAL CLINICAL STUDY PROTOCOL A Phase 1A Single Ascending Dose and Multiple Ascending Dose Double-Blind, Placebo-Controlled, Randomized Trial of Oral Inhalation PK10571 in Healthy Adult Subjects Protocol Number: 4004002 VERSION 5.3 DATE: 31 Jul 2018 NCT NUMBER: NCT03473236 Protocol 4004002 V.5.3 Final CLINICAL STUDY PROTO