Understand arxiv api results

In [1]:
import urllib.request as libreq
with libreq.urlopen('http://export.arxiv.org/api/query?search_query=all:electron&start=0&max_results=1') as url:
    r = url.read()
print(r)

b'<?xml version="1.0" encoding="UTF-8"?>\n<feed xmlns="http://www.w3.org/2005/Atom">\n  <link href="http://arxiv.org/api/query?search_query%3Dall%3Aelectron%26id_list%3D%26start%3D0%26max_results%3D1" rel="self" type="application/atom+xml"/>\n  <title type="html">ArXiv Query: search_query=all:electron&amp;id_list=&amp;start=0&amp;max_results=1</title>\n  <id>http://arxiv.org/api/cHxbiOdZaP56ODnBPIenZhzg5f8</id>\n  <updated>2025-10-23T00:00:00-04:00</updated>\n  <opensearch:totalResults xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">231429</opensearch:totalResults>\n  <opensearch:startIndex xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">0</opensearch:startIndex>\n  <opensearch:itemsPerPage xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">1</opensearch:itemsPerPage>\n  <entry>\n    <id>http://arxiv.org/abs/cond-mat/0102536v1</id>\n    <updated>2001-02-28T20:12:09Z</updated>\n    <published>2001-02-28T20:12:09Z</published>\n    <title>Impact of Electron-Electron C

Let's see the output of the planner stage, to determine the kind of queries to pass to the arxiv tool.

In [7]:
# Initialize the LLM
from dotenv import load_dotenv
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_core.messages import BaseMessage, HumanMessage, AIMessage, SystemMessage

load_dotenv()

llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash")

In [48]:
system_prompt = """
    You are an expert research agent planner.

    You have access to the following tools:
    - arxiv_search: for academic papers
    - web_search: for general sources

    Your job is to take a user query and return a structured execution plan in STRICT JSON format.
    Do not include any text outside of the JSON. Do not explain your reasoning in prose. 

    The JSON must follow this schema exactly:

    {
    "plan": [
        {
        "tool": "<tool_name>",
        "purpose": "<why this step is included>",
        "query": {
            "search_terms": ["<list of exact search terms>"],
            "additional_focus": ["<list of optional focus keywords>"]
        },
        "rationale": "<why these parameters were chosen>"
        }
    ],
    "reflection": {
        "purpose": "<why reflection is needed>",
        "analysis_focus": ["<list of aspects to check>"],
        "rationale": "<why this reflection matters>"
    }
    }

    Return only valid JSON. Do not include markdown formatting, explanations, or extra text.
    """

# user_query = "write a report on evolution of LLMs"
user_query = "comparison of different image generation models"
response = llm.invoke(system_prompt + "\n" + user_query)

In [49]:
print(response.content)

```json
{
"plan": [
    {
    "tool": "web_search",
    "purpose": "To get a general overview of the major types of image generation models, their evolution, and common criteria used for comparison. This helps in identifying key models and categories before diving into deeper research.",
    "query": {
        "search_terms": ["image generation models comparison overview", "types of generative models for images"],
        "additional_focus": ["GANs", "VAEs", "Diffusion Models", "autoregressive models", "strengths", "weaknesses", "applications"]
    },
    "rationale": "Starting with a broad web search provides a foundational understanding, identifies prominent models and model types, and common comparison points, which guides subsequent, more focused searches."
    },
    {
    "tool": "arxiv_search",
    "purpose": "To find academic papers, surveys, or review articles that specifically compare different image generation models, focusing on technical details, performance metrics, and a

Now that we have the plan output, let's take one arxiv tool call and make the LLM generate an arxiv search query. The goal of this step is query expansion, giving search terms to the LLM and getting eact search queries.

In [50]:
# tool_call = {
#     "tool": "arxiv_search",
#     "purpose": "To identify the pivotal paper introducing the Transformer architecture, which is a cornerstone for modern LLMs.",
#     "query": {
#         "search_terms": [
#             "Attention Is All You Need"
#         ],
#         "additional_focus": [
#             "transformer architecture",
#             "sequence transduction"
#         ]
#     },
#     "rationale": "The Transformer paper marked a significant paradigm shift in neural network architectures for sequence modeling, directly leading to current LLMs."
# }

tool_call = {
    "tool": "arxiv_search",
    "purpose": "To find academic papers, surveys, or review articles that specifically compare different image generation models, focusing on technical details, performance metrics, and architectural differences.",
    "query": {
        "search_terms": ["survey image generation models", "comparison generative models image synthesis", "review diffusion models GANs VAEs"],
        "additional_focus": ["performance metrics", "FID", "IS", "perceptual quality", "diversity", "computational cost", "architectures", "benchmarking"]
    },
    "rationale": "Arxiv is crucial for accessing peer-reviewed research that provides rigorous and detailed comparisons, often including quantitative evaluations and in-depth technical analyses of various models."
    }

search_terms = tool_call["query"]["search_terms"]
additional_focus = tool_call["query"]["additional_focus"]

query_expansion_prompt = f"""
You are an expert at constructing arxiv API queries. 
Given the following search terms and additional focus terms, generate efficient arXiv API queries.

Requirements:
- Always include the exact search_terms verbatim.
- Incorporate additional_focus terms.
- Use arXiv field prefixes where appropriate:
  - ti: for title
  - abs: for abstract
  - cat:cs.CL for computational linguistics
- Combine terms with AND/OR for precision.
- Return 2-3 queries max.

Search terms: {search_terms}
Additional focus: {additional_focus}

Return only a JSON list of objects. 
Each object must have:
- "search_query": a valid arXiv API query string
- "max_results": an integer (default 5)

Do not include explanations, markdown, or extra text.

"""

queries = llm.invoke(query_expansion_prompt)

In [29]:
print(queries.content[7:-3])


[
  {
    "search_query": "ti:\"Attention Is All You Need\" AND (abs:\"transformer architecture\" OR abs:\"sequence transduction\") AND cat:cs.CL",
    "max_results": 5
  },
  {
    "search_query": "abs:\"transformer architecture\" AND abs:\"sequence transduction\" AND cat:cs.CL",
    "max_results": 5
  },
  {
    "search_query": "\"Attention Is All You Need\" AND abs:\"transformer architecture\" AND cat:cs.CL",
    "max_results": 5
  }
]



Now that we have a structured JSON list of search_queries as a string, we first need to convert them into a json object. Once we have a json object, we'll iterate over and pass each search_query as one arxiv api call

In [51]:
import json

queries_json_string = queries.content[7:-3]
# print(queries_json_string)
print(type(queries_json_string))


queries_json_dict = json.loads(queries_json_string)
print(queries_json_dict)
print(type(queries_json_dict))

<class 'str'>
[{'search_query': '((ti:"survey image generation models" OR abs:"comparison generative models image synthesis" OR ti:"review diffusion models GANs VAEs") AND ("performance metrics" OR FID OR IS OR "perceptual quality" OR diversity OR "computational cost" OR architectures OR benchmarking))', 'max_results': 5}, {'search_query': '("review diffusion models GANs VAEs" OR (abs:"diffusion models" AND (GANs OR VAEs))) AND (abs:architectures OR abs:FID OR abs:IS OR abs:"perceptual quality" OR abs:diversity OR abs:"computational cost")', 'max_results': 5}, {'search_query': '((ti:"survey image generation models" OR abs:"comparison generative models image synthesis") AND (ti:benchmarking OR abs:benchmarking)) AND ("performance metrics" OR diversity OR "computational cost")', 'max_results': 5}]
<class 'list'>


For each query in the queries_json_dict, we will create a URL and pass it to the arxiv search api

In [52]:
import feedparser
from urllib.parse import quote

base_url = 'http://export.arxiv.org/api/query?'
results = []
max_results = 5

for query in queries_json_dict:
    search_query = quote(query["search_query"])  # URL-encode
    url = base_url + f"search_query={search_query}&max_results={max_results}&sortBy=submittedDate&sortOrder=descending"
    print(url)
    feed = feedparser.parse(url)
    results.append(feed.entries)

print(results)

http://export.arxiv.org/api/query?search_query=%28%28ti%3A%22survey%20image%20generation%20models%22%20OR%20abs%3A%22comparison%20generative%20models%20image%20synthesis%22%20OR%20ti%3A%22review%20diffusion%20models%20GANs%20VAEs%22%29%20AND%20%28%22performance%20metrics%22%20OR%20FID%20OR%20IS%20OR%20%22perceptual%20quality%22%20OR%20diversity%20OR%20%22computational%20cost%22%20OR%20architectures%20OR%20benchmarking%29%29&max_results=5&sortBy=submittedDate&sortOrder=descending
http://export.arxiv.org/api/query?search_query=%28%22review%20diffusion%20models%20GANs%20VAEs%22%20OR%20%28abs%3A%22diffusion%20models%22%20AND%20%28GANs%20OR%20VAEs%29%29%29%20AND%20%28abs%3Aarchitectures%20OR%20abs%3AFID%20OR%20abs%3AIS%20OR%20abs%3A%22perceptual%20quality%22%20OR%20abs%3Adiversity%20OR%20abs%3A%22computational%20cost%22%29&max_results=5&sortBy=submittedDate&sortOrder=descending
http://export.arxiv.org/api/query?search_query=%28%28ti%3A%22survey%20image%20generation%20models%22%20OR%20abs%3A

Let's explore the results of the arxiv search

In [53]:
# print(len(results))
# print(results[0])
# print(results[1])
# print(results[2])

for i in range(len(results)):
    for result in results[i]:
        # print(result)
        print("Title:", result.title)
        print("Authors:", [author['name'] for author in result['authors']])
        print("Published:", result['published'])
        print("Summary:", result['summary'])
        print("Link:", result['link'])
        print("\n")

Title: BadGraph: A Backdoor Attack Against Latent Diffusion Model for
  Text-Guided Graph Generation
Authors: ['Liang Ye', 'Shengqin Chen', 'Jiazhu Dai']
Published: 2025-10-23T17:54:17Z
Summary: The rapid progress of graph generation has raised new security concerns,
particularly regarding backdoor vulnerabilities. While prior work has explored
backdoor attacks in image diffusion and unconditional graph generation,
conditional, especially text-guided graph generation remains largely
unexamined. This paper proposes BadGraph, a backdoor attack method targeting
latent diffusion models for text-guided graph generation. BadGraph leverages
textual triggers to poison training data, covertly implanting backdoors that
induce attacker-specified subgraphs during inference when triggers appear,
while preserving normal performance on clean inputs. Extensive experiments on
four benchmark datasets (PubChem, ChEBI-20, PCDes, MoMu) demonstrate the
effectiveness and stealth of the attack: less than 10% 

Perfect! Now we have a list of papers with their summaries and pdf link. We can store the title+summary+pdf_link in a vector db and use it to retrieve top-k relevant documents for the reflection step

In [5]:
import feedparser

url = 'http://export.arxiv.org/api/query?search_query=ti:LLM&max_results=15'
feed = feedparser.parse(url)

print(feed.entries)
print(len(feed.entries))
# print(feed.entries[0].keys())

# for entry in feed.entries:
#     print("Title:", entry.title)
#     print("Authors:", [author.name for author in entry.authors])
#     print("Published:", entry.published)
#     print("Summary:", entry.summary)
#     print("Link:", entry.link)


[{'id': 'http://arxiv.org/abs/1601.06914v1', 'guidislink': True, 'link': 'http://arxiv.org/abs/1601.06914v1', 'updated': '2016-01-26T07:50:51Z', 'updated_parsed': time.struct_time(tm_year=2016, tm_mon=1, tm_mday=26, tm_hour=7, tm_min=50, tm_sec=51, tm_wday=1, tm_yday=26, tm_isdst=0), 'published': '2016-01-26T07:50:51Z', 'published_parsed': time.struct_time(tm_year=2016, tm_mon=1, tm_mday=26, tm_hour=7, tm_min=50, tm_sec=51, tm_wday=1, tm_yday=26, tm_isdst=0), 'title': 'LLM Magnons', 'title_detail': {'type': 'text/plain', 'language': None, 'base': 'https://export.arxiv.org/api/query?search_query=ti:LLM&max_results=15', 'value': 'LLM Magnons'}, 'summary': 'We consider excitations of LLM geometries described by coloring the LLM plane\nwith concentric black rings. Certain closed string excitations are localized at\nthe edges of these rings. The string theory predictions for the energies of\nmagnon excitations of these strings depends on the radii of the edges of the\nrings. In this article