Key Components of ArXivIntelliRAG
Real-World Research Data
The system retrieves live academic papers directly from the arXiv API, providing users with hands-on experience working with current, high-quality scientific literature.

Modern NLP Ecosystem
It integrates powerful natural language processing techniques, including semantic embeddings, BERT-based relevance scoring, and large language model summarization, to make dense research content easier to navigate and understand.

End-to-End Intelligence Pipeline
From document indexing to intelligent retrieval and summarization, the project demonstrates a complete, retrieval-augmented generation (RAG) pipeline in action.

Practical and User-Centric
This tool is built to support researchers, students, and professionals by helping them quickly identify and comprehend the most relevant academic papers aligned with their queries.

In [23]:
# This cell is used to get research paper data from the arXiv website using their API.

# Importing necessary libraries:
# requests is used to make web requests,
# BeautifulSoup helps extract specific parts of the XML response,
# pandas is used to store and work with the extracted data in table form.
import requests
from bs4 import BeautifulSoup
import pandas as pd

# This function connects to the arXiv API and downloads papers based on a search query.
def scrape_arxiv(query, max_results=50):
    # This is the base URL for the arXiv API
    base_url = "http://export.arxiv.org/api/query"
    
    # These are the parameters we send to the API, including the search keyword and the number of results we want
    params = {
        "search_query": f"all:{query}",
        "start": 0,
        "max_results": max_results
    }
    
    # Send a GET request to the API with the parameters
    response = requests.get(base_url, params=params)
    
    # If the response is not successful, show an error
    if response.status_code != 200:
        raise Exception("Failed to fetch data from arXiv API")

    # Use BeautifulSoup to parse the response from XML format into a readable structure
    soup = BeautifulSoup(response.content, "lxml")
    
    # Create an empty list to store the data for each paper
    papers = []
    
    # Go through each paper entry in the XML and extract required fields
    for entry in soup.find_all("entry"):
        # Extract the title and remove extra spaces
        title = entry.find("title").text.strip()
        
        # Extract the abstract and remove line breaks
        summary = entry.find("summary").text.strip().replace('\n', ' ')
        
        # Extract the arXiv URL, which links to the paper's abstract page
        arxiv_url = entry.find("id").text.strip()
        
        # Convert the arXiv abstract page link to a direct PDF link
        pdf_url = arxiv_url.replace("abs", "pdf") + ".pdf"
        
        # Add the extracted information as a dictionary to the papers list
        papers.append({
            "title": title,
            "abstract": summary,
            "arxiv_url": arxiv_url,
            "pdf_url": pdf_url
        })
    
    # Convert the list of paper dictionaries into a pandas DataFrame and return it
    return pd.DataFrame(papers)

In [24]:
# This cell saves the scraped research papers to a CSV file.

# Define the search keyword you want to look up on arXiv
query = "AI Agents"

# Call the scrape_arxiv function with the query to get the list of papers
papers_df = scrape_arxiv(query)

# Save the list of papers (as a DataFrame) to a CSV file named 'arxiv_papers.csv'
# Setting index to False means it won't save row numbers in the CSV
papers_df.to_csv("arxiv_papers.csv", index=False)

# Print a message to confirm the file has been saved
print("Saved papers to arxiv_papers.csv")

Saved papers to arxiv_papers.csv


  soup = BeautifulSoup(response.content, "lxml")


In [28]:
# This cell loads the saved research papers from the CSV file so we can work with them again.

# Import pandas library to handle data in table format
import pandas as pd

# Read the CSV file that was saved earlier into a DataFrame
papers_df = pd.read_csv("arxiv_papers.csv")

# Get the 'abstract' column from the DataFrame and convert it to a list
# This will help us process the text of each paper separately
abstracts = papers_df["abstract"].tolist()

# Get the 'title' column from the DataFrame and convert it to a list
# This allows us to match titles with their abstracts later
titles = papers_df["title"].tolist()


In [29]:
# This cell creates vector embeddings from research paper abstracts and stores them in FAISS for fast semantic search.

# Import FAISS to build a fast similarity search index
# Import HuggingFaceEmbeddings to convert text (abstracts) into numerical vector representations
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings

# Print a message to indicate the start of the embedding process
print("Generating embeddings...")

# We are using the "all-MiniLM-L6-v2" embedding model from the sentence-transformers library
# This model is a good trade-off between performance and speed
# It generates high-quality sentence embeddings and is much faster and lighter than larger transformer models

# Why use this model:
# - It’s designed specifically for generating embeddings for entire sentences or documents (like abstracts)
# - It is widely used for semantic similarity tasks, such as retrieving documents related to a search query
# - It works well out of the box for most general-purpose retrieval and clustering tasks

# Other model options include:
# - "pritamdeka/S-Biomed-Roberta-snli-multinli-stsb": Better suited for biomedical or scientific text (slightly heavier)
# - "sentence-transformers/all-mpnet-base-v2": Higher accuracy but slower than MiniLM
# - "allenai/scibert_scivocab_uncased": More scientific-domain focused, but not directly compatible with sentence-transformers

# Initialize the embedding model
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Convert each abstract into a vector and store it in FAISS for similarity-based retrieval
# FAISS allows us to later find abstracts that are semantically similar to any given query
vectorstore = FAISS.from_texts(abstracts, embedding_model)

Generating embeddings...


In [30]:
# This cell sets up the LangChain pipeline to answer questions or summarize using the retrieved abstracts.

# Import RetrievalQA to create a question-answering or summarization chain
# Import Ollama to use a local or lightweight language model
from langchain.chains import RetrievalQA
from langchain.llms import Ollama

# Initialize the language model using Ollama
# We are using the LLaMA 3.2 model which is capable of understanding and summarizing text
# The instruct variant helps the model follow clear instructions like "summarize this"
llm = Ollama(model="llama3.2")  # You can replace this with another LLM like GPT, Gemini or Deepseek  if needed

# Convert the FAISS vector store into a retriever object
# This retriever will allow the system to find the most relevant abstracts when given a question
retriever = vectorstore.as_retriever()

# Combine the retriever and the language model into a RetrievalQA pipeline
# This pipeline takes a query, retrieves the best matching documents (abstracts), and uses the LLM to answer or summarize
qa_pipeline = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)


In [31]:
# This cell defines a scoring function to rank the most relevant papers using SciBERT.

# Import the tokenizer and model for BERT that is trained on scientific papers (SciBERT)
# This helps the model better understand scientific vocabulary and sentence structure
from transformers import BertTokenizer, BertForSequenceClassification

# Import softmax to convert model output into probability scores
import torch.nn.functional as F

# Define the SciBERT model name. This version is trained specifically on scientific text.
bert_model_name = "allenai/scibert_scivocab_uncased"

# Load the tokenizer, which breaks the input text into tokens that the model can understand
bert_tokenizer = BertTokenizer.from_pretrained(bert_model_name)

# Load the SciBERT model for sequence classification. It returns a relevance score for a pair of texts.
bert_model = BertForSequenceClassification.from_pretrained(bert_model_name)

# This function ranks a list of documents (abstracts) based on how relevant they are to a search query
def rank_papers(query, docs):
    # Print message to indicate that ranking is in progress
    print("Ranking using BERT relevance...")
    
    # Create a list to store the relevance scores
    scores = []

    # Go through each abstract in the list of documents
    for text in docs:
        # Encode the query and the abstract together as input for the model
        # Use truncation and padding to make sure input size fits the model
        inputs = bert_tokenizer.encode_plus(query, text, return_tensors="pt", truncation=True, padding=True, max_length=512)

        # Pass the input into the model to get output logits
        logits = bert_model(**inputs).logits

        # Convert the logits into probabilities using softmax
        probs = F.softmax(logits, dim=1)

        # Take the probability score for the first class as the relevance score
        score = probs[0][0].item()

        # Add the score to the list
        scores.append(score)
    
    # Combine scores with their corresponding abstracts and sort them from highest to lowest score
    return sorted(zip(scores, docs), reverse=True)


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at allenai/scibert_scivocab_uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [36]:
# This cell retrieves the most relevant abstracts for a specific query and ranks them using the SciBERT model.

# Define the user’s search query.
# This is the topic or question we want to explore in the research papers.
query = "AI Agents Evaluation Metrics"

# Use the retriever (based on FAISS + embeddings) to find documents (abstracts) most similar to the query.
# This gives us a list of documents that are likely related to the topic.
retrieved_docs = retriever.get_relevant_documents(query)

# Extract just the text content (abstracts) from the retrieved documents.
# These will be passed into the ranking function.
retrieved_texts = [doc.page_content for doc in retrieved_docs]

# Use the SciBERT-based ranking function to sort the documents by how relevant they are to the query.
# This helps us prioritize which papers are most likely useful.
ranked_results = rank_papers(query, retrieved_texts)


Ranking using BERT relevance...


In [37]:
# This cell summarizes the top-ranked abstracts using the language model pipeline
# and also prints the paper title and the arXiv PDF URL.

# Print a heading to indicate that summarization is starting
print("Top Paper Summaries:")

# This is the instruction we give to the language model
# It tells the model to convert the abstract into a simpler, more understandable summary
summary_prompt = "Summarize this research abstract in simple terms:\n"

# Go through the top 3 ranked abstracts (most relevant to the query)
for i, (score, abstract) in enumerate(ranked_results[:3], 1):
    
    # Find the index of the abstract in the original list so we can get title and URLs
    idx = abstracts.index(abstract)
    
    # Extract title and PDF URL from the original DataFrame using the index
    title = papers_df["title"][idx]
    pdf_url = papers_df["pdf_url"][idx]
    
    # Print the paper number
    print(f"\n📄 Paper {i}: {title}")
    
    # Print the PDF link for the full paper
    print("Link to full paper:", pdf_url)
    
    # Print the original abstract so we know what is being summarized
    print("Abstract:", abstract)

    # Use the LangChain QA pipeline to generate a summary using the LLM
    # We pass the abstract along with the summarization prompt
    summary = qa_pipeline.run(f"{summary_prompt} {abstract}")

    # Print the generated summary
    print("🔍 Summary:", summary)

Top Paper Summaries:

📄 Paper 1: AI Agents: Evolution, Architecture, and Real-World Applications
Link to full paper: http://arxiv.org/pdf/2503.12687v1.pdf
Abstract: This paper examines the evolution, architecture, and practical applications of AI agents from their early, rule-based incarnations to modern sophisticated systems that integrate large language models with dedicated modules for perception, planning, and tool use. Emphasizing both theoretical foundations and real-world deployments, the paper reviews key agent paradigms, discusses limitations of current evaluation benchmarks, and proposes a holistic evaluation framework that balances task effectiveness, efficiency, robustness, and safety. Applications across enterprise, personal assistance, and specialized domains are analyzed, with insights into future research directions for more resilient and adaptive AI agent systems.
🔍 Summary: This paper looks at how Artificial Intelligence (AI) agents have developed over time and how th

In [39]:
# Step 10: Build top_results and save to CSV + Markdown
# This step compiles the top 3 ranked papers along with their title, abstract, PDF URL, and LLM-generated summary.
# The final results are saved in both CSV (structured) and Markdown (readable) formats.

# Create an empty list to hold all the results
top_results = []

# Loop through the top 3 most relevant abstracts based on the query
for i, (score, abstract) in enumerate(ranked_results[:3], 1):
    
    # Find the index of the abstract in the original list to get related metadata
    title_idx = abstracts.index(abstract)

    # Extract the paper's title and PDF URL from the original DataFrame
    title = papers_df["title"][title_idx]
    pdf_url = papers_df["pdf_url"][title_idx]

    # Create the summarization prompt and pass the abstract to the LLM
    summary_prompt = f"Summarize this research abstract:\n{abstract}"
    summary = qa_pipeline.run(summary_prompt)

    # Store all relevant information in a dictionary and add it to the list
    top_results.append({
        "Title": title,
        "PDF URL": pdf_url,
        "Abstract": abstract,
        "Summary": summary,
        "Relevance Score": score
    })

# Convert the list of dictionaries into a pandas DataFrame
top_df = pd.DataFrame(top_results)

# Save the same results in Markdown format for human-friendly viewing
with open("top_summarized_papers.md", "w", encoding="utf-8") as f:
    for i, row in top_df.iterrows():
        f.write(f"## 📄 Paper {i+1}: {row['Title']}\n")
        f.write(f"**PDF Link:** {row['PDF URL']}\n\n")
        f.write(f"**Relevance Score:** {row['Relevance Score']:.4f}\n\n")
        f.write(f"**Abstract:**\n{row['Abstract']}\n\n")
        f.write(f"**🔍 Summary:**\n{row['Summary']}\n\n")
        f.write("---\n\n")

# Print confirmation after saving both files
print("Results saved to top_summarized_papers.md")


Results saved to top_summarized_papers.md
