## Step 1: Mounting Google Drive


In [1]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Navigate to the repo folder
%cd /content/drive/MyDrive/llm-finetuning-project/llm-finetuning-summarizer

# List repo contents
!ls

Mounted at /content/drive
/content/drive/MyDrive/llm-finetuning-project/llm-finetuning-summarizer
data				LICENSE		 qa_pairs   wandb
deployment			models		 README.md
eval_predictions_baseline.json	notebooks	 results
gpt4o_judgments_baseline.json	project_plan.md  scripts


## Step 2: Importing Necessary Libraries and Functions

In [10]:
import sys
import os, re
import json
sys.path.append('./scripts')

from arxiv_scraper_rag import search_arxiv, filter_papers, download_papers

## Step 3: Define Search Queries and Relevance Keywords

### Query & Relevance Setup for RAG Corpus Construction

Before retrieving papers from arXiv, we define two critical components:

1. **Multi-query Search Space (`queries`)**  
   These are natural language search phrases used to query the arXiv API.  
   Each query represents a high-level concept or subdomain within instruction fine-tuning, parameter-efficient methods, and retrieval-augmented generation (RAG).  
   The purpose is to **maximize coverage** across recent papers that might use different terminologies for similar ideas.

2. **Keyword List for Relevance Scoring (`keywords`)**  
   After retrieving papers via the queries, we **score each paper’s title + abstract** by counting keyword hits.  
   The higher the match count, the more relevant the paper is assumed to be.  
   This step ensures we prioritize **domain-specific, high-signal documents** for downstream chunking and retrieval.

> These two lists control both **breadth of retrieval** (queries) and **precision of filtering** (keywords).

In [3]:
# Define your multi‐query search space
queries = [
    "instruction tuning",
    "transformer fine-tuning",
    "parameter-efficient fine-tuning"
    "retrieval augmented generation",
    "LoRA",
    "prompt tuning"
]

In [4]:
# Define the keywords for relevance scoring
keywords = [
    "LoRA", "QLoRA", "parameter-efficient", "supervised fine-tuning",
    "adapter", "SFT", "instruction tuning", "prompt tuning",
    "RAG", "semantic search", "vector database"
]

## Step 4: Retrieve Papers from arXiv for Each Query

For each of the predefined queries, we call the `search_arxiv()` function to retrieve up to 100 recent papers from the arXiv API.

- We restrict the search to relevant subfields using arXiv categories:  
  - `cs.LG` (Machine Learning)  
  - `cs.CL` (Computation and Language)  
  - `cs.AI` (Artificial Intelligence)

Each query returns a set of papers, which we then append to a master list (`all_papers`).  
Note that this step **may introduce duplicates** (e.g., the same paper retrieved by multiple queries), which we’ll address during the filtering phase.

> This step prioritizes **recall**—gathering as many potentially relevant documents as possible before scoring and ranking them.

In [5]:
# Aggregate raw hits across all queries
all_papers = []
for q in queries:
    papers = search_arxiv(
        query=q,
        max_results=100,               # fetch up to 100 per query
        start=0,
        categories=['cs.LG', 'cs.CL', 'cs.AI']  # narrow to ML/NLP/AI
    )
    print(f"[QUERY] '{q}' → retrieved {len(papers)} papers")
    all_papers.extend(papers)

[QUERY] 'instruction tuning' → retrieved 74 papers
[QUERY] 'transformer fine-tuning' → retrieved 100 papers
[QUERY] 'parameter-efficient fine-tuningretrieval augmented generation' → retrieved 100 papers
[QUERY] 'LoRA' → retrieved 100 papers
[QUERY] 'prompt tuning' → retrieved 100 papers


## Step 5: Filter, Deduplicate, and Rank Papers by Relevance + Recency

Once all papers have been retrieved, we pass them into the `filter_papers()` function, which performs the following steps:

1. **Date Filtering**  
   - Only include papers published in **2021 or later**, ensuring the corpus reflects **recent advancements** in the field (e.g., LoRA, QLoRA, RAG).

2. **Keyword-Based Relevance Scoring**  
   - Each paper is scored based on the number of keyword matches in its **title and abstract**.
   - This produces a `relevance_score` that captures domain-specific salience.

3. **Deduplication by arXiv ID**  
   - Ensures that the same paper (possibly retrieved by multiple queries) appears **only once** in the final set.

4. **Ranking**  
   - Papers are sorted first by `relevance_score`, then by publication year (descending), prioritizing **both relevance and recency**.

5. **Top-k Selection**  
   - Retain only the **top 75 papers**, forming a high-quality, focused RAG corpus for downstream chunking and embedding.

> This stage compresses a noisy, redundant retrieval space into a **dense, curated knowledge base**—tailored for semantic retrieval.

In [6]:
# Filter, dedupe & rank by relevance + recency
filtered = filter_papers(
    papers=all_papers,
    keywords=keywords,
    year_from=2021,    # only papers 2021+
    top_k=75           # keep top 75 most relevant
)
print(f"[FILTER] {len(filtered)} papers passed keyword+date filter")

[FILTER] 75 papers passed keyword+date filter


In [7]:
# Exclude any papers you’ve already used for fine-tuning
ft_dir = "./data/QA_corpus/"
existing_ids = {
    re.sub(r"\.pdf$", "", fname)
    for fname in os.listdir(ft_dir)
    if fname.endswith(".pdf")
}
before = len(filtered)
filtered = [p for p in filtered if p['arxiv_id'] not in existing_ids]
print(f"[EXCLUDE] removed {before-len(filtered)} fine-tuning papers, {len(filtered)} remain")

[EXCLUDE] removed 0 fine-tuning papers, 75 remain


## Step 6: Downloading the Filtered Papers


In [8]:
# Download the remaining PDFs
download_dir = "./data/rag_corpus/"
os.makedirs(download_dir, exist_ok=True)
download_papers(
    papers=filtered,
    download_dir=download_dir,
    sleep_time=1.0
)

## Step 7: Saving Metadata

After filtering and ranking, we serialize the final list of selected papers (`filtered`) into a JSON file.

- Each entry in this file includes metadata such as:
  - `arxiv_id`
  - `title`
  - `abstract` (summary)
  - `publication date`
  - `relevance score`
  - `PDF download URL`

This metadata file will serve as a **canonical reference** in the next notebook, where we:
1. Extract text from the downloaded PDFs
2. Chunk them into semantically meaningful units
3. Embed them for retrieval in our RAG system

The JSON is saved to:

./data/rag_corpus/metadata.json

> This ensures the corpus is **reproducible**, **traceable**, and ready for semantic indexing in FAISS.

In [9]:
meta_path = "./data/rag_corpus/metadata.json"
with open(meta_path, "w") as f:
    json.dump(filtered, f, indent=2)
print(f"[DONE] Saved metadata ({len(filtered)} papers) → {meta_path}")

[DONE] Saved metadata (75 papers) → ./data/rag_corpus/metadata.json
