## Step 1: Mounting Google Drive

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Navigate to the repo folder
%cd /content/drive/MyDrive/llm-finetuning-project/llm-finetuning-summarizer

# List repo contents
!ls

Mounted at /content/drive
/content/drive/MyDrive/llm-finetuning-project/llm-finetuning-summarizer
data  deployment  LICENSE  notebooks  project_plan.md  README.md  scripts


## Step 2: Downlading LLM Fine-Tuning Papers

This block runs the full pipeline to query, filter, and download relevant papers for our **fine-tuning QA corpus**.

### Functionality Overview:

- **Import custom utilities** from `arxiv_scraper.py`:
  - `search_arxiv`: queries arXiv's Atom API for papers
  - `filter_papers`: keeps only those whose title/summary contain key phrases
  - `download_papers`: downloads the PDFs to the target directory

- **Query string**: `""large language model OR llm OR fine-tuning"
- **Keywords used for filtering**:
  - `["LoRA", "QLoRA", "parameter-efficient", "supervised fine-tuning","adapter", "SFT", "instruction tuning", "continued pretraining"]`

- **Download directory**:  
  `./data/QA_corpus`  
  (i.e., inside the repo's `data/` folder)

This yields a curated set of recent, relevant PDFs that will serve as the foundation for crafting our supervised QA pairs. These papers are assumed to be recent, though not necessarily high-impact (arXiv does not provide citation metadata).

You may increase `max_results` or refine `keywords` to adjust the yield.

In [None]:
import sys
sys.path.append('./scripts')

from arxiv_scraper import search_arxiv, filter_papers, download_papers

In [None]:
query = "large language model OR llm OR fine-tuning"
keywords = ["LoRA", "QLoRA", "parameter-efficient", "supervised fine-tuning", "adapter", "SFT", "instruction tuning", "continued pretraining"]

papers = search_arxiv(query=query, max_results=50)
print(f"Retrieved {len(papers)} papers.")

filtered = filter_papers(papers, keywords)
print(f"{len(filtered)} papers matched the keywords.")

download_dir = "./data/QA_corpus"
download_papers(filtered, download_dir)

Retrieved 50 papers.
15 papers matched the keywords.
Downloading: CURLoRA: Stable LLM Continual Fine-Tuning and Catastrophic Forgetting
  Mitigation
Downloading: Balancing Continuous Pre-Training and Instruction Fine-Tuning:
  Optimizing Instruction-Following in LLMs
Downloading: Revisiting Zeroth-Order Optimization for Memory-Efficient LLM
  Fine-Tuning: A Benchmark
Downloading: Preference-Oriented Supervised Fine-Tuning: Favoring Target Model Over
  Aligned Large Language Models
Downloading: DELIFT: Data Efficient Language model Instruction Fine Tuning
Downloading: Non-instructional Fine-tuning: Enabling Instruction-Following
  Capabilities in Pre-trained Language Models without Instruction-Following
  Data
Downloading: Targeted Efficient Fine-tuning: Optimizing Parameter Updates with
  Data-Driven Sample Selection
Downloading: Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study
  on Audio Question Answering
Downloading: Directed Domain Fine-Tuning: Tailoring Sepa

## Step 3: Doing a More Refined Search

This cell performs a **more exhaustive and precise sweep** of arXiv to collect highly relevant fine-tuning papers.

- **Query:** `"large language model OR llm OR fine-tuning"`
- **Max Results:** 100 (to ensure wider sampling)
- **Keywords:** Expanded list including `"LoRA"`, `"QLoRA"`, `"PEFT"`, `"instruction tuning"`, etc.
- **Final Filter:** Only papers with "LLM" or "Large Language Model" in title are kept

This will form the *core dataset* for QA pair generation, ensuring every paper is semantically rich and technically focused on **LLM fine-tuning**.

In [None]:
query = "large language model OR llm OR fine-tuning"
keywords = [
    "LoRA", "QLoRA", "low-rank adaptation", "parameter-efficient",
    "efficient fine-tuning", "supervised fine-tuning", "adapter",
    "SFT", "instruction tuning", "continued pretraining",
    "PEFT", "alignment tuning"
]

papers = search_arxiv(query=query, max_results=100)
print(f"Retrieved {len(papers)} papers.")

filtered = filter_papers(papers, keywords)
print(f"{len(filtered)} papers matched the keywords.")

# Keep only those where title mentions LLM or Large Language Model
filtered = [
    paper for paper in filtered
    if ("llm" in paper['title'].lower()) or ("large language model" in paper['title'].lower())
]
print(f"{len(filtered)} papers have LLM in title.")

download_dir = "./data/QA_corpus"
download_papers(filtered, download_dir)

Retrieved 100 papers.
21 papers matched the keywords.
12 papers have LLM in title.
Already downloaded: ./data/QA_corpus/CURLoRA:_Stable_LLM_Continual_Fine-Tuning_and_Catastrophic_Forgetting
__Mitigation.pdf
Already downloaded: ./data/QA_corpus/Balancing_Continuous_Pre-Training_and_Instruction_Fine-Tuning:
__Optimizing_Instruction-Following_in.pdf
Already downloaded: ./data/QA_corpus/Revisiting_Zeroth-Order_Optimization_for_Memory-Efficient_LLM
__Fine-Tuning:_A_Benchmark.pdf
Already downloaded: ./data/QA_corpus/Preference-Oriented_Supervised_Fine-Tuning:_Favoring_Target_Model_Over
__Aligned_Large_Language_Mode.pdf
Already downloaded: ./data/QA_corpus/FATE-LLM:_A_Industrial_Grade_Federated_Learning_Framework_for_Large
__Language_Models.pdf
Already downloaded: ./data/QA_corpus/Exploring_Advanced_Large_Language_Models_with_LLMsuite.pdf
Already downloaded: ./data/QA_corpus/Exploring_Design_Choices_for_Building_Language-Specific_LLMs.pdf
Downloading: MEGAnno+: A Human-LLM Collaborative Annot

## Step 4: Refining Further

In [None]:
query = "llm fine-tuning"
keywords = [
    "LoRA", "QLoRA", "low-rank adaptation", "parameter-efficient",
    "efficient fine-tuning", "supervised fine-tuning", "adapter",
    "SFT", "instruction tuning", "continued pretraining",
    "PEFT", "alignment tuning"
]

papers = search_arxiv(query=query, max_results=50)
print(f"Retrieved {len(papers)} papers.")

filtered = filter_papers(papers, keywords)
print(f"{len(filtered)} papers matched the keywords.")

# Keep only those where title mentions LLM or Large Language Model
filtered = [
    paper for paper in filtered
    if ("llm" in paper['title'].lower()) or ("large language model" in paper['title'].lower())
]
print(f"{len(filtered)} papers have LLM in title.")

download_dir = "./data/QA_corpus"
download_papers(filtered, download_dir)

Retrieved 50 papers.
14 papers matched the keywords.
4 papers have LLM in title.
Already downloaded: ./data/QA_corpus/Revisiting_Zeroth-Order_Optimization_for_Memory-Efficient_LLM
__Fine-Tuning:_A_Benchmark.pdf
Already downloaded: ./data/QA_corpus/Balancing_Continuous_Pre-Training_and_Instruction_Fine-Tuning:
__Optimizing_Instruction-Following_in.pdf
Already downloaded: ./data/QA_corpus/CURLoRA:_Stable_LLM_Continual_Fine-Tuning_and_Catastrophic_Forgetting
__Mitigation.pdf
Already downloaded: ./data/QA_corpus/Preference-Oriented_Supervised_Fine-Tuning:_Favoring_Target_Model_Over
__Aligned_Large_Language_Mode.pdf
