## Step 1: Mounting Google Drive

In [1]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Navigate to the repo folder
%cd /content/drive/MyDrive/llm-finetuning-project/llm-finetuning-summarizer

# List repo contents
!ls

Mounted at /content/drive
/content/drive/MyDrive/llm-finetuning-project/llm-finetuning-summarizer
data  deployment  LICENSE  notebooks  project_plan.md  README.md  scripts


## Step 2: Scraping Paper Metadata (arXiv API)

This block uses a custom Python script `arxiv_scraper.py` (stored in the `/scripts/` folder) to query arXiv for papers related to **LLM fine-tuning**.

### What This Code Does:
- Imports the `search_arxiv()` function from the script.
- Executes a search query on arXiv using their public Atom XML API.
- Retrieves metadata for each paper, including:
  - Title
  - Abstract
  - PDF URL
  - Published date
- Prints a list of papers with direct links to their PDFs.

This is the **first step in building our QA and RAG corpora**. Later steps will filter, download, and curate these papers for use in fine-tuning and retrieval-augmented generation (RAG).

In [2]:
# Adding the scripts/ directory to Python’s module search path
import sys
sys.path.append('/content/drive/MyDrive/llm-finetuning-project/llm-finetuning-summarizer/scripts')

from arxiv_scraper import search_arxiv

papers = search_arxiv(query="llm fine-tuning", max_results=10)
for paper in papers:
    # Printing each paper’s title and direct PDF URL, followed by a separator
    print(paper['title'], '\n', paper['pdf_url'], '\n---\n')

Beyond Fine-Tuning: Effective Strategies for Mitigating Hallucinations
  in Large Language Models for Data Analytics 
 http://arxiv.org/pdf/2410.20024v1 
---

Beyond Fine-tuning: Unleashing the Potential of Continuous Pretraining
  for Clinical LLMs 
 http://arxiv.org/pdf/2409.14988v1 
---

Revisiting Zeroth-Order Optimization for Memory-Efficient LLM
  Fine-Tuning: A Benchmark 
 http://arxiv.org/pdf/2402.11592v3 
---

Balancing Continuous Pre-Training and Instruction Fine-Tuning:
  Optimizing Instruction-Following in LLMs 
 http://arxiv.org/pdf/2410.10739v1 
---

AMOR: A Recipe for Building Adaptable Modular Knowledge Agents Through
  Process Feedback 
 http://arxiv.org/pdf/2402.01469v2 
---

CURLoRA: Stable LLM Continual Fine-Tuning and Catastrophic Forgetting
  Mitigation 
 http://arxiv.org/pdf/2408.14572v1 
---

Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt
  Templates 
 http://arxiv.org/pdf/2402.18540v2 
---

Preference-Oriented Supervised Fine-Tuning: Favorin

## Step 3: Downloading the Notebook

In [5]:
from google.colab import files

%cd /content/drive/MyDrive/llm-finetuning-project/llm-finetuning-summarizer/notebooks
files.download("01_arxiv_scraper.ipynb")

/content/drive/MyDrive/llm-finetuning-project/llm-finetuning-summarizer/notebooks


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>