<a href="https://colab.research.google.com/github/srimancho9/Week-4-Retrieval-Augmented-Generation-RAG-/blob/main/Week4_RAG_Gemini_FT_HandsOn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS 5588 — Enhanced RAG + Gemini + Fine-Tuning on Online Dataset
_Generated: 2025-09-14T13:53:05_

### 1) Install

In [1]:

!pip -q install -U langchain langchain-community chromadb pypdf             sentence-transformers transformers datasets evaluate peft accelerate tiktoken             langchain-google-genai google-genai
print("If upgraded core libs, consider restarting runtime.")


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m35.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.8/19.8 MB[0m [31m39.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.5/310.5 kB[0m [31m15.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m503.6/503.6 kB[0m [31m20.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m3.8 MB/s[0m eta [36

### 2) Keys & Imports

In [2]:

import os, getpass, json, sys, platform, pathlib, datetime, importlib, torch
if not os.getenv("GEMINI_API_KEY"):
    os.environ["GEMINI_API_KEY"] = getpass.getpass("Enter your GEMINI_API_KEY: ")
os.environ["GOOGLE_API_KEY"] = os.environ.get("GOOGLE_API_KEY", os.environ["GEMINI_API_KEY"])

from google import genai
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader, TextLoader
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import SentenceTransformerEmbeddings
from langchain_google_genai import ChatGoogleGenerativeAI, GoogleGenerativeAIEmbeddings
from langchain.chains import RetrievalQA
from langchain.llms.huggingface_pipeline import HuggingFacePipeline

from datasets import load_dataset
import evaluate
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, DataCollatorForSeq2Seq, Trainer, TrainingArguments, pipeline
from peft import LoraConfig, get_peft_model, PeftModel

pathlib.Path("data").mkdir(exist_ok=True)
pathlib.Path("artifacts/ft").mkdir(parents=True, exist_ok=True)
print("Env ready.")


Enter your GEMINI_API_KEY: ··········
Env ready.


### 3) Env log → env_rag.json

In [3]:

def pv(m):
    try:
        import importlib
        mod = importlib.import_module(m)
        return getattr(mod, "__version__", "unknown")
    except: return "not installed"
env = {
  "timestamp": datetime.datetime.now().isoformat(),
  "python": sys.version, "platform": platform.platform(),
  "cuda_available": torch.cuda.is_available(),
  "packages": {m: pv(m) for m in [
    "langchain","langchain_community","chromadb","tiktoken","transformers",
    "datasets","evaluate","peft","sentence_transformers",
    "langchain_google_genai","google.genai"
  ]}
}
json.dump(env, open("env_rag.json","w"), indent=2)
print(json.dumps(env, indent=2))


{
  "timestamp": "2025-09-19T02:00:51.526201",
  "python": "3.12.11 (main, Jun  4 2025, 08:56:18) [GCC 11.4.0]",
  "platform": "Linux-6.1.123+-x86_64-with-glibc2.35",
  "cuda_available": false,
  "packages": {
    "langchain": "0.3.27",
    "langchain_community": "0.3.29",
    "chromadb": "1.1.0",
    "tiktoken": "0.11.0",
    "transformers": "4.56.1",
    "datasets": "4.1.1",
    "evaluate": "0.4.6",
    "peft": "0.17.1",
    "sentence_transformers": "5.1.0",
    "langchain_google_genai": "unknown",
    "google.genai": "1.38.0"
  }
}


### 4) Upload & Load project docs, Chunk, Build Chroma

In [6]:

# Upload
try:
    from google.colab import files
    up = files.upload()
    import os
    for n,c in up.items():
        open(os.path.join("data", n), "wb").write(c)
    print("Uploaded:", list(up.keys()))
except Exception as e:
    print("Colab upload UI not available.", e)

# Load
import os
def load_docs(folder="data"):
    docs=[]
    for fname in os.listdir(folder):
        p=os.path.join(folder,fname)
        if not os.path.isfile(p): continue
        ext=fname.lower().split(".")[-1]
        try:
            if ext=="pdf": loader=PyPDFLoader(p)
            elif ext in ["txt","md","markdown"]: loader=TextLoader(p, encoding="utf-8")
            else: print("Skip", fname); continue
            docs += loader.load()
        except Exception as e:
            print("Fail", fname, e)
    return docs

raw_docs=load_docs("data")
splitter=RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
splits=splitter.split_documents(raw_docs)

# Chroma
emb = SentenceTransformerEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vs = Chroma.from_documents(splits, embedding=emb, persist_directory="./chroma_minilm")
vs.persist()
retriever = vs.as_retriever(search_kwargs={"k":4})
print("Docs:", len(raw_docs), "Chunks:", len(splits))


Saving project_findings[1].txt to project_findings[1] (1).txt
Uploaded: ['project_findings[1] (1).txt']
Skip rag_sample_docs.zip
Docs: 2 Chunks: 2


### 5) RAG Chains: Gemini & Local FLAN-T5 (pre-FT)

In [7]:

llm_gemini = ChatGoogleGenerativeAI(model="gemini-2.5-flash", temperature=0.2)
qa_gemini = RetrievalQA.from_chain_type(llm=llm_gemini, chain_type="stuff", retriever=retriever, return_source_documents=True)

base_model = "google/flan-t5-small"
tok = AutoTokenizer.from_pretrained(base_model)
base = AutoModelForSeq2SeqLM.from_pretrained(base_model)
pipe_base = pipeline("text2text-generation", model=base, tokenizer=tok, max_new_tokens=256, device=0 if torch.cuda.is_available() else -1)
llm_local = HuggingFacePipeline(pipeline=pipe_base)
qa_local = RetrievalQA.from_chain_type(llm=llm_local, chain_type="stuff", retriever=retriever)

def ask(chain, q):
    r=chain({"query": q})
    print("\nQ:", q); print("A:", r.get("result",""))
ask(qa_gemini, "What are two key methods in these documents?")
ask(qa_local, "Summarize one limitation discussed.")


tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Device set to use cpu
  llm_local = HuggingFacePipeline(pipeline=pipe_base)
  r=chain({"query": q})



Q: What are two key methods in these documents?
A: Based on the documents, two key methods are:

1.  **RAG-based systems:** These systems outperform vanilla LLMs on factual tasks.
2.  **Efficient retrieval of relevant context:** This is enabled by vector databases such as FAISS or Chroma. (This is often a component of RAG-based systems or structured retrieval pipelines).

Q: Summarize one limitation discussed.
A: A limitation discussed in this article is that RAG-based systems outperform vanilla LLMs on factual tasks. A limitation discussed in this article is that RAG-based systems outperform vanilla LLMs on factual tasks. A limitation discussed in this article is that RAG-based systems outperform vanilla LLMs on factual tasks. A limitation discussed in this article is that RAG-based systems outperform vanilla LLMs on factual tasks. A limitation discussed in this article is that RAG-based systems outperform vanilla LLMs on factual tasks. A limitation discussed in this article is that 

In [15]:
pip install wandb



### 6) Fine-Tune on online dataset (Hugging Face `squad`, sampled) with LoRA

In [17]:
dataset_name="squad"; train_n=800; eval_n=200; seed=42
ds = load_dataset(dataset_name)
ds_tr = ds["train"].shuffle(seed=seed).select(range(min(train_n, len(ds["train"]))))
ds_ev = ds["validation"].shuffle(seed=seed).select(range(min(eval_n, len(ds["validation"]))))

def preprocess(ex):
    ctx, q = ex["context"], ex["question"]
    ans = ex["answers"]["text"][0] if ex["answers"]["text"] else ""
    prompt = f"Use the context to answer concisely.\nContext: {ctx}\nQuestion: {q}\nAnswer:"
    model_in = tok(prompt, truncation=True, max_length=512)
    with tok.as_target_tokenizer():
        labels = tok(ans, truncation=True, max_length=64)
    model_in["labels"] = labels["input_ids"]
    model_in["id"] = ex["id"]
    return model_in

proc_tr = ds_tr.map(preprocess, remove_columns=ds_tr.column_names)
proc_ev = ds_ev.map(preprocess, remove_columns=ds_ev.column_names)
collator = DataCollatorForSeq2Seq(tokenizer=tok, model=base)

lora = LoraConfig(r=16, lora_alpha=32, target_modules=["q","k","v","o"], lora_dropout=0.05, bias="none", task_type="SEQ_2_SEQ_LM")
ft_model = get_peft_model(base, lora)
out_dir="artifacts/ft/flan_t5_small_lora"
args = TrainingArguments(output_dir=out_dir, num_train_epochs=1, per_device_train_batch_size=4,
                         per_device_eval_batch_size=4, learning_rate=5e-4, logging_steps=50,
                         eval_strategy="steps", eval_steps=200, save_steps=200,
                         save_total_limit=2,
                         bf16=torch.cuda.is_available(), fp16=False)

trainer = Trainer(model=ft_model, args=args, train_dataset=proc_tr, eval_dataset=proc_ev,
                  data_collator=collator, tokenizer=tok)
trainer.train()
mets = trainer.evaluate()
print(mets)

trainer.model.save_pretrained(out_dir)
tok.save_pretrained(out_dir)

ft_cfg={"base_model":base_model,"adapter_dir":out_dir,"dataset":dataset_name,
        "train_n":train_n,"eval_n":eval_n,"seed":seed}
json.dump(ft_cfg, open("ft_config.json","w"), indent=2)
print(json.dumps(ft_cfg, indent=2))

Map:   0%|          | 0/800 [00:00<?, ? examples/s]



Map:   0%|          | 0/200 [00:00<?, ? examples/s]

  trainer = Trainer(model=ft_model, args=args, train_dataset=proc_tr, eval_dataset=proc_ev,


Step,Training Loss,Validation Loss
200,0.6561,0.427471




{'eval_loss': 0.4274710416793823, 'eval_runtime': 81.6056, 'eval_samples_per_second': 2.451, 'eval_steps_per_second': 0.613, 'epoch': 1.0}
{
  "base_model": "google/flan-t5-small",
  "adapter_dir": "artifacts/ft/flan_t5_small_lora",
  "dataset": "squad",
  "train_n": 800,
  "eval_n": 200,
  "seed": 42
}


In [12]:
from transformers import TrainingArguments
import inspect

# Get the signature of the TrainingArguments constructor
sig = inspect.signature(TrainingArguments)

# Print the parameters
print("Available arguments for TrainingArguments:")
for param in sig.parameters.values():
    print(f"- {param.name}")

Available arguments for TrainingArguments:
- output_dir
- overwrite_output_dir
- do_train
- do_eval
- do_predict
- eval_strategy
- prediction_loss_only
- per_device_train_batch_size
- per_device_eval_batch_size
- per_gpu_train_batch_size
- per_gpu_eval_batch_size
- gradient_accumulation_steps
- eval_accumulation_steps
- eval_delay
- torch_empty_cache_steps
- learning_rate
- weight_decay
- adam_beta1
- adam_beta2
- adam_epsilon
- max_grad_norm
- num_train_epochs
- max_steps
- lr_scheduler_type
- lr_scheduler_kwargs
- warmup_ratio
- warmup_steps
- log_level
- log_level_replica
- log_on_each_node
- logging_dir
- logging_strategy
- logging_first_step
- logging_steps
- logging_nan_inf_filter
- save_strategy
- save_steps
- save_total_limit
- save_safetensors
- save_on_each_node
- save_only_model
- restore_callback_states_from_checkpoint
- no_cuda
- use_cpu
- use_mps_device
- seed
- data_seed
- jit_mode_eval
- use_ipex
- bf16
- fp16
- fp16_opt_level
- half_precision_backend
- bf16_full_eval
-

In [10]:
from transformers import TrainingArguments
import inspect

# Get the signature of the TrainingArguments constructor
sig = inspect.signature(TrainingArguments)

# Print the parameters
print("Available arguments for TrainingArguments:")
for param in sig.parameters.values():
    print(f"- {param.name}")

Available arguments for TrainingArguments:
- output_dir
- overwrite_output_dir
- do_train
- do_eval
- do_predict
- eval_strategy
- prediction_loss_only
- per_device_train_batch_size
- per_device_eval_batch_size
- per_gpu_train_batch_size
- per_gpu_eval_batch_size
- gradient_accumulation_steps
- eval_accumulation_steps
- eval_delay
- torch_empty_cache_steps
- learning_rate
- weight_decay
- adam_beta1
- adam_beta2
- adam_epsilon
- max_grad_norm
- num_train_epochs
- max_steps
- lr_scheduler_type
- lr_scheduler_kwargs
- warmup_ratio
- warmup_steps
- log_level
- log_level_replica
- log_on_each_node
- logging_dir
- logging_strategy
- logging_first_step
- logging_steps
- logging_nan_inf_filter
- save_strategy
- save_steps
- save_total_limit
- save_safetensors
- save_on_each_node
- save_only_model
- restore_callback_states_from_checkpoint
- no_cuda
- use_cpu
- use_mps_device
- seed
- data_seed
- jit_mode_eval
- use_ipex
- bf16
- fp16
- fp16_opt_level
- half_precision_backend
- bf16_full_eval
-

### 7) Merge & Evaluate (EM/F1 quick subset)

In [18]:

metric = evaluate.load("squad")
ft_loaded = AutoModelForSeq2SeqLM.from_pretrained(base_model)
ft_loaded = PeftModel.from_pretrained(ft_loaded, model_id="artifacts/ft/flan_t5_small_lora")
ft_loaded = ft_loaded.merge_and_unload()

pipe_ft = pipeline("text2text-generation", model=ft_loaded, tokenizer=tok,
                   max_new_tokens=64, device=0 if torch.cuda.is_available() else -1)

preds, refs = [], []
n = min(100, len(ds_ev))
for ex in ds_ev.select(range(n)):
    prompt = f"Use the context to answer concisely.\nContext: {ex['context']}\nQuestion: {ex['question']}\nAnswer:"
    pred = pipe_ft(prompt)[0]["generated_text"].strip()
    golds = ex["answers"]["text"] if ex["answers"]["text"] else [""]
    preds.append({"id": ex["id"], "prediction_text": pred})
    refs.append({"id": ex["id"], "answers": {"text": golds, "answer_start": ex["answers"]["answer_start"]}})
print(metric.compute(predictions=preds, references=refs))


Downloading builder script: 0.00B [00:00, ?B/s]

Downloading extra modules: 0.00B [00:00, ?B/s]

Device set to use cpu


KeyboardInterrupt: 

### 8) Plug FT model into RAG and compare to Gemini

In [19]:

from langchain.llms.huggingface_pipeline import HuggingFacePipeline
llm_ft = HuggingFacePipeline(pipeline=pipe_ft)
qa_ft = RetrievalQA.from_chain_type(llm=llm_ft, chain_type="stuff", retriever=retriever)

qs = [
    "Summarize the dataset assumptions made in the uploaded materials.",
    "Identify two limitations or open problems noted in the documents."
]
print("\n=== RAG: Gemini ===")
for q in qs: print("\nQ:", q, "\nA:", RetrievalQA.from_chain_type(llm=llm_gemini, chain_type='stuff', retriever=retriever).invoke({'query': q})['result'][:800])
print("\n=== RAG: Fine-tuned FLAN-T5 ===")
for q in qs: print("\nQ:", q, "\nA:", qa_ft.invoke({'query': q})['result'][:800])



=== RAG: Gemini ===

Q: Summarize the dataset assumptions made in the uploaded materials. 
A: I don't know the answer. The provided materials do not mention any dataset assumptions. They focus on the performance of RAG-based systems, the role of vector databases, and the improvement in generation quality with Gemini models.

Q: Identify two limitations or open problems noted in the documents. 
A: I don't know the answer. The provided documents do not mention any limitations or open problems. They only state positive findings regarding RAG-based systems, vector databases, and Gemini models.

=== RAG: Fine-tuned FLAN-T5 ===

Q: Summarize the dataset assumptions made in the uploaded materials. 
A: The dataset assumptions made in the uploaded materials are the assumptions made in the uploaded materials.

Q: Identify two limitations or open problems noted in the documents. 
A: Identify two limitations or open problems noted in the documents. Helpful


### 9) Save/Update configs

In [20]:

rag_cfg = {
  "chunk_settings_tested":[{"chunk_size":500,"chunk_overlap":100},{"chunk_size":300,"chunk_overlap":50}],
  "embedding_models_tested":["sentence-transformers/all-MiniLM-L6-v2"],
  "llm":{"providers":["google-genai","hf-local"],"models":["gemini-2.5-flash","google/flan-t5-small (LoRA)"]},
  "retriever_k":4,
  "finetune": json.load(open("ft_config.json"))
}
json.dump(rag_cfg, open("rag_run_config.json","w"), indent=2)
print(json.dumps(rag_cfg, indent=2))


{
  "chunk_settings_tested": [
    {
      "chunk_size": 500,
      "chunk_overlap": 100
    },
    {
      "chunk_size": 300,
      "chunk_overlap": 50
    }
  ],
  "embedding_models_tested": [
    "sentence-transformers/all-MiniLM-L6-v2"
  ],
  "llm": {
    "providers": [
      "google-genai",
      "hf-local"
    ],
    "models": [
      "gemini-2.5-flash",
      "google/flan-t5-small (LoRA)"
    ]
  },
  "retriever_k": 4,
  "finetune": {
    "base_model": "google/flan-t5-small",
    "adapter_dir": "artifacts/ft/flan_t5_small_lora",
    "dataset": "squad",
    "train_n": 800,
    "eval_n": 200,
    "seed": 42
  }
}


### 10) Notes
- Use GPU runtime in Colab for fine-tuning.
- Keep keys out of version control.
- Increase dataset size/epochs for stronger results.