<a href="https://colab.research.google.com/github/younes-sadi/semantic-search-task2-LSP2431549/blob/main/Belkacem_Sadi_Task2_DL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Belkacem Sadi LSP_Id: 2431549 Task_2

This project uses Retrieval-Augmented Generation (RAG) by embedding 5,000 real news articles using a transformer (MiniLM), storing the vectors in FAISS, retrieving top-k relevant entries for a user query, and generating a final answer using T5. The pipeline mimics the internal architecture of models like facebook/rag-sequence-nq while remaining lightweight and Colab-friendly

In [1]:
!pip install -q datasets==2.16.0
!pip install -q transformers sentence-transformers faiss-cpu


In [2]:
!pip install -q datasets==2.16.0
!pip install -q transformers sentence-transformers faiss-cpu

from datasets import load_dataset

dataset = load_dataset("ag_news", split="train[:5000]")
texts = [x["text"] for x in dataset]


Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


In [3]:

import pandas as pd

df = pd.DataFrame({"text": texts})
df.to_csv("ag_news_5000.csv", index=False)


In [4]:

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

encoder = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = encoder.encode(texts, convert_to_numpy=True, show_progress_bar=True)

index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(embeddings)


Batches:   0%|          | 0/157 [00:00<?, ?it/s]

In [5]:

from transformers import T5ForConditionalGeneration, T5Tokenizer

#  small generator model
tokenizer = T5Tokenizer.from_pretrained("t5-small")
generator = T5ForConditionalGeneration.from_pretrained("t5-small")

query = "What are the latest innovations in AI?"
query_embedding = encoder.encode([query], convert_to_numpy=True)
D, I = index.search(query_embedding, k=3)

# Combine top 3 results into a context
context = " ".join(texts[i] for i in I[0])
prompt = f"question: {query} context: {context}"

inputs = tokenizer(prompt, return_tensors="pt", truncation=True)
outputs = generator.generate(**inputs)
print("=== Answer ===")
print(tokenizer.decode(outputs[0], skip_special_tokens=True))


You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


=== Answer ===
micro flyer


In [6]:
from transformers import T5ForConditionalGeneration, T5Tokenizer

#  t5-large
tokenizer = T5Tokenizer.from_pretrained("t5-large")
generator = T5ForConditionalGeneration.from_pretrained("t5-large")

query = "What are the latest news of sport "

query_embedding = encoder.encode([query], convert_to_numpy=True)

D, I = index.search(query_embedding, k=5)

# top 5
context = " ".join(texts[i] for i in I[0])

prompt = f"question: {query} context: {context}"

# 7. Tokenize the prompt
inputs = tokenizer(prompt, return_tensors="pt", truncation=True)

outputs = generator.generate(
    input_ids=inputs["input_ids"],
    max_length=300,
    min_length=50,
    length_penalty=1.2,
    num_beams=4,
    early_stopping=True
)

# 9. Decode
print("=== Answer ===")
print(tokenizer.decode(outputs[0], skip_special_tokens=True))


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


=== Answer ===
Weight of drugs felt Weightlifting #39;s aggressive pursuit of drug cheaters will continue even if it jeopardizes its future in the Olympics, its top official said yesterday following six more positive doping cases in what is again becoming the Games #39; dirtiest sport


In [7]:
!git clone https://github.com/younes-sadi/semantic-search-task2-LSP2431549.git
!cd semantic-search-task2-LSP2431549


Cloning into 'semantic-search-task2-LSP2431549'...
remote: Enumerating objects: 57, done.[K
remote: Counting objects: 100% (57/57), done.[K
remote: Compressing objects: 100% (55/55), done.[K
remote: Total 57 (delta 19), reused 3 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (57/57), 104.92 KiB | 735.00 KiB/s, done.
Resolving deltas: 100% (19/19), done.


In [8]:
!pip install --upgrade ipywidgets



In [10]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [None]:
%cd /content/semantic-search-task2-LSP2431549
!git add Belkacem_Sadi_Task2_DL.ipynb
!git commit -m "Upload task2"
!git push
