### Project Goal
Build a Retrieval-Augmented Generation (RAG) system that:

- Retrieves relevant articles based on a user query.
- Generates a response or summary using the retrieved information.

### Key Steps
Prepare the Dataset:
- Load, explore, and clean the News Articles Dataset.
- Filter irrelevant or short articles.

Generate Embeddings:
- Convert articles into numerical vectors using a pre-trained Sentence Transformers model.
- Store these embeddings for efficient similarity search.

Document Retrieval System:
- Use FAISS (Facebook AI Similarity Search) to index and retrieve articles based on query embeddings.
- Retrieve the most relevant articles for a given query.

Response Generation:
- Pass retrieved articles to a text generation model (e.g., GPT-4 or T5).
- Generate summaries or answers to user queries.

Deploy the System:
- Build an API (e.g., with FastAPI) for users to interact with the RAG system.
- Input: User query.
- Output: A generated response based on retrieved articles.

End Deliverable
- A functional RAG system that can:

Retrieve top relevant news articles for a query.
Generate a concise and meaningful response using those articles.

In [1]:
from datasets import load_dataset
import pandas as pd

In [39]:
import os
print(os.getcwd())

C:\Users\sigar


In [2]:
dataset = load_dataset("ccdv/cnn_dailymail", "3.0.0", trust_remote_code=True)

In [3]:
print(dataset['train'].column_names)

['article', 'highlights', 'id']


In [4]:
sample = dataset['validation'][0]

In [5]:
df_train = pd.DataFrame(dataset['train'])

In [38]:
df_train.head(5)

Unnamed: 0,article,highlights,id
0,It's official: U.S. President Barack Obama wan...,Syrian official: Obama climbed to the top of t...,0001d1afc246a7964130f43ae940af6bc6c57f01
1,(CNN) -- Usain Bolt rounded off the world cham...,Usain Bolt wins third gold of world championsh...,0002095e55fcbd3a2f366d9bf92a95433dc305ef
2,"Kansas City, Missouri (CNN) -- The General Ser...",The employee in agency's Kansas City office is...,00027e965c8264c35cc1bc55556db388da82b07f
3,Los Angeles (CNN) -- A medical doctor in Vanco...,NEW: A Canadian doctor says she was part of a ...,0002c17436637c4fe1837c935c04de47adb18e9a
4,(CNN) -- Police arrested another teen Thursday...,Another arrest made in gang rape outside Calif...,0003ad6ef0c37534f80b55b4235108024b407f0b


In [7]:
filtered_dataset = dataset.filter(lambda x: len(x['article']) > 50)

In [8]:
print(len(filtered_dataset['train']), len(dataset['train']))

287112 287113


In [9]:
from sentence_transformers import SentenceTransformer

In [10]:
# Load the pre-trained model
model_name = "sentence-transformers/all-MiniLM-L6-v2"
model = SentenceTransformer(model_name)

print(f"Loaded embedding model: {model_name}")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Loaded embedding model: sentence-transformers/all-MiniLM-L6-v2


In [11]:
#generate embedding for each article
def embed_articles(batch):
    batch['embeddings'] = model.encode(batch['article'], show_progress_bar=True)
    return batch

embedded_datsaet = filtered_dataset.map(embed_articles, batched=True)

Map:   0%|          | 0/287112 [00:00<?, ? examples/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

Map:   0%|          | 0/13368 [00:00<?, ? examples/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/12 [00:00<?, ?it/s]

Map:   0%|          | 0/11490 [00:00<?, ? examples/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/16 [00:00<?, ?it/s]

In [12]:
# path = r"C:\Users\sigar\OneDrive\Desktop\CommentLit_Project"

In [12]:
embedded_datsaet.save_to_disk("path_to_save/embedded_dataset")

Saving the dataset (0/4 shards):   0%|          | 0/287112 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/13368 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/11490 [00:00<?, ? examples/s]

In [13]:
import faiss

In [14]:
import numpy as np

In [15]:
embeddings = np.array(embedded_datsaet["train"]["embeddings"])
print(f"Embeddings shape: {embeddings.shape}")

Embeddings shape: (287112, 384)


In [16]:
dimension = embeddings.shape[1]

In [37]:
dimension

384

In [17]:
index = faiss.IndexFlatL2(dimension)

In [18]:
index

<faiss.swigfaiss_avx2.IndexFlatL2; proxy of <Swig Object of type 'faiss::IndexFlatL2 *' at 0x000001A3F9AA5810> >

In [19]:
# Add embeddings to the FAISS index
index.add(embeddings)
print(f"FAISS index created with {index.ntotal} vectors.")

FAISS index created with 287112 vectors.


In [20]:
faiss.write_index(index, 'faiss_index')

In [21]:
index = faiss.read_index("faiss_index")


In [22]:
# Encode a sample query
query = "What is the latest news on climate change?"


In [23]:
quesry_embedding = model.encode(query) #Converts the query into an embedding, which is a vector representation that FAISS can use for similarity search.

#performe a search in the FAISS index

index.search(): Searches for the closest matches in the FAISS index.
- Input: query_embedding (reshaped into a 2D NumPy array).

- Output:
 distances: The L2 distances between the query and each retrieved article.
 indices: The indices of the top k closest articles in the dataset.
- k: The number of results you want to retrieve.

In [24]:
k = 5 #number of nearest neighbors to retrieve
distance, indices = index.search(np.array([quesry_embedding]), k)

print(f"Top {k} results retreived successfully!")

Top 5 results retreived successfully!


In [25]:
retrieved_articles = [embedded_datsaet["train"][int(i)] for i in indices[0]]

In [26]:
# Display the retrieved articles
for idx, article in enumerate(retrieved_articles):
    print(f"\n--- Article {idx + 1} ---")
    print("Title:", article["highlights"])
    print("Content:", article["article"][:100])  # Show first 500 characters of the content


--- Article 1 ---
Title: Scientists surer than ever humans play major role in climate change, report says .
Global warming already affecting extreme weather, and it could get worse, report says .
U.N.'s IPCC convenes every six years to put together report; it's considered benchmark on topic .
Even if emissions ended today, effects of climate change could linger for centuries .
Content: The world's getting hotter, the sea's rising and there's increasing evidence neither are naturally o

--- Article 2 ---
Title: Climate change is on a 'hiatus' and likely to return with more heatwaves, droughts, floods and rising sea levels .
Temperatures have not continued to rise since 1998 .
Sceptics say climate change is not man-made and question urgent action .
But IPCC report concludes global warming is '95 per cent' result of humans .
The IPCC report in 2007 erroneously claimed Himalayas would melt by 2035 .
Content: By . Shari Miller . Global warming has not stopped - it's just on a 'hiatus' and 

#### Step 5: Response Generation Using a Text Generation Model
We will use a pre-trained model like T5 or GPT-2 from Hugging Face for text generation. These models can generate summaries, paraphrase content, or provide answers.

In [27]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

In [28]:
model_name = "t5-small"# for summarization
tokenizer = AutoTokenizer.from_pretrained(model_name)
generation_model = AutoModelForSeq2SeqLM.from_pretrained(model_name)


#### Generate a Response
Use the retrieved articles to generate a response. Concatenate the retrieved content and provide it as input to the generation model.

In [29]:
# Combine the content of retrieved articles
retrieved_content = " ".join([article["article"] for article in retrieved_articles])

# Prepare the input for the generation model
input_text = f"summarize: {retrieved_content[:1000]}"  # Limit input to 1000 characters for simplicity
inputs = tokenizer.encode(input_text, return_tensors="pt", max_length=1024, truncation=True)

# Generate a summary or response
outputs = generation_model.generate(inputs, max_length=150, num_beams=4, early_stopping=True)

# Decode and display the response
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("\nGenerated Response:")
print(response)


Generated Response:
more than 800 authors and 50 editors from dozens of countries took part in the document. the report is considered the benchmark on the topic. climate scientists are 95% confident humans are responsible for at least half of the observed increase in global average surface temperatures since the 1950s.


#### Verify the Output
Check if the generated response:

Is coherent and relevant to the query.
Summarizes or answers the query meaningfully.

In [30]:
# Print available keys in the first retrieved article
print("Available Keys in Retrieved Articles:", retrieved_articles[0].keys())

Available Keys in Retrieved Articles: dict_keys(['article', 'highlights', 'id', 'embeddings'])


In [31]:
# Display the retrieved articles safely
for i, article in enumerate(retrieved_articles):
    print(f"\n--- Retrieved Article {i+1} ---")
    
    # Check if "title" exists, otherwise print article ID
    if "title" in article:
        print("Title:", article["title"])
    else:
        print("Title Not Available - Using ID:", article.get("id", "Unknown ID"))

    # Print a portion of the article content
    print("Content (First 300 chars):", article["article"][:300])



--- Retrieved Article 1 ---
Title Not Available - Using ID: 687e67b161bf493827464a14a4bc6984b46be77a
Content (First 300 chars): The world's getting hotter, the sea's rising and there's increasing evidence neither are naturally occurring phenomena. So says a report from the U.N. International Panel on Climate Change, a document released every six years that is considered the benchmark on the topic. More than 800 authors and 5

--- Retrieved Article 2 ---
Title Not Available - Using ID: cd81da2ff25c07386216b6ac4dff14e47107179d
Content (First 300 chars): By . Shari Miller . Global warming has not stopped - it's just on a 'hiatus' and likely to return with ever more heatwaves, droughts, floods and rising sea levels - according to a draft report from leading scientists. The 127-page United Nations report, and a shorter summary for policymakers due for

--- Retrieved Article 3 ---
Title Not Available - Using ID: 4cda325c0b744135d681dd9c600bad3906a2bb32
Content (First 300 chars): By . Harrie

In [32]:
# Use "highlights" if "title" is missing
for i, article in enumerate(retrieved_articles):
    print(f"\n--- Retrieved Article {i+1} ---")
    
    # Use "highlights" instead of title if it's available
    print("Title (or Summary):", article.get("highlights", "No Title Available"))
    
    print("Content (First 300 chars):", article["article"][:300])



--- Retrieved Article 1 ---
Title (or Summary): Scientists surer than ever humans play major role in climate change, report says .
Global warming already affecting extreme weather, and it could get worse, report says .
U.N.'s IPCC convenes every six years to put together report; it's considered benchmark on topic .
Even if emissions ended today, effects of climate change could linger for centuries .
Content (First 300 chars): The world's getting hotter, the sea's rising and there's increasing evidence neither are naturally occurring phenomena. So says a report from the U.N. International Panel on Climate Change, a document released every six years that is considered the benchmark on the topic. More than 800 authors and 5

--- Retrieved Article 2 ---
Title (or Summary): Climate change is on a 'hiatus' and likely to return with more heatwaves, droughts, floods and rising sea levels .
Temperatures have not continued to rise since 1998 .
Sceptics say climate change is not man-made and que

Summary of Fix
- ✅ Check Available Keys: Print article.keys() to identify valid fields.
- ✅ Modify Retrieval Display: Use "highlights" or "id" if "title" is missing.
- ✅ Handle Missing Titles Gracefully: Display "No Title Available" instead of crashing.

#### Fine-Tuning the Model
Fine-tuning the text generation model (e.g., T5 or GPT-2) on domain-specific data improves its performance, making responses more accurate, relevant, and context-aware.

1. Understanding Fine-Tuning
Fine-tuning means training a pre-trained model on a smaller dataset with specific examples so it adapts to your task. In our case:

- Base Model: t5-small (or t5-large, flan-t5, GPT-2)
- Dataset: News articles (input) and highlights (summary)
- Goal: Improve the model’s ability to generate news summaries or responses.

2. Preparing the Dataset for Fine-Tuning
We need pairs of input text and output text:

- Input: Article content
- Output: Highlights (or a generated summary)

3. Step 2.1: Format the Dataset
- Convert the dataset into a structured format.

In [33]:
# Load dataset
df = pd.DataFrame(embedded_datsaet["train"])  # Convert Hugging Face dataset to DataFrame

# Filter required columns
df = df[["article", "highlights"]]

# Format the dataset for T5 (prefix for summarization task)
df["input_text"] = "summarize: " + df["article"]
df["target_text"] = df["highlights"]
#This ensures each news article is paired with its summary for training.

In [34]:
from transformers import T5Tokenizer

In [35]:

model_name = "t5-small"
tokenizer = T5Tokenizer.from_pretrained(model_name)
print("T5 Tokenizer loaded successfully!")


You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


T5 Tokenizer loaded successfully!


In [36]:

model_name = "t5-small"
tokenizer = T5Tokenizer.from_pretrained(model_name)

# Tokenize input and output texts
def tokenize_data(example):
    return {
        "input_ids": tokenizer(example["input_text"], padding="max_length", truncation=True, max_length=512).input_ids,
        "attention_mask": tokenizer(example["input_text"], padding="max_length", truncation=True, max_length=512).attention_mask,
        "labels": tokenizer(example["target_text"], padding="max_length", truncation=True, max_length=150).input_ids,
    }

# Apply tokenization
dataset = df.to_dict(orient="records")  # Convert DataFrame to dictionary
tokenized_dataset = list(map(tokenize_data, dataset))  # Apply tokenization

print("Tokenization complete! Sample:", tokenized_dataset[0])


Tokenization complete! Sample: {'input_ids': [21603, 10, 94, 31, 7, 2314, 10, 412, 5, 134, 5, 1661, 20653, 4534, 2746, 23419, 12, 11385, 16, 30, 823, 12, 169, 2716, 2054, 16, 11380, 5, 4534, 1622, 3, 9, 2068, 12, 8, 7701, 13, 8, 1384, 11, 7819, 30, 1856, 706, 6, 716, 227, 3, 9, 22587, 24, 3, 88, 7228, 2716, 1041, 581, 16706, 8874, 19, 8, 269, 1147, 12, 240, 147, 8, 3, 12554, 169, 13, 5368, 7749, 5, 37, 4382, 6704, 45, 4534, 987, 7, 4442, 12, 15444, 8, 169, 13, 2716, 2054, 96, 235, 20, 449, 6, 23773, 6, 1709, 11, 20, 6801, 8, 1055, 21, 647, 2284, 13, 5368, 7749, 42, 119, 7749, 13, 3294, 11203, 535, 94, 31, 7, 3, 9, 1147, 24, 19, 356, 12, 919, 46, 1038, 5362, 139, 3, 9, 19894, 4422, 1827, 3392, 5, 290, 33, 843, 746, 3, 14351, 53, 147, 8, 5054, 10, 363, 410, 412, 5, 567, 5, 7749, 17033, 7, 253, 16, 11380, 58, 363, 2906, 3, 99, 4442, 11839, 150, 58, 275, 149, 56, 8, 16706, 789, 8922, 58, 86, 3, 9, 3, 1931, 208, 3375, 1115, 45, 8, 1945, 1384, 5088, 5072, 2283, 1856, 6, 8, 2753, 243, 3, 88, 

In [37]:
from datasets import Dataset

# Convert list of tokenized dictionaries to Hugging Face Dataset format
train_dataset = Dataset.from_list(tokenized_dataset)

# Verify dataset structure
print(train_dataset)
print("Sample:", train_dataset[0])


Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 287112
})
Sample: {'input_ids': [21603, 10, 94, 31, 7, 2314, 10, 412, 5, 134, 5, 1661, 20653, 4534, 2746, 23419, 12, 11385, 16, 30, 823, 12, 169, 2716, 2054, 16, 11380, 5, 4534, 1622, 3, 9, 2068, 12, 8, 7701, 13, 8, 1384, 11, 7819, 30, 1856, 706, 6, 716, 227, 3, 9, 22587, 24, 3, 88, 7228, 2716, 1041, 581, 16706, 8874, 19, 8, 269, 1147, 12, 240, 147, 8, 3, 12554, 169, 13, 5368, 7749, 5, 37, 4382, 6704, 45, 4534, 987, 7, 4442, 12, 15444, 8, 169, 13, 2716, 2054, 96, 235, 20, 449, 6, 23773, 6, 1709, 11, 20, 6801, 8, 1055, 21, 647, 2284, 13, 5368, 7749, 42, 119, 7749, 13, 3294, 11203, 535, 94, 31, 7, 3, 9, 1147, 24, 19, 356, 12, 919, 46, 1038, 5362, 139, 3, 9, 19894, 4422, 1827, 3392, 5, 290, 33, 843, 746, 3, 14351, 53, 147, 8, 5054, 10, 363, 410, 412, 5, 567, 5, 7749, 17033, 7, 253, 16, 11380, 58, 363, 2906, 3, 99, 4442, 11839, 150, 58, 275, 149, 56, 8, 16706, 789, 8922, 58, 86, 3, 9, 3, 1931, 208, 3375, 1115,