<a href="https://colab.research.google.com/github/sszobaer/RAG-Model-Simulation/blob/main/Rag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **RAG SIMULATION**

In [3]:
!pip install torch transformers datasets faiss-cpu tqdm

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.9.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.4 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec (from torch)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading faiss_cpu-1.9.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

### **Load Corpus**

In [10]:
from datasets import Dataset

# Example corpus
documents = [
    "What is machine learning?",
    "Machine learning is a subfield of artificial intelligence.",
    "It involves training models on data to make predictions.",
    "And we are using this as our research domain."
]

# Convert the corpus into a Hugging Face Dataset
dataset = Dataset.from_dict({"text": documents})

# Display the dataset
dataset

Dataset({
    features: ['text'],
    num_rows: 4
})

### **Create a FAISS Index for Retrieval**

In [11]:
from transformers import AutoTokenizer, AutoModel
import faiss
import torch

# Load a transformer model for embeddings (using Sentence-BERT model)
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

# Function to embed documents
def embed_documents(texts):
    """Embed the texts using the transformer model."""
    inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
    with torch.no_grad():
        embeddings = model(**inputs).last_hidden_state.mean(dim=1)
    return embeddings.numpy()

# Embed the corpus and build the FAISS index
embeddings = embed_documents(dataset["text"])
index = faiss.IndexFlatL2(embeddings.shape[1])  # L2 distance for retrieval
index.add(embeddings)  # Add embeddings to the index

# Check the index information
print('The total index number is: ', index.ntotal)

The total index number is:  4


### **Perform Retrieval**

In [18]:
def retrieve(query, top_k=3):
    """Retrieve top_k documents for a given query."""
    query_embedding = embed_documents([query])
    distances, indices = index.search(query_embedding, top_k)

    # Retrieve the documents based on the indices
    results = [dataset[i.item()]["text"] for i in indices[0]]
    return results

# Test retrieval
query = "What is models?"
retrieved_docs = retrieve(query)
retrieved_docs

['It involves training models on data to make predictions.',
 'And we are using this as our research domain.',
 'Machine learning is a subfield of artificial intelligence.']

### **Generate an Answer Using a Language Model**

In [27]:
from transformers import pipeline

# Load a text generation pipeline with GPT-2
qa_pipeline = pipeline("text-generation", model="gpt2")

# Generate a response based on the retrieved documents
context = " ".join(retrieved_docs)
response = qa_pipeline(context, max_length=50, num_return_sequences=1)

response[0]["generated_text"]

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


"It involves training models on data to make predictions. And we are using this as our research domain. Machine learning is a subfield of artificial intelligence.\n\nWe've also been involved in the recent research in a variety of different domains of human intelligence"

 ### **Enable GPU in Google Colab**

In [21]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 384, padding_idx=0)
    (position_embeddings): Embedding(512, 384)
    (token_type_embeddings): Embedding(2, 384)
    (LayerNorm): LayerNorm((384,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-5): 6 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=384, out_features=384, bias=True)
            (key): Linear(in_features=384, out_features=384, bias=True)
            (value): Linear(in_features=384, out_features=384, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=384, out_features=384, bias=True)
            (LayerNorm): LayerNorm((384,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)


In [25]:
inputs = tokenizer(query, padding=True, truncation=True, max_length=512, return_tensors="pt")

In [26]:
qa_pipeline = pipeline("text-generation", model="gpt2", pad_token_id=50256)

Device set to use cpu
