<a href="https://colab.research.google.com/github/ramanakurva164/genai/blob/main/RAG_S2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install transformers==4.30.0 faiss-cpu torch # Install necessary libraries: transformers for models, faiss-cpu for efficient similarity search, and torch for tensor operations.



INGESTION


In [2]:
sample_text="The Amazon rainforest is a vast and biodiverse ecosystem located primarily in Brazil, but also spanning across several other South American countries. It's home to an incredibly diverse range of plant and animal species, many of which are found nowhere else on Earth. The rainforest plays a crucial role in regulating the global climate, absorbing large amounts of carbon dioxide and releasing oxygen. Deforestation, driven by activities such as agriculture and logging, poses a significant threat to the Amazon's biodiversity and its ability to regulate the climate. Protecting the Amazon is vital for both environmental and human well-being." # Define a sample text string about the Amazon rainforest.

EMBEDDINGS
"sentence-transformers for embedding keyword
"

In [6]:
from transformers import AutoModel, AutoTokenizer # Import AutoModel and AutoTokenizer classes from the transformers library for loading pre-trained models and tokenizers.
import torch # Import the torch library for tensor operations.
import numpy as np # Import the numpy library for numerical operations, specifically for converting tensors to numpy arrays.

# Define the name of the pre-trained model to use for embeddings.
model_name="sentence-transformers/all-MiniLM-L6-v2"
# Load the tokenizer associated with the chosen model.
tokenizer=AutoTokenizer.from_pretrained(model_name)
# Load the pre-trained model.
model=AutoModel.from_pretrained(model_name)

# Define a function to generate embeddings for a given text.
def get_embeddings(text):
  # Tokenize the input text, returning PyTorch tensors, truncating to the model's max length, and padding for uniform input size.
  tokens=tokenizer(text,return_tensors="pt",truncation=True,padding=True) #token= words ; truncation: uniform input; padding:
  # Disable gradient calculation for efficiency during inference.
  with torch.no_grad(): # to only generate responses not to train itself
    # Pass the tokenized input to the model to get the output.
    output = model(**tokens) #** for unwrapping
  # Get the last hidden state, calculate the mean across the token dimension, remove the singleton dimension, and convert to a numpy array.
  return output.last_hidden_state.mean(dim=1).squeeze().numpy()



CHUNKING


In [25]:
import faiss # Import the faiss library for efficient similarity search.
chunks=[sample_text] # Create a list of text chunks (in this case, just the sample text).
embeddings=[get_embeddings(chunk) for chunk in chunks] # Generate embeddings for each chunk using the get_embeddings function.
dim=len(embeddings[0]) # Determine the dimension of the embeddings.

index=faiss.IndexFlatL2(dim) ## Create a FlatL2 index in faiss with the specified dimension, which uses L2 (Euclidean) distance.
index.add(np.array(embeddings)) # Add the calculated embeddings to the faiss index.

# Retrieving

In [29]:
from transformers import pipeline # Import the pipeline function from the transformers library for easy use of pre-trained models.

# Create a question answering pipeline using a pre-trained DistilBERT model and specifying PyTorch as the framework.
qa_pipeline=pipeline("question-answering",model="distilbert-base-cased-distilled-squad", framework="pt")

# Define a function to retrieve relevant text based on a query and answer the question.
def retrive_and_answer(query,top_k=1):
      # Generate the embedding for the query and reshape it for faiss search.
      query_embedding=get_embeddings(query).reshape(1,-1)
      # Search the faiss index for the top_k most similar embeddings to the query embedding.
      _,indices=index.search(query_embedding,top_k)
      # Retrieve the text chunks corresponding to the indices found in the search.
      retrived_text= [chunks[i] for i in indices[0]]
      # Join the retrieved text chunks to form the context for the question answering model.
      context="".join(retrived_text)
      # Pass the question and context as a dictionary to the question answering pipeline.
      answer=qa_pipeline({'question': query, 'context': context})
      # Return the answer from the pipeline's output.
      return answer['answer']

In [30]:
query = "What is this document about?" # Define the query string.
answer = retrive_and_answer(query) # Call the retrieve_and_answer function with the query to get the answer.
print(answer) # Print the retrieved answer.

Protecting the Amazon is vital for both environmental and human well-being
