 # Build Generative AI Assistant with RAG, LLMs, and Multimodal Input

Building the Retrieval Layer with FAISS

In [1]:
# Load a pre-trained model for embeddings
# Sample corpus
# Generate embeddings
# Initialize FAISS index
# Query
# Search for top 2 most similar documents
# Print retrieved documents

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np 

# Load a pre-trained model for embeddings

model = SentenceTransformer('all-MiniLM-L6-v2')

# Sample corpus

corpus = [
    "Artificial intelligence is transforming many industries.",
    "FAISS is a library for efficient similarity search.",
    "The weather today is sunny with light winds.",
    "Machine learning models require high‑quality training data."
]

# Generate embeddings

corpus_embeddings = model.encode(corpus, convert_to_numpy=True)

# Initialize FAISS index
embedding_dim = corpus_embeddings.shape[1]
index = faiss.IndexFlatL2(embedding_dim)
index.add(corpus_embeddings)

# Query

query = "How do we search for similar documents?"

# Encode query

query_embedding = model.encode([query], convert_to_numpy=True)

# Search for top 2 most similar documents

distances, retrieved_indices = index.search(query_embedding, k=2)

# Print retrieved documents

for idx in retrieved_indices[0]:
    print(corpus[idx])

FAISS is a library for efficient similarity search.
Machine learning models require high‑quality training data.


Integrating LLM for Response Generation

Designing Structured Outputs

Implement structured output generation for consistent data formatting. Steps:

Create a structured response dictionary with the following keys:
 *  "question": the query from Task 1
 *  "context": one of the retrieved documents from Task 1  
 *  "response": a sample response about how LLMs work

Convert to JSON with proper formatting

Print the structured output

In [9]:
import faiss
import numpy as np
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from sentence_transformers import SentenceTransformer

# 1. Example corpus

corpus = [
    "FAISS is a library for efficient similarity search.",
    "Machine learning models require high-quality training data.",
    "Transformers are neural networks designed for sequence modeling.",
    "Vector databases store embeddings for fast retrieval."
]

# 2. Encode corpus using SBERT

embedder = SentenceTransformer("all-MiniLM-L6-v2")
corpus_embeddings = embedder.encode(corpus, convert_to_numpy=True)

# 3. Build FAISS index

dimension = corpus_embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(corpus_embeddings)

# 4. User query + retrieval

query = "How do we search for similar documents?"
query_embedding = embedder.encode([query], convert_to_numpy=True)

k = 2
distances, retrieved_indices = index.search(query_embedding, k)

retrieved_context = "\n".join([corpus[i] for i in retrieved_indices[0]])

# 5. Load FLAN‑T5 (instruction model)

model_name = "google/flan-t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# 6. GOOD RAG PROMPT

prompt = (
    f"Use the context to answer the question.\n\n"
    f"Context:\n{retrieved_context}\n\n"
    f"Question: {query}\n"
    f"Answer:"
)

# 7. Tokenize

inputs = tokenizer(prompt, return_tensors="pt")

# 8. Generate

outputs = model.generate(
    **inputs,
    max_length=150,
    num_beams=5
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("=== RAG Prompt ===")
print(prompt)
print("\n=== Model Response ===")
print(response)

tokenizer_config.json: 0.00B [00:00, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

=== RAG Prompt ===
Use the context to answer the question.

Context:
FAISS is a library for efficient similarity search.
Vector databases store embeddings for fast retrieval.

Question: How do we search for similar documents?
Answer:

=== Model Response ===
Vector databases store embeddings for fast retrieval


In [10]:
import json

# Create a structured response dictionary
response_data = {
    "question": query,
    "context": corpus[retrieved_indices[0][0]],   # take the top retrieved document
    "response": "Large Language Models (LLMs) work by predicting the next word in a sequence based on patterns learned from massive text datasets. They use transformer architectures to capture context and generate coherent, human-like text."
}

# Convert to JSON with proper formatting
structured_output = json.dumps(response_data, indent=4)

# Print the structured output
print(structured_output)

{
    "question": "How do we search for similar documents?",
    "context": "FAISS is a library for efficient similarity search.",
    "response": "Large Language Models (LLMs) work by predicting the next word in a sequence based on patterns learned from massive text datasets. They use transformer architectures to capture context and generate coherent, human-like text."
}
