# Retrieval Augmented Generation (RAG)

## ‚ö†Ô∏è Adaptaciones para MongoDB Atlas + Groq + HuggingFace

Este notebook ha sido adaptado del original para funcionar con herramientas gratuitas/econ√≥micas:

### üîÑ Cambios Principales

| Original | Adaptado | Motivo |
|----------|----------|--------|
| MongoDB Atlas (cloud) | **MongoDB Atlas (cloud)** | ‚úÖ Vector Search optimizado con √≠ndices HNSW |
| OpenAI embeddings | HuggingFace embeddings | Gratis, local |
| OpenAI ChatGPT | Groq (llama-3.1-8b-instant) | Gratis, API p√∫blica |

### ‚òÅÔ∏è Configuraci√≥n MongoDB Atlas

**Configurar en `.env`:**
```bash
MONGO_URI=mongodb+srv://user:password@cluster.mongodb.net/?appName=Cluster0
```

### ‚úÖ Ventajas de Atlas Vector Search

**MongoDB Atlas S√ç tiene Vector Search optimizado:**
- ‚úÖ Los embeddings se guardan en MongoDB Atlas
- ‚úÖ Las b√∫squedas usan √≠ndices HNSW (r√°pidas y escalables)
- ‚úÖ Soporte para b√∫squedas h√≠bridas (sem√°ntica + filtros)
- ‚úÖ Gratis hasta 512MB de datos (tier M0)

### üìö Qu√© Aprendemos en este Notebook

1. **Persistencia en NoSQL Cloud**: C√≥mo guardar embeddings en MongoDB Atlas
2. **B√∫squedas vectoriales optimizadas**: Usar Atlas Vector Search con √≠ndices HNSW
3. **B√∫squedas h√≠bridas**: Combinar b√∫squeda sem√°ntica + filtros de metadatos
4. **Integraci√≥n LangChain + MongoDB Atlas**: Usar `MongoDBAtlasVectorSearch`
5. **Re-ranking**: Mejorar resultados con CrossEncoder
6. **Visualizaci√≥n**: RAGxplorer para entender el espacio vectorial

---

Importing necessary libraries and installing required packages

In [1]:
from dotenv import load_dotenv
import pandas as pd
from pathlib import Path
import json
from dotenv import load_dotenv
import os 
import shutil 
from IPython.display import display, Markdown
import pprint

In [None]:
# ‚ùå Imports deprecados (LangChain pre-1.0)
# from langchain.document_loaders.pdf import PyPDFDirectoryLoader 
# from langchain.text_splitter import RecursiveCharacterTextSplitter 
# from langchain_openai import OpenAIEmbeddings 
# from langchain.schema import Document 
# from langchain.vectorstores.chroma import Chroma
# from langchain_openai import ChatOpenAI
# from langchain_core.vectorstores import InMemoryVectorStore
# from langchain import hub
# from langchain.schema.runnable import RunnablePassthrough
# from langchain.schema.output_parser import StrOutputParser
# from langchain.document_loaders import PyPDFLoader
# from langchain.retrievers import ContextualCompressionRetriever
# from langchain.retrievers.document_compressors import CrossEncoderReranker
# from langchain_community.cross_encoders import HuggingFaceCrossEncoder

# ‚úÖ Imports actualizados para LangChain 1.0+ y alternativas gratuitas
from langchain_community.document_loaders import PyPDFLoader, PyPDFDirectoryLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_core.runnables import RunnablePassthrough, RunnableLambda
from langchain_core.output_parsers import StrOutputParser

# ‚úÖ Groq (LLM gratuito) + HuggingFace (embeddings locales gratuitos)
from langchain_groq import ChatGroq
from langchain_huggingface import HuggingFaceEmbeddings

# ‚úÖ LangSmith para hub (reemplazo de langchain.hub deprecado)
from langsmith import Client as LangSmithClient

# ‚úÖ Sentence Transformers para re-ranking manual (ContextualCompressionRetriever deprecado)
from sentence_transformers import CrossEncoder

In [None]:
# ‚úÖ Nueva celda: Configurar embeddings locales (HuggingFace)
EMBEDDING_MODEL = "all-MiniLM-L6-v2"
embeddings = HuggingFaceEmbeddings(model_name=EMBEDDING_MODEL)

# ‚úÖ LangSmith client para hub.pull()
hub_client = LangSmithClient()

In [3]:
# %pip install pypdf langchain-huggingface sentence-transformers

In [3]:
# Load environment variables from .env file
load_dotenv()

True

## Leveraging Semantic Search (with movies)

In [None]:
# Get the same dataset as in the other notebook
# ‚ùå input_datapath = "../semantic-search/dataset.json"
# ‚úÖ Path corregido (dataset en mismo directorio que notebooks)
input_datapath = "dataset.json"

with open(input_datapath, 'r') as f:
    movie_data = json.load(f)

df = pd.DataFrame(movie_data)
print(df.shape)
df.head()

We will create one document per movie

In [5]:
import ast

documents = []
for index, row in df.iterrows():
    genres = ast.literal_eval(row['genres'])
    md_dict = {
        "language": row['original_language'], 
        "genre": genres[0], 
        "release_date": row['release_date'],
        "source": index
    }
    doc = Document(id=index, page_content=row['title']+" - "+row['overview'], metadata=md_dict)
    documents.append(doc)
print(len(documents), "documents")

10 documents


In [37]:
documents

[Document(id='0', metadata={'language': 'English', 'genre': 'Horror', 'release_date': '2023-04-05', 'source': 0}, page_content="The Pope's Exorcist - Father Gabriele Amorth, Chief Exorcist of the Vatican, investigates a young boy's terrifying possession and ends up uncovering a centuries-old conspiracy the Vatican has desperately tried to keep hidden."),
 Document(id='1', metadata={'language': 'English', 'genre': 'Action', 'release_date': '2023-02-15', 'source': 1}, page_content="Ant-Man and the Wasp: Quantumania - Super-Hero partners Scott Lang and Hope van Dyne, along with with Hope's parents Janet van Dyne and Hank Pym, and Scott's daughter Cassie Lang, find themselves exploring the Quantum Realm, interacting with strange new creatures and embarking on an adventure that will push them beyond the limits of what they thought possible."),
 Document(id='2', metadata={'language': 'English', 'genre': 'Action', 'release_date': '2023-04-18', 'source': 2}, page_content='Ghosted - Salt-of-the

We store all the movies into an in-memory vector store for simplicity (it could be any other kind of vector store)

In [None]:
# ‚ùå OpenAI embeddings (de pago)
# inmemory_vectorstore = InMemoryVectorStore(OpenAIEmbeddings())

# ‚úÖ HuggingFace embeddings locales (gratis)
inmemory_vectorstore = InMemoryVectorStore(embeddings)
_ = inmemory_vectorstore.add_documents(documents=documents)

In [None]:
from pymongo.mongo_client import MongoClient
from pymongo.server_api import ServerApi

# ‚úÖ Conexi√≥n a MongoDB Atlas (cloud)
# Obtener URI completo desde .env (incluye usuario, password, cluster y par√°metros)
uri = os.getenv("MONGO_URI", "mongodb+srv://user:password@cluster.mongodb.net/?appName=Cluster0")

# Cliente MongoDB Atlas con ServerApi
mongo_client = MongoClient(uri, server_api=ServerApi('1'))

# Send a ping to confirm a successful connection
try:
    mongo_client.admin.command('ping')
    print("‚úÖ Pinged your deployment. You successfully connected to MongoDB Atlas!")
except Exception as e:
    print(f"‚ùå Error connecting to MongoDB: {e}")

In [None]:
from langchain_mongodb import MongoDBAtlasVectorSearch
from uuid import uuid4

DB_NAME = "langchain_test_db"
COLLECTION_NAME = "langchain_test_vectorstores"
ATLAS_VECTOR_SEARCH_INDEX_NAME = "langchain-test-index-vectorstores"

MONGODB_COLLECTION = mongo_client[DB_NAME][COLLECTION_NAME]

# ‚úÖ HuggingFace embeddings locales (gratis) + MongoDB Atlas Vector Search
mongo_vectorstore = MongoDBAtlasVectorSearch(
    collection=MONGODB_COLLECTION,
    embedding=embeddings,  # HuggingFace embeddings (dimensi√≥n 384 para all-MiniLM-L6-v2)
    index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME,
    relevance_score_fn="cosine",
)

# ‚úÖ Crear √≠ndice vectorial en Atlas si no existe (dimensi√≥n 384 para all-MiniLM-L6-v2)
try:
    mongo_vectorstore.create_vector_search_index(dimensions=384)
    print("‚úÖ √çndice vectorial creado exitosamente")
except Exception as e:
    if "already exists" in str(e).lower() or "duplicate" in str(e).lower():
        print("‚ÑπÔ∏è √çndice vectorial ya existe, continuando...")
    else:
        print(f"‚ö†Ô∏è Error al crear √≠ndice: {e}")

# ‚úÖ Agregar documentos con embeddings a MongoDB Atlas
_ = mongo_vectorstore.add_documents(documents=documents)

In [None]:
def _filter_function(doc: Document) -> bool:
    return doc.metadata.get("genre") == 'Horror'

# Alternative ways of performing a semantic search

query = "Something about religion"
#¬†results = vectorstore.similarity_search(query, k=2)
#¬†results = inmemory_vectorstore.similarity_search_with_score(query, k=2)
#¬†results = inmemory_vectorstore.similarity_search_with_score(query, k=2, filter=_filter_function)

results = mongo_vectorstore.similarity_search_with_score(query, k=2)

for r in results:
    print(r)

(Document(id='0', metadata={'_id': '0', 'language': 'English', 'genre': 'Horror', 'release_date': '2023-04-05', 'source': 0}, page_content="The Pope's Exorcist - Father Gabriele Amorth, Chief Exorcist of the Vatican, investigates a young boy's terrifying possession and ends up uncovering a centuries-old conspiracy the Vatican has desperately tried to keep hidden."), 0.8959551453590393)
(Document(id='3', metadata={'_id': '3', 'language': 'English', 'genre': 'Action', 'release_date': '2023-03-15', 'source': 3}, page_content='Shazam! Fury of the Gods - Billy Batson and his foster siblings, who transform into superheroes by saying "Shazam!", are forced to get back into action and fight the Daughters of Atlas, who they must stop from using a weapon that could destroy the world.'), 0.87743079662323)


In Langchain, we often use a *retriever* on top of the vector store

In [56]:
#¬†retriever = inmemory_vectorstore.as_retriever(
retriever = mongo_vectorstore.as_retriever(
    search_kwargs={
        'k': 3
    }
)

retriever.invoke(input=query)

[Document(id='0', metadata={'_id': '0', 'language': 'English', 'genre': 'Horror', 'release_date': '2023-04-05', 'source': 0}, page_content="The Pope's Exorcist - Father Gabriele Amorth, Chief Exorcist of the Vatican, investigates a young boy's terrifying possession and ends up uncovering a centuries-old conspiracy the Vatican has desperately tried to keep hidden."),
 Document(id='3', metadata={'_id': '3', 'language': 'English', 'genre': 'Action', 'release_date': '2023-03-15', 'source': 3}, page_content='Shazam! Fury of the Gods - Billy Batson and his foster siblings, who transform into superheroes by saying "Shazam!", are forced to get back into action and fight the Daughters of Atlas, who they must stop from using a weapon that could destroy the world.'),
 Document(id='8', metadata={'_id': '8', 'language': 'English', 'genre': 'Adventure', 'release_date': '2023-03-23', 'source': 8}, page_content='Dungeons & Dragons: Honor Among Thieves - A charming thief and a band of unlikely adventur

Let's create an LLM for the RAG chain

In [None]:
llm_model = os.environ["OPENAI_MODEL"]  # llama-3.1-8b-instant (configurado en .env)
print(llm_model)

# ‚ùå OpenAI ChatGPT (de pago)
# llm = ChatOpenAI(model=llm_model, temperature=0.1)

# ‚úÖ Groq (gratuito, usa modelos open source)
llm = ChatGroq(model=llm_model, temperature=0.1)

The typical RAG prompt considers the *context* and the *question*

In [None]:
# Example for a public prompt (https://smith.langchain.com/hub/rlm/rag-prompt)
# ‚ùå hub deprecado
# rag_prompt = hub.pull("rlm/rag-prompt", include_model=True)

# ‚úÖ Usar LangSmith Client
rag_prompt = hub_client.pull_prompt("rlm/rag-prompt")
rag_prompt.messages[0].prompt

A basic chain that connects to the retriever

In [57]:
# The prompt is predefined, but other prompts could be used
rag_chain = (
    {"context": retriever,  "question": RunnablePassthrough()} 
    | rag_prompt 
    | llm
    | StrOutputParser()
)

query = "I want to get a movie about religion"
result = rag_chain.invoke(query)
#¬†pprint.pprint(result)
display(Markdown(result))

You might consider watching "The Pope's Exorcist," which revolves around Father Gabriele Amorth, the Chief Exorcist of the Vatican, as he investigates a young boy's possession and uncovers a hidden conspiracy. This film explores themes of faith and the supernatural within a religious context.

## Ingestion (chunks) and RAG

In [12]:
# We consider a large PDF file
pdf_path = "./data/Understanding_Climate_Change.pdf"

loader = PyPDFLoader(pdf_path)
pdf_documents = loader.load() # Each document corresponds actually to a page
print(len(pdf_documents), "loaded")

33 loaded


In [58]:
def replace_t_with_space(list_of_documents):
    for doc in list_of_documents:
        doc.page_content = doc.page_content.replace('\t', ' ')  # Replace tabs with spaces
    return list_of_documents


# Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=100)

texts = text_splitter.split_documents(pdf_documents)
cleaned_texts = replace_t_with_space(texts)
print(len(cleaned_texts), "chunks")

97 chunks


In [None]:
uuids = [str(uuid4()) for _ in range(len(cleaned_texts))]
for id, doc in zip(uuids, cleaned_texts):
    doc._id = id
    doc.id = id

In [None]:
# We use a vector store for the chunks with MongoDB Atlas

# ‚úÖ HuggingFace embeddings locales + MongoDB Atlas Vector Search
mongo_vectorstore =  MongoDBAtlasVectorSearch.from_documents(
        cleaned_texts,
        collection=MONGODB_COLLECTION,
        embedding=embeddings,  # HuggingFace embeddings (all-MiniLM-L6-v2, dim=384)
        index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME,
        relevance_score_fn="cosine",
    )

# ‚úÖ Crear √≠ndice vectorial en Atlas si no existe (dimensi√≥n 384 para all-MiniLM-L6-v2)
try:
    mongo_vectorstore.create_vector_search_index(dimensions=384)
    print("‚úÖ √çndice vectorial creado exitosamente")
except Exception as e:
    if "already exists" in str(e).lower() or "duplicate" in str(e).lower():
        print("‚ÑπÔ∏è √çndice vectorial ya existe, continuando...")
    else:
        print(f"‚ö†Ô∏è Error al crear √≠ndice: {e}")

my_retriever = mongo_vectorstore.as_retriever(search_kwargs={"k": 5})

In [86]:
# Helper function for printing docs
def pretty_print_docs(docs):
    print(
        f"\n{'-' * 100}\n".join(
            [f"Document {i + 1}:\n\n" + d.page_content for i, d in enumerate(docs)]
        )
    )

test_query = "What is the main cause of climate change?"
context_docs =  my_retriever.invoke(test_query)
pretty_print_docs(context_docs)

Document 1:

change the amount of solar energy our planet receives. During the Holocene epoch, which 
began at the end of the last ice age, human societies flourished, but the industrial era has seen 
unprecedented changes. 
Modern Observations 
Modern scientific observations indicate a rapid increase in global temperatures, sea levels, 
and extreme weather events. The Intergovernmental Panel on Climate Change (IPCC) has 
documented these changes extensively. Ice core samples, tree rings, and ocean sediments 
provide a historical record that scientists use to understand past climate conditions and 
predict future trends. The evidence overwhelmingly shows that recent changes are primarily 
driven by human activities, particularly the emission of greenhouse gases. 
Chapter 2: Causes of Climate Change 
Greenhouse Gases 
The primary cause of recent climate change is the increase in greenhouse gases in the 
atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitro

In [87]:
# Then. we apply the RAG chain
rag_chain = (
    {"context": my_retriever,  "question": RunnablePassthrough()} 
    | rag_prompt 
    | llm
    | StrOutputParser()
)

#¬†test_query = "What was the latest storm on Earth?"
result = rag_chain.invoke(test_query)
#¬†pprint.pprint(result)
display(Markdown(result))

The main cause of climate change is the increase in greenhouse gases in the atmosphere, primarily due to human activities such as burning fossil fuels and deforestation. These gases, including carbon dioxide, methane, and nitrous oxide, trap heat from the sun, leading to a warming climate. This intensified greenhouse effect is largely a result of industrialization and increased energy consumption.

## Re-Ranking

In this example, we use a cross-encoding strategy from HuggingFace, but other strategies can be applied

In [None]:
# ‚úÖ Re-ranking con CrossEncoder de sentence-transformers compatible con LangChain
cross_encoder_model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank_documents(query: str, documents: list, top_n: int = 3):
    """Re-rankea documentos usando cross-encoder"""
    pairs = [[query, doc.page_content] for doc in documents]
    scores = cross_encoder_model.predict(pairs)
    scored_docs = list(zip(documents, scores))
    scored_docs.sort(key=lambda x: x[1], reverse=True)
    return [doc for doc, score in scored_docs[:top_n]]

# ‚úÖ Crear retriever compatible con LangChain usando RunnableLambda
def create_reranked_retriever(base_retriever, top_n=3):
    """Crea un retriever con re-ranking compatible con cadenas de LangChain"""
    def rerank_chain(query: str):
        # Obtener documentos del retriever base
        docs = base_retriever.invoke(query)
        # Re-rankear y devolver top_n
        return rerank_documents(query, docs, top_n)
    
    # Retornar como RunnableLambda para compatibilidad con LCEL
    return RunnableLambda(rerank_chain)

compression_retriever = create_reranked_retriever(my_retriever, top_n=3)

In [88]:
test_query = "What was the latest storm on Earth?"

# Use the compression retriever
compressed_docs = compression_retriever.invoke(test_query)
pretty_print_docs(compressed_docs)

Document 1:

Climate change is linked to an increase in the frequency and severity of extreme weather 
events, such as hurricanes, heatwaves, droughts, and heavy rainfall. These events can have 
devastating impacts on communities, economies, and ecosystems. 
Hurricanes and Typhoons 
Warmer ocean temperatures can intensify hurricanes and typhoons, leading to more 
destructive storms. Coastal regions are at heightened risk of storm surge and flooding. Early 
Droughts 
Increased temperatures and changing precipitation patterns are contributing to more frequent 
and severe droughts. This affects agriculture, water supply, and ecosystems, particularly in 
arid and semi-arid regions. Droughts can lead to food and water shortages and exacerbate 
conflicts. 
Flooding 
Heavy rainfall events are becoming more common, leading to increased flooding. Urban
----------------------------------------------------------------------------------------------------
Document 2:

Climate change is linked to an

In [89]:
rag_chain1 = (
    {"context": compression_retriever,  "question": RunnablePassthrough()} 
    | rag_prompt 
    | llm
    | StrOutputParser()
)

result = rag_chain.invoke(test_query)
#¬†pprint.pprint(result)
display(Markdown(result))

I don't know.

In [90]:
mongo_client.close() # MongoDB

## Bonus: Visualization of Chunks and Query

https://github.com/gabrielchua/RAGxplorer

In [21]:
# %pip install ragexplorer nbformat

In [None]:
from ragxplorer import RAGxplorer

# ‚ùå OpenAI embeddings
# client_openai = RAGxplorer(embedding_model="text-embedding-3-small")

# ‚úÖ HuggingFace embeddings locales (mismo modelo que usamos en todo el notebook)
client_openai = RAGxplorer(embedding_model=EMBEDDING_MODEL)  # "all-MiniLM-L6-v2"

client_openai.load_pdf(
    document_path=pdf_path, 
    chunk_size=1000,
    chunk_overlap=100,
    verbose=True
)

In [None]:
# ‚ùå HyDE tiene bug con embeddings locales
# client_openai.visualize_query(
#     query=test_query, 
#     retrieval_method="HyDE", 
#     top_k=6, 
#     query_shape_size=10
# )

# ‚úÖ Usar m√©todo naive (b√∫squeda b√°sica sin HyDE)
client_openai.visualize_query(
    query=test_query, 
    retrieval_method="naive",  # HyDE tiene bug con embeddings locales
    top_k=6, 
    query_shape_size=10
)

---