# RAG Pipeline: PDF to Pinecone

This notebook processes the rename-2024.pdf file and uploads it to Pinecone for RAG (Retrieval Augmented Generation).

## Steps:
1. Load PDF using LangChain
2. Chunk the document
3. Generate embeddings
4. Upload to Pinecone

## 1. Install Dependencies

In [None]:
pip install langchain_community

In [None]:
pip install langchain-text-splitters

In [None]:
pip install langchain-google-genai

In [None]:
pip install pinecone

In [None]:
pip install pypdf

In [1]:
import os
from dotenv import load_dotenv
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from pinecone import Pinecone, ServerlessSpec
import time

# Load environment variables
load_dotenv()

# API Keys
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY") 
PINECONE_API_KEY = os.getenv("PINECONE_API_KEY")

if not GOOGLE_API_KEY or not PINECONE_API_KEY:
    raise ValueError("Please set GOOGLE_API_KEY and PINECONE_API_KEY in .env file")

print("✓ Environment variables loaded successfully")

ModuleNotFoundError: No module named 'langchain_community'

## 2. Load and Parse PDF

In [14]:
# Path to PDF
PDF_PATH = "../data/rename-2024.pdf"

# Load PDF
print(f"Loading PDF from: {PDF_PATH}")
loader = PyPDFLoader(PDF_PATH)
documents = loader.load()

print(f"✓ Loaded {len(documents)} pages from PDF")
print(f"✓ Total characters: {sum(len(doc.page_content) for doc in documents):,}")

# Show first page preview
print("\n--- First Page Preview ---")
print(documents[0].page_content[:500] + "...")

Loading PDF from: ../data/rename-2024.pdf
✓ Loaded 267 pages from PDF
✓ Total characters: 242,022

--- First Page Preview ---
MINISTÉRIO DA SAÚDE
RENAME 2024
Brasília – DF
2025
RELAÇÃO
NACIONAL DE
MEDICAMENTOS
ESSENCIAIS
2ª edição...


## 3. Chunk Documents

In [15]:
# Initialize text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,        # Characters per chunk
    chunk_overlap=200,      # Overlap between chunks for context
    length_function=len,
    separators=["\n\n", "\n", " ", ""]
)

# Split documents into chunks
chunks = text_splitter.split_documents(documents)

print(f"✓ Created {len(chunks)} chunks from {len(documents)} pages")
print(f"✓ Average chunk size: {sum(len(chunk.page_content) for chunk in chunks) // len(chunks)} characters")

# Show first chunk preview
print("\n--- First Chunk Preview ---")
print(f"Content: {chunks[0].page_content[:300]}...")
print(f"\nMetadata: {chunks[0].metadata}")

✓ Created 328 chunks from 267 pages
✓ Average chunk size: 769 characters

--- First Chunk Preview ---
Content: MINISTÉRIO DA SAÚDE
RENAME 2024
Brasília – DF
2025
RELAÇÃO
NACIONAL DE
MEDICAMENTOS
ESSENCIAIS
2ª edição...

Metadata: {'producer': 'Adobe PDF Library 17.0', 'creator': 'Adobe InDesign 20.5 (Windows)', 'creationdate': '2025-10-17T09:52:20-03:00', 'moddate': '2025-10-29T09:11:31-03:00', 'trapped': '/False', 'source': '../data/rename-2024.pdf', 'total_pages': 267, 'page': 0, 'page_label': 'C1'}


## 4. Initialize Pinecone

In [18]:
# IMPORTANT: Delete existing index if it has wrong dimension
pc_temp = Pinecone(api_key=PINECONE_API_KEY)
INDEX_NAME = "health-rag"

if INDEX_NAME in pc_temp.list_indexes().names():
    print(f"Deleting existing index '{INDEX_NAME}' (has wrong dimension 768)...")
    pc_temp.delete_index(INDEX_NAME)
    print("✓ Index deleted. Waiting 10 seconds...")
    time.sleep(10)
else:
    print(f"No existing index '{INDEX_NAME}' found.")

Deleting existing index 'health-rag' (has wrong dimension 768)...
✓ Index deleted. Waiting 10 seconds...


In [19]:
# Initialize Pinecone
pc = Pinecone(api_key=PINECONE_API_KEY)

# Configuration
INDEX_NAME = "health-rag"
DIMENSION = 3072  # gemini-embedding-001 dimension (NOT 768!)

# Check if index exists, if not create it
if INDEX_NAME not in pc.list_indexes().names():
    print(f"Creating new index: {INDEX_NAME} with dimension {DIMENSION}")
    pc.create_index(
        name=INDEX_NAME,
        dimension=DIMENSION,
        metric="cosine",
        spec=ServerlessSpec(
            cloud="aws",
            region="us-east-1"
        )
    )
    # Wait for index to be ready
    while not pc.describe_index(INDEX_NAME).status['ready']:
        time.sleep(1)
    print(f"✓ Index '{INDEX_NAME}' created successfully")
else:
    print(f"✓ Using existing index: {INDEX_NAME}")

# Connect to index
index = pc.Index(INDEX_NAME)
print(f"✓ Connected to index. Stats: {index.describe_index_stats()}")

Creating new index: health-rag with dimension 3072
✓ Index 'health-rag' created successfully
✓ Connected to index. Stats: {'dimension': 3072,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {},
 'total_vector_count': 0,
 'vector_type': 'dense'}


## 5. Generate Embeddings and Upload to Pinecone

In [20]:
# Initialize Google Generative AI Embeddings
embeddings = GoogleGenerativeAIEmbeddings(
    model="models/gemini-embedding-001",
    google_api_key=GOOGLE_API_KEY
)

print("✓ Embeddings model initialized")

# Prepare vectors for upload
batch_size = 100
vectors_to_upsert = []

print(f"\nGenerating embeddings for {len(chunks)} chunks...")

for i, chunk in enumerate(chunks):
    # Generate embedding
    embedding = embeddings.embed_query(chunk.page_content)
    
    # Prepare vector with metadata
    vector = {
        "id": f"chunk_{i}",
        "values": embedding,
        "metadata": {
            "text": chunk.page_content,
            "page": chunk.metadata.get("page", 0),
            "source": chunk.metadata.get("source", "rename-2024.pdf")
        }
    }
    
    vectors_to_upsert.append(vector)
    
    # Upload in batches
    if len(vectors_to_upsert) >= batch_size:
        index.upsert(vectors=vectors_to_upsert)
        print(f"  Uploaded batch: {i+1-batch_size} to {i+1}")
        vectors_to_upsert = []
    
    # Show progress every 50 chunks
    if (i + 1) % 50 == 0:
        print(f"  Processed {i+1}/{len(chunks)} chunks...")

# Upload remaining vectors
if vectors_to_upsert:
    index.upsert(vectors=vectors_to_upsert)
    print(f"  Uploaded final batch: {len(vectors_to_upsert)} vectors")

print(f"\n✓ Successfully uploaded {len(chunks)} vectors to Pinecone!")
print(f"✓ Index stats: {index.describe_index_stats()}")

✓ Embeddings model initialized

Generating embeddings for 328 chunks...


E0000 00:00:1762668446.069556   46511 alts_credentials.cc:93] ALTS creds ignored. Not running on GCP and untrusted ALTS is not enabled.


  Processed 50/328 chunks...
  Uploaded batch: 0 to 100
  Processed 100/328 chunks...
  Processed 150/328 chunks...
  Uploaded batch: 100 to 200
  Processed 200/328 chunks...
  Processed 250/328 chunks...
  Uploaded batch: 200 to 300
  Processed 300/328 chunks...
  Uploaded final batch: 28 vectors

✓ Successfully uploaded 328 vectors to Pinecone!
✓ Index stats: {'dimension': 3072,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {'': {'vector_count': 328}},
 'total_vector_count': 328,
 'vector_type': 'dense'}


## 6. Test RAG Query

In [21]:
# Test query
test_query = "What are the main drug interactions mentioned in the document?"

print(f"Test Query: {test_query}\n")

# Generate query embedding
query_embedding = embeddings.embed_query(test_query)

# Search in Pinecone
results = index.query(
    vector=query_embedding,
    top_k=3,
    include_metadata=True
)

print(f"Found {len(results['matches'])} relevant chunks:\n")

for i, match in enumerate(results['matches'], 1):
    print(f"--- Result {i} (Score: {match['score']:.4f}) ---")
    print(f"Page: {match['metadata']['page']}")
    print(f"Text: {match['metadata']['text'][:300]}...")
    print()

Test Query: What are the main drug interactions mentioned in the document?

Found 3 relevant chunks:

--- Result 1 (Score: 0.7450) ---
Page: 265.0
Text: Conte-nos o que pensa sobre 
esta publicação. Clique aqui 
e responda a pesquisa....

--- Result 2 (Score: 0.7431) ---
Page: 20.0
Text: | 20
NOTAS EXPLICATIVAS
APÊNDICE A
Classificação e código ATC: 
O sistema de classificação Anatomical Therapeutic Chemical  (ATC) 
foi implementado como ferramenta para estudos de utilização de 
medicamentos na década de 1960. Com o objetivo de integrar os estudos 
internacionais de utilização de me...

--- Result 3 (Score: 0.7412) ---
Page: 3.0
Text: D: Medicamentos dermatológicos .............................. 49
G: Aparelho geniturinário e hormônios sexuais ....... 52
H: Preparações hormonais sistêmicas, excluindo 
hormônios sexuais e insulinas .................................... 55
J: Anti-infecciosos para uso sistêmico ........................



## Summary

✅ Successfully created RAG pipeline:
- Loaded PDF with PyPDFLoader
- Chunked document into manageable pieces
- Generated embeddings with `gemini-embedding-001`
- Uploaded to Pinecone vector database
- Tested retrieval with sample query

**Next Steps:**
- Integrate this RAG retrieval into MCP server
- Create tools for agents to query the knowledge base
- Add metadata filtering for specific pages/sections

## 7. Test RAG WITHOUT LangChain (Using only Pinecone SDK + Google GenAI)

Esta seção demonstra como fazer RAG usando apenas:
- SDK nativo do Pinecone
- Cliente nativo do Google Generative AI (sem LangChain)

In [3]:
# Install Google GenAI SDK (se ainda não instalado)
# pip install google-generativeai

import os
from dotenv import load_dotenv
import google.generativeai as genai
from pinecone import Pinecone

# Carregar variáveis de ambiente
load_dotenv()

# Pegar API keys do .env
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
PINECONE_API_KEY = os.getenv("PINECONE_API_KEY")

if not GOOGLE_API_KEY or not PINECONE_API_KEY:
    raise ValueError("GOOGLE_API_KEY e PINECONE_API_KEY devem estar no .env")

# Configurar Google GenAI
genai.configure(api_key=GOOGLE_API_KEY)

# Conectar ao Pinecone
pc = Pinecone(api_key=PINECONE_API_KEY)
index = pc.Index("health-rag")

print("✓ Environment variables loaded")
print("✓ Google Generative AI configured")
print("✓ Pinecone connected to index 'health-rag'")
print(f"\n✓ Available embedding models:")
for model in genai.list_models():
    if 'embedContent' in model.supported_generation_methods:
        print(f"  - {model.name}")

✓ Environment variables loaded
✓ Google Generative AI configured
✓ Pinecone connected to index 'health-rag'

✓ Available embedding models:
  - models/embedding-001
  - models/text-embedding-004
  - models/gemini-embedding-exp-03-07
  - models/gemini-embedding-exp
  - models/gemini-embedding-001


In [4]:
# Query de teste em português
test_query_pt = "Quais são as interações medicamentosas do paracetamol?"

print(f"Query: {test_query_pt}\n")

# Gerar embedding usando Google GenAI nativo (sem LangChain)
result = genai.embed_content(
    model="models/gemini-embedding-001",  # Modelo mais recente
    content=test_query_pt,
    task_type="retrieval_query"
)

query_embedding = result['embedding']

print(f"✓ Generated embedding using google.generativeai SDK")
print(f"✓ Embedding dimension: {len(query_embedding)}")
print(f"✓ First 5 values: {query_embedding[:5]}")

Query: Quais são as interações medicamentosas do paracetamol?

✓ Generated embedding using google.generativeai SDK
✓ Embedding dimension: 3072
✓ First 5 values: [0.0022615031, -0.00039042847, 0.014973011, -0.072072364, 0.03780577]


In [5]:
# Buscar no Pinecone usando SDK nativo (sem LangChain)
search_results = index.query(
    vector=query_embedding,
    top_k=5,
    include_metadata=True
)

print(f"✓ Found {len(search_results['matches'])} results using Pinecone SDK\n")

for i, match in enumerate(search_results['matches'], 1):
    score = match['score']
    metadata = match['metadata']
    
    print(f"--- Result {i} (Score: {score:.4f}) ---")
    print(f"Page: {metadata.get('page', 'N/A')}")
    print(f"Source: {metadata.get('source', 'N/A')}")
    print(f"Text preview: {metadata.get('text', '')[:250]}...")
    print()

✓ Found 5 results using Pinecone SDK

--- Result 1 (Score: 0.6368) ---
Page: 151.0
Source: ../data/rename-2024.pdf
Text preview: | 151
RELAÇÃO NACIONAL DE MEDICAMENTOS ESSENCIAIS | RENAME 2024
Denominação Comum 
Brasileira (DCB) Concentração/Composição Forma Farmacêutica 
omeprazol
10 mg cápsula
20 mg cápsula
palmitato de retinol 150.000 UI/mL solução oral
paracetamol
200 mg/m...

--- Result 2 (Score: 0.6329) ---
Page: 108.0
Source: ../data/rename-2024.pdf
Text preview: | 108MINISTÉRIO DA SAÚDE
Denominação Comum 
Brasileira (DCB)
Concentração/
Composição Forma Farmacêutica 
Componente de 
Financiamento 
da Assistência 
Farmacêutica 
Código ATC
nicotina
7 mg adesivo transdérmico Estratégico N07BA01
14 mg adesivo tran...

--- Result 3 (Score: 0.6195) ---
Page: 20.0
Source: ../data/rename-2024.pdf
Text preview: | 20
NOTAS EXPLICATIVAS
APÊNDICE A
Classificação e código ATC: 
O sistema de classificação Anatomical Therapeutic Chemical  (ATC) 
foi implementado como ferramenta para estudos de 