<a href="https://colab.research.google.com/github/sualeh/introduction-to-chatgpt-api/blob/main/local-vector-database.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

----------

> **How to Run This Notebook**

To get started, create an Open AI API account, set up billing, and generate and API key at https://platform.openai.com/. If you are running the notebook locally in Visual Studio Code or other IDE, create a file called `.env`, and add a line `OPENAI_API_KEY=<your-openai-api-key>`. This key will be read by the `load_dotenv` library.

Otherwise, if you are running in Google Colab, create a secret called `OPENAI_API_KEY` and set it to the value of your OpenAI API key.

Run the code below to read the key.


In [None]:
%pip install -qq python-dotenv

from os import environ as env
from dotenv import load_dotenv
import logging

logger = logging.getLogger(__name__)
logging.basicConfig(
    level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
)

# Load key from an environmental variable called "OPENAI_API_KEY"
# Use python-dotenv https://pypi.org/project/python-dotenv/
# And take environment variables from .env
load_dotenv()
try:
  # Attempt to read OPENAI_API_KEY from a Google Colab secret
  from google.colab import userdata
  env['OPENAI_API_KEY'] = env.get('OPENAI_API_KEY', userdata.get('OPENAI_API_KEY'))
except ModuleNotFoundError:
  logger.info("Not running in Google Colab")
  # No action - rely on the OPENAI_API_KEY environmental variable



----------

## Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) is a technique that enhances Large Language Models by combining two key components:

1. **Retrieval**: Finding relevant information from a knowledge base
2. **Generation**: Using that information to create accurate, contextual responses

RAG helps overcome LLMs' limitations by providing external, up-to-date knowledge without having to retrain the model. In this notebook, we'll build a complete RAG pipeline step by step.

# Vector Databases

## Load Files

The first step in any RAG system is to ingest the knowledge that will later be retrieved. We need to:

- Import external knowledge (PDFs, text files, web pages, etc.)
- Convert this unstructured data into a structured format ("langchain" `Document` objects)
- Preserve metadata like page numbers or source information for later citation

Document loaders like `PyPDFLoader` from the "langchain" library handle the complex task of parsing different file formats and converting them into a standardized `Document` structure that the rest of our RAG pipeline can process.

In [None]:
%pip install -qq langchain langchain-community pypdf

In [None]:
from langchain_community.document_loaders import PyPDFLoader
from langchain.schema import Document

file_path = "./example.pdf"
loader = PyPDFLoader(file_path)
loaded_documents: list[Document] = loader.load()

Print the loaded document information

In [None]:
def print_document_chunks(
    documents: list[Document], 
    limit: int = 3,
    context: int = 100,
) -> None:
    """
    Print preview of document chunks with their metadata.
    
    Args:
        documents: List of Document objects to preview.
        limit: Maximum number of chunks to display.
    """
    print(f"Printing {len(documents)} document chunk(s) with metadata")
    print()
    for index, chunk in enumerate(documents):
        if index > limit:
            break
        print(f"------ CHUNK {index+1} -------------------------------------------------")
        print(chunk.metadata)
        print()
        print(chunk.page_content[:context])
        print("... (skipping content) ...")
        print(chunk.page_content[-context:])
        print()

In [None]:
print_document_chunks(loaded_documents, limit=3)

## Text Splitting

Next, we'll split the documents into smaller chunks for better embedding and retrieval. Large documents need to be broken down into smaller pieces for several reasons:

- **Embedding Limitations**: Most embedding models have token limits (e.g., 8,192 tokens for "text-embedding-ada-002")
- **Retrieval Precision**: Smaller chunks allow for more precise retrieval of relevant information
- **Context Windows**: LLMs have limited context windows - retrieving entire documents would waste tokens
- **Semantic Focus**: Each chunk should ideally contain coherent, focused information on a specific topic

The "langchain" `RecursiveCharacterTextSplitter` is intelligent about how it splits text, trying to preserve natural boundaries like paragraphs while respecting maximum chunk sizes. The `chunk_overlap` parameter creates some redundancy between chunks to preserve context that might be split across chunk boundaries.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=100,
    separators=["\n\n", "\n", ".", " ", ""]
)

chunks = text_splitter.split_documents(loaded_documents)

Look at the chunks of text.

In [None]:
print_document_chunks(chunks, limit=3)

## Create a Vector Database

At the core of modern RAG systems are vector embeddings and vector databases:

**Text Embeddings** are mathematical representations of text as high-dimensional vectors (e.g., 1,536 dimensions for OpenAI embeddings). These vectors capture semantic relationships - similar meanings cluster together in vector space. Embedding vectors are stored in a **Vector Database**, which is specialized storage that allows for efficient similarity searches in high-dimensional space.

The embedding process transforms our text chunks into vectors that preserve their semantic meaning. These vectors are then stored in a vector database, which provides efficient nearest-neighbor search capabilities critical for the retrieval component of RAG. We use **FAISS**, a vector database library developed by Facebook AI Research for efficient similarity search and clustering of dense vectors.

In [None]:
%pip install -qq faiss-cpu langchain-openai

Create vector database to store the embeddings and perform similarity search.

In [None]:
from langchain.embeddings.base import Embeddings
from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS

# OpenAI embeddings model
embeddings_model = OpenAIEmbeddings()

# Create a FAISS vector store from the document chunks
vector_db = FAISS.from_documents(chunks, embeddings_model)

## Query

### The Retrieval Process

Retrieval is where the 'R' in RAG comes into play. First, the user's question or prompt is converted to the same vector space as our documents. This is done by embedding the query using the same embedding model we used for our documents. A **Similarity Search** in the vector database finds document chunks whose embeddings are closest to the query embedding. These documents are ranked by similarity score (cosine similarity or other distance metrics). We select the k most relevant chunks to provide as context (top-k selection).

This similarity-based retrieval is far more powerful than simple keyword matching because it captures semantic relationships. For example, a query about "climate impacts" might retrieve documents mentioning "environmental effects" even if the exact words don't match.

In [None]:
query = "Who is Joe?"

results = vector_db.similarity_search_with_score(query, k=2)

print_document_chunks([results for results, _ in results], context=200)

## Set Up the Chat Model

The final stage of RAG integrates retrieval with generation. We construct a combined prompt that includes both the retrieved context and the user's question. The enriched prompt is sent to the LLM model which generates an answer based on both the question and the retrieved context. 

The prompt explicitly instructs the model to rely on the provided context and admit when it doesn't know, which helps prevent hallucinations (made-up information). The temperature setting (0.7) provides a balance between creative and deterministic responses.

This end-to-end pipeline combines the knowledge retrieval capabilities of vector databases with the reasoning and language generation capabilities of LLMs, creating a system that can provide accurate, contextual answers based on specific knowledge sources.

In [None]:
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser

# Set up the chat model, using langchain
chat_model = ChatOpenAI(
        model_name="gpt-4o",
        temperature=0.7
    )
# Create the RAG prompt template
prompt_template = ChatPromptTemplate.from_template("""
    You are a helpful assistant that provides accurate information based on the given context.
    If you don't know the answer based on the context, just say that you don't know.
    Don't try to make up an answer.
    
    Context:
    {context}
    
    Question: {question}
    
    Answer:
    """)

# Create a retriever from the vector database
k = 3
retriever = vector_db.as_retriever(search_kwargs={"k": k})

# Format the retrieved documents into a single context string
# Also include source numbers for citation
def format_docs(docs):
    # DEBUG: Print the retrieved documents
    print(f"Retrieved {len(docs)} documents:")
    for i, doc in enumerate(docs):
        source_info = f"Source [{i+1}]"
        print(doc.metadata) 
        if 'source' in doc.metadata:
            source_info = source_info + f": {doc.metadata['source']}"
        page_content = doc.page_content.replace("\n", "")
        print(f"{source_info}\n\t{page_content[:50]} ... {page_content[-50:]}")

    return "\n\n".join(doc.page_content for doc in docs)

# Create the RAG chain
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt_template
    | chat_model
    | StrOutputParser()
)

question = "Tell me about Joe"
print(f"\nQuestion: {question}\n\n")
answer = rag_chain.invoke(question)
print(f"\n\nAnswer: {answer}")