<a href="https://colab.research.google.com/github/sualeh/introduction-to-chatgpt-api/blob/main/local-vector-database.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

----------

> **How to Run This Notebook**

To get started, create an Open AI API account, set up billing, and generate and API key at https://platform.openai.com/. If you are running the notebook locally in Visual Studio Code or other IDE, create a file called `.env`, and add a line `OPENAI_API_KEY=<your-openai-api-key>`. This key will be read by the `load_dotenv` library.

Otherwise, if you are running in Google Colab, create a secret called `OPENAI_API_KEY` and set it to the value of your OpenAI API key.

Run the code below to read the key.


In [None]:
%pip install -qq python-dotenv

from os import environ as env
from dotenv import load_dotenv
import logging

logger = logging.getLogger(__name__)
logging.basicConfig(
    level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
)

# Load key from an environmental variable called "OPENAI_API_KEY"
# Use python-dotenv https://pypi.org/project/python-dotenv/
# And take environment variables from .env
load_dotenv()
try:
  # Attempt to read OPENAI_API_KEY from a Google Colab secret
  from google.colab import userdata
  env['OPENAI_API_KEY'] = env.get('OPENAI_API_KEY', userdata.get('OPENAI_API_KEY'))
except ModuleNotFoundError:
  logger.info("Not running in Google Colab")
  # No action - rely on the OPENAI_API_KEY environmental variable



----------

# Vector Databases

## Define File Loading Functions

Define functions to load and process PDF and text files, and test them

In [None]:
%pip install -qq langchain langchain-community pypdf

In [None]:
import os
from dotenv import load_dotenv

load_dotenv(dotenv_path=".env.params")

DOCUMENT_PATH = os.getenv("DOCUMENT_PATH")

logging.basicConfig(
    level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
)
logger = logging.getLogger(None)

In [None]:
from langchain_community.document_loaders import PyPDFLoader
from langchain.document_loaders import TextLoader
from langchain.schema import Document

def load_document(file_path: str) -> list[Document]:
    """
    Load a document based on its file extension.
    
    Args:
        file_path: Path to the file to be loaded.
    
    Returns:
        List of Document objects containing the content and metadata.
    """
    _, file_extension = os.path.splitext(file_path)
    
    if file_extension.lower() == '.pdf':
        loader = PyPDFLoader(file_path)
        return loader.load()
    
    elif file_extension.lower() == '.txt':
        loader = TextLoader(file_path)
        return loader.load()
    
    else:
        logger.error(f"Unsupported file format: {file_extension}")
        return []

In [None]:
def print_document_chunks(documents: list[Document], limit: int = 3) -> None:
    """
    Print preview of document chunks with their metadata.
    
    Args:
        documents: List of Document objects to preview.
        limit: Maximum number of chunks to display.
    """
    print()
    for index, chunk in enumerate(documents):
        if index > limit:
            break
        print(f"------ CHUNK {index+1} -------------------------------------------------")
        print(chunk.metadata)
        print()
        print(chunk.page_content[:100])
        print("... (skipping content) ...")
        print(chunk.page_content[-100:])
        print()

Run the code for loading files

In [None]:
documents = load_document(DOCUMENT_PATH)

print_document_chunks(documents)


## Text Splitting

Next, we'll split the documents into smaller chunks for better embedding and retrieval.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

def split_documents(documents: list[Document], chunk_size: int = 1000, chunk_overlap: int = 100) -> list[Document]:
    """
    Split documents into smaller chunks for better processing.
    
    Args:
        documents: List of Document objects to split.
        chunk_size: Maximum size of each chunk in characters.
        chunk_overlap: Number of characters of overlap between chunks.
    
    Returns:
        List of smaller Document chunks.
    """
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        separators=["\n\n", "\n", ".", " ", ""]
    )
    
    chunks = text_splitter.split_documents(documents)
    logger.info(f"Split into {len(chunks)} chunks.")
    return chunks

Look at the chunks of text.

In [None]:
chunks = split_documents(documents)

print_document_chunks(chunks)


## Create a Vector Database

Now, let's create functions to build and save our vector database.

In [None]:
%pip install -qq faiss-cpu langchain-openai

In [None]:
from langchain.vectorstores import FAISS
from langchain.embeddings.base import Embeddings

def create_vector_db(chunks: list[Document], embeddings_model: Embeddings, save_path: str = None) -> FAISS:
    """
    Create a vector database from document chunks.
    
    Args:
        chunks: List of Document chunks to store in the database.
        embeddings_model: Model to create vector embeddings from text.
        save_path: Optional path to save the vector database.
    
    Returns:
        FAISS vector database containing the document embeddings.
    """
    vector_db = FAISS.from_documents(chunks, embeddings_model)
    
    if save_path:
        vector_db.save_local(save_path)
        logger.info(f"Vector database saved to {save_path}")
    
    return vector_db

Create vector database

In [None]:
import os
from dotenv import load_dotenv

load_dotenv(dotenv_path=".env.params")

VECTOR_DB_PATH = os.getenv("VECTOR_DB_PATH")

In [None]:
from langchain.embeddings.base import Embeddings
from langchain_openai import OpenAIEmbeddings

embeddings_model = OpenAIEmbeddings()

vector_db = create_vector_db(chunks, embeddings_model, save_path=VECTOR_DB_PATH)
print(vector_db)


## Query Function

Let's create a function to query our vector database.

In [None]:
def query_vector_db(query: str, vector_db: FAISS, k: int = 5) -> list[tuple[Document, float]]:
    """
    Query the vector database for similar documents.
    
    Args:
        query: Query string to search for.
        vector_db: FAISS vector database to search in.
        k: Number of results to return.
    
    Returns:
        List of tuples containing (Document, similarity_score).
    """
    results = vector_db.similarity_search_with_score(query, k=k)
    
    return results

Query the vector database to get documents and their similarities.

In [None]:
query = "What are the kinds of organizational time?"

results = query_vector_db(query, vector_db, k=3)

print_document_chunks([results for results, _ in results])