<a href="https://colab.research.google.com/github/sualeh/introduction-to-chatgpt-api/blob/main/local-vector-database.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

----------

> **How to Run This Notebook**

To get started, create an Open AI API account, set up billing, and generate and API key at https://platform.openai.com/. If you are running the notebook locally in Visual Studio Code or other IDE, create a file called `.env`, and add a line `OPENAI_API_KEY=<your-openai-api-key>`. This key will be read by the `load_dotenv` library.

Otherwise, if you are running in Google Colab, create a secret called `OPENAI_API_KEY` and set it to the value of your OpenAI API key.

Run the code below to read the key.


In [None]:
%pip install -qq python-dotenv

from os import environ as env
from dotenv import load_dotenv
import logging

logger = logging.getLogger(__name__)
logging.basicConfig(
    level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
)

# Load key from an environmental variable called "OPENAI_API_KEY"
# Use python-dotenv https://pypi.org/project/python-dotenv/
# And take environment variables from .env
load_dotenv()
try:
  # Attempt to read OPENAI_API_KEY from a Google Colab secret
  from google.colab import userdata
  env['OPENAI_API_KEY'] = env.get('OPENAI_API_KEY', userdata.get('OPENAI_API_KEY'))
except ModuleNotFoundError:
  logger.info("Not running in Google Colab")
  # No action - rely on the OPENAI_API_KEY environmental variable



----------

# Vector Databases

## Load Files

Load Adobe Adobe PDF files or text files from a file path. The file is read into a document object.

In [None]:
%pip install -qq langchain langchain-community pypdf

In [None]:
from langchain_community.document_loaders import PyPDFLoader
from langchain.schema import Document

file_path = "./example.pdf"
loader = PyPDFLoader(file_path)
loaded_documents: list[Document] = loader.load()

Print the loaded document information

In [None]:
def print_document_chunks(
    documents: list[Document], 
    limit: int = 3,
    context: int = 100,
) -> None:
    """
    Print preview of document chunks with their metadata.
    
    Args:
        documents: List of Document objects to preview.
        limit: Maximum number of chunks to display.
    """
    print(f"Printing {len(documents)} document chunk(s) with metadata")
    print()
    for index, chunk in enumerate(documents):
        if index > limit:
            break
        print(f"------ CHUNK {index+1} -------------------------------------------------")
        print(chunk.metadata)
        print()
        print(chunk.page_content[:context])
        print("... (skipping content) ...")
        print(chunk.page_content[-context:])
        print()

In [None]:
print_document_chunks(loaded_documents, limit=3)

## Text Splitting

Next, we'll split the documents into smaller chunks for better embedding and retrieval.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=100,
    separators=["\n\n", "\n", ".", " ", ""]
)

chunks = text_splitter.split_documents(loaded_documents)

Look at the chunks of text.

In [None]:
print_document_chunks(chunks, limit=3)

## Create a Vector Database

Now, let's create functions to build and save our vector database.

In [None]:
%pip install -qq faiss-cpu langchain-openai

Create vector database

In [None]:
import os
from dotenv import load_dotenv

load_dotenv(dotenv_path=".env.params")

VECTOR_DB_PATH = os.getenv("VECTOR_DB_PATH")

In [None]:
from langchain.embeddings.base import Embeddings
from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS

embeddings_model = OpenAIEmbeddings()
vector_db = FAISS.from_documents(chunks, embeddings_model)

vector_db.save_local(VECTOR_DB_PATH)
print(f"Vector database saved to {VECTOR_DB_PATH}")

## Query

Query the vector database to get documents and their similarities.

In [None]:
query = "Who is Joe?"

results = vector_db.similarity_search_with_score(query, k=2)

print_document_chunks([results for results, _ in results], context=200)