Vector Database FAISS/ChromaDB/PineCone?

-  Step 1: Load sec1 and sec2 PDFs
-  Step 2: Split into Chunks + Tag with Metadata ( Page Number + sec 2/sec1 label)
-  Step 3: Combine into VectorDB

Each Query Should Return:
- Page Number + Sec 2/Sec1 source
- UI should be able to filter between sec 1 and sec 2 content.

In [1]:

import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Check if a specific variable is loaded
#print(os.getenv("OPENAI_API_KEY"))


True

In [2]:
from openai import AzureOpenAI

client = AzureOpenAI(
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    api_version=os.getenv("AZURE_OPENAI_API_VERSION"),
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT")
)

response = client.chat.completions.create(
    model=os.getenv("AZURE_OPENAI_CHAT_DEPLOYMENT_NAME"),
    messages=[
        {"role": "system", "content": "Assistant is a large language model trained by OpenAI."},
        {"role": "user", "content": "Who were the founders of Microsoft?"}
    ],
    temperature=0
)

#print(response)
print(response.model_dump_json(indent=2))
print(response.choices[0].message.content)

{
  "id": "chatcmpl-B2aZHjkewf7cBITfUK1XKqUB7W2UV",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "Microsoft was founded by Bill Gates and Paul Allen. The company was established on April 4, 1975, initially to develop and sell a version of the BASIC programming language for the Altair 8800 microcomputer. Over the years, Microsoft grew to become one of the largest and most influential technology companies in the world, known for its software products, including the Windows operating system and Microsoft Office suite.",
        "role": "assistant",
        "function_call": null,
        "tool_calls": null,
        "refusal": null
      },
      "content_filter_results": {
        "hate": {
          "filtered": false,
          "severity": "safe"
        },
        "self_harm": {
          "filtered": false,
          "severity": "safe"
        },
        "sexual": {
          "filtered": false,
          "severity": "safe"
  

In [None]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS

import os

from langchain_openai import AzureOpenAIEmbeddings


print(os.getenv('AZURE_OPENAI_EMBEDDING_DEPLOYMENT'))
print(os.getenv('AZURE_OPENAI_ENDPOINT'))
print(os.getenv('AZURE_OPENAI_API_VERSION'))
print(os.getenv('AZURE_OPENAI_API_KEY'))


embedding_model = AzureOpenAIEmbeddings(
    model=os.getenv('AZURE_OPENAI_EMBEDDING_DEPLOYMENT'),
    azure_endpoint=os.getenv('AZURE_OPENAI_ENDPOINT'),
    api_version=os.getenv('AZURE_OPENAI_API_VERSION'),
    api_key=os.getenv('AZURE_OPENAI_API_KEY')
    )


# Load PDFs
pdf_paths = {"Sec1": "sec1.pdf", "Sec2": "sec2.pdf"}
documents = []

for label, path in pdf_paths.items():
    loader = PyPDFLoader(path)
    pages = loader.load()
    for page in pages:
        documents.append({
            "text": page.page_content,
            "metadata": {"page": page.metadata["page"], "source": label}
        })

# Chunk the documents
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500, chunk_overlap=50
)
chunks = []

for doc in documents:
    splits = text_splitter.split_text(doc["text"])
    for split in splits:
        chunks.append({
            "text": split,
            "metadata": doc["metadata"]
        })

faiss_index = FAISS.from_texts(
    [chunk["text"] for chunk in chunks], 
    embedding_model, 
    metadatas=[chunk["metadata"] for chunk in chunks]
)

# Save VectorDB
faiss_index.save_local("faiss_index_2")

print("Vector database 2 created successfully!")

In [5]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.prompts import PromptTemplate
from langchain.schema.runnable import RunnableLambda
from langchain.chat_models import ChatOpenAI
#from langchain.output_parsers import StrOutputParser
import os

# Load PDFs
pdf_paths = {"Sec1": "sec1.pdf", "Sec2": "sec2.pdf"}
documents = []

for label, path in pdf_paths.items():
    loader = PyPDFLoader(path)
    pages = loader.load()
    for page in pages:
        documents.append({
            "text": page.page_content,
            "metadata": {"page": page.metadata["page"], "source": label}
        })

# Chunk the documents
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500, chunk_overlap=50
)
chunks = []

for doc in documents:
    splits = text_splitter.split_text(doc["text"])
    for split in splits:
        chunks.append({
            "text": split,
            "metadata": doc["metadata"]
        })

# Generate Embeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small",openai_api_key=OPENAI_API_KEY)

faiss_index = FAISS.from_texts(
    [chunk["text"] for chunk in chunks], 
    embeddings, 
    metadatas=[chunk["metadata"] for chunk in chunks]
)

# Save VectorDB
faiss_index.save_local("faiss_index")

print("Vector database created successfully!")

  embeddings = OpenAIEmbeddings(model="text-embedding-3-small",openai_api_key=OPENAI_API_KEY)


Vector database created successfully!


In [5]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.prompts import PromptTemplate
from langchain.schema.runnable import RunnableLambda
from langchain.chat_models import ChatOpenAI
#from langchain.output_parsers import StrOutputParser
import os

# Load PDFs
pdf_paths = {"Sec1": "sec1.pdf", "Sec2": "sec2.pdf"}
documents = []

for label, path in pdf_paths.items():
    loader = PyPDFLoader(path)
    pages = loader.load()
    for page in pages:
        documents.append({
            "text": page.page_content,
            "metadata": {"page": page.metadata["page"], "source": label}
        })

# Chunk the documents
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500, chunk_overlap=50
)
chunks = []

for doc in documents:
    splits = text_splitter.split_text(doc["text"])
    for split in splits:
        chunks.append({
            "text": split,
            "metadata": doc["metadata"]
        })

# Generate Embeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small",openai_api_key=OPENAI_API_KEY)

faiss_index = FAISS.from_texts(
    [chunk["text"] for chunk in chunks], 
    embeddings, 
    metadatas=[chunk["metadata"] for chunk in chunks]
)

# Save VectorDB
faiss_index.save_local("faiss_index")

print("Vector database created successfully!")

  embeddings = OpenAIEmbeddings(model="text-embedding-3-small",openai_api_key=OPENAI_API_KEY)


Vector database created successfully!


In [7]:
# Test query
question = "Who is the founder of Singapore?"
response = answer_question_from_vector_store(vectorstore, question)

print(response['answer'])
print()
print(f"Referenced sources: {[doc.metadata['source'] for doc in response['context']]}")


  | ChatOpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)


Perspective 1: Sir Stamford Raffles
Summary: Some argue that Sir Stamford Raffles should be considered the founder of Singapore due to his role in signing the 1819 Treaty that allowed the British to establish a trading post in the southern part of Singapore. Raffles' contributions to the early development of Singapore are significant and well-documented.

Perspective 2: William Farquhar
Summary: Others believe that William Farquhar should be recognized as the founder of Singapore because of his efforts in building Singapore from scratch. Farquhar played a crucial role in the early development of the settlement and his contributions are often overlooked in favor of Raffles.

Perspective 3: John Crawfurd
Summary: Some may argue that John Crawfurd should be considered a founder of Singapore as he signed the 1824 Treaty of Friendship and Alliance that gave the British control over the entire island. Crawfurd's role in solidifying British control over Singapore cannot be ignored in discussi

In [8]:
print(response['answer'])
print()
print("Referenced sources:")
for doc in response['context']:
    print(f"Page {doc.metadata['page']} (Source: {doc.metadata['source']}):\n{doc.page_content}\n")


Perspective 1: Sir Stamford Raffles
Summary: Some argue that Sir Stamford Raffles should be considered the founder of Singapore due to his role in signing the 1819 Treaty that allowed the British to establish a trading post in the southern part of Singapore. Raffles' contributions to the early development of Singapore are significant and well-documented.

Perspective 2: William Farquhar
Summary: Others believe that William Farquhar should be recognized as the founder of Singapore because of his efforts in building Singapore from scratch. Farquhar played a crucial role in the early development of the settlement and his contributions are often overlooked in favor of Raffles.

Perspective 3: John Crawfurd
Summary: Some may argue that John Crawfurd should be considered a founder of Singapore as he signed the 1824 Treaty of Friendship and Alliance that gave the British control over the entire island. Crawfurd's role in solidifying British control over Singapore cannot be ignored in discussi

In [9]:
def answer_question_from_vector_store(vector_store, input_question):
    prompt = PromptTemplate.from_template(
        template="""
You are the Heritage Education Research Assistant, an AI-powered tool designed to help educators in Singapore create comprehensive and balanced lesson plans about Singapore's history and culture. Your task is to provide multiple perspectives on historical questions, using ONLY the information provided in the context below.

IMPORTANT RULES:
1. Only use information that is explicitly stated in the provided context
2. Do not make assumptions or add information from external knowledge
3. If the context doesn't provide enough information for a perspective, mention fewer perspectives rather than making up information
4. Always include direct quotes to support each perspective
5. If you cannot find relevant information in the context, state "Insufficient information in the provided sources to answer this question."

Please format your response using the following structure:

[Perspective 1]
Source: (cite the specific page number and document source)
Summary: (2-3 sentences using ONLY information from the cited source)
Direct Quote: (exact quote from the source that supports this perspective)

[Perspective 2]
Source: (cite the specific page number and document source)
Summary: (2-3 sentences using ONLY information from the cited source)
Direct Quote: (exact quote from the source that supports this perspective)

[Additional Perspectives if supported by context...]

[Discussion Questions]
(Only include questions that can be answered using the provided context)
1. (question that encourages critical thinking)
2. (question that encourages critical thinking)
3. (question that encourages critical thinking)

Context: {context}

Question: {question}
        """
    )

    def format_docs(docs):
        return "\n\n".join(doc.page_content for doc in docs)
    
    retriever = vector_store.as_retriever(search_kwargs={"k": 10})
    retrieved_docs = retriever.invoke(input_question)
    
    formatted_context = format_docs(retrieved_docs)
    
    rag_chain_from_docs = (
        RunnableLambda(lambda x: {"context": x["context"], "question": x["question"]})
        | prompt
        | ChatOpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)
    )

    result = rag_chain_from_docs.invoke({"context": formatted_context, "question": input_question})
    return {"answer": result.content, "context": retrieved_docs} 

# Load FAISS index
vectorstore = FAISS.load_local("faiss_index", embeddings,allow_dangerous_deserialization=True)


In [7]:
# Test query
question = "Who is the founder of Singapore?"
response = answer_question_from_vector_store(vectorstore, question)

print(response['answer'])

  | ChatOpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)


[Perspective 1]
Source: FROM TEMASEK TO SINGAPORE (1299–EARLY 1800 S), Page 71
Summary: Some argue that Raffles is recognized as the founder of Singapore due to his contributions, despite differing views.
Direct Quote: "Given what you have learnt about the contributions of Raffles, Farquhar and Crawfurd to the development of Singapore, who do you think founded Singapore?"

[Perspective 2]
Source: HOW DID SINGAPORE BECOME A BRITISH TRADING POST?, Page 69
Summary: Crawfurd is not considered a founder of the colonial settlement, with Raffles and Farquhar being credited for laying the foundations.
Direct Quote: "Crawfurd cannot be considered a founder or co-founder of the colonial settlement. It was Raffles and Farquhar who laid the foundations in January–February 1819."

[Discussion Questions]
1. How do differing perspectives on who founded Singapore impact the way history is understood and taught?
2. What criteria should be used to determine who the true founder of a place is in historic

In [8]:
# Test query
question = "What is the key factor attracting traders to Singapore?"
response = answer_question_from_vector_store(vectorstore, question)

print(response['answer'])

[Perspective 1]
Source: Chapter 3, Page 10, Document Source: Unknown
Summary: Traders were attracted to Singapore due to its free port status established by the British in 1819, allowing them to trade freely without paying taxes on goods.
Direct Quote: "A key factor attracting traders to Singapore was its free port status, which the British put in place in 1819. Traders came in ships from different places and could trade freely with one another in Singapore. They were not required to pay taxes on the goods they carried."

[Perspective 2]
Source: Chapter 3, Page 11, Document Source: Unknown
Summary: Traders were also attracted to Singapore due to the availability of better job prospects and the policies of free trade and free immigration established by the British.
Direct Quote: "From the outset, Singapore flourished because of a number of reasons. Among them, the policies of free trade and free immigration established by the British helped Singapore develop considerably."

[Discussion 