# <strong>Getting Started With Vector Databases - AISoC</strong>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1kDIUIWsmJ3QbjVjl0KB7ndNYX9Zc3asf?usp=drive_link)

A RAG pipeline tutorial to demonstate working with vector databases using Langchain and Pinecone
## AI Summer of Code 2024
<span><i><b>Facilitator: </b></i>Ayo Kehinde Samuel<span>

***This notebook solution is divided into 4 Sections, each constituting a workflow on its own:***

Part I: Load and Process Source Document

Part II: Load the Vector Database and Upsert the document


Setup

**Library import**

Import all the required Python libraries.

It is a good practice to organize the imported libraries by functionality, as shown below.

In [1]:
!pip install -q langchain langchain-community \
pypdf2 pypdf pinecone-client python-dotenv \
langchain-groq langchain-openai sentence_transformers \
protoc_gen_openapiv2 langchain-pinecone faiss-cpu

In [3]:
# local libraries for system runtime
import os,getpass,time
import numpy as np
from dotenv import load_dotenv
load_dotenv()
# load pdfreader
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.document_loaders import PyPDFDirectoryLoader
# load docreader
from langchain_community.document_loaders import Docx2txtLoader
# load textreader
from langchain_community.document_loaders import TextLoader
# load document splitter
from langchain.text_splitter import RecursiveCharacterTextSplitter
# load embedding model
from langchain_community.embeddings.spacy_embeddings import SpacyEmbeddings
from langchain_community.embeddings import HuggingFaceBgeEmbeddings
from langchain_openai import OpenAIEmbeddings
# load LLM
from langchain_openai import ChatOpenAI
from langchain_groq import ChatGroq
# load vector database
import pinecone
from langchain_pinecone import PineconeVectorStore
from pinecone.grpc import PineconeGRPC as Pinecone
from pinecone import ServerlessSpec, PodSpec
from langchain_community.vectorstores import FAISS
# load chain
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain.chains import ConversationalRetrievalChain

In [4]:
os.environ["GROQ_API_KEY"] = getpass.getpass("Enter your Groq API key: ")
# os.environ["PINECONE_API_KEY"] = ""
# pinecone_api_key = ""

In [5]:
os.environ["PINECONE_API_KEY"] = getpass.getpass("Enter your Pinecone API key: ")

In [17]:
GROQ_API_KEY = os.getenv("GROQ_API_KEY")
PINECONE_API_KEY = os.getenv("PINECONE_API_KEY")

**Parameter definitions**

In [6]:
# logger colour palette
yellow = "\033[0;33m"
green = "\033[0;32m"
white = "\033[0;39m"

In [7]:
chatbot_name = "AISoC RAG"

### **Part I: Load and Process Source Document**
* Load source documents
* Select embedding model

**1. Load source documents**

Langchain provides some useful features for Document Splitting

*   Recursive Character Text Splitter: This method splits documents by recursively dividing the text based on characters, ensuring each chunk is below a specified length. This is particularly useful for documents with natural paragraph or sentence breaks.
*   Token Splitter: This method splits the document using tokens. It is beneficial when working with language models with token limits, ensuring each chunk fits the model's constraints.
Sentence Splitter: This method splits documents at sentence boundaries. It is ideal for maintaining the contextual integrity of the text, as sentences usually represent complete thoughts.
*   Regex Splitter: This method uses regular expressions to define custom split points. It offers the highest flexibility, allowing users to split documents based on patterns specific to their use case.
*   Markdown Splitter: This method is tailored for markdown documents. It splits the text based on markdown-specific elements like headings, lists, and code blocks.

<b><font color='red'>ATTENTION!!!</font></b><br/>
Documents should be:

* large enough to contain enough information to answer a question
* small enough to fit into the LLM prompt: check the LLM max input tokens limit
* small enough to fit into the embeddings model: check input tokens limit (Note: 1 token ~ 4 characters, ~= ¾ words).

<br/>
We split the data into chunks of 1,000 characters, with an overlap of 200 characters between the chunks, which helps to give better results and contain the context of the information between chunks

In [9]:
# load single document as before
loader = PyPDFLoader('decind.pdf')
docs_before_split = loader.load()

load all pdf documents in a folder

In [11]:
# Load pdf files in the local directory
# loader = PyPDFDirectoryLoader("./")
# docs_before_split = loader.load()

load multiple types of documents

In [12]:
# # Initialize an empty list to store document contents
# docs_before_split = []

# # Iterate through all files in the 'docs' directory
# for file in os.listdir('docs'):
#     # Check if the file is a PDF
#     if file.endswith('.pdf'):
#         # Construct the full path to the PDF file
#         pdf_path = './docs/' + file
#         # Create a PDF loader
#         loader = PyPDFLoader(pdf_path)
#         # Load the PDF and extend the documents list with its contents
#         docs_before_split.extend(loader.load())
#     # Check if the file is a Word document
#     elif file.endswith('.docx') or file.endswith('.doc'):
#         # Construct the full path to the Word document
#         doc_path = './docs/' + file
#         # Create a Word document loader
#         loader = Docx2txtLoader(doc_path)
#         # Load the Word document and extend the documents list with its contents
#         docs_before_split.extend(loader.load())
#     # Check if the file is a text file
#     elif file.endswith('.txt'):
#         # Construct the full path to the text file
#         text_path = './docs/' + file
#         # Create a text file loader
#         loader = TextLoader(text_path)
#         # Load the text file and extend the documents list with its contents
#         docs_before_split.extend(loader.load())

In [13]:
# set the splitter parameters and instantiate it
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=20)
# split the document into chunks
documents = text_splitter.split_documents(docs_before_split)
documents[0]

Document(metadata={'source': 'decind.pdf', 'page': 0}, page_content='Page XLV 1The delegates of the United Colonies of New Hampshire; Mas -\nsachusetts Bay; Rhode Island and Providence Plantations; Con -\nnecticut; New York; New Jersey; Pennsylvania; New Castle,  \nKent, and Sussex, in Delaware; Maryland; Virginia; North Caro -\nlina, and South Carolina, In Congress assembled at Philadelphia,  \nResolved on the 10th of May, 1776, to recommend to the respec -\ntive assemblies and conventions of the United Colonies, where no')

**2. Select embedding model**

Selecting an embedding model is the most crucial part of your RAG pipeline and can make or mar your vector database efficiency.<br/>
Most engineers choose between OpenAI embedding models(if you are using Open AI LLM) or huggingface available embedding models.<br/>To find a performing embedding model on huggingface start by exploring the [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard. This is a great resource to see at a high level how various models perform on specific sets of standardized benchmark tasks.

These benchmarks cover a range of tasks and datasets. Some involve sentiment analysis of sentences extracted from comment threads on discussion forums. Others perform comparison analysis on question and answer pairs trying to identify the most correct option from a set of possible answers but the focus should be on benchmarks that center on retrieval use cases.

In [14]:
huggingface_embeddings = HuggingFaceBgeEmbeddings(
    model_name="BAAI/bge-small-en-v1.5",  # alternatively use "sentence-transformers/all-MiniLM-l6-v2" for a light and faster experience.
    model_kwargs={'device':'cpu'},
    encode_kwargs={'normalize_embeddings': True}
)

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


In [15]:
sample_embedding = np.array(huggingface_embeddings.embed_query(documents[0].page_content))
print("Sample embedding of a document chunk: ", sample_embedding)
print("Size of the embedding: ", sample_embedding.shape)

Sample embedding of a document chunk:  [ 5.39305713e-03  2.67981389e-03  2.18312442e-02 -5.74157201e-02
 -1.63130136e-03  5.66432327e-02  6.63541863e-03  4.00397228e-03
 -6.96664676e-02 -3.55090238e-02  2.63224076e-03 -2.82308576e-03
 -9.42327024e-04  6.83275685e-02 -3.90850827e-02  2.94142347e-02
 -7.69267082e-02  7.46982321e-02  1.74978692e-02  3.40323113e-02
  6.60954565e-02 -4.76830602e-02 -5.37511967e-02  4.06468175e-02
  3.65503766e-02  2.08294764e-02  2.53341757e-02 -7.39620859e-03
 -7.20104948e-02 -1.39072046e-01  1.66850686e-02  1.31300162e-03
 -4.39431816e-02  4.45690081e-02  2.85105244e-03  3.93263809e-03
  7.06762150e-02  4.50603403e-02  1.72234830e-02 -5.94627811e-03
 -2.08901260e-02  1.62895781e-03  1.34413643e-02  6.54319748e-02
  1.23374490e-02 -1.59539841e-02 -3.28045227e-02  2.07096450e-02
 -6.00214722e-03 -2.76389695e-03  6.16816618e-02 -1.67882647e-02
  8.40334967e-03 -4.35553566e-02  3.87073383e-02  4.56023812e-02
 -1.46400556e-03  2.35198662e-02  5.30543774e-02  5

### **Part II: Load the Vector Database and Upsert the document**

The text_field parameter sets the name of the metadata field that stores the raw text when you upsert records using a LangChain operation such as vectorstore.from_documents or vectorstore.add_texts. This metadata field is used as the page_content in the Document objects retrieved from query-like LangChain operations such as vectorstore.similarity_search. If you do not specify a value for text_field, it will default to "text".

In [18]:
# configure client
use_serverless = True
pc = Pinecone(api_key=PINECONE_API_KEY)
if use_serverless:
    spec = ServerlessSpec(cloud='aws', region='us-east-1')
else:
    # if not using a starter index, you should specify a pod_type too
    spec = PodSpec()
# check for and delete index if already exists
index_name = 'aisoc-rag'
if index_name in pc.list_indexes().names():
    pc.delete_index(index_name)
# create a new index
pc.create_index(
    index_name,
    dimension=384,  # confirm embedding model dimensionality
    metric='dotproduct',
    spec=spec
)
# wait for index to be initialized
while not pc.describe_index(index_name).status['ready']:
    time.sleep(1)

In [19]:
# verify database is created
index = pc.Index(index_name)
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 0}},
 'total_vector_count': 0}

In [21]:
#now upsert document to db
vectordb = PineconeVectorStore.from_documents(
        documents,
        index_name=index_name,
        embedding=huggingface_embeddings
    )

In [23]:
# verify document was upserted
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 34}},
 'total_vector_count': 34}

In [24]:
query = "What did the invasion cause?"
vectordb.similarity_search(query,k=3)# return 3 most relevant docs

[Document(metadata={'page': 1.0, 'source': 'decind.pdf'}, page_content='tion, have returned to the People at large for  \ntheir exercise; the State remaining in the mean  \ntime exposed to all the dangers of invasion from  \nwithout, and convulsions within.  \nHe has endeavoured to prevent the population  \nof these States; for that purpose obstructing the  \nLaws for Naturalization of Foreigners; refusing  \nto pass others to encourage their migrations  \nhither, and raising the conditions of new Appro -\npriations of Lands.  \nHe has obstructed the Administration of Jus -'),
 Document(metadata={'page': 0.0, 'source': 'decind.pdf'}, page_content='and to provide new Guards for their future secu -\nrity.—Such has been the patient sufferance of  \nthese Colonies; and such is now the necessity  \nwhich constrains them to alter their former  \nSystems of Government. The history of the  \npresent King of Great Britain is a history of re -\npeated injuries and usurpations, all having in di -

In [None]:
# vectordb = FAISS.from_documents(documents, huggingface_embeddings)

In [None]:
# query = "What did the invasion cause?" # Sample question, change to other questions you are interested in.
# relevant_documents = vectordb.similarity_search(query)
# print(f'There are {len(relevant_documents)} documents retrieved which are relevant to the query. Display the first one:\n')
# print(relevant_documents[0].page_content)

### **Part III: Connect a RAG chain**

* instantiate th LLM
* define a prompt template
* start the RAG chain



In [25]:
llm = ChatGroq(
    api_key=GROQ_API_KEY,
    model="mixtral-8x7b-32768",
    temperature=0.2,
    max_tokens=None,
    timeout=None,
    max_retries=2,
    # other params...
)

In [None]:
#llm=OpenAI()

In [27]:
prompt_template = """Use the following pieces of context to answer the question at the end. Please follow the following rules:
1. If you don't know the answer, don't try to make up an answer. Just say "I can't find the final answer but you may want to check the following links".
2. If you find the answer, write the answer in a concise way with five sentences maximum.

{context}

Question: {question}

Helpful Answer:
"""

PROMPT = PromptTemplate(
 template=prompt_template, input_variables=["context", "question"]
)

In [28]:
retrievalQA = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectordb.as_retriever(search_type="similarity", search_kwargs={"k": 3}),
    return_source_documents=True,
    chain_type_kwargs={"prompt": PROMPT}
)

In [29]:
# Call the QA chain with our query.
result = retrievalQA.invoke({"query": query})
print(result['result'])

The passage is an excerpt from the United States Declaration of Independence, specifically the section that lists grievances against King George III. One of the grievances mentioned is the King's obstruction of laws for naturalization and encouraging migration, as well as raising conditions for new land appropriations. This has resulted in the prevention of population growth in the states, exposing them to dangers of invasion from outside and convulsions from within. The invasion referred to here is likely military or political invasion, causing harm and instability. However, the specific details of the invasion are not provided in the text.


In [30]:
print(f"{yellow}---------------------------------------------------------------------------------")
print(f'Welcome to the {chatbot_name}. You are now ready to start interacting with your documents')
print('---------------------------------------------------------------------------------')
while True:
  user_input = input("type here: ")
  result = retrievalQA.invoke({"query": user_input})
  print(chatbot_name,": ",result['result'])

[0;33m---------------------------------------------------------------------------------
Welcome to the AISoC RAG. You are now ready to start interacting with your documents
---------------------------------------------------------------------------------
AISoC RAG :  The Declaration of Independence, adopted by the United States Continental Congress on July 4, 1776, is a formal statement announcing the separation of the thirteen American colonies from British rule. It explains the reasons for this decision, citing numerous grievances against King George III, including his dissolution of elected representative bodies, refusal to allow new elections, and imposition of laws without the consent of the colonies. The document is a significant historical record outlining the principles of self-determination and individual rights, which have greatly influenced the development of democratic governments around the world.
AISoC RAG :  The text is a comparison between the original Declaration of I


KeyboardInterrupt



In [None]:
print(f"{yellow}---------------------------------------------------------------------------------")
print(f'Welcome to the {chatbot_name}. You are now ready to start interacting with your documents')
print('---------------------------------------------------------------------------------')
while True:
  user_input = input("type here: ")
  result = retrievalQA.invoke({"query": user_input})
  print(chatbot_name,": ",result['result'])

[0;33m---------------------------------------------------------------------------------
Welcome to the AISoC RAG. You are now ready to start interacting with your documents
---------------------------------------------------------------------------------
type here: Action of Second Continental Congress was dated what year
AISoC RAG :  The Action of the Second Continental Congress, which resulted in the Declaration of Independence, was dated July 4, 1776. This document marked the official break of political ties between the American colonies and Great Britain, and it set forth the principles of a fair and just government. The significance of this document is further highlighted by the fact that it is one of the two most important and enduring documents in the history of the United States, the other being the Constitution. These founding documents have withstood the test of time and have been the guiding principles of the American government since their inception.


KeyboardInterrupt: Interrupted by user

In [None]:
#cleanup
pc.delete_index(index_name)