# <strong>Getting Started With Vector Databases - AISoC</strong>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1kDIUIWsmJ3QbjVjl0KB7ndNYX9Zc3asf?usp=drive_link)

A RAG pipeline tutorial to demonstate working with vector databases using Langchain and Pinecone
## AI Summer of Code 2024
<span><i><b>Facilitator: </b></i>Ayo Kehinde Samuel<span>

***This notebook solution is divided into 4 Sections, each constituting a workflow on its own:***

Part I: Load and Process Source Document

Part II: Load the Vector Database and Upsert the document


Setup

**Library import**

Import all the required Python libraries.

It is a good practice to organize the imported libraries by functionality, as shown below.

In [None]:
!pip install -q langchain langchain-community \
pypdf2 pypdf pinecone-client python-dotenv \
langchain-groq langchain-openai sentence_transformers \
protoc_gen_openapiv2 langchain-pinecone faiss-cpu

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m997.8/997.8 kB[0m [31m18.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m62.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m15.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.8/295.8 kB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.8/244.8 kB[0m [31m14.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.0/52.0 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
# local libraries for system runtime
import os,getpass,time
import numpy as np
from dotenv import load_dotenv
load_dotenv()
# load pdfreader
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.document_loaders import PyPDFDirectoryLoader
# load docreader
from langchain_community.document_loaders import Docx2txtLoader
# load textreader
from langchain_community.document_loaders import TextLoader
# load document splitter
from langchain.text_splitter import RecursiveCharacterTextSplitter
# load embedding model
from langchain_community.embeddings.spacy_embeddings import SpacyEmbeddings
from langchain_community.embeddings import HuggingFaceBgeEmbeddings
from langchain_openai import OpenAIEmbeddings
# load LLM
from langchain_openai import ChatOpenAI
from langchain_groq import ChatGroq
# load vector database
import pinecone
from langchain_pinecone import PineconeVectorStore
from pinecone.grpc import PineconeGRPC as Pinecone
from pinecone import ServerlessSpec, PodSpec
from langchain_community.vectorstores import FAISS
# load chain
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain.chains import ConversationalRetrievalChain

In [None]:
os.environ["GROQ_API_KEY"] = getpass.getpass("Enter your Groq API key: ")
os.environ["PINECONE_API_KEY"] = ""
pinecone_api_key = ""

Enter your Groq API key: ··········


**Parameter definitions**

In [None]:
# logger colour palette
yellow = "\033[0;33m"
green = "\033[0;32m"
white = "\033[0;39m"

In [None]:
chatbot_name = "AISoC RAG"

### **Part I: Load and Process Source Document**
* Load source documents
* Select embedding model

**1. Load source documents**

Langchain provides some useful features for Document Splitting

*   Recursive Character Text Splitter: This method splits documents by recursively dividing the text based on characters, ensuring each chunk is below a specified length. This is particularly useful for documents with natural paragraph or sentence breaks.
*   Token Splitter: This method splits the document using tokens. It is beneficial when working with language models with token limits, ensuring each chunk fits the model's constraints.
Sentence Splitter: This method splits documents at sentence boundaries. It is ideal for maintaining the contextual integrity of the text, as sentences usually represent complete thoughts.
*   Regex Splitter: This method uses regular expressions to define custom split points. It offers the highest flexibility, allowing users to split documents based on patterns specific to their use case.
*   Markdown Splitter: This method is tailored for markdown documents. It splits the text based on markdown-specific elements like headings, lists, and code blocks.

<b><font color='red'>ATTENTION!!!</font></b><br/>
Documents should be:

* large enough to contain enough information to answer a question
* small enough to fit into the LLM prompt: check the LLM max input tokens limit
* small enough to fit into the embeddings model: check input tokens limit (Note: 1 token ~ 4 characters, ~= ¾ words).

<br/>
We split the data into chunks of 1,000 characters, with an overlap of 200 characters between the chunks, which helps to give better results and contain the context of the information between chunks

In [None]:
# load single document as before
loader = PyPDFLoader('./declaration_of_independence.pdf')
docs_before_split = loader.load()

load all pdf documents in a folder

In [None]:
# Load pdf files in the local directory
# loader = PyPDFDirectoryLoader("./")
# docs_before_split = loader.load()

load multiple types of documents

In [None]:
# # Initialize an empty list to store document contents
# docs_before_split = []

# # Iterate through all files in the 'docs' directory
# for file in os.listdir('docs'):
#     # Check if the file is a PDF
#     if file.endswith('.pdf'):
#         # Construct the full path to the PDF file
#         pdf_path = './docs/' + file
#         # Create a PDF loader
#         loader = PyPDFLoader(pdf_path)
#         # Load the PDF and extend the documents list with its contents
#         docs_before_split.extend(loader.load())
#     # Check if the file is a Word document
#     elif file.endswith('.docx') or file.endswith('.doc'):
#         # Construct the full path to the Word document
#         doc_path = './docs/' + file
#         # Create a Word document loader
#         loader = Docx2txtLoader(doc_path)
#         # Load the Word document and extend the documents list with its contents
#         docs_before_split.extend(loader.load())
#     # Check if the file is a text file
#     elif file.endswith('.txt'):
#         # Construct the full path to the text file
#         text_path = './docs/' + file
#         # Create a text file loader
#         loader = TextLoader(text_path)
#         # Load the text file and extend the documents list with its contents
#         docs_before_split.extend(loader.load())

In [None]:
# set the splitter parameters and instantiate it
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=20)
# split the document into chunks
documents = text_splitter.split_documents(docs_before_split)
documents[0]

Document(metadata={'source': './declaration_of_independence.pdf', 'page': 0}, page_content='MESSAGE FROM  THE DIRECTOR  \nThe Declaration of Independence and the Constitution of the United States are the two most important, \nand enduring documents in our Nation’s history. It has been sai d that “the Declaration of Independence \nwas the promise; the Constitution was the fulfillment.” More than 200 years ago, our Founding Fathers \nset out to establish a government based on individual rights and the rule of law. The Declaration of')

**2. Select embedding model**

Selecting an embedding model is the most crucial part of your RAG pipeline and can make or mar your vector database efficiency.<br/>
Most engineers choose between OpenAI embedding models(if you are using Open AI LLM) or huggingface available embedding models.<br/>To find a performing embedding model on huggingface start by exploring the [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard. This is a great resource to see at a high level how various models perform on specific sets of standardized benchmark tasks.

These benchmarks cover a range of tasks and datasets. Some involve sentiment analysis of sentences extracted from comment threads on discussion forums. Others perform comparison analysis on question and answer pairs trying to identify the most correct option from a set of possible answers but the focus should be on benchmarks that center on retrieval use cases.

In [None]:
huggingface_embeddings = HuggingFaceBgeEmbeddings(
    model_name="BAAI/bge-small-en-v1.5",  # alternatively use "sentence-transformers/all-MiniLM-l6-v2" for a light and faster experience.
    model_kwargs={'device':'cpu'},
    encode_kwargs={'normalize_embeddings': True}
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
sample_embedding = np.array(huggingface_embeddings.embed_query(documents[0].page_content))
print("Sample embedding of a document chunk: ", sample_embedding)
print("Size of the embedding: ", sample_embedding.shape)

Sample embedding of a document chunk:  [ 2.07286458e-02  1.48984753e-02  2.97964513e-02 -3.18897404e-02
  9.28039663e-03  4.58592474e-02  2.21677925e-02  7.71791302e-03
 -2.80936658e-02 -5.89477224e-03 -1.52432118e-02  4.40493934e-02
 -1.84088033e-02 -2.64674723e-02 -1.35952933e-02  3.80787626e-02
 -5.24266846e-02  2.85085887e-02 -1.00141294e-01  5.41611798e-02
  9.54024643e-02  5.01484983e-03 -3.77718806e-02  7.09273759e-03
  8.54749903e-02  2.12258883e-02 -1.85587835e-02 -4.71161194e-02
  2.45035570e-02 -1.15124293e-01 -1.08148837e-02 -2.69518197e-02
 -2.83123795e-02 -4.17781342e-03 -4.98041185e-03 -1.13186501e-02
  5.01741581e-02  3.41664143e-02  2.32520401e-02  7.99186435e-03
 -3.54161672e-02  2.52249539e-02  2.64195283e-03 -2.85282470e-02
 -1.91519484e-02  1.34395445e-02  1.35938255e-02 -2.64761411e-02
  2.40871054e-03  2.76232744e-03  1.50006935e-02  4.46873680e-02
  2.71777268e-02  1.78575944e-02 -2.32706945e-02  6.52418435e-02
 -3.74428183e-02  1.36328898e-02  1.38329966e-02 -7

### **Part II: Load the Vector Database and Upsert the document**

The text_field parameter sets the name of the metadata field that stores the raw text when you upsert records using a LangChain operation such as vectorstore.from_documents or vectorstore.add_texts. This metadata field is used as the page_content in the Document objects retrieved from query-like LangChain operations such as vectorstore.similarity_search. If you do not specify a value for text_field, it will default to "text".

In [None]:
# configure client
use_serverless = True
pc = Pinecone(api_key=pinecone_api_key)
if use_serverless:
    spec = ServerlessSpec(cloud='aws', region='us-east-1')
else:
    # if not using a starter index, you should specify a pod_type too
    spec = PodSpec()
# check for and delete index if already exists
index_name = 'aisoc-rag'
if index_name in pc.list_indexes().names():
    pc.delete_index(index_name)
# create a new index
pc.create_index(
    index_name,
    dimension=384,  # confirm embedding model dimensionality
    metric='dotproduct',
    spec=spec
)
# wait for index to be initialized
while not pc.describe_index(index_name).status['ready']:
    time.sleep(1)


In [None]:
# verify database is created
index = pc.Index(index_name)
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 0}},
 'total_vector_count': 0}

In [None]:
#now upsert document to db
vectordb = PineconeVectorStore.from_documents(
        documents,
        index_name=index_name,
        embedding=huggingface_embeddings
    )

In [None]:
# verify document was upserted
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 15}},
 'total_vector_count': 15}

In [None]:
query = "What did the invasion cause?"
vectordb.similarity_search(query,k=3)# return 3 most relevant docs

[Document(metadata={'page': 0.0, 'source': './declaration_of_independence.pdf'}, page_content='Independence,  which officially broke all political ties between the American colonies and Great Britain, \nset forth the ideas and principles behind a just and fair government, and the Constitution outlined how \nthis government would function. Our founding documents have withstood the test of time, rising to the \nchallenge each time they were called upon. Make no mistake, we have been presented with a timeless'),
 Document(metadata={'page': 0.0, 'source': './declaration_of_independence.pdf'}, page_content='Necessity which constrains them to alter their former Systems o f Government. The History of the present \nKing of Great -Britain is a History of repeated Injuries and Usurpations, all having in direct Object the \nEstablishment of an absolute Tyranny over these States. To prove this, let Facts be submitted to a candid \nWorl d. He has refused his Assent to Laws, the most wholesome and n

In [None]:
# vectordb = FAISS.from_documents(documents, huggingface_embeddings)

In [None]:
# query = "What did the invasion cause?" # Sample question, change to other questions you are interested in.
# relevant_documents = vectordb.similarity_search(query)
# print(f'There are {len(relevant_documents)} documents retrieved which are relevant to the query. Display the first one:\n')
# print(relevant_documents[0].page_content)

### **Part III: Connect a RAG chain**

* instantiate th LLM
* define a prompt template
* start the RAG chain



In [None]:
llm = ChatGroq(
    model="mixtral-8x7b-32768",
    temperature=0.2,
    max_tokens=None,
    timeout=None,
    max_retries=2,
    # other params...
)

In [None]:
#llm=OpenAI()

In [None]:
prompt_template = """Use the following pieces of context to answer the question at the end. Please follow the following rules:
1. If you don't know the answer, don't try to make up an answer. Just say "I can't find the final answer but you may want to check the following links".
2. If you find the answer, write the answer in a concise way with five sentences maximum.

{context}

Question: {question}

Helpful Answer:
"""

PROMPT = PromptTemplate(
 template=prompt_template, input_variables=["context", "question"]
)

In [None]:
retrievalQA = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectordb.as_retriever(search_type="similarity", search_kwargs={"k": 3}),
    return_source_documents=True,
    chain_type_kwargs={"prompt": PROMPT}
)

In [None]:
# Call the QA chain with our query.
result = retrievalQA.invoke({"query": query})
print(result['result'])

The "invasion" referred to in the text is the British monarch's repeated interference in the affairs of the American colonies, which the text claims is an attempt to establish absolute tyranny. This "invasion" caused the dissolution of representative houses, as the colonial legislatures were disbanded when they opposed the British monarch's actions. The specific grievances mentioned include the king's refusal to approve beneficial laws and his manipulation of legislative bodies to force compliance with his measures.


In [None]:
print(f"{yellow}---------------------------------------------------------------------------------")
print(f'Welcome to the {chatbot_name}. You are now ready to start interacting with your documents')
print('---------------------------------------------------------------------------------')
while True:
  user_input = input("type here: ")
  result = retrievalQA.invoke({"query": user_input})
  print(chatbot_name,": ",result['result'])

[0;33m---------------------------------------------------------------------------------
Welcome to the AISoC RAG. You are now ready to start interacting with your documents
---------------------------------------------------------------------------------
type here: Action of Second Continental Congress was dated what year
AISoC RAG :  The Action of the Second Continental Congress, which resulted in the Declaration of Independence, was dated July 4, 1776. This document marked the official break of political ties between the American colonies and Great Britain, and it set forth the principles of a fair and just government. The significance of this document is further highlighted by the fact that it is one of the two most important and enduring documents in the history of the United States, the other being the Constitution. These founding documents have withstood the test of time and have been the guiding principles of the American government since their inception.


KeyboardInterrupt: Interrupted by user

In [None]:
#cleanup
pc.delete_index(index_name)