### Author : Rahul Bhoyar

In this tutorial, we will elucidate the systematic process of transferring our knowledge base data into a vector database.

This entails the conversion of textual data into embeddings, which represent numerical representations of the underlying information.

The resulting embeddings will be systematically stored within a vector database, facilitating enhanced efficiency and accessibility.

### Step 1: Installing the necessary libraries

In [1]:
!pip install langchain langchain_openai langchain_community faiss-gpu unstructured unstructured[pdf]

Collecting langchain
  Downloading langchain-0.1.8-py3-none-any.whl (816 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m816.1/816.1 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain_openai
  Downloading langchain_openai-0.0.6-py3-none-any.whl (29 kB)
Collecting langchain_community
  Downloading langchain_community-0.0.21-py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m14.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting faiss-gpu
  Downloading faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (85.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting unstructured
  Downloading unstructured-0.12.4-py3-none-any.whl (1.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m65.8 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.7,>

### Step 2: Loading the documents

Get the path where the "knowledge-base" is stored. Note that the directory path should be given here where we have stored our documnents.( To be precise, pdf files)

In [2]:
KNOWLEDGE_BASE_PATH = "/content/knowledge_path_dir"

In [3]:
from langchain_community.document_loaders import DirectoryLoader

In [4]:
loader = DirectoryLoader(KNOWLEDGE_BASE_PATH)

In the next step, we are creating the "documents" object which stores all the data from the directory.

It may take some time to execute.

In [5]:
import time

print("Process of loading the documents started.......")
print("-"*200)
# Record start time
start_time = time.time()

# Execute the loader.load() process
documents = loader.load()

# Record end time
end_time = time.time()

# Calculate the elapsed time in seconds
elapsed_time_seconds = end_time - start_time

# Convert elapsed time to minutes
elapsed_time_minutes = elapsed_time_seconds / 60

print("-"*200)
print(f"The process took {elapsed_time_minutes} minutes.")

print("-"*200)
print("The document object created successfully.")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


The process took 0.8644429167111715 minutes.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
The document object created successfully.


Check the length of documents

In [6]:
len(documents)

1

In [7]:
type(documents)

list

In [8]:
documents[0]

Document(page_content='dataset-Nr.: 1 dataset_name: 4quant/eye-gaze title: Eye Gaze description: Eye Gaze by K Scott Mader. Simulated and real datasets of eyes looking in different directions Last updated on 27/06/2018 07:15. Size: 4GB. Usability rating: 0.79. View count: 69406, Vote count: 162. Available under CC BY-NC-SA 4.0 license. Find more at: https://www.kaggle.com/datasets/4quant/eye-gaze description_tokens: 56 tags: [arts and entertainment, earth and nature, social science, image, eyes and vision]\n\ndataset-Nr.: 2 dataset_name: 4quant/soft-tissue-sarcoma title: Segmenting Soft Tissue Sarcomas description: Segmenting Soft Tissue Sarcomas by Timo Bozsolik. A challenge to automate tumor segmentation Last updated on 14/11/2019 06:50. Size: 306MB. Usability rating: 0.76. View count: 55961, Vote count: 162. Available under Other (specified in description) license. Find more at: https://www.kaggle.com/datasets/4quant/soft-tissue-sarcoma description_tokens: 56 tags: [healthcare, biol

Splitting the documents into chunks

In [9]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(chunk_size=400,
chunk_overlap=0)

texts = text_splitter.split_documents(documents)



Checke the total number of splittings.

In [10]:
print("Total number of splittings :", len(texts))

Total number of splittings : 2156


In [11]:
type(texts)

list

In [12]:
texts[0]

Document(page_content='dataset-Nr.: 1 dataset_name: 4quant/eye-gaze title: Eye Gaze description: Eye Gaze by K Scott Mader. Simulated and real datasets of eyes looking in different directions Last updated on 27/06/2018 07:15. Size: 4GB. Usability rating: 0.79. View count: 69406, Vote count: 162. Available under CC BY-NC-SA 4.0 license. Find more at: https://www.kaggle.com/datasets/4quant/eye-gaze description_tokens: 56 tags: [arts and entertainment, earth and nature, social science, image, eyes and vision]', metadata={'source': '/content/knowledge_path_dir/knowlege_base_kaggle_datasets.pdf'})

### Step 3: Converting the documents into Vector Embeddings

Setting up the OpenAI Environment.

In [13]:
import os
openai_api_key = "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"  # Enter your OPENAI_API_KEY
os.environ["OPENAI_API_KEY"] = openai_api_key
print("OPENAI API key is set successfully :",openai_api_key)

OPENAI API key is set successfully : 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


For embeddings creation, we will use the OPENAI's embeddings.

In [14]:
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

For indexing, we will use the FAISS.

In [15]:
import time
from langchain.vectorstores import FAISS

# Record the start time
start_time = time.time()

# Create the FAISS index
db = FAISS.from_documents(texts, embeddings)

# Record the end time
end_time = time.time()

# Calculate the elapsed time in seconds
elapsed_time_seconds = end_time - start_time

# Convert elapsed time to minutes
elapsed_time_minutes = elapsed_time_seconds / 60

# Print the elapsed time in minutes
print(f"Time taken for creating FAISS index: {elapsed_time_minutes:.2f} minutes")


Time taken for creating FAISS index: 0.29 minutes


### Step 4 : Saving the embeddings locally

Initilise the path to store the embeddinbgs.

In [16]:
PATH_TO_STORE_EMBEDDINGS = "vectorstore/db_faiss"

In [17]:
db.save_local(PATH_TO_STORE_EMBEDDINGS)
print("Vector database is created (Embeddings) loacally at directory:",PATH_TO_STORE_EMBEDDINGS)

Vector database is created (Embeddings) loacally at directory: vectorstore/db_faiss


### Step 5: Loading the embeddings

In [18]:
VECTOR_DATABASE_PATH = "vectorstore/db_faiss"

In [19]:
loaded_db = FAISS.load_local(VECTOR_DATABASE_PATH, embeddings)
print("Vector database loaded successfully.")
loaded_db

Vector database loaded successfully.


<langchain_community.vectorstores.faiss.FAISS at 0x7ee633b47220>

### Step 6: Testing the Vector database by asking it query.

In [23]:
QUERY = "Give me all the datasets related to healthcare."

In [31]:
relevant_documents = loaded_db.similarity_search(QUERY)
print("The relevant documents for above query are :")
print("-"*100)
print(relevant_documents)

The relevant documents for above query are :
----------------------------------------------------------------------------------------------------
[Document(page_content='dataset-Nr.: 1069 dataset_name: maheshdadhich/us-healthcare-data title: U.S. Healthcare Data description: U.S. Healthcare Data by BuryBuryZymon. Population Health, Diseases, Drugs, Nutritions, Health-plans Last updated on 22/12/2017 16:30. Size: 36MB. Usability rating: 0.76. View count: 155117, Vote count: 202. Available under CC0: Public Domain license. Find more at: https://www.kaggle.com/datasets/maheshdadhich/us-healthcare-data description_tokens: 56 tags: [united states, healthcare, diseases, nutrition, health, social science]', metadata={'source': '/content/knowledge_path_dir/knowlege_base_kaggle_datasets.pdf'}), Document(page_content='dataset-Nr.: 739 dataset_name: hhs/health-insurance-marketplace title: Health Insurance Marketplace description: Health Insurance Marketplace by Kaggle Team. Explore health and den

In [32]:
type(relevant_documents)

list

In [33]:
len(relevant_documents)

4

By default, it will fetch only first 4 documents.

In [35]:
relevant_documents

[Document(page_content='dataset-Nr.: 1069 dataset_name: maheshdadhich/us-healthcare-data title: U.S. Healthcare Data description: U.S. Healthcare Data by BuryBuryZymon. Population Health, Diseases, Drugs, Nutritions, Health-plans Last updated on 22/12/2017 16:30. Size: 36MB. Usability rating: 0.76. View count: 155117, Vote count: 202. Available under CC0: Public Domain license. Find more at: https://www.kaggle.com/datasets/maheshdadhich/us-healthcare-data description_tokens: 56 tags: [united states, healthcare, diseases, nutrition, health, social science]', metadata={'source': '/content/knowledge_path_dir/knowlege_base_kaggle_datasets.pdf'}),
 Document(page_content='dataset-Nr.: 739 dataset_name: hhs/health-insurance-marketplace title: Health Insurance Marketplace description: Health Insurance Marketplace by Kaggle Team. Explore health and dental plans data in the US Health Insurance Marketplace Last updated on 01/05/2017 20:16. Size: 829MB. Usability rating: 0.74. View count: 339685, 

Let's create a fuunction to see the repsonse in more structured way for a query.

In [50]:
import nltk
from nltk.tokenize import word_tokenize

nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [49]:
def print_structured_response(documents):
    TOKENS = 0
    length_of_documents = len(documents)

    print("-" * 300)

    for i in range(0, length_of_documents):
        print(documents[i].page_content)
        print("-" * 300)

        tokens = word_tokenize(documents[i].page_content)
        TOKENS += len(tokens)

    print("Total number of tokens from relevant documents are:", TOKENS)

def fetching_relevant_documents(query):
    relevant_documents = loaded_db.similarity_search(query, k=22)
    print_structured_response(relevant_documents)

fetching_relevant_documents("Give em all the datasets related to politics.")


------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
dataset-Nr.: 1166 dataset_name: michaelbryantds/fivethirtyeight-polls title: FiveThirtyEight Election Polls Dataset description: FiveThirtyEight Election Polls Dataset by Michael Bryant. For presidential, house, senate, and governor elections Last updated on 29/10/2022 19:09. Size: 13MB. Usability rating: 0.81. View count: 696, Vote count: 9. Available under Other (specified in description) license. Find more at: https://www.kaggle.com/datasets/michaelbryantds/fivethirtyeight-polls description_tokens: 60 tags: [politics, exploratory data analysis, classification, regression]
---------------------------------------------------------------------------------------------------------------------

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Conclusion:

In this tutorial, we have accomplished the creation and validation of a vector database derived from our knowledge base, specifically comprising PDF documents. The process can be succinctly summarized in bullet points:

**Objective Achievement:**

Successful establishment and testing of a vector database.
Data Source:

The knowledge base utilized for this endeavor consists of PDF documents.


**Creation Process:**

Implementation of a systematic approach for generating embeddings from textual content within the knowledge base.
Vector Database Setup:

Creation of a dedicated vector database designed to efficiently store numerical representations of the textual data.


**Validation Procedure:**

Thorough testing and validation to ensure the accuracy and effectiveness of the vector database.


This tutorial provides a streamlined and formal overview of the steps taken to transform textual data from PDF documents into a structured and accessible vector database.