### Author : Rahul Bhoyar

In this tutorial, we will elucidate the systematic process of transferring our knowledge base data into a vector database.

This entails the conversion of textual data into embeddings, which represent numerical representations of the underlying information.

The resulting embeddings will be systematically stored within a vector database, facilitating enhanced efficiency and accessibility.

### Step 1: Installing the necessary libraries

In [None]:
!pip install langchain langchain_openai langchain_community faiss-gpu unstructured unstructured[pdf]



### Step 2: Loading the documents

Get the path where the "knowledge-base" is stored.

In [None]:
KNOWLEDGE_BASE_PATH = "/content/drive/MyDrive/Colab Notebooks/python_projects/KaggleRecSys-with-Rahul/kaggle-datasets/pdf"

In [None]:
from langchain_community.document_loaders import DirectoryLoader

In [None]:
loader = DirectoryLoader(KNOWLEDGE_BASE_PATH)

In the next step, we are creating the "documents" object which stores all the data from the directory.

It may take some time to execute.

In [None]:
import time

# Record start time
start_time = time.time()

# Execute the loader.load() process
documents = loader.load()

# Record end time
end_time = time.time()

# Calculate the elapsed time in seconds
elapsed_time_seconds = end_time - start_time

# Convert elapsed time to minutes
elapsed_time_minutes = elapsed_time_seconds / 60

print(f"The process took {elapsed_time_minutes} minutes.")

print("-"*200)
print("The document object created successfully.")

The process took 7.785158944129944 minutes.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
The document object created successfully.


Check the length of documents

In [None]:
len(documents)

2

In [None]:
type(documents)

list

In [None]:
documents[0]

Output hidden; open in https://colab.research.google.com to view.

In [None]:
documents

Output hidden; open in https://colab.research.google.com to view.

Splitting the documents into chunks

In [None]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(chunk_size=2000,
chunk_overlap=0)

texts = text_splitter.split_documents(documents)

Checke the total number of splittings.

In [None]:
print("Total number of splittings :", len(texts))

Total number of splittings : 6713


In [None]:
type(texts)

list

In [None]:
texts[0]

Document(page_content='Sr. No.: 1 dataset_name: saurabhshahane/diverse-algebra-word-problems title: Diverse Algebra Word Problems Dataset tags: [earth and nature, education, beginner, intermediate, advanced, text] creatorName: Saurabh Shahane creatorUrl: saurabhshahane currentVersionNumber: 1 description: nan subtitle: Solving Algebra Word Problems using NLP descriptionNullable: nan downloadCount: 127 files: [] hasCreatorName: True hasCreatorUrl: True hasCurrentVersionNumber: True hasDescription: False hasLicenseName: True hasOwnerName: True hasOwnerRef: True hasSubtitle: True hasTitle: True hasTotalBytes: True hasUrl: True hasUsabilityRating: True id: 1116400 isFeatured: False isPrivate: False kernelCount: 1 lastUpdated: 2021-01-24 18:14:49 licenseName: Other (specified in description) licenseNameNullable: Other (specified in description) ownerName: Saurabh Shahane ownerNameNullable: Saurabh Shahane ownerRef: saurabhshahane ownerRefNullable: saurabhshahane ref: saurabhshahane/diverse-

### Step 3: Converting the documents into Vector Embeddings

Setting up the OpenAI Environment.

In [None]:
import os
openai_api_key = "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"  # Enter your OPENAI_API_KEY
os.environ["OPENAI_API_KEY"] = openai_api_key
print("OPENAI API key is set successfully :",openai_api_key)

OPENAI API key is set successfully : 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


In [None]:
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

In [None]:
import time
from langchain.vectorstores import FAISS

# Assuming 'texts' and 'embeddings' are already defined

# Record the start time
start_time = time.time()

# Create the FAISS index
db = FAISS.from_documents(texts, embeddings)

# Record the end time
end_time = time.time()

# Calculate the elapsed time in seconds
elapsed_time_seconds = end_time - start_time

# Convert elapsed time to minutes
elapsed_time_minutes = elapsed_time_seconds / 60

# Print the elapsed time in minutes
print(f"Time taken for creating FAISS index: {elapsed_time_minutes:.2f} minutes")


Time taken for creating FAISS index: 1.89 minutes


### Step 4 : Saving the embeddings locally

Initilise the path to store the embeddinbgs.

In [None]:
PATH_TO_STORE_EMBEDDINGS = "vectorstore/db_faiss"

In [None]:
db.save_local(PATH_TO_STORE_EMBEDDINGS)
print("Vector database is created (Embeddings) loacally at directory:",PATH_TO_STORE_EMBEDDINGS)

Vector database is created (Embeddings) loacally at directory: vectorstore/db_faiss


### Step 5: Loading the embeddings

In [None]:
VECTOR_DATABASE_PATH = "vectorstore/db_faiss"

In [None]:
loaded_db = FAISS.load_local(VECTOR_DATABASE_PATH, embeddings)
print("Vector database loaded successfully.")
loaded_db

Vector database loaded successfully.


<langchain_community.vectorstores.faiss.FAISS at 0x7c50f6f47130>

### Step 6: Testing the Vector database by asking it query.

In [None]:
QUERY = "Give me all the datasets related to Covid."

In [None]:
relevant_documents = loaded_db.similarity_search(QUERY)
print("The relevant documents for above query are :")
print("-"*100)
print(documents)

The relevant documents for above query are :
----------------------------------------------------------------------------------------------------


IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



### Conclusion:

In this tutorial, we have accomplished the creation and validation of a vector database derived from our knowledge base, specifically comprising PDF documents. The process can be succinctly summarized in bullet points:

**Objective Achievement:**

Successful establishment and testing of a vector database.
Data Source:

The knowledge base utilized for this endeavor consists of PDF documents.


**Creation Process:**

Implementation of a systematic approach for generating embeddings from textual content within the knowledge base.
Vector Database Setup:

Creation of a dedicated vector database designed to efficiently store numerical representations of the textual data.


**Validation Procedure:**

Thorough testing and validation to ensure the accuracy and effectiveness of the vector database.


This tutorial provides a streamlined and formal overview of the steps taken to transform textual data from PDF documents into a structured and accessible vector database.