Process data and save them in vector store

# Embedding and vector store

* Data source: OBC

* OpenAI - embedding

* FAISS vector store and vector search, semantic search, or both

* LangChain framework 


## Configure OpenAI Settings

In [4]:
import os
import openai
from dotenv import load_dotenv
# Set up Azure OpenAI
load_dotenv()

OPENAI_API_KEY = os.getenv("SKIBOT_KEY").strip()
assert OPENAI_API_KEY, "ERROR: OpenAI Key is missing"
openai.api_key = OPENAI_API_KEY

# Option 1: use an OpenAI account
openai_api_key: str = OPENAI_API_KEY
openai_api_version: str = "2023-05-15"
# model: str = "text-embedding-ada-002"
embedding_model: str = "text-embedding-3-small"


## Load PDF file


### Load single file - use Longchain PDF loader

In [5]:
from langchain.document_loaders import PyPDFLoader

# Load pdf files
loader = PyPDFLoader("./data_source_obc/OBC 2020 - Part 7 - Section4 (12 pages) .pdf")
loaded_documents = loader.load()

# from langchain.document_loaders import PyPDFDirectoryLoader

# loader = PyPDFDirectoryLoader("./data_source/")

# loaded_documents = loader.load()

In [6]:
loaded_documents

[Document(page_content=' \n387 7.4.2.1.   Connections  to Sanitar y Drainage Systems  \n (1)  Every fixture  shall be directly connected to a sanitary drainage system , except that,  \n (a) drinki ng fountains may be,  \n (i) indirectly connected  to a sanitary drainage system,  or \n (ii) connected to a storm drainage system  provided t hat where the system is subject to backflow , a check valve  is \ninstalled in the fountain waste pipe , \n (b) laundry plumbing appliances  may be indirectly connected  to a sanitary drainage system , \n (c) fixtures  or plumbing appliances , other than floo r drains, e xcept as provided in Sentence 7.1.4.2.(2), that discharge only \nclear water waste  may be connected to a storm drainage syst em, \n (d) the following devices shall be indirectly connected  to a drainage system : \n (i) a device for the display, storage, preparation  or processing of food or drink,  \n (ii) a sterilizer,  \n (iii) a device that uses water as a cooling or heating medium

### (Optional) Split documents into chunks, if by page has issues ...

In [7]:
# from langchain.text_splitter import CharacterTextSplitter

# # Split documents to chucks
# text_splitter = CharacterTextSplitter(chunk_size=2000, chunk_overlap=300)
# splitted_docs = text_splitter.split_documents(loaded_documents)

In [8]:
# splitted_docs

### For now, use pages

In [9]:
splitted_docs = loaded_documents

## Create embeddings and vector store instances

### FAISS vector store

In [10]:
from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS

# # Option 1: Use OpenAIEmbeddings with OpenAI account
embeddings: OpenAIEmbeddings = OpenAIEmbeddings(
    openai_api_key=openai_api_key, 
    model=embedding_model
)


In [11]:

# Create the vector index
db = FAISS.from_documents(splitted_docs, embeddings)

In [12]:
# Query the index
query = "how big interval shall between two cleanouts on a 6” horizontal sanitary drainage pipe?"
docs = db.similarity_search(
    query,
    k=3,
    search_type="similarity",
)
# Print the results
print(docs[0].page_content)
print(docs[0].metadata)

 
393  (3)  Reserved  
 (4)  Each change of direction of the piping between a cleanout  fitting and the drainage piping or vent piping  that it serv es 
shall be accomplished by using 45  bends.  
 (5)  A cleanout  shall be provided to serve vertica l drainage piping from a wall hung urinal and shall extend above the flood 
level rim  of the fixture . 
 (6)  A cleanout serving a fixture  in health care facilitie s, mortuaries, laboratories and similar occupancies , where 
contamination by body fluids is likel y, shall be located a minimum of 150 mm above the flood level rim of the fixture . 
7.4.8.   Minimum Slope and Length of Drainage Pipes  
7.4.8.1.   Minimum Slope  
 (1)  Except as provided in Sentences (2) and (3), every drainage pipe that has a size of 3 in. or less shall have a downward 
slope in the direction of flow of at least 1 in 50.  
 (2)  Sentence (1) does not apply to a force main . 
 (3)  Where it is not possible  to comply with Sentence (1), a lesser slope may be us

In [13]:
# Save index in local
db.save_local("faiss_index_obc")

In [14]:
# List FAISS index
db.docstore._dict

{'1fa9e7dc-0fbd-4e66-a0f9-7f2b97892ef9': Document(page_content=' \n387 7.4.2.1.   Connections  to Sanitar y Drainage Systems  \n (1)  Every fixture  shall be directly connected to a sanitary drainage system , except that,  \n (a) drinki ng fountains may be,  \n (i) indirectly connected  to a sanitary drainage system,  or \n (ii) connected to a storm drainage system  provided t hat where the system is subject to backflow , a check valve  is \ninstalled in the fountain waste pipe , \n (b) laundry plumbing appliances  may be indirectly connected  to a sanitary drainage system , \n (c) fixtures  or plumbing appliances , other than floo r drains, e xcept as provided in Sentence 7.1.4.2.(2), that discharge only \nclear water waste  may be connected to a storm drainage syst em, \n (d) the following devices shall be indirectly connected  to a drainage system : \n (i) a device for the display, storage, preparation  or processing of food or drink,  \n (ii) a sterilizer,  \n (iii) a device that u

In [15]:
db.index.d

1536