Process data and save them in vector store

# Embedding and vector store

* Data source: SEC filing reports

* Azure OpenAI - embedding

* FAISS

* Azure AI Search (Azure Cognitive Searc) - vector store and vector search, semantic search, or both

* LangChain framework - Azure OpenAI, Azure AI Search


## Configure OpenAI Settings

In [1]:
import os
import openai
from dotenv import load_dotenv
# Set up Azure OpenAI
load_dotenv()

openai.api_type = "azure"

AZURE_OPENAI_API_VERSION = os.getenv("AAG_AZURE_OPENAI_API_VERSION")
openai.api_version = AZURE_OPENAI_API_VERSION

AZURE_OPENAI_API_KEY = os.getenv("AAG_AZURE_OPENAI_API_KEY").strip()
assert AZURE_OPENAI_API_KEY, "ERROR: Azure OpenAI Key is missing"
openai.api_key = AZURE_OPENAI_API_KEY

AZURE_OPENAI_ENDPOINT = os.getenv("AAG_AZURE_OPENAI_ENDPOINT","").strip()
assert AZURE_OPENAI_ENDPOINT, "ERROR: Azure OpenAI Endpoint is missing"
openai.api_base = AZURE_OPENAI_ENDPOINT

# Deployment for Chat
# DEPLOYMENT_NAME_CHAT = os.getenv('DEPLOYMENT_NAME_CHAT')
DEPLOYMENT_NAME_CHAT = os.getenv('AAG_DEPLOYMENT_NAME_CHAT_16K')

# Deployment for embedding
DEPLOYMENT_NAME_EMBEDDING = os.getenv("AAG_DEPLOYMENT_NAME_EMBEDDING")
model: str = DEPLOYMENT_NAME_EMBEDDING

# Azure AI Search (Cognitive vector store)
vector_store_address: str = os.getenv("AAG_AZURE_SEARCH_SERVICE_ENDPOINT")  
vector_store_password: str = os.getenv("AAG_AZURE_SEARCH_ADMIN_KEY")
# index_name: str = "langchain-vector-arxiv-physics"

# Deployment for embedding
BING_SUBSCRIPTION_KEY = os.getenv("BING_SUBSCRIPTION_KEY")

## Load travel itinerary

* Parking, Flight, Car Rental, Hotel ...

#### Load pdf files in a dir

In [2]:
from langchain.document_loaders import PyPDFDirectoryLoader

loader = PyPDFDirectoryLoader("./data_source/")

loaded_documents = loader.load()

In [3]:
loaded_documents

[Document(page_content='12/1/23, 2:22 PM Car rental in Salt Lake City | Expedia\nhttps://www.expedia.com/trips/72674734923624/details/Nzg1NzU5NjUtYzIyZi01NDU1LWJmNGYtYzA5ZWMxMjc3Mjc5O2M0NDMwODUyLWVmZjkt… 1/2Car rental in Salt Lake City\nDec 6, 2023 - Dec 12, 2023\nSign in to see all booking details and manage your trip.\nSign in\nFox\nConfirmation: #WF********\nExpedia itinerary: 7267**********\nReservation details\nPick-up\nWed, Dec 6\n2:30pmDrop-off\nTue, Dec 12\n2:15pm\nPick-up and drop-off instructions\nLocation\nManage booking\nChange, cancel, and support\nPricing and rewards\nMenu\nHelpGet the app EnglishList your propertySupportTrips More travel', metadata={'source': 'data_source\\Car rental in Salt Lake City _ Expedia.pdf', 'page': 0}),
 Document(page_content="12/1/23, 2:07 PM JetBlue - Trip review\nhttps://managetrips.jetblue.com/dx/B6DX/#/myb?carReservation=false&hotelReservation=false&locale=en-US 1/4Manage your trip\nThere has been a change to your flight. If this new fligh

In [4]:
from langchain.text_splitter import CharacterTextSplitter

# Split documents to chucks
text_splitter = CharacterTextSplitter(chunk_size=2000, chunk_overlap=0)
splitted_docs = text_splitter.split_documents(loaded_documents)


In [5]:
splitted_docs

[Document(page_content='12/1/23, 2:22 PM Car rental in Salt Lake City | Expedia\nhttps://www.expedia.com/trips/72674734923624/details/Nzg1NzU5NjUtYzIyZi01NDU1LWJmNGYtYzA5ZWMxMjc3Mjc5O2M0NDMwODUyLWVmZjkt… 1/2Car rental in Salt Lake City\nDec 6, 2023 - Dec 12, 2023\nSign in to see all booking details and manage your trip.\nSign in\nFox\nConfirmation: #WF********\nExpedia itinerary: 7267**********\nReservation details\nPick-up\nWed, Dec 6\n2:30pmDrop-off\nTue, Dec 12\n2:15pm\nPick-up and drop-off instructions\nLocation\nManage booking\nChange, cancel, and support\nPricing and rewards\nMenu\nHelpGet the app EnglishList your propertySupportTrips More travel', metadata={'source': 'data_source\\Car rental in Salt Lake City _ Expedia.pdf', 'page': 0}),
 Document(page_content="12/1/23, 2:07 PM JetBlue - Trip review\nhttps://managetrips.jetblue.com/dx/B6DX/#/myb?carReservation=false&hotelReservation=false&locale=en-US 1/4Manage your trip\nThere has been a change to your flight. If this new fligh

### Create embeddings and vector store instances

#### FAISS vector store

In [6]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS

# Get Azure OpenAI embedding
embeddings: OpenAIEmbeddings = OpenAIEmbeddings(
    deployment=model,
    model=model,
    chunk_size=1,   # this 'chunk_size' is misleading, it is really about 'input' text string, not the number of words or characters in the text.
    openai_api_base=AZURE_OPENAI_ENDPOINT,
    openai_api_type="azure",
    api_key=AZURE_OPENAI_API_KEY,
)

# Create the vector index
db = FAISS.from_documents(splitted_docs, embeddings)
# Query the index
query = "What does the travel starts? Which terminal is it?"
docs = db.similarity_search(query)
# Print the results
print(docs[0].page_content)

12/1/23, 2:07 PM JetBlue - Trip review
https://managetrips.jetblue.com/dx/B6DX/#/myb?carReservation=false&hotelReservation=false&locale=en-US 1/4Manage your trip
There has been a change to your flight. If this new flight does not work for you then use the
'Change' link below.
Conﬁrmation code
IHHZZC Ticketed
  
New York-JFK Salt Lake City CONFIRMED
Dec 6, 2023
11:00 am
New York-JFKDec 6, 2023
02:15 pm
Salt Lake City5 hrs 15 mins
Nonstop
Blue Basic
B6871 A320Operated by JetBlue
Details 
Salt Lake City New York-JFK CONFIRMED
Dec 12, 2023
03:11 pm
Salt Lake CityDec 12, 2023
09:38 pm
New York-JFK4 hrs 27 mins
Nonstop
Blue Basic
B6872 A320Operated by JetBlueYour flights


In [7]:
# Travel
db.save_local("faiss_index_travel")