Process data and save them in vector store

# Embedding and vector store

* Data source: SEC filing reports

* Azure OpenAI - embedding

* FAISS

* Azure AI Search (Azure Cognitive Searc) - vector store and vector search, semantic search, or both

* LangChain framework - Azure OpenAI, Azure AI Search


## Import Langchain libraries and environment variables

In [27]:
# Import required libraries  
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores.azuresearch import AzureSearch
from azure.search.documents.indexes.models import (
    SemanticSettings,
    SemanticConfiguration,
    PrioritizedFields,
    SemanticField
)

## Configure OpenAI Settings

In [26]:
import os
import openai
from dotenv import load_dotenv
# Set up Azure OpenAI
load_dotenv()

openai.api_type = "azure"

AZURE_OPENAI_API_VERSION = os.getenv("AAG_AZURE_OPENAI_API_VERSION")
openai.api_version = AZURE_OPENAI_API_VERSION

AZURE_OPENAI_API_KEY = os.getenv("AAG_AZURE_OPENAI_API_KEY").strip()
assert AZURE_OPENAI_API_KEY, "ERROR: Azure OpenAI Key is missing"
openai.api_key = AZURE_OPENAI_API_KEY

AZURE_OPENAI_ENDPOINT = os.getenv("AAG_AZURE_OPENAI_ENDPOINT","").strip()
assert AZURE_OPENAI_ENDPOINT, "ERROR: Azure OpenAI Endpoint is missing"
openai.api_base = AZURE_OPENAI_ENDPOINT

# Deployment for Chat
# DEPLOYMENT_NAME_CHAT = os.getenv('DEPLOYMENT_NAME_CHAT')
DEPLOYMENT_NAME_CHAT = os.getenv('AAG_DEPLOYMENT_NAME_CHAT_16K')

# Deployment for embedding
DEPLOYMENT_NAME_EMBEDDING = os.getenv("AAG_DEPLOYMENT_NAME_EMBEDDING")
model: str = DEPLOYMENT_NAME_EMBEDDING

# Azure AI Search (Cognitive vector store)
vector_store_address: str = os.getenv("AAG_AZURE_SEARCH_SERVICE_ENDPOINT")  
vector_store_password: str = os.getenv("AAG_AZURE_SEARCH_ADMIN_KEY")
# index_name: str = "langchain-vector-arxiv-physics"

# Deployment for embedding
BING_SUBSCRIPTION_KEY = os.getenv("BING_SUBSCRIPTION_KEY")

## Load SEC data

* 10-K, 10-Q, 8-K

#### Loas single file

In [8]:
from langchain.document_loaders import PyPDFLoader

# Load pdf files
loader = PyPDFLoader("./data_source/zbra-20221231_10-K.pdf")
loaded_documents = loader.load()

#### Batch process: load all pdf files in a folder

* RateLimitError - need to do in small number of files

In [35]:
loaded_documents=[]
print("Data preparation >>>")
# Ask the user to provide the folder path
folder_path = input('Enter the path to the folder: ')

# Check if the provided path exists
if not os.path.exists(folder_path):
    print(f'The folder path "{folder_path}" does not exist.')
else:
    # Loop through the files in the folder
    for filename in os.listdir(folder_path):
        # Check if the file is a PDF
        if filename.lower().endswith('.pdf'):
            # If it's a PDF, print the file name or perform any other desired action
            # print(f'Found PDF file: {filename}')
            orig_file_with_full_path = os.path.join(folder_path, filename)
            print(f'Found PDF file: {orig_file_with_full_path}')

            loader = PyPDFLoader(orig_file_with_full_path)
            loaded_documents += loader.load()
            
            # Break out of the loop after processing the first PDF file - testing only
            # break
print("Data preparation is done! <<<")

Data preparation >>>
Found PDF file: data_source/10-K\zbra-20221231_10-K.pdf
Data preparation is done! <<<


#### Split documents into chunks

In [None]:
loaded_documents

In [37]:
from langchain.text_splitter import CharacterTextSplitter

# Split documents to chucks
text_splitter = CharacterTextSplitter(chunk_size=2000, chunk_overlap=0)
splitted_docs = text_splitter.split_documents(loaded_documents)


In [None]:
splitted_docs

## Create embeddings and vector store instances

### Option 1: FAISS vector store

In [39]:
# from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS

# Get Azure OpenAI embedding
embeddings: OpenAIEmbeddings = OpenAIEmbeddings(
    deployment=model,
    model=model,
    chunk_size=1,   # this 'chunk_size' is misleading, it is really about 'input' text string, not the number of words or characters in the text.
    openai_api_base=AZURE_OPENAI_ENDPOINT,
    openai_api_type="azure",
    api_key=AZURE_OPENAI_API_KEY,
)

# Create the vector index
db = FAISS.from_documents(splitted_docs, embeddings)
# Query the index
# query = "What did the president say about Ketanji Brown Jackson"
# docs = db.similarity_search(query)
# # Print the results
# print(docs[0].page_content)

#### Based on different data source, save them into different FAISS db. The thought behind this approach is credibility (audited) vs most recent data. There could be different weight to it while giving rating later on. So, prepare the data this way could provide flexibility for future.

In [25]:
# 8-K
db.save_local("faiss_index_8-K")

In [33]:
# 10-Q
db.save_local("faiss_index_10-Q")

In [40]:
# 10-K
db.save_local("faiss_index_10-K")

### Option 2: Azure AI Search (Cognitive search)

* TODO: will do indexing later.  Need to watch for the cost, hold on for now

In [32]:
# Get Azure OpenAI embedding
embeddings: OpenAIEmbeddings = OpenAIEmbeddings(deployment=model, model=model, 
                                                chunk_size=1, 
                                                openai_api_base = AZURE_OPENAI_ENDPOINT, 
                                                openai_api_type = "azure", 
                                                api_key = AZURE_OPENAI_API_KEY)
# Define index (aka embedding) name
index_name: str = "langchain-vector-zebra-10k-10q-8k"

# Create index in the vector store
azure_ai_search_vector_store: AzureSearch = AzureSearch(
    azure_search_endpoint=vector_store_address,
    azure_search_key=vector_store_password,
    index_name=index_name,
    embedding_function=embeddings.embed_query,
    semantic_configuration_name='config',
        semantic_settings=SemanticSettings(
            default_configuration='config',
            configurations=[
                SemanticConfiguration(
                    name='config',
                    prioritized_fields=PrioritizedFields(
                        title_field=SemanticField(field_name='content'),
                        prioritized_content_fields=[SemanticField(field_name='content')],
                        prioritized_keywords_fields=[SemanticField(field_name='metadata')]
                    ))
            ])
    )

#### Insert text and embeddings into vector store - need to warch the code, so hold for now

In [None]:
# Execute following will start embedding ...
# azure_ai_search_vector_store.add_documents(documents=splitted_docs)