### Taking a look at embeddings

Load document

In [3]:
import json
with open('data.json', 'r', encoding='utf-8') as file:
    input_data = json.load(file)

The purpose is to see what the embeddings look like for the corresponding input data. With the installation of SentenceTransformer lib, the embedding model all-MiniLM-L6-v2 also has been downloaded. \
all-MiniLM-L6-v2 maps text to 384 dimensional vector. \
We can also use Azure OpenAI's text-embedding-ada-002 model for this purpose.\
I'm using this locally as the model size is comparatively small (80mb) and could help with cost-memory-performance trade-off decision.


In [None]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

embeddings = model.encode("This is a sample text")
embeddings.shape


Store embeddings in an output file

In [5]:
for data in input_data:
    embeddings = model.encode(data['content'])
    data['contentVector'] = embeddings.tolist()

with open("output.json", "w") as f:
    json.dump(input_data, f)

## Using LLM for generating human-friendly response

Not just document files, but we can work with webpages as well.\
In this example, we will try to ask questions about The Digital Personal Data Protection Bill introduced by the Government of India in 2023.\
ChatGPT was trained on data prior to 2021. Using the code/approach below, we can make use of LLMs against the data that we desire.\
Let's start with importing the required modules and environment variables.

In [1]:
from langchain.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
import os
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.chat_models import AzureChatOpenAI

from dotenv import load_dotenv  
load_dotenv()



True

Import a webpage using Langchain's WebBaseLoader, and split it into chunks of smaller sizes.

In [2]:
loader = WebBaseLoader("https://prsindia.org/billtrack/digital-personal-data-protection-bill-2023")


data = loader.load_and_split()

text_splitter = RecursiveCharacterTextSplitter()
all_splits = text_splitter.split_documents(data)
len(all_splits)

10

### Generate embeddings
We can generate embeddings using either the model available to us locally (all-MiniLM-L6-v2) or use Azure OpenAI's embedding model.

In [3]:
# embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2") for using local model


embeddings = OpenAIEmbeddings(
    openai_api_key=os.getenv("AZURE_OPENAI_API_KEY"), 
    openai_api_base=os.getenv("AZURE_OPENAI_ENDPOINT"), 
    openai_api_type="azure", 
    deployment=os.getenv("AZURE_EMBEDDING_DEPLOYMENT_NAME"),
    openai_api_version=os.getenv("AZURE_OPENAI_API_VERSION")
)

After generating embeddings, we need a way to store, retrieve, and compare embeddings.\
Vector databases do just that.\
Chroma is a popular open source vector database. These two lines of code generate and store embeddings in local disk.

In [4]:
db = Chroma.from_documents(all_splits, embeddings)

Retriever helps with extracting document from the vector store

In [5]:
retriever = db.as_retriever()

Setup Azure OpenAI LLM model

In [6]:
llm = AzureChatOpenAI(
    deployment_name=os.getenv("AZURE_DEPLOYMENT_NAME"),
    model_name="gpt-35-turbo",
    openai_api_base=os.getenv("AZURE_OPENAI_ENDPOINT"), 
    openai_api_version=os.getenv("AZURE_OPENAI_API_VERSION"),
    openai_api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    openai_api_type="azure"
)

RetrievalQA does QA over the retriever based on a query provided by the user using the LLM.

In [7]:
query = "What are the duties of data fiducaries in the Digital Personal Data Protection Bill?"

qa_chain = RetrievalQA.from_chain_type(llm,retriever=retriever)
qa_chain({"query": query})

{'query': 'What are the duties of data fiducaries in the Digital Personal Data Protection Bill?',
 'result': 'The Digital Personal Data Protection Bill, 2023 requires data fiduciaries to maintain the accuracy of data, keep data secure, and delete data once its purpose has been met. They must make reasonable efforts to ensure the accuracy and completeness of data, build reasonable security safeguards to prevent a data breach, inform the Data Protection Board of India and affected persons in the event of a breach, and erase personal data as soon as the purpose has been met and retention is not necessary for legal purposes (storage limitation). However, in case of government entities, storage limitation and the right of the data principal to erasure will not apply.'}