### Taking a look at embeddings

Load document

In [3]:
import json
with open('data.json', 'r', encoding='utf-8') as file:
    input_data = json.load(file)

The purpose is to see what the embeddings look like for the corresponding input data. With the installation of SentenceTransformer lib, the embedding model all-MiniLM-L6-v2 also has been downloaded. \
all-MiniLM-L6-v2 maps text to 384 dimensional vector. \
We can also use Azure OpenAI's text-embedding-ada-002 model for this purpose.\
I'm using this locally as the model size is comparatively small (80mb) and could help with cost-memory-performance trade-off decision.


In [4]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

embeddings = model.encode("This is a sample text")
embeddings.shape




(384,)

Store embeddings in an output file

In [5]:
for data in input_data:
    embeddings = model.encode(data['content'])
    data['contentVector'] = embeddings.tolist()

with open("output.json", "w") as f:
    json.dump(input_data, f)

### Using LLM for generating human-friendly response

In [1]:
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.document_loaders import TextLoader




Load the document and split it into chunks. Splitting in chunks help with semantically related text being grouped together.

In [2]:
loader = TextLoader("data.json")
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter()
all_splits = text_splitter.split_documents(documents)
len(all_splits)

18

We stored embeddings in our output file above for demo purposes. However, we need a way to store, retrieve, and compare embeddings. Vector databases do just that.

Chroma is a popular open source vector database. These two lines of code generate and store embeddings in local disk.

In [3]:
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

db = Chroma.from_documents(all_splits, embedding_function)

Our goal is to ask questions based on the content of the document. Language model is required for generating humn-like response. We'll use Azure OpenAI's gpt-35-turbo model for the same.

In [4]:
from dotenv import load_dotenv  
load_dotenv()

from langchain.chains import RetrievalQA
from langchain.chat_models import AzureChatOpenAI
import os

llm = AzureChatOpenAI(deployment_name=os.getenv("AZURE_DEPLOYMENT_NAME"),
            model_name="gpt-35-turbo",
            openai_api_base=os.getenv("AZURE_OPENAI_ENDPOINT"), 
            openai_api_version=os.getenv("AZURE_OPENAI_API_VERSION"),
            openai_api_key=os.getenv("AZURE_OPENAI_API_KEY"),
            openai_api_type="azure")

To return the most relevant answer to the query, the query's embeddings need to be compared with what is stored in the vector database.
Retriever retrieves the most relevant document/chunk. We then pass that document/chunk to the LLM model. This way, we get to lock the context of the model, as the model generates response based on the input document/chunk.

score_threshold helps with eliminating irrelevant queries

In [9]:
# query = "What is the difference between Azure Databricks and Azure Data Factory?"
query = "What is the distance between moon and earth"
# query = "What is the best way to deploy to Azure?"

retriever = db.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={'score_threshold': 0.3}
)

rel_docs = retriever.get_relevant_documents(query)

print(rel_docs)

if len(rel_docs) > 0:
    qa_chain = RetrievalQA.from_chain_type(llm,retriever=retriever)
    print(qa_chain({"query": query}))


[]




### Using Azure OpenAI embeddings model

Two differences from the implementation above-
- Instead of a file, a webpage is the document
- Azure OpenAI embedding model is used instead of the one downloaded locally (all-MiniLM-L6-v2)

In [1]:
from langchain.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
import os
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.chat_models import AzureChatOpenAI

from dotenv import load_dotenv  
load_dotenv()

loader = WebBaseLoader("https://prsindia.org/billtrack/digital-personal-data-protection-bill-2023")


data = loader.load_and_split()

text_splitter = RecursiveCharacterTextSplitter()
all_splits = text_splitter.split_documents(data)
len(all_splits)

embeddings = OpenAIEmbeddings(openai_api_key=os.getenv("AZURE_OPENAI_API_KEY"), 
            openai_api_base=os.getenv("AZURE_OPENAI_ENDPOINT"), 
            openai_api_type="azure", 
            deployment=os.getenv("AZURE_EMBEDDING_DEPLOYMENT_NAME"),
            openai_api_version=os.getenv("AZURE_OPENAI_API_VERSION")
            )

db = Chroma.from_documents(all_splits, embeddings)
llm = AzureChatOpenAI(deployment_name=os.getenv("AZURE_DEPLOYMENT_NAME"),
            model_name="gpt-35-turbo",
            openai_api_base=os.getenv("AZURE_OPENAI_ENDPOINT"), 
            openai_api_version=os.getenv("AZURE_OPENAI_API_VERSION"),
            openai_api_key=os.getenv("AZURE_OPENAI_API_KEY"),
            openai_api_type="azure")



In [4]:
query = "What is the distance between moon and earth?"
# query = "What are the duties of data fiducaries in the Digital Personal Data Protection Bill?"

retriever=db.as_retriever(search_type='similarity_score_threshold', search_kwargs={'k':1, 'score_threshold': 0.1} )

rel_docs = retriever.get_relevant_documents(query)

print(rel_docs) 
    
if len(rel_docs) > 0:
    qa_chain = RetrievalQA.from_chain_type(llm,retriever=retriever)
    print(qa_chain({"query": query}))

[Document(page_content='Relevant Links\n\nPRS Products\n\n\nPRS Legislative Brief\n\n\nPRS Bill Summary\n\n\n\n\nOriginal Text\n\n\nBill Text\n\n\nDigital Personal Data Protection Act, 2023\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n×\n\n\n\n\nEducation(Graduation) : \nEducation(Post Graduation) : \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nFollow Us\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nPRS Legislative Research is licensed under a Creative Commons Attribution 4.0 International License\nDisclaimer: This data is being furnished to you for your information. PRS makes every effort to use reliable and comprehensive information, but PRS does not represent that this information is accurate or complete. PRS is an independent, not-for-profit group. This data has been collated without regard to the objectives or opinions of those who may receive it.\n\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\nAbout Us\nCareers\n\n\n\n\n\n\n\nCopyright © 2023 \xa0\xa0 prsindia.org \xa0\xa0 All Rights Reserved.', metada