<a href="https://colab.research.google.com/github/raviakasapu/LLM-Training-Docs/blob/main/LLM_%26_RAG_Using_vector_stores.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Retrieval Augumented Generation (RAG)

## and Evaluation

In [None]:
#!pip install --upgrade langchain
#!pip install python-dotenv
#!pip install openai

In [None]:
#!pip install langchain[docarray]
#!pip install tiktoken

In [None]:
#!pip install pdf2image
#!pip install unstructured
#!pip install pdfminer
#!pip install pdfminer.six

#### import Openai and input your API key

In [259]:
import os
import openai

#paid key
#os.environ["OPENAI_API_KEY"]=<your api key>

In [182]:
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
openai.api_key = os.environ['OPENAI_API_KEY']

### we will be using `gpt-3.5-turbo-0301` model

In [183]:
llm_model = "gpt-3.5-turbo-0301"

In [184]:
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

In [185]:
from langchain.indexes import VectorstoreIndexCreator
from langchain.document_loaders import UnstructuredPDFLoader
import tiktoken
import pdf2image
import pdfminer
import pdfminer.high_level

In [186]:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import CSVLoader
from langchain.vectorstores import DocArrayInMemorySearch
from IPython.display import display, Markdown

#### We will be using information with regard to Microsoft Fabric which is released in 2023. OpenAi chat model does not have information regarding Microsoft Fabric and we will depend on the the document to answer the questions

In [187]:
pdf_url = "https://learn.microsoft.com/pdf?url=https%3A%2F%2Flearn.microsoft.com%2Fen-us%2Ffabric%2Fget-started%2Ftoc.json"
!wget --no-cache --backups=1 {pdf_url} --output-document=microsoft_fabric.pdf

--2023-09-12 01:35:03--  https://learn.microsoft.com/pdf?url=https%3A%2F%2Flearn.microsoft.com%2Fen-us%2Ffabric%2Fget-started%2Ftoc.json
Resolving learn.microsoft.com (learn.microsoft.com)... 104.85.2.139, 2a02:26f0:b200:18c::3544, 2a02:26f0:b200:18d::3544
Connecting to learn.microsoft.com (learn.microsoft.com)|104.85.2.139|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8488195 (8.1M) [application/pdf]
Saving to: ‘microsoft_fabric.pdf’


2023-09-12 01:35:06 (113 MB/s) - ‘microsoft_fabric.pdf’ saved [8488195/8488195]



### Method 1

#### using `UnstructuredPDFLoader`

In [188]:
loader = UnstructuredPDFLoader("microsoft_fabric.pdf", mode="elements")

In [189]:
data = loader.load()

In [190]:
data[100]

Document(page_content='repositories into actionable workloads and analytics, and is an implementation of data', metadata={'source': 'microsoft_fabric.pdf', 'coordinates': {'points': ((64.0000009173332, 661.7616724113459), (64.0000009173332, 673.7616726793458), (518.5761906695336, 673.7616726793458), (518.5761906695336, 661.7616724113459)), 'system': 'PixelSpace', 'layout_width': 594.95996, 'layout_height': 841.91998}, 'filename': 'microsoft_fabric.pdf', 'last_modified': '2023-09-07T18:46:04', 'filetype': 'application/pdf', 'page_number': 5, 'category': 'NarrativeText'})

In [191]:
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch,
    embedding=embeddings
).from_loaders([loader])

In [192]:
llm = ChatOpenAI(temperature = 0.0, model=llm_model)
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=index.vectorstore.as_retriever(),
    verbose=True,
    chain_type_kwargs = {
        "document_separator": "<<<<>>>>>"
    }
)

In [193]:
query ="what are the components of microsoft fabric"

In [194]:
response2 = index.query(query)
display(Markdown(response2))

 Microsoft Fabric is a distributed systems platform that provides a set of tools and services for developing, deploying, and managing applications and services. The components of Microsoft Fabric include Service Fabric, Event Hubs, Azure Cosmos DB, Azure SQL Database, Azure Storage, and Azure Service Bus.

In [195]:
qa.run(query)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


'Microsoft Fabric is a distributed systems platform that provides a set of tools and services for building scalable and reliable applications. Some of the components of Microsoft Fabric include stateful and stateless services, reliable collections, actors, and partitions. It also includes tools for managing and monitoring applications running on the platform.'

In [196]:
query2 ="what kind of connectors can Microsoft Fabric can support"

In [197]:
response2 = index.query(query2)
display(Markdown(response2))

 I don't know.

In [198]:
query3 = "Is it a data lake?"

In [199]:
response3 = index.query(query3)
display(Markdown(response3))

 No, it is a data warehouse.

## Method 2

#### using package `pypdf`

In [200]:
#!pip install pypdf
#!pip install chromadb


In [201]:
from langchain.document_loaders import PyPDFLoader # for loading the pdf
from langchain.embeddings import OpenAIEmbeddings # for creating embeddings
from langchain.vectorstores import Chroma # for the vectorization part
from langchain.chains import ChatVectorDBChain # for chatting with the pdf
from langchain.llms import OpenAI # the LLM model we'll use (CHatGPT)

In [202]:
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
import chromadb
import chromadb.config
import warnings
warnings.filterwarnings('ignore')

In [203]:
pdf_path = "microsoft_fabric.pdf"
loader2 = PyPDFLoader(pdf_path)
pages = loader2.load_and_split()
print(pages[0].page_content)

Tell us about y our PDF experience.
Microsoft Fabric get star ted
documentation
Microsoft F abric is a unified platform that can meet your organization's data and
analytics needs. Discover the F abric shared and platform documentation from this page.
About Micr osoft Fabric
ｅOVERVIE W
What is F abric?
Fabric terminology
What's New
ｂGET STARTED
Start a F abric trial
Fabric home navigation
End-to-end tutorials
Context sensitive Help pane
Get star ted with F abric it ems
ｐCONCEPT
Find items in OneLake data hub
Promote and certify items
ｃHOW-T O GUIDE
Apply sensitivity labels
Worksp aces
ｐCONCEPT


In [204]:
embeddings = OpenAIEmbeddings()
vectordb2 = Chroma.from_documents(pages, embedding=embeddings,
                                 persist_directory=".")
vectordb2.persist()

In [205]:
pdf_qa = ChatVectorDBChain.from_llm(
    OpenAI(temperature=0,
           model_name=llm_model
           ),
    vectordb2,
    return_source_documents=True
)

result = pdf_qa({"question": query, "chat_history": ""})
print(result["answer"])

The components of Microsoft Fabric include Data Engineering, Data Factory, Data Science, Data Warehouse, Real-Time Analytics, and Power BI, all integrated into a shared SaaS foundation.


In [253]:
result = pdf_qa({"question": query2, "chat_history": ""})
print("Answer:")
print(result['answer'])

Answer:
The given context does not provide information on what kind of connectors Microsoft Fabric can support.


In [207]:
query3 = "Is it a data lake?"
result = pdf_qa({"question": query3, "chat_history": ""})
print("Answer:")
print(result["answer"])

Answer:
Yes, OneLake is a data lake.


# evaluation

In [208]:
from langchain.evaluation.qa import QAGenerateChain
from langchain.evaluation.qa import QAEvalChain

In [209]:
example_gen_chain = QAGenerateChain.from_llm(ChatOpenAI(model=llm_model))

In [210]:
new_examples = example_gen_chain.apply_and_parse(
    [{"doc": t} for t in pages[:5]]
)

In [211]:
new_examples

[{'qa_pairs': {'query': 'What is Microsoft Fabric?',
   'answer': "Microsoft Fabric is a unified platform that can meet an organization's data and analytics needs."}},
 {'qa_pairs': {'query': 'What are the four sections listed in the document?',
   'answer': 'The four sections listed in the document are Fabric workspace, Workspace roles, Get started, and Workspace access control.'}},
 {'qa_pairs': {'query': 'What is Microsoft Fabric and what services does it offer?',
   'answer': 'Microsoft Fabric is an all-in-one analytics solution that offers a comprehensive suite of services, including data lake, data engineering, and data integration. It covers everything from data movement to data science, real-time analytics, and business intelligence. Microsoft Fabric brings together new and existing components from Power BI, Azure Synapse, and Azure Data Factory into a single integrated environment. These components are then presented in various customized user experiences.'}},
 {'qa_pairs': {'

## Langchain debug

* the debug flag will be used to geenrate detailed output from LLM model, which can further used to analyse the Context, Prompt and response

* we can further improve the answers by understanding the providing more context to the LLM

In [252]:
import langchain
langchain.debug = False

In [254]:
predictions = []
examples = []
for i in range(len(new_examples)):
  print(new_examples[i]['qa_pairs']['query'])
  examples.append(
      {'query':new_examples[i]['qa_pairs']['query'],
       'answer':new_examples[i]['qa_pairs']['answer']
       }
  )
  pred = pdf_qa({"question": new_examples[i]['qa_pairs']['query'], "chat_history": ""})
  predictions.append(
    {'question': pred['question'],
     'result':pred['answer']
    }
  )


What is Microsoft Fabric?
What are the four sections listed in the document?
What is Microsoft Fabric and what services does it offer?
What is the benefit of using Microsoft Fabric SaaS experience?
What is the purpose of Real-Time Analytics in Fabric?


In [249]:
examples[4]

{'query': 'What is the purpose of Real-Time Analytics in Fabric?',
 'answer': 'Real-Time Analytics in Fabric is the best-in-class engine for observational data analytics, which is often semi-structured in formats like JSON or Text and comes in at high volume with shifting schemas. This data is hard for traditional data warehousing platforms to work with, but Real-Time Analytics is designed specifically to handle this type of data.'}

In [250]:
predictions[4]

{'question': 'What is the purpose of Real-Time Analytics in Fabric?',
 'result': 'The purpose of Real-Time Analytics in Fabric is to provide an end-to-end analytics solution for enterprises that covers everything from data movement to data science, business intelligence, and real-time analytics. It is one of the experiences tailored to a specific persona and task in the comprehensive set of analytics experiences offered by Microsoft Fabric.'}

In [234]:
eval_chain = QAEvalChain.from_llm(llm)

In [255]:
graded_outputs = eval_chain.evaluate(examples, predictions)

## Results

#### the results are available in graded_outputs and can further used to debug the responses

In [257]:
graded_outputs

[{'results': 'CORRECT'},
 {'results': 'INCORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'INCORRECT'}]