# Lecture 4: Evaluating an LLM Applications

- When building a complex application using **LLM** an important step
  is to **Evaluate** the Application.

- Such as whether it meets some **Accuracy** crieteria.

- Or if you change the implementation by say, using a different **LLM**
  or a diffrent vector database, or change some **parameter** of the system.
  How one knows whether the application is performing **Better** or **Worse**.

In [None]:
import os

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

## Example Application: A QandA application

In [None]:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import CSVLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.vectorstores import DocArrayInMemorySearch

In [None]:
file = '../OutdoorClothingCatalog_1000.csv'
loader = CSVLoader(file_path=file)
data = loader.load()

In [None]:
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch
).from_loaders([loader])

In [None]:
llm = ChatOpenAI(temperature = 0.0)
qa = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=index.vectorstore.as_retriever(), 
    verbose=True,
    chain_type_kwargs = {
        "document_separator": "<<<<>>>>>"
    }
)

# Evaluation Strategies

### 1. Generating Example Question and Answer from documents

#### 1.1 Coming up with test datapoints manually

- Choose some data points which we think are important

In [None]:
data[10] # data point is about pullover set having side poket

In [None]:
data[11] # data point is about Hooded Jacket belongs to DownTek collection

- Create Examples from the above data points

In [None]:
examples = [
    {
        "query": "Do the Cozy Comfort Pullover Set\
        have side pockets?",
        "answer": "Yes"
    },
    {
        "query": "What collection is the Ultra-Lofty \
        850 Stretch Down Hooded Jacket from?",
        "answer": "The DownTek collection"
    }
]

- manually compare the output of the application with our examples

- Problem with the above method
  1. The hard-coded method won't scale with the data.
  1. So we use LLMs to gnenerate examples for us 

### 1.2. Generate exammples with the help of an LLM

- The `QAGenerateChain` in `langchain` helps automating the process of example generation

- It will take in documents and generate will create a **question answer pair** from each document

In [None]:
from langchain.evaluation.qa import QAGenerateChain

In [None]:
example_gen_chain = QAGenerateChain.from_llm(ChatOpenAI())

In [None]:
new_examples = example_gen_chain.apply_and_parse(
    [{"doc": t} for t in data[:2]]
)

In [None]:
new_examples

In [None]:
new_examples[0] # llm generated example

In [None]:
data[0] # check manually how good it matches with the original data

- Combine the examples generated with once we created manually

In [None]:
examples += new_examples

## 2. How do we compare examples generated and the application's response
  - what is happening inside the prompt
  - what is the actual prompt
  - what documents are getting retrieved

### 2.1 Manually compare the example output with the response of the application

In [None]:
qa.run(examples[0]["query"])

- But we are allowed to see only the final answer

- We have no access to see what is happening in the intermediate steps

- langchain provides a nifty way to see what is happening inside

- using **debug** mode we can inspect all intermediate results

In [None]:
import langchain
langchain.debug = True
# Turn on the debug mode

- Now when we run the chain it will spits out every details of whats happening inside

In [None]:
qa.run(examples[0]["query"])

- But the process wil not scale with the data

In [None]:
# Turn off the debug mode
langchain.debug = False

### 2.2 LLM assisted evaluation

- let's predict the answers for the example queries using our Q&A application

In [None]:
predictions = qa.apply(examples)

In [None]:
predictions

# [{'query': 'Do the Cozy Comfort Pullover Set        have side pockets?',
# 'answer': 'Yes',
#  'result': 'The Cozy Comfort Pullover Set, Stripe has side pockets.'},
# {'query': 'What collection is the Ultra-Lofty         850 Stretch Down Hooded Jacket from?',
#  'answer': 'The DownTek collection',
#  'result': 'The Ultra-Lofty 850 Stretch Down Hooded Jacket is from the DownTek collection.'}]

- let's compare the predicted answers with ones we provided

In [None]:
from langchain.evaluation.qa import QAEvalChain

In [None]:
llm = ChatOpenAI(temperature=0)
eval_chain = QAEvalChain.from_llm(llm)

In [None]:
graded_outputs = eval_chain.evaluate(examples, predictions)

In [None]:
for i, eg in enumerate(examples):
    print(f"Example {i}:")
    print("Question: " + predictions[i]['query'])
    print("Real Answer: " + predictions[i]['answer'])
    print("Predicted Answer: " + predictions[i]['result'])
    print("Predicted Grade: " + graded_outputs[i]['text'])
    print()

# Example 0:
# Question: Do the Cozy Comfort Pullover Set        have side pockets?
# Real Answer: Yes
# Predicted Answer: The Cozy Comfort Pullover Set, Stripe has side pockets.
# Predicted Grade: CORRECT
#
# Example 1:
# Question: What collection is the Ultra-Lofty         850 Stretch Down Hooded Jacket from?
# Real Answer: The DownTek collection
# Predicted Answer: The Ultra-Lofty 850 Stretch Down Hooded Jacket is from the DownTek collection.
# Predicted Grade: CORRECT

- **Note that the evaluation process must grade the real anwser with predicted
  answer based on their semantic meaning**

- This is because the strings representing the answers are not similar