# LangChain: Evaluation

When building applications with LLMs , one of complex steps is to evaluate how application is performing.
Accuracy criteria.

langchain applications = sequences of chains and prompts.
first step is to understand flow of input and expected ouputs.

- Use Visualisers / debug points
- Use data with LLM and ask it to evaluate.


## Outline:

* Example generation
* Manual evaluation (and debuging)
* LLM-assisted evaluation
* LangChain evaluation platform

In [1]:
from dotenv import load_dotenv , find_dotenv

_ = load_dotenv(find_dotenv())

## Create our QandA application

** note : ada-002 can process 10M token for $1 , but rate-limits are bottle neck

In [3]:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import CSVLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.vectorstores import DocArrayInMemorySearch


In [6]:
file = 'qa_data.csv'
loader = CSVLoader(file_path=file)
data = loader.load()[:200] # sliced to test with less data

In [None]:
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch
).from_loaders([loader]) # document could be large even for embedding API 
# it could lead to RateLimitError on tokens

In [9]:
import langchain

langchain.debug = True

In [10]:
llm = ChatOpenAI(temperature = 0.0)
qa = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=index.vectorstore.as_retriever(), 
    verbose=True,
    chain_type_kwargs = {
        "document_separator": "<<<<>>>>>"
    }
) # model isn't called

### Coming up with test datapoints

Find some data-points/ examples you want to evaluate on.
- look at some data / examples and come up with example questions
- Use the example questions and answer 
- But this does not scale well

In [11]:
data[10]

Document(page_content=": 10\nname: Cozy Comfort Pullover Set, Stripe\ndescription: Perfect for lounging, this striped knit set lives up to its name. We used ultrasoft fabric and an easy design that's as comfortable at bedtime as it is when we have to make a quick run out.\n\nSize & Fit\n- Pants are Favorite Fit: Sits lower on the waist.\n- Relaxed Fit: Our most generous fit sits farthest from the body.\n\nFabric & Care\n- In the softest blend of 63% polyester, 35% rayon and 2% spandex.\n\nAdditional Features\n- Relaxed fit top with raglan sleeves and rounded hem.\n- Pull-on pants have a wide elastic waistband and drawstring, side pockets and a modern slim leg.\n\nImported.", metadata={'source': 'qa_data.csv', 'row': 10})

### Hard-coded examples

In [22]:
examples = [
    {
        "query": "Do the Cozy Comfort Pullover Set\
        have side pockets?",
        "answer": "Yes"
    },
    {
        "query": "What collection is the Ultra-Lofty \
        850 Stretch Down Hooded Jacket from?",
        "answer": "The DownTek collection"
    }
]

### LLM-Generated examples

langchain provides suitable chains to generate test data using an LLM itself
Langchain's evaluation.qa.QAGenerateChain , can take in documents as inputs 
and generate question , answer pair for each document.

In [12]:
from langchain.evaluation.qa import QAGenerateChain

example_gen_chain = QAGenerateChain.from_llm(llm = llm)

In [13]:
new_examples = example_gen_chain.apply_and_parse(
    [{"doc": t} for t in data[:3]] # use only 3 to avoid RateLimitError
) # note here multiple calls are made to LLM so possible RateLimitError



[32;1m[1;3m[chain/start][0m [1m[1:chain:QAGenerateChain] Entering Chain run with input:
[0m[inputs]
[32;1m[1;3m[llm/start][0m [1m[1:chain:QAGenerateChain > 2:llm:ChatOpenAI] Entering LLM run with input:
[0m{
  "prompts": [
    "Human: You are a teacher coming up with questions to ask on a quiz. \nGiven the following document, please generate a question and answer based on that document.\n\nExample Format:\n<Begin Document>\n...\n<End Document>\nQUESTION: question here\nANSWER: answer here\n\nThese questions should be detailed and be based explicitly on information in the document. Begin!\n\n<Begin Document>\npage_content=\": 0\\nname: Women's Campside Oxfords\\ndescription: This ultracomfortable lace-to-toe Oxford boasts a super-soft canvas, thick cushioning, and quality construction for a broken-in feel from the first time you put them on. \\n\\nSize & Fit: Order regular shoe size. For half sizes not offered, order up to next whole size. \\n\\nSpecs: Approx. weight: 1 lb.1 o

In [14]:
new_examples[0]

{'qa_pairs': {'query': "What is the weight of each pair of Women's Campside Oxfords?",
  'answer': "The weight of each pair of Women's Campside Oxfords is approximately 1 lb. 1 oz."}}

### Combine examples


In [24]:
examples += [pair['qa_pairs'] for pair in new_examples]

import pprint
pprint.pprint(examples)

print(len(examples))

[{'answer': 'Yes',
  'query': 'Do the Cozy Comfort Pullover Set        have side pockets?'},
 {'answer': 'The DownTek collection',
  'query': 'What collection is the Ultra-Lofty         850 Stretch Down Hooded '
           'Jacket from?'},
 {'answer': "The weight of each pair of Women's Campside Oxfords is "
            'approximately 1 lb. 1 oz.',
  'query': "What is the weight of each pair of Women's Campside Oxfords?"},
 {'answer': 'The dimensions of the small size of the Recycled Waterhog Dog '
            'Mat, Chevron Weave are 18" x 28".',
  'query': 'What are the dimensions of the small size of the Recycled Waterhog '
           'Dog Mat, Chevron Weave?'},
 {'answer': "The Infant and Toddler Girls' Coastal Chill Swimsuit, Two-Piece "
            'has bright colors, ruffles, and exclusive whimsical prints. It is '
            'made of four-way-stretch and chlorine-resistant fabric that keeps '
            'its shape and resists snags. The fabric is also UPF 50+ rated, '
        

## LLM assisted evaluation


We can run examples and manually evaluate responses , but this too is not scalable.

Here , LLM assisted evaluation/verification can be used.

`langchain.evaluation.qa.QAEvalChain` is used.

In [25]:
predictions = qa.apply(examples) # calls LLM multiple times :: RateLimitError

[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA] Entering Chain run with input:
[0m{
  "query": "Do the Cozy Comfort Pullover Set        have side pockets?",
  "answer": "Yes"
}
[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA > 3:chain:StuffDocumentsChain] Entering Chain run with input:
[0m[inputs]
[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA > 3:chain:StuffDocumentsChain > 4:chain:LLMChain] Entering Chain run with input:
[0m{
  "question": "Do the Cozy Comfort Pullover Set        have side pockets?",
  "context": ": 10\nname: Cozy Comfort Pullover Set, Stripe\ndescription: Perfect for lounging, this striped knit set lives up to its name. We used ultrasoft fabric and an easy design that's as comfortable at bedtime as it is when we have to make a quick run out.\n\nSize & Fit\n- Pants are Favorite Fit: Sits lower on the waist.\n- Relaxed Fit: Our most generous fit sits farthest from the body.\n\nFabric & Care\n- In the softest blend of 63% polyester, 35% ray

Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-embedding-ada-002 in organization org-Ipkgjb8eV212suWndEgA3XmD on requests per min. Limit: 3 / min. Please try again in 20s. Contact us through our help center at help.openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://platform.openai.com/account/billing to add a payment method..
Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-embedding-ada-002 in organization org-Ipkgjb8eV212suWndEgA3XmD on requests per min. Limit: 3 / min. Please try again in 20s. Contact us through our help center at help.openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://platform.openai.com/

[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA > 3:chain:StuffDocumentsChain] Entering Chain run with input:
[0m[inputs]
[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA > 3:chain:StuffDocumentsChain > 4:chain:LLMChain] Entering Chain run with input:
[0m{
  "question": "What are the dimensions of the small size of the Recycled Waterhog Dog Mat, Chevron Weave?",
  "context": ": 1\nname: Recycled Waterhog Dog Mat, Chevron Weave\ndescription: Protect your floors from spills and splashing with our ultradurable recycled Waterhog dog mat made right here in the USA. \n\nSpecs\nSmall - Dimensions: 18\" x 28\". \nMedium - Dimensions: 22.5\" x 34.5\".\n\nWhy We Love It\nMother nature, wet shoes and muddy paws have met their match with our Recycled Waterhog mats. Ruggedly constructed from recycled plastic materials, these ultratough mats help keep dirt and water off your floors and plastic out of landfills, trails and oceans. Now, that's a win-win for everyone.\n\nFabric & Care\

Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-embedding-ada-002 in organization org-Ipkgjb8eV212suWndEgA3XmD on requests per min. Limit: 3 / min. Please try again in 20s. Contact us through our help center at help.openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://platform.openai.com/account/billing to add a payment method..
Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-embedding-ada-002 in organization org-Ipkgjb8eV212suWndEgA3XmD on requests per min. Limit: 3 / min. Please try again in 20s. Contact us through our help center at help.openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://platform.openai.com/

[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA > 3:chain:StuffDocumentsChain] Entering Chain run with input:
[0m[inputs]
[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA > 3:chain:StuffDocumentsChain > 4:chain:LLMChain] Entering Chain run with input:
[0m{
  "question": "What are some features of the Infant and Toddler Girls' Coastal Chill Swimsuit, Two-Piece?",
  "context": ": 2\nname: Infant and Toddler Girls' Coastal Chill Swimsuit, Two-Piece\ndescription: She'll love the bright colors, ruffles and exclusive whimsical prints of this toddler's two-piece swimsuit! Our four-way-stretch and chlorine-resistant fabric keeps its shape and resists snags. The UPF 50+ rated fabric provides the highest rated sun protection possible, blocking 98% of the sun's harmful rays. The crossover no-slip straps and fully lined bottom ensure a secure fit and maximum coverage. Machine wash and line dry for best results. Imported.<<<<>>>>>: 769\nname: Girls' Aquatic Adventure Swimsuit, One-

In [26]:
from langchain.evaluation.qa import QAEvalChain

eval_chain = QAEvalChain.from_llm(llm)
graded_outputs = eval_chain.evaluate(examples, predictions)
# again here multiple calls to API , possible RateLimitError on req/min

[32;1m[1;3m[chain/start][0m [1m[1:chain:QAEvalChain] Entering Chain run with input:
[0m{
  "input_list": [
    {
      "query": "Do the Cozy Comfort Pullover Set        have side pockets?",
      "answer": "Yes",
      "result": "Yes, the Cozy Comfort Pullover Set does have side pockets."
    },
    {
      "query": "What collection is the Ultra-Lofty         850 Stretch Down Hooded Jacket from?",
      "answer": "The DownTek collection",
      "result": "The Ultra-Lofty 850 Stretch Down Hooded Jacket is from the DownTek collection."
    },
    {
      "query": "What is the weight of each pair of Women's Campside Oxfords?",
      "answer": "The weight of each pair of Women's Campside Oxfords is approximately 1 lb. 1 oz.",
      "result": "The weight of each pair of Women's Campside Oxfords is approximately 1 lb. 1 oz."
    },
    {
      "query": "What are the dimensions of the small size of the Recycled Waterhog Dog Mat, Chevron Weave?",
      "answer": "The dimensions of the sma

Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-gpt-3.5-turbo in organization org-Ipkgjb8eV212suWndEgA3XmD on requests per min. Limit: 3 / min. Please try again in 20s. Contact us through our help center at help.openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://platform.openai.com/account/billing to add a payment method..
Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-gpt-3.5-turbo in organization org-Ipkgjb8eV212suWndEgA3XmD on requests per min. Limit: 3 / min. Please try again in 20s. Contact us through our help center at help.openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit ht

[36;1m[1;3m[llm/end][0m [1m[1:chain:QAEvalChain > 2:llm:ChatOpenAI] [46.91s] Exiting LLM run with output:
[0m{
  "generations": [
    [
      {
        "text": "CORRECT",
        "generation_info": {
          "finish_reason": "stop"
        },
        "message": {
          "lc": 1,
          "type": "constructor",
          "id": [
            "langchain",
            "schema",
            "messages",
            "AIMessage"
          ],
          "kwargs": {
            "content": "CORRECT",
            "additional_kwargs": {}
          }
        }
      }
    ]
  ],
  "llm_output": {
    "token_usage": {
      "prompt_tokens": 181,
      "completion_tokens": 2,
      "total_tokens": 183
    },
    "model_name": "gpt-3.5-turbo"
  },
  "run": null
}
[36;1m[1;3m[llm/end][0m [1m[1:chain:QAEvalChain > 3:llm:ChatOpenAI] [46.91s] Exiting LLM run with output:
[0m{
  "generations": [
    [
      {
        "text": "CORRECT",
        "generation_info": {
          "finish_reason": "

In [28]:
for i, eg in enumerate(examples):
    print(f"Example {i}:")
    print("Question: " + predictions[i]['query'])
    print("Real Answer: " + predictions[i]['answer'])
    print("Predicted Answer: " + predictions[i]['result'])
    print("Predicted Grade: " + graded_outputs[i]['results'])
    print()

Example 0:
Question: Do the Cozy Comfort Pullover Set        have side pockets?
Real Answer: Yes
Predicted Answer: Yes, the Cozy Comfort Pullover Set does have side pockets.
Predicted Grade: CORRECT

Example 1:
Question: What collection is the Ultra-Lofty         850 Stretch Down Hooded Jacket from?
Real Answer: The DownTek collection
Predicted Answer: The Ultra-Lofty 850 Stretch Down Hooded Jacket is from the DownTek collection.
Predicted Grade: CORRECT

Example 2:
Question: What is the weight of each pair of Women's Campside Oxfords?
Real Answer: The weight of each pair of Women's Campside Oxfords is approximately 1 lb. 1 oz.
Predicted Answer: The weight of each pair of Women's Campside Oxfords is approximately 1 lb. 1 oz.
Predicted Grade: CORRECT

Example 3:
Question: What are the dimensions of the small size of the Recycled Waterhog Dog Mat, Chevron Weave?
Real Answer: The dimensions of the small size of the Recycled Waterhog Dog Mat, Chevron Weave are 18" x 28".
Predicted Answer: 

### LLM's output and example output may not be same, and using LLM and semantic search it is easier and better than standard (string) algorithms 