# Evaluation Techniques

In this notebook, we will explore different evaluation techniques to evaluate the performance of our LLM models in generating API docs. We will look into implementing suitable metrics for scoring/ranking the generated outputs.

In [1]:
import os
import json
import re
import pandas as pd
from dotenv import load_dotenv
from genai import Credentials, Client
from genai.text.generation import TextGenerationParameters
from genai.text.tokenization import (
    TextTokenizationParameters,
    TextTokenizationReturnOptions,
    TextTokenizationCreateResults,
)
from genai.credentials import Credentials
import sys
sys.path.append('../../app')
from utils import eval_using_model
from langchain.evaluation import (
    Criteria,
    load_evaluator,
    EvaluatorType
)
from langchain_community.chat_models import ChatOpenAI
from openai import OpenAI

## Setup BAM API

In [2]:
# make sure you have a .env file in the root folder with genaikey and genaiapi
load_dotenv()
api_key = os.getenv("GENAI_KEY", None)
api_endpoint = os.getenv("GENAI_API", None)
openai_key = os.getenv("OPENAI_API_KEY", None)

## Data Collection

In [3]:
dataset_path = "../../data/raw/chunked_data.json"
with open(dataset_path, 'r', encoding="utf-8") as f:
    data = json.load(f)

In [4]:
# Let's see all the Python code files we have
data.keys()

dict_keys(['errors', 'oidc', 'sign', 'transparency', 'verify_models', 'verify_policy', 'verify_verifier'])

In [5]:
# Select a file for which we would like to generate the API doc
file = "errors"

In [6]:
# Extract the code and the actual doc for the selected file
code = data[file]["code_chunks"]
actual_doc = data[file]["markdown"]

In [7]:
print(code)

{'imports': ['import sys'], 'functions': [], 'classes': ['class Error(Exception):\n    \n\n    def diagnostics(self) -> str:\n        \n\n        return An issue occurred.\n\n    def print_and_exit(self, raise_error: bool = False) -> None:\n        \n\n        remind_verbose = (\n            "Raising original exception:"\n            if raise_error\n            else "For detailed error information, run sigstore with the `--verbose` flag."\n        )\n\n        print(f"{self.diagnostics()}\\n{remind_verbose}", file=sys.stderr)\n\n        if raise_error:\n            # don\'t want "during handling another exception"\n            self.__suppress_context__ = True\n            raise self\n\n        sys.exit(1)', 'class NetworkError(Error):\n    \n\n    def diagnostics(self) -> str:\n        \n\n        cause_ctx = (\n            f\n        Additional context:\n\n        {self.__cause__}\n        \n            if self.__cause__\n            else ""\n        )\n\n        return (\n           

In [8]:
# Let's see the different components that are present in our code
code.keys()

dict_keys(['imports', 'functions', 'classes', 'documentation', 'other', 'functions_code', 'functions_docstrings', 'classes_code', 'classes_docstrings'])

In [9]:
# Let's take a look at the code for only the classes defined in the python file
classes_code_text = code["classes_code"]

In [10]:
print(classes_code_text)

['class Error(Exception):\n    \n\n    def diagnostics(self) -> str:\n        \n\n        return An issue occurred.\n\n    def print_and_exit(self, raise_error: bool = False) -> None:\n        \n\n        remind_verbose = (\n            "Raising original exception:"\n            if raise_error\n            else "For detailed error information, run sigstore with the `--verbose` flag."\n        )\n\n        print(f"{self.diagnostics()}\\n{remind_verbose}", file=sys.stderr)\n\n        if raise_error:\n            # don\'t want "during handling another exception"\n            self.__suppress_context__ = True\n            raise self\n\n        sys.exit(1)', 'class NetworkError(Error):\n    \n\n    def diagnostics(self) -> str:\n        \n\n        cause_ctx = (\n            f\n        Additional context:\n\n        {self.__cause__}\n        \n            if self.__cause__\n            else ""\n        )\n\n        return (\n            \\\n        A network issue occurred.\n\n        Check 

In [11]:
classes_code_text_joined = "\n".join(classes_code_text)

## Generate Prompts

We will now build a prompt to generate the API doc for the classes code extracted above.

In [12]:
instruction = """
You are an AI system specialized at generating API documentation for the provided Python code. You will be provided functions, classes, or Python scripts. Your documentation should include:

1. Introduction: Briefly describe the purpose of the API and its intended use.
2. Functions: Document each API function, including:
    - Description: Clearly explain what the endpoint or function does.
    - Parameters: List and describe each parameter, including data types and any constraints.
    - Return Values: Specify the data type and possible values returned.

3. Error Handling: Describe possible error responses and their meanings.

Make sure to follow this output structure to create API documentation that is clear, concise, accurate, and user-centric. Avoid speculative information and prioritize accuracy and completeness.
"""

In [13]:
# generate the final prompt by appending the classes code
prompt = f"""{instruction}\n"""
prompt += f"""

Class code:

{classes_code_text_joined}

Class Documentation:

"""

In [14]:
print(prompt)


You are an AI system specialized at generating API documentation for the provided Python code. You will be provided functions, classes, or Python scripts. Your documentation should include:

1. Introduction: Briefly describe the purpose of the API and its intended use.
2. Functions: Document each API function, including:
    - Description: Clearly explain what the endpoint or function does.
    - Parameters: List and describe each parameter, including data types and any constraints.
    - Return Values: Specify the data type and possible values returned.

3. Error Handling: Describe possible error responses and their meanings.

Make sure to follow this output structure to create API documentation that is clear, concise, accurate, and user-centric. Avoid speculative information and prioritize accuracy and completeness.



Class code:

class Error(Exception):
    

    def diagnostics(self) -> str:
        

        return An issue occurred.

    def print_and_exit(self, raise_error: bo

## Generate the API doc

We will now chose a suitable LLM model such as the IBM granite-20b model to generate our API doc.

In [15]:
creds = Credentials(api_key=api_key, api_endpoint=api_endpoint)

# Instantiate parameters for text generation
params = TextGenerationParameters(
        decoding_method="sample",
        max_new_tokens=1024,
        temperature=0.7,
        top_k=50,
        top_p=0.50,
)

# Instantiate a model proxy object to send your requests
client = Client(credentials=creds)
responses = list(
    client.text.generation.create(
         model_id="ibm/granite-20b-code-instruct-v1", inputs=[prompt], parameters=params
    )
)
response = responses[0].results[0]
print("The response:", response)
print("\n")
generated_patch = response.generated_text
print("The generated patch:", generated_patch)

The response: generated_text='1. Introduction: The purpose of this API is to provide a way to generate API documentation for Python code. The API documentation should include information about the functions, classes, and error handling.\n2. Functions:\n    - generate_docs: This function is used to generate API documentation for Python code. It takes in a Python script or directory as input and generates a JSON file containing the API documentation.\n    - get_docs: This function is used to retrieve the API documentation for a specific Python function or class. It takes in the name of the function or class as input and returns the API documentation as a JSON object.\n    - get_error_docs: This function is used to retrieve the API documentation for a specific error. It takes in the name of the error as input and returns the API documentation as a JSON object.\n3. Error Handling:\n    - Error: This class is the base class for all other errors in the API. It provides a way to handle errors

Let's take a look at the actual doc.

In [16]:
print(actual_doc)

[ sigstore](../sigstore.html)

## API Documentation

  * Error
    * diagnostics
    * print_and_exit
  * NetworkError
    * diagnostics
  * TUFError
    * TUFError
    * message
    * diagnostics
  * MetadataError
    * diagnostics
  * RootError
    * diagnostics

[ built with pdoc ](https://pdoc.dev "pdoc: Python API documentation
generator")

#  [sigstore](./../sigstore.html).errors

Exceptions.

View Source
    

class Error(builtins.Exception): View Source
    

Base sigstore exception type. Defines helpers for diagnostics.

def diagnostics(self) -> str: View Source
    

Returns human-friendly error information.

def print_and_exit(self, raise_error: bool = False) -> None: View Source
    

Prints all relevant error information to stderr and exits.

##### Inherited Members

builtins.Exception

    Exception

builtins.BaseException

    with_traceback
    add_note
    args

class NetworkError(Error): View Source
    

Raised when a connectivity-related issue occurs.

def diagnosti

## Evaluate the results

There are different ways to evaluate the results generated by our LLMs. Some of the methods we will explore are:
* **GenAI evaluation** - Use OpenAI GPT 3 to evaluate the result of the generated API doc
* **LangChain evaluation** - Using Langchain to evaluate on custom criteria such as helpfullness, correctness, descriptiveness etc

In [17]:
# Let's fetch the generated doc
result = generated_patch

### GenAI Evaluation

We will now ask GPT-3 to evaluate the generated doc based on factors such as Accuracy, Relevance,  Clarity, Completeness and Readability. We asked it to rate on a scale of 1 to 5. 1 for the poorest documentation and 5 for the best.

In [18]:
# Evaluate using GPT 3
score = eval_using_model(result, openai_key=openai_key)

Accuracy: 4 
Relevance: 5 
Clarity: 4
Completeness: 4 
Readability: 4
Overall Score: 4.2


**Interpreting the evaluation score**:

Although, GPT-3 has scored the generated doc with an overall score of 4.2 i.e. rating the result as "high/very good" documentation, we can see that the generated documentation does not accurately provide the relevant documentation for the code files we have provided as an input.

The generated output provides a generic documentation for the API, but fails to provide specific documentation for the code functions provided. Hence, GPT-3 has failed to evaluate the generated output. In order to improve the evaluation capability, we need to further fine-tune the prompt for GPT-3 by supplementing it with the source code file we provided as the initial input for generating the resultant documentation.

## LangChain Evaluation

LangChain criteria evaluation assesses a model’s output using a specific rubric or criteria set. It allows you to verify if an LLM or Chain’s output complies with a defined set of criteria.

In [19]:
# Let's see all the predefined criteria provided by LangChain
list(Criteria)

[<Criteria.CONCISENESS: 'conciseness'>,
 <Criteria.RELEVANCE: 'relevance'>,
 <Criteria.CORRECTNESS: 'correctness'>,
 <Criteria.COHERENCE: 'coherence'>,
 <Criteria.HARMFULNESS: 'harmfulness'>,
 <Criteria.MALICIOUSNESS: 'maliciousness'>,
 <Criteria.HELPFULNESS: 'helpfulness'>,
 <Criteria.CONTROVERSIALITY: 'controversiality'>,
 <Criteria.MISOGYNY: 'misogyny'>,
 <Criteria.CRIMINALITY: 'criminality'>,
 <Criteria.INSENSITIVITY: 'insensitivity'>,
 <Criteria.DEPTH: 'depth'>,
 <Criteria.CREATIVITY: 'creativity'>,
 <Criteria.DETAIL: 'detail'>]

The list mentioned above outlines the different criteria used to assess model responses. Notably, when it comes to "correctness," having an established correct answer is essential for evaluation. However, for other criteria, the model's response on its own is adequate for assessment. This approach ensures a comprehensive evaluation process that considers various aspects of the model's performance.

In [20]:
llm = ChatOpenAI(model="gpt-4", temperature=0)

  warn_deprecated(


### Criteria: Helpfulness
This criteria checks to see if the generated documentation is "helpful" i.e the ability to provide aid or support, make tasks easier or solve problems effectively.

In [21]:
evaluator = load_evaluator("criteria", llm=llm, criteria="helpfulness")
eval_result = evaluator.evaluate_strings(prediction=result,input=prompt)

In [22]:
eval_result

{'reasoning': 'The criterion for this task is "helpfulness". The submission is supposed to be helpful, insightful, and appropriate.\n\nLooking at the submission, it seems to be a general description of an API documentation system rather than a specific documentation for the provided Python code. The functions mentioned in the submission (generate_docs, get_docs, get_error_docs) are not present in the provided Python code. The submission does not provide a detailed description of the functions, their parameters, return values, or error handling as required by the task.\n\nTherefore, the submission is not helpful or appropriate as it does not provide accurate or complete information about the provided Python code. It is also not insightful as it does not provide any new or useful information about the Python code.\n\nN',
 'value': 'N',
 'score': 0}

**Interpreting the evaluation score**

A score of 0 indicates that the output doesn't meet the criteria defined and a score of 1 indicates that the output satisfies the criteria defined.

Our generated doc has been scored 0 for helpfullness, indicating that the generated doc is not "helpful" since it generated documentation for a different function code instead of the classes code we had provided. Hence, this is an effective metric to evaluate our model outputs.

### Criteria: Correctness

This criteria checks to see if the generated documentation is "correct" i.e. checks whether the outputs meet the ground truth provided.

In [23]:
evaluator = load_evaluator("labeled_criteria", llm=llm, criteria="correctness")
eval_result = evaluator.evaluate_strings(prediction=result,input=prompt, reference=actual_doc)

In [24]:
eval_result

{'reasoning': "The criterion for this task is correctness: Is the submission correct, accurate, and factual?\n\nLet's evaluate the submission based on this criterion:\n\n1. The introduction in the submission does not accurately describe the purpose of the API. The API is not for generating API documentation for Python code, but it is a set of error classes for a Python package called sigstore.\n\n2. The functions listed in the submission (generate_docs, get_docs, get_error_docs) are not present in the provided Python code. The actual functions/methods in the code are 'diagnostics' and 'print_and_exit' which are not mentioned in the submission.\n\n3. The error handling section in the submission correctly identifies the error classes (Error, NetworkError, TUFError, MetadataError, RootError) but does not provide accurate descriptions of what these errors do or when they are raised. For example, the TUFError is described as being raised when a TUF issue occurs, but the actual code shows th

**Interpreting the evaluation score**

A score of 0 indicates that the output doesn't meet the criteria defined and a score of 1 indicates that the output satisfies the criteria defined.

Our generated doc has been scored 0 for correctness, indicating that the generated doc is not correct since it doesn't match with the input Python code provided. It mentions that the Python code provided is about error handling classes, but the generated output documents functions like 'verify_signature' and 'get_artifact' which are not present in the provided code.

### Criteria: Logical

We can also provide our own custom criteria based on which we would like to evaluate our generated outputs. Here, we are evaluating the output based on how "logical" it is.

In [25]:
custom_criteria = {
    "logical": "Is the output logical?"
}

In [26]:
eval_chain = load_evaluator(
    EvaluatorType.CRITERIA,
    criteria=custom_criteria,
    llm=llm
)
eval_result = eval_chain.evaluate_strings(prediction=result, input=prompt)

In [27]:
eval_result

{'reasoning': 'The criterion is to assess whether the output is logical.\n\nThe output is supposed to be an API documentation for the provided Python code. The Python code provided is a set of classes that define different types of errors. Each class has a method called diagnostics that returns a string describing the error.\n\nThe submitted output, however, does not match the provided Python code. The output describes functions like generate_docs, get_docs, and get_error_docs, which are not present in the provided Python code. The output also describes the error classes, but it does not provide the required details such as the description of each function, the parameters, and the return values.\n\nTherefore, the output is not logical as it does not accurately represent the provided Python code.\n\nN',
 'value': 'N',
 'score': 0}

**Interpreting the evaluation score**

A score of 0 indicates that the output doesn't meet the criteria defined and a score of 1 indicates that the output satisfies the criteria defined.

Our generated doc has been scored 0 for logicalness, indicating that the generated doc does not capture the documentation for the input Python code provided and hence is not logical.