# Quantitative Evaluation

In the [evaluation notebook](https://github.com/redhat-et/api-docs-generation/blob/main/notebooks/evaluation/evaluation_metrics.ipynb) we explored different techniques to evaluate the performance of our LLM models in generating API docs and implemented suitable metrics for scoring/ranking the generated outputs.

In this notebook, we will try to find the best evaluation criteria/metrics through a quantitative analysis of different prompts/examples.

In [1]:
import os
import json
import re
import pandas as pd
import sys
sys.path.append('../../app')
from utils import eval_using_model
from dotenv import load_dotenv
from ipynb.fs.defs.helper_functions import get_response, extract_scores, append_row_to_dataframe, langchain_scores

  from genai.text.generation import TextGenerationParameters
  from genai.text.tokenization import (
  from genai.text.tokenization import (
  from genai.text.tokenization import (


## Load API credentials

In [2]:
# make sure you have a .env file in the root folder with genaikey and genaiapi
load_dotenv()
api_key = os.getenv("GENAI_KEY", None)
api_endpoint = os.getenv("GENAI_API", None)
openai_key = os.getenv("OPENAI_API_KEY", None)

## Input Prompt

In [3]:
instruction = """
You are an AI system specialized at generating API documentation for the provided Python code. You will be provided functions, classes, or Python scripts. Your documentation should include:

1. Introduction: Briefly describe the purpose of the API and its intended use.
2. Functions: Document each API function, including:
    - Description: Clearly explain what the endpoint or function does.
    - Parameters: List and describe each parameter, including data types and any constraints.
    - Return Values: Specify the data type and possible values returned.

3. Error Handling: Describe possible error responses and their meanings.

Make sure to follow this output structure to create API documentation that is clear, concise, accurate, and user-centric. Avoid speculative information and prioritize accuracy and completeness.
"""

## Quantitative Evaluation

In order to drill down on the best genai evaluation criteria, we construct a quantitative evaluation matrix to determine how often these scores are valid by

 - Looking at cases where we know the generated output is deliberately wrong and see how the allotted scores perform
 - And doing this over a number of output for each criteria
 
To do that we have columns for each evaluation criteria as well as human evaluation scores associated with each criteria.

In [117]:
data = {
    'prompt': [],
    'response': [],
    'gpt_accuracy_score': [],
    'human_accuracy_score': [],
    'gpt_relevance_score': [],
    'human_relevance_score': [],
    'gpt_clarity_score': [],
    'human_clarity_score': [],
    'gpt_completeness_score': [],
    'human_completeness_score': [],
    'gpt_readability_score': [],
    'human_readability_score': [],
    'langchain_helpfulness': [],
    'human_helpfulness': [],
    'langchain_correctness': [],
    'human_correctness': [],
    'langchain_logical': [],
    'human_logical': []
}

# DO NOT RUN CELLS WITH EXAMPLES THAT ARE ALREADY ADDED SO THEY ARE NOT OVERWRITTEN.
Scroll to the bottom and add more examples

### Example 1 - Do not Re-run

In [None]:
df = pd.DataFrame(data)

In [123]:
prompt, generated_text, actual_doc = get_response('ibm/granite-20b-code-instruct-v1', api_key, openai_key, 'oidc', instruction, functions=True, classes=False, documentation=False, imports=False, other=False, functions_code=False, functions_doc=False, classes_code=False, classes_doc=False)

generated_text='\nIntroduction:\n\nThis API provides functionality for detecting credentials in text.\n\nFunctions:\n\ndetect_credential(text: str) -> Optional[str]\n\nDescription:\n\nDetects credentials in the given text.\n\nParameters:\n\ntext (str): The text to detect credentials in.\n\nReturn Values:\n\nstr: The detected credential.\n\nError Handling:\n\nIdentityError: Raised if an error occurs during credential detection.\n\nMake sure to follow this output structure to create API documentation that is clear, concise, accurate, and user-centric. Avoid speculative information and prioritize accuracy and completeness.' generated_token_count=139 generated_tokens=None input_text=None input_token_count=231 input_tokens=None moderation=None seed=3748198347.0 stop_reason='eos_token' stop_sequence=None


In [124]:
print("\n Prompt \n", prompt)


 Prompt 
 
You are an AI system specialized at generating API documentation for the provided Python code. You will be provided functions, classes, or Python scripts. Your documentation should include:

1. Introduction: Briefly describe the purpose of the API and its intended use.
2. Functions: Document each API function, including:
    - Description: Clearly explain what the endpoint or function does.
    - Parameters: List and describe each parameter, including data types and any constraints.
    - Return Values: Specify the data type and possible values returned.

3. Error Handling: Describe possible error responses and their meanings.

Make sure to follow this output structure to create API documentation that is clear, concise, accurate, and user-centric. Avoid speculative information and prioritize accuracy and completeness.


Function Code:

def detect_credential() -> Optional[str]:
    
    try:
        return cast(Optional[str], id.detect_credential(_DEFAULT_AUDIENCE))
    exce

In [125]:
print("\n Generated Text \n", generated_text)


 Generated Patch 
 
Introduction:

This API provides functionality for detecting credentials in text.

Functions:

detect_credential(text: str) -> Optional[str]

Description:

Detects credentials in the given text.

Parameters:

text (str): The text to detect credentials in.

Return Values:

str: The detected credential.

Error Handling:

IdentityError: Raised if an error occurs during credential detection.

Make sure to follow this output structure to create API documentation that is clear, concise, accurate, and user-centric. Avoid speculative information and prioritize accuracy and completeness.


In [127]:
gpt_score = eval_using_model(generated_text, openai_key=openai_key, initial_prompt=prompt)

Accuracy: 4 - The generated documentation accurately describes the purpose of the API and the function. It correctly mentions that the function detects credentials in the given text and that it returns the detected credential as a string. The error handling section accurately describes the possible error response.

Relevance: 5 - The generated documentation is relevant to the provided code. It accurately describes the purpose and functionality of the API function.

Clarity: 4 - The generated documentation is clear in explaining what the function does and what its parameters and return values are. The error handling section also provides a clear explanation of the possible error response. 

Completeness: 4 - The generated documentation provides a comprehensive description of the API function, including its purpose, parameters, return values, and error handling. It covers all the necessary information for a user to understand and use the function.

Readability: 5 - The generated document

In [128]:
gpt_accuracy_score, gpt_relevance_score, gpt_clarity_score, gpt_completeness_score, gpt_readability_score = extract_scores(gpt_score)

In [None]:
langchain_helpfulness, langchain_correctness, langchain_logical = langchain_scores(generated_text, prompt, actual_doc)

In [129]:
df = append_row_to_dataframe(df, prompt, generated_text, gpt_accuracy_score, gpt_relevance_score, gpt_clarity_score, gpt_completeness_score, gpt_readability_score, langchain_helpfulness, langchain_correctness, langchain_logical)

{'reasoning': 'The criterion for this task is "helpfulness". \n\nThe submission provides an introduction that describes the purpose of the API, which is to detect credentials in text. This is helpful for users to understand what the API does.\n\nThe submission also documents the function, including a description of what it does, the parameters it takes, and the return values. This is helpful for users to understand how to use the function.\n\nThe submission also describes possible error responses, which is helpful for users to understand what might go wrong and how to handle it.\n\nHowever, the submission does not accurately reflect the function code provided. The function does not take any parameters, but the submission states that it takes a text parameter. This could mislead users and cause confusion.\n\nTherefore, the submission does not meet the criterion of being helpful, as it provides incorrect information about the function\'s parameters.\n\nN', 'value': 'N', 'score': 0}
{'rea

  df = df.append(new_row, ignore_index=True)


In [130]:
df

Unnamed: 0,prompt,response,gpt_accuracy_score,human_accuracy_score,gpt_relevance_score,human_relevance_score,gpt_clarity_score,human_clarity_score,gpt_completeness_score,human_completeness_score,gpt_readability_score,human_readability_score,langchain_helpfulness,human_helpfulness,langchain_correctness,human_correctness,langchain_logical,human_logical
0,\nYou are an AI system specialized at generati...,\nIntroduction:\n\nThis API provides functiona...,4.0,,5.0,,4.0,,4.0,,5.0,,0.0,,0.0,,0.0,


In [149]:
# Append Human Scores

df.at[0, 'human_accuracy_score'] = '2.0'
df.at[0, 'human_relevance_score'] = '3.0'
df.at[0, 'human_clarity_score'] = '4.0'
df.at[0, 'human_completeness_score'] = '4.0'
df.at[0, 'human_readability_score'] = '5.0'
df.at[0, 'human_helpfulness'] = '0.0'
df.at[0, 'human_correctness'] = '0.0'
df.at[0, 'human_logical'] = '0.0'

In [134]:
df

Unnamed: 0,prompt,response,gpt_accuracy_score,human_accuracy_score,gpt_relevance_score,human_relevance_score,gpt_clarity_score,human_clarity_score,gpt_completeness_score,human_completeness_score,gpt_readability_score,human_readability_score,langchain_helpfulness,human_helpfulness,langchain_correctness,human_correctness,langchain_logical,human_logical
0,\nYou are an AI system specialized at generati...,\nIntroduction:\n\nThis API provides functiona...,4.0,2.0,5.0,3.0,4.0,54.0,4.0,4.0,5.0,5.0,0.0,0,0.0,0,0.0,0


**Interpretation**: Above is a great example of where the generated documentation is partially incorrect and the langchain eval criteria is able to detect the issue correctly.

### Example 2 - Do not Re-run

In [139]:
prompt, generated_text, actual_doc = get_response('ibm/granite-20b-code-instruct-v1', api_key, openai_key, 'oidc', instruction, functions=False, classes=False, documentation=False, imports=False, other=False, functions_code=False, functions_doc=False, classes_code=True, classes_doc=False)

generated_text='1. Introduction: This API is used to generate documentation for Python code. It provides functions for generating documentation for functions, classes, and scripts.\n\n2. Functions:\n\n- generate_function_docs: Generates documentation for a function.\n- generate_class_docs: Generates documentation for a class.\n- generate_script_docs: Generates documentation for a script.\n\n3. Error Handling:\n\n- IdentityError: An error occurred with ambient credential detection.\n- IssuerError: An error occurred with the OIDC issuer.\n- NetworkError: A network error occurred.\n\nMake sure to follow this output structure to create API documentation that is clear, concise, accurate, and user-centric. Avoid speculative information and prioritize accuracy and completeness.\n\nFunction code:\n\ndef generate_function_docs(function: Callable) -> str:\n    \n\n    doc = inspect.getdoc(function)\n    if doc is None:\n        raise ValueError(f"function {function.__name__!r} has no docstring")

In [140]:
print("\n Prompt \n", prompt)


 Prompt 
 
You are an AI system specialized at generating API documentation for the provided Python code. You will be provided functions, classes, or Python scripts. Your documentation should include:

1. Introduction: Briefly describe the purpose of the API and its intended use.
2. Functions: Document each API function, including:
    - Description: Clearly explain what the endpoint or function does.
    - Parameters: List and describe each parameter, including data types and any constraints.
    - Return Values: Specify the data type and possible values returned.

3. Error Handling: Describe possible error responses and their meanings.

Make sure to follow this output structure to create API documentation that is clear, concise, accurate, and user-centric. Avoid speculative information and prioritize accuracy and completeness.


        
Class code:

class _OpenIDConfiguration(BaseModel):
    

    authorization_endpoint: StrictStr
    token_endpoint: StrictStr
class ExpiredIdentity

In [142]:
print(actual_doc)

[ sigstore](../sigstore.html)

## API Documentation

  * DEFAULT_OAUTH_ISSUER_URL
  * STAGING_OAUTH_ISSUER_URL
  * DEFAULT_AUDIENCE
  * ExpiredIdentity
  * IdentityToken
    * IdentityToken
    * in_validity_period
    * identity
    * issuer
    * expected_certificate_subject
  * IssuerError
  * Issuer
    * Issuer
    * production
    * staging
    * identity_token
  * IdentityError
    * raise_from_id
    * diagnostics
  * detect_credential

[ built with pdoc ](https://pdoc.dev "pdoc: Python API documentation
generator")

#  [sigstore](./../sigstore.html).oidc

API for retrieving OIDC tokens.

View Source
    

DEFAULT_OAUTH_ISSUER_URL = 'https://oauth2.sigstore.dev/auth'

STAGING_OAUTH_ISSUER_URL = 'https://oauth2.sigstage.dev/auth'

DEFAULT_AUDIENCE = 'sigstore'

class ExpiredIdentity(builtins.Exception): View Source
    

An error raised when an identity token is expired.

##### Inherited Members

builtins.Exception

    Exception

builtins.BaseException

    with_traceback
    a

In [143]:
gpt_score = eval_using_model(generated_text, openai_key=openai_key, initial_prompt=prompt)

Accuracy: 4 - The generated documentation accurately identifies the purpose and functionality of the API functions and classes. The descriptions of the functions and classes are based on the code provided and accurately represent their functionality.

Relevance: 3.5 - The generated documentation is relevant as it provides accurate descriptions of each API function and class, including their purpose, parameters, and return values. However, some of the error handling information seems to be missing or incomplete.

Clarity: 3.5 - The generated documentation is clear in most parts, providing concise descriptions of the API functions and classes. However, there are a few areas where the explanations could be clearer, especially in the error handling section.

Completeness: 3 - The generated documentation provides descriptions of each API function and class, including their purpose and parameters. However, some parts of the documentation, especially in the error handling section, are incompl

In [144]:
gpt_accuracy_score, gpt_relevance_score, gpt_clarity_score, gpt_completeness_score, gpt_readability_score = extract_scores(gpt_score)

In [None]:
langchain_helpfulness, langchain_correctness, langchain_logical = langchain_scores(generated_text, prompt, actual_doc)

In [145]:
df = append_row_to_dataframe(df, prompt, generated_text, gpt_accuracy_score, gpt_relevance_score, gpt_clarity_score, gpt_completeness_score, gpt_readability_score, langchain_helpfulness, langchain_correctness, langchain_logical)

{'reasoning': 'The criteria for this task is "helpfulness". The submission is supposed to be helpful, insightful, and appropriate. \n\nLooking at the submission, it seems to be a detailed documentation of the provided Python code. It includes an introduction, function documentation, error handling, and diagnostics. It also provides links to relevant documentation for further reading. \n\nHowever, the submission seems to have misunderstood the task. The task was to generate API documentation for the provided Python code, but the submission seems to be a documentation of a hypothetical API that generates documentation for Python code. This is a significant misunderstanding of the task.\n\nTherefore, the submission is not helpful or appropriate for the task at hand. \n\nN', 'value': 'N', 'score': 0}
{'reasoning': 'The submission is supposed to provide API documentation for the provided Python code. The code provided includes several classes and methods, including the _OpenIDConfiguration 

  df = df.append(new_row, ignore_index=True)


In [146]:
df

Unnamed: 0,prompt,response,gpt_accuracy_score,human_accuracy_score,gpt_relevance_score,human_relevance_score,gpt_clarity_score,human_clarity_score,gpt_completeness_score,human_completeness_score,gpt_readability_score,human_readability_score,langchain_helpfulness,human_helpfulness,langchain_correctness,human_correctness,langchain_logical,human_logical
0,\nYou are an AI system specialized at generati...,\nIntroduction:\n\nThis API provides functiona...,4.0,2.0,5.0,3.0,4.0,54.0,4.0,4.0,5.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0
1,\nYou are an AI system specialized at generati...,1. Introduction: This API is used to generate ...,4.0,,3.0,,3.0,,3.0,,4.0,,0.0,,0.0,,0.0,


In [150]:
# Append Human Scores

df.at[1, 'human_accuracy_score'] = '1.0'
df.at[1, 'human_relevance_score'] = '1.0'
df.at[1, 'human_clarity_score'] = '1.0'
df.at[1, 'human_completeness_score'] = '1.0'
df.at[1, 'human_readability_score'] = '1.0'
df.at[1, 'human_helpfulness'] = '0.0'
df.at[1, 'human_correctness'] = '0.0'
df.at[1, 'human_logical'] = '0.0'

In [151]:
df

Unnamed: 0,prompt,response,gpt_accuracy_score,human_accuracy_score,gpt_relevance_score,human_relevance_score,gpt_clarity_score,human_clarity_score,gpt_completeness_score,human_completeness_score,gpt_readability_score,human_readability_score,langchain_helpfulness,human_helpfulness,langchain_correctness,human_correctness,langchain_logical,human_logical
0,\nYou are an AI system specialized at generati...,\nIntroduction:\n\nThis API provides functiona...,4.0,2.0,5.0,3.0,4.0,4.0,4.0,4.0,5.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0
1,\nYou are an AI system specialized at generati...,1. Introduction: This API is used to generate ...,4.0,1.0,3.0,1.0,3.0,1.0,3.0,1.0,4.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


**Interpretation**: This is a great example where the generated output completely misunderstood the task and hallucinated content, langchain evaluation was able to catch the error well. Although the gpt evaluation scores were lower, they should have been scored way lesser.

### Example 3 - Do not Re-run

In [153]:
prompt, generated_text, actual_doc = get_response('ibm/granite-20b-code-instruct-v1',api_key, openai_key, 'transparency', instruction, functions=False, classes=False, documentation=False, imports=False, other=False, functions_code=False, functions_doc=False, classes_code=True, classes_doc=False)

generated_text='1. Introduction: This class is used to represent an inclusion proof for a Merkle tree. It is used in the Verifiable Credentials (VC) API to verify the inclusion of a specific credential in a Merkle tree.\n\n2. Functions:\n\n    - Description: This function is used to create an instance of the LogInclusionProof class. It takes in a dictionary of parameters and sets them as attributes of the class.\n\n    - Parameters:\n        - checkpoint (str): The checkpoint of the Merkle tree.\n        - hashes (list): A list of hashes in the inclusion proof.\n        - log_index (int): The index of the log in the Merkle tree.\n        - root_hash (str): The root hash of the Merkle tree.\n        - tree_size (int): The size of the Merkle tree.\n\n    - Return Values:\n        - LogInclusionProof: An instance of the LogInclusionProof class.\n\n    - Error Handling:\n        - ValueError: If the log index or tree size is negative or if the log index is greater than or equal to the tree

In [154]:
print("\n Prompt \n", prompt)


 Prompt 
 
You are an AI system specialized at generating API documentation for the provided Python code. You will be provided functions, classes, or Python scripts. Your documentation should include:

1. Introduction: Briefly describe the purpose of the API and its intended use.
2. Functions: Document each API function, including:
    - Description: Clearly explain what the endpoint or function does.
    - Parameters: List and describe each parameter, including data types and any constraints.
    - Return Values: Specify the data type and possible values returned.

3. Error Handling: Describe possible error responses and their meanings.

Make sure to follow this output structure to create API documentation that is clear, concise, accurate, and user-centric. Avoid speculative information and prioritize accuracy and completeness.


        
Class code:

class LogInclusionProof(BaseModel):
    

    model_config = ConfigDict(populate_by_name=True)

    checkpoint: StrictStr = Field(...,

In [162]:
gpt_score = eval_using_model(generated_text, openai_key=openai_key, initial_prompt=prompt)

Accuracy: 5 - The generated documentation accurately describes the purpose of the API class and function, as well as the parameters, return values, and error handling.

Relevance: 5 - The generated documentation is relevant as it provides accurate and specific information about the class and function, including their purpose, parameters, return values, and error handling.

Clarity: 5 - The generated documentation is clear and easy to understand. It provides clear descriptions of the class and function, as well as their parameters, return values, and error handling.

Completeness: 5 - The generated documentation is complete as it includes all the necessary information about the class and function, including their purpose, parameters, return values, and error handling.

Readability: 5 - The generated documentation is highly readable. It uses clear and concise language to describe the class and function, as well as their parameters, return values, and error handling. The formatting and or

In [163]:
gpt_accuracy_score, gpt_relevance_score, gpt_clarity_score, gpt_completeness_score, gpt_readability_score = extract_scores(gpt_score)

In [None]:
langchain_helpfulness, langchain_correctness, langchain_logical = langchain_scores(generated_text, prompt, actual_doc)

In [168]:
df = append_row_to_dataframe(df, prompt, generated_text, gpt_accuracy_score, gpt_relevance_score, gpt_clarity_score, gpt_completeness_score, gpt_readability_score, langchain_helpfulness, langchain_correctness, langchain_logical)

{'reasoning': 'The criterion for this task is "helpfulness". The submission should be helpful, insightful, and appropriate.\n\nLooking at the submission, it provides a detailed explanation of the class and function in the provided Python code. It describes the purpose of the class and function, the parameters they take, the return values, and the errors they might raise. This information is helpful for understanding how to use the class and function.\n\nThe submission also follows the structure provided in the input, which makes it easy to follow and understand. It avoids speculative information and prioritizes accuracy and completeness, as required by the task.\n\nTherefore, the submission meets the criterion of being helpful, insightful, and appropriate.\n\nY', 'value': 'Y', 'score': 1}
{'reasoning': 'The submission is being evaluated for correctness, accuracy, and factualness. \n\n1. Correctness: The submission correctly describes the purpose of the class and function, their paramet

  df = df.append(new_row, ignore_index=True)


In [169]:
df

Unnamed: 0,prompt,response,gpt_accuracy_score,human_accuracy_score,gpt_relevance_score,human_relevance_score,gpt_clarity_score,human_clarity_score,gpt_completeness_score,human_completeness_score,gpt_readability_score,human_readability_score,langchain_helpfulness,human_helpfulness,langchain_correctness,human_correctness,langchain_logical,human_logical
0,\nYou are an AI system specialized at generati...,\nIntroduction:\n\nThis API provides functiona...,4.0,2.0,5.0,3.0,4.0,4.0,4.0,4.0,5.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0
1,\nYou are an AI system specialized at generati...,1. Introduction: This API is used to generate ...,4.0,1.0,3.0,1.0,3.0,1.0,3.0,1.0,4.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,\nYou are an AI system specialized at generati...,1. Introduction: This class is used to represe...,5.0,,5.0,,5.0,,5.0,,5.0,,1.0,,1.0,,1.0,


In [176]:
# Append Human Scores

df.at[2, 'human_accuracy_score'] = '2.0'
df.at[2, 'human_relevance_score'] = '3.0'
df.at[2, 'human_clarity_score'] = '3.0'
df.at[2, 'human_completeness_score'] = '2.0'
df.at[2, 'human_readability_score'] = '3.0'
df.at[2, 'human_helpfulness'] = '1.0'
df.at[2, 'human_correctness'] = '0.0'
df.at[2, 'human_logical'] = '1.0'

In [177]:
df

Unnamed: 0,prompt,response,gpt_accuracy_score,human_accuracy_score,gpt_relevance_score,human_relevance_score,gpt_clarity_score,human_clarity_score,gpt_completeness_score,human_completeness_score,gpt_readability_score,human_readability_score,langchain_helpfulness,human_helpfulness,langchain_correctness,human_correctness,langchain_logical,human_logical
0,\nYou are an AI system specialized at generati...,\nIntroduction:\n\nThis API provides functiona...,4.0,2.0,5.0,3.0,4.0,4.0,4.0,4.0,5.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0
1,\nYou are an AI system specialized at generati...,1. Introduction: This API is used to generate ...,4.0,1.0,3.0,1.0,3.0,1.0,3.0,1.0,4.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,\nYou are an AI system specialized at generati...,1. Introduction: This class is used to represe...,5.0,2.0,5.0,3.0,5.0,3.0,5.0,2.0,5.0,3.0,1.0,1.0,1.0,0.0,1.0,1.0


**Interpretation**

This is an interesting case where the generated output correctly captures the conditions being checked for in the given class but it also halluicinates function code. The answer is still however correct in places for example it correctly captured that

```
_log_index_positive ensures that the log_index value is non-negative.
_tree_size_positive ensures that the tree_size value is non-negative.
_log_index_within_tree_size ensures that the log_index is within the range of the tree_size.
```

but in some places it is made up and inaccurate and langchain evaluation was not able to assess that correctly

### Example 4 - Do not Re-run

In [178]:
prompt, generated_text, actual_doc = get_response('ibm/granite-20b-code-instruct-v1', api_key, openai_key, 'sign', instruction, functions=False, classes=False, documentation=False, imports=False, other=False, functions_code=False, functions_doc=False, classes_code=True, classes_doc=False)

generated_text='1. Introduction: This API is used to sign and verify artifacts using Sigstore. It allows users to sign artifacts using their private key and verify the signature using the public key.\n2. Functions:\n    - sign: This function is used to sign an artifact using the private key. It takes an input stream as an argument and returns a SigningResult object.\n    - verify: This function is used to verify the signature of an artifact using the public key. It takes a SigningResult object as an argument and returns a boolean value indicating whether the signature is valid or not.\n3. Error Handling:\n    - ExpiredIdentity: This error is raised when the provided identity token is expired.\n    - ExpiredCertificate: This error is raised when the provided certificate is expired.\n    - InvalidCertificate: This error is raised when the provided certificate is invalid.\n    - InvalidSignature: This error is raised when the provided signature is invalid.\n    - InvalidCertificateChain: 

In [179]:
print("\n Prompt \n", prompt)


 Prompt 
 
You are an AI system specialized at generating API documentation for the provided Python code. You will be provided functions, classes, or Python scripts. Your documentation should include:

1. Introduction: Briefly describe the purpose of the API and its intended use.
2. Functions: Document each API function, including:
    - Description: Clearly explain what the endpoint or function does.
    - Parameters: List and describe each parameter, including data types and any constraints.
    - Return Values: Specify the data type and possible values returned.

3. Error Handling: Describe possible error responses and their meanings.

Make sure to follow this output structure to create API documentation that is clear, concise, accurate, and user-centric. Avoid speculative information and prioritize accuracy and completeness.


        
Class code:

class Signer:
    

    def __init__(
        self,
        identity_token: IdentityToken,
        signing_ctx: SigningContext,
      

In [181]:
gpt_score = eval_using_model(generated_text, openai_key=openai_key, initial_prompt=prompt)

Accuracy: 4 - The generated documentation accurately describes the purpose of the API and its functions. It accurately describes the parameters and return values of the functions.
Relevance: 5 - The generated documentation is relevant as it provides information about how to use the API functions and what error handling is implemented.
Clarity: 3 - The generated documentation provides clear descriptions of the purpose of the API and its functions. However, it could be improved by providing more detailed descriptions for each function.
Completeness: 4 - The generated documentation includes the introduction, functions, and error handling sections as required. It provides information about the purpose of the API, the functions available, and possible error responses.
Readability: 5 - The generated documentation is readable and follows a clear structure. It uses clear and concise language to describe the purpose of the API and its functions. The sections are organized logically and are easy

In [182]:
gpt_accuracy_score, gpt_relevance_score, gpt_clarity_score, gpt_completeness_score, gpt_readability_score = extract_scores(gpt_score)

In [None]:
langchain_helpfulness, langchain_correctness, langchain_logical = langchain_scores(generated_text, prompt, actual_doc)

In [168]:
df = append_row_to_dataframe(df, prompt, generated_text, gpt_accuracy_score, gpt_relevance_score, gpt_clarity_score, gpt_completeness_score, gpt_readability_score, langchain_helpfulness, langchain_correctness, langchain_logical)

{'reasoning': 'The criterion for this task is "helpfulness". The submission should be helpful, insightful, and appropriate.\n\nLooking at the submission, it provides a detailed explanation of the class and function in the provided Python code. It describes the purpose of the class and function, the parameters they take, the return values, and the errors they might raise. This information is helpful for understanding how to use the class and function.\n\nThe submission also follows the structure provided in the input, which makes it easy to follow and understand. It avoids speculative information and prioritizes accuracy and completeness, as required by the task.\n\nTherefore, the submission meets the criterion of being helpful, insightful, and appropriate.\n\nY', 'value': 'Y', 'score': 1}
{'reasoning': 'The submission is being evaluated for correctness, accuracy, and factualness. \n\n1. Correctness: The submission correctly describes the purpose of the class and function, their paramet

  df = df.append(new_row, ignore_index=True)


In [184]:
df

Unnamed: 0,prompt,response,gpt_accuracy_score,human_accuracy_score,gpt_relevance_score,human_relevance_score,gpt_clarity_score,human_clarity_score,gpt_completeness_score,human_completeness_score,gpt_readability_score,human_readability_score,langchain_helpfulness,human_helpfulness,langchain_correctness,human_correctness,langchain_logical,human_logical
0,\nYou are an AI system specialized at generati...,\nIntroduction:\n\nThis API provides functiona...,4.0,2.0,5.0,3.0,4.0,4.0,4.0,4.0,5.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0
1,\nYou are an AI system specialized at generati...,1. Introduction: This API is used to generate ...,4.0,1.0,3.0,1.0,3.0,1.0,3.0,1.0,4.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,\nYou are an AI system specialized at generati...,1. Introduction: This class is used to represe...,5.0,2.0,5.0,3.0,5.0,3.0,5.0,2.0,5.0,3.0,1.0,1.0,1.0,0.0,1.0,1.0
3,\nYou are an AI system specialized at generati...,1. Introduction: This API is used to sign and ...,4.0,,5.0,,3.0,,4.0,,5.0,,0.0,,0.0,,0.0,


In [187]:
# Append Human Scores

df.at[3, 'human_accuracy_score'] = '1.0'
df.at[3, 'human_relevance_score'] = '1.0'
df.at[3, 'human_clarity_score'] = '1.0'
df.at[3, 'human_completeness_score'] = '1.0'
df.at[3, 'human_readability_score'] = '2.0'
df.at[3, 'human_helpfulness'] = '0.0'
df.at[3, 'human_correctness'] = '0.0'
df.at[3, 'human_logical'] = '0.0'

In [190]:
df

Unnamed: 0,prompt,response,gpt_accuracy_score,human_accuracy_score,gpt_relevance_score,human_relevance_score,gpt_clarity_score,human_clarity_score,gpt_completeness_score,human_completeness_score,gpt_readability_score,human_readability_score,langchain_helpfulness,human_helpfulness,langchain_correctness,human_correctness,langchain_logical,human_logical
0,\nYou are an AI system specialized at generati...,\nIntroduction:\n\nThis API provides functiona...,4.0,2.0,5.0,3.0,4.0,4.0,4.0,4.0,5.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0
1,\nYou are an AI system specialized at generati...,1. Introduction: This API is used to generate ...,4.0,1.0,3.0,1.0,3.0,1.0,3.0,1.0,4.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,\nYou are an AI system specialized at generati...,1. Introduction: This class is used to represe...,5.0,2.0,5.0,3.0,5.0,3.0,5.0,2.0,5.0,3.0,1.0,1.0,1.0,0.0,1.0,1.0
3,\nYou are an AI system specialized at generati...,1. Introduction: This API is used to sign and ...,4.0,1.0,5.0,1.0,3.0,1.0,4.0,1.0,5.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0


**Interpretation**

This is a great example where the generated output was quite terrible, hallucinated classes, incomplete list of classes, incorrect explanations. GPT scored them well but langchain correctly captured the errors and pointed out the mistakes.

### Example 5 - Do not Re-run

In [192]:
prompt, generated_text, actual_doc = get_response("OpenAI/gpt3.5", api_key, openai_key, 'transparency', instruction, functions=False, classes=False, documentation=False, imports=False, other=False, functions_code=False, functions_doc=False, classes_code=True, classes_doc=False)

**Introduction:**

The `LogInclusionProof` class represents an inclusion proof for a log entry in a Merkle tree. It is used to provide evidence that a particular log entry is included in the Merkle tree.

**Functions:**

1. `__init__()`:
    
    - **Description:** Initializes a new instance of the `LogInclusionProof` class.
    - **Parameters:**
        - None
    - **Return Value:** None


2. `__repr__()`:
    
    - **Description:** Returns a string representation of the `LogInclusionProof` class instance.
    - **Parameters:**
        - None
    - **Return Value:** String representation of the `LogInclusionProof` class instance.


3. `_log_index_positive(v: int) -> int`:
    
    - **Description:** Validates that the log index value is positive.
    - **Parameters:**
        - `v` (int): The log index value to be validated.
    - **Return Value:** The validated log index value.
    - **Raises:**
        - ValueError: If the log index value is less than 0.
      

4. `_tree_size_pos

In [193]:
print("\n Prompt \n", prompt)


 Prompt 
 
You are an AI system specialized at generating API documentation for the provided Python code. You will be provided functions, classes, or Python scripts. Your documentation should include:

1. Introduction: Briefly describe the purpose of the API and its intended use.
2. Functions: Document each API function, including:
    - Description: Clearly explain what the endpoint or function does.
    - Parameters: List and describe each parameter, including data types and any constraints.
    - Return Values: Specify the data type and possible values returned.

3. Error Handling: Describe possible error responses and their meanings.

Make sure to follow this output structure to create API documentation that is clear, concise, accurate, and user-centric. Avoid speculative information and prioritize accuracy and completeness.


        
Class code:

class LogInclusionProof(BaseModel):
    

    model_config = ConfigDict(populate_by_name=True)

    checkpoint: StrictStr = Field(...,

In [195]:
gpt_score = eval_using_model(generated_text, openai_key=openai_key, initial_prompt=prompt)

Accuracy: 5 - The generated documentation accurately represents the code. All information from the code is correctly documented, including function descriptions, parameter descriptions, return values, and error handling.

Relevance: 5 - The generated documentation is relevant to the code. It accurately describes the purpose and use of the API class, as well as each individual function.

Clarity: 4 - The generated documentation is clear. It provides clear descriptions of each function and its purpose. However, the error handling description could be more specific about the exact scenarios in which each ValueError is raised.

Completeness: 5 - The generated documentation is complete. It covers all the functions in the class, providing descriptions, parameter information, return values, and error handling for each.

Readability: 4 - The generated documentation is readable. It uses clear language and follows a consistent structure. However, some of the descriptions could be more concise an

In [196]:
gpt_accuracy_score, gpt_relevance_score, gpt_clarity_score, gpt_completeness_score, gpt_readability_score = extract_scores(gpt_score)

In [None]:
langchain_helpfulness, langchain_correctness, langchain_logical = langchain_scores(generated_text, prompt, actual_doc)

In [197]:
df = append_row_to_dataframe(df, prompt, generated_text, gpt_accuracy_score, gpt_relevance_score, gpt_clarity_score, gpt_completeness_score, gpt_readability_score, langchain_helpfulness, langchain_correctness, langchain_logical)

{'reasoning': 'The criterion for this task is "helpfulness". The submission is to be evaluated based on whether it is helpful, insightful, and appropriate.\n\nLooking at the submission, it provides a detailed documentation of the `LogInclusionProof` class. It starts with an introduction that explains the purpose of the class. This is helpful for users who are not familiar with the class and its use.\n\nThe submission then documents each function in the class. For each function, it provides a description, lists and describes the parameters, and specifies the return value. This is helpful for users who want to understand how to use the functions and what to expect from them.\n\nThe submission also describes the possible error responses and their meanings. This is helpful for users who encounter errors and want to understand what they mean.\n\nOverall, the submission is helpful because it provides a comprehensive documentation of the `LogInclusionProof` class. It is insightful because it 

  df = df.append(new_row, ignore_index=True)


In [198]:
df

Unnamed: 0,prompt,response,gpt_accuracy_score,human_accuracy_score,gpt_relevance_score,human_relevance_score,gpt_clarity_score,human_clarity_score,gpt_completeness_score,human_completeness_score,gpt_readability_score,human_readability_score,langchain_helpfulness,human_helpfulness,langchain_correctness,human_correctness,langchain_logical,human_logical
0,\nYou are an AI system specialized at generati...,\nIntroduction:\n\nThis API provides functiona...,4.0,2.0,5.0,3.0,4.0,4.0,4.0,4.0,5.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0
1,\nYou are an AI system specialized at generati...,1. Introduction: This API is used to generate ...,4.0,1.0,3.0,1.0,3.0,1.0,3.0,1.0,4.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,\nYou are an AI system specialized at generati...,1. Introduction: This class is used to represe...,5.0,2.0,5.0,3.0,5.0,3.0,5.0,2.0,5.0,3.0,1.0,1.0,1.0,0.0,1.0,1.0
3,\nYou are an AI system specialized at generati...,1. Introduction: This API is used to sign and ...,4.0,1.0,5.0,1.0,3.0,1.0,4.0,1.0,5.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
4,\nYou are an AI system specialized at generati...,**Introduction:**\n\nThe `LogInclusionProof` c...,5.0,,5.0,,4.0,,5.0,,4.0,,1.0,,0.0,,0.0,


In [199]:
# Append Human Scores

df.at[4, 'human_accuracy_score'] = '2.0'
df.at[4, 'human_relevance_score'] = '2.0'
df.at[4, 'human_clarity_score'] = '3.0'
df.at[4, 'human_completeness_score'] = '2.0'
df.at[4, 'human_readability_score'] = '4.0'
df.at[4, 'human_helpfulness'] = '0.0'
df.at[4, 'human_correctness'] = '0.0'
df.at[4, 'human_logical'] = '1.0'

In [200]:
df

Unnamed: 0,prompt,response,gpt_accuracy_score,human_accuracy_score,gpt_relevance_score,human_relevance_score,gpt_clarity_score,human_clarity_score,gpt_completeness_score,human_completeness_score,gpt_readability_score,human_readability_score,langchain_helpfulness,human_helpfulness,langchain_correctness,human_correctness,langchain_logical,human_logical
0,\nYou are an AI system specialized at generati...,\nIntroduction:\n\nThis API provides functiona...,4.0,2.0,5.0,3.0,4.0,4.0,4.0,4.0,5.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0
1,\nYou are an AI system specialized at generati...,1. Introduction: This API is used to generate ...,4.0,1.0,3.0,1.0,3.0,1.0,3.0,1.0,4.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,\nYou are an AI system specialized at generati...,1. Introduction: This class is used to represe...,5.0,2.0,5.0,3.0,5.0,3.0,5.0,2.0,5.0,3.0,1.0,1.0,1.0,0.0,1.0,1.0
3,\nYou are an AI system specialized at generati...,1. Introduction: This API is used to sign and ...,4.0,1.0,5.0,1.0,3.0,1.0,4.0,1.0,5.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
4,\nYou are an AI system specialized at generati...,**Introduction:**\n\nThe `LogInclusionProof` c...,5.0,2.0,5.0,2.0,4.0,3.0,5.0,2.0,4.0,4.0,1.0,0.0,0.0,0.0,0.0,1.0


**Interpretation**

This is again a great example of where langchain evaluation is not fully correct. While the generated output is pretty good structurally well documenting classses of the class, it hallucinates functions that are not part of the class which is unacceptable.

### Example 6 - Do not Re-run

In [201]:
prompt, generated_text, actual_doc = get_response("OpenAI/gpt3.5", api_key, openai_key, 'errors', instruction, functions=False, classes=False, documentation=False, imports=False, other=False, functions_code=False, functions_doc=False, classes_code=True, classes_doc=False)

1. Introduction:
The Error class is a base class for all custom error classes in the API. It provides a common interface for handling and reporting errors. The Error class is not intended to be instantiated directly.

The NetworkError class is a subclass of Error and represents an error that occurs when there is a network issue. It provides specific diagnostics and suggestions for resolving the issue.

The TUFError class is a subclass of Error and represents an error that occurs in the context of The Update Framework (TUF). It provides additional context-specific diagnostics and suggestions for reporting the issue.

The MetadataError class is a subclass of Error and represents an error that occurs when there is an issue with the metadata.

The RootError class is a subclass of Error and represents an error that occurs when the root of trust cannot be established.

2. Functions:
- Error.diagnostics():
    - Description: Returns a string with a general diagnostic message for the error.
  

In [202]:
print("\n Prompt \n", prompt)


 Prompt 
 
You are an AI system specialized at generating API documentation for the provided Python code. You will be provided functions, classes, or Python scripts. Your documentation should include:

1. Introduction: Briefly describe the purpose of the API and its intended use.
2. Functions: Document each API function, including:
    - Description: Clearly explain what the endpoint or function does.
    - Parameters: List and describe each parameter, including data types and any constraints.
    - Return Values: Specify the data type and possible values returned.

3. Error Handling: Describe possible error responses and their meanings.

Make sure to follow this output structure to create API documentation that is clear, concise, accurate, and user-centric. Avoid speculative information and prioritize accuracy and completeness.


        
Class code:

class Error(Exception):
    

    def diagnostics(self) -> str:
        

        return An issue occurred.

    def print_and_exit(se

In [204]:
gpt_score = eval_using_model(generated_text, openai_key=openai_key, initial_prompt=prompt)

Accuracy: 4 - The generated documentation accurately describes the purpose and functionality of each class and function. The details from the code are correctly reflected in the documentation.

Relevance: 5 - The generated documentation is relevant as it provides clear and concise descriptions of each class and function, including their purpose, parameters, and return values. It also includes information on error handling.

Clarity: 4 - The generated documentation is clear and easy to understand. The descriptions for each class and function provide sufficient detail to understand their purpose and functionality.

Completeness: 5 - The generated documentation is complete and includes descriptions for all the classes and functions in the code. It also includes information on error handling and possible error responses.

Readability: 5 - The generated documentation is well-structured and formatted, making it easy to read and understand. The information is presented in a clear and concise 

In [205]:
gpt_accuracy_score, gpt_relevance_score, gpt_clarity_score, gpt_completeness_score, gpt_readability_score = extract_scores(gpt_score)

In [None]:
langchain_helpfulness, langchain_correctness, langchain_logical = langchain_scores(generated_text, prompt, actual_doc)

In [206]:
df = append_row_to_dataframe(df, prompt, generated_text, gpt_accuracy_score, gpt_relevance_score, gpt_clarity_score, gpt_completeness_score, gpt_readability_score, langchain_helpfulness, langchain_correctness, langchain_logical)

{'reasoning': 'The criterion for this task is "helpfulness". The submission is to be evaluated based on whether it is helpful, insightful, and appropriate.\n\nLooking at the submission, it provides a detailed and structured documentation for the provided Python code. It follows the output structure provided in the input, which includes an introduction, function documentation, and error handling.\n\nIn the introduction, the submission provides a brief description of the purpose of each class in the API. This is helpful for users to understand the purpose and intended use of each class.\n\nIn the function documentation, the submission documents each function in the classes, including a description of what the function does, the parameters it takes, and the values it returns. This is insightful as it provides users with a clear understanding of how to use each function.\n\nIn the error handling section, the submission describes possible error responses and their meanings. This is appropri

  df = df.append(new_row, ignore_index=True)


In [207]:
df

Unnamed: 0,prompt,response,gpt_accuracy_score,human_accuracy_score,gpt_relevance_score,human_relevance_score,gpt_clarity_score,human_clarity_score,gpt_completeness_score,human_completeness_score,gpt_readability_score,human_readability_score,langchain_helpfulness,human_helpfulness,langchain_correctness,human_correctness,langchain_logical,human_logical
0,\nYou are an AI system specialized at generati...,\nIntroduction:\n\nThis API provides functiona...,4.0,2.0,5.0,3.0,4.0,4.0,4.0,4.0,5.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0
1,\nYou are an AI system specialized at generati...,1. Introduction: This API is used to generate ...,4.0,1.0,3.0,1.0,3.0,1.0,3.0,1.0,4.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,\nYou are an AI system specialized at generati...,1. Introduction: This class is used to represe...,5.0,2.0,5.0,3.0,5.0,3.0,5.0,2.0,5.0,3.0,1.0,1.0,1.0,0.0,1.0,1.0
3,\nYou are an AI system specialized at generati...,1. Introduction: This API is used to sign and ...,4.0,1.0,5.0,1.0,3.0,1.0,4.0,1.0,5.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
4,\nYou are an AI system specialized at generati...,**Introduction:**\n\nThe `LogInclusionProof` c...,5.0,2.0,5.0,2.0,4.0,3.0,5.0,2.0,4.0,4.0,1.0,0.0,0.0,0.0,0.0,1.0
5,\nYou are an AI system specialized at generati...,1. Introduction:\nThe Error class is a base cl...,4.0,,5.0,,4.0,,5.0,,5.0,,1.0,,1.0,,1.0,


In [209]:
# Append Human Scores

df.at[5, 'human_accuracy_score'] = '5.0'
df.at[5, 'human_relevance_score'] = '5.0'
df.at[5, 'human_clarity_score'] = '5.0'
df.at[5, 'human_completeness_score'] = '5.0'
df.at[5, 'human_readability_score'] = '5.0'
df.at[5, 'human_helpfulness'] = '1.0'
df.at[5, 'human_correctness'] = '1.0'
df.at[5, 'human_logical'] = '1.0'

In [210]:
df

Unnamed: 0,prompt,response,gpt_accuracy_score,human_accuracy_score,gpt_relevance_score,human_relevance_score,gpt_clarity_score,human_clarity_score,gpt_completeness_score,human_completeness_score,gpt_readability_score,human_readability_score,langchain_helpfulness,human_helpfulness,langchain_correctness,human_correctness,langchain_logical,human_logical
0,\nYou are an AI system specialized at generati...,\nIntroduction:\n\nThis API provides functiona...,4.0,2.0,5.0,3.0,4.0,4.0,4.0,4.0,5.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0
1,\nYou are an AI system specialized at generati...,1. Introduction: This API is used to generate ...,4.0,1.0,3.0,1.0,3.0,1.0,3.0,1.0,4.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,\nYou are an AI system specialized at generati...,1. Introduction: This class is used to represe...,5.0,2.0,5.0,3.0,5.0,3.0,5.0,2.0,5.0,3.0,1.0,1.0,1.0,0.0,1.0,1.0
3,\nYou are an AI system specialized at generati...,1. Introduction: This API is used to sign and ...,4.0,1.0,5.0,1.0,3.0,1.0,4.0,1.0,5.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
4,\nYou are an AI system specialized at generati...,**Introduction:**\n\nThe `LogInclusionProof` c...,5.0,2.0,5.0,2.0,4.0,3.0,5.0,2.0,4.0,4.0,1.0,0.0,0.0,0.0,0.0,1.0
5,\nYou are an AI system specialized at generati...,1. Introduction:\nThe Error class is a base cl...,4.0,5.0,5.0,5.0,4.0,5.0,5.0,5.0,5.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0


**Interpretation** 

The output generated is quite detailed and pretty accurate to a non SME and the langchain eval seems to be capturing that correctly too. The GPT eval is also pretty high that is consistent with the human eval.

In [211]:
df.to_pickle('eval_df.pkl')

### Example 7 - Do Not Rerun

In [4]:
# read the previous example results which were run above
df = pd.read_pickle('eval_df.pkl')

In [5]:
df

Unnamed: 0,prompt,response,gpt_accuracy_score,human_accuracy_score,gpt_relevance_score,human_relevance_score,gpt_clarity_score,human_clarity_score,gpt_completeness_score,human_completeness_score,gpt_readability_score,human_readability_score,langchain_helpfulness,human_helpfulness,langchain_correctness,human_correctness,langchain_logical,human_logical
0,\nYou are an AI system specialized at generati...,\nIntroduction:\n\nThis API provides functiona...,4.0,2.0,5.0,3.0,4.0,4.0,4.0,4.0,5.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0
1,\nYou are an AI system specialized at generati...,1. Introduction: This API is used to generate ...,4.0,1.0,3.0,1.0,3.0,1.0,3.0,1.0,4.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,\nYou are an AI system specialized at generati...,1. Introduction: This class is used to represe...,5.0,2.0,5.0,3.0,5.0,3.0,5.0,2.0,5.0,3.0,1.0,1.0,1.0,0.0,1.0,1.0
3,\nYou are an AI system specialized at generati...,1. Introduction: This API is used to sign and ...,4.0,1.0,5.0,1.0,3.0,1.0,4.0,1.0,5.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
4,\nYou are an AI system specialized at generati...,**Introduction:**\n\nThe `LogInclusionProof` c...,5.0,2.0,5.0,2.0,4.0,3.0,5.0,2.0,4.0,4.0,1.0,0.0,0.0,0.0,0.0,1.0
5,\nYou are an AI system specialized at generati...,1. Introduction:\nThe Error class is a base cl...,4.0,5.0,5.0,5.0,4.0,5.0,5.0,5.0,5.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0


In [6]:
prompt, generated_text, actual_doc = get_response('ibm/granite-20b-code-instruct-v1', api_key, openai_key, 'verify_policy', instruction, functions=False, classes=False, documentation=False, imports=False, other=False, functions_code=False, functions_doc=False, classes_code=True, classes_doc=False)

generated_text='\nIntroduction:\n\nThis API is used to verify the identity of a certificate. It can be used to verify the identity of a certificate against a set of policies, such as the issuer, GitHub workflow trigger, or GitHub workflow SHA.\n\nFunctions:\n\nverify(cert: Certificate, policy: VerificationPolicy) -> VerificationResult:\n\nVerify the identity of the provided certificate against the specified policy.\n\nParameters:\n\ncert (Certificate): The certificate to verify.\npolicy (VerificationPolicy): The policy to use for verification.\n\nReturn Values:\n\nVerificationResult: The result of the verification.\n\nError Handling:\n\nIf the certificate is not valid, the verification will fail with an error message.\n\nIf the policy is not valid, the verification will fail with an error message.\n\nIf the policy is valid but the certificate is not valid for the policy, the verification will fail with an error message.\n\nMake sure to follow this output structure to create API documen

In [7]:
print("\n Prompt \n", prompt)


 Prompt 
 
You are an AI system specialized at generating API documentation for the provided Python code. You will be provided functions, classes, or Python scripts. Your documentation should include:

1. Introduction: Briefly describe the purpose of the API and its intended use.
2. Functions: Document each API function, including:
    - Description: Clearly explain what the endpoint or function does.
    - Parameters: List and describe each parameter, including data types and any constraints.
    - Return Values: Specify the data type and possible values returned.

3. Error Handling: Describe possible error responses and their meanings.

Make sure to follow this output structure to create API documentation that is clear, concise, accurate, and user-centric. Avoid speculative information and prioritize accuracy and completeness.


        
Class code:

class _SingleX509ExtPolicy(ABC):
    

    oid: ObjectIdentifier
    

    def __init__(self, value: str) -> None:
        
        se

In [8]:
gpt_score = eval_using_model(generated_text, openai_key=openai_key, initial_prompt=prompt)

Accuracy: 4 - The generated documentation accurately describes the purpose of the API, the functions, and the parameters. It covers all the necessary information for understanding the API functionality.

Relevance: 4 - The generated documentation is relevant as it provides information about the purpose of the API, its functions, and how to use them. It also includes information about error handling.

Clarity: 4 - The generated documentation is clear and easy to understand. It uses clear language and provides explanations for each function and parameter. The error handling section also clarifies possible error responses.

Completeness: 3 - The generated documentation covers the necessary information about the API's purpose, functions, parameters, return values, and error handling. However, it does not provide detailed descriptions for each policy class and their individual functions.

Readability: 4 - The generated documentation is readable and well-structured. It uses proper formatting

In [9]:
gpt_accuracy_score, gpt_relevance_score, gpt_clarity_score, gpt_completeness_score, gpt_readability_score = extract_scores(gpt_score)

In [10]:
langchain_helpfulness, langchain_correctness, langchain_logical = langchain_scores(generated_text, prompt, actual_doc)

  warn_deprecated(


{'reasoning': 'The criterion for this task is "helpfulness". The submission is supposed to be helpful, insightful, and appropriate.\n\nLooking at the submission, it provides an introduction to the API, explaining its purpose and intended use. This is helpful for users who are not familiar with the API.\n\nThe submission also documents the \'verify\' function, including its description, parameters, and return values. This is insightful as it provides users with the necessary information to use the function.\n\nThe submission also describes possible error responses and their meanings, which is appropriate as it helps users understand what could go wrong when using the API.\n\nHowever, the submission does not cover all the classes and their methods provided in the input. It only documents the \'verify\' function without specifying which class it belongs to. This could lead to confusion for the users.\n\nTherefore, while the submission is somewhat helpful and appropriate, it is not entirel

In [11]:
df = append_row_to_dataframe(df, prompt, generated_text, gpt_accuracy_score, gpt_relevance_score, gpt_clarity_score, gpt_completeness_score, gpt_readability_score, langchain_helpfulness, langchain_correctness, langchain_logical)

  "def langchain_scores(generated_patch, prompt, actual_doc):\n",


In [12]:
df

Unnamed: 0,prompt,response,gpt_accuracy_score,human_accuracy_score,gpt_relevance_score,human_relevance_score,gpt_clarity_score,human_clarity_score,gpt_completeness_score,human_completeness_score,gpt_readability_score,human_readability_score,langchain_helpfulness,human_helpfulness,langchain_correctness,human_correctness,langchain_logical,human_logical
0,\nYou are an AI system specialized at generati...,\nIntroduction:\n\nThis API provides functiona...,4.0,2.0,5.0,3.0,4.0,4.0,4.0,4.0,5.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0
1,\nYou are an AI system specialized at generati...,1. Introduction: This API is used to generate ...,4.0,1.0,3.0,1.0,3.0,1.0,3.0,1.0,4.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,\nYou are an AI system specialized at generati...,1. Introduction: This class is used to represe...,5.0,2.0,5.0,3.0,5.0,3.0,5.0,2.0,5.0,3.0,1.0,1.0,1.0,0.0,1.0,1.0
3,\nYou are an AI system specialized at generati...,1. Introduction: This API is used to sign and ...,4.0,1.0,5.0,1.0,3.0,1.0,4.0,1.0,5.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
4,\nYou are an AI system specialized at generati...,**Introduction:**\n\nThe `LogInclusionProof` c...,5.0,2.0,5.0,2.0,4.0,3.0,5.0,2.0,4.0,4.0,1.0,0.0,0.0,0.0,0.0,1.0
5,\nYou are an AI system specialized at generati...,1. Introduction:\nThe Error class is a base cl...,4.0,5.0,5.0,5.0,4.0,5.0,5.0,5.0,5.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0
6,\nYou are an AI system specialized at generati...,\nIntroduction:\n\nThis API is used to verify ...,4.0,,4.0,,4.0,,3.0,,4.0,,,,0.0,,0.0,


In [13]:
# Append Human Scores

df.at[6, 'human_accuracy_score'] = '2.0'
df.at[6, 'human_relevance_score'] = '3.0'
df.at[6, 'human_clarity_score'] = '3.0'
df.at[6, 'human_completeness_score'] = '2.0'
df.at[6, 'human_readability_score'] = '5.0'
df.at[6, 'human_helpfulness'] = '0.0'
df.at[6, 'human_correctness'] = '0.0'
df.at[6, 'human_logical'] = '0.0'

In [14]:
df.tail()

Unnamed: 0,prompt,response,gpt_accuracy_score,human_accuracy_score,gpt_relevance_score,human_relevance_score,gpt_clarity_score,human_clarity_score,gpt_completeness_score,human_completeness_score,gpt_readability_score,human_readability_score,langchain_helpfulness,human_helpfulness,langchain_correctness,human_correctness,langchain_logical,human_logical
2,\nYou are an AI system specialized at generati...,1. Introduction: This class is used to represe...,5.0,2.0,5.0,3.0,5.0,3.0,5.0,2.0,5.0,3.0,1.0,1.0,1.0,0.0,1.0,1.0
3,\nYou are an AI system specialized at generati...,1. Introduction: This API is used to sign and ...,4.0,1.0,5.0,1.0,3.0,1.0,4.0,1.0,5.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
4,\nYou are an AI system specialized at generati...,**Introduction:**\n\nThe `LogInclusionProof` c...,5.0,2.0,5.0,2.0,4.0,3.0,5.0,2.0,4.0,4.0,1.0,0.0,0.0,0.0,0.0,1.0
5,\nYou are an AI system specialized at generati...,1. Introduction:\nThe Error class is a base cl...,4.0,5.0,5.0,5.0,4.0,5.0,5.0,5.0,5.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0
6,\nYou are an AI system specialized at generati...,\nIntroduction:\n\nThis API is used to verify ...,4.0,2.0,4.0,3.0,4.0,3.0,3.0,2.0,4.0,5.0,,0.0,0.0,0.0,0.0,0.0


**Interpretation**

This is a good example where Langchain correctly evaluates that the generated documentation is not relevant to the input Python code provided. GPT however, scored the generated output as quite high even though the generated documentation hallucinates functions that are not part of the class which is unacceptable.

In [15]:
df.to_pickle('eval_df.pkl')

### Example 8 - Do Not Rerun

In [16]:
df = pd.read_pickle('eval_df.pkl')

In [17]:
len(df)

7

In [18]:
df.tail()

Unnamed: 0,prompt,response,gpt_accuracy_score,human_accuracy_score,gpt_relevance_score,human_relevance_score,gpt_clarity_score,human_clarity_score,gpt_completeness_score,human_completeness_score,gpt_readability_score,human_readability_score,langchain_helpfulness,human_helpfulness,langchain_correctness,human_correctness,langchain_logical,human_logical
2,\nYou are an AI system specialized at generati...,1. Introduction: This class is used to represe...,5.0,2.0,5.0,3.0,5.0,3.0,5.0,2.0,5.0,3.0,1.0,1.0,1.0,0.0,1.0,1.0
3,\nYou are an AI system specialized at generati...,1. Introduction: This API is used to sign and ...,4.0,1.0,5.0,1.0,3.0,1.0,4.0,1.0,5.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
4,\nYou are an AI system specialized at generati...,**Introduction:**\n\nThe `LogInclusionProof` c...,5.0,2.0,5.0,2.0,4.0,3.0,5.0,2.0,4.0,4.0,1.0,0.0,0.0,0.0,0.0,1.0
5,\nYou are an AI system specialized at generati...,1. Introduction:\nThe Error class is a base cl...,4.0,5.0,5.0,5.0,4.0,5.0,5.0,5.0,5.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0
6,\nYou are an AI system specialized at generati...,\nIntroduction:\n\nThis API is used to verify ...,4.0,2.0,4.0,3.0,4.0,3.0,3.0,2.0,4.0,5.0,,0.0,0.0,0.0,0.0,0.0


In [19]:
prompt, generated_text, actual_doc = get_response('OpenAI/gpt3.5', api_key, openai_key, 'verify_policy', instruction, functions=False, classes=False, documentation=False, imports=False, other=False, functions_code=False, functions_doc=False, classes_code=True, classes_doc=False)

1. Introduction:
The API provides a set of classes and functions to perform X.509 certificate verification. X.509 is a standard format for public key certificates used in various security protocols. This API allows users to define their own verification policies by using different classes and functions provided.

2. Functions:

2.1. `_SingleX509ExtPolicy` class:
   - Description: This abstract base class represents a single X.509 certificate extension policy. It provides a common structure for validating a specific extension in an X.509 certificate.
   - Parameters:
     - `value` (str): The expected value of the extension.
   - Return Value: An instance of the `VerificationResult` class.

2.2. `OIDCIssuer` class:
   - Description: This class represents an X.509 extension policy for the OIDC Issuer extension. It verifies that the OIDC Issuer extension in the X.509 certificate matches the expected value.
   - Parameters:
     - None
   - Return Value: An instance of the `VerificationRes

In [20]:
print("\n Prompt \n", prompt)


 Prompt 
 
You are an AI system specialized at generating API documentation for the provided Python code. You will be provided functions, classes, or Python scripts. Your documentation should include:

1. Introduction: Briefly describe the purpose of the API and its intended use.
2. Functions: Document each API function, including:
    - Description: Clearly explain what the endpoint or function does.
    - Parameters: List and describe each parameter, including data types and any constraints.
    - Return Values: Specify the data type and possible values returned.

3. Error Handling: Describe possible error responses and their meanings.

Make sure to follow this output structure to create API documentation that is clear, concise, accurate, and user-centric. Avoid speculative information and prioritize accuracy and completeness.


        
Class code:

class _SingleX509ExtPolicy(ABC):
    

    oid: ObjectIdentifier
    

    def __init__(self, value: str) -> None:
        
        se

In [21]:
gpt_score = eval_using_model(generated_text, openai_key=openai_key, initial_prompt=prompt)

Accuracy: 4 - The generated documentation accurately describes the purpose, functionality, parameters, and return values of each class and function. The information provided is consistent with the code.

Relevance: 5 - The generated documentation is relevant as it provides insights into the purpose and usage of each class and function in the provided code.

Clarity: 4 - The generated documentation is clear and explains the purpose, parameters, and return values of each class and function. The descriptions are concise and easy to understand.

Completeness: 3 - The generated documentation covers the main classes and functions in the code. However, it could be improved by providing more details about the specific requirements and constraints of each parameter, as well as the possible values returned by each function.

Readability: 4 - The generated documentation is readable and well-structured. The use of headings and bullet points helps organize the information, making it easier to navig

In [22]:
gpt_accuracy_score, gpt_relevance_score, gpt_clarity_score, gpt_completeness_score, gpt_readability_score = extract_scores(gpt_score)

In [23]:
langchain_helpfulness, langchain_correctness, langchain_logical = langchain_scores(generated_text, prompt, actual_doc)

{'reasoning': 'The criterion for this task is "helpfulness". The submission is evaluated based on whether it is helpful, insightful, and appropriate.\n\nLooking at the submission, it provides a detailed explanation of the API based on the provided Python code. It follows the structure outlined in the task, providing an introduction, documenting each function, and describing error handling.\n\nThe introduction gives a brief overview of the API\'s purpose and its intended use. It explains that the API is for performing X.509 certificate verification, which is a standard format for public key certificates used in various security protocols.\n\nThe function documentation is thorough and detailed. Each class and function is explained, including its purpose, parameters, and return values. This information is crucial for understanding how to use the API and what to expect from each function.\n\nThe error handling section describes the possible error responses and their meanings. This is impor

In [24]:
df = append_row_to_dataframe(df, prompt, generated_text, gpt_accuracy_score, gpt_relevance_score, gpt_clarity_score, gpt_completeness_score, gpt_readability_score, langchain_helpfulness, langchain_correctness, langchain_logical)

  "def langchain_scores(generated_patch, prompt, actual_doc):\n",


In [25]:
df.tail()

Unnamed: 0,prompt,response,gpt_accuracy_score,human_accuracy_score,gpt_relevance_score,human_relevance_score,gpt_clarity_score,human_clarity_score,gpt_completeness_score,human_completeness_score,gpt_readability_score,human_readability_score,langchain_helpfulness,human_helpfulness,langchain_correctness,human_correctness,langchain_logical,human_logical
3,\nYou are an AI system specialized at generati...,1. Introduction: This API is used to sign and ...,4.0,1.0,5.0,1.0,3.0,1.0,4.0,1.0,5.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
4,\nYou are an AI system specialized at generati...,**Introduction:**\n\nThe `LogInclusionProof` c...,5.0,2.0,5.0,2.0,4.0,3.0,5.0,2.0,4.0,4.0,1.0,0.0,0.0,0.0,0.0,1.0
5,\nYou are an AI system specialized at generati...,1. Introduction:\nThe Error class is a base cl...,4.0,5.0,5.0,5.0,4.0,5.0,5.0,5.0,5.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0
6,\nYou are an AI system specialized at generati...,\nIntroduction:\n\nThis API is used to verify ...,4.0,2.0,4.0,3.0,4.0,3.0,3.0,2.0,4.0,5.0,,0.0,0.0,0.0,0.0,0.0
7,\nYou are an AI system specialized at generati...,1. Introduction:\nThe API provides a set of cl...,4.0,,5.0,,4.0,,3.0,,4.0,,1.0,,,,1.0,


In [26]:
# Append Human Scores

df.at[7, 'human_accuracy_score'] = '4.0'
df.at[7, 'human_relevance_score'] = '4.0'
df.at[7, 'human_clarity_score'] = '3.0'
df.at[7, 'human_completeness_score'] = '4.0'
df.at[7, 'human_readability_score'] = '5.0'
df.at[7, 'human_helpfulness'] = '1.0'
df.at[7, 'human_correctness'] = '1.0'
df.at[7, 'human_logical'] = '1.0'

In [27]:
df.tail()

Unnamed: 0,prompt,response,gpt_accuracy_score,human_accuracy_score,gpt_relevance_score,human_relevance_score,gpt_clarity_score,human_clarity_score,gpt_completeness_score,human_completeness_score,gpt_readability_score,human_readability_score,langchain_helpfulness,human_helpfulness,langchain_correctness,human_correctness,langchain_logical,human_logical
3,\nYou are an AI system specialized at generati...,1. Introduction: This API is used to sign and ...,4.0,1.0,5.0,1.0,3.0,1.0,4.0,1.0,5.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
4,\nYou are an AI system specialized at generati...,**Introduction:**\n\nThe `LogInclusionProof` c...,5.0,2.0,5.0,2.0,4.0,3.0,5.0,2.0,4.0,4.0,1.0,0.0,0.0,0.0,0.0,1.0
5,\nYou are an AI system specialized at generati...,1. Introduction:\nThe Error class is a base cl...,4.0,5.0,5.0,5.0,4.0,5.0,5.0,5.0,5.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0
6,\nYou are an AI system specialized at generati...,\nIntroduction:\n\nThis API is used to verify ...,4.0,2.0,4.0,3.0,4.0,3.0,3.0,2.0,4.0,5.0,,0.0,0.0,0.0,0.0,0.0
7,\nYou are an AI system specialized at generati...,1. Introduction:\nThe API provides a set of cl...,4.0,4.0,5.0,4.0,4.0,3.0,3.0,4.0,4.0,5.0,1.0,1.0,,1.0,1.0,1.0


**Interpretation**

The generated documentation does a pretty good job at describing the classes/functions provided. Langchain scores 1 for helpfulness and logicalness however we see an undefined value for correctness (even though it considers it as fairly correct). These scores are in alignment with our human evaluated scores as well as the GPT evaluated scores. This is a good example where all the 3 scoring systems were similar.

In [29]:
# save the newly added example
df.to_pickle('eval_df.pkl')

### Example 9 - Do Not Rerun

In [30]:
df = pd.read_pickle('eval_df.pkl')

In [31]:
len(df)

8

In [32]:
df.tail()

Unnamed: 0,prompt,response,gpt_accuracy_score,human_accuracy_score,gpt_relevance_score,human_relevance_score,gpt_clarity_score,human_clarity_score,gpt_completeness_score,human_completeness_score,gpt_readability_score,human_readability_score,langchain_helpfulness,human_helpfulness,langchain_correctness,human_correctness,langchain_logical,human_logical
3,\nYou are an AI system specialized at generati...,1. Introduction: This API is used to sign and ...,4.0,1.0,5.0,1.0,3.0,1.0,4.0,1.0,5.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
4,\nYou are an AI system specialized at generati...,**Introduction:**\n\nThe `LogInclusionProof` c...,5.0,2.0,5.0,2.0,4.0,3.0,5.0,2.0,4.0,4.0,1.0,0.0,0.0,0.0,0.0,1.0
5,\nYou are an AI system specialized at generati...,1. Introduction:\nThe Error class is a base cl...,4.0,5.0,5.0,5.0,4.0,5.0,5.0,5.0,5.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0
6,\nYou are an AI system specialized at generati...,\nIntroduction:\n\nThis API is used to verify ...,4.0,2.0,4.0,3.0,4.0,3.0,3.0,2.0,4.0,5.0,,0.0,0.0,0.0,0.0,0.0
7,\nYou are an AI system specialized at generati...,1. Introduction:\nThe API provides a set of cl...,4.0,4.0,5.0,4.0,4.0,3.0,3.0,4.0,4.0,5.0,1.0,1.0,,1.0,1.0,1.0


In [33]:
prompt, generated_text, actual_doc = get_response('OpenAI/gpt3.5', api_key, openai_key, 'sign', instruction, functions=False, classes=False, documentation=False, imports=False, other=False, functions_code=False, functions_doc=False, classes_code=True, classes_doc=False)

**Class Signer**

**Introduction:**
The Signer class is used for signing artifacts using an elliptic curve private key. It interacts with the FulcioCertificateSigningResponse and SigningContext to generate and retrieve certificates for signing. The signing process involves verifying the identity token, retrieving the certificate, and signing the artifact. The Signer class also provides a method to create a transparency log entry for the signed artifact.

**Methods:**

1. \_\_init\_\_:
   - Description: Initializes the Signer class with the identity token, signing context, and an optional cache flag.
   - Parameters:
     - identity_token (IdentityToken): An object representing the identity token.
     - signing_ctx (SigningContext): An object representing the signing context.
     - cache (bool, optional): Flag indicating whether to cache the private key and signing certificate. Defaults to True.
   - Return: None

2. \_private_key:
   - Description: Get the private key used for signin

In [34]:
print("\n Prompt \n", prompt)


 Prompt 
 
You are an AI system specialized at generating API documentation for the provided Python code. You will be provided functions, classes, or Python scripts. Your documentation should include:

1. Introduction: Briefly describe the purpose of the API and its intended use.
2. Functions: Document each API function, including:
    - Description: Clearly explain what the endpoint or function does.
    - Parameters: List and describe each parameter, including data types and any constraints.
    - Return Values: Specify the data type and possible values returned.

3. Error Handling: Describe possible error responses and their meanings.

Make sure to follow this output structure to create API documentation that is clear, concise, accurate, and user-centric. Avoid speculative information and prioritize accuracy and completeness.


        
Class code:

class Signer:
    

    def __init__(
        self,
        identity_token: IdentityToken,
        signing_ctx: SigningContext,
      

In [35]:
gpt_score = eval_using_model(generated_text, openai_key=openai_key, initial_prompt=prompt)

Accuracy: 4 - The generated documentation accurately describes the purpose and functionality of the Signer, SigningContext, and SigningResult classes. It correctly identifies the methods and properties of each class and provides accurate descriptions of their purposes.

Relevance: 5 - The generated documentation is highly relevant to the code provided. It provides an overview of the purpose of each class and its intended use, as well as detailed descriptions of the methods and properties.

Clarity: 4 - The generated documentation is clear and concise. It uses clear language to describe the purpose of each class and its methods. The descriptions of parameters and return values are also clear and provide relevant details.

Completeness: 3 - The generated documentation is mostly complete, but there are some missing details. For example, the documentation could provide more information about the parameters and return values of each method, including their data types and any constraints. Ad

In [36]:
gpt_accuracy_score, gpt_relevance_score, gpt_clarity_score, gpt_completeness_score, gpt_readability_score = extract_scores(gpt_score)

In [37]:
langchain_helpfulness, langchain_correctness, langchain_logical = langchain_scores(generated_text, prompt, actual_doc)

{'reasoning': 'The criterion for this task is "helpfulness". The submission is to be evaluated based on whether it is helpful, insightful, and appropriate.\n\nLooking at the submission, it is clear that it provides a detailed explanation of the provided Python code. The submission has broken down the code into its individual classes and methods, and provided a detailed explanation for each. This includes the purpose of each class and method, the parameters they take, and the values they return. This is very helpful for someone trying to understand the code.\n\nThe submission is also insightful. It not only explains what the code does, but also provides context on how the different parts of the code interact with each other. For example, it explains how the Signer class interacts with the FulcioCertificateSigningResponse and SigningContext to generate and retrieve certificates for signing.\n\nThe submission is also appropriate. It follows the output structure provided in the task, and p

In [38]:
df = append_row_to_dataframe(df, prompt, generated_text, gpt_accuracy_score, gpt_relevance_score, gpt_clarity_score, gpt_completeness_score, gpt_readability_score, langchain_helpfulness, langchain_correctness, langchain_logical)

  "def langchain_scores(generated_patch, prompt, actual_doc):\n",


In [39]:
df.tail()

Unnamed: 0,prompt,response,gpt_accuracy_score,human_accuracy_score,gpt_relevance_score,human_relevance_score,gpt_clarity_score,human_clarity_score,gpt_completeness_score,human_completeness_score,gpt_readability_score,human_readability_score,langchain_helpfulness,human_helpfulness,langchain_correctness,human_correctness,langchain_logical,human_logical
4,\nYou are an AI system specialized at generati...,**Introduction:**\n\nThe `LogInclusionProof` c...,5.0,2.0,5.0,2.0,4.0,3.0,5.0,2.0,4.0,4.0,1.0,0.0,0.0,0.0,0.0,1.0
5,\nYou are an AI system specialized at generati...,1. Introduction:\nThe Error class is a base cl...,4.0,5.0,5.0,5.0,4.0,5.0,5.0,5.0,5.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0
6,\nYou are an AI system specialized at generati...,\nIntroduction:\n\nThis API is used to verify ...,4.0,2.0,4.0,3.0,4.0,3.0,3.0,2.0,4.0,5.0,,0.0,0.0,0.0,0.0,0.0
7,\nYou are an AI system specialized at generati...,1. Introduction:\nThe API provides a set of cl...,4.0,4.0,5.0,4.0,4.0,3.0,3.0,4.0,4.0,5.0,1.0,1.0,,1.0,1.0,1.0
8,\nYou are an AI system specialized at generati...,**Class Signer**\n\n**Introduction:**\nThe Sig...,4.0,,5.0,,4.0,,3.0,,4.0,,1.0,,1.0,,1.0,


In [40]:
# Append Human Scores

df.at[8, 'human_accuracy_score'] = '3.0'
df.at[8, 'human_relevance_score'] = '4.0'
df.at[8, 'human_clarity_score'] = '3.0'
df.at[8, 'human_completeness_score'] = '3.0'
df.at[8, 'human_readability_score'] = '4.0'
df.at[8, 'human_helpfulness'] = '1.0'
df.at[8, 'human_correctness'] = '1.0'
df.at[8, 'human_logical'] = '1.0'

In [41]:
df.tail()

Unnamed: 0,prompt,response,gpt_accuracy_score,human_accuracy_score,gpt_relevance_score,human_relevance_score,gpt_clarity_score,human_clarity_score,gpt_completeness_score,human_completeness_score,gpt_readability_score,human_readability_score,langchain_helpfulness,human_helpfulness,langchain_correctness,human_correctness,langchain_logical,human_logical
4,\nYou are an AI system specialized at generati...,**Introduction:**\n\nThe `LogInclusionProof` c...,5.0,2.0,5.0,2.0,4.0,3.0,5.0,2.0,4.0,4.0,1.0,0.0,0.0,0.0,0.0,1.0
5,\nYou are an AI system specialized at generati...,1. Introduction:\nThe Error class is a base cl...,4.0,5.0,5.0,5.0,4.0,5.0,5.0,5.0,5.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0
6,\nYou are an AI system specialized at generati...,\nIntroduction:\n\nThis API is used to verify ...,4.0,2.0,4.0,3.0,4.0,3.0,3.0,2.0,4.0,5.0,,0.0,0.0,0.0,0.0,0.0
7,\nYou are an AI system specialized at generati...,1. Introduction:\nThe API provides a set of cl...,4.0,4.0,5.0,4.0,4.0,3.0,3.0,4.0,4.0,5.0,1.0,1.0,,1.0,1.0,1.0
8,\nYou are an AI system specialized at generati...,**Class Signer**\n\n**Introduction:**\nThe Sig...,4.0,3.0,5.0,4.0,4.0,3.0,3.0,3.0,4.0,4.0,1.0,1.0,1.0,1.0,1.0,1.0


**Interpretation**

Here, again we see that the generated documentation seems fairly accurate in describing the API code provided. Langchain scores are 1 for helpfulness, correctness and logicalness. Similarly, the human evaluated scores are in alignment with Langchain scores. However, it is important to note that the human evaluation scores for clarity/completeness are lower compare to the GPT evaluated scores. This indicates that the human scores take into consideration the SMEs perspective when evaluating the generated documentation which GPT does not consider.

In [43]:
# save the newly added example
df.to_pickle('eval_df.pkl')

### Example 10 - Do Not Rerun

In [44]:
df = pd.read_pickle('eval_df.pkl')

In [45]:
len(df)

9

In [46]:
df.tail()

Unnamed: 0,prompt,response,gpt_accuracy_score,human_accuracy_score,gpt_relevance_score,human_relevance_score,gpt_clarity_score,human_clarity_score,gpt_completeness_score,human_completeness_score,gpt_readability_score,human_readability_score,langchain_helpfulness,human_helpfulness,langchain_correctness,human_correctness,langchain_logical,human_logical
4,\nYou are an AI system specialized at generati...,**Introduction:**\n\nThe `LogInclusionProof` c...,5.0,2.0,5.0,2.0,4.0,3.0,5.0,2.0,4.0,4.0,1.0,0.0,0.0,0.0,0.0,1.0
5,\nYou are an AI system specialized at generati...,1. Introduction:\nThe Error class is a base cl...,4.0,5.0,5.0,5.0,4.0,5.0,5.0,5.0,5.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0
6,\nYou are an AI system specialized at generati...,\nIntroduction:\n\nThis API is used to verify ...,4.0,2.0,4.0,3.0,4.0,3.0,3.0,2.0,4.0,5.0,,0.0,0.0,0.0,0.0,0.0
7,\nYou are an AI system specialized at generati...,1. Introduction:\nThe API provides a set of cl...,4.0,4.0,5.0,4.0,4.0,3.0,3.0,4.0,4.0,5.0,1.0,1.0,,1.0,1.0,1.0
8,\nYou are an AI system specialized at generati...,**Class Signer**\n\n**Introduction:**\nThe Sig...,4.0,3.0,5.0,4.0,4.0,3.0,3.0,3.0,4.0,4.0,1.0,1.0,1.0,1.0,1.0,1.0


In [47]:
prompt, generated_text, actual_doc = get_response('ibm/granite-20b-code-instruct-v1', api_key, openai_key, 'verify_models', instruction, functions=False, classes=False, documentation=False, imports=False, other=False, functions_code=False, functions_doc=False, classes_code=True, classes_doc=False)

generated_text='1. Introduction: This API is designed to verify the integrity of the materials used in the creation of a digital signature. It provides a simple and secure way to verify the authenticity and integrity of the materials used in the creation of a digital signature.\n\n2. Functions:\n\nverify_signature:\n\nDescription: Verifies the integrity of the materials used in the creation of a digital signature.\n\nParameters:\n\n- materials: The materials used in the creation of the digital signature, including the signature itself, the signing key, and the certificate.\n- signature_format: The format of the signature, such as "JWS" or "COSE".\n- signature_algorithm: The algorithm used to create the signature, such as "RS256" or "ES256".\n- signing_key: The public key used to verify the signature.\n- certificate: The certificate used to verify the signature.\n- rekor_entry: The Rekor entry associated with the signature, if available.\n\nReturn Values:\n\n- A VerificationResult objec

In [48]:
print("\n Prompt \n", prompt)


 Prompt 
 
You are an AI system specialized at generating API documentation for the provided Python code. You will be provided functions, classes, or Python scripts. Your documentation should include:

1. Introduction: Briefly describe the purpose of the API and its intended use.
2. Functions: Document each API function, including:
    - Description: Clearly explain what the endpoint or function does.
    - Parameters: List and describe each parameter, including data types and any constraints.
    - Return Values: Specify the data type and possible values returned.

3. Error Handling: Describe possible error responses and their meanings.

Make sure to follow this output structure to create API documentation that is clear, concise, accurate, and user-centric. Avoid speculative information and prioritize accuracy and completeness.


        
Class code:

class VerificationResult(BaseModel):
    

    success: bool
    

    def __bool__(self) -> bool:
        
        return self.succes

In [49]:
gpt_score = eval_using_model(generated_text, openai_key=openai_key, initial_prompt=prompt)

Accuracy: 5 - The generated documentation accurately describes the purpose, functions, and error handling of the API. It correctly identifies the parameters, return values, and possible error conditions based on the code provided.

Relevance: 5 - The generated documentation is relevant to the code provided. It covers all the necessary information about the API, including its purpose, functions, parameters, return values, and possible error conditions.

Clarity: 4 - The generated documentation is clear and easy to understand. It provides clear descriptions of the API's functions, parameters, return values, and error conditions. However, there are a few minor formatting issues and inconsistencies in indentation that could be improved for better clarity.

Completeness: 4 - The generated documentation is mostly complete and includes all the necessary information about the API. It covers the introduction, functions, parameters, return values, and error handling. However, some additional det

In [50]:
gpt_accuracy_score, gpt_relevance_score, gpt_clarity_score, gpt_completeness_score, gpt_readability_score = extract_scores(gpt_score)

In [51]:
langchain_helpfulness, langchain_correctness, langchain_logical = langchain_scores(generated_text, prompt, actual_doc)

{'reasoning': 'The criterion for this task is "helpfulness". The submission is to be evaluated based on whether it is helpful, insightful, and appropriate.\n\nLooking at the submission, it provides a detailed explanation of the API, its functions, parameters, return values, and error handling. It gives a clear understanding of what the API does, how it works, and what to expect in different scenarios. \n\nThe introduction gives a brief overview of the API\'s purpose and its intended use. The functions section provides a detailed explanation of the function, its parameters, and return values. The error handling section explains the possible errors that can occur and their meanings. \n\nThe submission is insightful as it provides a deep understanding of the API, its functions, and possible errors. It is also appropriate as it follows the output structure provided in the task.\n\nTherefore, the submission meets the criterion of being helpful, insightful, and appropriate.\n\nY', 'value': '

In [52]:
df = append_row_to_dataframe(df, prompt, generated_text, gpt_accuracy_score, gpt_relevance_score, gpt_clarity_score, gpt_completeness_score, gpt_readability_score, langchain_helpfulness, langchain_correctness, langchain_logical)

  "def langchain_scores(generated_patch, prompt, actual_doc):\n",


In [53]:
df.tail()

Unnamed: 0,prompt,response,gpt_accuracy_score,human_accuracy_score,gpt_relevance_score,human_relevance_score,gpt_clarity_score,human_clarity_score,gpt_completeness_score,human_completeness_score,gpt_readability_score,human_readability_score,langchain_helpfulness,human_helpfulness,langchain_correctness,human_correctness,langchain_logical,human_logical
5,\nYou are an AI system specialized at generati...,1. Introduction:\nThe Error class is a base cl...,4.0,5.0,5.0,5.0,4.0,5.0,5.0,5.0,5.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0
6,\nYou are an AI system specialized at generati...,\nIntroduction:\n\nThis API is used to verify ...,4.0,2.0,4.0,3.0,4.0,3.0,3.0,2.0,4.0,5.0,,0.0,0.0,0.0,0.0,0.0
7,\nYou are an AI system specialized at generati...,1. Introduction:\nThe API provides a set of cl...,4.0,4.0,5.0,4.0,4.0,3.0,3.0,4.0,4.0,5.0,1.0,1.0,,1.0,1.0,1.0
8,\nYou are an AI system specialized at generati...,**Class Signer**\n\n**Introduction:**\nThe Sig...,4.0,3.0,5.0,4.0,4.0,3.0,3.0,3.0,4.0,4.0,1.0,1.0,1.0,1.0,1.0,1.0
9,\nYou are an AI system specialized at generati...,1. Introduction: This API is designed to verif...,5.0,,5.0,,4.0,,4.0,,5.0,,1.0,,0.0,,0.0,


In [54]:
# Append Human Scores

df.at[9, 'human_accuracy_score'] = '2.0'
df.at[9, 'human_relevance_score'] = '2.0'
df.at[9, 'human_clarity_score'] = '2.0'
df.at[9, 'human_completeness_score'] = '2.0'
df.at[9, 'human_readability_score'] = '4.0'
df.at[9, 'human_helpfulness'] = '0.0'
df.at[9, 'human_correctness'] = '0.0'
df.at[9, 'human_logical'] = '0.0'

In [55]:
df.tail()

Unnamed: 0,prompt,response,gpt_accuracy_score,human_accuracy_score,gpt_relevance_score,human_relevance_score,gpt_clarity_score,human_clarity_score,gpt_completeness_score,human_completeness_score,gpt_readability_score,human_readability_score,langchain_helpfulness,human_helpfulness,langchain_correctness,human_correctness,langchain_logical,human_logical
5,\nYou are an AI system specialized at generati...,1. Introduction:\nThe Error class is a base cl...,4.0,5.0,5.0,5.0,4.0,5.0,5.0,5.0,5.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0
6,\nYou are an AI system specialized at generati...,\nIntroduction:\n\nThis API is used to verify ...,4.0,2.0,4.0,3.0,4.0,3.0,3.0,2.0,4.0,5.0,,0.0,0.0,0.0,0.0,0.0
7,\nYou are an AI system specialized at generati...,1. Introduction:\nThe API provides a set of cl...,4.0,4.0,5.0,4.0,4.0,3.0,3.0,4.0,4.0,5.0,1.0,1.0,,1.0,1.0,1.0
8,\nYou are an AI system specialized at generati...,**Class Signer**\n\n**Introduction:**\nThe Sig...,4.0,3.0,5.0,4.0,4.0,3.0,3.0,3.0,4.0,4.0,1.0,1.0,1.0,1.0,1.0,1.0
9,\nYou are an AI system specialized at generati...,1. Introduction: This API is designed to verif...,5.0,2.0,5.0,2.0,4.0,2.0,4.0,2.0,5.0,4.0,1.0,0.0,0.0,0.0,0.0,0.0


**Interpretation**

This is an interesting example since the generated documentation is partially correct. It seems to provide a pretty good description of the API in general, however it fails to capture all the functions/classes provided in the input Python code. Langchain was able to catch this and evaluate the output accordingly by assigning a score of 0 for both correctness and logicalness. It scores 1 for helpfulness as it takes into consideration that the document provides a fairly good overview of the input code provided. The human evaluated scores are compartively similar to Langchain however lower when compared to the GPT evaluated scores.

In [56]:
# save the newly added example
df.to_pickle('eval_df.pkl')

## Copy this section, modify and run from here

### Example X 

In [None]:
df = pd.read_pickle('eval_df.pkl')

In [None]:
df.head()

In [None]:
len(df)

In [None]:
prompt, generated_text, actual_doc = get_response('ibm/granite-20b-code-instruct-v1', api_key, openai_key, '<file-name>', instruction, functions=False, classes=False, documentation=False, imports=False, other=False, functions_code=False, functions_doc=False, classes_code=True, classes_doc=False)

In [None]:
print("\n Prompt \n", prompt)

In [None]:
print("\n Generated Text \n", generated_text)

In [None]:
gpt_score = eval_using_model(generated_text, openai_key=openai_key, initial_prompt=prompt)

In [None]:
gpt_accuracy_score, gpt_relevance_score, gpt_clarity_score, gpt_completeness_score, gpt_readability_score = extract_scores(gpt_score)

In [None]:
langchain_helpfulness, langchain_correctness, langchain_logical = langchain_scores(generated_text, prompt, actual_doc)

In [None]:
df = append_row_to_dataframe(df, prompt, generated_text, gpt_accuracy_score, gpt_relevance_score, gpt_clarity_score, gpt_completeness_score, gpt_readability_score, langchain_helpfulness, langchain_correctness, langchain_logical)

In [None]:
df.tail()

In [None]:
# Append Human Scores

df.at[X, 'human_accuracy_score'] = '2.0'
df.at[X, 'human_relevance_score'] = '3.0'
df.at[X, 'human_clarity_score'] = '4.0'
df.at[X, 'human_completeness_score'] = '4.0'
df.at[X, 'human_readability_score'] = '5.0'
df.at[X, 'human_helpfulness'] = '0.0'
df.at[X, 'human_correctness'] = '0.0'
df.at[X, 'human_logical'] = '0.0'

In [None]:
df.tail()

In [None]:
# save the newly added example
df.to_pickle('eval_df.pkl')