<table style="width:100%">
<tr>
<td style="vertical-align:middle; text-align:left;">
<font size="2">
Supplementary code for the <a href="http://mng.bz/orYv">Build a Large Language Model From Scratch</a> book by <a href="https://sebastianraschka.com">Sebastian Raschka</a><br>
<br>Code repository: <a href="https://github.com/rasbt/LLMs-from-scratch">https://github.com/rasbt/LLMs-from-scratch</a>
</font>
</td>
<td style="vertical-align:middle; text-align:left;">
<a href="http://mng.bz/orYv"><img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp" width="100px"></a>
</td>
</tr>
</table>

# Avaliação de Respostas de Instrução Usando a API do OpenAI

- Este notebook utiliza a API GPT-4 da OpenAI para avaliar respostas de LLMs ajustados por instrução com base em um dataset no formato JSON que inclui as respostas geradas pelo modelo, por exemplo:



```python
{
    "instruction": "What is the atomic number of helium?",
    "input": "",
    "output": "The atomic number of helium is 2.",               # <-- The target given in the test set
    "model 1 response": "\nThe atomic number of helium is 2.0.", # <-- Response by an LLM
    "model 2 response": "\nThe atomic number of helium is 3."    # <-- Response by a 2nd LLM
},
```

In [None]:
# pip install -r requirements-extra.txt

In [1]:
from importlib.metadata import version

pkgs = ["openai",  # OpenAI API
        "tqdm",    # Progress bar
        ]

for p in pkgs:
    print(f"{p} version: {version(p)}")

openai version: 1.61.1
tqdm version: 4.67.1


## Teste da API OpenAI

- Primeiro, vamos testar se a API da OpenAI está configurada corretamente.
- Se você ainda não tem uma conta, precisa criar uma em https://platform.openai.com/
- Observe que você também precisará adicionar saldo à sua conta, pois a API do GPT-4 não é gratuita (veja (see https://platform.openai.com/settings/organization/billing/overview)
- Executar os experimentos e criar aproximadamente 200 avaliações usando o código neste notebook custa cerca de $0,26 (26 centavos de dólar) no momento desta escrita.

- Primeiro, precisamos fornecer nossa chave secreta da API da OpenAI, que pode ser encontrada em https://platform.openai.com/api-keys
- Certifique-se de não compartilhar essa chave com ninguém!
- Adicione essa chave secreta ("sk-...") ao arquivo config.json nesta pasta.

In [2]:
import json
from openai import OpenAI

# Load API key from a JSON file.
# Make sure to replace "sk-..." with your actual API key from https://platform.openai.com/api-keys
#with open("config.json", "r") as config_file:
#    config = json.load(config_file)
#    api_key = config["OPENAI_API_KEY"]

from google.colab import userdata
client = OpenAI(api_key=userdata.get('OPENAI_API_KEY'))

- First, let's try the API with a simple example to make sure it works as intended:

In [14]:
def run_chatgpt(prompt, client, model="gpt-4o-mini"):
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0,
        seed=123,
    )
    return response.choices[0].message.content


prompt = "Respond with 'hello world' if you got this message."
run_chatgpt(prompt, client)

'hello world'

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Carregar entradas JSON.

- Here, we assume that we saved the test dataset and the model responses as a JSON file that we can load as follows:

In [5]:
json_file = "instruction-data-with-response.json"

with open(json_file, "r") as file:
    json_data = json.load(file)

print("Number of entries:", len(json_data))

Number of entries: 110


- A estrutura deste arquivo é a seguinte, onde temos a resposta fornecida no conjunto de testes (`'output'`) e as respostas de dois modelos diferentes (`'model 1 response'` e `'model 2 response'`):

In [6]:
json_data[0]

{'instruction': 'Rewrite the sentence using a simile.',
 'input': 'The car is very fast.',
 'output': 'The car is as fast as lightning.',
 'model_response': 'The car is as fast as a bullet.'}

- Below is a small utility function that formats the input for visualization purposes later:

In [7]:
def format_input(entry):
    instruction_text = (
        f"Below is an instruction that describes a task. Write a response that "
        f"appropriately completes the request."
        f"\n\n### Instruction:\n{entry['instruction']}"
    )

    input_text = f"\n\n### Input:\n{entry['input']}" if entry["input"] else ""
    instruction_text + input_text

    return instruction_text + input_text

- Now, let's try the OpenAI API to compare the model responses (we only evaluate the first 5 responses for a visual comparison):

In [10]:
for entry in json_data[:5]:
    prompt = (f"Given the input `{format_input(entry)}` "
              f"and correct output `{entry['output']}`, "
              f"score the model response `{entry['model_response']}`"
              f" on a scale from 0 to 100, where 100 is the best score. "
              )
    print("\nDataset response:")
    print(">>", entry['output'])
    print("\nModel response:")
    print(">>", entry["model_response"])
    print("\nScore:")
    print(">>", run_chatgpt(prompt, client))
    print("\n-------------------------")


Dataset response:
>> The car is as fast as lightning.

Model response:
>> The car is as fast as a bullet.

Score:
>> To score the model response "The car is as fast as a bullet." based on the instruction to rewrite the sentence using a simile, we need to consider the effectiveness and appropriateness of the simile used.

1. **Accuracy of Simile**: The original sentence "The car is very fast." is effectively transformed into a simile by comparing the car's speed to something universally recognized as fast. Both "lightning" and "a bullet" are commonly used in similes to denote high speed. Thus, the model's choice of "a bullet" is accurate and maintains the meaning of the original sentence.

2. **Relevance and Common Usage**: The simile "as fast as a bullet" is a standard and widely understood comparison that conveys extreme speed effectively. It is comparable in effectiveness to "as fast as lightning," which was the given correct output.

3. **Creativity and Clarity**: The simile used b

- Note that the responses are very verbose; to quantify which model is better, we only want to return the scores:

In [11]:
from tqdm import tqdm


def generate_model_scores(json_data, json_key, client):
    scores = []
    for entry in tqdm(json_data, desc="Scoring entries"):
        prompt = (
            f"Given the input `{format_input(entry)}` "
            f"and correct output `{entry['output']}`, "
            f"score the model response `{entry[json_key]}`"
            f" on a scale from 0 to 100, where 100 is the best score. "
            f"Respond with the number only."
        )
        score = run_chatgpt(prompt, client)
        try:
            scores.append(int(score))
        except ValueError:
            continue

    return scores

- Please note that the response scores may vary because OpenAI's GPT models are not deterministic despite setting a random number seed, etc.

- Let's now apply this evaluation to the whole dataset and compute the average score of each model:

In [15]:
from pathlib import Path

#for model in ("model_response", "model 2 response"):
model = "model_response"

scores = generate_model_scores(json_data, model, client)
print(f"\n{model}")
print(f"Number of scores: {len(scores)} of {len(json_data)}")
print(f"Average score: {sum(scores)/len(scores):.2f}\n")

# Optionally save the scores
save_path = Path("scores") / f"gpt4-{model.replace(' ', '-')}.json"
with open(save_path, "w") as file:
    json.dump(scores, file)

Scoring entries: 100%|██████████| 110/110 [00:54<00:00,  2.01it/s]


model_response
Number of scores: 110 of 110
Average score: 41.32






FileNotFoundError: [Errno 2] No such file or directory: 'scores/gpt4-model_response.json'

- Based on the evaluation above, we can say that the 1st model is substantially better than the 2nd model