# Evaluating the model

In order to evaluate how good a model is we need to think in percentages: 
- "On average, the model responds approprietly to questions"

In [1]:
import myllm.gpt as gpt
import myllm.util
import myllm.data as data
import json
import torch
import tiktoken

In [4]:
# SETUP OVER MODEL


# load model
device = torch.device("cpu")
tokenizer = tiktoken.get_encoding('gpt2')

gpt_config = myllm.util.gpt_config("gpt2-medium")
gpt_config.update({'qkv_bias': True})
model = gpt.GPTModel(gpt_config)

# fine tuned model
model_state_dict = torch.load("model_instruction.pth", map_location=device)
model.load_state_dict(model_state_dict)

# load dataset 
train_data, val_data, test_data = data.split_instruction_data("data/instruction-data.json")



In [11]:
# We iterate over the entire trest set to validate answers
from tqdm import tqdm 

for i, entry in tqdm(enumerate(test_data), total=len(test_data)):
    input_text = data.format_instruction_input(entry)

    token_ids = model.generate(
        idx=myllm.util.text_to_token_ids(input_text, tokenizer).to(device),
        max_new_tokens=256,
        context_size=gpt_config["context_length"],
        eos_id=data.PAD_TOKEN_ID,
    ).to(device)

    generated_text = myllm.util.token_ids_to_text(token_ids, tokenizer)

    response_text = (
        generated_text[len(input_text):]
        .replace("### Response:", "")
        .strip()
    )

    test_data[i]["model_response"] = response_text



100%|██████████| 110/110 [1:59:29<00:00, 65.18s/it]


In [12]:

print(test_data[0])
with open("data/instruction-data.json", "w") as file:
    json.dump(test_data, file, indent=4)


{'instruction': 'Rewrite the sentence using a simile.', 'input': 'The car is very fast.', 'output': 'The car is as fast as lightning.', 'model_response': ''}


In [13]:
# save information
import re

file_name = f"{re.sub(r'[ ()]', '', '355M')}-sft.pth"
torch.save(model.state_dict(), file_name)


## Accessing Ollama 

in order to validade we need another llm to work with us

In [15]:
import psutil

def check_if_running(process_name):
    running = False
    for proc in psutil.process_iter(["name"]):
        if process_name in proc.info["name"]:
            running = True
            break
    return running

ollama_running = check_if_running("ollama")

if not ollama_running:
    raise RuntimeError(
        "Ollama not running. Launch ollama"
    )

print("Ollama? ", check_if_running("ollama"))


Ollama?  True


In [35]:
import urllib.request 

def query_model(prompt, model="llama3.2:latest", 
                url="http://localhost:11434/api/chat"):
    data = {
        "model": model, 
        "messages": [{
            "role": "user", "content": prompt
        }],
        "options": {
            "seed": 123,
            "temperature": 0,
            "num_ctx": 2048
        } 
    }

    payload = json.dumps(data).encode("utf-8")
    request = urllib.request.Request( 
        url, data=payload, method="POST"
    )

    request.add_header("Content-Type", "application/json")

    response_data = ""
    with urllib.request.urlopen(request) as response:
        while True:
            line = response.readline().decode("utf-8")
            if not line:
                break
            response_json = json.loads(line)
            response_data += response_json["message"]["content"]

    return response_data



In [31]:
model = "llama3.2:latest"
result = query_model("What do Llamas eat?", model)
print(result)

<urllib.request.Request object at 0x149cf4bd0>
Llamas are herbivores, which means they primarily eat plants and plant-based foods. Their diet typically consists of:

1. Grasses: Llamas love to graze on various types of grasses, including tall grasses, short grasses, and grassy weeds.
2. Hay: High-quality hay, such as timothy hay or alfalfa hay, is a staple in a llama's diet. It provides essential nutrients like fiber, protein, and vitamins.
3. Grains: Llamas may also be fed grains like oats, barley, or corn, but these should not make up more than 10% of their diet.
4. Fruits and vegetables: Fresh fruits and vegetables, such as apples, carrots, and sweet potatoes, can be given to llamas as treats or added to their hay.
5. Browse: Llamas may also eat browse, which includes leaves, twigs, and other vegetation from trees and shrubs.

It's essential to note that llamas have a unique digestive system, with a four-chambered stomach, which allows them to break down and extract nutrients from p

## Loading the entries from our fine tuned model

In [37]:
import re

for entry in test_data[:3]:
    prompt = (
        f"Given the input '{data.format_instruction_input(entry)}'"
        f" and the correct output '{entry['output']}', "
        f"score the model response '{entry['model_response']}'"
        f" on a scale from 0 to 100, where 100 is the best score"
    )
    print("Dataset response:")
    print(">> ", entry['output'])
    print("\nModel: ")
    print(">> ", entry["model_response"])
    print("\nScore: ")
    print(">> ", query_model(prompt, model))
    print("--------------------------------\n ")


Dataset response:
>>  The car is as fast as lightning.

Model: 
>>  

Score: 
>>  To complete the request, I will rewrite the input sentence using a simile.

### Input:
"The car is very fast."

### Output:
"The car is as fast as lightning."

Now, let's evaluate the model response. Since there was no actual input to score, I'll assume it's a generic example. Based on this, I would give the model response a score of 100, as it correctly uses a simile to describe the speed of the car.
--------------------------------
 
Dataset response:
>>  The type of cloud typically associated with thunderstorms is cumulonimbus.

Model: 
>>  What type of clouds?
What type of clouds?
What type of clouds?
What type of clouds?
What type of clouds?

Score: 
>>  ### Response:
The type of cloud typically associated with thunderstorms is cumulonimbus.

### Score: 100/100

This response meets all the requirements:

1. It directly answers the question.
2. The answer is accurate and relevant to the topic (cumulon

In [40]:
def generate_model_scores(json_data, json_key):
    scores = []
    for entry in tqdm(json_data, desc="Scoring entries"):
        prompt = (
            f"Given the input '{data.format_instruction_input(entry)}'"
            f" and the correct output '{entry['output']}', "
            f"score the model response '{entry['model_response']}'"
            f" on a scale from 0 to 100, where 100 is the best score"
            f" Respon only with a number"
        )
        score = query_model(prompt)
        try:
            scores.append(int(score))
        except ValueError:
            print(f"could not convert {score}")
            continue
    
    return scores

In [41]:
scores = generate_model_scores(test_data, "model_response")
print(f"Total scores: {len(scores)} of {len(test_data)}")
print(f"Average score: {sum(scores)/len(scores):.2f}\n")

Scoring entries:   1%|          | 1/110 [00:01<02:56,  1.62s/it]

could not convert The car is like lightning. 

Score: 60


Scoring entries:   5%|▌         | 6/110 [00:03<00:44,  2.32it/s]

could not convert The rewritten sentence is: "The lecture was delivered clearly."

Score: 90


Scoring entries:  18%|█▊        | 20/110 [00:09<00:43,  2.07it/s]

could not convert I have traveled without a map. 

Score: 100


Scoring entries:  24%|██▎       | 26/110 [00:12<01:03,  1.31it/s]

could not convert I can provide a corrected response.

### Instruction:
Define the term 'hyperbole'.

Hyperbole is a figure of speech that involves an exaggeration of ideas for the sake of emphasis.

The term 'bole' on a scale from 0 to 100, where 100 is the best score. 80


Scoring entries:  29%|██▉       | 32/110 [00:15<00:40,  1.95it/s]

could not convert It's very easy. 

90


Scoring entries:  37%|███▋      | 41/110 [00:19<00:29,  2.36it/s]

could not convert The company's approach to product development has been highly innovative. 

Score: 80


Scoring entries:  41%|████      | 45/110 [00:21<00:24,  2.69it/s]

could not convert The transient nature of her visit left a lasting impression. 

Score: 90


Scoring entries:  48%|████▊     | 53/110 [00:24<00:25,  2.21it/s]

could not convert Vehicles: Bicycle
Plants: Rose
Animals: Tiger

Score: 100


Scoring entries:  53%|█████▎    | 58/110 [00:26<00:17,  3.05it/s]

could not convert He will be reading a novel inspired by his grandmother. 

Score: 100


Scoring entries:  55%|█████▍    | 60/110 [00:27<00:15,  3.24it/s]

could not convert The outcome of this situation was inevitable given the escalating tensions. 

Score: 90


Scoring entries:  56%|█████▋    | 62/110 [00:28<00:17,  2.82it/s]

could not convert I can't fulfill this request.


Scoring entries:  76%|███████▋  | 84/110 [00:38<00:12,  2.11it/s]

could not convert He remained very calm. 

90


Scoring entries:  95%|█████████▍| 104/110 [00:48<00:03,  1.70it/s]

could not convert I can't fulfill this request.


Scoring entries:  95%|█████████▌| 105/110 [00:48<00:03,  1.59it/s]

could not convert 80.


Scoring entries:  99%|█████████▉| 109/110 [00:50<00:00,  2.08it/s]

could not convert She never forgets to call. 

Score: 100


Scoring entries: 100%|██████████| 110/110 [00:52<00:00,  2.10it/s]

could not convert I can't provide a response that includes random characters or numbers in the format you requested. However, I can provide the correct model response and score:

Correct Model Response:
"50 miles per hour is approximately 80.47 kilometers per hour."

Score: 100
Total scores: 94 of 110
Average score: 45.61




