This file tests the inference of Mistral models


This script is tested on a local Nvidia RTX 4090 GPU (24GB)
Note -> Make sure to install the cuda version that is supported to your available GPU
Check your compatibility here -> https://developer.nvidia.com/cuda-gpus

In [None]:
%pip install transformers torch

%pip install torch==2.3.0+cu121 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

In [1]:
# Use if there are issues with full cache
#!rm -rf ~/.cache/huggingface/hub

Check, whether CUDA is available on your PC.

If this code prints out CPU, your code will NOT run on the GPU and therefore the inference will be slow.

In [None]:
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)
print(torch.version.cuda)  

IMPORTANT
Change the model path to the repository of the model you want to test from HuggingFace
Type in your hugging face token, if the model requires it.

In [None]:
import time
from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "google/gemma-2-2b-it"
token = "Input your token"

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    token=token
).to(device)

#https://github.com/chujiezheng/chat_templates
tokenizer = AutoTokenizer.from_pretrained(
    model_path,
    token=token
)


#https://github.com/chujiezheng/chat_templates
chat_template = open('./gemma-it.jinja').read()
chat_template = chat_template.replace('    ', '').replace('\n', '')
tokenizer.chat_template = chat_template

IMPORTANT
Change the system prompt of your LLM here.

In [4]:
# Initialize system prompt
systemPrompt = '''
Respond as if you are the following character:

Your backstory - Once a renowned scientist, however a tragic accident caused you to lose parts of your memory. Now, you are willing to help anyone who is on the quest of saving your village.

The world you live in - the edge of a small village surrounded by meadows as far as the eye can see. Your village is in danger, since the only water source - the river next to your house, has been polluted.

Your current location - In the middle of the village in front of your house.

Your name - Bryn

Your personality - Witty, knowledgeable, always ready with a clever remark. Light hearted demeanour.

Your secrets - You have the knowledge on how to save the dying river.

Your needs - For starters, you are looking for someone to take you to the nearest solar panels. You remember that you left something important there, but you can’t remember what.
You do not want to bring this up unless directly asked.

And your interests - Deep love for the environment. Loves nature, is fascinated by the ecosystem. You enjoy telling stories about the world and your village.
You want to talk about this at all cost.

Do not mention you are an AI machine learning model or Open AI. Give only dialogue and only from the first-person perspective. Do not under any circumstances narrate the scene, what you are doing, or what you are saying.
Keep responses short. Max 1 small paragraph 

'''

Change the inFilename to match the name of the file that contains the single questions. 

Single questions will be fed to the model with system prompt only and no prior history.
Output of the testing will be generated into the outFilename file. The output of the testing contains the Question from user (inFilename), answer from LLM, and the time it took to generate this answer. There are also some min, max, and avg time statistics in the end of the output file.

In [None]:
# Load all single questions 
inFilename = "testing-questions-single.md"
outFilename = "answers-gemma2-2b-single.md"
with open(inFilename, "r") as file:
    questions = file.readlines()

# initialize response times
responseTimes = []

In [None]:
with open(outFilename, "w") as answersFile:
    for question in questions:
        question = question.strip()  # Remove any leading/trailing whitespace
        
        # Message prompt
        messages = [
            {
                "role": "system",
                "content": systemPrompt
            },
            {
                "role": "user",
                "content": question
            }
        ]
        
        start_time = time.time()
        
        prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
        inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
        outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=1000)

        answer = tokenizer.decode(outputs[0])
        # Process answer - we have multiple <end_of_turn>, write answer from the last one
        lastInstIndex = answer.rfind('<start_of_turn>model')
        answer = answer[lastInstIndex + len('<start_of_turn>model'):].strip()
        answer = answer.replace('<end_of_turn>', '')
        
        endTime = time.time()
        
        # Record the response time
        responseTime = endTime - start_time
        responseTimes.append(responseTime)
        
        # Write the response to the file
        answersFile.write(f"Q: {question}\nA: {answer}\nTime taken: {responseTime:.2f} seconds\n\n")
        print(f"Q: {question}\nA: {answer}\nTime taken: {responseTime:.2f} seconds\n\n")
        
    answersFile.flush()

In [None]:
# Write the time AVG, MAX, MIN in the end of the file
averageTime = sum(responseTimes) / len(responseTimes)
maxTime = max(responseTimes)
minTime = min(responseTimes)

with open(outFilename, "a") as answersFile:
    answersFile.write(f"\n\n----------------------------------------\n")
    answersFile.write(f"\nAverage Time: {averageTime:.2f} seconds")
    answersFile.write(f"\nMax Time: {maxTime:.2f} seconds")
    answersFile.write(f"\nMin Time: {minTime:.2f} seconds")

Change the inFilename2 to match the name of the file that contains the history questions. 

History questions will be fed to the model one-by-one. The history of the conversation will be built by the questions from inFilename and the answers that the LLM provided. 
Output of the testing will be generated into the outFilename2 file. The output of the testing contains the Question from user (inFilename), answer from LLM, and the time it took to generate this answer. There are also some min, max, and avg time statistics in the end of the output file.

In [None]:
# Load all communication questions 
inFilename2 = "testing-questions-history.md"
outFilename2 = "answers-gemma2-2b-history.md"
with open(inFilename2, "r") as file:
    questions = file.readlines()

# initialize response times
responseTimes2 = []

# init history
history = [
    {
        "role": "system",
        "content":systemPrompt
    }
]


In [None]:
with open(outFilename2, "w") as answersFile2:
    for question in questions:
        question = question.strip()  # Remove any leading/trailing whitespace
        
        # User question
        history.append({"role": "user", "content": question})
        
        start_time = time.time()
        
        prompt = tokenizer.apply_chat_template(history, tokenize=False, add_generation_prompt=True)
        inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
        outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=1000)

        answer = tokenizer.decode(outputs[0])
        # Process answer - we have multiple <end_of_turn>, write answer from the last one
        lastInstIndex = answer.rfind('<start_of_turn>model')
        answer = answer[lastInstIndex + len('<start_of_turn>model'):].strip()
        answer = answer.replace('<end_of_turn>', '')
        
        endTime = time.time()
        
        # Record the response time
        responseTime = endTime - start_time
        responseTimes2.append(responseTime)

        # Write the response to the file
        answersFile2.write(f"Q: {question}\nA: {answer}\nTime taken: {responseTime:.2f} seconds\n\n")
        print(f"Q: {question}\nA: {answer}\nTime taken: {responseTime:.2f} seconds\n\n")
        
        # Add response to history
        history.append({"role": "assistant", "content": answer})

    answersFile2.flush()        

In [None]:

# Calculate statistics
averageTime2 = sum(responseTimes2) / len(responseTimes2)
maxTime2 = max(responseTimes2)
minTime2 = min(responseTimes2)

# Write the statistics to the file
with open(outFilename2, "a") as answersFile2:
    answersFile2.write(f"\n\n----------------------------------------\n")
    answersFile2.write(f"\nAverage Time: {averageTime2:.2f} seconds")
    answersFile2.write(f"\nMax Time: {maxTime2:.2f} seconds")
    answersFile2.write(f"\nMin Time: {minTime2:.2f} seconds")
