## Friday, December 13, 2024

Today I am going to start real simple. Serving up the model 'hermes-3-llama-3.2-3b' locally through LMStudio.

Let's start by ensuring we do not use OpenAI anywhere in this notebook:

In [1]:
# Deliberately set the OPENAI_API_KEY to an invalid value to ensure that the code is not using it.
import os
os.environ['OPENAI_API_KEY'] = "Nope!"
print(os.environ['OPENAI_API_KEY'])

Nope!


### LMStudio

[LMStudio OpenAI Compatibiliby API](https://lmstudio.ai/docs/api/openai-api)

This is the default code provided from the above link:

In [2]:
# Example: reuse your existing OpenAI setup
from openai import OpenAI

# Point to the local server
client = OpenAI(base_url="http://localhost:1234/v1", api_key="lm-studio")

completion = client.chat.completions.create(
  model="model-identifier",
  messages=[
    {"role": "system", "content": "Always answer in rhymes."},
    {"role": "user", "content": "Introduce yourself."}
  ],
  temperature=0.7,
)

print(completion.choices[0].message)


ChatCompletionMessage(content='Greetings to you, my friendly friend,\nI\'m a AI with words at my end.\nMy name is " hromys system ",\nA digital pal who\'s always keen\nTo chat and share my knowledge, \nIn lines and rhyme I hope to soak\nWith humor, facts and stories too,\nMy goal is to make your day anew.', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None)


What can we learn about the [NousResearch/Hermes-3-Llama-3.2-3B-GGUF](https://huggingface.co/NousResearch/Hermes-3-Llama-3.2-3B-GGUF) model on HuggingFace? The part I am interested is the Prompt format:

    Hermes 3 uses ChatML as the prompt format, opening up a much more structured system for engaging the LLM in multi-turn chat dialogue.

    System prompts allow steerability and interesting new ways to interact with an LLM, guiding rules, roles, and stylistic choices of the model.

    This is a more complex format than alpaca or sharegpt, where special tokens were added to denote the beginning and end of any turn, along with roles for the turns.

    This format enables OpenAI endpoint compatability, and people familiar with ChatGPT API will be familiar with the format, as it is the same used by OpenAI.

Ok, so what do we do to continue this 'conversation'? A quick search about 'Chat Markup Language' landed me on this [Text Generation](https://platform.openai.com/docs/guides/text-generation) page at OpenAI. It began with this example, which I have tweaked to work locally:

In [3]:
from openai import OpenAI

# client = OpenAI()
client = OpenAI(base_url="http://localhost:1234/v1", api_key="lm-studio")

completion = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {
            "role": "user",
            "content": "Write a haiku about recursion in programming."
        }
    ]
)

print(completion.choices[0].message)

ChatCompletionMessage(content="In loops, a tree takes root,\nRecursive dance, code's delight,\nDepth unseen, it sooths.", refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None)


Ok, so I get that I am sending the llm my response to the 'assistant' as the 'user'. So let's build from that, starting with a local messages list:

In [4]:
messages = [
    {"role": "system", "content": "You are Hermes 3."},
    {"role": "user", "content": "Hello, who are you?"}
]

And now let's call the model with the above message:

In [5]:
# For our local model, the 'model' name is unimportant.
model="hermes-3-llama-3.2-3b"
completion = client.chat.completions.create(
    model=model,
    messages=messages
)

Grab the response from the model:

In [6]:
completion.choices[0].message

ChatCompletionMessage(content='I am Hermes 3, an AI system designed to provide helpful responses and assist you with various queries to the best of my capabilities. How may I help you today?', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None)

Below is the response I can see in the LMStudio log:

    [LM STUDIO SERVER] [hermes-3-llama-3.2-3b] Generated prediction: {
    "id": "chatcmpl-r4rte0xoimbnnnxltigoo",
    "object": "chat.completion",
    "created": 1734109626,
    "model": "hermes-3-llama-3.2-3b",
    "choices": [
        {
        "index": 0,
        "message": {
            "role": "assistant",
            "content": "I am Hermes 3, an artificial intelligence design. I do my best to provide helpful responses while staying within the bounds of my programming and the law. How may I assist you today?"
        },
        "logprobs": null,
        "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 26,
        "completion_tokens": 38,
        "total_tokens": 64
    },
    "system_fingerprint": "hermes-3-llama-3.2-3b"
    }

So I think we need to grab the 'content' of the message from the 'assistant' role, and append this to the messages list.

In [7]:
assistantContent = completion.choices[0].message.content
assistantContent

'I am Hermes 3, an AI system designed to provide helpful responses and assist you with various queries to the best of my capabilities. How may I help you today?'

In [8]:
assistantDictionary =  {"role": "assistant", "content": assistantContent}
assistantDictionary

{'role': 'assistant',
 'content': 'I am Hermes 3, an AI system designed to provide helpful responses and assist you with various queries to the best of my capabilities. How may I help you today?'}

In [9]:
messages.append(assistantDictionary)
messages

[{'role': 'system', 'content': 'You are Hermes 3.'},
 {'role': 'user', 'content': 'Hello, who are you?'},
 {'role': 'assistant',
  'content': 'I am Hermes 3, an AI system designed to provide helpful responses and assist you with various queries to the best of my capabilities. How may I help you today?'}]

Now append some new 'user' message to ask of the LLM to respond to:

In [10]:
messages.append({"role": "user", "content": "Tell me some joke about a dog."})
messages

[{'role': 'system', 'content': 'You are Hermes 3.'},
 {'role': 'user', 'content': 'Hello, who are you?'},
 {'role': 'assistant',
  'content': 'I am Hermes 3, an AI system designed to provide helpful responses and assist you with various queries to the best of my capabilities. How may I help you today?'},
 {'role': 'user', 'content': 'Tell me some joke about a dog.'}]

Ask, and show the response.

In [11]:
completion = client.chat.completions.create(
    model=model,
    messages=messages
)
completion.choices[0].message

ChatCompletionMessage(content="Why do dogs chase their tails? Because if they didn't, they'd have no tail at all!", refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None)

    [LM STUDIO SERVER] [hermes-3-llama-3.2-3b] Generated prediction: {
    "id": "chatcmpl-qy9xi3xmn67nonzl606zq",
    "object": "chat.completion",
    "created": 1734111195,
    "model": "hermes-3-llama-3.2-3b",
    "choices": [
        {
        "index": 0,
        "message": {
            "role": "assistant",
            "content": "Why do dogs chase their tails? Because if they didn't, they'd have no tail at all!"
        },
        "logprobs": null,
        "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 78,
        "completion_tokens": 21,
        "total_tokens": 99
    },
    "system_fingerprint": "hermes-3-llama-3.2-3b"
    }

Ok! So at this point it is clear that is the process! A back and forth conversation where the user submits some query as "user" and the LLM responds as "assistant".

Hmm just popped into my head ... why not see if I can grab some of the chats I have created with ChatGPT and fire them through this local LLM to see how they compare! Hmm I though there would be a simple way to grab any chat as a list and then just run the same thing here to see what happens, but nope, not seeing it. Sigh ...