# Gai Chat Server (ExLlama2)

**important: Select venv Python Interpreter before you start**

This repository is designed to be used with Visual Studio Code and Docker DevContainer.

![dev-container](../../img/dev-container.png)

## 1. Setup

**Instructions:**

a) Download model

```bash
huggingface-cli download bartowski/Llama-3.2-3B-Instruct-exl2 \
    --revision c08d657b27cf0450deaddc3e582be20beec3e62d \
    --local-dir ~/.gai/models/Llama-3.2-3B-Instruct-exl2 \
    --local-dir-use-symlinks False
```

b) Create gai.yml in ~/.gai

```yaml
generators:
    ttt:
        type: ttt
        engine: exllamav2
        model: llama3.2:3b
        name: ttt
        hyperparameters:
            temperature: 0.85
            top_p: 0.8
            top_k: 50
            max_tokens: 1000
            tool_choice: auto
            max_retries: 5
            stop:
            - <|im_end|>
            - </s>
            - '[/INST]'
        extra:
            model_path: models/Llama-3.2-3B-Instruct-exl2
            max_seq_len: 8192
            prompt_format: mistral
            no_flash_attn: true
            seed: null
            decode_special_tokens: false
        module:
            name: gai.llm.server.gai_exllamav2
            class: GaiExLlamav2
        source:
            type: huggingface
            repo_id: bartowski/Llama-3.2-3B-Instruct-exl2
            local_dir: Llama-3.2-3B-Instruct-exl2
            revision: c08d657b27cf0450deaddc3e582be20beec3e62d
            file: null
```

---

## 2. Pull Model

In [1]:
# Takes around 7 minutes to run under normal conditions
from gai.llm.lib.generators_utils import download, text_progress_callback
download(name_or_config="llama3.2:8bpw:exl2", status_callback=text_progress_callback)

  from .autonotebook import tqdm as notebook_tqdm
Fetching 12 files:   8%|▊         | 1/12 [00:00<00:08,  1.33it/s]

Download status: {'progress': 8.333333333333332, 'current': 1, 'total': 12, 'message': 'Downloading'}


Fetching 12 files:  33%|███▎      | 4/12 [00:01<00:02,  3.99it/s]

Download status: {'progress': 33.33333333333333, 'current': 4, 'total': 12, 'message': 'Downloading'}


Fetching 12 files: 100%|██████████| 12/12 [04:15<00:00, 21.27s/it]

Download status: {'progress': 75.0, 'current': 9, 'total': 12, 'message': 'Downloading'}





'/home/vscode/.gai/models/Llama-3.2-3B-Instruct-exl2'

---

## 3. Smoke Test

In [2]:
# 1) Confirm gai initialized

import os
gairc=None
with open(os.path.expanduser("~/.gairc"),"r") as f:
    gairc = f.read()
print(".gairc = ",gairc)
assert os.path.exists(os.path.expanduser("~/.gai"))

# 2) Build generator configuration

from gai.lib.config import config_helper
yaml_config = """
# bpw: 8 bits, size: 3.98 GB
type: "ttt"
engine: "exllamav2"
model: "dolphin3.0_llama3.1:4.25bpw"
name: "dolphin3.0_llama3.1:4.25bpw:exl2"
extra:
    model_path: "models/Dolphin3.0-Llama3.1-8B-4_25bpw-exl2"
    max_seq_len: 8192
    prompt_format: "llama"
    no_flash_attn: true
    seed: null
    decode_special_tokens: false
hyperparameters:
    temperature: 0.85
    top_p: 0.8
    top_k: 50
    max_tokens: 1000
    tool_choice: "auto"
    max_retries: 5
    stop: ["<|im_end|>", "</s>", "[/INST]"]
module:
    name: "gai.llm.server.gai_exllamav2"
    class: "GaiExLlamav2"
source:
    type: "huggingface"
    repo_id: "bartowski/Dolphin3.0-Llama3.1-8B-exl2"
    local_dir: "Dolphin3.0-Llama3.1-8B-4_25bpw-exl2"
    revision: "896301e945342d032ef0b3a81b57f0d5a8bac6fe"
"""

import yaml
generator_config = yaml.safe_load(yaml_config)
generator_config = config_helper.get_generator_config(generator_config)

print("✅  Before loading")
from gai.lib.diagnostics import free_mem
free_mem()

from gai.lib.api import SingletonHost
from rich.console import Console
console=Console()

# 3) Acquire (or create) the singleton
host = SingletonHost.GetInstanceFromConfig(generator_config)

# 4) Load the generator into child process
host.load()
print("✅ Model loaded in subprocess")
free_mem()

# 5) Generate a response
response = host.create(
    model="ttt",
    messages=[{"role": "user", "content": "Tell me a one paragraph story about a dragon"}],
    stream=False
)
print("▶️  Response:", response)

# 6) Unload to tear down the child and free GPU memory
host.unload()
free_mem()
print("🗑️  Model unloaded, GPU memory freed")


.gairc =  {"app_dir":"/home/vscode/.gai"}

✅  Before loading


✅ Model loaded in subprocess


▶️  Response: ChatCompletion(id='chatcmpl-b5d1a348-fa25-4847-8511-c1a858353a26', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='\nA long time ago, in a land far away, there was a fearsome dragon. This dragon was not like any other dragon that had come before it. It was the most feared creature in all the land, but for a very good reason. The dragon was incredibly smart, and had an insatiable thirst for knowledge. It spent all of its days studying the ancient texts and devouring books on all manner of subjects. This dragon was unlike any other, as it sought to use its power for good.\n', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None))], created=1748236281, model='exllamav2-mistral7b', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=104, prompt_tokens=14, total_tokens=118, completion_tokens_details=None, prompt_tokens_deta

🗑️  Model unloaded, GPU memory freed


---

## 3. Integration Test

### a) Startup

In [3]:
import os
os.environ["LOG_LEVEL"] = "INFO"

# convert yaml to json
import yaml
yaml_config = """
# bpw: 8 bits, size: 3.98 GB
type: ttt
engine: exllamav2
model: dolphin3.0_llama3.1:4.25bpw
name: dolphin3.0_llama3.1:4.25bpw:exl2
hyperparameters:
    temperature: 0.85
    top_p: 0.8
    top_k: 50
    max_tokens: 1000
    tool_choice: auto
    max_retries: 5
    stop:
        - <|im_end|>
        - </s>
        - "[/INST]"
extra:
    model_path: models/Dolphin3.0-Llama3.1-8B-4_25bpw-exl2
    max_seq_len: 8192
    prompt_format: llama
    no_flash_attn: true
    seed: null
    decode_special_tokens: false
module:
    name: gai.llm.server.gai_exllamav2
    class: GaiExLlamav2
source:
    type: huggingface
    repo_id: bartowski/Dolphin3.0-Llama3.1-8B-exl2
    local_dir: Dolphin3.0-Llama3.1-8B-4_25bpw-exl2
    revision: 896301e945342d032ef0b3a81b57f0d5a8bac6fe
"""

import yaml
generator_config = yaml.safe_load(yaml_config)
from gai.lib.config import config_helper
generator_config = config_helper.get_generator_config(generator_config)

print("✅  Before loading")
from gai.lib.diagnostics import free_mem
free_mem()

from gai.lib.api import SingletonHost
host = SingletonHost.GetInstanceFromConfig(generator_config, verbose=False)
host.load()
print("✅ Model loaded in subprocess")
free_mem()


✅  Before loading


✅ Model loaded in subprocess


1.1096267700195312

### a) Test streaming

In [4]:
response = host.create(
    model="ttt",
    messages=[{"role":"user","content":"Tell me a one paragraph story"},
                {"role":"assistant","content":""}],
    stream=True)
for message in response:
    if message.choices[0].delta.content:
        print(message.choices[0].delta.content, end="", flush=True)
   

Once upon a time, in a small village nestled between two great mountains, there lived a young girl named Lila. She was known throughout the village for her kindness and generosity, always ready to lend a helping hand or offer a warm smile to those in need. One day, while out gathering herbs in the nearby forest, Lila stumbled upon a wounded bird, its wing badly injured and unable to fly. Moved by compassion, Lila carefully picked up the bird and took it back to her home, where she tended to its wounds with care and love. As the days passed, the bird slowly healed under Lila's gentle care, until one day it was able to fly again. Overjoyed, Lila released the bird back into the wild, watching as it soared high into the sky. From that day on, the villagers spoke of Lila's kindness and how it had brought joy and healing to all who knew her.

### b) Test generation

In [5]:
response = host.create(
    model="ttt",
    messages=[{"role":"user","content":"Tell me a one paragraph story"},
                {"role":"assistant","content":""}],
    stream=False,
    max_tokens=100)
print(response.choices[0].message.content)
print("finish reason:", response.choices[0].finish_reason)


Once upon a time, in a land far away, there was a young girl named Lily. She lived in a small village surrounded by a dense forest. One day, she decided to explore the forest and stumbled upon a magical tree. The tree had shimmering leaves, and when Lily touched it, she was transported to a magical world. In this world, animals could talk, and the sky was filled with colorful birds. Lily spent her days in this magical world, learning about the animals and their ways
finish reason: length


### c) Test Tool Calling

In [6]:
response = host.create(
    model="ttt",
    messages=[
        {"role":"user","content":"What is the current time in Singapore?"},
        {"role":"assistant","content":""}
    ],
    tool_choice="required",
    tools=[
        {
            "type": "function",
            "function": {
                "name": "google",
                "description": "The 'google' function is a powerful tool that allows the AI to gather external information from the internet using Google search. It can be invoked when the AI needs to answer a question or provide information that requires up-to-date, comprehensive, and diverse sources which are not inherently known by the AI. For instance, it can be used to find current date, current news, weather updates, latest sports scores, trending topics, specific facts, or even the current date and time. The usage of this tool should be considered when the user's query implies or explicitly requests recent or wide-ranging data, or when the AI's inherent knowledge base may not have the required or most current information. The 'search_query' parameter should be a concise and accurate representation of the information needed.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "search_query": {
                            "type": "string",
                            "description": "The search query to search google with. For example, to find the current date or time, use 'current date' or 'current time' respectively."
                        }
                    },
                    "required": ["search_query"]
                }
            }
        }
    ],
    stream=False)
print(response)

ChatCompletion(id='chatcmpl-d3baabf6-af5c-414a-b3f3-351273f04b22', choices=[Choice(finish_reason='tool_calls', index=0, logprobs=None, message=ChatCompletionMessage(content=None, refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='call_08e932bd-276f-4d4a-9d2f-aa4365b09ef9', function=Function(arguments='{"search_query": "current time in Singapore"}', name='google'), type='function')]))], created=1748236397, model='exllamav2-mistral7b', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=23, prompt_tokens=367, total_tokens=390, completion_tokens_details=None, prompt_tokens_details=None))


### d) Test Structured Output

In [7]:
# Define Schema
from pydantic import BaseModel
class Book(BaseModel):
    title: str
    summary: str
    author: str
    published_year: int

text = """Foundation is a science fiction novel by American writer
Isaac Asimov. It is the first published in his Foundation Trilogy (later
expanded into the Foundation series). Foundation is a cycle of five
interrelated short stories, first published as a single book by Gnome Press
in 1951. Collectively they tell the early story of the Foundation,
an institute founded by psychohistorian Hari Seldon to preserve the best
of galactic civilization after the collapse of the Galactic Empire.
"""
response = host.create(
    model="ttt",
    messages=[{'role':'user','content':text},{'role':'assistant','content':''}], 
    json_schema=Book.schema(),
    stream=False
    )
print(response)


/tmp/ipykernel_3396/4278824633.py:20: PydanticDeprecatedSince20: The `schema` method is deprecated; use `model_json_schema` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  json_schema=Book.schema(),


ChatCompletion(id='chatcmpl-6e8b8155-0683-4b7e-9164-678c88bddb15', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='{\n  "title": "Foundation",\n  "summary": "Foundation is a science fiction novel by American writer Isaac Asimov. It is the first published in his Foundation Trilogy (later expanded into the Foundation series). Foundation is a cycle of five interrelated short stories, first published as a single book by Gnome Press in 1951. Collectively they tell the early story of the Foundation, an institute founded by psychohistorian Hari Seldon to preserve the best of galactic civilization after the collapse of the Galactic Empire.",\n  "author": "Isaac Asimov",\n  "published_year": 1951\n}', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None))], created=1748236407, model='exllamav2-mistral7b', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(co

### Switch Model

In [7]:
# Just call using a different model name and the singleton host will switch to the new model

response = host.create(
    model="dolphin_mistral:exl2",
    messages=[{"role":"user","content":"Tell me a one paragraph story"},
                {"role":"assistant","content":""}],
    stream=True)
for message in response:
    if message.choices[0].delta.content:
        print(message.choices[0].delta.content, end="", flush=True)

Once upon a time, in a small village, there lived a brave knight named Sir Thomas. He was known far and wide for his courage and kindness. One day, a fierce dragon attacked the village, burning everything in its path. The villagers were terrified and didn't know what to do. But Sir Thomas didn't hesitate. He picked up his sword and rode towards the dragon, vowing to protect his people.

 The battle was long and fierce, but Sir Thomas was determined. With every slash of his sword, the dragon grew weaker and weaker. Finally, after what felt like an eternity, the dragon fell to the ground, defeated. The villagers cheered in joy, and Sir Thomas was hailed as a hero. From that day on, peace and prosperity returned to the village, and Sir Thomas was remembered as their savior. 

### Teardown

In [8]:
host.unload()
free_mem()

4.303363800048828

---

## 4. API Test

**Instructions**:

a) Open Debug Icon and select **Python Debugger: gai-ttt server (dolphin)**

b) Press `F5` to start the API server.

c) Wait for the server to start.


**Tests**:

Run the following cells to test the API.

### a) Test Pulling

In [1]:
%%bash
curl -X POST \
    http://localhost:12031/gen/v1/chat/pull \
    -H 'Content-Type: application/json' \
    -s \
    -N \
    -d "{\"model\":\"dolphin_mistral:exl2\"}"

null

### a) Test Generating

In [10]:
%%bash
curl -X POST \
    http://gai-llm-svr-exl2:12031/gen/v1/chat/completions \
    -H 'Content-Type: application/json' \
    -s \
    -N \
    -d "{\"model\":\"ttt\", \
        \"messages\": [ \
            {\"role\": \"user\",\"content\": \"Tell me a story.\"}, \
            {\"role\": \"assistant\",\"content\": \"\"} \
        ],\
        \"max_tokens\":100,\
        \"tool_choice\": \"none\"}"
        

{"id":"chatcmpl-9787d061-1553-4775-b6d5-b9edbcba10bd","choices":[{"finish_reason":"length","index":0,"logprobs":null,"message":{"content":"Once upon a time, in a small village nestled between two great mountains, there lived a young girl named Aria. She was a curious and adventurous soul, always eager to explore the world around her. One day, while wandering through the forest, she stumbled upon a mysterious door hidden behind a waterfall.\n\nIntrigued, Aria pushed open the door and found herself in a magical realm filled with talking animals, enchanted trees, and sparkling fairy dust. As she ventured deeper into this strange new world","refusal":null,"role":"assistant","annotations":null,"audio":null,"function_call":null,"tool_calls":null}}],"created":1748236643,"model":"exllamav2-mistral7b","object":"chat.completion","service_tier":null,"system_fingerprint":null,"usage":{"completion_tokens":100,"prompt_tokens":14,"total_tokens":114,"completion_tokens_details":null,"prompt_tokens_deta

### b) Test Streaming

In [None]:
import json
import httpx

json_payload = {
    "model":"ttt",
    "temperature": 0.2,
    "max_tokens": 50,
    "stream": "true",  # This should probably be a boolean True, not "true"
    "messages": [
        {
            "role": "user",
            "content": "Tell me a one paragraph story."
        },
        {
            "role": "assistant",
            "content": ""
        }
    ]
}
async def http_post_async(json_payload):

    # Send the POST request using httpx with streaming
    async with httpx.AsyncClient(timeout=30.0) as client:
        async with client.stream("POST", "http://gai-llm-svr-exl2:12031/gen/v1/chat/completions", json=json_payload) as response:
            response.raise_for_status()
            async for chunk in response.aiter_text():  # Use aiter_text() to handle decoding
                chunk=json.loads(chunk)
                chunk=chunk["choices"][0]["delta"]["content"]
                if chunk:  # Check for non-empty chunks
                    print(chunk, end="", flush=True)

try:
    response=await http_post_async(json_payload)
except httpx.HTTPStatusError as e:
    print(f"HTTP error occurred: {e.response.status_code} - {e.response.text}")


Once upon a time, there was a young girl named Lily who loved to read. She spent most of her days curled up in her favorite chair with a book in her hands. One day, while she was reading, she heard a knock on the door

### c) Test Tool Calling

In [None]:
%%bash
curl -X POST \
    http://gai-llm-svr-exl2:12031/gen/v1/chat/completions \
    -H 'Content-Type: application/json' \
    -s \
    -N \
    -d "{\"model\":\"ttt\", \
        \"messages\": [ \
            {\"role\": \"user\",\"content\": \"What is the current time in Singapore\"}, \
            {\"role\": \"assistant\",\"content\": \"\"} \
        ],\
        \"tools\": [\
            {\
                \"type\": \"function\",\
                \"function\": {\
                    \"name\": \"google\",\
                    \"description\": \"The 'google' function is a powerful tool that allows the AI to gather external information from the internet using Google search. It can be invoked when the AI needs to answer a question or provide information that requires up-to-date, comprehensive, and diverse sources which are not inherently known by the AI. For instance, it can be used to find current date, current news, weather updates, latest sports scores, trending topics, specific facts, or even the current date and time. The usage of this tool should be considered when the user's query implies or explicitly requests recent or wide-ranging data, or when the AI's inherent knowledge base may not have the required or most current information. The 'search_query' parameter should be a concise and accurate representation of the information needed.\",\
                    \"parameters\": {\
                        \"type\": \"object\",\
                        \"properties\": {\
                            \"search_query\": {\
                                \"type\": \"string\",\
                                \"description\": \"The search query to search google with. For example, to find the current date or time, use 'current date' or 'current time' respectively.\"\
                            }\
                        },\
                        \"required\": [\"search_query\"]\
                    }\
                }\
            }\
        ],\
        \"tool_choice\": \"required\"}"

{"id":"chatcmpl-ea3fa081-c13b-492f-82c6-7c1d43c895a3","choices":[{"finish_reason":"tool_calls","index":0,"logprobs":null,"message":{"content":null,"refusal":null,"role":"assistant","annotations":null,"audio":null,"function_call":null,"tool_calls":[{"id":"call_78c7e95a-3ebc-4021-b9a7-483df071ff45","function":{"arguments":"{\"search_query\": \"current time in Singapore\"}","name":"google"},"type":"function"}]}}],"created":1748187141,"model":"exllamav2-mistral7b","object":"chat.completion","service_tier":null,"system_fingerprint":null,"usage":{"completion_tokens":23,"prompt_tokens":366,"total_tokens":389,"completion_tokens_details":null,"prompt_tokens_details":null}}

### d) Test JSON Schema

In [11]:
%%bash
curl -X POST \
    http://gai-llm-svr-exl2:12031/gen/v1/chat/completions \
    -H 'Content-Type: application/json' \
    -s \
    -N \
    -d "{\"model\":\"ttt\", \
        \"messages\": [ \
            {\"role\": \"user\",\"content\": \"Foundation is a science fiction novel by American writer \
            Isaac Asimov. It is the first published in his Foundation Trilogy (later \
            expanded into the Foundation series). Foundation is a cycle of five \
            interrelated short stories, first published as a single book by Gnome Press \
            in 1951. Collectively they tell the early story of the Foundation, \
            an institute founded by psychohistorian Hari Seldon to preserve the best \
            of galactic civilization after the collapse of the Galactic Empire.\"}, \
            {\"role\": \"assistant\",\"content\": \"\"} \
        ],\
        \"json_schema\": {\"properties\": \
            {\"title\": \
                {\"title\": \"Title\", \"type\": \"string\"}, \
                    \"summary\": {\"title\": \"Summary\", \"type\": \"string\"}, \
                    \"author\": {\"title\": \"Author\", \
                    \"type\": \"string\"\
                }, \
                \"published_year\": {\
                    \"title\": \"Published Year\", \
                    \"type\": \"integer\"}}, \
                \"required\": [\
                    \"title\", \
                    \"summary\", \
                    \"author\", \
                    \"published_year\"\
                ], \
                \"title\": \"Book\", \
                \"type\": \"object\"\
            },\
        \"stream\": false }"

{"id":"chatcmpl-d99d95ad-9418-4728-9e4d-25d2b40e0565","choices":[{"finish_reason":"stop","index":0,"logprobs":null,"message":{"content":"{\n  \"title\": \"Foundation\",\n  \"summary\": \"Foundation is a science fiction novel by American writer Isaac Asimov. It is the first published in his Foundation Trilogy (later expanded into the Foundation series). Foundation is a cycle of five interrelated short stories, first published as a single book by Gnome Press in 1951. Collectively they tell the early story of the Foundation, an institute founded by psychohistorian Hari Seldon to preserve the best of galactic civilization after the collapse of the Galactic Empire.\",\n  \"author\": \"Isaac Asimov\",\n  \"published_year\": 1951\n}","refusal":null,"role":"assistant","annotations":null,"audio":null,"function_call":null,"tool_calls":null}}],"created":1748236684,"model":"exllamav2-mistral7b","object":"chat.completion","service_tier":null,"system_fingerprint":null,"usage":{"completion_tokens":12

### e) Shut down the API Service