Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 5 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ This repository contains a Helm chart for deploying Large Language Models (LLMs)

## Azimuth App

This app ~~is~~ will soon be provided as part of a standard deployment Azimuth so no specific steps are required to use this app other than access to an up to date Azimuth deployment.
This app ~~is~~ will soon be provided as part of a standard deployment Azimuth, so no specific steps are required to use this app other than access to an up-to-date Azimuth deployment.

## Manual Deployment

Expand All @@ -16,7 +16,7 @@ helm repo update
helm install <installation-name> <chosen-repo-name>/azimuth-llm --version <version>
```

where version is the full published version for the specified commit (e.g. `0.1.0-dev.0.main.125`). To see the latest published version, see [this page](https://github.com/stackhpc/azimuth-llm/tree/gh-pages).
where `version` is the full name of the published version for the specified commit (e.g. `0.1.0-dev.0.main.125`). To see the latest published version, see [this page](https://github.com/stackhpc/azimuth-llm/tree/gh-pages).

### Customisation

Expand All @@ -39,8 +39,10 @@ The following is a non-exhaustive list of models which have been tested with thi
- [Llama 2 7B chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)
- [AWQ Quantized Llama 2 70B](https://huggingface.co/TheBloke/Llama-2-70B-Chat-AWQ)
- [Magicoder 6.7B](https://huggingface.co/ise-uiuc/Magicoder-S-DS-6.7B)
- [Mistral 7B Instruct v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)
<!-- - [AWQ Quantized Mixtral 8x7B Instruct v0.1](https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ) (Not producing output properly) -->

Due to the combination of [components](##Components) used in this app, some Huggingface models may not work as expected (usually due to the way in which LangChain formats the prompt messages). Any errors when using new model will appear in the pod logs for either the web-app deployment the backend API deployment.
Due to the combination of [components](##Components) used in this app, some HuggingFace models may not work as expected (usually due to the way in which LangChain formats the prompt messages). Any errors when using new model will appear in the pod logs for either the web-app deployment the backend API deployment.


## Components
Expand Down
2 changes: 1 addition & 1 deletion chart/templates/NOTES.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,6 @@ On deployment of a new model, the app must first download the model's weights fr
This can take a significant amount of time depending on model choice and network speeds.
Download progress can be monitored by inspecting the logs for the LLM API pod(s) via the Kubernetes Dashboard for the target cluster.

The app uses [vLLM](https://docs.vllm.ai/en/latest/) as a model serving backend and [gradio](https://github.com/gradio-app/gradio) + [LangChain](https://python.langchain.com/docs/get_started/introduction) to provide the web interface.
The app uses [vLLM](https://docs.vllm.ai/en/latest/) as a model serving backend and [Gradio](https://github.com/gradio-app/gradio) + [LangChain](https://python.langchain.com/docs/get_started/introduction) to provide the web interface.
The official list of HuggingFace models supported by vLLM can be found [here](https://docs.vllm.ai/en/latest/models/supported_models.html), though some of these may not be compatible with the LangChain prompt format.
See [this documentation](https://github.com/stackhpc/azimuth-llm/) for a non-exhaustive list of languange models against which the app has been tested.
27 changes: 24 additions & 3 deletions chart/web-app/app.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import requests
import warnings
import re
import rich
import gradio as gr
from urllib.parse import urljoin
Expand All @@ -17,6 +18,18 @@
backend_health_endpoint = urljoin(backend_url, "/health")
backend_initialised = False

# NOTE(sd109): The Mistral family of models explicitly require a chat
# history of the form user -> ai -> user -> ... and so don't like having
# a SystemPrompt at the beginning. Since these models seem to be the
# best around right now, it makes sense to treat them as special and make
# sure the web app works correctly with them. To do so, we detect when a
# mistral model is specified using this regex and then handle it explicitly
# when contructing the `context` list in the `inference` function below.
MISTRAL_REGEX = re.compile(r".*mi(s|x)tral.*", re.IGNORECASE)
IS_MISTRAL_MODEL = (MISTRAL_REGEX.match(settings.model_name) is not None)
if IS_MISTRAL_MODEL:
print("Detected Mistral model - will alter LangChain conversation format appropriately.")

llm = ChatOpenAI(
base_url=urljoin(backend_url, "v1"),
model = settings.model_name,
Expand Down Expand Up @@ -57,9 +70,17 @@ def inference(latest_message, history):


try:
context = [SystemMessage(content=settings.model_instruction)]
for human, ai in history:
context.append(HumanMessage(content=human))
# To handle Mistral models we have to add the model instruction to
# the first user message since Mistral requires user -> ai -> user
# chat format and therefore doesn't allow system prompts.
context = []
if not IS_MISTRAL_MODEL:
context.append(SystemMessage(content=settings.model_instruction))
for i, (human, ai) in enumerate(history):
if IS_MISTRAL_MODEL and i == 0:
context.append(HumanMessage(content=f"{settings.model_instruction}\n\n{human}"))
else:
context.append(HumanMessage(content=human))
context.append(AIMessage(content=ai))
context.append(HumanMessage(content=latest_message))

Expand Down