# LocalAI and OpenVINO

[LocalAI](https://localai.io/) is the free, Open Source OpenAI alternative. LocalAI act as a drop-in replacement REST API that’s compatible with OpenAI API specifications for local inferencing. It allows you to run LLMs, generate images, audio (and not only) locally or on-prem with consumer grade hardware, supporting multiple model families and architectures. Does not require GPU. It is created and maintained by `Ettore Di Giacinto`.

In this tutorial we show how to prepare a model config and launch an OpenVINO LLM model with LocalAI in docker container. 

#### Table of contents:

- [Prepare Docker](#Prepare-Docker)
- [Prepare a model](#Prepare-a-model)
- [Run the server](#Run-the-server)
- [Send a client request](#Send-a-client-request)
- [Stop the server](#Stop-the-server)

### Installation Instructions

This is a self-contained example that relies solely on its own code.

We recommend  running the notebook in a virtual environment. You only need a Jupyter server to start.
For details, please refer to [Installation Guide](https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/README.md#-installation-guide).

<img referrerpolicy="no-referrer-when-downgrade" src="https://static.scarf.sh/a.png?x-pxid=5b5a4db0-7875-4bfb-bdbd-01698b5b1a77&file=notebooks/localai/localai.ipynb" />



## Prepare Docker
[back to top ⬆️](#Table-of-contents:)
Install [Docker Engine](https://docs.docker.com/engine/install/), including its [post-installation](https://docs.docker.com/engine/install/linux-postinstall/) steps, on your development system. To verify installation, test it, using the following command. When it is ready, it will display a test image and a message.

In [1]:
!docker run hello-world


Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (amd64)
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
 https://hub.docker.com/

For more examples and ideas, visit:
 https://docs.docker.com/get-started/



### Prepare a model
[back to top ⬆️](#Table-of-contents:)

LocalAI allows to use customized models. For more details you can read the [instruction](https://localai.io/docs/getting-started/customize-model/) where you can also find the detailed documentation. We will use one of the OpenVINO optimized LLMs in the collection on the [collection on 🤗Hugging Face](https://huggingface.co/collections/OpenVINO/llm-6687aaa2abca3bbcec71a9bd). In this example we will use [TinyLlama-1.1B-Chat-v1.0-fp16-ov](https://huggingface.co/OpenVINO/TinyLlama-1.1B-Chat-v1.0-fp16-ov). First of all we should create a model configuration file:

```YAML
name: TinyLlama-1.1B-Chat-v1.0-fp16-ov
backend: transformers
parameters:
  model: OpenVINO/TinyLlama-1.1B-Chat-v1.0-fp16-ov
  temperature: 0.2
  top_k: 40
  top_p: 0.95
  max_new_tokens: 32
  
type: OVModelForCausalLM

template:
  chat_message: |
    <|im_start|>{{if eq .RoleName "assistant"}}assistant{{else if eq .RoleName "system"}}system{{else if eq .RoleName "user"}}user{{end}}
    {{if .Content}}{{.Content}}{{end}}<|im_end|>
  chat: |
    {{.Input}}
    <|im_start|>assistant
    
  completion: |
    {{.Input}}

stopwords:
- <|im_end|>
```
The fields `backend`, `model`,  `type` you can find in the code example on the model page (we added the corresponding comments):
```python
from transformers import AutoTokenizer   # backend
from optimum.intel.openvino import OVModelForCausalLM  # type

model_id = "OpenVINO/TinyLlama-1.1B-Chat-v1.0-fp16-ov"  # parameters.model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = OVModelForCausalLM.from_pretrained(model_id)
```
The name you can choose by yourself. By this name you will specify what model to use on the client side.


You can create a GitHub gist and modify fields: [`ov.yaml`](https://gist.githubusercontent.com/aleksandr-mokrov/f007c8fa6036760a856ddc60f605a0b0/raw/9d24ceeb487f9c058a943113bd0290e8ae565b3e/ov.yaml)

Description of the parameters used in config YAML file can be found [here](https://localai.io/advanced/#advanced-configuration-with-yaml-files).

The most important:

- `name` - model name, used to identify the model in API calls.
- `backend` - backend to use for computation (like llama-cpp, diffusers, whisper, transformers).
- `parameters.model` - relative to the models path.
- `temperature`, `top_k`, `top_p`, `max_new_tokens` - parameters for the model.
- `type` - type of configuration, often related to the type of task or model architecture.
- `template` - templates for various types of model interactions.
- `stopwords` - Words or phrases that halts processing.



### Run the server
[back to top ⬆️](#Table-of-contents:)

Everything is ready for launch. Use `quay.io/go-skynet/local-ai:v2.23.0-ffmpeg` image that contains all required dependencies. For more details read [Run with container images](https://localai.io/basics/container/#standard-container-images).
If you want to see the output remove the `-d` flag and send a client request from a separate notebook. 

In [None]:
!docker run -d --rm --name="localai" -p 8080:8080 quay.io/go-skynet/local-ai:master-sycl-f16-ffmpeg https://gist.githubusercontent.com/aleksandr-mokrov/f007c8fa6036760a856ddc60f605a0b0/raw/9d24ceeb487f9c058a943113bd0290e8ae565b3e/ov.yaml

c9e6ca714193e9457c461751ae0a10933675c67a3f76b2e1855d6dae149b2bce


Check whether the `localai` container is running normally:

In [3]:
!docker ps | grep localai

c9e6ca714193   quay.io/go-skynet/local-ai:master-sycl-f16-ffmpeg   "/build/entrypoint.s…"   1 second ago   Up Less than a second (health: starting)   0.0.0.0:7860->8080/tcp, [::]:7860->8080/tcp   localai


### Send a client request
[back to top ⬆️](#Table-of-contents:)

Now you can send HTTP requests using the model name `TinyLlama-1.1B-Chat-v1.0-fp16-ov`. More details how to use [OpenAI API](https://platform.openai.com/docs/api-reference/chat).

In [None]:
!curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{"model": "TinyLlama-1.1B-Chat-v1.0-fp16-ov", "prompt": "What is OpenVINO?"}'

{"created":1732756622,"object":"text_completion","id":"af66405e-1579-41a5-b18e-7ce5b4292a63","model":"TinyLlama-1.1B-Chat-v1.0-fp16-ov","choices":[{"index":0,"finish_reason":"stop","text":"\n\nOpenVINO is a toolkit for Intel(r) OpenVINO(r) Toolkit, which is a toolkit for developing and deploying deep learning models on Intel(r) architecture. OpenVINO is designed to help developers implement deep learning models efficiently on Intel(r) architecture.\n\nThe main features of OpenVINO include:\n\n1. Develop and train deep learning models: OpenVINO provides a powerful toolset for developing and training deep learning models, including data augmentation, image resizing, and preprocessing.\n\n2. Compile and run models: OpenVINO provides a compilation and runtime environment for deploying trained models on Intel(r) architecture. Users can easily execute trained models on devices with Intel(r) architecture through the OpenVINO Runtime.\n\n3. Support for diverse deep learning frameworks: OpenVIN

### Stop the server
[back to top ⬆️](#Table-of-contents:)

In [10]:
!docker stop localai

localai
