> **NOTE**: This tutorial will demonstrate how to utilize the GPU provided by Colab to run LLM with Xinference local server, and how to interact with the model in different ways (OpenAI-Compatible endpoints/Xinference's builtin Client/LangChain).


# Xinference

Xorbits Inference (Xinference) is an open-source platform to streamline the operation and integration of a wide array of AI models. With Xinference, you’re empowered to run inference using any open-source LLMs, embedding models, and multimodal models either in the cloud or on your own premises, and create robust AI-driven applications.




* [Docs](https://inference.readthedocs.io/en/latest/index.html)
* [Built-in Models](https://inference.readthedocs.io/en/latest/models/builtin/index.html)
* [Custom Models](https://inference.readthedocs.io/en/latest/models/custom.html)
* [Deployment Docs](https://inference.readthedocs.io/en/latest/getting_started/using_xinference.html)
* [Examples and Tutorials](https://inference.readthedocs.io/en/latest/examples/index.html)


## Set up the environment

> **NOTE**: We recommend you run this demo on a GPU. To change the runtime type: In the toolbar menu, click **Runtime** > **Change runtime typ**e > **Select the GPU (T4)**


### Check memory and GPU resources

In [1]:
import psutil
import torch


ram = psutil.virtual_memory()
ram_total = ram.total / (1024**3)
print('RAM: %.2f GB' % ram_total)

print('=============GPU INFO=============')
if torch.cuda.is_available():
  !/opt/bin/nvidia-smi || ture
else:
  print('GPU NOT available')

RAM: 12.67 GB
Wed Jan  3 08:02:08 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   62C    P8              10W /  70W |      3MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                      

### Install Xinference and dependencies

In [2]:
%pip install -U -q typing_extensions==4.5.0 xinference[transformers] openai langchain

In [3]:
!pip show xinference

Name: xinference
Version: 0.7.4.1
Summary: Model Serving Made Easy
Home-page: https://github.com/xorbitsai/inference
Author: Qin Xuye
Author-email: qinxuye@xprobe.io
License: Apache License 2.0
Location: /usr/local/lib/python3.10/dist-packages
Requires: click, fastapi, fsspec, gradio, huggingface-hub, modelscope, openai, pydantic, requests, s3fs, sse-starlette, tabulate, torch, tqdm, typing-extensions, uvicorn, xoscar
Required-by: 


## A Quick Start Demo
### Start Local Server


To start a local instance of Xinference, run `xinference` in the background via `nohup`:

In [4]:
!nohup xinference-local  > xinference.log 2>&1 &

Congrats! You now have Xinference running in Colab machine. The default host and ip is 127.0.0.1 and 9997 respectively.


Once Xinference is running, there are multiple ways we can try it: via the web UI, via cURL, via the command line, or via the Xinference’s python client.

The command line tool is `xinference`. You can list the commands that can be used by running:

In [5]:
!xinference --help

Usage: xinference [OPTIONS] COMMAND [ARGS]...

  Xinference command-line interface for serving and deploying models.

Options:
  -v, --version       Show the current version of the Xinference tool.
  --log-level TEXT    Set the logger level. Options listed from most log to
                      (Default level is INFO)
  -H, --host TEXT     Specify the host address for the Xinference server.
  -p, --port INTEGER  Specify the port number for the Xinference server.
  --help              Show this message and exit.

Commands:
  chat           Chat with a running LLM.
  generate       Generate text using a running LLM.
  launch         Launch a model with the Xinference framework with the...
  list           List all running models in Xinference.
  register       Registers a new model with Xinference for deployment.
  registrations  Lists all registered models in Xinference.
  terminate      Terminate a deployed model through unique identifier...
  unregister     Unregisters a model from Xi

### Run Qwen-Chat

Xinference supports a variety of LLMs. Learn more in https://inference.readthedocs.io/en/latest/models/builtin/.

Let’s start by running a built-in model: `Qwen-1_8B-Chat`.


We can specify the model’s UID using the `--model-uid` or `-u` flag. If not specified, Xinference will generate it. This create a new model instance with unique ID `my-llvm`:


In [6]:
!xinference launch -u my-llm -n qwen-chat -s 1_8 -f pytorch

Model uid: my-llm


When you start a model for the first time, Xinference will download the model parameters from HuggingFace, which might take a few minutes depending on the size of the model weights. We cache the model files locally, so there’s no need to redownload them for subsequent starts.


## Interact with the running model

Congrats! You now have the model running by Xinference. Once the model is running, we can try it out either command line, via cURL, or via Xinference’s python client:



### 1.Use the OpenAI compatible endpoint

Xinference provides OpenAI-compatible APIs for its supported models, so you can use Xinference as a local drop-in replacement for OpenAI APIs. For example:


In [7]:
import openai

messages=[
    {
        "role": "user",
        "content": "Who are you?"
    }
]

client = openai.Client(api_key="empty", base_url=f"http://0.0.0.0:9997/v1")
client.chat.completions.create(
    model="my-llm",
    messages=messages,
)

ChatCompletion(id='chat899575cc-aa0e-11ee-9dba-0242ac1c000c', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='I am an AI language model created by Alibaba Cloud. I have been trained on a vast amount of text data and can answer questions, provide suggestions, and engage in conversations with users. How may I assist you today?', role='assistant', function_call=None, tool_calls=None))], created=1704268990, model='my-llm', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=44, prompt_tokens=23, total_tokens=67))

### 2. Send request using curl




In [8]:
!curl -k -X 'POST' -N \
  'http://127.0.0.1:9997/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{ "model": "my-llm", "messages": [ {"role": "system", "content": "You are a helpful assistant." }, {"role": "user", "content": "What is the largest animal?"} ]}'

{"id":"chat8bd9a524-aa0e-11ee-9dba-0242ac1c000c","object":"chat.completion","created":1704268994,"model":"my-llm","choices":[{"index":0,"message":{"role":"assistant","content":"It is difficult to determine which animal is the largest as there is no single animal that can be considered the \"largest\" in all senses. The size of an animal can vary greatly depending on its habitat, species, and individual characteristics.\n\nFor example, giant pandas are known for their large size, with males weighing up to 200 pounds and females weighing up to 135 pounds. Other large animals include blue whales, the world's largest animal, with estimated populations ranging from over 100,000 individuals; elephants, which can weigh over 6,000 pounds and stand up to "},"finish_reason":"length"}],"usage":{"prompt_tokens":25,"completion_tokens":127,"total_tokens":152}}

### 3. Use Xinference's Python client

In [9]:
from xinference.client import RESTfulClient
client = RESTfulClient("http://127.0.0.1:9997")
model = client.get_model("my-llm")
model.chat(
    prompt="hello",
    chat_history=[
    {
        "role": "user",
        "content": "What is the largest animal?"
    }]
)

{'id': 'chat8cef808c-aa0e-11ee-9dba-0242ac1c000c',
 'object': 'chat.completion',
 'created': 1704268996,
 'model': 'my-llm',
 'choices': [{'index': 0,
   'message': {'role': 'assistant',
    'content': "Hello! I'm here to help answer any questions you may have. Is there something specific you would like to know about animals or anything else?"},
   'finish_reason': 'stop'}],
 'usage': {'prompt_tokens': 31, 'completion_tokens': 29, 'total_tokens': 60}}

### 4. Langchain intergration

In [10]:
from langchain.llms import Xinference
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate

llm = Xinference(server_url='http://127.0.0.1:9997', model_uid='my-llm')

template = 'What is the largest {kind} on the earth?'

prompt = PromptTemplate(template=template, input_variables=['kind'])

llm_chain = LLMChain(prompt=prompt, llm=llm)

generated = llm_chain.run(kind='plant')
print(generated)

 The answer to this question is subjective and can vary depending on factors such as the definition of "largest" and location. However, some estimates put the size of a particular plant at over 50 feet tall or more.

One example of such a large plant is the giant sequoia tree ( Sequoia sempervirens), which is found only in California, USA. The tree has a diameter of up to 138 feet and can grow up to 600 feet tall. It is considered one of the oldest living organisms on Earth and is estimated to be over 27 million years old.

Another
