# IPEX-LLM 

> [IPEX-LLM](https://github.com/intel-analytics/ipex-llm/) is a PyTorch library for running LLM on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max) with very low latency.

This example goes over how to use LlamaIndex to interact with [`ipex-llm`](https://github.com/intel-analytics/ipex-llm/) for text generation and chat on CPU. 

For more examples and usage, refer to [Examples](https://github.com/run-llama/llama_index/tree/main/llama-index-integrations/llms/llama-index-llms-ipex-llm/examples).

Install `llama-index-llms-ipex-llm`. This will also install `ipex-llm` and its dependencies.

In [None]:
%pip install llama-index-llms-ipex-llm

In this example we'll use [HuggingFaceH4/zephyr-7b-alpha](https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha) model for demostration. It requires updating `transformers` and `tokenizers` packages.

In [None]:
%pip install -U transformers==4.37.0 tokenizers==0.15.2

Before loading the Zephyr model, you'll need to define `completion_to_prompt` and `messages_to_prompt` for formatting prompts. This is essential for preparing inputs that the model can interpret accurately.

In [None]:
# Transform a string into input zephyr-specific input
def completion_to_prompt(completion):
    return f"<|system|>\n</s>\n<|user|>\n{completion}</s>\n<|assistant|>\n"


# Transform a list of chat messages into zephyr-specific input
def messages_to_prompt(messages):
    prompt = ""
    for message in messages:
        if message.role == "system":
            prompt += f"<|system|>\n{message.content}</s>\n"
        elif message.role == "user":
            prompt += f"<|user|>\n{message.content}</s>\n"
        elif message.role == "assistant":
            prompt += f"<|assistant|>\n{message.content}</s>\n"

    # ensure we start with a system prompt, insert blank if needed
    if not prompt.startswith("<|system|>\n"):
        prompt = "<|system|>\n</s>\n" + prompt

    # add final assistant prompt
    prompt = prompt + "<|assistant|>\n"

    return prompt

Load the Zephyr model locally using IpexLLM.

In [None]:
from llama_index.llms.ipex_llm import IpexLLM

llm = IpexLLM.from_model_id(
    model_name="HuggingFaceH4/zephyr-7b-alpha",
    tokenizer_name="HuggingFaceH4/zephyr-7b-alpha",
    context_window=512,
    max_new_tokens=128,
    generate_kwargs={"temperature": 0.7, "do_sample": False},
    completion_to_prompt=completion_to_prompt,
    messages_to_prompt=messages_to_prompt,
)

You're all set! You can now use IpexLLM for text completion tasks and interactive chat. 

### Text Completion

In [None]:
completion_response = llm.complete("Once upon a time, ")
print(completion_response.text)

### Streaming Text Completion

In [None]:
response_iter = llm.stream_complete("Once upon a time, there's a little girl")
for response in response_iter:
    print(response.delta, end="", flush=True)

### Chat

In [None]:
from llama_index.core.llms import ChatMessage

message = ChatMessage(role="user", content="Explain Big Bang Theory briefly")
resp = llm.chat([message])
print(resp)

### Streaming Chat

In [None]:
message = ChatMessage(role="user", content="What is AI?")
resp = llm.stream_chat([message], max_tokens=256)
for r in resp:
    print(r.delta, end="")

### Low Bit Model

You can save the low-bit model and use `from_model_id_low_bit` to load the weights. 

In [None]:
# path to save low-bit model
LOW_BIT_MODEL_PATH = "./zephyr-7b-alpha-low-bit"
llm._model.save_low_bit(LOW_BIT_MODEL_PATH)
llm = IpexLLM.from_model_id_low_bit(
    model_name=LOW_BIT_MODEL_PATH,
    tokenizer_name="HuggingFaceH4/zephyr-7b-alpha",
    context_window=512,
    max_new_tokens=64,
    completion_to_prompt=completion_to_prompt,
    generate_kwargs={"temperature": 0.7, "do_sample": False},
)