# IPEX-LLM 

> [IPEX-LLM](https://github.com/intel-analytics/ipex-llm/) is a PyTorch library for running LLM on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max) with very low latenc. This PR adds ipex-llm integrations to LlamaIndex as a new integration package.

This example goes over how to use LlamaIndex to interact with IPEX-LLM for text generation on CPU.

In [None]:
%pip install llama-index-llms-ipex-llm

Install `ipex-llm` for CPU. For more details, refer to [`ipex-llm` Install Guide](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_cpu.html).

In [None]:
%pip install --pre --upgrade ipex-llm[all]

In this exmaple we use [HuggingFaceH4/zephyr-7b-alpha](https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha) model for demostration. Updating transformers and tokenizers is required to use the model.

In [None]:
%pip install -U transformers==4.37.0 tokenizers==0.15.2

In [None]:
from llama_index.llms.ipex_llm import IpexLLM

llm = IpexLLM(
    model_name="/mnt/disk1/models/zephyr-7b-alpha",
    tokenizer_name="/mnt/disk1/models/zephyr-7b-alpha",
    # model_name="HuggingFaceH4/zephyr-7b-alpha",
    # tokenizer_name="HuggingFaceH4/zephyr-7b-alpha",
    context_window=512,
    max_new_tokens=128,
    generate_kwargs={"temperature": 0.7, "do_sample": False},
)

  from .autonotebook import tqdm as notebook_tqdm
Loading checkpoint shards: 100%|██████████| 8/8 [00:00<00:00, 18.55it/s]
2024-03-28 06:09:17,424 - INFO - Converting the current model to sym_int4 format......


load tokenizer: /mnt/disk1/models/zephyr-7b-alpha


### Text Completion

In [None]:
completion_response = llm.complete("Once upon a time, ")
print(completion_response.text)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.




In a land far, far away, 

Lived a young girl named Lily, 

Who had a dream to chase every day. 

Lily loved to read, write, and draw, 

She'd spend hours in her room, 

Creating stories, characters, and worlds, 

That only she knew. 

One day, Lily's teacher said, 

"Lily, you're a talented writer, 

But you need to work on your grammar, 

And your spelling, and your


### Streaming Text Completion

In [None]:
response_iter = llm.stream_complete("Once upon a time, there's a little girl")
for response in response_iter:
    print(response.delta, end="", flush=True)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


named Lily. She loves to play with her toys, especially her teddy bear named Ted. One day, Lily's mommy told her that they're going to the park. Lily was so excited because she loves to play in the park. When they arrived at the park, Lily saw a big slide. She ran to it and climbed up. She was so scared to slide down, but she wanted to try it. She closed her eyes and slid down. She felt so fast and so high. She opened her eyes and saw her mommy waving at her. She smiled and waved back. She slid down again

### Chat

In [None]:
from llama_index.core.llms import ChatMessage

message = ChatMessage(role="user", content="Explain Big Bang Theory briefly")
resp = llm.chat([message])
print(resp)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


assistant: <|assistant|>
The Big Bang Theory is a popular American sitcom that aired from 2007 to 2019. The show follows the lives of two brilliant but socially awkward physicists, Leonard Hofstadter and Sheldon Cooper, and their friends and colleagues, Penny, Raj, and Amy. The show is set in Pasadena, California, and revolves around the characters' work at Caltech and their personal lives. The show's humor is often based on science and pop culture references, and it explores themes such as friendship, love, and the


### Streaming Chat

In [None]:
message = ChatMessage(role="user", content="What is AI?")
resp = llm.stream_chat([message], max_tokens=1000)
for r in resp:
    print(r.delta, end="")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<|assistant|>
AI, or artificial intelligence, is a branch of computer science that focuses on creating intelligent machines that can learn, reason, and make decisions like humans do. It involves the use of algorithms and statistical models to enable computers to perform tasks that would normally require human intelligence, such as recognizing speech, understanding natural language, and making predictions based on data. AI is used in a variety of fields, including healthcare, finance, and transportation, to improve efficiency, accuracy, and safety.