# Notebook 3: Basic Application Development On Baichuan2

This notebook introduces the essential usage of `ipex-llm`, and walks you through building a very basic chat application built upon `Baichuan2`.

## 3.1 Install `ipex-llm`

If you haven't installed `ipex-llm`, install it as shown below. The one-line command will install the latest `ipex-llm` with all the dependencies for common LLM application development.

In [None]:
!pip install --pre --upgrade ipex-llm[all]

> **Note**
>
> * On Linux OS, we recommend to use `pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu` to install. Please refer to https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_cpu.html#quick-installation for more details.

## 3.2 Load a pretrained Model

Before using a LLM, you need to first load one. Here we take [Baichuan2-7b-chat](https://huggingface.co/baichuan-inc/Baichuan2-7B-Chat) as an example.

> **Note**
>
> * `Baichuan2-7b-chat` is an open-source large language model based on the Transformer architecture. You can find more information about this model on its [homepage](https://huggingface.co/baichuan-inc/Baichuan2-7B-Chat) hosted on Hugging Face.

### 3.2.1 Load and Optimize Model
 
In general, you just need one-line `optimize_model` to easily optimize any loaded PyTorch model, regardless of the library or API you are using. For more detailed usage of optimize_model, please refer to the [API documentation](https://ipex-llm.readthedocs.io/en/latest/doc/PythonAPI/LLM/optimize.html).

Besides, many popular open-source PyTorch large language models can be loaded using the `Huggingface Transformers API` (such as [AutoModel](https://huggingface.co/docs/transformers/v4.33.2/en/model_doc/auto#transformers.AutoModel), [AutoModelForCasualLM](https://huggingface.co/docs/transformers/v4.33.2/en/model_doc/auto#transformers.AutoModelForCausalLM), etc.). For such models, ipex-llm also provides a set of APIs to support them. We will now demonstrate how to use them.

In this example, we use `ipex_llm.transformers.AutoModelForCausalLM` to load the `Baichuan2-7b-chat`. This API mirrors the official `transformers.AutoModelForCasualLM` with only a few additional parameters and methods related to low-bit optimization in the loading process.

To enable INT4 optimization, simply set `load_in_4bit=True` in `from_pretrained`. Additionally, we configure the parameters `torch_dtype="auto"` and `low_cpu_mem_usage=True` by default, as they may improve both performance and memory efficiency. 

Remember to set `trust_remote_code=True` when loading the model weights and tokenizer. This will allow the necessary configuration for the model.

In [2]:
from ipex_llm.transformers import AutoModelForCausalLM

model_path = 'D:\\LLM\\LLM_DEMO\\ModelCache\\Baichuan2-7B-chat'

model = AutoModelForCausalLM.from_pretrained(model_path,
                                             load_in_4bit=True,
                                             trust_remote_code=True)

pip install xformers.
2024-06-08 15:37:38,443 - INFO - Converting the current model to sym_int4 format......


> **Note**
>
> * If you want to use precisions other than INT4(e.g. FP8/INT8,etc.), or know more details about the arguments, please refer to [API document](https://ipex-llm.readthedocs.io/en/latest/doc/PythonAPI/LLM/transformers.html) for more information. 
>
> * `baichuan-inc/Baichuan2-7B-Chat` is the **_model_id_** of the model `Baichuan2-7B-Chat` on huggingface. When you set the `model_path` parameter of `from_pretrained` to this **_model_id_**, `from_pretrained` will automatically download the model from huggingface,  cache it locally (e.g. `~/.cache/huggingface`), and load it. It may take a long time to download the model using this API. Alternatively, you can download the model yourself, and set `model_path` to the local path of the downloaded model. For more information, refer to the [`from_pretrained` document](https://huggingface.co/docs/transformers/main_classes/model#transformers.PreTrainedModel.from_pretrained).


### 3.2.2 Save & Load Optimized Model

In the previous section, models loaded using the `Huggingface Transformers API` are typically stored with either fp32 or fp16 precision. To save model space and speedup loading processes, ipex-llm also provides the `save_low_bit` API for saving the model after low-bit optimization, and the `load_low_bit` API for loading the saved low-bit model.

You can use `save_low_bit` once and use `load_low_bit` many times for inference. This approach bypasses the processes of loading the original FP32/FP16 model and optimization during inference stage, saving both memory and time. Moreover, because the optimized model format is platform-agnostic, you can seamlessly perform saving and loading operations across various machines, regardless of their operating systems. This flexibility enables you to perform optimization/saving on a high-RAM server and deploy the model for inference on a PC with limited RAM.


**Save Optimized Model**

For example, you can use the `save_low_bit` function to save the optimized model as below:

In [3]:
save_directory = 'D:\\LLM\\LLM_DEMO\\ModelCache\\baichuan2-7b-chat-ipex-llm-INT4'

model.save_low_bit(save_directory)
del(model)

**Load Optimized Model**

Then use `load_low_bit` to load the optimized low-bit model as follows:

In [4]:
# note that the AutoModelForCausalLM here is imported from ipex_llm.transformers
model = AutoModelForCausalLM.load_low_bit(save_directory, trust_remote_code=True)

pip install xformers.
2024-06-08 15:41:25,696 - INFO - Converting the current model to sym_int4 format......
Loading checkpoint shards: 100%|██████████| 2/2 [00:02<00:00,  1.12s/it]


## 3.3 Building a Simple Chat Application

Now that the model is successfully loaded, we can start building our very first chat application. We shall use the `Huggingface transformers` inference API to do this job.

> **Note**
> 
> The code in this section is solely implemented using `Huggingface transformers` API. `ipex-llm` does not require any change in the inference code so you can use any libraries to build your appliction at inference stage.  

> **Note**
> 
> Here we use Q&A dialog prompt template so that it can answer our questions.


> **Note**
> 
> `max_new_tokens` parameter in the `generate` function defines the maximum number of tokens to predict. 


In [5]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

In [12]:
token = tokenizer.encode("以色列在攻击伊朗")
print(token)
for i in token:
    print(tokenizer.decode(i))

[20930, 92355, 10646, 13475]
以色列
在
攻击
伊朗


In [17]:
import torch

with torch.inference_mode():
    prompt = 'Q: 你是哪里人?\nA:'
    
    # tokenize the input prompt from string to token ids
    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    # predict the next tokens (maximum 32) based on the input token ids
    output = model.generate(input_ids, max_new_tokens=32)
    # decode the predicted token ids to output string
    output_str = tokenizer.decode(output[0], skip_special_tokens=True)

    print('-'*20, 'Output', '-'*20)
    print(output_str)

-------------------- Output --------------------
Q: 你是哪里人?
A: 我是中国人。
