<a target="_blank" href="https://colab.research.google.com/github/qianniucity/llm_notebooks/blob/main/notebooks/LLaMa_2_Prompting_Guide_with_Gradio.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## 介绍
在这个 Colab 笔记本里，我们要和 Llama-2 7B 聊天。

跟着这个教程，你最后能跟这个模型互动，让它生成对话式的回答。

不论你是对聊天机器人技术感兴趣，还是只是想看看机器对特定问题的生成回应，这个笔记本都会是一个全面的指南。

## 操作步骤
1. **安装：** 我们会开始配置环境，安装必要的库。
2. **前提条件：** 确保我们能够在 Hugging Face 上使用 Llama-2 7B 模型。
3. **加载模型和分词器：** 获取我们会话的模型和分词器。
4. **创建 Llama Pipeline：** 准备模型以生成回应。
5. **通过 Gradio 的 ChatInterface 与 Llama 互动：** 提出问题给模型，看看它的表现。
咱们开始吧！

**首先，将运行时调整为 GPU**

你可以在这里和 Llama-2 7B 互动：https://huggingface.co/spaces/huggingface-projects/llama-2-7b-chat

## 安装步骤

在继续之前，我们需要确保安装了必要的库：

- `Hugging Face Transformers`：提供了使用预训练模型的简便方法。
- `PyTorch`：作为深度学习操作的基础。
- `Accelerate`：优化 PyTorch 操作，特别是在 GPU 上。


In [None]:
!pip install transformers torch accelerate

Collecting accelerate
  Downloading accelerate-0.25.0-py3-none-any.whl (265 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.7/265.7 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.25.0


要使用 `gr.ChatInterface()`，我们需要安装最新版本的 Gradio。

In [None]:
!pip install --upgrade gradio

Collecting gradio
  Downloading gradio-4.10.0-py3-none-any.whl (16.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.6/16.6 MB[0m [31m39.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl (15 kB)
Collecting fastapi (from gradio)
  Downloading fastapi-0.105.0-py3-none-any.whl (93 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m93.1/93.1 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ffmpy (from gradio)
  Downloading ffmpy-0.3.1.tar.gz (5.5 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting gradio-client==0.7.3 (from gradio)
  Downloading gradio_client-0.7.3-py3-none-any.whl (304 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m304.8/304.8 kB[0m [31m28.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting httpx (from gradio)
  Downloading httpx-0.25.2-py3-none-any.whl (74 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━

如果 `!pip install --upgrade gradio` 返回一个错误，内容为：`NotImplementedError: A UTF-8 locale is required. Got ANSI_X3.4-1968`，请按照以下步骤操作：

1. 取消下一个单元格的注释。
2. 运行该单元格。
3. 重新启动运行时：`Runtime -> Restart Runtime`

In [None]:
# import locale
# locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')

### 前提条件


要加载我们想要的模型 `meta-llama/Llama-2-7b-chat-hf`，首先得在Hugging Face上验证身份。这是确保我们有权限取得模型的正确步骤。

1. 在 **Hugging Face上** 获取模型访问权限：[链接](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf).。
2. 用 **login()** 登录并确认您的身份验证状态。


In [None]:
from huggingface_hub import login
login()


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
!huggingface-cli whoami

minp


### 加载模型和分词器

这一步，我们准备环境，加载Llama模型以及它关联的分词器。

分词器将帮助将我们的文本提示转换成模型能够理解和处理的格式。

In [None]:
from transformers import AutoTokenizer
import transformers
import torch

model = "meta-llama/Llama-2-7b-chat-hf" # meta-llama/Llama-2-7b-chat-hf

tokenizer = AutoTokenizer.from_pretrained(model, use_auth_token=True)

Cannot access gated repo for url https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/resolve/main/tokenizer_config.json.
Your request to access model meta-llama/Llama-2-7b-chat-hf is awaiting a review from the repo authors.

运行出现上面代码提示，先不用着急，得需要模型作者审核

### 创建Llama管道

我们将建立一个文本生成的管道。

该管道简化了向我们的模型提供提示并接收生成文本输出的过程。

*注意*：运行此单元格需要2-3分钟。


In [None]:
from transformers import pipeline

llama_pipeline = pipeline(
    "text-generation",  # LLM task
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

Downloading (…)lve/main/config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

Downloading (…)fetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

## 比较方法：基础 vs. 高级


在深入研究我们的高级对话交互方法之前，先看看使用 `get_response()` 函数生成响应的基本方法。我们随后会谈谈它的局限性，以及高级方法是如何克服这些问题的。




In [None]:
def get_response(prompt: str) -> None:
    """
    Generate a response from the Llama model.

    Parameters:
        prompt (str): The user's input/question for the model.

    Returns:
        None: Prints the model's response.
    """
    sequences = llama_pipeline(
        prompt,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        max_length=256,
    )
    print("Chatbot:", sequences[0]['generated_text'])



In [None]:
get_response("Hi, I'm Kris")

Chatbot: Hi, I'm Kris. Here are some of the best ways to get your ex back:
1. Give them space: If your ex has broken up with you, it's important to give them the space they need. Respect their boundaries and don't try to contact them for a while. This will give them time to process their feelings and think about what they want.
2. Show that you've changed: If you've made mistakes in the past, it's important to show your ex that you've changed and grown as a person. This can involve working on yourself, improving your behavior, and being more mindful of their needs.
3. Be patient: Getting your ex back can take time, so it's important to be patient and not rush things. Give them the time and space they need to come around, and don't try to force them into anything.
4. Be kind and respectful: Treat your ex with kindness and respect, even if they've hurt you in the past. Avoid being confrontational or aggressive, and try to maintain a positive attitude.
5. Communicate openly and honestly: 

In [None]:
get_response("What's my name?")

Chatbot: What's my name?

Answer: Your name is Jack.


### `get_response()` 的缺点


1. **缺乏历史对话**：基础方法没有考虑过去的交互，难以保持一致的对话。
2. **限制定制性**：该函数不允许进行高级提示格式化或处理系统级别的指令。
3. **不适用于界面集成**：这种基础方法不便于与用户界面库（如Gradio）轻松集成。

## 改进提示

正确的 Llama 2 提示结构：

```
<s>[INST] <<SYS>>
{{ system_prompt }}
<</SYS>>

{{ user_message }} [/INST]
```

### 构建提示


解释参数：

- `message` 我们当前发送的消息
- `history` 对话历史，以元组列表的形式表示 `[(user_msg1, bot_msg1), (usr_msg2, bot_msg2), ...]`

In [None]:
SYSTEM_PROMPT = """<s>[INST] <<SYS>>
You are a helpful bot. Your answers are clear and concise.
<</SYS>>

"""

# 信息和历史记录格式化功能
def format_message(message: str, history: list, memory_limit: int = 3) -> str:
    """
    Formats the message and history for the Llama model.

    Parameters:
        message (str): Current message to send.
        history (list): Past conversation history.
        memory_limit (int): Limit on how many past interactions to consider.

    Returns:
        str: Formatted message string
    """
    # 始终保持 len(history) <= memory_limit
    if len(history) > memory_limit:
        history = history[-memory_limit:]

    if len(history) == 0:
        return SYSTEM_PROMPT + f"{message} [/INST]"

    formatted_message = SYSTEM_PROMPT + f"{history[0][0]} [/INST] {history[0][1]} </s>"

    # 处理对话历史记录
    for user_msg, model_answer in history[1:]:
        formatted_message += f"<s>[INST] {user_msg} [/INST] {model_answer} </s>"

    # 处理当前信息
    formatted_message += f"<s>[INST] {message} [/INST]"

    return formatted_message

### 获取响应

我们需要函数来生成响应

In [None]:
# 从 Llama 模型生成响应
def get_llama_response(message: str, history: list) -> str:
    """
    Generates a conversational response from the Llama model.

    Parameters:
        message (str): User's input message.
        history (list): Past conversation history.

    Returns:
        str: Generated response from the Llama model.
    """
    query = format_message(message, history)
    response = ""

    sequences = llama_pipeline(
        query,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        max_length=1024,
    )

    generated_text = sequences[0]['generated_text']
    response = generated_text[len(query):]  # 删除输出中的提示

    print("Chatbot:", response.strip())
    return response.strip()


In [None]:
import gradio as gr

gr.ChatInterface(get_llama_response).launch()


Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Note: opening Chrome Inspector may crash demo inside Colab notebooks.

To create a public link, set `share=True` in `launch()`.


<IPython.core.display.Javascript object>



### 结论


有了 `Hugging Face` 库，创建与 `llama 2`（或任何其他开源 LLM）聊天的管道就变得非常容易

如果您经常使用更大的模型（如 `GPT-4`），或者下次我们可以再做一期