<table style="width:100%">
<tr>
<td style="vertical-align:middle; text-align:left;">
<font size="2">
Supplementary code for the <a href="http://mng.bz/orYv">Build a Large Language Model From Scratch</a> book by <a href="https://sebastianraschka.com">Sebastian Raschka</a><br>
<br>Code repository: <a href="https://github.com/rasbt/LLMs-from-scratch">https://github.com/rasbt/LLMs-from-scratch</a>
<br>汉化的库: <a href="https://github.com/GoatCsu/CN-LLMs-from-scratch.git">https://github.com/GoatCsu/CN-LLMs-from-scratch.git</a>
</font>
</td>
<td style="vertical-align:middle; text-align:left;">
<a href="http://mng.bz/orYv"><img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp" width="100px"></a>
</td>
</tr>
</table>


# 使用 LLaMA 3 和 Ollama 生成指令数据集  

- 本笔记本使用 **Ollama 提供的 80 亿参数 LLaMA 3 模型** 生成 **合成数据集**，方法基于 **“Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing”** 论文（[https://arxiv.org/abs/2406.08464](https://arxiv.org/abs/2406.08464)）。  

- 生成的数据集将采用 **指令数据集格式**，包含 `"instruction"` 和 `"output"` 字段，类似于 **Alpaca 数据集**：  


```python
{
    "instruction": "What is the atomic number of helium?",
    "output": "The atomic number of helium is 2.",
},
```

- **该代码无需 GPU**，可直接在 **笔记本电脑** 上运行（已在 **M3 MacBook Air** 上测试）。  

*请注意，本示例生成的指令数据集仅用于**教学目的**。然而，**用户有责任** 确保其使用符合 **Meta AI LLaMA 3** 相关许可协议的规定。*  

In [1]:
from importlib.metadata import version

pkgs = [
    "tqdm",    # Progress bar
]

for p in pkgs:
    print(f"{p} version: {version(p)}")

tqdm version: 4.66.4


## 安装 Ollama 并下载 LLaMA 3

- **Ollama** 是一个高效运行 **LLM**（大语言模型）的应用。  
- 它是 **[llama.cpp](https://github.com/ggerganov/llama.cpp)** 的封装，后者使用 **纯 C/C++ 实现 LLM**，以 **最大化推理效率**。  
- **请注意**：Ollama 仅用于 **推理（inference）**，**不支持训练或微调（finetuning）LLM**。  
- 在运行下方代码之前，请先访问 **[https://ollama.com](https://ollama.com)** 并按照安装指南完成 **Ollama 的安装**（例如，点击 **“Download”** 按钮并下载适用于您操作系统的 Ollama 应用）。  


- **对于 macOS 和 Windows 用户**，点击 **下载的 Ollama 应用**，如果系统提示安装 **命令行工具**，请选择 **“是”**。  
- **Linux 用户** 可以使用 **Ollama 官网提供的安装命令** 进行安装。  

- **一般来说**，在使用 **Ollama 命令行工具** 之前，需要 **启动 Ollama 应用** 或 **在终端中运行 `ollama serve`**。  

<img src="https://raw.githubusercontent.com/MLNLP-World/LLMs-from-scratch-CN/main/imgs/ch7/21.png">

- **确保 Ollama 运行后**，在 **另一个终端窗口** 执行以下命令，尝试 **8B 参数的 LLaMA 3 模型**（首次执行时，模型将自动下载，占用 **4.7GB 存储空间**）：  


```bash
# 8B model
ollama run llama3
```


他的输出如下所示

```
$ ollama run llama3
pulling manifest 
pulling 6a0746a1ec1a... 100% ▕████████████████▏ 4.7 GB                         
pulling 4fa551d4f938... 100% ▕████████████████▏  12 KB                         
pulling 8ab4849b038c... 100% ▕████████████████▏  254 B                         
pulling 577073ffcc6c... 100% ▕████████████████▏  110 B                         
pulling 3f8eb4da87fa... 100% ▕████████████████▏  485 B                         
verifying sha256 digest 
writing manifest 
removing any unused layers 
success 
```

- **注意**：`llama3` 指的是 **指令微调后的 80 亿参数 LLaMA 3 模型**。  

- **如果您的设备支持**，可以将 `llama3` 替换为 **`llama3:70b`**，以使用 **更大的 700 亿参数 LLaMA 3 模型**。  

- **下载完成后**，您将看到 **命令行交互界面**，可以在其中与模型进行对话。  

- **尝试输入以下提示**："What do llamas eat?"（羊驼吃什么？），模型的输出应类似如下：  

```
>>> What do llamas eat?
Llamas are ruminant animals, which means they have a four-chambered 
stomach and eat plants that are high in fiber. In the wild, llamas 
typically feed on:
1. Grasses: They love to graze on various types of grasses, including tall 
grasses, wheat, oats, and barley.
```

- 通过输入`/bye`终止程序

## 使用Ollama's REST API

- 另一种与模型交互的方式是 **通过 Python 调用其 REST API**，可以使用以下函数实现。  
- **在运行本笔记本中的代码前**，请确保 **Ollama 仍在运行**，可以通过以下方式启动：
  - 在终端中执行 `ollama serve`
  - 使用 **Ollama 应用程序**  

- 接下来，运行下面的代码单元 **查询模型**。  

- **首先，我们用一个简单的示例测试 API**，以确保其 **正常运行**：  


In [2]:
import urllib.request
import json

def query_model(prompt, model="llama3", url="http://localhost:11434/api/chat", role="user"):
    # 创建数据负载作为字典
    data = {
        "model": model,
        "seed": 123,        # 用于生成确定性响应
        "temperature": 1.,   # 用于生成确定性响应
        "top_p": 1,         
        "messages": [
            {"role": role, "content": prompt}
        ]
    }

    # 将字典转换为JSON格式的字符串并编码为字节
    payload = json.dumps(data).encode("utf-8")

    # 创建请求对象，设置方法为POST并添加必要的头信息
    request = urllib.request.Request(url, data=payload, method="POST")
    request.add_header("Content-Type", "application/json")

    # 发送请求并捕获响应
    response_data = ""
    with urllib.request.urlopen(request) as response:
        # 读取并解码响应
        while True:
            line = response.readline().decode("utf-8")
            if not line:
                break
            response_json = json.loads(line)
            response_data += response_json["message"]["content"]

    return response_data

In [3]:
result = query_model("What do Llamas eat?")
print(result)

Llamas are herbivores, which means they primarily eat plants and plant-based foods. Their diet typically consists of:

1. Grasses: Llamas love to graze on various types of grasses, including tall grasses, short grasses, and even weeds.
2. Hay: They enjoy eating hay, such as alfalfa or timothy hay, which provides them with fiber, protein, and other essential nutrients.
3. Grains: Llamas may eat grains like oats, barley, or corn as a supplement to their diet.
4. Leaves: They will also munch on leaves from trees and shrubs, including clover, alfalfa, and various types of leaves.
5. Fruits and vegetables: In the wild, llamas might eat fruits and vegetables that grow in their natural habitat, such as apples, carrots, or potatoes.

In general, a llama's diet should consist of:

* 50% grasses and hay
* 20% grains (like oats or corn)
* 10% leaves and other plant material
* 5% fruits and vegetables (as treats)

It's essential to provide llamas with a balanced diet that meets their nutritional n

## 提取指令（Extract Instructions）

- 现在，让我们使用论文中提出的 **“巧妙方法”**：提供 **空的提示模板** `"<|begin_of_text|><|start_header_id|>user<|end_header_id|>"`，  
  这样 **指令微调后的 LLaMA 3 模型** 就会 **自动生成一条指令**。  

In [4]:
def extract_instruction(text):
    for content in text.split("\n"):
        if content:
            return content.strip()

In [5]:
query = "<|begin_of_text|><|start_header_id|>user<|end_header_id|>"

result = query_model(query, role="assistant")
instruction = extract_instruction(result)
print(instruction)

I am trying to find a way to make my child's birthday party more special and unique. What are some creative ideas you have?


- 如上所示，模型 **准确地生成了一条指令**。  

## 响应生成

- 接下来，我们需要 **生成对应的响应**
- 这可以 **直接将指令作为输入** 传递给模型完成。  

In [6]:
response = query_model(instruction, role="user")
print(response)

What an exciting question! I'd be delighted to help you come up with some creative and unique ideas to make your child's birthday party truly special!

Here are a few ideas to get you started:

1. **Themed Scavenger Hunt**: Plan a scavenger hunt based on the birthday child's favorite theme (e.g., superheroes, animals, or princesses). Hide clues and challenges throughout the party area, leading up to a final surprise.
2. **DIY Crafts Station**: Set up a craft station where kids can create their own party favors, such as customized t-shirts, crowns, or jewelry. This activity encourages creativity and makes for a memorable keepsake.
3. **Mystery Box Challenge**: Fill mystery boxes with different textures, smells, and sounds. Have the kids guess what's inside each box without looking. This game promotes problem-solving and teamwork.
4. **Indoor Camping Adventure**: Set up a cozy indoor "camping" area with sleeping bags, flashlights, and s'mores-making stations. Kids can enjoy a camping exp

## 生成数据集

- 我们可以 **扩展此方法** 以处理 **任意数量的数据样本**（建议使用 **额外的筛选机制**，例如通过 **另一个 LLM 评估数据质量** 或 **限制数据长度**）。  
- **下面的示例** 生成 **5 组合成的指令-响应对**，在 **M3 MacBook Air** 上 **约需 3 分钟**。  
- **如果要构建可用于指令微调的数据集**，建议将数据量扩展至 **1k-50k**，并 **使用 GPU 加速数据生成**。  

**提示（Tip）**  
- 您可以通过将 `model="llama3"` **更改为** `model="llama3:70b"` 来生成 **更高质量的响应**，但这将 **需要更多的计算资源**。  

In [7]:
from tqdm import tqdm

dataset_size = 5
dataset = []

for i in tqdm(range(dataset_size)):

    result = query_model(query, role="assistant")
    instruction = extract_instruction(result)
    response = query_model(instruction, role="user")
    entry = {
        "instruction": instruction,
        "output": response
    }
    dataset.append(entry)

100%|█████████████████████████████████████████████| 5/5 [02:37<00:00, 31.41s/it]


In [8]:
with open("instruction-data-llama3-7b.json", "w") as file:
    json.dump(dataset, file, indent=4)

In [9]:
!cat instruction-data-llama3-7b.json

[
    {
        "instruction": "What is the significance of the number 7 in various cultures and religions?",
        "output": "The number 7 has been a significant and recurring theme across many cultures and religions, often imbuing it with special meaning and symbolism. Here are some examples:\n\n1. **Numerology**: In numerology, the number 7 is considered sacred and mystical, associated with spiritual awakening, introspection, and enlightenment.\n2. **Judaism**: The Torah has seven days of creation, seven weeks in the wilderness, and seven years of rest (Sabbatical year). Seven is also a symbol of completion or perfection.\n3. **Christianity**: In Christianity, there are seven deadly sins, seven virtues, and seven sacraments. Jesus was said to have spoken seven sermons, and the number 7 appears in various biblical accounts, such as the seven days of creation and the seven angels who appear before God.\n4. **Islam**: In Islamic tradition, there are seven heavens, seven earths, and s