<table style="width:100%">
<tr>
<td style="vertical-align:middle; text-align:left;">
<font size="2">
Supplementary code for the <a href="http://mng.bz/orYv">Build a Large Language Model From Scratch</a> book by <a href="https://sebastianraschka.com">Sebastian Raschka</a><br>
<br>Code repository: <a href="https://github.com/rasbt/LLMs-from-scratch">https://github.com/rasbt/LLMs-from-scratch</a>
<br>汉化的库: <a href="https://github.com/GoatCsu/CN-LLMs-from-scratch.git">https://github.com/GoatCsu/CN-LLMs-from-scratch.git</a>
</font>
</td>
<td style="vertical-align:middle; text-align:left;">
<a href="http://mng.bz/orYv"><img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp" width="100px"></a>
</td>
</tr>
</table>


# 为指令数据集创建“被动语态”数据库

- 如下所示,我们用gpt-4来创建被动语态

```python
{  
   'instruction': 'Identify the verb in the following sentence',
   'input': 'The cat sleeps on the couch.',
   'output': 'The verb in the sentence is "sleeps."',
   'output_2': 'The sentence is "sleeps."'   #  <---- 新创建的条目
}  
```

In [1]:
# pip install -r requirements-extra.txt

In [2]:
from importlib.metadata import version

pkgs = ["openai",  # OpenAI API
        "tqdm",    # 进度条
       ]

for p in pkgs:
    print(f"{p} version: {version(p)}")

openai version: 1.30.3
tqdm version: 4.65.0


## Test OpenAI API

- 首先，让我们测试一下OpenAI API是否已正确设置
- 如果您还没有账户，需要在 https://platform.openai.com/ 创建一个
- 请注意，您还需要向账户中转账，因为GPT-4 API不是免费的（请参见 https://platform.openai.com/settings/organization/billing/overview）
- 使用本笔记本中的代码创建约200个被动语态条目大约需要花费0.13美元（13美分）

- 首先，我们需要提供OpenAI API的密钥，您可以在 https://platform.openai.com/api-keys 找到该密钥
- 请确保不要与任何人共享此密钥
- 将此密钥（`"sk-..."`）添加到此文件夹中的 `config.json` 文件

In [3]:
import json
from openai import OpenAI

# 从 JSON 文件中加载 API 密钥. 
# 请确保将 "sk-..." 替换为您在 https://platform.openai.com/api-keys 上的实际 API 密钥
with open("config.json", "r") as config_file:
    config = json.load(config_file)
    api_key = config["OPENAI_API_KEY"]

client = OpenAI(api_key=api_key)

- 首先，让我们通过一个简单的示例来测试API，确保它按预期工作：

In [4]:
def run_chatgpt(prompt, client, model="gpt-4-turbo"):
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0,
    )
    return response.choices[0].message.content


# 准备输入
sentence = "I ate breakfast"
prompt = f"Convert the following sentence to passive voice: '{sentence}'"
run_chatgpt(prompt, client)

'Breakfast was eaten by me.'

## 创建json

- 导入想要优化的json

In [5]:
import json

json_file = "instruction-examples.json"

with open(json_file, "r") as file:
    json_data = json.load(file)
    
print("Number of entries:", len(json_data))

Number of entries: 200


- 小试牛刀,确保API可用

In [6]:
for entry in json_data[:5]:
    text = entry["output"]
    prompt = f"Without adding any response or explanation, convert the following text to passive voice: {text}"
    
    print("\nInput:")
    print(">>", text)
    print("\nOutput:")
    print(">>", run_chatgpt(prompt, client))
    print("\n-------------------------")


Input:
>> The verb in the sentence is "sleeps."

Output:
>> The sentence is "sleeps."

-------------------------

Input:
>> The plural form of "goose" is "geese."

Output:
>> The plural form of "goose" is referred to as "geese."

-------------------------

Input:
>> The three primary colors are red, blue, and yellow.

Output:
>> Red, blue, and yellow are considered the three primary colors.

-------------------------

Input:
>> They had finished the game.

Output:
>> The game had been finished by them.

-------------------------

Input:
>> The abbreviation for "Doctor of Philosophy" is Ph.D.

Output:
>> The abbreviation "Ph.D." is used for "Doctor of Philosophy".

-------------------------


- 拓展代码也拓展数据集

In [7]:
from tqdm import tqdm  # 一个进度条工具


for i, entry in tqdm(enumerate(json_data[:5]), total=len(json_data[:5])):
    text = entry["output"]
    prompt = f"Without adding any response or explanation, convert the following text to passive voice: {text}"
    json_data[i]["output_2"] = run_chatgpt(prompt, client)

100%|██████████████████████████████████████████████████████████████████████| 5/5 [00:04<00:00,  1.23it/s]


- 确保输出合理有效

In [8]:
json_data[0]

{'instruction': 'Identify the verb in the following sentence: The cat sleeps on the couch.',
 'input': '',
 'output': 'The verb in the sentence is "sleeps."',
 'output_2': 'The sentence is "sleeps."'}

- 万事俱备,运行下程序

In [9]:
for i, entry in tqdm(enumerate(json_data), total=len(json_data)):
    text = entry["output"]
    prompt = f"Without adding any response or explanation, convert the following text to passive voice: {text}"
    json_data[i]["output_2"] = run_chatgpt(prompt, client)

100%|██████████████████████████████████████████████████████████████████| 200/200 [03:43<00:00,  1.12s/it]


- 测试结束,保存模型

In [10]:
new_json_file = json_file.replace(".json", "-modified.json")


with open(new_json_file, "w") as file:
    json.dump(json_data, file, indent=4)  # "indent"设置为了更美观的展示