该notebook将演示如何实现基于OpenAI模型的微调。我们将使用一组小型数据集，其中包含一些网络用语，比如关于涉及“六”的赞扬词句，以及其他中性语句。我们将使用这些数据来训练模型，以便它可以对输入文本实现情绪分类(positive, neutral)。

注意，本数据集并没有采集对应情绪分类negative的数据，因此我们将不会对此进行训练。

In [1]:
%pip install openai > /dev/null


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


1. 预处理数据集

In [35]:
!openai tools fine_tunes.prepare_data -f dataset.jsonl -q

Analyzing...

- Your file contains 50 prompt-completion pairs. In general, we recommend having at least a few hundred examples. We've found that performance tends to linearly increase for every doubling of the number of examples
- Based on your data it seems like you're trying to fine-tune a model for classification
- For classification, we recommend you try one of the faster and cheaper models, such as `ada`
- For classification, you can estimate the expected model performance by keeping a held out dataset, which is not used for training
- There are 1 duplicated prompt-completion sets. These are rows: [42]
- All prompts end with suffix ` ->`
- The completion should start with a whitespace character (` `). This tends to produce better results due to the tokenization we use. See https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more details

Based on the analysis we will perform the following actions:
- [Recommended] Remove 1 duplicate rows [Y/n]: Y
- [Recom

2. 上传数据集

In [36]:
import os
import openai
openai.api_key = os.getenv("OPENAI_API_KEY")
file = openai.File.create(
  file=open("dataset_prepared_train.jsonl", "rb"),
  purpose='fine-tune'
)

file_id = file.id
print(file_id)

file-SLFEpoSmDfDBD0Sm3sauR70n


3. 微调模型

In [37]:
fine_tune = openai.FineTune.create(training_file=file_id, model='ada')
fine_tune_id = fine_tune.id

print(fine_tune_id)

ft-uO0lbxZVELPy3kasNtMOXd05


4. 查看微调任务状态

In [38]:
status = openai.FineTune.retrieve(id=fine_tune_id)

print(f"ID: {status.id}")
print(f"Status: {status.status}")
print(f"Model: {status.model}")
print(f"Fine tuned model: {status.fine_tuned_model}")

ID: ft-uO0lbxZVELPy3kasNtMOXd05
Status: pending
Model: ada
Fine tuned model: None


5. 查询（分类提问）

In [51]:
response = openai.Completion.create(
  model="ada:ft-personal-2023-04-24-21-32-47",
  prompt="这么六的吉他弹奏许久不见 ->",
  temperature=0.7,
  top_p=1,
  frequency_penalty=0,
  presence_penalty=0,
  stop=["END"]
)

In [52]:
print(response)

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "text": " neutral "
    }
  ],
  "created": 1682376242,
  "id": "cmpl-78zCcTU38aYxWaE4mUsbDT7DpXPAJ",
  "model": "ada:ft-personal-2023-04-24-21-32-47",
  "object": "text_completion",
  "usage": {
    "completion_tokens": 2,
    "prompt_tokens": 25,
    "total_tokens": 27
  }
}
