## 构造故事数据，注意，以下代码较慢且耗费较多Token，可以不运行

In [27]:
import os,openai,backoff
import pandas as pd

openai.api_key = os.getenv("OPENAI_API_KEY")
dynasties= ['唐', '宋', '元', '明', '清', '汉', '魏', '晋', '南北朝']
super_powers = ['隐形', '飞行', '读心术', '瞬间移动', '不死之身', '喷火']
story_types = ['轻松', '努力', '艰难']

@backoff.on_exception(backoff.expo, openai.error.RateLimitError)
def gpt35(prompt, max_tokens=2048, temperature=0.5, top_p=1, frequency_penalty=0, presence_penalty=0):
    response = openai.Completion.create(
        engine="gpt-3.5-turbo-instruct",
        prompt=prompt,
        max_tokens=max_tokens,
        temperature=temperature,
        top_p=top_p,
        frequency_penalty=frequency_penalty,
        presence_penalty=presence_penalty)
    return response["choices"][0]["text"]

def prepare_stories(dynasties, super_powers, story_types, repeat=3, output_file="data/ultraman_stories.csv"):
    df = pd.DataFrame()
    for dynasty in dynasties:
        for super_power in super_powers:
            for story_type in story_types:
                   for i in range(repeat):
                        prompt = f"""请你用中文写一段300字的故事，情节跌宕起伏，讲述一位{dynasty}朝时期的英雄人物，穿越到现代，拥有了{super_power}这样的超能力，通过{story_type}的战斗，帮助奥特曼一起打败了怪兽的故事。"""
                        story = gpt35(prompt)
                        row = {"dynasty": dynasty, "super_power": super_power, "story_type": story_type, "story": story}
                        row = pd.DataFrame([row])
                        df = pd.concat([df, row], axis=0, ignore_index=True)

    df.to_csv(output_file)

In [None]:
prepare_stories(dynasties, super_powers, story_types)


## 读取CSV，构造微调数据

In [11]:
import os,openai,backoff
import pandas as pd

openai.api_key = os.getenv("OPENAI_API_KEY")

df = pd.read_csv("data/ultraman_stories.csv")
df['sub_prompt'] = df['dynasty'] + "," + df['super_power'] + "," + df['story_type']
prepared_data = df.loc[:,['sub_prompt','story']]
prepared_data.rename(columns={'sub_prompt':'prompt', 'story':'completion'}, inplace=True)
prepared_data.to_csv('data/prepared_data.csv',index=False)

In [12]:
import subprocess

subprocess.run('openai tools fine_tunes.prepare_data --file data/prepared_data.csv --quiet'.split())

Analyzing...

- Based on your file extension, your file is formatted as a CSV file
- Your file contains 464 prompt-completion pairs
- Your data does not contain a common separator at the end of your prompts. Having a separator string appended to the end of the prompt makes it clearer to the fine-tuned model where the completion should begin. See https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more detail and examples. If you intend to do open-ended generation, then you should leave the prompts empty
- Your data does not contain a common ending at the end of your completions. Having a common ending string appended to the end of the completion makes it clearer to the fine-tuned model where the completion should end. See https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more detail and examples.
- The completion should start with a whitespace character (` `). This tends to produce better results due to the tokenization we use. Se

CompletedProcess(args=['openai', 'tools', 'fine_tunes.prepare_data', '--file', 'data/prepared_data.csv', '--quiet'], returncode=0)

## 模型微调

In [None]:
subprocess.run('openai api fine_tunes.create --training_file data/prepared_data_prepared.jsonl --model curie --suffix "ultraman"'.split())

In [14]:
subprocess.run('openai api fine_tunes.list'.split())

{
  "data": [
    {
      "created_at": 1680576711,
      "fine_tuned_model": "curie:ft-bothub-ai:ultraman-2023-04-04-03-03-26",
      "hyperparams": {
        "batch_size": 1,
        "learning_rate_multiplier": 0.2,
        "n_epochs": 4,
        "prompt_loss_weight": 0.01
      },
      "id": "ft-3oxkr1zBVB4fJWogJDDjQbr0",
      "model": "curie",
      "object": "fine-tune",
      "organization_id": "org-yQaCtAdY0voCWSs0QNqSLfda",
      "result_files": [
        {
          "bytes": 107785,
          "created_at": 1680577408,
          "filename": "compiled_results.csv",
          "id": "file-LuSjYlkMa6fHHRB23bnr3L1z",
          "object": "file",
          "purpose": "fine-tune-results",
          "status": "processed",
          "status_details": null
        }
      ],
      "status": "succeeded",
      "training_files": [
        {
          "bytes": 446199,
          "created_at": 1680576711,
          "filename": "data/prepared_data_prepared.jsonl",
          "id": "file-yn0Bfn

CompletedProcess(args=['openai', 'api', 'fine_tunes.list'], returncode=0)

## 实验微调效果

In [20]:
import os
import openai

openai.api_key = os.getenv("OPENAI_API_KEY")

def write_a_story(prompt):
    response = openai.Completion.create(
        model="curie:ft-bothub-ai:ultraman-2023-04-04-03-03-26",
        prompt=prompt,
        temperature=0.7,
        max_tokens=2000,
        top_p=1,
        stop=["."])
    return response["choices"][0]["text"]

story = write_a_story("宋,发射激光,艰难 ->\n")
print(story)




宋朝时期，有一位英雄人物，名叫孔宣，他拥有超凡的智慧和勇气，曾经帮助宋朝抵御外敌的侵略，受到了国人的尊敬和赞誉。

一次，孔宣在抵御外敌侵略的过程中，不幸被一枚神秘的弩箭击中，瞬间就穿越到了现代，他发现自己拥有了发射激光的超能力，可以抵抗任何攻击。

孔宣把自己的超能力当做武器，参加了一场反恐战斗，他使用自己的激光击败了怪兽，为着反抗怪兽而战斗，他受到了全国人民的尊敬和赞誉。

孔宣最终回到了宋朝，他以自己的英勇事迹，激励着人们勇敢地面对抗敌的挑战，保护了宋朝的安定。


In [21]:

story = write_a_story("秦,龙卷风,辛苦 ->\n")
print(story)


曾经有一位叫苏轼的英雄人物，他曾经英勇地抵抗过许多强大的敌人，拯救了许多被危险封印的百姓。他曾经在一次战争中发挥过自己的作用，赢得了许多胜利，被尊为英雄。

然而，苏轼却在一次激烈的战斗中牺牲了，他的灵魂被封印在一个古老的石头里，隔着一层玻璃，一直沉睡了几百年。

苏轼的灵魂在穿越时空，来到了现代，他发现自己拥有了一种超能力，这就是龙卷风，他可以使自己的身体具有超强的力量，甚至可以抵抗恶魔的攻击。

苏轼在现代的世界里，发现了一种可怕的怪兽，它们正在摧毁着人类的家园，苏轼决定要拯救这个世界，于是他和奥特曼一起出发，开始了一场史诗般的战斗。

在苏轼和奥特曼的帮助下，苏轼利用自己的超能力，一次次击退怪兽的攻击，最终他们成功地打败了怪兽，拯救了人类。

苏轼的事迹在这里传唱了很久，他成为了一位永恒的英雄，他的故事也被传唱了下来，让人们永远不会忘记他的英勇事迹。


In [24]:
subprocess.run('openai api fine_tunes.results -i ft-3oxkr1zBVB4fJWogJDDjQbr0'.split())


step,elapsed_tokens,elapsed_examples,training_loss,training_sequence_accuracy,training_token_accuracy
1,625,1,0.8805545861742778,0.0,0.75
2,1258,2,0.8059815050491868,0.0,0.7766830870279147
3,1859,3,0.7964038042175758,0.0,0.7862068965517242
4,2548,4,0.805052303553852,0.0,0.7774436090225564
5,3197,5,0.7503930440556053,0.0,0.7808
6,3846,6,0.7992317049403261,0.0,0.7770700636942676
7,4775,7,0.6649006477473822,0.0,0.7927232635060639
8,5432,8,0.6493354803676822,0.0,0.8049921996879875
9,6265,9,0.6568901059838095,0.0,0.802937576499388
10,7122,10,0.6578856167468091,0.0,0.8100358422939068
11,7827,11,0.5687322367928961,0.0,0.8279411764705882
12,8404,12,0.6334827334911788,0.0,0.8172043010752689
13,9061,13,0.5771709139683721,0.0,0.825
14,9822,14,0.6079089517825593,0.0,0.8100407055630936
15,10399,15,0.6481047367374327,0.0,0.8154121863799283
16,11208,16,0.5528688982071029,0.0,0.8352490421455939
17,11913,17,0.6525803676480848,0.0,0.8093841642228738
18,12546,18,0.5230526420679229,0.0,0.8363047001620746


CompletedProcess(args=['openai', 'api', 'fine_tunes.results', '-i', 'ft-3oxkr1zBVB4fJWogJDDjQbr0'], returncode=0)

## 构造更多微调数据（可以不运行）

In [28]:
dynasties= ['秦', '五代', '隋']
super_powers = ['龙卷风', '冰冻大海', '流星火雨']
story_types = ['轻松', '努力', '艰难', '勇敢', '辛苦']

new_stories = "data/ultraman_stories_more.csv"
prepare_stories(dynasties, super_powers, story_types, repeat=3, output_file=new_stories)


In [29]:
df = pd.read_csv(new_stories)
df['sub_prompt'] = df['dynasty'] + "," + df['super_power'] + "," + df['story_type']
prepared_data = df.loc[:,['sub_prompt','story']]
prepared_data.rename(columns={'sub_prompt':'prompt', 'story':'completion'}, inplace=True)
new_stories_prepared = 'data/prepared_data_more.csv'
prepared_data.to_csv(new_stories_prepared, index=False)

subprocess.run('openai tools fine_tunes.prepare_data --file data/prepared_data_more.csv --quiet'.split())

Analyzing...

- Based on your file extension, your file is formatted as a CSV file
- Your file contains 135 prompt-completion pairs
- Your data does not contain a common separator at the end of your prompts. Having a separator string appended to the end of the prompt makes it clearer to the fine-tuned model where the completion should begin. See https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more detail and examples. If you intend to do open-ended generation, then you should leave the prompts empty
- Your data does not contain a common ending at the end of your completions. Having a common ending string appended to the end of the completion makes it clearer to the fine-tuned model where the completion should end. See https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more detail and examples.
- The completion should start with a whitespace character (` `). This tends to produce better results due to the tokenization we use. Se

CompletedProcess(args=['openai', 'tools', 'fine_tunes.prepare_data', '--file', 'data/prepared_data_more.csv', '--quiet'], returncode=0)

## 进一步微调

In [38]:
subprocess.run('openai api fine_tunes.create --training_file data/prepared_data_more_prepared.jsonl --model curie:ft-bothub-ai:ultraman-2023-04-04-03-03-26 --suffix "ultraman" --learning_rate_multiplier 2'.split())

Upload progress: 100%|██████████| 128k/128k [00:00<00:00, 67.6Mit/s]


Uploaded file from data/prepared_data_more_prepared.jsonl: file-Koi8mPKvDINOnBnI03J9iXnS
Created fine-tune: ft-CFLSbHbKvg5BEspBy3aQHSbm
Streaming events until fine-tuning is complete...

(Ctrl-C will interrupt the stream, but not cancel the fine-tune)
[2023-04-05 12:57:57] Created fine-tune: ft-CFLSbHbKvg5BEspBy3aQHSbm
[2023-04-05 12:58:41] Fine-tune costs $1.05
[2023-04-05 12:58:41] Fine-tune enqueued. Queue number: 0
[2023-04-05 12:58:43] Fine-tune started



CompletedProcess(args=['openai', 'api', 'fine_tunes.create', '--training_file', 'data/prepared_data_more_prepared.jsonl', '--model', 'curie:ft-bothub-ai:ultraman-2023-04-04-03-03-26', '--suffix', '"ultraman"', '--learning_rate_multiplier', '2'], returncode=0)

In [41]:
fine_tuned = write_a_story("五代,流星火雨,艰难 ->\n")
print(fine_tuned)


这是一个发生在一个古老的世界，一个叫做“六代”的世界。这个世界有着一种叫做“超能力”的特性，可以让人穿越时空，穿越到现代。

一位叫做“英雄”的人物，他来自于六代，但他拥有了一种叫做“流星火雨”的超能力，他可以把自己的身体变成一个火焰，然后穿越时空，来到现代。

他来到现代，发现这个世界变得越来越危险，有一种叫做“怪兽”的存在，他们想要毁灭这个世界。英雄决定帮助奥特曼一起打败怪兽，于是他们开始了一场激烈的战斗。

英雄凭借着自己的超能力，以及奥特曼的力量，战胜了怪兽，拯救了这个世界。最后，英雄又一次穿越回六代，这次他拥有了一种叫做“流星火雨”的超能力，他可以把自己的身体变成一个火焰，然后穿越时空，拯救又一次六代。


subprocess.run('openai api fine_tunes.list'.split())


## 测试流式生成效果

In [44]:

def write_a_story_by_stream(prompt):
    response = openai.Completion.create(
        model="curie:ft-bothub-ai:ultraman-2023-04-04-03-03-26",
        prompt=prompt,
        temperature=0.7,
        max_tokens=2000,
        stream=True,
        top_p=1,
        stop=["."])
    return response

response = write_a_story_by_stream("汉,冰冻大海,艰难 ->\n")

for event in response:
    event_text = event['choices'][0]['text']
    print(event_text, end = '')


一位叫李英的汉朝时期的英雄人物，穿越到了现代，拥有了一种超能力，可以把自己的身体冰冻到极限，他发现自己可以拥有超越情感的力量，可以把任何人都冻僵，他也发现自己可以控制全局，可以控制时间，可以控制物质，可以控制情景，他发现自己可以控制一切，他变得更加强大。

李英发现，地球正面临着一个叫做怪兽的强大敌人的威胁，他决定去帮助奥特曼一起打败怪兽。于是，他和奥特曼一起开始了一系列的战斗，他们一起抵抗着怪兽的攻击，最终，他们成功地消灭了怪兽，拯救了地球。

李英受到了所有人的赞赏，他也成为了一个英雄，他的事迹被传颂了几百年，他的故事也被记录在历史书中，他也成为了一个永恒的传奇。