# 使用`openai api`來微調分類器

使用`ada`來微調分類器，並且使用`ada`來預測分類結果。預測一個句子的分類結果，並且計算出分類的機率。

### 安裝命令列提示下的`openai`套件

In [None]:
!pip install openai # 安裝 OpenAI 的 Python 套件

### 安裝需要的資料集

這邊使用sklearn中的20newsgroups資料集，這個資料集是一個新聞分類的資料集，共有20個類別，每個類別有數百篇新聞，這邊我們只使用其中的4個類別，並且只取出每篇新聞的內容，並且將其轉換成一個list，每個list中的元素都是一個新聞的內容。使用baseball和hockey兩個新聞類別，並且將其轉換成一個list，每個list中的元素都是一個新聞的內容。

In [2]:
from sklearn.datasets import fetch_20newsgroups
import pandas as pd
import openai

categories = ['rec.sport.baseball', 'rec.sport.hockey']
sports_dataset = fetch_20newsgroups(subset='train', shuffle=True, random_state=42, categories=categories)

 ### 查看資料
 先看資料長什麼樣子，這邊我們只看第1筆資料。

In [4]:
print(sports_dataset['data'][1])

From: gld@cunixb.cc.columbia.edu (Gary L Dare)
Subject: Re: Flames Truly Brutal in Loss
Nntp-Posting-Host: cunixb.cc.columbia.edu
Reply-To: gld@cunixb.cc.columbia.edu (Gary L Dare)
Organization: PhDs In The Hall
Distribution: na
Lines: 13


This game would have been great as part of a double-header on ABC or
ESPN; the league would have been able to push back-to-back wins by
Le Magnifique and The Great One.  Unfortunately, the only network
that would have done that was SCA, seen in few areas and hard to
justify as a pay channel. )-;

gld
--
~~~~~~~~~~~~~~~~~~~~~~~~ Je me souviens ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Gary L. Dare
> gld@columbia.EDU 			GO  Winnipeg Jets  GO!!!
> gld@cunixc.BITNET			Selanne + Domi ==> Stanley



### 查看是什麼類別

In [5]:
sports_dataset.target_names[sports_dataset['target'][0]]


'rec.sport.baseball'

### 印出每個類別的長度
我們要用這個類別的文字資料來做文字分類

In [7]:
len_all, len_baseball, len_hockey = len(sports_dataset.data), len([e for e in sports_dataset.target if e == 0]), len([e for e in sports_dataset.target if e == 1])
print(f"Total examples: {len_all}, Baseball examples: {len_baseball}, Hockey examples: {len_hockey}")

Total examples: 1197, Baseball examples: 597, Hockey examples: 600


### 資料準備
將資料集轉換為一個`pandas` `dataframe`，其中包含一個用於提示和完成的列。提示中包含了郵件列表中的電子郵件，而完成則是運動的名稱，可以是曲棍球或棒球。為了示範目的和微調速度，我們僅選取了300個例子。在實際應用中，例子越多，性能越好。

In [8]:
import pandas as pd

labels = [sports_dataset.target_names[x].split('.')[-1] for x in sports_dataset['target']]
texts = [text.strip() for text in sports_dataset['data']]
df = pd.DataFrame(zip(texts, labels), columns = ['prompt','completion']) #[:300]
df.head()

Unnamed: 0,prompt,completion
0,From: dougb@comm.mot.com (Doug Bank)\nSubject:...,baseball
1,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...,hockey
2,From: rudy@netcom.com (Rudy Wade)\nSubject: Re...,baseball
3,From: monack@helium.gas.uug.arizona.edu (david...,hockey
4,Subject: Let it be Known\nFrom: <ISSBTL@BYUVM....,baseball


"baseball"和"hockey"都是單一tokens。這邊規定要轉成`jsonl`格式

In [9]:
df.to_json("sport2.jsonl", orient='records', lines=True)

### 使用`openai api`來上傳資料
使用一個資料準備工具，在微調之前對資料集進行一些改進。在啟動工具之前，我們會更新 OpenAI 函式庫，以確保使用最新的資料準備工具。我們還額外指定了"-q"參數，自動接受所有建議。

In [10]:
!pip install --upgrade openai



In [11]:
!openai tools fine_tunes.prepare_data -f sport2.jsonl -q

Analyzing...

- Your file contains 1197 prompt-completion pairs
- Based on your data it seems like you're trying to fine-tune a model for classification
- For classification, we recommend you try one of the faster and cheaper models, such as `ada`
- For classification, you can estimate the expected model performance by keeping a held out dataset, which is not used for training
- There are 11 examples that are very long. These are rows: [134, 200, 281, 320, 404, 595, 704, 838, 1113, 1139, 1174]
For conditional generation, and for classification the examples shouldn't be longer than 2048 tokens.
- Your data does not contain a common separator at the end of your prompts. Having a separator string appended to the end of the prompt makes it clearer to the fine-tuned model where the completion should begin. See https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more detail and examples. If you intend to do open-ended generation, then you should leave the prompts e

將數資料集分為訓練集和驗證集。

要在提示和完成之間加上一個後綴，告訴模型輸入文字已經結束，現在需要預測類別。由於我們在每個例子中使用相同的分隔符號，模型能夠學會它應該在分隔符之後預測棒球或曲棍球。

在完成部分加上空格，因為大多數單詞token都帶有空格前綴。

該工具還識別到這可能是一個分類任務，建議將數據集分為訓練集和驗證集。這將使我們能夠輕鬆地測量對新數據的預期性能

### 開始微調臨型
該工具建議我們運行以下命令來訓練數據集。由於這是一個分類任務，我們希望知道在提供的驗證集上的泛化性能如何，以滿足我們的分類使用情境。該工具建議添加 `--compute_classification_metrics` `--classification_positive_class "baseball"` 來計算分類指標。

我們可以直接從命令行工具中複製建議的命令。我們特別添加了 `-m ada`，以微調一個更經濟、更快速的 ada 模型，在分類使用情境中通常與速度較慢、更昂貴的模型在性能上相當。

In [12]:
!openai -k "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" api fine_tunes.create -t "sport2_prepared_train.jsonl" -v "sport2_prepared_valid.jsonl" --compute_classification_metrics --classification_positive_class " baseball" -m ada

Found potentially duplicated files with name 'sport2_prepared_train.jsonl', purpose 'fine-tune' and size 1519036 bytes
file-iLrr4IHweg2sCCRcYNcBbMtg
file-L3fySLiYqXhYDjWEfBfuTmmI
Enter file ID to reuse an already uploaded file, or an empty string to upload this file anyway: ^C



### 追蹤微調進度
必須使用`follow`來蹤微調進度

In [None]:
!openai -k "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" api fine_tunes.follow -i ft-NfED2Hc63L993EJVWE4XIXXI

### 列出所有微調模型
使用`curl`來列出之前訓練過所有微調的模型

In [None]:
!curl https://api.openai.com/v1/fine-tunes \                                   [16:41:14]
  -H "Authorization: Bearer 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"

### 再輸入一次`follow`來找到模型
`ada:ft-pytensor-2023-05-07-08-43-31`

In [None]:
!openai -k "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" api fine_tunes.follow -i ft-NfED2Hc63L993EJVWE4XIXXI

### 查看訓練結果及驗證結果
將訓練結果及驗證結果下載回來

In [None]:
!openai -k "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" api fine_tunes.results -i ft-NfED2Hc63L993EJVWE4XIXXI > result.csv

In [None]:
results = pd.read_csv('result.csv')
results[results['classification/accuracy'].notnull()].tail(1)

### 結果為99.6%的準確率，並畫出圖型

In [None]:
results[results['classification/accuracy'].notnull()]['classification/accuracy'].plot()

## 使用模型
使用驗證集來看效果如何

In [None]:
test = pd.read_json('sport2_prepared_valid.jsonl', lines=True)
test.head()

### 加入分隔符號
在進行微調期間，我們需要在提示後使用相同的分隔符號。`\n\n###\n\n`。由於我們關注的是分類，我們希望`temperature`盡可能低，並且只需要一個token完成來確定模型的預測。

In [None]:
openai.api_key = "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"
ft_model = 'ada:ft-pytensor-2023-05-07-08-43-31'
res = openai.Completion.create(model=ft_model, prompt=test['prompt'][1] + '\n\n###\n\n', max_tokens=1, temperature=0)
res['choices'][0]['text']


### 可以查看log probabilities，指定completion request的logprobs參數

In [None]:
res = openai.Completion.create(model=ft_model, prompt=test['prompt'][3] + '\n\n###\n\n', max_tokens=1, temperature=0, logprobs=2)
res['choices'][0]['logprobs']['top_logprobs'][0]

### 用在新的文字上
除了判斷電子郵件，也可以判斷推文

In [None]:
sample_hockey_tweet = """Thank you to the 
@Canes
 and all you amazing Caniacs that have been so supportive! You guys are some of the best fans in the NHL without a doubt! Really excited to start this new chapter in my career with the 
@DetroitRedWings
 !!"""
res = openai.Completion.create(model=ft_model, prompt=sample_hockey_tweet + '\n\n###\n\n', max_tokens=1, temperature=0, logprobs=2)
res['choices'][0]['text']

In [None]:
sample_baseball_tweet="""BREAKING: The Tampa Bay Rays are finalizing a deal to acquire slugger Nelson Cruz from the Minnesota Twins, sources tell ESPN."""
res = openai.Completion.create(model=ft_model, prompt=sample_baseball_tweet + '\n\n###\n\n', max_tokens=1, temperature=0, logprobs=2)
res['choices'][0]['text']