# 3. Train a fine-tuning model

This notebook will utilize the dataset created to fine tune a BERT model. The model will be trained to predict the answer to a question. 

In [1]:
from sklearn.model_selection import train_test_split

In [2]:
import pandas as pd

df = pd.read_csv('../data/olympics_qa.csv')
df.head()

Unnamed: 0,title,section,text,ntokens,context,questions,answers
0,Artificial intelligence,Summary,Artificial intelligence (AI) is intelligence d...,536,Artificial intelligence\nSummary\n\nArtificial...,1. What are the goals of AI research?\n2. What...,"1. The goals of AI research include reasoning,..."
1,Artificial intelligence,History,Artificial beings with intelligence appeared a...,1113,Artificial intelligence\nHistory\n\nArtificial...,1. What is the history of artificial intellige...,1. The history of artificial intelligence can ...
2,Artificial intelligence,Goals,The general problem of simulating (or creating...,50,Artificial intelligence\nGoals\n\nThe general ...,1. What are the sub-problems of simulating int...,1. The sub-problems of simulating intelligence...
3,Artificial intelligence,"Reasoning, problem-solving",Early researchers developed algorithms that im...,120,"Artificial intelligence\nReasoning, problem-so...","1. What is the ""combinatorial explosion""?\n2. ...","1. The ""combinatorial explosion"" is when an al..."
4,Artificial intelligence,Knowledge representation,Knowledge representation and knowledge enginee...,335,Artificial intelligence\nKnowledge representat...,1. What is an ontology?\n2. What is the differ...,"1. An ontology is a set of objects, relations,..."


Split the sections into train and test sets. The train set will be used to train the model and the test set will be used to evaluate the model.

In [3]:
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)
len(train_df), len(test_df)

(202, 51)

## 3.1 Create the fine-tuning datasets for Q&A model

The fine-tuning dataset is created in this step. 


In [4]:
def create_fine_tuning_dataset(df, discriminator=False, n_negative=1, add_related=False):
    rows = []
    for i, row in df.iterrows():
        for q, a in zip(("1." + row.questions).split('\n'), ("1." + row.answers).split('\n')):
            if discriminator:
                rows.append({"prompt": f"{row.context}\nQuestion: {q[2:].strip()}\n Related:", "completion": f"Yes"})
            else:
                rows.append({"prompt": f"{row.context}\nQuestion: {q[2:].strip()}\n Answer:", "completion": f"{a[2:].strip()}"})

    for i, row in df.iterrows():
        for q in ("1." + row.questions).split('\n'):
            if len(q) > 10:
                for j in range(n_negative + (2 if add_related else 0)):
                    random_context = ""
                    if j == 0 and add_related:
                        # add the related contexts based on originating from the same wikipedia page
                        subset = df[(df.title == row.title) & (df.context != row.context)]

                        if len(subset) < 1:
                            continue
                        random_context = subset.sample(1).iloc[0].context
                    
                    if discriminator:
                        rows.append({"prompt":f"{random_context}\nQuestion: {q[2:].strip()}\n Related:", "completion":f" no"})
                    else:
                        rows.append({"prompt":f"{random_context}\nQuestion: {q[2:].strip()}\nAnswer:", "completion":f" No appropriate context found to answer the question."})
                        
    return pd.DataFrame(rows)

In [5]:
for name, is_disc in [('qa', False)]:
    for train_test, dt in [('train', train_df), ('test', test_df)]:
        ft = create_fine_tuning_dataset(dt, discriminator=is_disc, n_negative=1, add_related=True)
        ft.to_json(f'{name}_{train_test}.jsonl', orient='records', lines=True)

## 3.2 Fine-tuning ChatGPT on the generated train_qa.jsonl dataset

The fine-tuning dataset is used to fine-tune the ChatGPT model. The fine-tuned model is saved in the `models` directory.

In [39]:
import openai
import os
import pandas as pd
import time

In [40]:
openai.api_key = "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"

In [42]:
# Read the fine tuning dataset
df = pd.read_json('qa_train.jsonl', orient='records', lines=True)
df.head()

Unnamed: 0,prompt,completion
0,Hallucination (artificial intelligence)\nSumma...,1. AI hallucinations are responses by AI that ...
1,Hallucination (artificial intelligence)\nSumma...,The consequences of AI hallucinations can be a...
2,Hallucination (artificial intelligence)\nSumma...,
3,History of artificial intelligence\nBirth of a...,1. The motivation behind the creation of the f...
4,History of artificial intelligence\nBirth of a...,Early artificial intelligence researchers face...


In [44]:
!openai api fine_tunes.prepare_data -f qa_train.jsonl -o qa_train_prepared.jsonl

usage: openai api [-h]
                  {engines.list,engines.get,engines.update,engines.generate,chat_completions.create,completions.create,deployments.list,deployments.get,deployments.delete,deployments.create,models.list,models.get,models.delete,files.create,files.get,files.delete,files.list,fine_tunes.list,fine_tunes.create,fine_tunes.get,fine_tunes.results,fine_tunes.events,fine_tunes.follow,fine_tunes.cancel,fine_tunes.delete,image.create,image.create_edit,image.create_variation,audio.transcribe,audio.translate}
                  ...
openai api: error: argument {engines.list,engines.get,engines.update,engines.generate,chat_completions.create,completions.create,deployments.list,deployments.get,deployments.delete,deployments.create,models.list,models.get,models.delete,files.create,files.get,files.delete,files.list,fine_tunes.list,fine_tunes.create,fine_tunes.get,fine_tunes.results,fine_tunes.events,fine_tunes.follow,fine_tunes.cancel,fine_tunes.delete,image.create,image.create_edi