# Fine-tuning GPT for speech act classification

### This script is splitting the data into train, validation and test set including the prompts, the model was be fine-tuned with. The actual fine-tuning is done on OpenAI, loading the json files generated with this notebook.

#### Import Libraries

In [None]:
import pandas as pd
from collections import Counter
from sklearn.model_selection import train_test_split
import json
import numpy as np

#### Read in data

In [None]:
table = pd.read_csv(#path/to/csv/file.csv)
table = table[["label", "sentence"]]

## Constructing the training set

#### Separate \<dec\> rows -> Preserve some for the test set, otherwise might end up with none in the test set because of scarcity

In [None]:
dec_df = table[table["label"] == "<dec>"]
non_dec_df = table[table["label"] != "<dec>"]
dec_test = dec_df.sample(n=5, random_state=42)
dec_remaining = dec_df.drop(dec_test.index)

#### Stratified split on the rest (non-\<dec\> + remaining \<dec\>)

In [None]:
split_df = pd.concat([non_dec_df, dec_remaining])

# First split: Train 85%, temp 15% (for val + test)
train_df, temp_df = train_test_split(
    split_df,
    test_size=0.15,
    stratify=split_df["label"],
    random_state=42
)

#### Option 1: Using smaller partitions of training set 

In [None]:
#train_df = train_df.sample(frac=0.75, random_state=42).reset_index(drop=True) #set frac to 0.25, 0.5, etc. and uncomment

# Second split: Validation (10%), Test (5%) from temp
val_df, test_df_partial = train_test_split(
    temp_df,
    test_size=1/3,  # 1/3 of 15% = 5%
    stratify=temp_df["label"],
    random_state=42
)

class_counts = train_df["label"].value_counts().to_dict()

#### Option 2: Scaling the class distribution with smallest class as anchor

In [None]:
min_class = min(class_counts.values())  # smallest count in training
target_min = min_class  # anchor to itself
target_for_second = 100  # <-- or another anchor you like

# Fit power-law scaling between two anchors
sorted_counts = sorted(class_counts.values())
x1, y1 = sorted_counts[0], target_min
x2, y2 = sorted_counts[1], target_for_second

p = np.log(y2 / y1) / np.log(x2 / x1)
a = y1 / (x1 ** p)

In [None]:
def scaled_target(n):
    return int(round(a * (n ** p)))

# Compute new targets for each class
targets = {cls: scaled_target(n) for cls, n in class_counts.items()}

# Sample accordingly
resampled_dfs = []
for cls, tgt in targets.items():
    cls_df = train_df[train_df["label"] == cls]
    # if target bigger than available, just take all
    n_sample = min(len(cls_df), tgt)
    resampled_dfs.append(cls_df.sample(n=n_sample, random_state=42))

train_df_resampled = pd.concat(resampled_dfs).reset_index(drop=True)

print("Original train counts:", class_counts)
print("Resampled train counts:", train_df_resampled["label"].value_counts().to_dict())

#### Option 3: Not running any of the two above and use the full data set

#### No matter which option, run the following block:

In [None]:
#Add the 5 reserved <dec> rows to test set
test_df = pd.concat([test_df_partial, dec_test]).reset_index(drop=True)

#Shuffle final test set
test_df = test_df.sample(frac=1, random_state=42).reset_index(drop=True)

In [None]:
system_prompt = """You're a linguist with expert knowledge in speech acts and you are annotating a dataset of spoken sentences with Searle's speech acts.
The classes are defined as follows: Representative (rep): For Searle the purpose of the class of representatives is 'to commit the speaker (in varying degrees) to something's being the case, to the truth of the expressed proposition'. The context of an utterance is often crucial to understanding the speaker's intent. \
Directive (<dir>): Searle characterises directives as acts 'by the speaker to get the hearer to do something'. While certain verbs such as ask, order, or command obviously have canonical uses as indicators of directives, Searle relegates to a footnote the observation that 'questions are a species of directives, since they are attempts by S to get H to answer'.
Commissive (<com>): Searle understands commissives as those 'illocutionary acts whose point is to commit the speaker (again in varying degrees) to some future course of action'. Subject to the sincerity condition of having a genuine intention to carry out the action to which a commitment is made, the person who utters a commissive undertakes to fit the external world to the words which have been uttered. This definition includes both a central function of commitment and the qualification that this commitment may be expressed in varying degrees of strength. 
Expressives (<exp>): In expressives, the purpose of an utterance is, as Searle says, 'to express the psychological state specified in the sincerity condition about a state of affairs specified in the propositional content'. The key distinguishing characteristic of expressives is that 'the speaker is neither trying to get the world to match the words nor the words to match the world: the truth of the expressed proposition is presupposed'. 
Declarative (<dec>):  In Searle's taxonomy, successful performance of a declarative 'guarantees that the propositional content corresponds to the world'. This guarantee arises because some aspect of the real world changes as a result of the commission of the speech act.
Indeterminate conversationally-relevant utterance (<icu>): Indeterminate conversationally-relevant utterances denote a broad range of minimal responses, back-channel utterances, or other elements of speech which are relevant to the maintenance of discourse or discourse continuity, but which lack a discernable function as a speech act.
Not analysable at pragmatic level (<xpa>): xpa denotes that the utterance lies outside the pragmatic analysis.
Please classify this sentence as one of the eight speech act types and output the label i.e. <rep>, <dir>, <com>, <exp>, <dec>, <icu> or <xpa>."""

output_path = "gpt_finetune_train_smoothed_distr.jsonl"

with open(output_path, "w", encoding="utf-8") as f:
    for _, row in train_df_resampled.iterrows():
        entry = {
            "messages": [
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": f"Annotate the following sentence with one speech act: {row['sentence']}"},
                {"role": "assistant", "content": row["label"]}
            ]
        }
        f.write(json.dumps(entry, ensure_ascii=False) + "\n")

## Constructing the validation set (same for all training set options)

In [None]:
system_prompt = """You're a linguist with expert knowledge in speech acts and you are annotating a dataset of spoken sentences with Searle's speech acts.
The classes are defined as follows: Representative (rep): For Searle the purpose of the class of representatives is 'to commit the speaker (in varying degrees) to something's being the case, to the truth of the expressed proposition'. The context of an utterance is often crucial to understanding the speaker's intent. \
Directive (<dir>): Searle characterises directives as acts 'by the speaker to get the hearer to do something'. While certain verbs such as ask, order, or command obviously have canonical uses as indicators of directives, Searle relegates to a footnote the observation that 'questions are a species of directives, since they are attempts by S to get H to answer'.
Commissive (<com>): Searle understands commissives as those 'illocutionary acts whose point is to commit the speaker (again in varying degrees) to some future course of action'. Subject to the sincerity condition of having a genuine intention to carry out the action to which a commitment is made, the person who utters a commissive undertakes to fit the external world to the words which have been uttered. This definition includes both a central function of commitment and the qualification that this commitment may be expressed in varying degrees of strength. 
Expressives (<exp>): In expressives, the purpose of an utterance is, as Searle says, 'to express the psychological state specified in the sincerity condition about a state of affairs specified in the propositional content'. The key distinguishing characteristic of expressives is that 'the speaker is neither trying to get the world to match the words nor the words to match the world: the truth of the expressed proposition is presupposed'. 
Declarative (<dec>):  In Searle's taxonomy, successful performance of a declarative 'guarantees that the propositional content corresponds to the world'. This guarantee arises because some aspect of the real world changes as a result of the commission of the speech act.
Indeterminate conversationally-relevant utterance (<icu>): Indeterminate conversationally-relevant utterances denote a broad range of minimal responses, back-channel utterances, or other elements of speech which are relevant to the maintenance of discourse or discourse continuity, but which lack a discernable function as a speech act.
Not analysable at pragmatic level (<xpa>): xpa denotes that the utterance lies outside the pragmatic analysis.
Please classify this sentence as one of the eight speech act types and output the label i.e. <rep>, <dir>, <com>, <exp>, <dec>, <icu> or <xpa>."""

# Path to save JSONL file
output_path = "gpt_finetune_validate.jsonl"

with open(output_path, "w", encoding="utf-8") as f:
    for _, row in val_df.iterrows():
        entry = {
            "messages": [
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": f"Annotate the following sentence with one speech act: {row['sentence']}"},
                {"role": "assistant", "content": row["label"]}
            ]
        }
        f.write(json.dumps(entry, ensure_ascii=False) + "\n")

## Constructing the test set (same for all training set options)

In [None]:
system_prompt = """You're a linguist with expert knowledge in speech acts and you are annotating a dataset of spoken sentences with Searle's speech acts.
The classes are defined as follows: Representative (rep): For Searle the purpose of the class of representatives is 'to commit the speaker (in varying degrees) to something's being the case, to the truth of the expressed proposition'. The context of an utterance is often crucial to understanding the speaker's intent. \
Directive (<dir>): Searle characterises directives as acts 'by the speaker to get the hearer to do something'. While certain verbs such as ask, order, or command obviously have canonical uses as indicators of directives, Searle relegates to a footnote the observation that 'questions are a species of directives, since they are attempts by S to get H to answer'.
Commissive (<com>): Searle understands commissives as those 'illocutionary acts whose point is to commit the speaker (again in varying degrees) to some future course of action'. Subject to the sincerity condition of having a genuine intention to carry out the action to which a commitment is made, the person who utters a commissive undertakes to fit the external world to the words which have been uttered. This definition includes both a central function of commitment and the qualification that this commitment may be expressed in varying degrees of strength. 
Expressives (<exp>): In expressives, the purpose of an utterance is, as Searle says, 'to express the psychological state specified in the sincerity condition about a state of affairs specified in the propositional content'. The key distinguishing characteristic of expressives is that 'the speaker is neither trying to get the world to match the words nor the words to match the world: the truth of the expressed proposition is presupposed'. 
Declarative (<dec>):  In Searle's taxonomy, successful performance of a declarative 'guarantees that the propositional content corresponds to the world'. This guarantee arises because some aspect of the real world changes as a result of the commission of the speech act.
Indeterminate conversationally-relevant utterance (<icu>): Indeterminate conversationally-relevant utterances denote a broad range of minimal responses, back-channel utterances, or other elements of speech which are relevant to the maintenance of discourse or discourse continuity, but which lack a discernable function as a speech act.
Not analysable at pragmatic level (<xpa>): xpa denotes that the utterance lies outside the pragmatic analysis.
Please classify this sentence as one of the eight speech act types and output the label i.e. <rep>, <dir>, <com>, <exp>, <dec>, <icu> or <xpa>."""

# Path to save JSONL file
output_path = "gpt_finetune_test.jsonl"

with open(output_path, "w", encoding="utf-8") as f:
    for _, row in test_df.iterrows():
        entry = {
            "messages": [
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": f"Annotate the following sentence with one speech act: {row['sentence']}"},
                #{"role": "assistant", "content": row["label"]}
            ]
        }
        f.write(json.dumps(entry, ensure_ascii=False) + "\n")

#### Save test labels for evaluation

In [None]:
output_path = "gpt_finetune_test_labels.jsonl"
with open(output_path, "w", encoding="utf-8") as f:
    for _, row in test_df.iterrows():
        entry = {
            "label": row["label"]
        }
        f.write(json.dumps(entry, ensure_ascii=False) + "\n")