# Instruction-tuning LLMs for Native Language Identification on TOEFL

This notebook contains the code for instruction-tuning open-source large language models on the task of Native Language Identification. The code is heavily inspired by https://github.com/unslothai/unsloth and their example notebooks on fine-tuning LLMs. We will be using the Unsloth library to perform 4-bit QLoRA fine-tuning on the TOEFL training set. We then evaluate the fine-tuned LLM on the TOEFL test set. We recommend running this notebook in Google Colaboratory to speed up the fine-tuning process.



In [None]:
# Install packages
!pip install "unsloth[cu121-ampere-torch230] @ git+https://github.com/unslothai/unsloth.git"
!pip install imbalanced-learn

In [None]:
# Mount Google Drive if using Google Colab
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
from unsloth import FastLanguageModel
from datasets import load_dataset, Dataset
from pydantic import BaseModel, ValidationError, Field
from sklearn.model_selection import StratifiedKFold
import pandas as pd
import numpy as np
import torch
import os
from typing import Literal
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix, f1_score, accuracy_score
from collections import defaultdict
from trl import SFTTrainer
from transformers import TrainingArguments
from imblearn.under_sampling import RandomUnderSampler
from collections import Counter

max_seq_length = 2048
dtype = torch.bfloat16
load_in_4bit = True # we use 4bit quantization

# 4bit pre quantized models supported by Unsloth
fourbit_models = [
    "unsloth/mistral-7b-bnb-4bit",
    "unsloth/llama-2-7b-bnb-4bit",
    "unsloth/gemma-7b-bnb-4bit",
    "unsloth/llama-3-8b-bnb-4bit",
    "unsloth/Phi-3-mini-4k-instruct"
]

ðŸ¦¥ Unsloth: Will patch your computer to enable 2x faster free finetuning.


In [None]:
# Load model
model_name = "unsloth/mistral-7b-bnb-4bit"
run_name = "finetuned_mistral"
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_name,
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
  )

# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 100,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

==((====))==  Unsloth 2024.8: Fast Mistral patching. Transformers = 4.43.3.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.0+cu121. CUDA = 8.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.26.post1. FA2 = True]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/4.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/971 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/438 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

Unsloth 2024.8 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


# Data Prep
We use the TOEFL training set to perform instruction-tuning with the prompts used in the closed-set experiments, in which we provide the set of possible L1s in the prompt.

In [None]:
toefl_train = "/content/drive/MyDrive/thesis_NLI/TOEFL11/train_preprocessed.csv"

alpaca_prompt = '''
### Instruction:
You are a forensic linguistics expert that reads English texts written by non-native authors to classify the native language of the author as one of:
"ARA": Arabic
"CHI": Chinese
"FRE": French
"GER": German
"HIN": Hindi
"ITA": Italian
"JPN": Japanese
"KOR": Korean
"SPA": Spanish
"TEL": Telugu
"TUR": Turkish
Use clues such as spelling errors, word choice, syntactic patterns, and grammatical errors to decide on the native language of the author.\n

DO NOT USE ANY OTHER CLASS.
IMPORTANT: Do not classify any input as "ENG" (English). English is an invalid choice.

Valid output formats:
Class: "ARA"
Class: "CHI"
Class: "FRE"
Class: "GER"

Classify the text below as one of ARA, CHI, FRE, GER, HIN, ITA, JPN, KOR, SPA, TEL, or TUR. Do not output any other class - do NOT choose "ENG" (English). What is the closest native language of the author of this English text from the given list?

### Input:
{}

### Response:
{}'''


EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    inputs       = examples["text"]
    outputs      = examples["language"]
    texts = []
    for input, output in zip(inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

# perform random undersampling
rus = RandomUnderSampler(sampling_strategy={'ARA':33, 'CHI':33, 'FRE':33, 'GER':33, 'HIN':33, 'ITA':33, 'JPN':33, 'KOR':33, 'SPA':33, 'TEL': 33, 'TUR':33}, random_state=0)
df = pd.read_csv(toefl_train)
X = df.drop('language',axis=1)
y = df['language'].tolist()
X_resampled, y_resampled = rus.fit_resample(X, y)
print(len(y_resampled))
# print(X_resampled['text'][0])
sample_df = pd.DataFrame({'text': X_resampled['text'].tolist(), 'language': y_resampled})

dataset = Dataset.from_pandas(sample_df)
print(Counter(y_resampled))

# or uncomment the following code for the full dataset
# from datasets import load_dataset
# dataset = load_dataset("csv", data_files=toefl_train, split='train')
# print(f'Number of samples: {len(dataset)}')
# print(dataset[0])
texts = dataset['text']
labels = dataset ['language']
dataset = dataset.map(formatting_prompts_func, batched = True,)
print(dataset[:5])


363
Counter({'ARA': 33, 'CHI': 33, 'FRE': 33, 'GER': 33, 'HIN': 33, 'ITA': 33, 'JPN': 33, 'KOR': 33, 'SPA': 33, 'TEL': 33, 'TUR': 33})


Map:   0%|          | 0/363 [00:00<?, ? examples/s]

{'text': ['\n### Instruction:\nYou are a forensic linguistics expert that reads English texts written by non-native authors to classify the native language of the author as one of:\n"ARA": Arabic\n"CHI": Chinese\n"FRE": French\n"GER": German\n"HIN": Hindi\n"ITA": Italian\n"JPN": Japanese\n"KOR": Korean\n"SPA": Spanish\n"TEL": Telugu\n"TUR": Turkish\nUse clues such as spelling errors, word choice, syntactic patterns, and grammatical errors to decide on the native language of the author.\n\n\nDO NOT USE ANY OTHER CLASS.\nIMPORTANT: Do not classify any input as "ENG" (English). English is an invalid choice.\n\nValid output formats:\nClass: "ARA"\nClass: "CHI"\nClass: "FRE"\nClass: "GER"\n\nClassify the text below as one of ARA, CHI, FRE, GER, HIN, ITA, JPN, KOR, SPA, TEL, or TUR. Do not output any other class - do NOT choose "ENG" (English). What is the closest native language of the author of this English text from the given list?\n\n### Input:\nBreif news " An adult man have been lost s

# Train LLM on TOEFL training set

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments

for ind, rs in zip(range(0,1), [100]): # perform 3 runs with 3 different random seeds
  model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/mistral-7b-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hfeds
  )
  model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = rs,
    use_rslora = False,
    loftq_config = None,
  )
  trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False,
    args = TrainingArguments(
        per_device_train_batch_size = 4, # The batch size per GPU for training
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        num_train_epochs=3,
        learning_rate = 1e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = rs,
        output_dir = "outputs",
    ),
  )

  trainer_stats = trainer.train()
  model.save_pretrained(f"/content/drive/MyDrive/thesis_NLI/TOEFL11/{run_name}") # Local saving
  tokenizer.save_pretrained(f"/content/drive/MyDrive/thesis_NLI/TOEFL11/{run_name}")


==((====))==  Unsloth 2024.8: Fast Mistral patching. Transformers = 4.43.3.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.0+cu121. CUDA = 8.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.26.post1. FA2 = True]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Map (num_proc=2):   0%|          | 0/363 [00:00<?, ? examples/s]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 363 | Num Epochs = 3
O^O/ \_/ \    Batch size per device = 4 | Gradient Accumulation steps = 4
\        /    Total batch size = 16 | Total steps = 66
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
1,2.1068
2,2.231
3,2.109
4,2.2208
5,2.0274
6,2.0068
7,1.8219
8,1.7811
9,1.6808
10,1.5158


# Evaluate on TOEFL test set

After fine-tuning, we evaluate the fine-tuned model on the TOEFL test set. We prompt the model to predict the native language for L2 texts in the TOEFL test set.

## Defining the functions for running inference

In [None]:
def generate_text(prompt):
  """
  Generate text for LLM based on input prompt
  :param prompt: input prompt
  :param max_length:
  :type prompt: str
  :type max_length: int
  """
    # Tokenize the prompt
  inputs = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")
  outputs = model.generate(inputs,
                           max_new_tokens=10,
                           pad_token_id=tokenizer.eos_token_id,
                           #temperature=0.001
                           ) # set temperature here?
    # Decode the response
  response = tokenizer.decode(outputs[0], skip_special_tokens=True)

  return response

def clean_output(output, eos_token, output_only=False):
  """
  This function specifically cleans up the output,
  to remove the prompt, make sure it is in the correct format, and remove any empty lines.
  :param output: model generated output
  :param eos_token: end-of-sequence token to split model output on
  :param output_only, default False: if True, extract only the newly generated output by model and remove the prompt. mostly used for debugging
  :type output: str
  :type eos_token: str
  """
  pure_output = output.split(eos_token)
  pure_output = pure_output[-1]
  pure_output = pure_output.strip()
  final_output = pure_output
  if output_only==False: # whether to extract only json-formatted string in the output or not
    predicted_classes=0
    language_class_dict = {'arabic': 'ARA',
                            'bulgarian': 'BUL',
                             'chinese': 'CHI',
                             'czech': 'CZE',
                             "french": "FRE",
                             "german": "GER",
                             "hindi": "HIN",
                             "italian": "ITA",
                             "japanese": "JPN",
                             "korean": 'KOR',
                             "spanish": "SPA",
                             "telugu": "TEL",
                             "turkish": "TUR",
                             "russian": "RUS",
                             "english": "ENG",
                            'sp': 'SPA',
                            'itl': 'ITA',
                           'deu': 'GER'
                             }
    if '}' in final_output:
      x = output.split("}")
      for piece in x:
        if 'native_lang' in piece:
          x = piece.split(":")
          label = x[-1]
          label = label.strip()
          label = label.replace('"', '')
          label = label.replace('\n', '')
          final_output = '{"native_lang":"' + label + '"}'
    if 'Class:' in final_output:
      x = output.split("Class:")
      label = x[-1]
      label = label.strip()
      label = label.replace('"', '')
      label = label.replace('\n', '')
      label = label.replace('.', '')
      final_output = label
    for lang, label in language_class_dict.items():
      if lang in final_output.lower() or label in final_output:
        final_output = '{"native_lang":"' + label + '"}'
  return final_output

def classify(texts, goldlabels, filter_token):
  '''
  :param texts: list of texts
  :param goldlabels: list of gold labels
  :param filter_token: token to get cleaned output
  :type texts: list
  :type goldlabels: list
  :returns predictions: a list of model predictions
  '''
  predictions = []
  count = 1
  sys_prompt = prompt_TOEFL
  prompt_retry = prompt_retry_TOEFL
  all_labels = all_labels_TOEFL
  NLI_prediction = NLI_prediction_TOEFL
  main_task_prompt = main_task_prompt_TOEFL
  for text, gold in zip(texts, goldlabels):
    promptcounter = 0
    while True:
      try:
        fullprompt = "Instruction: " + sys_prompt + '\n\n' + main_task_prompt + '\n\nInput: '+ text + "\nResponse:"
        output = generate_text(fullprompt) # generate text per TOEFL text
        output_only = clean_output(output, filter_token, output_only=True)
        print(output_only)
        final_output = clean_output(output, filter_token)
        validated_response = NLI_prediction.model_validate_json(final_output) # use class to validate json string
        response_dict = validated_response.model_dump() # dump validated response into dict
        predicted_native_lang = response_dict['native_lang'] # get the predicted native language
        if predicted_native_lang == "ENG": # reiterate prompt if model predicts english
          fullprompt = "Instruction: " + sys_prompt + '\n\n' + main_task_prompt + '\n' + prompt_retry_eng + '\n\nInput: '+ text + "\nResponse:"
          promptcounter+=1
          if promptcounter > 4: # try 5 times to reprompt, if still unable to extract predicted label, append other
            response_dict = {'native_lang': 'other'}
            predictions.append('other')
            break
        else:
          predictions.append(predicted_native_lang) # append it to list of predictions
          break
      # print(final_output)
      except ValidationError as e: # if there is a validation error, make model retry
        fullprompt = "Instruction: " + sys_prompt + '\n\n' + main_task_prompt + '\n' + prompt_retry + '\n\nInput: '+ text + "\nResponse:"
        promptcounter +=1
        if promptcounter > 4: # try 5 times to reprompt, if still unable to extract predicted label, append other
          response_dict = {'native_lang': 'other'}
          predictions.append('other')
          break
    print(count, response_dict)
    print('F1 score:', "{:.2f}".format(f1_score(goldlabels[0:count], predictions, average="macro")))
    count +=1
  return predictions

## Defining the prompts

In [None]:
class NLI_prediction_TOEFL(BaseModel):
  native_lang: Literal['ARA', 'CHI', 'FRE', 'GER', 'HIN', 'ITA', 'JPN', 'KOR', 'SPA', 'TEL', 'TUR', 'ENG']

all_labels_TOEFL = ['ARA', 'CHI', 'FRE', 'GER', 'HIN', 'ITA', 'JPN', 'KOR', 'SPA', 'TEL', 'TUR']

results_TOEFL= "/content/drive/MyDrive/thesis_NLI/TOEFL11/toefl_results.csv"
dataset = pd.read_csv(results_TOEFL)
test_texts = dataset['text'].tolist()
test_labels = dataset ['language'].tolist()
prompt_TOEFL = '''
  You are a forensic linguistics expert that reads English texts written by non-native authors to classify the native language of the author as one of:
  "ARA": Arabic
  "CHI": Chinese
  "FRE": French
  "GER": German
  "HIN": Hindi
  "ITA": Italian
  "JPN": Japanese
  "KOR": Korean
  "SPA": Spanish
  "TEL": Telugu
  "TUR": Turkish
  Use clues such as spelling errors, word choice, syntactic patterns, and grammatical errors to decide on the native language of the author.\n

  DO NOT USE ANY OTHER CLASS.
  IMPORTANT: Do not classify any input as "ENG" (English). English is an invalid choice.

  Valid output formats:
  Class: "ARA"
  Class: "CHI"
  Class: "FRE"
  Class: "GER"
  '''

main_task_prompt_TOEFL = '''Classify the text above as one of ARA, CHI, FRE, GER, HIN, ITA, JPN, KOR, SPA, TEL, or TUR. Do not output any other class - do NOT choose "ENG" (English). What is the closest native language of the author of this English text from the given list?
'''

prompt_retry_eng = '''
  You previously mistakenly predicted this text as "ENG" (English). The class is NOT English.
  Please classify the native language of the author of the text again.
  '''

prompt_retry_TOEFL = '''
  Your classification is not in the list of possible languages.
  Please try again and choose only one of the following classes:
  ARA, CHI, FRE, GER, HIN, ITA, JPN, KOR, SPA, TEL, or TUR
  '''



## NLI classification using fine-tuned LLMs

In [None]:
all_models = ['finetuned_llama2_7b', 'finetuned_llama3_8b', 'finetuned_mistral_7b', 'finetuned_gemma_7b, finetuned_phi3']

In [None]:
from unsloth import FastLanguageModel

accuracies = []
%cd /content/drive/MyDrive/thesis_NLI/TOEFL11
runs = 1
for count in range(runs):
  model, tokenizer = FastLanguageModel.from_pretrained(
      model_name = run_name, # YOUR MODEL YOU USED FOR TRAINING
      max_seq_length = max_seq_length,
      dtype = dtype,
      load_in_4bit = load_in_4bit,
    )
  FastLanguageModel.for_inference(model) # Enable native 2x faster inference

  eos_token = 'Response:'
  results_TOEFL = "/content/drive/MyDrive/thesis_NLI/TOEFL11/toefl_results.csv"
  predictions = classify(test_texts, test_labels, eos_token)
  accuracy = accuracy_score(test_labels, predictions)
  accuracy2 = "{:.2f}".format(accuracy*100)

  print(f'-------------Run: {count}')
  print(f'-------------Accuracy: {accuracy2}')
  accuracy=float(accuracy*100)
  accuracies.append(accuracy)
  cm = confusion_matrix(test_labels, predictions, labels=all_labels_TOEFL)
  cm_display = ConfusionMatrixDisplay(cm, display_labels=all_labels_TOEFL).plot()
  cm_display.figure_.savefig(f'/content/drive/MyDrive/thesis_NLI/TOEFL_results/finetuned_mistral_undersampled.png')
  df = pd.read_csv(results_TOEFL)
  num_columns = len(df.columns)
  df.insert(num_columns, run_name, predictions)
    # df.head()
  df.to_csv(results_TOEFL, index=False)

avg_acc=sum(accuracies)/runs
print(accuracies)
print(f"Average acc: {avg_acc}")
print(f"Standard deviation: {np.std(accuracies)}")