# Text Classification - Qlora Dataset

## $\color{blue}{Sections:}$
* Preamble
* Admin - importing libraries
* Data - Load the data into Dataset with [INST] format

## $\color{blue}{Preamble:}$

This note book prepares the data for finetuning a 7b version of Mistral instruct. We will focus on the subtask of predicting the correct book, with the assumtion that if we can get very high prediction accuracy on the book, we can use further classifiers for the chapters of each book.

The finetuning will focus on QLoRA, and we will rely on Hugging Faces SFTTrainer to finetune the model. The first problem is to get the data into a format that is expected by the Trainer and accepted by the model. That is the focus of this notebook.

## $\color{blue}{Admin:}$


In [None]:
from google.colab import drive

In [None]:
drive.mount("/content/drive")
%cd '/content/drive/MyDrive/'

Mounted at /content/drive
/content/drive/MyDrive


In [None]:
import pandas as pd
path = "class/datasets/" # modify path
df_train = pd.read_pickle(path + "df_train")
df_dev = pd.read_pickle(path + "df_dev")
df_test = pd.read_pickle(path + "df_test")

In [None]:
%%capture
!pip install dill

In [None]:
import dill
def save_data(docs, filename):
    """Save a list of Langchain Documents to a .dill file."""
    with open(filename, 'wb') as f:
        dill.dump(docs, f)
    print(f"Documents saved to {filename}")

def load_data(filename):
    """Load a list of Langchain Documents from a .dill file."""
    with open(filename, 'rb') as f:
        docs = dill.load(f)
    print(f"Documents loaded from {filename}")
    return docs

## $\color{blue}{Data:}$

In [None]:
base_text =  """<s> The task is to make a book classification from small passage of text.
The books are Telemachia, Odyssey, and Nostros from James Joyce's Ulysses, Dubliners by James Joyce, Dracula by Bram Stoker, and Republic by Plato.

[INST]Read the Text, choose the correct classification from the list below insert the book name after the "Answer: " prompt. You will provide a single word response.

Telemachia
Odyssey
Nostros
Dubliners
Dracula
Republic

Text: """

def generate_prompt(example, base_text, return_response=True):
  if not return_response:
    base_text = base_text[4:]
  full_prompt = base_text
  full_prompt += f"{example['input']}[/INST]"
  full_prompt += "\nAnswer: "
  if return_response:
    full_prompt += f"{example['output']}"
    full_prompt += '</s>'

  return [full_prompt]

In [None]:
%%capture
!pip install datasets

In [None]:
df_train.columns

Index(['index', 'master', 'book_idx', 'book', 'chapter_idx', 'chapter',
       'author', 'content', 'vanilla_embedding', 'vanilla_embedding.1',
       'ft_embedding', 'ft_embedding_pal'],
      dtype='object')

In [None]:
from datasets import Dataset

train_set = []
dev_set = []
test_set = []

for i in range(df_train.shape[0]):
  d = {'input':df_train.loc[i]['content'], 'output':str(df_train.loc[i]["book"])}
  train_set.append(d)

for i in range(df_dev.shape[0]):
  d = {'input':df_dev.loc[i]['content'], 'output':str(df_dev.loc[i]["book"])}
  dev_set.append(d)

for i in range(df_test.shape[0]):
  d = {'input':df_test.loc[i]['content'], 'output':str(df_test.loc[i]["book"])}
  test_set.append(d)


In [None]:
trainDataset = Dataset.from_list(train_set)
devDataset = Dataset.from_list(dev_set)
testDataset = Dataset.from_list(test_set)

In [None]:
trainDataset[0]

{'input': '“Is it John of Tuam?”   “Are you sure of that now?” asked Mr Fogarty dubiously. “I thought it was some Italian or American.”   “John of Tuam,” repeated Mr Cunningham, “was the man.”   He drank and the other gentlemen followed his lead.',
 'output': 'Dubliners'}

In [None]:
generate_prompt(trainDataset[0], base_text)

['<s> The task is to make a book classification from small passage of text.\nThe books are Telemachia, Odyssey, and Nostros from James Joyce\'s Ulysses, Dubliners by James Joyce, Dracula by Bram Stoker, and Republic by Plato.\n\n[INST]Read the Text, choose the correct classification from the list below insert the book name after the "Answer: " prompt. You will provide a single word response.\n\nTelemachia \nOdyssey\nNostros\nDubliners\nDracula\nRepublic\n\nText: “Is it John of Tuam?”   “Are you sure of that now?” asked Mr Fogarty dubiously. “I thought it was some Italian or American.”   “John of Tuam,” repeated Mr Cunningham, “was the man.”   He drank and the other gentlemen followed his lead.[/INST]\nAnswer: Dubliners</s>']

In [None]:
generate_prompt(trainDataset[0], base_text,return_response = False)

['The task is to make a book classification from small passage of text.\nThe books are Telemachia, Odyssey, and Nostros from James Joyce\'s Ulysses, Dubliners by James Joyce, Dracula by Bram Stoker, and Republic by Plato.\n\n[INST]Read the Text, choose the correct classification from the list below insert the book name after the "Answer: " prompt. You will provide a single word response.\n\nTelemachia \nOdyssey\nNostros\nDubliners\nDracula\nRepublic\n\nText: “Is it John of Tuam?”   “Are you sure of that now?” asked Mr Fogarty dubiously. “I thought it was some Italian or American.”   “John of Tuam,” repeated Mr Cunningham, “was the man.”   He drank and the other gentlemen followed his lead.[/INST]\nAnswer: ']

In [None]:
path = "class/datasets/"
save_data(trainDataset,path + "Dataset_train")
save_data(devDataset,path + "Dataset_dev")
save_data(testDataset,path + "Dataset_test")

Documents saved to class/datasets/Dataset_train
Documents saved to class/datasets/Dataset_dev
Documents saved to class/datasets/Dataset_test
