<a href="https://colab.research.google.com/github/ryyhan/fineTunedLLM/blob/main/FineTuningConversations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Setting up the libraries

In [1]:
!pip install transformers datasets evaluate torch

Collecting datasets
  Downloading datasets-2.20.0-py3-none-any.whl.metadata (19 kB)
Collecting evaluate
  Downloading evaluate-0.4.2-py3-none-any.whl.metadata (9.3 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting requests (from transformers)
  Downloading requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.5.0,>=2023.1.0 (from fsspec[http]<=2024.5.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.5.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_nvrtc_c

## Loading up the dataset

In [26]:
from datasets import load_dataset
from transformers import GPT2Tokenizer

In [27]:
import pandas as pd
import numpy as np

df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/ClusteredDatasets/clustered_df4.csv")

In [28]:
df_0 = df.loc[df['cluster'] == 0]
df_1 = df.loc[df['cluster'] == 1]
df_2 = df.loc[df['cluster'] == 2]
df_3 = df.loc[df['cluster'] == 3]

In [29]:
df_0.drop(columns=['Unnamed: 0','cluster'], inplace=True)
df_1.drop(columns=['Unnamed: 0','cluster'], inplace=True)
df_2.drop(columns=['Unnamed: 0','cluster'], inplace=True)
df_3.drop(columns=['Unnamed: 0','cluster'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_0.drop(columns=['Unnamed: 0','cluster'], inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_1.drop(columns=['Unnamed: 0','cluster'], inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_2.drop(columns=['Unnamed: 0','cluster'], inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-c

In [30]:
df_0.columns=['conversations']
df_1.columns=['conversations']
df_2.columns=['conversations']
df_3.columns=['conversations']

In [31]:
print(f"No. of rows in Cluster 0 are {df_0.shape[0]}")
print(f"No. of rows in Cluster 1 are {df_1.shape[0]}")
print(f"No. of rows in Cluster 2 are {df_2.shape[0]}")
print(f"No. of rows in Cluster 3 are {df_3.shape[0]}")

No. of rows in Cluster 0 are 121
No. of rows in Cluster 1 are 752
No. of rows in Cluster 2 are 57
No. of rows in Cluster 3 are 70


## Fine Tuning for first cluster

In [32]:
from datasets import Dataset

convos_dataset = Dataset.from_pandas(df_0)

print(convos_dataset)

Dataset({
    features: ['conversations', '__index_level_0__'],
    num_rows: 121
})


In [33]:
from transformers import GPT2Tokenizer

# Load the tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Add a padding token to the tokenizer
tokenizer.pad_token = tokenizer.eos_token

# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples['conversations'], truncation=True, padding='max_length', max_length=512)

tokenized_dataset = convos_dataset.map(tokenize_function, batched=True)

print(tokenized_dataset)

Map:   0%|          | 0/121 [00:00<?, ? examples/s]

Dataset({
    features: ['conversations', '__index_level_0__', 'input_ids', 'attention_mask'],
    num_rows: 121
})


In [34]:
# Set the labels for the model

def format_dataset(example):
    example['labels'] = example['input_ids'].copy()
    return example

tokenized_dataset = tokenized_dataset.map(format_dataset, batched=True)

Map:   0%|          | 0/121 [00:00<?, ? examples/s]

In [35]:
print(type(tokenized_dataset))

<class 'datasets.arrow_dataset.Dataset'>


In [36]:
# Split the dataset into train and test sets
split_dataset = tokenized_dataset.train_test_split(test_size=0.2, seed=42)

# Access the train and test sets from the split_dataset
train_dataset = split_dataset['train']
test_dataset = split_dataset['test']

print(len(train_dataset))
print(len(test_dataset))

96
25


In [37]:
from transformers import GPT2LMHeadModel, Trainer, TrainingArguments

# Load the pre-trained GPT-2 model
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy='epoch',
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
)



In [38]:
# Define the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=train_dataset,
)

In [39]:
# Train the model
trainer.train()

Epoch,Training Loss,Validation Loss
1,No log,0.197114
2,No log,0.173954
3,No log,0.166642


TrainOutput(global_step=72, training_loss=0.4924347135755751, metrics={'train_runtime': 70.5834, 'train_samples_per_second': 4.08, 'train_steps_per_second': 1.02, 'total_flos': 75252105216000.0, 'train_loss': 0.4924347135755751, 'epoch': 3.0})

In [40]:
# Save the best model

trainer.save_model('/content/drive/MyDrive/Colab Notebooks/ClusteredDatasets/4Clusters(0)/model')

tokenizer.save_pretrained("/content/drive/MyDrive/Colab Notebooks/ClusteredDatasets/4Clusters(0)/tokenizer")

('/content/drive/MyDrive/Colab Notebooks/ClusteredDatasets/4Clusters(0)/tokenizer/tokenizer_config.json',
 '/content/drive/MyDrive/Colab Notebooks/ClusteredDatasets/4Clusters(0)/tokenizer/special_tokens_map.json',
 '/content/drive/MyDrive/Colab Notebooks/ClusteredDatasets/4Clusters(0)/tokenizer/vocab.json',
 '/content/drive/MyDrive/Colab Notebooks/ClusteredDatasets/4Clusters(0)/tokenizer/merges.txt',
 '/content/drive/MyDrive/Colab Notebooks/ClusteredDatasets/4Clusters(0)/tokenizer/added_tokens.json')

In [41]:
from transformers import pipeline

# Load the fine-tuned model
model = GPT2LMHeadModel.from_pretrained('/content/drive/MyDrive/Colab Notebooks/ClusteredDatasets/4Clusters(0)/model')
tokenizer = GPT2Tokenizer.from_pretrained('/content/drive/MyDrive/Colab Notebooks/ClusteredDatasets/4Clusters(0)/tokenizer')
text_generator = pipeline('text-generation', model=model, tokenizer=tokenizer)

# Generate text for evaluation
generated_texts = []
for i in range(len(test_dataset)):
    # Extract the generated text from the dictionary and append it to the list
    generated_texts.append(text_generator(test_dataset[i]['conversations'], max_length=512)[0]['generated_text'])

# Calculate BLEU score
from evaluate import load
bleu = load('bleu')
results = bleu.compute(predictions=generated_texts, references=test_dataset['conversations'])
print('BLEU score:', results['bleu'])

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Downloading builder script:   0%|          | 0.00/5.94k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

BLEU score: 0.6784225010215791


## For second cluster

In [42]:
from datasets import Dataset

convos_dataset = Dataset.from_pandas(df_1)

print(convos_dataset)

Dataset({
    features: ['conversations', '__index_level_0__'],
    num_rows: 752
})


In [43]:
from transformers import GPT2Tokenizer

# Load the tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Add a padding token to the tokenizer
tokenizer.pad_token = tokenizer.eos_token

# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples['conversations'], truncation=True, padding='max_length', max_length=512)

tokenized_dataset = convos_dataset.map(tokenize_function, batched=True)

print(tokenized_dataset)

Map:   0%|          | 0/752 [00:00<?, ? examples/s]

Dataset({
    features: ['conversations', '__index_level_0__', 'input_ids', 'attention_mask'],
    num_rows: 752
})


In [44]:
# Set the labels for the model

def format_dataset(example):
    example['labels'] = example['input_ids'].copy()
    return example

tokenized_dataset = tokenized_dataset.map(format_dataset, batched=True)

Map:   0%|          | 0/752 [00:00<?, ? examples/s]

In [45]:
# Split the dataset into train and test sets
split_dataset = tokenized_dataset.train_test_split(test_size=0.2, seed=42)

# Access the train and test sets from the split_dataset
train_dataset = split_dataset['train']
test_dataset = split_dataset['test']

print(len(train_dataset))
print(len(test_dataset))

601
151


In [46]:
from transformers import GPT2LMHeadModel, Trainer, TrainingArguments

# Load the pre-trained GPT-2 model
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy='epoch',
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
)



In [47]:
# Define the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=train_dataset,
)

In [48]:
# Train the model
trainer.train()

Epoch,Training Loss,Validation Loss
1,No log,0.146369
2,No log,0.12912
3,No log,0.122527


TrainOutput(global_step=453, training_loss=0.20546567466348475, metrics={'train_runtime': 364.9647, 'train_samples_per_second': 4.94, 'train_steps_per_second': 1.241, 'total_flos': 471109533696000.0, 'train_loss': 0.20546567466348475, 'epoch': 3.0})

In [49]:
# Save the best model

trainer.save_model('/content/drive/MyDrive/Colab Notebooks/ClusteredDatasets/4Clusters(1)/model')

tokenizer.save_pretrained("/content/drive/MyDrive/Colab Notebooks/ClusteredDatasets/4Clusters(1)/tokenizer")

('/content/drive/MyDrive/Colab Notebooks/ClusteredDatasets/4Clusters(1)/tokenizer/tokenizer_config.json',
 '/content/drive/MyDrive/Colab Notebooks/ClusteredDatasets/4Clusters(1)/tokenizer/special_tokens_map.json',
 '/content/drive/MyDrive/Colab Notebooks/ClusteredDatasets/4Clusters(1)/tokenizer/vocab.json',
 '/content/drive/MyDrive/Colab Notebooks/ClusteredDatasets/4Clusters(1)/tokenizer/merges.txt',
 '/content/drive/MyDrive/Colab Notebooks/ClusteredDatasets/4Clusters(1)/tokenizer/added_tokens.json')

In [50]:
from transformers import pipeline

# Load the fine-tuned model
model = GPT2LMHeadModel.from_pretrained('/content/drive/MyDrive/Colab Notebooks/ClusteredDatasets/4Clusters(1)/model')
tokenizer = GPT2Tokenizer.from_pretrained('/content/drive/MyDrive/Colab Notebooks/ClusteredDatasets/4Clusters(1)/tokenizer')
text_generator = pipeline('text-generation', model=model, tokenizer=tokenizer)

# Generate text for evaluation
generated_texts = []
for i in range(len(test_dataset)):
    # Extract the generated text from the dictionary and append it to the list
    generated_texts.append(text_generator(test_dataset[i]['conversations'], max_length=512)[0]['generated_text'])

# Calculate BLEU score
from evaluate import load
bleu = load('bleu')
results = bleu.compute(predictions=generated_texts, references=test_dataset['conversations'])
print('BLEU score:', results['bleu'])

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


BLEU score: 0.7855410763592737


## For third cluster

In [51]:
from datasets import Dataset

convos_dataset = Dataset.from_pandas(df_2)

print(convos_dataset)

Dataset({
    features: ['conversations', '__index_level_0__'],
    num_rows: 57
})


In [52]:
from transformers import GPT2Tokenizer

# Load the tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Add a padding token to the tokenizer
tokenizer.pad_token = tokenizer.eos_token

# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples['conversations'], truncation=True, padding='max_length', max_length=512)

tokenized_dataset = convos_dataset.map(tokenize_function, batched=True)

print(tokenized_dataset)

Map:   0%|          | 0/57 [00:00<?, ? examples/s]

Dataset({
    features: ['conversations', '__index_level_0__', 'input_ids', 'attention_mask'],
    num_rows: 57
})


In [53]:
# Set the labels for the model

def format_dataset(example):
    example['labels'] = example['input_ids'].copy()
    return example

tokenized_dataset = tokenized_dataset.map(format_dataset, batched=True)

Map:   0%|          | 0/57 [00:00<?, ? examples/s]

In [54]:
# Split the dataset into train and test sets
split_dataset = tokenized_dataset.train_test_split(test_size=0.2, seed=42)

# Access the train and test sets from the split_dataset
train_dataset = split_dataset['train']
test_dataset = split_dataset['test']

print(len(train_dataset))
print(len(test_dataset))

45
12


In [55]:
from transformers import GPT2LMHeadModel, Trainer, TrainingArguments

# Load the pre-trained GPT-2 model
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy='epoch',
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
)



In [56]:
# Define the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=train_dataset,
)

In [57]:
# Train the model
trainer.train()

Epoch,Training Loss,Validation Loss
1,No log,0.183914
2,No log,0.150012
3,No log,0.139363


TrainOutput(global_step=36, training_loss=0.7822623782687717, metrics={'train_runtime': 37.3316, 'train_samples_per_second': 3.616, 'train_steps_per_second': 0.964, 'total_flos': 35274424320000.0, 'train_loss': 0.7822623782687717, 'epoch': 3.0})

In [58]:
# Save the best model

trainer.save_model('/content/drive/MyDrive/Colab Notebooks/ClusteredDatasets/4Clusters(2)/model')

tokenizer.save_pretrained("/content/drive/MyDrive/Colab Notebooks/ClusteredDatasets/4Clusters(2)/tokenizer")

('/content/drive/MyDrive/Colab Notebooks/ClusteredDatasets/4Clusters(2)/tokenizer/tokenizer_config.json',
 '/content/drive/MyDrive/Colab Notebooks/ClusteredDatasets/4Clusters(2)/tokenizer/special_tokens_map.json',
 '/content/drive/MyDrive/Colab Notebooks/ClusteredDatasets/4Clusters(2)/tokenizer/vocab.json',
 '/content/drive/MyDrive/Colab Notebooks/ClusteredDatasets/4Clusters(2)/tokenizer/merges.txt',
 '/content/drive/MyDrive/Colab Notebooks/ClusteredDatasets/4Clusters(2)/tokenizer/added_tokens.json')

In [59]:
from transformers import pipeline

# Load the fine-tuned model
model = GPT2LMHeadModel.from_pretrained('/content/drive/MyDrive/Colab Notebooks/ClusteredDatasets/4Clusters(2)/model')
tokenizer = GPT2Tokenizer.from_pretrained('/content/drive/MyDrive/Colab Notebooks/ClusteredDatasets/4Clusters(2)/tokenizer')
text_generator = pipeline('text-generation', model=model, tokenizer=tokenizer)

# Generate text for evaluation
generated_texts = []
for i in range(len(test_dataset)):
    # Extract the generated text from the dictionary and append it to the list
    generated_texts.append(text_generator(test_dataset[i]['conversations'], max_length=512)[0]['generated_text'])

# Calculate BLEU score
from evaluate import load
bleu = load('bleu')
results = bleu.compute(predictions=generated_texts, references=test_dataset['conversations'])
print('BLEU score:', results['bleu'])

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


BLEU score: 0.6038105204230777


## For fourth cluster

In [60]:
from datasets import Dataset

convos_dataset = Dataset.from_pandas(df_3)

print(convos_dataset)

Dataset({
    features: ['conversations', '__index_level_0__'],
    num_rows: 70
})


In [61]:
from transformers import GPT2Tokenizer

# Load the tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Add a padding token to the tokenizer
tokenizer.pad_token = tokenizer.eos_token

# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples['conversations'], truncation=True, padding='max_length', max_length=512)

tokenized_dataset = convos_dataset.map(tokenize_function, batched=True)

print(tokenized_dataset)

Map:   0%|          | 0/70 [00:00<?, ? examples/s]

Dataset({
    features: ['conversations', '__index_level_0__', 'input_ids', 'attention_mask'],
    num_rows: 70
})


In [62]:
# Set the labels for the model

def format_dataset(example):
    example['labels'] = example['input_ids'].copy()
    return example

tokenized_dataset = tokenized_dataset.map(format_dataset, batched=True)

Map:   0%|          | 0/70 [00:00<?, ? examples/s]

In [63]:
# Split the dataset into train and test sets
split_dataset = tokenized_dataset.train_test_split(test_size=0.2, seed=42)

# Access the train and test sets from the split_dataset
train_dataset = split_dataset['train']
test_dataset = split_dataset['test']

print(len(train_dataset))
print(len(test_dataset))

56
14


In [64]:
from transformers import GPT2LMHeadModel, Trainer, TrainingArguments

# Load the pre-trained GPT-2 model
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy='epoch',
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
)



In [65]:
# Define the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=train_dataset,
)

In [66]:
# Train the model
trainer.train()

Epoch,Training Loss,Validation Loss
1,No log,0.206938
2,No log,0.179009
3,No log,0.171559


TrainOutput(global_step=42, training_loss=0.6778829211280459, metrics={'train_runtime': 44.1667, 'train_samples_per_second': 3.804, 'train_steps_per_second': 0.951, 'total_flos': 43897061376000.0, 'train_loss': 0.6778829211280459, 'epoch': 3.0})

In [67]:
# Save the best model

trainer.save_model('/content/drive/MyDrive/Colab Notebooks/ClusteredDatasets/4Clusters(3)/model')

tokenizer.save_pretrained("/content/drive/MyDrive/Colab Notebooks/ClusteredDatasets/4Clusters(3)/tokenizer")

('/content/drive/MyDrive/Colab Notebooks/ClusteredDatasets/4Clusters(3)/tokenizer/tokenizer_config.json',
 '/content/drive/MyDrive/Colab Notebooks/ClusteredDatasets/4Clusters(3)/tokenizer/special_tokens_map.json',
 '/content/drive/MyDrive/Colab Notebooks/ClusteredDatasets/4Clusters(3)/tokenizer/vocab.json',
 '/content/drive/MyDrive/Colab Notebooks/ClusteredDatasets/4Clusters(3)/tokenizer/merges.txt',
 '/content/drive/MyDrive/Colab Notebooks/ClusteredDatasets/4Clusters(3)/tokenizer/added_tokens.json')

In [68]:
from transformers import pipeline

# Load the fine-tuned model
model = GPT2LMHeadModel.from_pretrained('/content/drive/MyDrive/Colab Notebooks/ClusteredDatasets/4Clusters(3)/model')
tokenizer = GPT2Tokenizer.from_pretrained('/content/drive/MyDrive/Colab Notebooks/ClusteredDatasets/4Clusters(3)/tokenizer')
text_generator = pipeline('text-generation', model=model, tokenizer=tokenizer)

# Generate text for evaluation
generated_texts = []
for i in range(len(test_dataset)):
    # Extract the generated text from the dictionary and append it to the list
    generated_texts.append(text_generator(test_dataset[i]['conversations'], max_length=512)[0]['generated_text'])

# Calculate BLEU score
from evaluate import load
bleu = load('bleu')
results = bleu.compute(predictions=generated_texts, references=test_dataset['conversations'])
print('BLEU score:', results['bleu'])

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


BLEU score: 0.6450277443055102
