<a href="https://colab.research.google.com/github/scorzo/train-that/blob/main/train_that.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Step 1: Setup Environment

In [None]:
!pip install transformers
!pip install torch


Step 2: Import Libraries

Next, import the required libraries:



In [None]:
import pandas as pd
from transformers import GPT2Tokenizer, GPT2LMHeadModel, TextDataset, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments
import torch


Step 3: Load and Prepare the Data

Load your CSV file and prepare it for training:

In [None]:
# Load the dataset
df = pd.read_csv('sample_data/support_tickets_resolutions.csv')

# Combine the issue and resolution into a single text for each record
df['combined_text'] = df['Support Issue Description'] + " [Resolution] " + df['Support Resolution']

# Save the combined text to a new file (required for the TextDataset)
df['combined_text'].to_csv('training_text.txt', header=False, index=False)


Step 4: Prepare the Model and Tokenizer

Load the GPT-2 model and tokenizer:



In [None]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Additional step to add the special token [Resolution] to the tokenizer
special_tokens_dict = {'additional_special_tokens': ['[Resolution]']}
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
model.resize_token_embeddings(len(tokenizer))


Step 5: Prepare the Dataset for Training

Create a dataset and data collator for training:

In [None]:
train_dataset = TextDataset(
    tokenizer=tokenizer,
    file_path="training_text.txt",
    block_size=128
)

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=False
)


Step 6: Training Arguments and Trainer

Set up the training arguments and the trainer:

In [None]:
!pip install accelerate -U

training_args = TrainingArguments(
    output_dir="./gpt2-finetuned",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=4,
    save_steps=3, #10_000,
    save_total_limit=2,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
)


Step 7: Fine-Tune the Model

Finally, fine-tune the model:

In [None]:
trainer.train()


Step 8: (Optional) Uploading the Model via Hugging Face Website


Create an Account and Repository:

Sign up or log in to Hugging Face.
Create a new model repository.
Prepare Your Model Locally:

Ensure your model is saved locally with all necessary files (like config.json, pytorch_model.bin, etc.).
Upload the Model:

Navigate to your new repository on the Hugging Face website.
Use the web interface to upload your model files directly. Typically, you'll be able to drag and drop files or use a file chooser.

Step 9: (Optional) Use Model for Inference

In [None]:
!pip install transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

# Replace 'your-username/your-model-name' with the path to your model
tokenizer = AutoTokenizer.from_pretrained('your-username/your-model-name')
model = AutoModelForCausalLM.from_pretrained('your-username/your-model-name')

# Example inference
input_ids = tokenizer.encode('Your input text here', return_tensors='pt')
output = model.generate(input_ids)
print(tokenizer.decode(output[0], skip_special_tokens=True))
