<a href="https://colab.research.google.com/github/yanshiyou123/NLP/blob/main/HW5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

6320 NLP HW5

Author: Shiyou Yan

Use Hugging Face and PyTorch

In [None]:
!pip install accelerate -U
!pip install transformers[torch]

Collecting accelerate
  Downloading accelerate-0.29.2-py3-none-any.whl (297 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/297.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.0/297.4 kB[0m [31m1.1 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━[0m [32m143.4/297.4 kB[0m [31m2.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m297.4/297.4 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.10.0->accele

In [None]:
import pandas as pd
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder

# Load data
data = pd.read_csv('IT_Service.csv')
print(data.head())

# Split data into train and test
train_texts, test_texts, train_categories, test_categories = train_test_split(data['Document'], data['Topic_group'], test_size=0.2, random_state=42)

# Encode labels (categories) to numerical values
label_encoder = LabelEncoder()
train_labels = label_encoder.fit_transform(train_categories)
test_labels = label_encoder.transform(test_categories)

# Tokenize data
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')

train_encodings = tokenizer(list(train_texts), truncation=True, padding=True, return_tensors='pt')
test_encodings = tokenizer(list(test_texts), truncation=True, padding=True, return_tensors='pt')


# Define the dataset
class TextDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, i):
        item = {k: v[i] for k, v in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[i])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = TextDataset(train_encodings, train_labels)
test_dataset = TextDataset(test_encodings, test_labels)

# Create data loaders
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=8)

# Evaluate the pretrained model
def evaluate(model, test_loader):
  device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
  model.to(device)
  model.eval()

  correct = 0
  total = 0
  with torch.no_grad():
    for batch in test_loader:
      input_ids = batch['input_ids'].to(device)
      attention_mask = batch['attention_mask'].to(device)
      labels = batch['labels'].to(device)

      outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
      _, predicted = torch.max(outputs.logits, 1)
      total += labels.size(0)
      correct += (predicted == labels).sum().item()

      accuracy = 100 * correct / total
  return accuracy

model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=len(label_encoder.classes_))
pretrain_accuracy = evaluate(model, test_loader)
print(f'Pre-trained Model Accuracy: {pretrain_accuracy:.2f}%')


# Define the training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=1,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    evaluation_strategy="no",
)

# Create the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=lambda eval_pred: accuracy_score(eval_pred.label_ids, eval_pred.predictions.argmax(-1)),
)

# Fine-tune the model
trainer.train()
fine_tune_accuracy = evaluate(model, test_loader)
print(f'Fine-tuned Model Accuracy: {fine_tune_accuracy:.2f}%')



                                            Document    Topic_group
0  connection with icon icon dear please setup ic...       Hardware
1  work experience user work experience user hi w...         Access
2  requesting for meeting requesting meeting hi p...       Hardware
3  reset passwords for external accounts re expir...         Access


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Pre-trained Model Accuracy: 14.60%


dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Step,Training Loss
10,2.063
20,2.0576
30,2.0472
40,2.049
50,1.9937
60,1.9344
70,1.9038
80,1.8999
90,1.8435
100,1.8693


Checkpoint destination directory ./results/checkpoint-500 already exists and is non-empty. Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./results/checkpoint-1000 already exists and is non-empty. Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./results/checkpoint-1500 already exists and is non-empty. Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./results/checkpoint-2000 already exists and is non-empty. Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./results/checkpoint-2500 already exists and is non-empty. Saving will proceed but saved results may be invalid.


Fine-tuned Model Accuracy: 87.26%


Analysis:
The "IT Service Ticket Classification Dataset" comprises 47,837 entries with two key features: "Document," representing the ticket text, and "Topic_group," encompassing eight categories such as 'Hardware,' 'HR Support,' 'Access,' 'Miscellaneous,' 'Storage,' 'Purchase,' 'Internal Project,' and 'Administrative rights.'
Initially, I divided the dataset into training and testing subsets. Each category was encoded into numerical labels ranging from 0 to 7. Subsequently, I utilized the AutoTokenizer from the Hugging Face transformer library to tokenize the ticket texts. With these preparations, I constructed a PyTorch dataset and generated the corresponding data loader.
The initial evaluation of the pretrained model yielded an accuracy of 14.6%, which was not good. This prompted me to proceed with fine-tuning the model. Leveraging training arguments and the trainer, I fine-tuned the model and re-evaluated its performance. The fine-tuning process swiftly boosted the accuracy to 87.26%, showcasing the effectiveness of it for enhancing model performance.
Given the large volume of data, I opted to run only one epoch during fine-tuning. However, to further enhance accuracy, running three epochs would be advisable for this dataset. Running three epochs would likely result in higher accuracy.