**Intent recognition**

distilbert-base-uncased

a distilled version of the BERT base model.

Bert is a Pretrained model on English language using a masked language modeling (MLM) objective pretrained on a large corpus of English data in a self-supervised fashion.

 it was pretrained with two objectives:


*   Masked language modeling (MLM)
*   Next sentence prediction (NSP)

DistilBERT was pretrained with three objectives

*   Distillation loss: the model was trained to return the same probabilities as the BERT base model
*   Masked language modeling (MLM): this is part of the original training loss of the BERT base model.
*  Cosine embedding loss: the model was also trained to generate hidden states as close as possible as the BERT base model







In [None]:
from datasets import load_dataset,DatasetDict
import kagglehub

# Download latest version
path = kagglehub.dataset_download("parthpatil256/it-support-ticket-data")

print("Path to dataset files:", path)

Using Colab cache for faster access to the 'it-support-ticket-data' dataset.
Path to dataset files: /kaggle/input/it-support-ticket-data


This dataset provides a comprehensive collection of real-world IT support ticket data

Each row in the dataset represents a single IT support ticket, with the following key attributes:
- **body**: This column contains the verbatim, free-form text of the customer's support request, issue description, or question.
- **Department**: Specifies the department or team that has been assigned the responsibility of handling and resolving the IT support ticket. This serves as a primary high-level classification of the issue.
- **Priority**:  Indicates the urgency or criticality level assigned to the IT support ticket
- **Tags**: A comprehensive list of keywords or labels that provide more granular detail about the nature, specific topic, affected components, or sub-categories of the IT support ticket.


In [None]:
raw_dataset=load_dataset(path=path)
raw_dataset

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'Body', 'Department', 'Priority', 'Tags'],
        num_rows: 29651
    })
})

In [None]:
# Split the dataset
split_dataset = raw_dataset['train'].train_test_split(
    test_size=0.1,
    seed=42
)

# Rename "test" → "validation"
final_dataset = DatasetDict({
    "train": split_dataset["train"],
    "validation": split_dataset["test"]
})


In [None]:
final_dataset["train"] = final_dataset["train"].filter(lambda x: len(x["Tags"]) > 0)
final_dataset["validation"] = final_dataset["validation"].filter(lambda x: len(x["Tags"]) > 0)


Filter:   0%|          | 0/26685 [00:00<?, ? examples/s]

Filter:   0%|          | 0/2966 [00:00<?, ? examples/s]

In [None]:
print(final_dataset["train"]["Tags"][:5])


["['Feature', 'Documentation', 'Feedback', 'Tech Support']", "['Returns and Exchanges', 'Technical Support', 'Product Support', 'Problem Resolution']", "['Feature', 'Product', 'Documentation', 'Tech Support']", "['Feedback', 'Sales', 'Product', 'Feature']", "['Feature', 'Feedback', 'IT', 'Tech Support']"]


In [None]:
from sklearn.preprocessing import MultiLabelBinarizer

# Convert datasets columns to lists
train_tags = list(final_dataset["train"]["Tags"])
val_tags = list(final_dataset["validation"]["Tags"]) if "validation" in final_dataset else []
test_tags = list(final_dataset["test"]["Tags"]) if "test" in final_dataset else []

all_tags = train_tags + val_tags + test_tags

mlb = MultiLabelBinarizer()
mlb.fit(all_tags)
tag_classes = mlb.classes_
num_tags = len(tag_classes)
print("✅ Number of tags:", num_tags)


✅ Number of tags: 1650


**Label Preparation**

In [None]:
from sklearn.preprocessing import MultiLabelBinarizer

# Convert Columns to lists
train_tags = list(final_dataset["train"]["Tags"])
val_tags = list(final_dataset["validation"]["Tags"]) if "validation" in final_dataset else []

# Combine all tags
all_tags = train_tags + val_tags

# Fit MultiLabelBinarizer on all tags
mlb = MultiLabelBinarizer()
mlb.fit(all_tags)
tag_classes = mlb.classes_

print("✅ Classes after merging splits:", tag_classes)


✅ Classes after merging splits: ['2019' 'AES' 'AI' ... 'Zoom' 'iOS' 'macOS']


In [None]:
final_dataset

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'Body', 'Department', 'Priority', 'Tags'],
        num_rows: 26685
    })
    validation: Dataset({
        features: ['Unnamed: 0', 'Body', 'Department', 'Priority', 'Tags'],
        num_rows: 2966
    })
})

**Tokenization**

In [None]:
from transformers import AutoTokenizer, DataCollatorWithPadding
checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

from transformers import AutoTokenizer, DataCollatorWithPadding
checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_with_labels(example):
    text = example["Body"]
    if not isinstance(text, str) or text.strip() == "":
        text = "[EMPTY]"

    tokenized = tokenizer(
        text,
        truncation=True,
        padding="max_length",
        max_length=256
    )

    # Convert tags to multi-hot vector with mlb
    tokenized["labels"] = mlb.transform([example["Tags"]])[0].astype(float).tolist()
    return tokenized

tokenized_datasets = final_dataset.map(
    tokenize_with_labels,
    batched=False,
    remove_columns=["Body", "Tags", "Priority", "Department", "Unnamed: 0"]
)
tokenized_datasets.set_format("torch")

# Create data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Map:   0%|          | 0/26685 [00:00<?, ? examples/s]

Map:   0%|          | 0/2966 [00:00<?, ? examples/s]

In [None]:
display(tokenized_datasets)

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 26685
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 2966
    })
})

In [None]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    tokenized_datasets["train"], shuffle=True, batch_size=8, collate_fn=data_collator
)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"], batch_size=8, collate_fn=data_collator
)

In [None]:
for batch in train_dataloader:
    break
{k: v.shape for k, v in batch.items()}

{'input_ids': torch.Size([8, 256]),
 'attention_mask': torch.Size([8, 256]),
 'labels': torch.Size([8, 1650])}

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    checkpoint,
    num_labels=num_tags,
    problem_type="multi_label_classification"
)


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)

In [None]:
from transformers import get_scheduler

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)
print(num_training_steps)

10008


In [None]:
import torch

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
device

device(type='cuda')

In [None]:
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        # Cast labels to float32 as BCEWithLogitsLoss expects float targets
        batch["labels"] = batch["labels"].to(torch.float32)
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

  0%|          | 0/10008 [00:00<?, ?it/s]

In [None]:
def evaluate_tag_model(threshold=0.5):
    model.eval()
    all_predictions = []
    all_true_labels = []
    all_probabilities = []

    eval_split = "validation" if "validation" in tokenized_datasets else "train"
    print(f"Evaluating on: {eval_split}")

    eval_dataloader = DataLoader(
        tokenized_datasets[eval_split],
        batch_size=16,
        collate_fn=data_collator
    )

    with torch.no_grad():
        for batch in tqdm(eval_dataloader, desc="Evaluating"):
            batch = {k: v.to(device) for k, v in batch.items()}
            # Cast labels to float32 as BCEWithLogitsLoss expects float targets
            batch["labels"] = batch["labels"].to(torch.float32)
            outputs = model(**batch)

            probs = torch.sigmoid(outputs.logits)
            preds = (probs > threshold).int()

            all_probabilities.extend(probs.cpu().numpy())
            all_predictions.extend(preds.cpu().numpy())
            all_true_labels.extend(batch["labels"].cpu().numpy())

    from sklearn.metrics import f1_score

    micro_f1 = f1_score(all_true_labels, all_predictions, average="micro")
    macro_f1 = f1_score(all_true_labels, all_predictions, average="macro")

    print(f"📊 Micro F1: {micro_f1:.4f}")
    print(f"📊 Macro F1: {macro_f1:.4f}")

    # show a few samples
    for i in range(5):
        true_tags = [tag_classes[j] for j, v in enumerate(all_true_labels[i]) if v == 1]
        pred_tags = [tag_classes[j] for j, v in enumerate(all_predictions[i]) if v == 1]
        print(f"Sample {i+1}: True={true_tags}, Pred={pred_tags}")

    return {
        "micro_f1": micro_f1,
        "macro_f1": macro_f1,
        "predictions": all_predictions,
        "true_labels": all_true_labels,
        "probabilities": all_probabilities
    }

In [None]:
import json
import numpy as np

evaluation_results = evaluate_tag_model()

# Convert numpy arrays in evaluation_results to lists for JSON serialization
for key in ['predictions', 'true_labels', 'probabilities']:
    if key in evaluation_results:
        evaluation_results[key] = [arr.tolist() if isinstance(arr, np.ndarray) else arr for arr in evaluation_results[key]]



Evaluating on: validation


Evaluating:   0%|          | 0/186 [00:00<?, ?it/s]

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


📊 Micro F1: 0.7135
📊 Macro F1: 0.0190
Sample 1: True=['Documentation', 'Feature', 'Feedback', 'IT', 'Tech Support'], Pred=['Documentation', 'Feature', 'Feedback', 'IT', 'Tech Support']
Sample 2: True=['Bug', 'Hardware', 'IT', 'Performance', 'Tech Support'], Pred=['Bug', 'Hardware', 'IT', 'Performance', 'Tech Support']
Sample 3: True=['Data', 'Encryption', 'Guidance', 'Hospital', 'Patient', 'Security', 'SecurityMeasure', 'Systems'], Pred=['IT', 'Security']
Sample 4: True=['Feature', 'Feedback', 'Marketing', 'Performance', 'Product'], Pred=['Feedback', 'Performance', 'Tech Support']
Sample 5: True=['Bug', 'IT', 'Performance', 'Tech Support'], Pred=['Bug', 'IT', 'Performance', 'Tech Support']


In [None]:
from google.colab import drive
import os

# Mount Google Drive
drive.mount('/content/it_support/')

# Create a directory for your model in Drive
model_dir = "/content/drive/MyDrive/it_support_priority_classifier"

# Create the directory if it doesn't exist
os.makedirs(model_dir, exist_ok=True)

print(f"📁 Model will be saved to: {model_dir}")

Drive already mounted at /content/it_support/; to attempt to forcibly remount, call drive.mount("/content/it_support/", force_remount=True).
📁 Model will be saved to: /content/drive/MyDrive/it_support_priority_classifier


In [None]:
import json
import os

# Make sure model_dir exists
os.makedirs(model_dir, exist_ok=True)

# Save model and tokenizer
model.save_pretrained(model_dir)
tokenizer.save_pretrained(model_dir)
print("💾 Model saved!")

# Save the tag mapping (MultiLabelBinarizer classes)
tag_mapping = list(tag_classes)  # tag_classes = mlb.classes_
with open(f"{model_dir}/tag_mapping.json", "w") as f:
    json.dump(tag_mapping, f, indent=2)
print("📁 Tag mapping saved!")

# Optional: save training info
training_info = {
    "num_tags": len(tag_classes),
    "model_name": "distilbert-base-uncased",
    "training_date": "2025-12-02"
}

with open(f"{model_dir}/training_info.json", "w") as f:
    json.dump(training_info, f, indent=2)
print("📊 Training info saved")


💾 Model saved!
📁 Tag mapping saved!
📊 Training info saved


In [None]:
# Install Gradio
!pip install gradio -q

import gradio as gr
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import json
import os
from google.colab import drive

# Mount Google Drive
drive.mount('/content/it_support/')

# Define model directory

print(f"🔍 Looking for model at: {model_dir}")

if not os.path.exists(model_dir):
    raise FileNotFoundError(f"Model directory not found at: {model_dir}")

# Load tag mapping
tag_mapping_path = os.path.join(model_dir, "tag_mapping.json")
with open(tag_mapping_path, "r") as f:
    tag_classes = json.load(f)
num_tags = len(tag_classes)
print(f"🏷️ Number of tags: {num_tags}")

# Define classifier class for tags
class ITSupportTagClassifier:
    def __init__(self, model_path, tag_classes):
        self.tag_classes = tag_classes
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model = AutoModelForSequenceClassification.from_pretrained(
            model_path, num_labels=len(tag_classes), problem_type="multi_label_classification"
        )
        self.tokenizer = AutoTokenizer.from_pretrained(model_path)
        self.model.to(self.device)
        self.model.eval()

    def predict(self, ticket_text, top_k=3):
        # Tokenize
        inputs = self.tokenizer(
            ticket_text,
            return_tensors="pt",
            truncation=True,
            padding=True,
            max_length=256
        )
        inputs = {k: v.to(self.device) for k, v in inputs.items()}

        # Forward pass
        with torch.no_grad():
            outputs = self.model(**inputs)
            probs = torch.sigmoid(outputs.logits)  # multi-label probabilities

        # Get top K tags
        topk = torch.topk(probs, k=min(top_k, probs.shape[-1]), dim=-1)
        predicted_indices = topk.indices[0].cpu().numpy()
        predicted_tags = [self.tag_classes[i] for i in predicted_indices]

        # Probabilities for top K
        prob_dict = {self.tag_classes[i]: float(probs[0, i].cpu()) for i in predicted_indices}

        return {
            "tags": predicted_tags,
            "probabilities": prob_dict
        }

# Initialize classifier
classifier = ITSupportTagClassifier(model_dir, tag_classes)
print("✅ Tag model loaded successfully!")

# Gradio interface function
def gradio_tag_interface(ticket_text):
    if not ticket_text.strip():
        return "Please enter a ticket description"

    result = classifier.predict(ticket_text, top_k=3)

    output = "### Top Predicted Tags:\n- " + "\n- ".join(result["tags"])
    output += "\n\n### Probabilities (Top 3):\n"
    for tag, prob in result["probabilities"].items():
        output += f"- **{tag}**: {prob:.3f}\n"

    return output


# Launch Gradio interface
iface = gr.Interface(
    fn=gradio_tag_interface,
    inputs=gr.Textbox(
        lines=3,
        placeholder="Enter IT support ticket description here...",
        label="IT Support Ticket"
    ),
    outputs=gr.Markdown(label="Predicted Tags"),
    title="🎯 IT Support Ticket Tag Classifier",
    description="Predict relevant tags for IT support tickets based on content",
    examples=[
        ["URGENT: Production database server crashed. All customer transactions are failing."],
        ["I need help resetting my password for the email system."],
        ["The office printer is jammed, but not urgent."]
    ]
)

iface.launch(share=True)


print("🌐 Launching web interface...")
iface.launch(share=True)


Drive already mounted at /content/it_support/; to attempt to forcibly remount, call drive.mount("/content/it_support/", force_remount=True).
🔍 Looking for model at: /content/drive/MyDrive/it_support_priority_classifier
🏷️ Number of tags: 1650
✅ Tag model loaded successfully!
Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://3b526cdfb22d3d7f7f.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


🌐 Launching web interface...
Rerunning server... use `close()` to stop if you need to change `launch()` parameters.
----
Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://3b526cdfb22d3d7f7f.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


