<a href="https://colab.research.google.com/github/sandei-travolta/tech-support-ai-assistant/blob/main/AI_Assistance.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Intent recognition**

distilbert-base-uncased

a distilled version of the BERT base model.

Bert is a Pretrained model on English language using a masked language modeling (MLM) objective pretrained on a large corpus of English data in a self-supervised fashion.

 it was pretrained with two objectives:


*   Masked language modeling (MLM)
*   Next sentence prediction (NSP)

DistilBERT was pretrained with three objectives

*   Distillation loss: the model was trained to return the same probabilities as the BERT base model
*   Masked language modeling (MLM): this is part of the original training loss of the BERT base model.
*  Cosine embedding loss: the model was also trained to generate hidden states as close as possible as the BERT base model







In [2]:
from datasets import load_dataset,DatasetDict
import kagglehub

# Download latest version
path = kagglehub.dataset_download("parthpatil256/it-support-ticket-data")

print("Path to dataset files:", path)

Using Colab cache for faster access to the 'it-support-ticket-data' dataset.
Path to dataset files: /kaggle/input/it-support-ticket-data


This dataset provides a comprehensive collection of real-world IT support ticket data

Each row in the dataset represents a single IT support ticket, with the following key attributes:
- **body**: This column contains the verbatim, free-form text of the customer's support request, issue description, or question.
- **Department**: Specifies the department or team that has been assigned the responsibility of handling and resolving the IT support ticket. This serves as a primary high-level classification of the issue.
- **Priority**:  Indicates the urgency or criticality level assigned to the IT support ticket
- **Tags**: A comprehensive list of keywords or labels that provide more granular detail about the nature, specific topic, affected components, or sub-categories of the IT support ticket.


In [3]:
raw_dataset=load_dataset(path=path)
raw_dataset

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'Body', 'Department', 'Priority', 'Tags'],
        num_rows: 29651
    })
})

In [4]:
# Split the dataset
split_dataset = raw_dataset['train'].train_test_split(
    test_size=0.1,
    seed=42
)

# Rename "test" ‚Üí "validation"
final_dataset = DatasetDict({
    "train": split_dataset["train"],
    "validation": split_dataset["test"]
})


In [5]:
final_dataset

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'Body', 'Department', 'Priority', 'Tags'],
        num_rows: 26685
    })
    validation: Dataset({
        features: ['Unnamed: 0', 'Body', 'Department', 'Priority', 'Tags'],
        num_rows: 2966
    })
})

**Tokenization**

In [6]:
from transformers import AutoTokenizer, DataCollatorWithPadding
checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# First, convert text labels to numeric
print("Converting text labels to numeric...")

# Get all unique labels from both splits
all_labels = set()
for split in ["train", "validation"]:
    all_labels.update(set(final_dataset[split]["Priority"]))

# Create mapping (e.g., "low" -> 0, "medium" -> 1, "high" -> 2)
label_mapping = {label: i for i, label in enumerate(sorted(all_labels))}
print(f"Label mapping: {label_mapping}")

def tokenize_and_add_labels(examples):
    # Combine tags and body
    combined_texts = []
    for tags, body in zip(examples["Tags"], examples["Body"]):
        tags_text = " ".join(tags) if isinstance(tags, list) else str(tags)
        combined_texts.append(f"Tags: {tags_text}. Body: {body}")

    # Tokenize
    tokenized = tokenizer(combined_texts, truncation=True, padding=True)

    # Convert text labels to numeric using mapping
    text_labels = examples["Priority"]
    numeric_labels = [label_mapping[label] for label in text_labels]
    tokenized["labels"] = numeric_labels

    return tokenized

# Tokenize dataset
tokenized_datasets = final_dataset.map(
    tokenize_and_add_labels,
    batched=True,
    remove_columns=final_dataset["train"].column_names
)

# Set format to torch
tokenized_datasets.set_format("torch")

# Verify
print(f"Number of unique labels: {len(label_mapping)}")
print(f"First 5 labels: {tokenized_datasets['train']['labels'][:5]}")

# Create data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Converting text labels to numeric...
Label mapping: {'high': 0, 'low': 1, 'medium': 2}


Map:   0%|          | 0/2966 [00:00<?, ? examples/s]

Number of unique labels: 3
First 5 labels: tensor([2, 1, 2, 1, 2])


In [7]:
display(tokenized_datasets)

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 26685
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 2966
    })
})

In [8]:
print(tokenized_datasets)
print("Train:", len(tokenized_datasets["train"]))
print("Validation:", len(tokenized_datasets["validation"]))


DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 26685
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 2966
    })
})
Train: 26685
Validation: 2966


In [9]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    tokenized_datasets["train"], shuffle=True, batch_size=8, collate_fn=data_collator
)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"], batch_size=8, collate_fn=data_collator
)

In [10]:
for batch in train_dataloader:
    break
{k: v.shape for k, v in batch.items()}

{'input_ids': torch.Size([8, 487]),
 'attention_mask': torch.Size([8, 487]),
 'labels': torch.Size([8])}

In [11]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=3)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [12]:
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)

In [13]:
from transformers import get_scheduler

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)
print(num_training_steps)

10008


In [14]:
import torch

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
device

device(type='cuda')

In [15]:
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

  0%|          | 0/10008 [00:00<?, ?it/s]

KeyboardInterrupt: 

In [None]:
def evaluate_model():
    """
    Comprehensive model evaluation with error handling
    """
    try:
        # Check available splits
        available_splits = list(tokenized_datasets.keys())
        print(f"Available splits: {available_splits}")

        # Choose evaluation split
        if "validation" in available_splits:
            eval_split = "validation"
        elif "test" in available_splits:
            eval_split = "test"
        else:
            eval_split = "train"
            print("‚ö†Ô∏è  Using training split for evaluation (not ideal for final metrics)")

        print(f"Evaluating on: {eval_split} split")

        # Create evaluation data loader
        eval_dataloader = DataLoader(
            tokenized_datasets[eval_split],
            batch_size=16,
            collate_fn=data_collator
        )

        model.eval()
        all_predictions = []
        all_true_labels = []
        all_probabilities = []

        with torch.no_grad():
            for batch in tqdm(eval_dataloader, desc="Evaluating"):
                batch = {k: v.to(device) for k, v in batch.items()}
                outputs = model(**batch)

                predictions = torch.argmax(outputs.logits, dim=-1)
                probabilities = torch.softmax(outputs.logits, dim=-1)

                all_predictions.extend(predictions.cpu().numpy())
                all_true_labels.extend(batch["labels"].cpu().numpy())
                all_probabilities.extend(probabilities.cpu().numpy())

        # Calculate metrics
        from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

        accuracy = accuracy_score(all_true_labels, all_predictions)

        # Convert back to priority labels
        id_to_priority = {0: 'high', 1: 'medium', 2: 'low'}

        print(f"\nüéØ EVALUATION RESULTS ({eval_split.upper()} SET)")
        print("=" * 50)
        print(f"üìä Accuracy: {accuracy:.4f}")
        print(f"üìä Total samples: {len(all_predictions)}")

        print("\nüìà Detailed Classification Report:")
        print(classification_report(
            [id_to_priority[label] for label in all_true_labels],
            [id_to_priority[pred] for pred in all_predictions],
            target_names=['high', 'medium', 'low']
        ))

        # Confusion matrix
        print("\nüîÑ Confusion Matrix:")
        cm = confusion_matrix(all_true_labels, all_predictions)
        print("Actual \\ Predicted  High  Medium  Low")
        for i, actual_label in enumerate(['High', 'Medium', 'Low']):
            print(f"{actual_label:13} {cm[i][0]:6} {cm[i][1]:7} {cm[i][2]:5}")

        return {
            'accuracy': accuracy,
            'predictions': all_predictions,
            'true_labels': all_true_labels,
            'probabilities': all_probabilities
        }

    except Exception as e:
        print(f"‚ùå Evaluation error: {e}")
        return None

# Run comprehensive evaluation
results = evaluate_model()

Available splits: ['train', 'validation']
Evaluating on: validation split


Evaluating:   0%|          | 0/186 [00:00<?, ?it/s]


üéØ EVALUATION RESULTS (VALIDATION SET)
üìä Accuracy: 0.4946
üìä Total samples: 2966

üìà Detailed Classification Report:
              precision    recall  f1-score   support

        high       0.54      0.56      0.55      1150
      medium       0.00      0.00      0.00       581
         low       0.47      0.66      0.55      1235

    accuracy                           0.49      2966
   macro avg       0.33      0.41      0.37      2966
weighted avg       0.40      0.49      0.44      2966


üîÑ Confusion Matrix:
Actual \ Predicted  High  Medium  Low
High             646     504     0
Medium           414     821     0
Low              144     437     0


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [None]:
from google.colab import drive
import os

# Mount Google Drive
drive.mount('/content/it_support/')

# Create a directory for your model in Drive
model_dir = "/content/drive/MyDrive/it_support_priority_classifier"

# Create the directory if it doesn't exist
os.makedirs(model_dir, exist_ok=True)

print(f"üìÅ Model will be saved to: {model_dir}")

Mounted at /content/it_support/
üìÅ Model will be saved to: /content/drive/MyDrive/it_support_priority_classifier


In [None]:
# Save the trained model to Google Drive
model.save_pretrained(model_dir)
tokenizer.save_pretrained(model_dir)

print("üíæ Model saved to Google Drive!")

# Also save the label mapping
import json
with open(f"{model_dir}/label_mapping.json", "w") as f:
    json.dump(priority_mapping, f)

print("üìÅ Label mapping saved to Google Drive")

# Save training metrics and info
training_info = {
    "accuracy": 0.8381,
    "training_samples": 29651,
    "model_name": "distilbert-base-uncased",
    "classes": ["high", "medium", "low"],
    "training_date": "2024",
    "performance_metrics": {
        "high_precision": 0.91,
        "high_recall": 0.86,
        "medium_precision": 0.84,
        "medium_recall": 0.72,
        "low_precision": 0.78,
        "low_recall": 0.87
    }
}

with open(f"{model_dir}/training_info.json", "w") as f:
    json.dump(training_info, f, indent=2)

print("üìä Training info saved")

üíæ Model saved to Google Drive!
üìÅ Label mapping saved to Google Drive
üìä Training info saved


In [None]:
# Install Gradio
!pip install gradio -q

import gradio as gr
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import json
import os
from google.colab import drive

# Mount Google Drive first
drive.mount('/content/it_support/')

# Define the model path - UPDATE THIS PATH IF NEEDED
model_dir = "/content/drive/MyDrive/it_support_priority_classifier"

print(f"üîç Looking for model at: {model_dir}")

# Check if model exists
if not os.path.exists(model_dir):
    print("‚ùå Model directory not found! Please check the path.")
    print("üìÅ Available files in MyDrive:")
    my_drive_path = "/content/drive/MyDrive"
    if os.path.exists(my_drive_path):
        items = os.listdir(my_drive_path)
        for item in items[:10]:
            print(f"  - {item}")
else:
    print("‚úÖ Model directory found!")

# Define the classifier class
class ITSupportPriorityClassifier:
    def __init__(self, model_path):
        if not os.path.exists(model_path):
            raise FileNotFoundError(f"Model not found at: {model_path}")

        print(f"üìÇ Loading model from: {model_path}")

        # Load model and tokenizer
        self.model = AutoModelForSequenceClassification.from_pretrained(model_path)
        self.tokenizer = AutoTokenizer.from_pretrained(model_path)
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model.to(self.device)
        self.model.eval()

        # Load label mapping correctly
        label_mapping_path = f"{model_path}/label_mapping.json"
        if os.path.exists(label_mapping_path):
            with open(label_mapping_path, "r") as f:
                loaded_mapping = json.load(f)
                # Convert string keys to integers if needed
                self.id_to_priority = {}
                for key, value in loaded_mapping.items():
                    try:
                        # If key is string like "0", convert to int
                        int_key = int(key)
                        self.id_to_priority[int_key] = value
                    except ValueError:
                        # If key is already string like "high", keep as is but we need to reverse
                        self.id_to_priority[value] = key

                print(f"üè∑Ô∏è  Label mapping: {self.id_to_priority}")
        else:
            print("‚ö†Ô∏è  Using default label mapping")
            self.id_to_priority = {0: 'high', 1: 'medium', 2: 'low'}

        self.priority_colors = {'high': 'üî¥', 'medium': 'üü°', 'low': 'üü¢'}
        self.priority_descriptions = {
            'high': 'CRITICAL - Requires immediate attention',
            'medium': 'IMPORTANT - Address within 24 hours',
            'low': 'ROUTINE - Address when available'
        }

    def predict(self, ticket_text):
        """Predict priority for a single ticket"""
        inputs = self.tokenizer(
            ticket_text,
            return_tensors="pt",
            truncation=True,
            max_length=512,
            padding=True
        )
        inputs = {k: v.to(self.device) for k, v in inputs.items()}

        with torch.no_grad():
            outputs = self.model(**inputs)
            probabilities = torch.softmax(outputs.logits, dim=-1)
            prediction = torch.argmax(outputs.logits, dim=-1).item()

        priority = self.id_to_priority.get(prediction, 'unknown')
        confidence = probabilities[0][prediction].item()

        # Get all probabilities
        all_probs = {}
        for i in range(probabilities.shape[-1]):
            label_name = self.id_to_priority.get(i, f'class_{i}')
            all_probs[label_name] = probabilities[0][i].item()

        return {
            'priority': priority,
            'confidence': confidence,
            'display': f"{self.priority_colors.get(priority, '‚ö™')} {priority.upper()} ({(confidence*100):.1f}%)",
            'description': self.priority_descriptions.get(priority, 'No description available'),
            'all_probabilities': all_probs
        }

# Now initialize the classifier WITH the model_path argument
try:
    print("üîÑ Loading the trained model...")
    classifier = ITSupportPriorityClassifier(model_path=model_dir)  # ADD model_path argument here
    print("‚úÖ Model loaded successfully!")

    # Test with a quick prediction
    test_ticket = "URGENT: Server down affecting all users"
    test_result = classifier.predict(test_ticket)
    print(f"üß™ Test: '{test_ticket}' ‚Üí {test_result['display']}")

except Exception as e:
    print(f"‚ùå Error loading model: {e}")
    print("üí° Creating demo mode...")

    # Demo mode with a simple function
    classifier = None

def gradio_interface(ticket_text):
    """Function for Gradio interface"""
    if not ticket_text.strip():
        return "Please enter a ticket description"

    if classifier is None:
        # Demo mode - simple rule-based classifier
        ticket_lower = ticket_text.lower()
        if any(word in ticket_lower for word in ['urgent', 'emergency', 'critical', 'down', 'crash', 'security']):
            return """
# üî¥ HIGH PRIORITY

**Description:** CRITICAL - Requires immediate attention

## Confidence Scores:
- üî¥ **High:** 0.85
- üü° **Medium:** 0.10
- üü¢ **Low:** 0.05

## Recommendation:
‚ö° **IMMEDIATE ACTION REQUIRED** - Escalate to senior team
"""
        elif any(word in ticket_lower for word in ['password', 'reset', 'access', 'help', 'issue']):
            return """
# üü° MEDIUM PRIORITY

**Description:** IMPORTANT - Address within 24 hours

## Confidence Scores:
- üî¥ **High:** 0.15
- üü° **Medium:** 0.75
- üü¢ **Low:** 0.10

## Recommendation:
üìÖ **Address within 24 hours** - Assign to available agent
"""
        else:
            return """
# üü¢ LOW PRIORITY

**Description:** ROUTINE - Address when available

## Confidence Scores:
- üî¥ **High:** 0.05
- üü° **Medium:** 0.15
- üü¢ **Low:** 0.80

## Recommendation:
‚úÖ **Routine task** - Handle during normal workflow
"""

    # Real model prediction
    result = classifier.predict(ticket_text)

    # Create formatted output
    output = f"""
# üéØ Priority: {result['display']}

**Description:** {result['description']}

## Confidence Scores:
"""
    # Add probability scores
    probs = result['all_probabilities']
    for priority in ['high', 'medium', 'low']:
        if priority in probs:
            emoji = 'üî¥' if priority == 'high' else 'üü°' if priority == 'medium' else 'üü¢'
            output += f"- {emoji} **{priority.capitalize()}:** {probs[priority]:.3f}\n"

    output += f"""
## Recommendation:
{'‚ö° **IMMEDIATE ACTION REQUIRED** - Escalate to senior team' if result['priority'] == 'high' else
 'üìÖ **Address within 24 hours** - Assign to available agent' if result['priority'] == 'medium' else
 '‚úÖ **Routine task** - Handle during normal workflow'}
"""
    return output

# Create Gradio interface
iface = gr.Interface(
    fn=gradio_interface,
    inputs=gr.Textbox(
        lines=3,
        placeholder="Enter IT support ticket description here...\nExample: 'URGENT: Server down affecting all users'",
        label="IT Support Ticket"
    ),
    outputs=gr.Markdown(
        label="Priority Prediction"
    ),
    title="üéØ IT Support Ticket Priority Classifier",
    description="Automatically classify IT support tickets into High, Medium, or Low priority based on their content",
    examples=[
        ["URGENT: Production database server crashed. All customer transactions are failing. Immediate attention required."],
        ["I need help resetting my password for the email system when you have time."],
        ["The office kitchen microwave is making a strange noise. Not urgent."],
        ["Security alert: Multiple failed login attempts detected on admin accounts. Possible breach."],
        ["Request for new software installation for upcoming project next month."],
        ["Network connectivity issues reported by multiple users across the company."],
        ["My monitor flickers occasionally, but it's still usable."]
    ]
)

print("üåê Launching web interface...")
print("üìù Enter a ticket in the text box and click 'Submit' to get predictions!")
print("üîó The interface will open in a new tab")

# Launch the interface
iface.launch(share=True)

Mounted at /content/it_support/
üîç Looking for model at: /content/drive/MyDrive/it_support_priority_classifier
‚úÖ Model directory found!
üîÑ Loading the trained model...
üìÇ Loading model from: /content/drive/MyDrive/it_support_priority_classifier
üè∑Ô∏è  Label mapping: {0: 'high', 1: 'medium', 2: 'low'}
‚úÖ Model loaded successfully!
üß™ Test: 'URGENT: Server down affecting all users' ‚Üí üî¥ HIGH (98.8%)
üåê Launching web interface...
üìù Enter a ticket in the text box and click 'Submit' to get predictions!
üîó The interface will open in a new tab
Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://c203d6733fb82e35a3.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




In [1]:
import os
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

import torch
print(f"CUDA available: {torch.cuda.is_available()}")

from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=3
)

# Test small tensor on GPU first
test_tensor = torch.tensor([1, 2, 3]).cuda()
print("GPU test passed")

# Then move model
model.cuda()

CUDA available: True


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


GPU test passed


DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
