![servicedesk](image.png)

CleverSupport is a company at the forefront of AI innovation, specializing in the development of AI-driven solutions to enhance customer support services. Their latest endeavor is to engineer a text classification system that can automatically categorize customer complaints. 

Our role as a data scientist involves the creation of a sophisticated machine learning model that can accurately assign complaints to specific categories, such as mortgage, credit card, money transfers, debt collection, etc.

In [6]:
!pip install -r requirements.txt

Collecting nltk (from -r requirements.txt (line 2))
  Using cached nltk-3.8.1-py3-none-any.whl.metadata (2.8 kB)
Collecting pandas (from -r requirements.txt (line 3))
  Using cached pandas-2.0.3-cp38-cp38-macosx_10_9_x86_64.whl.metadata (18 kB)
Collecting scikit-learn (from -r requirements.txt (line 4))
  Using cached scikit_learn-1.3.2-cp38-cp38-macosx_10_9_x86_64.whl.metadata (11 kB)
Collecting click (from nltk->-r requirements.txt (line 2))
  Using cached click-8.1.7-py3-none-any.whl.metadata (3.0 kB)
Collecting joblib (from nltk->-r requirements.txt (line 2))
  Using cached joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting regex>=2021.8.3 (from nltk->-r requirements.txt (line 2))
  Downloading regex-2024.5.15-cp38-cp38-macosx_10_9_x86_64.whl.metadata (40 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.9/40.9 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tqdm (from nltk->-r requirements.txt (line 2))
  Using cached tqdm-4.66.4-py3-non

In [8]:
from collections import Counter
import nltk, json
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader
from torchmetrics import Accuracy, Precision, Recall

In [9]:
nltk.download('punkt')

[nltk_data] Error loading punkt: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1108)>


False

In [11]:
#Imprt data and labels
with open("data/words.json", 'r') as f1:
    words = json.load(f1)
with open("data/text.json", 'r') as f2:
    text = json.load(f2)
labels = np.load('data/labels.npy')

In [12]:
# Dictionaries to store the word to index mappings and vice versa
word2idx = {o:i for i, o in enumerate(words)}
idx2word = {i:o for i, o in enumerate(words)}

# Looking up the mapping dictionary and assigning the index to the respective words
for i, sentece in enumerate(text):
    text[i] = [word2idx[word] if word in word2idx else 0 for word in sentece]
    
# Defining a function that either shortens sentences or pads sentences with 0 ot a fixed length
def pad_input(sentences, seq_len):
    features = np.zeros((len(sentences), seq_len), dtype=int)
    for ii, review in enumerate(sentences):
        if len(review) !=0:
            features[ii, -len(review):] = np.array(review)[:seq_len]
        return features

text = pad_input(text, 50)

In [14]:
# Splitting dataset
train_text, test_text, train_label, test_label = train_test_split(text, labels, test_size=0.2, random_state=42)

train_data = TensorDataset(torch.from_numpy(train_text), torch.from_numpy(train_label).long())
test_data = TensorDataset(torch.from_numpy(test_text), torch.from_numpy(test_label).long())

In [15]:
batch_size = 400
train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
test_loader = DataLoader(test_data, shuffle=False, batch_size=batch_size)

In [16]:
# Define the classifier class
class TicketClassifier(nn.Module):
    
    def __init__(self, vocab_size, embed_dim, target_size):
        super(TicketClassifier, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.conv = nn.Conv1d(embed_dim, embed_dim, kernel_size=3, stride=1, padding=1)
        self.fc = nn.Linear(embed_dim, target_size)
        
    def forward(self, text):
        embedded = self.embedding(text).permute(0, 2, 1)
        conved = F.relu(self.conv(embedded))
        conved = conved.mean(dim=2)
        return self.fc(conved)
    