![servicedesk](servicedesk.png)

CleverSupport is a company at the forefront of AI innovation, specializing in the development of AI-driven solutions to enhance customer support services. Their latest endeavor is to engineer a text classification system that can automatically categorize customer complaints. 

Your role as a data scientist involves the creation of a sophisticated machine learning model that can accurately assign complaints to specific categories, such as mortgage, credit card, money transfers, debt collection, etc.

# Project Instructions
Classify service desk tickets into categories using a CNN to streamline customer services.

- Define a CNN classifier with the following layers: an embedding layer, a 1D convolution layer, and a linear layer.
- Train your classifier on `train_data` using a suitable optimizer. Run your training for only 3 epochs.
- Test your classifier on `test_data`, storing your predictions in a list called `predictions`.
- Calculate the accuracy, per-class precision, and recall for your trained classifier on the `test_data`. Save the metrics as variables with the corresponding names: `accuracy`, `precision`, and `recall`, with precision and recall saved as lists.

# How to approach the project
1. Defining the classifier
2. Training the classifier
3. Testing the classifier

## Steps to complete

### 1. Defining the classifier
Define a class containing all the appropriate layers, and a method to perform the forward pass over a batch of input text.

#### Creating a class to contain the layers of the classifier
- Define a class called `TicketClassifier` that inherits from PyTorch's `nn.Module` class.

#### Adding an embedding layer
- Use PyTorch's `nn.Embedding` class to define the embedding layer.
- Create an instance of it in the `TicketClassifier` class's constructor and assign it to an instance variable such as `self.embedding`.

#### Adding a convolution ayer
- Use PyTorch's `nn.Conv1d` class to define the 1D convolution layer.
- Create an instance of it in the `TicketClassifier` class's constructor and assign it to an instance variable such as `self.conv`.

#### Adding a linear layer
- Use PyTorch's `nn.Linear` class to define the linear layer.
- Create an instance of it in the `TicketClassifier` class's constructor and assign it to an instance variable such as `self.fc`.

#### Define a .forward() method
- Finally, define a `.forward()` method that passes the input through the embedding and convolution layer, applies `nn.functional.relu` on the output, and finally applies linear layer before returning the output.

### 2. Training the classifier
Define a training loop that loops over the dataset, calculating the loss and propagating it backwards through the network.

#### Define a suitable loss criterion
- Use PyTorch's `nn.CrossEntropyLoss`, since this is a multi-class classification problem.

#### Define an optimizer
- Use PyTorch's `optim.Adam` optimizer.

### 3. Testing the classifier
Use your trained model to classify the text in the test set, and calculate the appropriate metrics.

#### Predict the category of each ticket in the test data.
- Invoke `model()` on your input data to pass the data through the network.
- Use `torch.argmax()` to find the category with the highest predicted probability.

#### Calculate the accuracy
- Use `torchmetrics.Accuracy` to calculate the accuracy.

#### Calculate the precision and recall
- Use `torchmetrics.Precision` and `torchmetrics.Recall` to calculate the precision and recall.

In [32]:
!pip install torchmetrics

Defaulting to user installation because normal site-packages is not writeable


In [33]:
from collections import Counter
import nltk, json
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader
from torchmetrics import Accuracy, Precision, Recall

In [34]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/repl/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [35]:
# Import data and labels
with open("words.json", 'r') as f1:
    words = json.load(f1)
with open("text.json", 'r') as f2:
    text = json.load(f2)
labels = np.load('labels.npy')

In [36]:
# Print the contents of the data
print("Words:")
print(words)  # prints the entire dictionary loaded from words.json

print("\nText:")
print(text)  # prints the entire dictionary loaded from text.json

print("\nLabels:")
print(labels)  # prints the entire array loaded from labels.npy

IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



In [37]:
# Dictionaries to store the word to index mappings and vice versa
word2idx = {o:i for i,o in enumerate(words)}
idx2word = {i:o for i,o in enumerate(words)}

In [38]:
# Looking up the mapping dictionary and assigning the index to the respective words
for i, sentence in enumerate(text):
    text[i] = [word2idx[word] if word in word2idx else 0 for word in sentence]

In [39]:
# Defining a function that either shortens sentences or pads sentences with 0 to a fixed length
def pad_input(sentences, seq_len):
    features = np.zeros((len(sentences), seq_len),dtype=int)
    for ii, review in enumerate(sentences):
        if len(review) != 0:
            features[ii, -len(review):] = np.array(review)[:seq_len]
    return features

text = pad_input(text, 50)

In [40]:
# Splitting dataset
train_text, test_text, train_label, test_label = train_test_split(text, labels, test_size=0.2, random_state=42)

train_data = TensorDataset(torch.from_numpy(train_text), torch.from_numpy(train_label).long())
test_data = TensorDataset(torch.from_numpy(test_text), torch.from_numpy(test_label).long())

In [41]:
# Start coding here
import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader
from torchmetrics import Accuracy, Precision, Recall

In [42]:
batch_size = 400
train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
test_loader = DataLoader(test_data, shuffle=False, batch_size=batch_size)

In [43]:
# Define the classifier class
class TicketClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, target_size):
        super(TicketClassifier, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.conv = nn.Conv1d(embed_dim, embed_dim, kernel_size=3, stride=1, padding=1)
        self.fc = nn.Linear(embed_dim, target_size)

    def forward(self, text):
        embedded = self.embedding(text).permute(0, 2, 1)
        conved = F.relu(self.conv(embedded))
        conved = conved.mean(dim=2) 
        return self.fc(conved)

In [44]:
vocab_size = len(word2idx) + 1
target_size = len(np.unique(labels))
embedding_dim = 64

In [45]:
# Create an instance of the TicketClassifier class
model = TicketClassifier(vocab_size, embedding_dim, target_size)

lr = 0.05
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

epochs = 3

In [46]:
# Train the model
model.train()
for i in range(epochs):
    running_loss, num_processed = 0,0
    for inputs, labels in train_loader:
        model.zero_grad()
        output = model(inputs)
        loss = criterion(output, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
        num_processed += len(inputs)
    print(f"Epoch: {i+1}, Loss: {running_loss/num_processed}")

Epoch: 1, Loss: 0.0038546013832092287
Epoch: 2, Loss: 0.0015976080670952798
Epoch: 3, Loss: 0.0007425523810088634


In [47]:
accuracy_metric = Accuracy(task='multiclass', num_classes=5)
precision_metric = Precision(task='multiclass', num_classes=5, average=None)
recall_metric = Recall(task='multiclass', num_classes=5, average=None)

In [48]:
# Evaluate model on test set
model.eval()
predicted = []

for i, (inputs, labels) in enumerate(test_loader):
    output = model(inputs)
    cat = torch.argmax(output, dim=-1)
    predicted.extend(cat.tolist())
    accuracy_metric(cat, labels)
    precision_metric(cat, labels)
    recall_metric(cat, labels)

accuracy = accuracy_metric.compute().item()
precision = precision_metric.compute().tolist()
recall = recall_metric.compute().tolist()
print('Accuracy:', accuracy)
print('Precision (per class):', precision)
print('Recall (per class):', recall)

Accuracy: 0.7979999780654907
Precision (per class): [0.6682464480400085, 0.7168949842453003, 0.9345238208770752, 0.8118279576301575, 0.8888888955116272]
Recall (per class): [0.734375, 0.8263157606124878, 0.7268518805503845, 0.7864583134651184, 0.9142857193946838]
