## Intent Classification With PyTorch
Previously, my focus in the notebooks was on obtaining labeled data for my chatbot. However, this current notebook is centered around utilizing PyTorch for the classification of intents within fresh, unseen user-generated data. The model has transitioned to a supervised learning approach, leveraging the labels derived from the unsupervised learning conducted in the preceding notebook.

### RASA Comparison

Rasa trains this intent classification step with SVM and GridsearchCV because they can try different configurations ([source](https://medium.com/bhavaniravi/intent-classification-demystifying-rasanlu-part-4-685fc02f5c1d)). When deploying preprocessing pipeline should remain same between train and test.

In [353]:
!pip3 install wandb

Collecting wandb
  Downloading wandb-0.17.3-py3-none-macosx_11_0_arm64.whl.metadata (10 kB)
Collecting docker-pycreds>=0.4.0 (from wandb)
  Downloading docker_pycreds-0.4.0-py2.py3-none-any.whl.metadata (1.8 kB)
Collecting gitpython!=3.1.29,>=1.0.0 (from wandb)
  Downloading GitPython-3.1.43-py3-none-any.whl.metadata (13 kB)
Collecting sentry-sdk>=1.0.0 (from wandb)
  Downloading sentry_sdk-2.7.1-py2.py3-none-any.whl.metadata (14 kB)
Collecting setproctitle (from wandb)
  Downloading setproctitle-1.3.3-cp312-cp312-macosx_10_9_universal2.whl.metadata (9.9 kB)
Collecting gitdb<5,>=4.0.1 (from gitpython!=3.1.29,>=1.0.0->wandb)
  Downloading gitdb-4.0.11-py3-none-any.whl.metadata (1.2 kB)
Collecting smmap<6,>=3.0.1 (from gitdb<5,>=4.0.1->gitpython!=3.1.29,>=1.0.0->wandb)
  Downloading smmap-5.0.1-py3-none-any.whl.metadata (4.3 kB)
Downloading wandb-0.17.3-py3-none-macosx_11_0_arm64.whl (6.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m9.1 MB/s[0m e

In [1]:
import spacy 
import wandb
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English
nlp = English()

# Create a blank Tokenizer with just the English vocab
tokenizer = Tokenizer(nlp.vocab)

In [2]:
# Standard 
import collections
import yaml
import re
import os

# Data science
import pandas as pd
print(f"Pandas: {pd.__version__}")
import numpy as np
print(f"Numpy: {np.__version__}")

# Machine Learning
import sklearn
print(f"Sklearn: {sklearn.__version__}")


# Deep Learning
import torch
from torch import nn
import torch.optim as optim

# Visualization 
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="ticks", color_codes=True)

# Preprocessing and Torch
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
# from torchtext.data.utils import get_tokenizer
# from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import Dataset, DataLoader, TensorDataset
# from torchtext.vocab import build_vocab_from_iterator
# from torchtext.data import get_tokenizer

# Reading in training data
train = pd.read_pickle('../objects/train.pkl')
print(f'Training data: {train.head()}')

Pandas: 2.2.2
Numpy: 1.26.4
Sklearn: 1.4.2
Training data:                                                track  \
0                                  [no, information]   
1  [issue, is, resolved, and, item, is, being, re...   
2  [expected, delivery, date, is, th, october, tr...   
3  [expected, delivery, date, is, th, october, tr...   
4               [no, emails, no, reason, for, delay]   

                                             support  \
0  [very, poor, feedback, very, disappointing, se...   
1  [already, done, i, am, frankly, fed, up, with,...   
2  [very, poor, feedback, very, disappointing, se...   
3  [can, see, you, have, replied, to, others, who...   
4  [my, issue, is, not, resolved, really, should,...   

                                             quality  \
0   [done, attached, is, the, proof, of, completion]   
1                 [return, pick, up, not, happening]   
2  [your, target, completed, return, policy, expi...   
3  [order, is, lost, no, one, taking, respon

In [3]:
wandb.login()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33msinhasagar507[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

In [4]:
# Configuration for training
# Change all of the following configurations as per the specifications in the original repo 
# Set a seed value 
seed_value = 12321 

# 1. Set `PYTHONHASHSEED` environment variable at a fixed value
import os
os.environ['PYTHONHASHSEED']=str(seed_value)

# 2. Set `python` built-in pseudo-random generator at a fixed value
import random
random.seed(seed_value)

# 3. Set `numpy` pseudo-random generator at a fixed value
import numpy as np
np.random.seed(seed_value)

# 4. Set `pytorch` pseudo-random generator at a fixed value
torch.manual_seed(seed_value)

<torch._C.Generator at 0x10793be10>

In [5]:
train = pd.melt(train)
train.columns = ["intent", "tokens"]

In [6]:
shuffled_df = train.sample(frac=1).reset_index(drop=True)
shuffled_df 

Unnamed: 0,intent,tokens
0,challenge_robot,"[who, are, you]"
1,challenge_robot,"[are, you, robot]"
2,support,"[i, tried, to, find, a, customer, services, em..."
3,speak_representative,"[let, me, talk, to, apple, support]"
4,challenge_robot,"[who, are, you]"
...,...,...
8995,support,"[really, disappointed, with, your, service, co..."
8996,account,"[does, amazon, no, longer, provide, refund, if..."
8997,account,"[i, should, not, have, to, speak, to, now, a, ..."
8998,goodbye,[thank]


In [7]:
# Print the data types of the columns
print(shuffled_df.dtypes)

# Check the data types of each row in the "tokens" column and if its not a list, highlight the the error 
# Don't print it, log it 
print(" ")
for index, row in shuffled_df.iterrows():
    if not isinstance(row["tokens"], list):
        print(f"Error: {row['tokens']}")

intent    object
tokens    object
dtype: object
 


In [8]:
X = [token_lst for token_lst in shuffled_df['tokens']]
X = [*X]
y = [*shuffled_df['intent'].values]

In [9]:
import nltk 
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/saggysimmba/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/saggysimmba/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Torchtext Preprocessing

### Torchtext tokenizer 
- Add description later 

### Plan of Action
- Prepare the dataset 

In [10]:
%pwd

'/Applications/saggydev/projects_learning/amazon_support/notebooks'

- Steps taken
    -   the words would involve creating a vocabulary dictionary to map words to indices 
    -   For each sequence, the words are converted into their corresponding indices based on the word dictionary 
    - When feeding sentences into the model, ensure a consistent sequence length is crucial 
    - To achieve this, sequences are padded with zeros until they reach the length of the longest sequence 
    - This padding ensures uniformity, and shorter maximum lengths are typically preferred for ease of training, as longer sequences can pose challenges 
    - This padding ensures uniformity, and shorter maximum lengths are typically preferred for ease of training, as longer sequences can pose challenges 


In [10]:
# Assuming 'train' is a DataFrame containing 'Utterance' and 'Intent' columns

# Tokenize the text data using PyTorch's tokenizer
# The text already seems to be tokenized 

# Split the data into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, 
                                                  shuffle=True, stratify=y, random_state=7)

# Label encode the target variable
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_val_encoded = label_encoder.transform(y_val)


# Convert encoded targets to PyTorch tensors
y_train_encoded = torch.tensor(y_train_encoded, dtype=torch.long) 
y_val_encoded = torch.tensor(y_val_encoded, dtype=torch.long)

print(f'\nShape checks:\nX_train: {len(X_train)} X_val: {len(X_val)}\ny_train: {len(y_train_encoded)} y_val: {len(y_val_encoded)}')


Shape checks:
X_train: 6300 X_val: 2700
y_train: 6300 y_val: 2700


In [11]:
# Now build a vocabulary: This is something I hadve just added 
from collections import Counter
word_counts = Counter(token for sentence in X for token in sentence)
vocabulary = {word: i+1 for i, (word, _) in enumerate(word_counts.items())}  # +1 for zero padding
vocab_size = len(vocabulary) + 1  # +1 for unknown words

In [12]:
len(vocabulary)

4630

In [13]:
# Encocde sentences as sequences of integers: This is something I have just added
def encode_sequences(tokenized_sentences, vocab):
    sequences = []
    for sentence in tokenized_sentences:
        sequence = [vocab.get(word, 0) for word in sentence]  # 0 for unknown words
        sequences.append(sequence)
    return sequences

encoded_X_train = encode_sequences(X_train, vocabulary)
encoded_X_val = encode_sequences(X_val, vocabulary)

In [14]:
# Pad sequences to a fixed length: This is something I have just added
from torch.nn.utils.rnn import pad_sequence

# Convert encoded sequences to PyTorch tensors
encoded_X_train_tensors = [torch.tensor(seq) for seq in encoded_X_train]
encoded_X_val_tensors = [torch.tensor(seq) for seq in encoded_X_val]

# Pad sequences
# Set batch_first=True to have the batch dimension first
padded_X_train = pad_sequence(encoded_X_train_tensors, batch_first=True, padding_value=0)
padded_X_val = pad_sequence(encoded_X_val_tensors, batch_first=True, padding_value=0)

In [54]:
padded_X_train.shape

torch.Size([6300, 61])

In [15]:
# Use glove word embeddings 
embeddings_index = {}
f = open("../models/glove.twitter.27B/glove.twitter.27B.100d.txt", "r", encoding="utf-8")
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Found 1193514 word vectors.


In [16]:
# Create an embedding matrix
embedding_dim = 100
embedding_matrix = np.zeros((vocab_size, embedding_dim))
for word, i in vocabulary.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

In [18]:
# Just an experimental check
from torch.nn import Embedding  
# embedding_layer = Embedding(num_embeddings=embedding_matrix_tensor.size(0), 
#                             embedding_dim=embedding_matrix_tensor.size(1), 
#                             _weight=embedding_matrix_tensor)

# # Freeze the embedding layer
# embedding_layer.weight.requires_grad = False

In [17]:
# Assuming padded_X_train and padded_X_val are NumPy arrays
padded_X_train_tensor = torch.LongTensor(padded_X_train)
padded_X_val_tensor = torch.LongTensor(padded_X_val)

In [18]:
seq_len = padded_X_train_tensor.shape[1]

In [19]:
# Embedding layer
embedding_matrix_tensor = torch.FloatTensor(embedding_matrix)
embedding = nn.Embedding(vocab_size, embedding_dim)
embedding.weight = nn.Parameter(embedding_matrix_tensor)
embedding.weight.requires_grad = False  # To not train the embedding layer

In [22]:
# start a new wandb run to track this script
wandb.init(
    # set the wandb project where this run will be logged
    project="my-awesome-project",

    # track hyperparameters and run metadata
    config={
    "learning_rate": 0.01,
    "architecture": "LSTM-RNN",
    "dataset": "custom-intent-data",
    "optimizer": "Adam",
    "epochs": 20,
    "batch_size": 32, 
    "embedding_size": 100,
    "hidden_size": 128,
    "output_size": 9,
    "num_layers": 2,
    "dropout": 0.1,
    "eval_metric": "accuracy"
    }
)

In [27]:
class MODEL_EVAL_METRIC:
    accuracy = "accuracy"
    f1_score = "f1_score"
    
class Config: 
    VOCAB_SIZE = 0
    BATCH_SIZE = 32 
    EMB_SIZE = 300 
    OUT_SIZE = 9 # Corresponds to the number of intents
    NUM_FOLDS = 5 
    NUM_EPOCHS = 5
    NUM_WORKERS = 8
    
    # I want to update the pretrained embedding weights during training process 
    # I want to use a pretrained embedding
    OPTIMIZER = "Adam"
    EMB_WT_UPDATE = True
    DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    MODEL_EVAL_METRIC = MODEL_EVAL_METRIC.accuracy
    FAST_DEV_RUN = False 
    PATIENCE = 6 
    IS_BIDIRECTIONAL = True 
    
     
    # Model hyperparameters
    MODEL_PARAMS = {
        "hidden_size": 128,
        "num_layers": 2,
        "drop_out": 0.4258,
        "lr": 0.000366,
        "weight_decay": 0.00001
    }

In [28]:
# Just an experimental check
# from torch.nn import Embedding  
# embedding_layer = Embedding(num_embeddings=embedding_matrix_tensor.size(0), 
#                             embedding_dim=embedding_matrix_tensor.size(1), 
#                             _weight=embedding_matrix_tensor)

In [29]:
# Enhance the architecture later 
class IntentClassifier(nn.Module):
    
    def __init__(self, seq_len, embedding_dim, hidden_dim, output_dim, embedding_matrix): 
        super().__init__()

        # Embedding layer
        embedding_matrix_tensor = torch.FloatTensor(embedding_matrix)
        self.embedding = nn.Embedding(seq_len, embedding_dim)
        self.embedding.weight = nn.Parameter(embedding_matrix_tensor)
        self.embedding.weight.requires_grad = False  # To not train the embedding layer
        
        # LSTM layer 
        self.lstm = nn.LSTM(input_size=embedding_dim, # Embedding dim = 50
                            hidden_size=hidden_dim, # Hidden dim = 128
                            num_layers=wandb.config["num_layers"], # n_layers is 2 
                            bidirectional=True, # Its a bidirectional LSTM
                            dropout=wandb.config["dropout"], 
                            batch_first=True)
        
        # The output of this operation should be 
        
        # Dense layers 

        self.fc1 = nn.Linear(hidden_dim*2, 600)  # 2 for bidirectional. Over here, its (128*2) = 256, 600 is the output dimension of the first dense layer
        self.fc2 = nn.Linear(600, 600) # When passed through this layer, the output would be (600, 600)
        
        # Dropout layer
        self.dropout = nn.Dropout(wandb.config["dropout"])  
        
        # Output layer 
        self.out = nn.Linear(600, output_dim) ## Yaar idhr output hoga RNN ya LSTM ka (batch_size output_dim, no_of_classes) aayega kya? 
        # self.out_2 = nn.Linear(output_dim, 9)
        
    def forward(self, inputs):
        
        # text = [batch_size, embed_length]
        
        # embeddings = self.dropout(self.embedding(inputs))
        
        # embedded = [batch_size, sent_length, emb_dim]

        # if self.embedding_matrix is not None: 
        #     assert self.embeddings.shape == (inputs.shape[0], inputs.shape[1], self.embedding_dim)
         
        # pack_padded_sequence before feeding to the LSTM. This is required so PyTorch knows 
        # which elements of the sequence are padded and ignores them in the computation 
        # Accomplished only after the embedding step 
        # embeds_pack = pack_padded_sequence(embeddings, inputs_lengths, batch_first=True)
        
        # Get the dimensions of the packed sequence 
        # dimensions = embeds_pack.data.size()

        # Assert the shape of input sequence 
        # assert inputs.shape == (Config.BATCH_SIZE, 1000)

        embeddings = self.embedding(inputs)
        # print(f"Embeddings shape: {embeddings.shape}")
        _, (hidden, _) = self.lstm(embeddings)

        # hidden shape: [num_layers*num_directions, batch_size, hidden_dim]
        # print(f"Hidden shape: {hidden.shape}")
        

        
        # Ours task being a classification model, we are only interested in the final hidden state and not the LSTM output 
        # h_n and c_n = [num_directions * num_layers, batch_size, hidden_size]
        final_hidden_forward = hidden[-2, :, :] # [batch_size, hidden_dim]
        final_hidden_backward = hidden[-1, :, :] # [bacth_size, hidden_dim]

        # print(f"Final hidden forward shape: {final_hidden_forward.shape}") # Iska shape is 
        # print(f"Final hidden backward shape: {final_hidden_backward.shape}")
        
        # Concat the final forward and hidden backward states 
        hidden = torch.cat((final_hidden_forward, final_hidden_backward), dim=1)
        # print(f"Hidden shape after concatenation: {hidden.shape}")
                
        # Dense Linear Layers 
        dense_outputs_1 = self.fc1(hidden)
        dense_outputs_1 = nn.ReLU()(dense_outputs_1)  
        dense_outputs_2 = self.fc2(dense_outputs_1)
        dense_outputs_2 = self.dropout(dense_outputs_2)
        dense_outputs_2 = nn.ReLU()(dense_outputs_2) 

        # Final output classification layer
        # Applying the Softmax layer 
        final_output = (self.out(dense_outputs_2))
        # print(f"Final output shape: {final_output.shape}")
    
        return final_output

In [32]:
import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset

class ModelTrainer:
    def __init__(self, seq_len, embedding_dim, embedding_matrix):
        self.seq_len = seq_len
        self.embedding_dim = embedding_dim
        self.embedding_matrix = embedding_matrix   
        self.hidden_dim = wandb.config["hidden_size"]
        self.output_dim = wandb.config["output_size"]
        self.n_layers = wandb.config["num_layers"]
        self.batch_size = wandb.config["batch_size"]
        self.epochs = wandb.config["epochs"]
        self.dropout = wandb.config["dropout"]
        # Assuming IntentClassifier is defined elsewhere and matches these parameters
        # print(self.seq_len, self.embedding_dim, self.hidden_dim, self.output_dim, self.embedding_matrix)
        self.model = IntentClassifier(self.seq_len, self.embedding_dim, self.hidden_dim, self.output_dim, self.embedding_matrix)
        self.criterion = nn.CrossEntropyLoss()
        # Assuming Config.OPTIMIZER is a valid PyTorch optimizer class
        self.optimizer = optim.Adam(self.model.parameters(), lr=wandb.config["learning_rate"])
        self.epoch_lst = []

    def train(self, X_train, y_train, X_val, y_val):
        # X_train = torch.tensor(X_train, dtype=torch.float)
        # X_val = torch.tensor(X_val, dtype=torch.float)
        # y_train = torch.tensor(y_train, dtype=torch.long)
        # y_val = torch.tensor(y_val, dtype=torch.long)

        # Assuming X_train, y_train, X_val, y_val are already tensors
        # Ensure they have matching first dimensions
        assert X_train.shape[0] == y_train.shape[0], "Training feature and label count mismatch"
        assert X_val.shape[0] == y_val.shape[0], "Validation feature and label count mismatch"
        
        train_loader = DataLoader(TensorDataset(X_train, y_train), batch_size=self.batch_size, shuffle=True)
        val_loader = DataLoader(TensorDataset(X_val, y_val), batch_size=self.batch_size)

        train_accuracies_epoch, val_accuracies_epoch = [], []
        self.valid_loss_min = np.Inf

        for epoch in range(self.epochs):
            train_loss, valid_loss = 0.0, 0.0
            correct, total = 0, 0

            self.model.train()
            for data, target in train_loader:
                # Log the shape of the data and target tensors
                # assert data.shape == (self.batch_size, self.embedding_dim), f"Data shape mismatch: {data.shape}"
                # assert target.shape == (self.batch_size,), f"Target shape mismatch: {target.shape}"
                self.optimizer.zero_grad()
                output = self.model(data)
                loss = self.criterion(output, target)
                loss.backward()
                self.optimizer.step()

                # print(output.shape)
                pred_labels = torch.argmax(output, 1)
                correct += (pred_labels == target).sum().item()
                total += target.size(0)
                train_loss += loss.item() * data.size(0)


            train_accuracy = 100 * correct / total
            train_accuracies_epoch.append(train_accuracy)

            # Log the training loss and accuracy
            # wandb.log({"Training Accuracy": train_accuracy, "Training Loss": train_loss})

            self.model.eval()
            correct, total = 0, 0
            for data, target in val_loader:
                output = self.model(data)
                loss = self.criterion(output, target)

                pred_labels = torch.argmax(output, 1)
                correct += (pred_labels == target).sum().item()
                total += target.size(0)
                valid_loss += loss.item() * data.size(0)

            valid_accuracy = 100 * correct / total
            val_accuracies_epoch.append(valid_accuracy)

            # Log the validation loss and accuracy
            # print(f"Epoch: {epoch+1}/{self.epochs}.. Training Accuracy: {train_accuracy:.3f}.. Validation Accuracy: {valid_accuracy:.3f}")

            # Log epoch-wise accuracies
            wandb.log({"epoch": epoch, "Training Accuracy": train_accuracy, "Validation Accuracy": valid_accuracy, "Training Loss": train_loss, "Validation Loss": valid_loss})

            if valid_loss <= self.valid_loss_min:
                print(f"Validation loss decreased ({self.valid_loss_min:.3f} --> {valid_loss:.3f}). Saving model...")
                
                # Log the model and its parameters 
                # wandb.log_artifact(self.model)

                torch.save(self.model.state_dict(), "../models/intent_classification_model.pt")
                self.valid_loss_min = valid_loss

            self.epoch_lst.append(epoch + 1)

Things I Need to Add
- WandB table
- Log artifact (model)
- For now, include all the basic elements (then we can improve upon this in the future)
- Ability to track across multiple hyperparameters
- Set the configuration after the run is complete
- Sweeps (...) AND Improvisation

In [None]:
# Train the model
trainer = ModelTrainer(padded_X_train.shape[1])
train_features, val_features = padded_X_train, padded_X_val
trainer.train(train_features, y_train_encoded, val_features, y_val_encoded)

### Plot the data and related information 

In [46]:
wandb.config

{'learning_rate': 0.01, 'architecture': 'LSTM-RNN', 'dataset': 'custom-intent-data', 'optimizer': 'Adam', 'epochs': 20, 'batch_size': 32, 'embedding_size': 100, 'hidden_size': 128, 'output_size': 9, 'num_layers': 2, 'dropout': 0.1, 'eval_metric': 'accuracy'}

In [86]:
# Load the trained model
model = IntentClassifier(seq_len, wandb.config["embedding_size"], wandb.config["hidden_size"], wandb.config["output_size"], embedding_matrix)
model.load_state_dict(torch.load("../models/intent_classification_model.pt"))
model.eval()

def inference(text):
    """
    Perform preprocessing and inference on the input text using the trained model.
    
    Parameters:
    - model: The trained PyTorch model for intent classification.
    - text: The input text string.
    - vocabulary: A dictionary mapping tokens to indices.
    - seq_len: The fixed sequence length expected by the model.
    
    Returns:
    - pred_label: The predicted label index.
    """
    # Preprocess the text
    tokens = text.split()
    indices = [vocabulary.get(token, 0) for token in tokens]  # Use 0 for unknown words
    padded_indices = indices[:seq_len] + [0] * max(0, seq_len - len(indices))  # Pad with zeros
    input_tensor = torch.tensor(padded_indices).unsqueeze(0)  # Add batch dimension
    print(input_tensor.shape)
    
    # Perform inference
    with torch.no_grad():
        output = model(input_tensor)
        pred_label = output
    
    return pred_label


In [87]:
inference("I want to book a flight")

torch.Size([1, 61])


tensor([[-1.4125, -3.6488, -1.7868, -3.2215, -5.8661,  1.0337, -4.6215, -0.6400,
         -0.6101]])

In [76]:
original_label = label_encoder.inverse_transform([5])
original_label

array(['quality'], dtype='<U20')