# Problem Statement

You are building a customer support chatbot for a retail company that sells products online. The goal of the chatbot is to assist customers in multiple ways, including answering product-related queries, tracking orders, handling refunds, and providing general information about store policies.

Each customer query can have multiple intents, such as requesting information about a product and also asking about its availability. The chatbot should be able to classify these queries into one or more intents simultaneously. For example, the query "What are the features of the latest phone, and can I return it?" has two intents: one related to product information and the other related to returns.

Objective:
Create a model that can classify a given customer query into one or more intents from the following categories:

- Product Inquiry - Queries related to product details (e.g., features, pricing, availability).

- Order Tracking - Queries related to tracking orders (e.g., "Where is my order?").

- Refund Request - Queries related to requesting a refund (e.g., "How do I return this product?").

- Store Policy - Queries related to the store’s policies (e.g., return policies, delivery times).

The model should be able to classify one or more intents for each query.

In [1]:
import re
import pandas as pd
import numpy as np

import random
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import hamming_loss, precision_score, recall_score, f1_score, accuracy_score

from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense, Embedding, LSTM, GlobalMaxPooling1D
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.optimizers import Adam

import joblib

In [2]:
intents = {
    "Product Inquiry": [
        "What are the features of this laptop?",
        "Is this phone available?",
        "What is the price of the new headphones?",
        "Do you have this product in stock?",
        "Expected avaliability date for the product",
        "what are the different color options that are avaliable for the product?",
        "Help me with the products that have discounts",
    ],
    "Order Tracking": [
        "Where is my order?",
        "How long will delivery take?",
        "Can you provide the tracking details?",
        "I want to check the status of my shipment.",
        "There is delay in the order delivery, can you please let me know the reason",
        "System shows that order is delivered but I have not reveived any order",
        "I've been waiting for the order long time"
    ],
    "Refund Request": [
        "How do I get a refund?",
        "Can I return my order?",
        "What is the process for a refund?",
        "Can I cancel my order and get a refund?",
        "It's been long time since I have raised the refund, but amount is not credited",
        "When I can expect the refund to be processed",
        "I don't what this product anymore",
        "Product I received is different from the one that I placed order, need help with refund",
        
    ],
    "Store Policy": [
        "What is your return policy?",
        "Do you offer free shipping?",
        "Can you explain your warranty terms?",
        "What are the delivery charges?",
        "What are the options for free delivery",
    ],
}

In [3]:
data = []
for p in range(700):
    selected_intents = random.sample(list(intents.keys()), k=random.randint(1, 3))
    combined_query = " and ".join(random.choice(intents[intent]) for intent in selected_intents)
    record = {
        "query": combined_query,
        "Product Inquiry": int("Product Inquiry" in selected_intents),
        "Order Tracking": int("Order Tracking" in selected_intents),
        "Refund Request": int("Refund Request" in selected_intents),
        "Store Policy": int("Store Policy" in selected_intents),
    }
    data.append(record)

In [4]:
data = pd.DataFrame(data)
data.to_csv("expanded_multi_intent_data.csv", index=False)

In [5]:
data

Unnamed: 0,query,Product Inquiry,Order Tracking,Refund Request,Store Policy
0,Can I cancel my order and get a refund? and I ...,1,1,1,0
1,What are the features of this laptop? and What...,1,0,0,1
2,What are the features of this laptop? and It's...,1,0,1,0
3,How do I get a refund?,0,0,1,0
4,I've been waiting for the order long time and ...,0,1,0,1
...,...,...,...,...,...
695,Can you explain your warranty terms? and Is th...,1,0,1,1
696,Help me with the products that have discounts,1,0,0,0
697,It's been long time since I have raised the re...,0,0,1,0
698,Expected avaliability date for the product and...,1,1,1,0


In [6]:
lemmatizer = WordNetLemmatizer()
default_stop_words = set(stopwords.words("english"))
stop_words = default_stop_words - {"and"}

In [7]:
def expand_contractions(text):
    contractions_dict = {
        "don't": "donot",
        "can't": "cannot",
        "won't": "willnot",
        "isn't": "isnot",
        "aren't": "arenot",
        "didn't": "didnot",
        "hasn't": "hasnot",
        "haven't": "havenot",
        "wasn't": "wasnot",
        "weren't": "werenot",
        "shouldn't": "shouldnot",
        "couldn't": "couldnot",
        "wouldn't": "wouldnot",
        "I've": "I have",
        "you've": "you have",
        "they've": "they have",
        "we've": "we have",
        "I'd": "I would",
        "you'd": "you would",
        "he'd": "he would",
        "she'd": "she would",
        "that'll": "that will",
    }
    pattern = re.compile(r'\b(' + '|'.join(contractions_dict.keys()) + r')\b')
    return pattern.sub(lambda x: contractions_dict[x.group()], text)

In [8]:
def preprocess_query(query):
    query = query.lower()
    query = expand_contractions(query)
    tokens = word_tokenize(query)
    tokens = [word for word in tokens if word.isalnum()]
    filtered_tokens = [word for word in tokens if word not in stop_words]
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]
    cleaned_query = " ".join(lemmatized_tokens)
    return cleaned_query

In [9]:
data["cleaned_query"] = data["query"].apply(preprocess_query)

In [10]:
data

Unnamed: 0,query,Product Inquiry,Order Tracking,Refund Request,Store Policy,cleaned_query
0,Can I cancel my order and get a refund? and I ...,1,1,1,0,cancel order and get refund and want check sta...
1,What are the features of this laptop? and What...,1,0,0,1,feature laptop and option free delivery
2,What are the features of this laptop? and It's...,1,0,1,0,feature laptop and long time since raised refu...
3,How do I get a refund?,0,0,1,0,get refund
4,I've been waiting for the order long time and ...,0,1,0,1,waiting order long time and option free delivery
...,...,...,...,...,...,...
695,Can you explain your warranty terms? and Is th...,1,0,1,1,explain warranty term and phone available and ...
696,Help me with the products that have discounts,1,0,0,0,help product discount
697,It's been long time since I have raised the re...,0,0,1,0,long time since raised refund amount credited
698,Expected avaliability date for the product and...,1,1,1,0,expected avaliability date product and provide...


In [11]:
data = data.drop(columns=["query"])
data = data[["cleaned_query"] + [col for col in data.columns if col != "cleaned_query"]]

In [12]:
data

Unnamed: 0,cleaned_query,Product Inquiry,Order Tracking,Refund Request,Store Policy
0,cancel order and get refund and want check sta...,1,1,1,0
1,feature laptop and option free delivery,1,0,0,1
2,feature laptop and long time since raised refu...,1,0,1,0
3,get refund,0,0,1,0
4,waiting order long time and option free delivery,0,1,0,1
...,...,...,...,...,...
695,explain warranty term and phone available and ...,1,0,1,1
696,help product discount,1,0,0,0
697,long time since raised refund amount credited,0,0,1,0
698,expected avaliability date product and provide...,1,1,1,0


In [13]:
X = data["cleaned_query"]
y = data.drop(columns=["cleaned_query"])

In [14]:
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(X)
X = tokenizer.texts_to_sequences(X)
X = pad_sequences(X, maxlen=100)

In [15]:
joblib.dump(tokenizer, 'tokenizer.pkl')

['tokenizer.pkl']

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [17]:
input_layer = Input(shape=(100,))
embedding_layer = Embedding(input_dim=10000, output_dim=128)(input_layer)
lstm_layer = LSTM(128, return_sequences=True)(embedding_layer)
global_max_pooling_layer = GlobalMaxPooling1D()(lstm_layer)
dense_layer = Dense(64, activation="relu")(global_max_pooling_layer)
output_layer = Dense(y.shape[1], activation="sigmoid")(dense_layer)

In [18]:
model = Model(inputs=input_layer, outputs=output_layer)

In [19]:
model.summary()

In [20]:
model.compile(optimizer=Adam(), loss='binary_crossentropy', metrics=['accuracy'])

In [21]:
model.fit(X_train, y_train, epochs=5, batch_size=32, validation_data=(X_test, y_test))

Epoch 1/5
[1m18/18[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 112ms/step - accuracy: 0.1192 - loss: 0.6897 - val_accuracy: 0.1643 - val_loss: 0.6653
Epoch 2/5
[1m18/18[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 96ms/step - accuracy: 0.2088 - loss: 0.6462 - val_accuracy: 0.3357 - val_loss: 0.5854
Epoch 3/5
[1m18/18[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 96ms/step - accuracy: 0.3142 - loss: 0.5772 - val_accuracy: 0.3071 - val_loss: 0.4948
Epoch 4/5
[1m18/18[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 98ms/step - accuracy: 0.4677 - loss: 0.4504 - val_accuracy: 0.5214 - val_loss: 0.3001
Epoch 5/5
[1m18/18[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 127ms/step - accuracy: 0.5845 - loss: 0.2547 - val_accuracy: 0.6000 - val_loss: 0.1465


<keras.src.callbacks.history.History at 0x2f33ed856a0>

In [22]:
y_pred = model.predict(X_test)

[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 71ms/step 


In [23]:
y_pred_binary = (y_pred > 0.5)

In [24]:
hamming = hamming_loss(y_test, y_pred_binary)
precision = precision_score(y_test, y_pred_binary, average="micro")
recall = recall_score(y_test, y_pred_binary, average="micro")
f1 = f1_score(y_test, y_pred_binary, average="micro")
subset_accuracy = accuracy_score(y_test, y_pred_binary)

In [25]:
print(f"Hamming Loss: {hamming}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")
print(f"Subset Accuracy: {subset_accuracy}")

Hamming Loss: 0.0375
Precision: 0.9794520547945206
Recall: 0.9501661129568106
F1 Score: 0.9645868465430016
Subset Accuracy: 0.8571428571428571


In [26]:
embedding_weights = model.layers[1].get_weights()[0]
np.save("embeddings.npy", embedding_weights)

In [27]:
joblib.dump(model, 'multi_label_intent_model.pkl')

['multi_label_intent_model.pkl']

In [28]:
model.save('multi_label_intent_model.h5')



In [29]:
embedding_weights = model.layers[1].get_weights()[0]
np.save('embeddings.npy', embedding_weights)