# Chatbot using NLP and Deep learning

## Importing Dependencies

In [68]:
import nltk
import json
import numpy as np
import torch
import torch.nn as nn
import random
from nltk.stem.porter import PorterStemmer
from torch.utils.data import Dataset, DataLoader

In [69]:
# nltk.download('punkt_tab')

# Only download for the first time

We should download the tokenizer if we are working for the first time

Our NLP Preprocessing pipeline for creating the training data

1. Tokenization
2. Lowering and Stemming
3. Removing Punctuations
4. Bag of Words

## Tokenization

Tokenization is the process of breaking text into smaller units, called **tokens** (words, subwords, or sentences), which are used as input for NLP models.  

For example, in the sentence:  
*"Machine Learning is amazing!"*  

**Word tokenization** → `["Machine", "Learning", "is", "amazing", "!"]`  

It helps models understand and process text efficiently.

In [70]:
def tokenize(sentence):
    return nltk.word_tokenize(sentence)

We are using the `word_tokenize` method from NLTK to split the given sentence into tokens.

In [71]:
a = "Hi! How are you doing?"
print(a)

Hi! How are you doing?


In [72]:
a=tokenize(a)
print(a)

['Hi', '!', 'How', 'are', 'you', 'doing', '?']


The above are the tokens found in the given sentence.

## Stemming and Lowering

**Stemming** is the process of reducing words to their root form by removing prefixes or suffixes, without considering the actual meaning.  

For example:  
- **"running" → "run"**  
- **"flies" → "fli"** (incorrect but common in stemming)  
- **"better" → "better"** (unchanged since it doesn’t use linguistic rules)  

In [73]:
stemmer = PorterStemmer()

In [74]:
def stemming(word):
    return stemmer.stem(word.lower())

In [75]:
words = ['Organized', 'Organizer', 'Organizing']
print(words)

['Organized', 'Organizer', 'Organizing']


In [76]:
stemmed_words = [stemming(w) for w in words]
print(stemmed_words)

['organ', 'organ', 'organ']


The above output words are the stemmed versions of the listed words

## Creating train data

In [77]:
with open('../Datasets/intents.json', 'r') as f:
    train_data = json.load(f)

In [78]:
train_data

{'intents': [{'tag': 'greeting',
   'patterns': ['Hi',
    'Hey',
    'How are you',
    'Is anyone there?',
    'Hello',
    'Good day'],
   'responses': ['Hey :-)',
    'Hello, thanks for visiting',
    'Hi there, what can I do for you?',
    'Hi there, how can I help?']},
  {'tag': 'goodbye',
   'patterns': ['Bye', 'See you later', 'Goodbye'],
   'responses': ['See you later, thanks for visiting',
    'Have a nice day',
    'Bye! Come back again soon.']},
  {'tag': 'thanks',
   'patterns': ['Thanks', 'Thank you', "That's helpful", "Thank's a lot!"],
   'responses': ['Happy to help!', 'Any time!', 'My pleasure']},
  {'tag': 'items',
   'patterns': ['Which items do you have?',
    'What kinds of items are there?',
    'What do you sell?'],
   'responses': ['We sell coffee and tea', 'We have coffee and tea']},
  {'tag': 'payments',
   'patterns': ['Do you take credit cards?',
    'Do you accept Mastercard?',
    'Can I pay with Paypal?',
    'Are you cash only?'],
   'responses': ['We 

For applying bag of words we need to first collect all the words present in the training data

In [79]:
all_words = []    #List of all the tokens
tags = []         #List of all the tags in the training data
xy = []           #Combination of all the words and it's respective tags

### Tokenizing Training Data

In [80]:
for intent in train_data['intents']:
    tag = intent['tag']
    tags.append(tag)
    for pattern in intent['patterns']:
        w = tokenize(pattern)
        all_words.extend(w)    #We use the `extend` function because we want to avoid a list of lists and instead gather 
                               #all the tokens from the pattern into a single list.
        xy.append((w, tag))

In [81]:
print(all_words)

['Hi', 'Hey', 'How', 'are', 'you', 'Is', 'anyone', 'there', '?', 'Hello', 'Good', 'day', 'Bye', 'See', 'you', 'later', 'Goodbye', 'Thanks', 'Thank', 'you', 'That', "'s", 'helpful', 'Thank', "'s", 'a', 'lot', '!', 'Which', 'items', 'do', 'you', 'have', '?', 'What', 'kinds', 'of', 'items', 'are', 'there', '?', 'What', 'do', 'you', 'sell', '?', 'Do', 'you', 'take', 'credit', 'cards', '?', 'Do', 'you', 'accept', 'Mastercard', '?', 'Can', 'I', 'pay', 'with', 'Paypal', '?', 'Are', 'you', 'cash', 'only', '?', 'How', 'long', 'does', 'delivery', 'take', '?', 'How', 'long', 'does', 'shipping', 'take', '?', 'When', 'do', 'I', 'get', 'my', 'delivery', '?', 'Tell', 'me', 'a', 'joke', '!', 'Tell', 'me', 'something', 'funny', '!', 'Do', 'you', 'know', 'a', 'joke', '?']


In [82]:
len(all_words)

103

There are 103 tokens in total in our training set

### Adding Punctuations to ignore

In [83]:
ignore_words = ['@', '#', '$', '%', '&', '*', '(', ')', '?', '!', '.', ',']

### Stemming and removing Punctuations

In [84]:
all_words = [stemming(word) for word in all_words if word not in ignore_words]

In [85]:
print(all_words)

['hi', 'hey', 'how', 'are', 'you', 'is', 'anyon', 'there', 'hello', 'good', 'day', 'bye', 'see', 'you', 'later', 'goodby', 'thank', 'thank', 'you', 'that', "'s", 'help', 'thank', "'s", 'a', 'lot', 'which', 'item', 'do', 'you', 'have', 'what', 'kind', 'of', 'item', 'are', 'there', 'what', 'do', 'you', 'sell', 'do', 'you', 'take', 'credit', 'card', 'do', 'you', 'accept', 'mastercard', 'can', 'i', 'pay', 'with', 'paypal', 'are', 'you', 'cash', 'onli', 'how', 'long', 'doe', 'deliveri', 'take', 'how', 'long', 'doe', 'ship', 'take', 'when', 'do', 'i', 'get', 'my', 'deliveri', 'tell', 'me', 'a', 'joke', 'tell', 'me', 'someth', 'funni', 'do', 'you', 'know', 'a', 'joke']


In [86]:
len(all_words)

88

The length of the list remains the same, but the words are now changed to their respective root forms.

In [87]:
all_words = sorted(set(all_words))

Stemming can produce the same root word for multiple words, leading to duplicates in the token list. To handle this, we convert the tokens into a set to remove duplicates and then use the `sorted` function to return a sorted list

In [88]:
print(all_words)

["'s", 'a', 'accept', 'anyon', 'are', 'bye', 'can', 'card', 'cash', 'credit', 'day', 'deliveri', 'do', 'doe', 'funni', 'get', 'good', 'goodby', 'have', 'hello', 'help', 'hey', 'hi', 'how', 'i', 'is', 'item', 'joke', 'kind', 'know', 'later', 'long', 'lot', 'mastercard', 'me', 'my', 'of', 'onli', 'pay', 'paypal', 'see', 'sell', 'ship', 'someth', 'take', 'tell', 'thank', 'that', 'there', 'what', 'when', 'which', 'with', 'you']


In [89]:
len(all_words)

54

The total length reduces after removing duplicates

In [90]:
tags = sorted(set(tags))

In [91]:
tags

['delivery', 'funny', 'goodbye', 'greeting', 'items', 'payments', 'thanks']

In [92]:
xy

[(['Hi'], 'greeting'),
 (['Hey'], 'greeting'),
 (['How', 'are', 'you'], 'greeting'),
 (['Is', 'anyone', 'there', '?'], 'greeting'),
 (['Hello'], 'greeting'),
 (['Good', 'day'], 'greeting'),
 (['Bye'], 'goodbye'),
 (['See', 'you', 'later'], 'goodbye'),
 (['Goodbye'], 'goodbye'),
 (['Thanks'], 'thanks'),
 (['Thank', 'you'], 'thanks'),
 (['That', "'s", 'helpful'], 'thanks'),
 (['Thank', "'s", 'a', 'lot', '!'], 'thanks'),
 (['Which', 'items', 'do', 'you', 'have', '?'], 'items'),
 (['What', 'kinds', 'of', 'items', 'are', 'there', '?'], 'items'),
 (['What', 'do', 'you', 'sell', '?'], 'items'),
 (['Do', 'you', 'take', 'credit', 'cards', '?'], 'payments'),
 (['Do', 'you', 'accept', 'Mastercard', '?'], 'payments'),
 (['Can', 'I', 'pay', 'with', 'Paypal', '?'], 'payments'),
 (['Are', 'you', 'cash', 'only', '?'], 'payments'),
 (['How', 'long', 'does', 'delivery', 'take', '?'], 'delivery'),
 (['How', 'long', 'does', 'shipping', 'take', '?'], 'delivery'),
 (['When', 'do', 'I', 'get', 'my', 'deliver

### Creating Bag of Words(bow)

In [93]:
X_train = []
y_train = []

In [94]:
def bag_of_words(pattern_sequences, all_words):
    tokenized_sentence = [stemming(w) for w in pattern_sequences]  #Applying Stemming for all the words
    bow = np.zeros(len(all_words), dtype=np.float32)
    for idx, w in enumerate(all_words):
        if w in tokenized_sentence:
            bow[idx]+=1
    return bow

In [95]:
for (pattern_sentence, tag) in xy:
    bag = bag_of_words(pattern_sentence, all_words)
    X_train.append(bag)
    label = tags.index(tag)
    y_train.append(label)

Here, `y_train` doesn't need to be in one-hot encoded form because we are using **CrossEntropyLoss**, which expects class index labels instead of one-hot vectors.

In [96]:
X_train = np.array(X_train)
y_train = np.array(y_train)

In [97]:
X_train

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 1.],
       ...,
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 1.]], dtype=float32)

In [98]:
y_train

array([3, 3, 3, 3, 3, 3, 2, 2, 2, 6, 6, 6, 6, 4, 4, 4, 5, 5, 5, 5, 0, 0,
       0, 1, 1, 1])

### Creating a PyTorch Dataset using training data

In [99]:
class ChatDataSet(Dataset):
    def __init__(self):
        self.n_samples = len(X_train)
        self.X_data = X_train
        self.y_data = y_train

    def __getitem__(self, index):
        return self.X_data[index], self.y_data[index]

    def __len__(self):
        return self.n_samples

- Inherits from `Dataset`, a PyTorch class for handling custom datasets.  
- The `__init__` method initializes the dataset, storing `X_train` (features) and `y_train` (labels).  
- `self.n_samples = len(X_train)` stores the total number of samples in the dataset.  
- The `__getitem__` method takes an `index` and returns the corresponding feature (`X_train[index]`) and label (`y_train[index]`).  
- The `__len__` method returns `self.n_samples`, which helps PyTorch determine the dataset size.  
- This class is useful when used with `DataLoader` for batch processing and efficient training.

### Defining Hyperparameters

In [100]:
batch_size = 8
learning_rate = 0.001
input_size = len(X_train[0])
hidden_size = 8
num_classes = len(tags)
num_epochs = 1000

### Creation of Dataset Object

In [101]:
dataset = ChatDataSet()
train_loader = DataLoader(dataset=dataset, batch_size=batch_size, shuffle=True)

- `batch_size = 8` sets the number of samples per batch during training.  
- `dataset = ChatDataSet()` creates an instance of the `ChatDataSet` class, which holds the training data.  
- `DataLoader(dataset=dataset, batch_size=batch_size, shuffle=True)` creates a `DataLoader` to handle batching, shuffling, and sampling of data.  
- `dataset=dataset` tells the `DataLoader` which dataset to use.  
- `batch_size=batch_size` ensures that each batch contains 8 samples.  
- `shuffle=True` randomizes the order of samples in each epoch to improve generalization.  
- This setup is useful for efficient training and feeding data in mini-batches to a neural network.

## Model Creation

In [102]:
class NeuralNet(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super(NeuralNet, self).__init__()
        self.l1 = nn.Linear(input_size, hidden_size)
        self.l2 = nn.Linear(hidden_size, hidden_size)
        self.l3 = nn.Linear(hidden_size, num_classes)
        self.relu = nn.ReLU()

    def forward(self, x):
        out = self.l1(x)
        out = self.relu(out)
        out = self.l2(out)
        out = self.relu(out)
        out = self.l3(out)
        return out   

In [103]:
model = NeuralNet(input_size, hidden_size, num_classes).to(device)

### Optimization

In [104]:
criterion = nn.CrossEntropyLoss()

In [105]:
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

## Training the model

In [106]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [107]:
for epoch in range(num_epochs):
    for (words, labels) in train_loader:
        words = words.to(device)
        labels = labels.to(device).long()

        # Forward Pass
        output = model(words)
        loss = criterion(output, labels)
    
        # Backward Pass and Optimizer
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    if (epoch+1) % 100 == 0:
        print(f'Epoch {epoch+1}/{num_epochs}, loss={loss.item():.4f}')

Epoch 100/1000, loss=1.0290
Epoch 200/1000, loss=0.5270
Epoch 300/1000, loss=0.0329
Epoch 400/1000, loss=0.0215
Epoch 500/1000, loss=0.4692
Epoch 600/1000, loss=0.0027
Epoch 700/1000, loss=0.0056
Epoch 800/1000, loss=0.2505
Epoch 900/1000, loss=0.0017
Epoch 1000/1000, loss=0.0020


## Saving the Data

In [108]:
data = {
    "model_state":model.state_dict(),
    "input_size":input_size,
    "output_size":num_classes,
    "hidden_size":hidden_size,
    "all_words":all_words,
    "tags":tags
}

In [109]:
file = "data.pth"

In [110]:
torch.save(data, file)

## Implementation of Chat

In [111]:
with open('../Datasets/intents.json', 'r') as f:
    intents = json.load(f)

In [112]:
data = torch.load(file)

  data = torch.load(file)


In [115]:
input_size = data["input_size"]
hidden_size = data["hidden_size"]
output_size = data["output_size"]
all_words = data['all_words']
tags = data['tags']
model_state = data["model_state"]

model = NeuralNet(input_size, hidden_size, output_size).to(device)
model.load_state_dict(model_state)
model.eval()

bot_name = "Bot"
print("Let's chat! (type 'exit' to quit)")
while True:
    # sentence = "do you use credit cards?"
    sentence = input("You: ")
    if sentence == "exit":
        break

    sentence = tokenize(sentence)
    X = bag_of_words(sentence, all_words)
    X = X.reshape(1, X.shape[0])
    X = torch.from_numpy(X).to(device)

    output = model(X)
    _, predicted = torch.max(output, dim=1)

    tag = tags[predicted.item()]

    probs = torch.softmax(output, dim=1)
    prob = probs[0][predicted.item()]
    if prob.item() > 0.75:
        for intent in intents['intents']:
            if tag == intent["tag"]:
                print(f"{bot_name}: {random.choice(intent['responses'])}")
    else:
        print(f"{bot_name}: I do not understand...")

Let's chat! (type 'exit' to quit)


You:  Hello, this is vedavyas


Bot: Hi there, how can I help?


You:  I want to greet you


Bot: I do not understand...


You:  Thank you for working


Bot: My pleasure


You:  bye


Bot: Bye! Come back again soon.


You:  exit
