# 1. <a id='toc1_'></a>[**NLP and Speech Recognition Chatbot**](#toc0_)

**Table of contents**<a id='toc0_'></a>    
- 1. [**NLP and Speech Recognition Chatbot**](#toc1_)    
  - 1.1. [**Problem Statement**](#toc1_1_)    
  - 1.2. [**Objectives**](#toc1_2_)    
  - 1.3. [**Analysis To Be Done**](#toc1_3_)    
    - 1.3.1. [**Ensure Necessary Modules are Installed**](#toc1_3_1_)    
    - 1.3.2. [**Import Modules and Set Default Environment Variables and Load Data**](#toc1_3_2_)    
    - 1.3.3. [**Preprocess Data**](#toc1_3_3_)    
    - 1.3.4. [**Create Pickle Files**](#toc1_3_4_)    
    - 1.3.5. [**Create Training And Testing Datasets**](#toc1_3_5_)    
    - 1.3.6. [**Build the Model**](#toc1_3_6_)    
    - 1.3.7. [**Predict The Responses**](#toc1_3_7_)    
      - 1.3.7.1. [**Load Required Python Modules**](#toc1_3_7_1_)    
      - 1.3.7.2. [**Establish Environment Variables and Load Data**](#toc1_3_7_2_)    
      - 1.3.7.3. [**Creat Prediction Functions**](#toc1_3_7_3_)    
      - 1.3.7.4. [**Interactive loop for testing**](#toc1_3_7_4_)    

<!-- vscode-jupyter-toc-config
    numbering=true
    anchor=true
    flat=false
    minLevel=1
    maxLevel=6
    /vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

-----------------------------
## 1.1. <a id='toc1_1_'></a>[**Problem Statement**](#toc0_)
-----------------------------

A company holds an event that has been given the deserved promotion through marketing in 
hopes of attracting as big an audience as possible. Now, it’s up to the customer support team to 
guide the audience and answer any queries. Providing high-quality support and guidance is the 
challenge. The chatbot is very helpful for its 24/7 presence and ability to reply instantly.

-----------------------------
## 1.2. <a id='toc1_2_'></a>[**Objectives**](#toc0_)
-----------------------------

Develop a real-time chatbot to engage with the customers in order to boost their 
business growth by using NLP and Speech Recognition.

**Domain:** Customer Support

-----------------------------
## 1.3. <a id='toc1_3_'></a>[**Analysis To Be Done**](#toc0_)
-----------------------------

Create a set of prebuilt commands or inputs as a dataset. Here, we use 
command .json as Dataset that contains the patterns we need to find and the responses we 
want to return to the user.

-----------------------------
### 1.3.1. <a id='toc1_3_1_'></a>[**Ensure Necessary Modules are Installed**](#toc0_)
-----------------------------

In [None]:
%pip install python-dotenv
%pip install nltk 
%pip install keras 
%pip install SpeechRecognition  
%pip install tensorflow 
# %pip install pickle - standard library in python does not need to be installed

-----------------------------
### 1.3.2. <a id='toc1_3_2_'></a>[**Import Modules and Set Default Environment Variables and Load Data**](#toc0_)
-----------------------------

In [None]:
import json
import pickle
import random
import copy

import numpy as np
import tensorflow as tf

import nltk
from nltk.stem import WordNetLemmatizer

from tensorflow.keras.layers import Input, Activation, Dense, Dropout
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import SGD



In [None]:
# Download necessary NLTK data if not already downloaded
nltk.download('punkt')

# Directory for dataset and dependency files
dataset_dir = '1581663590_datasetsanddependencyfiles'

# Initialize lists
words = [] 
classes = [] 
documents = [] 
ignore_words = ['?', '!']

# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# Load intents file
with open(f'{dataset_dir}/commands.json') as data_file:
    intents = json.load(data_file)

-----------------------------
### 1.3.3. <a id='toc1_3_3_'></a>[**Preprocess Data**](#toc0_)
-----------------------------

In [None]:
# Loop through each intent in the intents dataset
for intent in intents['intents']:
    for pattern in intent['patterns']:
        # Tokenize each word in the sentence
        word_list = nltk.word_tokenize(pattern)
        words.extend(word_list)
        
        # Add to documents in our corpus
        documents.append((word_list, intent['tag']))
        
        # Add to our classes if it's not already there
        if intent['tag'] not in classes:
            classes.append(intent['tag'])

# Lemmatize, lower each word, and remove duplicates
words = [lemmatizer.lemmatize(w.lower()) for w in words if w not in ignore_words] 
words = sorted(list(set(words))) 

# Sort classes 
classes = sorted(list(set(classes))) 

# Display basic stats
print(len(documents), "documents") 
print(len(classes), "classes", classes) 
print(len(words), "unique lemmatized words", words) 

### 1.3.4. <a id='toc1_3_4_'></a>[**Create Pickle Files**](#toc0_)

In [None]:
pickle.dump(words,open(f'{dataset_dir}/words.pkl','wb'))
pickle.dump(classes,open(f'{dataset_dir}/classes.pkl','wb'))

### 1.3.5. <a id='toc1_3_5_'></a>[**Create Training And Testing Datasets**](#toc0_)

In [None]:
# Define output_empty based on the number of classes
output_empty = [0] * len(classes)

# Create separate lists for training data inputs (X) and outputs (Y)
train_x = []
train_y = []

# Create bag of words for each sentence and corresponding output
for doc in documents:
    # Initialize our bag of words
    bag = []
    # List of tokenized words for the pattern
    pattern_words = doc[0]
    # Lemmatize each word to create the base form
    pattern_words = [lemmatizer.lemmatize(word.lower()) for word in pattern_words]
    # Create the bag of words array with 1 if word match found in current pattern
    bag = [1 if w in pattern_words else 0 for w in words]
    
    # Output is '0' for each tag and '1' for current tag
    output_row = list(output_empty)
    output_row[classes.index(doc[1])] = 1

    # Append the bag of words and output row to their respective lists
    train_x.append(bag)
    train_y.append(output_row)

# Initial deep verification of train_x and train_y
def verify_array(array, name):
    print(f"Verifying {name}...")
    for i, row in enumerate(array):
        if not isinstance(row, np.ndarray):
            print(f"Row {i} in {name} is not a numpy array. Found type: {type(row)}")
        elif row.shape[0] != array.shape[1]:
            print(f"Row {i} in {name} has inconsistent shape. Expected {array.shape[1]}, found {row.shape[0]}")
        elif row.dtype != np.float32:
            print(f"Row {i} in {name} has incorrect dtype. Expected float32, found {row.dtype}")

# Convert train_x and train_y to numpy arrays
train_x = np.array(train_x, dtype=np.float32)
train_y = np.array(train_y, dtype=np.float32)

# Deep verification before proceeding
verify_array(train_x, "train_x")
verify_array(train_y, "train_y")

# Ensure consistency by stacking rows
try:
    train_x = np.vstack(train_x)
    train_y = np.vstack(train_y)
    print("train_x and train_y successfully stacked.")
except ValueError as e:
    print(f"Error in stacking train_x or train_y: {e}")
    raise

# Print shapes and types for debugging
print("Shape of train_x:", train_x.shape)
print("Shape of train_y:", train_y.shape)
print("Data type of train_x:", train_x.dtype)
print("Data type of train_y:", train_y.dtype)

# Double-check by printing sample values if needed
print(f"Sample values from train_x:\n", train_x[:5])
print(f"Sample values from train_y:\n", train_y[:5])

In [None]:
# Deep copy train_x and train_y to ensure there are no lingering references
train_x = copy.deepcopy(train_x)
train_y = copy.deepcopy(train_y)

# Convert train_x and train_y to strict numpy arrays and enforce dtype
train_x = np.array(train_x, dtype=np.float32)
train_y = np.array(train_y, dtype=np.float32)

# Print shapes and sample values to verify
print("Final check - Shape of train_x:", train_x.shape)
print("Final check - Shape of train_y:", train_y.shape)
print("Sample values from train_x:", train_x[:5])
print("Sample values from train_y:", train_y[:5])

### 1.3.6. <a id='toc1_3_6_'></a>[**Build the Model**](#toc0_)

In [None]:
# Define model structure
model = Sequential()
model.add(Input(shape=(train_x.shape[1],)))
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5)) 
model.add(Dense(64, activation='relu')) 
model.add(Dropout(0.5)) 
model.add(Dense(train_y.shape[1], activation='softmax'))

# Compile model with updated learning rate parameter
sgd = SGD(learning_rate=0.01, momentum=0.9, nesterov=True) 
model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy']) 

# Train and save the model
hist = model.fit(train_x, train_y, epochs=200, batch_size=5, verbose=1)
model.save(f'{dataset_dir}/chatbot_model.h5', hist) 

print("Model created and saved")


### 1.3.7. <a id='toc1_3_7_'></a>[**Predict The Responses**](#toc0_)

#### 1.3.7.1. <a id='toc1_3_7_1_'></a>[**Load Required Python Modules**](#toc0_)

In [None]:
import nltk
from nltk.stem import WordNetLemmatizer
import pickle
import numpy as np

from tensorflow.keras.models import load_model

#### 1.3.7.2. <a id='toc1_3_7_2_'></a>[**Establish Environment Variables and Load Data**](#toc0_)

In [None]:
# Define dataset directory
dataset_dir = '1581663590_datasetsanddependencyfiles'

# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# Load the trained model
model = load_model(f'{dataset_dir}/chatbot_model.h5')

# Load words and classes
with open(f'{dataset_dir}/words.pkl', 'rb') as f:
    words = pickle.load(f)
with open(f'{dataset_dir}/classes.pkl', 'rb') as f:
    classes = pickle.load(f)

# Load intents
with open(f'{dataset_dir}/commands.json', 'r') as f:
    intents = json.load(f)

#### 1.3.7.3. <a id='toc1_3_7_3_'></a>[**Creat Prediction Functions**](#toc0_)

In [None]:
# Function to clean up the sentence
def clean_up_sentence(sentence):
    # Tokenize the sentence
    sentence_words = nltk.word_tokenize(sentence)
    # Lemmatize each word
    sentence_words = [lemmatizer.lemmatize(word.lower()) for word in sentence_words]
    return sentence_words

# Function to create a bag of words from the sentence
def bow(sentence, words, show_details=True):
    sentence_words = clean_up_sentence(sentence)
    bag = [0] * len(words)
    for s in sentence_words:
        for i, w in enumerate(words):
            if w == s:
                bag[i] = 1
                if show_details:
                    print(f'found in bag: {w}')
    return np.array(bag)

# Function to predict the class
def predict_class(sentence, model):
    p = bow(sentence, words, show_details=False)
    res = model.predict(np.array([p]))[0]
    ERROR_THRESHOLD = 0.25
    results = [[i, r] for i, r in enumerate(res) if r > ERROR_THRESHOLD]
    results.sort(key=lambda x: x[1], reverse=True)
    return_list = []
    for r in results:
        return_list.append({"intent": classes[r[0]], "probability": str(r[1])})
    return return_list

# Function to get the response
def get_response(intents_list, intents_json):
    if intents_list:
        tag = intents_list[0]['intent']
        list_of_intents = intents_json['intents']
        for i in list_of_intents:
            if i['tag'] == tag:
                result = random.choice(i['responses'])
                break
    else:
        result = "I don't understand, please try again."
    return result

# Function to chat with the model
def chatbot_response(text):
    intents = predict_class(text, model)
    response = get_response(intents, intents)
    return response

#### 1.3.7.4. <a id='toc1_3_7_4_'></a>[**Interactive loop for testing**](#toc0_)

In [None]:
print("Chatbot is ready to chat! (Type 'exit' to stop)")
while True:
    message = input("You: ")
    if message.lower() == "exit":
        print("Goodbye!")
        break
    response = chatbot_response(message)
    print("Bot:", response)