<h1 style="text-align: center;">Chatbot</h1>
<h2 style="text-align: center;">University of Denver</h2>
<h3 style="text-align: center;">Alex Liddle</h3>

A chatbot is an intelligent piece of software that is capable of communicating and performing actions similar to a human. The goal of this project is to build a model that predicts answers using predefined patterns and responses. Provided is a file named intents.json that contains these patterns. 

#### Possible chat with your bot
<code>
You: Hello, how are you? 
Bot: Hi there, how can I help?
You: what can you do?
Bot: I can guide you through Adverse drug reaction list, Blood pressure tracking, Hospitals and Pharmacies
You: thanks
Bot: My pleasure
You: see ya. got to go!
Bot: See you
</code>

#### Load the necessary libraries

In [1]:
import nltk
import string
import re
import sklearn
import urllib.request
import requests
import json
import pickle
import pandas as pd
import numpy as np
from tqdm import tqdm
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns
from sklearn.cluster import MiniBatchKMeans
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.preprocessing import LabelEncoder
from scipy import spatial
#nltk.download('stopwords') #<---uncomment if you haven't downloaded the stopwords library
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer 
import gensim
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument
from gensim.utils import simple_preprocess

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [2]:
%matplotlib notebook

First, the data, which is provided by the University of Denver, is loaded.

In [3]:
# Load pickle files
classes = pickle.load(urllib.request.urlopen("https://raw.githubusercontent.com/emmanueliarussi/DataScienceCapstone/master/3_MidtermProjects/ProjectPCB/data/classes.pkl"))
words   = pickle.load(urllib.request.urlopen("https://raw.githubusercontent.com/emmanueliarussi/DataScienceCapstone/master/3_MidtermProjects/ProjectPCB/data/words.pkl"))

In [4]:
# Load json file with answer patterns
intents = json.loads(requests.get("https://raw.githubusercontent.com/emmanueliarussi/DataScienceCapstone/master/3_MidtermProjects/ProjectPCB/data/intents.json").text)

The answer patterns in the data file are for a hypothetical medical application. Let's take a peek!

In [5]:
intents

{'intents': [{'tag': 'greeting',
   'patterns': ['Hi there',
    'How are you',
    'Is anyone there?',
    'Hey',
    'Hola',
    'Hello',
    'Good day'],
   'responses': ['Hello, thanks for asking',
    'Good to see you again',
    'Hi there, how can I help?'],
   'context': ['']},
  {'tag': 'goodbye',
   'patterns': ['Bye',
    'See you later',
    'Goodbye',
    'Nice chatting to you, bye',
    'Till next time'],
   'responses': ['See you!', 'Have a nice day', 'Bye! Come back again soon.'],
   'context': ['']},
  {'tag': 'thanks',
   'patterns': ['Thanks',
    'Thank you',
    "That's helpful",
    'Awesome, thanks',
    'Thanks for helping me'],
   'responses': ['Happy to help!', 'Any time!', 'My pleasure'],
   'context': ['']},
  {'tag': 'noanswer',
   'patterns': [],
   'responses': ["Sorry, can't understand you",
    'Please give me more info',
    'Not sure I understand'],
   'context': ['']},
  {'tag': 'options',
   'patterns': ['How you could help me?',
    'What you can do

The text patterns are grouped together in 'slots', which are best identified by the 'tag' key. From the perspective of the developer, patterns that are grouped together are similar. For example, 'Hey', 'Hola', and 'Hello' all warrant the same type of response. The goal is to create a meaningful numerical representation where that similarity can be calculated using some metric. I addition to the 'tag' and 'patterns' are 'responses' and 'context'. 'Context' adds more information to the 'slots' and is similar to the 'tag'. The 'Context' won't be used in this project. 'responses' will be used to generate a response from the chatbot. The response will be randomly selected from the list of responses corresponding to whatever slot contains a phrase or phrases most similar to the user's prompt.

### Data Preparation
#### Explode patterns into individual rows

Next, a dataframe is created where each individual pattern is separated into its own row.

In [6]:
df = pd.DataFrame(intents['intents'])
# explode pattern fields into new rows
df_intents = pd.DataFrame({
              col:np.repeat(df[col].values, df['patterns'].str.len())
              for col in df.columns.difference(['patterns'])
          }).assign(**{'patterns':np.concatenate(df['patterns'].values)})[df.columns.tolist()]
# re-add rows with no pattern, since they were dropped by the previous step as they contained no patterns
df_intents = df_intents.append(pd.DataFrame([{'context':d['context'], 
                                              'patterns':'', 
                                              'responses':d['responses'], 
                                              'tag':d['tag']} 
                                             for d in intents['intents'] if d['patterns']==[]])).reset_index(drop=True)
# remove duplicate patterns
df_intents = df_intents.copy()[:48]

df_intents.head()

Unnamed: 0,tag,patterns,responses,context
0,greeting,Hi there,"[Hello, thanks for asking, Good to see you aga...",[]
1,greeting,How are you,"[Hello, thanks for asking, Good to see you aga...",[]
2,greeting,Is anyone there?,"[Hello, thanks for asking, Good to see you aga...",[]
3,greeting,Hey,"[Hello, thanks for asking, Good to see you aga...",[]
4,greeting,Hola,"[Hello, thanks for asking, Good to see you aga...",[]


#### Data Preprocessing

Now it is time to preprocess the text patterns (e.g., normalize, remove punctuation, remove stopwords, stem, and lemmatize. This is one of many important steps for natural language processing. Consider the following example: Suppose two users each ask a question -- 'How could you help me?' and 'How can you be helpful?'. These are essentially asking the same question, but the use of different words could potentially make the vector representations vastly different from one another. First the phrases are normalized -- 'how could you help me?' and 'how can you be helpful?'. Next, punctuation is removed -- 'how could you help me' and 'how can you be helpful'. Next, stopwords are removed -- 'how help' and 'how helpful'. Finally, the phrases are stemmed and lemmatized -- 'how help' and 'how help'. Note how the phrases are now identical, and so will their vector representations be.

In [7]:
def preprocess_patterns(model_type='cluster'):
    '''
    Preprocesses text by normalizing, removing punctuation, removing stopwords, stemming,
    and lemmatizing
    
    Args: model_type (str, default='cluster') -- possible values are ['cluster', 'doc2vec', 'bow'].
    Returns: preprocessed (list) -- If model_type=='cluster', returns a list of tfidf vectors.
                                    Elif model_type=='doc2vec', returns a list of
                                    TaggedDocument objects.
                                    Elif model_type=='bow', returns a list of bow vectors.
                                    Else, raises ValueError
             model -- If model_type=='cluster', returns a fitted tfidf vectorizer.
                      Elif model_type=='doc2vec', returns a doc2vec model object.
                      Else, returns a word_freq dictionary.
    '''
    stop = stopwords.words()
    processed_list = []
    
    if model_type not in ['cluster', 'doc2vec', 'bow']:
        raise ValueError("'model_type' must be one of the following ['cluster', 'doc2vec', 'bow']")
    for idx, val in df_intents.patterns.items():
        # Normalize words (i.e., convert to same case)
        lowered = val.replace("[^\w\s]", "").lower().split()
        # Remove punctuation
        table = str.maketrans('', '', string.punctuation)
        stripped = [w.translate(table) for w in lowered]
        # Remove stop words
        stops_removed = [item for item in stripped if (item in words) or (item not in stop)]
        # Lemmatize and stem words
        porter = nltk.PorterStemmer()
        lemmatizer = WordNetLemmatizer()
        lemmatized = [lemmatizer.lemmatize(word) for word in stops_removed]
        stemmed = ' '.join([porter.stem(word) for word in lemmatized])
        df_intents.at[idx, 'stemmed_pattern'] = ' '.join(simple_preprocess(stemmed))
        
        if stemmed == '':
            continue
        if model_type=='cluster':
            processed_list.append(stemmed)
        elif model_type=='doc2vec':
            processed_list.append(TaggedDocument(simple_preprocess(stemmed), [idx]))
        else:
            for word in simple_preprocess(stemmed):
                processed_list.append(word)
    
    if model_type=='cluster':
        tfidf = TfidfVectorizer(
            min_df = 0,
            max_df = 0.95,
            max_features = 8000,
            stop_words = 'english', 
        )

        tfidf.fit(processed_list)
        patterns_tfidf = tfidf.transform(processed_list)
        
        return patterns_tfidf, tfidf
    
    elif model_type=='doc2vec':
        doc2vec_model = gensim.models.doc2vec.Doc2Vec(vector_size=15, min_count=1, epochs=1000)
        doc2vec_model.build_vocab(processed_list)
        doc2vec_model.train(processed_list, total_examples=doc2vec_model.corpus_count, epochs=1000)
        
        return processed_list, doc2vec_model
    
    else:
        word_freq = {}
        for word in processed_list:
            if word not in word_freq.keys():
                word_freq[word] = 1
            else:
                word_freq[word] += 1

        bow_vec = []
        for sentence in df_intents.stemmed_pattern:
            sentence_tokens = sentence.split()
            sent_vec = []
            for token in word_freq:
                if token in sentence_tokens:
                    sent_vec.append(1)
                else:
                    sent_vec.append(0)
            bow_vec.append(sent_vec)
        
        return bow_vec, word_freq

### Models

#### Clustering Model
The first model will use clustering. User prompts will be vectorized and clustered with those from the dataset. Ideally, semmantically similar prompts will cluster together, allowing the program to give a coherent response.

In [8]:
# Transform data for clustering
patterns_tfidf, tfidf = preprocess_patterns(model_type='cluster')

Let's inspect the vectorized preprocessed data

In [9]:
str(patterns_tfidf)

'  (0, 19)\t1.0\n  (2, 1)\t1.0\n  (3, 18)\t1.0\n  (4, 21)\t1.0\n  (5, 16)\t1.0\n  (6, 14)\t0.7071067811865475\n  (6, 10)\t0.7071067811865475\n  (7, 5)\t1.0\n  (8, 24)\t1.0\n  (9, 15)\t1.0\n  (10, 34)\t0.595985531506885\n  (10, 7)\t0.595985531506885\n  (10, 5)\t0.538147277674905\n  (11, 50)\t0.7071067811865475\n  (11, 49)\t0.7071067811865475\n  (12, 48)\t1.0\n  (13, 48)\t1.0\n  (14, 17)\t1.0\n  (15, 48)\t0.615369689973343\n  (15, 2)\t0.7882386343374141\n  (16, 48)\t0.7271369414361715\n  (16, 17)\t0.6864924387047899\n  (17, 17)\t1.0\n  (19, 40)\t0.804975269538045\n  (19, 17)\t0.5933083645391759\n  :\t:\n  (36, 4)\t0.39516747570546445\n  (37, 38)\t1.0\n  (38, 38)\t1.0\n  (39, 38)\t0.4925771983882054\n  (39, 33)\t0.6683075559684937\n  (39, 25)\t0.5574340447653383\n  (40, 38)\t0.5933083645391759\n  (40, 27)\t0.804975269538045\n  (41, 44)\t0.7271369414361715\n  (41, 38)\t0.6864924387047899\n  (42, 30)\t0.7746836041079114\n  (42, 22)\t0.6323490440621989\n  (43, 51)\t0.6151150105288973\n  (43,

#### TF-IDF Vectorization Explained
The TF-IDF in TF-IDF vectorization stands for Term Frequency Inverse Document Frequency. It is the process of turning text into a meaningful numerical representation that can be used in machine learning applications such as this. The mathematical expression for TF-IDF is
<p/>
<div class="math">
\begin{equation}
  TF(t, d)xIDF(t)
\end{equation}
</div>
<p/>
where TF(t, d) returns the number of times a term appears in a document and IDF(t) returns the inverse document frequency:
<p/>
<div class="math">
\begin{equation}
  IDF(t) = log (\frac{1 + NumberOfDocuments}{1 + DocumentFreqOfTerm}) + 1
\end{equation}
</div>

(Chaudhary, M. (2021, January 28). TF-IDF vectorizer scikit-learn. Retrieved April 10, 2021, from https://medium.com/@cmukesh8688/tf-idf-vectorizer-scikit-learn-dbc0244a911a)

#### Clustering
Next, we will determine the ideal number of clusters by analyzing the inertia (i.e., the sum of squared distances of samples to their closest cluster center) as a function of k. This is known as the elbow method. The ideal number of clusters is where the "elbow" occurs in the plot of inertias versus k clusters, e.g. the sum of squared distances stops minimizing.
(Elbow method for optimal value of k in KMeans. (2021, February 09). Retrieved April 10, 2021, from https://www.geeksforgeeks.org/elbow-method-for-optimal-value-of-k-in-kmeans/)

In [10]:
def find_optimal_clusters(data, max_k):    
    k_list = range(2, max_k+1)
    
    sse = []
    for k in k_list:
        with warnings.catch_warnings():
            warnings.simplefilter("ignore")
            sse.append(MiniBatchKMeans(n_clusters=k, random_state=42).fit(data).inertia_)

    sns.set(style = "darkgrid")
    f, ax = plt.subplots(1, 1, figsize=(12,12))
    ax.plot(k_list, sse, marker='o')
    ax.set_xlabel('Cluster Centers')
    ax.set_xticks(k_list)
    ax.set_xticklabels(k_list)
    ax.set_ylabel('SSE')
    ax.set_title('SSE by Cluster Center Plot')

In [11]:
find_optimal_clusters(patterns_tfidf, patterns_tfidf.shape[0])

<IPython.core.display.Javascript object>

The "elbow" is located approximately where k clusters equals 42. Now we will create the cluster model using sklearn's MiniBatchKmeans module.

In [13]:
n_clusters = 42

model = MiniBatchKMeans(n_clusters=n_clusters, random_state=42).fit(patterns_tfidf)

pattern_clusters = model.predict(patterns_tfidf)

Let's use PCA to reduce the vector components so we can visualize the numerical representation resulting from tfidf vectorization, colored according to their tag (e.g., vectors will ideally be closest to others with identical tags).

In [14]:
reduced_data = PCA(n_components=3).fit_transform(patterns_tfidf.todense())

x=reduced_data[:,0]
y=reduced_data[:,1]
z=reduced_data[:,2]
color=LabelEncoder().fit_transform(df_intents.tag.values[:-1])

sns.set(style = "darkgrid")
fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(111, projection = '3d')

ax.set_xlabel("PCA Component 1")
ax.set_ylabel("PCA Component 2")
ax.set_zlabel("PCA Component 3")

N = len(df_intents.tag.unique())
cmap = plt.cm.jet
cmaplist = [cmap(i) for i in range(cmap.N)]
cmap = cmap.from_list('Custom cmap', cmaplist, cmap.N)

bounds = np.linspace(0,N,N+1)
norm = matplotlib.colors.BoundaryNorm(bounds, cmap.N)

scat = ax.scatter(x, y, z, c=color, s=100, cmap=cmap, norm=norm)

cb = plt.colorbar(scat, spacing='proportional',ticks=bounds)
cb.set_label('Tag')

plt.title('TFIDF Vectors')
plt.ion()

plt.show()

<IPython.core.display.Javascript object>

Now we will test the accuracy of the model, where accuracy is defined as the total number of times the model's response equals the expected response divided by the total number of prompts.

#### Model Evaluation

In [15]:
user_prompt = ''

stop = stopwords.words()
prompts = df_intents.patterns.values
responses = df_intents.responses.values

i = 0

missed = 0

print('Testing Chatbot')
print('-----------------------')

while True:
    user_prompt = prompts[i] #input("User: ")
    if user_prompt.lower()=='q':
        break
    lowered = user_prompt.replace("[^\w\s]", "").lower().split()
    # Remove punctuation
    table = str.maketrans('', '', string.punctuation)
    stripped = [w.translate(table) for w in lowered]
    # Remove stop words
    stops_removed = [item for item in stripped if (item in words) or (item not in stop)]
    # Stem and lemmatize words
    porter = nltk.PorterStemmer()
    lemmatizer = WordNetLemmatizer()
    lemmatized = [lemmatizer.lemmatize(word) for word in stops_removed]
    stemmed = ' '.join([porter.stem(word) for word in lemmatized])
    
    # vectorize
    test_vector = tfidf.transform([stemmed])

    predicted_cluster = model.predict(test_vector)[0]

    response = np.random.choice(list(df_intents.loc[np.where(pattern_clusters == predicted_cluster)[0]]['responses'])[0])
                                     #df_intents[df_intents.patterns_cluster==predicted_cluster].responses)[0])

    if user_prompt == "":
        response = "Sorry, can't understand you"
        
    r = responses[i]
    if response not in r:
        r = r[0]
        missed +=1
    else:
        r = response
    
    print('\nprompt:', user_prompt)
    print('predicted response:', response)
    print('expected response:', r)
        
    i += 1
    if i==len(prompts):
        break

print('-----------------------')
print('\n\nModel Accuracy:',(len(prompts)-missed)/len(prompts))

Testing Chatbot
-----------------------

prompt: Hi there
predicted response: Hi there, how can I help?
expected response: Hi there, how can I help?

prompt: How are you
predicted response: Hello, thanks for asking
expected response: Hello, thanks for asking

prompt: Is anyone there?
predicted response: Hello, thanks for asking
expected response: Hello, thanks for asking

prompt: Hey
predicted response: Hello, thanks for asking
expected response: Hello, thanks for asking

prompt: Hola
predicted response: Hi there, how can I help?
expected response: Hi there, how can I help?

prompt: Hello
predicted response: Hello, thanks for asking
expected response: Hello, thanks for asking

prompt: Good day
predicted response: Hello, thanks for asking
expected response: Hello, thanks for asking

prompt: Bye
predicted response: Bye! Come back again soon.
expected response: Bye! Come back again soon.

prompt: See you later
predicted response: See you!
expected response: See you!

prompt: Goodbye
predi

Though this model performs fairly well, it is not perfect. In the tfidf vector plot, vectors with tags 1 and 2 were exceptionally close and, therefore, clustered together. Other models and methods for vectorizing the textual patterns must be considered.

#### Doc2Vec Model

The second model will be a Doc2Vec model which inherits the Word2Vec class but works on phrases instead of individual words.

Doc2Vec works by combining a Continuous-Bag-of-Words (CBOW) model, Skip-Gram model, and paragraph id vector. CBOW, transforms a word into a vector using context, e.g. the vector representation is dependent on the words surrounding the word of interest. The Skip-Gram model is the opposite of CBOW -- given a word, it tries to predict the context. Finally, the paragraph id vector represents what is essentially a larger context, which could be considered a document, but in this particular case it is the entire training corpus. In the previous clustering model, patterns that should've been grouped together because they shared the same tag (i.e., 'greeting'), ended up in different clusters. The pattern id vector can be thought of as representing that tag and should therefore solve the issue of incorrect groupings. 
(Shperber, G. (2019, November 05). A gentle introduction to doc2vec. Retrieved April 13, 2021, from https://medium.com/wisio/a-gentle-introduction-to-doc2vec-db3e8c0cce5e)

In [16]:
patterns_doc2vec, doc2vec_model = preprocess_patterns(model_type='doc2vec')

The builtin model.similarity() method calculates the cosine-similarity, a distance metric, of the given phrase against the phrases it was trained on. The most similar phrase is the one with the smallest distance. Below is a visualization of the phrase vectors, colored according to their tag (e.g., vectors will ideally be closest to others with identical tags).

In [17]:
patterns_doc2vec_vectors = []
for pattern in df_intents.stemmed_pattern.values:
    l = pattern.split()
    patterns_doc2vec_vectors.append(doc2vec_model.infer_vector(l))
    
reduced_data = PCA(n_components=3).fit_transform(patterns_doc2vec_vectors)

x=reduced_data[:,0]
y=reduced_data[:,1]
z=reduced_data[:,2]
color=LabelEncoder().fit_transform(df_intents.tag.values)

sns.set(style = "darkgrid")
fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(111, projection = '3d')

ax.set_xlabel("PCA Component 1")
ax.set_ylabel("PCA Component 2")
ax.set_zlabel("PCA Component 3")

N = len(df_intents.tag.unique())
cmap = plt.cm.jet
cmaplist = [cmap(i) for i in range(cmap.N)]
cmap = cmap.from_list('Custom cmap', cmaplist, cmap.N)

bounds = np.linspace(0,N,N+1)
norm = matplotlib.colors.BoundaryNorm(bounds, cmap.N)

scat = ax.scatter(x, y, z, c=color, s=100, cmap=cmap, norm=norm)

cb = plt.colorbar(scat, spacing='proportional',ticks=bounds)
cb.set_label('Tag')

plt.title('Doc2Vec Vectors')

plt.show()

<IPython.core.display.Javascript object>

#### Model Evaluation

In [18]:
user_prompt = ''

prompts = df_intents.patterns.values
responses = df_intents.responses.values

i = 0

missed = 0

print('Testing Chatbot')
print('-----------------------')

while True:
    user_prompt = prompts[i] #input("User: ")
    if user_prompt.lower()=='q':
        break
    lowered = user_prompt.replace("[^\w\s]", "").lower().split()
    # Remove punctuation
    table = str.maketrans('', '', string.punctuation)
    stripped = [w.translate(table) for w in lowered]
    # Remove stop words
    stops_removed = [item for item in stripped if (item in words) or (item not in stop)]
    # Stem and lemmatize words
    porter = nltk.PorterStemmer()
    lemmatizer = WordNetLemmatizer()
    lemmatized = [lemmatizer.lemmatize(word) for word in stops_removed]
    stemmed = ' '.join([porter.stem(word) for word in lemmatized])
    
    # vectorize
    inferred_vector = doc2vec_model.infer_vector(simple_preprocess(stemmed))
    # find similar phrases
    similar = doc2vec_model.docvecs.most_similar([inferred_vector])
    similar = ' '.join(patterns_doc2vec[similar[0][0]].words)
    response = np.random.choice(list(df_intents.loc[df_intents.stemmed_pattern==similar]['responses'])[0])
        
    if user_prompt == "":
        response = "Sorry, can't understand you"
    r = responses[i]
    if response not in r:
        r = r[0]
        missed +=1
    else:
        r = response
    
    print('\nprompt:', user_prompt)
    print('predicted response:', response)
    print('expected response:', r)
        
    i += 1
    if i==len(prompts):
        break

print('-----------------------')
print('\n\nModel Accuracy:',(len(prompts)-missed)/len(prompts))

Testing Chatbot
-----------------------

prompt: Hi there
predicted response: Good to see you again
expected response: Good to see you again

prompt: How are you
predicted response: Hello, thanks for asking
expected response: Hello, thanks for asking

prompt: Is anyone there?
predicted response: Hello, thanks for asking
expected response: Hello, thanks for asking

prompt: Hey
predicted response: Good to see you again
expected response: Good to see you again

prompt: Hola
predicted response: Hi there, how can I help?
expected response: Hi there, how can I help?

prompt: Hello
predicted response: Hi there, how can I help?
expected response: Hi there, how can I help?

prompt: Good day
predicted response: Hi there, how can I help?
expected response: Hi there, how can I help?

prompt: Bye
predicted response: See you!
expected response: See you!

prompt: See you later
predicted response: See you!
expected response: See you!

prompt: Goodbye
predicted response: See you!
expected response: See

The Doc2Vec model performs exceptionally well, though, with large corpuses, training can be computationally expensive due to the complexity of the model. For an all-encompassing chatbot, one could see how a complex model would be necessary. The scope of this model, however, is limited to the medical application that it serves. Perhaps a simpler model should be evaluated.

#### Bag-of-Words Model

The final model is a bag-of-words model.

The bag-of-words model makes a frequency list of all words in the entire corpus. For example, imagine our corpus is

['data science is fun', 'i love doing data science']

The frequency list would look like

[('data', 2), ('science', 2), ('is', 1), ('fun', 1), ('i', 1), ('love', 1), ('doing', 1)]

If our corpus was exceptionally large, we might discard words with a frequency less than or equal to 1 (or whatever threshold is desired). Because our corpus is small, we will keep all of the words. Next, a vector is created where each position is a word in our frequency list. If a word in our frequency list is in a particular phrase, the value at that position  in our vector will be 1. Otherwise it is 0. For example,

['data', 'science', 'is', 'fun]  --> [1, 1, 1, 1, 0, 0, 0]

['i', 'love', 'doing', 'data', 'science'] --> [1, 1, 0, 0, 1, 1, 1]

(Mujtaba, H. (2021, April 20). An introduction to bag of words in nlp using python: What is bow? Retrieved April 22, 2021, from https://www.mygreatlearning.com/blog/bag-of-words/)

In [19]:
bow_vec, word_freq = preprocess_patterns(model_type='bow')

A distance metric will be used to determine the similarity between two phrases, e.g. a phrase of interest is compared to the phrases in our bag-of-words. The cosine-similarity method will be used here. Just like in the Doc2Vec model, the most similar phrase is the one with the smallest distance. Below is a visualization of the phrase vectors, colored according to their tag (e.g., vectors will ideally be closest to others with identical tags).

In [20]:
reduced_data = PCA(n_components=3).fit_transform(bow_vec)

x=reduced_data[:,0]
y=reduced_data[:,1]
z=reduced_data[:,2]
color=LabelEncoder().fit_transform(df_intents.tag.values)

sns.set(style = "darkgrid")
fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(111, projection = '3d')

ax.set_xlabel("PCA Component 1")
ax.set_ylabel("PCA Component 2")
ax.set_zlabel("PCA Component 3")

N = len(df_intents.tag.unique())
cmap = plt.cm.jet
cmaplist = [cmap(i) for i in range(cmap.N)]
cmap = cmap.from_list('Custom cmap', cmaplist, cmap.N)

bounds = np.linspace(0,N,N+1)
norm = matplotlib.colors.BoundaryNorm(bounds, cmap.N)

scat = ax.scatter(x, y, z, c=color, s=100, cmap=cmap, norm=norm)

cb = plt.colorbar(scat, spacing='proportional',ticks=bounds)
cb.set_label('Tag')

plt.title('BOW Vectors')

plt.show()

<IPython.core.display.Javascript object>

#### Model Evaluation

In [21]:
user_prompt = ''

prompts = df_intents.patterns.values
responses = df_intents.responses.values

i = 0

missed = 0

print('Testing Chatbot')
print('-----------------------')

while True:
    user_prompt = prompts[i] #input("User: ")
    if user_prompt.lower()=='q':
        break
    lowered = user_prompt.replace("[^\w\s]", "").lower().split()
    # Remove punctuation
    table = str.maketrans('', '', string.punctuation)
    stripped = [w.translate(table) for w in lowered]
    # Remove stop words
    stops_removed = [item for item in stripped if (item in words) or (item not in stop)]
    # Stem and lemmatize words
    porter = nltk.PorterStemmer()
    lemmatized = [lemmatizer.lemmatize(word) for word in stops_removed]
    stemmed = ' '.join([porter.stem(word) for word in lemmatized])
    
    # vectorize
    sent_vec = []
    sentence_tokens = simple_preprocess(stemmed)
    for token in word_freq:
        if token in sentence_tokens:
            sent_vec.append(1)
        else:
            sent_vec.append(0)
    
    # find similar phrases
    similar = []
    for vec in bow_vec:
        similar.append(1 - pairwise_distances([vec], [sent_vec], metric = 'cosine')[0][0])
    response = np.random.choice(list(df_intents.loc[np.argmax(similar)]['responses']))
        
    if user_prompt == "":
        response = "Sorry, can't understand you"
    r = responses[i]
    if response not in r:
        r = r[0]
        missed +=1
    else:
        r = response
    
    print('\nprompt:', user_prompt)
    print('predicted response:', response)
    print('expected response:', r)
        
    i += 1
    if i==len(prompts):
        break

print('-----------------------')
print('\n\nModel Accuracy:',(len(prompts)-missed)/len(prompts))

Testing Chatbot
-----------------------

prompt: Hi there
predicted response: Good to see you again
expected response: Good to see you again

prompt: How are you
predicted response: Good to see you again
expected response: Good to see you again

prompt: Is anyone there?
predicted response: Hi there, how can I help?
expected response: Hi there, how can I help?

prompt: Hey
predicted response: Hello, thanks for asking
expected response: Hello, thanks for asking

prompt: Hola
predicted response: Good to see you again
expected response: Good to see you again

prompt: Hello
predicted response: Hi there, how can I help?
expected response: Hi there, how can I help?

prompt: Good day
predicted response: Hello, thanks for asking
expected response: Hello, thanks for asking

prompt: Bye
predicted response: See you!
expected response: See you!

prompt: See you later
predicted response: See you!
expected response: See you!

prompt: Goodbye
predicted response: See you!
expected response: See you!

p

### Conclusion

All of the models perform well, but the Doc2Vec and Bag-of-Words models stand out in comparison to the Tfidf Vector Clustering model. The 3D visualizations of the pattern vectors suggest that the Bag-of-Words model does a better job at representing phrases and grouping similar ones together. So the winner is the Bag-of-Words model.