# AI – Chatbot
Description:

In this Project you need to build a corpus-based conversational chatbot using
NLTK and python. Please have a look at the tutorial provided at url:
https://medium.com/swlh/a-chatbot-in-python-using-nltk-938a37a9eacc for guidance
and ideas.
You need to perform the following tasks:
1. Use the dataset provided in the tutorial or develop your own dataset with
similar structure.
2. Perform text normalization: convert text to lower case, remove special
characters, and perform lemmatization. Remove any stopwords.
3. Use word embeddings such as: bag of words and TF-IDF, and compute
cosine similarity.
4. Compare the performance and results of the two methods, i.e., bag of words
and TF-IDF.
5. Customize using any of the NLP techniques we have learned.

# Import necessary dependencies 

In [1]:
import pandas as pd
import nltk
import numpy as np
import re  #regular expressions
from nltk.stem import wordnet  # for lemmatization
from sklearn.feature_extraction.text import CountVectorizer  # for bag of words (bow)
from sklearn.feature_extraction.text import TfidfVectorizer  #for tfidf
from nltk import pos_tag  # for parts of speech
from sklearn.metrics import pairwise_distances  # cosine similarity
from nltk import word_tokenize
from nltk.corpus import stopwords 
nltk.download('omw-1.4')  # this seems to be a requirement for the .apply() function to work 

[nltk_data] Error loading omw-1.4: <urlopen error [WinError 10060] A
[nltk_data]     connection attempt failed because the connected party
[nltk_data]     did not properly respond after a period of time, or
[nltk_data]     established connection failed because connected host
[nltk_data]     has failed to respond>


False

# Choose dataset and read it into a data frame

In [2]:
#check what files are available after adding the database manually
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [3]:
df1= pd.read_excel('dialog_talk_agent.xlsx')  # read the database into a data frame
df1.head()  # see first 5 lines

Unnamed: 0,Context,Text Response
0,Tell me about your personality,Just think of me as the ace up your sleeve.
1,I want to know you better,I can help you work smarter instead of harder
2,Define yourself,
3,Describe yourself,
4,tell me about yourself,


Why not add another dataset?

In [4]:
df2= pd.read_json('train.json')  # read the database into a data frame
df2.head()  # see first 5 lines

Unnamed: 0,viewed_doc_titles,used_queries,annotations,nq_answer,id,nq_doc_title,question
0,[The Simpsons],[{'query': 'When did the simpsons first air on...,"[{'type': 'multipleQAs', 'qaPairs': [{'questio...","[December 17 , 1989]",-4469503464110108160,The Simpsons,When did the simpsons first air on television?
1,[John Adams (miniseries)],"[{'query': 'John adams tv', 'results': [{'titl...","[{'type': 'singleAnswer', 'answer': ['David Mo...",[David Morse],4790842463458965504,John Adams (miniseries),Who played george washington in the john adams...
2,[Marriage age in the United States],"[{'query': 'legal age of marriage in usa', 're...","[{'type': 'multipleQAs', 'qaPairs': [{'questio...","[18, Nebraska ( 19 ), Mississippi ( 21 )]",-6631915997977101312,Age of marriage in the United States,What is the legal age of marriage in usa?
3,"[Barefoot in the Park, Barefoot in the Park (f...",[{'query': 'Who starred in barefoot in the par...,"[{'type': 'multipleQAs', 'qaPairs': [{'questio...","[Robert Redford, Elizabeth Ashley]",-3098213414945179648,Barefoot in the Park,Who starred in barefoot in the park on broadway?
4,"[Timeline of the Manhattan Project, Manhattan ...",[{'query': 'When did the manhattan project beg...,"[{'type': 'multipleQAs', 'qaPairs': [{'questio...",[From 1942 to 1946],-927805218867163520,Timeline of the Manhattan Project,When did the manhattan project began and end?


Now, let's make this dataset look similar to our original one, it is: two columns with the same headings

In [5]:
# delete columns other than question and nq_answer
df2 = df2.drop(columns=['viewed_doc_titles', 'used_queries', 'annotations', 'id', 'nq_doc_title'])
df2 = df2.reindex(columns=['question', 'nq_answer'])  # swap the order for questions to be first
df2 = df2.rename(columns={'question': 'Context', 'nq_answer': 'Text Response'})
df2.head()

Unnamed: 0,Context,Text Response
0,When did the simpsons first air on television?,"[December 17 , 1989]"
1,Who played george washington in the john adams...,[David Morse]
2,What is the legal age of marriage in usa?,"[18, Nebraska ( 19 ), Mississippi ( 21 )]"
3,Who starred in barefoot in the park on broadway?,"[Robert Redford, Elizabeth Ashley]"
4,When did the manhattan project began and end?,[From 1942 to 1946]


In [6]:
# there are brackets included in the Text Responses, doesn't look great
def remove_brackets(text):
    new_text = str(text).replace('[', '')  # replace left square bracket character with nothing
    new_text = str(new_text).replace(']', '')  # do the same with the right one
    return new_text

df2['Text Response'] = df2['Text Response'].apply(remove_brackets)  # remove all the brackets for the selected column
df2.head()

Unnamed: 0,Context,Text Response
0,When did the simpsons first air on television?,"'December 17 , 1989'"
1,Who played george washington in the john adams...,'David Morse'
2,What is the legal age of marriage in usa?,"'18', 'Nebraska ( 19 )', 'Mississippi ( 21 )'"
3,Who starred in barefoot in the park on broadway?,"'Robert Redford', 'Elizabeth Ashley'"
4,When did the manhattan project began and end?,'From 1942 to 1946'


In [7]:
# now we have some apostrophes at the start and end of our responses...
def remove_first_and_last_character(text):
    return str(text)[1:-1]  # slice from the second character till the last one (excluding the last one)

df2['Text Response'] = df2['Text Response'].apply(remove_first_and_last_character)  # execute it for the whole column
df2.head()

Unnamed: 0,Context,Text Response
0,When did the simpsons first air on television?,"December 17 , 1989"
1,Who played george washington in the john adams...,David Morse
2,What is the legal age of marriage in usa?,"18', 'Nebraska ( 19 )', 'Mississippi ( 21 )"
3,Who starred in barefoot in the park on broadway?,"Robert Redford', 'Elizabeth Ashley"
4,When did the manhattan project began and end?,From 1942 to 1946


In [8]:
# compare the number of rows for each dataset
print("Number of rows in the original dataset: ", df1.shape[0])
print("Number of rows in the new dataset: ", df2.shape[0])

Number of rows in the original dataset:  1592
Number of rows in the new dataset:  10036


In [9]:
df = pd.DataFrame()  # create blank dataframe'
column1 = [*df1['Context'].tolist(), *df2['Context'].tolist()]  # make a list out of the first columns of both datasets
column2 = [*df1['Text Response'].tolist(), * df2['Text Response'].tolist()]  # make a second column by combining both
df.insert(0, 'Context', column1, True)  # insert first column
df.insert(1, 'Text Response', column2, True)  # insert second one
print("Number of rows in the combined dataset: ", df.shape[0])

Number of rows in the combined dataset:  11628


In [10]:
df.head() 

Unnamed: 0,Context,Text Response
0,Tell me about your personality,Just think of me as the ace up your sleeve.
1,I want to know you better,I can help you work smarter instead of harder
2,Define yourself,
3,Describe yourself,
4,tell me about yourself,


Null values are present for the same type of questions whose response can be almost similar and in that similar group of questions, the response is given to the first and the rest filled with null. So, what we can do is use `ffill()` which returns the value of previous response in place of null values as below.

In [11]:
df.ffill(axis = 0, inplace = True)   # fill the null value with the previous value
df.head()  # see first 5 lines

Unnamed: 0,Context,Text Response
0,Tell me about your personality,Just think of me as the ace up your sleeve.
1,I want to know you better,I can help you work smarter instead of harder
2,Define yourself,I can help you work smarter instead of harder
3,Describe yourself,I can help you work smarter instead of harder
4,tell me about yourself,I can help you work smarter instead of harder


# Preprocess data

In [12]:
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

[nltk_data] Error loading averaged_perceptron_tagger: <urlopen error
[nltk_data]     [WinError 10060] A connection attempt failed because
[nltk_data]     the connected party did not properly respond after a
[nltk_data]     period of time, or established connection failed
[nltk_data]     because connected host has failed to respond>
[nltk_data] Error loading wordnet: <urlopen error [WinError 10060] A
[nltk_data]     connection attempt failed because the connected party
[nltk_data]     did not properly respond after a period of time, or
[nltk_data]     established connection failed because connected host
[nltk_data]     has failed to respond>


False

In [13]:
import nltk
nltk.download('wordnet')
import nltk
nltk.download()

[nltk_data] Error loading wordnet: <urlopen error [WinError 10060] A
[nltk_data]     connection attempt failed because the connected party
[nltk_data]     did not properly respond after a period of time, or
[nltk_data]     established connection failed because connected host
[nltk_data]     has failed to respond>


showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

### Convert text into lower cases and remove special characters and numbers

In [14]:
def cleaning(x):
    cleaned_array = list()
    for i in x:
        a = str(i).lower()  # convert to all lower letters
        p = re.sub(r'[^a-z0-9]', ' ', a)  # remove any special characters but keep numbers
        cleaned_array.append(p)  # add variable p to our array names cleaned_array
    return cleaned_array

# Create extra column in our dataset just for fun and to see how the cleaned text looks like 
# in comparison with the original
df.insert(1, 'Cleaned Context', cleaning(df['Context']), True)
# first argument indicates position we want this column to be slotted in, second is the name of the new column,
# third is the array we want to you to fill the rows with and 
# the last boolean indicates whether to allow duplicates
df.head()



Unnamed: 0,Context,Cleaned Context,Text Response
0,Tell me about your personality,tell me about your personality,Just think of me as the ace up your sleeve.
1,I want to know you better,i want to know you better,I can help you work smarter instead of harder
2,Define yourself,define yourself,I can help you work smarter instead of harder
3,Describe yourself,describe yourself,I can help you work smarter instead of harder
4,tell me about yourself,tell me about yourself,I can help you work smarter instead of harder


### Create function to clean our data and carry out lemmatization

In [15]:
def text_normalization(text):
    text = str(text).lower()  # convert to all lower letters
    spl_char_text = re.sub(r'[^a-z]', ' ', text)  # remove any special characters including numbers
    tokens = nltk.word_tokenize(spl_char_text)  # tokenize words
    lema = wordnet.WordNetLemmatizer()  # lemmatizer initiation
    tags_list = pos_tag(tokens, tagset = None)  # parts of speech
    lema_words = []
    for token, pos_token in tags_list:
        if pos_token.startswith('V'):  # if the tag from tag_list is a verb, assign 'v' to it's pos_val
            pos_val = 'v'
        elif pos_token.startswith('J'):  # adjective
            pos_val = 'a'
        elif pos_token.startswith('R'):  # adverb
            pos_val = 'r'
        else:  # otherwise it must be a noun
            pos_val = 'n'
        lema_token = lema.lemmatize(token, pos_val)  # performing lemmatization
        lema_words.append(lema_token)  # addid the lemamtized words into our list
    return " ".join(lema_words)  # return our list as a human sentence

In [16]:
normalized = df['Context'].apply(text_normalization)
df.insert(2, 'Normalized Context', normalized, True)
df.head()

Unnamed: 0,Context,Cleaned Context,Normalized Context,Text Response
0,Tell me about your personality,tell me about your personality,tell me about your personality,Just think of me as the ace up your sleeve.
1,I want to know you better,i want to know you better,i want to know you good,I can help you work smarter instead of harder
2,Define yourself,define yourself,define yourself,I can help you work smarter instead of harder
3,Describe yourself,describe yourself,describe yourself,I can help you work smarter instead of harder
4,tell me about yourself,tell me about yourself,tell me about yourself,I can help you work smarter instead of harder


### Also create function to remove stop words from text

In [17]:
stop = stopwords.words('english')
def removeStopWords(text):
    Q = []
    s = text.split()  # create an array of words from our text sentence, cut it into words
    q = ''
    for w in s:  # for every word in the given sentence if the word is a stop word ignore it
        if w in stop:
            continue
        else:  # otherwise add it to the end of our array
            Q.append(w)
        q = " ".join(Q)  # create a sentence out of our array of non stop words
    return q

In [18]:
normalized_non_stopwords = df['Normalized Context'].apply(removeStopWords)
df.insert(3, 'Normalized and StopWords Removed', normalized_non_stopwords, True)
df.head()

Unnamed: 0,Context,Cleaned Context,Normalized Context,Normalized and StopWords Removed,Text Response
0,Tell me about your personality,tell me about your personality,tell me about your personality,tell personality,Just think of me as the ace up your sleeve.
1,I want to know you better,i want to know you better,i want to know you good,want know good,I can help you work smarter instead of harder
2,Define yourself,define yourself,define yourself,define,I can help you work smarter instead of harder
3,Describe yourself,describe yourself,describe yourself,describe,I can help you work smarter instead of harder
4,tell me about yourself,tell me about yourself,tell me about yourself,tell,I can help you work smarter instead of harder


# Bag of words
### BOW is a method to extract features from text documents. These features can be used for training machine learning algorithms. It creates a vocabulary of all the unique words occurring in all the documents in the training set

In [19]:
cv = CountVectorizer()  # initializing count vectorizer
x_bow = cv.fit_transform(df['Normalized Context']).toarray()  # badly speaking this converts words to vectors

features_bow = cv.get_feature_names_out()  # use function to get all the normalized words
df_bow = pd.DataFrame(x_bow, columns = features_bow)  # create dataframe to show the 0, 1 value for each word
df_bow.head()

Unnamed: 0,aaron,ab,abba,abbey,abbott,abbreviation,abby,abdomen,abduct,abdul,...,zombie,zone,zoo,zootopia,zorro,zulu,zumbo,zumbos,zuzu,zymase
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [20]:
def chat_bow(question):
    tidy_question = text_normalization(removeStopWords(question))  # clean & lemmatize the question
    cv_ = cv.transform([tidy_question]).toarray()  # convert the question into a vector
    cos = 1- pairwise_distances(df_bow, cv_, metric = 'cosine')  # calculate the cosine value
    index_value = cos.argmax()  # find the index of the maximum cosine value
    return df['Text Response'].loc[index_value]  # use index to choose the reply from the Text Response feature(column)

In [21]:
# call the chat_bow function with the question as an argument
chat_bow('Will you help me and tell me more about yourself?')

'I can help you work smarter instead of harder'

# Term Frequency - Inverse Document Frequency
### TF-IDF is a widely used statistical method in natural language processing and information retrieval. It measures how important a term is within a document relative to a collection of documents (i.e., relative to a corpus).

In [22]:
tfidf = TfidfVectorizer()  # initializing tf-idf
x_tfidf = tfidf.fit_transform(df['Normalized Context']).toarray()  # convert data into array

features_tfidf = tfidf.get_feature_names_out() # use function to get all the normalized words
df_tfidf = pd.DataFrame(x_tfidf, columns = features_tfidf)  # create dataframe to show word score for each word
df_tfidf.loc[:,['you', 'yourself']].head()  # show only specific columns that are named you and yourself

Unnamed: 0,you,yourself
0,0.0,0.0
1,0.312233,0.0
2,0.0,0.698254
3,0.0,0.71098
4,0.0,0.604955


In [23]:
def chat_tfidf(question):
    tidy_question = text_normalization(removeStopWords(question))  # clean & lemmatize the question
    tf = tfidf.transform([tidy_question]).toarray()  # convert the question into a vector
    cos = 1- pairwise_distances(df_tfidf, tf, metric = 'cosine')  # calculate the cosine value
    index_value = cos.argmax()  # find the index of the maximum cosine value
    return df['Text Response'].loc[index_value]  # use index to choose the reply from the Text Response feature(column)

In [24]:
# call the chat_tfidf function with the question as an argument
chat_tfidf('Will you help me and tell me more about yourself?')

'I can help you work smarter instead of harder'

## Sentiment Analysis

Use ***textblob*** for quick sentiment analysis - see references

In [26]:
!pip install TextBlob

Collecting TextBlob
  Downloading textblob-0.17.1-py2.py3-none-any.whl (636 kB)
     -------------------------------------- 636.8/636.8 kB 4.0 MB/s eta 0:00:00
Installing collected packages: TextBlob
Successfully installed TextBlob-0.17.1


In [27]:
from textblob import TextBlob

def senti(text):
    blob = TextBlob(text)
    return(blob.polarity)

print("polarity",senti("This is good"))
print("polarity",senti("This is not good"))

polarity 0.7
polarity -0.35


## Create function to create a chatbot experience
### ***While loop*** seems perfect for this task

In [28]:
def chatbot(method):
    exit_chatbot = False
    first_loop = True

    while exit_chatbot == False:
        if (first_loop):
            print("Welcome to the chatbot! Type q to close it, otherwise let's keep talking :)")
            first_loop = False
        user_input_question = input("Your question:")
        
        if(senti(user_input_question) < 0) : print("\U0001F44E")
        else : print("\U0001F44D")
            
        if(user_input_question.lower() == 'q'): 
            exit_chatbot = True
            print("Thank you for your time and see you around!")
        else :
            if (method == 'bow') : print('Chatbot answer: ', chat_bow(user_input_question))
            elif (method == 'tfidf') : print('Chatbot answer: ', chat_tfidf(user_input_question))
            

# ChatBot
## Call the chatbot with either Bag of Words or Tf-Idf methods
### Uncomment the lines and run the cell when you feel ready

In [None]:
 chatbot('bow')  # start the chatbot based on the bag of words method

# chatbot('tfidf')  # start the chatbot based on the tf-idf method

Welcome to the chatbot! Type q to close it, otherwise let's keep talking :)
Your question:I want to know you better
👍
Chatbot answer:  I can help you work smarter instead of harder


# Conclusion
Both Bag of Words and TF-IDF methods had comparable if not the same results given the available datasets.

Creating this chatbot was fun, I was able to enhance the main idea outlined in the tutorial.

Despite combining two datasets, still having a more complete dataset would be greatly beneficial, also using AI like natural networks might yield better results.