# Hotel Chatbot

This notebook is based on the Hotel Chatbot problem. How can we provide better customer service at the front desk of a hotel? One simple way is to have an FAQ chatbot that is capable of answering simple questions about the hotel experience. There are many advantages to having a chatbot. 

1. It increases the appeal of the hotel and also increases information throughput.
2. It creates a way to gather questions about the hotel in a data table form. 

In this notebook, we will be taking a look at 2 different models. The cosine similarity modela and the doc2vec model.


## 1. Adding knowledge base to your chatbot

The Chatbot's ability to converse is defined by the data available to it. Take a look at the Ques.txt file to find questions and ans.txt file to find answers to those questions. This chatbot will essentially run a cosine similarity on a question posed with the question bank and try to find an answer. 

Let us begin by importing relevant libraries.

In [1]:
import nltk # to process text data
import numpy as np # to represent corpus as arrays
import random 
import operator
import string # to process standard python strings
from sklearn.metrics.pairwise import cosine_similarity # We will use this later to decide how similar two sentences are
from sklearn.feature_extraction.text import TfidfVectorizer # Remember when you built a function to create a tfidf bag of words in Experience 2? This function does the same thing!

In [2]:
nltk.download('wordnet') # first-time use only Used for the lemmatizer

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [3]:
filepath='./Module15-08-HotelChatbot_ans.txt'
corpus=open(filepath,'r',errors = 'ignore')
raw_data_ans=corpus.read()
print (raw_data_ans)

200$ per night is the price for a basic suite.

This establishment was constructed and inaugurated by John S. on the 23rd of September 1965.

Breakfast is served from 7 AM to 10 AM.

The breakfast menu is decided as per the head chefs decision on the night before. Kindly contact hotel staff for information abou the menu on the night before. 

The Vance hotel is a singular establishment designed and constructed by Lindsey Vance in 1949. It has hosted many dignitaries and government officials over the years.

There are 43 rooms in this hotel including one pent house suite.

We offer 4 types of rooms: basic, mid-level, premium and penthouse.

Yes, room service is available 24 hrs.

Yes we have one restaurant currently called 'Rouge'.

There are 12 floors in the hotel.

300$ per night is the standard price for a mid-level suite.

500$ per night is the price for a premium suite.

Yes, we have tuxedo services available at the reception.

Yes, we have a laundry service at the reception.

'Rou

In [4]:
filepath='./Module15-08-HotelChatbot_ques.txt'
corpus=open(filepath,'r',errors = 'ignore')
raw_data=corpus.read()
print (raw_data)

What is the price of one night stay in basic suite?

How old is this establishment?

What time is breakfast served?

What is the breakfast menu?

What is the history behind this hotel or establishment?

How many rooms are in this hotel?

What are the types of room or suites offered by the hotel?

What is the name of the hotel?

Is room service served 24 hours?

How do I call for room service?

Are there any restaurants in the hotel?

How many floors are in the hotel?

What is the price of one night stay at the mid-level suite?

What is the price of one night stay at the premium suite?

Do you have tuxedo services?

Do you have a laundry service?

What time do the restaurants open for dinner?

What are the near by tourist attractions?

Is there a spa in the hotel?

Is there anywhere I can get a massage?

What time is the check in?

What time is the check out?

Do you offer handicapped rooms?

Is parking available at the hotel?

Can I reserve a parking lot?

What are the reception openin

####  Conversion to lower case

We will convert all text to lower case first. Remember to inspect the result once we are done.

In [5]:
raw_data=raw_data.lower()# converts to lowercase
print (raw_data)

what is the price of one night stay in basic suite?

how old is this establishment?

what time is breakfast served?

what is the breakfast menu?

what is the history behind this hotel or establishment?

how many rooms are in this hotel?

what are the types of room or suites offered by the hotel?

what is the name of the hotel?

is room service served 24 hours?

how do i call for room service?

are there any restaurants in the hotel?

how many floors are in the hotel?

what is the price of one night stay at the mid-level suite?

what is the price of one night stay at the premium suite?

do you have tuxedo services?

do you have a laundry service?

what time do the restaurants open for dinner?

what are the near by tourist attractions?

is there a spa in the hotel?

is there anywhere i can get a massage?

what time is the check in?

what time is the check out?

do you offer handicapped rooms?

is parking available at the hotel?

can i reserve a parking lot?

what are the reception openin

####  Segmentation, Lematization and tokenization

We will convert all text to lower case first. Remember to inspect the result once we are done.

In [6]:
nltk.download('punkt')
sent_tokens = nltk.sent_tokenize(raw_data)# converts documents to list of sentences 

print(sent_tokens)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


['what is the price of one night stay in basic suite?', 'how old is this establishment?', 'what time is breakfast served?', 'what is the breakfast menu?', 'what is the history behind this hotel or establishment?', 'how many rooms are in this hotel?', 'what are the types of room or suites offered by the hotel?', 'what is the name of the hotel?', 'is room service served 24 hours?', 'how do i call for room service?', 'are there any restaurants in the hotel?', 'how many floors are in the hotel?', 'what is the price of one night stay at the mid-level suite?', 'what is the price of one night stay at the premium suite?', 'do you have tuxedo services?', 'do you have a laundry service?', 'what time do the restaurants open for dinner?', 'what are the near by tourist attractions?', 'is there a spa in the hotel?', 'is there anywhere i can get a massage?', 'what time is the check in?', 'what time is the check out?', 'do you offer handicapped rooms?', 'is parking available at the hotel?', 'can i res

In [7]:
sent_tokens_ans = nltk.sent_tokenize(raw_data_ans)# converts documents to list of sentences 
print(sent_tokens_ans)

['200$ per night is the price for a basic suite.', 'This establishment was constructed and inaugurated by John S. on the 23rd of September 1965.', 'Breakfast is served from 7 AM to 10 AM.', 'The breakfast menu is decided as per the head chefs decision on the night before.', 'Kindly contact hotel staff for information abou the menu on the night before.', 'The Vance hotel is a singular establishment designed and constructed by Lindsey Vance in 1949.', 'It has hosted many dignitaries and government officials over the years.', 'There are 43 rooms in this hotel including one pent house suite.', 'We offer 4 types of rooms: basic, mid-level, premium and penthouse.', 'Yes, room service is available 24 hrs.', "Yes we have one restaurant currently called 'Rouge'.", 'There are 12 floors in the hotel.', '300$ per night is the standard price for a mid-level suite.', '500$ per night is the price for a premium suite.', 'Yes, we have tuxedo services available at the reception.', 'Yes, we have a laundr

In [8]:
res = {sent_tokens[i]: sent_tokens_ans[i] for i in range(len(sent_tokens))} 
print(res)

{'what is the price of one night stay in basic suite?': '200$ per night is the price for a basic suite.', 'how old is this establishment?': 'This establishment was constructed and inaugurated by John S. on the 23rd of September 1965.', 'what time is breakfast served?': 'Breakfast is served from 7 AM to 10 AM.', 'what is the breakfast menu?': 'The breakfast menu is decided as per the head chefs decision on the night before.', 'what is the history behind this hotel or establishment?': 'Kindly contact hotel staff for information abou the menu on the night before.', 'how many rooms are in this hotel?': 'The Vance hotel is a singular establishment designed and constructed by Lindsey Vance in 1949.', 'what are the types of room or suites offered by the hotel?': 'It has hosted many dignitaries and government officials over the years.', 'what is the name of the hotel?': 'There are 43 rooms in this hotel including one pent house suite.', 'is room service served 24 hours?': 'We offer 4 types of 

In [9]:
word_tokens = nltk.word_tokenize(raw_data)# converts documents to list of words
print (word_tokens)

['what', 'is', 'the', 'price', 'of', 'one', 'night', 'stay', 'in', 'basic', 'suite', '?', 'how', 'old', 'is', 'this', 'establishment', '?', 'what', 'time', 'is', 'breakfast', 'served', '?', 'what', 'is', 'the', 'breakfast', 'menu', '?', 'what', 'is', 'the', 'history', 'behind', 'this', 'hotel', 'or', 'establishment', '?', 'how', 'many', 'rooms', 'are', 'in', 'this', 'hotel', '?', 'what', 'are', 'the', 'types', 'of', 'room', 'or', 'suites', 'offered', 'by', 'the', 'hotel', '?', 'what', 'is', 'the', 'name', 'of', 'the', 'hotel', '?', 'is', 'room', 'service', 'served', '24', 'hours', '?', 'how', 'do', 'i', 'call', 'for', 'room', 'service', '?', 'are', 'there', 'any', 'restaurants', 'in', 'the', 'hotel', '?', 'how', 'many', 'floors', 'are', 'in', 'the', 'hotel', '?', 'what', 'is', 'the', 'price', 'of', 'one', 'night', 'stay', 'at', 'the', 'mid-level', 'suite', '?', 'what', 'is', 'the', 'price', 'of', 'one', 'night', 'stay', 'at', 'the', 'premium', 'suite', '?', 'do', 'you', 'have', 'tuxedo

In [10]:
lemmer = nltk.stem.WordNetLemmatizer() #Initiate lemmer class. WordNet is a semantically-oriented dictionary of English included in NLTK.
def LemTokens(tokens):
    return [lemmer.lemmatize(token) for token in tokens]

In [11]:
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def LemNormalize(text):
    return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict))) 

## 2. Adding Chatbot functionality - Cosine similarity

In [12]:
GREETING_INPUTS = ["hello", "hi", "greetings", "sup", "what's up","hey", "hey there"]
GREETING_RESPONSES = ["hi", "hey", "*nods*", "hi there", "hello", "I am glad! You are talking to me"]

def greeting(sentence):
    for word in sentence.split(): # Looks at each word in your sentence
        if word.lower() in GREETING_INPUTS: # checks if the word matches a GREETING_INPUT
            return random.choice(GREETING_RESPONSES) # replies with a GREETING_RESPONSE

The functionality of the chatbot is done by creating a loop for running the chatbot. Take a loof at the function below. Each line in the function is important as it calls another function to perform an important step. The function 'response' is responsible for how the chatbot behaves. Having a master function for each functionality is recommended as good programming practise. 


In [13]:
def response(user_response):
    
    robo_response='' # initialize a variable to contain string
    sent_tokens.append(user_response) #add user response to sent_tokens
    TfidfVec = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english') 
    tfidf = TfidfVec.fit_transform(sent_tokens) #get tfidf value
    vals = cosine_similarity(tfidf[-1], tfidf) #get cosine similarity value
    idx=vals.argsort()[0][-2]
    flat = vals.flatten() 
    flat.sort() #sort in ascending order
    req_tfidf = flat[-2] 
    
    if(req_tfidf==0):
        robo_response=robo_response+"I am sorry! I don't understand you"
        return robo_response, vals
    else:
        robo_response = robo_response+res[sent_tokens[idx]]
        return robo_response, vals

Finally, let us create the chatbot interface and create a persona around it. Let us call it 'Jane' and use cosine similarity to find FAQs similar to the question asked and answer them. 

In [14]:
flag=True
print("Jane: My name is Jane. I will answer your queries about this hotel. If you want to exit, type Bye!")
while(flag==True):
    user_response = input()
    user_response=user_response.lower()
    if(user_response!='bye'):
        if(user_response=='thanks' or user_response=='thank you' ):
            flag=False
            print("Jane: You are welcome..")
        else:
            if(greeting(user_response)!=None):
                print("Jane: "+greeting(user_response))

            else:
                print("Jane: ",end="")
                resp= response(user_response)
                print(resp[0], )
                sent_tokens.remove(user_response)
                resp_l = resp[1].tolist()
                resp_l[0].pop()
                print(' (With similarity of ',max(resp_l[0]),')')

    else:
        flag=False
        print("Jane: Bye! take care..")

Jane: My name is Jane. I will answer your queries about this hotel. If you want to exit, type Bye!
bye
Jane: Bye! take care..


## 3. Chatbot functionality using Doc2vec

We will be covering one more type of model for creating chatbots. As we saw the cosine similarity model works on using that algorithm to find similarities between 2 sentences. But can we now use a neural network to solve this problem? Lets take a look.

Doc2Vec is a neural network based model that essentially creates vectors out of documents. In order to understand Doc2vec you also need to understand word2vec. 


#### What is word2vec?
It’s a Model to create the word embeddings, where it takes input as a large corpus of text and produces a vector space typically of several hundred dimesions. it was introduced in two papers between September and October 2013, by a team of researchers at Google. The underlying assumption of Word2Vec is that two words sharing similar contexts also share a similar meaning and consequently a similar vector representation from the model.

For instance: “Bank”, “money” and “accounts” are often used in similar situations, with similar surrounding words like “dollar”, “loan” or “credit”, and according to Word2Vec they will therefore share a similar vector representation. 

<img src="https://lrccd.instructure.com/files/31013933/download?download_frd=1"/>

#### What is Doc2vec?

So the objective of Doc2vec is to create the numerical representation of sentence/paragraphs/documents unlike word2vec that computes a feature vector for every word in the corpus, Doc2Vec computes a feature vector for every document in the corpus. The vectors generated by Doc2vec can be used for tasks like finding similarity between sentences/paragraphs/documents

<strong> We will be using this property of doc2vec to create our own similarity model.</strong>


Let us begin by importing the relevant libraries.

In [15]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize

The first main aim is to tag our data. The doc2vec model requires us to tag our data to use it effectively. Here is a good learning link for a [starter code](https://www.kaggle.com/fmitchell259/creating-a-doc2vec-model) for doc2vec. The following [link](https://medium.com/wisio/a-gentle-introduction-to-doc2vec-db3e8c0cce5e) can also be useful and it is recommended to read them. 

In [16]:
tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(sent_tokens)]

In [17]:
max_epochs = 100
vec_size = 20 # Increase this to have a larger vector. This will mean more differentiation. 
alpha = 0.025


The next step is to train the model. As before we will be using a `model.train` function to run the training process. Take a look at the [documentation](https://radimrehurek.com/gensim/models/doc2vec.html) for more information about how to train the Doc2vec model. 

In [20]:

model = Doc2Vec(vector_size=vec_size,
                alpha=alpha, 
                min_alpha=0.00025,
                min_count=1,
                dm =1)
  
model.build_vocab(tagged_data)

for epoch in range(max_epochs):
    print('iteration {0}'.format(epoch))
    model.train(tagged_data, # tagged data to be used here
                total_examples=model.corpus_count,
                epochs=model.epochs)
    # decrease the learning rate
    model.alpha -= 0.0002
    # fix the learning rate, no decay
    model.min_alpha = model.alpha

model.save("d2v.model")
print("Model Saved")



iteration 0
iteration 1
iteration 2
iteration 3
iteration 4
iteration 5
iteration 6
iteration 7
iteration 8
iteration 9
iteration 10
iteration 11
iteration 12
iteration 13
iteration 14
iteration 15
iteration 16
iteration 17
iteration 18
iteration 19
iteration 20
iteration 21
iteration 22
iteration 23
iteration 24
iteration 25
iteration 26
iteration 27
iteration 28
iteration 29
iteration 30
iteration 31
iteration 32
iteration 33
iteration 34
iteration 35
iteration 36
iteration 37
iteration 38
iteration 39
iteration 40
iteration 41
iteration 42
iteration 43
iteration 44
iteration 45
iteration 46
iteration 47
iteration 48
iteration 49
iteration 50
iteration 51
iteration 52
iteration 53
iteration 54
iteration 55
iteration 56
iteration 57
iteration 58
iteration 59
iteration 60
iteration 61
iteration 62
iteration 63
iteration 64
iteration 65
iteration 66
iteration 67
iteration 68
iteration 69
iteration 70
iteration 71
iteration 72
iteration 73
iteration 74
iteration 75
iteration 76
iteration

### Evaluating doc2vec model

In [21]:
from gensim.models.doc2vec import Doc2Vec
model= Doc2Vec.load("d2v.model")

In [22]:
test_data = word_tokenize("How much is the price?".lower())

We can use the model.infer_vector function to infer the vector associated with a document. We can then use a function called most_similar to find the most similar vectors to the one we have created. What are the results?

In [23]:
v1 = model.infer_vector(test_data)
print("V1_infer", v1)

V1_infer [ 0.02356609  0.02570973 -0.00366953  0.02741713 -0.02513227  0.02635313
  0.00283291  0.05130521 -0.03079578 -0.01109047  0.01240083 -0.01940247
 -0.02085616  0.01730078  0.00045929 -0.01124656  0.05331567 -0.02663193
 -0.03785448 -0.01796117]


In [24]:
similar_doc = model.docvecs.most_similar(positive = [v1], topn = 4) #positive is an attribute that shows positive correlation first followed by the correlation value
print(similar_doc)

[('13', 0.6659650802612305), ('41', 0.6498533487319946), ('29', 0.6370059847831726), ('33', 0.6362162232398987)]


  similar_doc = model.docvecs.most_similar(positive = [v1], topn = 4) #positive is an attribute that shows positive correlation first followed by the correlation value


In [25]:
num,_ = similar_doc[0]
num = int(num)
tagged_data[num]

TaggedDocument(words=['what', 'is', 'the', 'price', 'of', 'one', 'night', 'stay', 'at', 'the', 'premium', 'suite', '?'], tags=['13'])

Here it is very clear from the output of the previous code block that this model is not as effective as one might have thought. Are there any guesses why this model failed where we expected it to succeed?

Click on this [link](https://stackoverflow.com/questions/58206571/doc2vec-find-the-similar-sentence) to learn more about this. The gist of the issue is this

> Doc2Vec isn't going to give good results on toy-sized datasets, so you shouldn't expect anything meaningful until using much more data.

**After observing the performance** of both of the models there are some clear conclusions that we can draw from this:

1. The Doc2vec model requires a lot more data for it to understand the relationship between words. And even after using a pre-trained model there is a clear lack of quality in the responses from the model compared to the cosine similarity model. 
2. The cosine similarity model works better for a smaller and more well defined dataset. This means that a few simple questions can be solved effectively but complex questions that require context wont be solved by it. 