### Importing the Required Packages

In [1]:
import nltk
import re
import random
import time
import numpy as np
from nltk.tokenize import RegexpTokenizer, word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.tokenize.regexp import WhitespaceTokenizer
from nltk.stem import WordNetLemmatizer 
import gensim
from nltk.data import find
from scipy import spatial
import mysql.connector as mc

In [2]:
word2vec_sample = str(find('models/word2vec_sample/pruned.word2vec.txt'))
google_news_model = gensim.models.KeyedVectors.load_word2vec_format(word2vec_sample, binary=False)

### Defining Intents
In the following list, a few intents have been defined and a list of keywords are mentioned that identify the corresponding intent.

In [3]:
intents = [
    ["greetings", 
    ["Hi", "Hello", "Greetings", "Dear", "Respected", "Welcome"]],
    
    ["greetings_questions", 
    ["How are you", "How goes it", "What's up", "What's happening"]],
    
    ["thanking", 
    ["Thanks", "Thank you", "I am indebted to you", "I appreciate it", "I am grateful", "This is great", "My sincere thanks", "You've been very helpful"]],

    ["apology", 
    ["Oops sorry", "Sorry about that", "I'm sorry", "sorry", "Sorry", "My bad", "Apologies", "I apologize"]],

    ["farewell", 
    ["Bye", "Farewell", "See you later", "Talk to you later", "Goodbye", "Take care", "Nice to meet you", "Nice to talk to you"]]
    
]

### Defining Responses for each Intent
In the following code block, a list of responses are defined for each intent. Once the intent has been identified from the user's query, this list is used to return a random response from a list of corresponding responses for each intent.

In [4]:
intent_responses = {
    "greetings": ["hi, how may I help you today?", 
                  "hello, what can I do for you today?",  
                  "it’s nice to meet you, how may we be of service?"],
    
    "greetings_questions": ["I'm great. How  can I help?"],
    
    "thanking": ["don't mention it."],
    
    "apology": ["hey, don't worry about it.", 
                "it's ok. no need to apologize"],
    
    "farewell": ["it was a pleasure to help you. do come back. type 'quit' to exit the program.",
                 "see you later. let me know if you have any other queries. type 'quit' to exit the program.",
                 "i hope the interaction was helpful. type 'quit' to exit the program.",
                 "thank you for your time. type 'quit' to exit the program."]
}

In [5]:
default_responses = [
    "i am sorry. i am not sure about what you wanted to say.",
    "i am sorry. it is becomming a bit hard to follow along."
]

### Tokenizing Input Text
**Whitespace** tokenizer in the **nltk** module of python is a powerful tokenizer that can handle punctuation and contractions with greater efficiency as compared to the other tokenizers in the library.

In [6]:
def tokenize_input(user_response):
    wst = WhitespaceTokenizer()
    return wst.tokenize(user_response)

### Remove Punctuation
The punctuation at the end of each token - **full stop(.)**, **exclamation mark(!)**, **question mark(?)** and **comma(,)** - are removed. A simple pattern matching using Regular Expressions (**re**) module of python is sufficient to remove all such punctuation marks.

In [7]:
def remove_punct(user_tokens):
    punct_re = r"(.*)[?,.!]$"
    for i, word in enumerate(user_tokens):
        if re.match(punct_re, word):
            user_tokens[i] = word[:-1]
    return user_tokens

### Removing Stopwords
After removing the punctuations from each token, some of the common tokens with limited significance in determining the intent of the statement are removed. Such common tokens are known as stopwords. The stopwords in English language are stored in the **corpus** module inside **nltk** library.

In [8]:
def remove_stopwords(tokens):
    stop_words = set(stopwords.words('english'))
    new_tokens = list()
    for w in tokens:
        if w not in stop_words: new_tokens.append(w)
    return new_tokens

### Lemmatize Tokens
Once all the unwanted punctuation is removed from the tokenized text, the **WordNetLemmatizer** is used to lemmatize (extract the root words) of each token. This step is crucial as words with similar meaning are reduced to a single word and it helps analysing text more efficiently.

In [9]:
def lemmatize_tokens(tokens):
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(word) for word in tokens]

## Build Word Vector

Each word is checked against all the words in words in the pruned set of word vectors from the word2vec model. If the word, or any of its variants(upper case, lower case, title etc.) are present then the vector is returned

In [10]:
def build_token_vec(token):
    token_vec = list()
    if token in google_news_model:
        token_vec = google_news_model[token]

    elif token.istitle():
        if token.lower() in google_news_model:
            token_vec = google_news_model[token.lower()]
        else:
            print("not in vocab")

    elif token.islower():
        if token.capitalize() in google_news_model:
            token_vec = google_news_model[token.capitalize()]
        else:
            print("Not In Vocab")

    return token_vec

## Build Phrase Vector

The word vectors build are added to build a phrase vector. In this way, the affects of a set of words are captured using arithmetic operations.

In [11]:
def build_phrase_vec(phrase):
    token_vec_list = list()
    phrase_tokens = phrase.split()
    
    for token in phrase_tokens:
        token_vec = build_token_vec(token)
        token_vec_list.append(token_vec)
    
    phrase_vec = sum(token_vec_list)
    return phrase_vec

## Build Intents Vec

In this step the words or phrases representing each intent are vectorized and stored. Later an input word/phrase is vectorized and compared with each representative of each intent using Euclidean Distance.

In [12]:
def build_intents_vec(intents):
    
    intent_vec_dict = dict()
    
    for intent in intents:
        intent_name, intent_key_words = intent
        intent_vec_dict.update({intent_name: []})
        
        for phrase in intent_key_words:
            phrase_vec = build_phrase_vec(phrase)
            intent_vec_dict[intent_name].append([phrase, phrase_vec])
    return intent_vec_dict 

### Euclidean Distance

In [13]:
def get_euclid_dist(key_vec, phrase_vec):
    dist_list = list()
    for key in key_vec:
        dist_list.append(np.linalg.norm(key[1]-phrase_vec))
    
    return min(dist_list)

### Matching Intent
In this section, the preprocessed input text is matched with each intent defined in the **intents** list. For each intent, the **Intent Name** and the **Jaccard Similarity** value is stored in a list 

In [14]:
def match_intents(lemma_tokens, intents):
    intent_vec_dict = build_intents_vec(intents)

    phrase = " "
    phrase = phrase.join(lemma_tokens)
    phrase_vec = build_phrase_vec(phrase)
    
    intents_matched = list()
    for intent, key_vec in intent_vec_dict.items():
        intents_matched.append([intent, get_euclid_dist(key_vec, phrase_vec)])
    
    return intents_matched

### Finding the most suited Intent
The intent that has the minimum **Eucledian Distance** with the user's input is extracted in the following code block.

In [15]:
def min_dist_intent(intents_matched):
    min_dist = float('inf')
    user_intent = list()
    for i, intent in enumerate(intents_matched):
        if intent[1]<min_dist:
            min_dist = intent[1]
            user_intent = intents_matched[i]
    return user_intent

### Finding an appropriate response
Once the most appropriate intent is identified, the **intent_responses** list is used to retrieve the list of corresponding responses. A random response from this list is returned to the user.

In [16]:
def responses(user_intent):
    if user_intent==list():
        print(random.choice(default_responses))
        return
    
    print(random.choice(intent_responses[user_intent[0]]))

### Defining the logic that genrates the bot's response
In this function, first the user's input is preprocessed in the following manner - 
- Input is tokenized. (**tokenize_input()**)
- Stopwords and Punctuation is removed from the tokenized input. (**remove_stopwords()** and **remove_punct()**)
- The tokens so generated and lemmatized to generate a list of keywords for intent matching. (**lemmatize_tokens()**)

After preprocessing the text, the keywords are matched to an appropriate intent in the following manner - 
- The Cosine Similarity of the input text with all the intents is calculated. (**match_intents()**)
- The intent with the maximum Cosine Simmilarity is returned as the user's intent. (**max_sim_intent()**)

After identifying the intent, an appropriate response is randomly selected from the list of responses stored in **intent_responses** list. (**responses()**)

In [17]:
def bot_response(user_response, intents):
    # Input Text Preprocessing 
    tokens = tokenize_input(user_response)
    new_tokens = remove_punct(tokens)
    # new_tokens = remove_stopwords(new_tokens)
    lemma_tokens = lemmatize_tokens(new_tokens)
    
    # Intent Matching
    intents_matched = match_intents(lemma_tokens, intents)
    # print(intents_matched)
    user_intent = min_dist_intent(intents_matched)
    # print(user_intent)
    
    # Generating an appropriate response for the intent matched
    responses(user_intent)

### Main Function
The **while loop** prompts the user to enter their text until they type **quit**.

In [18]:
print("Type 'quit' to exit the conversation")
user_response = str()
while(user_response!="quit"):
    print()
    time.sleep(0.2)
    user_response = input("YOU: ")
    user_response = user_response.lower()
    if(user_response!='quit'):
        # print("BOT: "+bot_response(user_response, intents))
        print("BOT: ", end='')
        bot_response(user_response, intents)
        print()
    else:
        time.sleep(0.2)
        print("BOT: Bye! take care..")

Type 'quit' to exit the conversation

YOU: hi
BOT: it’s nice to meet you, how may we be of service?


YOU: thank you for your time
BOT: don't mention it.


YOU: bye
BOT: see you later. let me know if you have any other queries. type 'quit' to exit the program.


YOU: quit
BOT: Bye! take care..
