<a href="https://colab.research.google.com/github/vt-ai-ml/fall2019-meetings/blob/master/Chatbot_VSM_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Building a chatbot using a VSM

Reference: https://medium.com/analytics-vidhya/building-a-simple-chatbot-in-python-using-nltk-7c8c8215ac6e

In [0]:
# download data
import requests
url = 'https://raw.githubusercontent.com/vt-ai-ml/fall2019-meetings/master/data/chatbot_corpus.txt'
data = requests.get(url).text

In [0]:
splits = data.split('\n')
splits = splits[:-1]

In [0]:
dictionary = {}
questions_list = []

# Creating a key, value pair as <question,response>
for i in range(0, len(splits), 2):
    question = splits[i].replace('-','').strip()
    response = splits[i+1].replace('-','').strip()
    
    if question in dictionary:
        dictionary[question].append(response)
    else:
        dictionary[question] = [response]
        questions_list.append(question)
        
print(questions_list[-5:])

In [0]:
import string
import nltk

nltk.download('punkt')
nltk.download('wordnet')

#WordNet is a semantically-oriented dictionary of English included in NLTK.
lemmer = nltk.stem.WordNetLemmatizer()

remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)

def normal_lemma_tokens(text):
    tokens = nltk.word_tokenize(text.lower().translate(remove_punct_dict))
    return [lemmer.lemmatize(token) for token in tokens]

In [0]:
import random

def select_random_response(question):
    return random.choice(dictionary[question])

### Term Frequency-Inverse Document Frequency (TF-IDF)
>**Term Frequency**: scores the frequency of a word in the current document.

$$ TF(w,d) = \frac{\text{# of times w appears in d}}{\text{total # of words in the d}} $$

>**Inverse Document Frequency**: scores how rare the word is across all documents.
>+ A word with a *low* IDF score is a *common* word.
>+ A word with a *high* IDF score is a *rare* word.

$$ IDF(w,D) = \log{(\frac{\text{total # of documents}}{\text{# of documents containing w}})} $$



### Word Similarity 
>**Cosine similarity**: is the measure of similarity between two vectors. In our case, the similarity between two questions.

$$ cos(u,v) = \frac{u \cdot v}{\left\lVert u \right\lVert \left\lVert v \right\lVert} $$


<img src="https://github.com/vt-ai-ml/fall2019-meetings/raw/master/data/chatbot_cosine_image.png" align="left" style="width:250px;height:250px;">

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def respond(user_question):
    questions_list.append(user_question)
    
    # Maps user response and questions into TF-IDF vector space, where each word is a dimension
    tf_idf_matrix = tf_idf_vector_space.fit_transform(questions_list)
    
    # Calculate the cosine similarity of user's response with other vectors (question) in the vector space
    user_question_tf_idf = tf_idf_matrix[-1]
    similarity_score = cosine_similarity(user_question_tf_idf, tf_idf_matrix[:-1])
    
    # Find the most similar vector (question) to our user's response
    highest_tf_idf_idx = similarity_score.argmax()
    highest_tf_idf = similarity_score.max()
    
    # Find appropriate response
    if(highest_tf_idf == 0): # no similarity
        robo_response = "I don't understand what you're saying."
    else:
        robo_response = select_random_response(questions_list[highest_tf_idf_idx])
        
    questions_list.remove(user_question)
    return robo_response

In [0]:
tf_idf_vector_space = TfidfVectorizer(tokenizer=normal_lemma_tokens, stop_words='english')

print("Chatbot is on! Type 'exit' to turn off the chatbot.")
while(True):
    user_question = input().lower().translate(remove_punct_dict)
    
    if(user_question == 'exit'):
        print('Chatbot is now off!')
        break       
    else:
        print("BOT: ", respond(user_question), '\n')  

### Questions to ask your chatbot
* How do you do?
* Tell me a joke
* What is a chat robot?
* What can you eat?