## Creating A Chatbot

The idea behind creating this chatbot was that I wanted to use the given data and try to build a model using it which could be usefully applied to a CRM firm.

The given data has a lot of questions with unique question IDs and about 3-5 options for every question. Every question has one or more answer labelled as correct.

My intution was that at a CRM firm we would be able to collect similar content from customer complaint or assistance tickets. We can then feed that data to a NLP algorithm which would return old questions along with the correct answer when it is asked a similar question again.

Such a feature could be integrated with an Answer Bot service provided by CRM firms and give suggestion to webpages and links to help solve customer queries. 

I hope you like it!

In [1]:
from nltk.chat.util import Chat, reflections
import pandas as pd
import csv
import nltk
import numpy as np
import random
import string

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

### Getting data from tsv file

In [3]:
df = pd.DataFrame(columns = ['QuestionID', 'Question', 'DocumentID', 'DocumentTitle', 'SentenceID', 'Sentence', 'Label'])

with open('challenge.tsv') as tsvfile:
#     count = 0
    reader = csv.reader(tsvfile, delimiter='\t')
    
    for row in reader:
#         df = df.append(row)
#         print(row)
        df = df.append({'QuestionID': row[0], 'Question': row[1], 'DocumentID': row[2] , 'DocumentTitle': row[3], 'SentenceID': row[4], 'Sentence': row[5], 'Label': row[6]}, ignore_index = True)
#         count += 1
#         if count == 10:
#             break

In [4]:
df = df[1:]

In [5]:
df.tail(15)

Unnamed: 0,QuestionID,Question,DocumentID,DocumentTitle,SentenceID,Sentence,Label
20333,Q3042,When was Apple Computer founded,D2806,Apple Inc.,D2806-10,", the company had 72,800 permanent full-time e...",0
20334,Q3042,When was Apple Computer founded,D2806,Apple Inc.,D2806-11,Its worldwide annual revenue in 2012 totalled ...,0
20335,Q3043,what is section eight housing,D2807,Section 8 (housing),D2807-0,Section 8 housing in the South Bronx,0
20336,Q3043,what is section eight housing,D2807,Section 8 (housing),D2807-1,"Section 8 of the Housing Act of 1937 (), often...",1
20337,Q3043,what is section eight housing,D2807,Section 8 (housing),D2807-2,"It operates through several programs, the larg...",1
20338,Q3043,what is section eight housing,D2807,Section 8 (housing),D2807-3,The US Department of Housing and Urban Develop...,0
20339,Q3043,what is section eight housing,D2807,Section 8 (housing),D2807-4,"The Housing Choice Voucher Program provides ""t...",0
20340,Q3043,what is section eight housing,D2807,Section 8 (housing),D2807-5,It also allows individuals to apply their mont...,0
20341,Q3043,what is section eight housing,D2807,Section 8 (housing),D2807-6,The maximum allowed voucher is $2200 a month.,0
20342,Q3043,what is section eight housing,D2807,Section 8 (housing),D2807-7,"Section 8 also authorizes a variety of ""projec...",0


##### Looks like the data has a lot of questions with unique question IDs with multiple options for every question. Every question has one or more answer labelled as correct.

In [6]:
#### Tokenizing and lemmatizing the data

In [7]:
data_dict = df[df['Label'] == '1'][['Question', 'Sentence']].set_index(['Question']).to_dict()

In [8]:
sent_tokens = [i + " " + j for i,j in data_dict['Sentence'].items()]

In [9]:
raw = ''
for i in sent_tokens:
    raw+=i + " "

In [10]:
word_tokens = nltk.word_tokenize(raw)

In [11]:
## Create functions to tokenise and remove punctuations

In [12]:
lemmer = nltk.stem.WordNetLemmatizer()
def LemTokens(tokens):
    return [lemmer.lemmatize(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
# Removing punctuations
def LemNormalize(text):
    return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

### Approach
##### I took a simple approach where I first ran a TF-IDF vectorizer over the entire data, alongwith the user query followed by finding the most relevant answer/question pair by finding cosine similarity between the user query and the rest of the data. The matching question/answer pair with the highest cosine score is then returned as the response of the chatbot.

In [13]:
## Creating some basic greetings

In [14]:
GREETING_INPUTS = ("hello", "hi", "greetings", "sup", "what's up","hey","Hey!")
GREETING_RESPONSES = ["hi", "hey", "*nods*", "hi there", "hello", "I am glad! You are talking to me"]
def greeting(sentence):
    for word in sentence.split():
        if word.lower() in GREETING_INPUTS:
            return random.choice(GREETING_RESPONSES)

In [15]:
## The bot uses cosine similarity between the user request and the existing data.
def response(user_response):
    zen_response=''
    sent_tokens.append(user_response)
    TfidfVec = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english')
    tfidf = TfidfVec.fit_transform(sent_tokens)
    vals = cosine_similarity(tfidf[-1], tfidf)
    idx=vals.argsort()[0][-2]
    flat = vals.flatten()
    flat.sort()
#     print(flat)
    req_tfidf = flat[-2]
    if(req_tfidf==0):
        zen_response=zen_response+"I'm sorry! I didn't get you."
        return zen_response
    else:
        add_ = sent_tokens[idx]
#         final_ = add_[0] + ' ' + add_[1]
        zen_response = zen_response+add_
        return zen_response

## !! Make sure to enter 'bye' to exit the chatbot , don't leave him hanging. !!

In [None]:
flag=True
print("Zen: My name is Zen. I would like you to stay that way too! If you want to exit, type Bye.")
while(flag==True):
    user_response = input()
    user_response=user_response.lower()
    if(user_response!='bye' and user_response != 'Bye'):
        if(user_response=='thanks' or user_response=='thank you' ):
            flag=False
            print("Zen: You're welcome!")
        else:
            if(greeting(user_response)!=None):
                print("Zen: "+greeting(user_response))
            else:
                print("Zen: ",end="")
                print(response(user_response))
                sent_tokens.remove(user_response)
    else:
        flag=False
        print("Zen: Bye! take care!")

Zen: My name is Zen. I would like you to stay that way too! If you want to exit, type Bye.
what do you know about apple computer
Zen: 

  'stop_words.' % sorted(inconsistent))


When was Apple Computer founded The company was founded on April 1, 1976, and incorporated as Apple Computer, Inc. on January 3, 1977.


Given more time (and data) I would love to add to this very basic bot in the following ways:

1. Add functionality to combine this with a clustering method as described in the second notebook. This would allow the bot to identify the type of request and further improve its responses. Moreover, it could help the bot re-direct the customer to the most suitable customer care agent.


2. I would add a feature wherein the customer using the bot could rate the correctness of its responses. This data can then be used to further add to the question-answer data the bot already has, helping the bot slowly get better over time.


3. Combine the chatbot with the clustering software submitted as the second part of this task to identify the topics to which the question belongs and suggest links to pages which could be useful to solving the queries.