# FAQ Matching Bot using Word Mover's Distance
### Steps
1. Tokenising and setting Stopwords
2. Loading word2vec model (for distance between words)
3. Experimenting with WMD
4. Loading the dataset and processing the data
5. Finding closest question using WmdSimilarity

## Setting stopwords
- Import set of common stopwords from nltk
- Adding domain-related stopword
- Removing question words (To distinguish between intents of questions)

## Tokenizing
- Use `wordpunct_tokenize` to get all tokens in lowercase sentence and remove stopwords and special characters

In [1]:
from nltk.corpus import stopwords
from nltk import wordpunct_tokenize

plus = {'hdfc'}
minus = {'what','how','where','when','why'}
stop = set(stopwords.words('english'))

stop.update(plus)
stop.difference_update(minus)

def get_tokens(sent):
    return [x for x in wordpunct_tokenize(sent.lower()) if x.isalnum() and x not in stop]

## Loading [word2vec](https://code.google.com/archive/p/word2vec/) model
Download here : https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit

In [2]:
# Loading the model with gensim
from gensim.models import KeyedVectors
import pandas as pd
import os

MODEL_PATH = '/Users/vishalgupta/Downloads/GoogleNews-vectors-negative300.bin'

if not os.path.exists(MODEL_PATH):
    raise ValueError("SKIP: You need to download the google news model")
    
model = KeyedVectors.load_word2vec_format(MODEL_PATH, binary=True,limit=200000)

## Experimenting with WMD
### About WMD
Word Mover's Distance is a method that to measure distance/similarity between two documents, even when they have no words in common. To measure distance between words, it needs embeddings, (word2vec, in this case). WMD was presented by Matt Kusner et al. in 2015 ([Paper](proceedings.mlr.press/v37/kusnerb15.pdf)).

![WMD Results](http://vishalgupta.me/NLP-Notebooks/img/WMD_results.png)

### How does it work ?
WMD adapts the [earth mover’s distance](https://en.wikipedia.org/wiki/Earth_mover%27s_distance) to the space of documents: the distance between two texts is given by the total amount of “mass” needed to move the words from one side into the other, multiplied by the distance the words need to move. So, starting from a measure of the distance between different words, we can get a principled document-level distance

<img src="http://vishalgupta.me/NLP-Notebooks/img/WM_dist.png" width = 500></img>

- Top : 
    - The components of the WMD metric between a query `D0` and two sentences `D1` , `D2` (with equal BOW distance).
    - The arrows represent flow between two words and are labeled with their distance contribution. 
- Bottom: 
    - The flow between two sentences `D3` and `D0` with different numbers of words.
    - This mis-match causes the WMD to move words to multiple similar words.

In [3]:
# Sample Sentences
sentence_obama = 'Obama speaks to the media in Illinois'
sentence_obama_past = 'In Illinois, Obama spoke to journalists'
sentence_president = 'The president greets the press in Chicago'
sentence_arbitrary = 'Mary had a little lamb'

# Tokens of sample sentences
tokens_obama = get_tokens(sentence_obama)
tokens_president = get_tokens(sentence_president)
tokens_obama_past = get_tokens(sentence_obama_past)
tokens_arbitrary = get_tokens(sentence_arbitrary)

# WMD between sample sentences
print ("'"+sentence_obama+"' vs. '"+sentence_obama+"' \t\t:",model.wmdistance(tokens_obama, tokens_obama))
print ("'"+sentence_obama+"' vs. '"+sentence_obama_past+"' \t\t:",model.wmdistance(tokens_obama, tokens_obama_past))
print ("'"+sentence_obama+"' vs. '"+sentence_president+"' \t:",model.wmdistance(tokens_obama, tokens_president))
print ("'"+sentence_obama+"' vs. '"+sentence_arbitrary+"' \t\t\t\t:",model.wmdistance(tokens_obama, tokens_arbitrary))

("'Obama speaks to the media in Illinois' vs. 'Obama speaks to the media in Illinois' \t\t:", 0.0)
("'Obama speaks to the media in Illinois' vs. 'In Illinois, Obama spoke to journalists' \t\t:", 1.751374609270841)
("'Obama speaks to the media in Illinois' vs. 'The president greets the press in Chicago' \t:", 3.3901614766891077)
("'Obama speaks to the media in Illinois' vs. 'Mary had a little lamb' \t\t\t\t:", 4.176404432301822)


## Loading the dataset
- `hdfc.xlsx` : Collection of 1341 QnA about HDFC. (Scraped from HDFC's FAQ site)
- Dropping duplicate questions
- Dropping questions whose words are not in the word2vec model's vocabulary

In [4]:
df = pd.read_excel('data/hdfc.xlsx')
df = df.drop_duplicates('Question')
df = df.reset_index()

qlabels = df['Question'].to_dict()

drop_list = []
for x in qlabels:
    for w in get_tokens(qlabels[x]):
        if w not in model.vocab:
            drop_list.append(x)

reduced = df.drop(drop_list)

qlabels = reduced['Question'].to_dict()
alabels = reduced['Answer'].to_dict()
reduced.head()

Unnamed: 0,index,Question,Answer
1,1,How can I repay my Personal Loan?,You pay the loan in equal monthly instalments ...
2,2,Are there any additional charges for loan repa...,The additional charges (if any) are applicable...
3,3,What is Guarantor?,A Guarantor is a person who guarantees to pay ...
4,4,What is De-pledge?,Removal of a pledge from the security to regai...
5,5,Whom do I contact in case of ay further querie...,You can apply online on the clicking on the be...


## Finding closest question using WmdSimilarity

In [5]:
from gensim.similarities import WmdSimilarity
questions = reduced['Question'].values

instance = WmdSimilarity([get_tokens(x) for x in questions], model, num_best=5)

In [7]:
# Function to predict sentence closest to given sentence
def predict_wmd(sent):
    query = get_tokens(sent)
    sims = instance[query]
    scores = []
    for i in range(5):
        scores.append({"Score":'%.4f' % sims[i][1],"Question":questions[sims[i][0]]})
    return pd.DataFrame(scores)
predict_wmd('What is the procedure to pay my Personal Loan?')

Unnamed: 0,Question,Score
0,How can I repay my Personal Loan?,0.6165
1,What are the charges I need to pay to foreclos...,0.6076
2,What are the charges I need to pay to foreclos...,0.5802
3,Can I repay the Personal loan earlier?,0.5799
4,How long can I take to repay my personal loan?,0.5783


# WMD Similarity vs. TF-IDF
- WMD 
    - Uses word2vec to embed and commpare words. 
    - Hence distance between two words (which is used to compute distance between documents) reflects their real-life relationship
    - Computationally expensive. `O(p^3*logp)` , where p is the size of the corpus
- TF-IDF 
    - Doesn't reflect real-life relationships
    - Importance/weight assigned to each word is inversely proportional to its frequency in the corpus
    - Faster to compute than wmd since vocabulary is much smaller
    - Cannot matches questions with synonyms 

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk import FreqDist
import numpy as np

# Training tfidf vectoriser
tf_vect =TfidfVectorizer(stop_words=stop,
                         lowercase=True,
                         use_idf=True)
all_qs_vectors = tf_vect.fit_transform(questions)

all_words = wordpunct_tokenize(' '.join(questions).lower())
fdist = FreqDist(all_words)

# Function to predict closest question using TF-IDF
def predict(query,n=5,answers=False):
    query_vector = tf_vect.transform([query])
    arr = np.dot(all_qs_vectors, query_vector.T).toarray().flatten()
    matches,results = arr.argsort(axis=0)[::-1],[]
    for i in matches[:n]:
        res = {"Question":questions[i],"Score":arr[i]}
        results.append(res)
    return pd.DataFrame(results)

In [9]:
query = 'What is the procedure to pay my Personal Loan?'
word_dist = [{"word":w,"count":fdist[w]} for w in get_tokens(query)]
pd.DataFrame(word_dist).sort_values('count')

Unnamed: 0,count,word
1,7,procedure
2,20,pay
3,23,personal
4,103,loan
0,206,what


#### Consider the question : **What is the procedure to pay my Personal Loan?**
- TFIDF suggests '*What is the cancellation procedure*' by matching the words **procedure** and **what**
    - This is closely followed by '*How can I repay my Personal Loan?*' by matching the words **Personal** and **Loan**
    - Thus, the word **procedure** is assigned greater importance simply due to lack of questions with **procedure**
    - This worked to our benefit in the previous notebook where **Amortization** was assigned greater importance but backfires when common words aren't just as common/frequent in the corpus
    - Hence, WMD performs a lot better than tfidf when we have a limited set of questions
- WMD suggests '*How can I repay my Personal Loan?*' by roughly matching
    - **What is the procedure** to **How**
    - **pay** to **repay**

In [10]:
# Predict closest question using TF-IDF similarity
predict('What is the procedure to pay my Personal Loan?')

Unnamed: 0,Question,Score
0,What is the cancellation procedure?,0.433564
1,How can I repay my Personal Loan?,0.400332
2,What are the charges I need to pay to foreclos...,0.352948
3,Can I repay the Personal loan earlier?,0.334612
4,What is a Personal Assurance Message or Person...,0.331042


In [11]:
# Predict closest question using WMD similarity
predict_wmd('What is the procedure to pay my Personal Loan?')

Unnamed: 0,Question,Score
0,How can I repay my Personal Loan?,0.6165
1,What are the charges I need to pay to foreclos...,0.6076
2,What are the charges I need to pay to foreclos...,0.5802
3,Can I repay the Personal loan earlier?,0.5799
4,How long can I take to repay my personal loan?,0.5783
