## [Word2vec](https://code.google.com/archive/p/word2vec/) model
Download here : https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit

In [70]:
# Loading the model with gensim
from gensim.models import KeyedVectors
import pandas as pd
import os

MODEL_PATH = '/home/b/Downloads/GoogleNews-vectors-negative300.bin.gz'

if not os.path.exists(MODEL_PATH):
    raise ValueError("SKIP: You need to download the google news model")
    
model = KeyedVectors.load_word2vec_format(MODEL_PATH, binary=True,limit=200000)

In [71]:
from nltk.corpus import stopwords
from nltk import wordpunct_tokenize

plus = {'hdfc'}
minus = {'what','how','where','when','why'}
stop = set(stopwords.words('english'))

stop.update(plus)
stop.difference_update(minus)

def get_tokens(sent):
    return [x for x in wordpunct_tokenize(sent.lower()) if x.isalnum() and x not in stop]

In [80]:
sentence_obama = 'Obama speaks to the media in Illinois'
sentence_obama_past = 'In Illinois, Obama spoke to journalists'
sentence_president = 'The president greets the press in Chicago'
sentence_arbitrary = 'Mary had a little lamb'

tokens_obama = get_tokens(sentence_obama)
tokens_president = get_tokens(sentence_president)
tokens_obama_past = get_tokens(sentence_obama_past)
tokens_arbitrary = get_tokens(sentence_arbitrary)

print ("'"+sentence_obama+"' vs. '"+sentence_obama+"' \t\t:",model.wmdistance(tokens_obama, tokens_obama))
print ("'"+sentence_obama+"' vs. '"+sentence_obama_past+"' \t\t:",model.wmdistance(tokens_obama, tokens_obama_past))
print ("'"+sentence_obama+"' vs. '"+sentence_president+"' \t:",model.wmdistance(tokens_obama, tokens_president))
print ("'"+sentence_obama+"' vs. '"+sentence_arbitrary+"' \t\t\t\t:",model.wmdistance(tokens_obama, tokens_arbitrary))

'Obama speaks to the media in Illinois' vs. 'Obama speaks to the media in Illinois' 		: 0.0
'Obama speaks to the media in Illinois' vs. 'In Illinois, Obama spoke to journalists' 		: 1.751374609270841
'Obama speaks to the media in Illinois' vs. 'The president greets the press in Chicago' 	: 3.3901614766891077
'Obama speaks to the media in Illinois' vs. 'Mary had a little lamb' 				: 4.176404432301822


## Loading the dataset

In [44]:
df = pd.read_excel('hdfc.xlsx')
df = df.drop_duplicates('Question')
df = df.reset_index()

qlabels = df['Question'].to_dict()

drop_list = []
for x in qlabels:
    for w in get_tokens(qlabels[x]):
        if w not in model.vocab:
            drop_list.append(x)

reduced = df.drop(drop_list)

qlabels = reduced['Question'].to_dict()
alabels = reduced['Answer'].to_dict()
reduced

Unnamed: 0,index,Question,Answer
1,1,How can I repay my Personal Loan?,You pay the loan in equal monthly instalments ...
2,2,Are there any additional charges for loan repa...,The additional charges (if any) are applicable...
3,3,What is Guarantor?,A Guarantor is a person who guarantees to pay ...
4,4,What is De-pledge?,Removal of a pledge from the security to regai...
5,5,Whom do I contact in case of ay further querie...,You can apply online on the clicking on the be...
6,6,What is the minimum loan value/product value?,We are funding products with loan value above ...
7,7,What are the tenures for which I can avail thi...,HDFC Bank Consumer Durable Loan is available i...
8,8,Which all locations HDFC Bank Consumer Durable...,"Mumbai,\nDelhi,\nBangalore,\nChennai,\nPune,\n..."
9,9,How can I get my address changed in my loan ac...,In order to change your address in our records...
13,13,How do I utilize this amount?,This amount is made available in your Salary A...


## Finding closest question using WmdSimilarity

In [52]:
from gensim.similarities import WmdSimilarity
questions = reduced['Question'].values

instance = WmdSimilarity([get_tokens(x) for x in questions], model, num_best=5)

In [68]:
def predict_wmd(sent):
    query = get_tokens(sent)
    sims = instance[query]
    scores = []
    for i in range(5):
        scores.append({"Score":'%.4f' % sims[i][1],"Question":questions[sims[i][0]]})
    return pd.DataFrame(scores)
predict('What is the procedure to pay my Personal Loan?')

Unnamed: 0,Question,Score
0,What is the cancellation procedure?,0.433564
1,How can I repay my Personal Loan?,0.400332
2,What are the charges I need to pay to foreclos...,0.352948
3,Can I repay the Personal loan earlier?,0.334612
4,What is a Personal Assurance Message or Person...,0.331042


## WMD Similarity vs. TF-IDF
- WMD 
    - Uses word2vec to embed and commpare words. 
    - Hence distance between two words (which is used to compute distance between documents) reflects their real-life relationship
    - Computationally expensive. `O(p^3*logp)` , where p is the size of the corpus
- TF-IDF 
    - Doesn't reflect real-life relationships
    - Importance/weight assigned to each word is inversely proportional to its frequency in the corpus
    - Faster to compute than wmd since vocabulary is much smaller
    - Cannot matches questions with synonyms 

In [92]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk import FreqDist
import numpy as np

# Training tfidf vectoriser
tf_vect =TfidfVectorizer(stop_words=stop,
                         lowercase=True,
                         use_idf=True)
all_qs_vectors = tf_vect.fit_transform(questions)

all_words = wordpunct_tokenize(' '.join(questions).lower())
fdist = FreqDist(all_words)

# Function to predict closest question using TF-IDF
def predict(query,n=5,answers=False):
    query_vector = tf_vect.transform([query])
    arr = np.dot(all_qs_vectors, query_vector.T).toarray().flatten()
    matches,results = arr.argsort(axis=0)[::-1],[]
    for i in matches[:n]:
        res = {"Question":questions[i],"Score":arr[i]}
        results.append(res)
    return pd.DataFrame(results)

In [112]:
query = 'What is the procedure to pay my Personal Loan?'
word_dist = [{"word":w,"count":fdist[w]} for w in get_tokens(query)]
pd.DataFrame(word_dist).sort_values('count')

Unnamed: 0,count,word
1,7,procedure
2,20,pay
3,23,personal
4,103,loan
0,206,what


#### Consider the question : **What is the procedure to pay my Personal Loan?**
- TFIDF suggests '*What is the cancellation procedure*' by matching the words **procedure** and **what**
    - This is closely followed by '*How can I repay my Personal Loan?*' by matching the words **Personal** and **Loan**
    - Thus, the word **procedure** is assigned greater importance simply due to lack of questions with **procedure**
    - This worked to our benefit in the previous notebook where **Amortization** was assigned greater importance but backfires when common words aren't just as common/frequent in the corpus
    - Hence, WMD performs a lot better than tfidf when we have a limited set of questions
- WMD suggests '*How can I repay my Personal Loan?*' by roughly matching
    - **What is the procedure** to **How**
    - **pay** to **repay**

In [66]:
predict('What is the procedure to pay my Personal Loan?')

Unnamed: 0,Question,Score
0,What is the cancellation procedure?,0.433564
1,How can I repay my Personal Loan?,0.400332
2,What are the charges I need to pay to foreclos...,0.352948
3,Can I repay the Personal loan earlier?,0.334612
4,What is a Personal Assurance Message or Person...,0.331042


In [86]:
predict_wmd('What is the procedure to pay my Personal Loan?')

Unnamed: 0,Question,Score
0,How can I repay my Personal Loan?,0.6165
1,What are the charges I need to pay to foreclos...,0.6076
2,What are the charges I need to pay to foreclos...,0.5802
3,Can I repay the Personal loan earlier?,0.5799
4,How long can I take to repay my personal loan?,0.5783
