# Fact Retreival Bot using IDFT
### Steps
- Loading and preprocessing Questions and Answers from dataset
- Setting Stopwords
- Intitialising and training TF_IDF vectors
- Testing

## Imports

In [1]:
import pandas as pd                   # To load and process dataset
import numpy as np                    # For matrix operations
from nltk.corpus import stopwords     # Using NLTK to load stopwords
from nltk import wordpunct_tokenize   # Using NLTK to token sentences

from sklearn.feature_extraction.text import TfidfVectorizer

pd.set_option('display.width',1000)

## Loading and preprocessing Questions and Answers from dataset
- `hdfc.pkl` : Collection of 1341 QnA about HDFC. (Scraped from HDFC's FAQ site)
- Dropping stopwords
- Stripping Questions of extra spaces

In [2]:
df = pd.read_excel('hdfc.xlsx')
df = df.drop_duplicates('Question')
df = df.reset_index()

In [3]:
limit = 1000
reduced = df[['Question','Answer']][:limit]

qlabels = reduced['Question'].to_dict()
alabels = reduced['Answer'].to_dict()

reduced

Unnamed: 0,Question,Answer
0,What will be done with the post dated cheques ...,Post Dated Cheques(PDCs)/Security Cheques subm...
1,How can I repay my Personal Loan?,You pay the loan in equal monthly instalments ...
2,Are there any additional charges for loan repa...,The additional charges (if any) are applicable...
3,What is Guarantor?,A Guarantor is a person who guarantees to pay ...
4,What is De-pledge?,Removal of a pledge from the security to regai...
5,Whom do I contact in case of ay further querie...,You can apply online on the clicking on the be...
6,What is the minimum loan value/product value?,We are funding products with loan value above ...
7,What are the tenures for which I can avail thi...,HDFC Bank Consumer Durable Loan is available i...
8,Which all locations HDFC Bank Consumer Durable...,"Mumbai,\nDelhi,\nBangalore,\nChennai,\nPune,\n..."
9,How can I get my address changed in my loan ac...,In order to change your address in our records...


## Setting stopwords
- Import set of common stopwords from nltk
- Adding domain-related stopword
- Removing question words (To distinguish between intents of questions)

In [4]:
plus = {'hdfc'}
minus = {'what','how','where','when','why'}
stop = set(stopwords.words('english'))

stop.update(plus)
stop.difference_update(minus)

## Intitialising and training TF-IDF vectors
- Setting stopwords to `stop`
- `tf_vect` : `TfidfVectorizer` object. Can be used to convert strings to tf-idf vectors
- `all_qs_vectors` : Matrix of TF-IDF vectors corresponding to questions in training set

In [13]:
tf_vect =TfidfVectorizer(stop_words=stop,
                         lowercase=True,
                         use_idf=True)
all_qs_vectors = tf_vect.fit_transform(reduced['Question'])
print "Shape of all_qs_vectors :",all_qs_vectors.shape
print "Number of questions : ",all_qs_vectors.shape[0]
print "Vocabulary size : ",all_qs_vectors.shape[1]

Shape of all_qs_vectors : (993, 1140)
Number of questions :  993
Vocabulary size :  1140


In [14]:
# Transforming context with tfidf
context = 'How can I repay my Personal Loan?'
context_vector = tf_vect.transform([context])
context_matrix = context_vector.todense()

In [15]:
# Displaying TF_IDF results
def tabulate_vector(context):
    values = []
    for w in word_tokenize(context.strip()):
        ind = tf_vect.vocabulary_.get(w.lower(),"-")
        val = context_matrix[0,ind] if not ind == "-" else 0
        values.append({"Word":w,"Vocabulary Index":str(ind),"TF-IDF Value":val})
    TableDisplay(values)


## Predicting closest question
- `predict` has the following arguments
    - `n`       : int  | Number of results (from top)
    - `answers` : bool | Return answers or not
    - `ret_best`: bool | Returns index of closest match
- Steps for prediction
    - Convert query to tfidf vector
    - Get dot product of query vectors with each question to measures similarity
    - Sort array indices by descending order of array values
    - Return top n results

In [16]:
def predict(query,n=5,answers=False,ret_indices=False):
    # Comparing context with all questions using dot product
    query_vector = tf_vect.transform([query])
    sim = np.dot(all_qs_vectors, query_vector.T)
    # Converting numpy matrix to 1D array with 146 dot products (146 questions vs context)
    arr = sim.toarray().flatten()
    matches = arr.argsort(axis=0)[::-1]
    top_n_matches = matches[:n]
    results = []
    if ret_indices:
        return top_n_matches
    for i in top_n_matches:
        res = {"Question":qlabels[i],"Ans":alabels[i]} if answers else {"Question":qlabels[i],"Score":arr[i]}
        results.append(res)
    return pd.DataFrame(results)

In [17]:
predict('How do I pay my personal loan ?')

Unnamed: 0,Question,Score
0,How can I repay my Personal Loan?,0.637225
1,How long can I take to repay my personal loan?,0.485998
2,How long will it take for my Personal loan to ...,0.465213
3,Can I repay the Personal loan earlier?,0.449065
4,How does a Salary Account help me get a person...,0.421928


## Finding closest question by jaccard_distance
- `tokens` is a dictionary mapping a question's index to a list of tokens in the word

In [30]:
# Generating tokens after converting to lowercase, removing stopwords and non-alphanumberic tokens
# Note : nltk.word_tokenize does not split PIN/Pattern'
def get_tokens(sent):
    return set([x for x in wordpunct_tokenize(sent.lower()) if x.isalnum() and x not in stop])
    
tokens = {}
for i in qlabels:
    tokens[i] = get_tokens(qlabels[i])

In [31]:
# Eliminating questions which have a jaccard_distance > 0.9 with another questions
def get_jaccard_similarity(words,words2):
    inter = words.intersection(words2)
    union = words.union(words2)
    return float(len(inter))/len(union),len(inter)

def pred_jaccard(query,n=5):
    words = get_tokens(query)
    max_sim = -1
    max_ind = None
    scores = {}
    for i in qlabels:        
        sc = get_jaccard_similarity(words,tokens[i])
        scores[i] = {"question":qlabels[i],"score":sc[0],"inter":sc[1]}
    return pd.DataFrame(scores).T.sort_values('score',ascending=False)[:n]
    

## Jaccard-similarity-based matching vs. TF-IDF-based matching
- Jaccard does not depend on the rest of the corpus/questions while computing similiarity and thus treats all terms/tokens equally
- IDF, which is used to measure the importance of the word, gives more importance to words that rarely occur in a document
<br><br>

#### Consider the question : **How does amortization work ?**
- Jaccard suggests '*How does it work*' by matching the words **how** and **work**
- TFIDF suggests '*How does amortization work ?*' by matching the words **amortization** since it is a less-frequent term

In [32]:
pred_jaccard('How does amortization work ?',5)
# Jaccard suggests 'How does it work' by matching the words how and work

Unnamed: 0,inter,question,score
594,2,How does it work?,0.666667
641,2,How does it BillPay work?,0.5
302,2,How does SMSBanking work?,0.5
734,2,How does the Insta IPIN facility work?,0.333333
616,2,What is IVR Password and how does it work?,0.333333


In [33]:
predict('How does amortization work ?',5)
# TFIDF suggests 'How does amortization work ?' by matching the words amortization since it is a less-frequent term

Unnamed: 0,Question,Score
0,What is Amortization?,0.747444
1,How does it work?,0.626755
2,How does it BillPay work?,0.493706
3,How does SMSBanking work?,0.450852
4,What is IVR Password and how does it work?,0.402905


In [39]:
# Frequency Distribution of words
all_ = {}
for x in list(tokens.values()):
    for w in list(x):
        all_[w] = all_.get(w,0)+1
print "Word Frequency"
for i in get_tokens('How does amortization work ?'):
    print i,":",all_[i]

Word Frequency
how : 236
work : 13
amortization : 1
