# Fact Retreival Bot using IDFT
### Steps
- Loading and preprocessing Questions and Answers from dataset
- Setting Stopwords
- Intitialising and training TF_IDF vectors
- Testins

## Imports

In [30]:
import pandas as pd                   # To load and process dataset
import numpy as np                    # For matrix operations
from nltk.corpus import stopwords     # Using NLTK to load stopwords
from nltk import word_tokenize        # Using NLTK to token sentences

from sklearn.feature_extraction.text import TfidfVectorizer

pd.set_option('display.width',1000)

## Loading and preprocessing Questions and Answers from dataset
- `hdfc.pkl` : Collection of 1341 QnA about HDFC. (Scraped from HDFC's FAQ site)
- Dropping stopwords
- Stripping Questions of extra spaces

In [28]:
df = pd.read_pickle('hdfc.pkl')
df = df.drop_duplicates('Question')
df = df.reset_index()
df['Question'] = df['Question'].str.strip()

In [29]:
limit = 100
reduced = df[['Question','Answer']][:limit]

qlabels = reduced['Question'].to_dict()
alabels = reduced['Answer'].to_dict()

print reduced.head()

                                            Question                                             Answer
0  What will be done with the post dated cheques ...  Post Dated Cheques(PDCs)/Security Cheques subm...
1                  How can I repay my Personal Loan?  You pay the loan in equal monthly instalments ...
2  Are there any additional charges for loan repa...  The additional charges (if any) are applicable...
3                                 What is Guarantor?  A Guarantor is a person who guarantees to pay ...
4                                 What is De-pledge?  Removal of a pledge from the security to regai...


## Setting stopwords
- Import set of common stopwords from nltk
- Adding domain-related stopword
- Removing question words (To distinguish between intents of questions)

In [27]:
# Loading stopwords
plus = {'hdfc'}
minus = {'what','how','where','when','why'}
stop = set(stopwords.words('english'))

stop.update(plus)
stop.difference_update(minus)

## Intitialising and training TF-IDF vectors
- Setting stopwords to `stop`
- `tf_vect` : `TfidfVectorizer` object. Can be used to convert strings to tf-idf vectors
- `all_qs_vectors` : Matrix of TF-IDF vectors corresponding to questions in training set

In [11]:
tf_vect =TfidfVectorizer(stop_words=stop,
                         lowercase=True,
                         use_idf=True)
all_qs_vectors = tf_vect.fit_transform(reduced['Question'])
print "Shape of all_qs_vectors :",all_qs_vectors.shape
print all_qs_vectors.shape[0],": Number of questions"
print all_qs_vectors.shape[1],": Vocabulary size"

Shape of all_qs_vectors : (100, 178)
100 : Number of questions
178 : Vocabulary size


In [12]:
# Transforming context with tfidf
context = 'How can I repay my Personal Loan?'
context_vector = tf_vect.transform([context])
context_matrix = context_vector.todense()

In [13]:
# Displaying TF_IDF results
print "WORD".ljust(10),"INDEX".ljust(6),"TFIDF_VALUE"
for w in word_tokenize(context.strip()):
    ind = tf_vect.vocabulary_.get(w.lower(),"NA")
    val = context_matrix[0,ind] if not ind == "NA" else 0
    print w.ljust(10),str(ind).ljust(6),val

WORD       INDEX  TFIDF_VALUE
How        NA     0
can        NA     0
I          NA     0
repay      141    0.6328378766551715
my         NA     0
Personal   118    0.714811086464789
Loan       87     0.2975925612941322
?          NA     0


## Predicting closest question
- `predict` has the following arguments
    - `n`       : int  | Number of results (from top)
    - `answers` : bool | Return answers or not
    - `ret_best`: bool | Returns index of closest match
- Steps for prediction
    - Convert query to tfidf vector
    - Get dot product of query vectors with each question to measures similarity
    - Sort array indices by descending order of array values
    - Return top n results

In [48]:
def predict(query,n=5,answers=False,ret_indices=False):
    # Comparing context with all questions using dot product
    query_vector = tf_vect.transform([query])
    sim = np.dot(all_qs_vectors, query_vector.T)
    # Converting numpy matrix to 1D array with 146 dot products (146 questions vs context)
    arr = sim.toarray().flatten()
    matches = arr.argsort(axis=0)[::-1]
    top_n_matches = matches[:n]
    results = []
    if ret_indices:
        return top_n_matches
    for i in top_n_matches:
        res = {qlabels[i]:alabels[i]} if answers else qlabels[i]
        results.append(res)
    return results

In [43]:
predict('How do I pay my personal loan ?')

[u'How can I repay my Personal Loan?',
 u'Can I repay the Personal loan earlier?',
 u'What are the charges I need to pay to foreclose my loan?',
 u'How long can I take to repay my personal loan?',
 u'What are the charges I need to pay to foreclose my Loan Against Property?']

In [51]:
num_correct = 0
failed = []
for i in qlabels:
    if predict(qlabels[i],n=1,ret_indices=True)[0] == i:
        num_correct +=1
    else :
        failed.append(i)
print "Recall : ",float(num_correct)/len(qlabels) *100,"%"

for i in failed : 
    query = qlabels[i]
    print "\nQuery :",query
    print predict(query,n=3)

Recall :  95.0 %
Query : What will be done with the post dated cheques if I request to change the mode of repayment/ account for my loan?
[u'What will be done with the post dated cheques if I request to change the mode of repayment/account for my loan?', u'What will be done with the post-dated cheques if I request to change the mode of repayment/account for my loan?', u'What will be done with the post-dated cheques if I request to change the mode of repayment/ account for my loan?']

Query : What are the different loan repayment modes?
[u'What are the different modes of loan repayment?', u'What are the different loan repayment modes?', u'How can I change the mode of repayment/ account for my loan?']

Query : Do you want to repay the loan earlier than the due date?
[u'What if I want to repay the loan earlier than the due date?', u'Do you want to repay the loan earlier than the due date?', u'Can I repay my loan earlier than the due date?']

Query : What will be done with the post-dated c