This notebook:
1. Loads and cleans the raw data
2. Prepares the data for the Anserini retriever
3. Pre-processes and tokenizes the raw cleaned data
4. Creates vocabulary from the corpus

In [None]:
%cd ../

In [3]:
from statistics import mean 

from src.process_data import *

[nltk_data] Downloading package punkt to /home/bithiah/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# **Load data**

In [4]:
# Document id and Answer text
collection = load_answers_to_df("data/raw/FiQA_train_doc_final.tsv")
# Question id and Question text
queries = load_questions_to_df("data/raw/FiQA_train_question_final.tsv")
# Question id and Answer id pair
qid_docid = load_qid_docid_to_df("data/raw/FiQA_train_question_doc_final.tsv")

In [0]:
print("Document id and Answer text")
collection.head(5)

Document id and Answer text


Unnamed: 0,docid,doc
0,3,I'm not saying I don't like the idea of on-the...
1,31,So nothing preventing false ratings besides ad...
2,56,You can never use a health FSA for individual ...
3,59,Samsung created the LCD and other flat screen ...
4,63,Here are the SEC requirements: The federal sec...


In [0]:
print("Example answer: \n")
print(collection.iloc[0]['doc'])

Example answer: 

I'm not saying I don't like the idea of on-the-job training too, but you can't expect the company to do that. Training workers is not their job - they're building software. Perhaps educational systems in the U.S. (or their students) should worry a little about getting marketable skills in exchange for their massive investment in education, rather than getting out with thousands in student debt and then complaining that they aren't qualified to do anything.


In [0]:
print("Question id and Question text")
queries.head(5)

Question id and Question text


Unnamed: 0,qid,question
0,0,What is considered a business expense on a bus...
1,1,Claiming business expenses for a business with...
2,2,Transferring money from One business checking ...
3,3,Having a separate bank account for business/in...
4,4,Business Expense - Car Insurance Deductible Fo...


In [0]:
print("Example Question: \n")
print(queries.iloc[0]['question'])

Example Question: 

What is considered a business expense on a business trip?


In [0]:
print("Question id and Answer id pair")
qid_docid.head(5)

Question id and Answer id pair


Unnamed: 0,qid,docid
0,0,18850
1,1,14255
2,2,308938
3,3,296717
4,3,100764


In [0]:
qid_rel = label_to_dict(qid_docid)

In [0]:
# Number of relevant passages for each query
num_rel = [len(v) for v in qid_rel.values()]

avg_num_rel = mean(num_rel)
max_num_rel = max(num_rel)
min_num_rel = min(num_rel)

print("Average number of relevant passages for each query: {}\n".format(round(avg_num_rel)))
print("Max number of relevant passages for each query: {}\n".format(max_num_rel))
print("Min number of relevant passages for each query: {}\n".format(min_num_rel))

Average number of relevant passages for each query: 3

Max number of relevant passages for each query: 23

Min number of relevant passages for each query: 1



In [0]:
print("Example QA pair: \n")
print("Question: {}".format(queries.at[3, 'question']))

x = collection[collection['docid']==296717]

print()
print("Answer: {}".format(x.at[28732, 'doc']))

Example QA pair: 

Question: Having a separate bank account for business/investing, but not a “business account?”

Answer: Having a separate checking account for the business makes sense. It simplifies documenting your income/expenses. You can "explain" every dollar entering and exiting the account without having to remember that some of them were for non-business items. My credit union allowed me to have a 2nd checking account and allowed me to put whatever I wanted as the name on the check. I think this looked a little better than having my name on the check. I don't see the need for a separate checking account for investing. The money can be kept in a separate savings account that has no fees, and can even earn a little interest. Unless you are doing a lot of investment transactions a month this has worked for me. I fund IRAs and 529 plans this way. We get paychecks 4-5 times a month, but send money to each of the funds once a month. You will need a business account if the number of

In [0]:
print("Number of answers: {}".format(len(collection)))
print("Number of questions: {}".format(len(queries)))
print("Number of QA pairs: {}".format(len(qid_docid)))

Number of answers: 57638
Number of questions: 6648
Number of QA pairs: 17110


# **Clean data**

In [4]:
# Cleaning data
empty_docs, empty_id = get_empty_docs(collection)
# Remove empty answers from collection of answers
collection_cleaned = collection.drop(empty_id)
# Remove empty answers from qa pairs
qid_docid = qid_docid[~qid_docid['docid'].isin(empty_docs)]

print("Number of answers after cleaning: {}".format(len(collection_cleaned)))
print("Number of QA pairs after cleaning: {}".format(len(qid_docid)))

Number of answers after cleaning: 57600
Number of QA pairs after cleaning: 17072


# **Prepare data for Anserini**

In [0]:
# Write collection df to file
save_tsv("retriever/collection_cleaned.tsv", collection_cleaned)

# Convert collection df to JSON file for Anserini's document indexer
collection_to_json("retriever/collection_json/docs.json", "retriever/collection_cleaned.tsv")

# **Process Data**

In [0]:
processed_answers = process_answers(collection_cleaned)
processed_questions = process_questions(queries)

In [6]:
print("Processed and tokenized questions")
processed_questions.head(5)

Processed and tokenized questions


Unnamed: 0,qid,question,q_processed,tokenized_q,q_len
0,0,What is considered a business expense on a bus...,what is considered a business expense on a bus...,"[what, is, considered, a, business, expense, o...",10
1,1,Claiming business expenses for a business with...,claiming business expenses for a business with...,"[claiming, business, expenses, for, a, busines...",9
2,2,Transferring money from One business checking ...,transferring money from one business checking ...,"[transferring, money, from, one, business, che...",10
3,3,Having a separate bank account for business/in...,having a separate bank account for business in...,"[having, a, separate, bank, account, for, busi...",13
4,4,Business Expense - Car Insurance Deductible Fo...,business expense car insurance deductible fo...,"[business, expense, car, insurance, deductible...",13


In [8]:
print("Processed and tokenized answers")
processed_answers.head(5)

Processed and tokenized answers


Unnamed: 0,docid,doc,doc_processed,tokenized_ans,ans_len
0,3,I'm not saying I don't like the idea of on-the...,im not saying i dont like the idea of on the j...,"[im, not, saying, i, dont, like, the, idea, of...",76
1,31,So nothing preventing false ratings besides ad...,so nothing preventing false ratings besides ad...,"[so, nothing, preventing, false, ratings, besi...",78
2,56,You can never use a health FSA for individual ...,you can never use a health fsa for individual ...,"[you, can, never, use, a, health, fsa, for, in...",74
3,59,Samsung created the LCD and other flat screen ...,samsung created the lcd and other flat screen ...,"[samsung, created, the, lcd, and, other, flat,...",54
4,63,Here are the SEC requirements: The federal sec...,here are the sec requirements the federal sec...,"[here, are, the, sec, requirements, the, feder...",222


In [7]:
avg_ans_count = processed_answers['ans_len'].mean()
avg_q_count = processed_questions['q_len'].mean()

print("Average answer length: {}".format(round(avg_ans_count)))
print("Average question length: {}".format(round(avg_q_count)))

Average answer length: 136
Average question length: 11


In [21]:
print("Total answers: {}".format(len(processed_answers)))
print("Number of answers with length greater than 512: {}".format(len(processed_answers[processed_answers['ans_len'] > 512])))

Total answers: 57600
Number of answers with length greater than 512: 1233


# **Create Vocabulary**

In [0]:
word2index, word2count = create_vocab(processed_answers, processed_questions)

print("Vocab size: {}".format(len(word2index)))
print("Top {} common words: {}".format(35, Counter(word2count).most_common(35)))

Vocab size: 85034
Top 35 common words: [('the', 371203), ('to', 233559), ('a', 201620), ('you', 166702), ('and', 163066), ('of', 157574), ('is', 129894), ('in', 120019), ('that', 111416), ('for', 89366), ('it', 83822), ('i', 74100), ('your', 68153), ('are', 67255), ('if', 60689), ('be', 59266), ('on', 58382), ('have', 55754), ('as', 50088), ('this', 49868), ('not', 49227), ('or', 46080), ('with', 45894), ('they', 44485), ('but', 41690), ('can', 38863), ('will', 36865), ('at', 35548), ('an', 31392), ('money', 31003), ('so', 29980), ('$', 29096), ('would', 28750), ('from', 28582), ('more', 27378)]


In [0]:
qid_to_text, docid_to_text = id_to_text(collection, queries)
qid_to_tokenized_text, docid_to_tokenized_text = id_to_tokenized_text(processed_answers, processed_questions)

In [0]:
# Save objects to pickle
save_pickle("data/qa_lstm_tokenizer/word2index.pickle", word2index)
save_pickle("data/qa_lstm_tokenizer/word2count.pickle", word2count)

# id map to raw text
save_pickle("data/id_to_text/qid_to_text.pickle", qid_to_text)
save_pickle("data/id_to_text/docid_to_text.pickle", docid_to_text)

# id map to tokenized text
save_pickle("data/qa_lstm_tokenizer/qid_to_tokenized_text.pickle", qid_to_tokenized_text)
save_pickle("data/qa_lstm_tokenizer/docid_to_tokenized_text.pickle", docid_to_tokenized_text)