This notebook:
1. loads and cleans the raw data
2. Splits the data into train, test, and validation sets
3. Prepares the data for the Anserini retriever
4. Pre-processes and tokenizes the raw cleaned data
5. Creates vocabulary from the corpus

In [36]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [39]:
from utils import *
from prepare_data import *
from process_data import *

path = "drive/My Drive/Thesis/"

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


# **Load data**

In [0]:
# Document id and Answer text
collection = load_answers_to_df(path + "data/raw/FiQA_train_doc_final.tsv")
# Question id and Question text
queries = load_questions_to_df(path + "data/raw/FiQA_train_question_final.tsv")
# Question id and Answer id pair
qid_docid = load_qid_docid_to_df(path + "data/raw/FiQA_train_question_doc_final.tsv")

In [41]:
print("Document id and Answer text")
collection.head(5)

Document id and Answer text


Unnamed: 0,docid,doc
0,3,I'm not saying I don't like the idea of on-the...
1,31,So nothing preventing false ratings besides ad...
2,56,You can never use a health FSA for individual ...
3,59,Samsung created the LCD and other flat screen ...
4,63,Here are the SEC requirements: The federal sec...


In [42]:
print("Question id and Question text")
queries.head(5)

Question id and Question text


Unnamed: 0,qid,question
0,0,What is considered a business expense on a bus...
1,1,Claiming business expenses for a business with...
2,2,Transferring money from One business checking ...
3,3,Having a separate bank account for business/in...
4,4,Business Expense - Car Insurance Deductible Fo...


In [43]:
print("Question id and Answer id pair")
qid_docid.head(5)

Question id and Answer id pair


Unnamed: 0,qid,docid
0,0,18850
1,1,14255
2,2,308938
3,3,296717
4,3,100764


In [44]:
print("Number of answers: {}".format(len(collection)))
print("Number of questions: {}".format(len(queries)))
print("Number of QA pairs: {}".format(len(qid_docid)))

Number of answers: 57638
Number of questions: 6648
Number of QA pairs: 17110


# **Clean data**

In [45]:
# Cleaning data
empty_docs, empty_id = get_empty_docs(collection)
# Remove empty answers from collection of answers
collection_cleaned = collection.drop(empty_id)
# Remove empty answers from qa pairs
qid_docid = qid_docid[~qid_docid['docid'].isin(empty_docs)]

print("Number of answers after cleaning: {}".format(len(collection_cleaned)))
print("Number of QA pairs after cleaning: {}".format(len(qid_docid)))

Number of answers after cleaning: 57600
Number of QA pairs after cleaning: 17072


# **Prepare data for Anserini**

In [0]:
# Write collection df to file
save_tsv(path + "data/retrieval/collection_cleaned.tsv", collection_cleaned)

# Convert collection df to JSON file for Anserini's document indexer
collection_to_json(path + "data/retrieval/collection_json/docs.json", path + "data/retrieval/collection_cleaned.tsv")

# **Split data into train, test, and validation sets**

In [0]:
# Split QA pairs
train_label, test_label, valid_label = split_label(qid_docid)

# Save label
save_pickle(path + "data/retrieval/train/qid_rel_train.pickle", train_label)
save_pickle(path + "data/retrieval/test/qid_rel_test.pickle", test_label)
save_pickle(path + "data/retrieval/valid/qid_rel_valid.pickle", valid_label)

In [33]:
print("Train set label dictionary\n")
take(5, train_label.items())

Train set label dictionary



[(0, [18850]),
 (1, [14255]),
 (2, [308938]),
 (3, [296717, 100764, 314352, 146317]),
 (4, [196463])]

In [0]:
# Split Questions
train_questions, test_questions, valid_questions = split_question(train_label, test_label, valid_label, queries)

# Save the questions dataset
save_tsv(path + "data/retrieval/train/train_questions.tsv", train_questions)
save_tsv(path + "data/retrieval/test/test_questions.tsv", test_questions)
save_tsv(path + "data/retrieval/valid/valid_questions.tsv", valid_questions)

In [34]:
print("Train set questions")
train_questions.head(5)

Train set questions


Unnamed: 0,qid,question
0,0,What is considered a business expense on a bus...
1,1,Claiming business expenses for a business with...
2,2,Transferring money from One business checking ...
3,3,Having a separate bank account for business/in...
4,4,Business Expense - Car Insurance Deductible Fo...


In [17]:
# Number of questions in each set
print("Number of questions in the training set: {}".format(len(train_questions)))
print("Number of questions in the testing set: {}".format(len(test_questions)))
print("Number of questions in the validation set: {}".format(len(valid_questions)))

Number of questions in the training set: 5681
Number of questions in the testing set: 333
Number of questions in the validation set: 632


# **Process Data**

In [0]:
processed_answers = process_answers(collection_cleaned)
processed_questions = process_questions(queries)

In [59]:
print("Processed and tokenized questions")
processed_questions.head(5)

Processed and tokenized questions


Unnamed: 0,qid,question,q_processed,tokenized_q,q_len
0,0,What is considered a business expense on a bus...,what is considered a business expense on a bus...,"[what, is, considered, a, business, expense, o...",10
1,1,Claiming business expenses for a business with...,claiming business expenses for a business with...,"[claiming, business, expenses, for, a, busines...",9
2,2,Transferring money from One business checking ...,transferring money from one business checking ...,"[transferring, money, from, one, business, che...",10
3,3,Having a separate bank account for business/in...,having a separate bank account for business in...,"[having, a, separate, bank, account, for, busi...",13
4,4,Business Expense - Car Insurance Deductible Fo...,business expense car insurance deductible fo...,"[business, expense, car, insurance, deductible...",13


In [55]:
avg_ans_count = processed_answers['ans_len'].mean()
avg_q_count = processed_questions['q_len'].mean()

print("Average answer length: {}".format(round(avg_ans_count)))
print("Average question length: {}".format(round(avg_q_count)))

Average answer length: 136
Average question length: 11


# **Create Vocabulary**

In [60]:
word2index, word2count = create_vocab(processed_answers, processed_questions)

print("Vocab size: {}".format(len(word2index)))
print("Top {} common words: {}".format(35, Counter(word2count).most_common(35)))

Vocab size: 85034
Top 35 common words: [('the', 371203), ('to', 233559), ('a', 201620), ('you', 166702), ('and', 163066), ('of', 157574), ('is', 129894), ('in', 120019), ('that', 111416), ('for', 89366), ('it', 83822), ('i', 74100), ('your', 68153), ('are', 67255), ('if', 60689), ('be', 59266), ('on', 58382), ('have', 55754), ('as', 50088), ('this', 49868), ('not', 49227), ('or', 46080), ('with', 45894), ('they', 44485), ('but', 41690), ('can', 38863), ('will', 36865), ('at', 35548), ('an', 31392), ('money', 31003), ('so', 29980), ('$', 29096), ('would', 28750), ('from', 28582), ('more', 27378)]


In [0]:
qid_to_text, docid_to_text = id_to_text(collection, queries)
qid_to_tokenized_text, docid_to_tokenized_text = id_to_tokenized_text(processed_answers, processed_questions)

In [0]:
# Save objects to pickle
save_pickle(path+"/data/word2index.pickle", word2index)
save_pickle(path+"/data/word2count.pickle", word2count)
save_pickle(path+"data/qid_to_text.pickle", qid_to_text)
save_pickle(path+"data/docid_to_text.pickle", docid_to_text)
save_pickle(path+"data/qid_to_tokenized_text.pickle", qid_to_tokenized_text)
save_pickle(path+"data/docid_to_tokenized_text.pickle", qid_to_tokenized_text)