<a href="https://colab.research.google.com/github/seecode4/seeRepo1/blob/main/capstone/explore3datasets/trivia_qa_read_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

TriviaQA

A reading comprehension dataset containing over 650K question-answer-evidence triples is in TriviaQA.

This was used in ACL 17 paper "TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension."
The paper presents two baseline algorithms: a feature-based classifier and a state-of-the-art neural network, that performs well on SQuAD reading comprehension. Neither approach comes close to human performance (23% and 40% vs. 80%).

The capstone project can use this dataset to model and train to try and improve the performance from what is presented in this paper. Further, we can explore to see how well the models generalize. This could be studied using the game show Jeopardy dataset.

In [1]:
# Get the dataset tar.gz file and decompress it
url = "https://nlp.cs.washington.edu/triviaqa/data/triviaqa-unfiltered.tar.gz"
!pwd
!ls
!curl {url} | tar xz

/content
sample_data
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  603M  100  603M    0     0  12.5M      0  0:00:48  0:00:48 --:--:-- 12.3M


In [2]:
# Check directory content
!du -h /content/triviaqa-unfiltered
!df
!ls -l /content/triviaqa-unfiltered/
# !cat /content/triviaqa-unfiltered/README

2.9G	/content/triviaqa-unfiltered
Filesystem     1K-blocks     Used Available Use% Mounted on
overlay        112947452 35227340  77703728  32% /
tmpfs              65536        0     65536   0% /dev
shm              5989376        0   5989376   0% /dev/shm
/dev/root        2019696  1180612    839084  59% /usr/sbin/docker-init
tmpfs            6645228      112   6645116   1% /var/colab
/dev/sda1       73032084 55416760  17598940  76% /kaggle/input
tmpfs            6645228        0   6645228   0% /proc/acpi
tmpfs            6645228        0   6645228   0% /proc/scsi
tmpfs            6645228        0   6645228   0% /sys/firmware
total 2938248
-rw-rw-r-- 1 1000 1000       3260 May  4  2017 README
-rw-rw-r-- 1 1000 1000  311475524 Jul 18  2017 unfiltered-web-dev.json
-rw-rw-r-- 1 1000 1000  283196950 Jul 18  2017 unfiltered-web-test-without-answers.json
-rw-rw-r-- 1 1000 1000 2414078634 Jul 18  2017 unfiltered-web-train.json


Since these are large datasets, explore using ijson, used to read big json files

In [3]:
# Ref: https://www.kaggle.com/code/xxxxyyyy80008/how-to-read-a-big-json-file-with-python/notebook
# Number of lines in test dataset
%%time
test_file = "/content/triviaqa-unfiltered/unfiltered-web-test-without-answers.json"
num_lines_test = sum(1 for line in open(test_file))
print(f'Num. of samples in test dataset: {num_lines_test:,}')

Num. of samples in test dataset: 3,837,331
CPU times: user 2.13 s, sys: 72.9 ms, total: 2.2 s
Wall time: 2.27 s


In [4]:
# Number of lines in train dataset
%%time
train_file = "/content/triviaqa-unfiltered/unfiltered-web-train.json"
num_lines_train = sum(1 for line in open(train_file))
print(f'Num. of samples in train dataset: {num_lines_train:,}')

Num. of samples in train dataset: 34,184,870
CPU times: user 15.3 s, sys: 880 ms, total: 16.2 s
Wall time: 16.3 s


In [5]:
# Just trying to read chunks did not help
# Parse data by looking for <prefix, event, value>
%%time
import pandas as pd
!pip install ijson
from typing import KeysView
import ijson
# chunksize = 100000
# chunks = pd.read_json(train_file, lines=True, chunksize=chunksize)
# print(type(chunks))

Collecting ijson
  Downloading ijson-3.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (21 kB)
Downloading ijson-3.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (114 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.5/114.5 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: ijson
Successfully installed ijson-3.3.0
CPU times: user 588 ms, sys: 67.9 ms, total: 656 ms
Wall time: 7.19 s


Prefixes describe location of the keys or names in the object tree.
Events report value types,
*   mark the start (start_array)
*   end of arrays (end_array)
*   objects (start_map, end_map)
*   mark keys (map_key)

In [7]:
# Navigate the data
# Ref: https://www.aylakhan.tech/?p=27
cnt = 0
with open(train_file, 'r', encoding='utf-8') as fp:
    parser = ijson.parse(fp)
    for prefix, event, value in parser:
      if (event=='start_map') or (event=='end_map') or (cnt % 10 == 0) or cnt < 10:
        print('prefix={}, event={}, value={}, cnt={}'.format(prefix, event, value, cnt))
      cnt += 1
      if cnt > 70 or event=='end_map':
        break
print('--------')
cnt = 0
with open(test_file, 'r', encoding='utf-8') as fp:
    parser = ijson.parse(fp)
    for prefix, event, value in parser:
      # print('prefix={}, event={}, value={}'.format(prefix, event, value))
      cnt += 1
      if cnt > 50 or event=='end_map':
        break

prefix=, event=start_map, value=None, cnt=0
prefix=, event=map_key, value=Data, cnt=1
prefix=Data, event=start_array, value=None, cnt=2
prefix=Data.item, event=start_map, value=None, cnt=3
prefix=Data.item, event=map_key, value=Answer, cnt=4
prefix=Data.item.Answer, event=start_map, value=None, cnt=5
prefix=Data.item.Answer, event=map_key, value=Aliases, cnt=6
prefix=Data.item.Answer.Aliases, event=start_array, value=None, cnt=7
prefix=Data.item.Answer.Aliases.item, event=string, value=Presidency of Harry S. Truman, cnt=8
prefix=Data.item.Answer.Aliases.item, event=string, value=Hary truman, cnt=9
prefix=Data.item.Answer.Aliases.item, event=string, value=Harry Shipp Truman, cnt=10
prefix=Data.item.Answer.Aliases.item, event=string, value=HST (president), cnt=20
prefix=Data.item.Answer.Aliases.item, event=string, value=Harold Truman, cnt=30
prefix=Data.item.Answer.NormalizedAliases.item, event=string, value=truman administration, cnt=40
prefix=Data.item.Answer.NormalizedAliases.item, ev

Using ijson would be one way to approach the data.
There seem to be some utilities from when this project was done for how to understand the content of the data files. This needs more exploration.

In [8]:
# Ref: https://github.com/mandarjoshi90/triviaqa/blob/ca43b5820b107f3970cf4b7d67f7db7a98117b79/utils/utils.py#L15
def get_file_contents(filename, encoding='utf-8'):
    with open(filename, encoding=encoding) as f:
        content = f.read()
    return content


def read_json(filename, encoding='utf-8'):
    contents = get_file_contents(filename, encoding=encoding)
    return json.loads(contents)


def get_file_contents_as_list(file_path, encoding='utf-8', ignore_blanks=True):
    contents = get_file_contents(file_path, encoding=encoding)
    lines = contents.split('\n')
    lines = [line for line in lines if line != ''] if ignore_blanks else lines
    return lines

def read_triviaqa_data(qajson):
    # data = utils.utils.read_json(qajson)
    data = read_json(qajson)
    # read only documents and questions that are a part of clean data set
    if data['VerifiedEval']:
        clean_data = []
        for datum in data['Data']:
            if datum['QuestionPartOfVerifiedEval']:
                if data['Domain'] == 'Web':
                    datum = read_clean_part(datum)
                clean_data.append(datum)
        data['Data'] = clean_data
    return data