# AIMS.au Statement Dataset Parsing Demo

This notebook shows how to parse the statements dataset that should have been created with the
scripts located in [this folder](../qut01/data/scripts). To get a already-generated copy of the
dataset, check out the links in the top-level README.

Note: the dataset is currently packaged using deeplake. That means it looks like a folder
of not-so-easy-to-understand data when we look at it on disk, but that's because the contents are
chunked and compressed and stored in order to simplify the data loading process for future
experiments. See [this page](https://docs.deeplake.ai/en/latest/Datasets.html) for API
documentation on the deeplake dataset format.

This notebook was last updated on 2024-07-31 for framework v0.5.2.

In [None]:
import datetime

import numpy as np
import tqdm

import qut01

qut01.utils.logging.setup_logging_for_analysis_script()

### Opening the dataset

Note: before we actually fetch the data for a particular statement, it is not guaranteed that it will
contain a specific number of annotations, even if annotators should have provided labels for it. This
is because some annotations are discarded when parsed (due to bad formatting or other labeling issues),
and others are replaced by a single "validated" annotation so that the statement becomes part of
the gold evaluation subset.

In [None]:
# change this path to something else if you are storing the dataset in the non-default location!
try:
    dataset_path = qut01.data.dataset_parser.get_default_deeplake_dataset_path()
    print(f"Will try to open the statements dataset at: {dataset_path}")
    assert dataset_path.exists()
except:  # noqa
    raise RuntimeError(
        "dataset not found!"
        "\n\t...you must define a .env file with a proper DATA_ROOT variable, or hard-code the path here!"
        "\n\t   (see the README.md and .env.template files for more details on how to use it)"
    )

dataset = qut01.data.dataset_parser.get_deeplake_dataset(  # this will load the deeplake dataset itself
    dataset_path=dataset_path,
    checkout_branch=qut01.data.dataset_parser.dataset_validated_branch_name,  # to load all annotations (train-valid-test)
)
data_parser = qut01.data.dataset_parser.DataParser(  # this will give us a easy-to-use parser for the dataset
    dataset_path_or_object=dataset,
    use_processed_data_cache=False,  # we will iterate over the entire dataset below, caching might go out of memory
)

All of the dataset columns should have been printed above; if you would like to access those
directly, you can use the `dataset` object we just created almost like a pandas dataframe. Refer to
the deeplake API documentation for more information on deeplake dataset objects.

Now, let's display some high-level statistics that do not require iterating through all the data...

In [None]:
# 'potentially' annotated statements correspond to statements that should have been annotated by workers
potentially_annotated_statement_ids = data_parser.get_potentially_annotated_statement_ids()
all_annot_sids = set([sid for annot_sids in potentially_annotated_statement_ids.values() for sid in annot_sids])
print(f"\ndataset contains {len(data_parser)} statements (and {len(all_annot_sids)} with annotations)")
for annot_type, annot_sids in potentially_annotated_statement_ids.items():
    print(f"\tthere are {len(annot_sids)} annotated statements for {annot_type}")
annotated_statement_flags = [sid in all_annot_sids for sid in dataset.statement_id]
# the publication dates correspond to the date when the statement was made available online
publication_dates = [
    datetime.datetime.strptime(t.item(), "%Y-%m-%d %H:%M:%S") for t in dataset.metadata.PublishedAt.numpy()
]
print(f"{min(publication_dates).isoformat()=}")
print(f"{max(publication_dates).isoformat()=}")
# the statement period end corresponds to the end of the financial year covered by the statement
period_end_dates = [datetime.datetime.strptime(t.item(), "%Y-%m-%d") for t in dataset.metadata.PeriodEnd.numpy()]
print(f"{min(period_end_dates).isoformat()=}")
print(f"{max(period_end_dates).isoformat()=}")
# we can compute aggregate statistics for the metadata fields directly using the dataset object
print(f"total page count: {dataset.metadata.PageCount.numpy().sum()}")
print(f"total word count: {dataset.metadata.WordCount.numpy().sum()}")
# using the binary flags indicating whether statements were annotated, we can compute stats for annotated statements
print(f"annotated page count: {dataset[annotated_statement_flags].metadata.PageCount.numpy().sum()}")
print(f"annotated word count: {dataset[annotated_statement_flags].metadata.WordCount.numpy().sum()}")

### Reading the data from the dataset for a random statement

Note: the statement INDEX is not the same as its IDENTIFIER; the index is a 0-based integer that identifies the
statement in the dataset itself, whereas the identifier is an arbitrary integer that identifies the statement
on the modern slavery register.

Below, we pick a random statement by index, and then convert that index to the statement's identifier.

In [None]:
random_statement_idx = np.random.randint(len(data_parser))
statement_data = data_parser[random_statement_idx]
statement_id = statement_data["statement_id"]
print(f"displaying data for statement ID={statement_id}:")

# another way to get the id from the index without fetching the data is to use a data parser attribute:
expected_statement_id = data_parser.statement_ids[random_statement_idx]
assert expected_statement_id == statement_id

# some of the 'raw' tensor data can be directly accessed via the loaded batch dictionary:
print(f"\tlink: {statement_data['metadata/Link'].item()}")
print(f"\tpage count: {statement_data['metadata/PageCount'].item()}")
print(f"\tpdf size: {statement_data['pdf_data'].size} bytes")

# metadata that requires a bit of processing can be queried through the data parser:
annot_counts = data_parser.get_potential_annotation_counts(statement_data)
for annot_type, annot_count in annot_counts.items():
    print(f"\tnumber of potential {annot_type} annotations: {annot_count}")
valid_flags = data_parser.get_validated_annotation_flags(statement_data)
for annot_type, valid_flag in valid_flags.items():
    if valid_flag:
        print(f"\tannotation '{annot_type}' has been validated")

# processed text and annotations are provided on-demand via a new dataclass:
statement_processed_data = data_parser.get_processed_data(random_statement_idx, statement_data)
print(f"\tprocessed sentence count: {len(statement_processed_data.sentences)}")
print(f"\tannotation count: {len(statement_processed_data.annotations)}")
if len(statement_processed_data.annotations):
    print(f"\t\t{len(statement_processed_data.annotation_chunks)} annotation chunks")
    print(f"\t\t{len(statement_processed_data.annotated_sentence_idxs)} sentences matched to at least one annotation")
    print(f"\t\t{len(statement_processed_data.unannotated_sentence_idxs)} sentences matched to no annotation")
    if statement_processed_data.annotation_chunks:
        chunk = np.random.choice(statement_processed_data.annotation_chunks)
        assert isinstance(chunk, qut01.data.annotations.chunks.AnnotationTextChunk)
        print("\trandomly picked annotation chunk:")
        print(f"\t\ttype: {chunk.annotation.name}")
        print(f"\t\tlabel: {chunk.annotation.label.name}")
        print(f"\t\tsentence match scores: {chunk.matched_sentences_scores}")
        for sentence_idx, sentence in enumerate(chunk.sentences):
            print(f"\t\tsentence {sentence_idx + 1}/{len(chunk.sentences)}:  {sentence}")

### Loading specific annotations using the framework's annotation parsing objects:

(i.e. without having to read individual statements using the above data parser)

In [None]:
import qut01.data.annotations.classes

# first, we'll make sure we read a bunch of statements that possesses all annotations
target_statement_ids = [62, 701, 1578, 1583, 1702, 1880, 1907, 2407, 3072]  # from 2024-01-28 sample
target_annot_types = [
    qut01.data.annotations.classes.Criterion2Structure,
    qut01.data.annotations.classes.Criterion2Operations,
    qut01.data.annotations.classes.Criterion2SupplyChains,
    qut01.data.annotations.classes.Criterion3RiskDesc,
    qut01.data.annotations.classes.Criterion4Mitigation,
    qut01.data.annotations.classes.Criterion4Remediation,
    qut01.data.annotations.classes.Criterion5Effect,
    qut01.data.annotations.classes.Criterion6Consult,
]
# we can then use this function to get ALL ANNOTATIONS across all of the targeted statements
annotations = qut01.data.annotations.classes.get_annotations(
    dataset=data_parser,
    target_statement_ids=target_statement_ids,
    target_annot_types=target_annot_types,
)

# finally, we will compute inter-annotator-agreement scores for these annotations
iaa_iou_scores = {atype: [] for atype in target_annot_types}  # annot_type-to-score-list
for target_sid in target_statement_ids:
    for target_annot_type in target_annot_types:
        target_annots = [a for a in annotations if a.statement_id == target_sid and type(a) is target_annot_type]
        if not target_annots:
            continue  # annotations were filtered out due to errors
        assert len(target_annots) == 2  # for the annotation types we selected, we should get double-annotations
        iaa_iou = qut01.metrics.iaa.compute_inter_annotator_agreement(*target_annots)
        print(f"IAA-IoU for statement_{target_sid}, {target_annot_type.name}: {iaa_iou:.02f}")
        iaa_iou_scores[target_annot_type].append(iaa_iou)
print()
for target_annot_type in target_annot_types:
    scores = iaa_iou_scores[target_annot_type]
    iaa_iou = (sum(scores) / len(scores)) if len(scores) else 0.0
    print(f"average IAA-IoU, {target_annot_type.name} ({len(scores)} statements): {iaa_iou:.02f}")

### Parsing all annotated sentences across all statements

(i.e. exhaustively, and this will be slow)

In [None]:
total_sentence_counts = []
total_relevant_sentence_counts = []
for target_sid in tqdm.tqdm(all_annot_sids, desc="parsing all annotated statements"):
    target_idx = data_parser.statement_ids.index(target_sid)
    statement_processed_data = data_parser.get_processed_data(target_idx)
    total_sentence_counts.append(len(statement_processed_data.sentences))
    total_relevant_sentence_counts.append(len(statement_processed_data.annotated_sentence_idxs))

print(f"\ttotal annotated sentence count: {sum(total_sentence_counts)}")
print(f"\ttotal relevant sentence count: {sum(total_relevant_sentence_counts)}")