## Overview
In this particular tutorial, we explain the three types of tasks and their required datasets. We cover how to load our prepared datasets or load your very own datasets using our provided functionalities.  

## Table of Contents
- [The Tasks](#task)
- [Dataset Types](#type)
- [Dataset Sources](#source)
- [Loading the Datasets](#load)
    - [Preview of examples from our combined dataset](#preview)
    - [Using your own dataset](#own)
- ...


## The Tasks
<a id='task'></a>

The three tasks, as shown in [Figure 1](../assets/tasks.png), are Sequence Classification, Span Detection, and Pair Classification. By definition:
1. Sequence Classification: Given an example sequence, do it contain causal relationships?
2. Span Detection: Given a causal sequence example, which words in the sentence correspond to
the Cause and Effect arguments? The task is to identify up to three causal relations and their spans.
3. Pair Classification: Given sentences with marked argument or entity pairs, the task is to figure out if they are causally related, such that the first argument (marked as `ARG0`) causes the second argument (`ARG1`).

## Dataset Types
<a id='type'></a>

Correspondingly, there are three type of datasets needed for training purposes, abbreviated as `Seq` type, `Span` type, as well as `Pair` type. 
1. `Seq` type datasets contain both causal and non-causal texts, where each unique example text is labelled with a target `s`. Causal texts refer to texts that contain causal relationships. 
2. `Span` type datasets contain only causal texts. Each unique example text allows up to three causal relations. To annotate the text, we converted spans into a BIO-format (Begin (B), Inside (I), Outside (O))  for two types of spans (Cause (C), Effect (E)). Therefore, there were five possible labels per word: B-C, I-C,
B-E, I-E and O, and the task is to predice the labels for each word. For examples with multiple relations, we sorted them based on the location of the B-C, followed by B-E if tied. This means that an earlier occurring Cause span was assigned a lower index number. See Figure 1’s spans for example.
3. `Pair` type datasets contain both causal and non-causal texts. Special tokens (`<ARG0>`, `</ARG0>`) marks the boundaries of a Cause span, while (`<ARG1>`, `</ARG1>`) marks the boundaries of a corresponding Effect span. Each example text may contain multiple pairs of arguments, resulting in differently located argument tokens `ARG0` and `ARG1`. For a given text of length `N`, say it has `a` number of arguments, the input word vector $\vec u$ has length `N+2*a` due to the addition of special tokens. Finally tokenized sequence $\vec w$ can have multiple versions of $\vec u$ due to differently located argument tokens.

## Dataset Sources
<a id='source'></a>

We have processed and split 6 corpus ([AltLex](https://aclanthology.org/P16-1135/), [BECAUSE](https://aclanthology.org/W17-0812/), [CTB](https://aclanthology.org/W14-0702/), [ESL](https://aclanthology.org/W17-2711/), [PDTB](https://catalog.ldc.upenn.edu/LDC2019T05), [Sem-Eval](https://aclanthology.org/S10-1006)) into the specified three types of datasets for your convenient use. The statistics are as below.<br>
<img src="../assets/temp_statistics.png" alt="Table" width="50%"/>

## Loading the Datasets
<a id='load'></a>

To load the datasets, we have provided convenient interfaces. In the [training script](../run.sh), add `--dataset_name` attribute and append the dataset names you want. For example, `--dataset_name altlex because` means to load and train the model on [AltLex](https://aclanthology.org/P16-1135/), [BECAUSE](https://aclanthology.org/W17-0812/) datasets. Full list of provided datasets are <code>['altlex', 'because', 'ctb', 'esl', 'esl2', 'pdtb', 'semeval2010t8', 'cnc', 'causenet', 'causenetm']</code>.

In case you want to use our `load_cre_dataset` function to load the datasets manually. The function signature is defined as:
```
def load_cre_dataset(
        dataset_name: List[str],
        do_train_val: bool,
        also_add_span_sequence_into_seq: bool = False, 
        span_augment: bool = False,
        span_files: dict = {}, 
        seq_files: dict = {}, 
        do_train: bool = True) -> Tuple[DatasetDict, DatasetDict, Tuple[int, int, int, int]]:
    """
    Loads in specified dataset from pre-processed training and testing files, or user-provided span 
    and seq files. 

    Args:
        dataset_name: A list of dataset names intend to be loaded
        do_train_val: A boolean value indicating whether to load validation datasets
        also_add_span_sequence_into_seq: A boolean value indicating whether to add span texts to sequence texts
        span_augment: A boolean value indicating whether to retain only the Cause or Effect clause as a Non-causal example to augment the span dataset
        span_files: A dictionary of user provided span data files, in the format of 
{'train':path_to_training_files, 'valid': path_to_valid_files}
        seq_files: A dictionary of user provided sequence data files, in the format of 
{'train': path_to_training_files, 'valid': path_to_valid_files}
        do_train: A boolean value indicating whether to do the training process
    Returns: A ``Tuple`` of interest
    Raises:
        ValueError: Raises an ValueError if input dataset_name doesn't exist, or provided seq files have 0 or more than 3 or causal relations per text.
    """
    ...
```

See more details in the code example below.

In [None]:
import sys
sys.path.append('..')
%load_ext autoreload
%autoreload 2

In [3]:
from _datasets.unifiedcre import load_cre_dataset, available_datasets
print('List of available datasets:', available_datasets)

"""
 Example case of loading AltLex and BECAUSE dataset,
 without adding span texts to seq texts, span augmentation or user-provided datasets,
 and load both training and validation datasets.
"""
load_cre_dataset(dataset_name=['altlex','because'], do_train_val=True, data_dir='../data')

List of available datasets: ['altlex', 'because', 'ctb', 'esl', 'esl2', 'pdtb', 'semeval2010t8', 'cnc', 'causenet', 'causenetm']




  0%|          | 0/2 [00:00<?, ?it/s]

(DatasetDict({
     span_validation: Dataset({
         features: ['corpus', 'index', 'text', 'label', 'ce_tags', 'ce_tags1', 'ce_tags2'],
         num_rows: 127
     })
     span_train: Dataset({
         features: ['corpus', 'index', 'text', 'label', 'ce_tags', 'ce_tags1', 'ce_tags2'],
         num_rows: 606
     })
 }),
 DatasetDict({
     seq_validation: Dataset({
         features: ['corpus', 'index', 'text', 'label'],
         num_rows: 290
     })
     pair_validation: Dataset({
         features: ['corpus', 'index', 'text', 'label'],
         num_rows: 435
     })
     seq_train: Dataset({
         features: ['corpus', 'index', 'text', 'label'],
         num_rows: 374
     })
     pair_train: Dataset({
         features: ['corpus', 'index', 'text', 'label'],
         num_rows: 1178
     })
 }),
 (5, 1, 3, 1))

You may have experience working with the [`load_dataset`](https://huggingface.co/docs/datasets/loading) function the HuggingFace datasets library. Our method can be taken as a wrapper function of HuggingFace [`load_dataset`](https://huggingface.co/docs/datasets/loading), which loads three types of datasets simultaneously and applies some customized loading steps to the datasets, as datasets such as of `Span` type need to loaded with special care to handle their labels. Note that they have different function signatures.

#### Preview of examples from our combined dataset
<a id='preview'></a>

In [None]:
dataset_sources_to_show = ['altlex', 'because', 'ctb', 'esl', 'pdtb', 'semeval2010t8'] # esl and pdtb not available
dataset = load_cre_dataset(
    dataset_name=dataset_sources_to_show, 
    do_train_val=True, 
    data_dir='../data'
)

In [17]:
# Span examples
dataset_sources_shown = []
for i in dataset[0]['span_validation']:
    corpus = i['corpus']
    if corpus not in dataset_sources_shown:
        print(i,'\n')
        dataset_sources_shown.append(corpus)

{'corpus': 'altlex', 'index': 'altlex_altlex_dev.tsv_4_0', 'text': ['The', 'U.S.', 'Supreme', 'Court', 'refused', 'to', 'hear', 'an', 'appeal', 'of', 'the', 'decision', 'of', 'the', 'lower', 'federal', 'courts', 'in', 'October', '1993', ',', 'meaning', 'that', 'victims', 'of', 'the', 'Bhopal', 'disaster', 'could', 'not', 'seek', 'damages', 'in', 'a', 'U.S.', 'court', '.'], 'label': 1, 'ce_tags': ['B-C', 'I-C', 'I-C', 'I-C', 'I-C', 'I-C', 'I-C', 'I-C', 'I-C', 'I-C', 'I-C', 'I-C', 'I-C', 'I-C', 'I-C', 'I-C', 'I-C', 'I-C', 'I-C', 'I-C', 'I-C', 'O', 'B-E', 'I-E', 'I-E', 'I-E', 'I-E', 'I-E', 'I-E', 'I-E', 'I-E', 'I-E', 'I-E', 'I-E', 'I-E', 'I-E', 'I-E'], 'ce_tags1': ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], 'ce_tags2': ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 

In [18]:
# Seq examples
dataset_sources_shown = []
for i in dataset[1]['seq_validation']:
    corpus = i['corpus']
    if corpus not in dataset_sources_shown:
        print(i,'\n')
        dataset_sources_shown.append(corpus)

{'corpus': 'altlex', 'index': 'altlex_altlex_dev.tsv_0_0', 'text': "The Bhopal disaster , also referred to as the Bhopal gas tragedy , was a gas leak incident in India , considered the world 's worst industrial disaster .", 'label': 0} 

{'corpus': 'because', 'index': 'because_Article247_327.ann_18_0', 'text': "But the Scientific Method does not permit any tinkering with the Indis Index 's scoring procedures.", 'label': 0} 

{'corpus': 'ctb', 'index': 'ctb_APW19980213.1320.tml_0_0', 'text': 'CANBERRA, Australia ( AP ) _ Qantas will almost double its flights between Australia and India by August in the search for new markets untouched by the crippling Asian financial crisis.', 'label': 0} 

{'corpus': 'semeval2010t8', 'index': 'semeval2010t8_test.json_0_0', 'text': 'The most common audits were about waste and recycling .', 'label': 0} 



In [19]:
# Seq examples
dataset_sources_shown = []
for i in dataset[1]['pair_validation']:
    corpus = i['corpus']
    if corpus not in dataset_sources_shown:
        print(i,'\n')
        dataset_sources_shown.append(corpus)

{'corpus': 'altlex', 'index': 'altlex_altlex_dev.tsv_0_0', 'text': "<ARG1>The Bhopal disaster , also referred to</ARG1> as <ARG0>the Bhopal gas tragedy , was a gas leak incident in India , considered the world 's worst industrial disaster .</ARG0>", 'label': 0} 

{'corpus': 'because', 'index': 'because_Article247_327.ann_3_0', 'text': 'They will then score one point for <ARG1>every subsequent issue or broadcast or Internet posting</ARG1> after <ARG0>the first offense is noted by Chatterbox</ARG0> if they continue not to report said inconvenient fact--and an additional two points on days when the news organization runs a follow-up without making note of said inconvenient fact.', 'label': 0} 

{'corpus': 'ctb', 'index': 'ctb_APW19980213.1320.tml_0_0', 'text': 'CANBERRA, Australia ( AP ) _ Qantas will almost <ARG0>double</ARG0> its flights between Australia and India by <ARG1>August</ARG1> in the search for new markets untouched by the crippling Asian financial crisis.', 'label': 0} 

{'c

#### Using your own dataset
<a id='own'></a>

In certain scenarios, you may want to use your own datasets to test the power of our unifed task training. Fortunately, our dataset loaders are open to user provided training and testing files. When using our training script, it is as easy as appending your training and validation file paths to arguments `--span_train_file`, `--span_val_file`, `--seq_train_file`, and `--seq_val_file`, with each path leading to a `csv`, `json`, or `txt` file that contains the corresponding dataset. 

They can be the paths to your very own datasets, or the name of one of the public datasets for token classification task available on the hub at https://huggingface.co/datasets/. The column name of text and labels (for `csv` or `json` files) can be set via arguments `--text_column_name` and `--label_column_name`.

Our `run.py` script will automatically process the input paths and handle the rest of job before model training.

If you wish to load your own datasets manually using our `load_cre_dataset` function, follow the steps below: 

In [None]:
# Example input arguments
dataset_name = None # required when using customsized datasets
span_augment = False
do_train = do_eval = do_predict = do_train_val = True

# Using your own files
span_train_file = 'data/grouped/splits/altlex_train.csv'
span_val_file = 'data/grouped/splits/altlex_test.csv'
seq_train_file = 'data/splits/altlex_train.csv'
seq_val_file = 'data/splits/altlex_test.csv'

# [Not sure if supported] Using huggingface datasets (https://huggingface.co/datasets)
# dataset_name = ['wikitext']
# xxx

# Load file paths into dictionaries
span_files, seq_files = {}, {}
span_files["train"] = span_train_file
span_files["validation"] = span_val_file
seq_files["train"] = seq_train_file
seq_files["validation"] = seq_val_file

# Call load_cre_dataset function
load_cre_dataset(
    dataset_name, do_train_val,
    span_augment=span_augment,
    span_files=span_files, 
    seq_files=seq_files,
    do_train=do_train
)



  0%|          | 0/2 [00:00<?, ?it/s]



  0%|          | 0/2 [00:00<?, ?it/s]

(DatasetDict({
     span_validation: Dataset({
         features: ['corpus', 'index', 'text', 'label', 'ce_tags', 'ce_tags1', 'ce_tags2'],
         num_rows: 115
     })
     span_train: Dataset({
         features: ['corpus', 'index', 'text', 'label', 'ce_tags', 'ce_tags1', 'ce_tags2'],
         num_rows: 300
     })
 }),
 DatasetDict({
     seq_validation: Dataset({
         features: ['corpus', 'index', 'text', 'label'],
         num_rows: 687
     })
     pair_validation: Dataset({
         features: ['corpus', 'index', 'text', 'label'],
         num_rows: 832
     })
     seq_train: Dataset({
         features: ['corpus', 'index', 'text', 'label'],
         num_rows: 854
     })
     pair_train: Dataset({
         features: ['corpus', 'index', 'text', 'label'],
         num_rows: 1222
     })
 }),
 (7, 1, 4, 2))

It comes to the end of our dataset usage tutorial. We are now ready to start model loading section.