In this particular tutorial, we explain the three types of tasks and their required datasets. We cover how to load our prepared datasets or load your very own datasets using our provided functionalities.  

The three tasks, as shown in Figure [1](assets/tasks.png), are Sequence Classification, Span Detection, and Pair Classification. By definition:
1. Sequence Classification: Given an example sequence, do it contain causal relationships?
2. Span Detection: Given a causal sequence example, which words in the sentence correspond to
the Cause and Effect arguments? The task is to identify up to three causal relations and their spans.
3. Pair Classification: Given sentences with marked argument or entity pairs, the task is to figure out if they are causally related, such that the first argument (marked as `ARG0`) causes the second argument (`ARG1`).

Correspondingly, there are three type of datasets needed for training purposes, abbreviated as `Seq` type, `Span` type, as well as `Pair` type. 
1. `Seq` type datasets contain both causal and non-causal texts, where each unique example text is labelled with a target `s`. Causal texts refer to texts that contain causal relationships. 
2. `Span` type datasets contain only causal texts. Each unique example text allows up to three causal relations. To annotate the text, we converted spans into a BIO-format (Begin (B), Inside (I), Outside (O))  for two types of spans (Cause (C), Effect (E)). Therefore, there were five possible labels per word: B-C, I-C,
B-E, I-E and O, and the task is to predice the labels for each word. For examples with multiple relations, we sorted them based on the location of the B-C, followed by B-E if tied. This means that an earlier occurring Cause span was assigned a lower index number. See Figure 1’s spans for example.
3. `Pair` type datasets contain both causal and non-causal texts. Special tokens (<ARG0>, </ARG0>) marks the boundaries of a Cause span, while (<ARG1>, </ARG1>) marks the boundaries of a corresponding Effect span. Each example text may contain multiple pairs of arguments, resulting in differently located argument tokens `ARG0` and `ARG1`. For a given text of length `N`, say it has `a` number of arguments, the input word vector $\vec u$ has length `N+2*a` due to the addition of special tokens. Finally tokenized sequence $\vec w$ can have multiple versions of $\vec u$ due to differently located argument tokens.

We have processed and split 6 corpus ([AltLex](), [BECAUSE](), [CTB](), [ESL](), [PDTB](), [Sem-Eval]()) into the specified three types of datasets for your convenient use. The statistics are as below.<br>
<img src="assets/temp_statistics.png" alt="Table" width="50%"/>

To load the datasets, we have provided convenient interfaces. In the [training script](_run.sh), add `--dataset_name` attribute and append the dataset names you want. For example, `--dataset_name altlex because` means to load and train the model on [AltLex]() and [BECAUSE]() datasets.

Alternatively, you may play with our `load_cre_dataset` function to load the datasets manually. See more details in the code block below.