# Data Preparation

In this notebook, we will create datasets for later demonstration.


Below is the workflow that SeqAL works with annotation tool.

![al_cycle_v2.png](../docs/images/al_cycle_v2.png)

## Research Mode and Annotation Mode

According to the workflow, we have to provide 4 datasets, 3 labeled datasets and 1 unlabeled dataset. We call this mode as **annotation mode**.

- labeled datasets
    1. seed data: a dataset used for training model
    2. validation data: a dataset used to validate model performance in training process
    3. test data: a dataset used to test best model performance
- unlabeled datasets
    1. unlabeled data pool: a dataset contains unlabeled data

If we just want to simulate the active learning cycle, we should provide `labeled data pool` instead of `unlabeled data pool`.

- labeled datasets
    1. seed data: a dataset used for training model
    2. validation data: a dataset used to validate model performance in training process
    3. test data: a dataset used to test best model performance
    4. labeled data pool: a dataset contains gold labels to simulate real annotation work

We call this mode as **research mode**.

More detail of two modes can be found in [TUTORIAL_5_Research_and_Annotation_Mode](../docs/TUTORIAL_5_Research_and_Annotation_Mode.md)

## Download Public Dataset

If the domain of our data is specific and there are no public labeled datasets, we should first annnotate the `seed data`, `validation data`, `test data` by ourselves. 


If the domain of our data is same with the domain of other public datasets, there is no need to prepare the labeled data by ourselves. We can download the dataset directly. For example, we download CoNLL-03 from [homepage](https://www.clips.uantwerpen.be/conll2003/ner/) and put the `eng.testa`, `eng.testb`, `eng.train` to `data/conll_03` floder.

We can load the corpus by below script.


```python
from seqal.datasets import ColumnCorpus

columns = {0: "text", 3: "ner"}
data_folder = "../data/conll_03"
corpus = ColumnCorpus(
    data_folder,
    columns,
    train_file="eng.train",
    dev_file="eng.testa",
    test_file="eng.testb",
)
```

Flair also provides some [Named Entity Recognition (NER) datasets](https://github.com/flairNLP/flair/blob/master/resources/docs/TUTORIAL_6_CORPUS.md#datasets-included-in-flair).

We can download it by below script. 

In [2]:
import flair.datasets
corpus = flair.datasets.NER_ENGLISH_MOVIE_SIMPLE()
print(corpus)

2022-09-07 01:06:58,168 Reading data from /Users/smap/.flair/datasets/ner_english_movie_simple
2022-09-07 01:06:58,170 Train: /Users/smap/.flair/datasets/ner_english_movie_simple/engtrain.bio
2022-09-07 01:06:58,171 Dev: None
2022-09-07 01:06:58,172 Test: /Users/smap/.flair/datasets/ner_english_movie_simple/engtest.bio
Corpus: 8797 train + 978 dev + 2443 test sentences


The message shows where the datasets are stored. We put the `engtrain.bio`, `engtest.bio` to `data/ner_english_movie_simple` floder.

Below is a sentence example in `engtrain.bio`. The first column is label. The second column is text.

```
O	what
O	movies
O	star
B-ACTOR	bruce
I-ACTOR	willis
```

This datasets does not contain validation (dev) file, but it will create validation dataset by selecting data from training dataset. So the `engtrain.bio` contains 9775 data samples.

The `print(corpus)` will show the number of data samples in each dataset.

We will use this dataset to create the labeled dataset and unlabeled dataset used for demonstration.

## Create Datasets

The training dataset contains 9775 data samples. Usually, a seed data contains a small number of data. So we split 10% of the training data as labaled seed data, 10% as validation data, and 80% as data pool.



In [4]:
from seqal.datasets import ColumnDataset

columns = {0: "ner", 1: "text"}
pool_file = "../data/ner_english_movie_simple/engtrain.bio"
data_pool = ColumnDataset(pool_file, columns)
print(len(data_pool.sentences))


9775


In [5]:
import numpy as np

indices = np.arange(len(data_pool.sentences))
np.random.shuffle(indices)
seed_end = int(len(indices) * 0.1)

seed_data = data_pool.sentences[:seed_end]
validation_data = data_pool.sentences[seed_end:2*seed_end]
data_pool = data_pool.sentences[2*seed_end:]

The `seed_data` and `validation_data` are labeled data, we could use below script to save them as CoNLL format.

We save `data_pool` to both CoNLL format wtih labels and plain text without labels. They are used on different active learning mode. In the research mode, there is no annotation tool. The `data_pool` have to contains gold labels to simulate the annotation step. In the annotation mode, we will select data from unlabled dataset and transfer the data to annotation tool. So the `data_pool` should be a plain text.

In [6]:
from seqal.utils import output_labeled_data

seed_data_path = "../data/ner_english_movie_simple/engtrain_seed.bio"
validation_data_path = "../data/ner_english_movie_simple/engtrain_dev.bio"
data_pool_path = "../data/ner_english_movie_simple/labeled_data_pool.bio"


output_labeled_data(seed_data, seed_data_path, file_format="conll", tag_type="ner")
output_labeled_data(validation_data, validation_data_path, file_format="conll", tag_type="ner")
output_labeled_data(data_pool, data_pool_path, file_format="conll", tag_type="ner")

We read the test data and reorder the columns.

In [7]:
columns = {0: "ner", 1: "text"}
test_file = "../data/ner_english_movie_simple/engtest.bio"
test_data = ColumnDataset(test_file, columns)

test_data_path = "../data/ner_english_movie_simple/engtest.bio"
output_labeled_data(test_data, test_data_path, file_format="conll", tag_type="ner")

Finally, we save the `data_pool` with plain text format.

In [8]:
def output_plain_text(sentences: list, file_path: str) -> None:
    with open(file_path, "w", encoding="utf-8") as file:
        for sentence in sentences:
            file.write(sentence.to_plain_string())
            file.write("\n")

unlabeled_data_pool_path = "../data/ner_english_movie_simple/unlabeled_data_pool.txt"
output_plain_text(data_pool, unlabeled_data_pool_path)

## Summary 

In each mode, we provide below datasets.

- Research mode:
  - labeled data:
      - seed data: `engtrain_seed.bio`
      - validation data: `engtrain_dev.bio`
      - test data: `engtest.bio`
      - labeled data pool: `labeled_data_pool.bio`


- Annotation mode:
  - labeled data:
      - seed data: `engtrain_seed.bio`
      - validation data: `engtrain_dev.bio`
      - test data: `engtest.bio`
  - unlabeled data:
      - unlabeled data pool: `unlabeled_data_pool.txt`
