# Active Learning Cycle Annotation Mode

In this notebook, we will introduce how to use the annotation mode. The annotation mode means that we combine SeqAL with third part annotation tool to run the active learning cycle.

Below is the workflow that SeqAL works with annotation tool.

![al_cycle_v2.png](../docs/images/al_cycle_v2.png)


The SeqAL workflow with annotation tool:

- Step1: SeqAL initialize model by corpus
- Step2: Model predicts on unlabeled data
- Step3: SeqAL select informative data from unlabeled data according to the predictions in step2.
- Step4: The Annotation tool get data, and annotator assign labels to the data
- Step5: SeqAL get the annotated data and process its format
- Step6: SeqAL add the annotated data to labeled data
- Step7: Retrain the model




We have created below datasets for research mode.

- labeled data:
    - seed data: `engtrain_seed.bio`
    - validation data: `engtrain_dev.bio`
    - test data: `engtest.bio`
- unlabeled data:
    - unlabeled data pool: `unlabeled_data_pool.txt`

You can find the detail of creation process in the `data_preparation.ipynb` notebook.

## Load Corpus

We load below datasets by the following script.

- seed data: `engtrain_seed.bio`
- validation data: `engtrain_dev.bio`
- test data: `engtest.bio`


In [19]:
from flair.embeddings import WordEmbeddings

from seqal.active_learner import ActiveLearner
from seqal.datasets import ColumnCorpus
from seqal.samplers import LeastConfidenceSampler


# 1. get the corpus
columns = {0: "text", 1: "ner"}
data_folder = "../data/ner_english_movie_simple"
corpus = ColumnCorpus(
    data_folder,
    columns,
    train_file="engtrain_seed.bio",
    dev_file="engtrain_dev.bio",
    test_file="engtest.bio",
)


2022-09-07 01:47:04,606 Reading data from ../data/ner_english_movie_simple
2022-09-07 01:47:04,615 Train: ../data/ner_english_movie_simple/engtrain_seed.bio
2022-09-07 01:47:04,619 Dev: ../data/ner_english_movie_simple/engtrain_dev.bio
2022-09-07 01:47:04,621 Test: ../data/ner_english_movie_simple/engtest.bio


## Initialize Active Learner

In [20]:
# 2. tagger params
tagger_params = {}
tagger_params["tag_type"] = "ner"
tagger_params["hidden_size"] = 256
embeddings = WordEmbeddings("glove")
tagger_params["embeddings"] = embeddings
tagger_params["use_rnn"] = False

# 3. trainer params
trainer_params = {}
trainer_params["max_epochs"] = 1
trainer_params["mini_batch_size"] = 32
trainer_params["learning_rate"] = 0.1
trainer_params["patience"] = 5

# 4. setup active learner
sampler = LeastConfidenceSampler()
learner = ActiveLearner(corpus, sampler, tagger_params, trainer_params)

# 5. initialize active learner
learner.initialize(dir_path="output/init_train")

2022-09-07 01:48:00,100 ----------------------------------------------------------------------------------------------------
2022-09-07 01:48:00,102 Model: "SequenceTagger(
  (embeddings): WordEmbeddings(
    'glove'
    (embedding): Embedding(400001, 100)
  )
  (word_dropout): WordDropout(p=0.05)
  (locked_dropout): LockedDropout(p=0.5)
  (embedding2nn): Linear(in_features=100, out_features=100, bias=True)
  (linear): Linear(in_features=100, out_features=27, bias=True)
  (beta): 1.0
  (weights): None
  (weight_tensor) None
)"
2022-09-07 01:48:00,104 ----------------------------------------------------------------------------------------------------
2022-09-07 01:48:00,105 Corpus: "Corpus: 977 train + 977 dev + 2443 test sentences"
2022-09-07 01:48:00,106 ----------------------------------------------------------------------------------------------------
2022-09-07 01:48:00,107 Parameters:
2022-09-07 01:48:00,107  - learning_rate: "0.1"
2022-09-07 01:48:00,109  - mini_batch_size: "32"




2022-09-07 01:48:00,358 epoch 1 - iter 6/31 - loss 2.50851027 - samples/sec: 1261.66 - lr: 0.100000
2022-09-07 01:48:00,427 epoch 1 - iter 9/31 - loss 2.36948741 - samples/sec: 1421.70 - lr: 0.100000
2022-09-07 01:48:00,530 epoch 1 - iter 12/31 - loss 2.24910035 - samples/sec: 939.41 - lr: 0.100000
2022-09-07 01:48:00,603 epoch 1 - iter 15/31 - loss 2.17125771 - samples/sec: 1338.95 - lr: 0.100000
2022-09-07 01:48:00,704 epoch 1 - iter 18/31 - loss 2.10833743 - samples/sec: 970.74 - lr: 0.100000
2022-09-07 01:48:00,775 epoch 1 - iter 21/31 - loss 2.06120892 - samples/sec: 1360.90 - lr: 0.100000
2022-09-07 01:48:00,852 epoch 1 - iter 24/31 - loss 2.00763254 - samples/sec: 1278.06 - lr: 0.100000
2022-09-07 01:48:00,938 epoch 1 - iter 27/31 - loss 1.96025351 - samples/sec: 1133.14 - lr: 0.100000
2022-09-07 01:48:01,013 epoch 1 - iter 30/31 - loss 1.92183433 - samples/sec: 1315.54 - lr: 0.100000
2022-09-07 01:48:01,032 -----------------------------------------------------------------------

To set up an active learner, we have to provide `corpus`, `sampler`, `tagger_params`, and `trainer_params`. 

The `sampler` means the sampling method. Here we use the least confidence sampling metod (`LeastConfidenceSampler`)


The `tagger_params` means model parameters. The default model is Bi-LSTM CRF. In order to speed up the training time, here we set the `tagger_params["use_rnn"] = False`. It means that we only use the CRF model. This model is fast even in CPU.


The `trainer_params` control the training process. We set `trainer_params["max_epochs"] = 1` for demonstration. But in real case, `20` is a proper choice.


After the setup, we can initialize the learner by calling `learner.initialize`. This will first train the model from scratch. The training log and model will be saved to `dir_path`.

Related tutorial: [Active Learner Setup](../docs/TUTORIAL_3_Active_Learner_Setup.md)

## Prepare Data Pool

In [22]:
from seqal.utils import load_plain_text

# 6. prepare unlabeled data pool
file_path = "../data/ner_english_movie_simple/unlabeled_data_pool.txt"
unlabeled_sentences = load_plain_text(file_path)

Because we are in the research mode, here we data pool is a labeled dataset. 

Related tutorial: [Prepare Data Pool](../docs/TUTORIAL_4_Prepare_Data_Pool.md)

## Query Setup


In [23]:
# 7. query setup
query_number = 100
token_based = False
iterations = 3

# initialize the tool to read annotated data
from seqal.aligner import Aligner

aligner = Aligner()

The `query_number` means how many data samples we want to query in each iteration. 

The `token_based` means we query data on sentence level or token level. If `token_based` is `True`, we will query the `100` tokens  in each iteration. If `token_based` is `False`, we will query `100` sentences in each iteration. 

The `iterations` means how many rounds we run the active learning cycle.


Related tutorial: [Query Setup](../docs//TUTORIAL_6_Query_Setup.md)

## Query Unlabeled Data


In [24]:
# 8. iteration
for i in range(iterations):
    # 9. query labeled sentences
    queried_samples, unlabeled_sentences = learner.query(
        unlabeled_sentences, query_number, token_based=token_based, research_mode=False
    )


Step 9, the `learner.query()` run the query process. The parameter `research_mode` is `False` which means that we run a real annotation project. 

The `queried_samples` contains the samples selected by the sampling method. The `unlabeled_setence` contains the rest data.


Related tutorial: [Research and Annotation Mode](../docs/TUTORIAL_5_Research_and_Annotation_Mode.md)

## Get Annotated Data

Below is the code in one iteration.

```python
    # 10. convert sentence to plain text
    queried_texts = [{"text": sent.to_plain_string()} for sent in queried_samples]
    # queried_texts:
    # [
    #   {
    #     "text": "I love Berlin"
    #   },
    #   {
    #     "text": "Tokyo is a city"
    #   }
    # ]


    # 11. send queried_texts to annotation tool
    # annotator annotate the queried samples
    # 'annotate_by_human' method should be provide by user
    annotated_data = annotate_by_human(queried_texts)
    # annotated_data:
    # [
    #     {
    #         "text": ['I', 'love', 'Berlin'],
    #         "labels": ['O', 'O', 'B-LOC']
    #     }
    #     {
    #         "text": ['Tokyo', 'is', 'a', 'city'],
    #         "labels": ['B-LOC', 'O', 'O', 'O']
    #     }
    # ]

    # 12. convert data to sentence
    queried_samples = aligner.align_spaced_language(annotated_data)
```

Step 10, we convert the queried texts to format that the annotation tool can receive. 


Step 11, the user should provide `annotate_by_human()` method, which receive the `queried_texts` to annotation tool and return the annnotation result.


Step 12, we convert `annotated_data` to a list of `flair.data.Sentence` by `aligner`. We support different format of annotated data. More detail is in below tutorial. 

Related tutorial: [Annotated Data](../docs/TUTORIAL_7_Annotated_Data.md)

## Retrain Model

```python
    # 13. retrain model, the queried_samples will be added to corpus.train
    learner.teach(queried_samples, dir_path=f"output/retrain_{i}")
```

Finally, `learner.teach()` will add `queried_sampels` to the training dataset and retrain the model from scratch. The retraining log and model will be saved to `dir_path`.

The whole script can be found in `examples/active_learning_cycle_annotation_mode.py`.