# Active Learning Cycle in Research Mode

In this notebook, we will introduce how to use the research mode. The research mode means that we skip the human related annotation step and simulate the active learning cycle without third part annotation tool.

Below is the workflow of research mode.

![al_cycle_research_mode.png](../docs/images/al_cycle_research_mode.png)



The SeqAL workflow without annotation tool:

- Step1: SeqAL initialize model by corpus
- Step2: Model predicts on unlabeled data
- Step3: SeqAL select informative data from data pool according to the predictions in step2.
- Step4: Replace the predicted labels with gold annotations to simulate the annotation process.
- Step6: SeqAL add the annotated data to labeled data
- Step7: Retrain the model


We have created below datasets for research mode.


- labeled data:
    - seed data: `engtrain_seed.bio`
    - validation data: `engtrain_dev.bio`
    - test data: `engtest.bio`
    - labeled data pool: `labeled_data_pool.bio`


You can find the detail of creation process in the `data_preparation.ipynb` notebook.

## Load Corpus

We load below datasets by the following script.

- seed data: `engtrain_seed.bio`
- validation data: `engtrain_dev.bio`
- test data: `engtest.bio`


In [2]:
from flair.embeddings import WordEmbeddings

from seqal.active_learner import ActiveLearner
from seqal.datasets import ColumnCorpus, ColumnDataset
from seqal.samplers import LeastConfidenceSampler


# 1. get the corpus
columns = {0: "text", 1: "ner"}
data_folder = "../data/ner_english_movie_simple"
corpus = ColumnCorpus(
    data_folder,
    columns,
    train_file="engtrain_seed.bio",
    dev_file="engtrain_dev.bio",
    test_file="engtest.bio",
)


2022-09-07 01:18:28,909 Reading data from ../data/ner_english_movie_simple
2022-09-07 01:18:28,911 Train: ../data/ner_english_movie_simple/engtrain_seed.bio
2022-09-07 01:18:28,912 Dev: ../data/ner_english_movie_simple/engtrain_dev.bio
2022-09-07 01:18:28,912 Test: ../data/ner_english_movie_simple/engtest.bio


## Initialize Active Learner

In [3]:
# 2. tagger params
tagger_params = {}
tagger_params["tag_type"] = "ner"
tagger_params["hidden_size"] = 256
embeddings = WordEmbeddings("glove")
tagger_params["embeddings"] = embeddings
tagger_params["use_rnn"] = False

# 3. trainer params
trainer_params = {}
trainer_params["max_epochs"] = 1
trainer_params["mini_batch_size"] = 32
trainer_params["learning_rate"] = 0.1
trainer_params["patience"] = 5

# 4. setup active learner
sampler = LeastConfidenceSampler()
learner = ActiveLearner(corpus, sampler, tagger_params, trainer_params)

# 5. initialize active learner
learner.initialize(dir_path="output/init_train")

2022-09-07 01:18:38,559 ----------------------------------------------------------------------------------------------------
2022-09-07 01:18:38,560 Model: "SequenceTagger(
  (embeddings): WordEmbeddings(
    'glove'
    (embedding): Embedding(400001, 100)
  )
  (word_dropout): WordDropout(p=0.05)
  (locked_dropout): LockedDropout(p=0.5)
  (embedding2nn): Linear(in_features=100, out_features=100, bias=True)
  (linear): Linear(in_features=100, out_features=27, bias=True)
  (beta): 1.0
  (weights): None
  (weight_tensor) None
)"
2022-09-07 01:18:38,561 ----------------------------------------------------------------------------------------------------
2022-09-07 01:18:38,562 Corpus: "Corpus: 977 train + 977 dev + 2443 test sentences"
2022-09-07 01:18:38,562 ----------------------------------------------------------------------------------------------------
2022-09-07 01:18:38,564 Parameters:
2022-09-07 01:18:38,565  - learning_rate: "0.1"
2022-09-07 01:18:38,566  - mini_batch_size: "32"


To set up an active learner, we have to provide `corpus`, `sampler`, `tagger_params`, and `trainer_params`. 

The `sampler` means the sampling method. Here we use the least confidence sampling metod (`LeastConfidenceSampler`)


The `tagger_params` means model parameters. The default model is Bi-LSTM CRF. In order to speed up the training time, here we set the `tagger_params["use_rnn"] = False`. It means that we only use the CRF model. This model is fast even in CPU.


The `trainer_params` control the training process. We set `trainer_params["max_epochs"] = 1` for demonstration. But in real case, `20` is a proper choice.


After the setup, we can initialize the learner by calling `learner.initialize`. This will first train the model from scratch. The training log and model will be saved to `dir_path`.

Related tutorial: [Active Learner Setup](../docs/TUTORIAL_3_Active_Learner_Setup.md)

## Prepare Data Pool

In [4]:
# 6. prepare data pool
pool_file = data_folder + "/labeled_data_pool.bio"
data_pool = ColumnDataset(pool_file, columns)
labeled_sentences = data_pool.sentences

Because we are in the research mode, here we data pool is a labeled dataset. 

Related tutorial: [Prepare Data Pool](../docs/TUTORIAL_4_Prepare_Data_Pool.md)

## Query Setup


In [6]:
# 7. query setup
query_number = 100
token_based = False
iterations = 3

The `query_number` means how many data samples we want to query in each iteration. 

The `token_based` means we query data on sentence level or token level. If `token_based` is `True`, we will query the `100` tokens  in each iteration. If `token_based` is `False`, we will query `100` sentences in each iteration. 

The `iterations` means how many rounds we run the active learning cycle.


Related tutorial: [Query Setup](../docs//TUTORIAL_6_Query_Setup.md)

## Iteration

In [7]:
# 8. iteration
for i in range(iterations):
    # 9. query labeled sentences
    queried_samples, labeled_sentences = learner.query(
        labeled_sentences, query_number, token_based=token_based, research_mode=True
    )

    # 10. retrain model, the queried_samples will be added to corpus.train
    learner.teach(queried_samples, dir_path=f"output/retrain_{i}")


2022-09-07 01:35:46,274 ----------------------------------------------------------------------------------------------------
2022-09-07 01:35:46,275 Model: "SequenceTagger(
  (embeddings): WordEmbeddings(
    'glove'
    (embedding): Embedding(400001, 100)
  )
  (word_dropout): WordDropout(p=0.05)
  (locked_dropout): LockedDropout(p=0.5)
  (embedding2nn): Linear(in_features=100, out_features=100, bias=True)
  (linear): Linear(in_features=100, out_features=27, bias=True)
  (beta): 1.0
  (weights): None
  (weight_tensor) None
)"
2022-09-07 01:35:46,277 ----------------------------------------------------------------------------------------------------
2022-09-07 01:35:46,278 Corpus: "Corpus: 1077 train + 977 dev + 2443 test sentences"
2022-09-07 01:35:46,279 ----------------------------------------------------------------------------------------------------
2022-09-07 01:35:46,282 Parameters:
2022-09-07 01:35:46,283  - learning_rate: "0.1"
2022-09-07 01:35:46,285  - mini_batch_size: "32"

Step 9, the `learner.query()` run the query process. The parameter `research_mode` is `True`, which means we just simulate the active learning cycle.

The `queried_samples` contains the samples selected by the sampling method. The `labeled_setence` contains the rest data.


Related tutorial: [Research and Annotation Mode](../docs/TUTORIAL_5_Research_and_Annotation_Mode.md)

Finally, `learner.teach()` will add `queried_sampels` to the training dataset and retrain the model from scratch. The retraining log and model will be saved to `dir_path`.

The whole script can be found in `examples/active_learning_cycle_research_mode.py`.