# **Pipeline for generic tagger**

This notebook contains a pipeline for developing a historic Dutch token classification model (a tagger).
The starting material for this notebook is a dataframe of 350 texts from the dbnl.

In the case of this notebook we will work on a animal tagger, but this set-up can be used for multiple other entities.

In this notebook we'll do the following things:

1. Make a list with animal_names (with generic datasets and word2vec);
2. Naively tag sentences with this animal list;
3. Remove most frequent homonyms from the animal list and tag again;
4. Train a model on the naive tags;
5. Let model select sentences with animals; 
6. Manually tag these sentences in doccano;
7. Train a new model on the manually annotated dataset. 
8. Repeat steps 5-7 till you're satisfied with the results.


## **Imports**

In [1]:
import pandas as pd
import os
import glob
import pickle
from transformers import AutoTokenizer
from datasets import load_dataset
from transformers import DataCollatorForTokenClassification
import evaluate
from transformers import (
    AutoModelForTokenClassification,
    TrainingArguments,
    Trainer,
    pipeline,
)

## **Load dataframe with sentences**

In [3]:
f_path = "data/dbnl_dfs/total_with_sentences_100.pkl"
df = pd.read_pickle(f_path)

## **Create animal list**

In [15]:
a = pd.read_csv(
    "data/wnt_exports/animals/gtb-export2.csv", encoding="unicode_escape", delimiter=";"
)

In [38]:
folder_path = "data/wnt_exports/animals"
csv_files = glob.glob(os.path.join(folder_path, "*.csv"))

animal_lst = []
for file_path in csv_files:
    gtb_df = pd.read_csv(file_path, encoding="unicode_escape", delimiter=";")
    animal_lst += gtb_df["Trefwoord"].tolist()
    animal_lst += gtb_df["Originele spelling"].tolist()
    animal_lst += gtb_df["Betekenis"].tolist()

animal_lst = list(set([string.lower() for string in animal_lst]))

Clean this list manually and save the list.

In [None]:
with open("data/wnt_lsts/animal_lst_from_wnt.pickle", "wb") as f:
    pickle.dump(animal_lst, f)

### **Enlarge with Word2Vec**

Download a Dutch word2vec model.

In [None]:
!wget https://github.com/coosto/dutch-word-embeddings/releases/download/v1.0/model.bin

In [None]:
from preprocess.enlarge_reference_lst import word2vec_phishing_expedition

path_to_word2vec_model = ""
similarity_threshold = 0.6
with open("data/wnt_lsts/animal_lst_from_wnt.pickle", "rb") as f:
    animal_lst = pickle.load(f)


additions = word2vec_phishing_expedition(
    path_to_word2vec_model, animal_lst, similarity_threshold
)

Then check these additions manually, add them to the list and save.

In [None]:
enriched_animal_lst = animal_lst + additions
with open("data/wnt_lsts/animal_lst_word2vec_enlarged.pickle", "wb") as f:
    pickle.dump(enriched_animal_lst, f)

## **Tag naively for diagnostic reasons** 

Select the dataframe with sentences and take a sample from it.

In [4]:
total_df = pd.read_pickle("data/dbnl_dfs/total_with_sentences_100.pkl")
sample_df = total_df.sample(n=100000)

Open the aniamal_lst.

In [5]:
with open("data/wnt_lsts/animal_lst.pkl", "rb") as file:
    animals = pickle.load(file)

Tag the dataframe naively and examine the most frequent tagged words. Check for homonyms.

In [50]:
from preprocess.tag_sentence import *

sample_df["tagged"] = sample_df["sentence"].apply(
    lambda x: check_for_organisms(x, animals)
)

lst_sentence = sample_df["sentence"].tolist()
lst_tags = sample_df["tagged"].tolist()
sentence_and_tagged = list(zip(lst_sentence, lst_tags))

lst = []
for sentence, tagged in sentence_and_tagged:
    individual_coupled = list(zip(sentence, tagged))
    for e in individual_coupled:
        if e[1] == 1:
            lst.append(e[0])

diagnostic_df = pd.DataFrame(lst, columns=["tagged_entity"])
animal_counts = diagnostic_df["tagged_entity"].value_counts()
animal_counts.to_csv("animal_counts.csv")

Remove the homonyms that are frequently present. Save the animal_lst again. 

In [53]:
to_be_removed = [
    "und",
    "wind",
    "sterker",
    "nadere",
    "wint",
    "ridder",
    "monnik",
    "harder",
    "tuinkamer",
    "vacht",
    "volgeling",
]
animals_without_frequent_homonyms = [
    animal for animal in animals if animal not in to_be_removed
]

with open("data/wnt_lsts/animal_lst_cleaned_for_homonyms.pkl", "wb") as f:
    pickle.dump(animals_without_frequent_homonyms, f)

## **Tag naively for training**

In [6]:
p_to_f = "data/wnt_lsts/animal_lst_cleaned_for_homonyms.pkl"
with open(p_to_f, "rb") as file:
    animals = pickle.load(file)

Select for sentences with animals in it (according to naive tagging)

In [53]:
from preprocess.tag_sentence import *

total_df = pd.read_pickle("data/dbnl_dfs/total_with_sentences_100.pkl")
sample_df = total_df.sample(n=100000)
sample_df["tagged"] = sample_df["sentence"].apply(
    lambda x: check_for_organisms(x, animals)
)
sample_df["has_ones"] = sample_df["tagged"].apply(lambda x: 1 in x)
only_positives = sample_df.loc[sample_df["has_ones"] == True]

Filter for Dutch and Afrikaans (historic Dutch can be misclassified as Afrikaans sometimes)

In [54]:
only_positives["language"] = only_positives["sentence"].apply(
    lambda x: find_language(x)
)
only_dutch_positives = only_positives[only_positives["language"].isin(["af", "nl"])]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  only_positives["language"] =only_positives["sentence"].apply(lambda x: find_language(x))


In [64]:
df_to_save = only_dutch_positives.drop(
    columns=["text_id", "has_ones", "language"]
).reset_index(drop=True)

In [65]:
with open("data/annotated/naively/2513_sentences.pkl", "wb") as f:
    pickle.dump(df_to_save, f)

## **Train**

In [74]:
model_checkpoint = "emanjavacas/GysBERT"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [72]:
path_to_tagged_file = "data/annotated/naively/2513_sentences.pkl"

total_ds = load_dataset("pandas", data_files=path_to_tagged_file)["train"]

total_ds = total_ds.train_test_split(test_size=0.2)

Using custom data configuration default-2e2df5eadfb2b146
Found cached dataset pandas (/home/arjan_v_d/.cache/huggingface/datasets/pandas/default-2e2df5eadfb2b146/0.0.0/3ac4ffc4563c796122ef66899b9485a3f1a977553e2d2a8a318c72b8cc6f2202)


  0%|          | 0/1 [00:00<?, ?it/s]

In [73]:
from preprocess.tokenize_and_align import tokenize_and_align_labels

train_ds = total_ds["train"].map(tokenize_and_align_labels, batched=True)
test_ds = total_ds["test"].map(tokenize_and_align_labels, batched=True)

  0%|          | 0/3 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [None]:
from train_model.metrics import compute_metrics
from transformers import DataCollatorForTokenClassification
from huggingface_hub import notebook_login

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)
notebook_login()

In [None]:
unique_tags = ("not_an_organism", "organism")
tag2id = {tag: id for id, tag in enumerate(unique_tags)}
id2tag = {id: tag for tag, id in tag2id.items()}

model = AutoModelForTokenClassification.from_pretrained(
    model_checkpoint, num_labels=2, id2label=id2tag, label2id=tag2id
)

In [None]:
name_for_model = "is_het_een_dier_v1"

training_args = TrainingArguments(
    output_dir=name_for_model,
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=test_ds,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,  # ,
)

trainer.train()

## **Doccano check of model predictions**

In [None]:
from active_learning.from_ner_to_doccano import hf_output_for_doccano

model_checkpoint = "emanjavacas/GysBERT"
tokenizer = AutoTokenizer.from_pretrained(
    model_checkpoint, max_length=512, truncation=True, model_max_length=512
)
classifier = pipeline(
    task="ner",
    model="ArjanvD95/is_het_een_dier_v1",
    tokenizer=tokenizer,
    aggregation_strategy="average",
)

In [8]:
from random import sample

df_with_sentences = pd.read_csv("df_with_sentences.csv")  # have a look at this
sentence_lst = sample(df_with_sentences["sentence"].tolist(), 250)

In [None]:
json_lst = hf_output_for_doccano(sentences=sentence_lst, classifier=classifier)

In [None]:
import json

f_name = "tuesday_v1"
with open(f"data/for_doccano/{f_name}", "w") as file:
    for obj in json_lst:
        json_line = json.dumps(obj)
        file.write(json_line + "\n")