# Example Notebook for annotator package
## Scientific Software Center, I. S. Ulusoy, C. Delavier, Heidelberg University
*January 2022*

In the following the basic functionalities of the package are introduced. We will load basic text in English and German and annotate it using the available features.

### Import the modules of the annotator package

In [None]:
import annotator.base as be
import annotator.mspacy as msp
import annotator.mstanza as sa

# General input

The input is passed to the package using a dictionary (json-file). Later, this will be hidden in the user interface. For now, you will load the default dictionary which has all options pre-set to default values, and then replace the options that you specify for your desired processing.

### Read the default dictionary - do not change

In [None]:
# read in input.json
default_dict = be.prepare_run.load_input_dict("input")

The main input dictionary contains three important parameters:
```
    "input": "./test/test_files/example_en.txt",
    "output": "out/output_en",
    "tool": "spacy",
```
You can print these using:

In [None]:
print(default_dict["input"])
print(default_dict["output"])
print(default_dict["tool"])

These tell the programm that the data we want to annotate is stored in `example_en.txt` with path to the file `./test/test_files/`, that we want to output to a file which we can identify as `output_en` in the folder `out` and that we want to use the tool `spacy` to annotate the data.

### Modify the keys for your specific run - do change

You can now modify the keys to specify a different input file, output file, and selected tool as so:

In [None]:
# first make a copy of the dictionary for your run
mydict = default_dict
# now you need to set your parameters
# change the value of the key on the right hand of the "="
mydict["input"] = "./test/test_files/example_en.txt"
# change the value of the key on the right hand of the "="
mydict["output"] = "output_en"
mydict["tool"] = "spacy"  # or "stanza" - so far, only spacy and stanza are implemented

Please note the "" around the keys - these are essential as the values are passed as string and should not be removed!

### Read in the input text to be processed as raw text - do not change

In [None]:
# read in the raw text
data = be.prepare_run.get_text(mydict["input"])

You can print the text as so:

In [None]:
print(data)

You may also directly copy and paste text here - take care that it is surrounded by double quotes again:

In [None]:
data = "This is my text. I like it better this way."

# Process the text using SpaCy
More information about SpaCy is found [here](https://spacy.io/). Generally, SpaCy supports [these languages](https://spacy.io/usage/models), but at the moment only English and German are available in the annotator package. We will add more languages based on your requests - so please get in touch!

## TL;DR - the essence
You need to specify which language you would like to process. It is also good to specify the processing options, like tokenization, part-of-speech, lemma, etc., although if not specified the package will select all that are available for the language. There is currently a restriction: The output that is generated in the end and passed to cwb can only contain one or several of these options: sentencize, tokenize, part-of-speech, lemma. All other options do get processed but are not written to the file yet. Here we need some more feedback on the format that is required for cwb.

In [None]:
# check which language has been selected
print(mydict["spacy_dict"]["lang"])

You can change the selected language as so:

In [None]:
mydict["spacy_dict"]["lang"] = "en"
# currently, only "en" and "de" are available

In [None]:
# find out which model is being used
print(mydict["spacy_dict"]["model"])

You will be able to change the model, if another one has been downloaded. At the moment, only `en_core_web_md` and `de_core_news_md` are available. We will add more upon request, so please get in touch!

Now select the processors that you would like to use: For the default English pipeline, the available options are `tok2vec, senter, tagger, parser, attribute_ruler, lemmatizer, ner`, where the first two options are required for tokenization, and the other options are: [Dependency parser](https://spacy.io/api/dependencyparser), POS-tagging via the [attribute ruler](https://spacy.io/api/attributeruler), [lemma](https://spacy.io/api/lemmatizer), and [named-entity recognition](https://spacy.io/api/entityrecognizer). 

In [None]:
# check which processors have been selected
print(mydict["spacy_dict"]["processors"])

In [None]:
# change these if you like as so
mydict["spacy_dict"][
    "processors"
] = "tok2vec, senter, tagger, attribute_ruler, lemmatizer"

## In more detail for the interested user

There will be more tools to chose from, but for simplicity their configurations have been stripped for now. The spacy specific config is found in the `"spacy_dict"` section of the main input dictionary. Here we find the parameters we can tell spacy to enable it to annotate the data. The entries do usually come with a comment explaining what the parameters do. Lets look through the ones we set up in input.json:
```
    "model": false,
```
Here we can specify a model spacy should use to annotate the text if we want to. We leave it to false for now though.
```
    "lang": "en",
```
Here we specify that the language of the data we want to annotate is english. Since we didn't specify a model this information will be needed to chose one for us.
```
    "text_type": "news",
```
We specify what kind of text we want to annotate in order to chose an appropriate model for the task. This does currently only support "news" for english. The setup we chose here will lead to the usage of the model `en_core_web_md`.
```
    "processors": "tok2vec, senter, tagger, parser, attribute_ruler, lemmatizer, ner",
```
Here we specifiy the processors for the pipeline we will apply to our data. This will define what kind of annotations we get in the end, as well as potentionally impacting performance of the pipeline. The availability of specific processors is dependent upon the selected model. The module checks if all requested processors are available before trying to load them and should tell us if there is a problem. Available pipeline components for pretrained `spacy models` can be found in the [spacy models documentation](https://spacy.io/models).

The remaining entries are not immediatly important for this example and are all set to their default values. Especially the `"config"` parameter and its contents are defined for a given pretrained model in it's config.cfg file, changing this is not recommended unless you really know what you are doing.

Next we would load the tool as specified by the .json. In this case we would load the spacy pipeline from the `mspacy` module. We are told what components we load for our pipeline and which model we are using. In our case we load the models `en_core_web_md` and `de_core_news_md` with all their pipeline components. The function calls below create `spacy_pipe` objects which we can than apply to data.

## Load the SpaCy pipeline and process the text

In [None]:
# load the pipeline from the config
pipe = msp.spacy_pipe(mydict)

After doing this we only have to apply the pipeline to the data we read in earlier. For this we use the `apply_to` function of the `spacy_pipe` object, this generates the annotated `spacy.Doc` which is stored in our `spacy_pipe` object.

In [None]:
# apply pipeline to data
annotated = pipe.apply_to(data)

To extract the results of the pipeline  we can easily pass the results to a .vrt file using the output name defined in the .json. This is done by using the `pass_results` function build into the `spacy_pipe` object. Doing it this way, we also directly encode our results for `CWB`.

In [None]:
# get the annotated .vrt and pass to cwb
annotated.pass_results()

Loading the pipeline, applying it and passing the results can be done conveniently in one line:

In [None]:
msp.spacy_pipe(mydict).apply_to(data).pass_results()

## Access the newly annotated corpus in cwb via cwb-ccc
This needs to be adjusted on the jupyterjub as there are specific directories required. 

In [None]:
import ccc

In [None]:
from ccc import Corpora

corpora = Corpora(
    cqp_bin="/usr/local/cwb-3.4.22/bin/cqp",
    registry_path="/home/jovyan/DemoCorpus-German/registry",
)
print(corpora)
corpora.show()
# needed to provide absolute paths in registry for this to work

In [None]:
# Use the newly encoded corpus

# Using a different tool: stanza
In this section we will be using a different tool to annotate our data. The only other available tool at this moment is `stanza`. Looking at the default dictionary, we now set
```
    "tool": "stanza",
```
which would indicate to the program that we do indeed want to use `stanza`. As data we will again use `example_en.txt`.

In [None]:
# first make a copy of the dictionary for your run
mydict = default_dict
# now you need to set your parameters
# change the value of the key on the right hand of the "="
mydict["input"] = "./test/test_files/example_de.txt"
# change the value of the key on the right hand of the "="
mydict["output"] = "output_de"
mydict["tool"] = "stanza"

## Stanza input options
For the processing with [stanza](https://stanfordnlp.github.io/stanza/), we only pass the stanza-specific part of the dictionary to the pipeline and need to clean the dictionary before using. Currently, only tokenization, POS and lemma are implemented for the same reasons as above. German requires also `mwt`, but the multi-word expressions are not marked as such in the generated output file. For these, p-attributes will be included at a later stage.

In [None]:
# get relevant part of dict for stanza
stanza_dict = mydict["stanza_dict"]

# remove the comments
stanza_dict = be.prepare_run.update_dict(stanza_dict)

The language is selected as so:

In [None]:
# check which option is set
print(stanza_dict["lang"])

In [None]:
# set to a different value - currently only "en" and "de"
stanza_dict["lang"] = "de"

The processors may be set as so:

In [None]:
# check which options are selected
print(stanza_dict["processors"])

In [None]:
# set this to different values
# please note that stanza does not allow spaces between the processor names
stanza_dict["processors"] = "tokenize,pos,mwt,lemma"

To use the processors requested in the input, we have to activate them for `stanza`.

In [None]:
# activate processors
stanza_dict = be.prepare_run.activate_procs(stanza_dict, "stanza_")

In [None]:
# get the input text
data = be.prepare_run.get_text(mydict["input"])

You may again look at the raw text like this:

In [None]:
print(data)

Or you can set it manually as so:

In [None]:
data = "Hier ist nun das zweite Beispiel. Wir können momentan nur reinen Text einlesen."

## Stanza text processing

After setting up the dictionary we can create a `mstanza_pipeline` object from the `mstanza` module. Please request additional models to the English and German ones that are currently installed. We will then add them to the Hub and you do not need to worry about downloads. For a list of available languages and models, see [here](https://stanfordnlp.github.io/stanza/available_models.html). 

If you want to use your own custom model, youhave to change the path set in `"dir"` to point to your model. 

If you want to download another model, execute the code below while replacing the model language to your selected language. Note that the models have sizes of several hundred MB.

In [None]:
from stanza import download

download("en", model_dir="/home/jovyan/shared/stanza_resources")
stanza_dict["dir"] = "/home/jovyan/shared/stanza_resources/"

Now you can initialize the text processing pipeline.

In [None]:
# initialize the pipeline with the dict
stanza_pipe = sa.mstanza_pipeline(stanza_dict)

# to get the working pipeline we have to use the inbuilt initialize function
stanza_pipe.init_pipeline()

We can then apply the pipeline to our data through the `process_text` function of the `mstanza_pipeline` object.

In [None]:
# apply pipeline to data
results = stanza_pipe.process_text(data)

To write the results to a file and export to `CWB` we can then use the `mstanza_pipeline.postprocess` function. 

In [None]:
stanza_pipe.postprocess(mydict["output"])

After this, the newly annotated corpus can be used in cwb/cqp.