# Example Notebook for annotator package
## Scientific Software Center, I. S. Ulusoy, C. Delavier, Heidelberg University
*January 2022*

In the following the basic functionalities of the package are introduced. We will load basic text in English and German and annotate it using the available features.

### Import the modules of the annotator package

In [None]:
import base as be
import mspacy as msp
import mstanza as sa

# General input

The input is passed to the package using a dictionary (json-file). Later, this will be hidden in the user interface. For now, you will load the default dictionary which has all options pre-set to default values, and then replace the options that you specify for your desired processing.

### Read the default dictionary - do not change

In [None]:
# read in input.json
default_dict = be.prepare_run.load_input_dict("input")

The main input dictionary contains these important parameters:
```
    "input": "./test/test_files/example_en.txt",
    "tool": "spacy",
    "corpus_name": "test",
    "language": "en",
    "document_type": "text",
    "processing_option": "fast",
    "processing_type": "tokenize"
```
You can print these using:

In [None]:
print(default_dict["input"])
print(default_dict["tool"])

In [None]:
# first make a copy of the dictionary for your run
mydict = default_dict

These tell the programm that the data we want to annotate is stored in `example_en.txt` with path to the file `./test/test_files/` and that we want to use the tool `spacy` to annotate the data.

# Available tools and options
You need to specify which language you would like to process. It is also good to specify the processing options, like tokenization, part-of-speech, lemma, etc., although if not specified the package will select all that are available for the language. There is currently a restriction: The output that is generated in the end and passed to cwb can only contain one or several of these options: sentencize, tokenize, part-of-speech, lemma. All other options do get processed but are not written to the file yet. Here we need some more feedback on the format that is required for cwb.

## SpaCy
More information about SpaCy is found [here](https://spacy.io/). Generally, SpaCy supports [these languages](https://spacy.io/usage/models), but at the moment only English and German are available in the annotator package. We will add more languages based on your requests - so please get in touch!

In [None]:
# find out which model is being used
print(mydict["spacy_dict"]["model"])

In [None]:
# check which language has been selected
print(mydict["language"])

You will be able to change the model, if another one has been downloaded. At the moment, only `en_core_web_md` and `de_core_news_md` are available. We will add more upon request, so please get in touch!

Now select the processors that you would like to use: For the default English pipeline, the available options are `tok2vec, senter, tagger, parser, attribute_ruler, lemmatizer, ner`, where the first two options are required for tokenization, and the other options are: [Dependency parser](https://spacy.io/api/dependencyparser), POS-tagging via the [attribute ruler](https://spacy.io/api/attributeruler), [lemma](https://spacy.io/api/lemmatizer), and [named-entity recognition](https://spacy.io/api/entityrecognizer). 

In [None]:
# check which processors have been selected
print(mydict["processing_type"])

## Stanza
The only other available tool at this moment is `stanza`. Looking at the default dictionary, we now set
```
mydict["tool"] = "stanza"
```
For the processing with [stanza](https://stanfordnlp.github.io/stanza/), only tokenization, POS and lemma are implemented for the same reasons as above. German requires also `mwt`, but the multi-word expressions are not marked as such in the generated output file. For these, p-attributes will be included at a later stage.

Please request additional models to the English and German ones that are currently installed. We will then add them to the Hub and you do not need to worry about downloads. For a list of available languages and models, see [here](https://stanfordnlp.github.io/stanza/available_models.html). 

### Modify the keys for your specific run - do change

You can now modify the keys to specify a different input file, output file, and selected tool as so:

In [None]:
# first make a copy of the dictionary for your run
mydict = default_dict
# now you need to set your parameters
# change the value of the key on the right hand of the "="
mydict["input"] = "./test/test_files/example_en.txt"
# change the value of the key on the right hand of the "="
mydict["tool"] = "spacy"  # or "stanza" - so far, only spacy and stanza are implemented
# specify the output directory of the vrt file
mydict["advanced_options"]["output_dir"] = "./test/test_files/"

Please note the "" around the keys - these are essential as the values are passed as string and should not be removed!

### Validate the input - do not change
The input is then validated to make sure all options have been set correctly.

In [None]:
be.prepare_run.validate_input_dict(mydict)

### Read in the input text to be processed as raw text - do not change

In [None]:
# read in the raw text
data = be.prepare_run.get_text(mydict["input"])

You can print the text as so:

In [None]:
print(data)

You may also directly copy and paste text here - take care that it is surrounded by double quotes again:

In [None]:
data = "This is my text. I like it better this way."

## Load the tool pipeline and process the text - do not change

In [None]:
# get specific dict for spacy
subdict = mydict[mydict["tool"] + "_dict"]
# load the pipeline using the selected options
pipe = msp.spacy_pipe(subdict)

After doing this we only have to apply the pipeline to the data we read in earlier.

In [None]:
# apply pipeline to data
annotated = pipe.apply_to(data)

To extract the results of the pipeline  we can easily pass the results to a .vrt file using the output name defined in the .json.

In [None]:
# get the annotated .vrt and pass to cwb
annotated.pass_results("STR", mydict, ret=False)

Loading the pipeline, applying it and passing the results can be done conveniently in one line:

In [None]:
msp.spacy_pipe(subdict).apply_to(data).pass_results("STR", mydict, ret=False)

## Access the newly annotated corpus in cwb via cwb-ccc
This needs to be adjusted on the jupyterjub as there are specific directories required. 

In [None]:
import ccc

In [None]:
from ccc import Corpora

In [None]:
corpora = Corpora(
    cqp_bin="/usr/local/bin/cqp",
    registry_path="/home/jovyan/shared/registry",
)

In [None]:
print(corpora)

In [None]:
corpus = corpora.activate(corpus_name="TEST")

In [None]:
corpus.attributes_available

In [None]:
# Use the newly encoded corpus
query = r'"if"'
dump = corpus.query(query)

In [None]:
# print the query data frame
dump.df

In [None]:
dump = corpus.query(cqp_query=query, context=20, context_break="s")