# Example Notebook for annotator package
## Scientific Software Center, I. S. Ulusoy, C. Delavier, Heidelberg University
*January 2022*

In the following the basic functionalities of the package are introduced. We will load basic text in English and German and annotate it using the available features.

### Import the modules of the annotator package

In [None]:
import annotator.base as be
import annotator.pipe as pe
import annotator.mspacy as msp
import annotator.mstanza as mst

# General input

The input is passed to the package using a dictionary (json-file). Later, this will be hidden in the user interface. For now, you will load the default dictionary which has all options pre-set to default values, and then replace the options that you specify for your desired processing.

### Read the default dictionary - do not change

In [None]:
# read in input.json
default_dict = be.prepare_run.load_input_dict("input")

The main input dictionary contains these important parameters:
```
    "input": "./test/test_files/example_en.txt",
    "tool": "spacy",
    "corpus_name": "test",
    "language": "en",
    "document_type": "text",
    "processing_option": "fast",
    "processing_type": "tokenize"
```
You can print these using:

In [None]:
print(default_dict["input"])

In [None]:
# first make a copy of the dictionary for your run
mydict = default_dict

These tell the programm that the data we want to annotate is stored in `example_en.txt` with path to the file `./test/test_files/` and that we want to use the tool `spacy` to annotate the data.

## Adapt the keys for your run
You can and *should* change the following options:  
```
input  
language
document_type
corpus_name
processing_option
```
You do so by changing the content below according to your needs:

In [None]:
# Tell where to find the input file
mydict["input"] = "./test/test_files/example_de.txt"

In [None]:
# Tell what to name your corpus
mydict["corpus_name"] = "test_de"

In [None]:
# Tell the language of the document - currently "en" and "de" are available
mydict["language"] = "de"

In [None]:
# Tell which document type: This will aid in the model selection for some of the tools
# (normal text: "text", historic text: "historic", scientific text: "scientific" )
mydict["document_type"] = "text"

In [None]:
# Tell which option to choose: "fast" or "accurate" or "manual"
# this will set the toolchain for the text processing
# currently, fast = spacy for all types of processing
# accurate = stanza for all types of processing
# this will be further adapted
# please don"t use manual for now, it doesn't add anything new
mydict["processing_option"] = "fast"

In [None]:
# Tell what you want to do with the text: Options are
# tokenize - separate into tokens
# pos - part-of-speech tagging
# lemma - lemma
# mydict["processing_type"]: "tokenize, pos, lemma"
# currently, tokenize is mandatory, the option for pretokenized text is
# not enabled in this version

## Settings for JupyterHub - do not change

In [None]:
mydict["advanced_options"]["output_dir"] = "./test/out/"
mydict["advanced_options"]["corpus_dir"] = "./shared/corpora/"
mydict["advanced_options"]["output_dir"] = "./shared/registry/"
mydict["stanza_dict"]["dir"] = "./test/models/"

# Perform the annotation - do not change
The below will be hidden behind the user interface, for now you need to execute the cells to perform the annotation. Please do not change.

In [None]:
# input dictionary is set above, now we need to validate
be.prepare_run.validate_input_dict(mydict)

In [None]:
# read in the input text to be processed as raw text
data = be.prepare_run.get_text(mydict["input"])

In [None]:
# set the tool dictionaries and options
obj = pe.SetConfig(mydict)

In [None]:
# load pipeline
# here we select by hand currently
if "spacy" in obj.tool:
    spacy_dict = obj.mydict["spacy_dict"]
    pipe = msp.spacy_pipe(spacy_dict)
    annotated = pipe.apply_to(data)
    annotated.pass_results("STR", mydict, ret=False)
elif "stanza" in obj.tool:
    stanza_dict = obj.mydict["stanza_dict"]
    stanza_pipe = mst.MyStanza(stanza_dict)
    annotated = stanza_pipe.apply_to(data)
    annotated.pass_results("STR", mydict, ret=False)
else:
    print("Did not find tool to use!")

# Access the newly annotated corpus in cwb via the command-line interface 
Open a terminal and type "cqp -e". All further processing and options are then done via the cqp prompt.

In [None]:
# Set the registry dir in the cqp terminal

In [None]:
# Load the corpus

In [None]:
# search for POS

In [None]:
# search for lemma

In [None]:
# ???