![neanno logo](images/neanno.png)

# neanno
**A tool to annotate texts and predict annotations.**

## Getting Started

### What is neanno?

neanno is a tool to add different types of annotations to texts in a dataset. Once you have the annotated texts, neanno has options to extract these annotations so you can build your own models with any code and framework you like.

Currently, neanno supports the following annotation types:

- **Categories**

    A *category* is a class or type of a text. For instance, you might have a dataset which contains multiple support tickets and you could give each text one or more categories such as "inquiry", "information", "complaint", ... or - just as another example for categories - the team that should resolve the ticket. With neanno, you can give a text multiple categories.


- **Named Entities**

    A *named entity* is an arbitrary object of a certain type. For example, we might be interested in all airlines that appear in a text. Then the type of the named entities we are looking for would be "AIRLINE", and if our extraction model is good, it would even find airlines we have never heard about before, and it would be able to distinguish between "American" as an airline and "American" as in "American Football".
    

- **Key Terms**
    
    A *key term* is different from named entities in that they do not necessarily describe an object, and they do not have an entity type. They just denote an arbitrary term which is key to understanding a text. neanno assumes that whenever the text of a key term appears, this text is a key term, no matter where it stands.

Named entities and key terms can either be *standalone* or *parented*. A **parented named entity** or a **parented key term** means that the named entity or key term has one or more other terms as "parents". These parents help to learn which terms mean actually the same thing. Parented terms are similar to synonyms but they can also lead to a more general representation of a term.

For instance, you might have the text of a ticket where someone asks something about his flight from Sydney to Los Angeles. Some people write "Los Angeles", others write "L.A." and more others write "LAX" but in the end it's all about Los Angeles. Hence, it would be good if an annotator can give us this information. So if we want to create some reports or word clouds on the texts, we can report on the single term "Los Angeles" although the original texts have different representations for "Los Angeles".

But parent terms are more than just synonyms. Assume in your context it is not only of interest that someone flies to Los Angeles but it is also of interest that she flies to the U.S. An annotator could then simply add "U.S." as parent term and then we can easily consider this in our reports too. neanno lets you add multiple parent terms if you need. Just separate them by a comma and neanno will automatically find out that you meant multiple parent terms. 

Goal of neanno is to make annotation as comfortable and efficient as possible. So we tried to avoid as much tedious mouse clicks as possible. Besides during annotation, neanno can support by suggesting certain annotations, either by a model it trains, a dataset you give or some regex pattern you specify.

In future, neanno will even select the presented texts for annotation a little bit smarter than just sequentially iterating so you don't waste time, and it will create and export models for you (if you like).

### Installation

neanno is provided to you as a package written in Python - Python 3 to be more precise. It can run on Windows, Linux and (probably) Mac. (Mac will be tested soon but should work.)

It might make sense to install neanno in a separate environment to ensure that there are no conflicts with other packages you have installed. (Trust me - trying out multiple things in only one single environment is not a good idea. Resolving dependency issues is extremely time-consuming...)

To create a new environment, ensure that you have installed [Anaconda](https://www.anaconda.com/distribution) or [Miniconda](https://conda.io/en/latest/miniconda.html).

Once you have Ananconda or Miniconda,
1. open the Anaconda Prompt (only in Windows, in Linux/Mac Anaconda should be there by default after installation IIRC) and
2. run `conda create -n <environment name such as: neanno> python=<3.x whereby greater than 3.6>`.
3. Then activate your new environment by running
    - `conda activate <your new environment name>` (Windows) or
    - `source activate <your new environment name>` if on Linux/Mac, respectively.
    
  From now on, you always have to activate this environment with the command above before you want to use neanno.


4. After you have activated your target environment, you are ready to install neanno. To install neanno, you can either
    - install it from pip by simply running `pip install neanno` (once it's available there ;-)) or
    - if you want to install neanno from its sources, go the root folder of neanno's sources and run `pip install -e .`

neanno has a few dependencies on other packages. These packages should be installed automatically. If you want to see which packages neanno uses, have a look at the included requirements.txt file.


### Walkthrough - demo

Once you installed neanno and are using it the first time, it might make sense to check out a sample first.

The samples folder includes a demo named `airline_tickets`. The airline tickets demo has a dataset with helpdesk tickets and forum entries about travelling with airlines.

#### Using the UI

To start, go to neanno's source folder and run `python -m neanno --config samples/airline_tickets/airline_tickets.config.yaml`. This will open neanno and bring you directly to the middle of an annotation session which looks similar to this.

![airline_tickets demo](images/airline_tickets.png)

When you start neanno, it automatically determines the next text which has not been annotated yet and brings you directly to that text. As you can see, neanno uses different colors to highlight annotations. All named entities use the respective color shown on the right, key terms are marked with a grey background and a light blue font.

- To assign or unassign the text a category, simply click the category in the list top right.
- To annotate a key term, simply select the respective text and press `Alt+1` (shortcuts can be reconfigured by config file, see below).
- To annotate a key term with parent terms, select the respective text, press `Alt+2` and write the parent term.
- Annotating named entities is similar, just select the text and press the respective shortcut for the entity.
- To give a named entity parents, just add the `Shift` key while you press the shortcut for the respective entity.
- To remove an inline annotation, just place the cursor within the span of the annotation, press `Ctrl+R` and your annotation is gone.
- Once you are done with your annotations hit `Ctrl+Enter` (just like in Juypter). This will then submit your annotations and set the *is_annotated* flag, ie. neanno will not ask you to annotate that text again (although you can if needed).

> Beware:
>
> - Any changes will be lost if you navigate to another text without submitting.
> - neanno has no save button. Instead it saves the dataset whenever a text is submitted.

neanno has a few more shortcuts/features you can use but I don't want to explain them all here. To see all available shortcuts, simply click on the `Shortcuts` button on bottom of the window, and it will open a dialog box that tells you about your options.

#### Auto suggest

Depending on what is configured, neanno will suggest annotations (see config file for more details). You can switch predictors on and off as you like. Just use the dialog behind the *Enable/disable predictors for prediction* button for that.

As you might have already realized, neanno also learns during your annotations. There are however some predictors which need more time for training. To train these predictors, you currently need to explicitly trigger their training by clicking on the *Trigger time-consuming training(s)* button. It will then show you the output of the training process(es) in the console/terminal but don't be afraid, this experience will be better some day.

(...as well as the export model button... which unfortunately has no real functionality yet...)

#### Distributions

One last thing before I continue with the Python part. The numbers on the right show the distribution of categories or named entities, respectively. Knowing the distribution of your annotations is very valuable. It indicates when it's ok to stop annotating and lets you know when you have not enough annotations for a certain entity type yet.

#### Using neanno from Python

##### Extracting annotations

The `neanno.utils.text` module has some functions to extract annotations from the annotated text. Base function for many other functions is the `extract_annotations_as_generator` function. It walks through the specified text and yields an annotation whenever encountered.

Let's see it in action.

In [107]:
# load some data from the airline_tickets sample
import pandas as pd
df = pd.read_csv('../samples/airline_tickets/texts.annotating.csv')
df = df.fillna('None')
df[["Text", "categories"]] = df[["Text", "categories"]].astype(str)
df

Unnamed: 0,Request ID,Text,categories,is_text_annotated
0,2047,"Hi all,\r\n\r\nI have booked to fly from `Sydn...",Service Offering/Procedure|Technology,True
1,1997,"If my friend and I are turning 17, but want to...",Trip Planning|Customs/Immigration|Legal,True
2,1999,"Hey All,\r\n\r\nIn May, we'll be flying from `...",Trip Planning|Security,True
3,2003,Here is a little story for you `football``SK`´...,,True
4,1549,Quick question...\r\n\r\nI've just pre booked ...,,True
5,2010,Can anyone give me a suggestion on how to `exc...,Trip Planning,True
6,1941,One of our checked `bag``SK`´ was opened by `T...,Security|Complaint,True
7,1968,My son will be flying home from college but ne...,Trip Planning|Service Offering/Procedure,True
8,1881,I have booked a `return flight``SK`´ from `Rio...,Trip Planning|Service Offering/Procedure,True
9,1895,will I make it to `bus``SK`´ to `east Midlands...,Trip Planning,True


In [108]:
# extract all annotations from the first text
from neanno.utils.text import extract_annotations_as_generator
first_text = df["Text"][0]
df_to_show = pd.DataFrame(extract_annotations_as_generator(first_text))
df_to_show[[
    "term",
    "type",
    "entity_code",
    "parent_terms",
    "parent_terms_raw",
    "start_net",
    "end_net",
    "start_gross",
    "end_gross"
]]

Unnamed: 0,term,type,entity_code,parent_terms,parent_terms_raw,start_net,end_net,start_gross,end_gross
0,Sydney,parented_named_entity,FROM,SYD,SYD,37,43,37,61
1,Los Angeles,parented_named_entity,TO,LAX,LAX,47,58,65,92
2,747-400,standalone_named_entity,AIRCRAFT,,,71,78,105,129
3,wifi,standalone_key_term,,,,142,146,193,204
4,pay for,parented_key_term,,fees,fees,179,186,237,257
5,charge,standalone_key_term,,,,218,224,289,302
6,iPhone,standalone_key_term,,,,228,234,306,319
7,iPad,standalone_key_term,,,,238,242,323,334
8,in flight,parented_key_term,,in-flight,in-flight,243,252,335,362
9,meals,standalone_key_term,,,,284,289,394,406


In [109]:
# we can also extract only the annotations of a certain type, eg. named entities
df_to_show = pd.DataFrame(extract_annotations_as_generator(
    first_text, types_to_extract=["standalone_named_entity", "parented_named_entity"]))
df_to_show[[
    "term",
    "type",
    "entity_code",
    "parent_terms",
    "parent_terms_raw",
    "start_net",
    "end_net",
    "start_gross",
    "end_gross"
]]

Unnamed: 0,term,type,entity_code,parent_terms,parent_terms_raw,start_net,end_net,start_gross,end_gross
0,Sydney,parented_named_entity,FROM,SYD,SYD,37,43,37,61
1,Los Angeles,parented_named_entity,TO,LAX,LAX,47,58,65,92
2,747-400,standalone_named_entity,AIRCRAFT,,,71,78,105,129


##### Computing distributions
There are also some functions to compute distributions, eg. the distribution of the categories, named entities or terms.

In [110]:
# compute and show the categories distribution
from neanno.utils.text import compute_categories_distribution_from_column
df_to_show = pd.DataFrame.from_dict(
    compute_categories_distribution_from_column(df["categories"]), orient="index")
df_to_show.columns = ["Frequency"]
df_to_show = df_to_show.sort_values(by=["Frequency"], ascending=False)
df_to_show

Unnamed: 0,Frequency
,821
Service Offering/Procedure,27
Trip Planning,24
Technology,3
Customs/Immigration,3
Security,3
Complaint,2
Legal,1
Complaint/Feedback,1


##### Extracting dictionaries / term distribution

In [111]:
from neanno.utils.text import compute_term_distribution_from_column
from operator import itemgetter

term_distribution = compute_term_distribution_from_column(df["Text"], include_entity_codes=False)   
df_to_print = pd.DataFrame(sorted(term_distribution.items(), key = itemgetter(1), reverse = True), columns=["Term", "Frequency"])
df_to_print

Unnamed: 0,Term,Frequency
0,to,3350
1,the,2765
2,I,2193
3,and,1730
4,a,1593
5,in,1237
6,for,1047
7,on,966
8,is,962
9,of,939


##### Compute precision/recall

The evaluation metric computations can be found in `neanno.utils.metrics`.

In [116]:
from neanno.utils.metrics import compute_ner_metrics_from_actual_predicted_annotated_text_columns

compute_ner_metrics_from_actual_predicted_annotated_text_columns(
    df["Text"], df["Text"], ["AIRLINE", "FROM", "TO", "VIA", "AT", "AIRCRAFT"])

ImportError: cannot import name 'compute_ner_metrics_from_actual_predicted_annotated_text_columns'

### Walkthrough II - now again but from scratch

Ok - so you have seen the airline_tickets demo, got some basic understanding and now want to annotate your own text? No problem. Here we go!

1. First, you need to prepare your dataset. neanno supports currently only CSV files (other data source types may come later, at least the code is well prepared for that). So no matter how you do it - in the end you currently need a CSV file which has a column that contains the texts to annotate.

    If it's more convenient for you, just leave the other columns in the file. neanno will not remove or change them. BUT depending on how you configure neanno - see next point and config file - it might add additional columns which are needed.


2. Once you have the dataset, it's time to create/adjust the config file. Because the structure of the config file has become quite large meanwhile, the best approach here is to take the config file from the airline_tickets sample and adjust it to your needs. To do that, copy the `airline_tickets.config.yaml` in a folder of your choice, rename it and open a text editor to edit it.

    If you do not want to annotate eg. categories, just remove the respective item from the yaml file and neanno will not show any controls around annotating categories, key terms or named entities. Similar thing with the predictors. If you do not want to use any or some of the mentioned predictors, just remove them.


3. Now that your config file is (hopefully) ready, you can start and use neanno. The command to start is exactly the same as for the airline_tickets sample except that you have to specify your new config file.

> It's a best practice to put config file, dataset and all other files into a single folder so you have everything together.
>
> Don't miss to add that single folder to git and a version control such as Azure DevOps (Repos) or GitHub. This will ensure that you don't miss valuable work.
>
> As a rule of thumb you need approx. 1 hour to annotate 100 texts. So don't even dare to plan or ask someone to annotate
> 20.000 texts! ;-)


### Advanced

#### Adding custom predictors

Currently needs manual adjustment of the ConfigManager class in configmanager.py and subclassing the predictor class (see the existing predictors). Plan would be to automatically load predictors as soon the config file contains a hint and the respective python class is available somewhere in the directory structure. That should make bringing your own predictors much easier.


### Known Issues

- Annotations are rendered with too much blank space on some machines (when font stretching is not supported)


### To Do List

- [ ] make bring-your-own-predictor more comfortable
- [ ] category predictors
- [ ] dialog instead of console output for long-running trainings
- [ ] integration of precision/recall evaluation into UI (code is already available in neanno.utils....)
- [ ] model export
- [ ] check Mac compatibility (should be compatible, only slight changes needed if at all)
- [ ] next best text by active learning