# Text entity annotation

A common task in natural language processing is to extract entities of interest from some text. This may be as simple as extracting the main content of interest from some text that often comes with boilerplate, or involve identifying e.g. place names or personal names.

To do this, ipyannotations has a widget called `ipyannotations.text.TextTagger`, which allows you to highlight words, phrases or sentences and assign a class to them.

The widget will display any string, including Markdown-formatted text.

In [3]:
import ipyannotations.text
from ipyannotations._doc_utils import recursively_remove_from_dom


widget = ipyannotations.text.TextTagger()
widget.display("This is an example sentence. Try highlighting a word.")
recursively_remove_from_dom(widget)

TextTagger(children=(TextTaggerCore(classes=['MISC', 'PER', 'LOC', 'ORG'], palette=['#8dd3c7', '#ffffb3', '#be…

The default entity types are `PER` (person), `ORG` (organisation), `LOC` (location), and `MISC` (miscellaneous). These are chosen because they are relatively standard in the Named Entity Recognition research community.

To set the classes you are interested in, you can pass them to the widget using the `classes` argument:

In [5]:
import ipyannotations.text

widget = ipyannotations.text.TextTagger(classes=["Insult", "Compliment"])
widget.display("You are annoying, but I like you.")
widget
recursively_remove_from_dom(widget)

TextTagger(children=(TextTaggerCore(classes=['Insult', 'Compliment'], palette=['#8dd3c7', '#ffffb3', '#bebada'…

The widget will snap to word boundaries to make tagging faster. This means you can double-click on a word to tag it.

The format for the annotations takes the form of a three-tuple with types (int, int, str). The integers indicate the starting and ending character of the selected span, and the string indicates the class name.

In [6]:
widget.data

[(8, 16, 'Insult'), (22, 32, 'Compliment')]