# `txtanot`. Text Annotator with Similarity Engine.

## Features
- Displays a GUI (graphical user interface) in a cell of a Jupyter notebook and allows annotating text.
- Similarity engine. It is useful to collect data similar to one given item. Given a large dataset, while annotating we may find one relevant item and want to have more similar examples. Instead of going through all the data, we can focus on finding similar data points.
  - Extracts embeddings of the loaded dataset and builds an index.
  - It uses a Hugging Face model checkpoint to extract embeddings. The checkpoint to use is a parameter easily configurable.
  - It is optional. The widget can vbe used without similarity index.
- Multiple annotation classes.
- Handles data which has already being annotated. Option to filter annotated data and keep untouched.

In [1]:
%load_ext autoreload
%autoreload 2

In [3]:
from pprint import pprint

import pandas as pd
from datasets import Dataset

from txtanot.core.text_annotator import Annotator
from txtanot.core.similarity import SimilarityEngine

In [4]:
pd.set_option('display.max_colwidth', 100)

## Load data to be annotated

In [5]:
FILE_NAME = 'es_2020.csv'
df = pd.read_csv(FILE_NAME)
print('Shape:', df.shape)
df.head(3)

Shape: (995, 3)


Unnamed: 0,id,text,label
0,5fee3d06-0000-1000-8333-a2a2a2116fff,"Muy buenas noches, Apreciada consultante: Las lipo-infiltraciones realizadas de manera arbitrari...",
1,5fede218-0000-1000-8333-a2a2a2105fff,"Puede parecer más pequeño, pero realmente el tejido que se extirpa es mínimo, solo piel para aju...",
2,5fede15d-0000-1000-8333-a2a2a2104fff,"La fotografía no es muy aclaradora, por si acaso yo te recomiendo una pomada antibiótica de ampl...",


## Create an Annotator object

- Define classification classes to be used.
- Filter already annotated data. Keep untouched in the dataset.

In [6]:
shuffle = True
annotator = Annotator(df, filter_annotated=True, classes=["valid", "notvalid", "lowinfo"], shuffle=shuffle)

Data rows: 995


In [7]:
annotator.counts('label')

Series([], Name: count, dtype: int64)


### Define similarity model

Define Hugging-Face model checkpoint to be used to build and index of embeddings using Faiss indexer.

In [8]:
checkpoint = "PlanTL-GOB-ES/roberta-base-biomedical-clinical-es"
annotator.build_index('text', checkpoint)

Some weights of the model checkpoint at PlanTL-GOB-ES/roberta-base-biomedical-clinical-es were not used when initializing RobertaModel: ['lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.decoder.bias', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at PlanTL-GOB-ES/roberta-base-biomedical-clinical-es and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably

Map:   0%|          | 0/995 [00:00<?, ? examples/s]

  0%|          | 0/1 [00:00<?, ?it/s]

## Annotation

Start annotation widget.
- Display the text to be annotated and its metadata. Metadata shown can be tuned: Inherit from `txtanot.core.text_annotator.Datapoint` class.
- Widget show one button for each classification label defined and buttons to browse data.

In [9]:
annotator.start()

Output(layout=Layout(height='300px', max_width='600px'))

HBox(children=(Button(description='< go back', style=ButtonStyle()), Button(description='next >', style=Button…

HBox(children=(Button(description='valid', style=ButtonStyle()), Button(description='notvalid', style=ButtonSt…

## Similarity

Once the Annotator object has been defined and the Faiss index build, we can start looking for similar texts in de index. Given an input text, the widget presents N similar text in the index to be annotated.

In [15]:
txt = ("Recuerde que en nuestra clínica la primera visita es totalmente gratuita")
annotator.similar(txt, n=20)

Output(layout=Layout(height='300px', max_width='600px'))

HBox(children=(Button(description='< go back', style=ButtonStyle()), Button(description='next >', style=Button…

HBox(children=(Button(description='valid', style=ButtonStyle()), Button(description='notvalid', style=ButtonSt…

In [12]:
# Incorporate annotations to the dataset.
annotator.merge_similar()

## Save

In [16]:
df_ = pd.DataFrame(annotator.data)
df_.head(3)

Unnamed: 0,id,text,label
0,5feb4d89-0000-1000-8333-a2a2a20adfff,Estimada paciente todos los tratamientos inductores de colágeno tardan un tiempo en producir una...,
1,5fbf028c-0000-1000-8333-a1a1a1a0afff,"Estimada paciente, usted pregunta por la técnica de FOXY EYES, que se realiza con hilos tensores...",
2,5fa13270-0000-1000-8333-a1a1a1568fff,"Estimada paciente, \r\nlo normal es que a las 3 semanas ya haya cedido el edema postoperatorio m...",


In [17]:
df_['label'].value_counts()

label
valid       13
notvalid     3
lowinfo      3
Name: count, dtype: int64

In [18]:
FILE_NAME

'es_2020.csv'

In [19]:
df_.to_csv(FILE_NAME, index=False)