# Using neanno from Python

neanno has a couple of functions for working with annotations directly from Python code incl. predicting annotations. This notebook shows some of them. For an up-to-date view have a look at the `neanno.utils.text`, `neanno.utils.metrics` and `neanno.prediction.*` modules.

In [1]:
# ensure the directory where the neanno sources reside are in the path
import os
import sys
nb_dir = os.path.split(os.getcwd())[0]
if nb_dir not in sys.path:
    sys.path.append(nb_dir)


## Load some data first

To continue, we first need to load some data.

In [2]:
import pandas as pd
df = pd.read_csv('../samples/airline_tickets/texts.annotating.csv')
df = df.fillna('None')
df[["text", "categories"]] = df[["text", "categories"]].astype(str)
df.head()

Unnamed: 0,request_id,text,categories,is_text_annotated
0,2047,"Hi all,\n\nI have booked to fly from `Sydney``...",Service Offering/Procedure|Technology,True
1,1997,"If my friend and I are turning 17, but want to...",Trip Planning|Customs/Immigration|Legal,True
2,1999,"Hey All,\n\nIn May, we'll be flying from `YYZ`...",Trip Planning|Security,True
3,2003,Here is a little story for you `football``SK`´...,,True
4,1549,Quick question...\n\nI've just pre booked my `...,,True


## Extracting annotations

Now that we loaded the data, we can have a look at what neanno provides.

neanno gives you different ways to extract annotations from annotated texts. Base function for all these functions is the `extract_annotations_as_generator` function. It walks through the specified text and yields an annotation whenever encountered. 

Let's see the it in action.

### All annotations from a text

In [3]:
from neanno.utils.text import extract_annotations_as_generator

# get annotations
first_text = df["text"][0]
annotations = extract_annotations_as_generator(first_text)

# show (only first few to avoid blowing up the notebook)
df_to_show = pd.DataFrame(annotations)
df_to_show[[
    "term",
    "type",
    "entity_code",
    "parent_terms",
    "parent_terms_raw",
    "start_net",
    "end_net",
    "start_gross",
    "end_gross"
]].head()

Unnamed: 0,term,type,entity_code,parent_terms,parent_terms_raw,start_net,end_net,start_gross,end_gross
0,Sydney,parented_named_entity,FROM,SYD,SYD,35,41,35,59
1,Los Angeles,parented_named_entity,TO,LAX,LAX,45,56,63,90
2,747-400,standalone_named_entity,AIRCRAFT,,,69,76,103,127
3,wifi,standalone_key_term,,,,136,140,187,198
4,pay for,parented_key_term,,fees,fees,173,180,231,251


### Only the annotations of a certain type and entity code

In [4]:
# get annotations
annotations = extract_annotations_as_generator(
        first_text,
        types_to_extract=["standalone_named_entity", "parented_named_entity"],
        entity_codes_to_extract=["TO"]
    )

# show (only first few to avoid blowing up the notebook)
df_to_show = pd.DataFrame(annotations)                        
df_to_show[[
    "term",
    "type",
    "entity_code",
    "parent_terms",
    "parent_terms_raw",
    "start_net",
    "end_net",
    "start_gross",
    "end_gross"
]].head()

Unnamed: 0,term,type,entity_code,parent_terms,parent_terms_raw,start_net,end_net,start_gross,end_gross
0,Los Angeles,parented_named_entity,TO,LAX,LAX,45,56,63,90


## Computing distributions
There are also some functions to compute distributions, eg. the distribution of the categories, named entities or terms.

### Compute and show the text categories distribution

In [5]:
from neanno.utils.text import compute_categories_distribution_from_column

# get distribution
categories_distribution = compute_categories_distribution_from_column(df["categories"])

# show
df_to_show = pd.DataFrame.from_dict(categories_distribution, orient="index")
df_to_show.columns = ["Frequency"]
df_to_show = df_to_show.sort_values(by=["Frequency"], ascending=False)
df_to_show

Unnamed: 0,Frequency
,821
Service Offering/Procedure,27
Trip Planning,24
Technology,3
Customs/Immigration,3
Security,3
Complaint,2
Legal,1
Complaint/Feedback,1


Computing the named entities distribution is similar. See the `neanno.utils.text` module for more details.

### Extract the dictionary / term distribution

Note: Named entity terms are understood as compound words, hence they are extracted as single term in the dictionary. This should give a better quality than just extracting single words.

In [6]:
from neanno.utils.text import compute_term_distribution_from_column
from operator import itemgetter

# get term distribution
term_distribution = compute_term_distribution_from_column(df["text"], include_entity_codes=False)

# show (only first few to avoid blowing up the notebook)
df_to_show = pd.DataFrame(sorted(term_distribution.items(), key = itemgetter(1), reverse = True), columns=["Term", "Frequency"])
df_to_show.head()

Unnamed: 0,Term,Frequency
0,to,3350
1,the,2765
2,I,2193
3,and,1730
4,a,1593


## Metrics

The evaluation metric computations can be found in `neanno.utils.metrics`.

### Compute precision/recall for recognized named entities


In [7]:
from neanno.utils.metrics import compute_ner_metrics

# get metrics (using the same annotations for actual/predicted for the sake of simplicity)
ner_metrics = compute_ner_metrics(df["text"], df["text"])

# show
df_to_show = pd.DataFrame(ner_metrics).T
df_to_show

Unnamed: 0,correct,incorrect,number_predictions,possible,precision,recall
AIRCRAFT,13.0,0.0,13.0,13.0,1.0,1.0
AIRLINE,52.0,0.0,52.0,52.0,1.0,1.0
AT,7.0,0.0,7.0,7.0,1.0,1.0
FROM,44.0,0.0,44.0,44.0,1.0,1.0
TO,64.0,0.0,64.0,64.0,1.0,1.0
VIA,15.0,0.0,15.0,15.0,1.0,1.0


## Train and Predict annotations

### Predict

In [8]:
from neanno.prediction.pipeline import PredictionPipeline
from neanno.prediction.key_terms.from_dataset import FromDatasetKeyTermsPredictor

# create a prediction pipeline
prediction_pipeline = PredictionPipeline()

# create and add a predictor to the pipeline
# notes: - a pipeline can have an arbitrary number of predictors
#        - see the sample project files and/or the validation schema within the
#          predictor classes for more infos about the config options
#        - predictors validate the config they are given during instantiation
key_terms_predictor = FromDatasetKeyTermsPredictor({
    "location": "csv:../samples/airline_tickets/default.key_terms.csv"
})
key_terms_predictor.load_dataset("csv:../samples/airline_tickets/default.key_terms.csv")
prediction_pipeline.add_predictor(key_terms_predictor)

# ask the pipeline to predict some annotations from a text
text_with_predicted_annotations = prediction_pipeline.predict_inline_annotations("Can I use wifi during flight?")
for annotation in extract_annotations_as_generator(text_with_predicted_annotations):
    print(annotation)

{'term': 'wifi', 'type': 'standalone_key_term', 'start_net': 10, 'end_net': 14, 'start_gross': 10, 'end_gross': 21}


### Train
#### Online Training with single cases

In [9]:
# teach new annotations
# note: when we teach a FromDatasetKeyTermsPredictor, it will write back its learnings to the dataset.
#       to avoid breaking the key terms dataset of the airline_tickets sample, we simply teach a term
#       which is already known.
prediction_pipeline.learn_from_annotated_text("Can I use `wifi``SK`´ during flight?", "en-US")

# ask the pipeline again to predict the annotations
# note: the language parameter is optional. if it is not specified, en-US will be used as default.
text_with_predicted_annotations = prediction_pipeline.predict_inline_annotations("Can I use wifi during flight?", "en-US")
for annotation in extract_annotations_as_generator(text_with_predicted_annotations):
    print(annotation)

{'term': 'wifi', 'type': 'standalone_key_term', 'start_net': 10, 'end_net': 14, 'start_gross': 10, 'end_gross': 21}


#### Batch Training

The `FromDatasetKeyTermsPredictor` (as example) is a predictor which learns from single text examples. There are however also predictors which learn from a dataset/in a batch, eg. the `FromSpacyNamedEntitiesPredictor`. To teach these predictors, you have to use the pipeline's `learn_from_annotated_dataset` method.

> Note: It's important to know that predictors built for online training will not learn automatically if batch training is started (except the predictors support batch training as well). Use the right training method for each predictor. If things don't match, the respective predictor will learn nothing and no exception will be thrown!

In [10]:
from neanno.prediction.named_entities.from_spacy import FromSpacyNamedEntitiesPredictor
from neanno.configuration.definitions import NamedEntityDefinition

# create a FromSpacyNamedEntitiesPredictor and add it to the pipeline
named_entities_predictor = FromSpacyNamedEntitiesPredictor({
      "source_model": "blank:en"
    }
)
prediction_pipeline.add_predictor(named_entities_predictor)

# disable the key terms predictor just for fun - because we can ;-)
key_terms_predictor.is_prediction_enabled = False

# show many annotated texts we have currently
print("Using {} annotated texts for training/testing.".format(df["is_text_annotated"].sum()))

# ask the pipeline to learn from the dataset
language_column = ""
categories_column = ""
categories_to_train = []
entity_codes_to_train = ["FROM", "TO", "AIRLINE"]
prediction_pipeline.learn_from_annotated_dataset(df, "text", "is_text_annotated",
    language_column, categories_column, categories_to_train, entity_codes_to_train)

Using 46 annotated texts for training/testing.
Training NER model with predictor '32ec3b33-9213-463a-acf8-02e04f2404b6'...
Iteration: 0...
Iteration: 1...
Iteration: 2...
Iteration: 3...
Iteration: 4...
Iteration: 5...
Iteration: 6...
Iteration: 7...
Iteration: 8...
Iteration: 9...
Computing precision/recall matrix...
         correct  incorrect  number_predictions  possible  precision    recall
FROM         7.0        1.0                 8.0      12.0   0.875000  0.583333
TO           9.0        4.0                13.0      16.0   0.692308  0.562500
AIRLINE      4.0       10.0                14.0      14.0   0.285714  0.285714
=> Success
Done.


In [11]:
# ask the pipeline again to predict some inline annotations
text_with_predicted_annotations = prediction_pipeline.predict_inline_annotations("We wanna fly to Cancun next year with Contoso.")
for annotation in extract_annotations_as_generator(text_with_predicted_annotations):
    print("{} = {}".format(annotation["term"], annotation["entity_code"]))
    
print("")

text_with_predicted_annotations = prediction_pipeline.predict_inline_annotations("I went to Germany last summer with Lufthansa.")
for annotation in extract_annotations_as_generator(text_with_predicted_annotations):
    print("{} = {}".format(annotation["term"], annotation["entity_code"]))

Cancun = TO
Contoso = AIRLINE

Germany = TO
Lufthansa = AIRLINE


## Bring your own predictor

To bring your own predictor, you need to write a class which inherits from `neanno.prediction.predictor.Predictor` and either reference that new class in a neanno configuration file or use it directly in your Python code (see above).

The included predictors are good templates to write your own predictors, eg. the `FromRegexesKeyTermsPredictor`.

Depending on what your predictor shall do, there are different methods to implement. Mainly:

- learn_from_annotated_text()
- learn_from_annotated_dataset()
- predict_inline_annotations()
- predict_text_categories()

Since the base class implements default variants of these methods already, new predictors have to implement these only if the predictor does something different than the base class. To see all methods you could inherit, see the above mentioned `Predictor` base class.

When a predictor is created, it is passed a configuration, and the base class will then check if the configuration matches an expected schema. Predictors tell neanno which (additional to base class) configuration they expect by having the `project_config_validation_schema_custom_part` return a validation schema. neanno uses the cerberus package for validation. See the site [here](http://docs.python-cerberus.org/en/stable) for documentation.