# PoC of FlairNLP Named Entity Recognition integration with DKPro

[DKPro](https://dkpro.github.io/dkpro-core/info/) is:

> A DKPro Core addresses tasks that are commonly referred to as linguistic pre-processing, e.g. part-of-speech taggers, parsers, etc. Within DKPro Core, a steadily growing set of third-party tools for such tasks have been wrapped into interoperable and interchangeable components for the Apache UIMA framework.

[Flair](https://github.com/flairNLP/flair) is:

> A powerful NLP library. Flair allows you to apply our state-of-the-art natural language processing (NLP) models to your text, such as named entity recognition (NER), part-of-speech tagging (PoS), sense disambiguation and classification.

DKPro Core is Java based and it integrates taggers e.g. StanfordNLP, OpenNLP etc., where as the Flair is Python based state-of-the-art NLP model. This PoC shows the process of connecting the Java based DKPro Core and Python based Flair together using a middleware DKPro-Cassis.

[DKPro-Cassis](https://github.com/dkpro/dkpro-cassis) provides:
> A pure-Python implementation of the Common Analysis System (CAS) as defined by the UIMA framework. The CAS is a data structure representing an object to be enriched with annotations (the so-called Subject of Analysis, short SofA).

In [1]:
%reload_ext watermark
%watermark -v -p flair,torch

CPython 3.8.3
IPython 7.17.0

flair 0.5.1
torch 1.6.0


> In order to use this notebook, the following libraries are needed to be installed. It is advised to create a Python virtual environment and install the required libraries

In [2]:

from flair.data import Sentence
from flair.models import SequenceTagger
from cassis import *

> The TypeSystem and CAS Object is generated from the DKPro Core Java file. This CAS Object file consists of the output of OpenNLPSegmenter. The file has begin and end for each tokens. The generated files are manually loaded in the notebook.

In [3]:
typeSystemFile = './data/TypeSystem.xml'
casFile = './data/output_OpenNlpSegmenter.xmi'

segmenter = "de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token"

# Load Typesystem
with open(typeSystemFile, 'rb') as f:
    typesystem = load_typesystem(f)

# Load Cas
with open(casFile, 'rb') as f:
    cas = load_cas_from_xmi(f, typesystem=typesystem)

> The CAS Object has the following document. This document can be found as the value of "sofa_string".

In [4]:
text = cas.sofa_string
print(text)

Obama was born in Honolulu, Hawaii, making him the first president not born in North America .
After graduating from Columbia University in 1983, he worked as a community organizer in Chicago .
In 1988, he enrolled in Harvard Law School, where he was the first black person to head the Harvard Law Review .
After graduating, he became a civil rights attorney and an academic, teaching constitutional law at the University of Chicago Law School from 1992 to 2004 .
Turning to elective politics, he represented the 13th district from 1997 until 2004 in the Illinois Senate, when he ran for the U.S. Senate. Obama received national attention in 2004 with his March Senate primary win, his well-received July Democratic National Convention keynote address, and his landslide November election to the Senate .
In 2008, he was nominated for president a year after his presidential campaign began, and after close primary campaigns against Hillary Clinton .
Obama was elected over Republican John McCain and

> After we have loaded the CAS Object, we can display the top five tokens in the document.

In [5]:
i = 0
for token in cas.select(segmenter):
    print('Token: begin: {0} \t end: {1}'.format(token.begin, token.end))
    i=i+1
    if i == 5:
        break

Token: begin: 0 	 end: 5
Token: begin: 6 	 end: 9
Token: begin: 10 	 end: 14
Token: begin: 15 	 end: 17
Token: begin: 18 	 end: 26


> The Flair Named Entity Recognizer model is loaded. The document is passed to the FLair NER model to predict the NER-tags.

In [6]:
model_name = 'ner'
text = Sentence(text)
nlp = SequenceTagger.load(model_name)
nlp.predict(text)

2020-08-11 19:09:09,458 loading file C:\Users\shoeb\.flair\models\en-ner-conll03-v0.4.pt


> Flair has a built-in function to get the span of the document. This function can display the list of all the NER-Tags and the token number.

In [7]:
for entity in text.get_spans('ner'):
    print(entity)

Span [1]: "Obama"   [− Labels: PER (0.9999)]
Span [5,6]: "Honolulu, Hawaii,"   [− Labels: LOC (0.9612)]
Span [15,16]: "North America"   [− Labels: LOC (0.9993)]
Span [20,21]: "Columbia University"   [− Labels: ORG (0.9741)]
Span [31]: "Chicago"   [− Labels: LOC (1.0)]
Span [37,38,39]: "Harvard Law School,"   [− Labels: LOC (0.8647)]
Span [50,51,52]: "Harvard Law Review"   [− Labels: ORG (0.9719)]
Span [69,70,71,72,73]: "University of Chicago Law School"   [− Labels: ORG (0.8778)]
Span [93,94]: "Illinois Senate,"   [− Labels: LOC (0.6874)]
Span [100]: "U.S."   [− Labels: LOC (0.9976)]
Span [102]: "Obama"   [− Labels: PER (0.9999)]
Span [111]: "Senate"   [− Labels: ORG (1.0)]
Span [117,118,119]: "Democratic National Convention"   [− Labels: MISC (0.8716)]
Span [129]: "Senate"   [− Labels: ORG (1.0)]
Span [150,151]: "Hillary Clinton"   [− Labels: PER (0.9871)]
Span [156]: "Republican"   [− Labels: MISC (1.0)]
Span [157,158]: "John McCain"   [− Labels: PER (0.9913)]
Span [163,164]: "Joe Bi

> The TypeSystem of the DKPro Cores Named Entity Recognizer is to put the values of NER-Tags from Flair NER. Flair NER doesn't have a rich set of NER-Tags. Therefore, the tags are mapped according to the DKPro Core NER tags

In [8]:
NERType = cas.typesystem.get_type("de.tudarmstadt.ukp.dkpro.core.api.ner.type.NamedEntity")

for span in text.get_spans('ner'):
    if span.tag == 'PER':
        val = 'person'
    if span.tag == 'LOC':
        val = 'location'
    if span.tag == 'ORG':
        val = "organization"
    if span.tag == 'MISC':
        val = "miscellaneous"
    ner_annotation = NERType(begin=span.start_pos,
                             end=span.end_pos,
                             value=val)
    cas.add_annotation(ner_annotation)

> The NER-Tags are annotated in the new CAS object. The new CAS object consists of the Segmenter annotations as well as the NER annotations from Flair NER. Below is the list of all the NER tokens.

In [9]:
for token in cas.select("de.tudarmstadt.ukp.dkpro.core.api.ner.type.NamedEntity"):
    print('Token: begin: {0} \t end: {1} \t NER-tag: {2}'.format(token.begin, token.end, token.value))

Token: begin: 0 	 end: 5 	 NER-tag: person
Token: begin: 18 	 end: 35 	 NER-tag: location
Token: begin: 79 	 end: 92 	 NER-tag: location
Token: begin: 117 	 end: 136 	 NER-tag: organization
Token: begin: 184 	 end: 191 	 NER-tag: location
Token: begin: 218 	 end: 237 	 NER-tag: location
Token: begin: 286 	 end: 304 	 NER-tag: organization
Token: begin: 411 	 end: 443 	 NER-tag: organization
Token: begin: 555 	 end: 571 	 NER-tag: location
Token: begin: 592 	 end: 596 	 NER-tag: location
Token: begin: 605 	 end: 610 	 NER-tag: person
Token: begin: 662 	 end: 668 	 NER-tag: organization
Token: begin: 705 	 end: 735 	 NER-tag: miscellaneous
Token: begin: 796 	 end: 802 	 NER-tag: organization
Token: begin: 933 	 end: 948 	 NER-tag: person
Token: begin: 974 	 end: 984 	 NER-tag: miscellaneous
Token: begin: 985 	 end: 996 	 NER-tag: person
Token: begin: 1027 	 end: 1036 	 NER-tag: person
Token: begin: 1100 	 end: 1117 	 NER-tag: miscellaneous


> The new CAS Object is outputted for the second component of the DKPro Core.

In [10]:
cas.to_xmi('./data/output_FlairNER.xmi', pretty_print=True)

> The second component of the DKPro Core is the OpenNLP POS-Tagger. The component uses the CAS Object generated from the Flair NER and outputs a new CAS Object.

In [11]:
casFile = './data/output_OpenNlpPosTagger.xmi'
PosTagger = "de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.POS"

# Load CAS
with open(casFile, 'rb') as f:
    cas = load_cas_from_xmi(f, typesystem=typesystem)

> The newest CAS Object from the OpenNLP POS-Tagger from DKPro Core is loaded and the top 10 POS-Tags with NNP are displayed

In [12]:
i = 0
for token in cas.select(PosTagger):
    if token.PosValue == 'NNP':
        print('Token: begin: {0} \t end: {1} \t POS-tag: {2}'.format(token.begin, token.end, token.PosValue))
        i=i+1
        if i == 10:
            break

Token: begin: 0 	 end: 5 	 POS-tag: NNP
Token: begin: 0 	 end: 5 	 POS-tag: NNP
Token: begin: 18 	 end: 26 	 POS-tag: NNP
Token: begin: 18 	 end: 35 	 POS-tag: NNP
Token: begin: 28 	 end: 34 	 POS-tag: NNP
Token: begin: 79 	 end: 84 	 POS-tag: NNP
Token: begin: 79 	 end: 92 	 POS-tag: NNP
Token: begin: 85 	 end: 92 	 POS-tag: NNP
Token: begin: 117 	 end: 125 	 POS-tag: NNP
Token: begin: 117 	 end: 136 	 POS-tag: NNP
