# Text Processing and Pipelines with Spacy

## NLP Dependency Labels

### Subjects

|                           |                                                                                                                 |
|---------------------------|-----------------------------------------------------------------------------------------------------------------|
| Nominal Subject           | A nominal subject (`nsubj`) is a non-clausal constituent in the subject position of an active verb.             |
| Nominal subject (passive) | A nominal passive subject (`nsubjpass`) is a non-clausal constituent in the subject position of a passive verb. |
| Clausal subject           | A clausal subject (`csubj`) is a clause in the subject position of an active verb.                              |
| Clausal subject (passive) | A clausal passive subject (`csubjpass`) is a clause in the subject position of a passive verb.                  |
| Agent                     | An agent (`agent`) is the complement of a passive verb that is the surface subject of its active form.          |
| Expletive                 | An expletive (`expl`) is an existential there in the subject position.                                          |

### Objects

|                  |                                                                                                                      |
|------------------|----------------------------------------------------------------------------------------------------------------------|
| Direct object    | A direct object (`dobj`) is a noun phrase that is the accusative object of a (di)transitive verb.                    |
| Dative           | A dative (`dative`) is a nominal or prepositional object of dative-shifting verb.                                    |
| Attribute        | An attribute (`attr`) is a noun phrase that is a non-VP predicate usually following a copula verb.                   |
| Object predicate | An object predicate (`oprd`) is a non-VP predicate in a small clause that functions like the predicate of an object. |


### Complements

|                         |                                                                                                                                                  |
|-------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------|
| Clausal complement      | A clausal complement (`ccomp`) is a clause with an internal subject that modifies the head of an `ADJP\|ADVP\|NML\|NP\|WHNP\|VP\|SINV\|SQ`\.  |
| Open clausal complement | An open clausal complement (`xcomp`) is a clause without an internal subject that modifies the head of an `ADJP\|ADVP\|VP\|SINV\|SQ`\.        |
| Adjectival complement   | An adjectival complement (`acomp`) is an adjective phrase that modifies the head of a `VP\|SINV\|SQ`, that is usually a verb\.                |

### Nominals

|                           |                                                                                                                                                                                                                                                                                                                                                            |
|---------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Appositional modifier     | An appositional modifier (`appos`) of an `NML\|NP` is a noun phrase immediately preceded by another noun phrase, which gives additional information to its preceding noun phrase.                                                                                                                                                                      |
| Clausal modifier          | A finite or non-finite clausal modifier (`acl`) is either an infinitival modifier is an infinitive clause or phrase that modifies the head of a noun phrase, or a participial modifier is a clause or phrase whose head is a verb in a participial form (e.g., gerund, past participle) that modifies the head of a noun phrase, or a complement. |
| Releative clause modifier | A relative clause modifier (`relcl`) is a either relative clause or a reduced relative clause that modifies the head of an `NML\|NP\|WHNP`\.                                                                                                                                                                                                            |
| Determiner                | A determiner (`det`) is a word token whose pos tag is `DT\|WDT\|WP` that modifies the head of a noun phrase\.                                                                                                                                                                                                                                           |
| Pre-determiner           | A predeterminer (`predet`) is a word token whose pos tag is PDT that modifies the head of a noun phrase\.                                                                                                                                                                                                                                               |
| Numeric modifier          | A numeric modifier (`nummod`) is any number or quantifier phrase that modifies the head of a noun phrase\.                                                                                                                                                                                                                                              |
| Adjectival modifier       | An adjectival modifier (`amod`) is an adjective or an adjective phrase that modifies the meaning of another word, usually a noun\.                                                                                                                                                                                                                      |
| Possession modifier       | A possession modifier (`poss`) is either a possessive determiner (PRP$) or a `NML\|NP\|WHNP` containing a possessive ending that modifies the head of a `ADJP\|NML\|NP\|QP\|WHNP`\.                                                                                                                                                                   |
| Modifier of nominal       | A modifier of nominal (`nmod`) is any unclassified dependent that modifies the head of a noun phrase\.                                                                                                                                                                                                                                                  |


### Adverbials

|                           |                                                                                                              |
|---------------------------|--------------------------------------------------------------------------------------------------------------|
| Adverbial modifier        | An adverbial modifier (`advmod`) is an adverb or an adverb phrase that modifies the meaning of another word. |
| Adverbial clause modifier | An adverbial clause modifier (`advcl`) is a clause that acts like an adverbial modifier.                     |


### Negation modifier

A negation modifier (`neg`) is an adverb that gives negative meaning to its head.

### Noun phrase as adverbial modifier

An adverbial noun phrase modifier (`npmod`) is a noun phrase that acts like an adverbial modifier.

### Prepositions

|                             |                                                                                                                                                                                              |
|-----------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Object of a preposition     | An object of a preposition (`pobj`) is a noun phrase that modifies the head of a prepositional phrase, which is usually a preposition but can be a verb in a participial form such as `VBG`. |
| Complement of a preposition | A complement of a preposition (`pcomp`) is any dependent that is not a `pobj` but modifies the head of a prepositional phrase.                                                               |


### Coordination

|                             |                                                                                                                                                          |
|-----------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------|
| Conjunct                    | A conjunct (`conj`) is a dependent of the leftmost conjunct in coordination.                                                                             |
| Coordinating conjunction    | A coordinating conjunction (`cc`) is a dependent of the leftmost conjunct in coordination.                                                               |
| Pre-correlative conjunction | A pre-correlative conjunction (`preconj`) is the first part of a correlative conjunction that becomes a dependent of the first conjunct in coordination. |
| Prepositional modifier      | A prepositional modifier (`prep`) is any prepositional phrase that modifies the meaning of its head.                                                     |


### Auxiliaries

|                     |                                                                                                                              |
|---------------------|------------------------------------------------------------------------------------------------------------------------------|
| Auxiliary           | An auxiliary (`aux`) is an auxiliary or modal verb that gives further information about the main verb (e.g., tense, aspect). |
| Auxiliary (passive) | A passive auxiliary (`auxpass`) is an auxiliary verb, be, become, or get, that modifies a passive verb.                      |


### Compund words

|             |                                                                                                                                                                                                                                    |
|-------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Compound    | A compound (`compound`) is either a noun modifying the head of noun phrase, a number modifying the head of quantifier phrase, or a hyphenated word (or a preposition modifying the head of the prepositioanl phrase).              |
| Particle    | A particle (`prt`) is a preposition in a phrasal verb that forms a verb-particle construction.                                                                                                                                     |
| Case marker | A case marker (`case`) is either a possessive marker, ...                                                                                                                                                                          |
| Marker      | A marker (`mark`) is either a subordinating conjunction (e.g., although, because, while) that introduces an adverbial clause modifier, or a subordinating conjunction, if, that, or whether, that introduces a clausal complement. |


### Miscellaneous

|                        |                                                                                                                                                                                                                |
|------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Unclassified dependent | An unclassified dependent (`dep`) is a dependent that does not satisfy conditions for any other dependency.                                                                                                    |
| Meta modifier          | A meta modifier (`meta`) is code (1), embedded (2), or meta (3) information that is randomly inserted in a phrase or clause.                                                                                   |
| Parenthetical modifier | A parenthetical modifier (`parataxis`) is an embedded chunk, often but not necessarily surrounded by parenthetical notations (e.g,. brackets, quotes, commas, etc.), which gives side information to its head. |
| Punctuation            | Any punctuation (`punct`) is assigned the dependency label PUNCT.                                                                                                                                              |
| Root                   | A root (`root`) is the root of a tree that does not depend on any node in the tree but the artificial root node.                                                                                               |


## Experiments

In [1]:
import functools
import spacy
from spacy import displacy
import coreferee

In [2]:
nlp = spacy.load("en_core_web_trf")

In [3]:
def print_deps(doc):
    for token in doc:
        print(f"{token.text} -- {token.pos_}")
        print(f"\tDEP: {token.dep_}")
        print(f"\tHEAD: {token.head.text}")
        
def render_deps(doc, compact=True, options=None):
    if not options: options = {}
    displacy.render(doc, style="dep", jupyter=True, options={'compact': compact, **options})
    
def cprint(*objs, color=0, sep="", end='\n'):
    to_print = f"\033[9{color}m" + sep.join(map(str, objs))
    print(to_print, end=end)

In [4]:
sample = "Autonomous cars shift insurance liability toward manufacturers"
doc = nlp(sample)
print(sample)
print("------")
print_deps(doc)

Autonomous cars shift insurance liability toward manufacturers
------
Autonomous -- ADJ
	DEP: amod
	HEAD: cars
cars -- NOUN
	DEP: nsubj
	HEAD: shift
shift -- VERB
	DEP: ROOT
	HEAD: shift
insurance -- NOUN
	DEP: compound
	HEAD: liability
liability -- NOUN
	DEP: dobj
	HEAD: shift
toward -- ADP
	DEP: prep
	HEAD: shift
manufacturers -- NOUN
	DEP: pobj
	HEAD: toward




Assuming subject is **autonomous cars**, and object is **insurance liability**, we need to extract autonomous cars, insurance liability, and the relationship of "shift toward manufacturers."

This is achieved with cars recognized as `nsubj`, autonomous as its `amod` child. "Shift" is recognized as the root verb. "Liability" is recognized as `dobj`, with "insurance" as child. "toward manufacturers" is recognized as child of shift.

In [5]:
render_deps(doc)

## Pipeline and Processing

![image.png](attachment:eab42910-05c3-41b4-ac71-cc08ee1d3ce6.png)

In [6]:
full_text = """
The queen angelfish (Holacanthus ciliaris), also known as the blue angelfish, golden angelfish, or 
yellow angelfish, is a species of marine angelfish found in the western Atlantic Ocean. It is a 
benthic (ocean floor) warm-water species that lives in coral reefs. It is recognized by its blue 
and yellow coloration and a distinctive spot or "crown" on its forehead. This crown distinguishes 
it from the closely related and similar-looking Bermuda blue angelfish (Holacanthus bermudensis), 
with which it overlaps in range and can interbreed.Adult queen angelfish are selective feeders 
and primarily eat sponges. Their social structure consists of harems which include one male and 
up to four females. They live within a territory where the females forage separately and are 
tended to by the male. Breeding in the species occurs near a full moon. The transparent eggs 
are pelagic and float in the water, hatching after 15–20 hours. Juveniles of the species have 
different coloration than adults and act as cleaner fish. The queen angelfish is popular in 
the aquarium trade and has been a particularly common exported species from Brazil. In 2010, 
the queen angelfish was assessed as least concern by the International Union for 
Conservation of Nature as the wild population appeared to be stable.
""".replace('\n', '')
full_text

'The queen angelfish (Holacanthus ciliaris), also known as the blue angelfish, golden angelfish, or yellow angelfish, is a species of marine angelfish found in the western Atlantic Ocean. It is a benthic (ocean floor) warm-water species that lives in coral reefs. It is recognized by its blue and yellow coloration and a distinctive spot or "crown" on its forehead. This crown distinguishes it from the closely related and similar-looking Bermuda blue angelfish (Holacanthus bermudensis), with which it overlaps in range and can interbreed.Adult queen angelfish are selective feeders and primarily eat sponges. Their social structure consists of harems which include one male and up to four females. They live within a territory where the females forage separately and are tended to by the male. Breeding in the species occurs near a full moon. The transparent eggs are pelagic and float in the water, hatching after 15–20 hours. Juveniles of the species have different coloration than adults and act

### Merging phrases

In [7]:
def merge_phrases(doc):
    with doc.retokenize() as retokenizer:
        for np in list(doc.noun_chunks):
            attrs = {
                "tag": np.root.tag_,
                "lemma": np.root.lemma_,
                "ent_type": np.root.ent_type_,
            }
            retokenizer.merge(np, attrs=attrs)
    return doc

In [8]:
long_text2 = "Adult queen angelfish are selective feeders and primarily eat sponges."
long_doc2 = nlp(long_text2)

In [9]:
render_deps(long_doc2)

In [10]:
long_doc2_merged = merge_phrases(long_doc2)
render_deps(long_doc2_merged)

### Merging Entities

In [11]:
full_text_doc = nlp(full_text)

In [12]:
render_deps(list(full_text_doc.sents)[-1], options={'distance': 200})

In [13]:
displacy.render(list(full_text_doc.sents)[-1], style="ent", jupyter=True)

---
In the last sentence of the document, we see that the "International Union for Conservation of Nature" is an entity but is not recognized as a phrase by the dependency parser. We can merge these together.

In [14]:
def merge_entities(doc):
    with doc.retokenize() as retokenizer:
        for ent in doc.ents:
            retokenizer.merge(ent)
    return doc

In [15]:
full_text_doc = merge_entities(merge_phrases(full_text_doc))

In [16]:
render_deps(list(full_text_doc.sents)[-1], options={'distance': 200})

### Constructing the pipeline

In [17]:
pipeline = spacy.load("en_core_web_trf")
pipeline.add_pipe("merge_entities") # built-in
pipeline.add_pipe("merge_noun_chunks")

<function spacy.pipeline.functions.merge_noun_chunks(doc: spacy.tokens.doc.Doc) -> spacy.tokens.doc.Doc>

In [18]:
full_text_doc = pipeline(full_text)

In [19]:
render_deps(list(full_text_doc.sents)[-1], options={'distance': 200})

### Finding the subject/object

Given a set of sentences/paragraph, find subjects and objects in each sentence. When a subject/object matches the topic of the article, the sentence should be added to the knowledge graph with a link to the other entities in the sentence, only if other entities exist in the sentence.

In [20]:
# in the future, this will be found by parsing through the wikitext and looking for [[]] text.
relevant_objects = ["species", "marine angelfish", "atlantic ocean", 
                    "benthic", "coral reefs", "holacanthus bermudensis", 
                    "sponges", "harems", "territory", "breeding", 
                    "full moon", "eggs", "pelagic", "cleaner fish", 
                    "aquarium trade", "brazil", "least concern", 
                    "international union for conservation of nature"]

In [21]:
def get_subjects_objects(doc):
    """Gets a list of subjects with each element corresponding to a sentence in the doc.
    Each element is an array [nsubj|nsubjpass] of spacy tokens.""" 
    return [[tok for tok in sent if tok.dep_ in ['nsubj', 'nsubjpass', 'dobj', 'pobj']] for sent in doc.sents]

In [22]:
for sent_subj_objs, sent in zip(get_subjects_objects(full_text_doc), full_text_doc.sents):
    for subj_obj in sent_subj_objs:
        cprint(f"{subj_obj} - type:{subj_obj.dep_} - head:{subj_obj.head} - i:{subj_obj.i - sent.start}", color=3)
    cprint("\t", sent, color=0)

[93mThe queen angelfish - type:nsubj - head:is - i:0
[93mthe blue angelfish - type:pobj - head:as - i:8
[93mmarine angelfish - type:pobj - head:of - i:18
[93mthe western Atlantic Ocean - type:pobj - head:in - i:21
[90m	The queen angelfish (Holacanthus ciliaris), also known as the blue angelfish, golden angelfish, or yellow angelfish, is a species of marine angelfish found in the western Atlantic Ocean.
[93mIt - type:nsubj - head:is - i:0
[93mthat - type:nsubj - head:lives - i:3
[93mcoral reefs - type:pobj - head:in - i:6
[90m	It is a benthic (ocean floor) warm-water species that lives in coral reefs.
[93mIt - type:nsubjpass - head:recognized - i:0
[93mits blue and yellow coloration - type:pobj - head:by - i:4
[93mits forehead - type:pobj - head:on - i:11
[90m	It is recognized by its blue and yellow coloration and a distinctive spot or "crown" on its forehead.
[93mThis crown - type:nsubj - head:distinguishes - i:0
[93mit - type:dobj - head:distinguishes - i:2
[93mthe clo

### Pronoun resolution

I'll just use the `coreferee` plugin to resolve pronouns. 

In [23]:
pipeline.add_pipe('coreferee')

<coreferee.manager.CorefereeBroker at 0x17f831cb040>

In [24]:
doc = pipeline("The device that reads from a magnetic drive and writes data to it is called a disk.")

In [25]:
def render_corefs(doc):
    for sent in doc.sents:
        for token in sent:
            if token.pos_ == 'NOUN':
                cprint(f"{token} ({token.i})", color=3, end=' ')
            elif token.pos_ == 'PRON':
                cprint(f"{token} ({token.i})", color=2, end=' ')
            else:
                cprint(token, end=' ')
        print('\n')
    doc._.coref_chains.print()

In [26]:
render_corefs(doc)

[93mThe device (0) [92mthat (1) [90mreads [90mfrom [93ma magnetic drive (4) [90mand [90mwrites [93mdata (7) [90mto [92mit (9) [90mis [90mcalled [93ma disk (12) [90m. 

0: a magnetic drive(4), it(9)


In [27]:
long_doc = '''
Germany, formally the Federal Republic of Germany, is a country in Central Europe. 
It is the second most populous country in Europe after Russia, and the most populous member state of the European Union. 
Germany is situated between the Baltic and North seas to the north, and the Alps to the south; it covers an area of 357,022 square kilometres (137,847 sq mi), with a population of over 83 million within its 16 constituent states. 
Germany borders Denmark to the north, Poland and the Czech Republic to the east, Austria and Switzerland to the south, and France, Luxembourg, Belgium, and the Netherlands to the west. 
The nation's capital and largest city is Berlin and its financial centre is Frankfurt; the largest urban area is the Ruhr. 
Germany is a great power with a strong economy; it has the largest economy in Europe, the world's fourth-largest economy by nominal GDP and the fifth-largest by PPP. 
As a global leader in several industrial, scientific and technological sectors, it is both the world's third-largest exporter and importer of goods. 
As a developed country, which ranks highly on the Human Development Index, it offers social security and a universal health care system, environmental protections, a tuition-free university education, and it is ranked as 18th most peaceful country in the world. 
Germany is a member of the United Nations, NATO, the G7, the G20 and the OECD. It has the third-greatest number of UNESCO World Heritage Sites. 
'''

In [28]:
render_corefs(pipeline(long_doc))

[90m
 [90mGermany [90m, [90mformally the Federal Republic of Germany [90m, [90mis [93ma country (6) [90min [90mCentral Europe [90m. 

[90m
 [92mIt (11) [90mis [93mthe second most populous country (13) [90min [90mEurope [90mafter [90mRussia [90m, [90mand [93mthe most populous member state (20) [90mof [90mthe European Union [90m. 

[90m
 [90mGermany [90mis [90msituated [90mbetween [93mthe Baltic and North seas (29) [90mto [93mthe north (31) [90m, [90mand [90mthe Alps [90mto [93mthe south (36) [90m; [92mit (38) [90mcovers [93man area (40) [90mof [93m357,022 square kilometres (42) [90m( [93m137,847 sq mi (44) [90m) [90m, [90mwith [93ma population (48) [90mof [90mover [90m83 [90mmillion [90mwithin [93mits 16 constituent states (54) [90m. 

[90m
 [90mGermany [90mborders [90mDenmark [90mto [93mthe north (61) [90m, [90mPoland [90mand [90mthe Czech Republic [90mto [93mthe east (67) [90m, [90mAustria [90mand [90mSwitzerland 

Here it **incorrectly** identifies "Human Development Index" as the antecedent.

In [29]:
long_doc2 = '''
The Germanic tribes are thought to date from the Nordic Bronze Age or the Pre-Roman Iron Age.
From southern Scandinavia and north Germany, they expanded south, east, and west, coming into contact with the Celtic, Iranian, Baltic, and Slavic tribes.
Under Augustus, the Roman Empire began to invade lands inhabited by the Germanic tribes, creating a short-lived Roman province of Germania between the Rhine and Elbe rivers. 
In 9 AD, three Roman legions were defeated by Arminius. By 100 AD, when Tacitus wrote Germania, Germanic tribes had settled along the Rhine and the Danube (the Limes Germanicus), occupying most of modern Germany. 
However, Baden Württemberg, southern Bavaria, southern Hesse and the western Rhineland had been incorporated into Roman provinces. 
Around 260, Germanic peoples broke into Roman-controlled lands.
After the invasion of the Huns in 375, and with the decline of Rome from 395, Germanic tribes moved farther southwest: the Franks established the Frankish Kingdom and pushed east to subjugate Saxony and Bavaria, and areas of what is today eastern Germany were inhabited by Western Slavic tribes.
'''

In [30]:
render_corefs(pipeline(long_doc2))

[90m
 [93mThe Germanic tribes (1) [90mare [90mthought [90mto [90mdate [90mfrom [90mthe Nordic Bronze Age [90mor [90mthe Pre-Roman Iron Age [90m. 

[90m
 [90mFrom [90msouthern Scandinavia [90mand [90mnorth Germany [90m, [92mthey (17) [90mexpanded [90msouth [90m, [93meast (21) [90m, [90mand [93mwest (24) [90m, [90mcoming [90minto [93mcontact (28) [90mwith [93mthe Celtic, Iranian, Baltic, and Slavic tribes (30) [90m. 

[90m
 [90mUnder [90mAugustus [90m, [90mthe Roman Empire [90mbegan [90mto [90minvade [93mlands (40) [90minhabited [90mby [93mthe Germanic tribes (43) [90m, [90mcreating [93ma short-lived Roman province (46) [90mof [90mGermania [90mbetween [93mthe Rhine and Elbe rivers (50) [90m. 

[90m
 [90mIn [93m9 AD (54) [90m, [93mthree Roman legions (56) [90mwere [90mdefeated [90mby [90mArminius [90m. 

[90mBy [93m100 AD (63) [90m, [90mwhen [90mTacitus [90mwrote [90mGermania [90m, [93mGermanic tribes (70) [90mhad [9

This is fine.

#### Comparing Coreferee models

Compare it to models `sm`, and `lg`.

In [31]:
pipeline_sm = spacy.load("en_core_web_sm")
pipeline_sm.add_pipe("merge_entities")
pipeline_sm.add_pipe("merge_noun_chunks")
pipeline_sm.add_pipe("coreferee")

pipeline_lg = spacy.load("en_core_web_lg")
pipeline_lg.add_pipe("merge_entities")
pipeline_lg.add_pipe("merge_noun_chunks")
pipeline_lg.add_pipe("coreferee")

<coreferee.manager.CorefereeBroker at 0x17faf6312d0>

In [32]:
render_corefs(pipeline_sm(long_doc))

[90m
Germany [90m, [90mformally the Federal Republic of Germany [90m, [90mis [93ma country (5) [90min [90mCentral Europe [90m. 

[90m
 [92mIt (10) [90mis [93mthe second most populous country (12) [90min [90mEurope [90mafter [90mRussia [90m, [90mand [93mthe most populous member state (19) [90mof [90mthe European Union [90m. 

[90m
 [90mGermany [90mis [90msituated [90mbetween [93mthe Baltic and North seas (28) [90mto [93mthe north (30) [90m, [90mand [90mthe [90mAlps [90mto [93mthe south (36) [90m; [92mit (38) [90mcovers [93man area (40) [90mof [93m357,022 square kilometres (42) [90m( [90m137,847 [90msq [90mmi [90m) [90m, [90mwith [93ma population (50) [90mof [90mover 83 million [90mwithin [93mits 16 constituent states (54) [90m. 

[93m
Germany borders (56) [90mDenmark [90mto [93mthe north (59) [90m, [90mPoland [90mand [90mthe Czech Republic [90mto [93mthe east (65) [90m, [90mAustria [90mand [90mSwitzerland [90mto [93

In [33]:
render_corefs(pipeline_lg(long_doc))

[90m
 [90mGermany [90m, [90mformally the Federal Republic of Germany [90m, [90mis [93ma country (6) [90min [90mCentral Europe [90m. 

[90m
 [92mIt (11) [90mis [93mthe second most populous country (13) [90min [90mEurope [90mafter [90mRussia [90m, [90mand [93mthe most populous member state (20) [90mof [90mthe European Union [90m. 

[90m
 [90mGermany [90mis [90msituated [90mbetween [93mthe Baltic and North seas (29) [90mto [93mthe north (31) [90m, [90mand [90mthe [90mAlps [90mto [93mthe south (37) [90m; [92mit (39) [90mcovers [93man area (41) [90mof [93m357,022 square kilometres (43) [90m( [90m137,847 sq mi [90m) [90m, [90mwith [93ma population (49) [90mof [90mover 83 million [90mwithin [93mits 16 constituent states (53) [90m. 

[90m
 [90mGermany [90mborders [90mDenmark [90mto [93mthe north (60) [90m, [90mPoland [90mand [90mthe Czech Republic [90mto [93mthe east (66) [90m, [90mAustria [90mand [90mSwitzerland [90mto [

Here both the small and large pipelines correctly identify Germany instad of Human Development Index.

In [34]:
render_corefs(pipeline_sm(long_doc2))

[90m
 [93mThe Germanic tribes (1) [90mare [90mthought [90mto [93mdate (5) [90mfrom [90mthe Nordic Bronze Age [90mor [90mthe Pre-Roman Iron Age [90m. 

[90m
 [90mFrom [90msouthern Scandinavia [90mand [90mnorth Germany [90m, [92mthey (17) [90mexpanded [90msouth [90m, [90meast [90m, [90mand [93mwest (24) [90m, [90mcoming [90minto [93mcontact (28) [90mwith [93mthe Celtic, Iranian, Baltic, and Slavic tribes (30) [90m. 

[90m
 [90mUnder [90mAugustus [90m, [90mthe Roman Empire [90mbegan [90mto [90minvade [93mlands (40) [90minhabited [90mby [93mthe Germanic tribes (43) [90m, [90mcreating [93ma short-lived Roman province (46) [90mof [90mGermania [90mbetween [93mthe Rhine and Elbe rivers (50) [90m. 

[90m
 [90mIn [93m9 AD (54) [90m, [93mthree Roman legions (56) [90mwere [90mdefeated [90mby [90mArminius [90m. 

[90mBy [93m100 AD (63) [90m, [90mwhen [90mTacitus [90mwrote [90mGermania [90m, [93mGermanic tribes (70) [90mhad [90

In [35]:
render_corefs(pipeline_lg(long_doc2))

[90m
 [93mThe Germanic tribes (1) [90mare [90mthought [90mto [93mdate (5) [90mfrom [90mthe Nordic Bronze Age or the Pre-Roman Iron Age [90m. 

[90m
 [90mFrom [90msouthern Scandinavia [90mand [90mnorth Germany [90m, [92mthey (15) [90mexpanded [90msouth [90m, [93meast (19) [90m, [90mand [93mwest (22) [90m, [90mcoming [90minto [93mcontact (26) [90mwith [93mthe Celtic, Iranian, Baltic, and Slavic tribes (28) [90m. 

[90m
 [90mUnder [90mAugustus [90m, [90mthe Roman Empire [90mbegan [90mto [90minvade [93mlands (38) [90minhabited [90mby [93mthe Germanic tribes (41) [90m, [90mcreating [93ma short-lived Roman province (44) [90mof [90mGermania [90mbetween [93mthe Rhine and Elbe rivers (48) [90m. 

[90m
 [90mIn [93m9 AD (52) [90m, [93mthree Roman legions (54) [90mwere [90mdefeated [90mby [90mArminius [90m. 

[90mBy [93m100 AD (61) [90m, [90mwhen [90mTacitus [90mwrote [90mGermania [90m, [93mGermanic tribes (68) [90mhad [90msett

The `trf` models don't add word vectors: maybe this is the reason for the discrepancy between lg and trf. Use `lg` model.

In [36]:
render_corefs(pipeline_sm(full_text))

[93mThe queen angelfish (0) [90m( [90mHolacanthus ciliaris [90m) [90m, [90malso [90mknown [90mas [93mthe blue angelfish (8) [90m, [93mgolden angelfish (10) [90m, [90mor [93myellow angelfish (13) [90m, [90mis [93ma species (16) [90mof [93mmarine angelfish (18) [90mfound [90min [90mthe western Atlantic Ocean [90m. 

[92mIt (23) [90mis [93ma benthic (ocean floor) warm-water species (25) [92mthat (26) [90mlives [90min [93mcoral reefs (29) [90m. 

[92mIt (31) [90mis [90mrecognized [90mby [93mits blue and yellow coloration (35) [90mand [93ma distinctive spot (37) [90mor [93m"crown (39) [90m" [90mon [93mits forehead (42) [90m. 

[93mThis crown (44) [90mdistinguishes [92mit (46) [90mfrom [93mthe closely related and similar-looking Bermuda blue angelfish (48) [90m( [93mHolacanthus bermudensis (50) [90m) [90m, [90mwith [92mwhich (54) [92mit (55) [90moverlaps [90min [93mrange (58) [90mand [90mcan [90minterbreed [90m. 

[93mAdult quee

And this one is all over the place, all wrong.

#### Issues and Limitations

* No support for noun phrase pronouns (?). For example: "The bat is a nocturnal animal. The species sleeps in the day." Coreferee does not recognize "the species" as bat.
* Completely inaccurate in certain cases. 

### AllenNLP

If we only need coreference resolution, we can experiment with other models, let's try AllenNLP.

In [37]:
import allennlp_models.tagging
from allennlp.predictors.predictor import Predictor

In [38]:
predictor = Predictor.from_path("https://storage.googleapis.com/allennlp-public-models/coref-spanbert-large-2021.03.10.tar.gz")

error loading _jsonnet (this is expected on Windows), treating C:\Users\tyler\AppData\Local\Temp\tmpljry6q6_\config.json as plain json
Some weights of BertModel were not initialized from the model checkpoint at SpanBERT/spanbert-large-cased and are newly initialized: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [79]:
allen_result = predictor.predict(
    document="Paul Allen was born on January 21, 1953, in Seattle, Washington, to Kenneth Sam Allen and Edna Faye Allen. Allen attended Lakeside School, a private school in Seattle, where he befriended Bill Gates, two years younger, with whom he shared an enthusiasm for computers. Paul and Bill used a teletype terminal at their high school, Lakeside, to develop their programming skills on several time-sharing computer systems."
)

In [143]:
def find_token_index_in_cluster_list(token_index, clusters):
    "returns color to print if found based on cluster, None if not found. Only 6 colors available."
    for i, cluster in enumerate(clusters):
        for c in cluster:
            if token_index in c:
                return i % 6 + 1
    return None

def get_antecedent_for_cluster_element(prediction, cluster_element):
    for i, (start, end) in enumerate(prediction['top_spans']):
        if start == cluster_element[0] and end == cluster_element[1]:
            antecedent_indices_index = prediction['predicted_antecedents'][i]
            if antecedent_indices_index == -1:
                return None
            top_span_index = prediction['antecedent_indices'][i][antecedent_indices_index]
            ante_start_index, ante_end_index = prediction['top_spans'][top_span_index]
            return ' '.join(prediction["document"][ante_start_index:ante_end_index + 1]) # range is inclusive. subscript inclusive
    return None # this will never happen, but keep it just for completeness 

def render_result(prediction):
    for i, token in enumerate(prediction['document']):
        if (color := find_token_index_in_cluster_list(i, prediction['clusters'])):
            cprint(f'{token} ({i})', color=color, end=' ')
        else:
            cprint(token, end=' ')
    print('\n')
    for i, cluster in enumerate(prediction['clusters']):
        cprint(f"\nCluster {i}\n")
        color = i % 6 + 1
        for j, c in enumerate(cluster):
            antecedent = get_antecedent_for_cluster_element(prediction, c)
            cprint(f'indices: {c}', color=color)
            cprint('\ttext: ' + ' '.join(prediction['document'][c[0]:c[1] + 1]))
            if antecedent:
                cprint(f"\tante: {antecedent}")
            else:
                cprint("\t__ANTECEDENT__")

In [144]:
render_result(allen_result)

[91mPaul (0) [91mAllen (1) [90mwas [90mborn [90mon [90mJanuary [90m21 [90m, [90m1953 [90m, [90min [92mSeattle (11) [90m, [92mWashington (13) [90m, [90mto [90mKenneth [90mSam [90mAllen [90mand [90mEdna [90mFaye [90mAllen [90m. [91mAllen (24) [90mattended [95mLakeside (26) [90mSchool [90m, [90ma [90mprivate [90mschool [90min [92mSeattle (33) [90m, [90mwhere [91mhe (36) [90mbefriended [93mBill (38) [90mGates [90m, [90mtwo [90myears [90myounger [90m, [90mwith [90mwhom [91mhe (47) [90mshared [90man [90menthusiasm [90mfor [93mcomputers (52) [90m. [91mPaul (54) [90mand [93mBill (56) [90mused [90ma [90mteletype [90mterminal [90mat [94mtheir (62) [90mhigh [90mschool [90m, [90mLakeside [95m, (67) [90mto [90mdevelop [94mtheir (70) [90mprogramming [90mskills [90mon [90mseveral [90mtime [90m- [90msharing [90mcomputer [90msystems [90m. 

[90m
Cluster 0

[91mindices: [0, 1]
[90m	text: Paul Allen
[90m	__ANTECEDENT__


#### Interpreting AllenNLP output

`predictor.predict` returns an object with keys:

|Key|Description|
|-|-|
|`top_spans`|A tensor of shape `(batch_size, num_spans_to_keep, 2)` representing the start and end word indices of the top spans that survived the pruning stage.|
|`antecedent_indices`|A tensor of shape `(num_spans_to_keep, max_antecedents)` representing for each top span the index (with respect to top_spans) of the possible antecedents the model considered.|
|`predicted_antecedents`|A tensor of shape `(batch_size, num_spans_to_keep)` representing, for each top span, the index (with respect to antecedent_indices) of the most likely antecedent. -1 means there was no predicted link.|
|`clusters`|A tensor of shape `(batch_size, num_clusters, 2)` representing the start and end word indices of spans that corefer each other.|

To find coreferred spans, iterate through clusters for each batch. Each cluster will reveal the start + end indices for each span that corefer each other. To find the antecedent of each span, find the **index `i`** it is located at in `top_spans`. 
> `cluster 1 -> span [5, 10]` `span [5, 10] found at index 3 in top_spans` `i = 3`. 

Go to `i` in `predicted_antecedents` and get value `j`. If -1, then the span is the antecedent.
> `predicted_antecedents[3] = 4`

Go to `i` in `antecedent_indices` and get value `antecedent_indices[i][j] = k`. This value is guaranteed to exist.
> `antecedent_indices[3] = [1, 3, 4, 5, 6, 7, 8, 9, 11, 13, 14]` `antecedent_indices[3][4] = 6`

Go to `top_spans[k]` to get antecedent span.
> `top_spans[6] = [1, 3]`


#### Sample

In [145]:
long_doc2_result = predictor.predict(document=long_doc2)

In [146]:
render_result(long_doc2_result)

[90m
 [91mThe (1) [90mGermanic [91mtribes (3) [90mare [90mthought [90mto [90mdate [90mfrom [90mthe [90mNordic [90mBronze [90mAge [90mor [90mthe [90mPre [90m- [90mRoman [90mIron [90mAge [90m. [90m
 [90mFrom [90msouthern [90mScandinavia [90mand [90mnorth [90mGermany [90m, [91mthey (29) [90mexpanded [90msouth [90m, [90meast [90m, [90mand [90mwest [90m, [90mcoming [90minto [90mcontact [90mwith [90mthe [90mCeltic [90m, [90mIranian [90m, [90mBaltic [90m, [90mand [90mSlavic [90mtribes [90m. [90m
 [90mUnder [90mAugustus [90m, [90mthe [90mRoman [90mEmpire [90mbegan [90mto [90minvade [90mlands [90minhabited [90mby [91mthe (66) [90mGermanic [91mtribes (68) [90m, [90mcreating [92ma (71) [90mshort [90m- [90mlived [90mRoman [90mprovince [90mof [90mGermania [90mbetween [90mthe [90mRhine [90mand [90mElbe [92mrivers (84) [90m. [90m
 [90mIn [90m9 [90mAD [90m, [90mthree [90mRoman [90mlegions [90mwere [90mdefea

In [147]:
full_text_result = predictor.predict(document=full_text)

In [148]:
render_result(full_text_result)

[91mThe (0) [90mqueen [91mangelfish (2) [90m( [90mHolacanthus [90mciliaris [90m) [90m, [90malso [90mknown [90mas [90mthe [90mblue [90mangelfish [90m, [90mgolden [90mangelfish [90m, [90mor [90myellow [90mangelfish [90m, [90mis [90ma [90mspecies [90mof [90mmarine [90mangelfish [90mfound [90min [90mthe [90mwestern [90mAtlantic [90mOcean [90m. [91mIt (35) [90mis [90ma [90mbenthic [90m( [90mocean [90mfloor [90m) [90mwarm [90m- [90mwater [90mspecies [90mthat [90mlives [90min [90mcoral [90mreefs [90m. [91mIt (53) [90mis [90mrecognized [90mby [91mits (57) [90mblue [90mand [90myellow [90mcoloration [90mand [92ma (63) [90mdistinctive [90mspot [90mor [90m" [90mcrown [90m" [90mon [91mits (71) [92mforehead (72) [90m. [92mThis (74) [92mcrown (75) [90mdistinguishes [91mit (77) [90mfrom [90mthe [90mclosely [90mrelated [90mand [90msimilar [90m- [90mlooking [90mBermuda [90mblue [90mangelfish [90m( [90mHolacanthus 

In [149]:
render_result(predictor.predict(document="While he was searching for his glasses, John stubbed his toe."))

[90mWhile [91mhe (1) [90mwas [90msearching [90mfor [91mhis (5) [90mglasses [90m, [91mJohn (8) [90mstubbed [90mhis [90mtoe [90m. 

[90m
Cluster 0

[91mindices: [1, 1]
[90m	text: he
[90m	__ANTECEDENT__
[91mindices: [5, 5]
[90m	text: his
[90m	ante: he
[91mindices: [8, 8]
[90m	text: John
[90m	ante: he


#### Optimizing Antecedent Searching

If search for antecedents by iterating through `top_spans` for each cluster span, this can be very inefficient. Technically, we don't need to find the antecedent. Just need to find the subject of article. 

### Parsing Wikipedia

In [1]:
sample_wiki = """
The '''queen angelfish''' (''Holacanthus ciliaris''), also known as the '''blue angelfish''', '''golden angelfish''', or '''yellow angelfish''', is a [[species]] of [[Pomacanthidae|marine angelfish]] found in the western [[Atlantic Ocean]]. It is a [[benthic zone|benthic]] (ocean floor) warm-water species that lives in [[coral reef]]s. It is recognized by its blue and yellow coloration and a distinctive spot or "crown" on its forehead. This crown distinguishes it from the closely related and similar-looking Bermuda blue angelfish (''[[Holacanthus bermudensis]]''), with which it overlaps in range and can interbreed.

Adult queen angelfish are selective feeders and primarily eat [[sponge]]s. Their social structure consists of [[harem (zoology)|harems]] which include one male and up to four females. They live within a [[territory (animal)|territory]] where the females forage separately and are tended to by the male. [[Spawn (biology)|Breeding]] in the species occurs near a [[full moon]]. The transparent [[Ichthyoplankton|eggs]] float in the water until they hatch. Juveniles of the species have different coloration than adults and act as [[cleaner fish]].

The queen angelfish is popular in the [[Fishkeeping|aquarium trade]] and has been a particularly common exported species from [[Brazil]]. In 2010, the queen angelfish was assessed as [[least concern]] by the [[International Union for Conservation of Nature]] as the wild population appeared to be stable.
"""

Input text will be raw [wikitext](https://en.wikipedia.org/wiki/Help:Wikitext). Goal is to clean up the text, with all markup removed except outgoing links. 

#### Wikitext Conversion

**General Rules**:

* For any kind of html, just remove the tag, and keep text. 
* For templates, keep the text according to template format. Define custom reader for templates. 
    * KV pairs should be handled specially for information extraction. 
    * String data from templates should be extracted as is without template syntax. 
* Newlines + tabs should be replaced by plain whitespace. 

|Element|Format|HTML|Action|Template|
|-|-|-|-|-|
|Headings|`={0,6}...={0,6}`|h1-6|Remove|template exists, but should not be used in articles|
|Horizontal Rule|----|hr|Remove||
|Table of Contents|\_\_TOC\_\_, \_\_FORCE_TOC\_\_, \_\_NOTOC\_\_||Remove|Template exists {{TOC limit}}|
|Line Breaks|`enter`|br, br/|Remove|template {{break}}, {{-}}, {{clear}}|
|Unordered list|\*, \*\*, \*\*\*, ...|ul|Retain text. Handler|template {{plainlist}}, {{unbulleted list}}|
|Ordered list|\#, \#\#, \#\#\#, ...|ol|Retain text. Handler||
|Description list|; followed by :|dl, dt, dd|Retain text. Handler|template {{glossary}}, {{term}}, {{defn}}|
|Indents|:, ::, :::, ...||Remove|template {{outdent}}, {{outdent2}}|
|Blockquote||blockquote|Retain text|template {{quote}}|
|Italics|''...''||Retain text||
|Bold|'''...'''||Retain text||
|Bold Italics|'''''...'''''||Retain text||
|Small Caps|||Retain text|template {{smallcaps}}|
|Code||code|Retain text||
|Syntax Highlight||syntaxhighlight|Retain text||
|Small text||small|Retain text||
|Big text||big|Retain text||
|Non breaking space|&nbsp;||Replace with whitespace|template {{nowrap}}|
|Extra spacing|||Replace with whitespace|template {{pad}}|
|Special Symbols|&\<spec\>;||Remove/Replace with whitespace||
|Math||math|Remove|template {{math}}|
|Pronunication aids|||Remove|template {{IPAc-en}}, {{respell}}|