# Text Processing and Pipelines with Spacy

## NLP Dependency Labels

### Subjects

|                           |                                                                                                                 |
|---------------------------|-----------------------------------------------------------------------------------------------------------------|
| Nominal Subject           | A nominal subject (`nsubj`) is a non-clausal constituent in the subject position of an active verb.             |
| Nominal subject (passive) | A nominal passive subject (`nsubjpass`) is a non-clausal constituent in the subject position of a passive verb. |
| Clausal subject           | A clausal subject (`csubj`) is a clause in the subject position of an active verb.                              |
| Clausal subject (passive) | A clausal passive subject (`csubjpass`) is a clause in the subject position of a passive verb.                  |
| Agent                     | An agent (`agent`) is the complement of a passive verb that is the surface subject of its active form.          |
| Expletive                 | An expletive (`expl`) is an existential there in the subject position.                                          |

### Objects

|                  |                                                                                                                      |
|------------------|----------------------------------------------------------------------------------------------------------------------|
| Direct object    | A direct object (`dobj`) is a noun phrase that is the accusative object of a (di)transitive verb.                    |
| Dative           | A dative (`dative`) is a nominal or prepositional object of dative-shifting verb.                                    |
| Attribute        | An attribute (`attr`) is a noun phrase that is a non-VP predicate usually following a copula verb.                   |
| Object predicate | An object predicate (`oprd`) is a non-VP predicate in a small clause that functions like the predicate of an object. |


### Complements

|                         |                                                                                                                                                  |
|-------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------|
| Clausal complement      | A clausal complement (`ccomp`) is a clause with an internal subject that modifies the head of an `ADJP\|ADVP\|NML\|NP\|WHNP\|VP\|SINV\|SQ`\.  |
| Open clausal complement | An open clausal complement (`xcomp`) is a clause without an internal subject that modifies the head of an `ADJP\|ADVP\|VP\|SINV\|SQ`\.        |
| Adjectival complement   | An adjectival complement (`acomp`) is an adjective phrase that modifies the head of a `VP\|SINV\|SQ`, that is usually a verb\.                |

### Nominals

|                           |                                                                                                                                                                                                                                                                                                                                                            |
|---------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Appositional modifier     | An appositional modifier (`appos`) of an `NML\|NP` is a noun phrase immediately preceded by another noun phrase, which gives additional information to its preceding noun phrase.                                                                                                                                                                      |
| Clausal modifier          | A finite or non-finite clausal modifier (`acl`) is either an infinitival modifier is an infinitive clause or phrase that modifies the head of a noun phrase, or a participial modifier is a clause or phrase whose head is a verb in a participial form (e.g., gerund, past participle) that modifies the head of a noun phrase, or a complement. |
| Releative clause modifier | A relative clause modifier (`relcl`) is a either relative clause or a reduced relative clause that modifies the head of an `NML\|NP\|WHNP`\.                                                                                                                                                                                                            |
| Determiner                | A determiner (`det`) is a word token whose pos tag is `DT\|WDT\|WP` that modifies the head of a noun phrase\.                                                                                                                                                                                                                                           |
| Pre-determiner           | A predeterminer (`predet`) is a word token whose pos tag is PDT that modifies the head of a noun phrase\.                                                                                                                                                                                                                                               |
| Numeric modifier          | A numeric modifier (`nummod`) is any number or quantifier phrase that modifies the head of a noun phrase\.                                                                                                                                                                                                                                              |
| Adjectival modifier       | An adjectival modifier (`amod`) is an adjective or an adjective phrase that modifies the meaning of another word, usually a noun\.                                                                                                                                                                                                                      |
| Possession modifier       | A possession modifier (`poss`) is either a possessive determiner (PRP$) or a `NML\|NP\|WHNP` containing a possessive ending that modifies the head of a `ADJP\|NML\|NP\|QP\|WHNP`\.                                                                                                                                                                   |
| Modifier of nominal       | A modifier of nominal (`nmod`) is any unclassified dependent that modifies the head of a noun phrase\.                                                                                                                                                                                                                                                  |


### Adverbials

|                           |                                                                                                              |
|---------------------------|--------------------------------------------------------------------------------------------------------------|
| Adverbial modifier        | An adverbial modifier (`advmod`) is an adverb or an adverb phrase that modifies the meaning of another word. |
| Adverbial clause modifier | An adverbial clause modifier (`advcl`) is a clause that acts like an adverbial modifier.                     |


### Negation modifier

A negation modifier (`neg`) is an adverb that gives negative meaning to its head.

### Noun phrase as adverbial modifier

An adverbial noun phrase modifier (`npmod`) is a noun phrase that acts like an adverbial modifier.

### Prepositions

|                             |                                                                                                                                                                                              |
|-----------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Object of a preposition     | An object of a preposition (`pobj`) is a noun phrase that modifies the head of a prepositional phrase, which is usually a preposition but can be a verb in a participial form such as `VBG`. |
| Complement of a preposition | A complement of a preposition (`pcomp`) is any dependent that is not a `pobj` but modifies the head of a prepositional phrase.                                                               |


### Coordination

|                             |                                                                                                                                                          |
|-----------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------|
| Conjunct                    | A conjunct (`conj`) is a dependent of the leftmost conjunct in coordination.                                                                             |
| Coordinating conjunction    | A coordinating conjunction (`cc`) is a dependent of the leftmost conjunct in coordination.                                                               |
| Pre-correlative conjunction | A pre-correlative conjunction (`preconj`) is the first part of a correlative conjunction that becomes a dependent of the first conjunct in coordination. |
| Prepositional modifier      | A prepositional modifier (`prep`) is any prepositional phrase that modifies the meaning of its head.                                                     |


### Auxiliaries

|                     |                                                                                                                              |
|---------------------|------------------------------------------------------------------------------------------------------------------------------|
| Auxiliary           | An auxiliary (`aux`) is an auxiliary or modal verb that gives further information about the main verb (e.g., tense, aspect). |
| Auxiliary (passive) | A passive auxiliary (`auxpass`) is an auxiliary verb, be, become, or get, that modifies a passive verb.                      |


### Compund words

|             |                                                                                                                                                                                                                                    |
|-------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Compound    | A compound (`compound`) is either a noun modifying the head of noun phrase, a number modifying the head of quantifier phrase, or a hyphenated word (or a preposition modifying the head of the prepositioanl phrase).              |
| Particle    | A particle (`prt`) is a preposition in a phrasal verb that forms a verb-particle construction.                                                                                                                                     |
| Case marker | A case marker (`case`) is either a possessive marker, ...                                                                                                                                                                          |
| Marker      | A marker (`mark`) is either a subordinating conjunction (e.g., although, because, while) that introduces an adverbial clause modifier, or a subordinating conjunction, if, that, or whether, that introduces a clausal complement. |


### Miscellaneous

|                        |                                                                                                                                                                                                                |
|------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Unclassified dependent | An unclassified dependent (`dep`) is a dependent that does not satisfy conditions for any other dependency.                                                                                                    |
| Meta modifier          | A meta modifier (`meta`) is code (1), embedded (2), or meta (3) information that is randomly inserted in a phrase or clause.                                                                                   |
| Parenthetical modifier | A parenthetical modifier (`parataxis`) is an embedded chunk, often but not necessarily surrounded by parenthetical notations (e.g,. brackets, quotes, commas, etc.), which gives side information to its head. |
| Punctuation            | Any punctuation (`punct`) is assigned the dependency label PUNCT.                                                                                                                                              |
| Root                   | A root (`root`) is the root of a tree that does not depend on any node in the tree but the artificial root node.                                                                                               |


## Experiments

In [1]:
import spacy
from spacy import displacy

In [2]:
nlp = spacy.load("en_core_web_sm")

In [58]:
def print_deps(doc):
    for token in doc:
        print(f"{token.text} -- {token.pos_}")
        print(f"\tDEP: {token.dep_}")
        print(f"\tHEAD: {token.head.text}")
        
def render_deps(doc, compact=True, options=None):
    if not options: options = {}
    displacy.render(doc, style="dep", jupyter=True, options={'compact': compact, **options})
    
def cprint(*objs, color=0, sep=""):
    to_print = f"\033[9{color}m" + sep.join(map(str, objs))
    print(to_print)

In [4]:
sample = "Autonomous cars shift insurance liability toward manufacturers"
doc = nlp(sample)
print(sample)
print("------")
print_deps(doc)

Autonomous cars shift insurance liability toward manufacturers
------
Autonomous -- ADJ
	DEP: amod
	HEAD: cars
cars -- NOUN
	DEP: nsubj
	HEAD: shift
shift -- VERB
	DEP: ROOT
	HEAD: shift
insurance -- NOUN
	DEP: compound
	HEAD: liability
liability -- NOUN
	DEP: dobj
	HEAD: shift
toward -- ADP
	DEP: prep
	HEAD: shift
manufacturers -- NOUN
	DEP: pobj
	HEAD: toward


Assuming subject is **autonomous cars**, and object is **insurance liability**, we need to extract autonomous cars, insurance liability, and the relationship of "shift toward manufacturers."

This is achieved with cars recognized as `nsubj`, autonomous as its `amod` child. "Shift" is recognized as the root verb. "Liability" is recognized as `dobj`, with "insurance" as child. "toward manufacturers" is recognized as child of shift.

In [5]:
render_deps(doc)

## Pipeline and Processing

![image.png](attachment:eab42910-05c3-41b4-ac71-cc08ee1d3ce6.png)

In [6]:
# Assume the following strings represent existing entities referenced in the above text. 
# Some may have a different entity name.
objects = ["marine angelfish", "benthic", "coral reefs", "holacanthus bermudensis", "sponges", 
           "harems", "territory", "breeding", "full moon", "eggs", "pelagic", "cleaner fish", 
           "aquarium trade", "brazil", "least concern", "international union for conservation of nature"]

subject = "queen angelfish"

full_text = """
The queen angelfish (Holacanthus ciliaris), also known as the blue angelfish, golden angelfish, or 
yellow angelfish, is a species of marine angelfish found in the western Atlantic Ocean. It is a 
benthic (ocean floor) warm-water species that lives in coral reefs. It is recognized by its blue 
and yellow coloration and a distinctive spot or "crown" on its forehead. This crown distinguishes 
it from the closely related and similar-looking Bermuda blue angelfish (Holacanthus bermudensis), 
with which it overlaps in range and can interbreed.Adult queen angelfish are selective feeders 
and primarily eat sponges. Their social structure consists of harems which include one male and 
up to four females. They live within a territory where the females forage separately and are 
tended to by the male. Breeding in the species occurs near a full moon. The transparent eggs 
are pelagic and float in the water, hatching after 15–20 hours. Juveniles of the species have 
different coloration than adults and act as cleaner fish. The queen angelfish is popular in 
the aquarium trade and has been a particularly common exported species from Brazil. In 2010, 
the queen angelfish was assessed as least concern by the International Union for 
Conservation of Nature as the wild population appeared to be stable.
""".replace('\n', '')
full_text

'The queen angelfish (Holacanthus ciliaris), also known as the blue angelfish, golden angelfish, or yellow angelfish, is a species of marine angelfish found in the western Atlantic Ocean. It is a benthic (ocean floor) warm-water species that lives in coral reefs. It is recognized by its blue and yellow coloration and a distinctive spot or "crown" on its forehead. This crown distinguishes it from the closely related and similar-looking Bermuda blue angelfish (Holacanthus bermudensis), with which it overlaps in range and can interbreed.Adult queen angelfish are selective feeders and primarily eat sponges. Their social structure consists of harems which include one male and up to four females. They live within a territory where the females forage separately and are tended to by the male. Breeding in the species occurs near a full moon. The transparent eggs are pelagic and float in the water, hatching after 15–20 hours. Juveniles of the species have different coloration than adults and act

Can start by tokenzing documents and creating nlp documents via `spacy`. Finding subject should be easy. Have to implement a function to detect if the subject refers to the article subject or another entity. If article subject, ignore. Else, continue onto find the object.

There are **2 main types of objects** that can be detected:
* Prepositional objects
* Direct objects

**How to tokenize**:
* Merge phrases and components of text to simplify
* Split into sentences?? (unless spacy already does this for us)

### Merging phrases

In [7]:
def merge_phrases(doc):
    with doc.retokenize() as retokenizer:
        for np in list(doc.noun_chunks):
            attrs = {
                "tag": np.root.tag_,
                "lemma": np.root.lemma_,
                "ent_type": np.root.ent_type_,
            }
            retokenizer.merge(np, attrs=attrs)
    return doc

In [8]:
long_text2 = "Adult queen angelfish are selective feeders and primarily eat sponges."
long_doc2 = nlp(long_text2)

In [9]:
render_deps(long_doc2)

In [10]:
long_doc2_merged = merge_phrases(long_doc2)
render_deps(long_doc2_merged)

We can definitely merge phrases to simplify subject/object matching process.

### Merging Entities

In [11]:
full_text_doc = nlp(full_text)

In [12]:
render_deps(list(full_text_doc.sents)[-1], options={'distance': 200})

In [13]:
displacy.render(list(full_text_doc.sents)[-1], style="ent", jupyter=True)

---
In the last sentence of the document, we see that the "International Union for Conservation of Nature" is an entity but is not recognized as a phrase by the dependency parser. We can merge these together.

In [14]:
def merge_entities(doc):
    with doc.retokenize() as retokenizer:
        for ent in doc.ents:
            retokenizer.merge(ent)
    return doc

In [15]:
full_text_doc = merge_entities(merge_phrases(full_text_doc))

In [16]:
render_deps(list(full_text_doc.sents)[-1], options={'distance': 200})

This is much simpler and easier to deal with.

### Constructing the pipeline

Although I have created functions for the above merge ops, there are built-in function components in `spacy` that I can add to the nlp pipeline.

In [17]:
pipeline = spacy.load("en_core_web_sm")
pipeline.add_pipe("merge_entities")
pipeline.add_pipe("merge_noun_chunks")

<function spacy.pipeline.functions.merge_noun_chunks(doc: spacy.tokens.doc.Doc) -> spacy.tokens.doc.Doc>

In [18]:
full_text_doc = pipeline(full_text)

In [19]:
render_deps(list(full_text_doc.sents)[-1], options={'distance': 200})

Pipeline is pretty easy to create.

### Finding the subject

Given a set of sentences/paragraph, let's find the subject for each sentence.

In [30]:
def get_subjects(doc):
    """Gets a list of subjects with each element corresponding to a sentence in the doc.
    Each element is an array [nsubj|nsubjpass] of spacy tokens.""" 
    return [[tok for tok in sent if tok.dep_ in ['nsubj', 'nsubjpass']] for sent in doc.sents]

In [60]:
for subj_list, sent in zip(get_subjects(full_text_doc), full_text_doc.sents):
    for subj in subj_list:
        cprint(f"{subj} - type:{subj.dep_} - head:{subj.head} - i:{subj.i - sent.start}", color=3)
    cprint("\t", sent, color=0)

[93mThe queen angelfish - type:nsubj - head:is - i:0
[90m	The queen angelfish (Holacanthus ciliaris), also known as the blue angelfish, golden angelfish, or yellow angelfish, is a species of marine angelfish found in the western Atlantic Ocean.
[93mIt - type:nsubj - head:is - i:0
[93mthat - type:nsubj - head:lives - i:3
[90m	It is a benthic (ocean floor) warm-water species that lives in coral reefs.
[93mIt - type:nsubjpass - head:recognized - i:0
[90m	It is recognized by its blue and yellow coloration and a distinctive spot or "crown" on its forehead.
[93mThis crown - type:nsubj - head:distinguishes - i:0
[93mit - type:nsubj - head:overlaps - i:11
[90m	This crown distinguishes it from the closely related and similar-looking Bermuda blue angelfish (Holacanthus bermudensis), with which it overlaps in range and can interbreed.
[93mAdult queen angelfish - type:nsubj - head:are - i:0
[90m	Adult queen angelfish are selective feeders and primarily eat sponges.
[93mTheir social st

### Finding outbound relationships

If we can find the root of the subject, we can trace all dependents from the root to find subject relationships. 