# Relationship Mapping

This is to test out methods and brainstorm ways of finding relationships within annotated text. Initial ideas are:
1. Do not merge phrases.
2. Find the root dep of the link. (the root is defined as a token that has no inward links from tokens within the link)
3. Traverse links until finding a verb, **capture traversed text and link subtree text**
4. Check if verb is already in the subtree. If so, then stop traversing, note that there are very few cases where the verb is not the root.

In [4]:
import spacy
from termcolor import colored
from spacy.tokens import Doc, Span, Token
from spacy import displacy
from typing import Iterator

In [5]:
nlp = spacy.load("en_core_web_lg")

## Functions

In [6]:
def render_deps(doc: Doc):
    displacy.render(list(doc.sents), style="dep", options={
        "compact": True,
        "bg": "#fff0",
        "color": "#EEE",
        "font": "consolas"
    })

In [7]:
def get_phrase_span(doc: Doc, phrase: str):
    index_found = doc.text.index(phrase)
    return doc.char_span(index_found, len(phrase) + index_found)

In [8]:
def get_relationship_for_span(span: Span):
    """
    Gets the next highest verb for the span and any intermediate tokens that may qualify the verb. If verb is already present in the subtree of the given span,
    then return the subtree
    inserts tokens into the beginning of the list until a verb token or aux token is reached. Add the verb or aux token, and break the loop
    if at any point a conjugating token is reached, (i.e. dep_ == "conj"), then the following phrases (e.g. the important ones) are part of a conjugated clause, and the next ancestor 
    is just the first part of the conjugated clause and should be skipped. However, if the next ancestor of a conjugated clause is actually a verb, then the verb rule should be followed.
    The verb should be inserted, and the loop should break.
    This handles cases such as "John ate an apple and an orange." orange -(conj)> apple -()> ate. apple should be skipped.
    "John served as kitchen staff and as hospitality.
    """
    tokens = []
    for token in span.subtree:
        tokens.append(token)
    if span.root.pos_ == "VERB" or span.root.pos_ == "AUX":
        return tokens
    is_conj = span.root.dep_ == "conj"
    for token in span.root.ancestors:
        if token.pos_ == "VERB" or token.pos_ == "AUX":
            tokens.insert(0, token)
            return tokens
        elif not is_conj:
            tokens.insert(0, token)
        elif is_conj:
            is_conj = False

        if token.dep_  == "conj":
            is_conj = True
    return None

In [9]:
def print_relationship(relationship: Iterator[Token], phrase: Span, detailed: bool=False):
    for token in relationship:
        if token.pos_ == 'VERB' or token.pos_ == "AUX" or detailed:
            print(f"{colored(token.text, 'green')} (POS:{token.pos_},DEP:{token.dep_},TAG:{token.tag_},Position={token.idx})", end=' ')
        else:
            print(f"{colored(token.text, 'blue')}", end=' ')
    print()
    print(f"Root: {colored(phrase.root, 'blue')}")
    print(f"Phrase: {colored(phrase.text, 'blue')}")

In [10]:
def render_phrase_relationship(doc: Doc, phrase: str, detailed: bool=False):
    phrase_span = get_phrase_span(doc, phrase)
    relationship = get_relationship_for_span(phrase_span)
    print_relationship(relationship, phrase_span, detailed)    

In [11]:
class Extract:
    def __init__(self, text: str, links: Iterator[str]):
        self.text = text
        self.links = links

def render_multi_phrase_relationship(extracts: Iterator[Extract], expand_deps=False, detailed=False):
    docs = list(nlp.pipe(ext.text for ext in extracts))
    if expand_deps:
        for doc in docs:
            render_deps(doc)
    for (sent, doc) in zip(extracts, docs):
        print()
        print(sent.text)
        for link in sent.links:
            render_phrase_relationship(doc, link, detailed)

## Initial Sample

Barack Hussein Obama II (born August 4, 1961) is an American lawyer and politician who served as the 44th president of the United States from 2009 to 2017.

In [12]:
text1 = "Barack Hussein Obama II (born August 4, 1961) is an American lawyer and politician who served as the 44th president of the United States from 2009 to 2017."
linkText = "president of the United States"

In [13]:
doc1 = nlp(text1)
render_deps(doc1)
render_phrase_relationship(doc1, linkText)

[32mserved[0m (POS:VERB,DEP:relcl,TAG:VBD,Position=87) [34mas[0m [34mthe[0m [34m44th[0m [34mpresident[0m [34mof[0m [34mthe[0m [34mUnited[0m [34mStates[0m 
Root: [34mpresident[0m
Phrase: [34mpresident of the United States[0m


## Long Samples

In [14]:
article = [
    Extract(
        text="Barack Hussein Obama II (born August 4, 1961) is an American lawyer and politician who served as the 44th president of the United States from 2009 to 2017.",
        links=["president of the United States"]
    ),
    Extract(
        text="A member of the Democratic Party, he was the first African-American president in U.S. history.",
        links=["African-American", "Democratic Party"]
    ),
    Extract(
        text="Obama previously served as a U.S. senator representing Illinois from 2005 to 2008, and as an Illinois state senator from 1997 to 2004.",
        links=["U.S. senator", "Illinois", "Illinois state senator"]
    )
]

In [15]:
render_multi_phrase_relationship(article, False)


Barack Hussein Obama II (born August 4, 1961) is an American lawyer and politician who served as the 44th president of the United States from 2009 to 2017.
[32mserved[0m (POS:VERB,DEP:relcl,TAG:VBD,Position=87) [34mas[0m [34mthe[0m [34m44th[0m [34mpresident[0m [34mof[0m [34mthe[0m [34mUnited[0m [34mStates[0m 
Root: [34mpresident[0m
Phrase: [34mpresident of the United States[0m

A member of the Democratic Party, he was the first African-American president in U.S. history.
[32mwas[0m (POS:AUX,DEP:ROOT,TAG:VBD,Position=37) [34mpresident[0m [34mAfrican[0m [34m-[0m [34mAmerican[0m 
Root: [34mAmerican[0m
Phrase: [34mAfrican-American[0m
[32mwas[0m (POS:AUX,DEP:ROOT,TAG:VBD,Position=37) [34mmember[0m [34mof[0m [34mthe[0m [34mDemocratic[0m [34mParty[0m 
Root: [34mParty[0m
Phrase: [34mDemocratic Party[0m

Obama previously served as a U.S. senator representing Illinois from 2005 to 2008, and as an Illinois state senator from 1997 to 2004.
[32m

In [16]:
article = [
    Extract("John served as kitchen staff and hospitality", ["kitchen staff", "hospitality"])
]

render_multi_phrase_relationship(article, True, True)


John served as kitchen staff and hospitality
[32mserved[0m (POS:VERB,DEP:ROOT,TAG:VBD,Position=5) [32mas[0m (POS:ADP,DEP:prep,TAG:IN,Position=12) [32mkitchen[0m (POS:NOUN,DEP:compound,TAG:NN,Position=15) [32mstaff[0m (POS:NOUN,DEP:pobj,TAG:NN,Position=23) [32mand[0m (POS:CCONJ,DEP:cc,TAG:CC,Position=29) [32mhospitality[0m (POS:NOUN,DEP:conj,TAG:NN,Position=33) 
Root: [34mstaff[0m
Phrase: [34mkitchen staff[0m
[32mserved[0m (POS:VERB,DEP:ROOT,TAG:VBD,Position=5) [32mas[0m (POS:ADP,DEP:prep,TAG:IN,Position=12) [32mhospitality[0m (POS:NOUN,DEP:conj,TAG:NN,Position=33) 
Root: [34mhospitality[0m
Phrase: [34mhospitality[0m


## Initial Issues

~~In sentence 2, note that 'was president African - American' is not very descriptive. It should note that he was the **first** african-american president.~~

~~**Resolution**: capture the subtree of each traversed token~~

Honestly it doesn't matter, it should just note that he was African-American and that is the only relationship that matters for this sentence semantically. If the link was 'African-American president', then maybe it should be different.

In sentence 3, note that 'representing from as an Illinois state senator from 1997 to 2004' is not accurate. It should really just say 'representing as an Illinois state senator from 1997 to 2004'. It's capturing the 'from' from the first conjuction.

**Resolution**: handle conjunctions better so that -
1. If conjunction is the ancestor, skip the conjunction and go to the next ancestor until verb is found.

## Longer Samples

In [19]:
article = [
    Extract(
        text="Barack Hussein Obama II (born August 4, 1961) is an American lawyer and politician who served as the 44th president of the United States from 2009 to 2017.",
        links=["president of the United States"]
    ),
    Extract(
        text="A member of the Democratic Party, he was the first African-American president in U.S. history.",
        links=["African-American", "Democratic Party"]
    ),
    Extract(
        text="Obama previously served as a U.S. senator representing Illinois from 2005 to 2008, and as an Illinois state senator from 1997 to 2004.",
        links=["U.S. senator", "Illinois", "Illinois state senator"]
    ),
    Extract(
        text="Obama was born in Honolulu, Hawaii.",
        links=["Honolulu"]
    ),
    Extract(
        text="He graduated from Columbia University in 1983 with a Bachelor of Arts degree in political science and later worked as a community organizer in Chicago.",
        links=["Columbia University", "Bachelor of Arts", "community organizer", "Chicago"]
    ),
    Extract(
        text="In 1988, Obama enrolled in Harvard Law School, where he was the first black president of the Harvard Law Review.",
        links=["Harvard Law School", "Harvard Law Review"]
    ),
    Extract(
        text="He became a civil rights attorney and an academic, teaching constitutional law at the University of Chicago Law School from 1992 to 2004.",
        links=["constitutional law", "University of Chicago Law School"]
    ),
    Extract(
        text="In 1996, Obama was elected to represent the 13th district in the Illinois Senate, a position he held until 2004, when he successfully ran for the U.S. Senate.",
        links=["represent the 13th district in the Illinois Senate", "successfully ran for the U.S. Senate"]
    ),
    Extract(
        text="In the 2008 presidential election, after a close primary campaign against Hillary Clinton, he was nominated by the Democratic Party for president.",
        links=["2008 presidential election", "close primary campaign", "Hillary Clinton"]
    ),
    Extract(
        text="Obama selected Joe Biden as his running mate and defeated Republican nominee John McCain.",
        links=["Joe Biden", "John McCain"]
    )
]

In [20]:
render_multi_phrase_relationship(article)


Barack Hussein Obama II (born August 4, 1961) is an American lawyer and politician who served as the 44th president of the United States from 2009 to 2017.
[32mserved[0m (POS:VERB,DEP:relcl,TAG:VBD,Position=87) [34mas[0m [34mthe[0m [34m44th[0m [34mpresident[0m [34mof[0m [34mthe[0m [34mUnited[0m [34mStates[0m 
Root: [34mpresident[0m
Phrase: [34mpresident of the United States[0m

A member of the Democratic Party, he was the first African-American president in U.S. history.
[32mwas[0m (POS:AUX,DEP:ROOT,TAG:VBD,Position=37) [34mpresident[0m [34mAfrican[0m [34m-[0m [34mAmerican[0m 
Root: [34mAmerican[0m
Phrase: [34mAfrican-American[0m
[32mwas[0m (POS:AUX,DEP:ROOT,TAG:VBD,Position=37) [34mmember[0m [34mof[0m [34mthe[0m [34mDemocratic[0m [34mParty[0m 
Root: [34mParty[0m
Phrase: [34mDemocratic Party[0m

Obama previously served as a U.S. senator representing Illinois from 2005 to 2008, and as an Illinois state senator from 1997 to 2004.
[32m

In sentence 3, we can probably find a better verb matching for U.S. senator, Illinois, and Illinois state senator. Because yes, he did represent Illinois, but he did not represent an Illinois state senator, and he did not represent a U.S. senator. This is because I had added code that will just grab the phrase subtree if a verb is found in it.

## Relationship Finding State Machine

For a relationship we want to have 3 main fields:
```python
class Relationship:
    verb_action: str
    detailed_action: str
    outgoing: str
```
where `outgoing` will be the phrase we are searching for. `verb_action` is the single verb that represents the action. `detailed_action` includes details that qualify the verb and the `outgoing` link. We will assume that the main entity is the article title.

### State Machine

| state (including root) | is_conj | add_token? | next_state   |
|------------------------|---------|------------|--------------|
| pos=verb               | false   | yes        | break        |
| pos=verb               | true    | yes        | break        |
| pos=aux                | false   | yes        | break        |
| pos=aux                | true    | yes        | break        |
| dep=conj               | false   | yes        | is_conj=true |
| dep=conj               | true    | no         | is_conj=true |
| _                      | false   | yes        | next         |
| _                      | true    | no         | is_conj=false|

**Special consideration for the root (the first token)**:
* if the root is a verb or aux, then the subtree of the root should just be returned.
* is_conj is set to whether the root is a conjugated clause (dep_ == "conj"). 
* the root will always be added.

### Code

In [20]:
class Relationship:
    def __init__(self, action: str, detailed_action: str, target: str):
        self.action = action
        self.detailed_action = detailed_action
        self.target = target

def evaluate_relationships(span: Span):
    tokens = list(span.subtree)
    if span.root.pos_ == "VERB" or span.root.pos_ == "AUX":
        return Relationship(action=span.root.text, detailed_action=convert_tokens_to_text(tokens), target=span.text)
    is_conj = span.root.dep_ == "conj"
    for token in span.root.ancestors:
        if token.pos_ == "VERB" or token.pos_ == "AUX":
            tokens.insert(0, token)
            return Relationship(action=token.text, detailed_action=convert_tokens_to_text(tokens), target=span.text)
        elif not is_conj:
            tokens.insert(0, token)
        elif is_conj:
            is_conj = False

        if token.dep_  == "conj":
            is_conj = True
    return None

def convert_tokens_to_text(tokens: Iterator[Token]):
    return " ".join(token.text.strip() for token in tokens)

In [21]:
span = get_phrase_span(nlp(article[0].text), "president of the United States")
relationship = evaluate_relationships(span)
f"action={relationship.action},detailed_action={relationship.detailed_action},outgoing={relationship.target}"

'action=served,detailed_action=served as the 44th president of the United States,outgoing=president of the United States'

## Code V2

Evaluating relationships:

In [22]:
class Relationship:
    def __init__(self, action: str, detailed_action: str, target: str):
        self.action = action
        self.detailed_action = detailed_action
        self.target = target

def evaluate_relationships(span: Span):
    tokens = list(span.subtree)
    if span.root.pos_ == "VERB" or span.root.pos_ == "AUX":
        return Relationship(action=span.root.text, detailed_action=convert_tokens_to_text(tokens), target=span.text)
    is_conj = span.root.dep_ == "conj"
    for token in span.root.ancestors:
        if token.pos_ == "VERB" or token.pos_ == "AUX":
            tokens.insert(0, token)
            return Relationship(action=token.text, detailed_action=convert_tokens_to_text(tokens), target=span.text)
        elif not is_conj:
            tokens.insert(0, token)
        elif is_conj:
            is_conj = False

        if token.dep_  == "conj":
            is_conj = True
    return None

def convert_tokens_to_text(tokens: Iterator[Token]):
    return " ".join(token.text.strip() for token in tokens)

Rendering relationships for debugging:

In [23]:
def render_relationship(relationship: Relationship):
    print(f"Detail: {colored(relationship.detailed_action, 'cyan')}")
    print(f"Action: {colored(relationship.action, 'green')}")
    print(f"Target: {colored(relationship.target, 'light_magenta')}")

For processing strings:

In [24]:
def process_extracts(extracts: Iterator[Extract]):
    docs = nlp.pipe(ext.text for ext in extracts)
    for (sent, doc) in zip(extracts, docs):
        print()
        print(f"Text: {sent.text}")
        for link in sent.links:
            target_span = get_phrase_span(doc, link)
            relationship = evaluate_relationships(target_span)
            render_relationship(relationship)

### Examples

In [25]:
process_extracts(article)


Text: Barack Hussein Obama II (born August 4, 1961) is an American lawyer and politician who served as the 44th president of the United States from 2009 to 2017.
Detail: [36mserved as the 44th president of the United States[0m
Action: [32mserved[0m
Target: [95mpresident of the United States[0m

Text: A member of the Democratic Party, he was the first African-American president in U.S. history.
Detail: [36mwas president African - American[0m
Action: [32mwas[0m
Target: [95mAfrican-American[0m
Detail: [36mwas member of the Democratic Party[0m
Action: [32mwas[0m
Target: [95mDemocratic Party[0m

Text: Obama previously served as a U.S. senator representing Illinois from 2005 to 2008, and as an Illinois state senator from 1997 to 2004.
Detail: [36mserved as a U.S. senator representing Illinois from 2005 to 2008[0m
Action: [32mserved[0m
Target: [95mU.S. senator[0m
Detail: [36mrepresenting Illinois[0m
Action: [32mrepresenting[0m
Target: [95mIllinois[0m
Detail: [3

## Further Optimizations

One optimization we can do is add additional qualifiers to the verb. For example, for many of these sentences above, there are prepositions connecting the verb to the target word. (e.g. 'teaching' and 'University of Chicago Law School' are connected by 'at'). This detail is shown in the detailed action, but we should more concretely identify what the preposition is in a separate field.

Generally, the POS `ADP` should work well in this case. We don't want to use "prep" dependency because it doesn't always work, for example it may be part of a conj clause, in which case `dep_ == "conj"`

In [None]:
class Relationship:
    def __init__(self, action: str, prep: str, detailed_action: str, target: str):
        self.action = action
        self.prep = prep
        self.detailed_action = detailed_action
        self.target = target

def evaluate_relationships(span: Span):
    tokens = list(span.subtree)
    if span.root.pos_ == "VERB" or span.root.pos_ == "AUX":
        return Relationship(action=span.root.text, prep=None, detailed_action=convert_tokens_to_text(tokens), target=span.text)
    is_conj = span.root.dep_ == "conj"
    prep = span.root if span.root.pos_ == "ADP" else None
    for token in span.root.ancestors:
        if token.pos_ == "VERB" or token.pos_ == "AUX":
            tokens.insert(0, token)
            return Relationship(action=token.text, prep=prep, detailed_action=convert_tokens_to_text(tokens), target=span.text)
        elif is_conj:
            is_conj = False
        elif not is_conj:
            tokens.insert(0, token)
        
        if token.pos_ == "ADP" and prep == None:
            prep = token.text
        if token.dep_ == "conj":
            is_conj = True
    return None

In [38]:
def render_relationship(relationship: Relationship):
    print(f"Detail: {colored(relationship.detailed_action, 'cyan')}")
    print(f"Action: {colored(relationship.action, 'green')}")
    print(f"Prep: {colored(relationship.prep, 'green')}")
    print(f"Target: {colored(relationship.target, 'light_magenta')}")

In [39]:
process_extracts(article)


Text: Barack Hussein Obama II (born August 4, 1961) is an American lawyer and politician who served as the 44th president of the United States from 2009 to 2017.
Detail: [36mserved as the 44th president of the United States[0m
Action: [32mserved[0m
Prep: [32mas[0m
Target: [95mpresident of the United States[0m

Text: A member of the Democratic Party, he was the first African-American president in U.S. history.
Detail: [36mwas president African - American[0m
Action: [32mwas[0m
Prep: [32mNone[0m
Target: [95mAfrican-American[0m
Detail: [36mwas member of the Democratic Party[0m
Action: [32mwas[0m
Prep: [32mof[0m
Target: [95mDemocratic Party[0m

Text: Obama previously served as a U.S. senator representing Illinois from 2005 to 2008, and as an Illinois state senator from 1997 to 2004.
Detail: [36mserved as a U.S. senator representing Illinois from 2005 to 2008[0m
Action: [32mserved[0m
Prep: [32mas[0m
Target: [95mU.S. senator[0m
Detail: [36mrepresenting Illino

## Final State Machine

| state (including root) | is_conj | add_token? | next_state   | set prep token? |
|------------------------|---------|------------|--------------|-----------------|
| pos=verb               | false   | yes        | break        | no              |
| pos=verb               | true    | yes        | break        | no              |
| pos=aux                | false   | yes        | break        | no              |
| pos=aux                | true    | yes        | break        | no              |
| dep=conj               | false   | yes        | is_conj=true | no              |
| dep=conj               | true    | no         | is_conj=true | no              |
| pos=adp                | false   | yes        | next         | yes             |
| pos=adp                | true    | no         | is_conj=false| yes             |
| _                      | false   | yes        | next         | no              |
| _                      | true    | no         | is_conj=false| no              |