# Introduction

Welcome to the third text mining module. In this module, you will learn about Part of Speech Tagging, Named Entity Recognition, and Relation Extraction. Under each session, you will have a short tutorial which shows you how to complete an information extraction task using text mining tool. The goal is to give you hands-on experience on extracting key information using a text mining tool. Section 2, 3, and 4.1, have optional exercises to allow you to get familair with the concepts. The task 4 is a little bit complex but we encourage you to take this challenge, which will help you better understand how we can teach mechine to extract sentence relation. Ok, let's get it!

## How to Run the Module

Throughout this module you will encounter both text and code cells. Please run each cell in this Notebook by clicking "Run" button in the Toolbar or by pushing Shift+Enter keys
<br>
![run_cell.png](Pictures/run_cell.png)

# Part of Speech Tagging (POS Tagging)

In [24]:
#set up
import warnings
warnings.filterwarnings('ignore')

from IPython.display import HTML
# Display Video
HTML('<iframe width="705" height="537" src="https://www.youtube.com/embed/bvP70Bhgbf8?list=PL6IN6GlGifEytPcv5HR_iaNBekwXYZIpR" frameborder="0" allowfullscreen></iframe>')

## Task 1: POS tags and Noun phrase extraction

In the following, we will take a taste of parsing a text and complete the following tasks
* Tag each word with Penn tree bank and universal tags
* Extract noun phrases

In this module, you will use spaCy to complete the following tasks. You already have had some experience with NLTK in tokenization, lemmatization, stemming. spaCy is a another very useful NLP tool and a strong competiter to NLTK. Unlike NLTK, spaCy takes over the dirty work using an object-oriented approach, which makes text processing easier and faster. For example, it extracts noun phrases without the need to pre-design regex pattern and traverse the parser tree. Instead, you simply "call" the noun chunks that spacy has done for you under the hood. 

Here is a simple code example to extract the noun phrases using spaCy:

```python
doc = nlp(string_to_process)
for np in doc.noun_chunks:
    print(np.text)
```

For additional resources for spaCy 101 tutorial, you can find [here](https://spacy.io/usage/spacy-101). 

### How to solve it?

* Import spacy package and load english language model
* Get the spacy object of such text
* Obtain the pos tagging attributes of each word, word.pos_ and noun chunk results noun.chunks

### Code

In [25]:
import spacy                    #import spacy module
import en_core_web_sm

# load English language model
nlp = en_core_web_sm.load()

TEXT_SAMPLE = """Health informatics is information engineering applied to the field of health care, essentially the management and use of patient health care information """

doc = nlp(TEXT_SAMPLE)

# Obtain tags using pos_tag
print("The pos tags of the given text:")
for tok in doc:
    print((tok.text, tok.tag_))

print("*"*20)          
print("The universal pos tags of the given text")
print("*"*20) 
for tok in doc:
    print((tok.text, tok.pos_))

print("*"*20)
print("The extracted noun phrases is: ")
print("*"*20) 
for noun_chunk in doc.noun_chunks:
    print(noun_chunk.text)

The pos tags of the given text:
('Health', 'NN')
('informatics', 'NNS')
('is', 'VBZ')
('information', 'NN')
('engineering', 'NN')
('applied', 'VBD')
('to', 'IN')
('the', 'DT')
('field', 'NN')
('of', 'IN')
('health', 'NN')
('care', 'NN')
(',', ',')
('essentially', 'RB')
('the', 'DT')
('management', 'NN')
('and', 'CC')
('use', 'NN')
('of', 'IN')
('patient', 'JJ')
('health', 'NN')
('care', 'NN')
('information', 'NN')
********************
The universal pos tags of the given text
********************
('Health', 'NOUN')
('informatics', 'NOUN')
('is', 'AUX')
('information', 'NOUN')
('engineering', 'NOUN')
('applied', 'VERB')
('to', 'ADP')
('the', 'DET')
('field', 'NOUN')
('of', 'ADP')
('health', 'NOUN')
('care', 'NOUN')
(',', 'PUNCT')
('essentially', 'ADV')
('the', 'DET')
('management', 'NOUN')
('and', 'CCONJ')
('use', 'NOUN')
('of', 'ADP')
('patient', 'ADJ')
('health', 'NOUN')
('care', 'NOUN')
('information', 'NOUN')
********************
The extracted noun phrases is: 
********************
H

### Practice Exercise (Optional)

Based on the task above, can you parse the following text and text the noun phrases from that? You can also choose other tools and text that you are interested in to do the exercise. 

    "Information science is that discipline that investigates the properties and behavior of information, the forces governing the flow of information, and the means of processing information for optimum accessibility and usability."

-Borko, H. (1968). Information science: What is it? American Documentation, 19, 3. Retrieved from ASIST, [What is information science](https://www.asist.org/about/what-is-information-science/)


In [26]:
import spacy                    
import en_core_web_sm
nlp = en_core_web_sm.load()
text_1 = """Information science is the discipline that investigates the properties and behavior of information, 
            the forces governing the flow of information, and the means of processing information for 
            optimum accessibility and usability."""

# here is your code

In [27]:
import spacy                    
import en_core_web_sm
nlp = en_core_web_sm.load()
text_1 = """Information science is the discipline that investigates the properties and behavior of information, 
            the forces governing the flow of information, and the means of processing information for 
            optimum accessibility and usability."""

# here is your code

doc = nlp(text_1)
for noun_chunk in doc.noun_chunks:
    print(noun_chunk.text)

Information science
the discipline
that
the properties
behavior
information
the flow
information
information
optimum accessibility
usability


# Name Entity Recognition (NER)

In [10]:
from IPython.display import HTML
HTML('<iframe width="705" height="537" src="https://www.youtube.com/embed/d7kFYmvyZiQ?list=PL6IN6GlGifEytPcv5HR_iaNBekwXYZIpR" frameborder="0" allowfullscreen></iframe>')

##  Task 2: Extract Named Entities

In this section, we will use spaCy to extract **Organization**, **Geopolitical entity**, **CARDINAL**, **Money** entities from the text. 
For Named entity uses OntoNotes 5 corpus to train its entity recognition model, spaCy supporting the detection of a wider variety of entities than NLTK. This corpus includes diverse data sources, e.g., telephone conversations, newswire, newsgroups, broadcast news, broadcast conversation, weblogs. More detail can be found [here](https://catalog.ldc.upenn.edu/LDC2013T19).   

* *The NIH was founded in 1887 and is now part of the United States Department of Health and Human Services.* 
* *The NIH is located in Maryland, U.S. and has nearly 1,000 scientists and support staff.*
* *The NIH obtained US$39 billion from Congress in 2019*

### How to solve it?

* Import spacy and english model
* Using nlp() to obtain the attribute of each word in the text. nlp() helps you detect and extract the name entities. For more detail about what you can do with nlp(), you can find [here](https://spacy.io/usage/spacy-101)
* You can either use doc.ents    
```python
doc = nlp(string_to_process)
for ent in doc.ents:
    print(ent.label)
```
    or get the words' attributes "ent_type_" to obtain the named entities.
```python
doc = nlp(string_to_process)
for token in doc:
    print(token.ent_type_)
```

* Visualize the entities within the sentence with displacy
```python
displacy.render(doc, style="ent")
```

### Code

In [28]:
import spacy                    #import spacy module
from spacy import displacy         # import NER visualizer
import en_core_web_sm


TEXT_SAMPLE = """
The NIH was founded in April 1887 and is now part of the United States Department of Health and Human Services.
The NIH is located in Maryland, U.S. and contains nearly 1,000 scientists and support staff.
The NIH obtained US$39 billion from Congress in 2019.
"""

# load English language model
nlp = en_core_web_sm.load()
# pass the text to nlp

doc = nlp(TEXT_SAMPLE)

# Extract the entities from such doc objects. We will get the following attributes of
# the entity, i.e., original text, start position, end position, entity type
print("*"*20) 
print("Using doc.ent to directly get entities")
print("*"*20) 
for ent in doc.ents:
    print("{}: {}, {}, {}".format(ent.text,      # original text
                                ent.start_char,  # start position of each entity in that text
                                ent.end_char,    # end position of each entity in that text
                                ent.label_))         # entity type
    
# You can also access the attributes of each token directly. 
# Here we obtian the text, pos_tags, IOB entity label and named entity label.
print("*"*20) 
print("Using doc.token to directly get entities")
print("*"*20) 
for sent in doc.sents:
    print("We are processing the text: ", sent.text)      # print the sentence
    for tok in sent:
        print("{}: {}".format(tok.text, tok.ent_type_))      # original text and entity type
                                          
# Visualize the entities displacy by specifying the visualization type as "ent"     
displacy.render(doc, style="ent")
                                                      

********************
Using doc.ent to directly get entities
********************
NIH: 5, 8, ORG
April 1887: 24, 34, DATE
the United States Department of Health and Human Services: 54, 111, ORG
NIH: 117, 120, ORG
Maryland: 135, 143, GPE
U.S.: 145, 149, GPE
1,000: 170, 175, CARDINAL
NIH: 210, 213, ORG
US$39 billion: 223, 236, MONEY
Congress: 242, 250, ORG
2019: 254, 258, DATE
********************
Using doc.token to directly get entities
********************
We are processing the text:  
The NIH was founded in April 1887 and is now part of the United States Department of Health and Human Services.


: 
The: 
NIH: ORG
was: 
founded: 
in: 
April: DATE
1887: DATE
and: 
is: 
now: 
part: 
of: 
the: ORG
United: ORG
States: ORG
Department: ORG
of: ORG
Health: ORG
and: ORG
Human: ORG
Services: ORG
.: 

: 
We are processing the text:  The NIH is located in Maryland, U.S. and contains nearly 1,000 scientists and support staff.

The: 
NIH: ORG
is: 
located: 
in: 
Maryland: GPE
,: 
U.S.: GPE
and: 
co

### Practice Exercise (Optional)

Based on the task above, can you extract the PERSON, ORGANIZATION,  from the given text. You can also choose other tools and text that you are interested in to do the exercise. 

    "Marc Lipsitch, a Harvard professor of epidemiology and the director of the Center for Communicable Disease Dynamics, created one of the first modeling tools used in the U.S. for the COVID-19 pandemic"

In [29]:
import spacy                    #import spacy module
from spacy import displacy         # import NER visualizer
import en_core_web_sm

text = """Marc Lipsitch, a Harvard professor of epidemiology and the director of the Center for Communicable Disease Dynamics, 
            created one of the first modeling tools used in the U.S. for the COVID-19 pandemic"""
# here is your code

In [30]:
import spacy                    #import spacy module
from spacy import displacy         # import NER visualizer
import en_core_web_sm

text = """Marc Lipsitch, a Harvard professor of epidemiology and the director of the Center for Communicable Disease Dynamics, 
            created one of the first modeling tools used in the U.S. for the COVID-19 pandemic"""
# here is your code

nlp = en_core_web_sm.load()
doc = nlp(text)
for ent in doc.ents:
    if ent.label_ == 'PERSON' or ent.label_ == 'ORG': # as we are outputing PERSON and ORGANIZATION only
        print("{}: {}, {}, {}".format(ent.text,      # original text
                                    ent.start_char,  # start position of each entity in that text
                                    ent.end_char,    # end position of each entity in that text
                                    ent.label_)) 

Marc Lipsitch: 0, 13, PERSON
Harvard: 17, 24, ORG
the Center for Communicable Disease Dynamics: 71, 115, ORG
COVID-19: 195, 203, ORG


# Relation Extraction

In the previous sections, we are still analyzing text at a lexical level, i.e, the predefined label of words. In order to obtain more information from the text, syntactax structure analysis comes to play and provides us with the syntactic information in the text. The syntactax structure describes the arrangement of words and phrases to create well-formed sentences in a language. 
In this section, we will first look at the syntactic dependency structure, that is, how words group as a unit and how the units relate to each other. We will then introduce the universal dependency and end this section up with the extraction of subject-predicate-object relation. 

## Syntactic Dependency Structure

In [31]:
from IPython.display import HTML
HTML('<iframe width="705" height="537" src="https://www.youtube.com/embed/X3qgiSaDYzU?list=PL6IN6GlGifEytPcv5HR_iaNBekwXYZIpR" frameborder="0" allowfullscreen></iframe>')

### Additional Resources for the Dependency Structure

#### Universal dependency structure

One thing we didn't mention in the video is the [Universal Dependency (UD) set](https://universaldependencies.org/). This set, provides an inventory of the dependency relations in human language. In this tutorial, we focus on English and will use this set for the dependency parsing and the relation extraction in the following section. See example: 

<img src = https://d3i71xaburhd42.cloudfront.net/273f54ea6f3631a78d9dd442609bb2033cfb1ffe/3-Figure14.2-1.png style="height:400px"> Source: Jurafsky, D., & Martin, J.H. Speech and Language Processing. Dependency Parsing (p.275)

### Task 3: Analyze Dependency Structure

In this task we have three sentences with complex sturcture. We want to figure out the root of the entire sentence and the head of each words. Be sure to be familiar yourself with the Universal dependency tags. 

* *I remember that you have given Tom a gift* 
* *Bell makes and distributes computer products*
* *The NIH is located in Maryland, U.S. and it contains nearly 1,000 scientists and support staff.*

#### How to solve it?

* Get the dependency tags of each word and its head or dependents, if any, using spaCy. 
```python
    # tok.dep_ gives the dependency function that the word plays, 
    # tok.head gives the head of the current word, notice that the head of the head verb in a sentence is itself
    # tok.children gives all the dependents this word has. 
    # tok.rights, tok.lefts give you the dependents on its right and left
    doc = nlp(text)
    for tok in doc:
        print(tok.dep_)
        print(tok.head)
        print(tok.children)
        print(tok.rights)
        print(tok.lefts)
```
* Visualize the dependency structure, specify the visualization style you want to present, which here is dependency "dep"
```python
    displacy.render(doc, style="dep")
```

#### Code

In [33]:
import pandas as pd
import spacy
# !pip install benepar   # you may need to install the package if you don't have that
import en_core_web_sm
from spacy import displacy # import visualization tool
from IPython.display import display

# load English language model
nlp = en_core_web_sm.load()
# Set up the visualization options
options = {"compact": True, "bg": "#ffffff",
           "color": "black", "font": "Source Sans Pro", "distance": 100}

TEXT_SAMPLE = ["I remember that you have give Tom a gift",
               "Bell makes and distributes computer products",
               "The NIH is located in Maryland, U.S. and it contains nearly 1,000 scientists and support staff"] 

for text in TEXT_SAMPLE:
    
    # Create a dataframe for an easy-to-see output
    df = pd.DataFrame()
    # Import the text and get nlp object
    doc = nlp(text)

    # Here parse the examples  using displacy.render method
    displacy.render(doc, style="dep", options=options)
    
    # Obtain the head and dependents of each word
    for tok in doc:
        df = df._append([{
            "Word":tok.text,
            "Dependent tag": tok.dep_,
            "Head":tok.head,
            "Dependents":list(tok.children),
            "Left dependents":list(tok.rights),
            "Right dependents":list(tok.lefts)
        }])
    
    # Show table in a readable format
    display(df)

Unnamed: 0,Word,Dependent tag,Head,Dependents,Left dependents,Right dependents
0,I,nsubj,remember,[],[],[]
0,remember,ROOT,remember,"[I, give]",[give],[I]
0,that,mark,give,[],[],[]
0,you,nsubj,give,[],[],[]
0,have,aux,give,[],[],[]
0,give,ccomp,remember,"[that, you, have, Tom, gift]","[Tom, gift]","[that, you, have]"
0,Tom,dative,give,[],[],[]
0,a,det,gift,[],[],[]
0,gift,dobj,give,[a],[],[a]


Unnamed: 0,Word,Dependent tag,Head,Dependents,Left dependents,Right dependents
0,Bell,nsubj,makes,[],[],[]
0,makes,ROOT,makes,"[Bell, and, distributes]","[and, distributes]",[Bell]
0,and,cc,makes,[],[],[]
0,distributes,conj,makes,[products],[products],[]
0,computer,compound,products,[],[],[]
0,products,dobj,distributes,[computer],[],[computer]


Unnamed: 0,Word,Dependent tag,Head,Dependents,Left dependents,Right dependents
0,The,det,NIH,[],[],[]
0,NIH,nsubjpass,located,[The],[],[The]
0,is,auxpass,located,[],[],[]
0,located,ROOT,located,"[NIH, is, in, and, contains]","[in, and, contains]","[NIH, is]"
0,in,prep,located,[Maryland],[Maryland],[]
0,Maryland,pobj,in,"[,, U.S.]","[,, U.S.]",[]
0,",",punct,Maryland,[],[],[]
0,U.S.,appos,Maryland,[],[],[]
0,and,cc,located,[],[],[]
0,it,nsubj,contains,[],[],[]


#### Practice Exercise (Optional)

Based on the task above, can you parse and visualize the dependency structure of the following sentence. You can also choose other tools and text that you are interested in to do the exercise. 

    "Marry was looking for her bag but nothing was founded"

In [34]:
import pandas as pd
import spacy
import en_core_web_sm
from spacy import displacy

text = "Marry was looking for her bag but nothing was founded"

# here is your code

In [18]:
import pandas as pd
import spacy
import en_core_web_sm
from spacy import displacy # import visualization tool
from IPython.display import display

text = "Marry was looking for her bag but nothing was founded"

# here is your code

# load English language model
nlp = en_core_web_sm.load()
# Set up the visualization options
options = {"compact": True, "bg": "#ffffff",
           "color": "black", "font": "Source Sans Pro", "distance": 100}
    
# Create a dataframe for an easy-to-see output
df = pd.DataFrame()
# Import the text and get nlp object
doc = nlp(text)

# Here parse the examples  using displacy.render method
displacy.render(doc, style="dep", options=options)

# Obtain the head and dependents of each word
for tok in doc:
    df = df._append([{
        "Word":tok.text,
        "Dependent tag": tok.dep_,
        "Head":tok.head,
        "Dependents":list(tok.children),
        "Left dependents":list(tok.rights),
        "Right dependents":list(tok.lefts)
    }])

# Show table in a readable format
display(df)

Unnamed: 0,Word,Dependent tag,Head,Dependents,Left dependents,Right dependents
0,Marry,nsubj,looking,[],[],[]
0,was,aux,looking,[],[],[]
0,looking,ROOT,looking,"[Marry, was, for, but, founded]","[for, but, founded]","[Marry, was]"
0,for,prep,looking,[bag],[bag],[]
0,her,poss,bag,[],[],[]
0,bag,pobj,for,[her],[],[her]
0,but,cc,looking,[],[],[]
0,nothing,nsubjpass,founded,[],[],[]
0,was,auxpass,founded,[],[],[]
0,founded,conj,looking,"[nothing, was]",[],"[nothing, was]"


## Relation Extraction

In the above section we discussed the dependency structure. It seems that we can use the syntactic dependency to extract the predicate-argument relation by traversing the dependency structure. In this part we will be discussing how we can leverage such dependency structure to figure out the **predicates and arguments** relation. 
This *Triple* relation consists of the entity pairs and their semantic relations, i.e., (Subject, Predicate, Object). 

In [19]:
from IPython.display import HTML
HTML('<iframe width="705" height="537" src="https://www.youtube.com/embed/ODph0mGwMEg?list=PL6IN6GlGifEytPcv5HR_iaNBekwXYZIpR" frameborder="0" allowfullscreen></iframe>')

### Task 4: Extract Subject-Predicate-Object Relation (Optional)

In the following, we will use spaCy to extract such relations. In the above example, 

* *I remember that you have give Tom a gift.*

* *Bell makes and distributes computer products.* 

* *The NIH is located in Maryland, U.S. and it contains nearly 1,000 scientists and support staff*

The expected outcome will be:

```python
([I], remember, [given])
([you], given, [Tom, gift])
([Bell], makes, [products])
([Bell], distributes, [products])
([], located, [NIH])
([it], contains, [scientists, staff])
```
Before we dive into the code detail, feel free to play with the code by running **Code** cell and see what are the outcomes of the above tested codes


#### How to solve it?

Considering this implementation is a little bit complex, we will take one step at a time. 

* Import packages and initiate variables
* Obtain all verbs of a sentence
* Recognize subjects
* Recognize objects
* Recognize other subjects or objects conjunction dependents

The basic coding logic is to create the function that take the input text and output the results, **getRelation(sent_string)**. Then we create three functions that play roles in the getRelation function, i.e.,  **getSubj(verb)**, **getObj(verb)** and **getConj(word)**. 

##### 1. Import packages and initiate variables 

We will use these constant variables to select the word with targeted dependency labels. The variables includes typical dependency tags of subject and object. In addition to subject and object, we also need dependency tag of the conjunctions.

In [36]:
# Import library
import spacy                             #import spacy module
import en_core_web_sm                    #import language model

from spacy.util import filter_spans      #import filter_spans to avoid duplicate matches
from spacy.matcher import Matcher        #import Matcher object to perform regex matching
nlp = en_core_web_sm.load()              #load English language model

# All possible dependency tags of subject
SUBJECTS_DEP = ["nsubj",  "csubj", "expl"]

# Subjects with passive voice
PASSIVE_SUBJ_DEP = ["nsubjpass", "csubjpass"]

# All possible dependency tags of object
OBJECTS_DEP = ["dobj", "dative", "pobj", "oprd", ]

# Conjunction dependency tags
CONJ_DEP = ["cc", "conj"]

##### 2. Obtain all the verbs of a sentence 

After we set up the initial settings. We want to constuct a function **getRelation(sent_string)** which will help us extract the relation. In this function, we will idenfify the verb, subject and object from the sentence. We obtain all the verbs of a sentence which is normally the root of a sentence and thus helps us find other sentence parts. We also want to exclude the auxiliary verbs and their passive forms, such as "be", "do", "have", "can". We will use the following code to extract the verbs. word.pos_ is the pos tag of each word, word.dep_ is the dependency tags of words. 

```python
def getRelation(sent_string):
    
    doc = nlp(sent_string)
    verb_rel_list = [word for word in doc 
                             if word.pos_ == 'VERB' and word.dep_ not in ['aux', 'auxpass']]
```
After this extraction, we take the verb as the argument of the function **getSubj(verb)** and **getObj(verb)** to get the subjects and objects.

```python

    # get the subjects and objects of this verb
    tuple_list = []
    # here get all verbs  
    if verb_rel_list:
        for verb in verb_rel_list:
            subj = getSubj(verb)
            obj = getObj(verb)
            print("{}".format((subj, verb, obj)))
```

##### 3. Recognize subjects: getSubj()

* First add all words with the subject dependency labels.

```python
    subjs = []
    subjs.extend([w for w in verb.lefts if w.dep_ in SUBJECTS_DEP and w.pos_ != "DET"])
```

* However, this cannot include all cases. There are some other situations to consider: 
    1. Passive voice, where the subject and object swap their position
    2. Conjunction, where two verbs may share the same subject. This situation can be recursive, which means, multiple verbs can be possible --> *I create, implement, and revise the product*. In this case, the 'create' is the head of 'implement', and the 'implement' is the head of 'revise'. It thus would be better to use recursive function to iteratively find the head of the verb. 
    3. [clausal complement (ccomp, xcomp)](https://universaldependencies.org/u/dep/ccomp.html), such as 'I remembered to give...', 'She started to cry'. Both cases contain two verbs and the latter one ("give", "cry") is the dependent of the first one ("remembered", "started"). The former verb are the head (root) of the sentence.

```python
def getSubj(verb, limit_time = 3):
    
    subjs.extend(list(w.rights)[0] for w in verb.rights if w.dep_ == 'agent')
    if len(subjs) == 0 and limit_time>0:
        limit_time -= 1    
        subjs.extend(getSubj(verb.head, limit_time))
    else: 
        print("No subject identified: ", verb)
```

* In some cases a subject may include a conjunction part, such as apple and orange. This is also the subject. Therefore we construct another function getConj(word) which helps us find their subject "friends". For example, apple, orange and peach

```python
    subjs.extend([w for subj in subjs for w in getConj(subj)])
    
    return subjs
```

##### 4. Recognize objects: getObj()

* First add all words with object dependency labels
```python
    objs.extend([w for w in verb.rights if w.dep_ in OBJECTS_DEP])
```
* However, this cannot include all cases. There are several cases we need to consider:
    1. Passive voice, where the subject and object swap their places
    2. [clausal complement (ccomp, xcomp)](https://universaldependencies.org/u/dep/ccomp.html), such as 'He said that', 'I remember that'
    3. Conjunction and prepostion phrase condition where the objects do not follow the verb directly. The 'relcl' is the conjunction condition where the current verb has a head which is actually the root of the sentence, e.g., create and implement. The 'implement' is the relcl of the 'create'. In prepostion phrase the object follows a preposition, rather than the verb, such as 'look for'. 
    
```python

    # Check if passive objects in left children if so the object will be the subject
    objs.extend([w for w in verb.lefts if w.dep_ in PASSIVE_SUBJ_DEP])
    
    # Check complement clause, conjunction and prepostion condition
    if len(objs) == 0:
        # If the verb is relcl to the main verb, its head is its object, e.g., "I saw the book you bought(relcl)". Another example is "the elements connected (acl) by a link"  
        
        if verb.dep_ in ["relcl", "acl"] and verb.tag_ in ["VBN"]:
            objs.append(verb.head)    
        else:
            for child in verb.rights:
                
                # Consider the clausal complement, get the action of the clause, which can be represented by the verb, 
                # such as I remember that she cried (remember, cry)
                if child.dep_ in ['ccomp', 'xcomp']:
                    objs.extend([child])
                    break
                
                # Consider verb_prep condition where prep has the obj child, such as depends on
                elif child.pos_ == 'ADP' and child.dep_ == 'prep':
                    temp = [w_child for w_child in child.rights if w_child.dep_ in OBJECTS_DEP]
                    if temp:
                        objs.extend(temp)  
                        break          
                # Get the verb's child to check dependent verb
                elif child.pos_ == 'VERB': 
                    temp = getObj(child)
                    if temp:
                        objs.extend(temp)
                        break 
```
* In some cases an object may include a conjunction part, such as apple, orange and peach. This is also the object. 

```python
    objs.extend(w for obj in objs for w in getConj(obj))
    
    return objs
```


##### 5. Get the subject's or object's "friends": getConj()

Here we construct the getConj function to capture the conjunction dependents. For example in "apple, orange and peach" the orange and peach are the depndents of apple.

```python
def getConj(word):
    '''
    Return the conjunction part of a token
    '''
    return [rchild for rchild in word.rights if rchild.dep_ == 'conj']
```

#### Code

In [37]:
# Import library
import spacy                             #import spacy module
import en_core_web_sm                    #import language model

from spacy.util import filter_spans      #import filter_spans to avoid duplicate matches
from spacy.matcher import Matcher        #import Matcher object to perform regex matching
nlp = en_core_web_sm.load()              #load English language model

# All possible dependency tags of subject
SUBJECTS_DEP = ["nsubj",  "csubj", "expl"]

# Subjects with passive voice
PASSIVE_SUBJ_DEP = ["nsubjpass", "csubjpass"]

# All possible dependency tags of object
OBJECTS_DEP = ["dobj", "dative", "pobj", "oprd", ]

# Conjunction dependency tags
CONJ_DEP = ["cc", "conj"]


# Obtain subjects from the verb
def getSubj(verb, limit_time = 3):
    '''
    Traverse the relation's dependency tree to collect subject
    Arg: 
        verb: a verb of the sentence
    Return: 
        A subject list           
    '''
    subjs = []
    # check if conjunction (verb as conj to main sentence)
    # check if verb's head is verb, if true, then call getSubj
    
    subjs.extend([w for w in verb.lefts if w.dep_ in SUBJECTS_DEP and w.pos_ != "DET"])
    
    # Check passive tone, if agent in the sentence, i.e., "by", collect agent's child as subject
    subjs.extend(list(w.rights)[0] for w in verb.rights if w.dep_ == 'agent')

    
    if len(subjs) == 0 and verb.text != verb.head.text:
            
        #If verb has no subject, then trace back to its head verb use the subject in main sentence
        limit_time -= 1    
        
        # recursively use this function to further find subject
        subjs.extend(getSubj(verb.head, limit_time))
        

    # Get the shared dependency subject with the ones already identified in conjunction
    # Obtain conjunct dependents of the leftmost conjunct, apple, orange and peach
    subjs.extend([w for subj in subjs for w in getConj(subj)])
    
    return subjs

def getObj(verb):
    '''
    Traverse the relation's dependency tree to collect objects
    Arg: 
        verb: a verb of the sentence
    Return: 
        A objects list    
    '''
    
    # If there is only one verb in sentence
    objs = []
    
    # Get the right children dependency of this verb
    right_child = [w for w in verb.rights]
    
    # Collect objects
    objs.extend([w for w in verb.rights if w.dep_ in OBJECTS_DEP])
    
    # here check agent "by" 
    # Check if passive objects in left children if so the object will be the subject
    objs.extend([w for w in verb.lefts if w.dep_ in PASSIVE_SUBJ_DEP])
                
    # Check prepostion and conjunction condition
    if len(objs) == 0:
        # If the verb is relcl to the main verb, its head is its object, e.g., I saw the book you bought(relcl), or elements connected (acl) by a link    
        if verb.dep_ in ["relcl", "acl"] and verb.tag_ in ["VBN"]:
            objs.append(verb.head) 
            
        else:
            for child in verb.rights:
                
                # Consider the clausal complement, get the action of the clause, which can be represented by the verb, 
                # such as I remember that she cried (remember, cry)
                if child.dep_ in ['ccomp', 'xcomp']:
                    objs.extend([child])
                    break
                
                # Consider verb_prep condition where prep has the obj child, such as depends on
                elif child.pos_ == 'ADP' and child.dep_ == 'prep':
                    temp = [w_child for w_child in child.rights if w_child.dep_ in OBJECTS_DEP]
                    if temp:
                        objs.extend(temp)  
                        break         
                
                # Get the verb's child to check the dependent verb that share the same objects, e.g., make and develop software
                elif child.pos_ == 'VERB': 
                    temp = getObj(child)
                    if temp:
                        objs.extend(temp)
                        break              
    
    # Get the shared dependency subject with the ones already identified in conjunction
    # Obtain conjunct dependents of the rightmost conjunct, apple, orange and peach
    objs.extend(w for obj in objs for w in getConj(obj))   
    
    return objs

def getConj(word):
    '''
    Return the conjunction part of a token
    Arg: 
        word: a word with conjunction dependencies
    Return: 
        A list of conjunction dependencies
    '''
    return [rchild for rchild in word.rights if rchild.dep_ == 'conj']


def getRelation(sent_string):
    """
    Obtain S-V-O tuple from the string
    Arg: 
        sent string: A sentence string
    Return: 
        A S-V-O tuple list, e.g., [(subject, relation, object), (subject, relation, object)]

    """
    svo_tuple = []    # a list of tuple (subj, v, obj)
    doc = nlp(sent_string)

    # Use regex to collect the verb entity which represent relations, except the auxiliry part, am, is, are, can.     
#     special_pattern = [{'DEP': "auxpass"}, 
#                        {'POS': "VERB"},
#                        {'DEP': "prep", "POS": "ADP"}]

    # Get the verb and conjunction word match

    verb_rel_list = [word for word in doc 
                                     if word.pos_ == 'VERB' and word.dep_ not in ['aux', 'auxpass']]
    
    is_rel_list = [word for word in doc 
                                     if word.pos_ == 'VERB' and word.dep_ not in ['aux', 'auxpass']]

    # get the subjects and objects of this verb
    tuple_list = []
    # here get all verbs  
    if verb_rel_list:
        for verb in verb_rel_list:
            subj = getSubj(verb)
            obj = getObj(verb)
            print("{}".format((subj, verb.lemma_, obj)))


#### Play with the code

You can play with the code before you dive into the detail. Please run the **Code** section first and run the following cell. Replace the text in 
```python
getRelation(text)
```
with the text you are interested in

In [38]:
TEXT_SAMPLE = """
I remember that you have given Tom a gift.
Bell makes and distributes computer products.
The NIH is located in Maryland, U.S. and it contains nearly 1,000 scientists and support staff.
"""
CHALLENGE_TEXT = """
Marry was looking for her bag but nothing was founded
"""

getRelation(TEXT_SAMPLE)

([I], 'remember', [given])
([you], 'give', [Tom, gift])
([Bell], 'make', [products])
([Bell], 'distribute', [products])
([], 'locate', [NIH])
([it], 'contain', [scientists, staff])


# Reference & Appendix

## Reference

Taylor A., Marcus M., Santorini B., The Penn Treebank: An Overview (2003), Text, Speech and Language Technology, vol 20 
https://www.nltk.org/book/ch05.html   
https://www.nltk.org/book/ch07.html   
https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf    
https://arxiv.org/ftp/arxiv/papers/1308/1308.0661.pdf   
https://nlp.stanford.edu/software/dependencies_manual.pdf  
https://universaldependencies.org/en/dep/xcomp.html  
Wang, Y., Wang, L., Rastegar-Mojarad, M., Moon, S., Shen, F., Afzal, N., Liu, S., Zeng, Y., Mehrabi, S., Sohn, S., & Liu, H. (2018). [Clinical information extraction applications: A literature review. Journal of Biomedical Informatics](https://doi.org/10.1016/j.jbi.2017.11.011), 77, 34–49. 

## Available Clinical Text Mining Tools

As we discussed in the video, there are many text mining tools you can use for clinical purposes. 

|     Name                                   |     Description                                                                                                                                                                                                           |     Website                                                                                                          |
|--------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------|
|     cTAKES                                 |     Open-source NLP system based on UIMA   framework for extraction of information from electronic health records   unstructured clinical text                                                                            |     http://ctakes.apache.org/                                                                                        |
|     MetaMap                                |     National Institutes of Health   (NIH)-developed NLP tool that maps biomedical text to UMLS concepts                                                                                                                   |     https://metamap.nlm.nih.gov/                                                                                     |
|     MedLEE                                 |     NLP system that extracts, structures,   and encodes clinical information from narrative clinical notes                                                                                                                |     http://zellig.cpmc.columbia.edu/medlee/                                                                          |
|     KnowledgeMap Concept Indexer (KMCI)    |     NLP system that identifies biomedical   concepts and maps them to UMLS concepts                                                                                                                                       |     https://medschool.vanderbilt.edu/cpm/center-precision-medicine-blog/kmci-knowledgemap-concept-indexer            |
|     HITEx                                  |     Open-source NLP tool built on top of   the GATE framework for various tasks such as principal diagnoses extraction   and smoking status extraction                                                                    |     https://www.i2b2.org/software/projects/hitex/hitex_manual.html                                                   |
|     MedEx                                  |     NLP tool used to recognize drug names,   dose, route, and frequency from free-text clinical records                                                                                                                   |     https://medschool.vanderbilt.edu/cpm/center-precision-medicine-blog/medex-tool-finding-medication-information    |
|     MedTagger                              |     Open-source NLP pipeline based on UIMA   framework for indexing based on dictionaries, information extraction, and   machine learning–based named entity recognition from clinical text                               |     http://ohnlp.org/index.php/MedTagger                                                                             |
|     ARC                                    |     Automated retrieval console (ARC) is an   open-source NLP pipeline that converts unstructured text to structured data   such as Systematized Nomenclature of Medicine – Clinical Terms (SNOMED CT) or   UMLS codes    |     http://blulab.chpc.utah.edu/content/arc-automated-retrieval-console                                              |
|     Medtex                                 |     Clinical NLP software that extracts   meaningful information from narrative text to facilitate clinical staff in   decision-making process                                                                            |     https://aehrc.com/research/projects/medical-free-text-retrieval-and-analytics/#medtex                            |
|     CLAMP                                  |     NLP software system based on UIMA   framework for clinical language annotation, modeling, processing and machine   learning                                                                                           |     https://sbmi.uth.edu/ccb/resources/clamp.htm                                                                     |
|     MedXN                                  |     A tool to extract comprehensive   medication information from clinical narratives and normalize it to RxNorm                                                                                                          |     http://ohnlp.org/index.php/MedXN                                                                                 |
|     MedTime                                |     A tool to extract temporal information   from clinical narratives and normalize it to the TIMEX3 standard                                                                                                             |     http://ohnlp.org/index.php/MedTime                                                                               |

(Wang et al., 2018)