## Open Information Extraction
Open information extraction (open IE) refers to the extraction of relation tuples, typically binary relations, from plain text, such as (Mark Zuckerberg; founded; Facebook). The central difference from other information extraction is that the schema for these relations does not need to be specified in advance; typically the relation name is just the text linking two arguments.
- StanfordOpenIE (https://nlp.stanford.edu/software/openie.html) StanfordOpenIE is part of StanfordCoreNLP, which is built on top of PyTorch 1.0.0. 
 - Therefore, first install pytorch (https://pytorch.org/get-started/locally/), select your  os, package, language, cuda , then use the   command to install.
 - Install StanfordOpenIE using ``pip install stanford-openie``(https://pypi.org/project/stanford-openie/)
 - Install Java SE Development Kit 8 (https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html/)
 - Install graphviz : conda install python-graphviz
 - Unzip the stanford-corenlp-full-2018-10-05.zip to "C:\Users\UIC\stanfordnlp_resources"

In [1]:
import nltk
from nltk import pos_tag, word_tokenize, ne_chunk, Tree
from openie import StanfordOpenIE
from graphviz import Digraph
import numpy as np

### Triple Extraction from text using ``StanfordOpenIE``

In [None]:
# Example code for using StanfordOpenIE in python
from openie import StanfordOpenIE

with StanfordOpenIE() as client:
    text = 'Barack Obama was the 44th president of the United States'
    print('Text: %s.' % text)
    for triple in client.annotate(text):
        print('|-', triple)

For each sentence, StanfordOpenIE will return several triples with high confidence scores. To further refine the results, we can perform filtering by some constraints:
- The subjet and object must be named entity from some predfined types
- The relation must be verb or verb phrases (chunk rule)

Come up with your own refinement strategy to get a better result.

In [None]:
def extract_triple(text):
    # your implementation
    triples = []  # initialize a list to store the result
    
    def remove(lists,label): 
        subject = ["PERSON"]
        labels= ["PERSON","GPE", "ORGANIZATION", ]
        outcome = []  
        IN = None
        no = [" on", " at"]
        check = list(w_p.keys())
        for i in lists:
            skip = False
            for j in i[0].split(" "):
                if not j in check:
                    continue
                if not w_p[j] in subject or skip:
                    skip = True
                    continue
            for j in i[-1].split(" "):
                if not j in check:
                    continue
                if w_p[j] == "IN":
                    IN = j
                if not w_p[j] in labels or skip:
                    skip = True
                    continue
            if len(outcome)>1 and i[1] == outcome[-1][1]:
                if outcome[-1][-1] in i[-1]:
                    i[-1] = i[-1].replace(outcome[-1][-1], "").replace(IN+" ", "")
                    for n in no:
                        if n in i[1][-4:]:
                            i[1] = i[1].replace(n, " "+IN)
                    outcome.append(i)
            if not skip:
                outcome.append(i)
        return outcome
    # create a dictionary containing the attribute info of every word
    chunks = ne_chunk(pos_tag(word_tokenize(text))) #makeing the chunks for the text in order to get the single words and this pos
    w_p = {}#import the words and this pos to this dictionary
    for i in chunks:
        if type(i)==Tree:
            for j in i.leaves()[0]:
                w_p[j] = i._label
        else:
            w_p[i[0]] = i[1]
    #extract the text using the StanFordOpenIE()
    with StanfordOpenIE() as client:
        print('Text: %s.' % text)
        for triple in client.annotate(text):
            triples.append(list(triple.values()))
    triples = remove(triples, w_p)#called back the function and remove the noised word
    return triples

tr = extract_triple(""" Barack Obama was the 44th president of the United States, and the first African American to serve in the office.
          On October 3, 1992,  Barack Obama married  Michelle Robinson at Trinity United Church in Chicago. """)
print("Extracted Triples:",tr)

### Construct the KB from Triples
Given the knowledge triples, we need to index all the entities and relations, i.e.,get the entity set and relation set, and represent each triple using entity id and relation id.

In [None]:
def KB(triples):
    # your implementation
    id_en_list = []
    re_en_list = []
    triples_id=[]
    # your implementation
    for i in triples:
        id_en_list.append(i[0])# the entity ones
        id_en_list.append(i[-1])# the relation dict
        re_en_list.append(i[1])# the triple dict
    id_en_list= list(np.unique(id_en_list))
    re_en_list = list(np.unique(re_en_list)) 
    #for entity dict 
    id_en = dict(enumerate(id_en_list))
    #for relation dict
    re_en=dict(enumerate(re_en_list))
    # for triples
    for i in triples:
        tmp = []
        tmp.append(id_en_list.index(i[0]))# import subject
        tmp.append(re_en_list.index(i[1]))#relation
        tmp.append(id_en_list.index(i[2]))#object
        triples_id.append(tmp)
    return id_en,re_en,triples_id
kb = KB(tr)
print("Entities:", kb[0], "\nRelations:",kb[1], "\nTriples:",kb[2])


### Visualize the KB using ``graphviz``
- To render the graph, the dependency is a working installation of ``Graphviz`` (https://www.graphviz.org/download/).
- After installing Graphviz, make sure that its ``bin/`` subdirectory containing the layout commands for rendering graph descriptions (dot, circo, neato, etc.) is on your systems’ path: On the command-line, ``dot -V`` should print the version of your Graphiz installation.
- Refer to https://graphviz.readthedocs.io/en/stable/manual.html for the user guid of graphviz

In [None]:
def visualizeKB(kb_input):
    dot = Digraph(comment='KB-Demo') 
    for i in kb_input[0].keys():
        dot.node(str(i), kb_input[0][i])  
    for i in kb_input[2]:
        dot.edge(str(i[0]),str(i[2]), label=kb_input[1][i[1]])
    return dot
dot = visualizeKB(kb)
print(dot.source)

In [None]:
dot.render('kb-demo', view=True) 
dot