## SpaCy-based Knowledge Graph

#### What is a knowledge graph?

A knowledge graph is a data structure. This graph is a way to represent information embedded within text in ///write more here!!

The most basic knowledge graph would be (node)----edge---->(node), or more specifically (entity A)----predicate---->(entity B), where predicate is a *relationship* from entity A aka the *subject* to entity B aka the *object*.<br> 
The usage of *relationship* is important, as you could have a triple like <br>
<br>
    (Dean Pelton)----loves---->(Jeff)<br><br>
    where the predicate is a verb (this is called a data graph), but <br> <br>
    (Metaphysics)----subclass---->(Philosophy)<br> <br>shows a *relationship* not in terms of a verb, but in terms of a taxonomy!
One could begin to realize the possibilities of organizing data with this structure: <br><br>
For example....<br><br>
(Dean Pelton)----loves---->(Jeff)*----studies---->*(Metaphysics)----subclass---->(Philosophy)
<br><br>
Once a subject has multiple edges, one can begin to imagine the *power* that a knowledge graph can weild!
<br><br>

Based on the above writing, let's begin with developing a naive, simple methodology with respect to extracting a triple (aka node---predicate--->node) from a text document

1. For a given document:
2. Find the entities in the document(using spaCy)
3. Find the relationship between each entity in the document

This sounds so easy! Right? Of course not!

### 1. Find the entities in the document
How does spacy find entities though?
I'm not sure *exactly* how spaCy may do so, however, according to [this Microsoft blog post](https://learn.microsoft.com/en-us/archive/blogs/machinelearning/machine-learning-and-text-analytics), it may train some ML model with text which has its named entities annotated with a tag. Maybe the text could be chunked by sentence, and tagged with its POS tag(noun, verb, adj, etc) to provide more features to the model.<br>
I kinda lied. spaCy does so with a CNN. However, I am unsure about the more specific information regarding this process

I think it's important to begin doing some actual coding! So let's just run with the naive approach and try to extract entites from some text with spaCy to begin.

In [139]:
import spacy
from spacy import displacy
nlp_model = spacy.load("en_core_web_sm") 
"""spacy.load will "load" a pretrained
 model which can perform tokenization, 
 POS tagging, depency parsing, NER, text categorizer(sentiment, spam),
 rule based matcher(like regex), entity linker, more. In our case the 
 en_core_web_sm(english core pipieline features, trained on web, small) has 
 tok2vec, tagger, parser, attribute_ruler, lemmatizer, ner"""


our_text = "Jessica M Roberts lives in Canada"
doc = nlp_model(our_text)

Luckily spaCy makes things pretty readable, so I think this code below should speak for itself! But in case it doesn't, we are just taking a string, running it through our pretrained model for named entity extraction(NER), and mapping the entity's name to it's entity type!

In [140]:
def spacy_NER(input_str):
    annotated_str = {}
    for entity in nlp_model(input_str).ents:
        annotated_str[entity.text] = entity.label_
    return annotated_str

In [141]:
spacy_NER(our_text)

{'Jessica M Roberts': 'PERSON', 'Canada': 'GPE'}

It's important to recognize that if we search for an index with a spaCy token doc object, we will get a token object

In [142]:
print(doc)
print(type(doc))
print(doc[0])
print(type(doc[0]))

Jessica M Roberts lives in Canada
<class 'spacy.tokens.doc.Doc'>
Jessica
<class 'spacy.tokens.token.Token'>


There is a *head* attribute to each token within a doc object. This head(token A) is itself a token and acts is seen as the syntactic parent of token B. 

The root of a sentence is a token which matches its head!

In [143]:
heads = {x:x.head for x in doc}
print(heads)
root = [x for x in heads if heads[x]==x]
print("The root of this sentence is: ",root)

{Jessica: Roberts, M: Roberts, Roberts: lives, lives: lives, in: lives, Canada: in}
The root of this sentence is:  [lives]


### Rule Based relation extraction

Now it's important to recognize how this simple sentence works in terms of a knowledge graph's triple. (Jessica Roberts)---Lives in---->(Canada). So it's structured as `(COMPOUND NOUN NOUN-SUBJECT) (ROOT-VERB) (PREPOSITION NOUN-OBJECT)`. Assuming that with every sentence we have, there will be a subject acting on an object , maybe we can extract the relationship between the two named entities with a rule

This rule will look like:
for token in sentence:
    if compound
(COMPOUND-subject-noun)----VERB-ROOT and PREPOSITION-----(entity-object-noun)


In [144]:
def subject_object_predicate(doc, entities):
    subject = {}
    object = {}
    predicate = {}
    for token_n, token in enumerate(doc):
        #This will lead to compounds from one entity to be in the other! But we can clean those out later
        if "compound" in token.dep_:
            subject[token.text] = token.idx
        if "subj" in token.dep_:
            subject[token.text] = token.idx
        
        #This will lead to compounds from one entity to be in the other! But we can clean those out later
        if "compound" in token.dep_:
            object[token.text] = token.idx
        if "obj" in token.dep_:
            object[token.text] = token.idx


        if "verb" in token.pos_.lower() or "root" in token.dep_.lower():
            predicate[token.text] = token.idx
            if "prep" in doc[token_n+1].dep_:
                predicate[doc[token_n+1].text] = doc[token_n+1].idx
    
    #clean object and subject 
    #For each entity
    for entity in entities:
        #Split entity into a list
        entity_list = entity.split(" ")
        #For the object, find which words in it match the current entity
        object_match = [x for x in list(object.keys()) if x in entity_list]
        #If it's not a perfect match, then the object has compounds in it that don't exist in the actual entity it's associated with! So we should remove those matching words
        if entity_list != object_match:
            for word in object_match:
                object.pop(word)
        #For the subject, find which words in it match the current entity
        subject_match = [x for x in list(subject.keys()) if x in entity_list]
        #If it's not a perfect match, then the subject has compounds in it that don't exist in the actual entity it's associated with! So we should remove those matching words
        if entity_list != subject_match:
            for word in subject_match:
                object.pop(word)
    return {"subject":subject, "object":object, "predicate":predicate}
print(subject_object_predicate(nlp_model(our_text), entities = spacy_NER(our_text)))

{'subject': {'Jessica': 0, 'M': 8, 'Roberts': 10}, 'object': {'Canada': 27}, 'predicate': {'lives': 18, 'in': 24}}


Now we get a rule based triple from our sentence

In [145]:
triple = subject_object_predicate(nlp_model(our_text), 
    entities = spacy_NER(our_text))
print(f"({' '.join(list(triple['subject']))})---({' '.join(list(triple['predicate']))})--->({' '.join(list(triple['object']))})")

(Jessica M Roberts)---(lives in)--->(Canada)


Now we should make this into a pipeline!

In [148]:
def spacyRule_triple_extraction_pipeline(sentence):
    doc = nlp_model(sentence)
    entities = spacy_NER(sentence)
    triple = subject_object_predicate(doc, entities)
    print(f"({' '.join(list(triple['subject']))})---({' '.join(list(triple['predicate']))})--->({' '.join(list(triple['object']))})")
    return subject_object_predicate(doc, entities)
spacyRule_triple_extraction_pipeline("Jessica M Roberts is from Canada")
    

(Jessica M Roberts)---(is from)--->(Canada)


{'subject': {'Jessica': 0, 'M': 8, 'Roberts': 10},
 'object': {'Canada': 26},
 'predicate': {'is': 18, 'from': 21}}

# OPENIE based knowledge graph

In [107]:
from openie import StanfordOpenIE

# https://stanfordnlp.github.io/CoreNLP/openie.html#api
# Default value of openie.affinity_probability_cap was 1/3.
properties = {
    'openie.affinity_probability_cap': 2 / 3,
}

with StanfordOpenIE(properties=properties) as client:
    text = 'Barack Obama was born in Hawaii. Richard Manning wrote this sentence.'
    print('Text: %s.' % text)
    for triple in client.annotate(text):
        print('|-', triple)

    graph_image = 'graph.png'
    client.generate_graphviz_graph(text, graph_image)
    print('Graph generated: %s.' % graph_image)

    with open('corpus/pg6130.txt', encoding='utf8') as r:
        corpus = r.read().replace('\n', ' ').replace('\r', '')

    triples_corpus = client.annotate(corpus[0:5000])
    print('Corpus: %s [...].' % corpus[0:80])
    print('Found %s triples in the corpus.' % len(triples_corpus))
    for triple in triples_corpus[:3]:
        print('|-', triple)
    print('[...]')

Text: Barack Obama was born in Hawaii. Richard Manning wrote this sentence..
Starting server with command: java -Xmx8G -cp /home/monty/.stanfordnlp_resources/stanford-corenlp-4.1.0/* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 60000 -threads 5 -maxCharLength 100000 -quiet True -serverProperties corenlp_server-0b085fd4193e49e3.props -preload openie


PermanentlyFailedException: Timed out waiting for service to come alive.

## LSTM-based NER

#### GOAL: Train a Many-Many LSTM with a word, and its respective POS tag. Each will have an output as O or {B-art', 'B-eve', 'B-geo', 'B-gpe', 'B-nat', 'B-org', 'B-per','B-tim', 'I-art', 'I-eve', 'I-geo', 'I-gpe', 'I-nat', 'I-org','I-per', 'I-tim',}

In [7]:
import pandas as pd
import numpy as np

In [8]:
data = pd.read_csv("ner_dataset.csv")

In [16]:
data_fillna = data.fillna(method='ffill', axis=0)
data_fillna

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,Sentence: 1,of,IN,O
2,Sentence: 1,demonstrators,NNS,O
3,Sentence: 1,have,VBP,O
4,Sentence: 1,marched,VBN,O
...,...,...,...,...
1048570,Sentence: 47959,they,PRP,O
1048571,Sentence: 47959,responded,VBD,O
1048572,Sentence: 47959,to,TO,O
1048573,Sentence: 47959,the,DT,O


In [28]:
def get_dict_map(data, token_or_tag):
    tok2idx = {}
    idx2tok = {}
    
    if token_or_tag == 'token':
        vocab = list(set(data['Word'].to_list()))
    else:
        vocab = list(set(data['Tag'].to_list()))
    
    idx2tok = {idx:tok for  idx, tok in enumerate(vocab)}
    tok2idx = {tok:idx for  idx, tok in enumerate(vocab)}
    return tok2idx, idx2tok


token2idx, idx2token = get_dict_map(data, 'token')
tag2idx, idx2tag = get_dict_map(data, 'tag')

In [36]:
data['Word_idx'] = data['Word'].map(token2idx)
data['Tag_idx'] = data['Tag'].map(tag2idx) 

Put each sentence into a single cell

In [40]:
data_fillna = data.fillna(method='ffill', axis=0)
data_group = data_fillna.groupby(['Sentence #'],as_index=False)['Word', 'POS', 'Tag', 'Word_idx', 'Tag_idx'].agg(lambda x: list(x))

  data_group = data_fillna.groupby(['Sentence #'],as_index=False)['Word', 'POS', 'Tag', 'Word_idx', 'Tag_idx'].agg(lambda x: list(x))


In [41]:
data_group.iloc[1]

Sentence #                                         Sentence: 10
Word          [Iranian, officials, say, they, expect, to, ge...
POS           [JJ, NNS, VBP, PRP, VBP, TO, VB, NN, TO, JJ, J...
Tag           [B-gpe, O, O, O, O, O, O, O, O, O, O, O, O, O,...
Word_idx      [34060, 34845, 30973, 28080, 14747, 15710, 176...
Tag_idx       [15, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 1...
Name: 1, dtype: object

In [42]:
np.unique(data["Tag"])

array(['B-art', 'B-eve', 'B-geo', 'B-gpe', 'B-nat', 'B-org', 'B-per',
       'B-tim', 'I-art', 'I-eve', 'I-geo', 'I-gpe', 'I-nat', 'I-org',
       'I-per', 'I-tim', 'O'], dtype=object)

In [43]:
import numpy as np
import tensorflow
from tensorflow.keras import Sequential, Model, Input
from tensorflow.keras.layers import LSTM, Embedding, Dense, TimeDistributed, Dropout, Bidirectional

In [50]:
input_dim = len(list(set(data['Word'].to_list())))+1
input_length = max(len(x) for x in data_group["Word"])

In [58]:
data_group.iloc[0]["Word_idx"]

[21452,
 9016,
 34659,
 11975,
 22112,
 12369,
 15362,
 15710,
 10922,
 30879,
 34158,
 2799,
 2096,
 24391,
 21519,
 30879,
 33679,
 9016,
 5748,
 30214,
 21642,
 7434,
 11627,
 28853]

In [59]:
model = Sequential()

    # Add Embedding layer
model.add(Embedding(input_dim=input_dim, output_dim=32, input_length=5))


In [60]:
model.compile(optimizer="SGD", loss="msr")

In [63]:
model.predict([1,2,3,4,5,6,10])



array([[-3.60159501e-02,  3.03565897e-02,  5.51499054e-03,
        -2.10718997e-02, -4.46258895e-02, -4.94754910e-02,
         3.57830413e-02,  2.52403729e-02,  2.11687796e-02,
        -3.88580933e-02,  2.62679197e-02, -1.38202682e-02,
        -4.95600700e-02,  4.11613248e-02, -1.14636309e-02,
         4.68579419e-02,  1.36339925e-02, -3.11755296e-02,
         2.78808735e-02,  1.63600780e-02, -1.64666548e-02,
         2.96495594e-02,  1.63764097e-02,  4.05922048e-02,
        -1.01650842e-02,  4.45184000e-02, -4.95735668e-02,
        -4.19676192e-02,  4.36299555e-02,  2.13460959e-02,
        -3.67555730e-02, -2.49840263e-02],
       [ 4.58573438e-02,  4.57807668e-02, -1.88681372e-02,
         3.49065922e-02,  2.17909105e-02,  3.73826958e-02,
         1.75218657e-03, -2.76631005e-02,  4.21754271e-03,
         1.85655467e-02,  4.39696386e-03, -4.81691025e-02,
         7.42132589e-03,  4.30054925e-02, -2.82822382e-02,
         3.03554796e-02,  4.08295654e-02,  3.11111473e-02,
         9.20

# GPT Based KNowledge graph