## SpaCy, rule-based Knowledge Graph

#### What is a knowledge graph?

A knowledge graph is a data structure. This graph is a way to represent information embedded within text in ///write more here!!

The most basic knowledge graph would be (node)----edge---->(node), or more specifically (entity A)----predicate---->(entity B), where predicate is a *relationship* from entity A aka the *subject* to entity B aka the *object*.<br> 
The usage of *relationship* is important, as you could have a triple like <br>
<br>
    (Dean Pelton)----loves---->(Jeff)<br><br>
    where the predicate is a verb (this is called a data graph), but <br> <br>
    (Metaphysics)----subclass---->(Philosophy)<br> <br>shows a *relationship* not in terms of a verb, but in terms of a taxonomy!
One could begin to realize the possibilities of organizing data with this structure: <br><br>
For example....<br><br>
(Dean Pelton)----loves---->(Jeff)*----studies---->*(Metaphysics)----subclass---->(Philosophy)
<br><br>
Once a subject has multiple edges, one can begin to imagine the *power* that a knowledge graph can weild!
<br><br>

Based on the above writing, let's begin with developing a naive, simple methodology with respect to extracting a triple (aka node---predicate--->node) from a text document

1. For a given document:
2. Find the entities in the document(using spaCy)
3. Find the relationship between each entity in the document

This sounds so easy! Right? Of course not!

### 1. Find the entities in the document
How does spacy find entities though?
I'm not sure *exactly* how spaCy may do so, however, according to [this Microsoft blog post](https://learn.microsoft.com/en-us/archive/blogs/machinelearning/machine-learning-and-text-analytics), it may train some ML model with text which has its named entities annotated with a tag. Maybe the text could be chunked by sentence, and tagged with its POS tag(noun, verb, adj, etc) to provide more features to the model.<br>
I kinda lied. spaCy does so with a CNN. However, I am unsure about the more specific information regarding this process

I think it's important to begin doing some actual coding! So let's just run with the naive approach and try to extract entites from some text with spaCy to begin.

In [36]:
import spacy
from spacy import displacy
nlp_model = spacy.load("en_core_web_sm") 
"""spacy.load will "load" a pretrained
 model which can perform tokenization, 
 POS tagging, depency parsing, NER, text categorizer(sentiment, spam),
 rule based matcher(like regex), entity linker, more. In our case the 
 en_core_web_sm(english core pipieline features, trained on web, small) has 
 tok2vec, tagger, parser, attribute_ruler, lemmatizer, ner"""


our_text = "Jessica M Roberts lives in Canada"
doc = nlp_model(our_text)

Luckily spaCy makes things pretty readable, so I think this code below should speak for itself! But in case it doesn't, we are just taking a string, running it through our pretrained model for named entity extraction(NER), and mapping the entity's name to it's entity type!

In [34]:
def spacy_NER(input_str):
    annotated_str = {}
    for entity in nlp_model(input_str).ents:
        annotated_str[entity.text] = entity.label_
    return annotated_str

In [47]:
def vanilla_NER(input_str):
  sent=input_str
  ## chunk 1
  ent1 = ""
  ent2 = ""

  prv_tok_dep = ""    # dependency tag of previous token in the sentence
  prv_tok_text = ""   # previous token in the sentence

  prefix = ""
  modifier = ""

  #############################################################
  
  for tok in nlp_model(sent):
    ## chunk 2
    # if token is a punctuation mark then move on to the next token
    if tok.dep_ != "punct":
      # check: token is a compound word or not
      if tok.dep_ == "compound":
        prefix = tok.text
        # if the previous word was also a 'compound' then add the current word to it
        if prv_tok_dep == "compound":
          prefix = prv_tok_text + " "+ tok.text
      
      # check: token is a modifier or not
      if tok.dep_.endswith("mod") == True:
        modifier = tok.text
        # if the previous word was also a 'compound' then add the current word to it
        if prv_tok_dep == "compound":
          modifier = prv_tok_text + " "+ tok.text
      
      ## chunk 3
      if tok.dep_.find("subj") == True:
        ent1 = modifier +" "+ prefix + " "+ tok.text
        prefix = ""
        modifier = ""
        prv_tok_dep = ""
        prv_tok_text = ""      

      ## chunk 4
      if tok.dep_.find("obj") == True:
        ent2 = modifier +" "+ prefix +" "+ tok.text
        
      ## chunk 5  
      # update variables
      prv_tok_dep = tok.dep_
      prv_tok_text = tok.text
  #############################################################

  return [ent1.strip(), ent2.strip()]

In [51]:
vanilla_NER("Monty is actually cool")

['Monty', '']

Let's make a quick function to sort NEs into their categories

In [43]:
def sort_entities(entity_dict):
    categories = ['ORG', 'CARDINAL', 'DATE', 'GPE', 'PERSON', 'MONEY', 'PRODUCT', 'TIME', 'PERCENT', 'WORK_OF_ART', 'QUANTITY', 'NORP', 'LOC', 'EVENT', 'ORDINAL', 'FAC', 'LAW', 'LANGUAGE']
    categories = {x:0 for x in categories}
    for x in entity_dict:
        categories[entity_dict[x]]+=1
    return categories

In [44]:
sort_entities(spacy_NER(our_text))

{'ORG': 0,
 'CARDINAL': 0,
 'DATE': 0,
 'GPE': 1,
 'PERSON': 1,
 'MONEY': 0,
 'PRODUCT': 0,
 'TIME': 0,
 'PERCENT': 0,
 'WORK_OF_ART': 0,
 'QUANTITY': 0,
 'NORP': 0,
 'LOC': 0,
 'EVENT': 0,
 'ORDINAL': 0,
 'FAC': 0,
 'LAW': 0,
 'LANGUAGE': 0}

In [37]:
spacy_NER(our_text)

{'Jessica M Roberts': 'PERSON', 'Canada': 'GPE'}

It's important to recognize that if we search for an index with a spaCy token doc object, we will get a token object

In [8]:
print(doc)
print(type(doc))
print(doc[0])
print(type(doc[0]))

Jessica M Roberts lives in Canada
<class 'spacy.tokens.doc.Doc'>
Jessica
<class 'spacy.tokens.token.Token'>


There is a *head* attribute to each token within a doc object. This head(token A) is itself a token and acts is seen as the syntactic parent of token B. 

The root of a sentence is a token which matches its head!

In [9]:
heads = {x:x.head for x in doc}
print(heads)
root = [x for x in heads if heads[x]==x]
print("The root of this sentence is: ",root)

{Jessica: Roberts, M: Roberts, Roberts: lives, lives: lives, in: lives, Canada: in}
The root of this sentence is:  [lives]


### Rule Based relation extraction

Now it's important to recognize how this simple sentence works in terms of a knowledge graph's triple. (Jessica Roberts)---Lives in---->(Canada). So it's structured as `(COMPOUND NOUN NOUN-SUBJECT) (ROOT-VERB) (PREPOSITION NOUN-OBJECT)`. Assuming that with every sentence we have, there will be a subject acting on an object , maybe we can extract the relationship between the two named entities with a rule

This rule will look like:
for token in sentence:
    if compound
(COMPOUND-subject-noun)----VERB-ROOT and PREPOSITION-----(entity-object-noun)


In [44]:
def subject_object_predicate(doc, entities):
    subject = {}
    object = {}
    predicate = {}
    for token_n, token in enumerate(doc):
        if "verb" in token.pos_.lower() or "root" in token.dep_.lower():
            predicate[token.text] = token.idx
            if "prep" in doc[token_n+1].dep_:
                predicate[doc[token_n+1].text] = doc[token_n+1].idx
    
    #clean object and subject 
    #For each entity
    for entity in entities:
        #Split entity into a list
        entity_list = entity.split(" ")
        #For the object, find which words in it match the current entity
        object_match = [x for x in list(object.keys()) if x in entity_list]
        #If it's not a perfect match, then the object has compounds in it that don't exist in the actual entity it's associated with! So we should remove those matching words
        if entity_list != object_match:
            for word in object_match:
                try:
                    object.pop(word)
                except:
                    None
        #For the subject, find which words in it match the current entity
        subject_match = [x for x in list(subject.keys()) if x in entity_list]
        #If it's not a perfect match, then the subject has compounds in it that don't exist in the actual entity it's associated with! So we should remove those matching words
        if entity_list != subject_match:
            for word in subject_match:
                try:
                    object.pop(word)
                except:
                    None
    return {"subject":subject, "object":object, "predicate":predicate}
print(subject_object_predicate(nlp_model(our_text), entities = spacy_NER(our_text)))

{'subject': {'Jessica': 0, 'M': 8, 'Roberts': 10}, 'object': {'Canada': 27}, 'predicate': {'lives': 18, 'in': 24}}


Now we get a rule based triple from our sentence

In [45]:
triple = subject_object_predicate(nlp_model(our_text), 
    entities = spacy_NER(our_text))
print(f"({' '.join(list(triple['subject']))})---({' '.join(list(triple['predicate']))})--->({' '.join(list(triple['object']))})")

(Jessica M Roberts)---(lives in)--->(Canada)


Now we should make this into a pipeline!

In [55]:
def spacyRule_triple_extraction_pipeline(sentence):
    doc = nlp_model(sentence)
    entities = spacy_NER(sentence)
    print(entities)
    triple = subject_object_predicate(doc, entities)
    print(f"({' '.join(list(triple['subject']))})---({' '.join(list(triple['predicate']))})--->({' '.join(list(triple['object']))})")
    return subject_object_predicate(doc, entities)
spacyRule_triple_extraction_pipeline("Many assume, wrongly, that the reign of King Aegon I Targaryen began on the day he landed at the mouth of the Blackwater Rush, beneath the three hills where the city of King’s Landing would eventually stand.")    

{'the day': 'DATE', 'the Blackwater Rush': 'ORG', 'three': 'CARDINAL', 'King’s Landing': 'ORG'}
(reign King I Targaryen he Blackwater city)---(assume began on landed at stand)--->(King Aegon I mouth hills)


{'subject': {'reign': 31,
  'King': 40,
  'I': 51,
  'Targaryen': 53,
  'he': 80,
  'Blackwater': 110,
  'city': 161},
 'object': {'King': 40, 'Aegon': 45, 'I': 51, 'mouth': 97, 'hills': 145},
 'predicate': {'assume': 5,
  'began': 63,
  'on': 69,
  'landed': 83,
  'at': 90,
  'stand': 201}}

In [None]:
from spacy_triple_extraction import spacy_NER, sort_entities
from tqdm import tqdm
entities_list = []
c = 0
for x in tqdm(cleaned_paragraphs):
    c+=1
    if c > 100:
        break
    entities = spacy_NER(cleaned_paragraphs[x])
    for entity in entities:
        entities_list.append([entity,entities[entity]])
entities_list

In [None]:
characters = set([x[0] for x in entities_list if x[1]=="PERSON"])

In [None]:
from itertools import combinations
character_pairs = list(combinations(characters, 2))
sentences = []
for paragraph in cleaned_paragraphs:
    sentence_list = cleaned_paragraphs[paragraph].split(".")
    for sentence in sentence_list:
        sentences.append(sentence)
sentences_annotated = {}
for sentence in sentences:
    for pair in character_pairs:
        if pair[1] in sentence and pair[0] in sentence:
            sentences_annotated[sentence] = pair

In [49]:
import pandas as pd

for ind, sentence in pd.read_csv("test.csv").iterrows():
    spacyRule_triple_extraction_pipeline(str(sentence[0]))

{}
(brain that)---(is controls)--->(all functions)
{}
(It occipital)---(divided into including)--->(regions lobe occipital)
{}
(Each motor)---(is)--->(regions functions motor control)
{'billions': 'CARDINAL'}
(brain which)---(composed of communicate with)--->(billions neurons other synapses)
{'Neurotransmitters': 'ORG'}
(Neurotransmitters health)---(play associated with)--->(dopamine role communications health conditions)
{}
(Understanding)---(Understanding is)--->(structure brain development treatments disorders)
{'100': 'CARDINAL', 'the Spring 2018': 'EVENT', 'North American League of Legends Championship Series': 'ORG', 'the Cleveland Cavaliers': 'ORG'}
(Thieves North American Legends Championship Cleveland Cavaliers)---(fielded)--->(roster support organization)
{'100': 'CARDINAL', 'Before High School': 'ORG', 'American': 'NORP', 'Scott Fellows': 'PERSON', 'Nickelodeon': 'ORG', 'November 11, 2014 to': 'DATE', 'February 27, 2016': 'DATE'}
(Things High comedy television Scott that)---

So clearly this method doesn't work great. Two things need to be improved: entities are being extracted well, and the relationship method just doesn't work for complex sentences

# Thought process

I think that we should aim to extract a single relation between 

# LSTM-based NER

#### GOAL: Train a Many-Many LSTM with a word, and its respective POS tag. Each will have an output as O or {B-art', 'B-eve', 'B-geo', 'B-gpe', 'B-nat', 'B-org', 'B-per','B-tim', 'I-art', 'I-eve', 'I-geo', 'I-gpe', 'I-nat', 'I-org','I-per', 'I-tim',}

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf

2023-03-14 12:27:00.968254: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-14 12:27:01.320928: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-03-14 12:27:01.320949: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-03-14 12:27:01.369969: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-03-14 12:27:02.349953: W tensorflow/stream_executor/platform/de

In [3]:
data_path = "ner_dataset.csv"

data = pd.read_csv(data_path, encoding= 'unicode_escape')
# filling the first column that determines which sentence each word belongs to.
data.fillna(method = 'ffill', inplace = True)
data.head()

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,Sentence: 1,of,IN,O
2,Sentence: 1,demonstrators,NNS,O
3,Sentence: 1,have,VBP,O
4,Sentence: 1,marched,VBN,O


In [4]:
ready_dist_path = "ner.csv"
ready_data = pd.read_csv(ready_dist_path)
ready_data.head()

Unnamed: 0,Sentence #,Sentence,POS,Tag
0,Sentence: 1,Thousands of demonstrators have marched throug...,"['NNS', 'IN', 'NNS', 'VBP', 'VBN', 'IN', 'NNP'...","['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', '..."
1,Sentence: 2,Families of soldiers killed in the conflict jo...,"['NNS', 'IN', 'NNS', 'VBN', 'IN', 'DT', 'NN', ...","['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', ..."
2,Sentence: 3,They marched from the Houses of Parliament to ...,"['PRP', 'VBD', 'IN', 'DT', 'NNS', 'IN', 'NN', ...","['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', ..."
3,Sentence: 4,"Police put the number of marchers at 10,000 wh...","['NNS', 'VBD', 'DT', 'NN', 'IN', 'NNS', 'IN', ...","['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', ..."
4,Sentence: 5,The protest comes on the eve of the annual con...,"['DT', 'NN', 'VBZ', 'IN', 'DT', 'NN', 'IN', 'D...","['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', ..."


In [5]:
tags = data.Tag.unique()
tags

array(['O', 'B-geo', 'B-gpe', 'B-per', 'I-geo', 'B-org', 'I-org', 'B-tim',
       'B-art', 'I-art', 'I-per', 'I-gpe', 'I-tim', 'B-nat', 'B-eve',
       'I-eve', 'I-nat'], dtype=object)

In [6]:
def num_words_tags (tags, data):
    
    """This functions takes the tags we want to count and the datafram 
    and return a dict where the key is the tag and the value is the frequency
    of that tag"""
    
    tags_count = {}
    
    for tag in tags:
        len_tag = len(data[data['Tag'] == tag])
        tags_count[tag] = len_tag
    
    return tags_count

Get X and Y from the dataset!

In [8]:
X = list(ready_data["Sentence"])
Y = list(ready_data["Tag"])

In [9]:
from ast import literal_eval
Y_ready = []

for sen_tags in Y:
    Y_ready.append(literal_eval(sen_tags))

In [10]:
print(X[:1])
print(Y[:1])

['Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country .']
["['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-gpe', 'O', 'O', 'O', 'O', 'O']"]


Let's begin padding and tokenizing the data!

In [11]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [12]:
max([len(x.split(" ")) for x in X]) # Max sentence length

104

In [None]:
def add_data(X, input_add):
    X.append(input_add)
    return X

In [13]:
# cutoff reviews after 110 words
maxlen = 120

# consider the top 36000 words in the dataset
max_words = 46000

# tokenize each sentence in the dataset
tokenizer = Tokenizer(num_words=max_words)#Create tokenizer with len max_words
tokenizer.fit_on_texts(X) #Literally tokenize text #FILL WITH GOT TEXT!!
sequences = tokenizer.texts_to_sequences(X) #Make each sentence in X tokenized

In [14]:
word_index = tokenizer.word_index
print("Found {} unique tokens.".format(len(word_index)))
ind2word = dict([(value, key) for (key, value) in word_index.items()]) #Explicitly assign each token number to a token


word2id = word_index #Assign each word/token to a token number

Found 27953 unique tokens.


In [15]:
# dict. that map each identifier to its word
id2word = {}
for key, value in word2id.items():
    id2word[value] = key

In [16]:
# pad the sequences so that all sequences are of the same size
X_preprocessed = pad_sequences(sequences, maxlen=maxlen, padding='post')

In [17]:
# first example after tokenization and padding. 
X_preprocessed[0]
#Made up of tokens, then when the sentence stops, 0s for padding up to 120

array([ 260,    3,  997,   13, 1838,  245,  452,    4,  545,    1,  121,
          2,   60,    6,  595,    1,  861,    3,  184,   89,   21,   12,
         54,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0],
      dtype=int32)

In [18]:
# dict. that map each tag to its identifier
tags2id = {}
for i, tag in enumerate(tags):
    tags2id[tag] = i

In [19]:
# dict. that map each identifier to its tag
id2tag = {}
for key, value in tags2id.items():
    id2tag[value] = key

In [20]:
#This function will just convert Os to 0s, and actual labels to ints
def preprocess_tags(tags2id, Y_ready):
    
    Y_preprocessed = []
    maxlen = 120
    # for each target 
    for y in Y_ready:
        
        # place holder to store the new preprocessed tag list
        Y_place_holder = []
        
        # for each tag in rhe tag list 
        for tag in y:
            # append the id of the tag in the place holder list
            Y_place_holder.append(tags2id[tag])
        
        # find the lenght of the new preprocessed tag list 
        len_new_tag_list = len(Y_place_holder)
        # find the differance in length between the len of tag list and padded sentences
        num_O_to_add = maxlen - len_new_tag_list
        
        # add 'O's to padd the tag lists
        padded_tags = Y_place_holder + ([tags2id['O']] * num_O_to_add)
        Y_preprocessed.append(padded_tags)
        
    return Y_preprocessed

In [21]:
Y_preprocessed = preprocess_tags(tags2id, Y_ready)

In [22]:
X_preprocessed = np.asarray(X_preprocessed)
Y_preprocessed = np.asarray(Y_preprocessed)

In [23]:
# 70% of the datat will be used for training 
training_samples = 0.7
# 15% of the datat will be used for validation 
validation_samples = 0.15
# 15% of the datat will be used for testing 
testing_samples = 0.15

In [24]:
indices = np.arange(len(Y_preprocessed))

In [25]:
np.random.seed(seed=555)
np.random.shuffle(indices)

In [26]:
X_preprocessed = X_preprocessed[indices]
Y_preprocessed = Y_preprocessed[indices]

In [27]:
X_train = X_preprocessed[: int(0.7 * len(X_preprocessed))]
print("Number of training examples: {}".format(len(X_train)))


X_val = X_preprocessed[int(0.7 * len(X_preprocessed)) : int(0.7 * len(X_preprocessed)) + (int(0.15 * len(X_preprocessed)) + 1)]
print("Number of validation examples: {}".format(len(X_val)))


X_test = X_preprocessed[int(0.7 * len(X_preprocessed)) + (int(0.15 * len(X_preprocessed)) + 1) : ]
print("Number of testing examples: {}".format(len(X_test)))



Y_train = Y_preprocessed[: int(0.7 * len(X_preprocessed))]
Y_val = Y_preprocessed[int(0.7 * len(X_preprocessed)) : int(0.7 * len(X_preprocessed)) + (int(0.15 * len(X_preprocessed)) + 1)]
Y_test = Y_preprocessed[int(0.7 * len(X_preprocessed)) + (int(0.15 * len(X_preprocessed)) + 1) : ]

print("Total number of examples after shuffling and splitting: {}".format(len(X_train) + len(X_val) + len(X_test)))


Number of training examples: 33571
Number of validation examples: 7194
Number of testing examples: 7194
Total number of examples after shuffling and splitting: 47959


In [28]:
train_dataset = tf.data.Dataset.from_tensor_slices((X_train, Y_train))
val_dataset = tf.data.Dataset.from_tensor_slices((X_val, Y_val))
test_dataset = tf.data.Dataset.from_tensor_slices((X_test, Y_test))

2023-03-14 12:27:48.724714: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2023-03-14 12:27:48.725043: W tensorflow/stream_executor/cuda/cuda_driver.cc:263] failed call to cuInit: UNKNOWN ERROR (303)
2023-03-14 12:27:48.725076: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (monty-XPS-15-9570): /proc/driver/nvidia/version does not exist
2023-03-14 12:27:48.727477: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [29]:
BATCH_SIZE = 132
SHUFFLE_BUFFER_SIZE = 132

train_dataset = train_dataset.shuffle(SHUFFLE_BUFFER_SIZE).batch(BATCH_SIZE)
val_dataset = val_dataset.batch(BATCH_SIZE)
test_dataset = test_dataset.batch(BATCH_SIZE)

In [30]:
embedding_dim = 300
maxlen = 120
max_words = 36000
num_tags = len(tags)
def create_model():
    model = tf.keras.models.Sequential([
        tf.keras.layers.Embedding(max_words, embedding_dim, input_length=maxlen),
        tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(units=100, activation='tanh', return_sequences=True)),
        tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(units=100, activation='tanh', return_sequences=True)),
        tf.keras.layers.TimeDistributed(tf.keras.layers.Dense(num_tags, activation='softmax'))
    ])
    model.compile(loss='sparse_categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
    return model

In [31]:
model = create_model()
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 120, 300)          10800000  
                                                                 
 bidirectional (Bidirectiona  (None, 120, 200)         320800    
 l)                                                              
                                                                 
 bidirectional_1 (Bidirectio  (None, 120, 200)         240800    
 nal)                                                            
                                                                 
 time_distributed (TimeDistr  (None, 120, 17)          3417      
 ibuted)                                                         
                                                                 
Total params: 11,365,017
Trainable params: 11,365,017
Non-trainable params: 0
____________________________________________

In [32]:
import os
early_stopping_callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)

checkpoint_path = "training_1/cp.ckpt"
checkpoint_dir = os.path.dirname(checkpoint_path)

# Create a callback that saves the model's weights
cp_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_path,
                                                 save_weights_only=True,
                                                 verbose=1)
history = model.fit(train_dataset,
                    validation_data=val_dataset,
                    epochs=15, callbacks=[early_stopping_callback, cp_callback])

Epoch 1/15
Epoch 1: saving model to training_1/cp.ckpt
Epoch 2/15
  8/255 [..............................] - ETA: 2:39 - loss: 0.1055 - accuracy: 0.9723

KeyboardInterrupt: 

In [44]:
X_test[0]


array([[23070,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0]], dtype=int32)

In [45]:
def make_prediction(model, input_str, id2word, id2tag):
    
    #if preprocessed_sentence.shape() != (1, 110):
    preprocessed_sentence = pad_sequences(tokenizer.texts_to_sequences([input_str]), maxlen=maxlen, padding='post').reshape((1, 120))
    
     
    # return preprocessed sentence to its orginal form
    sentence = preprocessed_sentence[preprocessed_sentence > 0]
    word_list = []
    for word in list(sentence):
        word_list.append(id2word[word])
    orginal_sententce = ' '.join(word_list)
    
    len_orginal_sententce = len(word_list)
    
    # make prediction
    prediction = model.predict(preprocessed_sentence)
    prediction = np.argmax(prediction[0], axis=1)
    
    # return the prediction to its orginal form
    prediction = list(prediction)[ : len_orginal_sententce] 
    
    pred_tag_list = []
    for tag_id in prediction:
        pred_tag_list.append(id2tag[tag_id])
    
    return orginal_sententce,  pred_tag_list

In [50]:
make_prediction(model, "Aegon was a great man!", id2word, id2tag)



('was a great man', ['O', 'O', 'O', 'O'])

In [37]:
X_test[520]

array([   71,     1,  7450,    49,   439,   100,  1496,  9923,     5,
        2323,  3234,  3459,    83,   113,   329, 16028,   158,     6,
         300,   309,     6,  4464, 25464,  8890,    58,     3,     1,
        1460,     7,  5718,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0], dtype=int32)

In [114]:
orginal_sententce,  pred_tag_list = make_prediction(model=model,
                                                    preprocessed_sentence=X_test[520],
                                                    id2word=id2word,
                                                    id2tag=id2tag)



In [115]:
print(orginal_sententce)

tuesday the manhattan new york city prosecutor unsealed a multi count indictment against china based limmt economic and trade company and li fang wei one of the firm 's managers


In [116]:
print(pred_tag_list)

['B-tim', 'O', 'O', 'B-geo', 'I-geo', 'I-geo', 'I-geo', 'O', 'O', 'O', 'O', 'B-tim', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-per', 'I-per', 'I-per', 'O', 'O', 'O']


#TODO: Allow for any text to be processed into the format for LSTM
#TODO: Maybe feed entities and their sentences in GPT3?

# Transformer Based Entity Extraction

In [5]:
import os
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from datasets import load_dataset
from collections import Counter
from conlleval import evaluate

In [6]:
class TransformerBlock(layers.Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super().__init__()
        self.att = keras.layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim
        )
        self.ffn = keras.Sequential(
            [
                keras.layers.Dense(ff_dim, activation="relu"),
                keras.layers.Dense(embed_dim),
            ]
        )
        self.layernorm1 = keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = keras.layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = keras.layers.Dropout(rate)
        self.dropout2 = keras.layers.Dropout(rate)

    def call(self, inputs, training=False):
        attn_output = self.att(inputs, inputs)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(inputs + attn_output)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)

In [7]:
class TokenAndPositionEmbedding(layers.Layer):
    def __init__(self, maxlen, vocab_size, embed_dim):
        super().__init__()
        self.token_emb = keras.layers.Embedding(
            input_dim=vocab_size, output_dim=embed_dim
        )
        self.pos_emb = keras.layers.Embedding(input_dim=maxlen, output_dim=embed_dim)

    def call(self, inputs):
        maxlen = tf.shape(inputs)[-1]
        positions = tf.range(start=0, limit=maxlen, delta=1)
        position_embeddings = self.pos_emb(positions)
        token_embeddings = self.token_emb(inputs)
        return token_embeddings + position_embeddings

In [8]:
class NERModel(keras.Model):
    def __init__(
        self, num_tags, vocab_size, maxlen=128, embed_dim=32, num_heads=2, ff_dim=32
    ):
        super().__init__()
        self.embedding_layer = TokenAndPositionEmbedding(maxlen, vocab_size, embed_dim)
        self.transformer_block = TransformerBlock(embed_dim, num_heads, ff_dim)
        self.dropout1 = layers.Dropout(0.1)
        self.ff = layers.Dense(ff_dim, activation="relu")
        self.dropout2 = layers.Dropout(0.1)
        self.ff_final = layers.Dense(num_tags, activation="softmax")

    def call(self, inputs, training=False):
        x = self.embedding_layer(inputs)
        x = self.transformer_block(x)
        x = self.dropout1(x, training=training)
        x = self.ff(x)
        x = self.dropout2(x, training=training)
        x = self.ff_final(x)
        return x

In [9]:
conll_data = load_dataset("conll2003")

Found cached dataset conll2003 (/home/monty/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/9a4d16a94f8674ba3466315300359b0acd891b68b6c8743ddf60b9c702adce98)
100%|██████████| 3/3 [00:00<00:00, 142.95it/s]


In [10]:
def export_to_file(export_file_path, data):
    with open(export_file_path, "w") as f:
        for record in data:
            ner_tags = record["ner_tags"]
            tokens = record["tokens"]
            if len(tokens) > 0:
                f.write(
                    str(len(tokens))
                    + "\t"
                    + "\t".join(tokens)
                    + "\t"
                    + "\t".join(map(str, ner_tags))
                    + "\n"
                )



os.mkdir("data")
export_to_file("./data/conll_train.txt", conll_data["train"])
export_to_file("./data/conll_val.txt", conll_data["validation"])

In [27]:
conll_data["train"]

Dataset({
    features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
    num_rows: 14041
})

In [12]:
def make_tag_lookup_table():
    iob_labels = ["B", "I"]
    ner_labels = ["PER", "ORG", "LOC", "MISC"]
    all_labels = [(label1, label2) for label2 in ner_labels for label1 in iob_labels]
    all_labels = ["-".join([a, b]) for a, b in all_labels]
    all_labels = ["[PAD]", "O"] + all_labels
    return dict(zip(range(0, len(all_labels) + 1), all_labels))


mapping = make_tag_lookup_table()
print(mapping)

{0: '[PAD]', 1: 'O', 2: 'B-PER', 3: 'I-PER', 4: 'B-ORG', 5: 'I-ORG', 6: 'B-LOC', 7: 'I-LOC', 8: 'B-MISC', 9: 'I-MISC'}


In [13]:
all_tokens = sum(conll_data["train"]["tokens"], [])
all_tokens_array = np.array(list(map(str.lower, all_tokens)))

counter = Counter(all_tokens_array)
print(len(counter))

num_tags = len(mapping)
vocab_size = 20000

# We only take (vocab_size - 2) most commons words from the training data since
# the `StringLookup` class uses 2 additional tokens - one denoting an unknown
# token and another one denoting a masking token
vocabulary = [token for token, count in counter.most_common(vocab_size - 2)]

# The StringLook class will convert tokens to token IDs
lookup_layer = keras.layers.StringLookup(
    vocabulary=vocabulary
)

21009


2023-03-14 00:48:59.351951: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2023-03-14 00:48:59.351974: W tensorflow/stream_executor/cuda/cuda_driver.cc:263] failed call to cuInit: UNKNOWN ERROR (303)
2023-03-14 00:48:59.351991: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (monty-XPS-15-9570): /proc/driver/nvidia/version does not exist
2023-03-14 00:48:59.353090: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [14]:
train_data = tf.data.TextLineDataset("./data/conll_train.txt")
val_data = tf.data.TextLineDataset("./data/conll_val.txt")

In [22]:
type(train_data)

tensorflow.python.data.ops.readers.TextLineDatasetV2

In [15]:
print(list(train_data.take(1).as_numpy_iterator()))


[b'9\tEU\trejects\tGerman\tcall\tto\tboycott\tBritish\tlamb\t.\t3\t0\t7\t0\t0\t0\t7\t0\t0']


In [17]:
def map_record_to_training_data(record):
    record = tf.strings.split(record, sep="\t")
    length = tf.strings.to_number(record[0], out_type=tf.int32)
    tokens = record[1 : length + 1]
    tags = record[length + 1 :]
    tags = tf.strings.to_number(tags, out_type=tf.int64)
    tags += 1
    return tokens, tags


def lowercase_and_convert_to_ids(tokens):
    tokens = tf.strings.lower(tokens)
    return lookup_layer(tokens)


# We use `padded_batch` here because each record in the dataset has a
# different length.
batch_size = 32
train_dataset = (
    train_data.map(map_record_to_training_data)
    .map(lambda x, y: (lowercase_and_convert_to_ids(x), y))
    .padded_batch(batch_size)
)
val_dataset = (
    val_data.map(map_record_to_training_data)
    .map(lambda x, y: (lowercase_and_convert_to_ids(x), y))
    .padded_batch(batch_size)
)

ner_model = NERModel(num_tags, vocab_size, embed_dim=32, num_heads=4, ff_dim=64)

In [18]:
class CustomNonPaddingTokenLoss(keras.losses.Loss):
    def __init__(self, name="custom_ner_loss"):
        super().__init__(name=name)

    def call(self, y_true, y_pred):
        loss_fn = keras.losses.SparseCategoricalCrossentropy(
            from_logits=True, reduction=keras.losses.Reduction.NONE
        )
        loss = loss_fn(y_true, y_pred)
        mask = tf.cast((y_true > 0), dtype=tf.float32)
        loss = loss * mask
        return tf.reduce_sum(loss) / tf.reduce_sum(mask)


loss = CustomNonPaddingTokenLoss()

In [28]:
ner_model.compile(optimizer="adam", loss=loss)
ner_model.fit(train_dataset, epochs=30)


def tokenize_and_convert_to_ids(text):
    tokens = text.split()
    return lowercase_and_convert_to_ids(tokens)


# Sample inference using the trained model
sample_input = tokenize_and_convert_to_ids(
    "eu rejects german call to boycott british lamb"
)
sample_input = tf.reshape(sample_input, shape=[1, -1])
print(sample_input)

output = ner_model.predict(sample_input)
prediction = np.argmax(output, axis=-1)[0]
prediction = [mapping[i] for i in prediction]

# eu -> B-ORG, german -> B-MISC, british -> B-MISC
print(prediction)

Epoch 1/30


  output, from_logits = _get_logits(


Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30

KeyboardInterrupt: 

Ok, So this code works! But its not good with Game of Thrones stuff!!

Since we were given this list by Dr. Hu, lets use it!
King Jaehaerys

Queen Alysanne

Prince Aemon

Prince Baelon

Princess Rhaenys

Corlys Velaryon

House Velaryon

Septon Barth

Grand Maester Elysar

Queen Visenya

King Maegor

Princess Rhaena

Lady Jocelyn of House Baratheon

Lord Boremund of Storm’s End

Septa Maegelle

House Stark

Eddard Stark

Lady Catelyn

Robb Stark

But we gotta make this into a pipeline...

In [29]:
# Sample inference using the trained model
input_sentence = "he maesters of the citadel who keep the histories of westeros have used aegon’s conquest as their touchstone for the past three hundred years"
sample_input = tokenize_and_convert_to_ids(input_sentence)
sample_input = tf.reshape(sample_input, shape=[1, -1])
print(sample_input)

output = ner_model.predict(sample_input)
prediction = np.argmax(output, axis=-1)[0]
prediction = [mapping[i] for i in prediction]
prediction = {word:prediction[t] for t, word in enumerate(input_sentence.split())}
print(prediction)

tf.Tensor(
[[   25     0     4     1     0    45   897     1     0     4     0    42
    747     0 10744    31    54 13467    15     1   572    84  6496   138]], shape=(1, 24), dtype=int64)
{'he': 'O', 'maesters': 'O', 'of': 'O', 'the': 'O', 'citadel': 'O', 'who': 'O', 'keep': 'O', 'histories': 'O', 'westeros': 'O', 'have': 'O', 'used': 'O', 'aegon’s': 'O', 'conquest': 'O', 'as': 'O', 'their': 'O', 'touchstone': 'O', 'for': 'O', 'past': 'O', 'three': 'O', 'hundred': 'O', 'years': 'O'}


In [98]:
print(len(sentence))

1


# GPT Based Knowledge graph

Okay let's fine tune a model to complete a sentence with it's entities inside!