# Exercise 2: Build your own knowledge graph with NLP

## Don't worry!  An NLP workflow will be included here!

There is a lot going on in this notebook.  We will walk through each step.

There are a lot of functions that get the search data into a format that can be uploaded into the graph.  As with any NLP project, there is a lot of preprocessing that needs to happen before you even worry about creating the graph.  And as we discussed, there is no proverbial "silver bullet" when it comes to NLP.  

For this exercise we will be creating the (subject, verb, object) triples.  From a broad brush-strokes perspective, this is what our NLP workflow will look like:

<img src="images/nlp_workflow.png" width="600">

## `Spacy`

To achieve the above, we will us the NLP package `spacy`.  It has a lot of great, basic functionality, especially when we are talking about detecting the parts of speech and identifying the ROOT.  But of course, you can choose to use anything you are comfortable with!


### A note about language models and configuring `spacy`

We will use, just a little bit, the word vectors generated by `spacy`.  There are plenty of corpuses available from `spacy`.  We are going to use the medium-sized web corpus called `en_core_web_md`, which you can read about [here](https://spacy.io/models/en#en_core_web_md). The basic model is their small core library, taken from the web: `en_core_web_sm`, which provides good, basic functionality with a small download size (< 20 MB). However, one drawback of this basic model is that it doesn't have full word vectors. Instead, it comes with context-sensitive tensors. You can still do things like text similarity with it, but if you want to use spacy to create good word vectors, you should use a larger model such as `en_core_web_md` or `en_core_web_lg` since the small models are not known for accuracy. You can also use a variety of third-party models, but that is beyond the scope of this workshop. Again, choose the model that works best with your setup.  These will have to be loaded in before we can do any NLP.

In [1]:
import json
import re
import urllib
from pprint import pprint
import time
from tqdm import tqdm

from py2neo import Node, Graph, Relationship, NodeMatcher
from py2neo.bulk import merge_nodes

import numpy as np
import pandas as pd
import wikipedia
from sklearn.metrics.pairwise import cosine_similarity

import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.matcher import Matcher
from spacy.tokens import Doc, Span, Token

print(spacy.__version__)

3.1.2


In [2]:
!python -m spacy download en_core_web_md

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


## The Google knowledge graph API

We will be building out the SVO's in our graph by querying Google for node properties.  There are obviously several ways we can build out the node properties, but I decided to show using "free" queries (allows a free quote up to 100,000 read calls per day per project) to the Google Knowledge Graph API just to give you a view into one of the many ways you could create your own knowledge graph.

### Getting an API key

We will need to create an API key to query the Google Knowledge Graph.  Visit [this link](https://developers.google.com/knowledge-graph/how-tos/authorizing) to get instructions how to do it.  This will give you 

<img src="images/google_kg_api1.png" width="600">

Click on "Credentials page" and then you will see

<img src="images/google_kg_api2.png" width="600">

From here you will click on "+ CREATE CREDENTIALS".  From there, copy the key into the following cell:

In [4]:
api_key = 'AIzaSyAy83OljsqiXfUaFBsJjb9jJyyfoh0891E'

## Let's get going with the NLP!

Here we will do a few things.  First, we are going to provide a list of word dependencies that we will attribute to subjects, verbs, and objects.  We will have `spacy` look for these for populating our SVO triples.

Next, we initialize `spacy` using our corpus.  We will also add to the normal pipeline a feature, `merge_noun_chunks`, which is handy for keeping names of things together (example: treat the noun as "barack obama" rather than "barack" and "obama" separately).

In [8]:
SUBJECTS = ["nsubj", "nsubjpass", "csubj", "csubjpass", "agent", "expl"]
VERBS = ['ROOT', 'advcl']
OBJECTS = ["dobj", "dative", "attr", "oprd", 'pobj']
ENTITY_LABELS = ['PERSON', 'NORP', 'GPE', 'ORG', 'FAC', 'LOC', 'PRODUCT', 'EVENT', 'WORK_OF_ART']

non_nc = spacy.load('en_core_web_md')

nlp = spacy.load('en_core_web_md')
nlp.add_pipe('merge_noun_chunks')

<function spacy.pipeline.functions.merge_noun_chunks(doc: spacy.tokens.doc.Doc) -> spacy.tokens.doc.Doc>

### Helper function for querying the Google Knowledge Graph

Note that GKG can return several results for a single query (example: what would happen if you queried "washington"...are you looking for a city, state, or person?).  We will limit this to 10 returned results, but in reality we are going to use only the first one below.  But you could add all of them to the graph with some minor re-writes of the below NLP pipeline.

In [33]:
def query_google(query, api_key, limit=10, indent=True, return_lists=True):
    
    text_ls = []
    node_label_ls = []
    url_ls = []
    
    params = {
        'query': query,
        'limit': limit,
        'indent': indent,
        'key': api_key,
    }   
    
    service_url = 'https://kgsearch.googleapis.com/v1/entities:search'
    url = service_url + '?' + urllib.parse.urlencode(params)
    response = json.loads(urllib.request.urlopen(url).read())
    
    if return_lists:
        for element in response['itemListElement']:

            try:
                node_label_ls.append(element['result']['@type'])
            except:
                node_label_ls.append('')

            try:
                text_ls.append(element['result']['detailedDescription']['articleBody'])
            except:
                text_ls.append('')
                
            try:
                url_ls.append(element['result']['detailedDescription']['url'])
            except:
                url_ls.append('')
                
        return text_ls, node_label_ls, url_ls
    
    else:
        return response

## Here is the main chunk of the NLP

The important function here is the last one that drives the whole thing: `create_svo_triples`.  Here is what it does:

1. Remove special characters (`remove_special_characters`)
2. Remove stop words and punctuation (`remove_stop_words_and_punct`)
3. Remove dates (`remove_dates`)
4. Remove duplicates (`remove_duplicates`)

Duplicates crop up many times during this process and we will be battling them a lot!

Then, we can get to the heart of the matter, `create_svo_triples`, which returns a list of tuples of all SVO triples in the text, like

```
[('oh bah mə', 'be', 'american politician'),
 ('oh bah mə', 'be', '44th president'),
 ('oh bah mə', 'be', 'united states')]
 ```
 
One key thing about this function is how we are adding the verbs.  There is no right or wrong way to do this.  What is implemented here is to try and figure out which verb is closest in distance between each object and each verb.  Definitely experiment with this and see if you can come up with a better way to do this!

In [34]:
def remove_special_characters(text):
    
    regex = re.compile(r'[\n\r\t]')
    clean_text = regex.sub(" ", text)
    
    return clean_text


def remove_stop_words_and_punct(text, print_text=False):
    
    result_ls = []
    rsw_doc = non_nc(text)
    
    for token in rsw_doc:
        if print_text:
            print(token, token.is_stop)
            print('--------------')
        if not token.is_stop and not token.is_punct:
            result_ls.append(str(token))
    
    result_str = ' '.join(result_ls)

    return result_str


def create_svo_lists(doc, print_lists):
    
    subject_ls = []
    verb_ls = []
    object_ls = []

    for token in doc:
        if token.dep_ in SUBJECTS:
            subject_ls.append((token.lower_, token.idx))
        elif token.dep_ in VERBS:
            verb_ls.append((token.lemma_, token.idx))
        elif token.dep_ in OBJECTS:
            object_ls.append((token.lower_, token.idx))

    if print_lists:
        print('SUBJECTS: ', subject_ls)
        print('VERBS: ', verb_ls)
        print('OBJECTS: ', object_ls)
    
    return subject_ls, verb_ls, object_ls


def remove_duplicates(tup, tup_posn):
    
    check_val = set()
    result = []
    
    for i in tup:
        if i[tup_posn] not in check_val:
            result.append(i)
            check_val.add(i[tup_posn])
            
    return result


def remove_dates(tup_ls):
    
    clean_tup_ls = []
    for entry in tup_ls:
        if not entry[2].isdigit():
            clean_tup_ls.append(entry)
    return clean_tup_ls


def create_svo_triples(text, print_lists=False):
    
    clean_text = remove_special_characters(text)
    doc = nlp(clean_text)
    subject_ls, verb_ls, object_ls = create_svo_lists(doc, print_lists=print_lists)
    
    graph_tup_ls = []
    dedup_tup_ls = []
    clean_tup_ls = []
    
    for subj in subject_ls: 
        for obj in object_ls:
            
            dist_ls = []
            
            for v in verb_ls:
                
                # Assemble a list of distances between each object and each verb
                dist_ls.append(abs(obj[1] - v[1]))
                
            # Get the index of the verb with the smallest distance to the object 
            # and return that verb
            index_min = min(range(len(dist_ls)), key=dist_ls.__getitem__)
            
            # Remve stop words from subjects and object.  Note that we do this a bit
            # later down in the process to allow for proper sentence recognition.

            no_sw_subj = remove_stop_words_and_punct(subj[0])
            no_sw_obj = remove_stop_words_and_punct(obj[0])
            
            # Add entries to the graph iff neither subject nor object is blank
            if no_sw_subj and no_sw_obj:
                tup = (no_sw_subj, verb_ls[index_min][0], no_sw_obj)
                graph_tup_ls.append(tup)
        
        #clean_tup_ls = remove_dates(graph_tup_ls)
    
    dedup_tup_ls = remove_duplicates(graph_tup_ls, 2)
    clean_tup_ls = remove_dates(dedup_tup_ls)
    
    return clean_tup_ls

## Now we are going to make this data more interesting!

We have a whole ton of objects within our data.  Each sentence might have several linked to the subject.  Let's get some properties of each object, through our `query_google_`, to get their Google description, node labels, and any URL they might have.  Not all objects will have this information, but we can always add the ones that do as node properties in our graph.

This code block also does a few other things.  For example, `add_layer` will go out and create SVOs using the above `create_svo_triples` for each object within our starting list, which adds more nodes and edges to our graph.  We do a bit more cleaning by removing tuples where the subject and object are equal (`subj_equals_obj`...it does happen).  Lastly, we are going to take any descriptions associated with the objects and turn them into the `spacy` word vectors (`create_word_vectors`), under the idea that they might be useful down the road.  (Those without descriptions get an array of 0's for their vectors).

In [35]:
def get_obj_properties(tup_ls):
    
    init_obj_tup_ls = []
    
    for tup in tup_ls:

        try:
            text, node_label_ls, url = query_google(tup[2], api_key, limit=1)
            new_tup = (tup[0], tup[1], tup[2], text[0], node_label_ls[0], url[0])
        except:
            new_tup = (tup[0], tup[1], tup[2], [], [], [])
        
        init_obj_tup_ls.append(new_tup)
        
    return init_obj_tup_ls


def add_layer(tup_ls):

    svo_tup_ls = []
    
    for tup in tup_ls:
        
        if tup[3]:
            svo_tup = create_svo_triples(tup[3])
            svo_tup_ls.extend(svo_tup)
        else:
            continue
    
    return get_obj_properties(svo_tup_ls)
        

def subj_equals_obj(tup_ls):
    
    new_tup_ls = []
    
    for tup in tup_ls:
        if tup[0] != tup[2]:
            new_tup_ls.append((tup[0], tup[1], tup[2], tup[3], tup[4], tup[5]))
            
    return new_tup_ls


def check_for_string_labels(tup_ls):
    # This is for an edge case where the object does not get fully populated
    # resulting in the node labels being assigned to string instead of list.
    # This may not be strictly necessary and the lines using it are commnted out
    # below.  Run this function if you come across this case.
    
    clean_tup_ls = []
    
    for el in tup_ls:
        if isinstance(el[2], list):
            clean_tup_ls.append(el)
            
    return clean_tup_ls


def create_word_vectors(tup_ls):

    new_tup_ls = []
    
    for tup in tup_ls:
        if tup[3]:
            doc = nlp(tup[3])
            new_tup = (tup[0], tup[1], tup[2], tup[3], tup[4], tup[5], doc.vector)
        else:
            new_tup = (tup[0], tup[1], tup[2], tup[3], tup[4], tup[5], np.random.uniform(low=-1.0, high=1.0, size=(300,)))
        new_tup_ls.append(new_tup)
        
    return new_tup_ls

### Just some more deduping and reformatting data into what Neo4j will accept

In [41]:
def dedup(tup_ls):
    
    visited = set()
    output_ls = []
    
    for tup in tup_ls:
        if not tup[0] in visited:
            visited.add(tup[0])
            output_ls.append((tup[0], tup[1], tup[2], tup[3], tup[4]))
            
    return output_ls


def convert_vec_to_ls(tup_ls):
    
    vec_to_ls_tup = []
    
    for el in tup_ls:
        vec_ls = [float(v) for v in el[4]]
        tup = (el[0], el[1], el[2], el[3], vec_ls)
        vec_to_ls_tup.append(tup)
        
    return vec_to_ls_tup

## Adding the tuples to the Neo4j database

These next two functions use `py2neo` to add the nodes and edges, in the form of node and edge lists, to the database.  Adding the nodes (`add_nodes`) is very straight forward and uses the [bulk data operations](https://py2neo.org/2021.0/bulk/index.html) in `py2neo` to add them efficiently.

Adding edges is a bit more complicated.  If they all had the same edge label we could use the bulk loader to upload them.  However, since there are many different verbs in the SVO's, we need to add them individually. 

In [42]:
def add_nodes(tup_ls):   

    keys = ['name', 'description', 'node_labels', 'url', 'word_vec']
    merge_nodes(graph.auto(), tup_ls, ('Node', 'name'), keys=keys)
    print('Number of nodes in graph: ', graph.nodes.match('Node').count())
    
    return


def add_edges(edge_ls):
    
    edge_dc = {} 
    
    # Group tuple by verb
    # Result: {verb1: [(sub1, v1, obj1), (sub2, v2, obj2), ...],
    #          verb2: [(sub3, v3, obj3), (sub4, v4, obj4), ...]}
    
    for tup in edge_ls: 
        if tup[1] in edge_dc: 
            edge_dc[tup[1]].append((tup[0], tup[1], tup[2])) 
        else: 
            edge_dc[tup[1]] = [(tup[0], tup[1], tup[2])] 
    
    for edge_labels, tup_ls in tqdm(edge_dc.items()):   # k=edge labels, v = list of tuples
        
        tx = graph.begin()
        
        for el in tup_ls:
            source_node = nodes_matcher.match(name=el[0]).first()
            target_node = nodes_matcher.match(name=el[2]).first()
            if not source_node:
                source_node = Node('Node', name=el[0])
                tx.create(source_node)
            if not target_node:
                try:
                    target_node = Node('Node', name=el[2], node_labels=el[4], url=el[5], word_vec=el[6])
                    tx.create(target_node)
                except:
                    continue
            try:
                rel = Relationship(source_node, edge_labels, target_node)
            except:
                continue
            tx.create(rel)
        tx.commit()
    
    return

## Creating our node and edge lists: where the NLP pipeline is put to work

We will be creating our node and edge lists as tuples.  What is important here (since all of the functions are described above) is the output of each function.

`edge_tuple_creation` takes the raw text from the Wikipedia query below and creates the edge list, which is a list of tuples of the format:

```
(subject, verb, object, object description, node label list, url, word vector of description)
```

`node_tuple_creation`, on the other hand, has some of the same properties as the full edge list above.  But it is taking just the objects (with the search subject appended) and creates the node list of the format

```
(object, object description, node label list, url, word vector of description)
```

In this example, we are going to do one round of `add_layer` to make our edge list bigger, but you can do this technically as many times you want!

In [31]:
def edge_tuple_creation(text):
    
    initial_tup_ls = create_svo_triples(text)
    init_obj_tup_ls = get_obj_properties(initial_tup_ls)
    new_layer_ls = add_layer(init_obj_tup_ls)
    starter_edge_ls = init_obj_tup_ls + new_layer_ls
    edge_ls = subj_equals_obj(starter_edge_ls)
    edges_word_vec_ls = create_word_vectors(edge_ls)
    
    return edges_word_vec_ls


def node_tuple_creation(edges_word_vec_ls):
    
    orig_node_tup_ls = [(edges_word_vec_ls[0][0], '', ['Subject'], '', np.random.uniform(low=-1.0, high=1.0, size=(300,)))]
    obj_node_tup_ls = [(tup[2], tup[3], tup[4], tup[5], tup[6]) for tup in edges_word_vec_ls]
    full_node_tup_ls = orig_node_tup_ls + obj_node_tup_ls
    cleaned_node_tup_ls = check_for_string_labels(full_node_tup_ls)
    #dedup_node_tup_ls = dedup(cleaned_node_tup_ls)
    dedup_node_tup_ls = cleaned_node_tup_ls
    node_tup_ls = convert_vec_to_ls(dedup_node_tup_ls)
    
    return node_tup_ls

## Time to populate the graph!

Be sure you start up a "Blank Graph Data Science" [Sandbox instance](https://sandbox.neo4j.com/) and get its URL and password (the user name is always `neo4j`).  We will use `py2neo` to make the connection and then instantiate the [`NodeMatcher`](https://py2neo.org/v5/matching.html#node-matching), which is used to locate nodes efficiently within the graph.

In [47]:
url = ''
pwd = ''

graph = Graph(url, auth=('neo4j', pwd))
nodes_matcher = NodeMatcher(graph)

## Getting the text and assembling the tuple lists

For fun we are going to use two related searches.  You could do only one if you wanted, but I will tell you in class on why I used these two different search terms.  (Spoiler alert: do you remember that bit about having SMEs involved in your graph???)

In [43]:
%%time
barack_text = wikipedia.summary('barack obama')
barack_edges_word_vec_ls = edge_tuple_creation(barack_text)
barack_node_tup_ls = node_tuple_creation(barack_edges_word_vec_ls)

michelle_text = wikipedia.summary('michelle obama')
michelle_edges_word_vec_ls = edge_tuple_creation(michelle_text)
michelle_node_tup_ls = node_tuple_creation(michelle_edges_word_vec_ls)

CPU times: user 1min 16s, sys: 591 ms, total: 1min 16s
Wall time: 2min 41s


In [44]:
full_node_ls = barack_node_tup_ls + michelle_node_tup_ls
full_edge_ls = barack_edges_word_vec_ls + michelle_edges_word_vec_ls
full_dedup_node_tup_ls = dedup(full_node_ls)
print(len(full_node_ls), len(full_dedup_node_tup_ls))

866 665


In [48]:
add_nodes(full_dedup_node_tup_ls)
add_edges(full_edge_ls)

Number of nodes in graph:  665


  tx.commit()
100%|███████████████████████████████████████████| 92/92 [03:25<00:00,  2.23s/it]
