# Exercise 4 Demonstration: Assembling your own graph from Wikidata

## Introduction

We are now going to explore a different approach to creating a knowledge graph.  As in the previous exercises we are going to start with unstructured text from Wikipedia and get the named entities from that text.  However, we will then query Wikidata for the P- and Q-values.  This is the code that was used to assemble the graph from Exercise 3.  There will also be a separate Exercise 4 notebook for you to use should you wish to assemble your own graph on your own search terms.

### Warning!

Working with Wikidata can be a fragile process.  I have experienced it where I run the following cells and get errors or unexpected/incomplete results.  This is unfortunately the nature of data science sometimes.  Therefore, in this notebook I am including the outputs of each cell so you can see how it is _supposed_ to run.  If you experience this type of interaction, clear and restart the kernel, go for a cup of coffee, and try again in a bit.  If worse comes to worse, we will always be able to use the pre-populated graph from Exercise 3 for the future exercises!

## Creating a bot for querying Wikidata

We will be using the packages `pywikibot` to query Wikidata for the named entities.  Working with Wikidata will require a few things.  First, you need to create an account at [wikidata.org](https://wikidata.org).  From there, navigate to [this webpage](https://heardlibrary.github.io/digital-scholarship/host/wikidata/bot/) where you will create the bot:

<img src='images/wikidata1.png' width='600'>

You will then want to log into the [Wikidata Test Instance](https://test.wikidata.org/wiki/Wikidata:Main_Page) using your login credentials you just created.

<img src='images/wikidata2.png' width='600'>

At the main pages, click the link for "Special pages."

<img src='images/wikidata3.png' width='600'>

Scroll down to "Users and Rights" and select "Bot Password".

<img src='images/wikidata4.png' width='600'>

Create a new bot password (it is customary to give it a name that ends with `_bot`).

<img src='images/wikidata5.png' width='600'>

You will now get a token that looks something like this:

<img src='images/wikidata6.png' width='600'>

For token security, save this token in a file in `notebooks/` called `.wiki_api_token`.  

In [1]:
%matplotlib inline

import json
import re
import urllib
from pprint import pprint
import time
from tqdm import tqdm

from neo4j import GraphDatabase

import numpy as np
import pandas as pd
import wikipedia

import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.matcher import Matcher
from spacy.tokens import Doc, Span, Token

from pywikibot.data import api
import pywikibot
import wikipedia

print(spacy.__version__)
print(pywikibot.__version__)
print(wikipedia.__version__)

3.1.2
6.5.0
(1, 4, 0)


## Check connection to Wikidata

If this following cell runs without error you are good to start.

In [2]:
site = pywikibot.Site("en", "wikipedia")
page = pywikibot.Page(site, "Douglas Adams")
item = pywikibot.ItemPage.fromPage(page)

## Let's start with our usual NLP pipeline

In [3]:
non_nc = spacy.load('en_core_web_md')

nlp = spacy.load('en_core_web_md')
nlp.add_pipe('merge_noun_chunks')

print(nlp.pipe_names)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner', 'merge_noun_chunks']


In [4]:
text = wikipedia.summary('barack obama')
doc = nlp(text)
text

'Barack Hussein Obama II ( (listen) bə-RAHK hoo-SAYN oh-BAH-mə; born August 4, 1961) is an American politician who served as the 44th president of the United States from 2009 to 2017. A member of the Democratic Party, Obama was the first African-American  president of the United States. He previously served as a U.S. senator from Illinois from 2005 to 2008 and as an Illinois state senator from 1997 to 2004.\nObama was born in Honolulu, Hawaii. After graduating from Columbia University in 1983, he worked as a community organizer in Chicago. In 1988, he enrolled in Harvard Law School, where he was the first black president of the Harvard Law Review. After graduating, he became a civil rights attorney and an academic, teaching constitutional law at the University of Chicago Law School from 1992 to 2004. Turning to elective politics, he represented the 13th district in the Illinois Senate from 1997 until 2004, when he ran for the U.S. Senate. Obama received national attention in 2004 with 

## Let's get rid of the dates and do some simple text cleaning...

In [5]:
ent_ignore_ls = ['DATE']
ner_list = []

for el in doc.ents:
    if el.label_ not in ent_ignore_ls:
        #print(el, el.label_)
        if el.text not in ner_list:
            temp_doc = nlp(el.text)
            ner_list.append(el.text)

ner_list[0:5]

['Barack Hussein Obama II',
 'the United States',
 'the Democratic Party',
 'Obama',
 'Illinois']

In [6]:
def remove_special_characters(text):
    
    regex = re.compile(r'[\n\r\t]')
    clean_text = regex.sub(" ", text)
    
    return clean_text


def remove_stop_words_and_punct(text, print_text=False):
    
    result_ls = []
    rsw_doc = non_nc(text)
    
    for token in rsw_doc:
        if print_text:
            print(token, token.is_stop)
            print('--------------')
        if not token.is_stop and not token.is_punct and not token.is_space:
            result_ls.append(str(token))
    
    result_str = ' '.join(result_ls)

    return result_str

## Now we will assemble our starting node list, which will be the P-values that we will begin our Wikidata query with

In [7]:
node_text_ls = []

for el in ner_list:
    clean_text = remove_special_characters(el)
    no_sw = remove_stop_words_and_punct(clean_text)
    if no_sw not in node_text_ls:
        node_text_ls.append(no_sw)

node_text_ls

['Barack Hussein Obama II',
 'United States',
 'Democratic Party',
 'Obama',
 'Illinois',
 'Honolulu',
 'Hawaii',
 'Columbia University',
 'Chicago',
 'Harvard Law School',
 'Harvard Law Review',
 'University Chicago Law School',
 'Illinois Senate',
 'U.S. Senate',
 'Senate',
 'Hillary Clinton',
 'Republican nominee John McCain',
 'Joe Biden',
 'Affordable Care Act',
 'ACA',
 'Obamacare',
 'Dodd',
 'American Recovery Reinvestment Act',
 'Tax Relief',
 'Great Recession',
 'Budget Control',
 'American Taxpayer Relief Acts',
 'Afghanistan',
 'Russia',
 'Iraq War',
 'Libya',
 'UN Security Council Resolution 1973',
 'Muammar Gaddafi',
 'Osama bin Laden',
 'Republican opponent Mitt Romney',
 'LGBT Americans',
 'Supreme Court',
 'Windsor',
 'Obergefell',
 'Hodges',
 'Court',
 'Iraq',
 'Syria',
 'ISIL',
 'Ukraine',
 'Joint Comprehensive Plan Action',
 'Iran',
 'Cuba',
 'Sonia Sotomayor',
 'Elena Kagan',
 'Merrick Garland',
 'Republican majority Senate',
 'Mitch McConnell',
 'Washington',
 'D.C

## Starting interactions with Wikidata

We won't use all of these functions below in this exercise, but they are here to help you if you want to get into the details a bit.  However, the following functions handle the connection with Wikidata for our specific connection.

In [None]:
def getItems(site, itemtitle):
    params = { 'action' :'wbsearchentities' , 'format' : 'json' , 'language' : 'en', 'type' : 'item', 'search': itemtitle}
    request = api.Request(site=site,**params)
    return request.submit()

def getItem(site, wdItem, token):
    request = api.Request(site=site,
                          action='wbgetentities',
                          format='json',
                          ids=wdItem)    
    return request.submit()

def prettyPrint(variable):
    pp = pprint.PrettyPrinter(indent=4)
    pp.pprint(variable)

# Login to wikidata
token = open('.wiki_api_token').read()
wikidata = pywikibot.Site('wikidata', 'wikidata')
site = pywikibot.Site("wikidata", "wikidata")

## Another check that we are still able to connect...

In [9]:
itempage = pywikibot.ItemPage(wikidata, "Q76")  # Q42 is Douglas Adams
itempage

ItemPage('Q76')

## Start querying Wikidata

First, we are going to take all of our named entities and identify them in Wikidata. This is done by correlating the individual entity with a Wikidata Q-code, which is what Wikidata uses to index all entities. As you will see, not all of the entities are in Wikidata, likely because of the fact that there are modifiers to the text prior to the actual entity (ex: "Republican nominee John McCain"). But will we still be OK. :)

In [10]:
%time

item_ls = []
i = 0

for el in node_text_ls:
    #itempage = pywikibot.ItemPage(wikidata, el)
    #print(el, itempage)
    wikidataEntries = getItems(site, el)
    try:
        tup = (wikidataEntries['search'][0]['id'], el)
        item_ls.append(tup)
    except:
        i += 1
        print('Missing ', i,'th entry for ', el)
    #item_ls.append(tup)
    
dedup_item_ls = []

for item in item_ls:
    if item not in dedup_item_ls:
        dedup_item_ls.append(item)
        
dedup_item_ls

CPU times: user 2 µs, sys: 1 µs, total: 3 µs
Wall time: 4.05 µs


  request = api.Request(site=site,**params)



Missing  1 th entry for  University Chicago Law School
Missing  2 th entry for  Republican nominee John McCain
Missing  3 th entry for  American Recovery Reinvestment Act
Missing  4 th entry for  American Taxpayer Relief Acts
Missing  5 th entry for  UN Security Council Resolution 1973
Missing  6 th entry for  Republican opponent Mitt Romney
Missing  7 th entry for  LGBT Americans
Missing  8 th entry for  Joint Comprehensive Plan Action
Missing  9 th entry for  Republican majority Senate
Missing  10 th entry for  Democratic Party Politics


[('Q76', 'Barack Hussein Obama II'),
 ('Q30', 'United States'),
 ('Q29552', 'Democratic Party'),
 ('Q76', 'Obama'),
 ('Q1204', 'Illinois'),
 ('Q18094', 'Honolulu'),
 ('Q782', 'Hawaii'),
 ('Q49088', 'Columbia University'),
 ('Q1297', 'Chicago'),
 ('Q49122', 'Harvard Law School'),
 ('Q1365125', 'Harvard Law Review'),
 ('Q1517713', 'Illinois Senate'),
 ('Q66096', 'U.S. Senate'),
 ('Q2570643', 'Senate'),
 ('Q6294', 'Hillary Clinton'),
 ('Q6279', 'Joe Biden'),
 ('Q1414593', 'Affordable Care Act'),
 ('Q1414593', 'ACA'),
 ('Q1414593', 'Obamacare'),
 ('Q16869600', 'Dodd'),
 ('Q54991062', 'Tax Relief'),
 ('Q154510', 'Great Recession'),
 ('Q4985020', 'Budget Control'),
 ('Q889', 'Afghanistan'),
 ('Q159', 'Russia'),
 ('Q545449', 'Iraq War'),
 ('Q1016', 'Libya'),
 ('Q19878', 'Muammar Gaddafi'),
 ('Q1317', 'Osama bin Laden'),
 ('Q11201', 'Supreme Court'),
 ('Q182625', 'Windsor'),
 ('Q19866992', 'Obergefell'),
 ('Q730841', 'Hodges'),
 ('Q41487', 'Court'),
 ('Q796', 'Iraq'),
 ('Q858', 'Syria'),
 ('Q2

## How do we get the verbs?

In Wikidata, these are called "claims" or "statements" and are indexed through the P-code. There are literally thousands of different P codes. I have gone through and identified a series that I thought might be particularly interesting for this dataset. This list should absolutely be customized to the application/graph.  One of the easiest ways to figure out what P-values you want is to go to an entity's Wikidata page and hold your mouse over the claim you are interested in.  The value of the P-code will be present in the URL address.

### Note

This process can take several minutes, depending on the size of your starter list and the amount of traffic hitting Wikidata at any given time. You might even hit timeout errors. They will eventually resolve themselves. Grab a cup of coffee. For Barack Obama's entity list, this takes around 10-12 minutes or so.  We have the pre-populated graph and will not actually run these cells in the course in the interest of time.

In [11]:
%%time
p_dc = {'P6': 'head_of_government',
        'P17': 'country',
        'P19': 'place_of_birth',
        'P22': 'father',
        'P25': 'mother', 
        'P26': 'spouse',
        'P27': 'country_of_citizenship',
        'P30': 'continent',
        'P31': 'instance_of',
        'P35': 'head_of_state',
        'P36': 'capital',
        'P37': 'official_language',
        'P39': 'position_held',
        'P40': 'child',
        'P69': 'educated_at',
        'P101': 'field_of_work',
        'P102': 'member_of_political_party',
        'P106': 'occupation',
        'P108': 'employer',
        'P150': 'contains_administrative_territorial_entity',
        'P159': 'headquarters_location',
        'P166': 'award_received',
        'P172': 'ethnic_group',
        'P361': 'part_of',
        'P463': 'member_of',
        'P551': 'residence',
        'P607': 'conflict',
        'P793': 'significant_event',
        'P1344': 'participated_in',
        'P1813': 'short_name',
        'P1906': 'office_held_by_head_of_state',
        'P2388': 'office_held_by_head_of_the_organization',
        'P2670': 'has_parts_of_the_class'
       }

full_node_tup_ls = []

for el in tqdm(item_ls):
    itempage = pywikibot.ItemPage(wikidata, el[0])
    itemdata = itempage.get()
    source_node = itemdata['labels']['en']
    #print(el, source_node)

    for key in p_dc.keys():
        #print(source_node, key, p_dc[key])
        #print(itemdata['claims'])
        try:
            for i in itemdata['claims'][key]:
                target = i.getTarget()
                #print(target.id)
                tup = (source_node, el[0], key, p_dc[key], target.labels['en'], target.id)
                if tup not in full_node_tup_ls:
                    full_node_tup_ls.append(tup)
        except:
            continue

100%|███████████████████████████████████████████| 47/47 [11:00<00:00, 14.06s/it]

CPU times: user 54.9 s, sys: 1.07 s, total: 55.9 s
Wall time: 11min





## Here is what the output looks like...

In [12]:
full_node_tup_ls[0:5]

[('Barack Obama',
  'Q76',
  'P19',
  'place_of_birth',
  'Kapiolani Medical Center for Women and Children',
  'Q6366688'),
 ('Barack Obama', 'Q76', 'P19', 'place_of_birth', 'Kenya', 'Q114'),
 ('Barack Obama', 'Q76', 'P19', 'place_of_birth', 'Honolulu', 'Q18094'),
 ('Barack Obama', 'Q76', 'P22', 'father', 'Barack Obama Sr.', 'Q649593'),
 ('Barack Obama', 'Q76', 'P25', 'mother', 'Ann Dunham', 'Q766106')]

In [13]:
df = pd.DataFrame(full_node_tup_ls, columns=['source_name', 'source_q', 'rel_p', 'rel_name', 'target_name', 'target_q'])
df.head()

Unnamed: 0,source_name,source_q,rel_p,rel_name,target_name,target_q
0,Barack Obama,Q76,P19,place_of_birth,Kapiolani Medical Center for Women and Children,Q6366688
1,Barack Obama,Q76,P19,place_of_birth,Kenya,Q114
2,Barack Obama,Q76,P19,place_of_birth,Honolulu,Q18094
3,Barack Obama,Q76,P22,father,Barack Obama Sr.,Q649593
4,Barack Obama,Q76,P25,mother,Ann Dunham,Q766106


In [14]:
df.shape

(1657, 6)

## Now we will connect to Neo4j as we usually do

You will want to create a blank graph data science Sandbox for this exercise.

In [15]:
class Neo4jConnection:
    
    def __init__(self, uri, user, pwd):
        self.__uri = uri
        self.__user = user
        self.__pwd = pwd
        self.__driver = None
        try:
            self.__driver = GraphDatabase.driver(self.__uri, auth=(self.__user, self.__pwd))
        except Exception as e:
            print("Failed to create the driver:", e)
        
    def close(self):
        if self.__driver is not None:
            self.__driver.close()
        
    def query(self, query, parameters=None, db=None):
        assert self.__driver is not None, "Driver not initialized!"
        session = None
        response = None
        try: 
            session = self.__driver.session(database=db) if db is not None else self.__driver.session() 
            response = list(session.run(query, parameters))
        except Exception as e:
            print("Query failed:", e)
        finally: 
            if session is not None:
                session.close()
        return response

In [16]:
uri = ''
user = 'neo4j'
pwd = ''

conn = Neo4jConnection(uri=uri, user=user, pwd=pwd)
conn.query("MATCH (n) RETURN COUNT(n)")

[<Record COUNT(n)=0>]

In [17]:
conn.query('CREATE CONSTRAINT q_value IF NOT EXISTS ON (n:Node) ASSERT n.id IS UNIQUE')

[]

## Create node lists

We have a series of sources and targets.  Some of them will be duplicates.  (Can you guess some of them???)  So this takes all unique nodes and gets their names and IDs to be used as node properties.

In [18]:
source_df = df[['source_name', 'source_q']].drop_duplicates()
source_df.columns = ['name', 'id']
target_df = df[['target_name', 'target_q']].drop_duplicates()
target_df.columns = ['name', 'id']
all_nodes_df = pd.concat([source_df, target_df]).drop_duplicates()
all_nodes_df.shape

(1346, 2)

## More helper functions

`P31` is kind of a special P-code representing "instance of."  That is a cool one because we can use it to identify node types!  For example, "Barack Obama" would be an instance of a human.  So we are going to have a special function to go grab that for each of our nodes.  This can also take a bit of time depending on the size of our graph.  Go get a cup of coffee if you run this query.  

We then have our usual functions for populating the database from a Pandas dataframe.  Note that we will just give all nodes the type `:Node` in our function `add_nodes` and then will update it based on the `P31` values identified as a separate query using APOC below.

In [19]:
def get_p31(row):
    # P31 corresponds to "instance of"
    
    itempage = pywikibot.ItemPage(wikidata, row)
    itemdata = itempage.get()
    try:
        target = itemdata['claims']['P31'][0].getTarget()
        target.get()
        return target.labels['en']
    except:
        return 'Unknown'
    

def add_nodes(rows, batch_size=10000):
    # Adds author nodes to the Neo4j graph as a batch job.

    query = '''UNWIND $rows AS row
               MERGE (:Node {name: row.name, id: row.id, type: row.node_label})
               RETURN count(*) as total
    '''
    return insert_data(query, rows, batch_size)


def add_edges(rows, batch_size=50000):
    
    
    query = """UNWIND $rows AS row
               MATCH (src:Node {id: row.source_q}), (tar:Node {id: row.target_q})
               CREATE (src)-[:%s]->(tar)
    """ % edge
    
    return insert_data(query, rows, batch_size)


def insert_data(query, rows, batch_size = 10000):
    # Function to handle the updating the Neo4j database in batch mode.

    total = 0
    batch = 0
    start = time.time()
    result = None

    while batch * batch_size < len(rows):

        res = conn.query(query, parameters={'rows': rows[batch*batch_size:(batch+1)*batch_size].to_dict('records')})
        try:
            total += res[0]['total']
        except:
            total += 0
        batch += 1
        result = {"total":total, "batches":batch, "time":time.time()-start}
        print(result)

    return result

In [20]:
%%time
all_nodes_df['node_label'] = all_nodes_df['id'].map(get_p31)
all_nodes_df.head()



CPU times: user 52.4 s, sys: 554 ms, total: 53 s
Wall time: 13min 17s


Unnamed: 0,name,id,node_label
0,Barack Obama,Q76,human
76,United States of America,Q30,sovereign state
322,Democratic Party,Q29552,political party
327,Illinois,Q1204,U.S. state
439,Honolulu,Q18094,county seat


In [22]:
add_nodes(all_nodes_df)

{'total': 1346, 'batches': 1, 'time': 2.796731948852539}


{'total': 1346, 'batches': 1, 'time': 2.796731948852539}

In [23]:
edge_ls = df['rel_name'].unique().tolist()

In [24]:
for edge in edge_ls:
    print(edge)
    y = df[df['rel_name'] == edge]
    add_edges(y)

place_of_birth
{'total': 0, 'batches': 1, 'time': 0.44391298294067383}
father
{'total': 0, 'batches': 1, 'time': 0.3990213871002197}
mother
{'total': 0, 'batches': 1, 'time': 0.2861306667327881}
spouse
{'total': 0, 'batches': 1, 'time': 0.34337902069091797}
country_of_citizenship
{'total': 0, 'batches': 1, 'time': 0.41155385971069336}
instance_of
{'total': 0, 'batches': 1, 'time': 0.489840030670166}
position_held
{'total': 0, 'batches': 1, 'time': 0.5104033946990967}
child
{'total': 0, 'batches': 1, 'time': 0.30907130241394043}
educated_at
{'total': 0, 'batches': 1, 'time': 0.40592432022094727}
field_of_work
{'total': 0, 'batches': 1, 'time': 0.3043184280395508}
member_of_political_party
{'total': 0, 'batches': 1, 'time': 0.31018638610839844}
occupation
{'total': 0, 'batches': 1, 'time': 0.30188536643981934}
employer
{'total': 0, 'batches': 1, 'time': 0.4091973304748535}
award_received
{'total': 0, 'batches': 1, 'time': 0.4785928726196289}
ethnic_group
{'total': 0, 'batches': 1, 'time'

In [25]:
y = all_nodes_df['node_label'].value_counts()
print(y[0:5])

human                 220
county of Illinois    102
Unknown                54
U.S. state             50
oblasts of Russia      45
Name: node_label, dtype: int64


## Drop duplicates (there aren't many of them)

In [26]:
query = """MATCH (n:Node) 
           WITH n.name AS name, COLLECT(n) AS nodes 
           WHERE SIZE(nodes)>1 
           FOREACH (el in nodes | DETACH DELETE el)
"""

conn.query(query)

[]

## Update node labels

In [27]:
query = """MATCH (n:Node) 
           SET n.type_ls = apoc.convert.toStringList(n.type)
"""

conn.query(query)

[]

In [28]:
query = """MATCH (n:Node) 
           CALL apoc.create.addLabels(n, n.type_ls) 
           YIELD node RETURN node
"""

conn.query(query)

[<Record node=<Node id=0 labels=frozenset({'Node', 'human'}) properties={'name': 'Barack Obama', 'type_ls': ['human'], 'id': 'Q76', 'type': 'human'}>>,
 <Record node=<Node id=1 labels=frozenset({'Node', 'sovereign state'}) properties={'name': 'United States of America', 'type_ls': ['sovereign state'], 'id': 'Q30', 'type': 'sovereign state'}>>,
 <Record node=<Node id=2 labels=frozenset({'Node', 'political party'}) properties={'name': 'Democratic Party', 'type_ls': ['political party'], 'id': 'Q29552', 'type': 'political party'}>>,
 <Record node=<Node id=3 labels=frozenset({'Node', 'U.S. state'}) properties={'name': 'Illinois', 'type_ls': ['U.S. state'], 'id': 'Q1204', 'type': 'U.S. state'}>>,
 <Record node=<Node id=4 labels=frozenset({'Node', 'county seat'}) properties={'name': 'Honolulu', 'type_ls': ['county seat'], 'id': 'Q18094', 'type': 'county seat'}>>,
 <Record node=<Node id=5 labels=frozenset({'Node', 'U.S. state'}) properties={'name': 'Hawaii', 'type_ls': ['U.S. state'], 'id': 'Q

## Cool!

We should now have a graph that looks like the following (when viewed from the browser):

<img src='images/obama_wiki_graph.png'>