# Exercise 3: Using a pre-populated Wikidata graph

Wikidata is a great source for making a knowledge graph.  We are going to build off of Exercise 1, where we used a pre-populated graph from the Google Knowledge graph, to create a similar graph and then answer some questions about it.  Similar to that approach, this graph is created from the subjects and objects found by `spacy` in the intro paragraph for the Wikipedia page for Barack Obama.  However, once we have our starting subject, we will then get the Wikidata elements for the objects.  You will see how this then creates a rich graph as we get the target entities for each of the objects.

## Wikidata as a graph source

As mentioned, Wikidata uses a data model of unique identifiers (P-values) and properties / claims (Q-values).  This structure is shown below

<img src='images/wikidata_data_model.png'>

We can see that the Q-value above for Douglas Adams is `Q42`.  (42...get it???) provide a unique link to the nouns in our text.  The P-value, on the other hand, can be used to provide the nouns.  As of the writing of this notebook, there were _9143 different P-values_ representing a range of potential verbs we could use!  So it will be up to use to decide which ones to use since querying all of them would be prohibitive.  "educated at," as shown above, is `P69`.  A searchable list of all possible P-value can be found [here](https://www.wikidata.org/wiki/Wikidata:List_of_properties).  So we could assemble a list of the nouns we are interested in and get their Q-values and then a list of the P-values.  Then we would get the target text that the P-value refers to.

## Our workflow

We will assemble a list of the nouns from our target text and then we get their Q-values and then a list of the P-values.  Then we will get the target text that the P-value refers to.  So our workflow looks as follows

<img src='images/wiki_workflow.png'>

Once we have the data, we will populate the graph.  This will then result in a graph schema that is

<img src = 'images/wiki_schema.png'>

In [None]:
from neo4j import GraphDatabase

In [None]:
class Neo4jConnection:
    
    def __init__(self, uri, user, pwd):
        
        self.__uri = uri
        self.__user = user
        self.__pwd = pwd
        self.__driver = None
        
        try:
            self.__driver = GraphDatabase.driver(self.__uri, auth=(self.__user, self.__pwd))
        except Exception as e:
            print("Failed to create the driver:", e)
        
    def close(self):
        
        if self.__driver is not None:
            self.__driver.close()
        
    def query(self, query, parameters=None, db=None):
        
        assert self.__driver is not None, "Driver not initialized!"
        session = None
        response = None
        
        try: 
            session = self.__driver.session(database=db) if db is not None else self.__driver.session() 
            response = list(session.run(query, parameters))
        except Exception as e:
            print("Query failed:", e)
        finally: 
            if session is not None:
                session.close()
        return response

In [None]:
uri = ''
user = 'neo4j'
pwd = ''

conn = Neo4jConnection(uri=uri, user=user, pwd=pwd)

In [None]:
conn.query("MATCH (n) RETURN COUNT(n)")

In [None]:
wiki_url = 'https://resources.oreilly.com/binderhub/introduction-to-knowledge-graphs/raw/master/data/wiki.json'

query = "CALL apoc.import.json('" + wiki_url + "')"
conn.query(query)

conn.query("MATCH (n) RETURN COUNT(n)")

## _Exercise:_ Drop duplicates

How many are there?

In [None]:
# your code here

## _Exercise:_ Where was Barack Obama born?

Hint: Check [this](https://neo4j.com/docs/cypher-manual/current/functions/list/#functions-relationships) out.

In [None]:
# your code here

## Updating the node labels

As we have already seen, a knowledge graph is more informative when we make the labels more rich.  Right now, we just have the node labels `Node`.  Since we have the node types (is instance of, like "human," "sovereign state," etc.), let's use those.  After you do this, I recommend you go back into the browser and take a look and see just how much more informative it is!

_Note:_ There is a quirk in our data where those node types are not returned as a list.  So first we have to convert them to a string list and then we will add them.  APOC to the rescue!

In [None]:
query = """
    MATCH (n:Node) 
    SET n.type_ls = apoc.convert.toStringList(n.type)
"""

conn.query(query)

query = """
    MATCH (n:Node) 
    CALL apoc.create.addLabels(n, n.type_ls) 
    YIELD node RETURN node
"""

conn.query(query)

## Coming up

We are going to do some actual machine learning on a knowledge graph in the next several exercises!  You are welcome to use this one or to create your own.  In the next exercise I will show you how to create your own graph, should you choose to do so. 