# Exercise 1: Assemble your first knowledge graph

## Introduction

For this exercise we are going to work with some provided data that was obtained completely through NLP creating SVO records.  (You will create a graph for yourself in the next exercise.)  Our graph will have the following schema:

<img src="images/svo_schema.png" width="600">

## Data files

The data in this graph is taken from the Wikipedia entry for Barack Obama, as described earlier in the course.  File can be found in `/data/svo.json`, available in the repository in JSON format and looks like this:

```
{
  "type": "node",
  "id": "3",
  "labels": [
    "Node"
  ],
  "properties": {
    "node_labels": [
      "Place",
      "Thing",
      "Country",
      "AdministrativeArea"
    ],
    "name": "united states",
    "word_vec": [
      -0.06853523850440979,
      0.20753547549247742,
      -0.012865334749221802,
      ...
    ],
    "description": "The United States of America, commonly known as the United States or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions. ",
    "url": "https://en.wikipedia.org/wiki/United_States"
  }
}
```
Note that `word_vec` is a 300-dimensional vector created with Spacy.  Also, if you explore this file, you will find that not every node has all of this information.  

Also in this file at the end are the relationships, which look like this:

```
{
  "id": "822",
  "type": "relationship",
  "label": "see",
  "start": {
    "id": "244",
    "labels": [
      "Node"
    ]
  },
  "end": {
    "id": "653",
    "labels": [
      "Node"
    ]
  }
}
```

## Workflow

We will use our usual workflow for this course, which means that we will create a Sandbox instance (this time it will be a "Blank Data Science" instance and connect to it.  This time though we will use the official Neo4j Python driver.  Be sure to grab the URL and password so you can connect from this notebook!

_Note:_ We will not need to use Jupyter for this exercise necessarily.  You could do all of this from within the Neo4j browser.  All exercises will be shown in this notebook, but can be replicated in the browser.

In [11]:
from neo4j import GraphDatabase
import pandas as pd

## Connection class

This class is used for making a nice connection to the database with some basic error handling.

In [2]:
class Neo4jConnection:
    
    def __init__(self, uri, user, pwd):
        
        self.__uri = uri
        self.__user = user
        self.__pwd = pwd
        self.__driver = None
        
        try:
            self.__driver = GraphDatabase.driver(self.__uri, auth=(self.__user, self.__pwd))
        except Exception as e:
            print("Failed to create the driver:", e)
        
    def close(self):
        
        if self.__driver is not None:
            self.__driver.close()
        
    def query(self, query, parameters=None, db=None):
        
        assert self.__driver is not None, "Driver not initialized!"
        session = None
        response = None
        
        try: 
            session = self.__driver.session(database=db) if db is not None else self.__driver.session() 
            response = list(session.run(query, parameters))
        except Exception as e:
            print("Query failed:", e)
        finally: 
            if session is not None:
                session.close()
        return response

In [3]:
uri = ''
user = 'neo4j'
pwd = ''

conn = Neo4jConnection(uri=uri, user=user, pwd=pwd)

## Awesome Procedures on Cypher (APOC)

<img align="left" src="images/apoc.jpeg" width="200">

In addition to basic functionality within Cypher, there are also some add-on libraries that we will be using today.  It has been said that if you cannot figure out how to do a thing with Cypher, then there is probably a function in [APOC](https://neo4j.com/labs/apoc/) to do it for you.  There are a _lot_ of functions in there -- too many to cover in this course.  However, we will use a few here, such as loading our JSON data into the database.

_Note:_ At any point if you make a mistake with your graph, you can always delete all of the nodes and relationships (and then re-populate it) using:

```
MATCH (n) DETACH DELETE n
```

However, **be aware** that GitLab might limit the number of times you can run this API call!!!  If that is the case and you get an error that says `Failed to invoke procedure `apoc.import.json`: Caused by: java.net.SocketException: Unexpected end of file from server`, try creating a new instance so that the API is being called from a new IP addres

In [9]:
svo_url = "https://resources.oreilly.com/binderhub/introduction-to-knowledge-graphs/raw/ex_1/data/svo.json"

query = "CALL apoc.import.json('" + svo_url + "')"
conn.query(query)

conn.query("MATCH (n) RETURN COUNT(n)")

[<Record COUNT(n)=862>]

## Cool!

We now have a graph!  It has 862 nodes and 823 relationships.  Let's now do something with it!

First, let's check out a few nodes and their properties using this query in the browser:

```
MATCH (n) RETURN n LIMIT 5
```

## Observation

We can see that some of our nodes have multiple node labels (not surprising since we can see this in our sample JSON above).  It would be nice to have multiple nodes labels associated with these nodes.  And, of course, there is a function in APOC to help with that!

In [21]:
query = """MATCH (n:Node) 
           CALL apoc.create.addLabels(n, n.node_labels) 
           YIELD node 
           RETURN 'Done!'
"""

conn.query(query)

[<Record 'Done!'='Done!'>,
 <Record 'Done!'='Done!'>,
 <Record 'Done!'='Done!'>,
 <Record 'Done!'='Done!'>,
 <Record 'Done!'='Done!'>,
 <Record 'Done!'='Done!'>,
 <Record 'Done!'='Done!'>,
 <Record 'Done!'='Done!'>,
 <Record 'Done!'='Done!'>,
 <Record 'Done!'='Done!'>,
 <Record 'Done!'='Done!'>,
 <Record 'Done!'='Done!'>,
 <Record 'Done!'='Done!'>,
 <Record 'Done!'='Done!'>,
 <Record 'Done!'='Done!'>,
 <Record 'Done!'='Done!'>,
 <Record 'Done!'='Done!'>,
 <Record 'Done!'='Done!'>,
 <Record 'Done!'='Done!'>,
 <Record 'Done!'='Done!'>,
 <Record 'Done!'='Done!'>,
 <Record 'Done!'='Done!'>,
 <Record 'Done!'='Done!'>,
 <Record 'Done!'='Done!'>,
 <Record 'Done!'='Done!'>,
 <Record 'Done!'='Done!'>,
 <Record 'Done!'='Done!'>,
 <Record 'Done!'='Done!'>,
 <Record 'Done!'='Done!'>,
 <Record 'Done!'='Done!'>,
 <Record 'Done!'='Done!'>,
 <Record 'Done!'='Done!'>,
 <Record 'Done!'='Done!'>,
 <Record 'Done!'='Done!'>,
 <Record 'Done!'='Done!'>,
 <Record 'Done!'='Done!'>,
 <Record 'Done!'='Done!'>,
 

## Where was Barack Obama born?

Pretty straight forward, but there are a few ways we could approach this problem.  For the sake of visualization, let's do these in the browser.

#### Method 1

```
MATCH (n:Node {name: 'oh bah mə'})-[*1..5]->(p) 
WHERE p:Country OR p:AdministrativeArea OR p:Continent OR p:Place 
RETURN n, p
```

## _Exercise:_ Can you make that better and narrow it down a bit?

Again, this will be easier to visualize in the brower.

In [25]:
# your code here 

## _Exercise:_ How many Obama's are in the graph?

If you have looked around the graph a bit (probably easiest done in the browser) then you might notice that there are a few different Obama's floating around in there.  This might suggest we have a bit of data cleaning to do.  But before we do that, let's see try to see how many there are.

In [6]:
# your code here

## Dedup-ing

Bummer!  We clearly have a problem with duplicates!  We could attempt to solve that in Python, but for education's sake let's try to do it in Cypher.  First, we should try to find them.  

In [27]:
query = """MATCH (n:Node)
           WITH n.name AS name, n.node_labels AS labels, COLLECT(n) AS nodes
           WHERE SIZE(nodes) > 1
           RETURN [n in nodes | n.name] AS names, [n in nodes | n.node_labels] as labels, SIZE(nodes)
           ORDER BY SIZE(nodes) DESC
"""

dupes_df = pd.DataFrame([dict(_) for _ in conn.query(query)])
dupes_df.head()

Unnamed: 0,names,labels,SIZE(nodes)
0,"[barack hussein obama ii, barack hussein obama...","[None, None, None, None, None, None, None, Non...",20
1,"[officially state, officially state, officiall...","[None, None, None, None, None, None, None, Non...",12
2,"[district, district, district, district, distr...","[None, None, None, None, None, None, None, Non...",9
3,"[john sidney mccain iii, john sidney mccain ii...","[None, None, None, None, None, None, None, None]",8
4,[public safety recreational firearms use prote...,"[None, None, None, None, None, None, None, None]",8


Or maybe we could try to do this via Levenshtein distance (how many characters are similar in two strings, smaller values are better)?

In [32]:
query = """MATCH (n1:Node {name: 'barack obama'}) 
           MATCH (n2:Node) WHERE n2.name CONTAINS 'obama' 
           RETURN DISTInCT(n2.name), apoc.text.distance(n1.name, n2.name) AS distance
           ORDER BY distance
"""

lev_df = pd.DataFrame([dict(_) for _ in conn.query(query)])
lev_df.head(10)

Unnamed: 0,(n2.name),distance
0,barack obama,0
1,obama terms,8
2,obamacare,8
3,president barack obama,10
4,barack hussein obama ii,11
5,michelle lavaughn robinson obama,24


## _Exercise:_ Dropping the duplicates

We can probably do better.  Again, let's try to do this in Cypher...

In [8]:
# your code here

## Node similarity

If you recall, we included the word vectors of the description text as part of our node properties.  If we were doing strict NLP, we could just take the cosine between these two vectors and get a similarity score where 1.0 is perfect similarity and -1.0 is perfect dissimilarity.  However, we have the same functionality within the [Graph Data Science (GDS) libary](https://neo4j.com/docs/graph-data-science/current/) of Neo4j, which is the second add-on library we will use in this course.

In [33]:
query = """MATCH (n1:Node {name: 'barack obama'}) 
           MATCH (n2:Node {name: 'mitch mcconnell'}) 
           RETURN gds.alpha.similarity.cosine(n1.word_vec, n2.word_vec) AS similarity
"""

conn.query(query)

[<Record similarity=0.93807255209707>]