# Project Summary

## Citing SoNAR (IDH)

# Data Description


## Summary Stats

## Data Sources

## Data Preparation

# Data Access

We will need some specific libraries to work with the SoNAR (IDH) database. Let's start with installing the `neo4j` library.

When you are using the curriculum on binder or locally or by running it as a docker container locally, the pacckage is already installed. When you want to interact with the SoNAR (IDH) database independently install the package with the following code line in a new notebook cell:

```python
!pip install neo4j
```

In [31]:
import yaml

with open("../creds.yml", 'r') as ymlfile:
    cfg = yaml.safe_load(ymlfile)

uri = cfg["sonar_creds"]["uri"]
user = cfg["sonar_creds"]["user"]
password = cfg["sonar_creds"]["pass"]

In [40]:
from neo4j import GraphDatabase

driver = GraphDatabase.driver(uri, auth=(user, password))

With the call above we create a [Neo4j driver object](https://neo4j.com/docs/api/python-driver/current/api.html#driver). This driver now stores the connection details for the database. We can use this driver now to send requests to the database for example to request data.

# Data Exploration

Data exploration is usually the very first thing to do when working with new data. So let's start diving into the SoNAR (IDH) database by exploring it. 

Whenever we want to retrieve data from the Neo4j database of SoNAR (IDH) we can use a query language called "Cypher Query Language". Cypher provides a rather easy to comprehend syntax for requesting data from the database.

Throughout this curriculum we will use this Cypher Query Language whenever we directly retrieve data from SoNAR (IDH).

## Node Labels

Let's start off with a simple query. Let's request the database to return all [node labels](https://neo4j.com/docs/getting-started/current/graphdb-concepts/#graphdb-labels). Node labels are basically categories the nodes can belong to. You can think of them as entity groups. The SoNAR (IDH) database distinguishes between Persons, Corporations and more. Let's ask the database it self to return all the labels available. 

In [28]:
with driver.session() as session:
    print(session.run("CALL db.labels()").data())

[{'label': 'CorpName'}, {'label': 'GeoName'}, {'label': 'MeetName'}, {'label': 'PerName'}, {'label': 'TopicTerm'}, {'label': 'UniTitle'}, {'label': 'ChronTerm'}, {'label': 'IsilTerm'}, {'label': 'Resource'}]


**Code Breakdown:**

*`with` [..] `as` [..] `:`*  
>The `with` statement is basically used to make the database call as resource effective and concise as possible. Everything that happens within the `with` statement will only be temporarily kept in memory. The `as` assigns the `with` definition to an object we can use inside the `with block`.
>
>There are more advantages of the `with` call but their explanation would exceed the goal of this curriculum. However, an in-depth explanation of the `with` statement can be found [here](https://www.python.org/dev/peps/pep-0343/).

*`driver.session()` [as] `session`*
>When we request data from the database we need to establish a connection (`session`). The `driver` object we created earlier stores the connection details. When we use the method `driver.session()` we establish a new connection. This connection is assigned to the object `session` object for the `while` statement.

*`print(` [..] `)`*
> The print function is wrapped around the actual database request to see the result below the cell.

*`session.run(`[..]`).data()`*
> This is the part of the code that actually sends the request to the database and returns the answer. `session.run()` ingests a Cypher query. The method `.data()` can be used to return the plain data the database returned.

*`"CALL db.labels()"`*
> This is the actual Cypher query. The `CALL` clause is used to call the `db.labels()` procedure. More details about Neo4j procedures can be found below.


The result of this code chunk is a list that contains a key-value pair (`dictionary`) per label in the database.  

Some useful built-in procedures for exploring and describing the database are listed in the table below. You can get a full list of built-in procedures by using the following query: `CALL dbms.procedures()`


|Procedure | Description |
|---------|----------|
|`db.labels()`| List all labels in the database.|
|`db.propertyKeys()`|List all property keys in the database.|
|`db.relationshipTypes`|List all relationship types in the database. |
|`db.schema`| Show the schema of the data. |
|`db.stats.retrieve`|Retrieve statistical data about the current database. <br>Valid sections are 'GRAPH COUNTS', 'TOKENS', 'QUERIES', 'META'|

## Relationship Types

Similar to node labels we can retrieve the categories of the relations inside the database. Every relation must have exactly one relationship type. This type defines the kind or category the relation belongs to. 

In [19]:
with driver.session() as session:
    result = session.run("CALL db.relationshipTypes()").data()
    
pd.DataFrame(result)

Unnamed: 0,relationshipType
0,RelationToIsilTerm
1,RelationToChronTerm
2,RelationToCorpName
3,RelationToTopicTerm
4,RelationToGeoName
5,SocialRelation
6,RelationToMeetName
7,RelationToPerName
8,RelationToUniTitle
9,RelationToResource


## Properties

Properties are additional information that might be assigned to nodes or relationships. Properties can provide meta information e.g. about geographic locations, names, gender and pretty much anything else that might be relevant as an information for nodes or relationships. 

Also, properties can be used to identify a specific subset of nodes or relationships. 

In [41]:
query = """
CALL db.propertyKeys() YIELD propertyKey AS prop
MATCH (n)
WHERE n[prop] IS NOT NULL
RETURN prop, count(n) AS numNodes
"""


with driver.session() as session:
    result = session.run(query).data()
    
result

Failed to read from defunct connection IPv4Address(('h2918680.stratoserver.net', 7687)) (IPv4Address(('85.214.119.41', 7687)))


ServiceUnavailable: Failed to read from defunct connection IPv4Address(('h2918680.stratoserver.net', 7687)) (IPv4Address(('85.214.119.41', 7687)))

## Cypher Query Language

## General Database Summary

In [29]:
query = """
match (n) return 'Number of Nodes: ' + count(n) as output UNION
match ()-[]->() return 'Number of Relationships: ' + count(*) as output UNION
CALL db.labels() YIELD label RETURN 'Number of Labels: ' + count(*) AS output UNION
CALL db.relationshipTypes() YIELD relationshipType  RETURN 'Number of Relationships Types: ' + count(*) AS output
"""

with driver.session() as session:
    result = session.run(query).data()
    
pd.DataFrame(result)


Unnamed: 0,output
0,Number of Nodes: 46831805
1,Number of Relationships: 191363660
2,Number of Labels: 9
3,Number of Relationships Types: 10


# Descriptive Analyses

## Summarise Node Labels

In [177]:
result = {"label": [], "count": []}

with driver.session() as session:
    labels = [row["label"] for row in session.run("CALL db.labels()")]
    for label in labels:
        query = f"MATCH (:{label}) RETURN count(*) as count"
        count = session.run(query).single()["count"]
        result["label"].append(label)
        result["count"].append(count)
        
node_labels_df = pd.DataFrame(result)
node_labels_df.sort_values(by = "count")

Unnamed: 0,label,count
7,IsilTerm,611
4,TopicTerm,212135
1,GeoName,308197
5,UniTitle,385300
6,ChronTerm,537054
2,MeetName,814044
0,CorpName,1487711
3,PerName,5087660
8,Resource,37999093


## Summarise Relationship Types


In [183]:
result = {"relType": [], "count": []}

with driver.session() as session:
    rel_types = [row["relationshipType"] for row in session.run("CALL db.relationshipTypes()")]
    for rel_type in rel_types:
        query = f"MATCH ()-[:{rel_type}]->() RETURN count(*) as count"
        count = session.run(query).single()["count"]
        result["relType"].append(rel_type)
        result["count"].append(count)
        
rel_type_df = pd.DataFrame(result)
rel_type_df.sort_values(by = "count")

Unnamed: 0,relType,count
8,RelationToUniTitle,128256
6,RelationToMeetName,422333
1,RelationToChronTerm,5446841
2,RelationToCorpName,6728127
4,RelationToGeoName,6861379
9,RelationToResource,7389423
7,RelationToPerName,20857782
3,RelationToTopicTerm,24068056
5,SocialRelation,40301595
0,RelationToIsilTerm,79159868


# Complex Queries & Data Preparation