# AUEB M.Sc. in Data Science (part-time)

**Course**: Data Mining

**Semester**: Winter 2020

**2nd homework**: Neo4j Graph database

**Professor**: Y.Kotidis

**Assistant responsible for this assignment**: I.Filippidou


Team members:

- Spiros Politis (p3351814)

- Manolis Proimakis (p3351815)

## Assignment Description (General)

> All scripts have been done using **cypher**, python was only used for presentation purposes as an alternative to Neo4j Desktop, and cypher console.

> Queries can be seen in this report but a copy of them is also copied to `queries.cy` file.

### Dataset

You are given the DBLP citation network, which contains authors, articles, venues and citations between articles. In particular, the dataset contains 184313 articles with id, title, year and abstract (don’t use the number of citations property), 80299 authors with names, 4 venues with names and 289908 citations among papers. You can download the dataset (Citation Network Dataset) from e-class in json format.

### Property graph model

You are asked to model the data as a property graph by designing the appropriate entities and assigning the relevant labels, types and properties. For your modeling, you need to study the details of all the files that describe the citation network and represent all network attributes on nodes and edges of a graph. In your model you should include only the attributes that describe each node and edge type, without repetitions of elements (e.g. same property being displayed on both a node and an edge). Finally, nodes should not be connected when this is not required by the model.

## Part 1 - Creating the graph and querying

### Importing the dataset into Neo4j

Based on your model, **you should create a graph database on Neo4j and load the citation network elements (nodes, edges, attributes)**. You can load the dataset directly from the json files provided in e-class for this assignment by using either the neo4j browser or the neo4j import tool, or any programming language that is supported by neo4j. To speed up loading and query response times, you could also create proper indexes on your model properties.

In order to satisfy the assignment requirements, we used *Neo4J Desktop*, creating a graph named *dmh2_base_graph* (ver. 3.5.15). 

Further configuration options that had to be applied are as follows:

- Installation of the *APOC* library (ver. 3.5.0.9)

- In neo4j.config, we added / altered the following parameters:

    - dbms.memory.heap.initial_size=2G

    - dbms.memory.heap.max_size=8G

    - apoc.import.file.enabled=true

    - apoc.export.file.enabled=true
    
    - dbms.security.procedures.unrestricted=*
    
    
The following plugin package needs to be downloaded and installed: https://s3-eu-west-1.amazonaws.com/com.neo4j.graphalgorithms.dist/neo4j-graph-algorithms-3.5.14.0-standalone.zip. The .jar file should be extracted in the "*plugins*" directory of the Neo4J installation. This prevents the message "*There is no procedure with the name `...` registered for this database instance. Please ensure you've spelled the procedure name correctly and that the procedure is properly deployed.*" from occuring.

In order to be able to load JSON files into the graph, we copied the source files 
- 

in the Neo4J import directory.

This part of the assignment provides an alternative implementation to querying a *Neo4J* through the Python interface.

### Load required libraries

In [1]:
import logging
import sys
import pandas as pd

sys.path.append("code/")

from DMH2 import Neo4jPy

In [2]:
logging.basicConfig(
    format = "%(levelname)s: %(message)s", 
    level = logging.INFO, 
    datefmt = "%I:%M:%S"
)
log = logging.getLogger()

In [3]:
import importlib
importlib.reload(Neo4jPy)

<module 'DMH2.Neo4jPy' from 'code/DMH2/Neo4jPy.py'>

### Connect

In [4]:
# Create a Neo4JPy object.
neo4jpy = Neo4jPy.Neo4jPy(
    uri = "bolt://localhost:7687/",
    user = "neo4j", 
    password = "dmh2"
)

### Drop all objects

- remove all nodes and connections
- remove all indexes and constraints

In [5]:
neo4jpy.execute_statement("CALL apoc.periodic.iterate(\"MATCH (n) RETURN n\",\"DETACH DELETE n\", {batchSize:10000, parallel:false})")
neo4jpy.execute_statement("CALL apoc.schema.assert({}, {})")

In [6]:
neo4jpy.query("MATCH (n) RETURN n")

[]

### Ingest JSON data

#### Creating unique indexes

First of all we create the following 3 unique constraints which will help us
- Create a solid without duplicate node
- Load the json data much more quickly since our importing can use these indexes to verify existing nodes and update them

In [7]:
# Create constraints.
neo4jpy.execute_statement("CREATE CONSTRAINT ON (author:Author) ASSERT author.name IS UNIQUE")
neo4jpy.execute_statement("CREATE CONSTRAINT ON (venue:Venue) ASSERT venue.name IS UNIQUE")
neo4jpy.execute_statement("CREATE CONSTRAINT ON (article:Article) ASSERT article.id IS UNIQUE")

#### Import Data

Then we will import json data, Some remarks are

- 1) While injesting articles we also create the references with the id, but they will not have title, year and abstract
- 2) After the article with an id is in our sample we update the incomplete article with the other information.
- 3) Because there are more referenced ids than the articles that we have on the dataset, in the end it will be common to 
   encounter articles with only id assigned.

In [8]:
# Ingest source JSON data.
neo4jpy.execute_statement("""
    CALL apoc.periodic.iterate(
        'UNWIND ["dblp-ref-0.json", "dblp-ref-1.json", "dblp-ref-2.json", "dblp-ref-3.json"] AS file CALL apoc.load.json(file) YIELD value as a return a',
        'MERGE (article:Article {id: a.id}) 
            ON CREATE SET article.title = a.title, article.year = a.year, article.abstract = a.abstract
            ON MATCH SET article.title = a.title, article.year = a.year, article.abstract = a.abstract
        FOREACH (authorName in a.authors | 
            MERGE (author:Author {name: authorName}) 
            MERGE (author)-[:WROTE {year: article.year}]->(article)) 
            MERGE (venue:Venue {name: a.venue }) 
            MERGE (venue)-[:PRESENTED]->(article)
        FOREACH (referenceId in a.references | MERGE (referenceArticle:Article {id: referenceId}) MERGE (article)-[:CITES]->(referenceArticle))',
        {batchSize:100}
        )
    YIELD batches,total return batches, total
""")

The expected number on our graph should be

- 184313 articles
- 80299 authors
- 4 venues
- 289908 citation

#### Validation

In [9]:
print("No of articles: ", neo4jpy.query("MATCH(n:Article) return count(n)")[0].value())
print("No of authors: ", neo4jpy.query("MATCH(n:Author) return count(n)")[0].value())
print("No of venues: ", neo4jpy.query("MATCH(n:Venue) return count(n)")[0].value())
print("No of citations: ", neo4jpy.query("MATCH ()-[r:CITES]->() RETURN count(*) as count")[0].value())

No of articles:  184313
No of authors:  80299
No of venues:  4
No of citations:  289908


### Showing the graph schema

In Neo4J Browser, execute *CALL db.schema()*

![title](images/schema_base.png)

Our main graphs constists of 3 node types and 3 relations.

Nodes

- **Author**: The author of an article
    - **name** unique mandatory


- **Article**: An article
    - **id**: unique mandatory
    - **title**: optional
    - **year**: optional
    - **abstract**: optional


- **Venue**: A venue, event
    - **name**: unique mandatory
    

Relations

- **WROTE**: a directed edge from an author to the articles that he has written
    - **year**: unique mandatory
- **CITES**: a directed edge from an article to the articles that it has referenced
- **PRESENTED**: a directed adge from a value to the article that is has presented

### Querying the database

#### 1) Which are the top 5 authors with the most citations. Return author name and number of citations.

In [10]:
neo4jpy.to_df(
    neo4jpy.query("""
        MATCH (x:Author) - [w:WROTE] -> (n:Article) <- [c:CITES] - (a:Article)
        RETURN x AS author, COUNT(c) AS citations
        ORDER BY citations DESC
        LIMIT 5
    """)
)

Unnamed: 0,author,citations
0,Roman Słowiński,251
1,Zdzisław Pawlak,240
2,Jerzy W. Grzymala-Busse,233
3,Nenad Medvidovic,216
4,Wojciech Ziarko,214


#### 2) Which are the top 5 authors with the most collaborations (with different authors). Return author name and number of collaborations.

In [11]:
neo4jpy.to_df(
    neo4jpy.query("""
        MATCH (x:Author) - [w:WROTE] -> (n:Article) <- [c:WROTE] - (b:Author)
        RETURN x AS author, COUNT(DISTINCT b) AS collaborators_count
        ORDER BY collaborators_count DESC
        LIMIT 5 
    """)
)

Unnamed: 0,author,collaborators_count
0,Barry W. Boehm,84
1,Bart Preneel,83
2,Marco Dorigo,79
3,Schahram Dustdar,75
4,Josef Kittler,70


#### 3) Which is the author who has wrote the most papers without collaborations. Return author name and number of papers.

In [12]:
neo4jpy.to_df(
    neo4jpy.query("""
        MATCH (author:Author) - [w:WROTE] -> (article:Article)
        OPTIONAL MATCH (collaborator:Author) - [w1:WROTE] -> (article:Article)
        WITH author, COUNT(article) AS articles_count, COUNT(DISTINCT collaborator) AS collaborators_count
        WHERE collaborators_count = 1
        RETURN author, articles_count
        ORDER BY articles_count DESC
        LIMIT 5
    """)
)

Unnamed: 0,author,articles_count
0,Neil Savage,32
1,David Roman,31
2,Leah Hoffmann,29
3,Diane Crawford,22
4,John R. Herndon,16


#### 4) Which author published the most papers in 2003? Return author name and number of papers.

In [13]:
neo4jpy.to_df(
    neo4jpy.query("""
        MATCH(article:Article {year:2003}) <- [:WROTE] - (author:Author)
        RETURN author, count(article) AS articles
        ORDER BY articles DESC
        LIMIT 5
    """)
)

Unnamed: 0,author,articles
0,Bernard De Baets,9
1,Horst Bunke,7
2,Josef Kittler,7
3,Jan Bosch,6
4,Keng Siau,5


#### 5) Which is the venue with the most papers on the Data Mining field (derived from the paper title) in 2001. Return venue and number of papers.

First of all we will create full text index on the title of the article
which will allow us search for the existence of a key phrase in all the titles much more quickly with the benefit of
searching for permutations of the phrase

In [14]:
neo4jpy.execute_statement("""
    call db.index.fulltext.createNodeIndex("articleTitle", ["Article"], ["title"])
""")

In [15]:
neo4jpy.to_df(
    neo4jpy.query("""
        CALL db.index.fulltext.queryNodes("articleTitle", "Data Mining")
        YIELD node AS article
        MATCH (venue:Venue) - [:PRESENTED] -> (article)
        WHERE article.year = 2001
        RETURN venue, COUNT(article) AS articles
        ORDER BY articles DESC
        LIMIT 1
    """)
)

Unnamed: 0,venue,articles
0,Lecture Notes in Computer Science,87


#### 6) Which are the top 5 papers with the most citations? Return paper title and number of citations.

In [16]:
neo4jpy.to_df(
    neo4jpy.query("""
        MATCH (article:Article)-[c:CITES]->(citedArticle:Article)
        RETURN citedArticle.title AS cited_article_title, COUNT(c) AS number_of_citations
        ORDER BY number_of_citations DESC
        LIMIT 5
    """)
)

Unnamed: 0,cited_article_title,number_of_citations
0,,261202
1,Rough sets,211
2,A method for obtaining digital signatures and ...,125
3,"Pastry: Scalable, Decentralized Object Locatio...",108
4,An axiomatic basis for computer programming,93


#### 7) Which were the papers that use “Neural Networks” in “speech recognition”. Return authors, title and abstract.

In [17]:
neo4jpy.execute_statement("""
    CALL db.index.fulltext.createNodeIndex("articleTitleAbstract", ["Article"], ["title", "abstract"])
""")

In [18]:
neo4jpy.to_df(
    neo4jpy.query("""
        CALL db.index.fulltext.queryNodes("articleTitleAbstract", "Neural Networks speech recognition")
        YIELD node AS article
        MATCH (author:Author)-[:WROTE]->(article)
        RETURN article.title AS article_title, article.abstract AS article_abstract, author.name AS author_name
        LIMIT 5
    """)
)

Unnamed: 0,article_title,article_abstract,author_name
0,Automatic speech recognition with deep neural ...,Automatic Speech Recognition has reached almos...,José A. R. Fonollosa
1,Automatic speech recognition with deep neural ...,Automatic Speech Recognition has reached almos...,Cristina España-Bonet
2,Implementation of Tamil Speech Recognition Sys...,This paper presents a neural network approach ...,T. V. Geetha
3,Implementation of Tamil Speech Recognition Sys...,This paper presents a neural network approach ...,S. Saraswathi
4,"Speech recognition by integrating audio, visua...",Recent researches have been focusing on fusion...,Joung Woo Ryu


## Part 2 - Link prediction

For this assignment you will also use the created neo4j graph database in order to predict future collaborations between authors. For this task you will use link prediction algorithms as discussed in lecture "Link Prediction". Neo4j implements the following algorithms that you can use and test for this task.

- Adamic Adar ( algo.linkprediction.adamicAdar )

- Common Neighbors ( algo.linkprediction.commonNeighbors )

- Preferential Attachment ( algo.linkprediction.preferentialAttachment )

- Resource Allocation ( algo.linkprediction.resourceAllocation )

- Same Community ( algo.linkprediction.sameCommunity )

- Total Neighbors ( algo.linkprediction.totalNeighbors )

Before running the algorithms, you should create links between authors that have already collaborated. In order to form a training set, you should add an undirected "CoAuthor" relationship between authors that have collaborated until the end of 2005. These relations form the train set and the rest of the collaborations (year > 2005) form the test set with which you will evaluate the accuracy of the link prediction algorithms.

After inserting the CoAuthor relations, you should run in neo4j link prediction algorithms using **only the subgraph with "Author" nodes and "Coauthor" relations**. In order to calculate scores for a given author node you should exclude all nodes that this node is already connected to and calculate scores to all other author nodes of the graph. The nodes that take positive scores are potential coauthors (after 2005) with the specific author.

For all the following Authors and for each used algorithm calculate using cypher the following:

(a) Compute for each author the number of True Positives (TP), True Negatives (TN), False Positives (FP) and False Negatives (FN) predicted links.

(b) Calculate Accuracy, Precision and Recall metrics for each author. 

(c) Also calculate the adoption rate of each algorithms using the following formula

$$\text{AdoptionRate} = \frac{\sum{\text{score(TP)}}}{\sum{\text{score(TP)} + \sum{\text{score(FP)}}}}$$

Where: 

TP = links with score > 0 that have been adopted after 2005

FP = links with score>0 that have not been adopted after 2005

TN = links with score=0 that have not been adopted after 2005

FN = links with score=0 that have been adopted after 2005

### Creating required subgraphs

To use the prediction algorithms we require a train set which will be used in order to predict future relations.
For this purpose will create `:COAUTHOR_TRAIN` relationship between authors that have authored an article before `2005`

In [19]:
neo4jpy.execute_statement("""
    MATCH (:Author) - [r:COAUTHOR_TRAIN] - (:Author)
    DELETE r
""")

neo4jpy.execute_statement("""
    MATCH (:Author) - [r:COAUTHOR_TEST] - (:Author)
    DELETE r
""")

In [20]:
neo4jpy.execute_statement("""
    MATCH (author) - [:WROTE] -> (article:Article) <- [:WROTE] - (coauthor:Author)
    WHERE article.year <= 2005
    MERGE (author) - [:COAUTHOR_TRAIN] - (coauthor)
""")

Moreover we will require a test set to validate our predictions. So again we will create a relation between the authors that we know that they have released an article after 2005 and do not already have a relation before 2005

In [21]:
neo4jpy.execute_statement("""
    MATCH (author) - [:WROTE] -> (article:Article) <- [:WROTE] - (coauthor:Author)
    WHERE article.year > 2005 AND NOT (author)-[:COAUTHOR_TRAIN] - (coauthor)
    MERGE (author) - [:COAUTHOR_TEST] - (coauthor)
""")

### Showing the graph schema

![title](images/schema_link_prediction.png)

### Description

The same procedure has been applied to each of the algorithms

For each of the requested authors

- 1) We MATCH all the nodes `:Author` that do not have a `:COAUTHOR_TRAIN` relation with the requested
- 2) we optionally MATCH a `:CAUTHOR_TEST` relation between these nodes and the requested

---

- 3) We predict the score using any algorithm between the requested and all the other nodes.
- 4) We count the test relations (expected values are **0** if nodes are not connected after 2005, **1** if nodes are connected after 2005)
- 5) Based on the score and the existance of the test relation we can conclude if the prediction is TP, FP, TN, FN
---
- 6) We canculate
   - The sum of TP relations as the total TP of the algorithm (for recall, presision and accuracy)
   - The sum of FP relations as the total FP of the algorithm (for recall, presision and accuracy)
   - The sum of TN relations as the total TN of the algorithm (for recall, presision and accuracy)
   - The sum of FN relations as the total FN of the algorithm (for recall, presision and accuracy)
   
   - The sum score of TP relations as the total score(TP) of the algorithm (for AdoptionRate)
   - The sum score of FP relations as the total score(FP) of the algorithm (for AdoptionRate)
   
- 7) Using all the above metrics in the final step we calculate the (recall, presision, accuracy, adoption rate) for each user
   

### Adamic Adar algorithm Scores

The Adamic Adar algorithm was introduced in 2003 by Lada Adamic and Eytan Adar to predict links in a social network. It is computed using the following formula:

$A(x, y)=\sum_{u \in N(x) \cap N(y)} \frac{1}{\log |N(u)|}$

where N(u) is the set of nodes adjacent to u.

A value of 0 indicates that two nodes are not close, while higher values indicate nodes are closer.

In [22]:
neo4jpy.to_df(
    neo4jpy.query("""
        WITH [
            "Krzysztof Kaczmarski", 
            "Hector Garcia-Molina", 
            "Christine Parent", 
            "Robert Meersman", 
            "Pier Luigi Emiliani", 
            "Hidenao Abe", 
            "Mario Lamberger", 
            "Kaoru Inoue", 
            "Gabriel Thierrin", 
            "Marcin Gogolewski"
        ] AS author_names
        UNWIND author_names AS author_name
        MATCH (other:Author)
        MATCH (author:Author {name: author_name})
        OPTIONAL MATCH (author:Author) - [r:COAUTHOR_TEST] - (other:Author)
        WHERE not (author) - [:COAUTHOR_TRAIN] - (other)
        WITH 
            author AS author, 
            other AS coauthor,
            algo.linkprediction.adamicAdar(author, other, { relationshipQuery: "COAUTHOR_TRAIN", direction: "BOTH" }) AS score,
            COUNT(r) AS in_test
        WITH 
            author,
            coauthor,
            score,
            in_test,
            CASE WHEN score > 0 AND in_test = 1 THEN 1 ELSE 0 END AS TP,
            CASE WHEN score > 0 AND in_test = 1 THEN score ELSE 0 END AS TP_score,
            CASE WHEN score > 0 AND in_test = 0 THEN 1 ELSE 0 END AS FP,
            CASE WHEN score > 0 AND in_test = 0 THEN score ELSE 0 END AS FP_score,
            CASE WHEN score = 0 AND in_test = 0 THEN 1 ELSE 0 END AS TN,
            CASE WHEN score = 0 AND in_test = 1 THEN 1 ELSE 0 END AS FN
        WITH 
            DISTINCT(author) AS author,
            SUM(TP) AS TP,
            SUM(TP_score) AS TP_score,
            SUM(FP) AS FP,
            SUM(FP_score) AS FP_score,
            SUM(TN) AS TN,
            SUM(FN) AS FN
        RETURN
            author.name AS author,
            TP, FP, TN, FN,
            TP_score, FP_score,
            toFloat(TP) / ( TP + FN ) AS Recall,
            toFloat(TP) / ( TP + FP) AS Presision,
            toFloat(TP + TN ) / (TP + FP + FN + TN) AS Accuracy,
            toFloat(TP_score) / (TP_score + FP_score) AS AdoptionRate
    """
    )
)

Unnamed: 0,author,TP,FP,TN,FN,TP_score,FP_score,Recall,Presision,Accuracy,AdoptionRate
0,Krzysztof Kaczmarski,4,10,80284,1,2.1176,7.83675,0.8,0.285714,0.999863,0.212731
1,Hector Garcia-Molina,9,29,80242,19,2.574,16.3452,0.321429,0.236842,0.999402,0.136052
2,Christine Parent,6,22,80263,8,1.78185,6.53343,0.428571,0.214286,0.999626,0.214286
3,Robert Meersman,9,54,80223,13,5.07593,46.2501,0.409091,0.142857,0.999166,0.0988958
4,Pier Luigi Emiliani,3,3,80293,0,1.67433,2.55892,1.0,0.5,0.999963,0.395519
5,Hidenao Abe,2,5,80291,1,1.0278,8.54772,0.666667,0.285714,0.999925,0.107336
6,Mario Lamberger,2,4,80288,5,1.82048,4.70587,0.285714,0.333333,0.999888,0.278943
7,Kaoru Inoue,3,7,80285,4,3.90865,9.98134,0.428571,0.3,0.999863,0.2814
8,Gabriel Thierrin,2,4,80292,1,1.11622,3.11703,0.666667,0.333333,0.999938,0.26368
9,Marcin Gogolewski,2,5,80291,1,2.27047,3.29826,0.666667,0.285714,0.999925,0.407717


### Common Neighbors algorithm Scores

It is computed using the following formula:

$C N(x, y)=|N(x) \cap N(y)|$

where $N(x)$ is the set of nodes adjacent to node $x$, and $N(y)$ is the set of nodes adjacent to node $y$.

A value of 0 indicates that two nodes are not close, while higher values indicate nodes are closer.

In [23]:
neo4jpy.to_df(
    neo4jpy.query("""
        WITH [
            "Krzysztof Kaczmarski", 
            "Hector Garcia-Molina", 
            "Christine Parent", 
            "Robert Meersman", 
            "Pier Luigi Emiliani", 
            "Hidenao Abe", 
            "Mario Lamberger", 
            "Kaoru Inoue", 
            "Gabriel Thierrin", 
            "Marcin Gogolewski"
        ] AS author_names
        UNWIND author_names AS author_name
        MATCH (other:Author)
        MATCH (author:Author {name: author_name})
        OPTIONAL MATCH (author:Author) - [r:COAUTHOR_TEST] - (other:Author)
        WHERE NOT (author) - [:COAUTHOR_TRAIN] - (other)
        WITH 
            author AS author, 
            other AS coauthor,
            algo.linkprediction.commonNeighbors(author, other, { relationshipQuery: "COAUTHOR_TRAIN", direction: "BOTH" }) AS score,
            COUNT(r) AS in_test
        WITH 
            author,
            coauthor,
            score,
            in_test,
            CASE WHEN score > 0 AND in_test = 1 THEN 1 ELSE 0 END AS TP,
            CASE WHEN score > 0 AND in_test = 1 THEN score ELSE 0 END AS TP_score,
            CASE WHEN score > 0 AND in_test = 0 THEN 1 ELSE 0 END AS FP,
            CASE WHEN score > 0 AND in_test = 0 THEN score ELSE 0 END AS FP_score,
            CASE WHEN score = 0 AND in_test = 0 THEN 1 ELSE 0 END AS TN,
            CASE WHEN score = 0 AND in_test = 1 THEN 1 ELSE 0 END AS FN
        WITH 
            DISTINCT(author) AS author,
            SUM(TP) AS TP,
            SUM(TP_score) AS TP_score,
            SUM(FP) AS FP,
            SUM(FP_score) AS FP_score,
            SUM(TN) AS TN,
            SUM(FN) AS FN
        RETURN
            author.name AS author,
            TP, FP, TN, FN,
            TP_score, FP_score,
            toFloat(TP) / ( TP + FN ) AS Recall,
            toFloat(TP) / ( TP + FP) AS Presision,
            toFloat(TP + TN ) / (TP + FP + FN + TN) AS Accuracy,
            toFloat(TP_score) / (TP_score + FP_score) AS AdoptionRate
    """
    )
)

Unnamed: 0,author,TP,FP,TN,FN,TP_score,FP_score,Recall,Presision,Accuracy,AdoptionRate
0,Krzysztof Kaczmarski,4,10,80284,1,5,16,0.8,0.285714,0.999863,0.238095
1,Hector Garcia-Molina,9,29,80242,19,9,35,0.321429,0.236842,0.999402,0.204545
2,Christine Parent,6,22,80263,8,6,22,0.428571,0.214286,0.999626,0.214286
3,Robert Meersman,9,54,80223,13,16,113,0.409091,0.142857,0.999166,0.124031
4,Pier Luigi Emiliani,3,3,80293,0,3,3,1.0,0.5,0.999963,0.5
5,Hidenao Abe,2,5,80291,1,2,13,0.666667,0.285714,0.999925,0.133333
6,Mario Lamberger,2,4,80288,5,2,4,0.285714,0.333333,0.999888,0.333333
7,Kaoru Inoue,3,7,80285,4,9,21,0.428571,0.3,0.999863,0.3
8,Gabriel Thierrin,2,4,80292,1,2,4,0.666667,0.333333,0.999938,0.333333
9,Marcin Gogolewski,2,5,80291,1,4,6,0.666667,0.285714,0.999925,0.4


### Preferential Attachment algorithm Scores

Preferential attachment means that the more connected a node is, the more likely it is to receive new links. 
This algorithm was popularised by Albert-Lászli Barabási and Réka Albert through their work on scale-free networks. It is computed
using the following formula:

$P A(x, y)=|N(x)| *|N(y)|$

where $\mathrm{N}(u)$ is the set of nodes adjacent to $u$.

A value of 0 indicates that two nodes are not close, while higher values indicate that nodes are closer.

In [24]:
neo4jpy.to_df(
    neo4jpy.query("""
        WITH [
            "Krzysztof Kaczmarski", 
            "Hector Garcia-Molina", 
            "Christine Parent", 
            "Robert Meersman", 
            "Pier Luigi Emiliani", 
            "Hidenao Abe", 
            "Mario Lamberger", 
            "Kaoru Inoue", 
            "Gabriel Thierrin", 
            "Marcin Gogolewski"
        ] AS author_names
        UNWIND author_names AS author_name
        MATCH (other:Author)
        MATCH (author:Author {name: author_name})
        OPTIONAL MATCH (author:Author) - [r:COAUTHOR_TEST] - (other:Author)
        WHERE NOT (author) - [:COAUTHOR_TRAIN] - (other)
        WITH 
            author AS author, 
            other AS coauthor,
            algo.linkprediction.preferentialAttachment(author, other, { relationshipQuery: "COAUTHOR_TRAIN", direction: "BOTH" }) AS score,
            COUNT(r) AS in_test
        WITH 
            author,
            coauthor,
            score,
            in_test,
            CASE WHEN score > 0 AND in_test = 1 THEN 1 ELSE 0 END AS TP,
            CASE WHEN score > 0 AND in_test = 1 THEN score ELSE 0 END AS TP_score,
            CASE WHEN score > 0 AND in_test = 0 THEN 1 ELSE 0 END AS FP,
            CASE WHEN score > 0 AND in_test = 0 THEN score ELSE 0 END AS FP_score,
            CASE WHEN score = 0 AND in_test = 0 THEN 1 ELSE 0 END AS TN,
            CASE WHEN score = 0 AND in_test = 1 THEN 1 ELSE 0 END AS FN
        WITH 
            DISTINCT(author) AS author,
            SUM(TP) AS TP,
            SUM(TP_score) AS TP_score,
            SUM(FP) AS FP,
            SUM(FP_score) AS FP_score,
            SUM(TN) AS TN,
            SUM(FN) AS FN
        RETURN
            author.name AS author,
            TP, FP, TN, FN,
            TP_score, FP_score,
            toFloat(TP) / ( TP + FN ) AS Recall,
            toFloat(TP) / ( TP + FP) AS Presision,
            toFloat(TP + TN ) / (TP + FP + FN + TN) AS Accuracy,
            toFloat(TP_score) / (TP_score + FP_score) AS AdoptionRate
    """
    )
)

Unnamed: 0,author,TP,FP,TN,FN,TP_score,FP_score,Recall,Presision,Accuracy,AdoptionRate
0,Krzysztof Kaczmarski,4,45014,35280,1,39,486537.0,0.8,8.88533e-05,0.439408,8.01519e-05
1,Hector Garcia-Molina,17,45001,35270,11,2432,1295100.0,0.607143,0.000377627,0.439445,0.00187432
2,Christine Parent,13,45005,35280,1,213,161979.0,0.928571,0.000288773,0.43952,0.00131326
3,Robert Meersman,15,45003,35274,7,2870,2267820.0,0.681818,0.0003332,0.43947,0.00126393
4,Pier Luigi Emiliani,3,45015,35281,0,24,324360.0,1.0,6.664e-05,0.439408,7.39864e-05
5,Hidenao Abe,3,45015,35281,0,48,648720.0,1.0,6.664e-05,0.439408,7.39864e-05
6,Mario Lamberger,4,45014,35278,3,116,648652.0,0.571429,8.88533e-05,0.439383,0.0001788
7,Kaoru Inoue,3,45015,35277,4,96,648672.0,0.428571,6.664e-05,0.439358,0.000147973
8,Gabriel Thierrin,2,45016,35280,1,12,324372.0,0.666667,4.44267e-05,0.439383,3.69932e-05
9,Marcin Gogolewski,2,45016,35280,1,12,324372.0,0.666667,4.44267e-05,0.439383,3.69932e-05


### Resource Allocation

The Resource Allocation algorithm was introduced in 2009 by Tao Zhou, Linyuan Lü, and YI-Cheng Zhang as part of a
study to predict links in various networks. It is computed using the following formula:

$R A(x, y)=\sum_{u \in N(x) \cap N(y)} \frac{1}{|N(u)|}$

where $N(u)$ is the set of nodes adjacent to $u$.

A value of 0 indicates that two nodes are not close, while higher values indicate nodes are closer.

In [25]:
neo4jpy.to_df(
    neo4jpy.query("""
        WITH [
            "Krzysztof Kaczmarski", 
            "Hector Garcia-Molina", 
            "Christine Parent", 
            "Robert Meersman", 
            "Pier Luigi Emiliani", 
            "Hidenao Abe", 
            "Mario Lamberger", 
            "Kaoru Inoue", 
            "Gabriel Thierrin", 
            "Marcin Gogolewski"
        ] AS author_names
        UNWIND author_names AS author_name
        MATCH (other:Author)
        MATCH (author:Author {name: author_name})
        OPTIONAL MATCH (author:Author)-[r:COAUTHOR_TEST]-(other:Author)
        WHERE not (author) -[:COAUTHOR_TRAIN]-(other)
        WITH 
            author AS author, 
            other AS coauthor,
            algo.linkprediction.resourceAllocation(author, other, { relationshipQuery: "COAUTHOR_TRAIN", direction: "BOTH" }) AS score,
            count(r) AS in_test
        WITH 
            author,
            coauthor,
            score,
            in_test,
            CASE WHEN score > 0 AND in_test = 1 THEN 1 ELSE 0 END AS TP,
            CASE WHEN score > 0 AND in_test = 1 THEN score ELSE 0 END AS TP_score,
            CASE WHEN score > 0 AND in_test = 0 THEN 1 ELSE 0 END AS FP,
            CASE WHEN score > 0 AND in_test = 0 THEN score ELSE 0 END AS FP_score,
            CASE WHEN score = 0 AND in_test = 0 THEN 1 ELSE 0 END AS TN,
            CASE WHEN score = 0 AND in_test = 1 THEN 1 ELSE 0 END AS FN
        WITH 
            DISTINCT(author) AS author,
            SUM(TP) AS TP,
            SUM(TP_score) AS TP_score,
            SUM(FP) AS FP,
            SUM(FP_score) AS FP_score,
            SUM(TN) AS TN,
            SUM(FN) AS FN
        RETURN
            author.name AS author,
            TP, FP, TN, FN,
            TP_score, FP_score,
            toFloat(TP) / ( TP + FN ) AS Recall,
            toFloat(TP) / ( TP + FP) AS Presision,
            toFloat(TP + TN ) / (TP + FP + FN + TN) AS Accuracy,
            toFloat(TP_score) / (TP_score + FP_score) AS AdoptionRate
    """
    )
)

Unnamed: 0,author,TP,FP,TN,FN,TP_score,FP_score,Recall,Presision,Accuracy,AdoptionRate
0,Krzysztof Kaczmarski,4,10,80284,1,0.474359,2.08205,0.8,0.285714,0.999863,0.185557
1,Hector Garcia-Molina,9,29,80242,19,0.272727,4.16364,0.321429,0.236842,0.999402,0.0614754
2,Christine Parent,6,22,80263,8,0.206897,0.758621,0.428571,0.214286,0.999626,0.214286
3,Robert Meersman,9,54,80223,13,0.718313,10.1933,0.409091,0.142857,0.999166,0.0658299
4,Pier Luigi Emiliani,3,3,80293,0,0.5,0.833333,1.0,0.5,0.999963,0.375
5,Hidenao Abe,2,5,80291,1,0.285714,2.82143,0.666667,0.285714,0.999925,0.091954
6,Mario Lamberger,2,4,80288,5,0.666667,1.66667,0.285714,0.333333,0.999888,0.285714
7,Kaoru Inoue,3,7,80285,4,0.9,2.55,0.428571,0.3,0.999863,0.26087
8,Gabriel Thierrin,2,4,80292,1,0.333333,1.0,0.666667,0.333333,0.999938,0.25
9,Marcin Gogolewski,2,5,80291,1,0.685714,0.971429,0.666667,0.285714,0.999925,0.413793


### Total Neighbors algorithm Scores

Total Neighbors is computed using the following formula:

$T N(x, y)=|N(x) \cup N(y)|$

where $N(x)$ is the set of nodes adjacent to $x$, and $N(y)$ is the set of nodes adjacent to $y$.

A value of 0 indicates that two nodes are not close, while higher values indicate nodes are closer.

In [26]:
neo4jpy.to_df(
    neo4jpy.query("""
        WITH [
            "Krzysztof Kaczmarski", 
            "Hector Garcia-Molina", 
            "Christine Parent", 
            "Robert Meersman", 
            "Pier Luigi Emiliani", 
            "Hidenao Abe", 
            "Mario Lamberger", 
            "Kaoru Inoue", 
            "Gabriel Thierrin", 
            "Marcin Gogolewski"
        ] AS author_names
        UNWIND author_names AS author_name
        MATCH (other:Author)
        MATCH (author:Author {name: author_name})
        OPTIONAL MATCH (author:Author)-[r:COAUTHOR_TEST]-(other:Author)
        WHERE not (author) -[:COAUTHOR_TRAIN]-(other)
        WITH 
            author AS author, 
            other AS coauthor,
            algo.linkprediction.totalNeighbors(author, other, { relationshipQuery: "COAUTHOR_TRAIN", direction: "BOTH" }) AS score,
            count(r) AS in_test
        WITH 
            author,
            coauthor,
            score,
            in_test,
            CASE WHEN score > 0 AND in_test = 1 THEN 1 ELSE 0 END AS TP,
            CASE WHEN score > 0 AND in_test = 1 THEN score ELSE 0 END AS TP_score,
            CASE WHEN score > 0 AND in_test = 0 THEN 1 ELSE 0 END AS FP,
            CASE WHEN score > 0 AND in_test = 0 THEN score ELSE 0 END AS FP_score,
            CASE WHEN score = 0 AND in_test = 0 THEN 1 ELSE 0 END AS TN,
            CASE WHEN score = 0 AND in_test = 1 THEN 1 ELSE 0 END AS FN
        WITH 
            DISTINCT(author) AS author,
            SUM(TP) AS TP,
            SUM(TP_score) AS TP_score,
            SUM(FP) AS FP,
            SUM(FP_score) AS FP_score,
            SUM(TN) AS TN,
            SUM(FN) AS FN
        RETURN
            author.name AS author,
            TP, FP, TN, FN,
            TP_score, FP_score,
            toFloat(TP) / ( TP + FN ) AS Recall,
            toFloat(TP) / ( TP + FP) AS Presision,
            toFloat(TP + TN ) / (TP + FP + FN + TN) AS Accuracy,
            toFloat(TP_score) / (TP_score + FP_score) AS AdoptionRate
    """)
)

Unnamed: 0,author,TP,FP,TN,FN,TP_score,FP_score,Recall,Presision,Accuracy,AdoptionRate
0,Krzysztof Kaczmarski,5,80294,0,0,23,403042.0,1,6.22673e-05,6.22673e-05,5.70628e-05
1,Hector Garcia-Molina,28,80271,0,0,519,804013.0,1,0.000348697,0.000348697,0.000645096
2,Christine Parent,14,80285,0,0,221,242241.0,1,0.000174348,0.000174348,0.000911483
3,Robert Meersman,22,80277,0,0,497,1285740.0,1,0.000273976,0.000273976,0.000386399
4,Pier Luigi Emiliani,3,80296,0,0,15,322767.0,1,3.73604e-05,3.73604e-05,4.6471e-05
5,Hidenao Abe,3,80296,0,0,22,483347.0,1,3.73604e-05,3.73604e-05,4.55139e-05
6,Mario Lamberger,7,80292,0,0,55,483323.0,1,8.71742e-05,8.71742e-05,0.000113783
7,Kaoru Inoue,7,80292,0,0,43,483311.0,1,8.71742e-05,8.71742e-05,8.89617e-05
8,Gabriel Thierrin,3,80296,0,0,10,322772.0,1,3.73604e-05,3.73604e-05,3.09807e-05
9,Marcin Gogolewski,3,80296,0,0,8,322770.0,1,3.73604e-05,3.73604e-05,2.47848e-05


### Conclussion

In general we can see that **AdamicAdar**, **Common Neighbors** and **Resource Allocaiton** have higher accuracy among the algorithms this happens due to the fact that
prediction score are clear zero *0.0* when the algorithm does not predict that an edge will happen based on the train set.

While **Preferential Attachment** and **Total neighbours** predict a score for all edges. For these algorithms to work in our case we would require a grid search where we could find the best cut off value under which 

- a **score < value** is accounted for a negative prediction
- a **score > value** is accounted for a positive prediction

In any case though since our train and test graphs are pretty sparce, and since we try to predict edges for all the **other** not already connected nodes any naive algorithm that would predict only Negatives would also have a very high accuracy.

---