Project SNA by Aleksandra Elena Getman (r0884498) and Vaishnav Dilip (r0872689)

![Javatpoint](https://wallpaper-mania.com/wp-content/uploads/2018/09/High_resolution_wallpaper_background_ID_77702108554-optimized.jpg) 

## About the Game of Thrones (HBO Series)

Game of Thrones is fantasy drama television series, which has over 8 series and a total of 73 episodes. 

This series shows three main storylines; 
1. Fights of the Iron Throne of the Seven Kingdoms: in which various noble characters create a web of political conflicts to fights for the throne or to set the independence from whoever is sitting on the throne 
2. Legal descendant try to reclaim their birth rights to dynasty of the Iron Throne
3. Where the Night's Watch (a military located at the Northern borders) tries to protect all the Kingdoms from the mystical creatures

(Source; https://en.wikipedia.org/wiki/Game_of_Thrones#:~:text=Game%20of%20Thrones%20is%20an,is%20A%20Game%20of%20Thrones)

# Introduction

In the following notebook, we are going to present you our Graph Database (Neo4J) Project using the Game of Thrones database that can be found on the following GitHub repository: https://github.com/mathbeveridge/gameofthrones
This repository contains the pairs of characters found in the HBO series that are connected by (undirected edges) weighted by the number of interactions. 

__In reality, this database contains 5 types of interactions;__
1. Character A speaks directly after Character B
2. Character A speaks about Character B
3. Character C speaks about Characters A and B
4. Characters A and B are mentioned in the same stage direction
5. Characters A and B appear in a scene together

However, the exact type of interaction is not mentioned in each database, instead, the overall overview is provided.


__Our project objectives__

1. Community mining
2. Link prediction by using network embeddings

![Javatpoint](https://upload.wikimedia.org/wikipedia/commons/e/e5/Neo4j-logo_color.png) 

# Creating the Graph Databse in Neo4J

In [1]:
from py2neo import Graph
import pandas as pd
import numpy as np
import os

In [2]:
graph = Graph("bolt://localhost:7687", auth=("pizza", "superman"))

In [3]:
graph.run("MATCH (n) DETACH DELETE n")

In [4]:
#Making nodes function
def create_nodes(df):
    """Function to create the nodes from the csv files.

    Returns:
        str: Acknowledgement string
    """
    i=0
    for idx in df.index:
        id_ = df.loc[idx, 'Id']
        label = df.loc[idx, 'Label'].replace('\'', '')
        cypher = "MERGE (a:Person {id:" + "'" + \
            id_ + "'" + ",label:" + "'" + label + "'})"
        cypher = cypher + f'SET a.seed={i}'
        graph.run(cypher)
        i+=1
    return "Done creating nodes"

In [5]:
#Making edges
def create_edges(df):
    """Function to create the edges from the csv files.

    Returns:
        str: Acknowledgement string
    """
    for idx in df.index:
        src = df.loc[idx, 'Source']
        tar = df.loc[idx, 'Target']
        weight = df.loc[idx, 'Weight']
        season = df.loc[idx, 'Season']
        cypher = "MATCH (src:Person {id:" + "'" + src + "'}),"
        cypher = cypher + " (tar:Person {id:" + "'" + tar + "'})"
        cypher = cypher + \
            "MERGE (src)-[r:INTERACTS_"+ str(season) +"]-(tar)"
        cypher = cypher + f"SET r.weight={weight} SET r.season={season}"
        graph.run(cypher)
    return "Done creating edges"

In [6]:
#Creating graph
def create_graph():
    """Function to create the graph from the csv files.

    Returns:
        str: Acknowledgement string
    """
    for season in range(1,9):
        for type in ['nodes','edges']:
            base_url = "https://raw.githubusercontent.com/mathbeveridge/gameofthrones/master/data/"
            final_url = base_url+"got-s"+str(season)+"-"+type+".csv"
            df = pd.read_csv(final_url)
            if type == 'nodes':
                create_nodes(df)
            else:
                create_edges(df)
    return "Done creating graph"

In [7]:
create_graph()

'Done creating graph'

# Comunity mining in each Season

In the following section, we will apply the Louvain Modularity Algorithm to find communities in each of the seasons. The reason why this is carried out for each season separately is that Game of Thrones is a series in which various characters tend to betray each other often to get something out of it. On top of that, it will be interesting to see how the connections between characters change between each other. 

After applying the algorithm, we will employ the 5 centrality measures to derive a bigger picture of the characters that are important in each season. The 5 centrality measures will also assign the same meaning as in https://networkofthrones.wordpress.com/, as they carried out similar research.

1. Degree centrality: how many people the character knows?
2. Weighted degree centrality: how many interactions does the character have?
3. Eigenvector centrality: how many important people does the character know?
4. PageRank: how many important interactions does the character have?
5. Betweenness: does the character help to connect the whole network?

## Graph for season 1

### Louvain Modularity Algorithm 

In [8]:
query = """
CALL gds.graph.project.cypher(
  'Got1',
  'MATCH (n:Person)-[r:INTERACTS_1]-(Person) RETURN id(n) AS id, n.seed AS seed',
  'MATCH (n:Person)-[r:INTERACTS_1]-(p:Person) RETURN id(n) AS source, id(p) AS target, r.weight AS weight')
YIELD
    graphName AS graph, nodeQuery, nodeCount AS nodes, relationshipQuery, relationshipCount AS rels

"""
graph.run(query)

graph,nodeQuery,nodes,relationshipQuery,rels
Got1,"MATCH (n:Person)-[r:INTERACTS_1]-(Person) RETURN id(n) AS id, n.seed AS seed",126,"MATCH (n:Person)-[r:INTERACTS_1]-(p:Person) RETURN id(n) AS source, id(p) AS target, r.weight AS weight",1098


In [9]:
#The following will estimate the memory requirements for running the algorithm:
query = """
CALL gds.louvain.write.estimate('Got1', { writeProperty: 'community_S1' })
YIELD nodeCount, relationshipCount, bytesMin, bytesMax, requiredMemory
"""
graph.run(query)

nodeCount,relationshipCount,bytesMin,bytesMax,requiredMemory
126,1098,13369,582864,[13369 Bytes ... 569 KiB]


In [10]:
#The following will run the algorithm and stream results:
query = """
CALL gds.louvain.stream('Got1', { seedProperty: 'seed' })
YIELD nodeId, communityId, intermediateCommunityIds
RETURN gds.util.asNode(nodeId).label AS label, communityId
ORDER BY label ASC
"""
graph.run(query)

label,communityId
Addam,55
Aegon,125
Aemon,125


In [11]:
#The following will run the algorithm and returns the result in form of statistical and measurement values:
query = """
CALL gds.louvain.stats('Got1')
YIELD communityCount
"""
graph.run(query)

communityCount
6


In [12]:
#The following will run the algorithm and store the results in myGraph:
query = """
CALL gds.louvain.mutate('Got1', { mutateProperty: 'communityId' })
YIELD communityCount, modularity, modularities
"""
graph.run(query)

communityCount,modularity,modularities
6,0.4500051426504889,"[0.40857860458326284, 0.4500051426504889]"


In [13]:
#The following run the algorithm, and write back results:
query = """
CALL gds.louvain.write('Got1', { writeProperty: 'community_S1' })
YIELD communityCount, modularity, modularities
"""
graph.run(query)

communityCount,modularity,modularities
6,0.4500051426504889,"[0.40857860458326284, 0.4500051426504889]"


#### Discussion

Louvain Modularity Algorithm is one of the ways to identify various communities present in the network, in here the maximum modularity is derived when 6 communities are identified. Such prediction is close to what we remember in the series, 
 1. the Night's Watch
 2. the Lannisters
 3. the Dathraki
 4. King's Landing
 5. the Starks
 6. Others

### 5 Basic centrality measures

In [31]:
#Degree
query = """
CALL gds.degree.stream('Got1')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).id AS name, score
ORDER BY score DESC
LIMIT 5
"""
graph.run(query)

name,score
NED,57.0
TYRION,41.0
ROBERT,36.0


In [15]:
#Weighted Degree
query = """
CALL gds.degree.stream(
   'Got1',
   { relationshipWeightProperty: 'weight' }
)
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).id AS name, score
ORDER BY score DESC
LIMIT 5
"""
graph.run(query)

name,score
NED,1290.0
TYRION,709.0
CATELYN,584.0


In [16]:
#Eigenvector
query = """
CALL gds.eigenvector.stream('Got1')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).id AS name, score
ORDER BY score DESC
LIMIT 5
"""
graph.run(query)

name,score
NED,0.3150828603837995
ROBERT,0.2481904649529348
CERSEI,0.239263152443205


In [17]:
#Pagerank
query = """
CALL gds.pageRank.stream('Got1',  {maxIterations: 20,  dampingFactor: 0.85})
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).id AS page, score
ORDER BY score DESC
LIMIT 5
"""
graph.run(query)

page,score
NED,5.710814289359831
TYRION,4.140412812879867
CATELYN,3.439708123022758


In [18]:
#Betweenness
query = """
CALL gds.betweenness.stream('Got1')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).id AS name, score
ORDER BY score DESC
LIMIT 5
"""
graph.run(query)

name,score
NED,4700.843226292089
TYRION,2526.9252462619834
CATELYN,1833.4028207687145


__Discussion__

By looking at all the 5 centrality measures, Ned who is from the house of Starks is the main character. He became the right hand of Robert (who is the current king of the Iron Throne in the first season). Because Rober is the king, it makes sense for him to know a lot of important people, yet he does not hold many important interactions (based on the weighted degree and PageRank), rather all his right hand operates the whole Kingdom (due to Ned having a lot of important interactions). 

The second person who not only seems to know a lot of people and has many interactions, but also helps the network to connect is Tyrion who belongs to the family of Lannisters and is a brother to Cersei (queen of the Iron Throne). During the whole show of The Game of Thrones, he feels left out by his family as he is being neglected and blamed by them for the death of his mother who died giving birth to him. As a rebellion action of him, he tries to find a connection with the Starks (in particular with Ned), therefore making him the node who holds all the connections in place. Notice that Tyrion does not know many important people, this indicates he is not so much interested in the power games compared to his sister Cersei (given in the PageRank).

Catelyn on the other hand as playing a supportive wife of Ned, and because she is so close with Ned during the first season, it portrays her as an important character. 

## Graph for season 2

### Louvain Modularity Algorithm

In [19]:
query = """
CALL gds.graph.project.cypher(
  'Got2',
  'MATCH (n:Person)-[r:INTERACTS_2]-(Person) RETURN id(n) AS id, n.seed AS seed',
  'MATCH (n:Person)-[r:INTERACTS_2]-(p:Person) RETURN id(n) AS source, id(p) AS target, r.weight AS weight')
YIELD
    graphName AS graph, nodeQuery, nodeCount AS nodes, relationshipQuery, relationshipCount AS rels

"""
graph.run(query)

graph,nodeQuery,nodes,relationshipQuery,rels
Got2,"MATCH (n:Person)-[r:INTERACTS_2]-(Person) RETURN id(n) AS id, n.seed AS seed",129,"MATCH (n:Person)-[r:INTERACTS_2]-(p:Person) RETURN id(n) AS source, id(p) AS target, r.weight AS weight",972


In [20]:
#The following will estimate the memory requirements for running the algorithm:
query = """
CALL gds.louvain.write.estimate('Got2', { writeProperty: 'community_S2' })
YIELD nodeCount, relationshipCount, bytesMin, bytesMax, requiredMemory
"""
graph.run(query)

nodeCount,relationshipCount,bytesMin,bytesMax,requiredMemory
129,972,13569,583456,[13569 Bytes ... 569 KiB]


In [21]:
#The following will run the algorithm and stream results:
query = """
CALL gds.louvain.stream('Got2', { seedProperty: 'seed' })
YIELD nodeId, communityId, intermediateCommunityIds
RETURN gds.util.asNode(nodeId).label AS label, communityId
ORDER BY label ASC
"""
graph.run(query)

label,communityId
Aegon,123
Aerys,39
Alton,39


In [22]:
#The following will run the algorithm and returns the result in form of statistical and measurement values:
query = """
CALL gds.louvain.stats('Got2')
YIELD communityCount
"""
graph.run(query)

communityCount
6


In [23]:
#The following will run the algorithm and store the results in myGraph:
query = """
CALL gds.louvain.mutate('Got2', { mutateProperty: 'communityId' })
YIELD communityCount, modularity, modularities
"""
graph.run(query)

communityCount,modularity,modularities
6,0.5675519483818524,"[0.5333155514911345, 0.5675519483818524]"


In [24]:
#The following run the algorithm, and write back results:
query = """
CALL gds.louvain.write('Got2', { writeProperty: 'community_S2' })
YIELD communityCount, modularity, modularities
"""
graph.run(query)

communityCount,modularity,modularities
6,0.5675519483818524,"[0.5333155514911345, 0.5675519483818524]"


__Discussion__

In here again 6 communities were identified;
1. the Far North
2. the Red Waste and Qarth 
3. King's Landing
4. Harenhall
5. the Riverlands and the Stormland
6. the Iron Islands and Winterfell

### 5 basic centrality measures 

In [26]:
#Degree
query = """
CALL gds.degree.stream('Got2')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).id AS name, score
ORDER BY score DESC
LIMIT 5
"""
graph.run(query)

name,score
JOFFREY,36.0
TYRION,33.0
CERSEI,31.0


In [27]:
#Weighted Degree
query = """
CALL gds.degree.stream(
   'Got2',
   { relationshipWeightProperty: 'weight' }
)
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).id AS name, score
ORDER BY score DESC
LIMIT 5
"""
graph.run(query)

name,score
TYRION,960.0
CERSEI,549.0
ARYA,481.0


In [28]:
#Eigenvector
query = """
CALL gds.eigenvector.stream('Got2')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).id AS name, score
ORDER BY score DESC
LIMIT 5
"""
graph.run(query)

name,score
JOFFREY,0.3133074261625616
CERSEI,0.2829819369741506
TYRION,0.2773242931270157


In [29]:
#Pagerank
query = """
CALL gds.pageRank.stream('Got2',  {maxIterations: 20,  dampingFactor: 0.85})
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).id AS page, score
ORDER BY score DESC
LIMIT 5
"""
graph.run(query)

page,score
JOFFREY,3.613951461011889
TYRION,3.4854517741397
ARYA,3.4014533093263664


In [30]:
#Betweenness
query = """
CALL gds.betweenness.stream('Got2')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).id AS name, score
ORDER BY score DESC
LIMIT 5
"""
graph.run(query)

name,score
ARYA,2455.341556287124
TYRION,2293.402566916603
ROBB,1902.715964088471


__Discussion__


## Graph for season 3

### Louvain Modularity Algorithm

In [33]:
query = """
CALL gds.graph.project.cypher(
  'Got3',
  'MATCH (n:Person)-[r:INTERACTS_3]-(Person) RETURN id(n) AS id, n.seed AS seed',
  'MATCH (n:Person)-[r:INTERACTS_3]-(p:Person) RETURN id(n) AS source, id(p) AS target, r.weight AS weight')
YIELD
    graphName AS graph, nodeQuery, nodeCount AS nodes, relationshipQuery, relationshipCount AS rels

"""
graph.run(query)

graph,nodeQuery,nodes,relationshipQuery,rels
Got3,"MATCH (n:Person)-[r:INTERACTS_3]-(Person) RETURN id(n) AS id, n.seed AS seed",125,"MATCH (n:Person)-[r:INTERACTS_3]-(p:Person) RETURN id(n) AS source, id(p) AS target, r.weight AS weight",1014


In [34]:
#The following will estimate the memory requirements for running the algorithm:
query = """
CALL gds.louvain.write.estimate('Got3', { writeProperty: 'community_S3' })
YIELD nodeCount, relationshipCount, bytesMin, bytesMax, requiredMemory
"""
graph.run(query)

nodeCount,relationshipCount,bytesMin,bytesMax,requiredMemory
125,1014,13305,582696,[13305 Bytes ... 569 KiB]


In [35]:
#The following will run the algorithm and stream results:
query = """
CALL gds.louvain.stream('Got3', { seedProperty: 'seed' })
YIELD nodeId, communityId, intermediateCommunityIds
RETURN gds.util.asNode(nodeId).label AS label, communityId
ORDER BY label ASC
"""
graph.run(query)

label,communityId
Aegon,15
Aemon,19
Aerys,82


In [36]:
#The following will run the algorithm and returns the result in form of statistical and measurement values:
query = """
CALL gds.louvain.stats('Got3')
YIELD communityCount
"""
graph.run(query)

communityCount
6


In [37]:
#The following will run the algorithm and store the results in myGraph:
query = """
CALL gds.louvain.mutate('Got3', { mutateProperty: 'communityId' })
YIELD communityCount, modularity, modularities
"""
graph.run(query)

communityCount,modularity,modularities
6,0.6260635131823116,"[0.6018171632645916, 0.6260635131823116]"


In [38]:
#The following run the algorithm, and write back results:
query = """
CALL gds.louvain.write('Got3', { writeProperty: 'community_S3' })
YIELD communityCount, modularity, modularities
"""
graph.run(query)

communityCount,modularity,modularities
6,0.6260635131823116,"[0.6018171632645916, 0.6260635131823116]"


### 5 basic centrality measures 

In [39]:
#Degree
query = """
CALL gds.degree.stream('Got3')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).id AS name, score
ORDER BY score DESC
LIMIT 5
"""
graph.run(query)

name,score
ROBB,31.0
TYWIN,29.0
TYRION,25.0


In [40]:
#Weighted Degree
query = """
CALL gds.degree.stream(
   'Got3',
   { relationshipWeightProperty: 'weight' }
)
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).id AS name, score
ORDER BY score DESC
LIMIT 5
"""
graph.run(query)

name,score
TYRION,653.0
ROBB,375.0
TYWIN,366.0


In [41]:
#Eigenvector
query = """
CALL gds.eigenvector.stream('Got3')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).id AS name, score
ORDER BY score DESC
LIMIT 5
"""
graph.run(query)

name,score
TYWIN,0.294075789383587
TYRION,0.2683662004062233
JOFFREY,0.2498390261687947


In [42]:
#Pagerank
query = """
CALL gds.pageRank.stream('Got3',  {maxIterations: 20,  dampingFactor: 0.85})
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).id AS page, score
ORDER BY score DESC
LIMIT 5
"""
graph.run(query)

page,score
ROBB,3.1307713054929844
TYWIN,2.874792630046823
TYRION,2.5522894149675976


In [43]:
#Betweenness
query = """
CALL gds.betweenness.stream('Got3')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).id AS name, score
ORDER BY score DESC
LIMIT 5
"""
graph.run(query)

name,score
ROBB,3609.6328036171367
NED,2651.6020895783627
ROBERT,2453.2637792802307


## Graph for season 4

### Louvain Modularity Algorithm

In [44]:
query = """
CALL gds.graph.project.cypher(
  'Got4',
  'MATCH (n:Person)-[r:INTERACTS_4]-(Person) RETURN id(n) AS id, n.seed AS seed',
  'MATCH (n:Person)-[r:INTERACTS_4]-(p:Person) RETURN id(n) AS source, id(p) AS target, r.weight AS weight')
YIELD
    graphName AS graph, nodeQuery, nodeCount AS nodes, relationshipQuery, relationshipCount AS rels

"""
graph.run(query)

graph,nodeQuery,nodes,relationshipQuery,rels
Got4,"MATCH (n:Person)-[r:INTERACTS_4]-(Person) RETURN id(n) AS id, n.seed AS seed",177,"MATCH (n:Person)-[r:INTERACTS_4]-(p:Person) RETURN id(n) AS source, id(p) AS target, r.weight AS weight",1410


In [45]:
#The following will estimate the memory requirements for running the algorithm:
query = """
CALL gds.louvain.write.estimate('Got4', { writeProperty: 'community_S4' })
YIELD nodeCount, relationshipCount, bytesMin, bytesMax, requiredMemory
"""
graph.run(query)

nodeCount,relationshipCount,bytesMin,bytesMax,requiredMemory
177,1410,16641,592480,[16 KiB ... 578 KiB]


In [46]:
#The following will run the algorithm and stream results:
query = """
CALL gds.louvain.stream('Got4', { seedProperty: 'seed' })
YIELD nodeId, communityId, intermediateCommunityIds
RETURN gds.util.asNode(nodeId).label AS label, communityId
ORDER BY label ASC
"""
graph.run(query)

label,communityId
Adrack Humble,86
Aemon,15
Aerys,147


In [47]:
#The following will run the algorithm and returns the result in form of statistical and measurement values:
query = """
CALL gds.louvain.stats('Got4')
YIELD communityCount
"""
graph.run(query)

communityCount
8


In [48]:
#The following will run the algorithm and store the results in myGraph:
query = """
CALL gds.louvain.mutate('Got4', { mutateProperty: 'communityId' })
YIELD communityCount, modularity, modularities
"""
graph.run(query)

communityCount,modularity,modularities
8,0.6019938634877522,"[0.5523786529852623, 0.6019938634877522]"


In [49]:
#The following run the algorithm, and write back results:
query = """
CALL gds.louvain.write('Got4', { writeProperty: 'community_S4' })
YIELD communityCount, modularity, modularities
"""
graph.run(query)

communityCount,modularity,modularities
8,0.6019938634877522,"[0.5523786529852623, 0.6019938634877522]"


### 5 basic centrality measures 

In [50]:
#Degree
query = """
CALL gds.degree.stream('Got4')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).id AS name, score
ORDER BY score DESC
LIMIT 5
"""
graph.run(query)

name,score
JOFFREY,41.0
TYRION,39.0
CERSEI,34.0


In [51]:
#Weighted Degree
query = """
CALL gds.degree.stream(
   'Got4',
   { relationshipWeightProperty: 'weight' }
)
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).id AS name, score
ORDER BY score DESC
LIMIT 5
"""
graph.run(query)

name,score
TYRION,890.0
CERSEI,632.0
JON,612.0


In [52]:
#Eigenvector
query = """
CALL gds.eigenvector.stream('Got4')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).id AS name, score
ORDER BY score DESC
LIMIT 5
"""
graph.run(query)

name,score
JOFFREY,0.2874989052013614
TYRION,0.2811453727233219
CERSEI,0.2657262876716366


In [53]:
#Pagerank
query = """
CALL gds.pageRank.stream('Got4',  {maxIterations: 20,  dampingFactor: 0.85})
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).id AS page, score
ORDER BY score DESC
LIMIT 5
"""
graph.run(query)

page,score
JOFFREY,3.995032308108388
TYRION,3.79855353319621
ARYA,3.337615317598997


In [54]:
#Betweenness
query = """
CALL gds.betweenness.stream('Got4')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).id AS name, score
ORDER BY score DESC
LIMIT 5
"""
graph.run(query)

name,score
JOFFREY,6169.187204485662
NED,6077.094818619815
STANNIS,5446.0343108831285


## Graph for season 5

### Louvain Modularity Algorithm

In [55]:
query = """
CALL gds.graph.project.cypher(
  'Got5',
  'MATCH (n:Person)-[r:INTERACTS_5]-(Person) RETURN id(n) AS id, n.seed AS seed',
  'MATCH (n:Person)-[r:INTERACTS_5]-(p:Person) RETURN id(n) AS source, id(p) AS target, r.weight AS weight')
YIELD
    graphName AS graph, nodeQuery, nodeCount AS nodes, relationshipQuery, relationshipCount AS rels

"""
graph.run(query)

graph,nodeQuery,nodes,relationshipQuery,rels
Got5,"MATCH (n:Person)-[r:INTERACTS_5]-(Person) RETURN id(n) AS id, n.seed AS seed",124,"MATCH (n:Person)-[r:INTERACTS_5]-(p:Person) RETURN id(n) AS source, id(p) AS target, r.weight AS weight",866


In [56]:
#The following will estimate the memory requirements for running the algorithm:
query = """
CALL gds.louvain.write.estimate('Got5', { writeProperty: 'community_S5' })
YIELD nodeCount, relationshipCount, bytesMin, bytesMax, requiredMemory
"""
graph.run(query)

nodeCount,relationshipCount,bytesMin,bytesMax,requiredMemory
124,866,13241,582456,[13241 Bytes ... 568 KiB]


In [57]:
#The following will run the algorithm and stream results:
query = """
CALL gds.louvain.stream('Got5', { seedProperty: 'seed' })
YIELD nodeId, communityId, intermediateCommunityIds
RETURN gds.util.asNode(nodeId).label AS label, communityId
ORDER BY label ASC
"""
graph.run(query)

label,communityId
Aegon,18
Aemon,18
Aerys,101


In [58]:
#The following will run the algorithm and returns the result in form of statistical and measurement values:
query = """
CALL gds.louvain.stats('Got5')
YIELD communityCount
"""
graph.run(query)

communityCount
5


In [59]:
#The following will run the algorithm and store the results in myGraph:
query = """
CALL gds.louvain.mutate('Got5', { mutateProperty: 'communityId' })
YIELD communityCount, modularity, modularities
"""
graph.run(query)

communityCount,modularity,modularities
5,0.664294971971689,"[0.6190789859671767, 0.664294971971689]"


In [60]:
#The following run the algorithm, and write back results:
query = """
CALL gds.louvain.write('Got5', { writeProperty: 'community_S5' })
YIELD communityCount, modularity, modularities
"""
graph.run(query)

communityCount,modularity,modularities
5,0.664294971971689,"[0.6190789859671767, 0.664294971971689]"


### 5 basic centrality measures 

In [61]:
#Degree
query = """
CALL gds.degree.stream('Got5')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).id AS name, score
ORDER BY score DESC
LIMIT 5
"""
graph.run(query)

name,score
CERSEI,31.0
LITTLEFINGER,27.0
JON,27.0


In [62]:
#Weighted Degree
query = """
CALL gds.degree.stream(
   'Got5',
   { relationshipWeightProperty: 'weight' }
)
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).id AS name, score
ORDER BY score DESC
LIMIT 5
"""
graph.run(query)

name,score
CERSEI,618.0
JON,616.0
TYRION,480.0


In [63]:
#Eigenvector
query = """
CALL gds.eigenvector.stream('Got5')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).id AS name, score
ORDER BY score DESC
LIMIT 5
"""
graph.run(query)

name,score
STANNIS,0.3383527047379288
JON,0.3171606732558863
SAM,0.283804814609472


In [64]:
#Pagerank
query = """
CALL gds.pageRank.stream('Got5',  {maxIterations: 20,  dampingFactor: 0.85})
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).id AS page, score
ORDER BY score DESC
LIMIT 5
"""
graph.run(query)

page,score
JON,3.604967605382056
CERSEI,3.604538477369172
LITTLEFINGER,3.53852568907438


In [65]:
#Betweenness
query = """
CALL gds.betweenness.stream('Got5')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).id AS name, score
ORDER BY score DESC
LIMIT 5
"""
graph.run(query)

name,score
STANNIS,5012.7852459276455
CERSEI,3964.75489181412
LITTLEFINGER,2971.9304254016747


## Graph for season 6

### Louvain Modularity Algorithm

In [66]:
query = """
CALL gds.graph.project.cypher(
  'Got6',
  'MATCH (n:Person)-[r:INTERACTS_6]-(Person) RETURN id(n) AS id, n.seed AS seed',
  'MATCH (n:Person)-[r:INTERACTS_6]-(p:Person) RETURN id(n) AS source, id(p) AS target, r.weight AS weight')
YIELD
    graphName AS graph, nodeQuery, nodeCount AS nodes, relationshipQuery, relationshipCount AS rels

"""
graph.run(query)

graph,nodeQuery,nodes,relationshipQuery,rels
Got6,"MATCH (n:Person)-[r:INTERACTS_6]-(Person) RETURN id(n) AS id, n.seed AS seed",148,"MATCH (n:Person)-[r:INTERACTS_6]-(p:Person) RETURN id(n) AS source, id(p) AS target, r.weight AS weight",1154


In [67]:
#The following will estimate the memory requirements for running the algorithm:
query = """
CALL gds.louvain.write.estimate('Got6', { writeProperty: 'community_S6' })
YIELD nodeCount, relationshipCount, bytesMin, bytesMax, requiredMemory
"""
graph.run(query)

nodeCount,relationshipCount,bytesMin,bytesMax,requiredMemory
148,1154,14785,586976,[14785 Bytes ... 573 KiB]


In [68]:
#The following will run the algorithm and stream results:
query = """
CALL gds.louvain.stream('Got6', { seedProperty: 'seed' })
YIELD nodeId, communityId, intermediateCommunityIds
RETURN gds.util.asNode(nodeId).label AS label, communityId
ORDER BY label ASC
"""
graph.run(query)

label,communityId
Aemon,32
Aeron,104
Aerys,11


In [69]:
#The following will run the algorithm and returns the result in form of statistical and measurement values:
query = """
CALL gds.louvain.stats('Got6')
YIELD communityCount
"""
graph.run(query)

communityCount
9


In [70]:
#The following will run the algorithm and store the results in myGraph:
query = """
CALL gds.louvain.mutate('Got6', { mutateProperty: 'communityId' })
YIELD communityCount, modularity, modularities
"""
graph.run(query)

communityCount,modularity,modularities
9,0.6540944165272475,"[0.6090893253516515, 0.6540944165272475]"


In [71]:
#The following run the algorithm, and write back results:
query = """
CALL gds.louvain.write('Got6', { writeProperty: 'community_S6' })
YIELD communityCount, modularity, modularities
"""
graph.run(query)

communityCount,modularity,modularities
9,0.6540944165272475,"[0.6090893253516515, 0.6540944165272475]"


### 5 basic centrality measures 

In [72]:
#Degree
query = """
CALL gds.degree.stream('Got6')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).id AS name, score
ORDER BY score DESC
LIMIT 5
"""
graph.run(query)

name,score
SANSA,41.0
JON,33.0
CERSEI,29.0


In [73]:
#Weighted Degree
query = """
CALL gds.degree.stream(
   'Got6',
   { relationshipWeightProperty: 'weight' }
)
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).id AS name, score
ORDER BY score DESC
LIMIT 5
"""
graph.run(query)

name,score
JON,624.0
SANSA,551.0
TYRION,442.0


In [74]:
#Eigenvector
query = """
CALL gds.eigenvector.stream('Got6')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).id AS name, score
ORDER BY score DESC
LIMIT 5
"""
graph.run(query)

name,score
SANSA,0.3519753949236122
JON,0.2764663353383537
DAVOS,0.2340585347780204


In [75]:
#Pagerank
query = """
CALL gds.pageRank.stream('Got6',  {maxIterations: 20,  dampingFactor: 0.85})
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).id AS page, score
ORDER BY score DESC
LIMIT 5
"""
graph.run(query)

page,score
SANSA,3.894475250679934
JON,3.250209678427113
CERSEI,3.007200929871433


In [76]:
#Betweenness
query = """
CALL gds.betweenness.stream('Got6')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).id AS name, score
ORDER BY score DESC
LIMIT 5
"""
graph.run(query)

name,score
SANSA,5791.99128590698
JON,4161.187582627684
TYRION,3080.729990101282


## Graph for season 7

### Louvain Modularity Algorithm

In [77]:
query = """
CALL gds.graph.project.cypher(
  'Got7',
  'MATCH (n:Person)-[r:INTERACTS_7]-(Person) RETURN id(n) AS id, n.seed AS seed',
  'MATCH (n:Person)-[r:INTERACTS_7]-(p:Person) RETURN id(n) AS source, id(p) AS target, r.weight AS weight')
YIELD
    graphName AS graph, nodeQuery, nodeCount AS nodes, relationshipQuery, relationshipCount AS rels

"""
graph.run(query)

graph,nodeQuery,nodes,relationshipQuery,rels
Got7,"MATCH (n:Person)-[r:INTERACTS_7]-(Person) RETURN id(n) AS id, n.seed AS seed",85,"MATCH (n:Person)-[r:INTERACTS_7]-(p:Person) RETURN id(n) AS source, id(p) AS target, r.weight AS weight",890


In [78]:
#The following will estimate the memory requirements for running the algorithm:
query = """
CALL gds.louvain.write.estimate('Got7', { writeProperty: 'community_S7' })
YIELD nodeCount, relationshipCount, bytesMin, bytesMax, requiredMemory
"""
graph.run(query)

nodeCount,relationshipCount,bytesMin,bytesMax,requiredMemory
85,890,10745,575176,[10745 Bytes ... 561 KiB]


In [79]:
#The following will run the algorithm and stream results:
query = """
CALL gds.louvain.stream('Got7', { seedProperty: 'seed' })
YIELD nodeId, communityId, intermediateCommunityIds
RETURN gds.util.asNode(nodeId).label AS label, communityId
ORDER BY label ASC
"""
graph.run(query)

label,communityId
Aegon,51
Aerys,51
Alys,67


In [80]:
#The following will run the algorithm and returns the result in form of statistical and measurement values:
query = """
CALL gds.louvain.stats('Got7')
YIELD communityCount
"""
graph.run(query)

communityCount
4


In [81]:
#The following will run the algorithm and store the results in myGraph:
query = """
CALL gds.louvain.mutate('Got7', { mutateProperty: 'communityId' })
YIELD communityCount, modularity, modularities
"""
graph.run(query)

communityCount,modularity,modularities
4,0.32572149981063,"[0.3046484029794218, 0.32572149981063003]"


In [82]:
#The following run the algorithm, and write back results:
query = """
CALL gds.louvain.write('Got7', { writeProperty: 'community_S7' })
YIELD communityCount, modularity, modularities
"""
graph.run(query)

communityCount,modularity,modularities
4,0.32572149981063,"[0.3046484029794218, 0.32572149981063003]"


### 5 basic centrality measures 

In [83]:
#Degree
query = """
CALL gds.degree.stream('Got7')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).id AS name, score
ORDER BY score DESC
LIMIT 5
"""
graph.run(query)

name,score
JON,46.0
TYRION,35.0
DAENERYS,35.0


In [84]:
#Weighted Degree
query = """
CALL gds.degree.stream(
   'Got7',
   { relationshipWeightProperty: 'weight' }
)
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).id AS name, score
ORDER BY score DESC
LIMIT 5
"""
graph.run(query)

name,score
JON,982.0
DAENERYS,735.0
TYRION,703.0


In [85]:
#Eigenvector
query = """
CALL gds.eigenvector.stream('Got7')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).id AS name, score
ORDER BY score DESC
LIMIT 5
"""
graph.run(query)

name,score
JON,0.2855926337551967
TYRION,0.2581673486956719
DAENERYS,0.2465015323052332


In [86]:
#Pagerank
query = """
CALL gds.pageRank.stream('Got7',  {maxIterations: 20,  dampingFactor: 0.85})
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).id AS page, score
ORDER BY score DESC
LIMIT 5
"""
graph.run(query)

page,score
JON,3.948579070763045
DAENERYS,2.9040361981717644
TYRION,2.712399180043202


In [87]:
#Betweenness
query = """
CALL gds.betweenness.stream('Got7')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).id AS name, score
ORDER BY score DESC
LIMIT 5
"""
graph.run(query)

name,score
JON,1761.086678965661
SAM,851.8978738896035
DAENERYS,763.3372373706223


## Graph for season 8

### Louvain Modularity Algorithm

In [88]:
query = """
CALL gds.graph.project.cypher(
  'Got8',
  'MATCH (n:Person)-[r:INTERACTS_8]-(Person) RETURN id(n) AS id, n.seed AS seed',
  'MATCH (n:Person)-[r:INTERACTS_8]-(p:Person) RETURN id(n) AS source, id(p) AS target, r.weight AS weight')
YIELD
    graphName AS graph, nodeQuery, nodeCount AS nodes, relationshipQuery, relationshipCount AS rels

"""
graph.run(query)

graph,nodeQuery,nodes,relationshipQuery,rels
Got8,"MATCH (n:Person)-[r:INTERACTS_8]-(Person) RETURN id(n) AS id, n.seed AS seed",78,"MATCH (n:Person)-[r:INTERACTS_8]-(p:Person) RETURN id(n) AS source, id(p) AS target, r.weight AS weight",1200


In [89]:
#The following will estimate the memory requirements for running the algorithm:
query = """
CALL gds.louvain.write.estimate('Got8', { writeProperty: 'community_S8' })
YIELD nodeCount, relationshipCount, bytesMin, bytesMax, requiredMemory
"""
graph.run(query)

nodeCount,relationshipCount,bytesMin,bytesMax,requiredMemory
78,1200,10297,573840,[10297 Bytes ... 560 KiB]


In [90]:
#The following will run the algorithm and stream results:
query = """
CALL gds.louvain.stream('Got8', { seedProperty: 'seed' })
YIELD nodeId, communityId, intermediateCommunityIds
RETURN gds.util.asNode(nodeId).label AS label, communityId
ORDER BY label ASC
"""
graph.run(query)

label,communityId
Aegon,52
Aerys,52
Alanna,67


In [91]:
#The following will run the algorithm and returns the result in form of statistical and measurement values:
query = """
CALL gds.louvain.stats('Got8')
YIELD communityCount
"""
graph.run(query)

communityCount
5


In [92]:
#The following will run the algorithm and store the results in myGraph:
query = """
CALL gds.louvain.mutate('Got8', { mutateProperty: 'communityId' })
YIELD communityCount, modularity, modularities
"""
graph.run(query)

communityCount,modularity,modularities
5,0.2130722222222222,"[0.20139861111111113, 0.21307222222222225]"


In [93]:
#The following run the algorithm, and write back results:
query = """
CALL gds.louvain.write('Got8', { writeProperty: 'community_S8' })
YIELD communityCount, modularity, modularities
"""
graph.run(query)

communityCount,modularity,modularities
5,0.2130722222222222,"[0.20139861111111113, 0.21307222222222225]"


### 5 basic centrality measures 

In [94]:
#Degree
query = """
CALL gds.degree.stream('Got8')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).id AS name, score
ORDER BY score DESC
LIMIT 5
"""
graph.run(query)

name,score
SAM,45.0
DAENERYS,44.0
TYRION,43.0


In [95]:
#Weighted Degree
query = """
CALL gds.degree.stream(
   'Got8',
   { relationshipWeightProperty: 'weight' }
)
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).id AS name, score
ORDER BY score DESC
LIMIT 5
"""
graph.run(query)

name,score
TYRION,1074.0
JON,927.0
DAENERYS,855.0


In [96]:
#Eigenvector
query = """
CALL gds.eigenvector.stream('Got8')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).id AS name, score
ORDER BY score DESC
LIMIT 5
"""
graph.run(query)

name,score
TYRION,0.214638219608149
SAM,0.2132742148090217
GREY_WORM,0.2118673852557779


In [97]:
#Pagerank
query = """
CALL gds.pageRank.stream('Got8',  {maxIterations: 20,  dampingFactor: 0.85})
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).id AS page, score
ORDER BY score DESC
LIMIT 5
"""
graph.run(query)

page,score
DAENERYS,2.7876658573738107
SAM,2.7025561890888463
ARYA,2.351808006058034


In [98]:
#Betweenness
query = """
CALL gds.betweenness.stream('Got8')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).id AS name, score
ORDER BY score DESC
LIMIT 5
"""
graph.run(query)

name,score
DAENERYS,790.9203081686722
SAM,718.8623113764764
ARYA,561.7706271124054


# Link prediction by using network embeddings

Link prediction refers to the prediction of whether an interaction will occur in the future between characters in a particular season. In this project, we propose a simple classification method (simple random forest).

For this, we make first a split of training and testing sets. We decided to use seasons 1 until 4 as our training data and 5 until 8 seasons as our testing dataset. Note that in our case, we only have the true/positive link in our dataset, to build our model we also need false/ negative links for our classification model to learn to distinguish between characters that should have a link between them and characters that should not be related. For this, we will look for characters that are between 2 and 3 hops away from each other. 

Upon construction of the datasets, we will carry out feature engineering (which introduces the graph embeddings). For this, we will compute the following things from the graphs; 

1. Number of common neighbors (cn)

Refers to the idea that the more common neighbors there are between two strangers, the more likely they are to be introduced to one another to create a new link

2. Preferential attachment (pa)

It measures how close the characters are, based on their shared neighbors, also if characters (nodes) have a higher degree then these nodes are also more likely to receive new links

3. Total neighbors (tn) 

Computes the closeness of characters depending on how many unique neighbors are around them, the more unique neighbors, the more likely these characters are to receive new links


Upon deriving these 3 features we impute them into random forest, to learn an appropriate model. For random forest we specified 2 parameters; 

1. random_state: to derive all the time the same parameter coefficients
2. n_estimators=30: to build 30 decision trees before giving the trained model

## Building testing and training set

### Training set

In [None]:
# Find positive examples
train_existing_links = graph.run("""
MATCH (n:Person)-[r:INTERACTS_1|INTERACTS_2|INTERACTS_3|INTERACTS_4]-(p:Person)
RETURN n.id AS node1, p.id AS node2, 1 AS label, r.season AS season
""").to_data_frame()

In [None]:
train_existing_links.info()

In [None]:
# Find negative examples
train_missing_links = graph.run("""
MATCH (n:Person)
WHERE (n:Person)-[:INTERACTS_1|INTERACTS_2|INTERACTS_3|INTERACTS_4]-()
MATCH (n:Person)-[r:INTERACTS_1|INTERACTS_2|INTERACTS_3|INTERACTS_4*1..2]-(p:Person)
WHERE not((n:Person)-[:INTERACTS_1|INTERACTS_2|INTERACTS_3|INTERACTS_4]-(p:Person))
RETURN n.id AS node1, p.id AS node2, 0 AS label
""").to_data_frame()

In [None]:
train_missing_links.info()

In [None]:
import random
randomlist = []
for i in range(0,117858):
    n = random.randint(1,4)
    randomlist.append(n)
#print(randomlist)
train_missing_links['season']=randomlist
train_missing_links.head(5)

In [None]:
# Remove duplicates
train_missing_links = train_missing_links.drop_duplicates()

In [None]:
train_missing_links.info()

In [None]:
# Down sample negative examples
train_missing_links = train_missing_links.sample(
    n=len(train_existing_links))

In [None]:
train_missing_links.info()

In [None]:
# Create DataFrame from positive and negative examples
training_df = pd.concat([train_missing_links,train_existing_links], ignore_index=True)
training_df['label'] = training_df['label'].astype('category')

In [None]:
training_df

In [None]:
training_df.info()

In [None]:
#Checking if there are indeed labels with the positive category
training_df[training_df['label'] == 1]

In [None]:
#Checking if there are indeed labels with the negative category
training_df[training_df['label'] == 0]

### Testing set

In [None]:
# Find positive examples
test_existing_links = graph.run("""
MATCH (n:Person)-[r:INTERACTS_5|INTERACTS_6|INTERACTS_7|INTERACTS_8]-(p:Person)
RETURN n.id AS node1, p.id AS node2, 1 AS label, r.season AS season
""").to_data_frame()

In [None]:
test_existing_links.info()

In [None]:
# Find negative examples
test_missing_links = graph.run("""
MATCH (n:Person)
WHERE (n:Person)-[:INTERACTS_5|INTERACTS_6|INTERACTS_7|INTERACTS_8]-()
MATCH (n:Person)-[r:INTERACTS_5|INTERACTS_6|INTERACTS_7|INTERACTS_8*1..2]-(p:Person)
WHERE not((n:Person)-[:INTERACTS_5|INTERACTS_6|INTERACTS_7|INTERACTS_8]-(p:Person))
RETURN n.id AS node1, p.id AS node2, 0 AS label
""").to_data_frame()

In [None]:
test_missing_links.info()

In [None]:
randomlist = []
for i in range(0,112504):
    n = random.randint(5,8)
    randomlist.append(n)
#print(randomlist)
test_missing_links['season']=randomlist
test_missing_links.head(5)

In [None]:
# Remove duplicates 
test_missing_links = test_missing_links.drop_duplicates()

In [None]:
test_missing_links.info()

In [None]:
# Down sample negative examples
test_missing_links = test_missing_links.sample(n=len(test_existing_links))

In [None]:
test_missing_links.info()

In [None]:
# Create DataFrame from positive and negative examples
test_df = pd.concat([test_missing_links, test_existing_links], ignore_index=True)
test_df['label'] = test_df['label'].astype('category')

In [None]:
test_df.info()

In [None]:
#Checking if there are indeed labels with the positive category
test_df[test_df['label'] == 1]

In [None]:
#Checking if there are indeed labels with the negative category
test_df[test_df['label'] == 0]

## Choosing Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=30, max_depth=10, 
                                    random_state=0)

## Generating features

In [None]:
def apply_graphy_features(data, rel_type):
    query = """
    UNWIND $pairs AS pair
    MATCH (p1) WHERE p1.id = pair.node1
    MATCH (p2) WHERE p2.id = pair.node2
    RETURN pair.node1 AS node1,
           pair.node2 AS node2,
           gds.alpha.linkprediction.commonNeighbors(
               p1, p2, {relationshipQuery: $relType}) AS cn,
           gds.alpha.linkprediction.preferentialAttachment(
               p1, p2, {relationshipQuery: $relType}) AS pa,
           gds.alpha.linkprediction.totalNeighbors(
               p1, p2, {relationshipQuery: $relType}) AS tn
    """
    pairs = [{"node1": node1, "node2": node2}  for node1,node2 in data[["node1", "node2"]].values.tolist()]
    params = {"pairs": pairs, "relType": rel_type}
    
    features = graph.run(query, params).to_data_frame()
    return pd.merge(data, features, on = ["node1", "node2"])

In [None]:
train_season1 = training_df[training_df['season'] == 1]
train_season2 = training_df[training_df['season'] == 2]
train_season3 = training_df[training_df['season'] == 3]
train_season4 = training_df[training_df['season'] == 4]

In [None]:
test_season5 = test_df[test_df['season'] == 5]
test_season6 = test_df[test_df['season'] == 6]
test_season7 = test_df[test_df['season'] == 7]
test_season8 = test_df[test_df['season'] == 8]

In [None]:
train_season1_v = apply_graphy_features(train_season1, "INTERACTS_1")
train_season2_v = apply_graphy_features(train_season2, "INTERACTS_2")
train_season3_v= apply_graphy_features(train_season3, "INTERACTS_3")
train_season4_v= apply_graphy_features(train_season4, "INTERACTS_4")

In [None]:
train_season1_v.sample(5)

In [None]:
train_season2_v.sample(5)

In [None]:
train_season3_v.sample(5)

In [None]:
train_season4_v.sample(5)

In [None]:
test_season5_v = apply_graphy_features(test_season5, "INTERACTS_5")
test_season6_v = apply_graphy_features(test_season6, 'INTERACTS_6')
test_season7_v = apply_graphy_features(test_season7, "INTERACTS_7")
test_season8_v = apply_graphy_features(test_season8, "INTERACTS_8")

In [None]:
test_season5_v.sample(5)

In [None]:
test_season6_v.sample(5)

In [None]:
test_season7_v.sample(5)

In [None]:
#Combining all seasons for trainign set and testing set
frames_training = [train_season1_v, train_season2_v, train_season3_v, train_season4_v]
result_training = pd.concat(frames_training)
frames_test = [test_season5_v, test_season6_v, test_season7_v, test_season8_v]
result_test = pd.concat(frames_test)

In [None]:
result_training

In [None]:
result_test

## Train our Random forest model

In [None]:
columns = ["cn", "pa", "tn"]
X = result_training[columns]
y = result_training["label"]
classifier.fit(X, y)

## Evaluation

In [None]:
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import accuracy_score
def evaluate_model(predictions, actual):
    accuracy = accuracy_score(actual, predictions)
    precision = precision_score(actual, predictions)
    recall = recall_score(actual, predictions)
    
    metrics = ["accuracy", "precision", "recall"]
    values = [accuracy, precision, recall]    
    return pd.DataFrame(data={'metric': metrics, 'value': values})
def feature_importance(columns, classifier):        
    features = list(zip(columns, classifier.feature_importances_))
    sorted_features = sorted(features, key = lambda x: x[1]*-1)
    
    keys = [value[0] for value in sorted_features]
    values = [value[1] for value in sorted_features]
    return pd.DataFrame(data={'feature': keys, 'value': values})

In [None]:
predictions = classifier.predict(result_test[columns])
y_test = result_test["label"]
evaluate_model(predictions, y_test)

In [None]:
feature_importance(columns, classifier)

Seperately, we will investigate whether the link prediction will perform better when the following features are introduced:
1. Minimum triangles (minTriangles)
2. Maximum triangles (maxTriangles)
3. Minimum clustering coefficient (minCoeff)
4. Maximum clustering coefficient (maxCoeff)

To this end, we will generate 8 in-memory graphs and compute the triangle count and clustering coefficient of all the nodes for all the 8 seasons.

In [None]:
# We need to drop the graph if it already exists

query1 = """
CALL gds.graph.drop('myGraph1') YIELD graphName;
"""

query2 = """
CALL gds.graph.drop('myGraph2') YIELD graphName;
"""

query3 = """
CALL gds.graph.drop('myGraph3') YIELD graphName;
"""

query4 = """
CALL gds.graph.drop('myGraph4') YIELD graphName;
"""

query5 = """
CALL gds.graph.drop('myGraph5') YIELD graphName;
"""

query6 = """
CALL gds.graph.drop('myGraph6') YIELD graphName;
"""

query7 = """
CALL gds.graph.drop('myGraph7') YIELD graphName;
"""

query8 = """
CALL gds.graph.drop('myGraph8') YIELD graphName;
"""

graph.run(query1)
graph.run(query2)
graph.run(query3)
graph.run(query4)
graph.run(query5)
graph.run(query6)
graph.run(query7)
graph.run(query8)


In [None]:
# Make the in memory graphs for adding triangle counts and clustering coefficients
query1 = """
CALL gds.graph.project(
  'myGraph1',
  'Person',
  {
    INTERACTS_1: {
      orientation: 'UNDIRECTED'
    }
  }
)
"""

query2 = """
CALL gds.graph.project(
  'myGraph2',
  'Person',
  {
    INTERACTS_2: {
    orientation: 'UNDIRECTED'
}
}
)
"""
query3 = """
CALL gds.graph.project(
  'myGraph3',
  'Person',
  {
    INTERACTS_3: {
    orientation: 'UNDIRECTED'
}
  }
)
"""
query4 = """
CALL gds.graph.project(
  'myGraph4',
  'Person',
  {
    INTERACTS_4: {
    orientation: 'UNDIRECTED'
}
  }
)
"""
query5 = """
CALL gds.graph.project(
  'myGraph5',
  'Person',
  {
    INTERACTS_5: {
    orientation: 'UNDIRECTED'
}
  }
)
"""

query6 = """
CALL gds.graph.project(
  'myGraph6',
  'Person',
  {
    INTERACTS_6: {
      orientation: 'UNDIRECTED'
    }
  }
)
"""

query7 = """
CALL gds.graph.project(
  'myGraph7',
  'Person',
  {
    INTERACTS_7: {
      orientation: 'UNDIRECTED'
    }
  }
)
"""

query8 = """
CALL gds.graph.project(
  'myGraph8',
  'Person',
  {
    INTERACTS_8: {
      orientation: 'UNDIRECTED'
    }
  }
)
"""

graph.run(query1)
graph.run(query2)
graph.run(query3)
graph.run(query4)
graph.run(query5)
graph.run(query6)
graph.run(query7)
graph.run(query8)


Since we are taking the first four seasons to be the training set, we add the triangle counts in these four seasons as "trianglesTrain*" to the corresponding in memory graphs

In [None]:
query1 = """ 
CALL gds.triangleCount.write('myGraph1', {
  writeProperty: 'trianglesTrain1'
})
"""

query2 = """ 
CALL gds.triangleCount.write('myGraph2', {
  writeProperty: 'trianglesTrain2'
})
"""

query3 = """ 
CALL gds.triangleCount.write('myGraph3', {
  writeProperty: 'trianglesTrain3'
})
"""

query4 = """ 
CALL gds.triangleCount.write('myGraph4', {
  writeProperty: 'trianglesTrain4'
})
"""


graph.run(query1)
graph.run(query2)
graph.run(query3)
graph.run(query4)


We proceed similarly as before and create 'trianglesTest*' for the triangle counts for easons 5-8.

In [None]:
query5 = """ 
CALL gds.triangleCount.write('myGraph5', {
  writeProperty: 'trianglesTest5'
})
"""

query6 = """ 
CALL gds.triangleCount.write('myGraph6', {
  writeProperty: 'trianglesTest6'
})
"""
query7 = """ 
CALL gds.triangleCount.write('myGraph7', {
  writeProperty: 'trianglesTest7'
})
"""
query8 = """ 
CALL gds.triangleCount.write('myGraph8', {
  writeProperty: 'trianglesTest8'
})
"""
graph.run(query5)
graph.run(query6)
graph.run(query7)
graph.run(query8)


We do the same for clustering coefficient for the nodes and create 'coefficientTrain*' and 'coefficientTest*'.

In [None]:
query1 = """
CALL gds.localClusteringCoefficient.write('myGraph1', {
    writeProperty: 'coefficientTrain1'
});
"""

query2 = """
CALL gds.localClusteringCoefficient.write('myGraph2', {
    writeProperty: 'coefficientTrain2'
});
"""

query3 = """
CALL gds.localClusteringCoefficient.write('myGraph3', {
    writeProperty: 'coefficientTrain3'
});
"""

query4 = """
CALL gds.localClusteringCoefficient.write('myGraph4', {
    writeProperty: 'coefficientTrain4'
});
"""


graph.run(query1)
graph.run(query2)
graph.run(query3)
graph.run(query4)


In [None]:
query5 = """
CALL gds.localClusteringCoefficient.write('myGraph5', {
    writeProperty: 'coefficientTest5'
});
"""

query6 = """
CALL gds.localClusteringCoefficient.write('myGraph6', {
    writeProperty: 'coefficientTest6'
});
"""

query7 = """
CALL gds.localClusteringCoefficient.write('myGraph7', {
    writeProperty: 'coefficientTest7'
});
"""

query8 = """
CALL gds.localClusteringCoefficient.write('myGraph8', {
    writeProperty: 'coefficientTest8'
});
"""

graph.run(query5)
graph.run(query6)
graph.run(query7)
graph.run(query8)


## Creating Features

As we need to have features for the edges, we take the maximum and the minimum of the properties of the nodes forming the edge. Since an edge is formed by 2 node, we are indeed adding the clustering coefficients and number of triangles of both the nodes as features for prediction.

In [None]:
def apply_triangles_features(data, triangles_prop, coefficient_prop):

    query = """
    UNWIND $pairs AS pair
    MATCH (p1:Person) WHERE p1.id = pair.node1
    MATCH (p2:Person) WHERE p2.id = pair.node2
    RETURN pair.node1 AS node1, 
    pair.node2 AS node2,
    apoc.coll.min([p1[$triangles], p2[$triangles]]) AS minTriangles,
    apoc.coll.max([p1[$triangles], p2[$triangles]]) AS maxTriangles,
    apoc.coll.min([p1[$coefficient], p2[$coefficient]]) AS minCoeff,
    apoc.coll.max([p1[$coefficient], p2[$coefficient]]) AS maxCoeff
    """

    pairs = [{"node1": str(pair[0]), "node2": str(pair[1])}
             for pair in data[["node1", "node2"]].values.tolist()]

    params = {
        "pairs": pairs,
        "triangles": triangles_prop,
        "coefficient": coefficient_prop
    }

    features = graph.run(query, params).to_data_frame()

    return pd.merge(data, features, on=["node1", "node2"])


We add these features for the training and testing datasets developed earlier.

In [None]:
train_season1_w = apply_triangles_features(
    train_season1_v, "trianglesTrain1", "coefficientTrain1")
train_season2_w = apply_triangles_features(
    train_season2_v, "trianglesTrain2", "coefficientTrain2")
train_season3_w = apply_triangles_features(
    train_season3_v, "trianglesTrain3", "coefficientTrain3")
train_season4_w = apply_triangles_features(
    train_season4_v, "trianglesTrain4", "coefficientTrain4")
# train_season5_w = apply_triangles_features(train_season5_v, "trianglesTrain5", "coefficientTrain5")

test_season5_w = apply_triangles_features(
    test_season5_v, "trianglesTest5", "coefficientTest5")
test_season6_w = apply_triangles_features(
    test_season6_v, "trianglesTest6", "coefficientTest6")
test_season7_w = apply_triangles_features(
    test_season7_v, "trianglesTest7", "coefficientTest7")
test_season8_w = apply_triangles_features(
    test_season8_v, "trianglesTest8", "coefficientTest8")


In [None]:
frames_training_w = [train_season1_w, train_season2_w,
                     train_season3_w, train_season4_w]
result_training_w = pd.concat(frames_training_w)
frames_test_w = [test_season5_w, test_season6_w,
                 test_season7_w, test_season8_w]
result_test_w = pd.concat(frames_test_w)

## Train Model

Finally we train the model to see the added benefit of these features.

In [None]:
classifier2 = RandomForestClassifier(n_estimators=30, max_depth=10,
                                     random_state=0)


In [None]:
columns = ["cn", "pa", "tn", "minTriangles",
           "maxTriangles", "minCoeff", "maxCoeff"]
X = result_training_w[columns]
y = result_training_w["label"]
classifier2.fit(X, y)

In [None]:
predictions = classifier2.predict(result_test_w[columns])
y_test = result_test_w["label"]
evaluate_model(predictions, y_test)

In [None]:
feature_importance(columns, classifier2)
