Project SNA by Aleksandra Elena Getman (r0884498) and Vaishnav Dilip (r0872689)

<img src ="https://cdn.trendhunterstatic.com/thumbs/game-of-thrones-fan-art.jpeg?auto=webp"></img>

# Link Prediction

In this notebook, we plan to predict links between the characters in the series. To this end we divide the whole series into training, validation and test sets. Since the series is chronological, it is best to have the first 4 seasons as the training set, seasons 5&6 as the validation set and seasons 7&8 as the test set.

Neo4j has several link prediction algorithms that help determine the closeness of a pair of nodes using the topology of the graph. These computed scores can then be used to predict new relationships between them. We will add these as features to our otherwise-featureless dataset. Furthermore, we will add the triangle count and the clustering coefficient of each node as features as well. This will aid our prediction model.

For our prediction model, we would be using a random forest classifier. We will tune the various hyperparameters in the model using the validation set and derive the best model. Using this model we would make predictions on the relationships in the last two seasons.

In [19]:
# Importing required libraries
from py2neo import Graph
import pandas as pd
import random
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV


# Connecting to Neo4j
graph = Graph("bolt://localhost:7687", auth=("pizza", "superman"))

# Building training, validation and  testing sets

To build the training, validation and test sets, we have to take into consideration that we would only get the existing links from the network (positive examples). To construct the negative examples, we would have to use some other mechanism.  In our case, we construct the negative examples from the relationships in the newtork themselves. We find out the nodes that are two or three relationships away and exclude those pairs of nodes that are direct neighbours. 

As mentioned before, we will create the training set from the seasons 1-4, the validation set from the seasons 5&6 and the testing set from the seasons 7&8.

## Training dataset

In [22]:
# Find positive examples
train_existing_links = graph.run("""
MATCH (n:Person)-[r:INTERACTS_1|INTERACTS_2|INTERACTS_3|INTERACTS_4]-(p:Person)
RETURN id(n) AS node1, id(p) AS node2, 1 AS label, n.id AS Name_Node1, p.id AS Name_node2, r.season AS season
""").to_data_frame()

In [23]:
train_existing_links.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4494 entries, 0 to 4493
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   node1       4494 non-null   int64 
 1   node2       4494 non-null   int64 
 2   label       4494 non-null   int64 
 3   Name_Node1  4494 non-null   object
 4   Name_node2  4494 non-null   object
 5   season      4494 non-null   int64 
dtypes: int64(4), object(2)
memory usage: 210.8+ KB


In [24]:
# Find negative examples
train_missing_links = graph.run("""
MATCH (n:Person)
WHERE (n:Person)-[:INTERACTS_1|INTERACTS_2|INTERACTS_3|INTERACTS_4]-()
MATCH (n:Person)-[r:INTERACTS_1|INTERACTS_2|INTERACTS_3|INTERACTS_4*1..2]-(p:Person)
WHERE not((n:Person)-[:INTERACTS_1|INTERACTS_2|INTERACTS_3|INTERACTS_4]-(p:Person))
RETURN id(n) AS node1, id(p) AS node2, 0 AS label, n.id AS Name_Node1, p.id AS Name_node2

""").to_data_frame()

In [25]:
train_missing_links.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 117858 entries, 0 to 117857
Data columns (total 5 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   node1       117858 non-null  int64 
 1   node2       117858 non-null  int64 
 2   label       117858 non-null  int64 
 3   Name_Node1  117858 non-null  object
 4   Name_node2  117858 non-null  object
dtypes: int64(3), object(2)
memory usage: 4.5+ MB


In [26]:
# We assign the season randomly to the missing links
randomlist = []
for i in range(0,117858):
    n = random.randint(1,4)
    randomlist.append(n)
#print(randomlist)
train_missing_links['season']=randomlist
train_missing_links.head(5)

Unnamed: 0,node1,node2,label,Name_Node1,Name_node2,season
0,0,114,0,ADDAM_MARBRAND,TYRION,1
1,0,38,0,ADDAM_MARBRAND,JAIME,1
2,0,75,0,ADDAM_MARBRAND,NED,3
3,0,93,0,ADDAM_MARBRAND,ROBERT,2
4,0,104,0,ADDAM_MARBRAND,SHAGGA,3


In [27]:
# Remove duplicates
train_missing_links = train_missing_links.drop_duplicates()

In [28]:
train_missing_links.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 54049 entries, 0 to 117857
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   node1       54049 non-null  int64 
 1   node2       54049 non-null  int64 
 2   label       54049 non-null  int64 
 3   Name_Node1  54049 non-null  object
 4   Name_node2  54049 non-null  object
 5   season      54049 non-null  int64 
dtypes: int64(4), object(2)
memory usage: 2.9+ MB


Since the number of negative examples is much higher than those of positive examples, we will randomly sample an equal number of negative examples.

In [29]:
# Down sample negative examples
train_missing_links = train_missing_links.sample(
    n=len(train_existing_links))

In [30]:
train_missing_links.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4494 entries, 102514 to 103042
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   node1       4494 non-null   int64 
 1   node2       4494 non-null   int64 
 2   label       4494 non-null   int64 
 3   Name_Node1  4494 non-null   object
 4   Name_node2  4494 non-null   object
 5   season      4494 non-null   int64 
dtypes: int64(4), object(2)
memory usage: 245.8+ KB


In [31]:
# Create DataFrame from positive and negative examples
training_df = pd.concat([train_missing_links,train_existing_links], ignore_index=True)
training_df['label'] = training_df['label'].astype('category')

In [32]:
training_df

Unnamed: 0,node1,node2,label,Name_Node1,Name_node2,season
0,214,22,0,MERO,DROGO,4
1,55,227,0,LITTLEFINGER,SELWYN,4
2,86,60,0,RENLY,LYSA,1
3,208,218,0,LOCKE,OLENNA,4
4,213,27,0,MEERA,GRENN,3
...,...,...,...,...,...,...
8983,315,240,1,YOHN_ROYCE,ANYA_WAYNWOOD,4
8984,315,60,1,YOHN_ROYCE,LYSA,4
8985,315,310,1,YOHN_ROYCE,VANCE_CORBRAY,4
8986,315,75,1,YOHN_ROYCE,NED,4


In [33]:
training_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8988 entries, 0 to 8987
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   node1       8988 non-null   int64   
 1   node2       8988 non-null   int64   
 2   label       8988 non-null   category
 3   Name_Node1  8988 non-null   object  
 4   Name_node2  8988 non-null   object  
 5   season      8988 non-null   int64   
dtypes: category(1), int64(3), object(2)
memory usage: 360.1+ KB


## Validation set

As before, we create the validation set by extracting the positive examples from the graph and the negative examples are created by finding pairs of nodes connected by the specified edges as a 2 or 3-neighbour and excluding the direct neighbours.

In [40]:
# Find positive examples
validation_existing_links = graph.run("""
MATCH (n:Person)-[r:INTERACTS_5|INTERACTS_6]-(p:Person)
RETURN id(n) AS node1, id(p) AS node2, 1 AS label, n.id AS Name_Node1, p.id AS Name_node2, r.season AS season
""").to_data_frame()

In [41]:
validation_existing_links.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2020 entries, 0 to 2019
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   node1       2020 non-null   int64 
 1   node2       2020 non-null   int64 
 2   label       2020 non-null   int64 
 3   Name_Node1  2020 non-null   object
 4   Name_node2  2020 non-null   object
 5   season      2020 non-null   int64 
dtypes: int64(4), object(2)
memory usage: 94.8+ KB


In [42]:
# Find negative examples
validation_missing_links = graph.run("""
MATCH (n:Person)
WHERE (n:Person)-[:INTERACTS_5|INTERACTS_6]-()
MATCH (n:Person)-[r:INTERACTS_5|INTERACTS_6*1..2]-(p:Person)
WHERE not((n:Person)-[:INTERACTS_5|INTERACTS_6]-(p:Person))
RETURN id(n) AS node1, id(p) AS node2, 0 AS label, n.id AS Name_Node1, p.id AS Name_node2
""").to_data_frame()


In [43]:
validation_missing_links.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24150 entries, 0 to 24149
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   node1       24150 non-null  int64 
 1   node2       24150 non-null  int64 
 2   label       24150 non-null  int64 
 3   Name_Node1  24150 non-null  object
 4   Name_node2  24150 non-null  object
dtypes: int64(3), object(2)
memory usage: 943.5+ KB


In [44]:
randomlist = []
for i in range(0, 24150):
    n = random.randint(5, 6)
    randomlist.append(n)
# print(randomlist)
validation_missing_links['season'] = randomlist
validation_missing_links.head(5)

Unnamed: 0,node1,node2,label,Name_Node1,Name_node2,season
0,1,100,0,AEGON,SAM,5
1,1,148,0,AEGON,GILLY,6
2,1,3,0,AEGON,ALLISER_THORNE,5
3,1,317,0,AEGON,ALLISER_THORNE,6
4,1,239,0,AEGON,ALLISER_THORNE,5


In [45]:
validation_missing_links = validation_missing_links.drop_duplicates()

In [46]:
validation_missing_links.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12537 entries, 0 to 24145
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   node1       12537 non-null  int64 
 1   node2       12537 non-null  int64 
 2   label       12537 non-null  int64 
 3   Name_Node1  12537 non-null  object
 4   Name_node2  12537 non-null  object
 5   season      12537 non-null  int64 
dtypes: int64(4), object(2)
memory usage: 685.6+ KB


In [47]:
# Down sample negative examples
validation_missing_links = validation_missing_links.sample(n=len(validation_existing_links))

In [48]:
validation_missing_links.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2020 entries, 17801 to 7674
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   node1       2020 non-null   int64 
 1   node2       2020 non-null   int64 
 2   label       2020 non-null   int64 
 3   Name_Node1  2020 non-null   object
 4   Name_node2  2020 non-null   object
 5   season      2020 non-null   int64 
dtypes: int64(4), object(2)
memory usage: 110.5+ KB


In [49]:
# Create DataFrame from positive and negative examples
validation_df = pd.concat(
    [validation_missing_links, validation_existing_links], ignore_index=True)
validation_df['label'] = validation_df['label'].astype('category')

In [50]:
validation_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4040 entries, 0 to 4039
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   node1       4040 non-null   int64   
 1   node2       4040 non-null   int64   
 2   label       4040 non-null   category
 3   Name_Node1  4040 non-null   object  
 4   Name_node2  4040 non-null   object  
 5   season      4040 non-null   int64   
dtypes: category(1), int64(3), object(2)
memory usage: 162.0+ KB


## Testing set

In [51]:
# Find positive examples
test_existing_links = graph.run("""
MATCH (n:Person)-[r:INTERACTS_7|INTERACTS_8]-(p:Person)
RETURN id(n) AS node1, id(p) AS node2, 1 AS label, n.id AS Name_Node1, p.id AS Name_node2, r.season AS season
""").to_data_frame()

In [52]:
test_existing_links.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2090 entries, 0 to 2089
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   node1       2090 non-null   int64 
 1   node2       2090 non-null   int64 
 2   label       2090 non-null   int64 
 3   Name_Node1  2090 non-null   object
 4   Name_node2  2090 non-null   object
 5   season      2090 non-null   int64 
dtypes: int64(4), object(2)
memory usage: 98.1+ KB


In [54]:
# Find negative examples
test_missing_links = graph.run("""
MATCH (n:Person)
WHERE (n:Person)-[:INTERACTS_7|INTERACTS_8]-()
MATCH (n:Person)-[r:INTERACTS_7|INTERACTS_8*1..2]-(p:Person)
WHERE not((n:Person)-[:INTERACTS_7|INTERACTS_8]-(p:Person))
RETURN id(n) AS node1, id(p) AS node2, 0 AS label, n.id AS Name_Node1, p.id AS Name_node2
""").to_data_frame()

In [55]:
test_missing_links.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34108 entries, 0 to 34107
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   node1       34108 non-null  int64 
 1   node2       34108 non-null  int64 
 2   label       34108 non-null  int64 
 3   Name_Node1  34108 non-null  object
 4   Name_node2  34108 non-null  object
dtypes: int64(3), object(2)
memory usage: 1.3+ MB


In [56]:
randomlist = []
for i in range(0, 34108):
    n = random.randint(7,8)
    randomlist.append(n)
#print(randomlist)
test_missing_links['season']=randomlist
test_missing_links.head(5)

Unnamed: 0,node1,node2,label,Name_Node1,Name_node2,season
0,1,17,0,AEGON,CERSEI,8
1,1,13,0,AEGON,BRAN,8
2,1,1,0,AEGON,AEGON,8
3,1,10,0,AEGON,BERIC,7
4,1,15,0,AEGON,BRONN,8


In [57]:
# Remove duplicates 
test_missing_links = test_missing_links.drop_duplicates()

In [58]:
test_missing_links.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9643 entries, 0 to 34107
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   node1       9643 non-null   int64 
 1   node2       9643 non-null   int64 
 2   label       9643 non-null   int64 
 3   Name_Node1  9643 non-null   object
 4   Name_node2  9643 non-null   object
 5   season      9643 non-null   int64 
dtypes: int64(4), object(2)
memory usage: 527.4+ KB


In [59]:
# Down sample negative examples
test_missing_links = test_missing_links.sample(n=len(test_existing_links))

In [60]:
test_missing_links.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2090 entries, 27801 to 28189
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   node1       2090 non-null   int64 
 1   node2       2090 non-null   int64 
 2   label       2090 non-null   int64 
 3   Name_Node1  2090 non-null   object
 4   Name_node2  2090 non-null   object
 5   season      2090 non-null   int64 
dtypes: int64(4), object(2)
memory usage: 114.3+ KB


In [61]:
# Create DataFrame from positive and negative examples
test_df = pd.concat([test_missing_links, test_existing_links], ignore_index=True)
test_df['label'] = test_df['label'].astype('category')

In [62]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4180 entries, 0 to 4179
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   node1       4180 non-null   int64   
 1   node2       4180 non-null   int64   
 2   label       4180 non-null   category
 3   Name_Node1  4180 non-null   object  
 4   Name_node2  4180 non-null   object  
 5   season      4180 non-null   int64   
dtypes: category(1), int64(3), object(2)
memory usage: 167.6+ KB


## Generating link prediction features

The topological link prediction algorithms in Neo4j are based on the topology of the graphs. Hence these are based on the structure of the graph. Let us look at what these algorithms are:

- Adamic Adar - Adamic Adar is a measure used to compute the closeness of nodes based on their shared neighbors. $$A(x,y) = \sum_{u \in N(x) \cap N(y)} \frac{1}{log|N(u)|}$$ where N(u) is the set of nodes adjacent to u.

- Common Neighbors - Common neighbours captures the idea that 2 strangers who have a friend in common are more likely to be introduced than those who don't have any friends in common. $$ CN(x,y) = | N(x) \cap N(y)| $$ where N(x) is the set of nodes adjacent to node x, and N(y) is the set of nodes adjacent to node y.

- Preferential Attachment - Preferential Attachment is a measure used to compute the closeness of nodes, based on their shared neighbors. Preferential attachment means that the more connected a node is, the more likely it is to receive new links. $$ PA(x,y) = |N(x)|*|N(y)| $$ where N(u) is the set of nodes adjacent to u.

- Resource Allocation - Resource Allocation is a measure used to compute the closeness of nodes based on their shared neighbors. $$RA(x,y) = \sum_{u \in N(x) \cap N(y)} \frac{1}{|N(u)|}$$ where N(u) is the set of nodes adjacent to u.

- Same Community - Same Community is a way of determining whether two nodes belong to the same community. If two nodes belong to the same community, there is a greater likelihood that there will be a relationship between them in future, if there isn’t already.

- Total Neighbors - Total Neighbors computes the closeness of nodes, based on the number of unique neighbors that they have. It is based on the idea that the more connected a node is, the more likely it is to receive new links. $$ CN(x,y) = | N(x) \cup N(y)| $$ where N(x) is the set of nodes adjacent to node x, and N(y) is the set of nodes adjacent to node y.

A lot of interactions in Game of thrones happens through shared neighbors. Hence it makes sense to include these measures as features. The more general algorithms like Total Neighbors and Same Community might also work, but not as efficiently as the others. We also omit Resource Allocation as it is very similar to the Adamic Adar.

In [63]:
def apply_graphy_features(data, rel_type):
    query = """
    UNWIND $pairs AS pair
    MATCH (p1) WHERE id(p1) = pair.node1
    MATCH (p2) WHERE id(p2) = pair.node2
    RETURN pair.node1 AS node1,
           pair.node2 AS node2,
           gds.alpha.linkprediction.commonNeighbors(
               p1, p2, {relationshipQuery: $relType}) AS cn,
           gds.alpha.linkprediction.preferentialAttachment(
               p1, p2, {relationshipQuery: $relType}) AS pa,
           gds.alpha.linkprediction.adamicAdar(
               p1, p2, {relationshipQuery: $relType}) AS ad
    """
    pairs = [{"node1": pair[0], "node2": pair[1]}  
             for pair in data[["node1", "node2"]].values.tolist()]
    params = {"pairs": pairs, "relType": rel_type}
    
    features = graph.run(query, params).to_data_frame()
    return pd.merge(data, features, on = ["node1", "node2"])

We apply the above function to each season so as to not add the connections from other seasons.

In [64]:
train_season1 = training_df[training_df['season'] == 1]
train_season2 = training_df[training_df['season'] == 2]
train_season3 = training_df[training_df['season'] == 3]
train_season4 = training_df[training_df['season'] == 4]

In [68]:
validation_season5 = validation_df[validation_df['season'] == 5]
validation_season6 = validation_df[validation_df['season'] == 6]
test_season7 = test_df[test_df['season'] == 7]
test_season8 = test_df[test_df['season'] == 8]

In [69]:
train_season1_v = apply_graphy_features(train_season1, "INTERACTS_1")
train_season2_v = apply_graphy_features(train_season2, "INTERACTS_2")
train_season3_v= apply_graphy_features(train_season3, "INTERACTS_3")
train_season4_v= apply_graphy_features(train_season4, "INTERACTS_4")

In [73]:
validation_season5_v = apply_graphy_features(validation_season5, "INTERACTS_5")
validation_season6_v = apply_graphy_features(validation_season6, 'INTERACTS_6')
test_season7_v = apply_graphy_features(test_season7, "INTERACTS_7")
test_season8_v = apply_graphy_features(test_season8, "INTERACTS_8")

In [74]:
#Combining all seasons for trainign set and testing set
frames_training = [train_season1_v, train_season2_v, train_season3_v, train_season4_v]
result_training = pd.concat(frames_training)
result_training = result_training.sample(frac=1)
frames_validation = [validation_season5_v, validation_season6_v]
result_validation = pd.concat(frames_validation)
result_validation = result_validation.sample(frac=1)
frames_test = [test_season7_v, test_season8_v]
result_test = pd.concat(frames_test)
result_test = result_test.sample(frac=1)

In [75]:
result_training.sample(5)

Unnamed: 0,node1,node2,label,Name_Node1,Name_node2,season,cn,pa,ad
2481,302,189,1,STYR,YGRITTE,4,8.0,108.0,4.084861
483,110,228,0,THEON,SELYSE,1,0.0,0.0,0.0
1201,16,13,1,CATELYN,BRAN,3,3.0,374.0,1.020701
1921,198,92,1,EDMURE,ROBB,3,13.0,465.0,5.55969
78,83,88,0,RAKHARO,RHAEGO,1,4.0,32.0,1.560236


In [76]:
result_validation.sample(5)


Unnamed: 0,node1,node2,label,Name_Node1,Name_node2,season,cn,pa,ad
1862,339,74,1,OBARA,MYRCELLA,5,5.0,72.0,2.079094
1718,217,170,1,MYRANDA,RAMSAY,6,1.0,38.0,0.721348
1570,158,358,1,MARGAERY,CLARENZO,6,7.0,216.0,2.478703
1609,170,55,1,RAMSAY,LITTLEFINGER,6,6.0,209.0,1.964621
212,210,355,0,MACE,BIANCA,6,2.0,108.0,0.64295


In [77]:
result_test.sample(5)


Unnamed: 0,node1,node2,label,Name_Node1,Name_node2,season,cn,pa,ad
1206,25,32,1,GENDRY,HOUND,7,8.0,264.0,2.738381
2168,405,13,1,IRONBORN_LORD,BRAN,8,17.0,594.0,5.144817
786,110,161,0,THEON,MELISANDRE,8,7.0,200.0,2.027849
247,1,116,0,AEGON,TYWIN,8,1.0,4.0,0.264257
956,87,32,0,RHAEGAR,HOUND,8,2.0,150.0,0.537605


Having added some features, we can now train a classifier to predict whether a link exists between two nodes. This would help us see how good the features are at predicting links. As mentioned earlier, we choose a random forest classifier with an arbitrary set of paramters.

# Choosing Random Forest Classifier

In [78]:
classifier = RandomForestClassifier(n_estimators=30, max_depth=10,  
                                    random_state=0)

## Train your model

In [80]:
columns = ["cn", "pa", "ad"]
X = result_training[columns]
y = result_training["label"]
classifier.fit(X, y)

RandomForestClassifier(max_depth=10, n_estimators=30, random_state=0)

# Evaluation

We create two functions evaluate_model and feature_importance to evaluate the performance of the model and to find the inportant features in the model.

In [81]:
def evaluate_model(predictions, actual):
    accuracy = accuracy_score(actual, predictions)
    precision = precision_score(actual, predictions)
    recall = recall_score(actual, predictions)
    metrics = ["accuracy", "precision", "recall"]
    values = [accuracy, precision, recall]    
    return pd.DataFrame(data={'metric': metrics, 'value': values})

def feature_importance(columns, classifier):        
    features = list(zip(columns, classifier.feature_importances_))
    sorted_features = sorted(features, key = lambda x: x[1]*-1)
    keys = [value[0] for value in sorted_features]
    values = [value[1] for value in sorted_features]
    return pd.DataFrame(data={'feature': keys, 'value': values})

In [82]:
predictions = classifier.predict(result_validation[columns])
y_test = result_validation["label"]
evaluate_model(predictions, y_test)

Unnamed: 0,metric,value
0,accuracy,0.84802
1,precision,0.805918
2,recall,0.916832


In [83]:
feature_importance(columns, classifier)

Unnamed: 0,feature,value
0,ad,0.350264
1,cn,0.350153
2,pa,0.299583


As we can see, the accuracy is 0.91, which is very good. The precision is 0.91 as well, which means that 91% of the predicted links are correct. The recall is 0.95, which means that 95% of the actual links are predicted. The feature importance shows that the common neighbors is the most important feature, followed by the adamic adar and the preferential attachment.

# Introducing more features (Triangles and The Clustering Coefficient)

To improve the accuracy further, we add more graphical features - Triangle count and clustering coefficient. 

- The Triangle Count algorithm counts the number of triangles for each node in the graph. A triangle is a set of three nodes where each node has a relationship to the other two. In graph theory terminology, this is sometimes referred to as a 3-clique. The Triangle Count algorithm in the GDS library only finds triangles in undirected graphs. Triangle counting has gained popularity in social network analysis, where it is used to detect communities and measure the cohesiveness of those communities. It can also be used to determine the stability of a graph. Triangle count at a particular node indicates the number of triangles with that node as a vertex of the triangle. It indicates that the vertex is a good connector of its "friends". 

- The Local Clustering Coefficient algorithm computes the local clustering coefficient for each node in the graph. The local clustering coefficient Cn of a node n describes the likelihood that the neighbours of n are also connected. To compute Cn we use the number of triangles a node is a part of Tn, and the degree of the node dn. The formula to compute the local clustering coefficient is as follows: $$ C_{n} = \frac{2T_n}{d_n(d_n - 1)}$$



## Calculating the Triangle count and Clustering coefficient

To calculate the triangle count for the nodes, we first create the in memory graphs and use the triangle count function in neo4j. We write this property to the in memry graph as triangleCount. We proceed similarly for clustering coefficient as well.

In [84]:
# Make the in memory graphs for adding triangle counts and clustering coefficients
query1 = """
CALL gds.graph.project(
  'myGraph1',
  'Person',
  {
    INTERACTS_1: {
      orientation: 'UNDIRECTED'
    }
  }
)
"""

query2 = """
CALL gds.graph.project(
  'myGraph2',
  'Person',
  {
    INTERACTS_2: {
    orientation: 'UNDIRECTED'
}
}
)
"""
query3 = """
CALL gds.graph.project(
  'myGraph3',
  'Person',
  {
    INTERACTS_3: {
    orientation: 'UNDIRECTED'
}
  }
)
"""
query4 = """
CALL gds.graph.project(
  'myGraph4',
  'Person',
  {
    INTERACTS_4: {
    orientation: 'UNDIRECTED'
}
  }
)
"""
query5 = """
CALL gds.graph.project(
  'myGraph5',
  'Person',
  {
    INTERACTS_5: {
    orientation: 'UNDIRECTED'
}
  }
)
"""
graph.run(query1)
graph.run(query2)
graph.run(query3)
graph.run(query4)
graph.run(query5)


nodeProjection,relationshipProjection,graphName,nodeCount,relationshipCount,projectMillis
"{Person: {label: 'Person', properties: {}}}","{INTERACTS_5: {orientation: 'UNDIRECTED', indexInverse: false, aggregation: 'DEFAULT', type: 'INTERACTS_5', properties: {}}}",myGraph5,418,866,7


In [85]:
# Make the in memory graphs for adding triangle counts and clustering coefficients
query6 = """
CALL gds.graph.project(
  'myGraph6',
  'Person',
  {
    INTERACTS_6: {
      orientation: 'UNDIRECTED'
    }
  }
)
"""

query7 = """
CALL gds.graph.project(
  'myGraph7',
  'Person',
  {
    INTERACTS_7: {
      orientation: 'UNDIRECTED'
    }
  }
)
"""

query8 = """
CALL gds.graph.project(
  'myGraph8',
  'Person',
  {
    INTERACTS_8: {
      orientation: 'UNDIRECTED'
    }
  }
)
"""

graph.run(query6)
graph.run(query7)
graph.run(query8)


nodeProjection,relationshipProjection,graphName,nodeCount,relationshipCount,projectMillis
"{Person: {label: 'Person', properties: {}}}","{INTERACTS_8: {orientation: 'UNDIRECTED', indexInverse: false, aggregation: 'DEFAULT', type: 'INTERACTS_8', properties: {}}}",myGraph8,418,1200,7


In [86]:
query1 = """ 
CALL gds.triangleCount.write('myGraph1', {
  writeProperty: 'trianglesTrain1'
})
"""

query2 = """ 
CALL gds.triangleCount.write('myGraph2', {
  writeProperty: 'trianglesTrain2'
})
"""

query3 = """ 
CALL gds.triangleCount.write('myGraph3', {
  writeProperty: 'trianglesTrain3'
})
"""

query4 = """ 
CALL gds.triangleCount.write('myGraph4', {
  writeProperty: 'trianglesTrain4'
})
"""



graph.run(query1)
graph.run(query2)
graph.run(query3)
graph.run(query4)


writeMillis,nodePropertiesWritten,globalTriangleCount,nodeCount,postProcessingMillis,preProcessingMillis,computeMillis,configuration
26,418,1524,418,0,0,4,"{jobId: '406ea1fa-6935-4cbe-ac97-7200d93b31b3', writeConcurrency: 4, writeProperty: 'trianglesTrain4', maxDegree: 9223372036854775807, logProgress: true, nodeLabels: ['*'], sudo: false, relationshipTypes: ['*'], concurrency: 4}"


In [87]:
query5 = """ 
CALL gds.triangleCount.write('myGraph5', {
  writeProperty: 'trianglesTest5'
})
"""

query6 = """ 
CALL gds.triangleCount.write('myGraph6', {
  writeProperty: 'trianglesTest6'
})
"""
query7 = """ 
CALL gds.triangleCount.write('myGraph7', {
  writeProperty: 'trianglesTest7'
})
"""
query8 = """ 
CALL gds.triangleCount.write('myGraph8', {
  writeProperty: 'trianglesTest8'
})
"""
graph.run(query5)
graph.run(query6)
graph.run(query7)
graph.run(query8)

writeMillis,nodePropertiesWritten,globalTriangleCount,nodeCount,postProcessingMillis,preProcessingMillis,computeMillis,configuration
15,418,3351,418,0,0,4,"{jobId: 'b57b46f8-1d85-4706-b74c-b6f3cf76647c', writeConcurrency: 4, writeProperty: 'trianglesTest8', maxDegree: 9223372036854775807, logProgress: true, nodeLabels: ['*'], sudo: false, relationshipTypes: ['*'], concurrency: 4}"


In [88]:
query1 = """
CALL gds.localClusteringCoefficient.write('myGraph1', {
    writeProperty: 'coefficientTrain1'
});
"""

query2 = """
CALL gds.localClusteringCoefficient.write('myGraph2', {
    writeProperty: 'coefficientTrain2'
});
"""

query3 = """
CALL gds.localClusteringCoefficient.write('myGraph3', {
    writeProperty: 'coefficientTrain3'
});
"""

query4 = """
CALL gds.localClusteringCoefficient.write('myGraph4', {
    writeProperty: 'coefficientTrain4'
});
"""



graph.run(query1)
graph.run(query2)
graph.run(query3)
graph.run(query4)


writeMillis,nodePropertiesWritten,averageClusteringCoefficient,nodeCount,postProcessingMillis,preProcessingMillis,computeMillis,configuration
16,418,0.2821768603167667,418,0,0,11,"{jobId: 'bf638fc8-48e5-49cb-aa1a-775ed6703cbc', writeConcurrency: 4, triangleCountProperty: null, writeProperty: 'coefficientTrain4', logProgress: true, nodeLabels: ['*'], sudo: false, relationshipTypes: ['*'], concurrency: 4}"


In [89]:
query5 = """
CALL gds.localClusteringCoefficient.write('myGraph5', {
    writeProperty: 'coefficientTest5'
});
"""

query6 = """
CALL gds.localClusteringCoefficient.write('myGraph6', {
    writeProperty: 'coefficientTest6'
});
"""

query7 = """
CALL gds.localClusteringCoefficient.write('myGraph7', {
    writeProperty: 'coefficientTest7'
});
"""

query8 = """
CALL gds.localClusteringCoefficient.write('myGraph8', {
    writeProperty: 'coefficientTest8'
});
"""

graph.run(query5)
graph.run(query6)
graph.run(query7)
graph.run(query8)

writeMillis,nodePropertiesWritten,averageClusteringCoefficient,nodeCount,postProcessingMillis,preProcessingMillis,computeMillis,configuration
18,418,0.1246066499305612,418,0,0,15,"{jobId: 'c8dbd65c-94c5-46b3-a30d-5d352781eceb', writeConcurrency: 4, triangleCountProperty: null, writeProperty: 'coefficientTest8', logProgress: true, nodeLabels: ['*'], sudo: false, relationshipTypes: ['*'], concurrency: 4}"


## Adding the features

As we need to have features for the edges, we take the maximum and the minimum of the properties of the nodes forming the edge. Since an edge is formed by 2 node, we are indeed adding the clustering coefficients and number of triangles of both the nodes as features for prediction.

In [90]:
def apply_triangles_features(data,triangles_prop,coefficient_prop):
    query = """
    UNWIND $pairs AS pair
    MATCH (p1:Person) WHERE id(p1) = pair.node1
    MATCH (p2:Person) WHERE id(p2) = pair.node2
    RETURN pair.node1 AS node1, 
    pair.node2 AS node2,
    apoc.coll.min([p1[$triangles], p2[$triangles]]) AS minTriangles,
    apoc.coll.max([p1[$triangles], p2[$triangles]]) AS maxTriangles,
    apoc.coll.min([p1[$coefficient], p2[$coefficient]]) AS minCoeff,
    apoc.coll.max([p1[$coefficient], p2[$coefficient]]) AS maxCoeff
    """
    

    pairs = [{"node1": str(pair[0]), "node2": str(pair[1])}  
          for pair in data[["node1", "node2"]].values.tolist()]
        
    params = {
        "pairs": pairs,
        "triangles": triangles_prop,
        "coefficient": coefficient_prop
        }
    
    features = graph.run(query,params).to_data_frame()
    
    return pd.merge(data, features, on = ["node1", "node2"])

In [86]:
def apply_triangles_features(data,triangles_prop,coefficient_prop):
    query = """
    UNWIND $pairs AS pair
    MATCH (p1:Person) WHERE p1.id = pair.node1
    MATCH (p2:Person) WHERE p2.id = pair.node2
    RETURN pair.node1 AS node1, 
    pair.node2 AS node2,
    apoc.coll.min([p1[$triangles], p2[$triangles]]) AS minTriangles,
    apoc.coll.max([p1[$triangles], p2[$triangles]]) AS maxTriangles,
    apoc.coll.min([p1[$coefficient], p2[$coefficient]]) AS minCoeff,
    apoc.coll.max([p1[$coefficient], p2[$coefficient]]) AS maxCoeff
    """
    

    pairs = [{"node1": str(pair[0]), "node2": str(pair[1])}  
          for pair in data[["node1", "node2"]].values.tolist()]
        
    params = {
        "pairs": pairs,
        "triangles": triangles_prop,
        "coefficient": coefficient_prop
        }
    
    features = graph.run(query,params).to_data_frame()
    
    return pd.merge(data, features, on = ["node1", "node2"])

In [91]:
train_season1_w = apply_triangles_features(train_season1_v, "trianglesTrain1", "coefficientTrain1")
train_season2_w = apply_triangles_features(train_season2_v, "trianglesTrain2", "coefficientTrain2")
train_season3_w = apply_triangles_features(train_season3_v, "trianglesTrain3", "coefficientTrain3")
train_season4_w = apply_triangles_features(train_season4_v, "trianglesTrain4", "coefficientTrain4")
# train_season5_w = apply_triangles_features(train_season5_v, "trianglesTrain5", "coefficientTrain5")

validation_season5_w = apply_triangles_features(validation_season5_v, "trianglesTest5", "coefficientTest5")
validation_season6_w = apply_triangles_features(validation_season6_v, "trianglesTest6", "coefficientTest6")
test_season7_w = apply_triangles_features(test_season7_v, "trianglesTest7", "coefficientTest7")
test_season8_w = apply_triangles_features(test_season8_v, "trianglesTest8", "coefficientTest8")

KeyError: 'node1'

We add the features to the training, validation and test sets.

In [88]:
frames_training_w = [train_season1_w, train_season2_w,
                   train_season3_w, train_season4_w]
result_training_w = pd.concat(frames_training_w)
result_training_w = result_training_w.sample(frac=1).reset_index(drop=True)
frames_validation_w = [validation_season5_w, validation_season6_w]
result_validation_w = pd.concat(frames_validation_w)
result_validation_w = result_validation_w.sample(frac=1).reset_index(drop=True)
frames_test_w = [test_season7_w, test_season8_w]
result_test_w = pd.concat(frames_test_w)
result_test_w = result_test_w.sample(frac=1).reset_index(drop=True)

In [89]:
query1 = """
CALL gds.graph.drop('myGraph1') YIELD graphName;
"""

query2 = """
CALL gds.graph.drop('myGraph2') YIELD graphName;
"""

query3 = """
CALL gds.graph.drop('myGraph3') YIELD graphName;
"""

query4 = """
CALL gds.graph.drop('myGraph4') YIELD graphName;
"""

query5 = """
CALL gds.graph.drop('myGraph5') YIELD graphName;
"""

query6 = """
CALL gds.graph.drop('myGraph6') YIELD graphName;
"""

query7 = """
CALL gds.graph.drop('myGraph7') YIELD graphName;
"""

query8 = """
CALL gds.graph.drop('myGraph8') YIELD graphName;
"""

graph.run(query1)
graph.run(query2)
graph.run(query3)
graph.run(query4)
graph.run(query5)
graph.run(query6)
graph.run(query7)
graph.run(query8)

graphName
myGraph8


# Training and Hyperparameter tuning

We train the random forest classifier and select the best set of parameters using RandomizedSearchCV

In [91]:
n_estimators = [10,20,30,40,50,60,70,80,90,100]
max_depth = [2,3,4,5,6,7,8,9,10]
min_samples_split = [2,3,4,5,6,7,8,9,10]
min_samples_leaf = [1,2,3,4,5,6,7,8,9,10]
max_features = ['auto', 'sqrt', 'log2']
bootstrap = [True, False]
criterion = ['gini', 'entropy']

param_grid = {'n_estimators': n_estimators,
                'max_depth': max_depth,
                'min_samples_split': min_samples_split,
                'min_samples_leaf': min_samples_leaf,
                'max_features': max_features,
                'bootstrap': bootstrap,
                'criterion': criterion}

rf = RandomForestClassifier()
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = param_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
rf_random.fit(result_training_w[["cn", "pa", "ad", "minTriangles",
              "maxTriangles", "minCoeff", "maxCoeff"]], result_training_w['label'])

Fitting 3 folds for each of 100 candidates, totalling 300 fits


In [92]:
print(rf_random.best_params_)
print(rf_random.best_score_)
print(rf_random.best_estimator_)
print(rf_random.best_estimator_.feature_importances_)


{'n_estimators': 20, 'min_samples_split': 4, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': 10, 'criterion': 'gini', 'bootstrap': False}
0.9909082824353854
RandomForestClassifier(bootstrap=False, max_depth=10, min_samples_leaf=2,
                       min_samples_split=4, n_estimators=20)
[0.06537562 0.07329361 0.07128983 0.04820935 0.37987766 0.04029055
 0.32166337]


# Testing

Now we evealuate the model on the test set to find how the model performs on unseen data. For this we select the classifier with the highest validation set accuracy.

In [93]:
classifier2 = rf_random.best_estimator_

In [95]:
columns = ["cn", "pa", "ad","minTriangles", "maxTriangles", "minCoeff", "maxCoeff"]
X = result_training_w[columns]
y = result_training_w["label"]
classifier2.fit(X, y)

In [96]:
predictions = classifier2.predict(result_test_w[columns])
y_test = result_test_w["label"]
evaluate_model(predictions, y_test)


Unnamed: 0,metric,value
0,accuracy,0.931062
1,precision,0.945902
2,recall,0.973746


We see that the model performs good even on the testing set. It has an accuracy of 93%, precision of 94% and recall of 97%. 

# References

- https://neo4j.com/docs/
- https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html