Berner Fachhochschule BFH - MAS Data Science - Graph Machine Learning - Master Thesis FS/2022 Thomas Iten

# Experiment 6 - Node2Vec Exploration Tests

**Referenzen**<br />
[1] https://snap.stanford.edu/node2vec<br />
[2] https://stellargraph.readthedocs.io/en/stable/demos/link-prediction/node2vec-link-prediction.html<br />



In [1]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn import metrics
from gml.graph.data_factory import TestTrainDataFactory
from gml.graph.graph_embedding import EdgeEmbedding

## 6.1 Data Structure

### Datensammlung - Mitarbeiter mit Organisation

<img src="img/test-6.png" alt="Test Scenario 4" width="800"/>

## 6.2 Create graph and show properties

In [2]:
n = 500

graph, test_graph, test_samples, test_labels, train_graph, train_samples, train_labels \
    = TestTrainDataFactory().create_testdata(n, add_id=False, add_predict_edges=True)

graph.print_dimemsions()
test_graph.print_dimemsions(title="Test graph dimensions")
train_graph.print_dimemsions(title="Train graph dimensions")

Graph dimensions:
  order : 3002 (number of nodes)
  size  : 6000 (number of edges)
Test graph dimensions:
  order : 3002 (number of nodes)
  size  : 5750 (number of edges)
Train graph dimensions:
  order : 3002 (number of nodes)
  size  : 5625 (number of edges)


## 6.3 Create embeddings

In [3]:
window = 3
dimensions = 4
embeddings =  EdgeEmbedding(graph.graph, window=window, dimensions=dimensions).embeddings

test_embeddings  = [embeddings[str(x[0]),str(x[1])] for x in test_samples]
train_embeddings = [embeddings[str(x[0]),str(x[1])] for x in train_samples]

print("Embedding shape:")
print("Nodes    =", str(embeddings.kv.vectors.shape[0]), "(number of nodes)")
print("Features =", embeddings.kv.vectors.shape[1], "(number of features per node)")

Computing transition probabilities:   0%|          | 0/3002 [00:00<?, ?it/s]

Generating walks (CPU: 1): 100%|██████████| 10/10 [01:01<00:00,  6.18s/it]


Embedding shape:
Nodes    = 3002 (number of nodes)
Features = 4 (number of features per node)


## 6.4 Train and Test with RandomForest- and AdaBoostClassifier

In [4]:
classifiers = [RandomForestClassifier, AdaBoostClassifier]

index  = ["Precision", "Recall", "F1-Score"]
scores = {}

print("Train and Test:")
for classifier in classifiers:
    name = classifier.__name__
    print("- Start train", name, "...")
    c = classifier(n_estimators=1000)
    c.fit(train_embeddings, train_labels)
    print("- Start test", name, "...")
    y_pred = c.predict(test_embeddings)
    scores[name] = [
        metrics.precision_score(test_labels, y_pred),
        metrics.recall_score(test_labels, y_pred),
        metrics.f1_score(test_labels, y_pred)
    ]

print("\nMetrics:")
df = pd.DataFrame(scores, index=index)
print(df)

Train and Test:
- Start train RandomForestClassifier ...
- Start test RandomForestClassifier ...
- Start train AdaBoostClassifier ...
- Start test AdaBoostClassifier ...

Metrics:
           RandomForestClassifier  AdaBoostClassifier
Precision                0.857143            0.865672
Recall                   0.984000            0.928000
F1-Score                 0.916201            0.895753



---
_The end._