Berner Fachhochschule BFH - MAS Data Science - Graph Machine Learning - Master Thesis FS/2022 Thomas Iten

# Experiment 8 - Node2Vec Mitarbieter vs. Mitarbeiter-X Tests

**Referenzen**<br />
[1] https://snap.stanford.edu/node2vec<br />
[2] https://stellargraph.readthedocs.io/en/stable/demos/link-prediction/node2vec-link-prediction.html<br />



In [1]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from gml.graph.data_factory import TestTrainDataFactory, EdgeLabelFactory
from gml.graph.graph_embedding import EdgeEmbedding

## 8.1 Data Structure

### Datensammlung - Mitarbeiter mit Organisation

<img src="img/test-6.png" alt="Test Scenario 4" width="800"/>

## 8.2 Test Mitarbeiter


In [2]:
n = 2000

graph, test_graph, test_samples, test_labels, train_graph, train_samples, train_labels \
    = TestTrainDataFactory().create_testdata(n, add_id=False, add_predict_edges=True)

graph.print_dimemsions()

Graph dimensions:
  order : 12002 (number of nodes)
  size  : 24000 (number of edges)


In [4]:
window = 8
dimensions = 64
embeddings =  EdgeEmbedding(graph.graph, window=window, dimensions=dimensions).embeddings

test_embeddings  = [embeddings[str(x[0]),str(x[1])] for x in test_samples]
train_embeddings = [embeddings[str(x[0]),str(x[1])] for x in train_samples]

classifier = RandomForestClassifier
c = classifier(n_estimators=1000)
c.fit(train_embeddings, train_labels)
y_pred = c.predict(test_embeddings)

name  = classifier.__name__
index = ["Precision", "Recall", "F1-Score"]
score = {}

score[name] = [
    metrics.precision_score(test_labels, y_pred),
    metrics.recall_score(test_labels, y_pred),
    metrics.f1_score(test_labels, y_pred)
]

df = pd.DataFrame(score, index=index)
print(df)

Computing transition probabilities:   0%|          | 0/12002 [00:00<?, ?it/s]


Generating walks (CPU: 1):   0%|          | 0/10 [00:00<?, ?it/s][A
Generating walks (CPU: 1):  20%|██        | 2/10 [01:31<06:05, 45.65s/it][A
Generating walks (CPU: 1):  30%|███       | 3/10 [02:49<06:53, 59.14s/it][A
Generating walks (CPU: 1):  40%|████      | 4/10 [04:02<06:27, 64.52s/it][A
Generating walks (CPU: 1):  50%|█████     | 5/10 [05:18<05:41, 68.38s/it][A
Generating walks (CPU: 1):  60%|██████    | 6/10 [06:32<04:41, 70.30s/it][A
Generating walks (CPU: 1):  70%|███████   | 7/10 [07:48<03:36, 72.04s/it][A
Generating walks (CPU: 1):  80%|████████  | 8/10 [09:03<02:26, 73.15s/it][A
Generating walks (CPU: 1):  90%|█████████ | 9/10 [10:19<01:13, 73.84s/it][A
Generating walks (CPU: 1): 100%|██████████| 10/10 [12:48<00:00, 76.89s/it][A


           RandomForestClassifier
Precision                     1.0
Recall                        1.0
F1-Score                      1.0


## 8.3 Test mit Mitarbeiter X

### Generate link predictions EM-DC (positive) and EMx-DC (negative)

In [5]:
# Filter positive edges
pos_samples = []
for i in range(len(test_labels)):
    sample = test_samples[i]
    label = test_labels[i]
    if label == 1:
        pos_samples.append(sample)

# Generate negative samples by appending an x to the employee
neg_samples = []
for sample in pos_samples:
    from_node = sample[0]
    to_node = sample[1]
    if from_node.startswith("EM"):
        from_node = from_node + "x"
    if to_node.startswith("EM"):
        to_node = to_node + "x"
    neg_samples.append((from_node, to_node))


# Generate new test data set
pos_labels = [1 for _ in range(len(pos_samples))]
neg_labels = [0 for _ in range(len(neg_samples))]

# Combine and shuffle samples
samples = pos_samples
samples.extend(neg_samples)
labels = pos_labels
labels.extend(neg_labels)
test_samples, test_labels = EdgeLabelFactory().shuffle(samples, labels)

# Generate new test embeddings
test_embeddings  = [embeddings[str(x[0]),str(x[1])] for x in test_samples]

print("New Testdata set with EM and EMx")
print(test_samples[:10], "...")
print(test_labels[:10], "...")

New Testdata set with EM and EMx
[('DC1581', 'EM1581'), ('DC1591', 'EM1591'), ('EM508x', 'DC508'), ('EM1490x', 'DC1490'), ('DC823', 'EM823x'), ('EM1552', 'DC1552'), ('DC519', 'EM519'), ('EM1494x', 'DC1494'), ('EM724', 'DC724'), ('EM1548', 'DC1548')] ...
[1, 1, 0, 0, 0, 1, 1, 0, 1, 1] ...


### Train and Test

In [6]:
classifier = RandomForestClassifier
c = classifier(n_estimators=1000)
c.fit(train_embeddings, train_labels)
y_pred = c.predict(test_embeddings)

name  = classifier.__name__
index = ["Precision", "Recall", "F1-Score"]
score = {}

score[name] = [
    metrics.precision_score(test_labels, y_pred),
    metrics.recall_score(test_labels, y_pred),
    metrics.f1_score(test_labels, y_pred)
]

df = pd.DataFrame(score, index=index)
print(df)

           RandomForestClassifier
Precision                0.566893
Recall                   1.000000
F1-Score                 0.723589


### Test mit Threshold 0.8

In [8]:
y_pred = (c.predict_proba(test_embeddings)[:,1] >= 0.8).astype(bool)

name  = classifier.__name__
index = ["Precision", "Recall", "F1-Score"]
score = {}

score[name] = [
    metrics.precision_score(test_labels, y_pred),
    metrics.recall_score(test_labels, y_pred),
    metrics.f1_score(test_labels, y_pred)
]

df = pd.DataFrame(score, index=index)
print(df)

           RandomForestClassifier
Precision                0.937178
Recall                   0.910000
F1-Score                 0.923389


## 8.4 Test mit Mitarbieter X ohne Organisation

In [9]:
graph, test_graph, test_samples, test_labels, train_graph, train_samples, train_labels\
    = TestTrainDataFactory().create_testdata(n, add_id=False, add_predict_edges=True, add_org=False)

graph.print_dimemsions()

Graph dimensions:
  order : 12000 (number of nodes)
  size  : 16000 (number of edges)


In [10]:
window = 8
dimensions = 64
embeddings =  EdgeEmbedding(graph.graph, window=window, dimensions=dimensions).embeddings

test_embeddings  = [embeddings[str(x[0]),str(x[1])] for x in test_samples]
train_embeddings = [embeddings[str(x[0]),str(x[1])] for x in train_samples]

print("Embedding shape:")
print("Nodes    =", str(embeddings.kv.vectors.shape[0]), "(number of nodes)")
print("Features =", embeddings.kv.vectors.shape[1], "(number of features per node)")

Computing transition probabilities:   0%|          | 0/12000 [00:00<?, ?it/s]


Generating walks (CPU: 1):   0%|          | 0/10 [00:00<?, ?it/s][A
Generating walks (CPU: 1):  20%|██        | 2/10 [00:03<00:13,  1.63s/it][A
Generating walks (CPU: 1):  30%|███       | 3/10 [00:06<00:16,  2.34s/it][A
Generating walks (CPU: 1):  40%|████      | 4/10 [00:09<00:16,  2.71s/it][A
Generating walks (CPU: 1):  50%|█████     | 5/10 [00:13<00:14,  2.91s/it][A
Generating walks (CPU: 1):  60%|██████    | 6/10 [00:16<00:12,  3.07s/it][A
Generating walks (CPU: 1):  70%|███████   | 7/10 [00:20<00:09,  3.18s/it][A
Generating walks (CPU: 1):  80%|████████  | 8/10 [00:23<00:06,  3.24s/it][A
Generating walks (CPU: 1):  90%|█████████ | 9/10 [00:26<00:03,  3.29s/it][A
Generating walks (CPU: 1): 100%|██████████| 10/10 [00:33<00:00,  3.36s/it][A


Embedding shape:
Nodes    = 12000 (number of nodes)
Features = 64 (number of features per node)


In [11]:
# Filter positive edges
pos_samples = []
for i in range(len(test_labels)):
    sample = test_samples[i]
    label = test_labels[i]
    if label == 1:
        pos_samples.append(sample)

# Generate negative samples by appending an x to the employee
neg_samples = []
for sample in pos_samples:
    from_node = sample[0]
    to_node = sample[1]
    if from_node.startswith("EM"):
        from_node = from_node + "x"
    if to_node.startswith("EM"):
        to_node = to_node + "x"
    neg_samples.append((from_node, to_node))


# Generate new test data set
pos_labels = [1 for _ in range(len(pos_samples))]
neg_labels = [0 for _ in range(len(neg_samples))]

# Combine and shuffle samples
samples = pos_samples
samples.extend(neg_samples)
labels = pos_labels
labels.extend(neg_labels)
test_samples, test_labels = EdgeLabelFactory().shuffle(samples, labels)

# Generate new test embeddings
test_embeddings  = [embeddings[str(x[0]),str(x[1])] for x in test_samples]

print("New Testdata set with EM and EMx")
print(test_samples[:10], "...")
print(test_labels[:10], "...")

New Testdata set with EM and EMx
[('EM664', 'DC664'), ('DC259', 'EM259'), ('EM958', 'DC958'), ('DC359', 'EM359x'), ('EM670', 'DC670'), ('DC1011', 'EM1011'), ('EM1756x', 'DC1756'), ('DC1167', 'EM1167x'), ('EM196x', 'DC196'), ('DC905', 'EM905x')] ...
[1, 1, 1, 0, 1, 1, 0, 0, 0, 0] ...


In [12]:
classifier = RandomForestClassifier
c = classifier(n_estimators=1000)
c.fit(train_embeddings, train_labels)
y_pred = c.predict(test_embeddings)

name  = classifier.__name__
index = ["Precision", "Recall", "F1-Score"]
score = {}

score[name] = [
    metrics.precision_score(test_labels, y_pred),
    metrics.recall_score(test_labels, y_pred),
    metrics.f1_score(test_labels, y_pred)
]

df = pd.DataFrame(score, index=index)
print(df)

           RandomForestClassifier
Precision                0.500504
Recall                   0.993000
F1-Score                 0.665550



---
_The end._