<font color='006A58'>**Supervised Graph Learning**</font>

Feature-based method is a very naive (yet powerful) approach for solving graph-based supervised machine learning. The idea rely on the classic machine learning approach of handcrafted feature extraction. In this demo, we will be using the PROTEINS dataset, already integrated in StellarGraph.

In [1]:
!pip install -q stellargraph

[?25l[K     |▊                               | 10 kB 24.5 MB/s eta 0:00:01[K     |█▌                              | 20 kB 31.0 MB/s eta 0:00:01[K     |██▎                             | 30 kB 24.1 MB/s eta 0:00:01[K     |███                             | 40 kB 19.6 MB/s eta 0:00:01[K     |███▊                            | 51 kB 8.0 MB/s eta 0:00:01[K     |████▌                           | 61 kB 9.3 MB/s eta 0:00:01[K     |█████▎                          | 71 kB 9.7 MB/s eta 0:00:01[K     |██████                          | 81 kB 10.8 MB/s eta 0:00:01[K     |██████▊                         | 92 kB 8.6 MB/s eta 0:00:01[K     |███████▌                        | 102 kB 9.3 MB/s eta 0:00:01[K     |████████▎                       | 112 kB 9.3 MB/s eta 0:00:01[K     |█████████                       | 122 kB 9.3 MB/s eta 0:00:01[K     |█████████▉                      | 133 kB 9.3 MB/s eta 0:00:01[K     |██████████▌                     | 143 kB 9.3 MB/s eta 0:00:01[K

In [14]:
from stellargraph import datasets
import numpy as np
import pandas as pd
import networkx as nx
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import svm
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [3]:
dataset = datasets.PROTEINS()
graphs, graph_labels = dataset.load()
dataset.description

'Each graph represents a protein and graph labels represent whether they are are enzymes or non-enzymes. The dataset includes 1113 graphs with 39 nodes and 73 edges on average for each graph. Graph nodes have 4 attributes (including a one-hot encoding of their label), and each graph is labelled as belonging to 1 of 2 classes.'

To compute the graph metrics, one way is to retrieve the adjacency matrix representation of each graph.

In [16]:
pd.Series(graph_labels).value_counts(dropna=False)

1    663
2    450
Name: label, dtype: int64

In [6]:
# convert graphs from StellarGraph format to numpy adj matrices
adjs = [graph.to_adjacency_matrix().A for graph in graphs]

# convert labes fom Pandas.Series to numpy array
labels = graph_labels.to_numpy(dtype=int)

metrics = []

for adj in adjs:
  G = nx.from_numpy_matrix(adj)
  # basic properties
  num_edges = G.number_of_edges()
  # clustering measures
  cc = nx.average_clustering(G)
  # measure of efficiency
  eff = nx.global_efficiency(G)

  metrics.append([num_edges, cc, eff])

In [7]:
X_train, X_test, y_train, y_test = train_test_split(metrics, labels, test_size=0.3, random_state=42)

As commonly done in many Machine Learning workflows, we preprocess features to have zero mean and unit standard deviation

In [9]:
scaler = StandardScaler()
scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

It's now time for training a proper algorithm. We chose a support vector machine for this task

In [11]:
clf = svm.SVC()
clf.fit(X_train_scaled, y_train)

y_pred = clf.predict(X_test_scaled)

print('Accuracy', accuracy_score(y_test,y_pred))
print('Precision', precision_score(y_test,y_pred))
print('Recall', recall_score(y_test,y_pred))
print('F1-score', f1_score(y_test,y_pred))

Accuracy 0.7455089820359282
Precision 0.7709251101321586
Recall 0.8413461538461539
F1-score 0.8045977011494253
