# Node classification in networks using NetworkX

In this section we illustrate how to use the NetworkX library to perform node classification in networks. The NetworkX library in Python is a powerful tool for working with complex networks and can be used to perform node classification.

To demonstrate, we will use the Airports dataset [1], which contains three networks `{'usa', 'brazil', 'europe'}` in which nodes represent airports and edges represent the existence of flights between airports. In addition, node labels correspond to airport activity levels `{0, 1, 2, 3}`.  Our task is then classifying the airport activity levels in the network.

[1] "struc2vec: Learning Node Representations from Structural Identity", Ribeiro et al., https://arxiv.org/abs/1704.03165


In [4]:
import os
import networkx as nx
import pandas as pd

from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from networkx.algorithms import node_classification

## Load dataset

The Airport dataset is available online [1]. If you have run the preparation script `prep.sh`, the dataset should appear in `data/airports/`. Otherwise you can download it manually. We will use the `usa` network through this section.

[1] https://github.com/leoribeiro/struc2vec/tree/master/graph

In [2]:
# params
data_dir = 'data/airports/'
network = 'usa'

### Load edge info

In [9]:
edgelist = pd.read_csv(os.path.join(data_dir, f"{network}-airports.edgelist"), sep=' ', header=None, names=["source", "target"])
edgelist

Unnamed: 0,source,target
0,12343,12129
1,13277,11996
2,13796,13476
3,15061,14559
4,14314,12889
...,...,...
13594,13303,10747
13595,13029,12892
13596,13930,11618
13597,12278,11423


In [10]:
# We have to use an undirected graph here because the node classification algorithms in nx do not support directed graphs
G = nx.from_pandas_edgelist(edgelist)

### Load node labels

In [12]:
df_nodes = pd.read_csv(os.path.join(data_dir, f"labels-{network}-airports.txt"), sep=' ')
df_nodes

Unnamed: 0,node,label
0,10241,1
1,10243,2
2,10245,0
3,16390,1
4,10247,1
...,...,...
1185,12278,0
1186,12280,0
1187,14332,3
1188,10237,2


## Split training and test data

We split the node labels into two sets: training and test using function `train_test_split` from the `sklearn` library.

The training set is used to train the node classifiers, thus it is visible to the model during the training phase. The test dataset is not visible to the model during the training phase, and will only be used to evaluate the classifiers.

Note in our setting, the whole network structure is visible to the model during the training phase.

In [6]:


train_size = 0.2

df_train, df_test = train_test_split(df_nodes, train_size = train_size)

In [7]:
node_ids_train, node_ids_test = df_train.iloc[:, 0].values, df_test.iloc[:, 0].values
labels_train, labels_test = df_train.iloc[:, -1].values.astype(str), df_test.iloc[:, -1].values.astype(str)
data_train = dict(zip(node_ids_train, labels_train))

In [8]:
nx.set_node_attributes(G, data_train, name="label")

### Node classification

Now we perform node classification using the built-in Harmonic Function method in NetworkX.

* Zhu, X., Ghahramani, Z., & Lafferty, J. (2003, August). Semi-supervised learning using gaussian fields and harmonic functions. In ICML (Vol. 3, pp. 912-919).

In [9]:
result = node_classification.harmonic_function(G)

The result contains the labels (predicted or ground truth) of all nodes. For example, the labels of the first five nodes:

In [10]:
result[:5]

['Genetic_Algorithms',
 'Genetic_Algorithms',
 'Neural_Networks',
 'Genetic_Algorithms',
 'Genetic_Algorithms']

To check the accuracy of the classification, we fetch the labels of the nodes in the test set.

In [11]:
dict_result = dict(zip(list(G), result))
labels_pred = [ dict_result.get(id) for id in node_ids_test ]

print('Accuracy (Harmonic Function): ', accuracy_score(labels_test, labels_pred))

Accuracy (Harmonic Function):  0.8107983387171205


Similarly, we can perform node classification using the built-in Local and Global Consistency method.

* Zhou, D., Bousquet, O., Lal, T. N., Weston, J., & Schölkopf, B. (2004). Learning with local and global consistency. Advances in neural information processing systems, 16(16), 321-328.

In [12]:
result = node_classification.local_and_global_consistency(G)

In [13]:
dict_result = dict(zip(list(G), result))
labels_pred = [ dict_result.get(id) for id in node_ids_test ]

In [15]:
print('Accuracy (Local and Global Consistency): ', accuracy_score(labels_test, labels_pred))

Accuracy (Local and Global Consistency):  0.8075680664513152
