# Node classification in networks using deep learning with Graph Neural Networks

In this section we illustrate how to use deep learning methods, in specific, Graph Neural Networks (GNN), to perform node classification in networks.

GNNs are particularly effective for tasks like node classification, where the goal is to predict the labels of nodes in a network. Unlike traditional network libraries such as NetworkX, which cannot handle node features directly, GNNs are designed to leverage both the network structure and the node features, enabling them to capture complex relationships and dependencies within the network more effectively.

We will use the PyTorch Geometric (PyG) library `torch_geometric`. PyG is built upon PyTorch and is specifically tailored to allow creating and training GNNs for a wide range of tasks related to structured data, such as complex networks.

To demonstrate, we will use the Twitch Gamer networks dataset [1]. Twitch is an online platform that focuses on video game live streaming [2]. 
The dataset contains user-user networks for 6 different languages. Nodes correspond to Twitch users, and links correspond to mutual friendships. Node features are games liked, location and streaming habits. All networks have the same set of node features. In addition, each node is labelled as 1 for using the language associated with the network it is in, and 0 for not. Our task is binary classification of whether a user uses the language associated with the network.

```
langs = {"DE", "EN", "ES", "FR", "PT", "RU"}
```

[1] "Multi-Scale Attributed Node Embedding", Rozemberczki et al., https://arxiv.org/pdf/1909.13021

[2] https://www.twitch.tv/ 

## Load dataset

The Twitch dataset we use is available as in the ```torch_geometric``` package [1]. Thus, we can load it to PyG with ease. We use the German network `DE` as an example.

[1] https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.datasets.Twitch.html#torch_geometric.datasets.Twitch

In [1]:
import torch
from torch_geometric.datasets import Twitch

dataset = Twitch(root='./data/twitch', name='DE')

  if osp.exists(f) and torch.load(f) != _repr(self.pre_transform):
  if osp.exists(f) and torch.load(f) != _repr(self.pre_filter):
  return torch.load(f, map_location)


### Show statistics of the network

In [2]:
# number of nodes and edges
N, M = dataset[0].num_nodes, dataset[0].num_edges
print(f'Number of nodes: {N}')
print(f'Number of edges: {M}')

Number of nodes: 9498
Number of edges: 315774


In [3]:
# check the node features and labels
dataset[0].x, dataset[0].y

(tensor([[-0.2367, -0.2307, -0.1605,  ..., -0.6348, -0.2558, -0.1839],
         [-0.2367, -0.2307, -0.1605,  ..., -0.6348, -0.2558, -0.1839],
         [-0.2354, -0.2210, -0.1605,  ..., -0.6348, -0.2490, -0.1839],
         ...,
         [-0.2367, -0.2307, -0.1605,  ..., -0.6348, -0.2558, -0.1839],
         [-0.2367, -0.2307, -0.1605,  ..., -0.6348, -0.2558, -0.1839],
         [-0.2367, -0.2307, -0.1605,  ..., -0.6348, -0.2558, -0.1810]]),
 tensor([0, 1, 1,  ..., 1, 0, 0]))

In [4]:
# number of node features and node classes
print('Number of node features:', dataset.num_node_features)
print('Number of node classes:', dataset.num_classes)

Number of node features: 128
Number of node classes: 2


## Split training and test data

We split the node labels `y` into two sets: training and test. We do this by creating two "masks", which are boolean tensors that specify which nodes are included in each set.

The training set is used to train the GNN model, thus it is visible to the model during the training phase. The test dataset is not visible to the model during the training phase, and will only be used to evaluate the model.

Note in our setting, the network structure and the node features are visible to the model during the training phase.

In [5]:
# param: define the size of the training dataset
train_size = .8

# Generate random permutation of node indices
perm = torch.randperm(N)

# Select train and test nodes
train_idx = perm[: int(train_size * N)]
test_idx = perm[int(train_size * N) :]

# Initialize train_mask and test_mask with False
train_mask = torch.zeros(N, dtype=torch.bool)
test_mask = torch.zeros(N, dtype=torch.bool)

# Set the selected indices to True
train_mask[train_idx] = True
test_mask[test_idx] = True

## Define a simply Graph Convolutional Network (GCN) model

Now we define a two-layer GCN model. The model consists of two graph convolutional layers with 32 filters, a dropout layer, and ReLU activations.

Explanation:

* Graph convolution layers: Applied to the network with node features `x`. They are used to aggregate neighbouring node features in a network. This aggregation is essential for capturing the local structure and relationships within a network. In PyG it is implemented as `torch_geometric.nn.GCNConv`. Note a graph convolutional layer can also be applied to the network with aggregated node features.
* ReLU Activation: Applied after the first convolution layer. This is to capture the non-linearity in the data.
* Dropout: Applied after the first convolution layer. The dropout layer is used to prevent overfitting.

Parameters:
```
Learning rate: 0.01
Dropout: 0.5
Weight decay: 0.0005
Filters per layer: 32 (hidden channels)
Number of epochs: 200
```

In [6]:
import torch.nn.functional as F
from torch_geometric.nn import GCNConv

# params
lr = .01
weight_decay=5e-4
p_dropout = .5
num_filters = 32
num_epochs = 200

class GCN(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = GCNConv(dataset.num_node_features, num_filters)
        self.conv2 = GCNConv(num_filters, dataset.num_classes)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index

        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = F.dropout(x, p=p_dropout, training=self.training)
        x = self.conv2(x, edge_index)

        return F.log_softmax(x, dim=1)

The two graph convolution layers are defined in the `__init__(self)` function.

Function `forward(self, data)` defines the GCN structure, whose input `data` contains the node features `data.x` and the network structure as captured in `data.edge_index`. Going through the first graph convolutional layer, the ReLU activation, the dropout layer, and the second graph convolutional layer, the output of the GCN is the logorithm of the softmax function, which can be used to compute the probability of a node belonging to each class.

## Train the GCN model
With the dataset prepared and GCN classification model defined, we now can train the classification model.

In [7]:
# find the best device to run on
if torch.backends.mps.is_available():
    device = torch.device("mps")
else:
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [8]:
# move the model and data to the device
model = GCN().to(device)
data = dataset[0].to(device)
train_mask = train_mask.to(device)
test_mask = test_mask.to(device)

In each epoch, the optimizer updates the weights in our GCN model to improve its predictions on the training data by minimizing the loss function. In our case, the loss function is the Negative Log-Likelihood (NLL) function, which is particularly suited for classification tasks.

In [9]:
optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=weight_decay)
model.train()
for epoch in range(num_epochs):
    optimizer.zero_grad()
    out = model(data)
    loss = F.nll_loss(out[train_mask], data.y[train_mask])

    if epoch % 10 == 0:
        print(f'Epoch: {epoch:03d}, Loss: {loss:.4f}')
        
    loss.backward()
    optimizer.step()

Epoch: 000, Loss: 0.7230
Epoch: 010, Loss: 0.6047
Epoch: 020, Loss: 0.5861
Epoch: 030, Loss: 0.5754
Epoch: 040, Loss: 0.5683
Epoch: 050, Loss: 0.5644
Epoch: 060, Loss: 0.5591
Epoch: 070, Loss: 0.5548
Epoch: 080, Loss: 0.5500
Epoch: 090, Loss: 0.5473
Epoch: 100, Loss: 0.5441
Epoch: 110, Loss: 0.5407
Epoch: 120, Loss: 0.5360
Epoch: 130, Loss: 0.5350
Epoch: 140, Loss: 0.5311
Epoch: 150, Loss: 0.5309
Epoch: 160, Loss: 0.5295
Epoch: 170, Loss: 0.5284
Epoch: 180, Loss: 0.5247
Epoch: 190, Loss: 0.5212


## Evaluation

To evaluate the GCN classification model we just trained, we first use it to predict the labels of the nodes in the test set.

Remember: technically the output of the model is in the form of `log_softmax`, so we need to convert it to either binary classes or probabilities, depending on our needs. 

In [10]:
model.eval()
pred = model(data).argmax(dim=1)

y_true = data.y[test_mask].cpu().numpy()
y_pred = pred[test_mask].cpu().numpy()
y_proba = torch.exp(model(data))[test_mask].cpu().detach().numpy()[:, 1]

Check the predicted node classes and probabilities.

In [11]:
y_pred

array([0, 1, 0, ..., 1, 1, 1])

In [12]:
y_proba

array([0.18113099, 0.60741115, 0.4456026 , ..., 0.859128  , 0.59309757,
       0.7267557 ], dtype=float32)

We now compare the GCN model's prediction with the ground truth, and calculate the following metrics on the result using the `sklearn` library.

* Accuracy: a general measure of how often the model is correct.
* Precision: focuses on the quality of positive predictions.
* Recall: focuses on capturing all actual positives.
* F1-Score: provides a balance between precision and recall.
* ROC AUC: measures the model's ability to distinguish between classes over various thresholds, providing a comprehensive performance indicator.

These metrics collectively provide a detailed view of the model’s performance, helping to understand its strengths and weaknesses in predicting the labels for the nodes in the test set.

In [13]:
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score

# Calculate metrics
accuracy = (y_pred == y_true).sum() / test_mask.sum().item()
precision = precision_score(y_true, y_pred, average='macro')
recall = recall_score(y_true, y_pred, average='macro')
f1 = f1_score(y_true, y_pred, average='macro')
roc_auc = roc_auc_score(y_true, y_proba)

print(f'Accuracy: {accuracy:.4f}')
print(f'Precision: {precision:.4f}')
print(f'Recall: {recall:.4f}')
print(f'F1-score: {f1:.4f}')
print(f'ROC AUC: {roc_auc:.4f}')

Accuracy: 0.6653
Precision: 0.6525
Recall: 0.6370
F1-score: 0.6386
ROC AUC: 0.7246
