# Import a network in PyG.

ref: [1] https://pytorch-geometric.readthedocs.io/en/latest/get_started/introduction.html

In [1]:
import torch
from torch_geometric.data import Data

Define the edge index where each column is a directed edge `(u, v)`. Define the feature matrix `x` where each row is the features of a node.

In [2]:
edge_index = torch.tensor([[0, 1, 1, 2],
                           [1, 0, 2, 1]], dtype=torch.long)
x = torch.tensor([[-1], [0], [1]], dtype=torch.float)

Note for undirected graphs we need to define edges in both directions.

In [3]:
data = Data(x=x, edge_index=edge_index)
data

Data(x=[3, 1], edge_index=[2, 4])

In [4]:
data.validate(raise_on_error=True)

True

## Common mistakes

### 1. Wrongly put each row in `edge_index` as an edge

In [5]:
edge_index_m1 = torch.tensor([[0, 1],
                              [1, 0],
                              [1, 2],
                              [2, 1]], dtype=torch.long)

In [6]:
data = Data(x=x, edge_index=edge_index_m1)
data

Data(x=[3, 1], edge_index=[4, 2])

In [7]:
try:
    data.validate(raise_on_error=True)
except ValueError as err:
    print(f'ValueError: {err}')

ValueError: 'edge_index' needs to be of shape [2, num_edges] in 'Data' (found torch.Size([4, 2]))


To fix it, transpose `edge_index_m1` and call contiguous().

In [8]:
edge_index_m1.T.contiguous()

tensor([[0, 1, 1, 2],
        [1, 0, 2, 1]])

In [9]:
data = Data(x=x, edge_index=edge_index_m1.T.contiguous())
data

Data(x=[3, 1], edge_index=[2, 4])

In [10]:
data.validate(raise_on_error=True)

True

### 2. Node IDs are discontinuous or do not start from 0

PyG typically expects node indices to range from 0 to `N-1` where `N` is the number of nodes.
Thus, conversion is necessary when node IDs are discontinuous or do not start from 0. We now use the US-airports data as an example and show a way to convert the discontinous node IDs to continuous ones.

In [11]:
import pandas as pd
import numpy as np
import os

data_dir = 'data/airports/'
network = 'usa'

edgelist = pd.read_csv(os.path.join(data_dir, f"{network}-airports.edgelist"), sep=' ', header=None, names=["source", "target"])
edgelist

Unnamed: 0,source,target
0,12343,12129
1,13277,11996
2,13796,13476
3,15061,14559
4,14314,12889
...,...,...
13594,13303,10747
13595,13029,12892
13596,13930,11618
13597,12278,11423


Original node IDs which are discontinuous and do not start from 0.

In [12]:
node_ids = np.sort(np.unique(edgelist[["source", "target"]].values))
node_ids

array([10005, 10006, 10011, ..., 16743, 16744, 16746])

We now create a mapping between the original IDs and continuous IDs.

In [13]:
node_id_mapping = { node_id : i for i, node_id in enumerate(node_ids) }

Replace the original IDs with the continuous IDs

In [14]:
edgelist[["source", "target"]] = edgelist[["source", "target"]].replace(node_id_mapping)
edgelist

Unnamed: 0,source,target
0,468,412
1,642,391
2,737,678
3,987,871
4,834,566
...,...,...
13594,649,137
13595,595,568
13596,759,306
13597,454,259


Now the node IDs are continuous and start from 0.

In [15]:
np.sort(np.unique(edgelist[["source", "target"]].values))

array([   0,    1,    2, ..., 1187, 1188, 1189])