# Heterogeneous Graph Attention Network
This notebook demonstrates the training of Heterogeneous Graph Attention Networks (HGAT) with TigerGraph ML Workbench. [Pytorch Geometric](https://pytorch-geometric.readthedocs.io)'s implementation of HGAT is used here. We train the model on the IMDB dataset from [PyG datasets](https://pytorch-geometric.readthedocs.io/en/latest/modules/datasets.html#torch_geometric.datasets.IMDB) with TigerGraph as the data store. The dataset contains 3 types of vertices: 4278 movies, 5257 actors, and 2081 directors; and 4 types of edges: 12828 actor to movie edges, 12828 movie to actor edges, 4278 director to movie edges, and 4278 movie to director edges. Each vertex is described by a 0/1-valued word vector indicating the absence/presence of the corresponding keywords from the plot (for movie) or from movies they participated (for actors and directors). Each movie is classified into one of three classes, action, comedy, and drama according to their genre. The goal is to predict the class of each movie in the graph.

The following libraries are required to run this notebook. Uncomment to install them if necessary. You need to restart the kernel after installing.

In [None]:
#!pip install torch==1.12.0 --extra-index-url https://download.pytorch.org/whl/cpu
#!pip install torch-scatter==2.0.9 torch-sparse==0.6.14 torch-cluster==1.6.0 torch-spline-conv==1.2.1 torch-geometric==2.0.4 -f https://data.pyg.org/whl/torch-1.12.0+cpu.html
#!pip install pyTigerGraph[gds]
#!pip install tensorboard # If you use tensorboard for visualization later

## Table of Contents
* [Data Processing](#data_processing)  
* [Train on whole graph](#train_whole)  
* [Train on neighborhood subgraphs](#train_subgraph)  
* [Inference](#inference)

## Data Processing <a name="data_processing"></a>

### Connect to TigerGraph

The `TigerGraphConnection` class represents a connection to the TigerGraph database. Under the hood, it stores the necessary information to communicate with the database. It is able to perform quite a few database tasks. Please see its [documentation](https://docs.tigergraph.com/pytigergraph/current/intro/) for details.

To connect your database, modify the `config.json` file accompanying this notebook. Set the value of `getToken` based on whether token auth is enabled for your database. Token auth is always enabled for tgcloud databases. 

In [1]:
from pyTigerGraph import TigerGraphConnection
import json

# Read in DB configs
with open('../../config.json', "r") as config_file:
    config = json.load(config_file)
    
conn = TigerGraphConnection(
    host=config["host"],
    username=config["username"],
    password=config["password"]
)

### Ingest Data

In [2]:
from pyTigerGraph.datasets import Datasets

dataset = Datasets("imdb")

Downloading:   0%|          | 0/441353 [00:00<?, ?it/s]

In [3]:
conn.ingestDataset(dataset, getToken=config["getToken"])

---- Checking database ----
A graph with name imdb already exists in the database. Please drop it first before ingesting.


### Visualize Schema

In [4]:
from pyTigerGraph.visualization import drawSchema

drawSchema(conn.getSchema(force=True))

CytoscapeWidget(cytoscape_layout={'name': 'circle', 'animate': True, 'padding': 1}, cytoscape_style=[{'selecto…

### Basic Statistics

In [5]:
conn.getVertexCount('*')

{'Movie': 4278, 'Actor': 5257, 'Director': 2081}

In [6]:
conn.getEdgeCount()

{'actor_movie': 12828,
 'director_movie': 4278,
 'movie_actor': 12828,
 'movie_director': 4278}

### Train/validation/test split

In [5]:
# The code in this cell is commented out because there is no need to split the vertices into 
# training/validation/test sets, as the split is already done in the original dataset. 
# See notebook 1_data_processing for examples on the split function.

#split = conn.gds.vertexSplitter(train_mask=0.8, val_mask=0.1, test_mask=0.1)
#split.run()

In [7]:
print(
    "Number of movies in training set:",
    conn.getVertexCount("Movie", where="train_mask!=0"),
)
print(
    "Number of movies in validation set:",
    conn.getVertexCount("Movie", where="val_mask!=0"),
)
print(
    "Number of movies in test set:", 
    conn.getVertexCount("Movie", where="test_mask!=0"),
)

Number of movies in training set: 400
Number of movies in validation set: 400
Number of movies in test set: 3478


## Train on whole graph <a name="train_whole"></a>
We first train the model on the whole graph. This will **NOT** work when the graph is large. See the section of training on subgraphs for real use. However, we still include this example for illustration purpose. Hyperparameters for the model and training environment are defined below.

In [8]:
# Hyperparameters
hp = {
    "num_heads": 2,
    "hidden_dim": 8,
    "num_layers": 2,
    "dropout": 0.1,
    "lr": 0.01,
    "l2_penalty": 0.0001,
}

### Construct graph loader

The `GraphLoader` can get the whole graph from database all at once (`num_batches=1`). See the tutorial on dataloaders for details.

In [9]:
graph_loader = conn.gds.graphLoader(
    v_in_feats={"Movie": ["x"], "Actor": ["x"], "Director": ["x"]}, 
    v_out_labels={"Movie": ["y"]},
    v_extra_feats={"Movie": ["train_mask", "val_mask", "test_mask"]},
    num_batches=1,
    output_format="PyG",
    shuffle=False
)

Installing and optimizing queries. It might take a minute if this is the first time you use this loader.
Query installation finished.


In [10]:
# Get the whole graph from the loader
data = graph_loader.data

data

HeteroData(
  [1mMovie[0m={
    x=[4278, 3066],
    y=[4278],
    train_mask=[4278],
    val_mask=[4278],
    test_mask=[4278]
  },
  [1mActor[0m={ x=[5257, 3066] },
  [1mDirector[0m={ x=[2081, 3066] },
  [1m(Movie, movie_actor, Actor)[0m={ edge_index=[2, 12828] },
  [1m(Movie, movie_director, Director)[0m={ edge_index=[2, 4278] },
  [1m(Actor, actor_movie, Movie)[0m={ edge_index=[2, 12828] },
  [1m(Director, director_movie, Movie)[0m={ edge_index=[2, 4278] }
)

### Construct model and optimizer

We build a GAT model with 2 convolutional layers, and then convert it to a heterogenous GAT model. We use the Adam optimizer with a learning rate of 0.01 to train the model.

In [11]:
import torch
import torch.nn.functional as F
from torch_geometric.nn import GATConv, to_hetero

In [12]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Create a normal (homogeneous) GAT model
class GAT(torch.nn.Module):
    def __init__(
        self, num_features, num_layers, out_dim, dropout, hidden_dim, num_heads
    ):
        super().__init__()
        self.dropout = dropout
        self.layers = torch.nn.ModuleList()
        for i in range(num_layers):
            in_units = num_features if i == 0 else hidden_dim * num_heads
            out_units = out_dim if i == (num_layers - 1) else hidden_dim
            heads = 1 if i == (num_layers - 1) else num_heads
            self.layers.append(
                GATConv(in_units, out_units, heads=heads, dropout=dropout, add_self_loops=False)
            )

    def reset_parameters(self):
        for layer in self.layers:
            layer.reset_parameters()

    def forward(self, x, edge_index):
        x = x.float()
        for layer in self.layers[:-1]:
            x = layer(x, edge_index)
            x = F.elu(x)
            x = F.dropout(x, p=self.dropout, training=self.training)
        x = self.layers[-1](x, edge_index)
        return x
    
model = GAT(
    num_features=3066,
    num_layers=hp["num_layers"],
    out_dim=3,
    dropout=hp["dropout"],
    hidden_dim=hp["hidden_dim"],
    num_heads=hp["num_heads"],
)

# Convert it to a heterogeneous model. See https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html#torch_geometric.nn.to_hetero_transformer.to_hetero for details.
model = to_hetero(model, data.metadata(), aggr='sum').to(device)

optimizer = torch.optim.Adam(
    model.parameters(), lr=hp["lr"], weight_decay=hp["l2_penalty"]
)

### Train the model

In [13]:
from datetime import datetime
from pyTigerGraph.gds.metrics import Accumulator, Accuracy
from torch.utils.tensorboard import SummaryWriter

  if not hasattr(tensorboard, "__version__") or LooseVersion(


In [14]:
log_dir = "logs/imdb/hgat/wholegraph/" + datetime.now().strftime("%Y%m%d-%H%M%S")
tb_log = SummaryWriter(log_dir)
logs = {}
data = data.to(device)
for epoch in range(20):
    # Train
    model.train()
    acc = Accuracy()
    # Forward pass
    out = model(data.x_dict, data.edge_index_dict)
    # Calculate loss on movie vertices in the training set only
    mask = data['Movie'].train_mask
    loss = F.cross_entropy(out["Movie"][mask], data["Movie"].y[mask])
    # Backward pass
    optimizer.zero_grad()
    loss.backward()
    # Update model
    optimizer.step()
    # Evaluate
    val_acc = Accuracy()
    with torch.no_grad():
        pred = out['Movie'].argmax(dim=1)
        acc.update(pred[mask], data['Movie'].y[mask])
        mask = data['Movie'].val_mask
        valid_loss = F.cross_entropy(out['Movie'][mask], data['Movie'].y[mask])
        val_acc.update(pred[mask], data['Movie'].y[mask])
    # Logging
    logs["loss"] = loss.item()
    logs["val_loss"] = valid_loss.item()
    logs["acc"] = acc.value
    logs["val_acc"] = val_acc.value
    print(
        "Epoch: {:02d}, Train Loss: {:.4f}, Valid Loss: {:.4f}, Train Accuracy: {:.4f}, Valid Accuracy: {:.4f}".format(
            epoch, logs["loss"], logs["val_loss"], logs["acc"], logs["val_acc"]
        )
    )
    tb_log.add_scalars(
        "Loss", {"Train": logs["loss"], "Validation": logs["val_loss"]}, epoch
    )
    tb_log.add_scalars(
        "Accuracy", {"Train": logs["acc"], "Validation": logs["val_acc"]}, epoch
    )
    tb_log.flush()

Epoch: 00, Train Loss: 1.0999, Valid Loss: 1.0952, Train Accuracy: 0.3200, Valid Accuracy: 0.3600
Epoch: 01, Train Loss: 0.9461, Valid Loss: 1.0454, Train Accuracy: 0.7500, Valid Accuracy: 0.4825
Epoch: 02, Train Loss: 0.8101, Valid Loss: 1.0224, Train Accuracy: 0.8100, Valid Accuracy: 0.4750
Epoch: 03, Train Loss: 0.6942, Valid Loss: 1.0109, Train Accuracy: 0.8475, Valid Accuracy: 0.4600
Epoch: 04, Train Loss: 0.5879, Valid Loss: 1.0056, Train Accuracy: 0.9125, Valid Accuracy: 0.4575
Epoch: 05, Train Loss: 0.4926, Valid Loss: 0.9867, Train Accuracy: 0.9525, Valid Accuracy: 0.5075
Epoch: 06, Train Loss: 0.3987, Valid Loss: 1.0070, Train Accuracy: 0.9700, Valid Accuracy: 0.4775
Epoch: 07, Train Loss: 0.3256, Valid Loss: 0.9922, Train Accuracy: 0.9900, Valid Accuracy: 0.5200
Epoch: 08, Train Loss: 0.2591, Valid Loss: 1.0082, Train Accuracy: 0.9875, Valid Accuracy: 0.5000
Epoch: 09, Train Loss: 0.2147, Valid Loss: 1.0477, Train Accuracy: 0.9950, Valid Accuracy: 0.4900
Epoch: 10, Train Los

### Test the model

In [16]:
model.eval()
acc = Accuracy()
with torch.no_grad():
    pred = model(data.x_dict, data.edge_index_dict)["Movie"].argmax(dim=1)
    mask = data["Movie"].test_mask
    acc.update(pred[mask], data["Movie"].y[mask])
print("Accuracy: {:.4f}".format(acc.value))

Accuracy: 0.5190


## Train on Neighborhood Subgraphs <a name="train_subgraph"></a>
Alternatively, we train the model on the neighborhood subgraphs. Each subgraph contains the 2 hop neighborhood of certain seed vertices. This method  will allow us to train the model on graphs that are way larger than the IMDB dataset because we don't load the whole graph into memory all at once. 

We will use the same parameters as before, but we will use the NeighborLoader to load subgraphs. Once we finish iterating over all the subgraphs generated by the loader, it is guaranteed to cover all vertices in the graph (except for those filtered by a user provided mask). 

In [17]:
# Hyperparameters
hp = {
    "num_heads": 2,
    "hidden_dim": 8,
    "num_layers": 2,
    "dropout": 0.2,
    "lr": 0.01,
    "l2_penalty": 0.0001,
    "batch_size": 128, 
    "num_neighbors": 10, 
    "num_hops": 2
}

### Construct neighborhood subgraph loader

Here we construct 3 subgraph loaders. The `train_loader` only uses vertices in the training set as seeds, the `valid_loader` only uses vertices in the validation set, and the `test_loader` only uses vertices in the test set.

In [18]:
train_loader = conn.gds.neighborLoader(
    v_in_feats={"Movie": ["x"], "Actor": ["x"], "Director": ["x"]}, 
    v_out_labels={"Movie": ["y"]},
    v_extra_feats={"Movie": ["train_mask", "val_mask", "test_mask"]},
    output_format="PyG",
    batch_size=hp["batch_size"],
    num_neighbors=hp["num_neighbors"],
    num_hops=hp["num_hops"],
    shuffle=True,
    filter_by={"Movie":"train_mask"},
)

Installing and optimizing queries. It might take a minute if this is the first time you use this loader.
Query installation finished.


In [19]:
valid_loader = conn.gds.neighborLoader(
    v_in_feats={"Movie": ["x"], "Actor": ["x"], "Director": ["x"]}, 
    v_out_labels={"Movie": ["y"]},
    v_extra_feats={"Movie": ["train_mask", "val_mask", "test_mask"]},
    output_format="PyG",
    batch_size=hp["batch_size"],
    num_neighbors=hp["num_neighbors"],
    num_hops=hp["num_hops"],
    shuffle=False,
    filter_by={"Movie":"val_mask"},
)

### Construct model and optimizer

We build a GAT model with 2 convolutional layers, and then convert it to a heterogenous GAT model. We use the Adam optimizer with a learning rate of 0.01 to train the model.

In [20]:
import torch
import torch.nn.functional as F
from torch_geometric.nn import GATConv, to_hetero

In [21]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Create a normal (homogeneous) GAT model
class GAT(torch.nn.Module):
    def __init__(
        self, num_features, num_layers, out_dim, dropout, hidden_dim, num_heads
    ):
        super().__init__()
        self.dropout = dropout
        self.layers = torch.nn.ModuleList()
        for i in range(num_layers):
            in_units = num_features if i == 0 else hidden_dim * num_heads
            out_units = out_dim if i == (num_layers - 1) else hidden_dim
            heads = 1 if i == (num_layers - 1) else num_heads
            self.layers.append(
                GATConv(in_units, out_units, heads=heads, dropout=dropout, add_self_loops=False)
            )

    def reset_parameters(self):
        for layer in self.layers:
            layer.reset_parameters()

    def forward(self, x, edge_index):
        x = x.float()
        for layer in self.layers[:-1]:
            x = layer(x, edge_index)
            x = F.elu(x)
            x = F.dropout(x, p=self.dropout, training=self.training)
        x = self.layers[-1](x, edge_index)
        return x
    
model = GAT(
    num_features=3066,
    num_layers=hp["num_layers"],
    out_dim=3,
    dropout=hp["dropout"],
    hidden_dim=hp["hidden_dim"],
    num_heads=hp["num_heads"],
)

# Convert it to a heterogeneous model. See https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html#torch_geometric.nn.to_hetero_transformer.to_hetero for details.
metadata = (['Actor', 'Movie', 'Director'], 
            [('Actor', 'actor_movie', 'Movie'), 
             ('Movie', 'movie_actor', 'Actor'), 
             ('Movie', 'movie_director', 'Director'), 
             ('Director', 'director_movie', 'Movie')])
model = to_hetero(model, metadata, aggr='sum').to(device)

optimizer = torch.optim.Adam(
    model.parameters(), lr=hp["lr"], weight_decay=hp["l2_penalty"]
)

### Train the model

In [22]:
from datetime import datetime

from pyTigerGraph.gds.metrics import Accumulator, Accuracy
from torch.utils.tensorboard import SummaryWriter

In [23]:
log_dir = "logs/imdb/hgat/subgraph/" + datetime.now().strftime("%Y%m%d-%H%M%S")
train_log = SummaryWriter(log_dir+"/train")
valid_log = SummaryWriter(log_dir+"/valid")
global_steps = 0
logs = {}
for epoch in range(10):
    # Train
    model.train()
    epoch_train_loss = Accumulator()
    epoch_train_acc = Accuracy()
    # Iterate through the loader to get a stream of subgraphs instead of the whole graph
    for bid, batch in enumerate(train_loader):
        batchsize = batch["Movie"].x.shape[0]
        batch.to(device)
        # Forward pass
        out = model(batch.x_dict, batch.edge_index_dict)
        # Calculate loss
        mask = batch["Movie"].is_seed
        loss = F.cross_entropy(out["Movie"][mask], batch["Movie"].y[mask])
        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        epoch_train_loss.update(loss.item() * batchsize, batchsize)
        # Predict on training data
        with torch.no_grad():
            pred = out["Movie"].argmax(dim=1)
            epoch_train_acc.update(pred[mask], batch["Movie"].y[mask])
        # Log training status after each batch
        logs["loss"] = epoch_train_loss.mean
        logs["acc"] = epoch_train_acc.value
        print(
            "Epoch {}, Train Batch {}, Loss {:.4f}, Accuracy {:.4f}".format(
                epoch, bid, logs["loss"], logs["acc"]
            )
        )
        train_log.add_scalar("Loss", logs["loss"], global_steps)
        train_log.add_scalar("Accuracy", logs["acc"], global_steps)
        train_log.flush()
        global_steps += 1
    # Evaluate
    model.eval()
    epoch_val_loss = Accumulator()
    epoch_val_acc = Accuracy()
    for batch in valid_loader:
        batchsize = batch["Movie"].x.shape[0]
        batch.to(device)
        with torch.no_grad():
            # Forward pass
            out = model(batch.x_dict, batch.edge_index_dict)
            # Calculate loss
            mask = batch["Movie"].is_seed
            valid_loss = F.cross_entropy(out["Movie"][mask], batch["Movie"].y[mask])
            epoch_val_loss.update(valid_loss.item() * batchsize, batchsize)
            # Prediction
            pred = out["Movie"].argmax(dim=1)
            epoch_val_acc.update(pred[mask], batch["Movie"].y[mask])
    # Log testing result after each epoch
    logs["val_loss"] = epoch_val_loss.mean
    logs["val_acc"] = epoch_val_acc.value
    print(
        "Epoch {}, Valid Loss {:.4f}, Valid Accuracy {:.4f}".format(
            epoch, logs["val_loss"], logs["val_acc"]
        )
    )
    valid_log.add_scalar("Loss", logs["val_loss"], global_steps)
    valid_log.add_scalar("Accuracy", logs["val_acc"], global_steps)
    valid_log.flush()

Epoch 0, Train Batch 0, Loss 1.1123, Accuracy 0.3086
Epoch 0, Train Batch 1, Loss 1.0735, Accuracy 0.4022
Epoch 0, Train Batch 2, Loss 1.0822, Accuracy 0.3693
Epoch 0, Train Batch 3, Loss 1.0856, Accuracy 0.3675
Epoch 0, Valid Loss 1.0530, Valid Accuracy 0.3925
Epoch 1, Train Batch 0, Loss 0.9177, Accuracy 0.5484
Epoch 1, Train Batch 1, Loss 0.9169, Accuracy 0.5459
Epoch 1, Train Batch 2, Loss 0.8784, Accuracy 0.5973
Epoch 1, Train Batch 3, Loss 0.8585, Accuracy 0.6275
Epoch 1, Valid Loss 1.0094, Valid Accuracy 0.4650
Epoch 2, Train Batch 0, Loss 0.7882, Accuracy 0.7553
Epoch 2, Train Batch 1, Loss 0.7354, Accuracy 0.7730
Epoch 2, Train Batch 2, Loss 0.7057, Accuracy 0.7973
Epoch 2, Train Batch 3, Loss 0.6940, Accuracy 0.8050
Epoch 2, Valid Loss 1.0000, Valid Accuracy 0.5225
Epoch 3, Train Batch 0, Loss 0.6163, Accuracy 0.8454
Epoch 3, Train Batch 1, Loss 0.5857, Accuracy 0.8357
Epoch 3, Train Batch 2, Loss 0.5866, Accuracy 0.8384
Epoch 3, Train Batch 3, Loss 0.5947, Accuracy 0.8200
Ep

### Test the model

In [25]:
test_loader = conn.gds.neighborLoader(
    v_in_feats={"Movie": ["x"], "Actor": ["x"], "Director": ["x"]}, 
    v_out_labels={"Movie": ["y"]},
    v_extra_feats={"Movie": ["train_mask", "val_mask", "test_mask"]},
    output_format="PyG",
    batch_size=hp["batch_size"],
    num_neighbors=hp["num_neighbors"],
    num_hops=hp["num_hops"],
    shuffle=False,
    filter_by={"Movie":"test_mask"},
)

In [26]:
model.eval()
acc = Accuracy()
for batch in test_loader:
    batch.to(device)
    with torch.no_grad():
        pred = model(batch.x_dict, batch.edge_index_dict)["Movie"].argmax(dim=1)
        mask = batch["Movie"].is_seed
        acc.update(pred[mask], batch["Movie"].y[mask])
print("Accuracy: {:.4f}".format(acc.value))

Accuracy: 0.4888


## Inference <a name="inference"></a>

Finally, we use the trained model for node classification. At this stage, we typically do inference/prediction for specific nodes instead of random batches, so we will create a new data loader.  

In [27]:
infer_loader = conn.gds.neighborLoader(
    v_in_feats={"Movie": ["x"], "Actor": ["x"], "Director": ["x"]}, 
    v_out_labels={"Movie": ["y"]},
    v_extra_feats={"Movie": ["train_mask", "val_mask", "test_mask"]},
    output_format="PyG",
    num_neighbors=hp["num_neighbors"],
    num_hops=hp["num_hops"],
    shuffle=False
)

In [28]:
# Fetch specific nodes by their IDs and do prediction. 
# Each node is represented by a dict with two mandatory keys: primary_id and type.
input_nodes = [{"primary_id": 7, "type": "Movie"}, 
               {"primary_id": 55, "type": "Movie"}]
data = infer_loader.fetch(input_nodes)

In [29]:
# The returned data are the neighborhood subgraphs of the input nodes.
# The original IDs of the nodes in the subgraphs are stored in the 
# `primary_id` attribute.
data

HeteroData(
  [1mMovie[0m={
    x=[53, 3066],
    y=[53],
    train_mask=[53],
    val_mask=[53],
    test_mask=[53],
    is_seed=[53],
    primary_id=[53]
  },
  [1mActor[0m={
    x=[6, 3066],
    is_seed=[6],
    primary_id=[6]
  },
  [1mDirector[0m={
    x=[2, 3066],
    is_seed=[2],
    primary_id=[2]
  },
  [1m(Movie, movie_actor, Actor)[0m={ edge_index=[2, 6] },
  [1m(Movie, movie_director, Director)[0m={ edge_index=[2, 2] },
  [1m(Actor, actor_movie, Movie)[0m={ edge_index=[2, 54] },
  [1m(Director, director_movie, Movie)[0m={ edge_index=[2, 11] }
)

In [30]:
# Predict. Predictions for both the input nodes and others in their 
# neighborhoods are generated.
model.eval()
pred = model(data.x_dict, data.edge_index_dict)["Movie"].argmax(dim=1)
print("ID: Label")
for i,j in zip(data["Movie"].primary_id, pred):
    print("{}:{}".format(i, j.item()))

ID: Label
7:0
55:0
1543:0
963:0
211:2
178:0
1074:0
387:0
553:0
856:0
1327:0
2184:0
2897:0
109:0
712:0
3597:0
1153:0
3137:0
4157:0
1930:0
1863:0
1901:0
3983:0
633:0
3150:0
76:0
9:0
1899:0
2025:0
3718:0
111:0
1433:0
138:0
2077:0
22:0
40:0
15:0
2789:2
3124:0
3413:0
1346:2
464:0
89:0
1530:0
2263:0
659:0
326:0
3454:2
3174:0
70:0
520:0
1206:0
2382:0
