# GNN Inference on Google Vertex AI using TigerGraph

In this notebook, we will train a GNN model and deploy it to Google Vertex AI as an inference endpoint. It assumes that you have a GCP account with the proper permissions. You can run this on Vertex AI Workbench as well as your local machine.

## Setup

We are going to create a working directory.
**Note:** the `mkdir` command below will fail if the directory already exists. You can safely ignore the error message.

In [1]:
!rm -rf ./gat_cora

In [2]:
import os

source_directory = "gat_cora"

os.mkdir("./{}".format(source_directory))

## Define The Model

We are going to define a Graph Attention Network (GAT) model, and write it to a file called `model.py`.

In [3]:
%%writefile $source_directory/model.py

import torch
import torch.nn.functional as F
from torch_geometric.nn import GATConv

class GAT(torch.nn.Module):
    def __init__(
        self, num_features, num_layers, out_dim, dropout, hidden_dim, num_heads
    ):
        super().__init__()
        self.dropout = dropout
        self.layers = torch.nn.ModuleList()
        for i in range(num_layers):
            in_units = num_features if i == 0 else hidden_dim * num_heads
            out_units = out_dim if i == (num_layers - 1) else hidden_dim
            heads = 1 if i == (num_layers - 1) else num_heads
            self.layers.append(
                GATConv(in_units, out_units, heads=heads, dropout=dropout)
            )

    def reset_parameters(self):
        for layer in self.layers:
            layer.reset_parameters()

    def forward(self, data):
        x, edge_index = data.x.float(), data.edge_index
        for layer in self.layers[:-1]:
            x = layer(x, edge_index)
            x = F.elu(x)
            x = F.dropout(x, p=self.dropout, training=self.training)
        x = self.layers[-1](x, edge_index)
        return x

Writing gat_cora/model.py


## Model Parameters

Here, we define a dictionary of the parameters of the model, data loaders, and connection to the database.

In [None]:
import json
with open('../../config.json', "r") as config_file:
    config = json.load(config_file)

In [None]:
parameters = {
    "model_name": "GAT",
    "model_config": {
        "num_features": 1433, # Number of features on Cora vertices 
        "out_dim": 7,         # Number of classes in Cora
        "num_heads": 8,       # Number of attention heads in GAT model
        "hidden_dim": 8,      # Number of hidden units in GAT model
        "num_layers": 2,      # Number of GAT layers in GAT model
        "dropout": 0.6        # Dropout probability in GAT model
    },
    "infer_loader_config": {
        "v_in_feats": ["x"],     # List of vertex features to be loaded
        "v_out_labels": ["y"],   # List of vertex labels to be loaded
        "v_extra_feats": ["train_mask","val_mask","test_mask"],     # Don't need any extra features for inference
        "output_format": "PyG",  # Using Pytorch Geometric format
        "batch_size": 64,        # Batch size for inference
        "num_neighbors": 10,     # Number of neighbors per vertex
        "num_hops": 2,           # How deep to go in the graph
        "shuffle": False,         # Don't shuffle the data,
    },
    "training_loader_config": {
        "v_in_feats": ["x"],
        "v_out_labels": ["y"],
        "v_extra_feats": ["train_mask","val_mask","test_mask"],
        "output_format": "PyG",
        "batch_size": 64, 
        "num_neighbors": 10, 
        "num_hops": 2,
        "shuffle": True,
    },
    "optimizer_config": {
        "lr": 0.01,
        "weight_decay": 5e-4,
    },
    "connection_config": {
        "host": config["host"],
        "username": config["username"],
        "password": config["password"],
        "graphname": "Cora"
    }
}

### Write Parameters to JSON File
We will write the parameters dictionary to a JSON file so that we can easily access the parameters when creating the inference container.

In [5]:
import json

json.dump(parameters, open("{}/config.json".format(source_directory), "w"))

## Train a GNN Model

### Load the Model
Here, we use some Python packaging tools to load the model. This is equivalent to writing `from source_directory.model import ModelName`.

Since `source_directory` and `ModelName` are unique to each developer's configs, we will use the `sys` package to import the model.

In [6]:
import sys
sys.path.append(source_directory)

import model
GAT = getattr(model, parameters["model_name"])

In [7]:
GAT

model.GAT

#### Instantiate the Model Class
Here, we use `kwargs` to pass in the parameters of the model from the parameters dictionary.

In [8]:
gat = GAT(**parameters["model_config"])
gat

GAT(
  (layers): ModuleList(
    (0): GATConv(1433, 8, heads=8)
    (1): GATConv(64, 7, heads=1)
  )
)

## Create Connection to the Database and Ingest Dataset

In [9]:
from pyTigerGraph import TigerGraphConnection

conn = TigerGraphConnection(**parameters["connection_config"])
conn.getToken(conn.createSecret())

('4rp8a88942kk3uufr6nad0vpmiikl5ke', 1681426506, '2023-04-13 22:55:06')

In [None]:
from pyTigerGraph.datasets import Datasets

dataset = Datasets("Cora")

conn.ingestDataset(dataset, getToken=config["getToken"])

### Create Data Loaders
Here, we instantiate a connection to our TigerGraph database with `pyTigerGraph`. Then we create data loaders for training, validation, and testing datasets. We will use the **Neighbor Sampling** technique introduced in the GraphSAGE paper to generate batches of data.

In [10]:
train_loader = conn.gds.neighborLoader(
    **parameters["training_loader_config"],
    filter_by="train_mask"
)

In [11]:
valid_loader = conn.gds.neighborLoader(
    **parameters["training_loader_config"],
    filter_by="val_mask"
)

In [12]:
test_loader = conn.gds.neighborLoader(
    **parameters["training_loader_config"],
    filter_by="test_mask"
)

### Setup Optimizer
Here, we define the `Adam` optimizer and move the model to the correct device (CPU or GPU).

In [13]:
import torch
import torch.nn.functional as F

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

gat.to(device)

optimizer = torch.optim.Adam(
    gat.parameters(), **parameters["optimizer_config"]
)

### Train the Model

In [14]:
from datetime import datetime
from pyTigerGraph.gds.metrics import Accumulator, Accuracy

In [15]:
global_steps = 0
logs = {}
for epoch in range(10):
    # Train
    gat.train()
    epoch_train_loss = Accumulator()
    epoch_train_acc = Accuracy()
    for bid, batch in enumerate(train_loader):
        batchsize = batch.x.shape[0]
        batch.to(device)
        # Forward pass
        out = gat(batch)
        # Calculate loss
        loss = F.cross_entropy(out[batch.train_mask], batch.y[batch.train_mask])
        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        epoch_train_loss.update(loss.item() * batchsize, batchsize)
        # Predict on training data
        with torch.no_grad():
            pred = out.argmax(dim=1)
            epoch_train_acc.update(pred[batch.train_mask], batch.y[batch.train_mask])
        # Log training status after each batch
        logs["loss"] = epoch_train_loss.mean
        logs["acc"] = epoch_train_acc.value
        print(
            "Epoch {}, Train Batch {}, Loss {:.4f}, Accuracy {:.4f}".format(
                epoch, bid, logs["loss"], logs["acc"]
            )
        )
        global_steps += 1
    # Evaluate
    gat.eval()
    epoch_val_loss = Accumulator()
    epoch_val_acc = Accuracy()
    for batch in valid_loader:
        batchsize = batch.x.shape[0]
        batch.to(device)
        with torch.no_grad():
            # Forward pass
            out = gat(batch)
            # Calculate loss
            valid_loss = F.cross_entropy(out[batch.val_mask], batch.y[batch.val_mask])
            epoch_val_loss.update(valid_loss.item() * batchsize, batchsize)
            # Prediction
            pred = out.argmax(dim=1)
            epoch_val_acc.update(pred[batch.val_mask], batch.y[batch.val_mask])
    # Log testing result after each epoch
    logs["val_loss"] = epoch_val_loss.mean
    logs["val_acc"] = epoch_val_acc.value
    print(
        "Epoch {}, Valid Loss {:.4f}, Valid Accuracy {:.4f}".format(
            epoch, logs["val_loss"], logs["val_acc"]
        )
    )

Epoch 0, Train Batch 0, Loss 2.0409, Accuracy 0.1013
Epoch 0, Train Batch 1, Loss 1.9961, Accuracy 0.1250
Epoch 0, Train Batch 2, Loss 1.9269, Accuracy 0.1923
Epoch 0, Valid Loss 1.6668, Valid Accuracy 0.4970
Epoch 1, Train Batch 0, Loss 1.4493, Accuracy 0.6190
Epoch 1, Train Batch 1, Loss 1.4238, Accuracy 0.6214
Epoch 1, Train Batch 2, Loss 1.3621, Accuracy 0.6390
Epoch 1, Valid Loss 1.4531, Valid Accuracy 0.6185
Epoch 2, Train Batch 0, Loss 1.1613, Accuracy 0.7333
Epoch 2, Train Batch 1, Loss 1.2273, Accuracy 0.6992
Epoch 2, Train Batch 2, Loss 1.1661, Accuracy 0.6847
Epoch 2, Valid Loss 1.2852, Valid Accuracy 0.6850
Epoch 3, Train Batch 0, Loss 1.1222, Accuracy 0.6753
Epoch 3, Train Batch 1, Loss 0.9907, Accuracy 0.7071
Epoch 3, Train Batch 2, Loss 0.9570, Accuracy 0.7340
Epoch 3, Valid Loss 1.1397, Valid Accuracy 0.7031
Epoch 4, Train Batch 0, Loss 1.0439, Accuracy 0.7313
Epoch 4, Train Batch 1, Loss 0.9597, Accuracy 0.7462
Epoch 4, Train Batch 2, Loss 0.9003, Accuracy 0.7600
Epoch

### Test the Model

In [16]:
gat.eval()
acc = Accuracy()
for batch in test_loader:
    batch.to(device)
    with torch.no_grad():
        pred = gat(batch).argmax(dim=1)
        acc.update(pred[batch.test_mask], batch.y[batch.test_mask])
print("Accuracy: {:.4f}".format(acc.value))

Accuracy: 0.7023


### Save the Trained Model Weights

In [17]:
torch.save(gat.state_dict(), "{}/model.pth".format(source_directory))

## Create Dockerfile

Google Vertex AI uses Docker containers in order to host models. We use a Dockerfile to build this container.

In [18]:
%%writefile Dockerfile

FROM ubuntu:latest

# Install some basic utilities
RUN apt-get update && apt-get install -y \
    curl \
    ca-certificates \
    sudo \
    git \
    bzip2 \
    libx11-6 \
    wget \
    pip \
 && rm -rf /var/lib/apt/lists/*

WORKDIR /opt
# Set up the Conda environment
ENV CONDA_AUTO_UPDATE_CONDA=false \
    PATH=/opt/miniconda/bin:$PATH
COPY ./gat_cora/environment.yml /opt/environment.yml
RUN curl -sLo /opt/miniconda.sh https://repo.continuum.io/miniconda/Miniconda3-py39_4.11.0-Linux-x86_64.sh \
 && chmod +x /opt/miniconda.sh \
 && /opt/miniconda.sh -b -p /opt/miniconda \
 && rm /opt/miniconda.sh \
 && conda env update -n base -f /opt/environment.yml \
 && rm /opt/environment.yml \
 && conda clean -ya

 RUN pip install --no-index torch-scatter -f https://data.pyg.org/whl/torch-1.10.0+cu113.html \
 && pip install --no-index torch-sparse -f https://data.pyg.org/whl/torch-1.10.0+cu113.html \
 && pip install --no-index torch-cluster -f https://data.pyg.org/whl/torch-1.10.0+cu113.html \
 && pip install --no-index torch-spline-conv -f https://data.pyg.org/whl/torch-1.10.0+cu113.html \
 && pip install torch-geometric \
 && pip cache purge

# install - requirements.txt
COPY ./gat_cora/requirements.txt /tmp/requirements.txt
RUN python3 -m pip install -r /tmp/requirements.txt --quiet --no-cache-dir \
  && rm -f /tmp/requirements.txt

ENV TARGET_DIR /opt/kserve-demo
WORKDIR ${TARGET_DIR}
COPY ./gat_cora/ ${TARGET_DIR}/gat_cora/

ENTRYPOINT ["python3", "./gat_cora/main.py"]

Overwriting Dockerfile


## Define main.py File

This `main.py` file will load the model and start running an HTTP server for model inference within the Docker container.

In [19]:
%%writefile $source_directory/main.py
import torch
import kserve
from google.cloud import storage
# from sklearn.externals import joblib
from kserve import Model, Storage
from kserve.model import ModelMissingError, InferenceError
from typing import Dict
import logging
import pyTigerGraph as tg
import os 
import sys
import json

logger = logging.getLogger(__name__)

class VertexClassifier(Model):
    def __init__(self, name: str, source_directory: str):
        super().__init__(name)
        self.name = name
        self.source_dir = source_directory
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    
        # Load configuration JSON file
        with open(os.path.join(source_directory, "config.json")) as json_file:
            data = json.load(json_file)
            self.model_config = data["model_config"]
            connection_config = data["connection_config"]
            loader_config = data["infer_loader_config"]
            self.mdl_nm = data["model_name"]

        sys.path.append(source_directory)
        # Setup Connection to TigerGraph Database
        self.conn = tg.TigerGraphConnection(**connection_config)
        self.conn.getToken(self.conn.createSecret())
        logger.info("connection created")
        # Setup Inference Loader
        self.infer_loader = self.conn.gds.neighborLoader(**loader_config)
        logger.info("loader created")
        # Setup Model
        self.model = self.load_model()
        logger.info("model loaded")

    def load(self):
        pass
    
    def load_model(self):
        import model
        mdl = getattr(model, self.mdl_nm)(**self.model_config)
        logger.info("Instantiated Model")
        with open(os.path.join(self.source_dir, "model.pth"), 'rb') as f:
            mdl.load_state_dict(torch.load(f))
        mdl.to(self.device).eval()
        logger.info("Loaded Model")
        return mdl

    def predict(self, request: Dict) -> Dict:
        input_nodes = request["instances"]
        input_ids = set([str(node['primary_id']) for node in input_nodes])
        logger.info(input_ids)
        data = self.infer_loader.fetch(input_nodes).to(self.device)
        logger.info (f"predicting {data}")
        with torch.no_grad():
            output = self.model(data)
        returnJSON = []
        for i in range(len(input_nodes)):
            returnJSON.append({input_nodes[i]["primary_id"]: list(output[i].tolist())})
        return json.dumps({"predictions": returnJSON})

if __name__ == "__main__":
    model_name = os.environ.get('K_SERVICE', "tg-gat-gcp-demo-predictor-default")
    model_name = '-'.join(model_name.split('-')[:-2]) # removing suffix "-predictor-default"
    logger.info(f"Starting model '{model_name}'")
    model = VertexClassifier(model_name, "./gat_cora/")
    kserve.ModelServer(http_port=8080).start([model])


Writing gat_cora/main.py


## Write requirements.txt File

In [20]:
%%writefile $source_directory/requirements.txt

# kubeflow packages
kfp==1.6.3
kfp-server-api==1.6.0
kserve==0.8

# common packages
#bokeh==2.3.2
#cloudpickle==1.6.0
#dill==0.3.4
#pandas==1.2.4

# pytorch packages
#fastai==2.4
class-resolver==0.3.9

# TigerGraph
pyTigerGraph[gds]

Writing gat_cora/requirements.txt


## Write environment.yml File

In [21]:
%%writefile $source_directory/environment.yml
name: base
dependencies:
- numpy=1.21.2
- pip=21.2.4
- python=3.9.7
- pytorch::pytorch=1.10.0=py3.9_cuda11.3_cudnn8.2.0_0
- scipy=1.7.1
- cloudpickle=2.0.0  

Writing gat_cora/environment.yml


## Build Docker Image
Using the Dockerfile defined above, we will build the Docker image for inference.

In [22]:
!gcloud builds submit --region=us-central1 --tag=us-central1-docker.pkg.dev/tigergraph-ml/gnn-inference/cora-gat-inference:latest

Creating temporary tarball archive of 14 file(s) totalling 629.8 KiB before compression.
Uploading tarball of [.] to [gs://tigergraph-ml_cloudbuild/source/1678834552.937435-4272b23ae97d4077af4249acf57493c7.tgz]
Created [https://cloudbuild.googleapis.com/v1/projects/tigergraph-ml/locations/us-central1/builds/8fc1516e-6a87-44ca-ac29-4694cddfcbd7].
Logs are available at [ https://console.cloud.google.com/cloud-build/builds;region=us-central1/8fc1516e-6a87-44ca-ac29-4694cddfcbd7?project=130162788530 ].
----------------------------- REMOTE BUILD OUTPUT ------------------------------
starting build "8fc1516e-6a87-44ca-ac29-4694cddfcbd7"

FETCHSOURCE
Fetching storage object: gs://tigergraph-ml_cloudbuild/source/1678834552.937435-4272b23ae97d4077af4249acf57493c7.tgz#1678834553374871
Copying gs://tigergraph-ml_cloudbuild/source/1678834552.937435-4272b23ae97d4077af4249acf57493c7.tgz#1678834553374871...
/ [1 files][387.1 KiB/387.1 KiB]                                                
Operation com

## Deploy Model

In [23]:
!gcloud ai models upload \
  --region=us-central1 \
  --display-name=cora-gat \
  --container-image-uri=us-central1-docker.pkg.dev/tigergraph-ml/gnn-inference/cora-gat-inference:latest \
  --container-health-route=/v1/models \
  --container-predict-route=/v1/models/tg-gat-gcp-demo:predict


Using endpoint [https://us-central1-aiplatform.googleapis.com/]
Waiting for operation [411316098676293632]...done.                             


In [29]:
!gcloud ai models list --region=us-central1 

Using endpoint [https://us-central1-aiplatform.googleapis.com/]
MODEL_ID             DISPLAY_NAME
4239490337308934144  cora-gat
5804563775587614720  paysim-prediction-model


## Create Endpoint

In [25]:
!gcloud ai endpoints create --region=us-central1 --display-name=coragat

Using endpoint [https://us-central1-aiplatform.googleapis.com/]
Waiting for operation [5397926786082275328]...done.                            
Created Vertex AI endpoint: projects/130162788530/locations/us-central1/endpoints/239126186855235584.


In [28]:
!gcloud ai endpoints list --region=us-central1 --filter=display_name=coragat

Using endpoint [https://us-central1-aiplatform.googleapis.com/]
ENDPOINT_ID          DISPLAY_NAME
239126186855235584   coragat
7345806398845878272  coragat


## Deploy Model to Endpoint

**NOTE:** Replace `YOUR_ENDPOINT_ID` and `YOUR_MODEL_ID` with the appropriate values as listed above

In [30]:
!gcloud ai endpoints deploy-model YOUR_ENDPOINT_ID \
  --region=us-central1 \
  --model=YOUR_MODEL_ID \
  --display-name=coragat \
  --machine-type=n1-standard-2 \
  --min-replica-count=1 \
  --max-replica-count=5 \
  --traffic-split=0=100


Using endpoint [https://us-central1-aiplatform.googleapis.com/]
Waiting for operation [590897133817692160]...done.                             
Deployed a model to the endpoint 239126186855235584. Id of the deployed model: 8127163342008614912.


## Run Prediction

In [31]:
data = {"instances": [{"primary_id": 7, "type": "Paper"}, {"primary_id": 17, "type": "Paper"}, {"primary_id": 27, "type": "Paper"}, {"primary_id": 37, "type": "Paper"}]}

**NOTE:** Replace `ENDPOINT_ID` and `PROJECT_ID` with the appropriate values below.

In [36]:
ENDPOINT_ID="ENDPOINT_ID"
PROJECT_ID="PROJECT_ID"

In [44]:
gcloud_token = !gcloud auth print-access-token
gcloud_token = gcloud_token[0]

In [46]:
import requests
header = {"Authorization": "Bearer "+gcloud_token}
resp = requests.post("https://us-central1-aiplatform.googleapis.com/v1/projects/"+PROJECT_ID+"/locations/us-central1/endpoints/"+ENDPOINT_ID+":predict", json=data, headers=header)

In [47]:
resp.json()

{'predictions': [{'7': [0.4432734847068787,
    -1.978203892707825,
    -2.681934356689453,
    3.335464477539062,
    1.850573658943176,
    0.5662031769752502,
    -0.4603772461414337]},
  {'17': [-1.18100106716156,
    -1.289207339286804,
    -0.2260519564151764,
    5.754012107849121,
    1.381614923477173,
    0.5561198592185974,
    -2.365478277206421]},
  {'27': [0.5439566373825073,
    -1.011525392532349,
    -0.5042273998260498,
    2.927565097808838,
    0.539995014667511,
    0.09310251474380493,
    -1.186939597129822]},
  {'37': [2.451966524124146,
    -1.063238501548767,
    -3.177617788314819,
    -1.015842080116272,
    0.5224310159683228,
    3.596114635467529,
    -0.6407819986343384]}],
 'deployedModelId': '8127163342008614912',
 'model': 'projects/130162788530/locations/us-central1/models/4239490337308934144',
 'modelDisplayName': 'cora-gat',
 'modelVersionId': '1'}