## Training and Creation
This notebook, consist of creating the architecture of the neural network and training it, once, we have finished that, we will save the model
as a .pt file and then use it for inference  

In [1]:
import itertools
import os

os.environ["DGLBACKEND"] = "pytorch"

import dgl
import dgl.data
import numpy as np
import scipy.sparse as sp
import torch
import torch.nn as nn
import torch.nn.functional as F

## Code Documentation: Initializing Reproducibility Seeds

This code snippet defines a function `setup_seed` to establish seeds for random number generators, 
ensuring reproducibility in experiments involving randomness.

### Function Signature
```python
def setup_seed(self, seed)

In [2]:
def setup_seed(self, seed):
  """
  Setting seeds for reproducible results
  """
  torch.manual_seed(seed)
  torch.cuda.manual_seed_all(seed)
  torch.backends.cudnn.deterministic = True
  np.random.seed(seed)
  random.seed(seed)
  dgl.seed(seed)
  dgl.random.seed(seed)

### Loading and Preprocessing

Loading in the custom dataset which we created into dgl, since we have already formed the data into the correct format,
we can use the built in graph class without any modifications. 

In [3]:
dataset = dgl.data.CSVDataset('../data/author_data')
g = dataset[0]
g = dgl.add_self_loop(g)
print(g)

Done loading data from cached files.
Graph(num_nodes=333, num_edges=2852,
      ndata_schemes={'feat': Scheme(shape=(224,), dtype=torch.int64)}
      edata_schemes={})


This block snippet splits the edge set of a graph into training and testing sets for link prediction tasks.

1. **Edge Set Preparation:**
   - Extract source (`u`) and destination (`v`) nodes from the graph's edges.
   - Generate a random permutation of edge indices (`eids`) to shuffle the edge set.
   - Determine the sizes of the test and training sets.

2. **Positive Edge Splitting:**
   - Extract source and destination nodes for the test positive edges.
   - Extract source and destination nodes for the train positive edges.

3. **Negative Edge Generation:**
   - Create an adjacency matrix (`adj`) from the edge information.
   - Calculate the adjacency matrix for negative edges (`adj_neg`) by subtracting the positive adjacency matrix from a matrix of ones.
   - Identify source and destination nodes for negative edges.

4. **Negative Edge Splitting:**
   - Randomly sample negative edges for the test set.
   - Assign source and destination nodes for the test negative edges.
   - Assign source and destination nodes for the train negative edges.

This code prepares positive and negative edge sets for training and testing link prediction tasks. It facilitates the creation of a comprehensive dataset for evaluating graph models.


In [4]:
# Split edge set for training and testing
u, v = g.edges()

eids = np.arange(g.num_edges())
eids = np.random.permutation(eids)
test_size = int(len(eids) * 0.1)
train_size = g.num_edges() - test_size
test_pos_u, test_pos_v = u[eids[:test_size]], v[eids[:test_size]]
train_pos_u, train_pos_v = u[eids[test_size:]], v[eids[test_size:]]

# Find all negative edges and split them for training and testing
adj = sp.coo_matrix((np.ones(len(u)), (u.numpy(), v.numpy())))
adj_neg = 1 - adj.todense() - np.eye(g.num_nodes())
neg_u, neg_v = np.where(adj_neg != 0)

neg_eids = np.random.choice(len(neg_u), g.num_edges())
test_neg_u, test_neg_v = (
    neg_u[neg_eids[:test_size]],
    neg_v[neg_eids[:test_size]],
)
train_neg_u, train_neg_v = (
    neg_u[neg_eids[test_size:]],
    neg_v[neg_eids[test_size:]],
)

When training, you will need to remove the edges in the test set from the original graph. You can do this via `dgl.remove_edges`

In [5]:
train_g = dgl.remove_edges(g, eids[:test_size])

### Creating our graph sage network

1. **Import GraphSAGE Convolution:**
   - Import `SAGEConv` from the DGL's neural network module.

2. **Model Definition:**
   - Define a two-layer GraphSAGE model class (`GraphSAGE`) that inherits from `nn.Module`.
   - Constructor (`__init__`):
       - Initialize the model using input feature dimensions (`in_feats`) and hidden feature dimensions (`h_feats`).
       - Create two `SAGEConv` layers: the first with input features, the second with hidden features.
   - Forward Pass (`forward`):
       - Execute the forward computation of the model.
       - Perform GraphSAGE convolution on the input graph `g` and input features `in_feat`.
       - Apply ReLU activation function to intermediate results.
       - Perform the second GraphSAGE convolution on the updated features and return the result.

This code defines a GraphSAGE model architecture for graph-based learning tasks, facilitating the propagation of node features through the graph structure.


In [6]:
from dgl.nn import SAGEConv


class GraphSAGE(nn.Module):
    def __init__(self, in_feats, h_feats):
        super(GraphSAGE, self).__init__()
        self.conv1 = SAGEConv(in_feats, h_feats, "mean")
        self.conv2 = SAGEConv(h_feats, h_feats, "mean")

    def forward(self, g, in_feat):
        h = self.conv1(g, in_feat)
        h = F.relu(h)
        h = self.conv2(g, h)
        return h

This is much more comlp[ex version of  the above class and we could have used it, however, the data thatw e have been
provided with is extremely clean and does not need a very complex model architecture for training. 
```python
class ComplexGraphSAGE(nn.Module):
    def __init__(self, in_feats, h_feats, num_layers, num_hidden, num_classes, dropout):
        super(ComplexGraphSAGE, self).__init__()
        self.num_layers = num_layers
        self.convs = nn.ModuleList()
        self.linears = nn.ModuleList()

        # Input layer
        self.convs.append(SAGEConv(in_feats, h_feats, 'mean'))
        self.linears.append(nn.Linear(in_feats, h_feats))
        
        # Hidden layers
        for layer in range(num_layers - 1):
            self.convs.append(SAGEConv(h_feats, h_feats, 'mean'))
            self.linears.append(nn.Linear(h_feats, h_feats))
        
        # Output layer
        self.output_linear = nn.Linear(h_feats, num_classes)
        
        self.dropout = nn.Dropout(dropout)

    def forward(self, g, in_feat):
        h = in_feat
        
        for layer in range(self.num_layers):
            conv = self.convs[layer]
            linear = self.linears[layer]
            
            h_conv = conv(g, h)
            h_linear = linear(h)
            
            h = F.relu(h_conv + h_linear)  # Skip connection
            
            if layer < self.num_layers - 1:
                h = self.dropout(h)
        
        h = self.output_linear(h)
        return h
```

This code snippet creates graph structures for training and testing positive and negative edges.

1. **Training Positive Graph:**
   - Create a DGL graph `train_pos_g` using positive training edges `train_pos_u` and `train_pos_v`.
   - Specify the number of nodes as `g.num_nodes()`.

2. **Training Negative Graph:**
   - Create a DGL graph `train_neg_g` using negative training edges `train_neg_u` and `train_neg_v`.
   - Specify the number of nodes as `g.num_nodes()`.

3. **Testing Positive Graph:**
   - Create a DGL graph `test_pos_g` using positive testing edges `test_pos_u` and `test_pos_v`.
   - Specify the number of nodes as `g.num_nodes()`.

4. **Testing Negative Graph:**
   - Create a DGL graph `test_neg_g` using negative testing edges `test_neg_u` and `test_neg_v`.
   - Specify the number of nodes as `g.num_nodes()`.

This code generates separate graph structures for training and testing positive and negative edges, facilitating the evaluation of the GraphSAGE model.


In [7]:
train_pos_g = dgl.graph((train_pos_u, train_pos_v), num_nodes=g.num_nodes())
train_neg_g = dgl.graph((train_neg_u, train_neg_v), num_nodes=g.num_nodes())

test_pos_g = dgl.graph((test_pos_u, test_pos_v), num_nodes=g.num_nodes())
test_neg_g = dgl.graph((test_neg_u, test_neg_v), num_nodes=g.num_nodes())

We ca use a generic dot product method that is given by the class is well and that would be recommended as it has been
optimised for large workflows. Our dataset is rather small and I wanted to implement a custom predictor  

1. **MLP Predictor Class:**
   - Define a class `MLPPredictor` that inherits from `nn.Module`.
   - Constructor (`__init__`):
       - Initialize the model with hidden feature dimensions (`h_feats`).
       - Create two linear layers: `W1` and `W2`.
   - `apply_edges` Method:
       - Compute scalar scores for each edge in the graph.
       - Concatenate source and destination node features and pass through `W1` and `W2`.
   - `forward` Method:
       - Perform the forward computation.
       - Set node features in the graph (`g.ndata["h"]`) as input node features (`h`).
       - Apply the `apply_edges` method to calculate edge scores.
       - Return edge scores.

2. **Important Information:**
   - The `apply_edges` method computes a scalar score for each edge using node features.
   - The `forward` method applies the predictor to the graph using node features.
   - This code segment showcases how to create a custom predictor for link prediction tasks in GNNs.



In [8]:
class MLPPredictor(nn.Module):
    def __init__(self, h_feats):
        super().__init__()
        self.W1 = nn.Linear(h_feats * 2, h_feats)
        self.W2 = nn.Linear(h_feats, 1)

    def apply_edges(self, edges):
        """
        Computes a scalar score for each edge of the given graph.

        Parameters
        ----------
        edges :
            Has three members ``src``, ``dst`` and ``data``, each of
            which is a dictionary representing the features of the
            source nodes, the destination nodes, and the edges
            themselves.

        Returns
        -------
        dict
            A dictionary of new edge features.
        """
        h = torch.cat([edges.src["h"], edges.dst["h"]], 1)
        return {"score": self.W2(F.relu(self.W1(h))).squeeze(1)}

    def forward(self, g, h):
        with g.local_scope():
            g.ndata["h"] = h
            g.apply_edges(self.apply_edges)
            return g.edata["score"]

### Model Initialization

- **Model Initialization (`model`):**
   - Initialize a `GraphSAGE` model (`GraphSAGE`) with input feature dimensions and hidden feature dimensions.
   - The input feature dimensions are determined by `train_g.ndata["feat"].shape[1]`.
   - Hidden feature dimensions are set to 16.

- **Predictor Initialization (`pred`):**
   - Initialize an `MLPPredictor` model (`MLPPredictor`) with 16 hidden feature dimensions.
   - The predictor computes scalar scores for link prediction.

### Loss and AUC Functions

- **Loss Computation (`compute_loss`):**
   - Compute the binary cross-entropy loss between positive and negative scores.
   - Concatenate positive and negative scores and corresponding labels.
   - Utilize `F.binary_cross_entropy_with_logits` to compute the loss.

- **AUC Computation (`compute_auc`):**
   - Compute the Area Under the Curve (AUC) for evaluating model performance.
   - Concatenate positive and negative scores and corresponding labels.
   - Calculate AUC using the `roc_auc_score` function from `sklearn.metrics`.


In [9]:
model = GraphSAGE(train_g.ndata["feat"].shape[1], 16)
# You can replace DotPredictor with MLPPredictor.
pred = MLPPredictor(16)


def compute_loss(pos_score, neg_score):
    scores = torch.cat([pos_score, neg_score])
    labels = torch.cat(
        [torch.ones(pos_score.shape[0]), torch.zeros(neg_score.shape[0])]
    )
    return F.binary_cross_entropy_with_logits(scores, labels)


def compute_auc(pos_score, neg_score):
    scores = torch.cat([pos_score, neg_score]).numpy()
    labels = torch.cat(
        [torch.ones(pos_score.shape[0]), torch.zeros(neg_score.shape[0])]
    ).numpy()
    return roc_auc_score(labels, scores)

Changing the type of the `feat` from Long to float32 as that siw aht is expected by our class

In [10]:
# h = model(train_g, train_g.ndata["feat"])
train_g.ndata['feat'] = train_g.ndata['feat'].type(torch.float32)

1. **Optimizer Initialization (`optimizer`):**
   - Initialize an Adam optimizer for both the GNN model and the predictor.
   - Combines the parameters from both using `itertools.chain`.
   - Learning rate is set to 0.01.

2. **Training Loop (`for e in range(350):`):**
   - Perform 350 training epochs.
   - Forward Pass:
       - Execute the GNN model (`model`) on the training graph using input node features.
       - Calculate scores for positive and negative edges using the predictor (`pred`).
       - Compute the loss using the `compute_loss` function.
   - Backward Pass:
       - Reset optimizer gradients.
       - Perform backpropagation to compute gradients.
       - Update model parameters using the optimizer.
   - Print loss every 5 epochs.

3. **AUC Evaluation (`with torch.no_grad():`):**
   - Compute AUC for model evaluation.
   - Calculate scores for positive and negative edges using the predictor.
   - Utilize `compute_auc` function to compute the AUC.

4. **Note:**
   - The training loop optimizes the model using the computed loss.
   - After training, the code evaluates the trained model's performance using AUC.


In [11]:
optimizer = torch.optim.Adam(
    itertools.chain(model.parameters(), pred.parameters()), lr=0.01
)

all_logits = []
for e in range(500):
    # forward
    h = model(train_g, train_g.ndata["feat"])
    pos_score = pred(train_pos_g, h)
    neg_score = pred(train_neg_g, h)
    loss = compute_loss(pos_score, neg_score)

    # backward
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if e % 5 == 0:
        print("In epoch {}, loss: {}".format(e, loss))

from sklearn.metrics import roc_auc_score

with torch.no_grad():
    pos_score = pred(test_pos_g, h)
    neg_score = pred(test_neg_g, h)
    print("AUC", compute_auc(pos_score, neg_score))

In epoch 0, loss: 0.8466184139251709
In epoch 5, loss: 0.788202702999115
In epoch 10, loss: 0.8662614226341248
In epoch 15, loss: 0.7980585098266602
In epoch 20, loss: 0.7788756489753723
In epoch 25, loss: 0.6864426732063293
In epoch 30, loss: 0.5877717137336731
In epoch 35, loss: 0.49723267555236816
In epoch 40, loss: 0.41885870695114136
In epoch 45, loss: 0.38718992471694946
In epoch 50, loss: 0.3665114641189575
In epoch 55, loss: 0.34574928879737854
In epoch 60, loss: 0.3321414887905121
In epoch 65, loss: 0.32042211294174194
In epoch 70, loss: 0.31038662791252136
In epoch 75, loss: 0.3018914759159088
In epoch 80, loss: 0.29395124316215515
In epoch 85, loss: 0.28652819991111755
In epoch 90, loss: 0.2790970206260681
In epoch 95, loss: 0.27185624837875366
In epoch 100, loss: 0.26480740308761597
In epoch 105, loss: 0.2581842839717865
In epoch 110, loss: 0.2520672082901001
In epoch 115, loss: 0.2462194859981537
In epoch 120, loss: 0.24323329329490662
In epoch 125, loss: 0.235872820019721

- **Evaluation (`with torch.no_grad():`):**
   - Compute scores for positive and negative test edges using the trained model.
   - Calculate probabilities from scores using sigmoid activation.
   - Concatenate positive and negative probabilities along with true labels.
   - Calculate the AUC-ROC score using the `roc_auc_score` function from `sklearn.metrics`.

In [12]:
with torch.no_grad():
    # Compute scores for positive and negative test edges
    test_h = model(train_g, train_g.ndata["feat"])
    test_pos_score = pred(test_pos_g, test_h)
    test_neg_score = pred(test_neg_g, test_h)

    # Convert scores to probabilities using sigmoid
    test_pos_prob = torch.sigmoid(test_pos_score)
    test_neg_prob = torch.sigmoid(test_neg_score)

    # Combine positive and negative probabilities
    all_probs = torch.cat([test_pos_prob, test_neg_prob])
    true_labels = torch.cat([torch.ones_like(test_pos_prob), torch.zeros_like(test_neg_prob)])

    # Compute AUC-ROC score
    from sklearn.metrics import roc_auc_score
    auc_roc_score = roc_auc_score(true_labels.cpu(), all_probs.cpu())

    print("Test AUC-ROC Score:", auc_roc_score)


Test AUC-ROC Score: 0.9056571252693135


We can see that our AUC-ROC is 90.5 which is really good. This score generally means how well the model can distinguish betwee positive and negative
edges in the graph. This means that our model is able to make the correct choice 90% of the time. This kind of score is generally what require
for real life scenarios

Finally saving th

In [13]:
model_save_path = "../model/dgl_model.pt"  # Provide the desired path to save the model
torch.save(model, model_save_path)
print("Model saved at:", model_save_path)

Model saved at: ../model/dgl_model.pt
