# GraphSL example notebook


## load libraries

We load all available methods from GraphSL i.e. `GCNSI`, `IVGD`, and `SLVAE` from the GNN based methods and `LPSI`, `NetSleuth` and `OJC` from the prescribed methods.
Furthermore we load some utilities from GraphSL, to deal with the dataset (e.g. download it and split it into a training and test dataset).


In [1]:
# load methods
from GraphSL.GNN.SLVAE.main import SLVAE
from GraphSL.GNN.IVGD.main import IVGD
from GraphSL.GNN.GCNSI.main import GCNSI
from GraphSL.Prescribed import LPSI, NetSleuth, OJC
# load utils
from GraphSL.utils import load_dataset, diffusion_generation, split_dataset, download_dataset, visualize_source_prediction
# other imports
import os
import ipywidgets as widgets

## dataset preparation

### Download dataset

In [2]:
w = widgets.Dropdown(
    options=['karate', 'dolphins', 'jazz', 'netscience', 'cora_ml', 'power_grid'],
    value='karate',
    description='Dataset:',
    disabled=False,
)
data_name = w.value

In [3]:
curr_dir = os.getcwd()
print(f"Current working directory is: {curr_dir}\n")

Current working directory is: /nfs/users/junxiang/personal/GraphSL



all datasets will be downloaded to the local `data` folder in the `curr_dir` directory

In [4]:
download_dataset(curr_dir)

Downloaded cora_ml
Downloaded dolphins
Downloaded jazz
Downloaded karate
Downloaded netscience
Downloaded power_grid


### Select dataset

In [5]:
display(w)

Dropdown(description='Dataset:', options=('karate', 'dolphins', 'jazz', 'netscience', 'cora_ml', 'power_grid')…

In [6]:
graph = load_dataset(data_name, data_dir=curr_dir)
# print graph 
print(graph)

{'adj_mat': <34x34 sparse matrix of type '<class 'numpy.float32'>'
	with 156 stored elements in Compressed Sparse Row format>}


### pre-processing dataset

generate diffusion using Independent Cascade(IC) model, the infection probability is `0.3`, the number of simulations is `100`, the probability of sources(seeds) is `0.2`.

In [7]:
dataset = diffusion_generation(graph=graph, infect_prob=0.3, diff_type='IC', sim_num=100, seed_ratio=0.2)
display(dataset)

{'adj_mat': <34x34 sparse matrix of type '<class 'numpy.float32'>'
 	with 156 stored elements in Compressed Sparse Row format>,
 'diff_mat': tensor([[[0., 1.],
          [1., 1.],
          [1., 1.],
          ...,
          [1., 1.],
          [1., 1.],
          [1., 1.]],
 
         [[1., 1.],
          [1., 1.],
          [1., 1.],
          ...,
          [0., 1.],
          [1., 1.],
          [1., 1.]],
 
         [[1., 1.],
          [0., 1.],
          [0., 1.],
          ...,
          [1., 1.],
          [1., 1.],
          [1., 1.]],
 
         ...,
 
         [[0., 1.],
          [1., 1.],
          [1., 1.],
          ...,
          [1., 1.],
          [0., 1.],
          [1., 1.]],
 
         [[0., 1.],
          [0., 1.],
          [1., 1.],
          ...,
          [1., 1.],
          [0., 1.],
          [1., 1.]],
 
         [[1., 1.],
          [1., 1.],
          [1., 1.],
          ...,
          [1., 1.],
          [0., 1.],
          [0., 1.]]])}

#### split the dataset into training and test sets

In [8]:
adj, train_dataset, test_dataset = split_dataset(dataset)
print(f"Training dataset: {train_dataset}")
print(f"Test dataset: {test_dataset}")

Training dataset: <torch.utils.data.dataset.Subset object at 0x7f68a9f6b260>
Test dataset: <torch.utils.data.dataset.Subset object at 0x7f68a9f6b3e0>


## Execute methods

executing the methods is split into two parts for every method. A **training** and a **test** step. In the training step, the `train` function of the selected method is called to train the model on the training set, and hyperparameters (e.g. threshold) are optimized based on F1-score. In the test step, the trained model is evaluated on the test set to verify its performance, which is returned by the **Metric** object. The Metric object consists of five performance metrics: accuracy (acc), precision (pr), recall (re), F1-score (fs) and the area under the ROC curve (auc). The higher these performance metrics are, the better a model performs. They are defined as follows:

**Accuracy (ACC)**: Accuracy is the ratio of correctly predicted instances to the total number of instances. It is a measure of how often the classifier is correct overall. ACC = 1 means the model is perfect, while ACC = 0 means the model is completely wrong.

Formula: Accuracy = (True Positives + True Negatives)/ Total Number of Instances, where

True Positives (TP): Instances where the model correctly predicted the positive class.

True Negatives (TN): Instances where the model correctly predicted the negative class.

Accuracy is useful when the classes are balanced, but it can be misleading if there is a class imbalance.

**Precision (PR)**: Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. It indicates how many of the predicted positive instances were actually positive.

Formula: Precision = True Positives/ (True Positives + False Positives), where

False Positives (FP): Instances where the model incorrectly predicted the positive class.

Precision is important in scenarios where the cost of false positives is high.

**Recall (RE)**: Recall, also known as sensitivity or true positive rate, is the ratio of correctly predicted positive observations to all actual positive observations. It measures the model's ability to detect positive instances.

Formula: Recall = True Positives/ (True Positives + False Negatives), where

False Negatives (FN): Instances where the model incorrectly predicted the negative class.

Recall is important in situations where the cost of false negatives is high.

**F1-Score (FS)**: The F1-score is the harmonic mean of precision and recall, providing a balance between the two metrics. It is useful when you need a single metric to evaluate the performance of a model with imbalanced classes.

Formula: F1-Score = 2 × Precision × Recall / (Precision + Recall)

The F1-score takes both false positives and false negatives into account.

It is best used when the class distribution is uneven or when both precision and recall are important.

**Area Under the ROC Curve (AUC)**: The AUC represents the area under the Receiver Operating Characteristic (ROC) curve, which plots the true positive rate against the false positive rate at various threshold settings.

The ROC curve illustrates the trade-off between sensitivity (recall) and specificity (1 - false positive rate).

The AUC is a single scalar value that summarizes the overall performance of the model across all classification thresholds.

AUC values range from 0 to 1, where 1 indicates a perfect model, 0.5 suggests no discriminative power (equivalent to random guessing), and 0 means a completely wrong model.

A higher AUC indicates better model performance.



### LPSI

#### training
train LPSI using the training set, and return the hyperparameter `alpha`, the optimal threshold, the area under the ROC curve, F1-Score, and source predictions,
source predictions can be utilized to adjust the parameter `thres_list` in `lpsi.train`

In [9]:
lpsi = LPSI()

print("lpsi.train:")
print("========================")
alpha, thres, auc, f1, pred = lpsi.train(adj, train_dataset)
print("========================\n")
print("training done\n")
print(f"train auc: {auc:.3f}, train f1: {f1:.3f}")

lpsi.train:
alpha = 0.001, train_auc = 0.323
alpha = 0.01, train_auc = 0.323
alpha = 0.1, train_auc = 0.323
thres = 0.090, train_f1 = 0.353
thres = 0.181, train_f1 = 0.353
thres = 0.272, train_f1 = 0.353
thres = 0.363, train_f1 = 0.353
thres = 0.454, train_f1 = 0.353
thres = 0.545, train_f1 = 0.353
thres = 0.636, train_f1 = 0.353
thres = 0.727, train_f1 = 0.353
thres = 0.818, train_f1 = 0.353
thres = 0.909, train_f1 = 0.353

training done

train auc: 0.323, train f1: 0.353


#### testing

test LPSI using the test set, and return the Metric object (accuracy, precision, recall, F1-Score and area under the ROC curve)

In [10]:
metric = lpsi.test(adj, test_dataset, alpha, thres)
print(f"test acc: {metric.acc:.3f}, test pr: {metric.pr:.3f}, test re: {metric.re:.3f}, test f1: {metric.f1:.3f}, test auc: {metric.auc:.3f}")

test acc: 0.353, test pr: 0.214, test re: 1.000, test f1: 0.353, test auc: 0.320


Based on five performance metrics, the LPSI does not perform well.

### NetSleuth

#### training

train NetSleuth using the training set, and return the hyperparameter 'k', the area under the ROC curve, and F1-Score

In [11]:
netSleuth = NetSleuth()

print("netSleuth.train:")
print("========================")
k, auc, f1 = netSleuth.train(adj, train_dataset)
print("========================\n")
print("training done\n")
print(f"train auc: {auc:.3f}, train f1: {f1:.3f}")

netSleuth.train:
k = 5, train_auc = 0.583
k = 10, train_auc = 0.684

training done

train auc: 0.684, train f1: 0.448


#### testing

test NetSleuth using the test set, and return the Metric object (accuracy, precision, recall, F1-Score and area under the ROC curve)

In [12]:
metric = netSleuth.test(adj, test_dataset, k)
print(f"test acc: {metric.acc:.3f}, test pr: {metric.pr:.3f}, test re: {metric.re:.3f}, test f1: {metric.f1:.3f}, test auc: {metric.auc:.3f}")

test acc: 0.741, test pr: 0.360, test re: 0.600, test f1: 0.450, test auc: 0.686


The performance of NetSleuth is better than the LPSI.

### OJC

#### training

train OJC using the training set, and return the hyperparameter 'Y', area under the ROC curve, and F1-Score

In [13]:
ojc = OJC()

print("ojc.train:")
print("========================")
Y, auc, f1 = ojc.train(adj, train_dataset)
print("========================\n")
print("training done\n")
print(f"train auc: {auc:.3f}, train f1: {f1:.3f}")

ojc.train:
Y = 5, train_auc = 0.422
Y = 10, train_auc = 0.422

training done

train auc: 0.422, train f1: 0.379


#### testing 

test OJC using the test set, and return the Metric object (accuracy, precision, recall, F1-Score and area under the ROC curve)

In [14]:
metric = ojc.test(adj, test_dataset, Y)
print(f"test acc: {metric.acc:.3f}, test pr: {metric.pr:.3f}, test re: {metric.re:.3f}, test f1: {metric.f1:.3f}, test auc: {metric.auc:.3f}")

test acc: 0.715, test pr: 0.315, test re: 0.525, test f1: 0.394, test auc: 0.640


### GCNSI

#### training
train GCNSI using the training set, and return the GCNSI model, the optimal threshold, the area under the ROC curve, F1-Score, and source predictions 
source predictions can be utilized to adjust the parameter `thres_list` in `gcnsi.train`

In [15]:
gcnsi = GCNSI()

print("gcnsi.train:")
print("========================")
gcnsi_model, thres, auc, f1, pred = gcnsi.train(adj, train_dataset)
print("========================\n")
print("training done\n")
print(f"train auc: {auc:.3f}, train f1: {f1:.3f}")

gcnsi.train:
train GCNSI:
Epoch [0/100], loss = 9.217
Epoch [10/100], loss = 0.960
Epoch [20/100], loss = 0.871
Epoch [30/100], loss = 0.802
Epoch [40/100], loss = 0.811
Epoch [50/100], loss = 0.805
Epoch [60/100], loss = 0.778
Epoch [70/100], loss = 0.786
Epoch [80/100], loss = 0.739
Epoch [90/100], loss = 0.740
train_auc = 0.931
thres = 0.251, train_f1 = 0.308
thres = 0.326, train_f1 = 0.429
thres = 0.400, train_f1 = 0.429
thres = 0.475, train_f1 = 0.429
thres = 0.550, train_f1 = 0.545
thres = 0.625, train_f1 = 0.545
thres = 0.700, train_f1 = 0.750
thres = 0.775, train_f1 = 0.750
thres = 0.850, train_f1 = 0.656
thres = 0.925, train_f1 = 0.555

training done

train auc: 0.931, train f1: 0.750


### visualization
visualize the predicted sources and the labeled sources and save the figure to the current directory

In [16]:
pred = (pred >= thres)
visualize_source_prediction(adj,pred[:,0],train_dataset[0][:,0].numpy(),save_dir=curr_dir,save_name="GCNSI_source_prediction")

Figure saved to /nfs/users/junxiang/personal/GraphSL/GCNSI_source_prediction.png


#### testing

test GCNSI using the test set, and return the Metric object (accuracy, precision, recall, F1-Score, and area under the ROC curve)

In [17]:
metric = gcnsi.test(adj, test_dataset, gcnsi_model, thres)
print(f"test acc: {metric.acc:.3f}, test pr: {metric.pr:.3f}, test re: {metric.re:.3f}, test f1: {metric.f1:.3f}, test auc: {metric.auc:.3f}")

test acc: 0.882, test pr: 0.600, test re: 1.000, test f1: 0.750, test auc: 0.931



### IVGD

#### training

First train IVGD diffusion model using the training set and return the diffusion model

Then train IVGD using the training set, and return the IVGD model, the optimal threshold, the area under the ROC curve, the F1-score, and source predictions. Source predictions can be utilized to adjust the parameter `thres_list` in `ivgd.train`

In [18]:
ivgd = IVGD()

# train diffusion model
print("ivgd.train_diffusion:")
print("========================")
diffusion_model = ivgd.train_diffusion(adj, train_dataset)
print("========================\n")
# train IVGD
print("ivgd.train:")
print("========================")
ivgd_model, thres, auc, f1, pred = ivgd.train(
    adj, train_dataset, diffusion_model)
print("========================\n")
print("training done\n")
print(f"train auc: {auc:.3f}, train f1: {f1:.3f}")

ivgd.train_diffusion:
train IVGD diffusion model:
Epoch 0: Train loss = 0.6950, Train error = 0.6950, early stopping loss = 0.4830, early stopping error = 0.4830, (1.216 sec)
Epoch 10: Train loss = 0.5045, Train error = 0.5045, early stopping loss = 0.5185, early stopping error = 0.5185, (9.223 sec)
Epoch 20: Train loss = 0.5468, Train error = 0.5468, early stopping loss = 0.5261, early stopping error = 0.5261, (8.049 sec)
Epoch 30: Train loss = 0.5348, Train error = 0.5348, early stopping loss = 0.5230, early stopping error = 0.5230, (8.998 sec)
Epoch 40: Train loss = 0.4855, Train error = 0.4855, early stopping loss = 0.5803, early stopping error = 0.5803, (8.138 sec)
train mean error:0.492
early_stopping mean error:0.531
validation mean error:0.499
run time:42.985 seconds
run time per epoch:0.860 seconds

ivgd.train:
train IVGD:
Epoch [0/100], loss = 1.239
Epoch [10/100], loss = 1.005
Epoch [20/100], loss = 0.947
Epoch [30/100], loss = 0.794
Epoch [40/100], loss = 0.744
Epoch [50/10

### visualization
visualize the predicted sources and the labeled sources and save the figure to the current directory

In [19]:
pred = (pred >= thres)
visualize_source_prediction(adj,pred[:,0],train_dataset[0][:,0].numpy(),save_dir=curr_dir,save_name="IVGD_source_prediction")

Figure saved to /nfs/users/junxiang/personal/GraphSL/IVGD_source_prediction.png


#### testing

test IVGD using the test set, and return the Metric object (accuracy, precision, recall, F1-Score and area under the ROC curve)

In [20]:
metric = ivgd.test(test_dataset, diffusion_model, ivgd_model, thres)
print(f"test acc: {metric.acc:.3f}, test pr: {metric.pr:.3f}, test re: {metric.re:.3f}, test f1: {metric.f1:.3f}, test auc: {metric.auc:.3f}")

test acc: 0.818, test pr: 0.491, test re: 0.900, test f1: 0.635, test auc: 0.908


### SLVAE

#### training

train SLVAE using the training set, and return the SLVAE model, the latent representations of training seed vector from VAE, the optimal threshold, the area under the ROC curve, the F1-score, and source predictions
source predictions can be utilized to adjust the parameter `thres_list` in `slvae.train`

In [21]:
slave = SLVAE()

print("ivgd.train:")
print("========================")
slvae_model, seed_vae_train, thres, auc, f1, pred = slave.train(
    adj, train_dataset)
print("========================\n")
print("training done\n")
print(f"train auc: {auc:.3f}, train f1: {f1:.3f}")

ivgd.train:
train SLVAE:
Epoch [0/100], loss = 0.719
Epoch [10/100], loss = 0.294
Epoch [20/100], loss = 0.224
Epoch [30/100], loss = 0.171
Epoch [40/100], loss = 0.168
Epoch [50/100], loss = 0.346
Epoch [60/100], loss = 0.162
Epoch [70/100], loss = 0.156
Epoch [80/100], loss = 0.190
Epoch [90/100], loss = 0.150
infer seed from training set:
Epoch [0/10], obj = -3.8282
thres = 0.087, train_f1 = 0.593
thres = 0.175, train_f1 = 0.402
thres = 0.263, train_f1 = 0.281
thres = 0.351, train_f1 = 0.191
thres = 0.439, train_f1 = 0.140
thres = 0.527, train_f1 = 0.089
thres = 0.615, train_f1 = 0.065
thres = 0.703, train_f1 = 0.038
thres = 0.791, train_f1 = 0.024
thres = 0.879, train_f1 = 0.014

training done

train auc: 0.937, train f1: 0.593


### visualization
visualize the predicted sources and the labeled sources and save the figure to the current directory

In [22]:
pred = (pred >= thres)
visualize_source_prediction(adj,pred[:,0],train_dataset[0][:,0].numpy(),save_dir=curr_dir,save_name="SLVAE_source_prediction")

Figure saved to /nfs/users/junxiang/personal/GraphSL/SLVAE_source_prediction.png


#### testing
test SLVAE using the test set, and return the Metric object (accuracy, precision, recall, F1-Score, and area under the ROC curve)

In [23]:
metric = slave.infer(test_dataset, slvae_model, seed_vae_train, thres)
print(f"test acc: {metric.acc:.3f}, test pr: {metric.pr:.3f}, test re: {metric.re:.3f}, test f1: {metric.f1:.3f}, test auc: {metric.auc:.3f}")

infer seed from test set:
Epoch [0/10], obj = -3.8032
Epoch [1/10], obj = -3.8023
Epoch [2/10], obj = -3.8079
Epoch [3/10], obj = -3.8034
Epoch [4/10], obj = -3.8148
Epoch [5/10], obj = -3.7953
Epoch [6/10], obj = -3.8086
Epoch [7/10], obj = -3.8015
Epoch [8/10], obj = -3.8017
Epoch [9/10], obj = -3.8036
test acc: 0.862, test pr: 0.628, test re: 0.537, test f1: 0.571, test auc: 0.932


The performance of three GNN-based methods (GCNSI, IVGD and SLVAE) is signifcantly better than that of three prescribed methods (LPSI, NetSleuth, and OJC). This may be because GNN-based methods can learn rules from graph toplogy and information diffusion automactically, while prescribed methods have predefined rules, which may be less flexible than GNN-based methods.