# SNoRe Evaluation (Node Classification)

This notebook shows how to create the embedding of a network and use it to evaluate SNoRe-s performance in the node classification task.

First we need to import some packages and method from them.

In [1]:
import numpy as np
from collections import defaultdict
from six import iteritems
from sklearn.utils import shuffle as skshuffle
from sklearn.metrics import f1_score
from sklearn.linear_model import LogisticRegression
from snore import SNoRe, from_mat_file, TopKRanker

Next we need to load the network, the labels and the multi label binarizer (We use sklearns MultiLabelBinarizer(range(num_classes))). Both the network and labels should be in scipy-s sparse format. We recommend saving the network to a .mat file similar to cora.mat and using our from_mat_file method.

In [2]:
network, labels, mlb = from_mat_file("../data/cora.mat")

Next we need to create an embedding. The SNoRe class has 7 parameters:
* fixed_dimension: If True, fixed number of features is used, otherwise space equivalent to |N|*dimension is used (SNoRe SDF),
* dimension: number of features (fixed or space equivalent to |N|*dimension),
* num_walks: number of random walks from each node,
* max_walk_length: length of the longest random walk,
* inclusion: inclusion threshold. Node needs to appear with frequency inclusion to appear in the hash representation,
* metric: metric used for similarity calculation. If fixed_dimension==False, 'cosine', 'HPI', and 'HDI' are valid, otherwise 'cosine','HPI','HDI','euclidean', 'jaccard', 'seuclidean', and 'canberra' can be used,
* num_bins: number of bins used in SNoRe SDF to digitize. The values are not digitized if None is chosen.


In [3]:
# Parameters
fixed_dimension = False
dimension = 256
num_walks = 1024
max_walk_length = 5
inclusion = 0.005
metric = "cosine"
num_bins = 256

# Embedding creation
model = SNoRe(dimension=dimension, num_walks=num_walks, max_walk_length=max_walk_length, inclusion=inclusion,
              fixed_dimension=fixed_dimension, metric=metric, num_bins=num_bins)
emb = model.embed(network)
print("Embeding with shape {} created.".format(emb.shape))

05-Nov-20 19:44:48 - Generating and hashing random walks
05-Nov-20 19:44:54 - Generating similarity matrix
05-Nov-20 19:44:59 - Embedding done
Embeding with shape (2708, 2708) created.


To evaluate the embedding we created, we will different percentages of randomly shuffled training data together with the logistic regression learner.

You can customize this evaluation by changing the parameters num_shuffles (number of repetitions for each training percentage), all_percentages (If True percentages from 10-90 are used with increaments of 10, otherwise percentages data_perc are used), and data_perc (percentages of the training set if all_percentages parameter is False). 

In [4]:
# Parameters
num_shuffles = 10
all_percentages = False
data_perc = [0.5]

# Evaluation
all_results = defaultdict(list)
shuffles = []
for x in range(num_shuffles):
    shuffles.append(skshuffle(emb, labels))
if all_percentages:
    training_percents = np.asarray(range(1, 10)) * .1
else:
    training_percents = data_perc

for train_percent in training_percents:
    for shuf in shuffles:
        X, y = shuf

        training_size = int(train_percent * X.shape[0])

        X_train = X[:training_size, :]
        y_train_ = y[:training_size]
        y_train = [list() for x in range(y_train_.shape[0])]

        cy = y_train_.tocoo()
        for i, j in zip(cy.row, cy.col):
            y_train[i].append(j)

        assert sum(len(l) for l in y_train) == y_train_.nnz

        X_test = X[training_size:, :]
        y_test_ = y[training_size:]

        y_test = [[] for _ in range(y_test_.shape[0])]

        cy = y_test_.tocoo()
        for i, j in zip(cy.row, cy.col):
            y_test[i].append(j)

        clf = TopKRanker(LogisticRegression())
        clf.fit(X_train, y_train_)

        # find out how many labels should be predicted
        top_k_list = [len(l) for l in y_test]
        preds = clf.predict(X_test, top_k_list)

        results = {}
        averages = ["micro", "macro"]
        for average in averages:
            results[average] = f1_score(mlb.fit_transform(y_test),
                                        mlb.fit_transform(preds),
                                        average=average)

            all_results[train_percent].append(results)

Lastly, let's print out the results we got from the evaluation.

In [5]:
print('Results of SNoRe using embeddings of dimensionality', emb.shape[1])
print('-------------------')
for train_percent in sorted(all_results.keys()):
    print('Train percent:', train_percent)
    avg_score = defaultdict(float)
    for score_dict in all_results[train_percent]:
        for metric, score in iteritems(score_dict):
            avg_score[metric] += score
    for metric in avg_score:
        avg_score[metric] /= len(all_results[train_percent])
    print('Average score:', dict(avg_score))
    print('-------------------')

Results of SNoRe using embeddings of dimensionality 2708
-------------------
Train percent: 0.5
Average score: {'micro': 0.8469719350073854, 'macro': 0.8376325651518943}
-------------------
