# Assignment

In this assignment, we will use the Cora citation network. Each node represents a paper, and each edge from node $i$ to $j$ represents the citation from $i$ to $j$. A field code is assigned to individual paper, which is in the `field` column in the node table.
We will ignore the edge directionality and apply a graph embedding to the undirected network.

In [None]:
import pandas as pd
import numpy as np
from scipy import sparse

node_table = pd.read_csv(
    "https://raw.githubusercontent.com/skojaku/adv-net-sci-course/main/data/cora/node_table.csv"
)
edge_table = pd.read_csv(
    "https://raw.githubusercontent.com/skojaku/adv-net-sci-course/main/data/cora/edge_table.csv",
    dtype={"src": np.int32, "trg": np.int32},
)
src, trg = tuple(edge_table[["src", "trg"]].values.T)

rows, cols = src, trg
nrows, ncols = node_table.shape[0], node_table.shape[0]
A = sparse.csr_matrix(
    (np.ones_like(rows), (rows, cols)),
    shape=(nrows, ncols),
).asfptype()

# Symmterize and binarize
A = A + A.T
A.data = A.data * 0 + 1

---
**Question 1: Implement the node2vec algorithm.**

First, let's prepare a function to generate node sequences with random walks on networks. Since simulating random walks can be a considerable bottleneck, we prepare the function for you to use. Notice that the function takes `indptr` and `indices` of the CSR representation of the adjacency matrix instead of the adjacency matrix itself.

In [None]:
from numba import njit


@njit(cache=True, nogil=True)
def random_walk(indices, indptr, start_node_id, walk_length):
    """Random walk on a graph.

    Parameters
    ----------
    indices : numpy.ndarray
        CSR matrix indices.
    indptr : numpy.ndarray
        CSR matrix indptr.
    start_node_id : int
        Id of the starting node.
    walk_length : int
        Length of the walk.

    Returns
    -------
    visited_nodes : list
        List of visited nodes.
    """
    visited_nodes = [start_node_id]
    current_node_id = start_node_id

    for _ in range(walk_length):
        # Get the neighbors of the current node
        # Hint: Use A.indices and A.indptr
        neighbors = indices[indptr[current_node_id] : indptr[current_node_id + 1]]

        if len(neighbors) == 0:
            break
        current_node_id = np.random.choice(neighbors)
        visited_nodes.append(current_node_id)

    return visited_nodes

Now, let's implement the node2vec algorithm. Here is a template for the class. Implement the node2vec algorithm.

In [None]:
import gensim


class SimpleNode2Vec:
    def __init__(self, window_length=5, n_walkers=10, walk_length=40):
        """Node2Vec class

        Parameters
        ----------
        p : float
          Walk bias parameter
        q : float
          In-out parameter
        """
        # set this to `window` parameter of word2vec in `def emb()` funcion
        self.window_length = window_length

        # These are the parameters for random walks
        self.n_walkers = n_walkers  # Number of walkers per node
        self.walk_length = walk_length  # window length

    def embed(self, A, dim):
        """Embed nodes in a graph

        Parameters:
        -----------
        A : scipy sparse matrix
          Adjacency matrix
        dim : int
          Dimension of embeddings

        Returns:
        --------
        emb : numpy array (n_nodes, dim)
          Embeddings
        """

        # Your code to generate embeddings

        return emb

In [None]:
import networkx as nx
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis


# Test
def test_node2vec():
    # Test with karate club
    G = nx.karate_club_graph()
    A = nx.adjacency_matrix(G)
    labels = np.unique([d[1]["club"] for d in G.nodes(data=True)], return_inverse=True)[
        1
    ]

    # Embedding
    n2v = SimpleNode2Vec()
    emb = n2v.embed(A, dim=64)

    clf = LinearDiscriminantAnalysis(n_components=1)
    clf.fit(emb, labels)
    assert emb.shape == (A.shape[0], 64)
    assert clf.score(emb, labels) > 0.8


test_node2vec()

**Question 2: Visualize the 64 dimensional embedding of the cora network by using UMAP**

First, let's generate the embedding of the cora network.

In [None]:
emb = SimpleNode2Vec().embed(A, dim=64)

Then, write a code to generate an embedding based on the UMAP. You can use any parameter for UMAP. Here is a suggested set of parameters: n_components=2, random_state=42, n_neighbors=30, min_dist=0.8, metric = "cosine."

In [None]:
import umap
import matplotlib.pyplot as plt
import seaborn as sns

# --- Your code here ---
x = ...
y = ...
# --- End of your code ---

sns.set_style("white")
sns.set(font_scale=1.2)
sns.set_style("ticks")
fig, ax = plt.subplots(figsize=(7, 5))


labels = node_table["field"].values
ax = sns.scatterplot(x=x, y=y, hue=labels, ax=ax)

ax.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)
ax.axis("off")

Color represents the regions, and the size of the nodes represents the degree.

---
**Question 3**


**Preparation:**
Suppose a task of classifying papers into fields based on the citation network structure. Our classifier will take the graph embedding and predict its field. You are given the field labels for 80% of the papers. And the task is to classify the remaining 20\% of the papers.

First, we will reserve 80% of the data for training and the remaining 20% for evaluating the performance.

In [None]:
# Split the node table into the train and test set.
df = node_table.sample(frac=1, random_state=0)
train_node_table = df.iloc[: int(len(df) * 0.8)]
test_node_table = df.iloc[int(len(df) * 0.2) :]

We will evaluate the classification performance by the accuracy:

In [None]:
def eval_prediction_accuracy(y, yred):
    """Calculate prediction accuracy.

    Parameters
    ----------
    y : numpy.ndarray
      True labels.
    ypred : numpy.ndarray
      Predicted labels.

    Returns
    -------
    acc : float
      Prediction accuracy.
    """
    return np.sum(y == yred) / len(y)

We will use the Support Vector Machine implemented in scikit-learn.

In [None]:
from sklearn.svm import SVC

Here are how we test the accuracy of the model:

In [None]:
# Train
clf = SVC()
clf.fit(emb[train_node_table["node_id"].values], train_node_table["field"].values)

# Predict
ypred = clf.predict(emb[test_node_table["node_id"].values])

# Evaluation
accuracy = eval_prediction_accuracy(ypred, test_node_table["field"].values)

print(f"Accuracy: {accuracy:.3f}")

**Question: Draw a line plot for the accuracy as a function of the embedding dimension $K$. Test $K=2,4,8,16,32,64,128,256$**

You should see that the accuracy stays relatively high even if we increase the dimensions. This is clearly a different behavior from the spectral embedding.