**$GEBE^p$**

This is a paper aiming to use matrix fractionation to speed up the process of PMF(a general form of PPR).

The general formula is $H[u_i, u_j]=∑_{l=0}^τω(l)(WW^T)^l[u_i, u_j]$. The three popular choices of $ω(l)$ are uniform distribution applied in uniform high-order proximity: $H_τ=∑_{l=0}^τ\frac{1}{τ}(WW^T)^l$, Geometric distribution used in PPR: $H_α=∑_{l=0}^τα(1-α)^l(WW^T)^l$, Poisson distribution used in HKPR: $H_λ=∑_{l=0}^τ\frac{e^{-λ}λ^l}{l!}(WW^T)^l$.

However, in the experiments of the authors, they found the Poisson distribution always performed better, so they decide to focus on it and speed it up.

For any real square matrix M, we can get $e^M = ∑_{k=0}^∞\frac{M^k}{k!}$. If we let $M=λWW^T$, we will have $\frac{e^{λWW^T}}{e^λ}=∑_{l=0}^∞\frac{e^{-λ}λ^l}{l!}(WW^T)^l=H_λ$, which is exactly the fomula of Poisson distribution used in HKPR.

In this case, $e^λH_λ=e^{λWW^T}=e^{λ(ΦΣΨ^T)(ΦΣΨ^T)^T}=e^{λΦΣ^2Φ^T}=Φe^{λΣ^2}Φ^T$, where $ΦΦ^T=I$

And then, we get $H_λΦ[·,i]=e^{-λ}Φe^{λΣ^2}Φ^TΦ[·,i]=e^{-λ}e^{λΣ[i,i]^2}Φ[·,i]$, where $e^{-λ}e^{λΣ[i,i]^2}$ is an eigenvalue of $H_λ$ and Φ[·,i] is its corresponding eigenvector.

In this case, only use a SVD and some matrix multiplication, we could get the proximity of a square matrix.

In [None]:
!git clone https://github.com/clhchtcjj/BiNE.git

fatal: destination path 'BiNE' already exists and is not an empty directory.


In [None]:
import scipy.sparse as sp
import numpy as np
import torch
import math
import random
from sklearn.metrics import roc_auc_score, average_precision_score
from sklearn.linear_model import LogisticRegression

In [None]:
def load_data(path):
  f = open(path, 'r')
  lines = f.readlines()
  f.close()
  edges = set()
  for line in lines:
    temp = line.strip().split()
    node1 = temp[0]
    node2 = temp[1]
    weight = float(temp[2])
    edges.add((node1, node2, weight))
  return edges

In [None]:
trainfile = 'BiNE/data/wiki/rating_train.dat'
testfile = 'BiNE/data/wiki/ratindg_test.dat'

#trainfile = 'BiNE/data/dblp/rating_train.dat'
#testfile = 'BiNE/data/dblp/rating_test.dat'

trainedges = load_data(trainfile)
testedges = load_data(testfile)

unode = {}
ucount = 0
vnode = {}
vcount = 0


posedges = set()
for edges in trainedges:
  if edges[0] not in unode:
    unode[edges[0]] = ucount
    ucount += 1
  if edges[1] not in vnode:
    vnode[edges[1]] = vcount
    vcount += 1
  posedges.add((unode[edges[0]], vnode[edges[1]]))

for edges in testedges:
  if edges[0] not in unode:
    unode[edges[0]] = ucount
    ucount += 1
  if edges[1] not in vnode:
    vnode[edges[1]] = vcount
    vcount += 1
  posedges.add((unode[edges[0]], vnode[edges[1]]))


rows = []
cols = []
weights = []
for edges in trainedges:
  rows.append(unode[edges[0]])
  cols.append(vnode[edges[1]])
  weights.append(edges[2])

rows = np.array(rows)
cols = np.array(cols)
weights = np.array(weights)

dim = 128
lamb = 1

adj = sp.csr_matrix((weights, (rows, cols)), shape=(len(unode), len(vnode))).todense()
adj = torch.tensor(adj)
adj = torch.nn.functional.normalize(adj, dim=0)
U, S, V = torch.svd(adj)
S = torch.nn.functional.normalize(S, dim=0)
U = U[:, :dim]
S = S[:dim]
V = V[:dim, :]


L = math.exp(-lamb / 2) * torch.exp(lamb * S @ S.T / 2)
Z = U

U = Z * L
V = adj.T @ U


negedges = set()
while len(negedges) < len(testedges) + len(posedges):
  node1 = random.randint(0, len(unode) - 1)
  node2 = random.randint(0, len(vnode) - 1)
  if (node1, node2) not in posedges:
    negedges.add((node1, node2))

negedges = list(negedges)
trainnegedges = negedges[:len(posedges)]
testnegedges = negedges[len(posedges):]


trainx = []
trainy = []

testx = []
testy = []

for edge in trainedges:
  trainx.append(torch.cat([U[unode[edge[0]],:], V[vnode[edge[1]],:]], dim=0).numpy().tolist())
  trainy.append(1)

for edge in trainnegedges:
  trainx.append(torch.cat([U[edge[0],:], V[edge[1],:]], dim=0).numpy().tolist())
  trainy.append(0)

for edge in testedges:
  testx.append(torch.cat([U[unode[edge[0]],:], V[vnode[edge[1]],:]], dim=0).numpy().tolist())
  testy.append(1)

for edge in testnegedges:
  testx.append(torch.cat([U[edge[0],:], V[edge[1],:]], dim=0).numpy().tolist())
  testy.append(0)

trainx = np.array(trainx)
trainy = np.array(trainy)
testx = np.array(testx)
testy = np.array(testy)


lr = LogisticRegression(max_iter=10000)
lr.fit(trainx,trainy)
y_pred = lr.predict(testx)
auc_lr = roc_auc_score(testy,y_pred)
ap_lr = average_precision_score(testy,y_pred)
print("AUC: {}, AP: {}".format(auc_lr, ap_lr))

AUC: 0.8635660865813319, AP: 0.8483303401903686
