### Preliminaries
If you want to normalize a vector to L1-norm or L2-norm, use:

In [1]:
from __future__ import print_function, division
import numpy as np

pr = np.array([1,2,3])
print("L1-norm of {0} is {1}".format(pr, pr / np.linalg.norm(pr,1)))
print("L2-norm of {0} is {1}".format(pr, pr / np.linalg.norm(pr,2)))

L1-norm of [1 2 3] is [0.16666667 0.33333333 0.5       ]
L2-norm of [1 2 3] is [0.26726124 0.53452248 0.80178373]


# Exercise 3: Link based ranking
## Question 1 - Page Rank (Eigen-vector method)
Consider a tiny Web with three pages A, B and C with no inlinks,
and with initial PageRank = 1. Initially, none of the pages link to
any other pages and none link to them. 
Answer the following questions, and calculate the PageRank for
each question.

1. Link page A to page B.
2. Link all pages to each other.
3. Link page A to both B and C, and link pages B and C to A.
4. Use the previous links and add a link from page C to page B.

Hints: 
+ We are using the theoretical PageRank computation (without source of rank). See slide "Transition Matrix for Random Walker" in the lecture note. **Columns of link matrix are from-vertex, rows of link matrix are to-vertex**. We take the eigenvector with the largest eigenvalue.
+ We only care about final ranking of the probability vector. You can choose the normalization (or not) of your choice).

In [11]:
# Implement your code here
def pagerank_eigen(L):
#   Construct transition probability matrix from L
    
    X = np.sum(L,axis=0)
    R = np.zeros([L.shape[0],L.shape[1]])
    
    for i in range(R.shape[0]):
        for j in range(R.shape[1]):
            R[i,j] = 1 / X[0,j] if X[0,j] != 0 else 0
    
    R = np.multiply(R,L)
#     Compute eigen-vectors and eigen-values of R
    eigenvalues, eigenvectors = np.linalg.eig(R)
#     Take the eigen-vector with maximum eigven-value
    p = eigenvectors[:,np.argmax(eigenvalues)]
    return (R,p)

1.Link page A to page B.


In [12]:
# Test with the question, e.g.
L = np.matrix([
    [0,0,0], 
    [1,0,0], 
    [0,0,0]
])
R,p = pagerank_eigen(L)
print("L={0}\nR={1}\np={2}".format(L,R,p))

L=[[0 0 0]
 [1 0 0]
 [0 0 0]]
R=[[0. 0. 0.]
 [1. 0. 0.]
 [0. 0. 0.]]
p=[[0.]
 [1.]
 [0.]]


2.Link all pages to each other.

In [13]:
# Test with the question, e.g.
L = np.matrix([
    [0,1,1], 
    [1,0,1], 
    [1,1,0]
])
R,p = pagerank_eigen(L)
print("L={0}\nR={1}\np={2}".format(L,R,p))

L=[[0 1 1]
 [1 0 1]
 [1 1 0]]
R=[[0.  0.5 0.5]
 [0.5 0.  0.5]
 [0.5 0.5 0. ]]
p=[[0.57735027]
 [0.57735027]
 [0.57735027]]


3.Link page A to both B and C, and link pages B and C to A.

In [14]:
# Test with the question, e.g.
L = np.matrix([
    [0,1,1], 
    [1,0,0], 
    [1,0,0]
])
R,p = pagerank_eigen(L)
print("L={0}\nR={1}\np={2}".format(L,R,p))

L=[[0 1 1]
 [1 0 0]
 [1 0 0]]
R=[[0.  1.  1. ]
 [0.5 0.  0. ]
 [0.5 0.  0. ]]
p=[[0.81649658]
 [0.40824829]
 [0.40824829]]


4.Use the previous links and add a link from page C to page B.

In [16]:
# Test with the question, e.g.
L = np.matrix([
    [0,1,1], 
    [1,0,1], 
    [1,0,0]
])
R,p = pagerank_eigen(L)
print("L={0}\nR={1}\np={2}".format(L,R,p))

L=[[0 1 1]
 [1 0 1]
 [1 0 0]]
R=[[0.  1.  0.5]
 [0.5 0.  0.5]
 [0.5 0.  0. ]]
p=[[-0.74278135+0.j]
 [-0.55708601+0.j]
 [-0.37139068+0.j]]


## Question 2 - Page Rank (Iterative method)

The eigen-vector method has some numerical issues (when computing eigen-vector) and not scalable with large datasets.

We will apply the iterative method in the slide "Practical Computation of PageRank" of the lecture.

Dataset for practice: https://snap.stanford.edu/data/ca-GrQc.html. It is available within the same folder of this github.

In [28]:
def matrixR(L):
#   Construct transition probability matrix from L
    return np.multiply(1/np.sum(L,axis=0),L)

def pagerank_iterative(L):
    R = matrixR(L)
    N = R.shape[0]
    e = np.ones(shape=(N,1))
    q = 0.9

    p = e
    delta = 1
    epsilon = 0.001
    i = 0
    while delta > epsilon:
        p_prev = p
        p = q * R.dot(p_prev)
        p = p + e*(1-q)/N
        delta = np.absolute(np.linalg.norm(p-p_prev,1))
        i += 1

    print("Converged after {0} iterations. Ranking vector: p={1}".format(i, p[:,0]))
    return R,p

#### Test with the dataset


In [23]:
# Construct link matrix from file
n_nodes = 0
nodes_idx = {}
nodes = []
with open("ca-GrQc.txt") as f:
    for idx, line in enumerate(f):
        if (idx>3):
            source = int(line.split()[0])
            dest   = int(line.split()[1])
            if source not in nodes_idx.keys():
                nodes_idx[source] = n_nodes
                nodes.append(source)
                n_nodes += 1
            if dest not in nodes_idx.keys():
                nodes_idx[dest] = n_nodes
                nodes.append(dest)
                n_nodes += 1

print(n_nodes)
print(nodes[:3])

5242
[3466, 937, 5233]


In [24]:
# Construct L
L = np.zeros([n_nodes,n_nodes])
with open("ca-GrQc.txt") as f:
    for idx, line in enumerate(f):
        if (idx>3):
            source = int(line.split()[0])
            dest   = int(line.split()[1])
            L[nodes_idx[dest],nodes_idx[source]] = 1
print(L)

[[0. 1. 1. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 1. 1.]
 [0. 0. 0. ... 1. 0. 1.]
 [0. 0. 0. ... 1. 1. 0.]]


In [29]:
# Run PageRank
R, p = pagerank_iterative(L)
print("Ranking vector: p={0}".format(p[:,0]))

Converged after 128 iterations. Ranking vector: p=[2.91315639e-04 1.88382754e-04 8.39741651e-05 ... 1.92156702e-04
 1.92156702e-04 1.92156702e-04]
Ranking vector: p=[2.91315639e-04 1.88382754e-04 8.39741651e-05 ... 1.92156702e-04
 1.92156702e-04 1.92156702e-04]


In [30]:
print(R)

[[0.    0.2   0.5   ... 0.    0.    0.   ]
 [0.125 0.    0.    ... 0.    0.    0.   ]
 [0.125 0.    0.    ... 0.    0.    0.   ]
 ...
 [0.    0.    0.    ... 0.    0.5   0.5  ]
 [0.    0.    0.    ... 0.5   0.    0.5  ]
 [0.    0.    0.    ... 0.5   0.5   0.   ]]


In [35]:
p[-1:]

array([[0.00019216]])

In [31]:
arr = np.array(p[:,0])
k = 3
# argsort -> sort index by value
# [-k:] pick last k element
# [::-1] reverse
k_idx = arr.argsort()[-k:][::-1]
print("Top-{0} nodes: {1}".format(k, np.array(nodes)[k_idx]))
print("Their scores: {0}".format(arr[k_idx]))

Top-3 nodes: [14265 13801 13929]
Their scores: [0.00144951 0.00141553 0.00138011]


## Question 3 - Ranking Methodology (Hard)

1. Give a directed graph, as small as possible, satisfying all the properties mentioned below:

    1. There exists a path from node i to node j for all nodes i,j in the directed
graph. Recall, with this property the jump to an arbitrary node in PageRank
is not required, so that you can set q = 1 (refer lecture slides).

    2. HITS authority ranking and PageRank ranking of the graph nodes are different.

2. Give intuition/methodology on how you constructed such a directed graph with
the properties described in (a).

3. Are there specific graph structures with arbitrarily large instances where PageRank
ranking and HITS authority ranking are the same?

## Question 4 - Hub and Authority

### a)

Let the adjacency matrix for a graph of four vertices ($n_1$ to $n_4$) be
as follows:

$
A =
  \begin{bmatrix}
	0 & 1 & 1 & 1  \\
	0 & 0 & 1 & 1 \\
	1 & 0 & 0 & 1 \\
	0 & 0 & 0 & 1 \\
  \end{bmatrix}
$

Calculate the authority and hub scores for this graph using the
HITS algorithm with k = 6, and identify the best authority and
hub nodes.

### b)
Apply the HITS algorithm to the dataset: https://snap.stanford.edu/data/ca-GrQc.html

**Hint:** We follow the slide "HITS algorithm" in the lecture. **Denote $x$ as authority vector and $y$ as hub vector**. You can use matrix multiplication for the update steps in the slide "Convergence of HITS". Note that rows of adjacency matrix is from-vertex and columns of adjacency matrix is to-vertex.

In [None]:
# You can implement your code following this template.
def hits_iterative(A, k=10):
    N = A.shape[0]

    x0, y0 = 1 / (N*N) * np.ones(N), 1 / (N*N) * np.ones(N) 

    xprev, yprev = x0, y0
    
    # For advanced exercise: define a convergence condition instead of k iterations
    for l in range(0,k):
        y = ...
        x = ...
        xprev = ...
        yprev = ...
        
    return xprev, yprev