<a href="https://colab.research.google.com/github/yexf308/AdvancedMachineLearning/blob/main/PageRank_and_HITS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%pylab inline 
from IPython.display import Image
import numpy.linalg as LA

$\def\m#1{\mathbf{#1}}$
$\def\mb#1{\mathbb{#1}}$
$\def\c#1{\mathcal{#1}}$

# PageRank
PageRank (PR) is an algorithm used by Google Search to rank web pages in their search engine results. It is named after both the term "web page" and co-founder Larry Page. PageRank is a way of measuring **the importance of website pages**. According to Google:




- PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that **more important websites are likely to receive more links from other websites**.


- The PageRank algorithm outputs a probability distribution used to represent the likelihood that a person randomly clicking on links will arrive at any particular page.






## Algorithm 
Consider a **directed** weighted graph $G=(V,E,\m{W})$ whose weight matrix decodes the webpage link structure:
$$w_{ij} = \begin{cases}\#\{\text{link:}\ i\rightarrow j\} & (i,j)\in E  \\ 0 & \text{Otherwise}\end{cases} $$

Define an out-degree vector $d_i^{(o)} = \sum_{j=1}^nw_{ij}$, which measures the number of out-links
from $i$. A diagonal matrix $D = \text{diag}(d_i^{(o)})$ and a Markov matrix $M=D^{-1}W$ assumed for simplicity that all nodes have non-empty out-degree.

This Markov chain accounts for a random walk according to the link structure of webpages. One would expect
that stationary distributions of such random walks will disclose the importance of
webpages: **the more visits, the more important**. 

There is one problem: $M$ may not be primitive, then the statinary distribution may not be unique. 

Introduce the following trick:

- Let $P_\alpha = \alpha M + (1-\alpha) E$, where $E = \frac{1}{N}\mb{1}\mb{1}^\top$ is a random surfing model, i.e., one
can jump to any other webpage uniformly. So in the model $P_\alpha$, a browser will play
a dice: he will jump according to link structure with probability $\alpha$ or randomly surf with probability $1-\alpha$. 

- With $1>\alpha>0$, the existence of random surfing model
makes $P_\alpha$ a positive matrix, whence there is a unique stationary $\pi$. Google choose $\alpha=0.85$ and in this case $\pi$ gives PageRank scores. 

Implement it in HW. 

### Cheating the PageRank

- If there are many cross links between a small set of nodes (for
example, Wikipedia), those nodes must appear to be high in
PageRank.

- Now you probably can figure out how to cheat PageRank. This
phenomenon actually has been exploited by spam webpages, and
even scholar citations. After learning the nature of PageRank, we
should be aware of such mis-behaviors. 

# Kleinberg's HITS algorithm
In PageRank we consider out-degree $d^{(o)}$. How about in-degree $d^{(i)}_k=\sum_{j}w_{jk}$?

- High out-degree webpages can be regarded as hubs, as they provide
more links to others. Like wikipedia. 

- High in-degree webpages
are regarded as authorities, as they were cited by others intensively. Like great research paper. 

- Basically in/out-degrees can be used to rank webpages, which gives
relative ranking as authorities/hubs.
  
   - $d^{(o)}_i = \sum_{k}w_{ik}$ 

   - $d^{(i)}_j = \sum_{k}w_{kj}$

Jon Kleinberg’s HITS algorithm gives similar results to
in/out-degree ranking. It is based on singular value decomposition (SVD) of link matrix $\m{W}$. 


### HITS-authority
We use primary right singular vector of $\m{W}$ as scores to give the ranking. To understand this, define $L_a = \m{W}^\top \m{W}$

- Primary right singular vector of $\m{W}$ is just a primary eigenvector of
nonnegative symmetric matrix $L_a$. 

- Since $L_a(i,j) = \sum_k w_{ki}w_{kj}$,  thus it counts the number of
references which cites both $i$ and $j$, i.e., $\sum_k\# \{i\leftarrow k \rightarrow j\}$. The
higher value of $L_a(i,j)$ the more references received on the pair of
nodes. Therefore eigenvector tend to rank the webpages according
to authority.


### HITS-hub
We use primary left singular vector of $\m{W}$ as scores to give the ranking. To understand this, define $L_h = \m{W} \m{W}^\top$


- Since $L_h(i,j) = \sum_k w_{ik}w_{jk}$, which counts the number of links
from both $i$ and $j$, hitting the same target, i.e., $\sum_k\# \{i\rightarrow k \leftarrow j\}$. Therefore eigenvector tend to rank the webpages according
to hub.