# 14. PageRank Algorithm\n
\n
PageRank is the famous algorithm used by Google to rank web pages. It treats the web as a graph where pages are nodes and hyperlinks are edges.\n
\n
**Core Idea**: A page is important if important pages link to it.\n
\n
$$ PR(u) = (1-d) + d \sum_{v \in B(u)} \frac{PR(v)}{L(v)} $$

In [1]:
import numpy as np
import random

# 1. Simulate a Web Graph\n
# We will use our 60 documents (10 original + 50 synthetic)\n
NUM_DOCS = 60
doc_ids = [f"doc{i:03d}" for i in range(1, NUM_DOCS+1)]

# Adjacency Matrix (0 or 1)
# We'll generate random links heavily biased within topics\n
links = []
for i in range(NUM_DOCS):
    # Each doc links to 3-10 other docs\n
    num_outlinks = random.randint(3, 10)
    targets = random.sample(range(NUM_DOCS), num_outlinks)
    
    for t in targets:
        if t != i: # No self-loops for simplicity\n
            links.append((i, t))
            
print(f"Generated {len(links)} links among {NUM_DOCS} documents.")

Generated 400 links among 60 documents.


## 2. Power Iteration Implementation\n
\n
We can solve PageRank using the Power Method: repeatedly multiply the rank vector by the transition matrix until convergence.

In [2]:
def calculate_pagerank(links, num_pages, damping=0.85, epsilon=1e-6, max_iterations=100):
    # 1. Initialize PageRank equally\n
    pagerank = np.ones(num_pages) / num_pages
    
    # 2. Build Adjacency List & Out-degree\n
    out_degree = [0] * num_pages
    in_links = [[] for _ in range(num_pages)]
    
    for source, target in links:
        out_degree[source] += 1
        in_links[target].append(source)
        
    # 3. Iteration\n
    for it in range(max_iterations):
        new_pagerank = np.zeros(num_pages)
        
        # The "Teleportation" part (random surfer jumps)\n
        leap_prob = (1 - damping) / num_pages
        
        for i in range(num_pages):
            rank_sum = 0
            # Sum of PR(source)/L(source) for all incoming links\n
            for source in in_links[i]:
                if out_degree[source] > 0:
                    rank_sum += pagerank[source] / out_degree[source]
                else:
                    # Sink node handling (distribute rank equally)\n
                    rank_sum += pagerank[source] / num_pages
            
            new_pagerank[i] = leap_prob + damping * rank_sum
            
        # Check convergence (L1 norm)\n
        diff = np.sum(np.abs(new_pagerank - pagerank))
        pagerank = new_pagerank
        
        if diff < epsilon:
            print(f"Converged in {it+1} iterations.")
            break
            
    return pagerank

pr_scores = calculate_pagerank(links, NUM_DOCS)

# Show Top 10 Pages\n
ranked_indices = np.argsort(pr_scores)[::-1]
print("\nTop 10 Ranked Documents:")
print("="*30)
for i in range(10):
    idx = ranked_indices[i]
    print(f"{i+1}. {doc_ids[idx]} (Score: {pr_scores[idx]:.5f})")

Converged in 13 iterations.

Top 10 Ranked Documents:
1. doc013 (Score: 0.03444)
2. doc057 (Score: 0.03064)
3. doc056 (Score: 0.02619)
4. doc001 (Score: 0.02556)
5. doc005 (Score: 0.02525)
6. doc045 (Score: 0.02480)
7. doc040 (Score: 0.02458)
8. doc026 (Score: 0.02392)
9. doc025 (Score: 0.02363)
10. doc020 (Score: 0.02200)


## 3. Visualization (Optional)\n
If this were a small graph, we could visualize it. Since it's 60 nodes, the Top 10 list is better.\n
\n
## Summary\n
We successfully implemented the Power Iteration method for PageRank manually.