# L2b: What does the largest eigenpair represent?
In this lab, we'll explore PageRank, an algorithm that ranks web pages by importance. While a naive approach might simply count incoming links, we'll discover why this fails and how PageRank uses Markov chains and eigenvector computation to address these limitations.

> __What is PageRank?__
>
> The [PageRank algorithm, developed by Larry Page and Sergey Brin](http://ilpubs.stanford.edu:8090/422/), is used by Google Search to rank web pages in their search engine results. It infers importance purely from link structure: incoming links are votes weighted by the importance of the linking page and diluted by its number of outgoing links. At its core, PageRank models web browsing as a Markov chain, where each webpage represents a state and hyperlinks define transition probabilities. 

In this lab, we'll represent a small network of webpages as a directed graph, examine why naive in-degree counting is insufficient, construct the corresponding transition matrix (a stochastic matrix), and use the power iteration method to compute the stationary distribution (scaled dominant eigenvector) of this Markov chain, which we call PageRank scores.

> __Learning Objectives:__
>
> By the end of this lab, you will be able to:
>
> * __Understand why naive ranking fails:__ Recognize the limitations of in-degree counting and why source quality and link distribution matter.
> * __Build and interpret transition matrices:__ Construct normalized stochastic matrices from adjacency matrices that properly encode importance transfer weighted by source authority.
> * __Apply power iteration to find PageRank:__ Compute the stationary distribution of a Markov chain using power iteration and contrast it with naive in-degree ranking to understand how iterative refinement captures network-based importance.


Let's get started!
___

## Setup, Data, and Prerequisites
First, we set up the computational environment by including the `Include.jl` file and loading any needed resources.

> The [`include(...)` command](https://docs.julialang.org/en/v1/base/base/#include) evaluates the contents of the input source file, `Include.jl`, in the notebook's global scope. The `Include.jl` file sets paths, loads required external packages, etc. For additional information on functions and types used in this material, see the [Julia programming language documentation](https://docs.julialang.org/en/v1/). 

Let's set up our code environment:

In [None]:
include(joinpath(@__DIR__, "Include.jl")); # include the Include.jl file

In addition to standard Julia libraries, we'll also use [the `VLDataScienceMachineLearningPackage.jl` package](https://github.com/varnerlab/VLDataScienceMachineLearningPackage.jl). Check out [the documentation](https://varnerlab.github.io/VLDataScienceMachineLearningPackage.jl/dev/) for more information on the functions, types, and data used in this material.

### Data
Next, let's load up the dataset that we will explore. This dataset was generated with the help of generative AI and a simple randomized graph generator for teaching and demonstration purposes. 

It does not contain real hyperlinks, real traffic patterns, or data collected from any website. Any resemblance to real domains is coincidental (the domain-like labels are fabricated).

We've provided [the `MySyntheticPageRankDataset()` function](https://varnerlab.github.io/VLDataScienceMachineLearningPackage.jl/dev/data/#VLDataScienceMachineLearningPackage.MySyntheticPageRankDataset) to load the synthetic PageRank dataset. This function takes no arguments and returns a tuple containing the edges and nodes of the synthetic web graph.

> __What's in the dataset?__
> 
> The dataset contains two data structures: `edges::Dict{Int, Tuple{String, String}}` maps edge indices to pairs of node identifiers (from_node, to_node), representing directed hyperlinks between webpages. The `nodes::Dict{String, NamedTuple}` maps node identifiers to named tuples containing metadata such as page labels, community assignments, and page types.

Let's load the dataset:

In [None]:
(edges, nodes) = MySyntheticPageRankDataset(); # load the synthetic PageRank dataset

#### Exploring the hyperlink structure

Let's start by examining the edge list, the hyperlinks in our synthetic web graph. Each edge represents a directed link from one webpage to another. To understand our ranking problem, we need to ask: **How interconnected is this network? Which pages are linked to most frequently? And most importantly, is simply counting incoming links sufficient to rank importance?**

Let's look at a sample of the edges:


In [None]:
edges

How about the `nodes::Dict{String, NamedTuple}` dictionary?

In [None]:
nodes

#### Network statistics and properties

Each node in the `nodes` dictionary carries metadata about the page: a human-readable `label`, a `community` assignment (which pages belong to topically related groups), and a `type` classification. These properties help us understand the network's structure.

Now let's compute some basic graph statistics to characterize our network before applying PageRank:

> __Why compute network statistics?__
>
> Understanding the network's key properties informs our interpretation of PageRank scores. Key metrics include:
> * __Network size and density:__ How many pages are there, and how densely are they linked? Sparse networks may produce PageRank scores with larger variance.
> * __In-degree distribution:__ How many incoming links does each page receive? If in-degree is highly skewed (many pages with few links, few pages with many links), naive in-degree ranking may be dominated by outliers.
> * __Out-degree distribution:__ How many pages does each page link to? This affects the transition matrix: pages with many outgoing links distribute their importance across many neighbors.
> * __Disconnected components:__ Are all pages reachable from each other, or are there isolated subgraphs? This affects Markov chain properties and convergence.
> * __Community structure:__ Do pages cluster into topically related groups? PageRank should respect this structure if links reflect topical relevance.

Let's compute these statistics:


In [None]:
graph_stats = let

    # initialize 
    number_of_edges = length(edges);
    in_degree_dict = Dict{String, Int}();
    out_degree_dict = Dict{String, Int}();
    community_dict = Dict{String, String}();
    
    # iterate over edges to build degree distributions
    for (idx, (from_node, to_node)) ∈ edges
        # count out-degree for from_node
        if haskey(out_degree_dict, from_node)
            out_degree_dict[from_node] += 1;
        else
            out_degree_dict[from_node] = 1;
        end
        
        # count in-degree for to_node
        if haskey(in_degree_dict, to_node)
            in_degree_dict[to_node] += 1;
        else
            in_degree_dict[to_node] = 1;
        end
    end
    
    # extract community and handle missing degrees
    for (node_id, node_data) ∈ nodes
        community_dict[node_id] = node_data.community;
        
        # ensure every node has a degree count (isolated nodes have degree 0)
        if !haskey(in_degree_dict, node_id)
            in_degree_dict[node_id] = 0;
        end
        if !haskey(out_degree_dict, node_id)
            out_degree_dict[node_id] = 0;
        end
    end
    
    (number_of_nodes = length(nodes), 
     number_of_edges = number_of_edges, 
     in_degree_dict = in_degree_dict, 
     out_degree_dict = out_degree_dict,
     community_dict = community_dict)
end;

println("Network size: $(graph_stats.number_of_nodes) pages");
println("Number of hyperlinks: $(graph_stats.number_of_edges) edges");
println("Network density: $(round(graph_stats.number_of_edges / (graph_stats.number_of_nodes * (graph_stats.number_of_nodes - 1)); digits=4))");

#### In-degree distribution: A preview of the ranking problem

Let's examine the in-degree distribution, how many incoming links does each page receive? This will reveal why a naive approach (ranking pages by in-degree alone) is insufficient.

> __The naive approach:__ One might initially assume that the "most important" pages are simply those with the most incoming links (highest in-degree). This is intuitive: if many pages link to you, you must be important. However, this approach has a critical flaw: **it treats all votes equally**, regardless of the importance of the voter. A link from an authority is more valuable than a link from a page that links to everything.

Let's compute in-degree statistics and visualize the distribution:


In [None]:
in_degree_values = collect(values(graph_stats.in_degree_dict));
out_degree_values = collect(values(graph_stats.out_degree_dict));

println("In-degree statistics:");
println("  Min: $(minimum(in_degree_values)), Max: $(maximum(in_degree_values)), Mean: $(round(mean(in_degree_values); digits=2))");
println("\nOut-degree statistics:");
println("  Min: $(minimum(out_degree_values)), Max: $(maximum(out_degree_values)), Mean: $(round(mean(out_degree_values); digits=2))");

# plot in-degree and out-degree distributions
let
    fig = plot(layout=(1,2), size=(1000, 400));
    histogram!(fig[1], in_degree_values, bins=10, c=:steelblue, legend=false, title="In-Degree Distribution", xlabel="In-Degree", ylabel="Frequency");
    histogram!(fig[2], out_degree_values, bins=10, c=:coral, legend=false, title="Out-Degree Distribution", xlabel="Out-Degree", ylabel="Frequency");
    
    fig
end

Finally, let's compute a few things we'll need below, in particular the list of nodes and the number of nodes.

In [None]:
number_of_nodes, list_of_nodes = let

    # initialize -
    nodeset = Set{String}();
    number_of_edges = keys(edges) |> length;

    # loop over edges to build the set of nodes. 
    # Trick: Take advantage of the fact that sets do not allow duplicates (nice!)
    for i ∈ 1:number_of_edges
        (from_node, to_node) = edges[i];
        push!(nodeset, from_node);
        push!(nodeset, to_node);
    end

    # Of course, we want a sorted array of nodes (not a set), so let's convert to an array and sort it.
    list_of_nodes = nodeset |> collect |> sort;
    number_of_nodes = length(list_of_nodes); # how many nodes are there?

    (number_of_nodes, list_of_nodes); # return 
end

### Computing in-degree scores: A baseline for comparison

Let's compute and examine in-degree based "importance" scores. This baseline will reveal exactly where naive counting falls short, and it will make the PageRank approach more intuitive once we implement it.

In [None]:
in_degree_scores = let
    
    # initialize -
    in_degree_scores = zeros(number_of_nodes);
    
    # loop over nodes and get their in-degree
    for i ∈ 1:number_of_nodes
        node_id = list_of_nodes[i];
        in_degree_scores[i] = graph_stats.in_degree_dict[node_id];
    end
    
    # normalize to sum to 1 (like PageRank scores)
    in_degree_scores .= in_degree_scores ./ sum(in_degree_scores);
    
    in_degree_scores
end;

# display the top 10 pages by in-degree
let
    
    i = sortperm(in_degree_scores; rev=true)[1:10]; # indices of top 10
    df = DataFrame();
    
    for j ∈ 1:10
        node_index = i[j];
        node_id = list_of_nodes[node_index];
        page_name = nodes[node_id].label;
        in_degree_score = in_degree_scores[node_index];
        community = nodes[node_id].community;
        type = nodes[node_id].type;
        push!(df, (Rank = j, NodeID = node_id, PageName = page_name, InDegreeScore = in_degree_score, Community = community, Type = type));
    end
    
    # Make a pretty table -
    pretty_table(
        df;
        backend = :text,
        fit_table_in_display_horizontally = false,
        table_format = TextTableFormat(borders = text_table_borders__compact)
    );
end

#### Comparing in-degree to PageRank

Now let's compute PageRank (which we'll do fully in Task 2) and compare the two rankings side by side. This comparison will reveal concrete examples of where naive in-degree fails and why PageRank provides a richer picture of importance.

> __Why the comparison matters:__
>
> By examining pages that rank high in in-degree but low in PageRank (or vice versa), we can understand what each metric captures:
> * **Pages ranking high in both:** These are genuinely important, they're linked to frequently AND by important sources. Consensus rankings build confidence.
> * **Pages high in in-degree but low in PageRank:** These are linked to often, but primarily by unimportant pages. The votes are numerous but from weak sources. This reveals the difference between **popularity** (in-degree) and **authority** (PageRank).
> * **Pages low in in-degree but high in PageRank:** These receive few votes, but those votes come from authoritative pages. A single link from a top authority can confer high PageRank. This reveals how importance is transferred through the network.

We'll compute PageRank next and then visualize these differences.


___

## Task 1: Build the transition matrix

In this task, we'll construct the transition matrix that forms the mathematical foundation of PageRank. We'll build this from the adjacency matrix, then validate its properties.

First, let's convert our edge list into an adjacency matrix representation.

> __What is an adjacency matrix?__
>
> An adjacency matrix $\mathbf{A}$ is an $n \times n$ matrix where $n$ is the number of nodes in the graph. The entry $A_{ij}$ is 1 if there is a directed edge from node $i$ to node $j$, and 0 otherwise. For our web graph, this captures the hyperlink structure: $A_{ij} = 1$ means page $i$ links to page $j$. The adjacency matrix is sparse (contains mostly zeros) for typical web graphs, making sparse matrix representations efficient for computation.

We'll store the adjacency matrix as a sparse matrix to save memory and speed up computations in the `A::SparseMatrixCSC{Int64, Int64}` variable.

In [None]:
A = let

    # initialize -
    A = spzeros(Int, number_of_nodes, number_of_nodes); # sparse adjacency matrix

    # ok, loop over edges to populate the adjacency matrix
    for (i, (from_node, to_node)) ∈ edges
        from_index = findfirst(isequal(from_node), list_of_nodes);
        to_index   = findfirst(isequal(to_node), list_of_nodes);
        A[from_index, to_index] = 1; # unweighted graph
    end
    
    A # return
end;

The sparse adjacency matrix $\mathbf{A}$ is now constructed. Let's examine its structure:

In [None]:
A

### Computing the transition matrix

From the adjacency matrix, we can compute the transition matrix. 

> __What is a transition matrix?__
> 
> A transition matrix $\mathbf{P}\in\mathbb{R}^{n \times n}$ is a row-stochastic matrix that describes how system can transition between states in a Markov chain. For our web graph application, each entry $P_{ij}$ represents the probability that a random surfer at page $i$ follows a link to page $j$. The entries of the transition matrix are computed as follows:
> $$
\begin{align*}
P_{ij} = \begin{cases}
\frac{A_{ij}}{k} & \text{if page } i \text{ has } k \text{ outgoing links} \\
\frac{1}{n} & \text{if page } i \text{ is a dangling node (no outgoing links)}
\end{cases}
\end{align*}
> $$
> where $k$ is the out-degree of page $i$ (the number of outgoing links from page $i$), and $n$ is the total number of pages (nodes) in the graph. This normalization ensures each row sums to one, making it a valid probability distribution.

Let's compute the transition matrix:

In [None]:
P = let

    # initialize -
    P = spzeros(Float64, number_of_nodes, number_of_nodes); # sparse transition matrix

    # loop over rows of A to compute the transition matrix
    for i ∈ 1:number_of_nodes
        row_sum = sum(A[i, :]); # sum of the i-th row
        if row_sum != 0
            P[i, :] .= A[i, :] ./ row_sum; # normalize the row (fancy! what is .= doing here?)
        else
            P[i, :] .= 1.0 / number_of_nodes; # handle dangling nodes (no outgoing edges)
        end
    end

    P # return
end

__Check__: If this is correct, then each row of the transition matrix should sum to one. You can verify this by summing the rows of the transition matrix and checking if they equal one.

In [None]:
let

    # initialize -
    number_of_nodes = size(P, 1);

    for i ∈ 1:number_of_nodes
        row_sum = sum(P[i, :]);
        @assert isapprox(row_sum, 1.0; atol=1e-8) "Row $i does not sum to 1, sum = $row_sum";
    end
end

Let's take a look at a few rows of the transition matrix. 

In [None]:
P[1,:]

___

## Task 2: Compute PageRank using power iteration and compare to naive ranking
In this task, we will compute the largest eigenpair of the transition matrix $\mathbf{P}$ using the power iteration method, yielding the PageRank scores. 

We've implemented [the power-iteration method pseudo-code](CHEME-5820-PowerIteration-Algorithm-Spring-2026.ipynb) in the [`Eigendecomposition.jl` file in the `src` directory](src/Eigendecomposition.jl). After computing PageRank, we'll compare the results directly to the in-degree ranking from Task 1, revealing exactly where naive link counting fails and why the iterative, weighted approach of PageRank is superior.

### Computing the stationary distribution

> __Connection to Markov Processes:__
>
> The transition matrix $\mathbf{P}$ defines a discrete-time Markov chain on the web graph, and a central question is whether a stationary distribution exists:
>
> * __Stationary Distribution:__ The stationary distribution $\boldsymbol{\pi}$ is a probability vector that satisfies $\boldsymbol{\pi}^{\top}\mathbf{P} = \boldsymbol{\pi}^{\top}$, meaning it remains unchanged after one step of the random walk. By transposing both sides, we get the eigenvalue equation $\mathbf{P}^{\top}\boldsymbol{\pi} = \boldsymbol{\pi}$, showing that $\boldsymbol{\pi}$ is an eigenvector of $\mathbf{P}^{\top}$ with eigenvalue $\lambda = 1$
>
> For an irreducible, aperiodic chain, the corresponding eigenvector has all positive entries and is unique up to scalar multiplication. Thus, when properly normalized, this eigenvector represents the long-run fraction of time a random surfer spends at each webpage.

The algorithm iterates until the change in the eigenvector estimate (measured by the L1-norm) falls below a tolerance of $\epsilon = 10^{-8}$. 

Let's run the power iteration:

In [None]:
λ̂,v̂ = let

    # initialize -
    max_iterations = 1000;
    tolerance      = 1e-8;
    v = rand(number_of_nodes); # random initial eigenvector
    v .= v ./ norm(v, 1);      # normalize
    A = transpose(P) |> Matrix;  # we want the left eigenvector, so we work with the transpose

    # call the power iteration method
    result = poweriteration(A, v; maxiter = max_iterations, ϵ = tolerance);

    (result.value, result.vector) # return
end

What is the largest eigenvalue and corresponding eigenvector of the transition matrix $\mathbf{P}$? Theory tells us that $\hat{λ} = 1$ should be the largest eigenvalue of a transition matrix. Let's see if our computation agrees with this.

In [None]:
@assert isapprox(λ̂, 1.0; atol=1e-4) # adjust atol to find the max permissible error

Next, we need to normalize the eigenvector to obtain the stationary distribution (PageRank scores).

> __Why normalize the eigenvector?__
> 
> The power iteration returns an eigenvector $\hat{\mathbf{v}}$ satisfying $\mathbf{P}^{\top}\hat{\mathbf{v}} = \hat{\mathbf{v}}$, but eigenvectors are only defined up to scalar multiplication: if $\mathbf{v}$ is an eigenvector, then so is $c\mathbf{v}$ for any nonzero scalar $c$. The power iteration algorithm normalizes at each step to prevent numerical overflow, but the final vector is not necessarily a probability distribution. To interpret $\hat{\mathbf{v}}$ as a stationary distribution $\boldsymbol{\pi}$, we require that its entries sum to one: $\sum_{i=1}^{n}\pi_{i} = 1$, where $\pi_{i}$ denotes the $i$-th entry of $\boldsymbol{\pi}$ representing the fraction of time spent at node $i$.
>
> The normalization $\hat{\pi} = \hat{\mathbf{v}}/(\mathbf{1}^{\top}\hat{\mathbf{v}})$, where $\mathbf{1}$ is a vector of ones, ensures this property while preserving the relative magnitudes that encode webpage importance.

Let's compute the stationary distribution which we save in the `π̂::Array{Float64,1}` variable: 

In [None]:
π̂ = let
    
    # initialize -
    ones_vector = ones(number_of_nodes);
    T = dot(ones_vector, v̂); # normalization factor
    π̂  = v̂ ./ T # return
end

__Check__: The entries of the stationary distribution $\hat{\pi}$ should sum to one. You can verify this by summing the entries of $\hat{\pi}$.

In [None]:
@assert isapprox(sum(π̂), 1.0; atol=1e-8) # check that the entries sum to one

Ok, but what does this stationary distribution represent in the context of our web graph? 

> __PageRank and Stationary Distribution__
> In the context of PageRank, the stationary distribution $\hat{\pi}$ represents the long-term behavior of a random surfer navigating the web graph. Each entry $\hat{\pi}_i$ in the stationary distribution corresponds to the probability of being at node (webpage) $i$ after a large number of steps in a random walk on the graph.

Let's find the most important webpage:

In [None]:
nodeid = let
    i = argmax(π̂);
    nodeid = list_of_nodes[i];
    println("The most important webpage is: $(list_of_nodes[i]) with PageRank score $(π̂[i])");
    nodeid;
end;

That `nodeid::String` variable holds the identifier of the most important webpage according to the PageRank analysis. What does this correspond to in the original dataset? You can look it up in the `nodes` dictionary.

In [None]:
let
    println(nodes[nodeid])
end

> __Interpreting PageRank Scores:__
>
> How should we interpret the entries of the stationary distribution $\hat{\pi}$ in terms of webpage importance?
> * __PageRank:__ The PageRank score $\pi_{i}$ represents the fraction of time a random surfer spends at webpage $i$, providing a measure of structural importance rather than content quality. A page achieves high PageRank either by receiving many incoming links or by receiving links from other high-PageRank pages. 
>
> Since the algorithm distributes a page's importance equally among its outgoing links, a link from a page with few outgoing links carries more weight than one from a page with many outgoing links.

Let's look at the details of the top $n$ most important webpages [using the `PrettyTables.jl` package](https://github.com/ronisbr/PrettyTables.jl)

In [None]:
let

    # initialize
    number_of_sites_to_display = 10;
    i = sortperm(π̂; rev=true)[1:number_of_sites_to_display] # indices of the top 10 most important webpages
    df = DataFrame(); # hold the data for the table

    for j ∈ 1:number_of_sites_to_display
        node_index = i[j];
        node_id = list_of_nodes[node_index];
        page_rank_score = π̂[node_index];
        label = nodes[node_id].label;
        community = nodes[node_id].community;
        type = nodes[node_id].type;
        push!(df, (Rank = j, NodeID = node_id, PageName = label, Community = community, Type = type, PageRankScore = page_rank_score));
    end
    
    # make the table -
    pretty_table(
        df;
        backend = :text,
        fit_table_in_display_horizontally = false,
        table_format = TextTableFormat(borders = text_table_borders__compact)
    );
end

### Comparing PageRank results to in-degree ranking

Now that we've computed PageRank scores, let's compare them directly to the naive in-degree approach from Task 1. This side-by-side analysis reveals why PageRank is more sophisticated.

> __What the comparison reveals:__
>
> Three categories of pages emerge from the comparison:
> * **High in both PageRank and in-degree:** Pages linked frequently AND by important sources. These represent genuine consensus about importance.
> * **High in in-degree but low in PageRank:** Pages receiving many links, but primarily from unimportant pages. These are *popular* but not *authoritative*. This reveals that **in-degree conflates volume of links with quality of sources**.
> * **Low in in-degree but high in PageRank:** Pages receiving few links, but from important pages. A single link from an authority confers significant PageRank. This shows **how importance is transferred through the network. You gain authority not just from link quantity, but from the authority of your linkers.

Let's create a side-by-side comparison of the top ranked pages:

In [None]:
let
    
    # get top 10 by PageRank
    pagerank_top10_idx = sortperm(π̂; rev=true)[1:10];
    pagerank_ranks = Dict{String, Int}();
    for (rank, idx) ∈ enumerate(pagerank_top10_idx)
        pagerank_ranks[list_of_nodes[idx]] = rank;
    end
    
    # get top 10 by in-degree
    indegree_top10_idx = sortperm(in_degree_scores; rev=true)[1:10];
    indegree_ranks = Dict{String, Int}();
    for (rank, idx) ∈ enumerate(indegree_top10_idx)
        indegree_ranks[list_of_nodes[idx]] = rank;
    end
    
    # build comparison table: show all top-10 pages from either metric
    all_pages = unique(vcat(
        [list_of_nodes[idx] for idx in pagerank_top10_idx],
        [list_of_nodes[idx] for idx in indegree_top10_idx]
    ));
    
    df = DataFrame();
    for node_id ∈ sort(all_pages)
        pr_rank = get(pagerank_ranks, node_id, "--");
        id_rank = get(indegree_ranks, node_id, "--");
        pr_score = π̂[findfirst(isequal(node_id), list_of_nodes)];
        id_score = in_degree_scores[findfirst(isequal(node_id), list_of_nodes)];
        label = nodes[node_id].label;
        
        push!(df, (
            NodeID = node_id,
            PageName = label,
            PageRankRank = pr_rank,
            PageRankScore = round(pr_score; digits=5),
            InDegreeRank = id_rank,
            InDegreeScore = round(id_score; digits=5)
        ));
    end
    
    println("\n=== Comparing PageRank vs. In-Degree Rankings ===\n");
    pretty_table(
        df;
        backend = :text,
        fit_table_in_display_horizontally = false,
        table_format = TextTableFormat(borders = text_table_borders__compact)
    );
end


#### Interpreting the differences

Look at the comparison table carefully. Notice:

* **Pages appearing in both top-10 lists:** These have earned high rankings through both metrics. They're likely genuinely important hubs or authorities in the network.

* **Pages with large rank gaps:** If a page ranks #2 in PageRank but #15+ in in-degree (or vice versa), this reveals how the two metrics diverge. Pages that jump high in PageRank despite moderate in-degree typically receive links from high-authority sources. Conversely, pages high in in-degree but lower in PageRank likely attract links from many low-authority pages.

* **Why PageRank wins:** The transition matrix approach accounts for two critical factors that in-degree ignores:
  - **Source quality:** Links from important pages carry more weight than links from obscure pages.
  - **Dilution by out-degree:** A page that links to 1000 sites distributes its importance across all of them; a page linking to only 5 sites concentrates its importance on those 5. This prevents manipulation by pages that indiscriminately link to everything.

This comparison illustrates why PageRank became important to search engines: it's robust against gaming and reflects a deeper notion of network importance than simple link counting.


___

## Summary

This lab shows why naive link counting is insufficient for ranking importance in networks, and how PageRank uses iteration and weighted links based on Markov chain theory.

> __Key Takeaways:__
>
> * **Naive in-degree ranking fails because it treats all votes equally:** Link counting ignores source quality and is vulnerable to manipulation. PageRank's weighted approach addresses this by making importance depend on the authority of your linkers, not just link quantity.
> * **Transition matrices and stationary distributions encode network importance:** Normalizing adjacency matrices by out-degree creates a stochastic matrix whose stationary distribution (eigenvalue $\lambda = 1$) represents the long-run visiting probability in a random walk, which is what PageRank computes.
> * **Power iteration makes PageRank computable at scale:** Rather than expensive eigendecomposition, power iteration converges rapidly to the dominant eigenvector through simple matrix multiplications, making PageRank feasible for billions of pages.

PageRank recognizes that importance in networks is both local, incoming links, and global, the quality of linkers, which is why it remains useful in modern ranking algorithms.