# Chapter 11 - Web Searches with PageRank

This notebook contains code accompanying Chapter 11 Web Searches with PageRank in *Practical Discrete Mathematics* by Ryan T. White and Archana Tikayat Ray.

## Google PageRank II

### Example: Computing one PageRank update

In [1]:
# import the NumPy library
import numpy

# transition probability matrix
A = numpy.array([[0, 0.25, 0.25, 0.25, 0.25],
                 [0.5, 0, 0, 0.5, 0],
                 [0.33, 0, 0, 0.33, 0.33],
                 [1, 0, 0, 0, 0],
                 [0, 0, 0, 1, 0]])

# initialize the PageRank vector
v = numpy.array([[0.2], [0.2], [0.2], [0.2], [0.2]])

# the damping factor
d = 0.85

# the size of the "Internet"
N = 5

# compute the update matrix
U = d * A.T + (1 - d) / N

# compute the new PageRank vector
v = numpy.dot(U, v)

# print the new PageRank vector
print(v)

[[0.3411]
 [0.0725]
 [0.0725]
 [0.3836]
 [0.1286]]


Below, we loop this calculation and do 15 PageRank updates.

In [2]:
# initialize the PageRank vector
v = numpy.array([[0.2], [0.2], [0.2], [0.2], [0.2]])

# print the initial vector
print('PageRank vector', 0, 'is', v.T)

# compute the PageRank vector for 15 iterations
for i in range(15):
    # compute the next PageRank vector
    v = numpy.dot(U, v)

    # round the PageRank vector to 3 places
    v = numpy.round(v, 3)

    # print the PageRank vector
    print('PageRank vector', i + 1, 'is', v.T)

PageRank vector 0 is [[0.2 0.2 0.2 0.2 0.2]]
PageRank vector 1 is [[0.341 0.073 0.073 0.384 0.129]]
PageRank vector 2 is [[0.408 0.102 0.102 0.264 0.123]]
PageRank vector 3 is [[0.326 0.117 0.117 0.293 0.145]]
PageRank vector 4 is [[0.362 0.099 0.099 0.305 0.132]]
PageRank vector 5 is [[0.359 0.107 0.107 0.289 0.135]]
PageRank vector 6 is [[0.351 0.106 0.106 0.296 0.136]]
PageRank vector 7 is [[0.356 0.104 0.104 0.295 0.134]]
PageRank vector 8 is [[0.354 0.105 0.105 0.293 0.135]]
PageRank vector 9 is [[0.353 0.105 0.105 0.294 0.134]]
PageRank vector 10 is [[0.354 0.105 0.105 0.293 0.134]]
PageRank vector 11 is [[0.353 0.105 0.105 0.293 0.134]]
PageRank vector 12 is [[0.353 0.105 0.105 0.293 0.134]]
PageRank vector 13 is [[0.353 0.105 0.105 0.293 0.134]]
PageRank vector 14 is [[0.353 0.105 0.105 0.293 0.134]]
PageRank vector 15 is [[0.353 0.105 0.105 0.293 0.134]]


Notice that the vectors converge to a specific location. This always happens if the linking structure of the "Internet" does not change. Typically, these updates are computed until they converge periodically as new information is gained from the real Internet, which is changing all the time.

## Implementing a PageRank in Python

In [3]:
# The PageRank algorithm for ranking search results
#
# INPUTS
# A - the transition probability matrix
# d - the damping factor, default = 0.85
# eps - the error threshold, default = 0.0005
# maxIterations - the maximum iterations it can run before stopping
# verbose - if true, the algorithm prints the progress of PageRank, default False
#
# OUTPUTS
# vNew - the steady state PageRank vector

def PageRank(A, d = 0.85, eps = 0.0005, maxIterations = 1000,
             verbose = False):
    # find the size of the "Internet"
    N = A.shape[0]

    # initialize the old and new PageRank vectors
    vOld = numpy.ones([N])
    vNew = numpy.ones([N])/N

    # initialize a counter
    i = 0

    # compute the update matrix
    U = d * A.T + (1 - d) / N

    while numpy.linalg.norm(vOld - vNew) >= eps:
        # if the verbose flag is true, print the progress at each iteration
        if verbose:
            print('At iteration', i, 'the error is',
                  numpy.round(numpy.linalg.norm(vOld - vNew), 3),
                  'with PageRank', numpy.round(vNew, 3))

        # save the current PageRank as the old PageRank
        vOld = vNew

        # update the PageRank vector
        vNew = numpy.dot(U, vOld)

        # increment the counter
        i += 1

        # if it runs too long before converging, stop and notify the user
        if i == maxIterations:
            print('The PageRank algorithm ran for',
                  maxIterations, 'with error',
                  numpy.round(numpy.linalg.norm(vOld - vNew), 3))

            # return the PageRank vectora and the
            return vNew, i

    # return the steady state PageRank vector and iteration number
    return vNew, i

Below, we run the PageRank algorithm with default settings.

In [4]:
# transition probability matrix
A = numpy.array([[0, 1/4, 1/4, 1/4, 1/4],
                 [1/2, 0, 0, 1/2, 0],
                 [1/3, 0, 0, 1/3, 1/3],
                 [1, 0, 0, 0, 0],
                 [0, 0, 0, 1, 0]])

# Run the PageRank algorithm with default settings
PageRank(A, verbose = True)

At iteration 0 the error is 1.789 with PageRank [0.2 0.2 0.2 0.2 0.2]
At iteration 1 the error is 0.303 with PageRank [0.342 0.073 0.073 0.384 0.129]
At iteration 2 the error is 0.144 with PageRank [0.408 0.103 0.103 0.264 0.123]
At iteration 3 the error is 0.092 with PageRank [0.327 0.117 0.117 0.294 0.146]
At iteration 4 the error is 0.047 with PageRank [0.363 0.099 0.099 0.306 0.133]
At iteration 5 the error is 0.019 with PageRank [0.361 0.107 0.107 0.29  0.135]
At iteration 6 the error is 0.011 with PageRank [0.352 0.107 0.107 0.297 0.137]
At iteration 7 the error is 0.007 with PageRank [0.358 0.105 0.105 0.297 0.135]
At iteration 8 the error is 0.003 with PageRank [0.357 0.106 0.106 0.295 0.136]
At iteration 9 the error is 0.001 with PageRank [0.356 0.106 0.106 0.296 0.136]
At iteration 10 the error is 0.001 with PageRank [0.357 0.106 0.106 0.296 0.136]


(array([0.3565286 , 0.10584025, 0.10584025, 0.29600666, 0.13578424]), 11)

After webpage W3 becomes popular, let's run PageRank again and see what changes.

In [5]:
# transition probability matrix
B = numpy.array([[0, 1/4, 1/4, 1/4, 1/4],
                 [1/3, 0, 1/3, 1/3, 0],
                 [1/3, 0, 0, 1/3, 1/3],
                 [1/2, 0, 1/2, 0, 0],
                 [0, 0, 1/2, 1/2, 0]])

# Run the PageRank algorithm with default settings
PageRank(B, verbose = True)

At iteration 0 the error is 1.789 with PageRank [0.2 0.2 0.2 0.2 0.2]
At iteration 1 the error is 0.192 with PageRank [0.228 0.073 0.299 0.271 0.129]
At iteration 2 the error is 0.06 with PageRank [0.25  0.079 0.269 0.239 0.163]
At iteration 3 the error is 0.026 with PageRank [0.23  0.083 0.276 0.251 0.159]
At iteration 4 the error is 0.01 with PageRank [0.239 0.079 0.277 0.248 0.157]
At iteration 5 the error is 0.004 with PageRank [0.236 0.081 0.275 0.248 0.159]
At iteration 6 the error is 0.001 with PageRank [0.236 0.08  0.276 0.249 0.158]
At iteration 7 the error is 0.001 with PageRank [0.237 0.08  0.276 0.249 0.159]


(array([0.2365497 , 0.08030807, 0.27603383, 0.24860661, 0.15850179]), 8)

Here, the rank of the website W3 goes from 0.11 to 0.28, a big increase due to its increased popularity.

## Applying the Algorithm to Real Data

We have a file `California.txt`, which has a real Internet dataset of 9664 webpages containing the word "California" with an adjacency list representing links between the webpages. Let's read that into a `pandas` dataframe.

In [6]:
# import the pandas library
import pandas

# read the txt file into a dataframe
data = pandas.read_csv("California.txt", delimiter=' ')

# display the dataframe
data

Unnamed: 0,Type,Source,Destination
0,n,0,http://www.berkeley.edu/
1,n,1,http://www.caltech.edu/
2,n,2,http://www.realestatenet.com/
3,n,3,http://www.ucsb.edu/
4,n,4,http://www.washingtonpost.com/wp-srv/national/...
...,...,...,...
25809,e,9663,1806
25810,e,9663,266
25811,e,9663,7905
25812,e,9663,70


Next, we preprocess the data to extract the adjacency list, drop the all the "e" strings in the first column, convert the remaining numerical portion to a `NumPy` array, and store the numbers as integers.

In [7]:
# preprocess the data

# select only the rows with type 'e'
adjacencies = data.loc[data['Type'] == 'e']

# drop the 'Type' column
adjacencies = adjacencies.drop(columns = 'Type')

# convert the adjacency list to a NumPy array
adjacencies = adjacencies.to_numpy()

# convert the adjacency list to integers
adjacencies = adjacencies.astype('int')

# print the adjacency list
print(adjacencies)

[[   0  449]
 [   0  450]
 [   0  451]
 ...
 [9663 7905]
 [9663   70]
 [9663 7907]]


Next, let’s convert the adjacency list to an adjacency matrix.

In [8]:
# convert the adjacency list to an adjacency matrix

# find the number of webpages and initialize A
N = numpy.max(adjacencies) + 1
A = numpy.zeros([N, N])

# iterate over the rows of the adjacency list
for k in range(adjacencies.shape[0]):
    # find the adjacent vertex numbers
    i, j = adjacencies[k,]

    # put 1 in the adjacency matrix
    A[i, j] = 1

Next, we need to convert $\mathbf{A}$ to the transition probability matrix by dividing each 1 corresponding to an outgoing link by the total number of outgoing links from that webpage. In other words, we divide each row by its row sum.

In [9]:
# convert A to the transition probability matrix

# divide each row of A by its row sum
rowSums = A.sum(axis = 1)[:,None]

# divide A by the rowSums
A = numpy.divide(A, rowSums, where = rowSums != 0)

Next, let’s run `PageRank` on this transition probability matrix.

In [10]:
# run PageRank
v, i = PageRank(A)

# print the steady state PageRank vector and iteration number
print(v)
print(i)

[2.79688870e-05 6.29671046e-06 2.06171425e-07 ... 9.48337601e-08
 9.48337601e-08 9.48337601e-08]
14


As we see, feeding this large transition probability matrix of dimensions 9663-by-9663 converges to a steady state PageRank vector in 14 iterations. We will then sort the PageRanks from highest to lowest and save the indices of the sorted list.

In [11]:
# sort the PageRanks in ascending order
ranks = numpy.argsort(v)

# find the PageRanks in descending order
ranks = numpy.flip(ranks)

In [12]:
print(ranks)

[4391 1488  997 ... 5632 5633 9663]


Then, let’s return the top 10 webpages containing the word “California.”

In [14]:
# return the URLs of the top few webpages
rankedPages = pandas.DataFrame(columns = ['Type', 'Source', 'Destination'])

# add the top 10-ranked webpages
for i in range(10):
    row = data.loc[(data['Type'] == 'n')
                   & (data['Source'] == ranks[i])]
    rankedPages = rankedPages = pandas.concat([rankedPages, row], ignore_index=True)

# display the top 10
rankedPages.drop(columns = ['Type', 'Source'])

Unnamed: 0,Destination
0,http://search.ucdavis.edu/
1,http://www.ucdavis.edu/
2,http://www.gene.com/ae/bioforum/
3,http://www.lib.uci.edu/
4,http://vision.berkeley.edu/VSP/index.shtml
5,http://www.uci.edu/
6,http://www.students.ucr.edu/
7,http://spectacle.berkeley.edu/
8,http://www.calacademy.org/
9,http://www.scag.org
