# Basic exercise 6

In [None]:
from timeit import timeit
import numpy as np
from scipy import linalg

This is a basic exercise in the course Python For Scientists. 
The aim is to get you acquainted with the syntax of `scipy` and `numpy` and give you the necessary skills to tackle more serious problems later on.

Of course these problems can be solved very easily by using AI tools. However, since the goal is to teach you the basics, it is not recommended to use AI. Try to solve them independetly instead.

## Eigenvalue problems at Google

Google's search engine uses, among other methods, network theory to order the search results by relevance.
The concept of PageRank was developed around 1998 by the founders of Google, Sergey Brin and Larry Page. You can find the original paper [here](https://www.sciencedirect.com/science/article/pii/S016975529800110X?via%3Dihub). 
PageRank assignes a number to a node in a network (in the case of Google a web page) that gives an idea of the relative importance of that web page. The PageRank of a certain web page is based on the amount of pages that point to it (i.e. provide a link to it). A page that is referenced by a lot of other pages will be an important web page. In this exercise we will explore a little of what is under the hood at Google. Of course, reality is far more complex than what will be explored here, but you should get some idea of what is happening.

We will work with a toy internet that contains 5 web pages: A, B, C, D and E. 
The web pages are linked in a **directed** network as shown in the following figure: 
<p align="center">
    <img src="Network.png" alt="drawing" width="300" align="center"/>
</p>
An arrow from one page to another means that the page has a link that refers to that other page.

This network can be represented with an adjacency matrix. Element (i, j) of this matrix contains the value 1 if page i points at page j and it contains the value 0 otherwise. 
For our toy internet, we find:

\begin{equation}
\begin{bmatrix}
0&1&0&1&0\\
1&0&1&1&0\\
0&0&0&1&0\\
0&0&0&0&1\\
1&1&0&1&0\\
\end{bmatrix}
\end{equation}

The fact that element (1, 4) in this matrix is 1 tells us that page A points to page D, and the fact that element (4, 1) is 0 tells us that page D does not point to page A.

It can be shown that the PageRanks of the pages in a network are the entries in the eigenvector that corresponds to the largest eigenvalue of the **reduced** adjacency matrix. 
The reduced adjacency matrix is the adjacency matrix but with every column normalized so that the column sum equals 1.


In order to solve some problems, Google uses a slight variation of the reduced adjacency matrix, called
the Google matrix (more info [here](https://en.wikipedia.org/wiki/Google_matrix)), but for simplicity, we
will discard this adaptation and just work with the reduced adjacency matrix.

### Part 1
Turn the adjacency matrix of our toy internet above into its reduced form and save it as a numpy array.
You don't have to code the procedure for reducing the adjacency matrix, it is sufficient to hard code the matrix itself.

Write a function that returns the PageRanks as described above. Remark that eigenvalues and eigenvectors are in general imaginary, so you will have to take the absolute value to sort the eigenvalues and to represent the PageRanks.

Calculate the PageRanks of the toy internet. What is the most important web page? Could you infer this from the figure? Do you notice something special? Can you explain this?

In [None]:
# Implement your solution here

In [None]:
A = np.array(
    [
        [0, 1 / 2, 0, 1 / 4, 0],
        [1 / 2, 0, 1, 1 / 4, 0],
        [0, 0, 0, 1 / 4, 0],
        [0, 0, 0, 0, 1],
        [1 / 2, 1 / 2, 0, 1 / 4, 0],
    ]
)


def pagerank(A):
    w, v = linalg.eig(A)
    index = np.argmax(np.abs(w))
    pr = np.abs(v[:, index]) / linalg.norm(v[:, index], 1)
    return pr


print(pagerank(A))

# The most important page is D, which we could infer from
# the graph because this is the page that has the most links pointing to it.
# However, page E has the same PageRank as D although it has only 1 page pointing to it.
# E gains in importance because it is referenced by a very important page, namely D.

### Part 2

SciPy provides a function for calculating both eigenvalues and eigenvectors (`scipy.linalg.eig`), but also a function that only computes eigenvalues `scipy.linalg.eigvals`). 
Perform both routines on the reduced adjacency matrix from above and determine the execution time (use `timeit` and repeat the calculation 100 000 times to get good time estimates).

You will notice that calculating eigenvectors takes more time. 
However, when you dive into the source code of `eig` and`eigvals`, you notice that `eigvals` just calls the function `eig` with different parameters. Try to explain the difference in execution time by examining the source code.

In [None]:
# Implement your solution here

In [None]:
result = timeit("linalg.eig(A)", globals=globals(), number=100000)
print(f"Time needed for 100000 eigenvalue and eigenvector calculations: {result} s")

result = timeit("linalg.eigvals(A)", globals=globals(), number=100000)
print(f"Time needed for 100000 eigenvalue calculations: {result} s")

# eigvals is faster because SciPy tells Lapack to only calculate eigenvalues and not the eigenvectors
# see https://github.com/scipy/scipy/blob/v1.9.3/scipy/linalg/_decomp.py#L115-L264

### Part 3

We now have a different network of web pages, represented in the figure below. 

<p align="center">
    <img src="Network2.png" alt="drawing" width="500" align="center"/>
</p>

Turn this network into its reduced adjacency matrix. 
Try to find out which page will be the most important by looking at the graph and confirm your answer by computing the PageRank.

SciPy provides some algorithms that are optimized for computing eigenvalues and eigenvectors of some special types of matrices. 
Which one would you use for this network? Compare the execution time of this algorithm with `scipy.linalg.eig`. Is it faster?


In [None]:
# Implement your solution here

In [None]:
B = np.array(
    [
        [1 / 2, 1 / 2, 0, 0, 0],
        [1 / 2, 0, 1 / 2, 0, 0],
        [0, 1 / 2, 0, 1 / 2, 0],
        [0, 0, 1 / 2, 0, 1 / 2],
        [0, 0, 0, 1 / 2, 1 / 2],
    ]
)

Bb = np.array([[1 / 2, 0, 0, 0, 1 / 2], [1 / 2, 1 / 2, 1 / 2, 1 / 2, 0]])


def pagerank2(A):
    w, v = linalg.eig_banded(A, lower=True)
    index = np.argmax(np.abs(w))
    pr = np.abs(v[:, index]) / linalg.norm(v[:, index], 1)
    return pr


print(pagerank(B))
print(pagerank2(Bb))


# Page C is the most important, because it has the most edges that arrive on it

In [None]:
result = timeit("linalg.eig(B)", globals=globals(), number=100000)
print(f"Time needed for 100000 executions of linalg.eig: {result} s")

result = timeit("linalg.eig_banded(Bb, lower=True)", globals=globals(), number=100000)
print(f"Time needed for 100000 executions of linalg.eig_banded: {result} s")

# The reduced adjacency matrix is a banded matrix, so we can use eig_banded
# This is faster than just using eig

In [None]:
%%timeit

linalg.eig(B)

In [None]:
%%timeit

linalg.eig_banded(Bb, lower=True)