# SD212: Graph mining
## Solution to Lab 2: PageRank

In this lab, you will learn to compute, use and interpret various [PageRank](https://en.wikipedia.org/wiki/PageRank) scores.

## Import

In [None]:
from IPython.display import SVG

In [None]:
import numpy as np
from scipy import sparse

In [None]:
from sknetwork.data import load_netset, linear_graph, miserables
from sknetwork.linalg import normalize
from sknetwork.ranking import PageRank, top_k
from sknetwork.visualization import svg_graph

## Data

We will work on the following graphs (see the [NetSet](https://netset.telecom-paris.fr/) collection for details):
* Openflights (graph)
* WikiVitals (digraph)
* Cinema (bigraph)

In [None]:
openflights = load_netset('openflights')
wikivitals = load_netset('wikivitals')
cinema = load_netset('cinema')

## 1. Graphs

The PageRank corresponds to the stationary distribution of a random walk with restart probability $1-\alpha$. Unless otherwise specified, we take the default value $\alpha = 0.85$ and the restart probability distribution is uniform over the set of nodes.

## Linear graph

Consider a linear graph:

In [None]:
n = 10

In [None]:
dataset = linear_graph(n, True)
adjacency = dataset.adjacency
position = dataset.position

In [None]:
image = svg_graph(adjacency, position, names=np.arange(n))
SVG(image)

## To do

* What are the two best ranked nodes? Try with different values of $\alpha$ and interpret the results.
* What is the exact PageRank vector when $\alpha=1$? You need to compute it!

In [None]:
pagerank = PageRank(damping_factor=0.85, solver='lanczos')

In [None]:
scores = pagerank.fit_transform(adjacency)

In [None]:
scores

In [None]:
np.sum(scores)

In [None]:
image = svg_graph(adjacency, position, names=np.arange(n), scores=scores)
SVG(image)

## Les Misérables


In [None]:
dataset = miserables(True)

In [None]:
adjacency = dataset.adjacency
position = dataset.position
names = dataset.names

In [None]:
image = svg_graph(adjacency, position, names, scale=2)
SVG(image)

## To do

* Display the graph with PageRank scores.
* List the 10 best ranked characters.
* Compare with:
    1. the 10 nodes of highest degrees
    2. the 10 nodes of highest weights
* Try different values of $\alpha$ and interpret the results.

In [None]:
pagerank = PageRank(damping_factor=0.95, solver='lanczos')

In [None]:
scores = pagerank.fit_transform(adjacency)

In [None]:
# scores in log scale appear more clearly
image = svg_graph(adjacency, position, names, scores=np.log(scores), scale=2)
SVG(image)

In [None]:
top_nodes = top_k(scores, 10)

In [None]:
for i, node in enumerate(top_nodes):
    print(i + 1, names[node])

In [None]:
n = adjacency.shape[0]
degrees = (adjacency > 0).dot(np.ones(n))

In [None]:
top_nodes_degrees = top_k(degrees, 10)

In [None]:
for i, node in enumerate(top_nodes_degrees):
    print(i + 1, names[node])

In [None]:
n = adjacency.shape[0]
weights = adjacency.dot(np.ones(n))

In [None]:
top_nodes_weights = top_k(weights, 10)

In [None]:
for i, node in enumerate(top_nodes_degrees):
    print(i + 1, names[node])

## Openflights


In [None]:
dataset = openflights

In [None]:
adjacency = dataset.adjacency
position = dataset.position
names = dataset.names

In [None]:
image = svg_graph(adjacency, position, width=800, height=400, display_node_weight=True, display_edges=False)
SVG(image)

## To do

* Display the same world map with PageRank scores.
* List the 10 best ranked airports, and compare with the 10 airports of highest traffic.
* Display the world map with Personalized PageRank scores, starting from Tokyo international airport.
* List the corresponding 10 best ranked airports.

In [None]:
pagerank = PageRank()

In [None]:
scores = pagerank.fit_transform(adjacency)

In [None]:
image = svg_graph(adjacency, position, scores=np.log(scores), node_order=np.argsort(scores), 
                  width=800, height=400, display_node_weight=True, display_edges=False)
SVG(image)

In [None]:
print(names[top_k(scores, 10)])

In [None]:
weights = adjacency.dot(np.ones(len(names)))

In [None]:
print(names[top_k(weights, 10)])

In [None]:
tokyo = {i: name for i, name in enumerate(names) if 'Tokyo' in name}

In [None]:
tokyo

In [None]:
tokyo = list(tokyo)[0]

In [None]:
# personalized PageRank
scores = pagerank.fit_transform(adjacency, seeds={tokyo:1})

In [None]:
scores = pagerank.fit_transform(adjacency, seeds=(names=='Tokyo Haneda International Airport'))

In [None]:
image = svg_graph(adjacency, position, scores=np.log(scores), node_order=np.argsort(scores), 
                  width=800, height=400, display_node_weight=True, display_edges=False)
SVG(image)

In [None]:
print(names[top_k(scores, 10)])

## 2. Directed graphs

## Wikipedia Vitals

In [None]:
dataset = wikivitals

In [None]:
adjacency = dataset.adjacency
names = dataset.names

## To do

* List the 10 best ranked pages of Wikipedia Vitals.
* Compare with the 10 nodes of highest out-degrees and the 10 nodes of highest in-degrees. Interpret the results.
* Which article of Wikipedia Vitals is in the top-20 of PageRank but not in the top-20 of in-degrees?

In [None]:
pagerank = PageRank()

In [None]:
scores = pagerank.fit_transform(adjacency)
print(names[top_k(scores, 10)])

In [None]:
out_degrees = adjacency.dot(np.ones(len(names)))
print(names[top_k(out_degrees, 10)])

In [None]:
in_degrees = adjacency.T.dot(np.ones(len(names)))
print(names[top_k(in_degrees, 10)])

## To do

* List the 20 closest pages to **Picasso** in Wikipedia Vitals. Who is the best ranked painter other than Picasso?
* List the 20 closest pages to both **Cat** and **Dog** in Wikipedia Vitals.
* In both cases, propose a method to get the 20 closest pages **specifically** related to the target(s). <br>**Hint:** Consider the difference between two PageRank scores.

In [None]:
scores_ppr = pagerank.fit_transform(adjacency, weights=names=='Pablo Picasso')

In [None]:
print(names[top_k(scores_ppr, 20)])

In [None]:
# more specific articles
print(names[top_k(scores_ppr - scores, 20)])

In [None]:
scores_ppr = pagerank.fit_transform(adjacency, weights=(names=='Cat')|(names=='Dog'))

In [None]:
print(names[top_k(scores_ppr, 20)])

In [None]:
# more specific articles
print(names[top_k(scores_ppr - scores, 20)])

## To do

* List 5 representative articles of each category.

In [None]:
labels = dataset.labels
names_labels = dataset.names_labels

In [None]:
names_labels

In [None]:
for label in np.unique(labels):
    print(names_labels[label])
    scores_ppr = pagerank.fit_transform(adjacency, weights=labels==label)
    scores_ppr *= labels==label
    print(names[top_k(scores_ppr, 5)])
    print('--')

## 3. Bipartite graphs

## Cinema

In [None]:
dataset = cinema

In [None]:
biadjacency = dataset.biadjacency
movies = dataset.names_row
actors = dataset.names_col

## To do


* List the 5 closest actors and the 5 closest movies to **Catherine Deneuve**.

In [None]:
biadjacency

In [None]:
scores = pagerank.fit_transform(biadjacency, weights_col=actors=='Catherine Deneuve')

In [None]:
scores_movies = scores

In [None]:
# movies
print(movies[top_k(scores_movies, 5)])

In [None]:
scores_actors = pagerank.scores_col_

In [None]:
# actors
print(actors[top_k(scores_actors, 5)])

## 4. Directed graphs as bipartite graphs

Directed graphs can be represented as bipartite graphs by duplicating each node, one as source of edges and the other as destination of edges. The biadjacency matrix of the bipartite graph is simply the adjacency matrix of the directed graph. 

The PageRank scores obtained with the bipartite graph differ from those obtained with the directed graph: they correspond to the **forward-backward** random walk in the directed graph, edges being alternately followed in forward and backward directions.

## Wikipedia Vitals

In [None]:
dataset = wikivitals

In [None]:
adjacency = dataset.adjacency
names = dataset.names

## To do

Do the same experiments as above and compare both rankings:
* List the 10 best ranked pages of Wikipedia Vitals. 
* List the 20 closest pages to **Picasso** in Wikipedia Vitals. 
* List the 20 closest pages to both **Cat** and **Dog** in Wikipedia Vitals.

In [None]:
scores = pagerank.fit_transform(adjacency, force_bipartite=True)

In [None]:
print(names[top_k(scores, 10)])

In [None]:
scores = pagerank.fit_transform(adjacency, weights=names=='Pablo Picasso', force_bipartite=True)

In [None]:
print(names[top_k(scores, 20)])

In [None]:
scores = pagerank.fit_transform(adjacency, weights=(names=='Cat')|(names=='Dog'), force_bipartite=True)

In [None]:
print(names[top_k(scores, 20)])