# Graph Learning
## Lab 5: Heat Diffusion

In this lab, you will learn to use heat diffusion for **contrastive ranking** and **node classification**.

## Import

In [None]:
from IPython.display import SVG

In [None]:
import numpy as np
from scipy import sparse

In [None]:
from sknetwork.data import load_netset, grid, karate_club
from sknetwork.regression import Dirichlet
from sknetwork.classification import DiffusionClassifier
from sknetwork.ranking import PageRank
from sknetwork.utils import directed2undirected
from sknetwork.visualization import visualize_graph

## Data

We will work on the following graphs (see the [NetSet](https://netset.telecom-paris.fr/) collection for details):
* Openflights (graph)
* WikiVitals (directed graph + bipartite graph)

In [None]:
openflights = load_netset('openflights')
wikivitals = load_netset('wikivitals')

## 1. Graphs

## Grid

We first illustrate the notion of **contrastive ranking** on a $k\times k$ grid.

In [None]:
k = 5
dataset = grid(k, k, True)
adjacency = dataset.adjacency
position = dataset.position

In [None]:
image = visualize_graph(adjacency, position, names=np.arange(k * k), width=200, height=200)
SVG(image)

## To do

* Display the solution to the Dirichlet problem with 1 hot source and 1 cold source, located on the opposite corners of the square.
* Add a hot source at the top-left corner of the square. What is the free node of highest temperature?

In [None]:
dirichlet = Dirichlet()

In [None]:
bottom_left = 0
top_right = k * k - 1

In [None]:
values = {bottom_left: 0, top_right: 1}

In [None]:
temperatures = dirichlet.fit_predict(adjacency, values=values)

In [None]:
image = visualize_graph(adjacency, position, scores=temperatures, width=200, height=200)
SVG(image)

## Karate Club


We now consider the **classification** of nodes by heat diffusion. We use the [karate club graph](https://en.wikipedia.org/wiki/Zachary%27s_karate_club) that has ground-truth labels.

In [None]:
dataset = karate_club(True)

In [None]:
adjacency = dataset.adjacency
position = dataset.position
labels_true = dataset.labels

In [None]:
n = len(labels_true)

In [None]:
image = visualize_graph(adjacency, position, labels=labels_true, names=np.arange(n))
SVG(image)

## To do

* Select one node in each ground-truth cluster, and predict the labels of the other nodes by heat diffusion.
* Display the graph with the predicted labels. 
* What is the accuracy of the classification?

In [None]:
classifier = DiffusionClassifier()

## To do

The classifier applies temperature centering after diffusion.

* Repeat the same experiment without temperature centering.
* Do the same experiments with 3 nodes in a ground-truth cluster and 1 in the other.
* Comment the results.

In [None]:
classifier = DiffusionClassifier(centering=False)

## Openflights


We now show how to classify the nodes of a graph **without labels**. We identify some nodes of interest and use these seeds to classify the other nodes.

In [None]:
dataset = openflights

In [None]:
adjacency = dataset.adjacency
position = dataset.position
names = dataset.names

In [None]:
image = visualize_graph(adjacency, position, width=800, height=400, node_size=3, display_edges=False)
SVG(image)

## To do

* Display the same world map with the labels predicted for 3 seeds (Paris, New-York, Beijing), each with its own  label.
* Add Madrid with another label and comment the results.

In [None]:
paris = 622
newyork = 1842
beijing = 1618
madrid = 572

Finally, we illustrate **contrastive ranking** on Openflights.

## To do

* List the top-10 airports that are the closest to Tokyo in terms of Personalized PageRank.
* List of the top-10 airports that are both the closest to Tokyo and the furthest from Paris Charles de Gaulle, using heat diffusion (contrastive ranking).<br>What do you observe?
* Give the temperatures of these airports after heat diffusion and explain the result.

In [None]:
tokyo = 1084

## Wikipedia Vitals

We now consider the graph of links of WikiVitals, considered as undirected. We first focus on **constrastive ranking**, then on **node classification**.

In [None]:
dataset = wikivitals

In [None]:
adjacency = dataset.adjacency
names = dataset.names
labels = dataset.labels
names_labels = dataset.names_labels

In [None]:
adjacency = directed2undirected(adjacency)

## Contrastive ranking

## To do

* List the top-10 articles that are close to **Cat** and **Dog** in terms of Personalized PageRank.
* Compare with the list of top-10 articles that are close to **Cat** and **Dog** and far from **Bear** and **Tiger** using heat diffusion (contrastive ranking).
* List the top-10 articles that are close to **Bear** and **Tiger** and far from **Cat** and **Dog** (you can use previous diffusion). Interpret the results.

In [None]:
pagerank = PageRank()

In [None]:
pagerank.fit(adjacency, force_bipartite=True)

## Node classification

We now use the ground-truth labels provided by the category of each article.

In [None]:
print(names_labels)

## To do

* What is the accuracy of node classification, using half of the nodes in the train set?
* Display the confusion matrix. What is the most difficult class to predict?
* Give examples of misclassification for each label (e.g., 2 false positives and 2 false negatives per label).

## To do

* Classify each article of the **People** category in one of the other categories (Mathematics, History, etc.), using all other labels.
* List the top-5 predicted people of each category using Personalized PageRank.

## 2. Bipartite graphs

Finally, we consider the bipartite graph between articles and words in WikiVitals.

In [None]:
dataset = wikivitals
biadjacency = dataset.biadjacency

In [None]:
words = dataset.names_col

## To do

* Repeat the experiments on node classification using the bipartite graph between articles and words.
* Which information seems to be richer, the text or the links between articles?
* Propose a solution exploiting both sources of information.