# SD212: Graph mining
## Solution to Lab 1: Graph structure

In this lab, you will check three structural properties of real graphs:
* the [power-law](https://en.wikipedia.org/wiki/Power_law#Power-law_probability_distributions) of degrees,
* the [small-world](https://en.wikipedia.org/wiki/Small-world_network) property,
* the tendency to [cluster](https://en.wikipedia.org/wiki/Clustering_coefficient).

## Import

In [None]:
import numpy as np
from scipy import sparse
import matplotlib.pyplot as plt

In [None]:
import sknetwork as skn

In [None]:
# check version
if skn.__version__ != '0.31.0':
    print('Please upgrade scikit-network.')

In [None]:
from sknetwork.data import load_netset, erdos_renyi
from sknetwork.path import get_distances, get_shortest_path
from sknetwork.utils import get_neighbors, bipartite2undirected
from sknetwork.topology import get_clustering_coefficient

## Data

We will work on the following graphs (see the [NetSet](https://netset.telecom-paris.fr) collection for details):
* Openflights (graph)
* WikiVitals (directed graph)
* Cinema (bipartite graph)

In [None]:
openflights = load_netset('openflights')

In [None]:
type(openflights)

In [None]:
dataset = openflights

In [None]:
dataset.meta

In [None]:
wikivitals = load_netset('wikivitals')

In [None]:
cinema = load_netset('cinema')

## 1. Power law

We first focus on the power law degree distribution.

We consider the empirical distributions of:
* degrees in the Openflights graph,
* in-degrees and out-degrees in the WikiVitals graph.

## To do

For each of these distributions:
* Plot the [tail distribution](https://en.wikipedia.org/wiki/Cumulative_distribution_function#Complementary_cumulative_distribution_function_(tail_distribution)) in log-log scale.
* Conclude: do you consider that these are power-law graphs?

In [None]:
dataset = wikivitals

In [None]:
adjacency = dataset.adjacency

In [None]:
adjacency

In [None]:
n = adjacency.shape[0]

In [None]:
out_degrees = adjacency.dot(np.ones(n))

In [None]:
in_degrees = adjacency.T.dot(np.ones(n))

In [None]:
def plot_loglog(degrees):
    """Plot the degree distribution in log-log scale and compare with that of a random graph.
    
    Parameters
    ----------
    degrees : array
        vector of degrees of the graph.
    """
    n = len(degrees)
    # comparison with a random graph
    adjacency_random = erdos_renyi(n, np.sum(degrees) / n / (n-1))
    degrees_random = adjacency_random.dot(np.ones(n))
    plt.loglog(np.sort(degrees), 1 - np.arange(n) / n, lw = 3, label='Real graph')
    plt.loglog(np.sort(degrees_random), 1 - np.arange(n) / n, lw = 3, label = 'Random graph')
    plt.xlabel('Degree')
    plt.ylabel('Probability')
    plt.legend()
    plt.show()

In [None]:
plot_loglog(out_degrees)

In [None]:
plot_loglog(in_degrees)

## 2. Small-world

You will now make a few experiments highlighting the small-world property.

## To do

* Give a shortest path of maximum length starting from **Paris Charles de Gaulle** airport in Openflights.<br> 
List the corresponding airports.
* Give a shortest path from [Plato](https://en.wikipedia.org/wiki/Plato) to [Platini](https://en.wikipedia.org/wiki/Michel_Platini) in WikiVitals.<br>
List the corresponding articles.

In [None]:
dataset = openflights

In [None]:
adjacency = dataset.adjacency > 0

In [None]:
names = dataset.names

In [None]:
{i: name for i, name in enumerate(names) if 'de Gaulle' in name}

In [None]:
source = 622

In [None]:
distances = get_distances(adjacency, source)

In [None]:
max(distances)

In [None]:
destination = np.argmax(distances)

In [None]:
shortest_path_tree = get_shortest_path(adjacency, source)

In [None]:
destination = np.argmax(distances)

In [None]:
names[destination]

In [None]:
node = destination
path = [destination]
while node != source:
    node = get_neighbors(shortest_path_tree, node, transpose=True)[0]
    path = [node] + path

In [None]:
len(path)

In [None]:
print(names[path])

In [None]:
dataset = wikivitals

In [None]:
adjacency = dataset.adjacency
names = dataset.names

In [None]:
source = np.flatnonzero(names == 'Plato')[0]
destination = np.flatnonzero(names == 'Michel Platini')[0]

In [None]:
shortest_path_tree = get_shortest_path(adjacency, source)

In [None]:
node = destination
path = [destination]
while node != source:
    node = get_neighbors(shortest_path_tree, node, transpose=True)[0]
    path = [node] + path

In [None]:
print(names[path])

## To do

* Which fraction of the airports can be reached from **Paris Charles de Gaulle** in 1, 2, 3 flights?
* Which fraction of the articles can be reached from **Paris** in 1, 2, 3 clicks on Wikipedia Vitals?

In [None]:
dataset = openflights
adjacency = dataset.adjacency
names = dataset.names

In [None]:
source = 622
names[source]

In [None]:
distances = get_distances(adjacency, source)

In [None]:
np.mean(distances <= 1)

In [None]:
np.mean(distances <= 2)

In [None]:
np.mean(distances <= 3)

In [None]:
dataset = wikivitals
adjacency = dataset.adjacency
names = dataset.names

In [None]:
source = np.flatnonzero(names=='Paris')[0]

In [None]:
names[source]

In [None]:
distances = get_distances(adjacency, source)

In [None]:
np.mean(distances <= 1)

In [None]:
np.mean(distances <= 2)

In [None]:
np.mean(distances <= 3)

## To do

* Compute the [Bacon number](https://fr.wikipedia.org/wiki/Six_Degrees_of_Kevin_Bacon) of all actors (**without** building the co-starring actor graph).
* Plot the corresponding histogram.

In [None]:
dataset = cinema

In [None]:
dataset.meta

In [None]:
biadjacency = dataset.biadjacency

In [None]:
biadjacency

In [None]:
movies = dataset.names_row
actors = dataset.names_col

In [None]:
len(movies)

In [None]:
len(actors)

In [None]:
source = np.flatnonzero(actors == 'Kevin Bacon')[0]

In [None]:
# actors are on columns
_, distances = get_distances(biadjacency, source_col=source)

In [None]:
# distances to actors are even in the bipartite graph
any(distances % 2)

In [None]:
# actors with Bacon number 1 (have co-starred in a movie with Kevin Bacon)
actors[np.flatnonzero(distances == 2)]

In [None]:
bacon_numbers = distances // 2

In [None]:
bacon_numbers[actors == 'Kevin Bacon']

In [None]:
bacon_numbers[actors == 'Julia Roberts']

In [None]:
plt.hist(bacon_numbers, bins=15);

Finally, we check the clustering property of real graph (my friends tend to be friends).

## To do


* Compute the clustering coefficient of Openflights and WikiVitals (seen as an undirected graph).
* Compare with that of a random graph of same size.

In [None]:
dataset = openflights

In [None]:
adjacency = dataset.adjacency

In [None]:
get_clustering_coefficient(adjacency)

In [None]:
n = adjacency.shape[0]
adjacency_random = erdos_renyi(n, np.sum(adjacency.data) / n / (n-1))

In [None]:
get_clustering_coefficient(adjacency_random)