# Exercise 02 - Paths, centralities, and community structure

In this week's assignment we explore some of the key network analytic concepts introduced in lecture 02 in practice. We will calculate path-based network characteristics for empirical data sets and we develop a simple approach to detect community structures based on the heuristic optimisation of modularity. We finally calculate and visualise the degree distribution of real networks. We will use the following five empirical data sets:

1) the collaboration network of the OpenSource software community `kde`  
2) the collaboration network of the OpenSource software community `gentoo`  
3) the power grid of the western states of the USA  
4) the contact network of students in a highschool  
5) an information sharing network of physicians in the United States  

All of these data are available in a single SQLite database file, which you can find in Moodle.

## Task 1: Paths, diameter, and components

### 1.1 Connected components

Implement Tarjan's algorithm for the calculation of connected components for an instance of `pathpy.Network` (see e.g. [Hopcroft and Tarjan 1973](https://dl.acm.org/citation.cfm?doid=362248.362272). Test your algorithm in the small toy example from lecture 2 (slide 14). Define a function that returns the relative size of the largest connected component (lcc) for a given network.

Compare your results with the implementation given in pathpy, using the function `reduce_to_gcc` in `pathpy.algorithms.components`.

In [1]:
import numpy as np
import pathpy as pp

# TODO: Implement functions

pp.Network?

[0;31mInit signature:[0m [0mpp[0m[0;34m.[0m[0mNetwork[0m[0;34m([0m[0mdirected[0m[0;34m=[0m[0;32mFalse[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
A graph or network that can be directed, undirected, unweighted or weighted
and whose edges can contain arbitrary attributes. This is the base class for
HigherOrderNetwork

Attributes
----------

nodes : list
    A list of (string) nodes.
edges : dictionary
    A dictionary containing edges (as tuple-valued keys) and their attributes (as value)
[0;31mInit docstring:[0m Generates an empty network.
[0;31mFile:[0m           ~/anaconda3/lib/python3.7/site-packages/pathpy/classes/network.py
[0;31mType:[0m           type
[0;31mSubclasses:[0m     DAG, HigherOrderNetwork


In [2]:
pp.algorithms.components.reduce_to_gcc?

[0;31mSignature:[0m [0mpp[0m[0;34m.[0m[0malgorithms[0m[0;34m.[0m[0mcomponents[0m[0;34m.[0m[0mreduce_to_gcc[0m[0;34m([0m[0mnetwork[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Reduces the network to the largest connected component.
Connected components are calculated using Tarjan's algorithm.
[0;31mFile:[0m      ~/anaconda3/lib/python3.7/site-packages/pathpy/algorithms/components.py
[0;31mType:[0m      function


### 1.2 Connected components in empirical data

Read the five data sets from the SQLite database as *undirected* networks and compute the sizes of connected components. Would you say that these networks contain a *giant* connected component?

In [None]:
import sqlite3
con = sqlite3.connect('data/01_networks.db')
con.row_factory = sqlite3.Row

# TODO: Read data as introduced in exercise 01 and apply your algorithm to these data

### 1.3 Diameter and average path length

Use the functions `diameter` and `avg_path_length` in `shortest_paths` in `pathpy.algorithms.shortest_paths` to compute the diameter and the average shortest path length of the (largest connected component in the) `physicians`, `highschool`, and `gentoo` data. Interpret the differences while considering the different sizes of the network.

In [None]:
# TODO: Calculate diamater and average shortest path lengths using functions in pathpy
pp.algorithms.shortest_paths.diameter(n)

## Task 2: Modularity-based community detection

### 2.1 Partition quality

Implement the partition quality function $Q(n, C)$ for a given network `n` and a non-overlapping partitioning `C` of nodes into communities as introduced in lecture 2 on slide 22. Test your method in the toy example and partitioning depicted on slide 26 and check whether you obtain the same value for the partition quality.

In [None]:
def Q(network, C):
    # TODO: implement function 
    # Hint: Assume that the partitioning is given as a dictionary where C[v] is the community of node v
    


In [None]:
n = pp.Network()
n.add_edge('a', 'b')
n.add_edge('b', 'c')
n.add_edge('a', 'c')
n.add_edge('b', 'd')
n.add_edge('d', 'f')
n.add_edge('d', 'g')
n.add_edge('d', 'e')
n.add_edge('e', 'f')
n.add_edge('f', 'g')

# Apply the method to the toy example

### 2.2 Modularity optimisation

Implement a simple heuristic optimisation algorithm to calculate the optimal modularity $Q_{opt}$ across all partitions. The idea of this algorithm is as follows: 

1) Start with a partitioning where you place each node in a separate community   
2) Draw two communities uniformly at random and merge them to a single community iff this merge increases partition quality   
3) Repeat the second step for a given number of iterations and output the final partitioning and partition quality  

Use your method to calculate $Q_{opt}$ for the toy example and plot the detected communities by coloring the nodes appropriately.

In [None]:
def find_communities(network, iterations=100):
    # TODO: implement method

To make this task a bit simpler, I provide you with the following method, which generates a community-based node-color mapping that you can use to color nodes according to (a maximum of 20) communities (hint: use the `node_color` parameter of the `pathpy.visualisation.plot` function).

In [None]:
def map_colors(n, communities):
    colors = ['red', 'green', 'blue', 'orange', 'yellow', 'cyan', 'blueviolet', \
              'chocolate', 'magenta', 'navy', 'plum', 'thistle', 'wheat', 'turquoise', \
              'steelblue', 'grey', 'powderblue', 'orchid', 'mintcream', 'maroon']
    node_colors = {}
    community_color_map = {}
    i = 0
    for v in n.nodes:
        if communities[v] not in community_color_map:
            community_color_map[communities[v]] = i%len(colors)
            i += 1
        node_colors[v] = colors[community_color_map[communities[v]]]
    return node_colors

In [None]:
# TODO: Detect communities in toy example 

### 2.3 Synthetic network generation

Create a simple *synthetic network* with a strong (and known) community structure as follows: 

1) Generate two networks $c1$ and $c2$ with $50$ nodes each and add $200$ links at random to each of the networks. For this, you can use the `numpy.random.choice` function introduced in exercise 02.  
2) Use `pathpy`'s $+$-operator to combine the two networks to a single network with two connected components.   
3) Add $5$ links that randomly interconnect nodes across the two components $c1$ and $c2$, thus generating a connected network.

Visualise the network generated by this method.

In [None]:
# TODO: Generate synthetic network

### 2.4 Heuristic optimisation vs. the global optimum

Define a community partitioning that corresponds to the "ground truth" communities in the network from 2.3 and calculate the partition quality $Q$ for this optimal partitioning. Use your method from Task 2.2 to find the partitioning with maximal modularity. Compare this value to the ground truth and visualise the corresponding community structures.

In [None]:
# TODO: Calculate global and heuristic optimum for synthetic network and visualise detected communities

### 2.5 Community assortativity coefficient

Using the definition from lecture 2, slide 29, implement a function that computes the theoretical maximum modularity $Q_{max}$ fora given network and partition. Test your function for the toy example and try to reproduce the *community assortativity coefficient* reported on slide 29. Calculate the community assortativity coefficient for the synthetic example from Task 3.4 and compare it to the modularity of this network.

In [None]:
def Qmax(network, C):
    # TODO: Implement function

In [None]:
# TODO: Calculate theoretical maximum for toy example and calculate community assortativity coefficient

### 2.6 Communities in empirical networks

Use your functions from Task 2.2 and 2.5 to compute the community assortativity coefficient and the number of detected communities for the `highschool` and `physicians` data. Visualise the optimal community structure found by your optimisation method. How does the number of detected communities depend on the number of iterations for your optimisation algorithm?

In [None]:
# Detect and visualise communities in empirical networks and calculate community assortativity coefficient