# P03-02: Clustering Coefficient

*May 07 2020*

In the second unit, we implement the clustering coefficient and compute it for some example networks.

In [1]:
import pathpy as pp

import seaborn as sns
import matplotlib.pyplot as plt

plt.style.use('default')
sns.set_style("whitegrid")

# Calculating the clustering coefficient

To test our implementation of the clustering coefficient, we first create two example networks:

In [2]:
n_undirected = pp.Network(directed=False)
n_undirected.add_edge('a', 'b')
n_undirected.add_edge('b', 'c')
n_undirected.add_edge('c', 'a')
n_undirected.add_edge('d', 'e')
n_undirected.add_edge('e', 'f')
n_undirected.add_edge('f', 'g')
n_undirected.add_edge('g', 'd')
n_undirected.add_edge('d', 'f')
n_undirected.add_edge('b', 'd')
n_undirected.plot()

In [3]:
n_directed = pp.Network(directed=True)
n_directed.add_edge('a', 'b')
n_directed.add_edge('b', 'c')
n_directed.add_edge('c', 'a')
n_directed.add_edge('d', 'e')
n_directed.add_edge('e', 'f')
n_directed.add_edge('f', 'g')
n_directed.add_edge('g', 'd')
n_directed.add_edge('d', 'f')
n_directed.add_edge('b', 'd')
n_directed.plot()

We start with a function that computes the local clustering coefficient of a node, i.e. the fraction of neighbor pairs of a node where the neighbors are directly connected to each other:

In [5]:
print(pp.statistics.local_clustering_coefficient(n_undirected, 'a'))
print(pp.statistics.local_clustering_coefficient(n_undirected, 'b'))
print(pp.statistics.local_clustering_coefficient(n_undirected, 'f'))

1.0
0.3333333333333333
0.6666666666666666


In the directed network, all three nodes have zero clustering coefficient:

In [6]:
print(pp.statistics.local_clustering_coefficient(n_directed, 'a'))
print(pp.statistics.local_clustering_coefficient(n_directed, 'b'))
print(pp.statistics.local_clustering_coefficient(n_directed, 'f'))

0.0
0.0
0.0


In this network, the only node with non-zero clustering coefficient is node `d`:

In [7]:
pp.statistics.local_clustering_coefficient(n_directed, 'd')

0.5

For a directed network that only consists of three edges $(a,b), (a,c)$ and $(b,c)$ the local clustering coefficient of $a$ is $0.5$ because only half of the possible (ordered) node pairs are connected:

In [10]:
n = pp.Network()
n.add_edge('a','b')
n.add_edge('a','c')
n.add_edge('b','c')
n.plot()
print(pp.statistics.local_clustering_coefficient(n, 'a'))

0.5


A global clustering coefficient can be computed as average of all local clustering coefficients. Note that there are also other definitions that actually count the total number of closed triads in the whole network and then calculate a single clustering coefficient for the whole network rather than averaging local clustering coefficients. 

In [11]:
pp.statistics.avg_clustering_coefficient(n_undirected)

0.7619047619047619

If we want to find the set of closed triads around a node, we can use the function `closed_triads`. It returns all direct connections between neighbors of a given node. In our example, `f` has a degree of three and there is a direct edge between two of the three neighbor pairs:

In [12]:
for edge in pp.statistics.closed_triads(n_undirected, 'f'):
    print(edge)

Uid:		0x1fcffb60520
Type:		Edge
Source node:	Node d
Target node:	Node e
Uid:		0x1fcffb60ac0
Type:		Edge
Source node:	Node g
Target node:	Node d


## Clustering coefficient in empirical networks

Concluding this unit, we calculate the clustering coefficient to empirical networks:

In [15]:
n_gentoo = pp.io.sql.read_network('networks.db', sql='SELECT source, target FROM gentoo', directed=True)
n_highschool = pp.io.sql.read_network('networks.db', sql='SELECT source, target FROM highschool', directed=False)
n_physicians = pp.io.sql.read_network('networks.db', sql='SELECT source, target FROM physicians', directed=False)

The first network is a directed communication and collaboration network, which has a small clustering coefficient.

In [17]:
print(n_gentoo.avg_clustering_coefficient())

0.0


The other two networks are undirected social networks, for which we expect a large clustering coefficient. The reason for this is that two friends of a person a likely to be friend with each other as well:

In [18]:
print(n_highschool.avg_clustering_coefficient())
print(n_physicians.avg_clustering_coefficient())

0.0
0.0


In [19]:
# n_highschool.plot()