# Exercise 01: Empirical Networks and Random Graphs


In this week's assignment, we apply network analytic methods to empirical data. We further explore the G (n, m) defined in lecture L03 and compare its characteristics to real networks constructed from empirical data. We use the following five empirical data sets:

1) the collaboration network of the OpenSource software community `kde`  
2) the collaboration network of the OpenSource software community `gentoo`  
3) the `powergrid` of the western states of the USA  
4) the contact network of students in a `highschool`  
5) an information sharing network of `physicists` in the United States  

The data are available in separate tables within a single SQLite database file.

To simplify the exercise a bit, I provide the following boilerplate code, which imports necessary packages, sets a nice plot style and connects to the database file.

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import sqlite3
import numpy as np
import networkx as nx

# Connect to the SQLite database.
con = sqlite3.connect('data/01_networks.db')
con.row_factory = sqlite3.Row

def nx_from_sqlite(query, directed=False):
    """
    Executes a SQL query that returns rows with 'source' and 'target' columns,
    and returns a NetworkX graph built from the query result.
    If directed=True, returns a DiGraph, else an undirected Graph.
    """
    cursor = con.execute(query)
    edges = [(row['source'], row['target']) for row in cursor.fetchall()]
    if directed:
        return nx.DiGraph(edges)
    else:
        return nx.Graph(edges)

# Create NetworkX graphs from the corresponding tables.
n_highschool = nx_from_sqlite('SELECT source, target FROM highschool')
n_kde        = nx_from_sqlite('SELECT source, target FROM kde')
n_gentoo     = nx_from_sqlite('SELECT source, target FROM gentoo')
n_powergrid  = nx_from_sqlite('SELECT source, target FROM powergrid')
n_physicians = nx_from_sqlite('SELECT source, target FROM physicians')


## Task 1: Empirical network analysis

### 1.1 Node centralities

We next study the concept of node centrality measures in real-world social networks.

Read the gentoo table as a directed network and the highschool table as an undirected network. Compute the following centralities for the two networks

1) in- and out-degree (for directed network), degree (for undirected network)  
2) closeness centrality   
3) betweenness centrality    

Use the `networkx` functions `closeness_centrality` and `betweenness_centrality` to identify the five most central nodes according to those centrality measures.

### 1.2 Visualising centralities

We next study how we can visually represent node centralities in terms of node sizes. Your task is to visualise the node centralities in the `highschool` network. Use the `node_size` visualisation parameter of the `nx.draw` method to scale the nodes according to their closeness, degree, and betweenness centralities respectively. 

## Task 2: Random Graph Models

### 2.1 G(n,m) model

Write a `python` function that starts from an empty network and samples m pairs `i` and `j`, and creates edges between those.

### 2.2 Degree distribution of G(n,m) model

Use your function and the boilerplate code below to create five random networks with $200$ nodes and $300$ links. Visualise the networks, plot their degree distributions and calculate the mean and variance of the distributions.

### 2.3 Empirical degree distributions 

Plot the histogram of node degrees for the five empirical networks listed in Task 1 (interpreted as undirected networks). Compare the degree distributions to those of a random graph model with the same number of nodes and links.

## Task 3: Triadic closure in social networks

### 3.1 Calculating the clustering coefficient

Implement `python` functions that calculate the local and global clustering coefficient in an (undirected) network. Test your method on the small toy example from lecture L03, slide "Global clustering coefficient".

### 3.2 Empirical vs. random clustering coefficient

Use your function to calculate the global clustering coefficient of the `highschool` data and compare it to the global clustering coefficient in random networks with the same number of nodes and links. Would you say that the level of triadic closure in this social network is significantly higher than what we would expect at random?