<p style="text-align:center">
PSY 394U <b>Data Analytics with Python</b>, Spring 2018


<img style="width: 400px; padding: 0px;" src="https://github.com/sathayas/JupyterAnalyticsSpring2018/blob/master/images/Title_pics.png?raw=true" alt="title pics"/>

</p>

<p style="text-align:center; font-size:40px; margin-bottom: 30px;"><b> Network statistics </b></p>

<p style="text-align:center; font-size:18px; margin-bottom: 32px;"><b>March 29, 2018</b></p>

<hr style="height:5px;border:none" />

# 1. How big is a network?
<hr style="height:1px;border:none" />

## Number of nodes and edges
The size of a network can be easily summarized by the numbers of nodes and edges. Here, we call them $n$ and $m$. These quantities can be easily calculated by the **`len()`** function on **`.nodes()`** and **`.edges()`** methods associated with a graph. Here are two examples. 
  * **`G_karate`**: From **`karate.gml`**, Zachary's karate club network
  * **`G_netsci`**: From **`netscience.gml`**, network science co-authorship network

*Both data sets are available from [Mark Newman](http://www-personal.umich.edu/~mejn/netdata/)*

`<NetSize.py>`

In [1]:
import networkx as nx
import numpy as np

# loading network data
G_karate = nx.read_gml('karate.gml', label='id')  # Karate network
G_netsci = nx.read_gml('netscience.gml')  # network science co-authorship


# Network sizes
print('Network sizes')
print("Zachary's karate network, n:", len(G_karate.nodes()), sep='')
print("Zachary's karate network, m:", len(G_karate.edges()), sep='')

print("Network science co-authorship network, n:",
      len(G_netsci.nodes()), sep='')
print("Network science co-authorship network, m:",
      len(G_netsci.edges()), sep='')


Network sizes
Zachary's karate network, n:34
Zachary's karate network, m:78
Network science co-authorship network, n:1589
Network science co-authorship network, m:2742


## Giant component size

In a network data set, there is no guarantee that all nodes are connected as a single network. It is plausible that some nodes are disconnected from other nodes. Thus, in addition to network sizes, we can also examine the size of the giant component, or the number of nodes included in the largest connected component in the data set. Unfortunately there isn't a straightforward approach to calculate the giant component size in `networkx`. We will use a function **`connected_component_subgraphs`**, which returns a generator of connected subcomponents of the network data. We need to sort the components by **`sorted`** function, according to the number of nodes in these components (thus **`key = len`**), in the reverse order (**`reverse=True`**). We are only interested in the number of nodes in the giant component.

In [3]:
# Giant component sizes
print('Giant component sizes')
GC_karate = len(sorted(nx.connected_components(G_karate), key = len, reverse=True)[0])
GC_netsci = len(sorted(nx.connected_components(G_netsci), key = len, reverse=True)[0])
print("Zachary's karate network, GC:", GC_karate, sep='')
print("Network science co-authorship network, GC:",
      GC_netsci, sep='')

Giant component sizes
Zachary's karate network, GC:34
Network science co-authorship network, GC:379



Or it may be easier to interpret the relative giant component size, the giant component size relative to all available nodes. 

In [4]:
# Relative giant component sizes
rGC_karate = GC_karate/len(G_karate.nodes())
rGC_netsci = GC_netsci/len(G_netsci.nodes())
print('Relative giant component sizes')
print('Zachary\'s karate network, GC: %4.2f' % rGC_karate)
print("Network science co-authorship network, GC: %4.2f" % rGC_netsci)

Relative giant component sizes
Zachary's karate network, GC: 1.00
Network science co-authorship network, GC: 0.24


As you can see, the karate network includes all nodes as part of the giant component, whereas the network science network only includes 24% of all available nodes as part of the giant component.

### Exercise
1. **Network size table**. The following network data sets are available for you:
  * Les Miserable interaction network - **`lesmis.gml`**
  * NCAA college football network - **`football.gml`**
  * S&P500 stock price correlation network - **`SP500.gexf`**
  * Facebook sample network - **`facebook_combined.edgelist`**
  * Western US power grid - **`power.gml`**
  * High-res fMRI connectivity network - **`fMRI_HighRes.adjlist`**

And here is a code snippet to read all these data sets

In [5]:
import networkx as nx
import numpy as np

# loading network data
G_LesMis = nx.read_gml('lesmis.gml')  # Les Miserables
G_football = nx.read_gml('football.gml')  # Football network
G_SP500 = nx.read_gexf('SP500.gexf')  # S&P500
G_facebook = nx.read_edgelist('facebook_combined.edgelist')  # facebook
G_power = nx.read_gml('power.gml', label='id')  # power grid

Your goal is to fill in the numbers that are missing in the table below. *You can just post the numbers, not the code, on Canvas discussion*.

<table>
<tr>
<th style="text-align:left">Network</th>
<th style="text-align:center">Nodes</th>
<th style="text-align:center">Edges</th>
<th style="text-align:center">Relative GC size</th>
</tr>
<tr>
<td style="text-align:left">Les Miserables</td>
<td style="text-align:center"><b style="color:red;">(a)</b></td>
<td style="text-align:center">254</td>
<td style="text-align:center">1.00</tr>
</tr>
<tr>
<td style="text-align:left">NCAA football</td>
<td style="text-align:center">115</td>
<td style="text-align:center">613</td>
<td style="text-align:center"><b style="color:red;">(b)</b></tr>
</tr>
<tr>
<td style="text-align:left">S&P 500</td>
<td style="text-align:center">491</td>
<td style="text-align:center"><b style="color:red;">(c)</b></td>
<td style="text-align:center">1.00</tr>
</tr>
<tr>
<td style="text-align:left">Facebook</td>
<td style="text-align:center">4039</td>
<td style="text-align:center"><b style="color:red;">(d)</b></td>
<td style="text-align:center">1.00</tr>
</tr>
<tr>
<td style="text-align:left">Power grid</td>
<td style="text-align:center">4941</td>
<td style="text-align:center">6594</td>
<td style="text-align:center"><b style="color:red;">(e)</b></tr>
</tr>
<tr>
<td style="text-align:left">fMRI network</td>
<td style="text-align:center"><b style="color:red;">(f)</b></td>
<td style="text-align:center"><b style="color:red;">(g)</b></td>
<td style="text-align:center">1.00</tr>
</tr>
</table>

* Metrics
   * Size metrics (nodes, edges)
   * Giant component
   * Connectivity metrics (degree, assortativity)
      * Scale-free
   * Distance metrics (path lengths, diameter)
   * Clustering metrics (clustering coefficient)
   * Small-world
* Random deletion vs targeted attack
