# Basic network import and representation

Here, we play a bit with some network datasets using the standard Python library.

We analyze the dataset 'cit-HepTh' available from the SNAP repository: http://snap.stanford.edu/data/index.html

There are several other repositories of network datasets, for instance:
- http://konect.cc/networks/
- https://networks.skewed.de/
- http://networkrepository.com/
- http://cnets.indiana.edu/resources/data-repository/
- http://www.sociopatterns.org/datasets/

In [None]:
import sys, math

In [None]:
%pylab inline

In [None]:
import collections as col

We use a dictionary that associates a key (node) to a list of nodes (neighbours)

In [None]:
links_out=col.defaultdict(list)
print(links_out)

### Citation network
We analyze the HepTh citation network from the SNAP repository.
We open the file containing the network and read each line.

In [None]:
filepath='./network_data/cit-HepTh.txt'

In [None]:
fh=open(filepath,'r')

In [None]:
fh

In [None]:
s=fh.readlines()

In [None]:
s[:4]

In [None]:
s[10].strip().split()

In [None]:
for line in s:
    #remove "\n" characters (.strip()) and split the line at blank spaces (split.())
    t=line.strip().split()
    if t[0]!='#':
        #the first lines are comments
        origin=int(t[0])
        dest=int(t[1])
        links_out[origin].append(dest)
    
#close the file
fh.close()

In [None]:
len(links_out[1001])

How many nodes are in the network?

In [None]:
tot_nodes=len(links_out)
print(tot_nodes)

<img src="figure/node_degree.png" width="60%">

The degree of a node is the total number of links connected to it. Let's first look at the *unidrected* graph **Node 4** has degree $k_4 = 3$. 

In the directed Graph, for **Node 4**, $k_4^{\text{in}}=1$ while $k_4^{\text{out}}=2$.

We calculate the out-degree distribution of the cit-HepTh network.

The degree distribution, as the term implies, is the **probability distribution** of all node degrees over the entire network.

In [None]:
degree_out={}

for i in links_out:

    deg_out=len(links_out[i])

    if deg_out in degree_out:
        degree_out[deg_out]+=1
    else:
        degree_out[deg_out]=1

In [None]:
print(sorted(degree_out.keys()))

In [None]:
degree_out

We export the degree distribution to an output file.

In [None]:
s_deg=sorted(degree_out.keys())

In [None]:
fout=open('./../datasets/Cit-HepTh-degout-distri.txt','w')
for d in s_deg:
    deg_freq=float(degree_out[d])/tot_nodes 
    
    fout.write(str(d)+'  '+str(deg_freq)+'\n')

fout.close()

In [None]:
for i in degree_out.items():
    print(i)

In [None]:
from operator import itemgetter

In [None]:
x=[]
y=[]

for i in sorted(degree_out.items(), key=itemgetter(0)):
    x.append(i[0])
    y.append(float(i[1])/tot_nodes)

In [None]:
plt.figure(figsize=(10,7))   

plt.plot(x,y)

plt.xlabel('$k_{out}$', fontsize=24)
plt.ylabel('$P(k_{out})$', fontsize=24)
plt.xticks(fontsize=24)
plt.yticks(fontsize=24)
plt.yscale('log')
plt.xscale('log')

Let's have a look at the degree-in distribution.

In [None]:
links_in=col.defaultdict(list)

fh=open(filepath,'r')
#reading all the file lines
for line in fh.readlines():
    #remove "\n" characters (.strip()) and split the line at blank spaces (split.())
    s=line.strip().split()
    if s[0]!='#':
        #the first lines are comments
        origin=int(s[0])
        dest=int(s[1])
        links_in[dest].append(origin)
    
fh.close()

In [None]:
degree_in=col.defaultdict(int)
for i in links_in.keys():
    deg=len(links_in[i])
    degree_in[deg]+=1

tot_nodes_in=len(links_in)
print(tot_nodes_in)

What is the difference from an exponential distribution?

In [None]:
def f(t):
    return np.exp(-0.5*t)

x=[]
y=[]
for i in sorted(degree_in.items(), key=itemgetter(0)):
    x.append(i[0])
    y.append(float(i[1])/tot_nodes_in)

plt.figure(figsize=(10,7))   
    
plt.plot(np.array(x),np.array(y))
plt.plot(np.array(x), f(np.array(x)), label='Exponential')
plt.xlabel('$k_{in}$', fontsize=24)
plt.ylabel('$P(k_{in})$', fontsize=24)
plt.xticks(fontsize=24)
plt.yticks(fontsize=24)
plt.yscale('log')
plt.xscale('log')
plt.axis([1,10000,0.00001,1])
plt.legend()
plt.show()