In [None]:
# Clustering Full Disclosure data with igraph

Igraph provides a number of algorithms for clustering graphs. They are described in a [stackoverflow discussion](http://stackoverflow.com/a/9478989).

This preliminary attempt at clustering analysis uses the 2007 Full Disclosure graph.

In [7]:
from igraph import *

g = Graph.Read_GraphML('2007.graphml')
print('total vertices:', len(g.vs))
print('total edges:', len(g.es))

total vertices: 5562
total edges: 6489


In [9]:
from igraph import *

g = Graph.Read_GraphML('2007.graphml')
com = g.community_leading_eigenvector(weights='weight', clusters=3)


  membership, _, q = GraphBase.community_leading_eigenvector(self, clusters, **kwds)


Some of the community measures do not support directed graphs, as seen in the warning message above, but the direction can be removed to compare the results of igraph's community funtions.

In [37]:
from igraph import *

g = Graph.Read_GraphML('2007.graphml')
g.to_undirected() # eliminate the directionality

com = g.community_leading_eigenvector(clusters=3)
print('clustering attempt, leading eigenvector:')
summary(com)

com = g.community_fastgreedy()
print('\nclustering attempt, fastgreedy:')
summary(com)

com = g.community_walktrap(steps=4)
print('\nclustering attempt, walktrap:')
summary(com)

com = g.community_infomap()
print('\nclustering attempt, infomap:')
summary(com)

com = g.community_multilevel()
print('\nclustering attempt, multilevel:')
summary(com)

clustering attempt, leading eigenvector:
Clustering with 5562 elements and 460 clusters

clustering attempt, fastgreedy:
Dendrogram, 5562 elements, 5102 merges

clustering attempt, walktrap:
Dendrogram, 5562 elements, 5102 merges

clustering attempt, infomap:
Clustering with 5562 elements and 791 clusters

clustering attempt, multilevel:
Clustering with 5562 elements and 489 clusters


A more detailed look at the eigenvector clusters follows in two parts. First, a listing of the overall cluster contents:

In [34]:
from igraph import *

g = Graph.Read_GraphML('2007.graphml')
g.to_undirected() # eliminate the directionality

com = g.community_leading_eigenvector()
print(com)

Clustering with 5562 elements and 463 clusters
[  0] 0, 1570
[  1] 1, 5, 11, 13, 23, 28, 31, 33, 34, 40, 47, 48, 76, 77, 95, 99, 103, 108,
      109, 112, 113, 114, 116, 129, 131, 134, 136, 137, 138, 139, 140, 141,
      142, 145, 148, 151, 155, 156, 157, 158, 162, 163, 164, 169, 172, 173,
      175, 180, 185, 192, 194, 195, 199, 203, 205, 207, 208, 211, 218, 219,
      222, 227, 229, 236, 237, 245, 247, 249, 254, 291, 293, 294, 302, 305,
      307, 309, 324, 325, 328, 329, 332, 347, 350, 353, 361, 364, 371, 372,
      383, 384, 389, 392, 393, 394, 395, 396, 398, 413, 415, 417, 421, 422,
      429, 430, 431, 432, 448, 453, 454, 459, 464, 469, 472, 473, 474, 478,
      480, 485, 488, 491, 502, 503, 504, 512, 518, 521, 525, 531, 536, 541,
      543, 544, 554, 561, 563, 581, 601, 604, 614, 627, 631, 634, 641, 646,
      652, 653, 660, 665, 668, 669, 670, 672, 678, 679, 680, 683, 694, 697,
      700, 714, 715, 748, 760, 762, 763, 768, 772, 775, 776, 777, 778, 779,
      780, 781, 782, 796,

Second, a report of the first 20 communities' node types and labels:

In [36]:
from igraph import *

g = Graph.Read_GraphML('2007.graphml')
g.to_undirected() # eliminate the directionality

com = g.community_leading_eigenvector()

for i in range(20):
    print('COMMUNITY', i, '  ( total size:', len(com[i]), ')')
    for _ in com[i]:
        print('\t', g.vs[_]['nodeType'], ':\t', g.vs[_]['label'])
    print()

COMMUNITY 0   ( total size: 2 )
	 author :	 andur matrix <andurmatrix () gmail com>
	 document :	 Re: [OOT] Thesis for master degree

COMMUNITY 1   ( total size: 1666 )
	 author :	 Denzity <denzity () gmail com>
	 author :	 Javor Ninov <drfrancky () securax org>
	 author :	 dfklsddshd <dfklsddshd () nerdshack com>
	 author :	 Blue Boar <BlueBoar () thievco com>
	 author :	 Steven McGrath <steven.mcgrath () chigeek com>
	 author :	 rPath Update Announcements <announce-noreply () rpath com>
	 author :	 Matias Soler <gnuler () gmail com>
	 author :	 Kees Cook <kees () ubuntu com>
	 author :	 Scott <geekboy () angrykeyboarder com>
	 author :	 sven.vetsch () disenchant ch
	 author :	 Cisco Systems Product Security Incident Response Team <psirt () cisco com>
	 author :	 Raphael Marichez <falco () gentoo org>
	 author :	 iDefense Labs <labs-no-reply () idefense com>
	 author :	 <corrado.liotta () alice it>
	 author :	 Sowhat <smaillist () gmail com>
	 author :	 Damir Rajnovic <gaus () cisco c

# Betweenness analysis

This preliminary attempt at betweenness analysis uses the 2007 Full Disclosure graph.

Igraph betweenness measures, like the clustering algorithms, require undirected graphs. Betweenness is returned as a simple list of values.

In [61]:
from igraph import *

g = Graph.Read_GraphML('2007.graphml')
g.to_undirected()
print(g.betweenness())

[0.0, 0.0, 78.0, 22394.619093596124, 256552.06433626046, 0.0, 4111.174330598513, 278718.86507494753, 132012.86174402796, 0.0, 1.0, 0.0, 23914.980716355964, 16850.668530499253, 18833.055489523493, 0.0, 22998.24452723478, 59213.59827074634, 7867.956449538302, 34203.45778196439, 691.1399438298088, 59.10843906348034, 0.0, 31320.0, 0.0, 32384.0, 0.0, 10.0, 501416.0, 10505.465532639366, 0.0, 0.0, 3.0, 564647.0, 0.0, 18505.484243673498, 75.25965308611259, 473026.909812853, 20757.883757101783, 7364.715831288807, 0.0, 0.0, 257016.66923210927, 12399.959993123544, 1725688.150070586, 6821.123974440932, 83464.74602294361, 138580.0, 555396.9379294378, 23289.655271616073, 0.0, 623166.1915775258, 327436.5733022523, 0.0, 0.0, 229517.05196229627, 10.0, 1.0, 3.0, 3483.9999999999995, 0.0, 0.0, 10552.482827144935, 0.0, 3943.4864929210653, 188322.01804034892, 0.0, 7376.439045386328, 58657.05384788196, 3.0, 0.0, 14059.579765241418, 86862.61467661655, 19803.942611603794, 0.0, 10568.926455887864, 585753.741771

A basic statistical summary can be provided by python's statistics model, and the betweenness results can be compared to the graph vertices to create more detailed overviews:

In [72]:
from igraph import *
import statistics

g = Graph.Read_GraphML('2007.graphml')
g.to_undirected()
betw = g.betweenness()
print('Betweenness summary:')
print('min:', min(betw))
print('max:', max(betw))
print('mode:', statistics.mode(betw))
print('mean:', statistics.mean(betw))

print('\n\nSample data for first 100 vertices:')
for i in range(100):
    print('vertex', i, '\n  betweenness:', betw[i], '\n ', g.vs[i]['nodeType'], ': ', g.vs[i]['label'])


Betweenness summary:
min: 0.0
max: 1725688.150070586
mode: 0.0
mean: 5845.322186263935


Sample data for first 100 vertices:
vertex 0 
  betweenness: 0.0 
  author :  andur matrix <andurmatrix () gmail com>
vertex 1 
  betweenness: 0.0 
  author :  Denzity <denzity () gmail com>
vertex 2 
  betweenness: 78.0 
  author :  Matousec - Transparent security Research <research () matousec com>
vertex 3 
  betweenness: 22394.619093596124 
  author :  Geo. <geoincidents () nls net>
vertex 4 
  betweenness: 256552.06433626046 
  author :  coderman <coderman () gmail com>
vertex 5 
  betweenness: 0.0 
  author :  Javor Ninov <drfrancky () securax org>
vertex 6 
  betweenness: 4111.174330598513 
  author :  php0t <php0t () zorro hu>
vertex 7 
  betweenness: 278718.86507494753 
  author :  Simon Smith <simon () snosoft com>
vertex 8 
  betweenness: 132012.86174402796 
  author :  Juha-Matti Laurio <juha-matti.laurio () netti fi>
vertex 9 
  betweenness: 0.0 
  author :  Poof <poof () fansubber com

Manual analysis of the vertices with highest betweenness will be useful for developing hypotheses about the graph.

In [71]:
from igraph import *
import numpy as np

g = Graph.Read_GraphML('2007.graphml')
g.to_undirected()
betw = g.betweenness()

i = np.argmax(betw)

print('vertex', i, '\n  betweenness:', betw[i], '\n ', g.vs[i]['nodeType'], ': ', g.vs[i]['label'])


node 44 
  betweenness: 1725688.150070586 
  author :  Valdis.Kletnieks () vt edu
