## Example of Information Theoretical Clustering of Relational Data

In [1]:
import networkx as nx
import pandas
from utils.data_processing import *
from algorithms.information_theoretical_clustering import InformationTheoreticalClustering
from matplotlib import pyplot as plt
import matplotlib.cm as cm

Load the data and build a bipartite Graph where nodes are CellId and countries and edges are the number of calls between cells and countries.

In [2]:
cdr_data, antenna_mapping, most_called_countries = process_data('data/mobile-phone-activity/sms-call-internet-mi-2013-11-01.csv', truncate=2000)
itc = InformationTheoreticalClustering(cdr_data)

Function plotting two matrices:
1. the joint density in the blocks of a matrix (the darker the denser)
2. the mutual information matrix (red excess of interaction, blue lack of interation compared to expected in case of independence)

The joint density P(row,col) does not take into account that if clusters are unbalanced in terms of observations, the density will be high for the biggest cluster, though it might not show any interesting pattern. The mutual information corrects that problem by dividing the join probability by the marginal distribution over rows and columns:
\begin{align}
mi = P(row,col) log\left(\dfrac{P(row,col)}{P(row)P(col)}\right)
\end{align}

In [61]:
def plot_matrices(adjacency_matrix, mi_matrix):
    labels_countries = ['Senegal', 'Ukraine', 'China', 'Egypt', 'EU / US']
    labels_antennas = ['Turquoise', 'Light purple', 'Light blue', 'Pink', 'Crimson', 'Yellow']
    plt.clf()
    fig = plt.figure()
    ax1 = fig.add_subplot(121)
    ax1.imshow(adjacency_matrix, interpolation='nearest', cmap=cm.Greys, vmin=0, vmax=.25)
    #plt.xticks([0,1,2,3,4], labels_countries, rotation='vertical')
    #plt.yticks([0,1,2,3,4,5], labels_antennas, rotation='horizontal')
    ax2 = fig.add_subplot(122)
    ax2.imshow(mi_matrix, interpolation='nearest', cmap=cm.bwr, vmin=-0.15, vmax=0.15)
    #plt.xticks([0,1,2,3,4], labels_countries, rotation='vertical')
    #plt.yticks([0,1,2,3,4,5], labels_antennas, rotation='horizontal')
    plt.subplots_adjust(bottom=0.1, right=1.4, top=0.9)
    plt.savefig('density_mi_random.png', bbox_inches='tight')
    #plt.savefig('density_mi_cluster.png', bbox_inches='tight')
    #plt.show()

Generates random clusters and plot the clusters adjacency matrix to see that the density is randomly distributed over the blocks: 
- The adjacency matrix is normalized to obtain the joint probability matrix (left). No underlying structure seems to appear
- The mutual information is plotted on the right. All the cells arewhite meaning that the joint probability os somehow close to the product of the marginals (no pattern)

In [62]:
random_cell_clusters = InformationTheoreticalClustering.random_partition([node-1 for node in range(itc.adjacency_matrix.shape[0])], k=6)
random_country_clusters = InformationTheoreticalClustering.random_partition([node-1 for node in range(itc.adjacency_matrix.shape[1])], k=5)

random_adjacency_matrix = itc.build_cluster_join_probability_matrix(itc.adjacency_matrix, random_cell_clusters, dimension='cell')
random_adjacency_matrix = itc.build_cluster_join_probability_matrix(random_adjacency_matrix, random_country_clusters, dimension='country')

random_mi_matrix = InformationTheoreticalClustering.compute_mutual_information(random_adjacency_matrix)
random_probability_matrix = random_adjacency_matrix / float(random_adjacency_matrix.sum())
plot_matrices(random_probability_matrix, random_mi_matrix)

We apply the information theoretical clustering algorithm to the call detail records and plot the adjacency matrix of  the obtained clustering: 
- The adjacency matrix is normalized to obtain the joint probability matrix (left), reavealing a diagonal structure.
- The mutual information is plotted on the right. The matrix confirms the diagonal structure but emphasises certain interations that are not visible on the joint density matrix. For example row 4 and column 1, there is a pretty high density of interactions but way lower than expected if we consider the interactions emerging from clusters 1 and 4.

In [60]:
#clusters = itc.coclustering(k=6, l=7)

adjacency_matrix = itc.build_cluster_join_probability_matrix(itc.adjacency_matrix, clusters[0], dimension='cell')
adjacency_matrix = itc.build_cluster_join_probability_matrix(adjacency_matrix, clusters_2[1], dimension='country')

probability_matrix = adjacency_matrix / float(adjacency_matrix.sum())
mi_matrix = InformationTheoreticalClustering.compute_mutual_information(adjacency_matrix)
 
plot_matrices(probability_matrix, mi_matrix)

In [14]:
plot_clusters(clusters[0], antenna_mapping)

0 #8dd3c7
1 #bebada
2 #80b1d3
3 #fccde5
4 #bc80bd
5 #ffed6f


In [16]:
import operator

for clust in clusters_2[1]:
    print sorted([(itc.country_index[i], round(100 * itc.adjacency_matrix[:,i].sum(axis=0) / float(itc.adjacency_matrix.sum()), 2)) for i in clust], key=operator.itemgetter(1), reverse=True)

[(221, 3.94), (503, 0.09), (225, 0.02), (254, 0.02), (963, 0.01)]
[(380, 7.39), (40, 3.47), (373, 0.9), (91, 0.84), (355, 0.81), (216, 0.55), (57, 0.13), (995, 0.12), (58, 0.09), (386, 0.08), (226, 0.05)]
[(86, 7.18), (63, 6.12), (94, 4.47), (593, 2.17), (51, 1.54), (234, 0.38), (233, 0.14), (56, 0.03), (62, 0.03), (243, 0.01)]
[(20, 26.83), (880, 12.73), (212, 1.55), (92, 0.91), (591, 0.21), (218, 0.14), (220, 0.02)]
[(41, 2.28), (44, 1.92), (33, 1.79), (7, 1.14), (34, 0.94), (1, 0.79), (49, 0.78), (55, 0.68), (46, 0.65), (30, 0.52), (48, 0.43), (230, 0.34), (420, 0.33), (31, 0.29), (359, 0.26), (32, 0.23), (43, 0.2), (81, 0.19), (90, 0.18), (381, 0.16), (36, 0.14), (421, 0.13), (358, 0.13), (54, 0.12), (371, 0.12), (375, 0.12), (45, 0.11), (61, 0.11), (961, 0.1), (972, 0.09), (370, 0.09), (385, 0.09), (47, 0.09), (372, 0.09), (852, 0.09), (213, 0.08), (53, 0.08), (98, 0.08), (351, 0.08), (962, 0.07), (356, 0.07), (966, 0.07), (971, 0.06), (52, 0.06), (65, 0.06), (66, 0.06), (353, 0.0

In [8]:
print clusters[1]

[[47, 48, 59, 82, 91], [10, 25, 26, 38, 44, 49, 63, 71, 74, 77, 99], [19, 24, 29, 30, 36, 40, 52, 53, 56, 84], [1, 6, 7, 12, 50, 75, 81, 90, 96], [8, 14, 17, 18, 23, 37, 43, 51, 67, 68, 73, 76, 79, 85, 95], [2, 39, 42, 45, 46, 83, 87], [0, 3, 4, 5, 9, 11, 13, 15, 16, 20, 21, 22, 27, 28, 31, 32, 33, 34, 35, 41, 54, 55, 57, 58, 60, 61, 62, 64, 65, 66, 69, 70, 72, 78, 80, 86, 88, 89, 92, 93, 94, 97, 98]]


In [11]:
new_clust = []
for i,c in enumerate(clusters[1]):
    if i not in (3,4,6):
        new_clust.append(c)
new_clust.append(clusters[1][3]+clusters[1][4]+clusters[1][6])
clusters_2 = (clusters[0], new_clust)

In [18]:
len(clusters_2[1])

5

In [20]:
print len(clusters[1])

7
