Skip to content

Latest commit

 

History

History
354 lines (264 loc) · 12.7 KB

louvain_ext.rst

File metadata and controls

354 lines (264 loc) · 12.7 KB

Louvain Parallel Extension

CHAMP can be used with partitions generated by any community detection algorithm. One of the most popular and fast algorithms is known as Louvain Blondel:2008vn . We provide an extension to the python package developed by Vincent Traag, louvain_igraph traag_louvain to run Louvain in parallel, while calculating the coefficients necessary for CHAMP. Currently, this extension only support single-layer network. The random seed is set within each parallel process to ensure that the results are stochastic over each run. In general this is desireable because Louvain uses a greedy optimization schema that finds local optima.

champ.parallel_louvain

We also have created a convenient class for managing and merging groups of partitions called champ.PartitionEnsemble . This class stores the partitions in membership vector form ( i.e. a list of N community assignments), as well as the coefficients for the partitions. As part of its constructor, the PartitionEnsemble applies CHAMP to all of its partitions, and stores the domains of dominance.

Partition Ensemble Example

We use igraph to generate a random ER graph, call louvain in parallel, and apply CHAMP to the ensemble. This example took approximately 1 minute to run on 2 cores. We can use the PartitionEnsemble object to check whether all of the coefficients are unique, and to check whether all of the partitions themselve are unique (it is possible for two unique partitions to give rise to the same coefficients). :

import champ
import igraph as ig
import tempfile
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(0)
test_graph=ig.Graph.Erdos_Renyi(500,p=.05)
#Create temporary file for calling louvain
tfile=tempfile.NamedTemporaryFile('wb')
test_graph.write_graphmlz(tfile.name)

# non-parallelized wrapper
ens1=champ.run_louvain(tfile.name,nruns=5,gamma=1)

# parallelized wrapper
test_graph2=ig.Graph.Random_Bipartite(n1=100,n2=100,p=.1)
ens2=champ.parallel_louvain(test_graph2,
                                  numruns=1000,start=0,fin=4,maxpt=4,
                                  numprocesses=2,
                                  progress=True)

print ("%d of %d partitions are unique"%(len(ens2.unique_partition_indices),ens2.numparts))
print ("%d of %d partitions after application of CHAMP"%(len(ens2.ind2doms),ens2.numparts))
print ("Number of twin sets:")
print (ens2.twin_partitions)
#plot both of these
plt.close()
f,a=plt.subplots(1,3,figsize=(21,7))
a1,a2,a3=a
champ.plot_single_layer_modularity_domains(ens2.ind2doms,ax=a1,labels=True)
champ.plot_similarity_heatmap_single_layer(ens2.partitions,ens2.ind2doms,ax=a2,title=True)

#PartitionEnsemble has method to plot downsampled summary of all partitions
#with optmal transitions and number of communities overlayed.

ens2.plot_modularity_mapping(ax=a3,no_tex=False)
plt.tight_layout()
plt.show()

Output:

 Run 0 at gamma = 0.000. Return time: 0.0264
 Run 200 at gamma = 0.800. Return time: 0.0621
 Run 100 at gamma = 0.400. Return time: 0.0589
 Run 300 at gamma = 1.200. Return time: 0.0660
 Run 400 at gamma = 1.600. Return time: 0.0963
 Run 500 at gamma = 2.000. Return time: 0.0970
 Run 600 at gamma = 2.400. Return time: 0.0828
 Run 700 at gamma = 2.800. Return time: 0.0769
 Run 800 at gamma = 3.200. Return time: 0.0763
 Run 900 at gamma = 3.600. Return time: 0.0818
 847 of 1000 partitions are unique
 11 of 1000 partitions after application of CHAMP
 Number of twin sets:
  []

We see that there are a number of identical partitions here, there are no twin partitions . That is different partitions with the same coefficents.

champ.louvain_ext.PartitionEnsemble.apply_CHAMP

champ.louvain_ext.PartitionEnsemble.plot_modularity_mapping

champ.louvain_ext.PartitionEnsemble.get_unique_partition_indices

champ.louvain_ext.PartitionEnsemble.twin_partitions

Creating and Managing Large Partition Sets

For large sets of partitions on larger networks, loading the entire set of partitions each time you want to access the data can take a fair amount of time. Having to load all of the partitions defeats the purpose of using CHAMP to find the optimal subset. As such we have equiped champ.louvain_ext.PartitionEnsemble objects with the ability to write to and access hdf5 files. Instead of loading the entire set of partitions each time, only the coefficients, the domains of dominance, the underlying graph information is loaded, vastly reducing the loading time and the memory required for large numbers of runs. Each individual partitions ( or all partitions at once) can be access from file only if needed, as the example below illustrates:

Saving and Loading PartitionEnsembles

import champ
import igraph as ig
import numpy as np

np.random.seed(0)
test_graph=ig.Graph.Erdos_Renyi(n=200,p=.1)
times={}
run_nums=[100]

#saving ensemble
ensemble=champ.parallel_louvain(test_graph,numprocesses=2,numruns=200,start=0,fin=4,maxpt=4,progress=False)
print "Ensemble 1, Optimal subset is %d of %d partitions"%(len(ensemble.ind2doms),ensemble.numparts)
ensemble.save("ensemble1.hdf5",hdf5=True)

#openning created file
ensemble2=champ.PartitionEnsemble().open("ensemble1.hdf5")
#partitions is an internal class the handles access
print ensemble2.partitions
#When sliced a numpy.array is returned
print "ensemble2.partitions[38,:10]:\n\t",ensemble2.partitions[38:40,:10]

#You can still add partitions as normal.  These are automatically
#added to the hdf5 file
ensemble2add=champ.parallel_louvain(test_graph,numprocesses=2,numruns=100,start=0,fin=4,maxpt=4,progress=False)
ensemble2.add_partitions(ensemble2add.get_partition_dictionary())
print "ensemble 2 has %d of %d partitions dominant"  %(len(ensemble2.ind2doms),ensemble2.numparts)

#When open is called, the created EnsemblePartition assumes all
#the state variables of saved EnsemblePartition, including default save file.
#If save() is called, it will overwrite the old ensemble file
print "ensemble2 default hdf5 file: ",ensemble2.hdf5_file

Output:

  Ensemble 1, Optimal subset is 11 of 200 partitions
  200 partitions saved on ensemble1.hdf5
  ensemble2.partitions[38,:10]:

      [[1 0 1 0 0 0 0 0 1 1]
      [1 1 2 0 1 0 2 1 2 1]]

  ensemble 2 has 12 of 300 partitions dominant

You can see with the above example that CHAMP is reapplied everytime new partitions are added to the PartitionEnsemble instance. In addition, whole PartitionEnsemble objects can be merged together in a similar fashion. You can either have a new PartitionEnsemble generated, containing the contents of both the inputs. Or one partition can be merged into the other one. It automatically uses the larger of the two to merge into. In addition, if either PartitionEnsemble has an associated hdf5 file, the merger will just expand the contents of the file and keep the same PartitionEnsemble object.

champ.louvain_ext.PartitionEnsemble.add_partitions

champ.louvain_ext.PartitionEnsemble.hdf5_file

champ.louvain_ext.PartitionEnsemble.save

champ.louvain_ext.PartitionEnsemble.save_graph

champ.louvain_ext.PartitionEnsemble.open

Merging PartitionEnsembles

import champ
import igraph as ig
import numpy as np

np.random.seed(0)
test_graph=ig.Graph.Erdos_Renyi(n=200,p=.1)
times={}
run_nums=[100]

ensemble=champ.parallel_louvain(test_graph,numprocesses=2,numruns=200,start=0,fin=4,maxpt=4,progress=False)
print "Ensemble 1, Optimal subset is %d of %d partitions"%(len(ensemble.ind2doms),ensemble.numparts)
ensemble.save("ensemble1.hdf5",hdf5=True)

ensemble2=champ.parallel_louvain(test_graph,numprocesses=2,numruns=200,start=0,fin=4,maxpt=4,progress=False)
print "Ensemble 2, Optimal subset is %d of %d partitions"%(len(ensemble2.ind2doms),ensemble2.numparts)
ensemble2.save("ensemble2.hdf5",hdf5=True)

#Esembles can be merged as follows:

#Create new PartitionEnsemble from scratch
ensemble3=ensemble.merge_ensemble(ensemble2,new=True)
print "Ensemble 3, Optimal subset is %d of %d partitions"%(len(ensemble3.ind2doms),ensemble3.numparts)

#Use largest of the 2 PartitionEnsembles being merged  and modify
ensemble4=ensemble.merge_ensemble(ensemble3,new=False)
print "Ensemble 4, Optimal subset is %d of %d partitions"%(len(ensemble4.ind2doms),ensemble4.numparts)
print "ensemble4 is ensemble3: ",ensemble4 is ensemble3

Output:

  Ensemble 1, Optimal subset is 13 of 200 partitions
  Ensemble 2, Optimal subset is 15 of 200 partitions
  Ensemble 3, Optimal subset is 15 of 400 partitions
  Ensemble 4, Optimal subset is 15 of 600 partitions
  ensemble4 is ensemble3: True

champ.louvain_ext.PartitionEnsemble.merge_ensemble

Improvement with HDF5 saving and loading

The following example gives a sense of when it is beneficial to save as HDF5 and general runtimes for parallelized Louvain. We also found that the overhead is dependent on the size of the graph as well, so that for larger graphs with a lower number of partitions, the read/write time on HDF5 can be greater. Total runtime was about 2.5 hours on 10 CPUs.

import champ
import matplotlib.pyplot as plt
import seaborn as sbn
import numpy as np
import pandas as pd
import igraph as ig
from time import time

np.random.seed(0)
test_graph=ig.Graph.Erdos_Renyi(n=1000,p=.1)

times={}
run_nums=[100,1000,2000,3000,10000]

for nrun in run_nums :
    rtimes=[]
    t=time()
    ens=champ.parallel_louvain(test_graph,numprocesses=10,numruns=nrun,start=0,fin=4,maxpt=4,progress=False)
    ens.name='name'
    rtimes.append(time()-t)

    #normal save (gzip)
    print "CHAMP running time %d runs: %.3f" %(nrun,rtimes[-1])
    t=time()
    ens.save()
    rtimes.append(time()-t)
    t=time()

    #normal open
    t=time()
    newens=champ.PartitionEnsemble().open("name_PartEnsemble_%d.gz"%nrun)
    rtimes.append(time()-t)
    t=time()

    #save with h5py
    t=time()
    ens.save(hdf5=True)
    rtimes.append(time()-t)
    t=time()

    #open with hdf5 format
    t=time()
    newes=champ.PartitionEnsemble().open("name_PartEnsemble_%d.hdf5"%nrun)
    rtimes.append(time()-t)
    times[nrun]=rtimes

tab=pd.DataFrame(times,columns=run_nums,index=['runtime','save','load','hdf5save','hdf5load'])
colors=sbn.color_palette('Set1',n_colors=tab.shape[0])
plt.close()
f,(a1,a2)=plt.subplots(1,2,figsize=(14,7))
for i,ind in enumerate(tab.index.values):
    if ind=='runtime':
        a1.plot(tab.columns,tab.loc[ind,:],label=ind,color=colors[i])
    else:
        a2.plot(tab.columns,tab.loc[ind,:],label=ind,color=colors[i])

a1.set_xlabel("#partitions")
a1.set_ylabel("seconds")
a2.set_xlabel("#partitions")
a2.set_ylabel("seconds")

a2.set_title("Comparison of Save and Load Times",fontsize=16)
a1.set_title("Runtimes on random ER-graph G(N=1000,p=.1)",fontsize=16)
a1.legend(fontsize=14)
a2.legend(fontsize=14)
plt.show()

Output:

  CHAMP running time 100 runs: 53.771
  CHAMP running time 1000 runs: 498.642
  CHAMP running time 2000 runs: 988.658
  CHAMP running time 3000 runs: 1479.512
  CHAMP running time 10000 runs: 5091.727

References

biblio.bib

  • genindex
  • search