# Community Clustering

In this notebook, we will cluster the graph usingthe various cuGraph algorithms.  We will then compare the clusters resulting from each algorithm.

| Author Credit |    Date    |  Update          | cuGraph Version |  Test Hardware |
| --------------|------------|------------------|-----------------|----------------|
| Don Acosta    | 07/05/2022 | tested / updated | 22.08 nightly   | DGX Tesla V100 CUDA 11.5

Clustering is the analytic method for finding the highly collected sets of vertices within a graph. It is often used to answer questions like:

* What are the communities within this graph?
* How can the graph be cut into the most cohesive partitions?
* What is the most important group of vertices within this group?

### Test Data
We will be using the Zachary Karate club dataset 
*W. W. Zachary, An information flow model for conflict and fission in small groups, Journal of
Anthropological Research 33, 452-473 (1977).*


<img src="../../img/zachary_graph_comm.png" width="35%"/>

Because the test data has vertex IDs starting at 1, the auto-renumber feature of cuGraph (mentioned above) will be used so the starting vertex ID is zero for maximum efficiency. The resulting data will then be auto-unrenumbered, making the entire renumbering process transparent to users.

In [None]:
#  Import the cugraph modules
import cugraph
import cudf

In [None]:
# Compute clusters
# the clustering calls are very straightforward with the graph being the primary argument
# we are specifying a few optional parameters for this dataset.

def compute_clusters(_graph) :
    # Compute Degree Centrality
    _e = cugraph.ecg(_graph)
        
    # Compute the k-truss Cores
    _k  = cugraph.ktruss_subgraph(_graph, 3)
    
    # Compute Louvain Clusters
    _l = cugraph.louvain(_graph)

    # Compute Spectral Balanced Clusters
    _b = cugraph.spectralBalancedCutClustering(_graph, 4, num_eigen_vects=4)

    # Call spectralModularityMaximizationClustering on the graph for 3 clusterstral 
    _m = cugraph.spectralModularityMaximizationClustering(_graph, 4, num_eigen_vects=4)
    return _e, _k, _l, _b, _m

## Read the data

In [None]:
# Test file    
datafile='../../data//karate-data.csv'

In [None]:
# read the data using cuDF
gdf = cudf.read_csv(datafile, delimiter='\t', names=['src', 'dst'], dtype=['int32', 'int32'] )

In [None]:
# The algorithms often also require that there are vertex weights.  Just use 1.0 
gdf["data"] = 1.0

it was that easy to load data

## Create a Graph

In [None]:
# create a Graph - since the data does not start at '0', use the auto-renumbering feature
G = cugraph.Graph()
G.from_cudf_edgelist(gdf, source='src', destination='dst', edge_attr='data', renumber=True)

## Now do all the clustering

In [None]:
_e, _k, _l, _b, _m = compute_clusters(G)
type(_b)

In [None]:
_e.to_pandas().groupby('partition')['vertex'].apply(list)