# Community Clustering

In this notebook, we will cluster the graph using various algorithms implemented in cuGraph.  We will then compare the clusters resulting from each algorithm.

| Author Credit |    Date    |  Update          | cuGraph Version |  Test Hardware |
| --------------|------------|------------------|-----------------|----------------|
| Don Acosta    | 07/05/2022 | tested / updated | 22.08 nightly   | DGX Tesla V100 CUDA 11.5

Clustering is the analytic method for finding the highly connected sets of vertices within a graph. It is often used to answer questions like:

* What are the communities within this graph?
* How can the graph be cut into the most cohesive partitions?
* What is the most important group of vertices within this group?

### Test Data
We will be using the Zachary Karate club dataset 
*W. W. Zachary, An information flow model for conflict and fission in small groups, Journal of
Anthropological Research 33, 452-473 (1977).*


<img src="../../img/zachary_graph_clusters.png" width="35%"/>

Because the test data has vertex IDs starting at 1, the auto-renumber feature of cuGraph (mentioned above) will be used so the starting vertex ID is zero for maximum efficiency. The resulting data will then be auto-unrenumbered, making the entire renumbering process transparent to users.

In [1]:
#  Import the cugraph modules
import cugraph
import cudf

In [2]:
# import non cugraph modules
import numpy as np

In [3]:
# Compute clusters
# the clustering calls are very straightforward with the graph being the primary argument
# we are specifying a few optional parameters for this dataset.

def compute_clusters(_graph) :

    # Compute ECG Clusters and normalize the column names
    _e = cugraph.ecg(_graph).rename(columns={'partition': 'cluster'})
    
    # Compute Louvain Clusters 
    _l, modularity = cugraph.louvain(_graph)
    # Normalize the column names
    _l = _l.rename(columns={'partition': 'cluster'})

    # Compute Spectral Balanced Clusters
    _b = cugraph.spectralBalancedCutClustering(_graph, 4, num_eigen_vects=4)

    # Call spectralModularityMaximizationClustering on the graph for 3 clusterstral 
    _m = cugraph.spectralModularityMaximizationClustering(_graph, 4, num_eigen_vects=4)
    return _e, _l, _b, _m

In [4]:
# compare 2 cluster results
def compare_values(algo, v1, v2):
    return (algo.loc[algo['vertex'] == v1]['cluster'].reset_index(drop=True)).equals((algo.loc[algo['vertex'] == v2]['cluster'].reset_index(drop=True)))

This functon builds a matrix to identify which algorithms cluster pairs of vertices together.

In [5]:
def create_cluster_matrix(ecg, louvain, spec_balance, spec_mod):
    mat_size = ecg['vertex'].max()
    clust_matrix = np.empty((mat_size+1) * (mat_size+1), dtype='object')
    clust_matrix = clust_matrix.reshape((mat_size+1),(mat_size+1))

    type(ecg['vertex'])

    for id_1 in ecg['vertex'].to_pandas():
        for id_2 in ecg['vertex'].to_pandas():
            clust_matrix[id_1][id_2] = ""
            if id_2 > id_1:
                if compare_values(ecg, id_1, id_2):
                    clust_matrix[id_1][id_2] += "e"
                if compare_values(louvain, id_1, id_2):
                    clust_matrix[id_1][id_2] += "l"
                if compare_values(spec_balance, id_1, id_2):
                    clust_matrix[id_1][id_2] += "b"
                if compare_values(spec_mod, id_1, id_2):
                    clust_matrix[id_1][id_2] += "m"

    return clust_matrix   

Method to look at a vertex pair since only one half of the symetric matrix is calculated for efficiency and display purposes.

In [21]:
def pair_clustering(comp_matrix, id1, id2):
    if (id2 > id1):
        return comp_matrix[id1][id2]
    else:
         return comp_matrix[id2][id1]

Print the table showing which algorithms group which vertices together

In [6]:
def print_clustering_table(cluster_array):
    import pandas as pd
    from IPython.display import display_html
    df = pd.DataFrame(cluster_array)
    df_styler = df.drop(df.columns[[0]], axis=1).drop(0).style.set_table_attributes("style='display:inline'")
    display_html(df_styler._repr_html_(), raw=True)

## Read the data

In [7]:
# Test file    
datafile='../../data/karate-data.csv'

In [8]:
# read the data using cuDF
gdf = cudf.read_csv(datafile, delimiter='\t', names=['src', 'dst'], dtype=['int32', 'int32'] )

In [9]:
# The algorithms often also require that there are vertex weights.  Just use 1.0 
gdf["data"] = 1.0

it was that easy to load data

## Create a Graph

In [10]:
# create a Graph - since the data does not start at '0', use the auto-renumbering feature
G = cugraph.Graph()
G.from_cudf_edgelist(gdf, source='src', destination='dst', edge_attr='data', renumber=True)

## Now do all the clustering

In [11]:
_e, _l, _b, _m = compute_clusters(G)

View the clusters for a single algorithm, in this case Ensemble Graph Clustering

In [12]:
_e.to_pandas().groupby('cluster')['vertex'].apply(list)

cluster
0                                     [25, 26, 29, 32]
1                                    [5, 11, 17, 6, 7]
2          [20, 10, 13, 18, 22, 1, 3, 2, 4, 14, 8, 12]
3    [15, 16, 19, 21, 23, 34, 33, 9, 24, 30, 31, 28...
Name: vertex, dtype: object

Generate the cluster comparison matrix to view the results of the clustering algorithms in one structure. Notice, the first row and column are index 0 which is empty since the graph has been numbered/renumbered to start with 1.

In [13]:
clust_comparison = create_cluster_matrix(_e, _l, _b, _m)

print the entire algorithm clustering comparison table.

The matrix[i][j] element includes a list of the algorithms where i and j are clustered together:
* e = Ensemble Graph Clustering has placed i and j together in a cluster
* l = Louvain community detection has placed i and j together in a cluster
* b = Spectral Balanced Clustering has placed i and j together in a cluster
* m = Spectral Modularity Maximization Clustering has placed i and j together in a cluster.

In [14]:
print_clustering_table(clust_comparison)

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34
1,,elbm,elbm,elbm,m,m,,elbm,,el,,elm,elbm,elbm,,,,elbm,,elbm,,elbm,,,,,,,,,,,,
2,,,elbm,elbm,m,m,,elbm,,el,,elm,elbm,elbm,,,,elbm,,elbm,,elbm,,,,,,,,,,,,
3,,,,elbm,m,m,,elbm,,el,,elm,elbm,elbm,,,,elbm,,elbm,,elbm,,,,,,,,,,,,
4,,,,,m,m,,elbm,,el,,elm,elbm,elbm,,,,elbm,,elbm,,elbm,,,,,,,,,,,,
5,,,,,,elbm,elb,m,,,elb,bm,m,m,,,elb,m,,m,,m,,,,,,,,,,,,
6,,,,,,,elb,m,,,elb,bm,m,m,,,elb,m,,m,,m,,,,,,,,,,,,
7,,,,,,,,,,,elbm,b,,,,,elbm,,,,,,,,,,,,,,,,,
8,,,,,,,,,,el,,elm,elbm,elbm,,,,elbm,,elbm,,elbm,,,,,,,,,,,,
9,,,,,,,,,,,,,,,elb,elbm,,,elbm,,el,,elm,el,b,m,elm,el,m,el,el,m,el,elm
10,,,,,,,,,,,,el,el,el,m,,,el,,el,bm,el,b,bm,m,b,b,bm,b,bm,bm,b,bm,b


An individual point in matrix, clust_comparison(5,17) shows that ECG, Louvain and spectral balanced clustering put vertices 5 and 17 in the same cluster, but Spectral Modularity Maximization does not.

In [22]:
print(pair_clustering(clust_comparison,17,5))

elb


Finally, to see the full clustering of a single algorithm, do the following:

In [16]:
_e.to_pandas().groupby('cluster')['vertex'].apply(list)

cluster
0                                     [25, 26, 29, 32]
1                                    [5, 11, 17, 6, 7]
2          [20, 10, 13, 18, 22, 1, 3, 2, 4, 14, 8, 12]
3    [15, 16, 19, 21, 23, 34, 33, 9, 24, 30, 31, 28...
Name: vertex, dtype: object

___
Copyright (c) 2022, NVIDIA CORPORATION.

Licensed under the Apache License, Version 2.0 (the "License");  you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
___