# Spectral Clustering  

In this notebook, we will use cuGraph to identify the cluster in a test graph using Spectral Clustering with both the (A) Balance Cut metric, and (B) the Modularity Maximization metric


| Author Credit              |    Date    |  Update          | cuGraph Version |  Test Hardware              |
| ---------------------------|------------|------------------|-----------------|-----------------------------|
| Brad Rees and James Wyles  | 08/01/2019 | created          | 0.14            | GV100 32G, CUDA 10.2        |
|                            | 08/16/2020 | updated          | 0.15   | GV100 32G, CUDA 10.2        |
| Don Acosta                 | 07/11/2022 | tested / updated | 22.08 nightly   | DGX Tesla V100 CUDA 11.5    |
| Ralph Liu                  | 07/26/2022 | updated          | 22.08 nightly   | DGX Tesla V100 CUDA 11.5   |

## Introduction

Spectral clustering uses the eigenvectors of a Laplacian of the input graph to find a given number of clusters which satisfy a given quality metric. Balanced Cut and Modularity Maximization are two such quality metrics. 

See:  https://en.wikipedia.org/wiki/Spectral_clustering

To perform spectral clustering using the balanced cut metric in cugraph use:

__df = cugraph.spectralBalancedCutClustering(G, num_clusters, num_eigen_vects)__
<br>or<br>
__df = cugraph.spectralModularityMaximizationClustering(G, num_clusters, num_eigen_vects)__



### Balanced Cut

    Compute a clustering/partitioning of the given graph using the spectral balanced cut method.

    Parameters
    ----------
    G : cugraph.Graph
        cuGraph graph descriptor
    num_clusters : integer
         Specifies the number of clusters to find
    num_eigen_vects : integer
         Specifies the number of eigenvectors to use. Must be lower or equal to
         num_clusters.
    evs_tolerance: float
         Specifies the tolerance to use in the eigensolver
    evs_max_iter: integer
         Specifies the maximum number of iterations for the eigensolver
    kmean_tolerance: float
         Specifies the tolerance to use in the k-means solver
    kmean_max_iter: integer
         Specifies the maximum number of iterations for the k-means solver

    Returns
    -------
    df : cudf.DataFrame
        GPU data frame containing two cudf.Series of size V: the vertex
        identifiers and the corresponding cluster assignments.

        df['vertex'] : cudf.Series
            contains the vertex identifiers
        df['cluster'] : cudf.Series
            contains the cluster assignments

            

### Modularity Maximization

    Compute a clustering/partitioning of the given graph using the spectral modularity maximization method.

    Parameters
    ----------
    G : cugraph.Graph
        cuGraph graph descriptor. This graph should have edge weights.
    num_clusters : integer
         Specifies the number of clusters to find
    num_eigen_vects : integer
         Specifies the number of eigenvectors to use. Must be lower or equal to
         num_clusters
    evs_tolerance: float
         Specifies the tolerance to use in the eigensolver
    evs_max_iter: integer
         Specifies the maximum number of iterations for the eigensolver
    kmean_tolerance: float
         Specifies the tolerance to use in the k-means solver
    kmean_max_iter: integer
         Specifies the maximum number of iterations for the k-means solver

    Returns
    -------
    df : cudf.DataFrame
        df['vertex'] : cudf.Series
            contains the vertex identifiers
        df['cluster'] : cudf.Series
            contains the cluster assignments

#### Some notes about vertex IDs...

* cuGraph will automatically renumber graphs to an internal format consisting of a contiguous series of integers starting from 0, and convert back to the original IDs when returning data to the caller. If the vertex IDs of the data are already a contiguous series of integers starting from 0, the auto-renumbering step can be skipped for faster graph creation times.
  * To skip auto-renumbering, set the `renumber` boolean arg to `False` when calling the appropriate graph creation API (eg. `G.from_cudf_edgelist(gdf_r, source='src', destination='dst', renumber=False)`).
  * For more advanced renumbering support, see the examples in `structure/renumber.ipynb` and `structure/renumber-2.ipynb`


### Test Data
We will be using the Zachary Karate club dataset 
*W. W. Zachary, An information flow model for conflict and fission in small groups, Journal of
Anthropological Research 33, 452-473 (1977).*


<img src="../../img/zachary_graph_clusters.png" width="35%"/>

Because the test data has vertex IDs starting at 1, the auto-renumber feature of cuGraph (mentioned above) will be used so the starting vertex ID is zero for maximum efficiency. The resulting data will then be auto-unrenumbered, making the entire renumbering process transparent to users.


Zachary used a min-cut flow model to partition the graph into two clusters, shown by the circles and squares.  Zarchary wanted just two cluster based on a conflict that caused the Karate club to break into two separate clubs.  Many social network clustering methods identify more that two social groups in the data.

In [1]:
# Import needed libraries
import cugraph
import cudf
import numpy as np

# Import a built-in dataset
from cugraph.experimental.datasets import karate

### Create Edgelist and Add Edge Weights

In [3]:
gdf = karate.get_edgelist(fetch=True)

# The algorithm requires that there are edge weights.  In this case all the weights are being set to 1
gdf["data"] = cudf.Series(np.ones(len(gdf), dtype=np.float32))

In [None]:
# Look at the first few data records - the output should be two columns: 'src' and 'dst'
gdf.head()

In [None]:
# verify data type
gdf.dtypes

Everything looks good, we can now create a graph

In [None]:
# create a Graph 
G = cugraph.Graph()
G.from_cudf_edgelist(gdf, source='src', destination='dst', edge_attr='data')

----
#### Define and print function, but adjust vertex IDs so that they match the illustration

In [None]:
def print_cluster(_df, id):
    
    _f = _df.query('cluster == @id')
  
    part = []
    for i in range(len(_f)):
        part.append(_f['vertex'].iloc[i])
    print(part)

----
#### Using Balanced Cut

In [None]:
# Call spectralBalancedCutClustering on the graph for 3 clusters
# using 3 eigenvectors:
bc_gdf = cugraph.spectralBalancedCutClustering(G, 3, num_eigen_vects=3)

In [None]:
# Check the edge cut score for the produced clustering
score = cugraph.analyzeClustering_edge_cut(G, 3, bc_gdf, 'vertex', 'cluster')
score

In [None]:
# See which nodes are in cluster 0:
print_cluster(bc_gdf, 0)

In [None]:
# See which nodes are in cluster 1:
print_cluster(bc_gdf, 1)

In [None]:
# See which nodes are in cluster 2:
print_cluster(bc_gdf, 2)

----
#### Modularity Maximization
Let's now look at the clustering using the modularity maximization metric

In [None]:
# Call spectralModularityMaximizationClustering on the graph for 3 clusters
# using 3 eigenvectors:
mm_gdf = cugraph.spectralModularityMaximizationClustering(G, 3, num_eigen_vects=3)

In [None]:
# Check the modularity score for the produced clustering
score = cugraph.analyzeClustering_modularity(G, 3, mm_gdf, 'vertex', 'cluster')
score

In [None]:
# See which nodes are in cluster 0:
print_cluster(mm_gdf, 0)

In [None]:
print_cluster(mm_gdf, 1)

In [None]:
print_cluster(mm_gdf, 2)

Notice that the two metrics produce different results

___
Copyright (c) 2019-2022, NVIDIA CORPORATION.

Licensed under the Apache License, Version 2.0 (the "License");  you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
___