# Connected Components

In this notebook, we will use cuGraph to compute weakly and strongly connected components of a graph and display some useful information about the resulting components.

_Weakly connected component_ (WCC) is often a necessary pre-processing step for many graph algorithms. A dataset may contact several disconnected (sub-) graphs.  Quite often, running a graph algorithm only on one component of a disconnected graph can lead to bugs which are not easy to trace.

_Strongly connected components_ (SCC) is used in the early stages of graph analysis to get an idea of a graph's structure.



_Notebook Credits_

| Author Credit |    Date    |  Update          | cuGraph Version |  Test Hardware     |
| --------------|------------|------------------|-----------------|--------------------|
| Kumar Aatish  | 08/13/2019 | created          | 0.15            | GV100, CUDA 10.2   |
| Brad Rees     | 10/18/2021 | updated          | 21.12 nightly   | GV100, CUDA 11.4   |
| Ralph Liu     | 06/22/2022 | updated/tested   | 22.08           | TV100, CUDA 11.5   |




## Introduction

### Weakly Connected Components
To compute WCC for a graph in cuGraph we use:

**cugraph.weakly_connected_components(G)**

   Generate the weakly connected components and attach a component label to each vertex.

    Parameters
    ----------
    G : cugraph.Graph
        cuGraph graph descriptor, should contain the connectivity information
        as an edge list (edge weights are not used for this algorithm).
        Currently, the graph should be undirected where an undirected edge is
        represented by a directed edge in both directions. The adjacency list
        will be computed if not already present. The number of vertices should
        fit into a 32b int.

    Returns
    -------
    df : cudf.DataFrame
      df['labels'][i] gives the label id of the i'th vertex
      df['vertex'][i] gives the vertex id of the i'th vertex




### Strongly Connected Components
To compute SCC for a graph in cuGraph we use:

**cugraph.strongly_connected_components(G)**


    Generate the stronlgly connected components and attach a component label to each vertex.

    Parameters
    ----------
    G : cugraph.Graph
      cuGraph graph descriptor, should contain the connectivity information as
      an edge list (edge weights are not used for this algorithm). The graph
      can be either directed or undirected where an undirected edge is
      represented by a directed edge in both directions.
      The adjacency list will be computed if not already present.
      The number of vertices should fit into a 32b int.

    Returns
    -------
    df : cudf.DataFrame
      df['labels'][i] gives the label id of the i'th vertex
      df['vertex'][i] gives the vertex id of the i'th vertex







### Some notes about vertex IDs...
* The current version of cuGraph requires that vertex IDs be representable as 32-bit integers, meaning graphs currently can contain at most 2^32 unique vertex IDs. However, this limitation is being actively addressed and a version of cuGraph that accommodates more than 2^32 vertices will be available in the near future.
* cuGraph will automatically renumber graphs to an internal format consisting of a contiguous series of integers starting from 0, and convert back to the original IDs when returning data to the caller. If the vertex IDs of the data are already a contiguous series of integers starting from 0, the auto-renumbering step can be skipped for faster graph creation times.
  * To skip auto-renumbering, set the `renumber` boolean arg to `False` when calling the appropriate graph creation API (eg. `G.from_cudf_edgelist(gdf_r, source='src', destination='dst', renumber=False)`).
  * For more advanced renumbering support, see the examples in `structure/renumber.ipynb` and `structure/renumber-2.ipynb`


### Test Data
We will be using the Netscience dataset :  
*M. E. J. Newman, Finding community structure in networks using the eigenvectors of matrices, Preprint physics/0605087 (2006)*

The graph netscience contains a coauthorship network of scientists working on network theory and experiment. The version given here contains all components of the network, for a total of 1589 scientists, with the the largest component of 379 scientists.

Netscience Adjacency Matrix               |NetScience Strongly Connected Components
:---------------------------------------------|------------------------------------------------------------:
![](../img/netscience.png "Credit : https://www.cise.ufl.edu/research/sparse/matrices/Newman/netscience") | ![](../img/netscience_scc.png "Credit : https://www.cise.ufl.edu/research/sparse/matrices/Newman/netscience")
  
Matrix plots above by Yifan Hu, AT&T Labs Visualization Group.

In [1]:
# Import needed libraries
import cugraph
import cudf
import numpy as np

### 1. Import a Built-In Dataset

In [2]:
from cugraph.experimental.datasets import netscience

### 2. Create a Graph from an edge list

In [3]:
G = netscience.get_graph()

### 3a. Call Weakly Connected Components

In [4]:
# Call cugraph.weakly_connected_components on the dataframe
df = cugraph.weakly_connected_components(G)
df.head(5)

Unnamed: 0,labels,vertex
0,1232,529
1,1233,534
2,1233,535
3,1235,544
4,1235,545


#### Get total number of weakly connected components

In [5]:
# Use groupby on the 'labels' column of the WCC output to get the counts of each connected component label
label_gby = df.groupby('labels')
label_count = label_gby.count()

print("Total number of components found : ", len(label_count))

Total number of components found :  268


#### Output the sizes of the top 10 largest weakly connected component

In [6]:
# Call nlargest on the groupby result to get the row where the component count is the largest
# NOTE: this will change the value of "vertex" to be the count and "labels" to be an index
largest_component = label_count.nlargest(n = 10, columns = 'vertex')

print("Size of the top 10 largest components are: ")
print(largest_component)

Size of the top 10 largest components are: 
        vertex
labels        
1126       379
785         57
112         31
349         28
26          21
384         14
489         14
651         13
580         12
106         11


#### Output vertex ids belonging to a weakly connected component label

In [7]:
# Query the connected component output to display vertex ids that belong to a component of interest
# picking label 106 from above to reduce amount of data printed

expr = "labels == 106"
component = df.query(expr)

print("Vertex Ids that belong to component label 106 : ")
print(component)

Vertex Ids that belong to component label 106 : 
     labels  vertex
31      106    1412
114     106    1060
732     106    1061
733     106    1062
734     106    1063
735     106    1064
736     106    1065
737     106    1066
738     106    1067
739     106    1068
740     106    1069


### 3b. Call Strongly Connected Components

In [8]:
# Call cugraph.strongly_connected_components on the dataframe
df = cugraph.strongly_connected_components(G)
df.head(5)

Unnamed: 0,labels,vertex
0,0,1341
1,743,1345
2,676,1346
3,1411,1354
4,1411,1355


#### Get total number of strongly connected components

In [9]:
# Use groupby on the 'labels' column of the SCC output to get the counts of each connected component label
label_gby = df.groupby('labels')
label_count = label_gby.count()
print("Total number of components found : ", len(label_count))

Total number of components found :  268


#### Get the top 10 largest strongly connected component

In [10]:
# Call nlargest on the groupby result to get the row where the component count is the largest
largest_component = label_count.nlargest(n = 10, columns = 'vertex')

print("Size of the top 10 largest components are: ")
print(largest_component)

Size of the top 10 largest components are: 
        vertex
labels        
0          379
4           57
8           31
30          28
5           21
87          14
167         14
55          13
254         12
66          11


#### Output vertex ids belonging to a strongly connected component label

In [11]:
# Query the connected component output to display vertex ids that belong to a component of interest
expr = "labels == 66"
component = df.query(expr)

print("Vertex Ids that belong to component label 66 : ")
print(component)

Vertex Ids that belong to component label 66 : 
     labels  vertex
15       66    1412
130      66    1060
716      66    1061
717      66    1062
718      66    1063
719      66    1064
720      66    1065
721      66    1066
722      66    1067
723      66    1068
724      66    1069


### Conclusion

The number of components found by **cugraph.weakly_connected_components(G)** and **cugraph.strongly_connected_components(G)** are equal to the results from  M. E. J. Newman,
Phys. Rev. E 64, 016132 (2001).

___
Copyright (c) 2019-2020, NVIDIA CORPORATION.

Licensed under the Apache License, Version 2.0 (the "License");  you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
___