# Connected Components

In this notebook, we will use cuGraph to compute weakly and strongly connected components of a graph and display some useful information about the resulting components.

_Weakly connected component_ (WCC) is often a necessary pre-processing step for many graph algorithms. A dataset may contact several disconnected (sub-) graphs.  Quite often, running a graph algorithm only on one component of a disconnected graph can lead to bugs which are not easy to trace.

_Strongly connected components_ (SCC) is used in the early stages of graph analysis to get an idea of a graph's structure.




Notebook Credits
* Original Authors: Kumar Aatish
* Created:    08/13/2019
* Last Edit:  06/10/2020

RAPIDS Versions: 0.15   

Test Hardware

* GV100 32G, CUDA 10.2



## Introduction

### Weakly Connected Components
To compute WCC for a graph in cuGraph we use:

**cugraph.weakly_connected_components(G)**

   Generate the weakly connected components and attach a component label to each vertex.

    Parameters
    ----------
    G : cugraph.Graph
        cuGraph graph descriptor, should contain the connectivity information
        as an edge list (edge weights are not used for this algorithm).
        Currently, the graph should be undirected where an undirected edge is
        represented by a directed edge in both directions. The adjacency list
        will be computed if not already present. The number of vertices should
        fit into a 32b int.

    Returns
    -------
    df : cudf.DataFrame
      df['labels'][i] gives the label id of the i'th vertex
      df['vertices'][i] gives the vertex id of the i'th vertex




### Strongly Connected Components
To compute SCC for a graph in cuGraph we use:

**cugraph.strongly_connected_components(G)**


    Generate the stronlgly connected components and attach a component label to each vertex.

    Parameters
    ----------
    G : cugraph.Graph
      cuGraph graph descriptor, should contain the connectivity information as
      an edge list (edge weights are not used for this algorithm). The graph
      can be either directed or undirected where an undirected edge is
      represented by a directed edge in both directions.
      The adjacency list will be computed if not already present.
      The number of vertices should fit into a 32b int.

    Returns
    -------
    df : cudf.DataFrame
      df['labels'][i] gives the label id of the i'th vertex
      df['vertices'][i] gives the vertex id of the i'th vertex







## cuGraph Notice 
The current version of cuGraph has some limitations:

* Vertex IDs need to be 32-bit integers.
* Vertex IDs are expected to be contiguous integers starting from 0.

cuGraph provides the renumber function to mitigate this problem. Input vertex IDs for the renumber function can be either 32-bit or 64-bit integers, can be non-contiguous, and can start from an arbitrary number. The renumber function maps the provided input vertex IDs to 32-bit contiguous integers starting from 0. cuGraph still requires the renumbered vertex IDs to be representable in 32-bit integers. These limitations are being addressed and will be fixed soon.

### Test Data
We will be using the Netscience dataset :  
*M. E. J. Newman, Finding community structure in networks using the eigenvectors of matrices, Preprint physics/0605087 (2006)*

The graph netscience contains a coauthorship network of scientists working on network theory and experiment. The version given here contains all components of the network, for a total of 1589 scientists, with the the largest component of 379 scientists.

Netscience Adjacency Matrix               |NetScience Strongly Connected Components
:---------------------------------------------|------------------------------------------------------------:
![](../img/netscience.png "Credit : https://www.cise.ufl.edu/research/sparse/matrices/Newman/netscience") | ![](../img/netscience_scc.png "Credit : https://www.cise.ufl.edu/research/sparse/matrices/Newman/netscience")
  
Matrix plots above by Yifan Hu, AT&T Labs Visualization Group.

In [None]:
# Import needed libraries
import cugraph
import cudf
import numpy as np

### 1. Read graph data from file

cuGraph depends on cuDF for data loading and the initial Dataframe creation on the GPU.

The data file contains an edge list, which represents the connection of a vertex to another. The source to destination pairs is in what is known as Coordinate Format (COO).

In this test case the data in the test file is expressed in three columns, source, destination and the edge weight. While edge weight is relevant in other algorithms, cuGraph connected component calls do not make use of it and hence that column can be discarded from the dataframe.

In [None]:
# Test file
datafile='../data/netscience.csv'

# the datafile contains three columns,but we only want to use the first two. 
# We will use the "usecols' feature of read_csv to ignore that column

gdf = cudf.read_csv(datafile, delimiter=' ', names=['src', 'dst', 'wgt'], dtype=['int32', 'int32', 'float32'], usecols=['src', 'dst'])
gdf.head(5)

### 2. Create a Graph from an edge list

In [None]:
# create a Graph using the source (src) and destination (dst) vertex pairs from the Dataframe
G = cugraph.Graph()
G.from_cudf_edgelist(gdf, source='src', destination='dst')

### 3a. Call Weakly Connected Components

In [None]:
# Call cugraph.weakly_connected_components on the dataframe
df = cugraph.weakly_connected_components(G)
df.head(5)

#### Get total number of weakly connected components

In [None]:
# Use groupby on the 'labels' column of the WCC output to get the counts of each connected component label
label_gby = df.groupby('labels')
label_count = label_gby.count()

print("Total number of components found : ", len(label_count))

#### Get size of the largest weakly connected component

In [None]:
# Call nlargest on the groupby result to get the row where the component count is the largest
largest_component = label_count.nlargest(n = 1, columns = 'vertices')
print("Size of the largest component is found to be : ", largest_component['vertices'].iloc[0])

#### Output vertex ids belonging to a weakly connected component label

In [None]:
# Query the connected component output to display vertex ids that belong to a component of interest
expr = "labels == 1"
component = df.query(expr)
print("Vertex Ids that belong to component label 1 : ")
print(component)

### 3b. Call Strongly Connected Components

In [None]:
# Call cugraph.strongly_connected_components on the dataframe
df = cugraph.strongly_connected_components(G)
df.head(5)

#### Get total number of strongly connected components

In [None]:
# Use groupby on the 'labels' column of the SCC output to get the counts of each connected component label
label_gby = df.groupby('labels')
label_count = label_gby.count()
print("Total number of components found : ", len(label_count))

#### Get size of the largest strongly connected component

In [None]:
# Call nlargest on the groupby result to get the row where the component count is the largest
largest_component = label_count.nlargest(n = 1, columns = 'vertices')
print("Size of the largest component is found to be : ", largest_component['vertices'].iloc[0])

#### Output vertex ids belonging to a strongly connected component label

In [None]:
# Query the connected component output to display vertex ids that belong to a component of interest
expr = "labels == 2"
component = df.query(expr)
print("Vertex Ids that belong to component label 2 : ")
print(component)

### Conclusion

The number of components found by **cugraph.weakly_connected_components(G)** and **cugraph.strongly_connected_components(G)** are equal to the results from  M. E. J. Newman,
Phys. Rev. E 64, 016132 (2001).

___
Copyright (c) 2019-2020, NVIDIA CORPORATION.

Licensed under the Apache License, Version 2.0 (the "License");  you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
___