# Skip notebook test
(this notebook is not executed as part of the RAPIDS cuGraph CI process.)

--- 

### Timing 
When looking at the overall workflow, NetworkX and cuGraph do things differently.  For example, NetworkX spends a lot of time creating the graph data structure.  cuGraph on the other hand does a lazy creation of the data structure when an algorithm is called.  

To further complicate the comparison problem, NetworkX does not always return the answer.  In some cases it returns a generator that is then called to get the data.  

This benchmark will measure time from an analyst perspective, how long does it take to create the graph and run an algorithm.  

__What is not timed__:  Reading the data</p>
__What is timed__:     (1) creating a Graph, (2) running the algorithm (3) run any generators


Notes:
* Since this is generated test data, we do not need to renumber the data.

---

### Algorithms
|        Algorithm        |  Type         | Graph | DiGraph |   Notes
| ------------------------|---------------|------ | ------- |-------------
| Katz                    | Centrality    |   X   |         | 
| Betweenness Centrality  | Centrality    |   X   |         | Estimated, k = 100
| Louvain                 | Community     |   X   |         | Uses python-louvain for comparison
| Triangle Counting       | Community     |   X   |         |
| WCC                     | Components    |       |    X    | Nx requires directed and returns a generator  
| Core Number             | Core          |   X   |         |  
| PageRank                | Link Analysis |       |    X    |
| Jaccard                 | Similarity    |   X   |         |
| BFS                     | Traversal     |   X   |         | No depth limit 
| SSSP                    | Traversal     |   X   |         | 


### Test Data
Data is generated using a  Recursive MATrix (R-MAT) graph generation algorithm



### Notes
* Running Betweenness Centrality on the full graph is prohibited using NetworkX.  Anything over k=100 can explode runtime to days


Notebook Credits

    
| Author        |    Date    |  Update             | cuGraph Version |  Test Hardware         |
| --------------|------------|---------------------|-----------------|------------------------|
| Don Acosta    | 10/12/2022 | Fix triangles and transposed graphs   | 23.02 nightly          | Tesla A6000, CUDA 11.5  |





## Import Modules

In [1]:
# system and other
import gc
import os
from time import perf_counter
import numpy as np
import math

# rapids
import cugraph
import cudf

# NetworkX libraries
import networkx as nx

# RMAT data generator
from cugraph.generators import rmat

In [2]:
try: 
    import community
except ModuleNotFoundError:
    os.system('pip install python-louvain')
    import community

### Define the test data

In [3]:
# Test Files
# set the data argument for full test or quick test

data_full = {
    'data_1k'  :  1000,
    'data_5k'  :  5000,
    'data_25k' :  25000,
    'data_50k' :  50000
}

# for quick testing
data_quick = {
   'data_500' : 500,
}


# TODO: Was set to quick for test
data = data_full


### Generate data
The data is generated once for each size.

In [4]:
# Data reader - the file format is MTX, so we will use the reader from SciPy
def generate_data(datasize):
    edgefactor = 2
    print('Generating ' + str(datasize) + '...')
    scale = datasize
    _gdf = rmat(
        scale,
        scale*2,
        0.57,
        0.19,
        0.19,
        42,
        clip_and_flip=False,
        scramble_vertex_ids=True,
        create_using=None,  # return edgelist instead of Graph instance
        mg=False
        )
    return _gdf

## Create Graph functions
There are two types of graphs created:
* Directed Graphs - calls to create_xx_digraph
* Undirected Graphs - calls to create_xx_ugraph <- fully syemmeterized 

In [5]:
# NetworkX
def create_nx_digraph(_df):
    _gnx = nx.from_pandas_edgelist(_df,
                                   source='src',
                                   target='dst',
                                   edge_attr=None,
                                   create_using=nx.DiGraph)
    return _gnx

def create_nx_ugraph(_df):
    _gnx = nx.from_pandas_edgelist(_df,
                                   source='src',
                                   target='dst',
                                   edge_attr=None,
                                   create_using=nx.Graph)
    return _gnx


# cuGraph
def create_cu_digraph(_df, transpose=False):
    _g = cugraph.Graph(directed=True)
    _g.from_cudf_edgelist(_df,
                          source='src',
                          destination='dst',
                          renumber=False,
                          store_transposed=transpose)
    return _g

def create_cu_ugraph(_df,transpose=False):
    _g = cugraph.Graph(directed=False)
    _g.from_cudf_edgelist(_df,
                          source='src',
                          destination='dst',
                          renumber=False,
                          store_transposed=transpose)
    return _g

## Algorithm Execution

### Katz

In [6]:
def nx_katz(_df, alpha):
    t1 = perf_counter()
    _G = create_nx_ugraph(_df)
    _ = nx.katz_centrality(_G, alpha)
    t2 = perf_counter() - t1
    return t2

def cu_katz(_df, alpha):
    t1 = perf_counter()
    _G = create_cu_ugraph(_df, transpose=True)
    _ = cugraph.katz_centrality(_G, alpha)
    t2 = perf_counter() - t1
    return t2

def cu_katz_nx(_df, alpha):
    t1 = perf_counter()
    _G = create_nx_ugraph(_df)
    _ = cugraph.katz_centrality(_G, alpha)
    t2 = perf_counter() - t1
    return t2

### Betweenness Centrality

In [7]:
def nx_bc(_df, _k):
    t1 = perf_counter()
    _G = create_nx_ugraph(_df)
    _ = nx.betweenness_centrality(_G, k=_k)
    t2 = perf_counter() - t1
    return t2

def cu_bc(_df, _k):
    t1 = perf_counter()
    _G = create_cu_ugraph(_df)
    _ = cugraph.betweenness_centrality(_G, k=_k)
    t2 = perf_counter() - t1
    return t2

def cu_bc_nx(_df, _k):
    t1 = perf_counter()
    _G = create_nx_ugraph(_df)
    _ = cugraph.betweenness_centrality(_G, k=_k)
    t2 = perf_counter() - t1
    return t2

### Louvain

In [8]:
def nx_louvain(_df):
    t1 = perf_counter()
    _G = create_nx_ugraph(_df)
    parts = community.best_partition(_G)
    
    # Calculating modularity scores for comparison 
    _ = community.modularity(parts, _G)  
    
    t2 = perf_counter() - t1
    return t2

def cu_louvain(_df):
    t1 = perf_counter()
    _G = create_cu_ugraph(_df)
    _,_ = cugraph.louvain(_G)
    t2 = perf_counter() - t1
    return t2

def cu_louvain_nx(_df):
    t1 = perf_counter()
    _G = create_nx_ugraph(_df)
    _,_ = cugraph.louvain(_G)
    t2 = perf_counter() - t1
    return t2

### Triangle Counting

In [9]:
def nx_tc(_df):
    t1 = perf_counter()
    _G = create_nx_ugraph(_df)
    nx_count = nx.triangles(_G)
    
    # To get the number of triangles, we would need to loop through the array and add up each count
    count = 0
    for key, value in nx_count.items():
        count = count + value    
    
    t2 = perf_counter() - t1
    return t2

def cu_tc(_df):
    t1 = perf_counter()
    _G = create_cu_ugraph(_df)
    _ = cugraph.triangle_count(_G)
    t2 = perf_counter() - t1
    return t2

def cu_tc_nx(_df):
    t1 = perf_counter()
    _G = create_nx_ugraph(_df)
    _ = cugraph.triangle_count(_G)
    t2 = perf_counter() - t1
    return t2

### WCC

In [10]:
def nx_wcc(_df):
    t1 = perf_counter()
    _G = create_nx_digraph(_df)
    gen = nx.weakly_connected_components(_G)

    list_of_digraphs = []

    for subgraph in gen:
        list_of_digraphs.append(nx.subgraph(_G, subgraph))
    
    t2 = perf_counter() - t1
    return t2

def cu_wcc(_df):
    t1 = perf_counter()
    _G = create_cu_digraph(_df)    
    _ = cugraph.weakly_connected_components(_G)
    t2 = perf_counter() - t1
    return t2

def cu_wcc_nx(_df):
    t1 = perf_counter()
    _G = create_nx_digraph(_df)    
    _ = cugraph.weakly_connected_components(_G)
    t2 = perf_counter() - t1
    return t2

### Core Number

In [11]:
def nx_core_num(_df):
    t1 = perf_counter()
    _G = create_nx_ugraph(_df)
    _G.remove_edges_from(nx.selfloop_edges(_G))
    nx_count = nx.core_number(_G)
    
    count = 0
    for key, value in nx_count.items():
        count = count + value
    
    t2 = perf_counter() - t1
    return t2

def cu_core_num(_df):
    t1 = perf_counter()
    _G = create_cu_ugraph(_df)
    _ = cugraph.core_number(_G)
    t2 = perf_counter() - t1
    return t2

def cu_core_num_nx(_df):
    t1 = perf_counter()
    _G = create_nx_ugraph(_df)
    _G.remove_edges_from(nx.selfloop_edges(_G))
    _ = cugraph.core_number(_G)
    t2 = perf_counter() - t1
    return t2

### PageRank

In [12]:
def nx_pagerank(_df):
    t1 = perf_counter()
    _G = create_nx_digraph(_df)
    _ = nx.pagerank(_G)
    t2 = perf_counter() - t1
    return t2

def cu_pagerank(_df):
    t1 = perf_counter()
    _G = create_cu_digraph(_df, transpose=True)
    _ = cugraph.pagerank(_G)
    t2 = perf_counter() - t1
    return t2

def cu_pagerank_nx(_df):
    t1 = perf_counter()
    _G = create_nx_digraph(_df)
    _ = cugraph.pagerank(_G)
    t2 = perf_counter() - t1
    return t2

### Jaccard

In [13]:
def nx_jaccard(_df):
    t1 = perf_counter()
    _G = create_nx_ugraph(_df)
    nj = nx.jaccard_coefficient(_G)
    t2 = perf_counter() - t1
    return t2

def cu_jaccard(_df):
    t1 = perf_counter()
    _G = create_cu_ugraph(_df)
    _ = cugraph.jaccard_coefficient(_G)
    t2 = perf_counter() - t1
    return t2

def cu_jaccard_nx(_df):
    t1 = perf_counter()
    _G = create_nx_ugraph(_df)
    _ = cugraph.jaccard_coefficient(_G)
    t2 = perf_counter() - t1
    return t2

### BFS

In [14]:
def nx_bfs(_df):
    t1 = perf_counter()
    _G = create_nx_ugraph(_df)
    nb = nx.bfs_edges(_G, 1) 
    nb_list = list(nb) # gen -> list
    t2 = perf_counter() - t1
    return t2

def cu_bfs(_df):
    t1 = perf_counter()
    _G = create_cu_ugraph(_df)
    _ = cugraph.bfs(_G, 1)
    t2 = perf_counter() - t1
    return t2

def cu_bfs_nx(_df):
    t1 = perf_counter()
    _G = create_nx_ugraph(_df)
    _ = cugraph.bfs(_G, 1)
    t2 = perf_counter() - t1
    return t2

### SSSP

In [15]:
def nx_sssp(_df):
    t1 = perf_counter()
    _G = create_nx_ugraph(_df)
    _ = nx.shortest_path(_G, 1)
    t2 = perf_counter() - t1
    return t2

def cu_sssp(_df):
    t1 = perf_counter()
    _G = create_cu_ugraph(_df)    
    _ = cugraph.sssp(_G, 1)
    t2 = perf_counter() - t1
    return t2

def cu_sssp_nx(_df):
    t1 = perf_counter()
    _G = create_nx_ugraph(_df)    
    _ = cugraph.sssp(_G, 1)
    t2 = perf_counter() - t1
    return t2

---

# Benchmark

In [16]:
# number of datasets
num_datasets = len(data)

In [17]:
# arrays to capture performance gains
names = []
algos = []

# Two dimension data [file, perf]
time_algo_nx = []          # NetworkX
time_algo_cu = []          # cuGraph
time_algo_cx = []          # cuGraph
perf = []
perf_cu_nx = []

algos.append("   ")

i = 0
for k,v in data.items():
    time_algo_nx.append([])
    time_algo_cu.append([])
    time_algo_cx.append([])
    perf.append([])
    perf_cu_nx.append([])
    
    # Saved the file Name
    names.append(k)

    # read data
    gdf = generate_data(v)
    pdf = gdf.to_pandas()
    print(f"\tdata in gdf {len(gdf)} and data in pandas {len(pdf)}")

    # prep
    tmp_g = create_cu_ugraph(gdf)
    deg = tmp_g.degree()
    deg_max = deg['degree'].max()

    alpha = 1 / deg_max
    num_nodes = tmp_g.number_of_vertices()
    
    del tmp_g
    del deg
    
    
    #----- Algorithm order is same as defined at top ----
    
    #-- Katz 
    print("\tKatz  ", end = '')
    if i == 0: 
        algos.append("Katz")

    print("n.", end='')
    tx = nx_katz(pdf, alpha)
    print("c.", end='')
    tc = cu_katz(gdf, alpha)
    print("cx.", end='')
    tcx = cu_katz_nx(pdf, alpha)
    print("")
    
    time_algo_nx[i].append(tx)
    time_algo_cu[i].append(tc)
    time_algo_cx[i].append(tcx)
    perf[i].append(tx/tc)
    perf_cu_nx[i].append(tx/tcx)
    gc.collect()
    
    
    #-- BC
    print("\tBC k=100  ", end='')
    if i == 0:
        algos.append("BC Estimate fixed")

    k = 100
    if k > num_nodes:
        k = int(num_nodes)
    print("n.", end='')
    tx = nx_bc(pdf, k)
    print("c.", end='')
    tc = cu_bc(gdf, k)
    print("cx.", end='')
    tcx = cu_bc_nx(pdf, k)
    print(" ")
    
    time_algo_nx[i].append(tx)
    time_algo_cu[i].append(tc)
    time_algo_cx[i].append(tcx)
    perf[i].append(tx/tc)
    perf_cu_nx[i].append(tx/tcx)
    gc.collect()
    

    #-- Louvain
    print("\tLouvain  ", end='')
    if i == 0:
        algos.append("Louvain")

    print("n.", end='')
    tx = nx_louvain(pdf)
    print("c.", end='')
    tc = cu_louvain(gdf)
    print("cx.", end='')
    tcx = cu_louvain_nx(pdf)
    print(" ")
    
    time_algo_nx[i].append(tx)
    time_algo_cu[i].append(tc)
    time_algo_cx[i].append(tcx)
    perf[i].append(tx/tc)
    perf_cu_nx[i].append(tx/tcx)
    gc.collect()
    
    #-- TC
    print("\tTC  ", end='')
    if i == 0:
        algos.append("TC")

    print("n.", end='')
    tx = nx_tc(pdf)
    print("c.", end='')
    tc = cu_tc(gdf)
    print("cx.", end='')
    tcx = cu_tc_nx(pdf)
    print(" ")
    
    time_algo_nx[i].append(tx)
    time_algo_cu[i].append(tc)
    time_algo_cx[i].append(tcx)
    perf[i].append(tx/tc)
    perf_cu_nx[i].append(tx/tcx)
    gc.collect()


    #-- WCC
    # print("\tWCC  ", end='')
    # if i == 0:
    #     algos.append("WCC")

    # print("n.", end='')
    # tx = nx_wcc(pdf)
    # print("c.", end='')
    # tc = cu_wcc(gdf)
    # print("cx.", end='')
    # tcx = cu_wcc_nx(pdf)
    # print(" ")

    # time_algo_nx[i].append(tx)
    # time_algo_cu[i].append(tc)
    # time_algo_cx[i].append(tcx)
    # perf[i].append(tx/tc)
    # perf_cu_nx[i].append(tx/tcx)
    # gc.collect()
    
    #-- Core Number
    print("\tCore Number  ", end='')
    if i == 0:
        algos.append("Core Number")

    print("n.", end='')
    tx = nx_core_num(pdf)
    print("c.", end='')
    tc = cu_core_num(gdf)
    print("cx.", end='')
    tcx = cu_core_num_nx(pdf)
    print(" ")
    
    time_algo_nx[i].append(tx)
    time_algo_cu[i].append(tc)
    time_algo_cx[i].append(tcx)
    perf[i].append(tx/tc)
    perf_cu_nx[i].append(tx/tcx)
    gc.collect()

    
    #-- PageRank
    # print("\tPageRank  ", end='')
    # if i == 0:
    #     algos.append("PageRank")

    # print("n.", end='')
    # tx = nx_pagerank(pdf)
    # print("c.", end='')
    # tc = cu_pagerank(gdf)
    # print("cx.", end='')
    # tcx = cu_pagerank_nx(pdf)
    # print(" ")

    # time_algo_nx[i].append(tx)
    # time_algo_cu[i].append(tc)
    # time_algo_cx[i].append(tcx)
    # perf[i].append(tx/tc)
    # perf_cu_nx[i].append(tx/tcx)
    # gc.collect()
    
    
    #-- Jaccard
    print("\tJaccard  ", end='')
    if i == 0:
        algos.append("Jaccard")

    print("n.", end='')
    tx = nx_jaccard(pdf)
    print("c.", end='')
    tc = cu_jaccard(gdf)
    print("cx.", end='')
    tcx = cu_jaccard_nx(pdf)
    print(" ")
    
    time_algo_nx[i].append(tx)
    time_algo_cu[i].append(tc)
    time_algo_cx[i].append(tcx)
    perf[i].append(tx/tc)
    perf_cu_nx[i].append(tx/tcx)
    gc.collect()
    

    #-- BFS
    print("\tBFS  ", end='')
    if i == 0:
        algos.append("BFS")

    print("n.", end='')
    tx = nx_bfs(pdf)
    print("c.", end='')
    tc = cu_bfs(gdf)
    print("cx.", end='')
    tcx = cu_bfs_nx(pdf)
    print(" ")

    time_algo_nx[i].append(tx)
    time_algo_cu[i].append(tc)
    time_algo_cx[i].append(tcx)
    perf[i].append(tx/tc)
    perf_cu_nx[i].append(tx/tcx)
    gc.collect()
    
    
    #-- SSSP
    print("\tSSSP  ", end='')
    if i == 0:
        algos.append("SSP")

    print("n.", end='')
    tx = nx_sssp(pdf)
    print("c.", end='')
    tc = cu_sssp(gdf)
    print("cx.", end='')
    tcx = cu_sssp(gdf)
    print(" ")

    time_algo_nx[i].append(tx)
    time_algo_cu[i].append(tc)
    time_algo_cx[i].append(tcx)
    perf[i].append(tx/tc)
    perf_cu_nx[i].append(tx/tcx)
    gc.collect()

    # increament count
    
    i = i + 1



Generating 1000...
	data in gdf 2000 and data in pandas 2000
	Katz  n.c.cx.
	BC k=100  n.c.cx. 
	Louvain  n.c.cx. 
	TC  n.c.cx. 
	Core Number  n.c.cx. 
	Jaccard  n.c.cx. 
	BFS  n.c.cx. 
	SSSP  n.c.cx. 
Generating 5000...
	data in gdf 10000 and data in pandas 10000
	Katz  n.c.cx.
	BC k=100  n.c.cx. 
	Louvain  n.c.cx. 
	TC  n.c.cx. 
	Core Number  n.c.cx. 
	Jaccard  n.c.cx. 
	BFS  n.c.cx. 
	SSSP  n.c.cx. 
Generating 25000...
	data in gdf 50000 and data in pandas 50000
	Katz  n.c.cx.
	BC k=100  n.c.cx. 
	Louvain  n.c.cx. 
	TC  n.c.cx. 
	Core Number  n.c.cx. 
	Jaccard  n.c.cx. 
	BFS  n.c.cx. 
	SSSP  n.c.cx. 
Generating 50000...
	data in gdf 100000 and data in pandas 100000
	Katz  n.c.cx.
	BC k=100  n.c.cx. 
	Louvain  n.c.cx. 
	TC  n.c.cx. 
	Core Number  n.c.cx. 
	Jaccard  n.c.cx. 
	BFS  n.c.cx. 
	SSSP  n.c.cx. 


In [18]:
#Print results
print(algos)

for i in range(num_datasets):
    print(f"{names[i]}")
    print(f"{perf[i]}")

['   ', 'Katz', 'BC Estimate fixed', 'Louvain', 'TC', 'Core Number', 'Jaccard', 'BFS', 'SSP']
data_1k
[3.3641247579799063, 2.697417788830093, 2.9955826896618056, 1.134040436280359, 0.3149530889954486, 0.29919658716215486, 0.24641096693412234, 0.2589937669321417]
data_5k
[14.876681978636219, 14.711910431523451, 8.84970941730169, 6.33344652973132, 1.1278123277453351, 1.3127515897570385, 1.1726614881506985, 1.2055487496277337]
data_25k
[35.767718140529475, 86.39224703465386, 30.02716935873087, 37.58636878210236, 3.6035226660009574, 4.744649894627159, 5.235785051343804, 6.279863775072972]
data_50k
[33.02547072424225, 189.30961857561275, 45.08972480181056, 63.30822556055329, 6.792224819759355, 6.831085248900713, 14.785968116988402, 11.242773214445538]


___
Copyright (c) 2020-2023, NVIDIA CORPORATION.

Licensed under the Apache License, Version 2.0 (the "License");  you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
___