# PageRank with the cugraph nx compatibility layer

In this notebook, we will compare calling the Pagerank algorithm with the full cugraph stack, building with Nx and then calling with the
algorithm with cugraph, and the nx compatibility layer which allows

| Author Credit |    Date    |  Update          | cuGraph Version |  Test Hardware |
| --------------|------------|------------------|-----------------|----------------|
| Don Acosta    | 03/31/2022 | created          | 22.06           | V100 w 32 GB, CUDA 11.5




## Introduction
Pagerank is measure of the relative importance, also called centrality, of a vertex based on the relative importance of it's neighbors.  PageRank was developed by Google and is (was) used to rank it's search results. PageRank uses the connectivity information of a graph to rank the importance of each vertex. 

See [Wikipedia](https://en.wikipedia.org/wiki/PageRank) for more details on the algorithm.

An overviw of generating Pagerank scores for a graph:<br>

**Pagerank(G,alpha=0.85, max_iter=100, tol=1.0e-5)**
* __G__: Graph object
* __alpha__: float, The damping factor represents the probability to follow an outgoing edge. default is 0.85
* __max_iter__: int, The maximum number of iterations before an answer is returned. This can be used to limit the execution time and do an early exit before the solver reaches the convergence tolerance. If this value is lower or equal to 0 cuGraph will use the default value, which is 100
* __tol__: float, Set the tolerance the approximation, this parameter should be a small magnitude value. The lower the tolerance the better the approximation. If this value is 0.0f, cuGraph will use the default value which is 0.00001. Setting too small a tolerance can lead to non-convergence due to numerical roundoff. Usually values between 0.01 and 0.00001 are acceptable.

Returns:
* __rankings__: An object with two columns:
    * vertex_id: The vertex identifier for the vertex
    * pagerank: The pagerank score for the vertex

### Test Data
We will be using several data sets of data to show different the algorithm running on different size sets.
* Zachary Karate club dataset 
* preferentialAttachment
* caidaRouterLevel
* Co Authers DBLP

In [12]:
# representitive data to run
data_files = {
'datafile_small'         : '../data/karate.mtx',
'preferentialAttachment' : '../data/preferentialAttachment.mtx',
'caidaRouterLevel'       : '../data/caidaRouterLevel.mtx',
'coAuthorsDBLP'          : '../data/coAuthorsDBLP.mtx'
}

### Thre will be three methods of building graphs
* Loading one link at a time from an input set using networkX
* Loading an entire file with networkX
* Building the graph with cugraph

In [13]:
from scipy.io import mmread
import pandas as pd
import time

In [14]:
# Data reader - the file format is MTX, so we will use the reader from SciPy
def read_mtx_file(mm_file):
    M = mmread(mm_file).asfptype()
    return M

Many applications need to load data one edge at a time.

In [15]:
def load_by_edge(data_file, nx_impl):
    G = nx_impl.Graph()
    if (data_file.endswith('.mtx')):
        M = read_mtx_file(data_file)
        for u,v  in zip(*M.nonzero()):
            G.add_edge(u,v)
    else:
        raise TypeError('Unsupported file type. Only mtx files currently supported.')      
    return G

This is the cugraph call which doesnt use NetworkX but emulates many of the NetworkX API's for conventience.
We are returning the time spent in the actual pagerank to make a better comparison with the NetworkX and
nx compatibility.

In [16]:
def cugraph_call(M, max_iter, tol, alpha):

    import cugraph
    import cudf
    from cugraph.utilities import df_score_to_dictionary
                            

    gdf = cudf.DataFrame()
    gdf['src'] = M.row
    gdf['dst'] = M.col
        
    t1 = time.time()
        
    # cugraph Pagerank Call
    G = cugraph.DiGraph()
    G.from_cudf_edgelist(gdf, source='src', destination='dst', renumber=False)
    # timing the algorithm only
    t1 = time.time()
    df = cugraph.pagerank(G, alpha=alpha, max_iter=max_iter, tol=tol)
    t2 = time.time() - t1
    
    return df_score_to_dictionary(df,'pagerank'),t2


This outputs the run times and the improvement over the slowest time which is represented as an improvement of 1.

In [17]:
def print_times(elapsed_dict):
    max_value = max(elapsed_dict.values())
    df = pd.DataFrame(columns =['Package','Time','Improvement'])
    for key in elapsed_dict.keys():
        key_value = elapsed_dict.get(key)
        pack_name = key
        improve_over_max = max_value/(key_value)
        df = df.append({'Package' : pack_name,'Time' : key_value ,'Improvement' : improve_over_max}, ignore_index=True )
    print(df)

This methods verifies that the rankings of all three runs are the same.

In [18]:
def validate_results(results):
    pr = results['cugraph']
    expected_rankings = dict(sorted(pr.items(),key=lambda x:x[1], reverse=True)).keys()
    for key in results.keys():
        actual_rankings = dict(sorted(results[key].items(),key=lambda x:x[1], reverse=True)).keys()
        if ( actual_rankings != expected_rankings):
            return False
    return True

## Arguments used to run Pagerank

In [19]:
MAX_ITERATIONS = 150
TOLERANCE = 1.0e-04
ALPHA = 0.85

These are the packages we will run 

In [20]:
import networkx as nx
import cugraph.experimental.compat.nx as nxcompat
packages = [nx, nxcompat]

This is how Pagerank is run on each dataset
* Using NetworkX
* Using the compatibility layer where the actual algrorithm call is done using cugraph without
any change other than replacing the package
* Using a full cugraph implementation for graph building and the algorithm

In [21]:
def run_dataset(datafile):
    elapsed_time = dict()
    results = dict()
    for package in packages:
        package_name = str(package).split()[1]
        G = load_by_edge(datafile,package)
        start_time = time.time()
        results[package_name] = package.pagerank(G, alpha=ALPHA, max_iter=MAX_ITERATIONS, tol=TOLERANCE)
        elapsed_time[package_name] = time.time() - start_time
    M = read_mtx_file(datafile)
    results['cugraph'],elapsed_time['cugraph'] = cugraph_call(M,MAX_ITERATIONS, TOLERANCE, ALPHA)
    return results, elapsed_time

# Process each data set
* Run Pagerank with each method (networkx, nx compatibility and cugraph)
* Display the run times for each
* Validate the results

In [22]:
datalist = data_files
for datafile in datalist.keys():
    print(f'Results for datafile: {datafile}')
    results, elapsed_time = run_dataset(datalist.get(datafile))
    print_times(elapsed_time)
    if validate_results(results):
        print("Results are consistent")
    else:
        print("Results are inconsistent")


Results for datafile: datafile_small
                            Package      Time  Improvement
0                        'networkx'  0.002342     9.551155
1  'cugraph.experimental.compat.nx'  0.022369     1.000000
2                           cugraph  0.005085     4.399165
Results are consistent
Results for datafile: preferentialAttachment




                            Package      Time  Improvement
0                        'networkx'  1.782115     1.000000
1  'cugraph.experimental.compat.nx'  0.748324     2.381476
2                           cugraph  0.010036   177.563925
Results are consistent
Results for datafile: caidaRouterLevel




                            Package      Time  Improvement
0                        'networkx'  2.250018     1.000000
1  'cugraph.experimental.compat.nx'  0.882459     2.549714
2                           cugraph  0.014078   159.820810
Results are consistent
Results for datafile: coAuthorsDBLP




                            Package      Time  Improvement
0                        'networkx'  3.803551     1.000000
1  'cugraph.experimental.compat.nx'  1.308596     2.906589
2                           cugraph  0.013761   276.404664
Results are consistent


___
Copyright (c) 2022, NVIDIA CORPORATION.

Licensed under the Apache License, Version 2.0 (the "License");  you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
___