Multi-GPU PageRank Demo
=======================

This example runs PageRank on HiBench data using RAPIDS. Before running this notebook, you need to start a DASK scheduler and DASK workers in command line.

* dask-scheduler --scheduler-file path_to_scheduler_file # path_to_scheduler_file: path to write the scheduler access information in the json format (e.g. ~\cluster.json)

* mpirun -np number_of_workers --machinefile path_to_machine_address_file dask-mpi --no-nanny --nthreads number_of_threads --no-scheduler --scheduler-file path_to_scheduler_info_file # numbrer_of_workers, path_to_machine_address_file, numbrer_of_threads, and path_to_scheduler_info_file need to be set to proper values, path_to_scheduler_info_file should match the path provided when launching a dask-scheduler, --machinefile path_to_machine_address_file can be skipped if you are running this on a single node.

1) Import Files
=======================


In [1]:
import os
import time

from dask.distributed import Client

import dask_cudf
import dask_cugraph as dcg

2) Set the Number of GPU Devices and File Paths
=======================

In [10]:
number_of_devices = 2
scheduler_file_path = r"/home/seunghwak/cluster.json"
input_data_path = r"/home/seunghwak/TMP/Input-small/edges"

2) Define Utility Functions
=======================
set_visible maps a dask-mpi process to a single GPU device.

In [3]:
def set_visible(i, n):
    all_devices = list(range(n))
    visible_devices = ",".join(map(str, all_devices[i:] + all_devices[:i]))
    os.environ["CUDA_VISIBLE_DEVICES"] = visible_devices

3) Create a Client
=======================
Connect to the DASK scheduler, need to replace /home/USERID/cluster.json to the proper directory path.

In [7]:
start_time = time.time()  # start timing from here

client = Client(scheduler_file=scheduler_file_path,
                direct_to_workers=True)

4) Map One Worker to One GPU
=======================

In [11]:
devices = list(range(number_of_devices))
device_workers = list(client.has_what().keys())
assert len(devices) == len(device_workers)

[client.submit(set_visible, device, len(devices), workers=[worker])
    for device, worker in zip(devices, device_workers)]

[<Future: status: pending, key: set_visible-4f3af2f890d6a36d7e8217c80e27002c>,
 <Future: status: pending, key: set_visible-734ca7f9b333fe2c93c28d89ff836b50>]

5) Read Input Data
=======================
Need to replace /datasets/pagerank_demo/Input-bigdata/edges to the proper directory path

In [12]:
dgdf = dask_cudf.read_csv(input_data_path + r"/part-*",
                          delimiter='\t', names=['src', 'dst'],
                          dtype=['int32', 'int32'])
dgdf = client.persist(dgdf)

6) Sort Input Data
=======================

In [13]:
dgdf = dgdf.sort_values_binned(by='dst')
dgdf = client.persist(dgdf)

7) Run PageRank
=======================

In [None]:
pagerank = dcg.mg_pagerank(dgdf)

8) Close the Client and Report Execution Time
=======================

In [None]:
client.close()

end_time = time.time()
print((end_time - start_time), "seconds")