PageRank Demo
=======================

This example runs PageRank on HiBench data using RAPIDS. Before running this notebook, you need to start a DASK scheduler and DASK workers in command line.

* dask-scheduler --scheduler-file path_to_scheduler_info_file_path # path_to_file_path: path to write the scheduler access information in the json format (e.g. ~\cluster.json to write to cluster.json under your home directory)

* mpirun -np number_of_workers --machinefile path_to_node_address_file dask-mpi --no-nanny --nthreads number_of_threads --no-scheduler --scheduler-file path_to_scheduler_info_file_path # numbrer_of_workers, path_to_node_address_file, and numbrer_of_threads need to be updated to proper values

1) Import Files
=======================


In [26]:
import os
import time
import numpy as np
from mpi4py import MPI
from dask.distributed import Client, wait
import dask
import dask_cudf

2) Define Utility Functions
=======================
TODO: call_cudaSetDevice should be properly updated.

In [27]:
 def parse_host_port(address):
    if '://' in address:
        address = address.rsplit('://', 1)[1]
    host, port = address.split(':')
    port = int(port)
    return host, port

In [28]:
def call_cudaSetDevice(l):
    try:
        comm = MPI.COMM_WORLD
        rank = comm.Get_rank()
        # need to call cudaSetDevice(rank % number_of_GPUs_per_node)
        return rank # TODO: actually need to return rank % number_of_GPUs_per_node
    
    except Exception as e:
        print(str(e))

3) Create a Client
=======================
Connect to the DASK scheduler, need to replace /home/seunghwak/cluster.json to the proper directory path.

In [29]:
start_time = time.time() # start timing from here

client = Client(scheduler_file = "/home/seunghwak/cluster.json", direct_to_workers = True) 
client

0,1
Client  Scheduler: tcp://10.28.133.204:8786  Dashboard: http://10.28.133.204:8787/status,Cluster  Workers: 4  Cores: 8  Memory: 32.85 GB


In [30]:
workers = [parse_host_port(worker) for worker in list(client.has_what().keys())]
workers

[('10.28.133.204', 34157),
 ('10.28.133.204', 34929),
 ('10.28.133.204', 40295),
 ('10.28.133.204', 43563)]

4) Map One Worker to One GPU
=======================

In [31]:
pmpi_map = [(client.submit(call_cudaSetDevice, ident, workers = [worker]), worker)
            for worker, ident in zip(workers, range(len(workers)))]
pmpi_map

[(<Future: status: pending, key: call_cudaSetDevice-de12bea20c5754ff703af5262c8b2021>,
  ('10.28.133.204', 34157)),
 (<Future: status: pending, key: call_cudaSetDevice-daf75b85f8a66d00168c5eb8097a0121>,
  ('10.28.133.204', 34929)),
 (<Future: status: pending, key: call_cudaSetDevice-09ca80f848b1dcb508ba7d85b020480d>,
  ('10.28.133.204', 40295)),
 (<Future: status: pending, key: call_cudaSetDevice-9af82fdbe75011af742e580a45b2b230>,
  ('10.28.133.204', 43563))]

In [32]:
device_nums = dict([(r.result(), worker) for r, worker in pmpi_map])
device_nums

{3: ('10.28.133.204', 34157),
 2: ('10.28.133.204', 34929),
 1: ('10.28.133.204', 40295),
 0: ('10.28.133.204', 43563)}

5) Read Input Data
=======================
Need to replace /home/seunghwak/TMP/Input-small/edges to the proper directory path

In [33]:
cdf = dask_cudf.read_csv("/home/seunghwak/TMP/Input-small/edges/part-*", delimiter='\t',  names=['src', 'dst'], 
                         dtype=['int32', 'int32'])
cdf = client.persist(cdf)
cdf

Unnamed: 0_level_0,src,dst
npartitions=32,Unnamed: 1_level_1,Unnamed: 2_level_1
,int32,int32
,...,...
...,...,...
,...,...
,...,...


In [34]:
cdf = cdf.sort_values_binned(by='dst') # data needs to be sorted based on destination index
cdf = client.persist(cdf)
cdf

Unnamed: 0_level_0,src,dst
npartitions=32,Unnamed: 1_level_1,Unnamed: 2_level_1
,int32,int32
,...,...
...,...,...
,...,...
,...,...


6) Run PageRank
=======================

In [35]:
# TODO: call page rank passing cdf2

7) Report Execution Time
=======================

In [36]:
end_time = time.time()
print ((end_time - start_time), "seconds")

9.966898441314697 seconds
