# Multi-GPU Jaccard/Sorensen and Overlap

This notebook loads data into a cudf_dask dataframe, uses it to run jaccard, Sorensen and overlap on multiple GPU.



| Author Credit |    Date    |  Update          | cuGraph Version |  Test Hardware        |
|---------------|------------|------------------|-----------------|-----------------------|
| Don Acosta    | 04/21/2023 | created          | 23.06 nightly   |  2xA6000 CUDA 11.7    |


CuGraph's multi-GPU features leverage Dask. RAPIDS has other projects based on Dask such as dask-cudf and dask-cuda. These products will also be used in this example. Check out [RAPIDS.ai](https://rapids.ai/) to learn more about these technologies.

### Multi-GPU Algorithms
### Basic setup

In [1]:
# Import needed libraries. We recommend using the [cugraph_dev](https://github.com/rapidsai/cugraph/tree/branch-23.02/conda/environments) env through conda
from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster
from cugraph.dask.comms import comms as Comms
import cugraph.dask as dask_cugraph
import cugraph
import dask_cudf
import time
import urllib.request
import os

This code pulls the datafile from the rapids S3 bucket and decompresses it. This will not be necessary when the Datasets API supports decompression and direct loading into a dask edgelist.

In [2]:
def get_data_file():

    data_dir = '../data/'
    if not os.path.exists(data_dir):
        print('creating data directory')
        os.system('mkdir ../data')

    # download the Hollywood dataset
    base_url = 'https://data.rapids.ai/cugraph/benchmark/'
    fn = 'hollywood.csv'
    comp = '.gz'

    if not os.path.isfile(data_dir+fn):
        if not os.path.isfile(data_dir+fn+comp):
            print(f'Downloading {base_url+fn+comp} to {data_dir+fn+comp}')
            urllib.request.urlretrieve(base_url+fn+comp, data_dir+fn+comp)
        print(f'Decompressing {data_dir+fn+comp}...')
        os.system('gunzip '+data_dir+fn+comp)
        print(f'{data_dir+fn+comp} decompressed!')
    else:
        print(f'Your data file, {data_dir+fn}, already exists')

    # File path, assuming Notebook directory
    return  (data_dir+fn)

### Initialize multi-GPU environment
Before we get started, we need to setup a Dask local cluster of workers to execute our work and a client to coordinate and schedule work for that cluster. As we see below, we can initiate a cluster and client using only 3 lines of code.

 The enable_spilling allows the stored graph to spill to memory on the host if necessary.

In [3]:
def enable_spilling():
    import cudf
    cudf.set_option("spill", True)

In [4]:
enable_spilling()
cluster = LocalCUDACluster()
client = Client(cluster)
client.run(enable_spilling)
Comms.initialize(p2p=True)

2023-04-21 10:31:30,467 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-worker-space/worker-a5pjl9mq', purging
2023-04-21 10:31:30,467 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-worker-space/worker-vwhtjr3e', purging
2023-04-21 10:31:30,468 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2023-04-21 10:31:30,468 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2023-04-21 10:31:30,487 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2023-04-21 10:31:30,487 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize


### Read the data from disk
cuGraph depends on cudf for data loading and the initial DataFrame creation. The CSV data file contains an edge list, which represents the connection of a vertex to another. The source to destination pairs is what is known as Coordinate Format (COO). In this test case, the data is just two columns. 

In [5]:
# Start ETL timer
#t_start = time.time()

# Helper function to set the reader chunk size to automatically get one partition per GPU  
#input_data_path = get_data_file()
input_data_path = '../data/hollywood.csv'

chunksize = dask_cugraph.get_chunksize(input_data_path)

# Multi-GPU CSV reader
e_list = dask_cudf.read_csv(input_data_path, chunksize = chunksize, delimiter=' ', names=['src', 'dst'], dtype=['int32', 'int32'])

  warn(


In [6]:
e_list['src'].max().compute()

1139904

In [7]:
G = cugraph.Graph(directed=False)
G.from_dask_cudf_edgelist(e_list, renumber=False, source='src', destination='dst')
vertex_pairs = G.view_edge_list().head(10)

### Run Multi-GPU jaccard

In [8]:
jdf = dask_cugraph.jaccard(G,vertex_pairs)
jdf.sort_values(by='jaccard_coeff',ascending=False).compute()

Unnamed: 0,first,second,jaccard_coeff
5,880757,880698,1.0
6,941603,941199,0.652971
3,773793,640730,0.292683
2,1012029,1015522,0.087393
0,485689,487550,0.082307
1,780139,684044,0.078031
4,940470,938973,0.048816
0,95844,864653,0.046243
1,203605,789165,0.043506
2,611414,600564,0.033784


### Run Multi-GPU Sorensen

In [9]:
sdf = jdf = dask_cugraph.sorensen(G,vertex_pairs)
jdf.sort_values(by='sorensen_coeff',ascending=False).compute()

Unnamed: 0,first,second,sorensen_coeff
5,880757,880698,1.0
6,941603,941199,0.790057
3,773793,640730,0.45283
2,1012029,1015522,0.160738
0,485689,487550,0.152095
1,780139,684044,0.144766
4,940470,938973,0.093088
0,95844,864653,0.088398
1,203605,789165,0.083384
2,611414,600564,0.065359


### Run Multi-GPU overlap

In [10]:
sdf = jdf = dask_cugraph.overlap(G,vertex_pairs)
jdf.sort_values(by='overlap_coeff',ascending=False).compute()

Unnamed: 0,first,second,overlap_coeff
1,203605,789165,1.0
3,773793,640730,1.0
5,880757,880698,1.0
6,941603,941199,1.0
2,1012029,1015522,0.72619
2,611414,600564,0.588235
1,780139,684044,0.580357
0,95844,864653,0.533333
4,940470,938973,0.302395
0,485689,487550,0.232795


### Shut down the multi-GPU Environment

In [11]:
Comms.destroy()
client.close()
cluster.close()

___
Copyright (c) 2023, NVIDIA CORPORATION.

Licensed under the Apache License, Version 2.0 (the "License");  you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
___