#Setup:

1. Use pynvml to confirm Colab allocated you a Tesla T4 GPU.
2. Install most recent Miniconda release compatible with Google Colab's Python install  (3.6.7)
3. Install RAPIDS libraries
4. Copy RAPIDS .so files into current working directory, a workaround for conda/colab interactions
5. Add the ngrok binary to expose Dask's status dashboard
6. Update env variables so Python can find and use RAPIDS artifacts
​
All of the above steps are automated in the next cell.
​
You should re-run this cell any time your instance re-starts.

In [0]:

!wget https://github.com/randerzander/notebooks-extended/raw/master/utils/rapids-colab.sh
!chmod +x rapids-colab.sh
!./rapids-colab.sh

import sys, os
sys.path.append('/usr/local/lib/python3.6/site-packages/')
os.environ['NUMBAPRO_NVVM'] = '/usr/local/cuda/nvvm/lib64/libnvvm.so'
os.environ['NUMBAPRO_LIBDEVICE'] = '/usr/local/cuda/nvvm/libdevice/'

import nvstrings, nvcategory, cudf, cuml, xgboost
import dask_cudf, dask_cuml, dask_xgboost
from dask.distributed import Client, LocalCluster, wait, progress

# we have one GPU, so limit Dask's workers and threads to exactly 1
cluster = LocalCluster(processes=False, threads_per_worker=1, n_workers=1)
client = Client(cluster)
client

--2019-06-06 16:52:25--  https://github.com/randerzander/notebooks-extended/raw/master/utils/rapids-colab.sh
Resolving github.com (github.com)... 52.69.186.44
Connecting to github.com (github.com)|52.69.186.44|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/randerzander/notebooks-extended/master/utils/rapids-colab.sh [following]
--2019-06-06 16:52:30--  https://raw.githubusercontent.com/randerzander/notebooks-extended/master/utils/rapids-colab.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1746 (1.7K) [text/plain]
Saving to: ‘rapids-colab.sh’


2019-06-06 16:52:31 (228 MB/s) - ‘rapids-colab.sh’ saved [1746/1746]

--2019-06-06 16:52:33--  https://github.com/randerzander/notebooks-extended/raw/mas


# Louvain Community Detection

## Introduction

The Louvain method of community detection is a greedy hierarchical clustering algorithm which seeks to optimize modularity as it progresses. Louvain starts with each vertex in its own clusters and iteratively merges groups using graph contraction.

For a detailed description of the algorithm see: [https://en.wikipedia.org/wiki/Louvain_Modularity](https://en.wikipedia.org/wiki/Louvain_Modularity).

It takes as input a cugraph.Graph object and returns as output a cudf.Datafrome object with the id and assigned partition for each vertex as well as the final modularity score.

To comoute the Louvain cluster in cuGraph use: 

__nvLouvain(G)__
* __G__: A `cugraph.Graph` object

Returns: 
* tupal lovain dataframe and modularity
* __louvain__: `cudf.DataFrame` with two named columns: 
    * `louvain["vertex"]`: The vertex id.
    * `louvain["partition"]`: The assigned partition.
* __modularity__ : the overall modularity of the graph

All vertices with the same partition ID are in the same cluster.


### Test Data
We will be using the Zachary Karate club dataset 
*W. W. Zachary, An information flow model for conflict and fission in small groups, Journal of
Anthropological Research 33, 452-473 (1977).*

![Karate Club](https://raw.githubusercontent.com/rapidsai/notebooks/branch-0.8/cugraph/img/zachary_black_lines.png)

### Prep

In [0]:
# Import needed libraries
import cugraph
import numpy as np
from collections import OrderedDict

## Reading data using cuDF

**At the creation of this notebook, `dask_cudf` doesn't work with `cugraph`, so we have to use `cudf`.**  A future repo, `dask_cugraph`, will have dask compatibility

In [0]:
import cudf
# Save test file
datafile='https://raw.githubusercontent.com/rapidsai/notebooks/branch-0.8/cugraph/data/karate-data.csv'

# Read the data file
cols = ["src", "dst"]

dtypes = OrderedDict([
        ("src", "int32"), 
        ("dst", "int32")
        ])

df = pd.read_csv(datafile, names=cols, delimiter='\t', dtype=list(dtypes.values()) )
gdf = cudf.from_pandas(df)

FileNotFoundError: ignored

In [0]:
# Save test file
!wget https://raw.githubusercontent.com/rapidsai/notebooks/branch-0.8/cugraph/data/karate-data.csv
datafile='karate-data.csv'

# Read the data file
cols = ["src", "dst"]

dtypes = OrderedDict([
        ("src", "int32"), 
        ("dst", "int32")
        ])

gdf = cudf.read_csv(datafile, names=cols, delimiter='\t', dtype=list(dtypes.values()) )

--2019-06-04 20:08:07--  https://raw.githubusercontent.com/rapidsai/notebooks/branch-0.8/cugraph/data/karate-data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 814 [text/plain]
Saving to: ‘karate-data.csv’


2019-06-04 20:08:07 (196 MB/s) - ‘karate-data.csv’ saved [814/814]



In [0]:
# Louvain is dependent on vertex ID starting at zero
gdf["src_0"] = gdf["src"] - 1
gdf["dst_0"] = gdf["dst"] - 1

In [0]:
# The algorithm also requires that there are vertex weights.  Just use 1.0 
gdf["data"] = 1.0

In [0]:
# just for fun, let's look at the data types in the dataframe
gdf.dtypes

src        int32
dst        int32
src_0      int32
dst_0      int32
data     float64
dtype: object

In [0]:
print(gdf.head(5))

   src  dst  src_0  dst_0  data
0    1    2      0      1   1.0
1    1    3      0      2   1.0
2    1    4      0      3   1.0
3    1    5      0      4   1.0
4    1    6      0      5   1.0


In [0]:
gdf

Unnamed: 0_level_0,src,dst,src_0,dst_0,data
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
,int32,int32,int32,int32,float64
,...,...,...,...,...


In [0]:
# create a Graph 
G = cugraph.Graph()
G.add_edge_list(gdf["src_0"], gdf["dst_0"], gdf["data"])

AttributeError: ignored

Exception ignored in: 'cugraph.get_gdf_column_view'
AttributeError: 'Series' object has no attribute '_column'


AttributeError: ignored

Exception ignored in: 'cugraph.get_gdf_column_view'
AttributeError: 'Series' object has no attribute '_column'


AttributeError: ignored

Exception ignored in: 'cugraph.get_gdf_column_view'
AttributeError: 'Series' object has no attribute '_column'


GDFError: ignored

In [0]:
# Call Louvain on the graph
df, mod = cugraph.nvLouvain(G)

GDFError: ignored

In [0]:
# Print the modularity score
print('Modularity was {}'.format(mod))
print()

In [0]:
df.dtypes

In [0]:
# How many partitions where found
part_ids = df["partition"].unique()

In [0]:
print(str(len(part_ids)) + " partition(s) detected")


In [0]:
for p in range(len(part_ids)):
    part = []
    for i in range(len(df)):
        if (df['partition'][i] == p):
            part.append(df['vertex'][i] +1)
    print("Partition " + str(p) + ":")
    print(part)

# Next Steps #

For an overview of how you can access and work with your own datasets in Colab, check out [this guide](https://towardsdatascience.com/3-ways-to-load-csv-files-into-colab-7c14fcbdcb92).

For more RAPIDS examples, check out our RAPIDS notebooks repos:
1. https://github.com/rapidsai/notebooks
2. https://github.com/rapidsai/notebooks-extended

Copyright (c) 2019, NVIDIA CORPORATION.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.