<a href="https://colab.research.google.com/github/taureandyernv/colabs/blob/master/rapids_colab_0_8_base.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Setup:

1. Use pynvml to confirm Colab allocated you a Tesla T4 GPU.
2. Install most recent Miniconda release compatible with Google Colab's Python install  (3.6.7)
3. Install RAPIDS libraries
4. Copy RAPIDS .so files into current working directory, a workaround for conda/colab interactions
5. Add the ngrok binary to expose Dask's status dashboard
6. Update env variables so Python can find and use RAPIDS artifacts

All of the above steps are automated in the next cell.

You should re-run this cell any time your instance re-starts.

In [0]:
!wget https://github.com/randerzander/notebooks-extended/raw/master/utils/rapids-colab.sh
!chmod +x rapids-colab.sh
!./rapids-colab.sh

import sys, os
sys.path.append('/usr/local/lib/python3.6/site-packages/')
os.environ['NUMBAPRO_NVVM'] = '/usr/local/cuda/nvvm/lib64/libnvvm.so'
os.environ['NUMBAPRO_LIBDEVICE'] = '/usr/local/cuda/nvvm/libdevice/'

import nvstrings, nvcategory, cudf, cuml, xgboost
import dask_cudf, dask_cuml, dask_xgboost
from dask.distributed import Client, LocalCluster, wait, progress

# we have one GPU, so limit Dask's workers and threads to exactly 1
cluster = LocalCluster(processes=False, threads_per_worker=1, n_workers=1)
client = Client(cluster)
client

--2019-06-03 15:37:55--  https://github.com/randerzander/notebooks-extended/raw/master/utils/rapids-colab.sh
Resolving github.com (github.com)... 192.30.255.113
Connecting to github.com (github.com)|192.30.255.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/randerzander/notebooks-extended/master/utils/rapids-colab.sh [following]
--2019-06-03 15:37:55--  https://raw.githubusercontent.com/randerzander/notebooks-extended/master/utils/rapids-colab.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1746 (1.7K) [text/plain]
Saving to: ‘rapids-colab.sh.1’


2019-06-03 15:37:55 (266 MB/s) - ‘rapids-colab.sh.1’ saved [1746/1746]

Checking for GPU type:
Traceback (most recent call last):
  File "env-check

0,1
Client  Scheduler: inproc://172.28.0.2/1431/1  Dashboard: http://localhost:8787/status,Cluster  Workers: 1  Cores: 1  Memory: 13.66 GB


Dask has a helpful [web interface](http://distributed.dask.org/en/latest/web.html) for monitoring the status of long running tasks. It's started automatically when you create a LocalCluster.

However, when you run things like Dask's web interface in the background on Colab, their ports are not publicly exposed. Services like ngrok provide an easy way to access them anyway.

If you want to expose your Dask dashboard as a public URL, run the cell below and open the resulting link in a new tab. When you run computations below, you can switch to it and observe as the task DAG progresses.

In [0]:
get_ipython().system_raw('ngrok http 8787 &')

!curl -s http://localhost:4040/api/tunnels | python3 -c \
    "import sys, json; print(json.load(sys.stdin)['tunnels'][0]['public_url'])"

https://9a172075.ngrok.io


In [0]:
# the datafile has no header, supply columnnames
cols = [
    "Year", "Month", "DayofMonth", "DayofWeek", "CRSDepTime",
    "CRSArrTime", "UniqueCarrier", "FlightNum", "ActualElapsedTime",
    "Origin", "Dest", "Distance", "Diverted", "ArrDelay"
]

#from http://kt.ijs.si/elena_ikonomovska/data.html
df = dask_cudf.read_csv('gs://rapidsai/airline/data.csv', names=cols)
df.head().to_pandas()

Unnamed: 0,Year,Month,DayofMonth,DayofWeek,CRSDepTime,CRSArrTime,UniqueCarrier,FlightNum,ActualElapsedTime,Origin,Dest,Distance,Diverted,ArrDelay
0,1987,10,1,4,1,556,AA,190,247,SFO,ORD,1846,0,27
1,1987,10,1,4,5,114,EA,57,74,LAX,SFO,337,0,5
2,1987,10,1,4,5,35,HP,351,167,ICT,LAS,987,0,17
3,1987,10,1,4,5,40,DL,251,35,MCO,PBI,142,0,-2
4,1987,10,1,4,8,517,UA,500,208,LAS,ORD,1515,0,17


In [0]:
%%time

cat_cols = ['Origin', 'Dest', 'UniqueCarrier']
uniques = {}
for col in cat_cols:
  print('Finding uniques for ' + col)
  uniques[col] = list(df[col].unique().compute())

Finding uniques for Origin
Finding uniques for Dest
Finding uniques for UniqueCarrier
CPU times: user 1min 48s, sys: 52.2 s, total: 2min 40s
Wall time: 3min 25s


In [0]:
%%time

import nvcategory
from librmm_cffi import librmm
import numpy as np

# There's a WIP PR that will make this much cleaner.. keep an eye on
# https://github.com/rapidsai/cuml/pull/631
def categorize(df_part, uniques):
    for col in uniques.keys():
      keys = nvstrings.to_device(uniques[col])
      cat = nvcategory.from_strings(df_part[col].data).set_keys(keys)
      device_array = librmm.device_array(df_part[col].data.size(), dtype=np.int32)
      cat.values(devptr=device_array.device_ctypes_pointer.value)
      df_part[col] = cudf.Series(device_array)
    return df_part

df = df.map_partitions(categorize, uniques)

# Turn into binary classification problem
df["ArrDelayBinary"] = df["ArrDelay"] > 0
df.head().to_pandas()

CPU times: user 2.31 s, sys: 736 ms, total: 3.04 s
Wall time: 3.32 s


In [0]:
# About 25% of data is year >= 2014
X_train = df.query('Year < 2004')
y_train = X_train[["ArrDelayBinary"]]
X_train = X_train[X_train.columns.difference(["ArrDelay", "ArrDelayBinary"])]

X_test = df.query('Year >= 2004')
y_test = X_test[["ArrDelayBinary"]]
X_test = X_test[X_test.columns.difference(["ArrDelay", "ArrDelayBinary"])]

In [0]:
%%time
X_train = X_train.compute()
y_train = y_train.compute()
y_train['ArrDelayBinary'] = y_train['ArrDelayBinary'].astype('int32')

#X_train = X_train.persist()
#y_train = y_train.persist()
#res = wait([X_train, y_train])

RuntimeError: ignored

In [0]:
%%time
dtrain = xgboost.DMatrix(X_train, y_train)

In [0]:
%%time
params = {"max_depth": 8,
          "learning_rate": 0.1,
          "min_child_weight": 1,
          "max_leaves": 256,
          "tree_method": "gpu_hist",
          "ngpus": -1,
          "reg_lambda": 1,
          "objective": "binary:logistic",
          "scale_pos_weight": len(y_train) / len(y_train.query('ArrDelayBinary > 0'))
         }

#train the model
model = dask_xgboost.train(client, params, X_train, y_train)

ValueError: ignored

# cuDF and cuML Examples #

Now you can run code! 

What follows are basic examples where all processing takes place on the GPU.

#[cuDF](https://github.com/rapidsai/cudf)#

Load a dataset into a GPU memory resident DataFrame and perform a basic calculation.

Everything from CSV parsing to calculating tip percentage and computing a grouped average is done on the GPU.

_Note_: You must import nvstrings and nvcategory before cudf, else you'll get errors.

In [0]:
import nvstrings, nvcategory, cudf
import io, requests

# download CSV file from GitHub
url="https://github.com/plotly/datasets/raw/master/tips.csv"
content = requests.get(url).content.decode('utf-8')

# read CSV from memory
tips_df = cudf.read_csv(io.StringIO(content))
tips_df['tip_percentage'] = tips_df['tip']/tips_df['total_bill']*100

# display average tip by dining party size
print(tips_df.groupby('size').tip_percentage.mean())

size
1     21.72920154872781
2     16.57191917348289
3    15.215685473711831
4    14.594900639351334
5    14.149548965142026
6    15.622920072028379
Name: tip_percentage, dtype: float64


#[cuML](https://github.com/rapidsai/cuml)#

This snippet loads a 

As above, all calculations are performed on the GPU.

In [0]:
import cuml

# Create and populate a GPU DataFrame
df_float = cudf.DataFrame()
df_float['0'] = [1.0, 2.0, 5.0]
df_float['1'] = [4.0, 2.0, 1.0]
df_float['2'] = [4.0, 2.0, 1.0]

# Setup and fit clusters
dbscan_float = cuml.DBSCAN(eps=1.0, min_samples=1)
dbscan_float.fit(df_float)

print(dbscan_float.labels_)

0    0
1    1
2    2
dtype: int32


# Next Steps #

For an overview of how you can access and work with your own datasets in Colab, check out [this guide](https://towardsdatascience.com/3-ways-to-load-csv-files-into-colab-7c14fcbdcb92).

For more RAPIDS examples, check out our RAPIDS notebooks repos:
1. https://github.com/rapidsai/notebooks
2. https://github.com/rapidsai/notebooks-extended