<a href="https://colab.research.google.com/github/yiruchen1993/nvidia_gtc_dli_rapids_2020/blob/section_notebooks%2Fmachine_learning/2_07_kmeans_dask.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Multi-GPU K-Means with Dask

在本notebook中，您將使用GPU加速的K-means通過Dask以多節點，多GPU可擴展的方式識別人口集群。

## 目標

在您完成本notebook時，您將能夠：

-通過Dask使用分散式，GPU加速的K-means演算法

## 載入

首先，我們載入所需的module以建立Dask cuDF集群。

In [None]:
import subprocess

from dask.distributed import Client, wait, progress
from dask_cuda import LocalCUDACluster

import dask.dataframe as dd
import dask.array as da

from dask import compute
from dask.delayed import delayed

之後，我們建立運算集群。

In [None]:
cmd = "hostname --all-ip-addresses"
process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE)
output, error = process.communicate()
IPADDR = str(output.decode()).split()[0]

cluster = LocalCUDACluster(ip=IPADDR)
client = Client(cluster)
client

0,1
Client  Scheduler: tcp://172.19.0.3:37229  Dashboard: http://172.19.0.3:8787/status,Cluster  Workers: 4  Cores: 4  Memory: 473.42 GB


Finally, as we did before, we import CUDA context creators after setting up the cluster so they don't lock to a single device.

In [None]:
import cudf
import dask_cudf

import cuml
from cuml.dask.cluster import KMeans

## 載入並保留資料

我們將從載入數據開始，該數據集具有兩個網格坐標欄位，即`easting`和`northing`，這兩個欄位均取自我們準備的主要總體數據集。

In [None]:
ddf = dask_cudf.read_csv('./data/pop5x_2-07.csv', names=['northing', 'easting'], dtype=['float32', 'float32'])

In [None]:
ddf

Unnamed: 0_level_0,northing,easting
npartitions=27,Unnamed: 1_level_1,Unnamed: 2_level_1
,float32,float32
,...,...
...,...,...
,...,...
,...,...


訓練K-means模型與scikit-learn版本和cuML單GPU版本非常相似-通過設置客戶端並從`cuml.dask.cluster`模組導入，該算法將自動使用本地 我們已經建立的Dask集群。

注意，呼叫`.fit`會觸發Dask計算。

In [None]:
dkm = KMeans(n_clusters=20)
dkm.fit(ddf)

<cuml.dask.cluster.kmeans.KMeans at 0x7f071a2f6ef0>

有了擬合模型後，我們將提取群聚中心，並從其通用的`0`和`1`中重命名欄位，以反映對其進行訓練的數據。

In [None]:
cluster_centers = dkm.cluster_centers_
cluster_centers.columns = ['northing', 'easting']
cluster_centers.dtypes

northing    float32
easting     float32
dtype: object

## 練習：計算最南端集群的成員

使用`cluster_centers`，使用`nsmallest`方法確定哪個集群是最南端（`北`值最低），然後使用`dkm.predict`獲取數據標籤，最後過濾標籤以確定模型估計有多少人在那個集群中。

In [None]:
cluster_centers.nsmallest?

[0;31mSignature:[0m [0mcluster_centers[0m[0;34m.[0m[0mnsmallest[0m[0;34m([0m[0mn[0m[0;34m,[0m [0mcolumns[0m[0;34m,[0m [0mkeep[0m[0;34m=[0m[0;34m'first'[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Get the rows of the DataFrame sorted by the n smallest value of *columns*

Difference from pandas:
* Only a single column is supported in *columns*
[0;31mFile:[0m      /opt/conda/envs/rapids/lib/python3.6/site-packages/cudf/core/dataframe.py
[0;31mType:[0m      method


In [None]:
cluster_centers.nsmallest(1, 'northing')

Unnamed: 0,northing,easting
11,-5321793.5,622414.5


In [None]:
cluster_centers.nsmallest(1, 'northing').index[0]

11

In [None]:
# %load solutions/southernmost_cluster
south_idx = cluster_centers.nsmallest(1, 'northing').index[0]
labels_predicted = dkm.predict(ddf)
labels_predicted[labels_predicted==south_idx].compute().shape[0]


9505217

<br>
<div align="center"><h2>Please Restart the Kernel</h2></div>

In [None]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

## 下一步

在下一個notebook中，您將再次使用功能強大的XGBoost算法來計算感染風險。