<a href="https://colab.research.google.com/github/yiruchen1993/nvidia_gtc_dli_rapids_2020/blob/section_notebooks%2Fdata_manipulation/1_03_cudf_basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# cuDF 簡介

首先，你將從 [cuDF](https://github.com/rapidsai/cudf) 簡介開始你的加速資料科學訓練，cuDF 是一種讓你能夠建立並操作 GPU DataFrame 的 RAPIDS API。cuDF 使用的介面與 Pandas 相當類似，因此 Python 資料科學家不必做出太多調整即可使用。在此學習筆記中，我們將提供與 cuDF 操作對應的 Pandas 操作，讓你直覺體驗即便是看似簡單的操作，cuDF 速度也能大幅提升。

## 目標

完成此學習筆記後，你將能夠:

- 使用 cuDF 從磁碟讀取資料或將資料寫入至磁碟
- 使用 cuDF 執行基本資料探索與清理作業

## 匯入

我們在此將 GPU 加速資料架構與數學作業匯入 cuDF 與 CuPy，還有作為其依據的 CPU 程式庫 Pandas 與 NumPy，以進行效能對照:

In [None]:
import cudf

import pandas as pd
import numpy as np

In [None]:
cudf

<module 'cudf' from '/opt/conda/envs/rapids/lib/python3.6/site-packages/cudf/__init__.py'>

## 讀取和寫入資料

透過 [cuDF](https://github.com/rapidsai/cudf)，RAPIDS API 可提供 GPU 加速資料架構，我們便可讀取 [多種格式](https://rapidsai.github.io/projects/cudf/en/0.10.0/api.html#module-cudf.io.csv) 的資料，包含 csv、json、parquet、feather 或 orc 與 Pandas 資料架構等。

在本實作坊的第一部分，我們將讀取將近 6,000 萬筆記錄 (對應全英國與威爾斯的人口)，這些資料由英國官方人口普查資料合成。以下我們將直接從本機 csv 檔案將此資料讀入 GPU 記憶體:

In [None]:
%time gdf = cudf.read_csv('./data/pop_1-03.csv')
gdf.shape

CPU times: user 4.35 s, sys: 2.39 s, total: 6.75 s
Wall time: 8.34 s


(58479894, 6)

In [None]:
gdf.dtypes

age         int64
sex        object
county     object
lat       float64
long      float64
name       object
dtype: object

以下為將相同資料讀取至 Pandas 資料架構時的效能比較:

In [None]:
%time df = pd.read_csv('./data/pop_1-03.csv')
gdf.shape == df.shape

CPU times: user 25.6 s, sys: 3.35 s, total: 29 s
Wall time: 29 s


True

因為 cuDF 背後的 GPU 記憶體管理較複雜，載入第一筆資料至新 RAPIDS 記憶體環境時的速度有時遠比後續載入內容緩慢。RAPIDS 記憶體管理機制會準備額外的記憶體，以容納你有興趣在資料上使用的資料科學作業陣列，而非在整個工作流程中反覆分配與取消分配記憶體。

在此實作坊中，我們將固定使用 `gdf` 來代表 GPU 資料架構，以及使用 `df` 代表 CPU 資料架構，以便比較效能。

### 寫入至檔案

cuDF 還提供將資料寫入至檔案的方法。此處我們特別建立包含黑潭郡居民資料的資料架構，並將此資料寫入 `blackpool.csv`，然後使用 Pandas 執行相同作業以進行比較。

#### cuDF


In [None]:
gdf.head()

Unnamed: 0,age,sex,county,lat,long,name
0,0,m,DARLINGTON,54.533644,-1.524401,FRANCIS
1,0,m,DARLINGTON,54.426256,-1.465314,EDWARD
2,0,m,DARLINGTON,54.5552,-1.496417,TEDDY
3,0,m,DARLINGTON,54.547906,-1.572341,ANGUS
4,0,m,DARLINGTON,54.477639,-1.605995,CHARLIE


In [None]:
%time blackpool_residents = gdf.loc[gdf['county'] == 'BLACKPOOL']
print(f'{blackpool_residents.shape[0]} residents')

CPU times: user 1.57 s, sys: 819 ms, total: 2.39 s
Wall time: 3.65 s
139305 residents


In [None]:
%time blackpool_residents.to_csv('blackpool.csv')

CPU times: user 17.3 ms, sys: 112 ms, total: 129 ms
Wall time: 467 ms


#### Pandas

In [None]:
%time blackpool_residents_pd = df.loc[df['county'] == 'BLACKPOOL']

CPU times: user 2.05 s, sys: 135 ms, total: 2.19 s
Wall time: 2.18 s


In [None]:
%time blackpool_residents_pd.to_csv('blackpool_pd.csv')

CPU times: user 646 ms, sys: 8.1 ms, total: 654 ms
Wall time: 653 ms


## 練習: 初步資料探索

現在我們已經載入了一些資料，讓我們開始進行初步探索。

在 `gdf` 上使用 `head`、`dtypes` 與 `columns` 方法，並在個別 `gdf` 欄上使用 `value_counts`，以便你瞭解資料狀況。若有興趣，請使用神奇的 `%time` 命令，來與 Pandas `df` 上的相同作業進行效能比較。

若要建立額外的互動式儲存格，你可以按一下上方的 [`+`] 按鈕，或是按 `Esc` 切換至命令模式，並使用鍵盤快速鍵 `a` (新增上方儲存格) 與 `b` (新增下方儲存格)。

在任何時刻，要是 GPU 記憶體滿了，別忘記你可以重新啟動核心，以相當快的速度重新執行至此階段。

In [None]:
# Begin your initial exploration here. Create more cells as needed.
%time gdf.head()

CPU times: user 7.3 ms, sys: 2.86 ms, total: 10.2 ms
Wall time: 9.08 ms


Unnamed: 0,age,sex,county,lat,long,name
0,0,m,DARLINGTON,54.533644,-1.524401,FRANCIS
1,0,m,DARLINGTON,54.426256,-1.465314,EDWARD
2,0,m,DARLINGTON,54.5552,-1.496417,TEDDY
3,0,m,DARLINGTON,54.547906,-1.572341,ANGUS
4,0,m,DARLINGTON,54.477639,-1.605995,CHARLIE


In [None]:
%time df.head()

CPU times: user 434 µs, sys: 105 µs, total: 539 µs
Wall time: 629 µs


Unnamed: 0,age,sex,county,lat,long,name
0,0,m,DARLINGTON,54.533644,-1.524401,FRANCIS
1,0,m,DARLINGTON,54.426256,-1.465314,EDWARD
2,0,m,DARLINGTON,54.5552,-1.496417,TEDDY
3,0,m,DARLINGTON,54.547906,-1.572341,ANGUS
4,0,m,DARLINGTON,54.477639,-1.605995,CHARLIE


In [None]:
gdf.dtypes

age         int64
sex        object
county     object
lat       float64
long      float64
name       object
dtype: object

In [None]:
%time gdf.sex.value_counts()

CPU times: user 59.8 ms, sys: 21.6 ms, total: 81.4 ms
Wall time: 80.3 ms


f    29579113
m    28900781
Name: sex, dtype: int32

In [None]:
%time df.sex.value_counts()

CPU times: user 3.23 s, sys: 47.1 ms, total: 3.28 s
Wall time: 3.26 s


f    29579113
m    28900781
Name: sex, dtype: int64

In [None]:
%time gdf.age.value_counts()

CPU times: user 68.8 ms, sys: 12.2 ms, total: 80.9 ms
Wall time: 79.8 ms


53    824404
51    821388
52    820676
54    816103
27    815488
       ...  
85    221588
86    204508
87    184122
88    161014
89    135301
Name: age, Length: 91, dtype: int32

In [None]:
%time df.age.value_counts()

CPU times: user 333 ms, sys: 120 ms, total: 453 ms
Wall time: 452 ms


53    824404
51    821388
52    820676
54    816103
27    815488
       ...  
85    221588
86    204508
87    184122
88    161014
89    135301
Name: age, Length: 91, dtype: int64

In [None]:
%time gdf.county.value_counts()

CPU times: user 34.3 ms, sys: 7.3 ms, total: 41.6 ms
Wall time: 40.5 ms


KENT               1568623
ESSEX              1477764
HAMPSHIRE          1376316
LANCASHIRE         1210053
SURREY             1189934
                    ...   
BLAENAU GWENT        69713
MERTHYR TYDFIL       60183
RUTLAND              39697
CITY OF LONDON        8706
ISLES OF SCILLY       2242
Name: county, Length: 171, dtype: int32

In [None]:
%time df.county.value_counts()

CPU times: user 3.95 s, sys: 74.5 ms, total: 4.03 s
Wall time: 4.02 s


KENT               1568623
ESSEX              1477764
HAMPSHIRE          1376316
LANCASHIRE         1210053
SURREY             1189934
                    ...   
BLAENAU GWENT        69713
MERTHYR TYDFIL       60183
RUTLAND              39697
CITY OF LONDON        8706
ISLES OF SCILLY       2242
Name: county, Length: 171, dtype: int64

In [None]:
gdf?

[0;31mType:[0m        DataFrame
[0;31mString form:[0m
age sex      county        lat      long      name
           0           0   m  DARLINGTON  54.53 <...> JESSICA
           58479893   90   f     NEWPORT  51.578787 -2.827954  FLORENCE
           
           [58479894 rows x 6 columns]
[0;31mLength:[0m      58479894
[0;31mFile:[0m        /opt/conda/envs/rapids/lib/python3.6/site-packages/cudf/core/dataframe.py
[0;31mDocstring:[0m  
A GPU Dataframe object.

Parameters
----------
data : data-type to coerce. Infers date format if to date.

Examples
--------

Build dataframe with `__setitem__`:

>>> import cudf
>>> df = cudf.DataFrame()
>>> df['key'] = [0, 1, 2, 3, 4]
>>> df['val'] = [float(i + 10) for i in range(5)]  # insert column
>>> print(df)
   key   val
0    0  10.0
1    1  11.0
2    2  12.0
3    3  13.0
4    4  14.0

Build DataFrame via dict of columns:

>>> import cudf
>>> import numpy as np
>>> from datetime import datetime, timedelta

>>> t0 = datetime.strptime('201

## cuDF 的基本作業

cuDF 除了在處理大型資料集時效能更好外，使用體驗也相當類似於 Pandas。在本節中，我們將著重在一些非常簡單的操作上。在 cuDF 資料架構上執行資料作業時，欄式操作通常比起列式操作效能更高。

### 轉換資料類型

在此實作坊後段使用的機器學習，有時需要將整數值轉換成浮點值。我們在此將 `age` 欄從 `int64` 轉換成 `float32`，以比較使用 Pandas 作業時的效能:

In [None]:
gdf.dtypes

age         int64
sex        object
county     object
lat       float64
long      float64
name       object
dtype: object

#### cuDF

In [None]:
%time gdf['age'] = gdf['age'].astype('float32')

CPU times: user 4.57 ms, sys: 0 ns, total: 4.57 ms
Wall time: 3.66 ms


#### Pandas

In [None]:
%time df['age'] = df['age'].astype('float32')

CPU times: user 92.5 ms, sys: 108 ms, total: 201 ms
Wall time: 200 ms


### 欄式彙總

同樣地，欄式彙總也可以利用 GPU 架構與 RAPIDS 記憶體格式。

#### cuDF

In [None]:
%time gdf['age'].mean()

CPU times: user 1.42 ms, sys: 296 µs, total: 1.72 ms
Wall time: 1.11 ms


40.12419336806595

#### Pandas

In [None]:
%time df['age'].mean()

CPU times: user 184 ms, sys: 68.1 ms, total: 253 ms
Wall time: 252 ms


40.12419

### 字串作業

儘管字串並非傳統上與 GPU 相關聯的資料類型，cuDF 仍能支援強大的加速字串作業。

In [None]:
gdf.name.unique()

0              A
1         A'ISHA
2        A'NIYAH
3          A-JAY
4          AABAN
          ...   
13207      ZYANA
13208       ZYLA
13209      ZYLAN
13210       ZYON
13211      ZYRAH
Name: name, Length: 13212, dtype: object

#### cuDF

In [None]:
%time gdf['name'] = gdf['name'].str.title()

CPU times: user 15 ms, sys: 43.8 ms, total: 58.8 ms
Wall time: 58 ms


In [None]:
gdf.head()

Unnamed: 0,age,sex,county,lat,long,name
0,0.0,m,DARLINGTON,54.533644,-1.524401,Francis
1,0.0,m,DARLINGTON,54.426256,-1.465314,Edward
2,0.0,m,DARLINGTON,54.5552,-1.496417,Teddy
3,0.0,m,DARLINGTON,54.547906,-1.572341,Angus
4,0.0,m,DARLINGTON,54.477639,-1.605995,Charlie


#### Pandas

In [None]:
%time df['name'] = df['name'].str.title()

CPU times: user 20.4 s, sys: 1.89 s, total: 22.3 s
Wall time: 22.3 s


In [None]:
df.head()

Unnamed: 0,age,sex,county,lat,long,name
0,0.0,m,DARLINGTON,54.533644,-1.524401,Francis
1,0.0,m,DARLINGTON,54.426256,-1.465314,Edward
2,0.0,m,DARLINGTON,54.5552,-1.496417,Teddy
3,0.0,m,DARLINGTON,54.547906,-1.572341,Angus
4,0.0,m,DARLINGTON,54.477639,-1.605995,Charlie


## 使用 `loc` 與 `iloc` 製作資料子集

cuDF 也支援核心資料子集工具 `loc` (標籤型定位器) 與 `iloc` (整數型定位器)。

### 選取範圍

我們的資料標籤剛好是漸增編號，因此與使用 Pandas 時相同，`loc` 會包含每一個傳遞值，而 `iloc` 會提供半開放式範圍 (排除最終值)。

In [None]:
gdf.loc[100:105]

Unnamed: 0,age,sex,county,lat,long,name
100,0.0,m,DARLINGTON,54.519527,-1.557723,Samuel
101,0.0,m,DARLINGTON,54.530248,-1.500405,Alden
102,0.0,m,DARLINGTON,54.51597,-1.628573,Samuel
103,0.0,m,DARLINGTON,54.543373,-1.664323,Muhammad
104,0.0,m,DARLINGTON,54.554589,-1.507385,Isaac
105,0.0,m,DARLINGTON,54.487209,-1.541073,Jayden


In [None]:
gdf.iloc[100:105]

Unnamed: 0,age,sex,county,lat,long,name
100,0.0,m,DARLINGTON,54.519527,-1.557723,Samuel
101,0.0,m,DARLINGTON,54.530248,-1.500405,Alden
102,0.0,m,DARLINGTON,54.51597,-1.628573,Samuel
103,0.0,m,DARLINGTON,54.543373,-1.664323,Muhammad
104,0.0,m,DARLINGTON,54.554589,-1.507385,Isaac


### `loc` 與布林選擇

我們可以搭配布林選擇使用 `loc`:

#### cuDF

In [None]:
# as of version 0.10, the startswith method returns a list, so we convert it back to a Series for efficiency
# in a future version, that method and other string methods will return a Series when appropriate
%time e_names = gdf.loc[cudf.Series(gdf['name'].str.startswith('E'))]
e_names.head()

CPU times: user 1.38 s, sys: 518 ms, total: 1.9 s
Wall time: 1.96 s


Unnamed: 0,age,sex,county,lat,long,name
1,0.0,m,DARLINGTON,54.426256,-1.465314,Edward
6,0.0,m,DARLINGTON,54.501872,-1.667874,Eamonn
34,0.0,m,DARLINGTON,54.483065,-1.501312,Ethan
45,0.0,m,DARLINGTON,54.640205,-1.558986,Elvin
49,0.0,m,DARLINGTON,54.57545,-1.600592,Edward


In [None]:
cudf.Series(gdf['name'].str.startswith('E'))

0           False
1            True
2           False
3           False
4           False
            ...  
58479889    False
58479890    False
58479891    False
58479892    False
58479893    False
Length: 58479894, dtype: bool

#### Pandas

In [None]:
%time e_names_pd = df.loc[df['name'].str.startswith('E')]

CPU times: user 17.5 s, sys: 648 ms, total: 18.2 s
Wall time: 18.2 s


### 結合 NumPy 方法

我們可以結合 cuDF 與 NumPy 方法。以下我們將使用 `np.logical_and` 進行元素式布林選擇。

#### cuDF

In [None]:
%time ed_names = gdf.loc[np.logical_and(gdf['name'].str.startswith('E'), \
                                        gdf['name'].str.endswith('d'))]
ed_names.head()

CPU times: user 4.41 s, sys: 361 ms, total: 4.77 s
Wall time: 4.89 s


Unnamed: 0,age,sex,county,lat,long,name
1,0.0,m,DARLINGTON,54.426256,-1.465314,Edward
49,0.0,m,DARLINGTON,54.57545,-1.600592,Edward
106,0.0,m,DARLINGTON,54.488042,-1.640927,Edward
145,0.0,m,DARLINGTON,54.49281,-1.509049,Edward
170,0.0,m,DARLINGTON,54.57792,-1.436109,Edward


為了提高效能，我們可以使用 CuPy 而非 NumPy，進而在 GPU 上執行元素式布林 `logical_and` 作業。

In [None]:
import cupy as cp

In [None]:
%time ed_names = gdf.loc[cudf.Series(cp.logical_and(cudf.Series(gdf['name'].str.startswith('E')), \
                                                    cudf.Series(gdf['name'].str.endswith('d'))))]
ed_names.head()

CPU times: user 1.96 s, sys: 296 ms, total: 2.25 s
Wall time: 2.27 s


Unnamed: 0,age,sex,county,lat,long,name
1,0.0,m,DARLINGTON,54.426256,-1.465314,Edward
49,0.0,m,DARLINGTON,54.57545,-1.600592,Edward
106,0.0,m,DARLINGTON,54.488042,-1.640927,Edward
145,0.0,m,DARLINGTON,54.49281,-1.509049,Edward
170,0.0,m,DARLINGTON,54.57792,-1.436109,Edward


#### Pandas

In [None]:
%time ed_names_pd = df.loc[np.logical_and(df['name'].str.startswith('E'), df['name'].str.endswith('d'))]

CPU times: user 28 s, sys: 2.02 s, total: 30 s
Wall time: 30 s


## 練習: 基本資料清理

在本練習中，我們會請你使用上述的幾種技巧，執行兩項簡單的資料清理工作:

    1.修改數個欄的資料類型
    2.將字串資料轉換成我們欲使用的格式

### 1.修改 `dtypes`

檢查 `gdf` 的 `dtypes`，並將 64 位元資料類型轉換成 32 位元格式。

In [None]:
gdf.dtypes

age       float32
sex        object
county     object
lat       float64
long      float64
name       object
dtype: object

In [None]:
%time gdf['lat'] = gdf['lat'].astype('float32')

CPU times: user 1.04 ms, sys: 4.14 ms, total: 5.18 ms
Wall time: 5.53 ms


In [None]:
%time gdf['long'] = gdf['long'].astype('float32')

CPU times: user 3.59 ms, sys: 0 ns, total: 3.59 ms
Wall time: 2.73 ms


#### 解決方案

In [None]:
%load solutions/modify_dtypes

### 2.將郡改成首字母大寫

目前所有的郡都使用大寫表示:

In [None]:
gdf['county'].head()

0    DARLINGTON
1    DARLINGTON
2    DARLINGTON
3    DARLINGTON
4    DARLINGTON
Name: county, dtype: object

將其轉換成首字母大寫格式，如同我們在 `name` 欄的作法。

In [None]:
%time gdf['county'] = gdf['county'].str.title()

CPU times: user 32 ms, sys: 52.5 ms, total: 84.5 ms
Wall time: 84.9 ms


In [None]:
gdf['county'].head()

0    Darlington
1    Darlington
2    Darlington
3    Darlington
4    Darlington
Name: county, dtype: object

#### 解決方案

In [None]:
%load solutions/title_case_counties

## 練習: 桑德蘭以北的郡

本練習需要使用 `loc` 方法與數種上述技巧。找出桑德蘭郡最北邊居民的緯度 (`lat` 值最高的人)，接著判斷哪些郡的居民位於他的北方。使用 cuDF `Series` 的 `unique` 方法來刪除結果中重複的項目。

In [None]:
gdf.county.unique()

0              Barking And Dagenham
1                            Barnet
2                          Barnsley
3      Bath And North East Somerset
4                           Bedford
                   ...             
166                       Wokingham
167                   Wolverhampton
168                  Worcestershire
169                         Wrexham
170                            York
Name: county, Length: 171, dtype: object

#### 解決方案

In [None]:
# %load solutions/counties_north_of_sunderland
sunderland_residents = gdf.loc[gdf['county'] == 'Sunderland']
northmost_sunderland_lat = sunderland_residents['lat'].max()
counties_with_pop_north_of = gdf.loc[gdf['lat'] > northmost_sunderland_lat]['county'].unique()


In [None]:
counties_with_pop_north_of

0          County Durham
1                Cumbria
2              Gateshead
3    Newcastle Upon Tyne
4         North Tyneside
5        North Yorkshire
6         Northumberland
7         South Tyneside
Name: county, dtype: object

## 下一步

在下一節中，你將進行實際的資料準備工作，以便在稍後的機器學習模型中使用。在此工作中，我們將使用 CuPy 建立自訂函數，以便透過 GPU 加速簡易替換 NumPy，大幅提升效能。

<br>
<div align="center"><h2>請重新啟動核心</h2></div>