<a href="https://colab.research.google.com/github/yiruchen1993/nvidia_gtc_dli_rapids_2020/blob/section_notebooks%2Fdata_manipulation/1_04_grouping_sorting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 使用 cuDF 分組與排序

在此學習筆記中，我們首先將為你介紹使用 cuDF 分組與排序的方法，並與 Pandas 進行效能比較，再透過簡短的資料分析練習為你整合所學。

## 目標

完成此學習筆記後，你將能夠:

- 使用 cuDF 執行 GPU 加速分組與排序作業

## 匯入

In [None]:
import cudf
import pandas as pd

## 讀取資料

我們會再一次讀取英國的人口資料，然後回到與 Pandas 的計時比較。

In [None]:
%time gdf = cudf.read_csv('./data/pop_1-04.csv', dtype=['float32', 'str', 'str', 'float32', 'float32', 'str'])

CPU times: user 3.95 s, sys: 6.49 s, total: 10.4 s
Wall time: 11.4 s


In [None]:
%time df = pd.read_csv('./data/pop_1-04.csv')

CPU times: user 26 s, sys: 3.38 s, total: 29.4 s
Wall time: 29.4 s


In [None]:
gdf.dtypes

age       float32
sex        object
county     object
lat       float32
long      float32
name       object
dtype: object

In [None]:
gdf.shape

(58479894, 6)

In [None]:
gdf.head()

Unnamed: 0,age,sex,county,lat,long,name
0,0.0,m,Darlington,54.533638,-1.5244,Francis
1,0.0,m,Darlington,54.426254,-1.465314,Edward
2,0.0,m,Darlington,54.555199,-1.496417,Teddy
3,0.0,m,Darlington,54.547905,-1.572341,Angus
4,0.0,m,Darlington,54.477638,-1.605995,Charlie


## 分組與排序

### 記錄分組

使用 cuDF 記錄分組的方式與使用 Pandas 記錄的方法相同。

#### cuDF

In [None]:
%%time
counties = gdf[['county', 'age']].groupby(['county'])
avg_ages = counties.mean()
print(avg_ages[:5])

                                    age
county                                 
Barking And Dagenham          33.056845
Barnet                        37.629770
Barnsley                      41.201061
Bath And North East Somerset  39.822837
Bedford                       39.715300
CPU times: user 943 ms, sys: 251 ms, total: 1.19 s
Wall time: 1.32 s


In [None]:
counties.count()

Unnamed: 0_level_0,age
county,Unnamed: 1_level_1
Barking And Dagenham,211998
Barnet,392140
Barnsley,245199
Bath And North East Somerset,192106
Bedford,171623
...,...
Wokingham,167979
Wolverhampton,262008
Worcestershire,592057
Wrexham,136126


#### Pandas

In [None]:
%%time
counties_pd = df[['county', 'age']].groupby(['county'])
avg_ages_pd = counties_pd.mean()
print(avg_ages_pd[:5])

                                    age
county                                 
Barking And Dagenham          33.056845
Barnet                        37.629770
Barnsley                      41.201061
Bath And North East Somerset  39.822837
Bedford                       39.715300
CPU times: user 2.82 s, sys: 719 ms, total: 3.54 s
Wall time: 3.53 s


## 排序

排序方式也與 Pandas 非常類似，但 cuDF 不支援原地排序。

#### cuDF

In [None]:
%time gdf_names = gdf['name'].sort_values()
print(gdf_names[:5]) # yes, "A" is an infrequent but correct given name in the UK, according to census data
print(gdf_names[-5:])

CPU times: user 979 ms, sys: 879 ms, total: 1.86 s
Wall time: 2.13 s
26850     A
154537    A
165578    A
211428    A
236972    A
Name: name, dtype: object
58060377    Zyrah
58289490    Zyrah
58363665    Zyrah
58388727    Zyrah
58394184    Zyrah
Name: name, dtype: object


#### Pandas

此作業使用 Pandas 時需花費一點時間。等待時可以自行開始下一項練習。

In [None]:
%time df_names = df['name'].sort_values()
print(df_names[:5])
print(df_names[-5:])

CPU times: user 1min 44s, sys: 1.29 s, total: 1min 45s
Wall time: 1min 45s
10811041    A
17931460    A
5060367     A
1842288     A
24866365    A
Name: name, dtype: object
47008072    Zyrah
47953653    Zyrah
31838209    Zyrah
53669567    Zyrah
54557840    Zyrah
Name: name, dtype: object


## 練習: 最年輕的名字

在本練習中，你會需要使用 `groupby` 與 `sort_values`。

我們想知道哪些名字擁有最低平均年齡，以及多少人擁有該名字。使用 `mean` 與 `count` 方法處理根據名字分組的資料，找出三個平均年齡最低的名字及其數量。

In [None]:
%%time
name_groups = gdf[['name', 'age']].groupby(['name'])

name_ages = name_groups['age'].mean()
name_counts = name_groups['age'].count()

CPU times: user 53.7 ms, sys: 28.2 ms, total: 81.9 ms
Wall time: 81 ms


In [None]:
ages_counts = cudf.DataFrame()
ages_counts['mean_age'] = name_ages
ages_counts['count'] = name_counts

In [None]:
ages_counts = ages_counts.sort_values('mean_age')
ages_counts.iloc[:3]

Unnamed: 0,mean_age,count
Leart,34.911197,259
Luke-Junior,35.313725,255
Nameer,35.479675,246


#### 解決方案

In [None]:
# %load solutions/youngest_names
name_groups = gdf[['name', 'age']].groupby('name')

name_ages = name_groups['age'].mean()
name_counts = name_groups['age'].count()

ages_counts = cudf.DataFrame()
ages_counts['mean_age'] = name_ages
ages_counts['count'] = name_counts

ages_counts = ages_counts.sort_values('mean_age')
ages_counts.iloc[:3]


## 下一步

為達到本實作坊更大的資料科學目標需求，我們將使用能反映整個英國道路網的資料。在下一份學習筆記中，你將學到額外的 cuDF 技巧，幫助你將直欄式資料轉換為圖表邊線資料，且我們會透過 `cuGraph` 函式庫使用該資料建構 GPU 加速圖表。

<br>
<div align="center"><h2>請重新啟動核心</h2></div>