# 大數據分析的基礎

在本次課程中，我們將說明資料分析的基礎工作，也就是(1)載入分析所需的套件、(2)讀入資料、(3)基本的資料處理以及(4)視覺化。我們將以[athlete_events.csv與noc_regions.csv](https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results)檔案為例，說明上述的觀念與實際操作。

## 載入分析所需的套件

一般來說，我們要處理的資料大多是結構式資料(structural data)，也就是具有欄位型式的資料表，資料表中包含多個欄位，每一筆記錄由它們對應到各欄位的資料構成。python程式大多用套件pandas來讀入與處理結構式資料，視覺化則需要matplotlib和seaborn套件，另外numpy則是用來處理數值資料。因此，首先要安裝與載入這四個套件。由於我們使用的Anaconda已經事先安裝好了這四個套件，所以只要直接載入就可以了。

In [32]:
# 載入資料處理與視覺化所需套件
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np

## 讀入資料

### 讀入資料

pandas 可以支援多種文字、二進位檔案與資料庫的資料載入，常見的 txt、csv、excel 試算表與MySQL。從Kaggle取得的奧運資料為csv檔案，可以利用pandas提供的read_csv()方法將資料從檔案讀取成為pandas的資料結構--資料框(data frame)。

In [4]:
# 讀入資料檔
data = pd.read_csv('athlete_events.csv')
regions = pd.read_csv('noc_regions.csv')

### 查看資料框

pandas提供若干種瀏覽、檢閱資料框的方式，讓使用者在處理資料前，先對資料有大概的了解。

1. head(n)：查看前n筆資料，如果沒有輸入n，則為前5筆

In [6]:
data.head(10)

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
0,1,A Dijiang,M,24.0,180.0,80.0,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,
1,2,A Lamusi,M,23.0,170.0,60.0,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,
2,3,Gunnar Nielsen Aaby,M,24.0,,,Denmark,DEN,1920 Summer,1920,Summer,Antwerpen,Football,Football Men's Football,
3,4,Edgar Lindenau Aabye,M,34.0,,,Denmark/Sweden,DEN,1900 Summer,1900,Summer,Paris,Tug-Of-War,Tug-Of-War Men's Tug-Of-War,Gold
4,5,Christine Jacoba Aaftink,F,21.0,185.0,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,Speed Skating Women's 500 metres,
5,5,Christine Jacoba Aaftink,F,21.0,185.0,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,"Speed Skating Women's 1,000 metres",
6,5,Christine Jacoba Aaftink,F,25.0,185.0,82.0,Netherlands,NED,1992 Winter,1992,Winter,Albertville,Speed Skating,Speed Skating Women's 500 metres,
7,5,Christine Jacoba Aaftink,F,25.0,185.0,82.0,Netherlands,NED,1992 Winter,1992,Winter,Albertville,Speed Skating,"Speed Skating Women's 1,000 metres",
8,5,Christine Jacoba Aaftink,F,27.0,185.0,82.0,Netherlands,NED,1994 Winter,1994,Winter,Lillehammer,Speed Skating,Speed Skating Women's 500 metres,
9,5,Christine Jacoba Aaftink,F,27.0,185.0,82.0,Netherlands,NED,1994 Winter,1994,Winter,Lillehammer,Speed Skating,"Speed Skating Women's 1,000 metres",


注意： python的資料編號為從0開始。

athlete_events的資料共有15個欄位。

- ID - 每個運動員的編號。
- Name - 運動員的名字。
- Sex - 性別。
- Age - 年齡。
- Height - 身高(以公分計)。
- Weight - 體重(以公斤計)。
- Team - 參賽隊名。
- NOC - 國家奧委會(3-letter code)。
- Games - 參加的奧運(包括年與季)。
- Year - 參加年。
- Season - 參加季(Summer 或 Winter)。
- City - 主辦城市。
- Sport - 運動種類。
- Event - 運動項目。
- Medal - 獎牌(Gold, Silver, Bronze, or NA)。

2. 資料的數量、規模

In [9]:
data.shape

(271116, 15)

共有271116筆資料，15個欄位。

3. 資料型態與包含的值

In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271116 entries, 0 to 271115
Data columns (total 15 columns):
ID        271116 non-null int64
Name      271116 non-null object
Sex       271116 non-null object
Age       261642 non-null float64
Height    210945 non-null float64
Weight    208241 non-null float64
Team      271116 non-null object
NOC       271116 non-null object
Games     271116 non-null object
Year      271116 non-null int64
Season    271116 non-null object
City      271116 non-null object
Sport     271116 non-null object
Event     271116 non-null object
Medal     39783 non-null object
dtypes: float64(3), int64(2), object(10)
memory usage: 31.0+ MB


資料的型態為數值者共有5個，其中2個為整數(ID, Year)，3個為浮點數(Age, Height, Weight)，其它為文字。

缺少資料者，共有Age, Height, Weight和Medal，其中Medal缺少最多。

In [None]:
4. 數值型欄位的統計描述

In [7]:
data.describe()

Unnamed: 0,ID,Age,Height,Weight,Year
count,271116.0,261642.0,210945.0,208241.0,271116.0
mean,68248.954396,25.556898,175.33897,70.702393,1978.37848
std,39022.286345,6.393561,10.518462,14.34802,29.877632
min,1.0,10.0,127.0,25.0,1896.0
25%,34643.0,21.0,168.0,60.0,1960.0
50%,68205.0,24.0,175.0,70.0,1988.0
75%,102097.25,28.0,183.0,79.0,2002.0
max,135571.0,97.0,226.0,214.0,2016.0


從資料的統計來看，年齡最小者為10歲，最大者為97歲。(請補充其它欄位的說明)

In [None]:
5. regions的前5筆資料

In [48]:
regions.head()

Unnamed: 0,NOC,region,notes
0,AFG,Afghanistan,
1,AHO,Curacao,Netherlands Antilles
2,ALB,Albania,
3,ALG,Algeria,
4,AND,Andorra,


In [None]:
5. 合併運動員資料和regions資料

In [49]:
data = data.merge(regions, how="left", on=["NOC"])
data.head()

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal,BMI,region,notes
0,1,A Dijiang,M,24.0,1.8,80.0,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,,24.691358,China,
1,2,A Lamusi,M,23.0,1.7,60.0,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,,20.761246,China,
2,3,Gunnar Nielsen Aaby,M,24.0,,,Denmark,DEN,1920 Summer,1920,Summer,Antwerpen,Football,Football Men's Football,,,Denmark,
3,4,Edgar Lindenau Aabye,M,34.0,,,Denmark/Sweden,DEN,1900 Summer,1900,Summer,Paris,Tug-Of-War,Tug-Of-War Men's Tug-Of-War,Gold,,Denmark,
4,5,Christine Jacoba Aaftink,F,21.0,1.85,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,Speed Skating Women's 500 metres,,23.959094,Netherlands,


##  資料處理

### 選取符合條件的資料紀錄

pandas在資料框中透過**布林判斷條件**篩選出符合條件的紀錄

1. 選取金牌選手

In [10]:
data[data.Medal=="Gold"]

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
3,4,Edgar Lindenau Aabye,M,34.0,,,Denmark/Sweden,DEN,1900 Summer,1900,Summer,Paris,Tug-Of-War,Tug-Of-War Men's Tug-Of-War,Gold
42,17,Paavo Johannes Aaltonen,M,28.0,175.0,64.0,Finland,FIN,1948 Summer,1948,Summer,London,Gymnastics,Gymnastics Men's Team All-Around,Gold
44,17,Paavo Johannes Aaltonen,M,28.0,175.0,64.0,Finland,FIN,1948 Summer,1948,Summer,London,Gymnastics,Gymnastics Men's Horse Vault,Gold
48,17,Paavo Johannes Aaltonen,M,28.0,175.0,64.0,Finland,FIN,1948 Summer,1948,Summer,London,Gymnastics,Gymnastics Men's Pommelled Horse,Gold
60,20,Kjetil Andr Aamodt,M,20.0,176.0,85.0,Norway,NOR,1992 Winter,1992,Winter,Albertville,Alpine Skiing,Alpine Skiing Men's Super G,Gold
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
270981,135503,Zurab Zviadauri,M,23.0,182.0,90.0,Georgia,GEO,2004 Summer,2004,Summer,Athina,Judo,Judo Men's Middleweight,Gold
271009,135520,Julia Zwehl,F,28.0,167.0,60.0,Germany,GER,2004 Summer,2004,Summer,Athina,Hockey,Hockey Women's Hockey,Gold
271016,135523,"Ronald Ferdinand ""Ron"" Zwerver",M,29.0,200.0,93.0,Netherlands,NED,1996 Summer,1996,Summer,Atlanta,Volleyball,Volleyball Men's Volleyball,Gold
271049,135545,Henk Jan Zwolle,M,31.0,197.0,93.0,Netherlands,NED,1996 Summer,1996,Summer,Atlanta,Rowing,Rowing Men's Coxed Eights,Gold


2. 選取臺灣選手

In [51]:
data[data.region=="Taiwan"]

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal,BMI,region,notes
12572,6850,Bai Hui-Yun,F,23.0,1.57,48.0,Chinese Taipei-1,TPE,1996 Summer,1996,Summer,Atlanta,Table Tennis,Table Tennis Women's Doubles,,19.473407,Taiwan,
22826,12002,Mackenzie Blackburn,M,21.0,,,Chinese Taipei,TPE,2014 Winter,2014,Winter,Sochi,Short Track Speed Skating,Short Track Speed Skating Men's 500 metres,,,Taiwan,
22827,12002,Mackenzie Blackburn,M,21.0,,,Chinese Taipei,TPE,2014 Winter,2014,Winter,Sochi,Short Track Speed Skating,"Short Track Speed Skating Men's 1,000 metres",,,Taiwan,
38281,19674,Chan Fai-Hung,M,28.0,1.71,61.0,Chinese Taipei,TPE,1960 Summer,1960,Summer,Roma,Football,Football Men's Football,,20.861120,Taiwan,
38282,19675,Chan Hao-Ching,F,22.0,1.75,60.0,Chinese Taipei,TPE,2016 Summer,2016,Summer,Rio de Janeiro,Tennis,Tennis Women's Doubles,,19.591837,Taiwan,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
266664,133400,Yuan Shu-Chi,F,19.0,1.68,62.0,Chinese Taipei,TPE,2004 Summer,2004,Summer,Athina,Archery,Archery Women's Individual,,21.967120,Taiwan,
266665,133400,Yuan Shu-Chi,F,19.0,1.68,62.0,Chinese Taipei,TPE,2004 Summer,2004,Summer,Athina,Archery,Archery Women's Team,Bronze,21.967120,Taiwan,
266666,133400,Yuan Shu-Chi,F,23.0,1.68,62.0,Chinese Taipei,TPE,2008 Summer,2008,Summer,Beijing,Archery,Archery Women's Individual,,21.967120,Taiwan,
266667,133400,Yuan Shu-Chi,F,23.0,1.68,62.0,Chinese Taipei,TPE,2008 Summer,2008,Summer,Beijing,Archery,Archery Women's Team,,21.967120,Taiwan,


3. 選取臺灣的金牌選手

In [52]:
data[(data.region=="Taiwan") & (data.Medal=="Gold")]

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal,BMI,region,notes
39384,20263,Chen Shih-Hsien,F,25.0,1.66,46.0,Chinese Taipei,TPE,2004 Summer,2004,Summer,Athina,Taekwondo,Taekwondo Women's Flyweight,Gold,16.693279,Taiwan,
41413,21359,Chu Mu-Yen,M,22.0,1.75,58.0,Chinese Taipei,TPE,2004 Summer,2004,Summer,Athina,Taekwondo,Taekwondo Men's Flyweight,Gold,18.938776,Taiwan,
99899,50545,Hsu Shu-Ching,F,25.0,1.6,53.0,Chinese Taipei,TPE,2016 Summer,2016,Summer,Rio de Janeiro,Weightlifting,Weightlifting Women's Featherweight,Gold,20.703125,Taiwan,


In [None]:
& 是布林邏輯運算式的"且"， | 則是 "或"

### 選取資料中的某些欄位

pandas在資料框中利用**以欄位名稱形成的list**指定選取的欄位

1. 所有選手的姓名

In [16]:
data[["Name"]]

Unnamed: 0,Name
0,A Dijiang
1,A Lamusi
2,Gunnar Nielsen Aaby
3,Edgar Lindenau Aabye
4,Christine Jacoba Aaftink
...,...
271111,Andrzej ya
271112,Piotr ya
271113,Piotr ya
271114,Tomasz Ireneusz ya


2. 所有選手的姓名與獎牌

In [18]:
data[["Name", "Medal"]]

Unnamed: 0,Name,Medal
0,A Dijiang,
1,A Lamusi,
2,Gunnar Nielsen Aaby,
3,Edgar Lindenau Aabye,Gold
4,Christine Jacoba Aaftink,
...,...,...
271111,Andrzej ya,
271112,Piotr ya,
271113,Piotr ya,
271114,Tomasz Ireneusz ya,


3. 臺灣選手的姓名與獎牌

In [53]:
data[data.region=="Taiwan"][["Name", "Medal"]]

Unnamed: 0,Name,Medal
12572,Bai Hui-Yun,
22826,Mackenzie Blackburn,
22827,Mackenzie Blackburn,
38281,Chan Fai-Hung,
38282,Chan Hao-Ching,
...,...,...
266664,Yuan Shu-Chi,
266665,Yuan Shu-Chi,Bronze
266666,Yuan Shu-Chi,
266667,Yuan Shu-Chi,


### 依照某些欄位上的值排列資料(排序)

pandas的方法sort_values(\["X"\])可以按照欄位X上的值排列資料

1. 依據選手參加的年份由小到大排列紀錄

In [23]:
data.sort_values(["Year"])

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
214333,107607,Fritz Richard Gustav Schuft,M,19.0,,,Germany,GER,1896 Summer,1896,Summer,Athina,Gymnastics,Gymnastics Men's Pommelled Horse,
244717,122526,Pierre Alexandre Tuffri,M,19.0,,,France,FRA,1896 Summer,1896,Summer,Athina,Athletics,Athletics Men's Triple Jump,Silver
244716,122526,Pierre Alexandre Tuffri,M,19.0,,,France,FRA,1896 Summer,1896,Summer,Athina,Athletics,Athletics Men's Long Jump,
23912,12563,Conrad Helmut Fritz Bcker,M,25.0,,,Germany,GER,1896 Summer,1896,Summer,Athina,Gymnastics,Gymnastics Men's Horse Vault,
23913,12563,Conrad Helmut Fritz Bcker,M,25.0,,,Germany,GER,1896 Summer,1896,Summer,Athina,Gymnastics,Gymnastics Men's Parallel Bars,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
142355,71419,Luis Fernando Lpez Erazo,M,37.0,166.0,60.0,Colombia,COL,2016 Summer,2016,Summer,Rio de Janeiro,Athletics,Athletics Men's 20 kilometres Walk,
47729,24610,Enrico D'Aniello,M,20.0,152.0,53.0,Italy,ITA,2016 Summer,2016,Summer,Rio de Janeiro,Rowing,Rowing Men's Coxed Eights,
47728,24609,Sabrina D'Angelo,F,23.0,173.0,71.0,Canada,CAN,2016 Summer,2016,Summer,Rio de Janeiro,Football,Football Women's Football,Bronze
47746,24621,Andrea Mitchell D'Arrigo,M,21.0,194.0,85.0,Italy,ITA,2016 Summer,2016,Summer,Rio de Janeiro,Swimming,Swimming Men's 200 metres Freestyle,


2. 依據選手參加的年份由大到小排列紀錄

In [25]:
data.sort_values(by=["Year"], ascending=[False])

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
198703,99780,Maximilian Reinelt,M,27.0,195.0,98.0,Germany,GER,2016 Summer,2016,Summer,Rio de Janeiro,Rowing,Rowing Men's Coxed Eights,Silver
90507,45859,Sophie Elizabeth Hansson,F,18.0,186.0,74.0,Sweden,SWE,2016 Summer,2016,Summer,Rio de Janeiro,Swimming,Swimming Women's 100 metres Breaststroke,
111165,56244,Arsen Julfalakyan,M,29.0,166.0,76.0,Armenia,ARM,2016 Summer,2016,Summer,Rio de Janeiro,Wrestling,"Wrestling Men's Middleweight, Greco-Roman",
82309,41810,Lalonde Keida Gordon,M,27.0,179.0,83.0,Trinidad and Tobago,TTO,2016 Summer,2016,Summer,Rio de Janeiro,Athletics,Athletics Men's 4 x 400 metres Relay,
82308,41810,Lalonde Keida Gordon,M,27.0,179.0,83.0,Trinidad and Tobago,TTO,2016 Summer,2016,Summer,Rio de Janeiro,Athletics,Athletics Men's 400 metres,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
115559,58546,Gyula Kellner,M,24.0,,,Hungary,HUN,1896 Summer,1896,Summer,Athina,Athletics,Athletics Men's Marathon,Bronze
97183,49185,Fritz Hofmann,M,24.0,167.0,56.0,Germany,GER,1896 Summer,1896,Summer,Athina,Athletics,Athletics Men's High Jump,
258338,129362,"Desiderius ""Dezs"" Wein (Boros)",M,23.0,,,Hungary,HUN,1896 Summer,1896,Summer,Athina,Gymnastics,Gymnastics Men's Horse Vault,
258339,129362,"Desiderius ""Dezs"" Wein (Boros)",M,23.0,,,Hungary,HUN,1896 Summer,1896,Summer,Athina,Gymnastics,Gymnastics Men's Parallel Bars,


ascending=True 時；由小到大排列
ascending=False 時；由大到小排列

3. 依據臺灣選手的身高(由大到小)排列紀錄

可以先篩選出符合條件的紀錄，再排列資料

In [54]:
data[data.region=="Taiwan"].sort_values(by=["Height"], ascending=[False])

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal,BMI,region,notes
140719,70649,Liu Wei-Ting,M,21.0,1.96,81.0,Chinese Taipei,TPE,2016 Summer,2016,Summer,Rio de Janeiro,Taekwondo,Taekwondo Men's Welterweight,,21.084965,Taiwan,
38445,19770,Chang Ming-Huang,M,29.0,1.94,130.0,Chinese Taipei,TPE,2012 Summer,2012,Summer,London,Athletics,Athletics Men's Shot Put,,34.541397,Taiwan,
38444,19770,Chang Ming-Huang,M,25.0,1.94,130.0,Chinese Taipei,TPE,2008 Summer,2008,Summer,Beijing,Athletics,Athletics Men's Shot Put,,34.541397,Taiwan,
256473,128421,Wang Chien-Ming,M,24.0,1.90,90.0,Chinese Taipei,TPE,2004 Summer,2004,Summer,Athina,Baseball,Baseball Men's Baseball,,24.930748,Taiwan,
256729,128552,Wang Shao-An,M,19.0,1.90,80.0,Chinese Taipei,TPE,2004 Summer,2004,Summer,Athina,Swimming,Swimming Men's 50 metres Freestyle,,22.160665,Taiwan,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
256584,128474,Wang Jauo-Hueyi,M,22.0,,,Chinese Taipei,TPE,1988 Winter,1988,Winter,Calgary,Bobsleigh,Bobsleigh Men's Four,,,Taiwan,
263321,131774,Wu Chun-Tsai,M,,,,Chinese Taipei,TPE,1956 Summer,1956,Summer,Melbourne,Athletics,Athletics Men's Triple Jump,,,Taiwan,
263474,131839,Wu Yet-An,M,,,,Chinese Taipei,TPE,1956 Summer,1956,Summer,Melbourne,Basketball,Basketball Men's Basketball,,,Taiwan,
265049,132633,James Yap,M,,,,Chinese Taipei,TPE,1956 Summer,1956,Summer,Melbourne,Basketball,Basketball Men's Basketball,,,Taiwan,


In [None]:
4. 依據臺灣選手的身高(由大到小)與體重(由大到小)排列紀錄

需要依照兩個欄位進行排列時

In [55]:
data[data.region=="Taiwan"]\
.sort_values(by=["Height", "Weight"], ascending=[False, False])

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal,BMI,region,notes
140719,70649,Liu Wei-Ting,M,21.0,1.96,81.0,Chinese Taipei,TPE,2016 Summer,2016,Summer,Rio de Janeiro,Taekwondo,Taekwondo Men's Welterweight,,21.084965,Taiwan,
38444,19770,Chang Ming-Huang,M,25.0,1.94,130.0,Chinese Taipei,TPE,2008 Summer,2008,Summer,Beijing,Athletics,Athletics Men's Shot Put,,34.541397,Taiwan,
38445,19770,Chang Ming-Huang,M,29.0,1.94,130.0,Chinese Taipei,TPE,2012 Summer,2012,Summer,London,Athletics,Athletics Men's Shot Put,,34.541397,Taiwan,
39382,20262,Chen Shih-Chieh,M,22.0,1.90,150.0,Chinese Taipei,TPE,2012 Summer,2012,Summer,London,Weightlifting,Weightlifting Men's Super-Heavyweight,,41.551247,Taiwan,
39383,20262,Chen Shih-Chieh,M,26.0,1.90,150.0,Chinese Taipei,TPE,2016 Summer,2016,Summer,Rio de Janeiro,Weightlifting,Weightlifting Men's Super-Heavyweight,,41.551247,Taiwan,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
256584,128474,Wang Jauo-Hueyi,M,22.0,,,Chinese Taipei,TPE,1988 Winter,1988,Winter,Calgary,Bobsleigh,Bobsleigh Men's Four,,,Taiwan,
263321,131774,Wu Chun-Tsai,M,,,,Chinese Taipei,TPE,1956 Summer,1956,Summer,Melbourne,Athletics,Athletics Men's Triple Jump,,,Taiwan,
263474,131839,Wu Yet-An,M,,,,Chinese Taipei,TPE,1956 Summer,1956,Summer,Melbourne,Basketball,Basketball Men's Basketball,,,Taiwan,
265049,132633,James Yap,M,,,,Chinese Taipei,TPE,1956 Summer,1956,Summer,Melbourne,Basketball,Basketball Men's Basketball,,,Taiwan,


### 修改或增加某些欄位

1. 將選手的身高從公分計算改為從公尺計算

In [30]:
data[["Height"]] = data[["Height"]]/100

In [31]:
# 查看資料的前五筆是否已經改變
data.head()

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
0,1,A Dijiang,M,24.0,1.8,80.0,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,
1,2,A Lamusi,M,23.0,1.7,60.0,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,
2,3,Gunnar Nielsen Aaby,M,24.0,,,Denmark,DEN,1920 Summer,1920,Summer,Antwerpen,Football,Football Men's Football,
3,4,Edgar Lindenau Aabye,M,34.0,,,Denmark/Sweden,DEN,1900 Summer,1900,Summer,Paris,Tug-Of-War,Tug-Of-War Men's Tug-Of-War,Gold
4,5,Christine Jacoba Aaftink,F,21.0,1.85,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,Speed Skating Women's 500 metres,


In [None]:
2. 計算每位選手的BMI值

In [None]:
pandas的assign()方法可以新增欄位

In [40]:
data = data.assign(BMI = data["Weight"]/np.power(data["Height"], 2.0))

In [41]:
# 查看資料的前五筆是否已經改變
data.head()

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal,BMI
0,1,A Dijiang,M,24.0,1.8,80.0,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,,24.691358
1,2,A Lamusi,M,23.0,1.7,60.0,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,,20.761246
2,3,Gunnar Nielsen Aaby,M,24.0,,,Denmark,DEN,1920 Summer,1920,Summer,Antwerpen,Football,Football Men's Football,,
3,4,Edgar Lindenau Aabye,M,34.0,,,Denmark/Sweden,DEN,1900 Summer,1900,Summer,Paris,Tug-Of-War,Tug-Of-War Men's Tug-Of-War,Gold,
4,5,Christine Jacoba Aaftink,F,21.0,1.85,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,Speed Skating Women's 500 metres,,23.959094


In [None]:
### 分組彙整資料

In [None]:
pandas的groupby()方法可以根據某一個欄位上的值，將紀錄分組
agg()方法可以彙整記錄，計算每一分組中記錄的資料數(count())、不重複資料數(nunique())、最大值(max())、最小值(min())、平均數(mean())與總和(sum())

In [None]:
1. 計算每一次奧運參加的運動員人數

In [43]:
data.groupby("Games").agg({"ID": ["count"]}).reset_index()

Unnamed: 0_level_0,Games,ID
Unnamed: 0_level_1,Unnamed: 1_level_1,count
0,1896 Summer,380
1,1900 Summer,1936
2,1904 Summer,1301
3,1906 Summer,1733
4,1908 Summer,3101
5,1912 Summer,4040
6,1920 Summer,4292
7,1924 Summer,5233
8,1924 Winter,460
9,1928 Summer,4992


In [None]:
2. 計算每一次夏季奧運臺灣參加的運動員人數

In [44]:
data[(data.region=="Taiwan") & (data.Season=="Summer")]\
.groupby("Games").agg({"ID": ["count"]}).reset_index()

Unnamed: 0_level_0,Games,ID
Unnamed: 0_level_1,Unnamed: 1_level_1,count
0,1956 Summer,23
1,1960 Summer,31
2,1964 Summer,96
3,1968 Summer,93
4,1972 Summer,44
5,1984 Summer,52
6,1988 Summer,119
7,1992 Summer,38
8,1996 Summer,103
9,2000 Summer,71


In [None]:
3. 計算每一次夏季奧運各國獲得的總獎牌數

In [56]:
data[(data.Season=="Summer") & (data.Medal.notnull())]\
.groupby(["Games", "region"]).agg({"ID": ["count"]}).reset_index()

Unnamed: 0_level_0,Games,region,ID
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,count
0,1896 Summer,Australia,3
1,1896 Summer,Austria,5
2,1896 Summer,Denmark,6
3,1896 Summer,France,11
4,1896 Summer,Germany,32
...,...,...,...
1259,2016 Summer,Ukraine,15
1260,2016 Summer,United Arab Emirates,1
1261,2016 Summer,Uzbekistan,13
1262,2016 Summer,Venezuela,3


In [None]:
## 資料視覺化