# 成為資料分析師 | Python 與資料科學應用

> Pandas 102：處理表格式資料的 Python 套件

## 郭耀仁

## 大綱

- DataFrame 的進階操作

## DataFrame 的進階操作

## 不那麼基礎的 DataFrame 操作

- 處理遺漏值
- 多層索引值
- 轉置
- 合併

## Pandas 常用於判斷、刪除和填補遺漏值的方法有四個：

- `.isnull()`
- `.notnull()`
- `.dropna()`
- `.fillna()`

## .isnull() 方法能夠輸出一個布林陣列將遺漏值標記為 True ，非遺漏值記錄為 False ；而 .notnull() 則是輸出與前者恰恰相反的布林陣列

In [1]:
import pandas as pd
import numpy as np

ser = pd.Series([5, None, 6, np.NaN])
print(ser.isnull())
print("\n")
print(ser.notnull())

0    False
1     True
2    False
3     True
dtype: bool


0     True
1    False
2     True
3    False
dtype: bool


## .dropna() 方法能夠將資料中遺漏值的部分捨棄，輸出非遺漏值的資料

In [2]:
ser = pd.Series([5, None, 6, np.NaN])
print(ser)
print("\n")
ser.dropna()

0    5.0
1    NaN
2    6.0
3    NaN
dtype: float64




0    5.0
2    6.0
dtype: float64

## 對 Series 來說，.dropna() 方法運作的方式非常直觀；但是面對資料框我們無法捨棄單一個資料點，只能夠選擇捨棄一整個列（觀測值）或一整個欄（變數），這時可以傳入參數 axis=0 指定列（預設）、axis=1 指定欄

In [3]:
df = pd.DataFrame([
    [1,      np.nan, 7.],
    [2,      5,      8.],
    [np.nan, 6,      9.]
])
df

Unnamed: 0,0,1,2
0,1.0,,7.0
1,2.0,5.0,8.0
2,,6.0,9.0


In [4]:
df.dropna() # default dropping rows with any NaN

Unnamed: 0,0,1,2
1,2.0,5.0,8.0


In [5]:
df.dropna(axis=1) # dropping columns with any NaN

Unnamed: 0,2
0,7.0
1,8.0
2,9.0


## 面對遺漏值我們會選擇填補而非捨棄，Pandas 提供了.fillna() 方法，輸出以指定值替代 NaN 的資料

In [6]:
ser = pd.Series([5, None, 6, np.NaN])
print(ser)
print("\n")
ser.fillna(5566)

0    5.0
1    NaN
2    6.0
3    NaN
dtype: float64




0       5.0
1    5566.0
2       6.0
3    5566.0
dtype: float64

## 除了以指定值填補遺漏值以外，亦可使用參數 method='ffill' 規則為碰到遺漏值時用前一筆資料填補，同理參數 method='bfill' 規則恰好相反，碰到遺漏值時用後一筆資料填補

In [7]:
ser = pd.Series([5, None, 6, np.NaN, 7])
print(ser)
print("\n")
print(ser.fillna(method='ffill'))
print("\n")
print(ser.fillna(method='bfill'))

0    5.0
1    NaN
2    6.0
3    NaN
4    7.0
dtype: float64


0    5.0
1    5.0
2    6.0
3    6.0
4    7.0
dtype: float64


0    5.0
1    6.0
2    6.0
3    7.0
4    7.0
dtype: float64


## 對資料框應用「分組摘要」的技巧在一些特殊情況之下我們會得到一個索引值比較複雜的 Series 輸出，例如在 .groupby() 方法之中傳入兩個以上的類別變數作為分組依據，這時我們將會得到一種名為 MultiIndex 的類別，所謂的多層索引值

In [8]:
player_profile = pd.read_csv("https://python4ds.s3-ap-northeast-1.amazonaws.com/player_profile.csv")
groupby_object = player_profile.groupby(["pos", "country"])
print(groupby_object["heightMeters"].mean()) # Average height by pos and country
print(type(groupby_object["heightMeters"].mean().index))

pos  country               
C    Austria                   2.130000
     Bahamas                   2.160000
     Bosnia and Herzegovina    2.130000
     Canada                    2.060000
     Croatia                   2.135000
                                 ...   
G-F  France                    2.006667
     Italy                     1.960000
     Japan                     2.060000
     Turkey                    2.010000
     USA                       1.993214
Name: heightMeters, Length: 88, dtype: float64
<class 'pandas.core.indexes.multi.MultiIndex'>


## 面對具有多層索引值的 Series ，數值部分同樣使用 .values 屬性即可拆解，至於索引值的拆解較為複雜，必須像是面對多維度陣列的索引，運用 [m, n, ...] 的方式選取所需資料，例如想知道前述例子中，聯盟中的美國（USA）後衛（G）平均身高，就可以運用 ["G", "USA"] 取值；假如想知道聯盟中的美國（USA）搖擺人（G-F、F-G）平均身高，就運用 [["G-F", "F-G"]][:, "USA"] 取值

In [9]:
player_profile = pd.read_csv("https://python4ds.s3-ap-northeast-1.amazonaws.com/player_profile.csv")
groupby_object = player_profile.groupby(["pos", "country"])
ser_w_multi_index = groupby_object["heightMeters"].mean() # Average height by pos and country
print(ser_w_multi_index.values) # values attribute of a multi-index series
print(ser_w_multi_index["G", "USA"]) # average height of USA's guards
print(ser_w_multi_index[["G-F", "F-G"]][:, "USA"]) # average heights of USA's swingmen

[2.13       2.16       2.13       2.06       2.135      2.06
 2.13333333 2.11       2.11       2.13       2.13       2.13
 2.2        2.17       2.135      2.18       2.08       2.11
 2.11961538 2.16       2.08       2.08       2.08       2.08
 2.13       2.09       2.06       2.08       2.055      2.06
 2.06       2.07       2.082      1.98       2.13       2.00666667
 2.06       2.07       2.06666667 2.06       2.11       2.07
 2.03       2.07       2.06       2.045      2.07       2.06
 2.085      2.06       2.06       2.055      2.03795455 1.995
 2.08       2.13       2.11       2.13       2.21       2.09933333
 2.02       2.03       2.01       1.98636364 1.92       1.93
 1.85       1.94142857 1.96       1.97       1.94       2.03
 1.93       1.98       1.9        1.93       1.9216763  2.03
 1.98       2.045      2.06       1.98       2.01       2.00666667
 1.96       2.06       2.01       1.99321429]
1.9216763005780348
pos
F-G    1.986364
G-F    1.993214
Name: heightMeters, dtype:

## 常見的轉置應用是寬表格（Wide Format）與長表格（Long Format）之間的互相轉換

## 寬表格是比較熟悉的資料框樣式，一列是獨立的觀測值，加入資訊是以增添欄位方式實踐，故得其名為寬表格；長表格是比較陌生的資料框樣式，具有以一欄 key 搭配一欄 value 來紀錄資料的項目與值，加入資訊是以增添列數方式實踐，故得其名為長表格

## 多數時候我們所使用的資料皆是寬表格的外觀，像是 NBA 球員的基本資料，一列是獨特的一名球員

In [10]:
player_profile = pd.read_csv("https://python4ds.s3-ap-northeast-1.amazonaws.com/player_profile.csv")
wide_format = player_profile[["temporaryDisplayName", "heightMeters", "weightKilograms"]]
wide_format.head()

Unnamed: 0,temporaryDisplayName,heightMeters,weightKilograms
0,"Adams, Steven",2.13,120.2
1,"Adebayo, Bam",2.08,115.7
2,"Adel, Deng",2.01,90.7
3,"Aldridge, LaMarcus",2.11,117.9
4,"Alexander, Kyle",2.11,99.8


## 將寬表格的外觀轉換為長表格，表示以一個變數（Key）記錄身高或體重，再以一個變數（Value）記錄身高的高度與體重的重量，我們可以使用 pd.melt() 函數

In [11]:
player_profile = pd.read_csv("https://python4ds.s3-ap-northeast-1.amazonaws.com/player_profile.csv")
wide_format = player_profile[["temporaryDisplayName", "heightMeters", "weightKilograms"]]
long_format = pd.melt(wide_format, id_vars="temporaryDisplayName", value_vars=["heightMeters", "weightKilograms"], var_name="key", value_name="value")
long_format.sort_values("temporaryDisplayName").head(10)

Unnamed: 0,temporaryDisplayName,key,value
0,"Adams, Steven",heightMeters,2.13
524,"Adams, Steven",weightKilograms,120.2
525,"Adebayo, Bam",weightKilograms,115.7
1,"Adebayo, Bam",heightMeters,2.08
2,"Adel, Deng",heightMeters,2.01
526,"Adel, Deng",weightKilograms,90.7
3,"Aldridge, LaMarcus",heightMeters,2.11
527,"Aldridge, LaMarcus",weightKilograms,117.9
4,"Alexander, Kyle",heightMeters,2.11
528,"Alexander, Kyle",weightKilograms,99.8


In [12]:
long_format.sort_values("temporaryDisplayName").head(10)

Unnamed: 0,temporaryDisplayName,key,value
0,"Adams, Steven",heightMeters,2.13
524,"Adams, Steven",weightKilograms,120.2
525,"Adebayo, Bam",weightKilograms,115.7
1,"Adebayo, Bam",heightMeters,2.08
2,"Adel, Deng",heightMeters,2.01
526,"Adel, Deng",weightKilograms,90.7
3,"Aldridge, LaMarcus",heightMeters,2.11
527,"Aldridge, LaMarcus",weightKilograms,117.9
4,"Alexander, Kyle",heightMeters,2.11
528,"Alexander, Kyle",weightKilograms,99.8


## 將長表格的外觀轉換為寬表格，會應用到類似分組的操作，以球員姓名作為分組依據，將數值資料樞紐回兩個變數，使用資料框的 .pivot() 方法

In [13]:
player_profile = pd.read_csv("https://python4ds.s3-ap-northeast-1.amazonaws.com/player_profile.csv")
wide_format = player_profile[["temporaryDisplayName", "heightMeters", "weightKilograms"]]
long_format = pd.melt(wide_format, id_vars="temporaryDisplayName", value_vars=["heightMeters", "weightKilograms"], var_name="key", value_name="value")
long_format.pivot(index="temporaryDisplayName", columns="key", values="value").head()

key,heightMeters,weightKilograms
temporaryDisplayName,Unnamed: 1_level_1,Unnamed: 2_level_1
"Adams, Steven",2.13,120.2
"Adebayo, Bam",2.08,115.7
"Adel, Deng",2.01,90.7
"Aldridge, LaMarcus",2.11,117.9
"Alexander, Kyle",2.11,99.8


## 最後稍微整理一下，利用 .reset_index() 以及刪除列索引的名稱，就能將樞紐後的表格回復成與原本一模一樣的寬表格

In [14]:
player_profile = pd.read_csv("https://python4ds.s3-ap-northeast-1.amazonaws.com/player_profile.csv")
wide_format = player_profile[["temporaryDisplayName", "heightMeters", "weightKilograms"]]
long_format = pd.melt(wide_format, id_vars="temporaryDisplayName", value_vars=["heightMeters", "weightKilograms"], var_name="key", value_name="value")
wide_format = long_format.pivot(index="temporaryDisplayName", columns="key", values="value").reset_index()
wide_format = wide_format.rename_axis(None, axis=1)
wide_format.head()

Unnamed: 0,temporaryDisplayName,heightMeters,weightKilograms
0,"Adams, Steven",2.13,120.2
1,"Adebayo, Bam",2.08,115.7
2,"Adel, Deng",2.01,90.7
3,"Aldridge, LaMarcus",2.11,117.9
4,"Alexander, Kyle",2.11,99.8


## Pandas 套件有四種常用函數或方法能夠協助使用者合併不同資料源

- `pd.concat()`
- `df.append()`
- `pd.merge()`
- `df.join()`

## 簡單合併 pd.concat()

In [15]:
upper_df = pd.DataFrame()
lower_df = pd.DataFrame()
upper_df["character"] = ["Rachel Green", "Monica Geller", "Phoebe Buffay"]
upper_df["cast"] = ["Jennifer Aniston", "Courteney Cox", "Lisa Kudrow"]
lower_df["character"] = ["Joey Tribbiani", "Chandler Bing", "Ross Geller"]
lower_df["cast"] = ["Matt LeBlanc", "Matthew Perry", "David Schwimmer"]

In [16]:
print("Upper df:")
upper_df
print("Lower df:")
lower_df
print("Concatenated vertically:")
pd.concat([upper_df, lower_df]) # axis=0 as default

Upper df:
Lower df:
Concatenated vertically:


Unnamed: 0,character,cast
0,Rachel Green,Jennifer Aniston
1,Monica Geller,Courteney Cox
2,Phoebe Buffay,Lisa Kudrow
0,Joey Tribbiani,Matt LeBlanc
1,Chandler Bing,Matthew Perry
2,Ross Geller,David Schwimmer


## 合併後的資料框具備了重複的列索引，如果希望可以重設列索引，可以在 pd.concat() 函數中加入參數 ignore_index=True

## 指定參數 axis=1 則為水平合併

In [17]:
left_df = pd.DataFrame()
right_df = pd.DataFrame()
left_df["character"] = ["Rachel Green", "Monica Geller", "Phoebe Buffay", "Joey Tribbiani", "Chandler Bing", "Ross Geller"]
right_df["cast"] = ["Jennifer Aniston", "Courteney Cox", "Lisa Kudrow", "Matt LeBlanc", "Matthew Perry", "David Schwimmer"]

In [18]:
print("Left df:")
left_df
print("Right df:")
right_df
print("Concatenated horizontally:")
pd.concat([left_df, right_df], axis=1)

Left df:
Right df:
Concatenated horizontally:


Unnamed: 0,character,cast
0,Rachel Green,Jennifer Aniston
1,Monica Geller,Courteney Cox
2,Phoebe Buffay,Lisa Kudrow
3,Joey Tribbiani,Matt LeBlanc
4,Chandler Bing,Matthew Perry
5,Ross Geller,David Schwimmer


## 垂直合併 df.append()

In [19]:
upper_df = pd.DataFrame()
lower_df = pd.DataFrame()
upper_df["character"] = ["Rachel Green", "Monica Geller", "Phoebe Buffay"]
upper_df["cast"] = ["Jennifer Aniston", "Courteney Cox", "Lisa Kudrow"]
lower_df["character"] = ["Joey Tribbiani", "Chandler Bing", "Ross Geller"]
lower_df["cast"] = ["Matt LeBlanc", "Matthew Perry", "David Schwimmer"]

In [20]:
print("Upper df:")
upper_df
print("Lower df:")
lower_df
print("Concatenated vertically using append method:")
upper_df.append(lower_df)

Upper df:
Lower df:
Concatenated vertically using append method:


Unnamed: 0,character,cast
0,Rachel Green,Jennifer Aniston
1,Monica Geller,Courteney Cox
2,Phoebe Buffay,Lisa Kudrow
0,Joey Tribbiani,Matt LeBlanc
1,Chandler Bing,Matthew Perry
2,Ross Geller,David Schwimmer


## 聯結 pd.merge()

在 Pandas 中若想要高效能操作類似關聯式資料庫表格聯結和合併，主要的實踐函數是 pd.merge() ，她沿用關聯式資料庫的正規法則 Relational Algebra，實踐正規法則所規範四種基礎聯結

- 一對一聯結（one-to-one）
- 一對多聯結（one-to-many）
- 多對一聯結（many-to-one）
- 多對多聯結（many-to-many）

In [21]:
#一對一聯結（one-to-one）
left_df = pd.DataFrame()
right_df = pd.DataFrame()
left_df["title"] = ["The Avengers", "Avengers: Age of Ultron", "Avengers: Infinity War", "Avengers: Endgame"]
left_df["release_year"] = [2012, 2015, 2018, 2019]
right_df["title"] = ["Avengers: Infinity War", "Avengers: Endgame", "The Avengers", "Avengers: Age of Ultron"]
right_df["rating"] = [8.5, 8.6, 8.5, 7.3]
left_df
right_df
pd.merge(left_df, right_df)

Unnamed: 0,title,release_year,rating
0,The Avengers,2012,8.5
1,Avengers: Age of Ultron,2015,7.3
2,Avengers: Infinity War,2018,8.5
3,Avengers: Endgame,2019,8.6


In [22]:
#一對多聯結（one-to-many）
left_df = pd.DataFrame()
right_df = pd.DataFrame()
left_df["title"] = ["The Avengers"]
left_df["release_year"] = [2012]
right_df["title"] = ["The Avengers", "The Avengers", "The Avengers"]
right_df["genre"] = ["Action", "Adventure", "Sci-Fi"]
left_df
right_df
pd.merge(left_df, right_df)

Unnamed: 0,title,release_year,genre
0,The Avengers,2012,Action
1,The Avengers,2012,Adventure
2,The Avengers,2012,Sci-Fi


In [23]:
#多對一聯結（many-to-one）
left_df = pd.DataFrame()
right_df = pd.DataFrame()
left_df["title"] = ["The Avengers", "The Avengers", "The Avengers"]
left_df["genre"] = ["Action", "Adventure", "Sci-Fi"]
right_df["title"] = ["The Avengers"]
right_df["release_year"] = [2012]
left_df
right_df
pd.merge(left_df, right_df)

Unnamed: 0,title,genre,release_year
0,The Avengers,Action,2012
1,The Avengers,Adventure,2012
2,The Avengers,Sci-Fi,2012


In [24]:
#多對多聯結（many-to-many）
left_df = pd.DataFrame()
right_df = pd.DataFrame()
left_df["title"] = ["The Avengers", "The Avengers", "The Avengers"]
left_df["genre"] = ["Action", "Adventure", "Sci-Fi"]
right_df["title"] = ["The Avengers"]*6
right_df["avengers"] = ["Ironman", "Captain America", "The Hulk", "Thor", "Black Widow", "Hawkeye"]
left_df
right_df
pd.merge(left_df, right_df)

Unnamed: 0,title,genre,avengers
0,The Avengers,Action,Ironman
1,The Avengers,Action,Captain America
2,The Avengers,Action,The Hulk
3,The Avengers,Action,Thor
4,The Avengers,Action,Black Widow
5,The Avengers,Action,Hawkeye
6,The Avengers,Adventure,Ironman
7,The Avengers,Adventure,Captain America
8,The Avengers,Adventure,The Hulk
9,The Avengers,Adventure,Thor


## 加入 left_on 與 right_on 參數指定要用哪些變數進行聯結的對照依據

In [25]:
left_df = pd.DataFrame()
right_df = pd.DataFrame()
left_df["title"] = ["The Avengers", "Avengers: Age of Ultron", "Avengers: Infinity War", "Avengers: Endgame"]
left_df["release_year"] = [2012, 2015, 2018, 2019]
right_df["movie_name"] = ["Avengers: Infinity War", "Avengers: Endgame", "The Avengers", "Avengers: Age of Ultron"]
right_df["rating"] = [8.5, 8.6, 8.5, 7.3]
left_df
right_df
pd.merge(left_df, right_df, left_on="title", right_on="movie_name")

Unnamed: 0,title,release_year,movie_name,rating
0,The Avengers,2012,The Avengers,8.5
1,Avengers: Age of Ultron,2015,Avengers: Age of Ultron,7.3
2,Avengers: Infinity War,2018,Avengers: Infinity War,8.5
3,Avengers: Endgame,2019,Avengers: Endgame,8.6


## 加入 how 參數則可以指定聯結後的資料框要採用交集（預設）、以左邊資料框存在的觀測值為主、以右邊資料框存在的觀測值為主或聯集

In [26]:
#交集（預設）
left_df = pd.DataFrame()
right_df = pd.DataFrame()
left_df["title"] = ["The Avengers", "Avengers: Age of Ultron", "Avengers: Infinity War"]
left_df["release_year"] = [2012, 2015, 2018]
right_df["title"] = ["Avengers: Infinity War", "Avengers: Endgame", "Avengers: Age of Ultron"]
right_df["rating"] = [8.5, 8.6, 7.3]
left_df
right_df
print("Inner join:")
pd.merge(left_df, right_df)

Inner join:


Unnamed: 0,title,release_year,rating
0,Avengers: Age of Ultron,2015,7.3
1,Avengers: Infinity War,2018,8.5


In [27]:
#以左邊資料框存在的觀測值為主
left_df = pd.DataFrame()
right_df = pd.DataFrame()
left_df["title"] = ["The Avengers", "Avengers: Age of Ultron", "Avengers: Infinity War"]
left_df["release_year"] = [2012, 2015, 2018]
right_df["title"] = ["Avengers: Infinity War", "Avengers: Endgame", "Avengers: Age of Ultron"]
right_df["rating"] = [8.5, 8.6, 7.3]
left_df
right_df
print("Left join:")
pd.merge(left_df, right_df, how="left")

Left join:


Unnamed: 0,title,release_year,rating
0,The Avengers,2012,
1,Avengers: Age of Ultron,2015,7.3
2,Avengers: Infinity War,2018,8.5


In [28]:
#以右邊資料框存在的觀測值為主
left_df = pd.DataFrame()
right_df = pd.DataFrame()
left_df["title"] = ["The Avengers", "Avengers: Age of Ultron", "Avengers: Infinity War"]
left_df["release_year"] = [2012, 2015, 2018]
right_df["title"] = ["Avengers: Infinity War", "Avengers: Endgame", "Avengers: Age of Ultron"]
right_df["rating"] = [8.5, 8.6, 7.3]
left_df
right_df
print("Right join:")
pd.merge(left_df, right_df, how="right")

Right join:


Unnamed: 0,title,release_year,rating
0,Avengers: Age of Ultron,2015.0,7.3
1,Avengers: Infinity War,2018.0,8.5
2,Avengers: Endgame,,8.6


In [29]:
#聯集
left_df = pd.DataFrame()
right_df = pd.DataFrame()
left_df["title"] = ["The Avengers", "Avengers: Age of Ultron", "Avengers: Infinity War"]
left_df["release_year"] = [2012, 2015, 2018]
right_df["title"] = ["Avengers: Infinity War", "Avengers: Endgame", "Avengers: Age of Ultron"]
right_df["rating"] = [8.5, 8.6, 7.3]
left_df
right_df
print("Outer join:")
pd.merge(left_df, right_df, how="outer")

Outer join:


Unnamed: 0,title,release_year,rating
0,The Avengers,2012.0,
1,Avengers: Age of Ultron,2015.0,7.3
2,Avengers: Infinity War,2018.0,8.5
3,Avengers: Endgame,,8.6


## 用列索引聯結 df.join()

In [30]:
left_df = pd.DataFrame()
right_df = pd.DataFrame()
left_df["title"] = ["The Avengers", "Avengers: Age of Ultron", "Avengers: Infinity War", "Avengers: Endgame"]
left_df["release_year"] = [2012, 2015, 2018, 2019]
right_df["title"] = ["Avengers: Infinity War", "Avengers: Endgame", "The Avengers", "Avengers: Age of Ultron"]
right_df["rating"] = [8.5, 8.6, 8.5, 7.3]
left_df = left_df.set_index("title")
right_df = right_df.set_index("title")
left_df
right_df
left_df.join(right_df)

Unnamed: 0_level_0,release_year,rating
title,Unnamed: 1_level_1,Unnamed: 2_level_1
The Avengers,2012,8.5
Avengers: Age of Ultron,2015,7.3
Avengers: Infinity War,2018,8.5
Avengers: Endgame,2019,8.6


## 延伸閱讀

[pandas: powerful Python data analysis toolkit](http://pandas.pydata.org/pandas-docs/stable/)

## 隨堂練習

[隨堂練習：美國普查](https://mybinder.org/v2/gh/yaojenkuo/python-data-analysis/master?filepath=exercises%2F04-exercises.ipynb)

## 題庫來源

[Introduction to Data Science in Python](https://www.coursera.org/learn/python-data-analysis)

In [31]:
import pandas as pd

census_df = pd.read_csv('https://storage.googleapis.com/py_ml_datasets/census.csv')
census_df.shape

(3193, 100)

In [32]:
census_df.head()

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
0,40,3,6,1,0,Alabama,Alabama,4779736,4780127,4785161,...,0.002295,-0.193196,0.381066,0.582002,-0.467369,1.030015,0.826644,1.383282,1.724718,0.712594
1,50,3,6,1,1,Alabama,Autauga County,54571,54571,54660,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.59227,-2.187333
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183193,...,14.83296,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
3,50,3,6,1,5,Alabama,Barbour County,27457,27457,27341,...,-4.728132,-2.50069,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
4,50,3,6,1,7,Alabama,Bibb County,22915,22919,22861,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861


In [33]:
census_df.tail()

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
3188,50,4,8,56,37,Wyoming,Sweetwater County,43806,43806,43593,...,1.072643,16.243199,-5.339774,-14.252889,-14.248864,1.255221,16.243199,-5.29546,-14.075283,-14.070195
3189,50,4,8,56,39,Wyoming,Teton County,21294,21294,21297,...,-1.589565,0.972695,19.525929,14.143021,-0.564849,0.654527,2.408578,21.160658,16.308671,1.520747
3190,50,4,8,56,41,Wyoming,Uinta County,21118,21118,21102,...,-17.755986,-4.91635,-6.902954,-14.215862,-12.127022,-18.136812,-5.536861,-7.52184,-14.740608,-12.606351
3191,50,4,8,56,43,Wyoming,Washakie County,8533,8533,8545,...,-11.637475,-0.827815,-2.013502,-17.781491,1.682288,-11.990126,-1.182592,-2.250385,-18.020168,1.441961
3192,50,4,8,56,45,Wyoming,Weston County,7208,7208,7181,...,-11.752361,-8.040059,12.372583,1.533635,6.935294,-12.032179,-8.040059,12.372583,1.533635,6.935294


## 隨堂練習：哪個州（state）的郡（county）數最多？

In [34]:
def most_counties(census_df):
    """
    >>> most_counties(census_df)
    'Texas'
    """

## 隨堂練習：僅考慮每州（state）人口最多的三個郡（county）計算人口總和（CENSUS2010POP），哪三個州總和數最多？（請注意 SUMLEV 變數）

In [35]:
def top_three_states(census_df):
    """
    >>> top_three_states(census_df)
    ['California', 'Texas', 'Illinois']
    """

## 隨堂練習：哪個郡（county）在 2010-2015 期間人口改變數量最高？（POPESTIMATE2010:POPESTIMATE2015 這六個變數）

提示：如果 6 年的人口數分別為 120, 80, 105, 100, 130, 120 則人口改變數量為 130-80 = 50

In [36]:
def pop_change_most_county(census_df):
    """
    >>> pop_change_most_county(census_df)
    'Harris County'
    """

## 隨堂練習：篩選出屬於 REGION 1 或 2、開頭名稱為 Washington 並且 POPESTIMATE2015 大於 POPESTIMATE2014 的郡（county）

In [37]:
def filter_counties(census_df):
    """
    >>> filter_counties(census_df)
             STNAME            CTYNAME
    0          Iowa  Washington County
    1     Minnesota  Washington County
    2  Pennsylvania  Washington County
    3  Rhode Island  Washington County
    4     Wisconsin  Washington County
    """

## 隨堂練習參考解答

[隨堂練習：美國普查參考解答](https://mybinder.org/v2/gh/yaojenkuo/python-data-analysis/master?filepath=suggested_answers%2F04-suggested-answers.ipynb)