# Pandas

## 郭耀仁

## Documentation

https://pandas.pydata.org/pandas-docs/stable/index.html

## 啟發自 R 語言

> Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more.

Source: <https://github.com/pandas-dev/pandas>

## Pandas 提供的資料結構

|名稱|描述|
|---|----|
|Series|可以建立索引的一維陣列|
|DataFrame|有列索引與欄標籤的二維資料集|
|Panel|有列索引與欄標籤的三維資料集|

# Series

## 建立 Series

- 用 `Series()` 建立 Series
- 其中 data 可以是：
    - 一個 ndarray
    - 一個 dict
    - 單一資料

```python
import pandas as pd

ser = pd.Series(data, index = idx)
```

## 建立 Series（2）

- data 是一個 ndarray

In [30]:
import numpy as np
import pandas as pd

arr = np.array(("Monkey D. Luffy", "Roronoa Zoro", "Nami", "Usopp", "Vinsmoke Sanji",
                "Tony Tony Chopper", "Nico Robin", "Franky", "Brook"))
ser = pd.Series(arr) # 預設的索引
print(type(ser))
print("\n")
print(ser)

<class 'pandas.core.series.Series'>


0      Monkey D. Luffy
1         Roronoa Zoro
2                 Nami
3                Usopp
4       Vinsmoke Sanji
5    Tony Tony Chopper
6           Nico Robin
7               Franky
8                Brook
dtype: object


In [31]:
# 使用自訂的索引

crew_idx = []
for i in range(9):
    crew_idx.append("crew " + str(i + 1))
ser = pd.Series(arr, index = crew_idx)
print(ser)

crew 1      Monkey D. Luffy
crew 2         Roronoa Zoro
crew 3                 Nami
crew 4                Usopp
crew 5       Vinsmoke Sanji
crew 6    Tony Tony Chopper
crew 7           Nico Robin
crew 8               Franky
crew 9                Brook
dtype: object


## 建立 Series（3）

- data 是一個 dict
- 預設會將 key 當作索引值

In [35]:
import pandas as pd

crew_dict = {
    "captain": "Monkey D. Luffy",
    "swordsman": "Roronoa Zoro",
    "navigator": "Nami",
    "sniper": "Usopp",
    "chef": "Vinsmoke Sanji",
    "doctor": "Tony Tony Chopper",
    "archaeologist": "Nico Robin",
    "shipwright": "Franky",
    "musician": "Brook"
}

ser = pd.Series(crew_dict) # 會依照 key 排序
print(ser)

archaeologist           Nico Robin
captain            Monkey D. Luffy
chef                Vinsmoke Sanji
doctor           Tony Tony Chopper
musician                     Brook
navigator                     Nami
shipwright                  Franky
sniper                       Usopp
swordsman             Roronoa Zoro
dtype: object


In [34]:
import pandas as pd

crew_dict = {
    "captain": "Monkey D. Luffy",
    "swordsman": "Roronoa Zoro",
    "navigator": "Nami",
    "sniper": "Usopp",
    "chef": "Vinsmoke Sanji",
    "doctor": "Tony Tony Chopper",
    "archaeologist": "Nico Robin",
    "shipwright": "Franky",
    "musician": "Brook"
}

ser = pd.Series(crew_dict, index = crew_dict.keys()) # 排序與原 dict 相同
print(ser)

captain            Monkey D. Luffy
swordsman             Roronoa Zoro
navigator                     Nami
sniper                       Usopp
chef                Vinsmoke Sanji
doctor           Tony Tony Chopper
archaeologist           Nico Robin
shipwright                  Franky
musician                     Brook
dtype: object


## 建立 Series（4）

- data 可以是單一資料

In [36]:
import pandas as pd

luffy = "Monkey D. Luffy"
ser = pd.Series(luffy, index = range(5))
print(ser)

0    Monkey D. Luffy
1    Monkey D. Luffy
2    Monkey D. Luffy
3    Monkey D. Luffy
4    Monkey D. Luffy
dtype: object


## Series 的操作

- 透過索引值或標籤選取資料
- 跟 ndarray 沒有差太多

In [40]:
import pandas as pd

crew_dict = {
    "captain": "Monkey D. Luffy",
    "swordsman": "Roronoa Zoro",
    "navigator": "Nami",
    "sniper": "Usopp",
    "chef": "Vinsmoke Sanji",
    "doctor": "Tony Tony Chopper",
    "archaeologist": "Nico Robin",
    "shipwright": "Franky",
    "musician": "Brook"
}

ser = pd.Series(crew_dict, index = crew_dict.keys()) # 排序與原 dict 相同
print(ser[0])
print(ser['captain'])
print("\n")
print(ser[[0, 3, 6]])
print(ser[['captain', 'sniper', 'archaeologist']])

Monkey D. Luffy
Monkey D. Luffy


captain          Monkey D. Luffy
sniper                     Usopp
archaeologist         Nico Robin
dtype: object
captain          Monkey D. Luffy
sniper                     Usopp
archaeologist         Nico Robin
dtype: object


## Series 的操作（2）

- 透過 `:` 快速地切割

In [43]:
import pandas as pd

crew_dict = {
    "captain": "Monkey D. Luffy",
    "swordsman": "Roronoa Zoro",
    "navigator": "Nami",
    "sniper": "Usopp",
    "chef": "Vinsmoke Sanji",
    "doctor": "Tony Tony Chopper",
    "archaeologist": "Nico Robin",
    "shipwright": "Franky",
    "musician": "Brook"
}

ser = pd.Series(crew_dict, index = crew_dict.keys()) # 排序與原 dict 相同
print(ser[:3])
print("\n")
print(ser['sniper':])

captain      Monkey D. Luffy
swordsman       Roronoa Zoro
navigator               Nami
dtype: object


sniper                       Usopp
chef                Vinsmoke Sanji
doctor           Tony Tony Chopper
archaeologist           Nico Robin
shipwright                  Franky
musician                     Brook
dtype: object


## Series 的操作（3）

- 也可以透過判斷條件進行布林篩選

In [46]:
import pandas as pd

crew_dict = {
    "captain": "Monkey D. Luffy",
    "swordsman": "Roronoa Zoro",
    "navigator": "Nami",
    "sniper": "Usopp",
    "chef": "Vinsmoke Sanji",
    "doctor": "Tony Tony Chopper",
    "archaeologist": "Nico Robin",
    "shipwright": "Franky",
    "musician": "Brook"
}

ser = pd.Series(crew_dict, index = crew_dict.keys()) # 排序與原 dict 相同
filter = ser.isin(("Nami", "Nico Robin"))
print(ser[filter])

navigator              Nami
archaeologist    Nico Robin
dtype: object


## Series 的操作（4）

- NumPy 的函數也都適用

In [48]:
import pandas as pd

crew_age = {
    "Monkey D. Luffy": 19,
    "Roronoa Zoro": 21,
    "Nami": 20,
    "Usopp": 19,
    "Vinsmoke Sanji": 21,
    "Tony Tony Chopper": 17,
    "Nico Robin": 30,
    "Franky": 36,
    "Brook": 90
}

ser = pd.Series(crew_age)
print("草帽海賊團的平均年齡：%.2f" % np.mean(ser))
print("草帽海賊團的年齡標準差：%.2f" % np.std(ser))

草帽海賊團的平均年齡：30.33
草帽海賊團的年齡標準差：21.88


## Series 的操作（5）

- 同樣適用 element-wise 運算

In [50]:
import pandas as pd

crew_age = {
    "Monkey D. Luffy": 19,
    "Roronoa Zoro": 21,
    "Nami": 20,
    "Usopp": 19,
    "Vinsmoke Sanji": 21,
    "Tony Tony Chopper": 17,
    "Nico Robin": 30,
    "Franky": 36,
    "Brook": 90
}

ser = pd.Series(crew_age, index = crew_age.keys())
print(ser - 2)

Monkey D. Luffy      17
Roronoa Zoro         19
Nami                 18
Usopp                17
Vinsmoke Sanji       19
Tony Tony Chopper    15
Nico Robin           28
Franky               34
Brook                88
dtype: int64


# DataFrame

## 建立 DataFrame

- 用 `DataFrame()` 建立 DataFrame
- 其中 data 是：
    - 一個 dict
    - 一個 ndarray

```python
import pandas as pd

df = pd.DataFrame(data)
```

## 建立 DataFrame（2）

- 其中 data 是一個 dict

In [54]:
import pandas as pd

straw_hat_dict = {"name": ["Monkey D. Luffy", "Roronoa Zoro", "Nami", "Usopp",
                           "Vinsmoke Sanji", "Tony Tony Chopper", "Nico Robin", "Franky", "Brook"],
                  "age": [19, 21, 20, 19, 21, 17, 30, 36, 90],
                  "is_male": [True, True, False, True, True, True, False, True, True]
}

df = pd.DataFrame(straw_hat_dict) # 欄標籤預設排序
print(type(df))
df

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,age,is_male,name
0,19,True,Monkey D. Luffy
1,21,True,Roronoa Zoro
2,20,False,Nami
3,19,True,Usopp
4,21,True,Vinsmoke Sanji
5,17,True,Tony Tony Chopper
6,30,False,Nico Robin
7,36,True,Franky
8,90,True,Brook


In [6]:
import pandas as pd

straw_hat_dict = {"name": ["Monkey D. Luffy", "Roronoa Zoro", "Nami", "Usopp",
                           "Vinsmoke Sanji", "Tony Tony Chopper", "Nico Robin", "Franky", "Brook"],
                  "age": [19, 21, 20, 19, 21, 17, 30, 36, 90],
                  "is_male": [True, True, False, True, True, True, False, True, True]
}

df = pd.DataFrame(straw_hat_dict, columns = ["name", "age", "is_male"]) # 指定欄標籤排序
print(type(df))
df

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,name,age,is_male
0,Monkey D. Luffy,19,True
1,Roronoa Zoro,21,True
2,Nami,20,False
3,Usopp,19,True
4,Vinsmoke Sanji,21,True
5,Tony Tony Chopper,17,True
6,Nico Robin,30,False
7,Franky,36,True
8,Brook,90,True


## 建立 DataFrame（3）

- 其中 data 是一個 ndarray

In [10]:
import numpy as np
import pandas as pd

arr = np.array([
    ["Monkey D. Luffy", 19, True],
    ["Roronoa Zoro", 21, True],
    ["Nami", 20, False],
    ["Usopp", 19, True],
    ["Vinsmoke Sanji", 21, True],
    ["Tony Tony Chopper", 17, True],
    ["Nico Robin", 30, False],
    ["Franky", 36, True],
    ["Brook", 90, True]
])
df = pd.DataFrame(arr, columns = ["name", "age", "is_male"])
df

Unnamed: 0,name,age,is_male
0,Monkey D. Luffy,19,True
1,Roronoa Zoro,21,True
2,Nami,20,False
3,Usopp,19,True
4,Vinsmoke Sanji,21,True
5,Tony Tony Chopper,17,True
6,Nico Robin,30,False
7,Franky,36,True
8,Brook,90,True


In [11]:
import numpy as np
import pandas as pd

arr = np.array([
    ["Monkey D. Luffy", 19, True],
    ["Roronoa Zoro", 21, True],
    ["Nami", 20, False],
    ["Usopp", 19, True],
    ["Vinsmoke Sanji", 21, True],
    ["Tony Tony Chopper", 17, True],
    ["Nico Robin", 30, False],
    ["Franky", 36, True],
    ["Brook", 90, True]
])
df = pd.DataFrame(arr, columns = ["name", "age", "is_male"])
print(df.dtypes)
df['age'] = df['age'].astype(int)
df['is_male'] = df['is_male'].astype(bool)
print("\n")
print(df.dtypes)
df

name       object
age        object
is_male    object
dtype: object


name       object
age         int64
is_male      bool
dtype: object


Unnamed: 0,name,age,is_male
0,Monkey D. Luffy,19,True
1,Roronoa Zoro,21,True
2,Nami,20,True
3,Usopp,19,True
4,Vinsmoke Sanji,21,True
5,Tony Tony Chopper,17,True
6,Nico Robin,30,True
7,Franky,36,True
8,Brook,90,True


## Data frame 的操作

- 包含多種變數類型，不像 ndarray 僅容納單一變數類型

In [57]:
import pandas as pd

straw_hat_dict = {"name": ["Monkey D. Luffy", "Roronoa Zoro", "Nami", "Usopp",
                           "Vinsmoke Sanji", "Tony Tony Chopper", "Nico Robin", "Franky", "Brook"],
                  "age": [19, 21, 20, 19, 21, 17, 30, 36, 90],
                  "is_male": [True, True, False, True, True, True, False, True, True]
}

df = pd.DataFrame(straw_hat_dict, columns = ["name", "age", "is_male"]) # 指定欄標籤排序
df.dtypes

name       object
age         int64
is_male      bool
dtype: object

## Data frame 的操作（2）

- 可以直接指派新增一個變數

In [59]:
import pandas as pd

straw_hat_dict = {"name": ["Monkey D. Luffy", "Roronoa Zoro", "Nami", "Usopp",
                           "Vinsmoke Sanji", "Tony Tony Chopper", "Nico Robin", "Franky", "Brook"],
                  "age": [19, 21, 20, 19, 21, 17, 30, 36, 90],
                  "is_male": [True, True, False, True, True, True, False, True, True]
}

df = pd.DataFrame(straw_hat_dict, columns = ["name", "age", "is_male"]) # 指定欄標籤排序
df['age_2_yr_ago'] = df['age'] - 2
df

Unnamed: 0,name,age,is_male,age_2_yr_ago
0,Monkey D. Luffy,19,True,17
1,Roronoa Zoro,21,True,19
2,Nami,20,False,18
3,Usopp,19,True,17
4,Vinsmoke Sanji,21,True,19
5,Tony Tony Chopper,17,True,15
6,Nico Robin,30,False,28
7,Franky,36,True,34
8,Brook,90,True,88


In [60]:
import pandas as pd

straw_hat_dict = {"name": ["Monkey D. Luffy", "Roronoa Zoro", "Nami", "Usopp",
                           "Vinsmoke Sanji", "Tony Tony Chopper", "Nico Robin", "Franky", "Brook"],
                  "age": [19, 21, 20, 19, 21, 17, 30, 36, 90],
                  "is_male": [True, True, False, True, True, True, False, True, True]
}

df = pd.DataFrame(straw_hat_dict, columns = ["name", "age", "is_male"]) # 指定欄標籤排序
df['favorite_food'] = ["Meat", "Food matches wine", "Orange", "Fish", "Food matches black tea",
                       "Sweets", "Food matches coffee", "Food matches coke", "Milk"]
df

Unnamed: 0,name,age,is_male,favorite_food
0,Monkey D. Luffy,19,True,Meat
1,Roronoa Zoro,21,True,Food matches wine
2,Nami,20,False,Orange
3,Usopp,19,True,Fish
4,Vinsmoke Sanji,21,True,Food matches black tea
5,Tony Tony Chopper,17,True,Sweets
6,Nico Robin,30,False,Food matches coffee
7,Franky,36,True,Food matches coke
8,Brook,90,True,Milk


## Data frame 的操作（3）

- 利用 `.insert()` 指定變數新增的位置

In [66]:
import pandas as pd

straw_hat_dict = {"name": ["Monkey D. Luffy", "Roronoa Zoro", "Nami", "Usopp",
                           "Vinsmoke Sanji", "Tony Tony Chopper", "Nico Robin", "Franky", "Brook"],
                  "age": [19, 21, 20, 19, 21, 17, 30, 36, 90],
                  "is_male": [True, True, False, True, True, True, False, True, True]
}

df = pd.DataFrame(straw_hat_dict, columns = ["name", "age", "is_male"]) # 指定欄標籤排序
df.insert(1, 'favorite_food', ["Meat", "Food matches wine", "Orange", "Fish", "Food matches black tea",
                               "Sweets", "Food matches coffee", "Food matches coke", "Milk"])
df

Unnamed: 0,name,favorite_food,age,is_male
0,Monkey D. Luffy,Meat,19,True
1,Roronoa Zoro,Food matches wine,21,True
2,Nami,Orange,20,False
3,Usopp,Fish,19,True
4,Vinsmoke Sanji,Food matches black tea,21,True
5,Tony Tony Chopper,Sweets,17,True
6,Nico Robin,Food matches coffee,30,False
7,Franky,Food matches coke,36,True
8,Brook,Milk,90,True


## Data frame 的操作（3）

- 利用 `del` 刪除變數

In [63]:
import pandas as pd

straw_hat_dict = {"name": ["Monkey D. Luffy", "Roronoa Zoro", "Nami", "Usopp",
                           "Vinsmoke Sanji", "Tony Tony Chopper", "Nico Robin", "Franky", "Brook"],
                  "age": [19, 21, 20, 19, 21, 17, 30, 36, 90],
                  "is_male": [True, True, False, True, True, True, False, True, True]
}

df = pd.DataFrame(straw_hat_dict, columns = ["name", "age", "is_male"]) # 指定欄標籤排序
del df['is_male']
df

Unnamed: 0,name,age
0,Monkey D. Luffy,19
1,Roronoa Zoro,21
2,Nami,20
3,Usopp,19
4,Vinsmoke Sanji,21
5,Tony Tony Chopper,17
6,Nico Robin,30
7,Franky,36
8,Brook,90


## Data frame 的操作（4）

- 利用 `.pop()` 將變數刪除後指派給一個 Series

In [67]:
import pandas as pd

straw_hat_dict = {"name": ["Monkey D. Luffy", "Roronoa Zoro", "Nami", "Usopp",
                           "Vinsmoke Sanji", "Tony Tony Chopper", "Nico Robin", "Franky", "Brook"],
                  "age": [19, 21, 20, 19, 21, 17, 30, 36, 90],
                  "is_male": [True, True, False, True, True, True, False, True, True]
}

df = pd.DataFrame(straw_hat_dict, columns = ["name", "age", "is_male"]) # 指定欄標籤排序
ser = df.pop('is_male')
print(type(ser))
print(ser)
df

<class 'pandas.core.series.Series'>
0     True
1     True
2    False
3     True
4     True
5     True
6    False
7     True
8     True
Name: is_male, dtype: bool


Unnamed: 0,name,age
0,Monkey D. Luffy,19
1,Roronoa Zoro,21
2,Nami,20
3,Usopp,19
4,Vinsmoke Sanji,21
5,Tony Tony Chopper,17
6,Nico Robin,30
7,Franky,36
8,Brook,90


## Data frame 的特性（3）

- 同樣使用中括號 `[]` 選擇元素
- 使用 `.ix` 屬性

```python
import pandas as pd # 引用套件並縮寫為 pd

name = ["Monkey D. Luffy", "Roronoa Zoro", "Nami", "Usopp", "Vinsmoke Sanji", "Tony Tony Chopper", "Nico Robin", "Franky", "Brook"]
age = [19, 21, 20, 19, 21, 17, 30, 36, 90]
is_male = [True, True, False, True, True, True, False, True, True]

straw_hat_dict = {"name": name,
                  "age": age,
                  "is_male": is_male
}

straw_hat_df = pd.DataFrame(straw_hat_dict)

print(straw_hat_df.ix[0, :]) # 選第 0 個觀測值
print("---")
print(straw_hat_df.ix[:, "name"]) # 選 name 欄位
print("---")
print(straw_hat_df.ix[0, "name"]) # 選第 0 個觀測值的 name 欄位
```

## Data frame 的特性（4）

- 請同學練習使用 `.ix` 與 `[]` 選擇 row、column 與元素

## Data frame 的特性（5）

- 可以使用布林值篩選

```python
import pandas as pd # 引用套件並縮寫為 pd

name = ["Monkey D. Luffy", "Roronoa Zoro", "Nami", "Usopp", "Vinsmoke Sanji", "Tony Tony Chopper", "Nico Robin", "Franky", "Brook"]
age = [19, 21, 20, 19, 21, 17, 30, 36, 90]
is_male = [True, True, False, True, True, True, False, True, True]

straw_hat_dict = {"name": name,
                  "age": age,
                  "is_male": is_male
}

straw_hat_df = pd.DataFrame(straw_hat_dict)

# 篩選小於 30 歲的船員
filter = straw_hat_df.ix[:, "age"] <= 30
straw_hat_df[filter]

# 篩選女性船員
filter = straw_hat_df.ix[:, "is_male"] == False
straw_hat_df[filter]
```

## Data frame 的特性（6）

- 請同學練習使用布林值篩選出草帽海賊團的熟男：
    - `age` >= 30
    - `is_male` == True

## 了解 data frame 的概觀

```python
import pandas as pd # 引用套件並縮寫為 pd

name = ["Monkey D. Luffy", "Roronoa Zoro", "Nami", "Usopp", "Vinsmoke Sanji", "Tony Tony Chopper", "Nico Robin", "Franky", "Brook"]
age = [19, 21, 20, 19, 21, 17, 30, 36, 90]
is_male = [True, True, False, True, True, True, False, True, True]

straw_hat_dict = {"name": name,
                  "age": age,
                  "is_male": is_male
}

straw_hat_df = pd.DataFrame(straw_hat_dict)

print(straw_hat_df.shape) # 回傳列數與欄數
print("---")
print(straw_hat_df.describe()) # 回傳描述性統計
print("---")
print(straw_hat_df.head(3)) # 回傳前三筆觀測值
print("---")
print(straw_hat_df.tail(3)) # 回傳後三筆觀測值
print("---")
print(straw_hat_df.columns) # 回傳欄位名稱
```

## 了解 data frame 的概觀（2）

- 請同學練習 **data frame** 的 `shape` 與 `columns` 屬性
- 練習 **data frame** 的 `describe()`、`head()` 與 `tail()` 方法

# Panel

## 讀取外部資料

- 使用 `pandas` 套件的 `.read_csv()` 方法讀取 csv 檔案

```python
import pandas as pd

url = "https://storage.googleapis.com/py_ds_basic/iris.csv" # 在雲端上儲存了一份 csv 檔案
iris_df = pd.read_csv(url)
iris_df.head()
```

## 讀取外部資料（2）

- 使用 `pandas` 套件的 `.read_table()` 方法讀取 tsv 檔案

```python
import pandas as pd

url = "https://storage.googleapis.com/py_ds_basic/iris.tsv" # 在雲端上儲存了一份 tsv 檔案
iris_df = pd.read_table(url, sep = "\t")
iris_df.head()
```

## 讀取外部資料（3）

- 使用 `pandas` 套件的 `.read_excel()` 方法來讀取 excel 檔案

```python
import pandas as pd

url = "https://storage.googleapis.com/py_ds_basic/iris.xlsx" # 在雲端上儲存了一份 Excel 試算表
iris_df = pd.read_excel(url)
iris_df.head()
```

## 讀取外部資料（4）

- 使用 `pandas` 套件的 `.read_json()` 方法來讀取 JSON 檔案

```python
import pandas as pd

url = "https://storage.googleapis.com/py_ds_basic/iris.json" # 在雲端上儲存了一份 JSON 檔
iris_df = pd.read_json(url)
iris_df.head()
```

## 讀取外部資料（5）

- 請同學讀取 **iris** 之後告訴我們這個資料的概觀：
    - 有幾個觀測值、幾個變數
    - 各個變數的名稱
    - 描述性統計