# Python語言numpy與pandas套件的衍生資料結構(Derived Data Structure)

### Prof. Ching-Shih Tsou鄒慶士(Ph.D.) at the Institute of Information and Decision  Sciences(IDS)/Center for Applications of Data Science(CADS), NTUB(國立臺北商業大學資訊與決策科學研究所教授暨資料科學應用研究中心主任);  the CARS(中華R軟體學會) and the DSBA(臺灣資料科學與商業應用協會)

## 大綱
### A. 從原生資料結構到陣列
### B. numpy資料結構
### C. 程式碼的向量化
### D. pandas資料結構

## A. 從原生資料結構到陣列

- 陣列或稱張量是數據建模的運作基礎。Arrays and tensors are basis for data modeling.
- 前述Python已提供相當有用與彈性的資料結構，像是串列物件有許多方便的特性與應用領域 Python provides some quite useful and flexible general data structures. In particular, list objects can be considered a real workhorse with many convenient characteristics and application areas.
- 然而，許多科學與財務上的應用需要高效能運算的特殊資料結構，例如：陣列 However, scientific and financial applications generally have a need for high-performing operations on special data structures. One of the most important data structures in this regard is the array.
- 最簡單的陣列是一維向量 In the simplest case, a one-dimensional array then represents, mathematically speaking, a vector of, in general, real numbers, internally represented by float objects.
- 接著是 i × j 各元素的二維矩陣 In a more common case, an array represents an i × j matrix of elements.
- 最後是 i × j × k 三維方塊，以及 i × j × k × l × ... 的n維陣列 This concept generalizes to i × j × k cubes of elements in three dimensions as well as to general n- dimensional arrays of shape i × j × k × l × ... .

### Python原生串列形成的陣列 Arrays with Python Lists

- 巢狀串列(nested lists)

In [1]:
v = [0.5, 0.75, 1.0, 1.5, 2.0]
m = [v, v, v] # matrix of numbers
m

[[0.5, 0.75, 1.0, 1.5, 2.0],
 [0.5, 0.75, 1.0, 1.5, 2.0],
 [0.5, 0.75, 1.0, 1.5, 2.0]]

In [2]:
m[1]

[0.5, 0.75, 1.0, 1.5, 2.0]

In [3]:
m[1][0]

0.5

In [4]:
type(m)

list

- 巢狀語法可建立更一般化的結構 Nesting can be pushed further for even more general structures.

In [5]:
v1 = [0.5, 1.5]
v2=[1,2]
m=[v1,v2]
print(m)
c = [m, m] # cube of numbers
c

[[0.5, 1.5], [1, 2]]


[[[0.5, 1.5], [1, 2]], [[0.5, 1.5], [1, 2]]]

In [6]:
c[1][1][0] # 那一個 1 ? 前 or 後 ?

1

In [7]:
type(c)

list

- 上面物件結合的方式只是對原始物件建立參考的指標，並未複製原始物件。 Note that combining objects in the way just presented generally works with reference pointers to the original objects. What does that mean in practice?

In [8]:
v = [0.5, 0.75, 1.0, 1.5, 2.0]
m=[v,v,v]
m

[[0.5, 0.75, 1.0, 1.5, 2.0],
 [0.5, 0.75, 1.0, 1.5, 2.0],
 [0.5, 0.75, 1.0, 1.5, 2.0]]

In [9]:
v[0] = 'Python'
m # all changed

[['Python', 0.75, 1.0, 1.5, 2.0],
 ['Python', 0.75, 1.0, 1.5, 2.0],
 ['Python', 0.75, 1.0, 1.5, 2.0]]

- 以copy模組中的deepcopy避免上述情況。 This can be avoided by using the deepcopy function of the copy module.

In [10]:
from copy import deepcopy
v = [0.5, 0.75, 1.0, 1.5, 2.0]
m = 3 * [deepcopy(v), ]
m

[[0.5, 0.75, 1.0, 1.5, 2.0],
 [0.5, 0.75, 1.0, 1.5, 2.0],
 [0.5, 0.75, 1.0, 1.5, 2.0]]

In [11]:
v[0] = 'Python'
m # unchanged

[[0.5, 0.75, 1.0, 1.5, 2.0],
 [0.5, 0.75, 1.0, 1.5, 2.0],
 [0.5, 0.75, 1.0, 1.5, 2.0]]

## B. numpy資料結構
### 1. 常規NumPy陣列 Regular NumPy Arrays
### 2. 結構化陣列 Structured Arrays

### 1. 常規NumPy陣列

- numpy.ndarray是能高效處理n維陣列的特殊類別 Some kind of specialized class could therefore be really beneficial to handle array-type structures. Such a specialized class is numpy.ndarray, which has been built with the specific goal of handling n-dimensional arrays both conveniently and efficiently—i.e., in a highly performing manner.

In [12]:
import numpy as np
a = np.array([0, 0.5, 1.0, 1.5, 2.0])
type(a)

numpy.ndarray

In [13]:
a[:2]

array([0. , 0.5])

- numpy.ndarray類有許多內建的方法 A major feature of the numpy.ndarray class is the multitude of built-in metho

In [14]:
a.sum() # sum of all elements

5.0

In [15]:
a.std() # standard deviation

0.7071067811865476

In [16]:
a.cumsum() # running cumulative sum

array([0. , 0.5, 1.5, 3. , 5. ])

- ndarray另一個主要特性就是向量化的數學運算 Another major feature is the (vectorized) mathematical operations defined on ndarray objects

In [17]:
a**2

array([0.  , 0.25, 1.  , 2.25, 4.  ])

In [18]:
np.sqrt(a)

array([0.        , 0.70710678, 1.        , 1.22474487, 1.41421356])

In [19]:
b = np.array([a, a * 2]) # a matric of 2 rows and 5 columns
b

array([[0. , 0.5, 1. , 1.5, 2. ],
       [0. , 1. , 2. , 3. , 4. ]])

In [20]:
b[0] # first row

array([0. , 0.5, 1. , 1.5, 2. ])

In [21]:
b[0, 2] # third element of first row

1.0

In [22]:
b.sum()

15.0

- NumPy n維陣列的sum方法，可依選定的維度進行加總
- axis=0表縱向加總，故為各縱行和

In [23]:
b.sum(axis=0) # sum along axis 0, i.e. column-wise sum

array([0. , 1.5, 3. , 4.5, 6. ])

- axis=1表橫向加總，故為各橫列和

In [24]:
b.sum(axis=1) # sum along axis 1, i.e. row-wise sum

array([ 5., 10.])

In [25]:
c = np.zeros((2, 3, 4), dtype='i', order='C') # also: np.ones()
c

array([[[0, 0, 0, 0],
        [0, 0, 0, 0],
        [0, 0, 0, 0]],

       [[0, 0, 0, 0],
        [0, 0, 0, 0],
        [0, 0, 0, 0]]], dtype=int32)

- np.ones_like以c之shape(2, 3, 4)，展開全為1的三維陣列

In [26]:
d = np.ones_like(c, dtype='f16', order='C') # also: np.zeros_like()
d

array([[[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]],

       [[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]]], dtype=float128)

> 重要屬性

- 維度與維數 shape: Either an int, a sequence of ints, or a reference to another numpy.ndarray
- 資料類型 dtype (optional): A numpy.dtype—these are NumPy-specific data types for numpy.ndarray objects
- 列導向或行導向 order (optional): The order in which to store elements in memory: C for C-like (i.e., row-wise) or F for Fortran-like (i.e., column-wise)

> 重要性質

- The shape/length/size of the array is homogenous across any given dimension.
- It only allows for a single data type (numpy.dtype) for the whole array.
- NumPy容許的dtype: t, b, i, u, f, c, O, S(a, String), U(Unicode), V(Other)

In [27]:
import random
I=5000
%time mat = [[random.gauss(0, 1) for j in range(I)] for i in range(I)] # a nested list comprehension

CPU times: user 14.8 s, sys: 277 ms, total: 15 s
Wall time: 15.1 s


In [28]:
%time mat = np.random.standard_normal((I, I)) # 很快！

CPU times: user 1.07 s, sys: 144 ms, total: 1.21 s
Wall time: 1.22 s


In [29]:
%time mat.sum() # 秒殺！

CPU times: user 15.4 ms, sys: 1.02 ms, total: 16.4 ms
Wall time: 14.8 ms


-4855.005221624336

### 2. 結構化陣列

- 結構化陣列允許Numpy資料物件的各欄有不同資料型別 NumPy provides structured arrays that allow us to have different NumPy data types per column, at least.
- 先以np.dtype設定各欄位的資料型別(data types)後，再填入資料

In [30]:
dt = np.dtype([('Name', 'S10'), ('Age', 'i4'), ('Height', 'f'), ('Children/Pets', 'i4', 2)])
s = np.array([('Smith', 45, 1.83, (0, 1)), ('Jones', 53, 1.72, (2, 2))], dtype=dt)
s

array([(b'Smith', 45, 1.83, [0, 1]), (b'Jones', 53, 1.72, [2, 2])],
      dtype=[('Name', 'S10'), ('Age', '<i4'), ('Height', '<f4'), ('Children/Pets', '<i4', (2,))])

- 功能類似SQL資料庫中初始化資料表時，會設定欄位名稱與其資料型別 This construction comes quite close to the operation for initializing tables in a SQL database. We have column names and column data types, with maybe some additional information (e.g., maximum number of characters per string object). The single columns can now be easily accessed by their names:

In [31]:
s['Name']

array([b'Smith', b'Jones'], dtype='|S10')

In [32]:
s['Height'].mean()

1.7750001

- 選出特定一列或一筆記錄後，其進一步擷取語法類似字典物件 Having selected a specific row and record, respectively, the resulting objects mainly behave like dict objects, where one can retrieve values via keys.

In [33]:
s[1]['Age']

53

- In summary, structured arrays are a generalization of the regular numpy.ndarray object types in that the data type only has to be the same per column, as one is used to in the context of tables in SQL databases. One advantage of structured arrays is that a single element of a column can be another multidimensional object and does not have to conform to the basic NumPy data types.

## C. 程式碼的向量化

- 向量化是撰寫更精簡且有效率程式碼的一種策略 Vectorization of code is a strategy to get more compact code that is possibly executed faster.
- 基本的想法是將一運算施加在***複雜物件一次***，而***非將其元素取出進行迭代式運用*** The fundamental idea is to conduct an operation on or to apply a function to a complex object “at once” and not by iterating over the single elements of the object.
- Python的函數式編程工具map, filter與reduce等均提供向量化功能，Numpy許多核心功能已內建向量化運算 In Python, the functional programming tools map, filter, and reduce provide means for vectorization. In a sense, NumPy has vectorization built in deep down in its core.

### 基本向量化

In [34]:
r = np.zeros((4, 3))
s = np.ones((4, 3))
r+s

array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.]])

- 向量化運算時，Numpy會將較短的物件循環使用，稱為廣播broadcasting，R軟體稱之為recycling NumPy also supports what is called broadcasting. This allows us to combine objects of different shape within a single operation.

In [35]:
2*r+3

array([[3., 3., 3.],
       [3., 3., 3.],
       [3., 3., 3.],
       [3., 3., 3.]])

In [36]:
s = np.random.standard_normal(3)
r+s

array([[ 0.93622957, -1.18346384, -0.57768203],
       [ 0.93622957, -1.18346384, -0.57768203],
       [ 0.93622957, -1.18346384, -0.57768203],
       [ 0.93622957, -1.18346384, -0.57768203]])

In [37]:
s = np.random.standard_normal(4) # Why not work?
# r+s # ValueError: operands could not be broadcast together with shapes (4,3) (4,)

In [38]:
r.transpose() + s

array([[ 0.34160839, -1.85187336, -0.07259451,  0.84713868],
       [ 0.34160839, -1.85187336, -0.07259451,  0.84713868],
       [ 0.34160839, -1.85187336, -0.07259451,  0.84713868]])

In [39]:
np.shape(r.T)

(3, 4)

In [40]:
np.shape(r)

(4, 3)

- Python客製化函數也可接受numpy.ndarrays類物件 As a general rule, custom-defined Python functions work with numpy.ndarrays as well.

In [41]:
def f(x):
    return 3*x+5

In [42]:
f(0.5) # float object

6.5

In [43]:
r = np.random.standard_normal((4, 3))
f(r) # NumPy array

array([[ 9.46667648,  4.68654517,  6.24188977],
       [10.19926544,  5.8581783 ,  5.83569538],
       [ 6.5092427 ,  9.99166376,  9.2784389 ],
       [ 1.98820175,  8.4663775 ,  9.0397805 ]])

In [44]:
r

array([[ 1.48889216, -0.10448494,  0.41396326],
       [ 1.73308848,  0.28605943,  0.27856513],
       [ 0.5030809 ,  1.66388792,  1.4261463 ],
       [-1.00393275,  1.15545917,  1.3465935 ]])

## D. pandas資料結構
### 1. 序列 Series
### 2. 資料框 DataFrame
### 3. 索引物件 Index Objects
### 4. pandas資料操弄 Data Manipulation using pandas

In [45]:
from pandas import Series, DataFrame
import pandas as pd
import numpy as np

- Series和DataFrame是兩個重要的workhorse。
- 雖然不能解決所有問題，但對大多數的應用來說相對容易使用。

### 1. 序列 Series

- Series是一維類陣列的物件，它有一個關於元素位置的一維陣列(稱為index)。
- 最簡單的Series形式是僅含數值的資料陣列，此時index即為流水號。
- 類似R的具名向量(named vectors)。

In [46]:
obj = Series([4, 7, -5, 3])
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [47]:
obj.values

array([ 4,  7, -5,  3])

In [48]:
obj.index # Int64Index([0, 1, 2, 3 ])

RangeIndex(start=0, stop=4, step=1)

In [49]:
obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [50]:
obj2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

- 相較於NumPy array，Series可以index中的值進行slicing。

In [51]:
obj2['a']

-5

In [52]:
obj2[['c', 'a', 'd']]

c    3
a   -5
d    4
dtype: int64

- NumPy相關的運算(***布林值索引、向量化運算***)都可以套用到Series中。

In [53]:
print(obj2[obj2 > 0])
print(obj2 * 2)
np.exp(obj2)

d    4
b    7
c    3
dtype: int64
d     8
b    14
a   -10
c     6
dtype: int64


d      54.598150
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

- 亦可將Series視為固定長度有序的dict。
- 或將Python dict轉為pandas中的Series。

In [54]:
'b' in obj2

True

In [55]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = Series(sdata)
obj3

Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

In [56]:
type(obj3) # pandas.core.series.Series

pandas.core.series.Series

In [57]:
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = Series(sdata, index=states)
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

- 遺缺值的判定。

In [58]:
pd.isnull(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [59]:
pd.notnull(obj4)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

- Series也有判定遺缺值的樣例方法。

In [60]:
obj4.isnull()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

- Series在算數運算後，自動將不同索引的資料排列整齊(data alignment)。

In [61]:
print(obj3)
print(obj4)
print(obj3 + obj4)

Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64
California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64


- Series物件本身及其index都有name屬性。

In [62]:
obj4.name = 'population'
obj4.index.name = 'state'
obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

- 可以改變Series的index順序

In [63]:
print(obj)
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
print(obj)

0    4
1    7
2   -5
3    3
dtype: int64
Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64


### 2. 資料框 DataFrame

- 類似試算表的資料結構，包含一群有序欄位，其資料型別不同(numeric, string, booloean, etc.)。
- Numpy的結構化陣列
- 可以視為a dict of Series，所有Series有相同的index。

In [64]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002], 'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)
print(data)
print(frame) # index都是0, 1, 2, 3, 4

{'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002], 'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
   pop   state  year
0  1.5    Ohio  2000
1  1.7    Ohio  2001
2  3.6    Ohio  2002
3  2.4  Nevada  2001
4  2.9  Nevada  2002


In [65]:
DataFrame(data, columns=['year', 'state', 'pop']) # 可以改變欄位順序

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9


In [66]:
frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'], index=['one', 'two', 'three', 'four', 'five'])
frame2 # 不包含在data中的欄位'debt'

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,


In [67]:
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

> DataFrame之縱向取值(columns slicing)

- 可以dict-like的語法，或是物件attribute的語法擷取DataFrame的各屬性值。
- 傳回物件為Series。

In [68]:
print(frame2['state'])
print(type(frame2['state']))
print(frame2.year)
print(type(frame2.year))

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object
<class 'pandas.core.series.Series'>
one      2000
two      2001
three    2002
four     2001
five     2002
Name: year, dtype: int64
<class 'pandas.core.series.Series'>


> DataFrame之橫向取值(rows slicing)

- 可以其位置或名稱等方法完成(ix方法、iloc方法與loc方法，後者稍後說明)。

In [69]:
frame2.ix['three']

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  if __name__ == '__main__':


year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

- 新增欄位(debt)。(與R語言類似)

In [70]:
frame2['debt'] = np.arange(5.)
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,0.0
two,2001,Ohio,1.7,1.0
three,2002,Ohio,3.6,2.0
four,2001,Nevada,2.4,3.0
five,2002,Nevada,2.9,4.0


- 以Series指定欄位值，注意index。

In [71]:
val = Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['debt'] = val
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7


- 如何新增欄位？(與R語言類似)

In [72]:
frame2['eastern'] = frame2.state == 'Ohio'
frame2

Unnamed: 0,year,state,pop,debt,eastern
one,2000,Ohio,1.5,,True
two,2001,Ohio,1.7,-1.2,True
three,2002,Ohio,3.6,,True
four,2001,Nevada,2.4,-1.5,False
five,2002,Nevada,2.9,-1.7,False


- 如何刪除欄位？

In [73]:
del frame2['eastern']
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

> 以巢狀字典(nested dict)建立資料框

- 外層dict鍵名為欄名，內層dict鍵名為列索引(the outer dict keys as the columns and the inner keys as the row indices)。

In [74]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9}, 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
pop

{'Nevada': {2001: 2.4, 2002: 2.9}, 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

- 再將巢狀串列傳入DataFrame函數中。

In [75]:
frame3 = DataFrame(pop)
frame3

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


- DataFrame轉置。

In [76]:
frame3.T

Unnamed: 0,2000,2001,2002
Nevada,,2.4,2.9
Ohio,1.5,1.7,3.6


In [77]:
DataFrame(pop, index=[2001, 2002, 2003])

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2003,,


- 若欲以序列所形成的字典(Dicts of Series)建立pandas DataFrame，處理方式同上。

In [78]:
pdata = {'Ohio': frame3['Ohio'][:-1], 'Nevada': frame3['Nevada'][:2]}
DataFrame(pdata)

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7


- 如果DataFrame的index與column有name屬性，則結果亦會呈現出來。下例列名為2000~2001，欄名為Nevada與Ohio。 If a DataFrame’s index and columns have their name attributes set, these will also be displayed

In [79]:
frame3.index.name = 'year'; frame3.columns.name = 'state'
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


- DataFrame的values以列導向呈現結果。

In [80]:
frame3.values

array([[nan, 1.5],
       [2.4, 1.7],
       [2.9, 3.6]])

- 各欄位型別不同時，dtype會是object，代表不同模，同R語言之data frame。

In [81]:
frame2.values

array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, -1.2],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, -1.5],
       [2002, 'Nevada', 2.9, -1.7]], dtype=object)

### 3. 索引物件 Index Objects

- pandas的Index物件保存資料物件的各軸標籤與其它說明資料。pandas’s Index objects are responsible for holding the axis labels and other metadata (like the axis name or names).
- 序列的值在前，索引設定於後。

In [82]:
obj = Series(range(3), index=['a', 'b', 'c'])
obj

a    0
b    1
c    2
dtype: int64

In [83]:
index = obj.index
index

Index(['a', 'b', 'c'], dtype='object')

In [84]:
index[1:]

Index(['b', 'c'], dtype='object')

- index物件不可更改！Index objects are immutable and thus can’t be modified by the user.

In [85]:
# index[1] = 'd'

In [86]:
index = pd.Index(np.arange(3))
print(index)
obj2 = Series([1.5, -2.5, 0], index=index)

Int64Index([0, 1, 2], dtype='int64')


In [87]:
obj2.index is index

True

In [88]:
type(obj2.index)

pandas.core.indexes.numeric.Int64Index

> pandas中主要的Index物件

- 各軸說明文字 Index: The most general Index object, representing axis labels in a NumPy array of Python objects.
- 整數值的特殊索引 Int64Index: Specialized Index for integer values.
- 階層式索引物件 MultiIndex: “Hierarchical” index object representing multiple levels of indexing on a single axis. Can be thought of as similar to an array of tuples.
- 日期時間索引 DatetimeIndex: Stores nanosecond timestamps (represented using NumPy’s datetime64 dtype).
- 期間索引 PeriodIndex: Specialized Index for Period data (timespans).

In [89]:
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


In [90]:
'Ohio' in frame3.columns

True

In [91]:
2003 in frame3.index

False

### 4. pandas資料操弄 Data Manipulation using pandas

> 改變資料索引

- 序列重新編排索引 Series reindex

In [92]:
obj = Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

In [93]:
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

In [94]:
obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value=0)

a   -5.3
b    7.2
c    3.6
d    4.5
e    0.0
dtype: float64

In [95]:
obj3 = Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj3

0      blue
2    purple
4    yellow
dtype: object

In [96]:
obj3.reindex(range(6), method='ffill') # 'ffill' means foward fills

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

- 資料框重新編排索引 DataFrame reindex
- reindex可以改變row、column或兩者
- 欄位名稱以columns關鍵字重新索引

- 0到8的9個整數值，排成3列3行，再設定列索引(index)與欄名(columns)，同R的矩陣創建為1D變2D

In [97]:
frame = DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'], columns=['Ohio', 'Texas', 'California'])
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


- 增加列索引值，有NaN。

In [98]:
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
frame2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


- 欄名重新索引，有NaN。

In [99]:
states = ['Texas', 'Utah', 'California']
frame.reindex(columns=states)

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


- axis=0的插補方法有ffill(前向差補)與bfill(後向差補)

In [100]:
frame.reindex(index=['a', 'b', 'c', 'd'], method='ffill')

Unnamed: 0,Ohio,Texas,California
a,0,1,2
b,0,1,2
c,3,4,5
d,6,7,8


- 較簡潔的語法(ix)

In [101]:
frame.ix[['a','b','c','d'], states]

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  if __name__ == '__main__':


Unnamed: 0,Texas,Utah,California
a,1.0,,2.0
b,,,
c,4.0,,5.0
d,7.0,,8.0


> 刪除資料
- 運用drop方法刪除資料

In [102]:
obj = Series(np.arange(5.), index=['a','b','c','d','e'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [103]:
new_obj = obj.drop('c')
new_obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

- 注意中括弧！因為drop方法套用到多個索引值需以list串起，如同R語言需用c()函數串起索引值為向量。

In [104]:
obj.drop(['d','c'])

a    0.0
b    1.0
e    4.0
dtype: float64

- 0到15的16個整數值，排成4列4行，再設定列索引(index)與欄名(columns)

In [105]:
data = DataFrame(np.arange(16).reshape((4,4)), index=['Ohio', 'Colorado', 'Utah', 'New York'], columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


- 資料框可以從任一軸刪除資料

In [106]:
data.drop(['Colorado','Ohio'])

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


In [107]:
data.drop('two', axis=1)

Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


> 資料索引、選取與過濾

In [108]:
obj = Series(np.arange(4.), index=['a','b','c','d'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

In [109]:
obj[['b','a','d']]

b    1.0
a    0.0
d    3.0
dtype: float64

In [110]:
obj[2:4]

c    2.0
d    3.0
dtype: float64

In [111]:
obj['b':'c'] # 冒號運算子

b    1.0
c    2.0
dtype: float64

- 資料框索引

In [112]:
data = DataFrame(np.arange(16).reshape((4, 4)), index=['Ohio', 'Colorado', 'Utah', 'New York'],columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [113]:
data[['three', 'one']]

Unnamed: 0,three,one
Ohio,2,0
Colorado,6,4
Utah,10,8
New York,14,12


- 注意前包後不包

In [114]:
data[:2] # 0 ~ 1 列

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [115]:
data[data['three'] > 5] # Logical indexing

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [116]:
data < 5

Unnamed: 0,one,two,three,four
Ohio,True,True,True,True
Colorado,True,False,False,False
Utah,False,False,False,False
New York,False,False,False,False


In [117]:
data[data < 5] = 0
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


- 較簡潔的語法(ix方法)

In [118]:
data.ix['Colorado', ['two', 'three']]

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  if __name__ == '__main__':


two      5
three    6
Name: Colorado, dtype: int64

In [119]:
data.ix[['Colorado', 'Utah'], [3, 0, 1]]

Unnamed: 0,four,one,two
Colorado,7,0,5
Utah,11,8,9


- DataFrame的ix方法也可以取出第三橫列，輸出的結果排為縱行。

In [120]:
data.ix[2]

one       8
two       9
three    10
four     11
Name: Utah, dtype: int64

In [121]:
data.ix[:'Utah', 'two']

Ohio        0
Colorado    5
Utah        9
Name: two, dtype: int64

In [122]:
data.ix[data.three > 5, :3]

Unnamed: 0,one,two,three
Colorado,0,5,6
Utah,8,9,10
New York,12,13,14


> 算術運算與資料校準

- 0到11的12個整數值，排成3列4行，再設定列索引(index)與欄名(columns)
- list('abcd')將字串拆解為字母

In [123]:
df1 = DataFrame(np.arange(12.).reshape((3,4)), columns=list('abcd'))
df1

Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,5.0,6.0,7.0
2,8.0,9.0,10.0,11.0


- 0到19的20個整數值，排成4列5行，再設定列索引(index)與欄名(columns)

In [124]:
df2 = DataFrame(np.arange(20.).reshape((4,5)), columns=list('abcde'))
df2

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,4.0
1,5.0,6.0,7.0,8.0,9.0
2,10.0,11.0,12.0,13.0,14.0
3,15.0,16.0,17.0,18.0,19.0


- 加號運算子，最末列與最末欄有NaN。

In [125]:
df1 + df2

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,
1,9.0,11.0,13.0,15.0,
2,18.0,20.0,22.0,24.0,
3,,,,,


- 使用DataFrame的add方法，結果同上。

In [126]:
df1.add(df2)

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,
1,9.0,11.0,13.0,15.0,
2,18.0,20.0,22.0,24.0,
3,,,,,


- 設定廣播值(broadcasting values)於df1的最末列與最末欄。

In [127]:
df1.add(df2, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,11.0,13.0,15.0,9.0
2,18.0,20.0,22.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


> Pandas DataFrame 資料取值的多種方法

In [128]:
import pandas as pd
fb = pd.read_excel("/Users/Vince/cstsouMac/Python/Examples/Basics/data/facebook_checkins_2013-08-24.xls", skiprows=1)

In [129]:
fb.dtypes

地標ID           int64
地標名稱          object
累積打卡數          int64
latitude     float64
longitude    float64
類別            object
地區            object
dtype: object

In [130]:
fb.head()

Unnamed: 0,地標ID,地標名稱,累積打卡數,latitude,longitude,類別,地區
0,164103576936475,Taiwan Taoyuan International Airport,711761,25.076389,121.223889,機場,桃園縣
1,1393970400817450,臺灣桃園國際機場第二航廈,411525,25.077949,121.232405,機場,桃園縣
2,159730407378429,Taipei Railway Station,391239,25.047769,121.517109,車站,台北市
3,162736400406103,Shilin Night Market,385886,25.088073,121.525044,觀光夜市,台北市
4,125435810861333,花園夜市,351568,23.01057,120.199785,觀光夜市,台南市


- 以屬性名稱取出整個縱行。

In [131]:
fb['地標名稱']

0                   Taiwan Taoyuan International Airport
1                                           臺灣桃園國際機場第二航廈
2                                 Taipei Railway Station
3                                    Shilin Night Market
4                                                   花園夜市
5                                                 信義威秀影城
6                                   美麗華影城Miramar Cinemas
7                                          西門町 Ximenting
8                                       Big City遠東巨城購物中心
9                                 淡水老街 Tamsui Old Street
10     臺灣桃園國際機場第一航廈 Taiwan Taoyuan International Airp...
11                                                羅東觀光夜市
12                          台灣高鐵左營站 THSR Zuoying Station
13                           TAIPEI 101 MALL 台北 101 購物中心
14                                               基隆 廟口夜市
15                               Taipei Songshan Airport
16                         台灣高鐵台中站 THSR Taichung Station
17                             

- 運用DataFrame的屬性取出整個縱行。

In [132]:
fb.類別

0             機場
1             機場
2             車站
3           觀光夜市
4           觀光夜市
5            電影院
6            電影院
7           地方商圈
8      百貨公司/購物中心
9          人文/古蹟
10            機場
11          觀光夜市
12            車站
13     百貨公司/購物中心
14          觀光夜市
15            機場
16            車站
17          地方商圈
18          地方商圈
19         休息轉運站
20      遊樂區/休閒農場
21          觀光夜市
22     百貨公司/購物中心
23         人文/古蹟
24     百貨公司/購物中心
25           電影院
26     百貨公司/購物中心
27          觀光夜市
28     百貨公司/購物中心
29          地方商圈
         ...    
970         地方商圈
971           餐飲
972      學校/文教機構
973         量販賣場
974           餐飲
975          捷運站
976           餐飲
977          KTV
978         飯店旅館
979           餐飲
980     體育場/運動中心
981           餐飲
982           餐飲
983           餐飲
984           餐飲
985        休息轉運站
986    百貨公司/購物中心
987         宗教會所
988           餐飲
989        人文/古蹟
990           車站
991         飯店旅館
992          捷運站
993          KTV
994           餐飲
995     體育場/運動中心
996           車站
997           

- 透過loc方法取值，可以對行列進行限制，不過必須使用行名(label-based indexing)。

In [133]:
fb.loc[:10, ['地區','累積打卡數']]

Unnamed: 0,地區,累積打卡數
0,桃園縣,711761
1,桃園縣,411525
2,台北市,391239
3,台北市,385886
4,台南市,351568
5,台北市,304376
6,台北市,297655
7,台北市,290853
8,新竹縣,287132
9,新北市,278212


- 透過iloc方法取值，可以對行列進行限制，不過必須使用行索引(positional indexing)。

In [134]:
fb.iloc[:10, [6, 2]]

Unnamed: 0,地區,累積打卡數
0,桃園縣,711761
1,桃園縣,411525
2,台北市,391239
3,台北市,385886
4,台南市,351568
5,台北市,304376
6,台北市,297655
7,台北市,290853
8,新竹縣,287132
9,新北市,278212


- 透過ix方法取值，可以使用行名與行索引，由於結合了loc與iloc兩者的用法，因此是最實用的資料選取方式。

In [135]:
fb.ix[:10, ['latitude', 'longtitude']]

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  if __name__ == '__main__':


Unnamed: 0,latitude,longtitude
0,25.076389,
1,25.077949,
2,25.047769,
3,25.088073,
4,23.01057,
5,25.035618,
6,25.083232,
7,25.043655,
8,24.809879,
9,25.168825,


In [136]:
fb.ix[:10, [3, 4]]

Unnamed: 0,latitude,longitude
0,25.076389,121.223889
1,25.077949,121.232405
2,25.047769,121.517109
3,25.088073,121.525044
4,23.01057,120.199785
5,25.035618,121.56699
6,25.083232,121.557588
7,25.043655,121.507148
8,24.809879,120.975144
9,25.168825,121.443263


- 整行選取可以使用第一、第二種方法，後三種方法適合做多條件的彈性選取。

## 結語 Conclusions

- pandas提供額外的資料結構。pandas provides additional data structures for working with data sets in Python. Its primary abstraction is the DataFrame with much more functionality and better performance.
- pandas是Python資料清理與整理的主要工具。pandas is probably the primary Python tool for cleaning, munging, manipulating, and working with data. If you’re going to use Python to munge, slice, group, and manipulate data sets, pandas is an invaluable tool.

## 參考文獻 Reference
- McKinney, W. (2013), Python for Data Analysis, Oreilly.