<img width=200 src="https://camo.githubusercontent.com/903f3cc51db134b8c9faed2ba2b18ffedff67ff2aafe75259cbde477b27d9b4f/68747470733a2f2f75706c6f61642e77696b696d656469612e6f72672f77696b6970656469612f636f6d6d6f6e732f7468756d622f652f65642f50616e6461735f6c6f676f2e7376672f3132303070782d50616e6461735f6c6f676f2e7376672e706e673f7261773d74727565"></img>

# Day-10 Pandas DataFrame 的資料選取

* 範例目標：
  * 正確使用欄位名稱與索引選取資料
  * 正確使用 location 座標選取資料
  * 正確使用 遮罩操作 選取資料
  * Pandas 資料的索引、操作、選擇、過濾、合併與排序。
* 範例重點：
  * 資料過濾與操作資料不同，過濾出來的資料將是新資料集，不會動到原本的資料。
  * 合併資料時合併欄位(key)可多個欄位，遇到相同欄位名稱時 merge 會自動產生字尾，join 則不會。

## 匯入套件

In [None]:
# 載入 NumPy, Pandas 套件
import numpy as np
import pandas as pd

# 檢查正確載入與版本
print(np)
print(np.__version__)
print(pd)
print(pd.__version__)

<module 'numpy' from '/Users/wei/.virtualenvs/py3/lib/python3.6/site-packages/numpy/__init__.py'>
1.16.1
<module 'pandas' from '/Users/wei/.virtualenvs/py3/lib/python3.6/site-packages/pandas/__init__.py'>
0.23.1


## 利用欄位名稱

### 選取單行資料

In [None]:
df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], index=['a', 'b'], columns=['A', 'B', 'C'])

print(df['A'])
print(df['B'])

a    1
b    4
Name: A, dtype: int64
a    2
b    5
Name: B, dtype: int64


### 選取多行資料

In [None]:
df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], index=['a', 'b'], columns=['A', 'B', 'C'])

print(df[['A', 'B']])
print(df[['A', 'C']])

   A  B
a  1  2
b  4  5
   A  C
a  1  3
b  4  6


## 利用列索引位置選取單列/多列資料

In [None]:
df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], index=['a', 'b'], columns=['A', 'B', 'C'])

print(df[0:1])
print(df[0:2])

   A  B  C
a  1  2  3
   A  B  C
a  1  2  3
b  4  5  6


### 用 loc, iloc, ix 取得行與列

* 利用`[]`和`.loc[]`做布林邏輯選擇資料，回傳為 True 的資料，此方法可以針對欄位的值做過濾，其中`.iloc[]`可以一併選擇欄位，則`[]`不行選擇欄位
* 前面兩種方法可以利用行或列的角度來選取資料，不過一次僅能做一個維度的篩選。為了有效的使用 DataFrame 二維的特性，在 Pandas 當中提供了一種座標選取的方法`.loc[...]`

In [None]:
df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], index=['a', 'b'], columns=['A', 'B', 'C'])

print(df.loc['a', 'A'])
print(df.loc['a', ['A', 'B']])
print(df.loc[['a', 'b'], 'A'])
print(df.loc[['a', 'b'], ['A', 'B']])

1
A    1
B    2
Name: a, dtype: int64
a    1
b    4
Name: A, dtype: int64
   A  B
a  1  2
b  4  5


* `.loc`本身其實是一個特殊的物件，搭配`[...]`方式做座標選取。

In [None]:
print(df.loc)
# <pandas.core.indexing._LocIndexer object at 0x7f236192f770>
print(type(df.loc))
# <class 'pandas.core.indexing._LocIndexer'>

* 類似的方法還有`.iloc[...]`，可以索引位置的方法來選出資料

In [None]:
df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], index=['a', 'b'], columns=['A', 'B', 'C'])

print(df.iloc[0, 0])
print(df.iloc[0, [0, 1]])
print(df.iloc[[0, 1], 0])
print(df.iloc[[0, 1], [0, 1]])

1
A    1
B    2
Name: a, dtype: int64
a    1
b    4
Name: A, dtype: int64
   A  B
a  1  2
b  4  5


* 使用`.ix[...]`的方法
  * 補充：ix 已經被標記為 deprecated，在後續的版本可能無法使用

In [None]:
df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], index=['a', 'b'], columns=['A', 'B', 'C'])

print(df.ix[0, 'A'])
print(df.ix['a', [0, 1]])
print(df.ix[['a', 'b'], 0])
print(df.ix[[0, 1], ['A', 'B']])

1
A    1
B    2
Name: a, dtype: int64
a    1
b    4
Name: A, dtype: int64
   A  B
a  1  2
b  4  5


.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  import sys
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  


### 用 iat, at 取得資料

* 如果只是想要選出「數值」資料的話，可以直接用 iat 跟 at 設定座標

In [None]:
print(df.loc['a', 'A'])
print(df.iloc[0, 1])

print(df.at['a', 'A'])
print(df.iat[0, 1])

1
2
1
2


## 根據條件篩選資料（遮罩）

In [None]:
print(df > 2)
#        A      B     C
# a  False  False  True
# b   True   True  True

print(df[df > 2])
#      A    B  C
# a  NaN  NaN  3
# b  4.0  5.0  6


       A      B     C
a  False  False  True
b   True   True  True
     A    B  C
a  NaN  NaN  3
b  4.0  5.0  6


In [None]:
print(df['A'] > 2)
# a    False
# b     True
# Name: A, dtype: bool

print(df[df['A'] > 2])
#    A  B  C
# b  4  5  6



a    False
b     True
Name: A, dtype: bool
   A  B  C
b  4  5  6


## 使用過濾與合併資料

### 串連(concat)

* pandas 的可以將多個物件的資料合成一個新物件。DataFrame 的串連(concat)可以在任何指定的欄位進行結合，如資料對其後出現遺漏值將會填入 NaN
* 以下使用 ETF 元大台灣 50 的資料做為範例，stock_data1 欄位有[date,open,close]，stock_data2 欄位有[date,open,high]
  * 使用 concat 串聯參數 axis=0 縱向結合後發現 stock_data1 不存在 high 欄位，stock_data2 不存在 close 欄位，那麼該位子會被填入 NaN
  * 使用 concat 串聯參數 axis=1 橫向結合，會以列索引標籤對齊
* 串連(concat)在合併資料時，連結類型預設是外連結(outer join)連集的操作，連結類型可以用 join 參數調整，如 join=’inner’ 表示連結型態為內連結(inner join)交集的操作 

### 合併(merge)

* 合併是藉由找出一或多個行或列索引的吻合值，然後將兩資料結合
* 以下為針對合併欄位 (key)date 做合併，合併方式為 how='outer' 外連結，如除了合併欄位 date 之外還有相同欄位名稱，pandas 將會自動把重複欄位名稱加上 '_x' 代表左邊 DataFrame stock_data1 的重複欄位open，加上 '_y' 代表右邊 DataFrame stock_data2 的重複欄位 open
* `pd.merge()` 預設連結類型為內連結(inner join)，參數 how 可以更改連結類型
  * inner：兩資料集的交集
  * outer：兩資料集的聯集
  * left：只使用左資料的合併欄位(key)
  * right：只使用右資料的合併欄位(key)
* `.join()` 利用兩個 DataFrame 的索引標籤(index)進行連結操作，在這裡要注意，除 date 是索引標籤(index)以外兩資料還有一個 open 欄位名稱重複，因為 join 不像 merge 會自動對於重複欄位產生字尾，所以需要參數 lsuffix、rsuffix 加上指定字尾

## 範例

In [None]:
boston_data = pd.read_csv('boston.csv', usecols=['CRIM','ZN','key','INDUS'])
boston_data_index = boston_data.set_index('key')
boston_data_index

Unnamed: 0_level_0,CRIM,ZN,INDUS
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,0.02731,0.0,7.07
2,0.02729,0.0,7.07
3,0.03237,0.0,2.18
4,0.06905,0.0,2.18
5,0.02985,0.0,2.18
...,...,...,...
501,0.06263,0.0,11.93
502,0.04527,0.0,11.93
503,0.06076,0.0,11.93
504,0.10959,0.0,11.93


In [None]:
boston_data_index.index

Int64Index([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,
            ...
            496, 497, 498, 499, 500, 501, 502, 503, 504, 505],
           dtype='int64', name='key', length=505)

In [None]:
boston_data_index2 = boston_data.set_index(['key','INDUS'])
boston_data_index2

Unnamed: 0_level_0,Unnamed: 1_level_0,CRIM,ZN
key,INDUS,Unnamed: 2_level_1,Unnamed: 3_level_1
1,7.07,0.02731,0.0
2,7.07,0.02729,0.0
3,2.18,0.03237,0.0
4,2.18,0.06905,0.0
5,2.18,0.02985,0.0
...,...,...,...
501,11.93,0.06263,0.0
502,11.93,0.04527,0.0
503,11.93,0.06076,0.0
504,11.93,0.10959,0.0


In [None]:
boston_data_index2.index

MultiIndex([(  1,  7.07),
            (  2,  7.07),
            (  3,  2.18),
            (  4,  2.18),
            (  5,  2.18),
            (  6,  7.87),
            (  7,  7.87),
            (  8,  7.87),
            (  9,  7.87),
            ( 10,  7.87),
            ...
            (496,  9.69),
            (497,  9.69),
            (498,  9.69),
            (499,  9.69),
            (500,  9.69),
            (501, 11.93),
            (502, 11.93),
            (503, 11.93),
            (504, 11.93),
            (505, 11.93)],
           names=['key', 'INDUS'], length=505)

In [None]:
new_boston_data = boston_data.rename(columns={'CRIM':'feature1'})
new_boston_data

Unnamed: 0,key,feature1,ZN,INDUS
0,1,0.02731,0.0,7.07
1,2,0.02729,0.0,7.07
2,3,0.03237,0.0,2.18
3,4,0.06905,0.0,2.18
4,5,0.02985,0.0,2.18
...,...,...,...,...
500,501,0.06263,0.0,11.93
501,502,0.04527,0.0,11.93
502,503,0.06076,0.0,11.93
503,504,0.10959,0.0,11.93


In [None]:
copy1 = boston_data.copy()
copy1['round_INDUS'] = round(copy1['INDUS'])
copy1

Unnamed: 0,key,CRIM,ZN,INDUS,round_INDUS
0,1,0.02731,0.0,7.07,7.0
1,2,0.02729,0.0,7.07,7.0
2,3,0.03237,0.0,2.18,2.0
3,4,0.06905,0.0,2.18,2.0
4,5,0.02985,0.0,2.18,2.0
...,...,...,...,...,...
500,501,0.06263,0.0,11.93,12.0
501,502,0.04527,0.0,11.93,12.0
502,503,0.06076,0.0,11.93,12.0
503,504,0.10959,0.0,11.93,12.0


In [None]:
copy2 = boston_data.copy()
copy2.insert(1,'round_INDUS',round(copy2['INDUS']))
copy2

Unnamed: 0,key,round_INDUS,CRIM,ZN,INDUS
0,1,7.0,0.02731,0.0,7.07
1,2,7.0,0.02729,0.0,7.07
2,3,2.0,0.03237,0.0,2.18
3,4,2.0,0.06905,0.0,2.18
4,5,2.0,0.02985,0.0,2.18
...,...,...,...,...,...
500,501,12.0,0.06263,0.0,11.93
501,502,12.0,0.04527,0.0,11.93
502,503,12.0,0.06076,0.0,11.93
503,504,12.0,0.10959,0.0,11.93


In [None]:
del copy2['round_INDUS']
copy2

Unnamed: 0,key,CRIM,ZN,INDUS
0,1,0.02731,0.0,7.07
1,2,0.02729,0.0,7.07
2,3,0.03237,0.0,2.18
3,4,0.06905,0.0,2.18
4,5,0.02985,0.0,2.18
...,...,...,...,...
500,501,0.06263,0.0,11.93
501,502,0.04527,0.0,11.93
502,503,0.06076,0.0,11.93
503,504,0.10959,0.0,11.93


In [None]:
print(copy1.pop('round_INDUS'))
copy1

0       7.0
1       7.0
2       2.0
3       2.0
4       2.0
       ... 
500    12.0
501    12.0
502    12.0
503    12.0
504    12.0
Name: round_INDUS, Length: 505, dtype: float64


Unnamed: 0,key,CRIM,ZN,INDUS
0,1,0.02731,0.0,7.07
1,2,0.02729,0.0,7.07
2,3,0.03237,0.0,2.18
3,4,0.06905,0.0,2.18
4,5,0.02985,0.0,2.18
...,...,...,...,...
500,501,0.06263,0.0,11.93
501,502,0.04527,0.0,11.93
502,503,0.06076,0.0,11.93
503,504,0.10959,0.0,11.93


In [None]:
copy3 = boston_data.copy()
copy3.drop('CRIM',axis=1)

Unnamed: 0,key,ZN,INDUS
0,1,0.0,7.07
1,2,0.0,7.07
2,3,0.0,2.18
3,4,0.0,2.18
4,5,0.0,2.18
...,...,...,...
500,501,0.0,11.93
501,502,0.0,11.93
502,503,0.0,11.93
503,504,0.0,11.93


In [None]:
boston_data = boston_data.append(pd.DataFrame([[506,0,0,0]],columns=boston_data.columns))
boston_data

Unnamed: 0,key,CRIM,ZN,INDUS
0,1,0.02731,0.0,7.07
1,2,0.02729,0.0,7.07
2,3,0.03237,0.0,2.18
3,4,0.06905,0.0,2.18
4,5,0.02985,0.0,2.18
...,...,...,...,...
501,502,0.04527,0.0,11.93
502,503,0.06076,0.0,11.93
503,504,0.10959,0.0,11.93
504,505,0.04741,0.0,11.93


In [None]:
boston_data = boston_data.drop(1)
boston_data

Unnamed: 0,key,CRIM,ZN,INDUS
0,1,0.02731,0.0,7.07
2,3,0.03237,0.0,2.18
3,4,0.06905,0.0,2.18
4,5,0.02985,0.0,2.18
5,6,0.08829,12.5,7.87
...,...,...,...,...
501,502,0.04527,0.0,11.93
502,503,0.06076,0.0,11.93
503,504,0.10959,0.0,11.93
504,505,0.04741,0.0,11.93


In [None]:
stock_data = pd.read_csv('STOCK_DAY_0050_202010.csv')
stock_data.loc[:5,['date','open','close']].to_csv('STOCK1.csv',index=False)
stock_data.loc[3:7,['date','open','high']].to_csv('STOCK2.csv',index=False)

In [None]:
stock_data

Unnamed: 0,date,open,high,low,close
0,109/10/05,103.45,104.05,103.0,103.05
1,109/10/06,104.0,104.35,103.85,104.25
2,109/10/07,104.0,105.0,103.5,104.8
3,109/10/08,105.45,106.35,105.3,106.2
4,109/10/12,106.7,107.7,106.7,107.05
5,109/10/13,107.35,107.6,106.2,107.1
6,109/10/14,107.05,107.2,106.45,106.7
7,109/10/15,106.5,106.5,105.1,105.7
8,109/10/16,105.7,106.3,105.1,105.25
9,109/10/19,105.65,106.6,105.6,106.6


In [None]:
stock_data.loc[stock_data.open<104]

Unnamed: 0,date,open,high,low,close
0,109/10/05,103.45,104.05,103.0,103.05
18,109/10/30,103.55,103.6,102.7,103.0


In [None]:
stock_data.loc[(stock_data.open<104)&(stock_data.close>103),['open','close']]

Unnamed: 0,open,close
0,103.45,103.05


In [None]:
stock_data.iloc[3:6]

Unnamed: 0,date,open,high,low,close
3,109/10/08,105.45,106.35,105.3,106.2
4,109/10/12,106.7,107.7,106.7,107.05
5,109/10/13,107.35,107.6,106.2,107.1


In [None]:
stock_data.iloc[3:6,:2]

Unnamed: 0,date,open
3,109/10/08,105.45
4,109/10/12,106.7
5,109/10/13,107.35


In [None]:
stock_data1=pd.read_csv('STOCK1.csv')
stock_data1

Unnamed: 0,date,open,close
0,109/10/05,103.45,103.05
1,109/10/06,104.0,104.25
2,109/10/07,104.0,104.8
3,109/10/08,105.45,106.2
4,109/10/12,106.7,107.05
5,109/10/13,107.35,107.1


In [None]:
stock_data2=pd.read_csv('STOCK2.csv')
stock_data2

Unnamed: 0,date,open,high
0,109/10/08,105.45,106.35
1,109/10/12,106.7,107.7
2,109/10/13,107.35,107.6
3,109/10/14,107.05,107.2
4,109/10/15,106.5,106.5


In [None]:
pd.concat([stock_data1,stock_data2],axis=0)

Unnamed: 0,date,open,close,high
0,109/10/05,103.45,103.05,
1,109/10/06,104.0,104.25,
2,109/10/07,104.0,104.8,
3,109/10/08,105.45,106.2,
4,109/10/12,106.7,107.05,
5,109/10/13,107.35,107.1,
0,109/10/08,105.45,,106.35
1,109/10/12,106.7,,107.7
2,109/10/13,107.35,,107.6
3,109/10/14,107.05,,107.2


In [None]:
pd.concat([stock_data1,stock_data2],axis=1)

Unnamed: 0,date,open,close,date.1,open.1,high
0,109/10/05,103.45,103.05,109/10/08,105.45,106.35
1,109/10/06,104.0,104.25,109/10/12,106.7,107.7
2,109/10/07,104.0,104.8,109/10/13,107.35,107.6
3,109/10/08,105.45,106.2,109/10/14,107.05,107.2
4,109/10/12,106.7,107.05,109/10/15,106.5,106.5
5,109/10/13,107.35,107.1,,,


In [None]:
pd.concat([stock_data1,stock_data2],axis=0,join='inner')

Unnamed: 0,date,open
0,109/10/05,103.45
1,109/10/06,104.0
2,109/10/07,104.0
3,109/10/08,105.45
4,109/10/12,106.7
5,109/10/13,107.35
0,109/10/08,105.45
1,109/10/12,106.7
2,109/10/13,107.35
3,109/10/14,107.05


In [None]:
pd.merge(stock_data1,stock_data2,on='date',how='outer')

Unnamed: 0,date,open_x,close,open_y,high
0,109/10/05,103.45,103.05,,
1,109/10/06,104.0,104.25,,
2,109/10/07,104.0,104.8,,
3,109/10/08,105.45,106.2,105.45,106.35
4,109/10/12,106.7,107.05,106.7,107.7
5,109/10/13,107.35,107.1,107.35,107.6
6,109/10/14,,,107.05,107.2
7,109/10/15,,,106.5,106.5


In [None]:
pd.merge(stock_data1,stock_data2,on='date',how='left')

Unnamed: 0,date,open_x,close,open_y,high
0,109/10/05,103.45,103.05,,
1,109/10/06,104.0,104.25,,
2,109/10/07,104.0,104.8,,
3,109/10/08,105.45,106.2,105.45,106.35
4,109/10/12,106.7,107.05,106.7,107.7
5,109/10/13,107.35,107.1,107.35,107.6


In [None]:
pd.merge(stock_data1,stock_data2,on='date',how='right')

Unnamed: 0,date,open_x,close,open_y,high
0,109/10/08,105.45,106.2,105.45,106.35
1,109/10/12,106.7,107.05,106.7,107.7
2,109/10/13,107.35,107.1,107.35,107.6
3,109/10/14,,,107.05,107.2
4,109/10/15,,,106.5,106.5


In [None]:
stock_data1_index = stock_data1.set_index('date')
stock_data2_index = stock_data2.set_index('date')

In [None]:
stock_data1_index.join(stock_data2_index,how='outer',lsuffix='_left',rsuffix='_right')

Unnamed: 0_level_0,open_left,close,open_right,high
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
109/10/05,103.45,103.05,,
109/10/06,104.0,104.25,,
109/10/07,104.0,104.8,,
109/10/08,105.45,106.2,105.45,106.35
109/10/12,106.7,107.05,106.7,107.7
109/10/13,107.35,107.1,107.35,107.6
109/10/14,,,107.05,107.2
109/10/15,,,106.5,106.5
