## 数据选择

### Agenda

- 单列和切片
- 按标签选择
- 按位置选择
- 按条件索引/筛选

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.DataFrame(np.random.randn(6, 4), index=pd.date_range('20200101', periods=6), columns=list('ABCD'))

df

Unnamed: 0,A,B,C,D
2020-01-01,-0.365506,-0.495573,-0.960927,1.866913
2020-01-02,1.001643,0.012393,-0.06298,-1.901772
2020-01-03,0.50449,-0.134172,1.60374,-0.181923
2020-01-04,0.31692,-1.636116,-0.15214,-1.074793
2020-01-05,-0.702104,-0.747868,-0.615878,-0.390399
2020-01-06,-0.568494,-1.780426,-0.431316,-1.040039


### 单列和切片

选择单列，产生 Series

In [3]:
df['A']

2020-01-01   -0.365506
2020-01-02    1.001643
2020-01-03    0.504490
2020-01-04    0.316920
2020-01-05   -0.702104
2020-01-06   -0.568494
Freq: D, Name: A, dtype: float64

用 [ ] 切片行

In [4]:
df[0:3]

Unnamed: 0,A,B,C,D
2020-01-01,-0.365506,-0.495573,-0.960927,1.866913
2020-01-02,1.001643,0.012393,-0.06298,-1.901772
2020-01-03,0.50449,-0.134172,1.60374,-0.181923


In [5]:
df['20200102':'20200104']

Unnamed: 0,A,B,C,D
2020-01-02,1.001643,0.012393,-0.06298,-1.901772
2020-01-03,0.50449,-0.134172,1.60374,-0.181923
2020-01-04,0.31692,-1.636116,-0.15214,-1.074793


### 按标签选择

用标签切片，包含行与列

In [6]:
df.loc['20200102':'20200104', ['A', 'B']]

Unnamed: 0,A,B
2020-01-02,1.001643,0.012393
2020-01-03,0.50449,-0.134172
2020-01-04,0.31692,-1.636116


用标签选择多列数据

In [7]:
df.loc[:, ['A', 'B']]

Unnamed: 0,A,B
2020-01-01,-0.365506,-0.495573
2020-01-02,1.001643,0.012393
2020-01-03,0.50449,-0.134172
2020-01-04,0.31692,-1.636116
2020-01-05,-0.702104,-0.747868
2020-01-06,-0.568494,-1.780426


用标签提取一行数据

In [8]:
df.loc['20200102']

A    1.001643
B    0.012393
C   -0.062980
D   -1.901772
Name: 2020-01-02 00:00:00, dtype: float64

### 按位置选择

用整数位置选择

In [9]:
df.iloc[3]

A    0.316920
B   -1.636116
C   -0.152140
D   -1.074793
Name: 2020-01-04 00:00:00, dtype: float64

用整数列表按位置切片

In [10]:
df.iloc[3:5, 0:2]

Unnamed: 0,A,B
2020-01-04,0.31692,-1.636116
2020-01-05,-0.702104,-0.747868


显式整行切片

In [11]:
df.iloc[1:3, :]

Unnamed: 0,A,B,C,D
2020-01-02,1.001643,0.012393,-0.06298,-1.901772
2020-01-03,0.50449,-0.134172,1.60374,-0.181923


显式整列切片

In [12]:
df.iloc[:, 1:3]

Unnamed: 0,B,C
2020-01-01,-0.495573,-0.960927
2020-01-02,0.012393,-0.06298
2020-01-03,-0.134172,1.60374
2020-01-04,-1.636116,-0.15214
2020-01-05,-0.747868,-0.615878
2020-01-06,-1.780426,-0.431316


显式提取值

In [13]:
df.iloc[1, 1]

0.012392958278996535

快速访问标量，与上述方法等效

In [14]:
df.iat[1, 1]

0.012392958278996535

### 按条件索引/筛选

用条件选择数据

In [19]:
df[df['A'] > 0]

Unnamed: 0,A,B,C,D
2020-01-02,1.001643,0.012393,-0.06298,-1.901772
2020-01-03,0.50449,-0.134172,1.60374,-0.181923
2020-01-04,0.31692,-1.636116,-0.15214,-1.074793


用isin()筛选

In [21]:
df2 = df.copy()
df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']
df2[df2['E'].isin(['two', 'four'])]

Unnamed: 0,A,B,C,D,E
2020-01-03,0.50449,-0.134172,1.60374,-0.181923,two
2020-01-05,-0.702104,-0.747868,-0.615878,-0.390399,four
