# Data Selection in Series
和dictionary差不多, Series对象提供了一组keys到values的映射

In [1]:
import pandas as pd

In [3]:
data = pd.Series([.25, .5, .75, 1.0], index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

也可以像dictionary一样来检查keys/indices和values:

In [4]:
'a' in data

True

In [5]:
data.keys()

Index(['a', 'b', 'c', 'd'], dtype='object')

In [6]:
list(data.items())

[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

Series对象甚至可以像dictionary一样被修改. 你可以赋一个新的索引值来扩展Series:

In [7]:
data['e'] = 1.25
data

a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64

### Series作为一维数组
Series是基于类似dictionary的接口构建的并提供了数组样式的选取方式, 这点和numpy数组的基本机制是一样的, 例如: slices, masking, fancy indexing等.

In [8]:
# slicing by explicit index
data['a':'c']

a    0.25
b    0.50
c    0.75
dtype: float64

In [9]:
# slicing by implicit integer index
data[0:2]

a    0.25
b    0.50
dtype: float64

In [10]:
# masking
data[(data > 0.3) & (data < 0.8)]

b    0.50
c    0.75
dtype: float64

In [11]:
# fancy indexing
data[['a', 'e']]

a    0.25
e    1.25
dtype: float64

上面这些里, slicing有可能造成一些混淆. 当使用显式索引进行切割的时候, 最后的索引是被包含在内的. 而使用隐式索引的时候, 最后的索引是不被包含在内的.

### Indexers: loc, iloc, ix
如果你的Series有显式的整数索引, 那么像data[1]这样的操作将会使用显式索引, 而data[1:3]这样的切割slicing操作将会使用隐式索引.

In [14]:
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data

1    a
3    b
5    c
dtype: object

In [15]:
# explicit index when indexing
data[1]

'a'

In [16]:
# implicit index when slicing
data[1:3]

3    b
5    c
dtype: object

因为这个整数索引可能引起的混淆, Pandas提供了特殊的indexer属性来显式的执行索引策略.  
这些不是函数方法, 而是属性, 它在Series中暴露特殊的切割接口给数据.

#### loc属性
loc属性允许使用显式索引来进行indexing或者slicing

In [21]:
data.loc[1]

'a'

In [22]:
data.loc[1:3]

1    a
3    b
dtype: object

#### iloc 属性
iloc允许使用隐式索引来进行indexing或者slicing

In [23]:
data.iloc[1]

'b'

In [24]:
data.iloc[1:3]

3    b
5    c
dtype: object

#### ix属性
ix属性是两者的混合, 对于Series对象就相当于标准的[]索引.  
这个一会再介绍

# Data Selection in DataFrame

### DataFrame作为dictionary
DataFrame可以看作是由一系列Series组成的dictionary:

In [25]:
area = pd.Series({'California': 423967, 'Texas': 695662, 'New York': 141297, 'Florida': 170312, 'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193, 'New York': 19651127, 'Florida': 19552860, 'Illinois': 12882135})
data = pd.DataFrame({'area': area, 'pop': pop})
data

Unnamed: 0,area,pop
California,423967,38332521
Florida,170312,19552860
Illinois,149995,12882135
New York,141297,19651127
Texas,695662,26448193


这里每个Series组成了DataFrame的一列, 它可以使用类似dictionary的方式来访问.

In [26]:
data['area']

California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
Name: area, dtype: int64

同样我们也可以使用属性样式, 使用列名:

In [30]:
data.area

California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
Name: area, dtype: int64

这两种方式是完全一样的:

In [31]:
data.area is data['area']

True

注意: 如果列名不是字符串, 或者列名和DataFrame的方法名冲突, 这种属性样式的访问方法就不可行了.

In [32]:
data.pop is data['pop']

False

特别要注意的是避免使用属性方式来给列赋值, 应该使用data['pop'] = z这种方式.

这种dictionary样式的写法可以用来修改DataFrame对象本身:

In [35]:
data['density'] = data['pop'] / data['area']
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763
New York,141297,19651127,139.076746
Texas,695662,26448193,38.01874


### DataFrame作为二维数组
可以使用values属性查看值:

In [36]:
data.values

array([[  4.23967000e+05,   3.83325210e+07,   9.04139261e+01],
       [  1.70312000e+05,   1.95528600e+07,   1.14806121e+02],
       [  1.49995000e+05,   1.28821350e+07,   8.58837628e+01],
       [  1.41297000e+05,   1.96511270e+07,   1.39076746e+02],
       [  6.95662000e+05,   2.64481930e+07,   3.80187404e+01]])

我们可以对DataFrame进行转置:

In [37]:
data.T

Unnamed: 0,California,Florida,Illinois,New York,Texas
area,423967.0,170312.0,149995.0,141297.0,695662.0
pop,38332520.0,19552860.0,12882140.0,19651130.0,26448190.0
density,90.41393,114.8061,85.88376,139.0767,38.01874


使用单个索引来访问数组的一行数据:

In [39]:
data.values[0]

array([  4.23967000e+05,   3.83325210e+07,   9.04139261e+01])

使用单个索引给DataFrame是访问它的一列:

In [40]:
data['area']

California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
Name: area, dtype: int64

#### iloc
使用iloc可以像numpy数组一样进行索引(隐式), 但是DataFrame的index和column标签也会出现在结果里.

In [41]:
data.iloc[:3, :2]

Unnamed: 0,area,pop
California,423967,38332521
Florida,170312,19552860
Illinois,149995,12882135


#### loc

In [42]:
data.loc[:'Illinois', :'pop']

Unnamed: 0,area,pop
California,423967,38332521
Florida,170312,19552860
Illinois,149995,12882135


#### ix
ix允许上面两者的混合(已经被弃用):

In [45]:
data.ix[:3, :'pop']

Unnamed: 0,area,pop
California,423967,38332521
Florida,170312,19552860
Illinois,149995,12882135


可以结合masking和fancy indexing使用:

In [46]:
data.loc[data.density > 100, ['pop', 'density']]

Unnamed: 0,pop,density
Florida,19552860,114.806121
New York,19651127,139.076746


这些索引约定也可以用来设置或修改值:

In [47]:
data.iloc[0, 2] = 90
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.0
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763
New York,141297,19651127,139.076746
Texas,695662,26448193,38.01874


### 其它的索引约定
indexing操作用于列, slicing用于行:

In [48]:
data['Florida': 'Illionis']

Unnamed: 0,area,pop,density
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [51]:
data[1:3]

Unnamed: 0,area,pop,density
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


直接的masking操作被解释为针对于行而不是列:

In [52]:
data[data.density > 100]

Unnamed: 0,area,pop,density
Florida,170312,19552860,114.806121
New York,141297,19651127,139.076746
