第5章

# Pandas入门

In [2]:
from pandas import Series,DataFrame
import pandas as pd
import numpy as np

---

## pandas 的数据结构介绍

### Series

Series是一种类似于一维数组的对象，它由一组数据（各种NumPy数据类型）以及一组与之相关的数据标签（即索引）组成。仅由一组数据即可以产生最简单的Series

In [2]:
obj = Series([4,5,-5,3])

In [3]:
obj

0    4
1    5
2   -5
3    3
dtype: int64

Series的字符串表现形式为：索引在左边，值在右边。由于我们没有为数据指定索引，于是会自动创建一个0到N-1的整数型索引。可以通过Series的values和index属性来获取其整组表示形式和索引对象：

In [5]:
obj.values

array([ 4,  5, -5,  3], dtype=int64)

In [6]:
obj.index

RangeIndex(start=0, stop=4, step=1)

一般而言，我们希望所创建的Series带有一个可以对各个数据点进行标记的索引：

In [9]:
obj2 = Series([4,7,-5,3],index=['d','b','a','c'])

In [10]:
obj2


d    4
b    7
a   -5
c    3
dtype: int64

In [11]:
obj2.index


Index(['d', 'b', 'a', 'c'], dtype='object')

与普通NumPy数组相比，我们可以通过索引的方式选取Series中的单个或一组值：

In [12]:
obj2['a']

-5

In [13]:
obj2[obj2>0]

d    4
b    7
c    3
dtype: int64

In [16]:
np.exp(obj2)

d      54.598150
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

**   如果数据被存放在一个python字典中，可以通过这个字典来创造Series：

In [24]:
sdata = {'Ohio':35000,'Texas':71000,'Oregon':1600,'Utah':5000}

In [25]:
obj3 = Series(sdata)

In [26]:
obj3

Ohio      35000
Texas     71000
Oregon     1600
Utah       5000
dtype: int64

如果只传入一个字典，那么结果Series中的索引就是原字典中的健。

In [30]:
states = {'Ohio','Texas','Oregon','Utah','California'}

In [31]:
obj4 = Series(sdata,index=states)

In [32]:
obj4

California        NaN
Utah           5000.0
Oregon         1600.0
Ohio          35000.0
Texas         71000.0
dtype: float64

注：索引结果为有序的排列，NaN表示非数字，即not a number。在pandas中，isnull和notnull来检测是否存在缺失数据

In [35]:
pd.isnull(obj4)

California     True
Utah          False
Oregon        False
Ohio          False
Texas         False
dtype: bool

In [36]:
pd.notnull(obj4)

California    False
Utah           True
Oregon         True
Ohio           True
Texas          True
dtype: bool

对于许多应用来说，Series最重要的一个功能是：它在算术运算中会自动对齐不同索引的数据。

In [37]:
obj3

Ohio      35000
Texas     71000
Oregon     1600
Utah       5000
dtype: int64

In [38]:
obj4

California        NaN
Utah           5000.0
Oregon         1600.0
Ohio          35000.0
Texas         71000.0
dtype: float64

In [39]:
obj3+obj4

California         NaN
Ohio           70000.0
Oregon          3200.0
Texas         142000.0
Utah           10000.0
dtype: float64

Series对象本身及其索引都有一个name属性，该属性跟pandas其他的关键功能关系非常密切

In [40]:
obj4.name = 'population'

In [44]:
obj4.index.name = 'states'

In [45]:
obj4

states
California        NaN
Utah           5000.0
Oregon         1600.0
Ohio          35000.0
Texas         71000.0
Name: population, dtype: float64

In [46]:
obj

0    4
1    5
2   -5
3    3
dtype: int64

In [47]:
obj.index=['Bob','Steve','Jeff','Ryan']

In [48]:
obj

Bob      4
Steve    5
Jeff    -5
Ryan     3
dtype: int64

### DataFrame

DataFrame是一个表格型的数据结构，它含有一组有序的列，每列可以是不同的值类型（数值，字符串，布尔值等）。DataFrame既有行索引，也有列索引，它可以被看做由Series组成的字典（共用一个索引）。DataFrame中面向行和面向列的操作基本上是平衡的。其实，DataFrame中的数据是以一个或多个二维快存放的。

DataFrame的构建有很多，最常用的是一种直接传入一个由等长列表或Numpy数组组成的字典：

In [62]:
data= {'state':['Ohio','Ohio','Ohio','Nevada','Nevada'],
      'year':[2000,2001,2002,2001,2002],
       'pop':[1.5,1.7,3.6,2.4,2.9]
      }

In [63]:
frame=DataFrame(data)

In [64]:
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


如果制定了序列，则DataFrame的列就会按照指定顺序进行排列

In [65]:
DataFrame(data,columns=['year','state','pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9


In [66]:
fram2=DataFrame(data,columns=['year','state','pop','debt'],
               index=['one','two','three','four','five']
               )

In [67]:
fram2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,


In [68]:
fram2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

可以将DataFrame的列获取为一个Series：

In [69]:
fram2['year']

one      2000
two      2001
three    2002
four     2001
five     2002
Name: year, dtype: int64

In [71]:
fram2['state']

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object

行可以通过索引字段ix获取

In [78]:
fram2.loc['three']

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

列可以通过赋值的方式修改。例如，我们可以给那个空的“debt”列赋值一个标量或一组值：

In [79]:
fram2['debt']=16.5

In [80]:
fram2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,16.5
two,2001,Ohio,1.7,16.5
three,2002,Ohio,3.6,16.5
four,2001,Nevada,2.4,16.5
five,2002,Nevada,2.9,16.5


In [81]:
fram2['debt']=np.arange(5.)

In [82]:
fram2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,0.0
two,2001,Ohio,1.7,1.0
three,2002,Ohio,3.6,2.0
four,2001,Nevada,2.4,3.0
five,2002,Nevada,2.9,4.0


在进行赋值时，其长度必须跟DataFrame的长度相匹配。如果赋值的是一个Series，就会精确匹配到DataFrame的索引

In [83]:
val=Series([-1.2,-1.5,-1.7],index=['two','four','five'])

In [84]:
val

two    -1.2
four   -1.5
five   -1.7
dtype: float64

In [85]:
fram2['debt']=val

In [86]:
fram2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7


关键字del用于删除列

In [87]:
fram2['eastern']=fram2.state =='Ohio'

In [88]:
fram2

Unnamed: 0,year,state,pop,debt,eastern
one,2000,Ohio,1.5,,True
two,2001,Ohio,1.7,-1.2,True
three,2002,Ohio,3.6,,True
four,2001,Nevada,2.4,-1.5,False
five,2002,Nevada,2.9,-1.7,False


In [90]:
del fram2['eastern']

In [91]:
fram2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7


另一种常见的数据形式是嵌套字典（也就是字典的字典）：

In [99]:
pop={'Nevada':{2001: 2.4,2002: 2.9},
     'Ohio':{2000: 1.5,2001: 1.7,2002: 3.6}}


In [101]:
frame3=DataFrame(pop)

In [103]:
frame3

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


被解释为：外层字典的健作为列，内层字典的健作为行索引

In [104]:
frame3.T

Unnamed: 0,2001,2002,2000
Nevada,2.4,2.9,
Ohio,1.7,3.6,1.5


In [107]:
DataFrame(pop,index=[2000,2001,2002,2003])

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6
2003,,


为行列进行命名：

In [108]:
frame3.index.name='year'
frame3.columns.name='state'

In [109]:
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [110]:
frame3.values

array([[2.4, 1.7],
       [2.9, 3.6],
       [nan, 1.5]])

In [112]:
fram2.values

array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, -1.2],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, -1.5],
       [2002, 'Nevada', 2.9, -1.7]], dtype=object)

----

### 索引对象

pandas的索引对象赋值管理轴标签和其他元数据（比如轴名称等）。构建Series或DataFrame时，所用到的任何数组或其他序列的标签都会被转换成一个Index：

In [113]:
obj=Series(range(3),index=['a','b','c'])

In [114]:
index=obj.index

In [116]:
index[1:]

Index(['b', 'c'], dtype='object')

**注，index对象是不可以修改的，因此用户不能对其进行修改

In [117]:
index = pd.Index(np.arange(3))

In [118]:
obj2=Series([1.5,-2.5,0],index=index)

In [119]:
obj2

0    1.5
1   -2.5
2    0.0
dtype: float64

In [121]:
obj2.index is index

True

-------

## 基本功能

### 重新索引

reindex，其作用是船舰一个适应新索引的新对象

In [122]:
obj = Series([4,7,-5,3],index=['d','b','a','c'])

In [123]:
obj

d    4
b    7
a   -5
c    3
dtype: int64

In [125]:
obj2 =obj.reindex(['a','b','c','d','e'])

In [126]:
obj2

a   -5.0
b    7.0
c    3.0
d    4.0
e    NaN
dtype: float64

对于缺失值：

In [132]:
obj2 =obj.reindex(['a','b','c','d','e'],fill_value=)

In [133]:
obj2

a    -5
b     7
c     3
d     4
e     .
dtype: object

对于一些时间序列，重新索引可能需要做一些插值处理。method选项可以达到此目的，例如，使用ffill可以实现向前值填充

In [134]:
obj3 = Series(['blue','purple','yellow'],index=[0,2,4])

In [136]:
obj3.reindex(range(6))

0      blue
1       NaN
2    purple
3       NaN
4    yellow
5       NaN
dtype: object

In [135]:
obj3.reindex(range(6),method='ffill')

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

In [137]:
obj3.reindex(range(6),method='bfill')

0      blue
1    purple
2    purple
3    yellow
4    yellow
5       NaN
dtype: object

对于DataFrame，reindex可以修改（行）索引、或列索引，如果仅传入一个序列，则会重新索引行：

In [138]:
frame = DataFrame(np.arange(9).reshape((3,3)),index=['a',
                                                'c','d'],columns=['Ohio','Texas','California'])

In [139]:
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [146]:
frame2 = frame.reindex(['a','b','c','d'])

In [147]:
frame2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


In [151]:
state=['Ohio','California','Texas','Utah']
frame.reindex(columns=state)

Unnamed: 0,Ohio,California,Texas,Utah
a,0,2,1,
c,3,5,4,
d,6,8,7,


同时对行列进行重新索引：插值只能用于行（即轴为0）

In [165]:
frame.reindex(index=['a','b','c','d'],columns=state)

Unnamed: 0,Ohio,California,Texas,Utah
a,0.0,2.0,1.0,
b,,,,
c,3.0,5.0,4.0,
d,6.0,8.0,7.0,


使用标签索引功能：loc

### 丢弃指定轴上的项

丢弃某条轴上的一个或多个项很简单，只要有一个索引数组或列表即可。由于需要执行一些数据整理和集合逻辑，所以drop方法返回的是一个在指定轴上删除了指定值的新对象

In [5]:
obj = Series (np.arange(5.),index = ['a','b','c','d','e'])

In [6]:
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [7]:
new_obj = obj.drop('c')

In [8]:
new_obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [10]:
obj.drop(['b','c'])


a    0.0
d    3.0
e    4.0
dtype: float64

对于DataFrame，可以删除任意轴上的索引值

In [12]:
data = DataFrame(np.arange(16).reshape((4,4)),index = ['Ohio','California','Texas','Utah'], columns = ['one','two','three','four'])

In [13]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
California,4,5,6,7
Texas,8,9,10,11
Utah,12,13,14,15


In [16]:
data.drop(['California','Ohio'])

Unnamed: 0,one,two,three,four
Texas,8,9,10,11
Utah,12,13,14,15


In [17]:
data.drop('two',axis = 1)

Unnamed: 0,one,three,four
Ohio,0,2,3
California,4,6,7
Texas,8,10,11
Utah,12,14,15


In [18]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
California,4,5,6,7
Texas,8,9,10,11
Utah,12,13,14,15


### 索引、选取和过滤

Series索引（obj【...】）的工作方式类似于NumPy数组的索引，只不过Series的索引值不只是整数。下面是几个例子：

In [26]:
obj = Series(np.arange(4.),index = ['a','b','c','d'])

In [27]:
obj['b']

1.0

In [28]:
obj[1]

1.0

In [29]:
obj

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

In [30]:
obj[2:4]

c    2.0
d    3.0
dtype: float64

In [33]:
obj[[1,3]]

b    1.0
d    3.0
dtype: float64

利用标签的切片运算与普通的Python切片运算不同，其末端是包含的：

In [34]:
obj['b':'c']

b    1.0
c    2.0
dtype: float64

通过切片或布尔型数组进行选取行：

In [35]:
data = DataFrame(np.arange(16).reshape((4,4)),index = ['Ohio','California','Texas','Utah'], columns = ['one','two','three','four'])

In [36]:
data[:2]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
California,4,5,6,7


In [39]:
data[data['three']> 5 ]

Unnamed: 0,one,two,three,four
California,4,5,6,7
Texas,8,9,10,11
Utah,12,13,14,15


In [40]:
data<5

Unnamed: 0,one,two,three,four
Ohio,True,True,True,True
California,True,False,False,False
Texas,False,False,False,False
Utah,False,False,False,False


In [41]:
data[data<5]=0

In [42]:
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
California,0,5,6,7
Texas,8,9,10,11
Utah,12,13,14,15


In [48]:
data.loc['California',['two','three']]

two      5
three    6
Name: California, dtype: int32

### 算术运算和数据对齐

pandas最重要的一个功能是，它可以对不同索引的对象进行算术运算。在将对象相加时，如果存在不同的索引对，则结果的索引就是该索引对的并集。

In [54]:
s1 = Series([7.3,-2.5,3.4,1.5],index = ['a','c','d','e']) 

In [55]:
s2 = Series([-2.1,3.6,-1.5,4,3.1],index = ['a','c','e','f','g'])

In [56]:
s1

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

In [57]:
s2

a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

In [58]:
s1+s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

对于DataFrame，对齐操作会同时发生在行和列上：

In [59]:
df1 = DataFrame(np.arange(9).reshape((3,3)),columns=list('bcd'),index = ['Ohio','Texas','Colorado'])

In [64]:
df2 = DataFrame(np.arange(12).reshape((4,3)),columns=list('bde'),index=['Utah','Ohio','Texas','Oregon'])

In [65]:
df2+df1

Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


#### 在算术方法中填充值

在对不同索引的对象进行算术运算时，有可能对缺失值进行填充：

In [78]:
df1.add(df2,fill_value=0)

Unnamed: 0,b,c,d,e
Colorado,6.0,7.0,8.0,
Ohio,3.0,1.0,6.0,5.0
Oregon,9.0,,10.0,11.0
Texas,9.0,4.0,12.0,8.0
Utah,0.0,,1.0,2.0


In [67]:
df1

Unnamed: 0,b,c,d
Ohio,0,1,2
Texas,3,4,5
Colorado,6,7,8


In [68]:
df2

Unnamed: 0,b,d,e
Utah,0,1,2
Ohio,3,4,5
Texas,6,7,8
Oregon,9,10,11


In [76]:
df1.reindex(columns = df2.columns,fill_value=0 )

Unnamed: 0,b,d,e
Ohio,0,2,0
Texas,3,5,0
Colorado,6,8,0


#### DataFrame和Series之间的运算

跟NumPy数组一样，DataFrame和Series之间算术运算也是有明确规定的。

In [80]:
arr = np.arange(12).reshape((3,4))

In [81]:
arr[0]

array([0, 1, 2, 3])

In [82]:
arr-arr[0]

array([[0, 0, 0, 0],
       [4, 4, 4, 4],
       [8, 8, 8, 8]])

上述每一行均减去arr【0】的值，这就叫做广播（broadcasting）

In [85]:
frame = DataFrame(np.arange(12.0).reshape((4,3)),columns=list('bde'),index = ['utah','Ohio','Texas','Oregon'])

In [103]:
series = frame.loc['utah']

In [91]:
frame

Unnamed: 0,b,d,e
utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [104]:
series

b    0.0
d    1.0
e    2.0
Name: utah, dtype: float64

In [105]:
frame -series

Unnamed: 0,b,d,e
utah,0.0,0.0,0.0
Ohio,3.0,3.0,3.0
Texas,6.0,6.0,6.0
Oregon,9.0,9.0,9.0


In [107]:
series2 = Series(range(3),index=['b','e','f'])

In [108]:
series2

b    0
e    1
f    2
dtype: int64

In [109]:
frame+series2

Unnamed: 0,b,d,e,f
utah,0.0,,3.0,
Ohio,3.0,,6.0,
Texas,6.0,,9.0,
Oregon,9.0,,12.0,


### 函数的应用和映射

NumPy的ufunc（元素级数组方法）也可以用于操作pandas对象：

In [113]:
import numpy as np
frame = DataFrame(np.random.randn(4,3),columns=list('bde'),index = ['Utah','Ohio','Texas','Oregon'])

In [114]:
frame

Unnamed: 0,b,d,e
Utah,-0.505467,-1.666956,-0.412084
Ohio,-0.383957,-0.61746,1.041121
Texas,-0.74891,-0.190268,0.635766
Oregon,1.027304,-1.300885,0.470798


In [115]:
np.abs(frame)

Unnamed: 0,b,d,e
Utah,0.505467,1.666956,0.412084
Ohio,0.383957,0.61746,1.041121
Texas,0.74891,0.190268,0.635766
Oregon,1.027304,1.300885,0.470798


另一个常见的操作是，将函数应用到由各列或各行所形成的一维数组上。DataFrame的apply方法即可实现此功能:

In [116]:
f=lambda x:x.max()-x.min()

In [117]:
frame.apply(f)

b    1.776215
d    1.476688
e    1.453205
dtype: float64

In [118]:
frame.apply(f,axis=1)

Utah      1.254872
Ohio      1.658581
Texas     1.384676
Oregon    2.328189
dtype: float64

In [119]:
def f(x):
    return Series([x.min(),x.max()],index = ['min','max'])

In [120]:
frame.apply(f)

Unnamed: 0,b,d,e
min,-0.74891,-1.666956,-0.412084
max,1.027304,-0.190268,1.041121


此外，元素级的Python函数也是可以用的。假如想要得到frame中隔俄国浮点值的格式化字符串，使用applymap即可：

In [121]:
format = lambda x :'%.2f' % x

In [123]:
frame.applymap(format)

Unnamed: 0,b,d,e
Utah,-0.51,-1.67,-0.41
Ohio,-0.38,-0.62,1.04
Texas,-0.75,-0.19,0.64
Oregon,1.03,-1.3,0.47


### 排序和排名

根据条件对数据集排序(sorting)也是一种重要的内置运算。要对行或列索引进行排序（按字典顺序），可使用sort_index方法，它将返回一个已排序的新对象：

In [124]:
obj = Series(range(4),index=list('dabc'))

In [125]:
obj

d    0
a    1
b    2
c    3
dtype: int64

In [126]:
obj.sort_index()

a    1
b    2
c    3
d    0
dtype: int64

对于DataFrame,则可以根据任意一个轴上的索引进行排序：

In [133]:
frame = DataFrame(np.arange(8).reshape((2,4)),index=['two','one'] , columns=list('dabc'))

In [134]:
frame

Unnamed: 0,d,a,b,c
two,0,1,2,3
one,4,5,6,7


In [135]:
frame.sort_index()

Unnamed: 0,d,a,b,c
one,4,5,6,7
two,0,1,2,3


In [136]:
frame.sort_index(axis=1)

Unnamed: 0,a,b,c,d
two,1,2,3,0
one,5,6,7,4


数据降序排序，ascending=False

In [137]:
frame.sort_index(axis=1,ascending=False)

Unnamed: 0,d,c,b,a
two,0,3,2,1
one,4,7,6,5


若要按值对Series进行排序，可以使用其order方法：

In [138]:
obj = Series([4,5,-3,2])

In [140]:
obj

0    4
1    5
2   -3
3    2
dtype: int64

In [144]:
obj.sort_values()

2   -3
3    2
0    4
1    5
dtype: int64

在DataFrame中，你可以根据一个或多个列中的值进行排序。将一个或多个列的名字传递给by选项即可以达到该目的。

In [145]:
frame = DataFrame({'b':[4,5,6,7],'a':[0,1,0,1]})

In [146]:
frame


Unnamed: 0,b,a
0,4,0
1,5,1
2,6,0
3,7,1


In [161]:
frame.sort_values(by=['a','b'])

Unnamed: 0,b,a
0,4,0
2,6,0
1,5,1
3,7,1


#### 排名ranking 跟排序关系密切，且他会增设一个排名值（从1开始，一直到数组中有效数据的数量）。它跟numpy.argsort产生的简洁排序索引差不多，只不过它可以根据某种规则破坏平级关系。默认情况下，rank是通过“为各组分配一个平均排名”的方式破坏平级关系的。

In [162]:
obj = Series([7,-5,7,4,2,0,4])

In [163]:
obj

0    7
1   -5
2    7
3    4
4    2
5    0
6    4
dtype: int64

In [164]:
obj.rank()

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

也可以根据值在原数据中出现的顺序给出排名，或是降序排名

In [166]:
obj.rank(method = 'first')

0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

In [167]:
obj.rank(ascending = False,method='max')

0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64

### 带有重复值的轴索引

如下：

In [170]:
obj = Series(range(5),index=list('aabbc'))

In [171]:
obj

a    0
a    1
b    2
b    3
c    4
dtype: int64

In [172]:
obj['a']

a    0
a    1
dtype: int64

## 汇总和计算描述统计

pandas对象拥有一组常用的数学和统计方法。它们大部分都属于约简和汇总统计，用于从Series中提取单个值（如sum或mean）或从DataFrame的行或列中提取一个Series。跟对应的Numpy数组方法相比，它们都是基于没有缺失数据的假设而构建的。接下来看一个简单的DataFrame：

In [173]:
df = DataFrame([[1.4,np.nan],[7.1,-4.5],[np.nan,np.nan],[0.75,-1.3]],index = list('abcd'),columns=['one','two'])

In [174]:
df


Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [175]:
df.sum()

one    9.25
two   -5.80
dtype: float64

In [176]:
df.sum(axis=1)

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

NA 值会自动被排除，除非整个切片（这里指的是行或列）都是NA。通过skipna选项可以禁用该功能：

In [177]:
df.mean(axis=1,skipna=False)

a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

有些方法，返回的是间接统计，比如，达到最小值或最大值的索引：

In [179]:
df.idxmax()

one    b
two    d
dtype: object

有一些则是累加型的，如cumsum


In [180]:
df.cumsum()

Unnamed: 0,one,two
a,1.4,
b,8.5,-4.5
c,,
d,9.25,-5.8


describe是一次性产生多个汇总统计：

In [181]:
df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


对于非数值型的数据，describe会产生另一种汇中统计：

In [182]:
obj =Series(['a','a','b','c']*4)

In [183]:
obj

0     a
1     a
2     b
3     c
4     a
5     a
6     b
7     c
8     a
9     a
10    b
11    c
12    a
13    a
14    b
15    c
dtype: object

In [185]:
obj.describe()

count     16
unique     3
top        a
freq       8
dtype: object

### 相关系数与协方差

有些汇总统计（如相关系数和协方差）是通过参数对计算出来的。我们来看几个DataFrame，它们的数据来自Yahoo！Finance的股票价格和成交量：使用corr与cov

### 唯一值，值计数以及成员资格

唯一值：unique


In [207]:
obj = Series(list('cadaabbcc'))

In [208]:
obj

0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object

In [209]:
uniques = obj.unique()

In [212]:
uniques.sort()
uniques

array(['a', 'b', 'c', 'd'], dtype=object)

value_counts用于计算一个Series中的频率

In [213]:
obj.value_counts()

c    3
a    3
b    2
d    1
dtype: int64

In [215]:
obj.describe()

count     9
unique    4
top       c
freq      3
dtype: object

value_counts还是一个顶级pandas方法可以用于任何数组或序列：

In [217]:
pd.value_counts(obj.values,sort=False)

c    3
a    3
d    1
b    2
dtype: int64

isin用于判断矢量化集合的成员资格：

In [218]:
mask = obj.isin(['b','c'])

In [222]:
obj[mask
   ]

0    c
5    b
6    b
7    c
8    c
dtype: object

### 处理缺失数据

pandas使用浮点值NaN表示浮点和非浮点数组中的缺失数据。它只是一个便于被检测出来的标记而已：

In [225]:
string_data = Series(['arrdvark',
                     'artichoke',
                     np.nan,
                     'avocado'])

In [226]:
string_data

0     arrdvark
1    artichoke
2          NaN
3      avocado
dtype: object

In [227]:
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

Python内置的None值也会被当作缺失值处理：

In [231]:
string_data[0]=None
string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

### 滤除缺失数据

使用dropna。

In [233]:
from numpy import nan as NA 

In [234]:
data = Series([1,NA,3.5,NA,7])

In [235]:
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [237]:
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

或者通过布尔值达到这个目的：

In [238]:
data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

对于DataFrame对象，事情有些复杂，可能会丢弃所有NA或者含有NA的行或列，dropna默认丢掉任何含有缺失值的行：

In [239]:
data = DataFrame([[1.,6.5,3.],[1,NA,NA],[NA,NA,NA],[NA,6.5,3]])

In [240]:
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [241]:
clean=data.dropna()

In [242]:
clean

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


传入how='all'将只丢弃全为NA的那些行

In [243]:
data.dropna(how = 'all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


若只想要丢弃列，只需要传入axis=1即可：

In [244]:
data[4] = NA

In [245]:
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [246]:
data.dropna(axis=1,how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


另一个滤除DataFrame行的问题涉及时间序列数据。假设只想要留下一部分观测数据，可以用thresh参数实现此目的：

In [247]:
df = DataFrame(np.random.randn(7,3))

In [248]:
df.loc[:4,1]=NA

In [250]:
df.loc[:2,2]=NA


In [251]:
df

Unnamed: 0,0,1,2
0,0.7323,,
1,-0.744379,,
2,0.061175,,
3,-0.28272,,-1.935945
4,-1.20498,,-1.83412
5,-1.208442,-0.685508,1.36995
6,-0.217677,-1.543158,-1.226526


In [252]:
df.dropna(thresh=3)

Unnamed: 0,0,1,2
5,-1.208442,-0.685508,1.36995
6,-0.217677,-1.543158,-1.226526


### 填充缺失数据

fillna函数

In [255]:
df.fillna(0)

Unnamed: 0,0,1,2
0,0.7323,0.0,0.0
1,-0.744379,0.0,0.0
2,0.061175,0.0,0.0
3,-0.28272,0.0,-1.935945
4,-1.20498,0.0,-1.83412
5,-1.208442,-0.685508,1.36995
6,-0.217677,-1.543158,-1.226526


In [256]:
df.fillna({1:0.5,2:-1})

Unnamed: 0,0,1,2
0,0.7323,0.5,-1.0
1,-0.744379,0.5,-1.0
2,0.061175,0.5,-1.0
3,-0.28272,0.5,-1.935945
4,-1.20498,0.5,-1.83412
5,-1.208442,-0.685508,1.36995
6,-0.217677,-1.543158,-1.226526


In [257]:
df

Unnamed: 0,0,1,2
0,0.7323,,
1,-0.744379,,
2,0.061175,,
3,-0.28272,,-1.935945
4,-1.20498,,-1.83412
5,-1.208442,-0.685508,1.36995
6,-0.217677,-1.543158,-1.226526


In [259]:
#总是返回被填充对象的引用
_ = df.fillna(0,inplace=True)
df

Unnamed: 0,0,1,2
0,0.7323,0.0,0.0
1,-0.744379,0.0,0.0
2,0.061175,0.0,0.0
3,-0.28272,0.0,-1.935945
4,-1.20498,0.0,-1.83412
5,-1.208442,-0.685508,1.36995
6,-0.217677,-1.543158,-1.226526


传入平均值：

In [262]:
data= Series([1,NA,3,NA,5])

In [264]:
data.fillna(data.mean())

0    1.0
1    3.0
2    3.0
3    3.0
4    5.0
dtype: float64

### 层级化索引

层级化索引是pandas的一项重要功能，它使我们能在一个轴上拥有多个索引级别。抽象点说，它使我们能够以低维度形式处理高维度数据。

In [265]:
data = Series(np.random.randn(10),index=[list('aaabbbccdd'),list('1231231223')])

In [266]:
data

a  1    0.839161
   2    0.028645
   3   -1.142681
b  1    0.579837
   2    0.658774
   3    0.660847
c  1   -1.111364
   2   -1.096083
d  2   -2.240340
   3    0.535630
dtype: float64

上述是带有MultiIndex索引的Series的格式化输出形式。索引之间的“间隔”表示“直接使用上面的标签”：

In [267]:
data.index

MultiIndex([('a', '1'),
            ('a', '2'),
            ('a', '3'),
            ('b', '1'),
            ('b', '2'),
            ('b', '3'),
            ('c', '1'),
            ('c', '2'),
            ('d', '2'),
            ('d', '3')],
           )

In [269]:
data['b']

1    0.579837
2    0.658774
3    0.660847
dtype: float64

In [270]:
data['b':'c']

b  1    0.579837
   2    0.658774
   3    0.660847
c  1   -1.111364
   2   -1.096083
dtype: float64

In [305]:
data.loc['a':'b','1']

a  1    0.839161
b  1    0.579837
dtype: float64

In [306]:
data.unstack()

Unnamed: 0,1,2,3
a,0.839161,0.028645,-1.142681
b,0.579837,0.658774,0.660847
c,-1.111364,-1.096083,
d,,-2.24034,0.53563


In [308]:
data.unstack().stack()

a  1    0.839161
   2    0.028645
   3   -1.142681
b  1    0.579837
   2    0.658774
   3    0.660847
c  1   -1.111364
   2   -1.096083
d  2   -2.240340
   3    0.535630
dtype: float64

对于一个DataFrame，每一条轴都可以有分层索引：

In [309]:
frame=DataFrame(np.arange(12).reshape((4,3)),
                index=[list('aabb'),list('1212')],
                columns=[['Ohio','Ohio','Colorad'],
                        ['Green','Red','Green']
                        ]
               
               )

In [310]:
frame

Unnamed: 0_level_0,Unnamed: 1_level_0,Ohio,Ohio,Colorad
Unnamed: 0_level_1,Unnamed: 1_level_1,Green,Red,Green
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [311]:
frame.index.names = ['key1','key2']
frame.columns.names = ['state','color']

In [312]:
frame

Unnamed: 0_level_0,state,Ohio,Ohio,Colorad
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [313]:
frame['Ohio']

Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,0,1
a,2,3,4
b,1,6,7
b,2,9,10


MutiIndex.from_array([['Ohio','Ohio','Colorado'],['green','red','green']],names=['state','color'])

### 重新分级顺序

有时候，我们需要重新调整某条轴上各级别的顺序，或者根据指定级别上的值对数据进行排序。swaplevel接受两个级别编号或名称，并返回一个互换了级别的新对象（但数据不会发生变换）

In [315]:
frame.swaplevel('key1','key2')

Unnamed: 0_level_0,state,Ohio,Ohio,Colorad
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
2,a,3,4,5
1,b,6,7,8
2,b,9,10,11


In [329]:
frame.swaplevel(0,1).sort_index(0)

Unnamed: 0_level_0,state,Ohio,Ohio,Colorad
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
1,b,6,7,8
2,a,3,4,5
2,b,9,10,11


In [322]:
frame

Unnamed: 0_level_0,state,Ohio,Ohio,Colorad
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [330]:
frame.sum(level='key2')

state,Ohio,Ohio,Colorad
color,Green,Red,Green
key2,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,6,8,10
2,12,14,16


In [334]:
frame.sum(level='color',axis=1)

Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,2,1
a,2,8,4
b,1,14,7
b,2,20,10


### 使用DataFrame列

人民经常想要将DataFrame的一个或多个列当作行索引来用，或者可能希望将行索引变成DataFrame的列。以下面这个DataFrame为例：

In [335]:
frame= DataFrame({'a':range(7),'b':range(7,0,-1),'c':['one','one',
                                                    'one','two','two','two','two'],
                 'd':[0,1,2,0,1,2,3]})

In [336]:
frame

Unnamed: 0,a,b,c,d
0,0,7,one,0
1,1,6,one,1
2,2,5,one,2
3,3,4,two,0
4,4,3,two,1
5,5,2,two,2
6,6,1,two,3


DataFrame的set_index函数会将一个或多个列转换为行索引，并创建一个新的DataFrame：

In [340]:
frame2 = frame.set_index(['c','d'],drop=False)

In [341]:
frame2

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c,d
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
one,0,0,7,one,0
one,1,1,6,one,1
one,2,2,5,one,2
two,0,3,4,two,0
two,1,4,3,two,1
two,2,5,2,two,2
two,3,6,1,two,3


In [339]:
frame2.reset_index()

Unnamed: 0,c,d,a,b
0,one,0,0,7
1,one,1,1,6
2,one,2,2,5
3,two,0,3,4
4,two,1,4,3
5,two,2,5,2
6,two,3,6,1
