# pandas

+ pandas这个名字源于 panel data( 面板数据)，以及 Python data analysis。
+ Pandas以NumPy为基础，让以NumPy为中心的应用变得更加简单。
+ 提供了大量适用于金融数据的高性能时间序列工具。作者在设计该包时就计划将其做成一款适用于金融数据分析的工具。

## 内容

+ Pandas的数据结构介绍
   - Seires
   - DataFrame
+ 基本功能
+ 汇总和计算描述统计
+ 处理缺失数据
+ 其他

+ 通常在调用Pandas之前做如下的引入约定：

In [151]:
import pandas as pd
import pandas
from pandas import Series, DataFrame

In [152]:
from myfunctions import *  

In [153]:
import numpy as np
from numpy.random import randn
import os ##cannot pip install os
import matplotlib.pyplot as plt
from myfunctions import *  ##这个myfunctions是个什么包
np.random.seed(12345)
plt.rc('figure', figsize=(10, 6)) 
np.set_printoptions(precision=4)

In [154]:
import os
plt.rc('figure', figsize=(10, 6)) 
np.set_printoptions(precision=4)

## 介绍Pandas数据结构

要使用pandas，首先得熟悉其两个主要的数据结构：Series 和 DataFrame。

### Series

一种类似于一维数组的对象，由数据及其标签组成

In [155]:
obj = Series([100, 99, 100, 60])
obj

0    100
1     99
2    100
3     60
dtype: int64

In [156]:
obj.values

array([100,  99, 100,  60])

In [157]:
obj.index

RangeIndex(start=0, stop=4, step=1)

在构建Series时，通过"index="参数给出其索引，

In [158]:
obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2

d    4
b    7
a   -5
c    3
dtype: int64

并且我们可以用索引引用数据，

和NumPy一样选取数据,对数据进行计算。只不过Series每个数据都有一个关键字索引。

In [159]:
obj2['d'] = 6
obj2[['c', 'a', 'd']]

c    3
a   -5
d    6
dtype: int64

In [160]:
obj2['a']

-5

In [161]:
obj2[obj2 > 0]

d    6
b    7
c    3
dtype: int64

In [162]:
obj2 * 2

d    12
b    14
a   -10
c     6
dtype: int64

In [163]:
np.exp(obj2)

d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

+ 我们可以将Series看成一定长度的有序字典，定义了索引值到数据值的一个映射。

In [164]:
obj2

d    6
b    7
a   -5
c    3
dtype: int64

In [165]:
'b' in obj2

True

In [166]:
'e' in obj2

False

+ 如果数据被存在字典中，我们可以直接通过该字典来创建Series

In [167]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = Series(sdata)
obj3

Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

+ 如果指定了Series的index，其只提取该index中存在的索引及其对应值。 

In [168]:
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = Series(sdata, index=states)
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

 但California在字典sdata中没有，这是在pandas中就产生一个缺失值。通过isnull方法判断是否有缺失值

In [169]:
pd.isnull(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

+ Series一个最重要的功能是：它在算术运算中会自动对齐不同索引的数据。

In [170]:
obj3

Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

In [171]:
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [172]:
obj3 + obj4

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

可以通过赋值的方式修改索引。

In [173]:
obj

0    100
1     99
2    100
3     60
dtype: int64

In [174]:
obj.index = ['Tom', 'Steve', 'Jeff', 'Ryan']
obj

Tom      100
Steve     99
Jeff     100
Ryan      60
dtype: int64

## DataFrame

+ DataFrame是表格型的数据结构，她含有一组有序的列，每列可以是不同的类型(数值，字符串，布尔等)
+ DataFrame既有行索引也有列索引，它可以被看做由Series组成的字典（共用同一个索引）
+ 最常见的构建DataFrame的方法是直接传入一个有等长列表或NumPy数组组成的字典。

In [175]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)
frame

Unnamed: 0,pop,state,year
0,1.5,Ohio,2000
1,1.7,Ohio,2001
2,3.6,Ohio,2002
3,2.4,Nevada,2001
4,2.9,Nevada,2002


我们可以指定列的顺序。

In [176]:
DataFrame(data, columns=['year', 'state', 'pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9


如果传入的数据不存在，就会产生NA值：

In [177]:
frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
                   index=['one', 'two', 'three', 'four', 'five'])
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,


通过引用一个列，可以获得一个Series,下面是两种引用方法

In [178]:
print(frame2['state'])

print(frame2.state)

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object
one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object


In [179]:
frame2['year']

one      2000
two      2001
three    2002
four     2001
five     2002
Name: year, dtype: int64

+ 行的引用，ix方法　

In [180]:
frame2.ix['three']

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

+ 给DataFrame的列赋值， 
  
  将标量赋值给列中所有单元

In [181]:
frame2['debt'] = [1,2,3,4,4]
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,1
two,2001,Ohio,1.7,2
three,2002,Ohio,3.6,3
four,2001,Nevada,2.4,4
five,2002,Nevada,2.9,4


  - 用list或数组赋值，列的长度必须和DataFrame长度匹配

In [182]:
frame2['debt'] = np.arange(5.)
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,0.0
two,2001,Ohio,1.7,1.0
three,2002,Ohio,3.6,2.0
four,2001,Nevada,2.4,3.0
five,2002,Nevada,2.9,4.0


  - 如果赋值的是一个Series，就会精确匹配索引，空的地方或用缺失值补上。

In [183]:
val = Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['debt'] = val
print(frame2)

print(frame2.values)

       year   state  pop  debt
one    2000    Ohio  1.5   NaN
two    2001    Ohio  1.7  -1.2
three  2002    Ohio  3.6   NaN
four   2001  Nevada  2.4  -1.5
five   2002  Nevada  2.9  -1.7
[[2000 'Ohio' 1.5 nan]
 [2001 'Ohio' 1.7 -1.2]
 [2002 'Ohio' 3.6 nan]
 [2001 'Nevada' 2.4 -1.5]
 [2002 'Nevada' 2.9 -1.7]]


  - 为不存在的列赋值　
  

In [184]:
frame2['eastern'] = frame2.state == 'Ohio'
frame2

Unnamed: 0,year,state,pop,debt,eastern
one,2000,Ohio,1.5,,True
two,2001,Ohio,1.7,-1.2,True
three,2002,Ohio,3.6,,True
four,2001,Nevada,2.4,-1.5,False
five,2002,Nevada,2.9,-1.7,False


-  关键字 del用于删除列。

In [185]:
del frame2['eastern']
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7


+ 将嵌套的字典传给DataFrame时，外层字典的键为列，内层为行索引。

In [186]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
DataFrame(pop)

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


In [187]:
frame3 = DataFrame(pop)
frame3

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


+ 行和列都有关键字索引，也可以进行转置

In [188]:
frame3.T

Unnamed: 0,2000,2001,2002
Nevada,,2.4,2.9
Ohio,1.5,1.7,3.6


Frame 的值 是ndarray

In [189]:
frame3.values


array([[ nan,  1.5],
       [ 2.4,  1.7],
       [ 2.9,  3.6]])

In [190]:
frame2.values

array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, -1.2],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, -1.5],
       [2002, 'Nevada', 2.9, -1.7]], dtype=object)

也可以直接用二维ndarray生成DataFrame

In [191]:
frame4=DataFrame(np.array([[0, 1.5],
       [ 2.4,  1.7],
       [ 2.9,  3.6]]),index=['2001','2002','2003'],columns=['Nevada','Ohio'])
frame4

Unnamed: 0,Nevada,Ohio
2001,0.0,1.5
2002,2.4,1.7
2003,2.9,3.6


## 基本功能

介绍操作Series和DataFrame中数据的基本手段。

### 舍弃某些数据

+ 可以通过drop方法删指定索引上的值，即删掉一行。


In [193]:
obj = Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
obj
obj.drop(['d', 'c'])

a    0.0
b    1.0
e    4.0
dtype: float64

+ 选项 axis=1，则删列. 

In [194]:
data = DataFrame(np.arange(16).reshape((4, 4)),
                 index=['Ohio', 'Colorado', 'Utah', 'New York'],
                 columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [196]:
data.drop(['Ohio', 'Colorado'])
##data.drop(['two','four'],axis=1)

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


### 索引、选取和过滤 （  Indexing, selection, and filtering）

+ Series可以通过索引，整数，布尔型进行选择。
+ DataFrame的索引能获得一或多列。
+ 如果需要对行进行选择，需要用到 .ix方法。

+ Series

In [197]:
#用一个或多个关键字索引
obj = Series(np.arange(4.0), index=['a', 'b', 'c', 'd'])
print(obj.dtype)

print(obj['b']) # index
obj[['b', 'a', 'd']]
print(obj)

float64
1.0
a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64


In [64]:
#通过一个整数或整数list索引
print(obj[1])
print(obj[1: 3])
print(obj[[1,3]])
print(obj[['b','d']])

1.0
b    1.0
c    2.0
dtype: float64
b    1.0
d    3.0
dtype: float64
b    1.0
d    3.0
dtype: float64


In [65]:
#用逻辑值索引
obj[obj < 2]

a    0.0
b    1.0
dtype: float64

+ 对DataFrame的索引

In [66]:
data = DataFrame(np.arange(16).reshape((4, 4)),
                 index=['Ohio', 'Colorado', 'Utah', 'New York'],
                 columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


+ 对列索引

In [67]:
print(data)
data['two']

          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15


Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64

+ 对行索引

In [68]:
print(data)
print(data.ix['Colorado':])
##等价于
print(data[1:])

          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15
          one  two  three  four
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15
          one  two  three  four
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15


In [69]:
#利用逻辑值索引行

print(data)
print(data > 5)


          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15
            one    two  three   four
Ohio      False  False  False  False
Colorado  False  False   True   True
Utah       True   True   True   True
New York   True   True   True   True


+ 利用逻辑值索引

In [70]:

data[data >= 5]

Unnamed: 0,one,two,three,four
Ohio,,,,
Colorado,,5.0,6.0,7.0
Utah,8.0,9.0,10.0,11.0
New York,12.0,13.0,14.0,15.0


In [71]:
data[data < 5] = 0
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


+ 使用.ix索引，指定行

In [72]:
#只指定一个整数，索引一列
data.ix[2]

one       8
two       9
three    10
four     11
Name: Utah, dtype: int64

同时指定行列

In [73]:
 
#data1=data.ix[['Colorado', 'Utah'], ['two', 'three']] 
 
data2=data.ix[['Colorado', 'Utah'], [3, 0, 1]]
 
data3=data.ix[data.three > 5, :2] 

side_by_side(data,data2,data3)

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15

Unnamed: 0,four,one,two
Colorado,7,0,5
Utah,11,8,9

Unnamed: 0,one,two
Colorado,0,5
Utah,8,9
New York,12,13


+ 总结一下：
   - ovj[val] 选取单个列或一组列
   - obj.ix[val] 选取单个行或一组行
   - obj.ix[:,val] 选取单个列或列子集
   - obj.ix[val1,val2] 同时选取行和列
   - reindex 方法 将一个或多个轴匹配到新索引
   - icol,irow 方法，根据整数位置选取单列或单行，返回一个Sereis
   - get_value,set_value 方法， 根据行标签和列标签选取单个值。

### 算术运算和数据对齐

+ 在进行计算时会将索引对齐，是pandas最重要的功能。
+ 将两个不同索引的对象相加时，结果的索引为两对象索引的并，相同索引的值相加，不同索引处为缺失值。
+ DataFrame， 同时在行和列上对齐。

In [74]:
s1 = Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])
s1 * s2

a   -15.33
c    -9.00
d      NaN
e    -2.25
f      NaN
g      NaN
dtype: float64

In [75]:
df1 = DataFrame(np.arange(9.).reshape((3, 3)), columns=['as','xi','ba'],
                index=['Ohio', 'Texas', 'Colorado'])

print(df1)
df2 = DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
                index=['Utah', 'Ohio', 'Texas', 'Oregon'])
print(df2)
print(df1 + df2)

           as   xi   ba
Ohio      0.0  1.0  2.0
Texas     3.0  4.0  5.0
Colorado  6.0  7.0  8.0
          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0
          as   b  ba   d   e  xi
Colorado NaN NaN NaN NaN NaN NaN
Ohio     NaN NaN NaN NaN NaN NaN
Oregon   NaN NaN NaN NaN NaN NaN
Texas    NaN NaN NaN NaN NaN NaN
Utah     NaN NaN NaN NaN NaN NaN


#### 处理运算中的缺失值

可以通过 fill_value在NaN的地方填充一个值， 算术方法分别为 add，sub，div，mul

In [76]:
df1 = DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd'))
df2 = DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde'))
#dfadd=df1+df2
dfaddfill0=df1.add(df2, fill_value=0)
side_by_side(df1,df2,dfaddfill0)

Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,5.0,6.0,7.0
2,8.0,9.0,10.0,11.0

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,4.0
1,5.0,6.0,7.0,8.0,9.0
2,10.0,11.0,12.0,13.0,14.0
3,15.0,16.0,17.0,18.0,19.0

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,11.0,13.0,15.0,9.0
2,18.0,20.0,22.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


### DataFrame 和 Series 之间的运算 

+ 当一个向量减去一个标量时，向量中的每个值都会减去该值。


In [77]:
arr = np.arange(12.).reshape((3, 4))
arr - arr[0]

array([[ 0.,  0.,  0.,  0.],
       [ 4.,  4.,  4.,  4.],
       [ 8.,  8.,  8.,  8.]])

In [78]:
arr - arr[0]

array([[ 0.,  0.,  0.,  0.],
       [ 4.,  4.,  4.,  4.],
       [ 8.,  8.,  8.,  8.]])

+ 减去一个Sereis，其每行都要做相同的操作


In [79]:
frame = DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
                  index=['Utah', 'Ohio', 'Texas', 'Oregon'])
series = frame.ix[0]
fsubr=frame - series
side_by_side(frame,fsubr)

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0

Unnamed: 0,b,d,e
Utah,0.0,0.0,0.0
Ohio,3.0,3.0,3.0
Texas,6.0,6.0,6.0
Oregon,9.0,9.0,9.0


+ Pandas和一般数组不同的是，它还要进行列索引的匹配。
+ 如果有不匹配的，还要生成新的列，并用缺失值NaN代替。这种操作被称为广播。

In [80]:
series2 = Series(range(3), index=['b', 'e', 'f'])
frame + series2

Unnamed: 0,b,d,e,f
Utah,0.0,,3.0,
Ohio,3.0,,6.0,
Texas,6.0,,9.0,
Oregon,9.0,,12.0,


+ 一般的广播都是按行进行，如果希望在列上进行，则要进行设定， axis=0，此处传入的是进行匹配的轴。

In [81]:
series3 = frame['d']
fsubc=frame.sub(series3, axis=0)

side_by_side(frame,fsubc)

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0

Unnamed: 0,b,d,e
Utah,-1.0,0.0,1.0
Ohio,-1.0,0.0,1.0
Texas,-1.0,0.0,1.0
Oregon,-1.0,0.0,1.0


### 函数应用和映射 

NumPy的函数也可以操作到pandas对象

In [82]:
frame = DataFrame(np.random.randn(4, 3), columns=list('bde'),
                  index=['Utah', 'Ohio', 'Texas', 'Oregon'])
fabs=np.abs(frame)
side_by_side(frame,fabs) 


Unnamed: 0,b,d,e
Utah,-0.204708,0.478943,-0.519439
Ohio,-0.55573,1.965781,1.393406
Texas,0.092908,0.281746,0.769023
Oregon,1.246435,1.007189,-1.296221

Unnamed: 0,b,d,e
Utah,0.204708,0.478943,0.519439
Ohio,0.55573,1.965781,1.393406
Texas,0.092908,0.281746,0.769023
Oregon,1.246435,1.007189,1.296221


+ 将函数应用到各列或者各行计算得到一个一维数组，可以用apply方法，这样方法也是R中的重要方法。

In [83]:
f = lambda x: x.max() - x.min()

frame.apply(f)

b    1.802165
d    1.684034
e    2.689627
dtype: float64

In [84]:
#　将函数应用到列
frame.apply(f, axis=1)

Utah      0.998382
Ohio      2.521511
Texas     0.676115
Oregon    2.542656
dtype: float64

+ 若函数只有一个返回值，apply之后得到一个Series。
+ 若有多个返回值，则得到DateFrame

In [85]:
def f(x):
    return Series([x.min(), x.max()], index=['min', 'max'])
frame.apply(f)

Unnamed: 0,b,d,e
min,-0.55573,0.281746,-1.296221
max,1.246435,1.965781,1.393406


+ 元素级（即每个数值）的Python函数在DataFrame也可以用，使用applymap即可
+ 如下的例子实现每个元素的输出特定的格式

In [86]:
format = lambda x: '%.2f' % x
frame.applymap(format)

Unnamed: 0,b,d,e
Utah,-0.2,0.48,-0.52
Ohio,-0.56,1.97,1.39
Texas,0.09,0.28,0.77
Oregon,1.25,1.01,-1.3


+ 对Sereis，相应的方法为map

In [87]:
frame['e'].map(format)

Utah      -0.52
Ohio       1.39
Texas      0.77
Oregon    -1.30
Name: e, dtype: object

###  序 

+ 按照索引关键字进行排序，使用sort_index方法，
+ 对DataFrame，默认对行排，如果对列索引排，需要指明 axis=1, 
+ 选项 ascending =False 表示降序。
+ 如果需要对Series的值排序，sort_values。缺失值会放到末尾。
+ 对DataFrame的值进行排序，传递给选项by列名称即可。

In [88]:
#按照索引关键字进行排序，使用sort_index方法，
obj = Series(range(4), index=['d', 'a', 'b', 'c'])
obj.sort_index()

a    1
b    2
c    3
d    0
dtype: int64

In [89]:
#对DataFrame，默认对行排
frame = DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one'],
                  columns=['d', 'a', 'b', 'c'])
frame.sort_index()

Unnamed: 0,d,a,b,c
one,4,5,6,7
three,0,1,2,3


In [90]:
#，如果对列索引排，需要指明 axis=1,
frame.sort_index(axis=1)

Unnamed: 0,a,b,c,d
three,1,2,3,0
one,5,6,7,4


In [91]:
frame.sort_index(axis=1, ascending=False)

Unnamed: 0,d,c,b,a
three,0,3,2,1
one,4,7,6,5


In [92]:
#如果需要对Series的值排序，则用order方法。缺失值会放到末尾。
obj = Series([4, 7, -3, 2])
obj.sort_values()

2   -3
3    2
0    4
1    7
dtype: int64

In [93]:
obj = Series([4, np.nan, 7, np.nan, -3, 2])
obj.sort_values()


4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64

In [94]:
# 对DataFrame的值进行排序，传递给选项by列名称即可。

frame = DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
sortbyb=frame.sort_values(by='b')
sortbyab=frame.sort_values(by=['a', 'b'])
side_by_side(frame,sortbyb,sortbyab) 

Unnamed: 0,a,b
0,0,4
1,1,7
2,0,-3
3,1,2

Unnamed: 0,a,b
2,0,-3
3,1,2
0,0,4
1,1,7

Unnamed: 0,a,b
2,0,-3
0,0,4
3,1,2
1,1,7


+ rank方法给出Series各元素的序号
+ 对DataFrame，可以逐行或列给出序号

In [95]:
obj = Series([7, -5, 7, 4, 2, 0, 4])
obj.rank()

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

In [96]:
obj.rank(ascending=False, method='max')

0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64

In [97]:
frame = DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],
                   'c': [-2, 5, 8, -2.5]})
side_by_side(frame,frame.rank())

Unnamed: 0,a,b,c
0,0,4.3,-2.0
1,1,7.0,5.0
2,0,-3.0,8.0
3,1,2.0,-2.5

Unnamed: 0,a,b,c
0,1.5,3.0,2.0
1,3.5,4.0,3.0
2,1.5,1.0,4.0
3,3.5,2.0,1.0


In [98]:
frame.rank(axis=1)

Unnamed: 0,a,b,c
0,2.0,3.0,1.0
1,1.0,3.0,2.0
2,2.0,1.0,3.0
3,2.0,3.0,1.0


### 有重复的索引 

 索引值可以是不唯一的

In [99]:
obj = Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
obj

a    0
a    1
b    2
b    3
c    4
dtype: int64

In [100]:
obj['a']

a    0
a    1
dtype: int64

In [101]:
df = DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])
df

Unnamed: 0,0,1,2
a,0.274992,0.228913,1.352917
a,0.886429,-2.001637,-0.371843
b,1.669025,-0.43857,-0.539741
b,0.476985,3.248944,-1.021228


In [102]:
df.ix['b']

Unnamed: 0,0,1,2
b,1.669025,-0.43857,-0.539741
b,0.476985,3.248944,-1.021228


## 汇总和描述性统计量的计算  

下面介绍一些汇总和描述性统计量的计算。

包括，count, describe, min,max, argmin, argmax, idxmin, idxmax, quantile, sum,mean, median, mad,var,std, skew,kurt,cumsum, cummin, cummax, cumprod,  diff, pct_change等

In [103]:
df = DataFrame([[1.4, np.nan], [7.1, -4.5],
                [np.nan, np.nan], [0.75, -1.3]],
               index=['a', 'b', 'c', 'd'],
               columns=['one', 'two'])
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [104]:
df.sum()

one    9.25
two   -5.80
dtype: float64

In [105]:
df.sum(axis=1)

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

In [106]:
df.mean(axis=1, skipna=False)

a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

In [107]:
df.idxmax()

one    b
two    d
dtype: object

In [108]:
df.cumsum()

Unnamed: 0,one,two
a,1.4,
b,8.5,-4.5
c,,
d,9.25,-5.8


In [109]:
obj = Series(['a', 'a', 'b', 'c'] * 4)
obj.describe()

count     16
unique     3
top        a
freq       8
dtype: object

###  唯一集合，频数等

In [110]:
obj = Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])

In [111]:
uniques = obj.unique()
uniques

array(['c', 'a', 'd', 'b'], dtype=object)

In [112]:
#计算频数
obj.value_counts()

a    3
c    3
b    2
d    1
dtype: int64

In [113]:
# 判断包含关系
mask = obj.isin(['b', 'c'])
mask

0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool

In [114]:
obj[mask]

0    c
5    b
6    b
7    c
8    c
dtype: object

In [115]:
data = DataFrame({'Qu1': [1, 3, 4, 3, 4],
                  'Qu2': [2, 3, 1, 2, 3],
                  'Qu3': [1, 5, 2, 4, 4]})
# 计算各个数出现的频数,按列
result = data.apply(pd.value_counts)
side_by_side(data,result)

Unnamed: 0,Qu1,Qu2,Qu3
0,1,2,1
1,3,3,5
2,4,1,2
3,3,2,4
4,4,3,4

Unnamed: 0,Qu1,Qu2,Qu3
1,1.0,1.0,1.0
2,,2.0,1.0
3,2.0,2.0,
4,2.0,,2.0
5,,,1.0


## 处理缺失值

In [116]:
string_data = Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

In [117]:
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

###  数据选择，滤掉一些缺失值

+ 函数 data.dropna()，data.notnull()


In [118]:
from numpy import nan as NA
data = Series([1, NA, 3.5, NA, 7])
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

In [119]:
data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

In [120]:
# DataFrame
data = DataFrame([[1., 6.5, 3.], [1., NA, NA],
                  [NA, NA, NA], [NA, 6.5, 3.]])

data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [121]:
cleaned = data.dropna()
cleaned

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


In [122]:
# 其它参数选项
data.dropna(axis=1,how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


### 填补缺失值

In [123]:
df.fillna(0)

Unnamed: 0,one,two
a,1.4,0.0
b,7.1,-4.5
c,0.0,0.0
d,0.75,-1.3


+ 传入字典实现 不同列填补不同数据

In [124]:
df.fillna({1: 0.5, 3: -1})

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


## 分层索引

In [125]:
data = Series(np.random.randn(10),
              index=[['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd'],
                     [1, 2, 3, 1, 2, 3, 1, 2, 2, 3]])
data

a  1   -0.577087
   2    0.124121
   3    0.302614
b  1    0.523772
   2    0.000940
   3    1.343810
c  1   -0.713544
   2   -0.831154
d  2   -2.370232
   3   -1.860761
dtype: float64

In [126]:
data.index

MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]])

In [127]:
data['b']

1    0.523772
2    0.000940
3    1.343810
dtype: float64

In [128]:
data['b':'c']

b  1    0.523772
   2    0.000940
   3    1.343810
c  1   -0.713544
   2   -0.831154
dtype: float64

In [129]:
data.ix[['b', 'd']]

b  1    0.523772
   2    0.000940
   3    1.343810
d  2   -2.370232
   3   -1.860761
dtype: float64

内层引用

In [130]:
data[:, 2]

a    0.124121
b    0.000940
c   -0.831154
d   -2.370232
dtype: float64

###  使用 DataFrame的列作为索引使用

In [131]:
frame = DataFrame({'a': range(7), 'b': range(7, 0, -1),
                   'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'],
                   'd': [0, 1, 2, 0, 1, 2, 3]})
frame

Unnamed: 0,a,b,c,d
0,0,7,one,0
1,1,6,one,1
2,2,5,one,2
3,3,4,two,0
4,4,3,two,1
5,5,2,two,2
6,6,1,two,3


In [132]:
frame2 = frame.set_index(['c', 'd'])
frame2

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1
one,0,0,7
one,1,1,6
one,2,2,5
two,0,3,4
two,1,4,3
two,2,5,2
two,3,6,1


In [133]:
frame.set_index(['c', 'd'], drop=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c,d
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
one,0,0,7,one,0
one,1,1,6,one,1
one,2,2,5,one,2
two,0,3,4,two,0
two,1,4,3,two,1
two,2,5,2,two,2
two,3,6,1,two,3


In [134]:
frame2.reset_index()

Unnamed: 0,c,d,a,b
0,one,0,0,7
1,one,1,1,6
2,one,2,2,5
3,two,0,3,4
4,two,1,4,3
5,two,2,5,2
6,two,3,6,1


## Other pandas topics

### Panel data

In [135]:
import pandas.io.data as web

pdata = pd.Panel(dict((stk, web.get_data_yahoo(stk))
                       for stk in ['AAPL', 'GOOG', 'MSFT', 'DELL']))

The pandas.io.data module is moved to a separate package (pandas-datareader) and will be removed from pandas in a future version.
After installing the pandas-datareader package (https://github.com/pydata/pandas-datareader), you can change the import ``from pandas.io import data, wb`` to ``from pandas_datareader import data, wb``.


In [136]:
pdata

<class 'pandas.core.panel.Panel'>
Dimensions: 4 (items) x 1735 (major_axis) x 6 (minor_axis)
Items axis: AAPL to MSFT
Major_axis axis: 2010-01-04 00:00:00 to 2016-10-24 00:00:00
Minor_axis axis: Open to Adj Close

In [137]:
pdata = pdata.swapaxes('items', 'minor')
pdata['Adj Close']

Unnamed: 0_level_0,AAPL,DELL,GOOG,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2010-01-04,27.990226,14.06528,313.062468,25.884104
2010-01-05,28.038618,14.38450,311.683844,25.892466
2010-01-06,27.592626,14.10397,303.826685,25.733566
2010-01-07,27.541619,14.23940,296.753749,25.465944
2010-01-08,27.724725,14.36516,300.709808,25.641571
2010-01-11,27.480148,14.37483,300.255255,25.315406
2010-01-12,27.167562,14.56830,294.945572,25.148142
2010-01-13,27.550775,14.57797,293.252243,25.382312
2010-01-14,27.391211,14.22005,294.630868,25.892466
2010-01-15,26.933449,13.92985,289.710772,25.808835


In [138]:
pdata.ix[:, '6/1/2012', :]

Unnamed: 0,Open,High,Low,Close,Volume,Adj Close
AAPL,569.159996,572.650009,560.520012,560.989983,130246900.0,73.371509
DELL,12.15,12.3,12.045,12.07,19397600.0,11.67592
GOOG,571.790972,572.650996,568.350996,570.981,6138700.0,285.205295
MSFT,28.76,28.959999,28.440001,28.450001,56634300.0,25.262972


In [139]:
pdata.ix['Adj Close', '5/22/2012':, :].head()

Unnamed: 0_level_0,AAPL,DELL,GOOG,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2012-05-22,72.845742,14.58765,300.100412,26.426222
2012-05-23,74.623162,12.08221,304.426106,25.849037
2012-05-24,73.937831,12.04351,301.528978,25.813517
2012-05-25,73.541536,12.05319,295.47005,25.804637
2012-05-28,,12.05319,,


In [140]:
stacked = pdata.ix[:, '5/30/2012':, :].to_frame()
stacked.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Open,High,Low,Close,Volume,Adj Close
Date,minor,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2012-05-30,AAPL,569.199997,579.98999,566.55999,579.169998,132357400.0,75.749262
2012-05-30,DELL,12.59,12.7,12.46,12.56,19787800.0,12.14992
2012-05-30,GOOG,588.161028,591.901014,583.530999,588.230992,3827600.0,293.821674
2012-05-30,MSFT,29.35,29.48,29.120001,29.34,41585500.0,26.053272
2012-05-31,AAPL,580.740021,581.499985,571.460022,577.730019,122918600.0,75.560928


In [141]:
stacked.to_panel()

<class 'pandas.core.panel.Panel'>
Dimensions: 6 (items) x 1122 (major_axis) x 4 (minor_axis)
Items axis: Open to Adj Close
Major_axis axis: 2012-05-30 00:00:00 to 2016-10-24 00:00:00
Minor_axis axis: AAPL to MSFT

## 数据合并

可以根据关键字进行合并，非常灵活。

In [142]:
df1 = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                 'data1': range(7)})
df2 = DataFrame({'key': ['a', 'b', 'b'],
                 'data2': range(3)})
print(df1)
print(df2)

   data1 key
0      0   b
1      1   b
2      2   a
3      3   c
4      4   a
5      5   a
6      6   b
   data2 key
0      0   a
1      1   b
2      2   b


In [143]:
pd.merge(df1, df2) # 共同的关键字，inner

Unnamed: 0,data1,key,data2
0,0,b,1
1,0,b,2
2,1,b,1
3,1,b,2
4,6,b,1
5,6,b,2
6,2,a,0
7,4,a,0
8,5,a,0
