# 10分钟入门pandas

这是一篇针对pandas新手的简短入门。

在开始之前，首先导入以下几个库：

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## 创建对象

通过传递一个list来创建**Series **,pandas会默认创建整型index：

In [2]:
s = pd.Series([1,3,5,np.nan,6,8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

通过传递一个numpy array，日期index以及column标签来创建一个**DataFrame**:

In [5]:
dates = pd.date_range('20160101',periods=6)
dates

DatetimeIndex(['2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04',
               '2016-01-05', '2016-01-06'],
              dtype='datetime64[ns]', freq='D')

In [6]:
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
2016-01-01,-1.010297,0.696287,1.581958,0.570403
2016-01-02,0.987048,-1.886653,0.996055,0.166742
2016-01-03,0.231736,1.205385,0.23463,0.660184
2016-01-04,-1.619384,0.698944,0.333947,-0.726975
2016-01-05,-0.108619,0.33031,0.750473,2.773949
2016-01-06,-0.043439,-0.863584,1.104039,-1.966649


通过传递一个dictionary来创建一个**DataFrame**:

In [8]:
df2 = pd.DataFrame({ 'A' : 1.,
                     'B' : pd.Timestamp('20160102'),
                     'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                     'D' : np.array([3]*4,dtype='int32'),
                     'E' : pd.Categorical(["test","train","test","train"]),
                     'F' : 'foo' })
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2016-01-02,1.0,3,test,foo
1,1.0,2016-01-02,1.0,3,train,foo
2,1.0,2016-01-02,1.0,3,test,foo
3,1.0,2016-01-02,1.0,3,train,foo


查看每一列的数据类型：

In [10]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

## 查看数据

查看DataFrame中前几行和最后几行：

In [11]:
df.head()

Unnamed: 0,A,B,C,D
2016-01-01,-1.010297,0.696287,1.581958,0.570403
2016-01-02,0.987048,-1.886653,0.996055,0.166742
2016-01-03,0.231736,1.205385,0.23463,0.660184
2016-01-04,-1.619384,0.698944,0.333947,-0.726975
2016-01-05,-0.108619,0.33031,0.750473,2.773949


In [12]:
df.tail(3)

Unnamed: 0,A,B,C,D
2016-01-04,-1.619384,0.698944,0.333947,-0.726975
2016-01-05,-0.108619,0.33031,0.750473,2.773949
2016-01-06,-0.043439,-0.863584,1.104039,-1.966649


显示index，columns以及底层的numpy数据：

In [13]:
df.index

DatetimeIndex(['2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04',
               '2016-01-05', '2016-01-06'],
              dtype='datetime64[ns]', freq='D')

In [14]:
df.columns

Index([u'A', u'B', u'C', u'D'], dtype='object')

In [15]:
df.values

array([[-1.01029681,  0.69628715,  1.58195759,  0.57040348],
       [ 0.98704841, -1.88665261,  0.99605479,  0.1667424 ],
       [ 0.23173557,  1.20538498,  0.23463005,  0.66018443],
       [-1.61938431,  0.69894436,  0.33394681, -0.72697456],
       [-0.10861947,  0.33030997,  0.75047258,  2.77394869],
       [-0.04343913, -0.86358449,  1.10403867, -1.96664921]])

describe()方法能对数据做一个快速统计汇总:

In [16]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,-0.260493,0.030115,0.833517,0.246276
std,0.924945,1.169204,0.504955,1.582156
min,-1.619384,-1.886653,0.23463,-1.966649
25%,-0.784877,-0.565111,0.438078,-0.503545
50%,-0.076029,0.513299,0.873264,0.368573
75%,0.162942,0.69828,1.077043,0.637739
max,0.987048,1.205385,1.581958,2.773949


对数据进行转置：

In [17]:
df.T

Unnamed: 0,2016-01-01 00:00:00,2016-01-02 00:00:00,2016-01-03 00:00:00,2016-01-04 00:00:00,2016-01-05 00:00:00,2016-01-06 00:00:00
A,-1.010297,0.987048,0.231736,-1.619384,-0.108619,-0.043439
B,0.696287,-1.886653,1.205385,0.698944,0.33031,-0.863584
C,1.581958,0.996055,0.23463,0.333947,0.750473,1.104039
D,0.570403,0.166742,0.660184,-0.726975,2.773949,-1.966649


按某一坐标轴进行排序：

In [18]:
df.sort_index(axis=1,ascending=False)

Unnamed: 0,D,C,B,A
2016-01-01,0.570403,1.581958,0.696287,-1.010297
2016-01-02,0.166742,0.996055,-1.886653,0.987048
2016-01-03,0.660184,0.23463,1.205385,0.231736
2016-01-04,-0.726975,0.333947,0.698944,-1.619384
2016-01-05,2.773949,0.750473,0.33031,-0.108619
2016-01-06,-1.966649,1.104039,-0.863584,-0.043439


按值进行排序：

In [19]:
df.sort_values(by='B')

Unnamed: 0,A,B,C,D
2016-01-02,0.987048,-1.886653,0.996055,0.166742
2016-01-06,-0.043439,-0.863584,1.104039,-1.966649
2016-01-05,-0.108619,0.33031,0.750473,2.773949
2016-01-01,-1.010297,0.696287,1.581958,0.570403
2016-01-04,-1.619384,0.698944,0.333947,-0.726975
2016-01-03,0.231736,1.205385,0.23463,0.660184


## 数据选择

注意：虽然标准的Python/Numpy的表达式能完成选择与赋值等功能，但我们仍推荐使用优化过的pandas数据访问方法：.at，.iat，.loc，.iloc和.ix

### 选取

选择某一列数据，它会返回一个Series，等同于df.A：

In [16]:
df['A']

2013-01-01    0.702633
2013-01-02   -0.662246
2013-01-03   -0.341566
2013-01-04    0.486240
2013-01-05    0.173729
2013-01-06   -0.972961
Freq: D, Name: A, dtype: float64

使用[]，对行进行切片选取：

In [24]:
df[0:3] # 选取范围是[)，前闭合，后开

Unnamed: 0,A,B,C,D
2016-01-01,-1.010297,0.696287,1.581958,0.570403
2016-01-02,0.987048,-1.886653,0.996055,0.166742
2016-01-03,0.231736,1.205385,0.23463,0.660184


In [25]:
df['20160102':'20160104'] # 选取范围是[], 前闭合，后闭合

Unnamed: 0,A,B,C,D
2016-01-02,0.987048,-1.886653,0.996055,0.166742
2016-01-03,0.231736,1.205385,0.23463,0.660184
2016-01-04,-1.619384,0.698944,0.333947,-0.726975


### 通过标签选取

通过标签进行交叉选取：

In [19]:
df.loc[dates[0]]

A    0.702633
B   -0.248319
C   -0.677385
D   -0.307803
Name: 2013-01-01 00:00:00, dtype: float64

使用标签对多个轴进行选取：

In [20]:
df.loc[:,['A','B']]

Unnamed: 0,A,B
2013-01-01,0.702633,-0.248319
2013-01-02,-0.662246,-0.02907
2013-01-03,-0.341566,-0.726115
2013-01-04,0.48624,1.190614
2013-01-05,0.173729,0.130515
2013-01-06,-0.972961,0.743743


进行标签切片，包含两个端点：

In [21]:
df.loc['20130102':'20130104',['A','B']]

Unnamed: 0,A,B
2013-01-02,-0.662246,-0.02907
2013-01-03,-0.341566,-0.726115
2013-01-04,0.48624,1.190614


对于返回的对象进行降维处理：

In [22]:
df.loc['20130102',['A','B']]

A   -0.662246
B   -0.029070
Name: 2013-01-02 00:00:00, dtype: float64

获取一个标量：

In [23]:
df.loc[dates[0],'A']

0.70263274076420246

快速获取一个标量（与上面方法等价）

In [24]:
df.at[dates[0],'A']

0.70263274076420246

### 通过位置选取

通过传递整型的位置进行选取

In [25]:
df.iloc[3]

A    0.486240
B    1.190614
C    0.757092
D   -0.736161
Name: 2013-01-04 00:00:00, dtype: float64

通过整型的位置切片进行选取，与python/numpy形式相同

In [26]:
df.iloc[3:5,0:2]

Unnamed: 0,A,B
2013-01-04,0.48624,1.190614
2013-01-05,0.173729,0.130515


只对行进行切片

In [27]:
df.iloc[1:3,:]

Unnamed: 0,A,B,C,D
2013-01-02,-0.662246,-0.02907,0.428587,1.209935
2013-01-03,-0.341566,-0.726115,0.494828,-0.313599


只对列进行切片

In [28]:
df.iloc[:,1:3]

Unnamed: 0,B,C
2013-01-01,-0.248319,-0.677385
2013-01-02,-0.02907,0.428587
2013-01-03,-0.726115,0.494828
2013-01-04,1.190614,0.757092
2013-01-05,0.130515,0.057717
2013-01-06,0.743743,-0.640023


只获取某个值

In [29]:
df.iloc[1,1]

-0.029070112494956581

快速获取某个值（与上面的方法等价）

In [30]:
df.iat[1,1]

-0.029070112494956581

### 布尔索引

用某列的值来选取数据

In [31]:
df[df.A > 0]

Unnamed: 0,A,B,C,D
2013-01-01,0.702633,-0.248319,-0.677385,-0.307803
2013-01-04,0.48624,1.190614,0.757092,-0.736161
2013-01-05,0.173729,0.130515,0.057717,0.082774


用where操作来选取数据

In [32]:
df[df > 0]

Unnamed: 0,A,B,C,D
2013-01-01,0.702633,,,
2013-01-02,,,0.428587,1.209935
2013-01-03,,,0.494828,
2013-01-04,0.48624,1.190614,0.757092,
2013-01-05,0.173729,0.130515,0.057717,0.082774
2013-01-06,,0.743743,,0.600814


用isin()方法来过滤数据

In [35]:
df2 = df.copy()

In [36]:
df2['E'] = ['one','one','two','three','four','three']
df2

Unnamed: 0,A,B,C,D,E
2013-01-01,0.702633,-0.248319,-0.677385,-0.307803,one
2013-01-02,-0.662246,-0.02907,0.428587,1.209935,one
2013-01-03,-0.341566,-0.726115,0.494828,-0.313599,two
2013-01-04,0.48624,1.190614,0.757092,-0.736161,three
2013-01-05,0.173729,0.130515,0.057717,0.082774,four
2013-01-06,-0.972961,0.743743,-0.640023,0.600814,three


In [37]:
df2[df2['E'].isin(['two','four'])]

Unnamed: 0,A,B,C,D,E
2013-01-03,-0.341566,-0.726115,0.494828,-0.313599,two
2013-01-05,0.173729,0.130515,0.057717,0.082774,four


In [38]:
df2[df2['E'].isin(['one','one','two'])]

Unnamed: 0,A,B,C,D,E
2013-01-01,0.702633,-0.248319,-0.677385,-0.307803,one
2013-01-02,-0.662246,-0.02907,0.428587,1.209935,one
2013-01-03,-0.341566,-0.726115,0.494828,-0.313599,two


### 赋值

赋值一个新的列，通过索引来自动对齐数据

In [39]:
s1 = pd.Series([1,2,3,4,5,6],index=pd.date_range('20130102',periods=6))

In [40]:
s1

2013-01-02    1
2013-01-03    2
2013-01-04    3
2013-01-05    4
2013-01-06    5
2013-01-07    6
Freq: D, dtype: int64

In [41]:
df['F'] = s1
s1

2013-01-02    1
2013-01-03    2
2013-01-04    3
2013-01-05    4
2013-01-06    5
2013-01-07    6
Freq: D, dtype: int64

In [42]:
df

Unnamed: 0,A,B,C,D,F
2013-01-01,0.702633,-0.248319,-0.677385,-0.307803,
2013-01-02,-0.662246,-0.02907,0.428587,1.209935,1.0
2013-01-03,-0.341566,-0.726115,0.494828,-0.313599,2.0
2013-01-04,0.48624,1.190614,0.757092,-0.736161,3.0
2013-01-05,0.173729,0.130515,0.057717,0.082774,4.0
2013-01-06,-0.972961,0.743743,-0.640023,0.600814,5.0


通过标签赋值

In [43]:
df.at[dates[0],'A'] = 0
df

Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,-0.248319,-0.677385,-0.307803,
2013-01-02,-0.662246,-0.02907,0.428587,1.209935,1.0
2013-01-03,-0.341566,-0.726115,0.494828,-0.313599,2.0
2013-01-04,0.48624,1.190614,0.757092,-0.736161,3.0
2013-01-05,0.173729,0.130515,0.057717,0.082774,4.0
2013-01-06,-0.972961,0.743743,-0.640023,0.600814,5.0


通过位置赋值

In [44]:
df.iat[0,1] = 0
df

Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,0.0,-0.677385,-0.307803,
2013-01-02,-0.662246,-0.02907,0.428587,1.209935,1.0
2013-01-03,-0.341566,-0.726115,0.494828,-0.313599,2.0
2013-01-04,0.48624,1.190614,0.757092,-0.736161,3.0
2013-01-05,0.173729,0.130515,0.057717,0.082774,4.0
2013-01-06,-0.972961,0.743743,-0.640023,0.600814,5.0


通过传递numpy array赋值

In [46]:
df.loc[:,'D'] = np.array([5] * len(df))

In [47]:
df

Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,0.0,-0.677385,5,
2013-01-02,-0.662246,-0.02907,0.428587,5,1.0
2013-01-03,-0.341566,-0.726115,0.494828,5,2.0
2013-01-04,0.48624,1.190614,0.757092,5,3.0
2013-01-05,0.173729,0.130515,0.057717,5,4.0
2013-01-06,-0.972961,0.743743,-0.640023,5,5.0


In [48]:
df2 = df.copy()
df2[df2 > 0] = -df2
df2

Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,0.0,-0.677385,-5,
2013-01-02,-0.662246,-0.02907,-0.428587,-5,-1.0
2013-01-03,-0.341566,-0.726115,-0.494828,-5,-2.0
2013-01-04,-0.48624,-1.190614,-0.757092,-5,-3.0
2013-01-05,-0.173729,-0.130515,-0.057717,-5,-4.0
2013-01-06,-0.972961,-0.743743,-0.640023,-5,-5.0


## 缺失值处理

在pandas中，用np.nan来代表缺失值，这些值默认不会参与运算。

reindex()允许你修改、增加、删除指定轴上的索引，并返回一个**数据副本**。

In [50]:
df1 = df.reindex(index=dates[0:4],columns=list(df.columns)+['E'])
df1.loc[dates[0]:dates[1],'E'] = 1
df1

Unnamed: 0,A,B,C,D,F,E
2013-01-01,0.0,0.0,-0.677385,5,,1.0
2013-01-02,-0.662246,-0.02907,0.428587,5,1.0,1.0
2013-01-03,-0.341566,-0.726115,0.494828,5,2.0,
2013-01-04,0.48624,1.190614,0.757092,5,3.0,


剔除所有包含缺失值的行数据

In [51]:
df1.dropna(how='any')

Unnamed: 0,A,B,C,D,F,E
2013-01-02,-0.662246,-0.02907,0.428587,5,1,1


填充缺失值

In [52]:
df1.fillna(value=5)

Unnamed: 0,A,B,C,D,F,E
2013-01-01,0.0,0.0,-0.677385,5,5,1
2013-01-02,-0.662246,-0.02907,0.428587,5,1,1
2013-01-03,-0.341566,-0.726115,0.494828,5,2,5
2013-01-04,0.48624,1.190614,0.757092,5,3,5


获取值是否为nan的布尔标记

In [53]:
pd.isnull(df1)

Unnamed: 0,A,B,C,D,F,E
2013-01-01,False,False,False,False,True,False
2013-01-02,False,False,False,False,False,False
2013-01-03,False,False,False,False,False,True
2013-01-04,False,False,False,False,False,True


## 运算

### 统计

运算过程中，通常不包含缺失值。

进行描述性统计。

In [54]:
df

Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,0.0,-0.677385,5,
2013-01-02,-0.662246,-0.02907,0.428587,5,1.0
2013-01-03,-0.341566,-0.726115,0.494828,5,2.0
2013-01-04,0.48624,1.190614,0.757092,5,3.0
2013-01-05,0.173729,0.130515,0.057717,5,4.0
2013-01-06,-0.972961,0.743743,-0.640023,5,5.0


In [55]:
df.mean()

A   -0.219467
B    0.218281
C    0.070136
D    5.000000
F    3.000000
dtype: float64

对其他轴进行同样的运算

In [56]:
df.mean(1)

2013-01-01    1.080654
2013-01-02    1.147454
2013-01-03    1.285429
2013-01-04    2.086789
2013-01-05    1.872392
2013-01-06    1.826152
Freq: D, dtype: float64

对于拥有不同维度的对象进行运算时需要对齐。除此之外，pandas会自动沿着指定维度计算。

In [58]:
s = pd.Series([1,3,5,np.nan,6,8],index=dates).shift(2)
s

2013-01-01   NaN
2013-01-02   NaN
2013-01-03     1
2013-01-04     3
2013-01-05     5
2013-01-06   NaN
Freq: D, dtype: float64

In [59]:
df.sub(s,axis='index')

Unnamed: 0,A,B,C,D,F
2013-01-01,,,,,
2013-01-02,,,,,
2013-01-03,-1.341566,-1.726115,-0.505172,4.0,1.0
2013-01-04,-2.51376,-1.809386,-2.242908,2.0,0.0
2013-01-05,-4.826271,-4.869485,-4.942283,0.0,-1.0
2013-01-06,,,,,


### Apply 函数作用

通过apply()对函数作用

In [60]:
df.apply(np.cumsum)

Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,0.0,-0.677385,5,
2013-01-02,-0.662246,-0.02907,-0.248797,10,1.0
2013-01-03,-1.003813,-0.755185,0.246031,15,3.0
2013-01-04,-0.517573,0.435429,1.003123,20,6.0
2013-01-05,-0.343843,0.565944,1.06084,25,10.0
2013-01-06,-1.316805,1.309686,0.420818,30,15.0


In [61]:
df.apply(lambda x:x.max()-x.min())

A    1.459201
B    1.916729
C    1.434477
D    0.000000
F    4.000000
dtype: float64

### 频数统计

In [64]:
s = pd.Series(np.random.randint(0,7,size=10))
s

0    0
1    1
2    2
3    6
4    6
5    5
6    4
7    5
8    5
9    3
dtype: int64

In [65]:
s.value_counts()

5    3
6    2
4    1
3    1
2    1
1    1
0    1
dtype: int64

### 字符串方法

对于Series对象，在其str属性中有着一系列的字符串处理方法。就如同下段代码一样，能很方便的对array中各个元素进行运算。值得注意的是，在str属性中的模式匹配默认使用正则表达式。

In [66]:
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s.str.lower()

0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

## 合并

### Concat 连接

pandas中提供了大量的方法能够轻松对Series，DataFrame和Panel对象进行不同满足逻辑关系的合并操作
通过concat()来连接pandas对象。

In [67]:
df = pd.DataFrame(np.random.randn(10,4))
df

Unnamed: 0,0,1,2,3
0,-1.161027,-1.231622,0.822933,-0.878371
1,0.714175,-0.588277,1.727522,-0.705645
2,1.561076,-0.124905,-0.62596,-0.248218
3,0.223258,0.697115,0.821513,-0.620105
4,0.832664,1.2994,1.100731,1.244994
5,-1.463792,1.709988,1.271488,-1.394849
6,0.64579,-1.030419,-0.218581,0.204045
7,-0.892794,1.370087,0.445159,0.015854
8,0.810359,1.111946,-0.26968,0.461707
9,-2.962658,-0.170383,0.447636,0.231339


In [68]:
#break it into pieces
pieces = [df[:3], df[3:7], df[7:]]
pieces

[          0         1         2         3
 0 -1.161027 -1.231622  0.822933 -0.878371
 1  0.714175 -0.588277  1.727522 -0.705645
 2  1.561076 -0.124905 -0.625960 -0.248218,
           0         1         2         3
 3  0.223258  0.697115  0.821513 -0.620105
 4  0.832664  1.299400  1.100731  1.244994
 5 -1.463792  1.709988  1.271488 -1.394849
 6  0.645790 -1.030419 -0.218581  0.204045,
           0         1         2         3
 7 -0.892794  1.370087  0.445159  0.015854
 8  0.810359  1.111946 -0.269680  0.461707
 9 -2.962658 -0.170383  0.447636  0.231339]

In [69]:
pd.concat(pieces)

Unnamed: 0,0,1,2,3
0,-1.161027,-1.231622,0.822933,-0.878371
1,0.714175,-0.588277,1.727522,-0.705645
2,1.561076,-0.124905,-0.62596,-0.248218
3,0.223258,0.697115,0.821513,-0.620105
4,0.832664,1.2994,1.100731,1.244994
5,-1.463792,1.709988,1.271488,-1.394849
6,0.64579,-1.030419,-0.218581,0.204045
7,-0.892794,1.370087,0.445159,0.015854
8,0.810359,1.111946,-0.26968,0.461707
9,-2.962658,-0.170383,0.447636,0.231339


### Join 合并

类似于SQL中的合并(merge)

In [70]:
left = pd.DataFrame({'key':['foo','foo'],'lval':[1,2]})
left

Unnamed: 0,key,lval
0,foo,1
1,foo,2


In [71]:
right = pd.DataFrame({'key':['foo','foo'],'lval':[4,5]})
right

Unnamed: 0,key,lval
0,foo,4
1,foo,5


In [72]:
pd.merge(left,right,on='key')

Unnamed: 0,key,lval_x,lval_y
0,foo,1,4
1,foo,1,5
2,foo,2,4
3,foo,2,5


### Append 添加

将若干行添加到dataFrame后面

In [73]:
df =pd.DataFrame(np.random.randn(8,4),columns=['A','B','C','D'])
df

Unnamed: 0,A,B,C,D
0,1.069736,0.057451,1.644067,-0.494798
1,0.107777,-0.146868,1.161056,0.501471
2,0.434794,-1.209243,-1.795692,-0.155802
3,-1.172873,0.171433,-0.010713,-1.213308
4,1.086785,-0.205287,0.956344,-0.487511
5,-0.776083,0.961408,0.289241,-1.167467
6,-0.907893,-0.494137,-1.967909,1.799217
7,-0.041598,-0.79336,0.700046,0.812201


In [74]:
s = df.iloc[3]
s

A   -1.172873
B    0.171433
C   -0.010713
D   -1.213308
Name: 3, dtype: float64

In [75]:
df.append(s,ignore_index=True)

Unnamed: 0,A,B,C,D
0,1.069736,0.057451,1.644067,-0.494798
1,0.107777,-0.146868,1.161056,0.501471
2,0.434794,-1.209243,-1.795692,-0.155802
3,-1.172873,0.171433,-0.010713,-1.213308
4,1.086785,-0.205287,0.956344,-0.487511
5,-0.776083,0.961408,0.289241,-1.167467
6,-0.907893,-0.494137,-1.967909,1.799217
7,-0.041598,-0.79336,0.700046,0.812201
8,-1.172873,0.171433,-0.010713,-1.213308


## 分组

* **划分** 按照某些标准将数据分为不同的组
* **应用** 对每组数据分别执行一个函数
* **划分** 将结果组合到一个数据结构

In [76]:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 
                          'foo', 'bar', 'foo', 'bar'],
                   'B' : ['one', 'one', 'two', 'three', 
                          'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})
df

Unnamed: 0,A,B,C,D
0,foo,one,-0.443665,0.238972
1,bar,one,2.848939,-0.167037
2,foo,two,0.106889,-0.805219
3,bar,three,0.432271,1.02709
4,foo,two,0.45552,1.006666
5,bar,two,-0.312365,-0.947471
6,foo,one,0.1998,0.165791
7,bar,three,-0.551134,1.478997


分组并对每个分组应用sum函数

In [77]:
df.groupby('A').sum()

Unnamed: 0_level_0,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1
bar,2.417712,1.391579
foo,0.318544,0.606211


按多个列分组形成层级索引，然后应用函数

In [78]:
df.groupby(['A','B']).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,2.848939,-0.167037
bar,three,-0.118863,2.506087
bar,two,-0.312365,-0.947471
foo,one,-0.243865,0.404764
foo,two,0.562409,0.201447


## 变形

### 堆叠

In [79]:
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
                     'foo', 'foo', 'qux', 'qux'],
                    ['one', 'two', 'one', 'two',
                     'one', 'two', 'one', 'two']]))
index = pd.MultiIndex.from_tuples(tuples,names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])

df2 = df[:4]
df2

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,-2.457795,0.063298
bar,two,-0.156439,-0.519229
baz,one,-0.966348,-0.600774
baz,two,0.760764,0.750158


**stack()**方法对DataFrame的列“压缩”一个层级

In [80]:
stacked = df2.stack()
stacked

first  second   
bar    one     A   -2.457795
               B    0.063298
       two     A   -0.156439
               B   -0.519229
baz    one     A   -0.966348
               B   -0.600774
       two     A    0.760764
               B    0.750158
dtype: float64

对于一个“堆叠过的”DataFrame或者Series（拥有MultiIndex作为索引），**stack()**的逆操作是**unstack()**，默认反堆叠到上一个层级

In [81]:
stacked.unstack()

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,-2.457795,0.063298
bar,two,-0.156439,-0.519229
baz,one,-0.966348,-0.600774
baz,two,0.760764,0.750158


In [82]:
stacked.unstack(1)

Unnamed: 0_level_0,second,one,two
first,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,A,-2.457795,-0.156439
bar,B,0.063298,-0.519229
baz,A,-0.966348,0.760764
baz,B,-0.600774,0.750158


In [83]:
stacked.unstack(0)

Unnamed: 0_level_0,first,bar,baz
second,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
one,A,-2.457795,-0.966348
one,B,0.063298,-0.600774
two,A,-0.156439,0.760764
two,B,-0.519229,0.750158


### 数据透视表

In [84]:
df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
                   'B' : ['A', 'B', 'C'] * 4,
                   'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                   'D' : np.random.randn(12),
                   'E' : np.random.randn(12)})
df

Unnamed: 0,A,B,C,D,E
0,one,A,foo,0.4725,-0.023828
1,one,B,foo,1.395087,1.796497
2,two,C,foo,-0.018254,0.103754
3,three,A,bar,-0.961576,0.812154
4,one,B,bar,1.239921,1.058762
5,one,C,bar,-0.162519,-1.145215
6,two,A,foo,0.783263,0.602081
7,three,B,foo,-1.131829,-0.958198
8,one,C,foo,-1.323945,0.70111
9,one,A,bar,-0.985977,0.679032


In [85]:
pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])

Unnamed: 0_level_0,C,bar,foo
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
one,A,-0.985977,0.4725
one,B,1.239921,1.395087
one,C,-0.162519,-1.323945
three,A,-0.961576,
three,B,,-1.131829
three,C,-0.688053,
two,A,,0.783263
two,B,0.559153,
two,C,,-0.018254


## 时间序列

pandas在对频率转换进行重新采样时拥有着简单，强大而且高效的功能（例如把按秒采样的数据转换为按5分钟采样的数据）。这在金融领域很常见，但又不限于此。

In [86]:
rng = pd.date_range('1/1/2012',periods=100,freq='S')
rng

<class 'pandas.tseries.index.DatetimeIndex'>
[2012-01-01 00:00:00, ..., 2012-01-01 00:01:39]
Length: 100, Freq: S, Timezone: None

In [87]:
ts = pd.Series(np.random.randint(0,500,len(rng)),index=rng)
ts

2012-01-01 00:00:00    114
2012-01-01 00:00:01     62
2012-01-01 00:00:02    370
2012-01-01 00:00:03    379
2012-01-01 00:00:04    360
2012-01-01 00:00:05    234
2012-01-01 00:00:06    198
2012-01-01 00:00:07     55
2012-01-01 00:00:08    458
2012-01-01 00:00:09    241
2012-01-01 00:00:10    117
2012-01-01 00:00:11    110
2012-01-01 00:00:12     57
2012-01-01 00:00:13    293
2012-01-01 00:00:14    150
...
2012-01-01 00:01:25    346
2012-01-01 00:01:26    355
2012-01-01 00:01:27     27
2012-01-01 00:01:28     36
2012-01-01 00:01:29     47
2012-01-01 00:01:30    394
2012-01-01 00:01:31    295
2012-01-01 00:01:32    329
2012-01-01 00:01:33     48
2012-01-01 00:01:34    296
2012-01-01 00:01:35     18
2012-01-01 00:01:36    373
2012-01-01 00:01:37    328
2012-01-01 00:01:38    494
2012-01-01 00:01:39    457
Freq: S, Length: 100

In [88]:
ts.resample('5Min',how='sum')

2012-01-01    24374
Freq: 5T, dtype: int64

时区表示

In [89]:
rng = pd.date_range('3/6/2012',periods=5,freq='D')
rng

<class 'pandas.tseries.index.DatetimeIndex'>
[2012-03-06, ..., 2012-03-10]
Length: 5, Freq: D, Timezone: None

In [90]:
ts = pd.Series(np.random.randn(len(rng)),index=rng)
ts

2012-03-06   -1.062647
2012-03-07   -0.988973
2012-03-08    0.018998
2012-03-09    0.882671
2012-03-10   -0.566935
Freq: D, dtype: float64

In [91]:
ts_utc = ts.tz_localize('UTC')
ts_utc

2012-03-06 00:00:00+00:00   -1.062647
2012-03-07 00:00:00+00:00   -0.988973
2012-03-08 00:00:00+00:00    0.018998
2012-03-09 00:00:00+00:00    0.882671
2012-03-10 00:00:00+00:00   -0.566935
Freq: D, dtype: float64

时区转换

In [92]:
ts_utc.tz_convert('US/Eastern')

2012-03-05 19:00:00-05:00   -1.062647
2012-03-06 19:00:00-05:00   -0.988973
2012-03-07 19:00:00-05:00    0.018998
2012-03-08 19:00:00-05:00    0.882671
2012-03-09 19:00:00-05:00   -0.566935
Freq: D, dtype: float64

时间跨度转换

In [93]:
rng = pd.date_range('1/1/2012', periods=5, freq='M')
rng

<class 'pandas.tseries.index.DatetimeIndex'>
[2012-01-31, ..., 2012-05-31]
Length: 5, Freq: M, Timezone: None

In [94]:
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts

2012-01-31   -0.632847
2012-02-29   -0.983099
2012-03-31    1.646944
2012-04-30   -2.136451
2012-05-31   -0.048257
Freq: M, dtype: float64

In [96]:
ps = ts.to_period()
ps

2012-01   -0.632847
2012-02   -0.983099
2012-03    1.646944
2012-04   -2.136451
2012-05   -0.048257
Freq: M, dtype: float64

In [97]:
ps.to_timestamp()


2012-01-01   -0.632847
2012-02-01   -0.983099
2012-03-01    1.646944
2012-04-01   -2.136451
2012-05-01   -0.048257
Freq: MS, dtype: float64

日期与时间戳之间的转换使得可以使用一些方便的算术函数。例如，我们把以11月为年底的季度数据转换为当前季度末月底为始的数据

In [98]:
prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')
prng

<class 'pandas.tseries.period.PeriodIndex'>
[1990Q1, ..., 2000Q4]
Length: 44, Freq: Q-NOV

In [99]:
ts = pd.Series(np.random.randn(len(prng)), index = prng)
ts

1990Q1   -0.529455
1990Q2    0.570384
1990Q3   -1.544862
1990Q4   -0.297993
1991Q1   -0.817410
1991Q2   -0.409854
1991Q3   -2.555214
1991Q4    1.032389
1992Q1   -0.378980
1992Q2   -0.373298
1992Q3   -0.290478
1992Q4   -1.106722
1993Q1   -1.118070
1993Q2    1.683016
1993Q3    2.278607
1993Q4    1.070247
1994Q1    0.911208
1994Q2    0.632467
1994Q3   -0.234640
1994Q4   -1.545572
1995Q1    0.707218
1995Q2   -0.386264
1995Q3    1.440627
1995Q4    0.751621
1996Q1   -1.640999
1996Q2    0.125822
1996Q3   -0.781528
1996Q4    1.900030
1997Q1    0.031440
1997Q2   -1.521395
1997Q3   -0.350913
1997Q4   -1.335841
1998Q1   -0.506609
1998Q2    0.609221
1998Q3   -0.092327
1998Q4    0.742343
1999Q1   -0.490511
1999Q2    1.018211
1999Q3    1.817720
1999Q4    0.560937
2000Q1   -1.012795
2000Q2   -1.070557
2000Q3    0.704798
2000Q4    0.353916
Freq: Q-NOV, dtype: float64

In [100]:
ts.index = (prng.asfreq('M', 'end') ) .asfreq('H', 'start') +9
ts

1990-02-01 09:00   -0.529455
1990-05-01 09:00    0.570384
1990-08-01 09:00   -1.544862
1990-11-01 09:00   -0.297993
1991-02-01 09:00   -0.817410
1991-05-01 09:00   -0.409854
1991-08-01 09:00   -2.555214
1991-11-01 09:00    1.032389
1992-02-01 09:00   -0.378980
1992-05-01 09:00   -0.373298
1992-08-01 09:00   -0.290478
1992-11-01 09:00   -1.106722
1993-02-01 09:00   -1.118070
1993-05-01 09:00    1.683016
1993-08-01 09:00    2.278607
1993-11-01 09:00    1.070247
1994-02-01 09:00    0.911208
1994-05-01 09:00    0.632467
1994-08-01 09:00   -0.234640
1994-11-01 09:00   -1.545572
1995-02-01 09:00    0.707218
1995-05-01 09:00   -0.386264
1995-08-01 09:00    1.440627
1995-11-01 09:00    0.751621
1996-02-01 09:00   -1.640999
1996-05-01 09:00    0.125822
1996-08-01 09:00   -0.781528
1996-11-01 09:00    1.900030
1997-02-01 09:00    0.031440
1997-05-01 09:00   -1.521395
1997-08-01 09:00   -0.350913
1997-11-01 09:00   -1.335841
1998-02-01 09:00   -0.506609
1998-05-01 09:00    0.609221
1998-08-01 09:

## 分类

从版本0.15开始，pandas在DataFrame中开始包括分类数据。

In [101]:
df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'e', 'e']})
df

Unnamed: 0,id,raw_grade
0,1,a
1,2,b
2,3,b
3,4,a
4,5,e
5,6,e


把raw_grade转换为分类类型

In [102]:
df["grade"] = df["raw_grade"].astype("category")
df["grade"]

0    a
1    b
2    b
3    a
4    e
5    e
Name: grade, dtype: category
Categories (3, object): [a < b < e]

重命名类别名为更有意义的名称

In [103]:
df["grade"].cat.categories = ["very good","good","very bad"]

对分类重新排序，并添加缺失的分类

In [108]:
df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
df["grade"]

0    very good
1         good
2         good
3    very good
4     very bad
5     very bad
Name: grade, dtype: category
Categories (5, object): [very bad < bad < medium < good < very good]

排序是按照分类的顺序进行的，而不是字典序

In [109]:
df.sort_index(by="grade")

Unnamed: 0,id,raw_grade,grade
4,5,e,very bad
5,6,e,very bad
1,2,b,good
2,3,b,good
0,1,a,very good
3,4,a,very good


按分类分组时，也会显示空的分类

In [110]:
df.groupby("grade").size()

grade
very bad      2
bad         NaN
medium      NaN
good          2
very good     2
dtype: float64

## 绘图

In [114]:
ts =pd.Series(np.random.randn(1000),index=pd.date_range('1/1/2000',periods=1000))
ts = ts.cumsum()
ts.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x10792f590>

对于DataFrame类型，**plot()**能很方便地画出所有列及其标签

In [115]:
df = pd.DataFrame(np.random.randn(1000,4),index=ts.index,columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
df.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x10bd255d0>

## 获取数据的I/O
### CSV

写入一个csv文件

In [118]:
df.to_csv('foo.csv')

从一个csv文件读入

In [119]:
pd.read_csv('foo.csv')

Unnamed: 0.1,Unnamed: 0,A,B,C,D
0,2000-01-01,-0.425461,-1.403269,-0.044436,0.219559
1,2000-01-02,0.308576,-2.057619,0.634288,-0.452682
2,2000-01-03,0.619000,-2.869349,1.237154,-0.308545
3,2000-01-04,-1.165187,-2.040167,1.047757,0.847036
4,2000-01-05,-0.214774,-4.301137,0.571458,2.713687
5,2000-01-06,1.639842,-3.369141,0.753631,2.160821
6,2000-01-07,1.910771,-4.025835,-0.773456,1.821734
7,2000-01-08,1.563340,-4.742109,2.518628,0.588485
8,2000-01-09,0.132218,-4.341313,1.862238,-0.697081
9,2000-01-10,0.373116,-4.632546,1.397640,-0.845373


### HDF5

HDFStores的读写

写入一个HDF5 Store

In [120]:
df.to_hdf('foo.h5','df')

从一个HDF5 Store读入

In [121]:
pd.read_hdf('foo.h5','df')

Unnamed: 0,A,B,C,D
2000-01-01,-0.425461,-1.403269,-0.044436,0.219559
2000-01-02,0.308576,-2.057619,0.634288,-0.452682
2000-01-03,0.619000,-2.869349,1.237154,-0.308545
2000-01-04,-1.165187,-2.040167,1.047757,0.847036
2000-01-05,-0.214774,-4.301137,0.571458,2.713687
2000-01-06,1.639842,-3.369141,0.753631,2.160821
2000-01-07,1.910771,-4.025835,-0.773456,1.821734
2000-01-08,1.563340,-4.742109,2.518628,0.588485
2000-01-09,0.132218,-4.341313,1.862238,-0.697081
2000-01-10,0.373116,-4.632546,1.397640,-0.845373


### Excel表格

Excel表格的读写

写入一个Excel文件

In [122]:
df.to_excel('foo.xlsx', sheet_name='Sheet1')

从一个Excel文件读入

In [123]:
pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])

Unnamed: 0,A,B,C,D
2000-01-01,-0.425461,-1.403269,-0.044436,0.219559
2000-01-02,0.308576,-2.057619,0.634288,-0.452682
2000-01-03,0.619000,-2.869349,1.237154,-0.308545
2000-01-04,-1.165187,-2.040167,1.047757,0.847036
2000-01-05,-0.214774,-4.301137,0.571458,2.713687
2000-01-06,1.639842,-3.369141,0.753631,2.160821
2000-01-07,1.910771,-4.025835,-0.773456,1.821734
2000-01-08,1.563340,-4.742109,2.518628,0.588485
2000-01-09,0.132218,-4.341313,1.862238,-0.697081
2000-01-10,0.373116,-4.632546,1.397640,-0.845373
