# pandas 入门
Pandas是基于NumPy的一种工具，该工具是为了解决数据分析任务而创建的。Pandas纳入了大量库和一些标准的数据模型，提供了高效地操作大型数据集所需的工具。pandas提供了大量能使我们快速便捷地处理数据的函数和方法。它是使Python成为强大而高效的数据分析环境的重要因素之一。

In [3]:
import numpy as np
import pandas as pd

## pandas的数据结构介绍
pandas的主要数据结构为Series和DataFrame。

### Series
类似于一维数组的对象，由一组数据（各种NumPy数据类型）以及一组与之相关的数据标签（索引）组成。

In [2]:
obj = pd.Series([4, 7, -5, 3])

In [3]:
obj

0    4
1    7
2   -5
3    3
dtype: int64

+ 上述例子中，左边一列为索引，右边一列为数据。由于没有指定索引，自动创建0到$N-1$($N$为数据长度)的整数索引。
+ 也可以通过**values**和**index**属性获取其表示形式和索引对象。

In [4]:
obj.values

array([ 4,  7, -5,  3], dtype=int64)

In [5]:
obj.index

RangeIndex(start=0, stop=4, step=1)

可以创建时通过**index**参数指定索引。

In [6]:
obj2 = pd.Series([4, 7 , -5, 3], index=['d', 'b', 'a', 'c'])

In [7]:
obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [8]:
obj2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

与NumPy数组类似，可以通过索引的方式选取Series中的单个值或一组值：

In [9]:
obj2['a']
obj2

-5

d    4
b    7
a   -5
c    3
dtype: int64

In [10]:
obj2['d'] = 6
obj2

d    6
b    7
a   -5
c    3
dtype: int64

In [11]:
obj2[1]

7

In [12]:
obj2[['c','a','d']]

c    3
a   -5
d    6
dtype: int64

Pandas Series运算（过滤、标量乘法、应用数学函数等）都会保留索引和值之间的链接。

In [13]:
obj2

d    6
b    7
a   -5
c    3
dtype: int64

In [14]:
obj2[obj2 > 0]

d    6
b    7
c    3
dtype: int64

In [15]:
obj2 * 2

d    12
b    14
a   -10
c     6
dtype: int64

In [16]:
np.exp(obj2)

d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

我们可以将Series看成是一个定长的有序字典，因此，原本用在许多字典参数的函数中

In [17]:
'b' in obj2

True

In [18]:
'e' in obj2

False

In [19]:
sdata = {'Ohio':35000, 'Texas':71000, 'Oregon':1600, 'Utah':5000}

In [20]:
obj3 = pd.Series(sdata)

In [21]:
obj3

Ohio      35000
Texas     71000
Oregon     1600
Utah       5000
dtype: int64

如果只传入一个字典，则结果Series中的索引就是原字典的键（有序排列）。

In [22]:
states = ['California', 'Ohio', 'Oregon', 'Texas']

In [23]:
obj4 = pd.Series(sdata, index=states)

In [24]:
obj4

California        NaN
Ohio          35000.0
Oregon         1600.0
Texas         71000.0
dtype: float64

+ 如果传入字典，并额外指定索引，则数据中与索引相匹配的值会被找出来并放在相应的位置上，而索引中找不到对应值的其结果会被置为NaN(非数字，not a number)
+ pandas的`isnull`和`notnull`函数可以用于检测缺失数据；同时，该函数也是Series对象的函数。

In [25]:
pd.isnull(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [26]:
pd.notnull(obj4)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

In [27]:
obj4.isnull()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [28]:
obj4.notnull()

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

Series对象在算术运算时会自动对齐不同索引的数据。

In [29]:
obj3

Ohio      35000
Texas     71000
Oregon     1600
Utah       5000
dtype: int64

In [30]:
obj4

California        NaN
Ohio          35000.0
Oregon         1600.0
Texas         71000.0
dtype: float64

In [31]:
obj3 + obj4

California         NaN
Ohio           70000.0
Oregon          3200.0
Texas         142000.0
Utah               NaN
dtype: float64

Series对象及其索引都有一个*name*属性，输出时index索引的name将被作为列标题。

In [32]:
obj4.name = 'population'

In [33]:
obj4.index.name = 'state'

In [34]:
obj4

state
California        NaN
Ohio          35000.0
Oregon         1600.0
Texas         71000.0
Name: population, dtype: float64

Series的索引可以通过赋值的方式就地修改：

In [35]:
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [36]:
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']

In [37]:
obj

Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

### DataFrame
表格型数据结构。既有行索引，又有列索引；还可以看作是由Series组成的字典（共用一个索引）。

+ DataFrame创建


1. 传入一个等长列表或NumPy数组组成的词典。DataFrame会自动添加索引，且全部列会被有序排列。
2. 如果指定了序列，则DataFrame的列会按照指定顺序进行排列。

In [38]:
data = {'state':['Ohio','Ohio', 'Ohio','Nevada' ,'Nevada'],
       'year':[2000, 2001, 2002, 2001, 2002],
       'pop':[1.5, 1.7, 3.6, 2.4, 2.9]}
df1 = pd.DataFrame(data)

In [39]:
df1

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


In [40]:
# 指定列排列顺序
pd.DataFrame(data, columns=['year', 'state', 'pop'])  

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9


如果传入的列在数据中找不到，则会产生NaN值。

In [41]:
df2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'], 
                   index=['one', 'two', 'three', 'four', 'five'])

In [42]:
df2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,


In [43]:
df2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

+ **列获取**。通过类似**字典标记**或**属性**的方式，可以将DataFrame的列获取为一个Series。返回Series拥有原DataFrame的相同索引，且其name属性已经被相应地设置好了。

In [44]:
df2['state']

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object

In [45]:
df2.year

one      2000
two      2001
three    2002
four     2001
five     2002
Name: year, dtype: int64

+ **行获取**。基于标签的索引用`.loc`，基于位置的索引用`.iloc`：

In [46]:
df2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,


In [47]:
df2.loc['three']

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

In [48]:
df2.iloc[2]

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

+ **赋值**。可以将标量、列表或数组赋予某个列，从而对列通过赋值方式进行修改。


In [49]:
df2['debt'] = 16.5

In [50]:
df2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,16.5
two,2001,Ohio,1.7,16.5
three,2002,Ohio,3.6,16.5
four,2001,Nevada,2.4,16.5
five,2002,Nevada,2.9,16.5


In [51]:
df2['debt'] = np.arange(5)

In [52]:
df2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,0
two,2001,Ohio,1.7,1
three,2002,Ohio,3.6,2
four,2001,Nevada,2.4,3
five,2002,Nevada,2.9,4


将列表或数组赋值给某列时，其长度必须与DataFrame的长度匹配。如果赋值的是一个Series，就会精确匹配DataFrame的索引，所有的空位会被填上缺失值。

In [53]:
val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])

In [54]:
df2['debt'] = val

In [55]:
df2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7


为不存的列赋值将会创建一个新列。关键字`del`用于删除列：

In [56]:
df2['estern'] = df2.state == 'Ohio'

In [57]:
df2

Unnamed: 0,year,state,pop,debt,estern
one,2000,Ohio,1.5,,True
two,2001,Ohio,1.7,-1.2,True
three,2002,Ohio,3.6,,True
four,2001,Nevada,2.4,-1.5,False
five,2002,Nevada,2.9,-1.7,False


In [58]:
del df2['estern']

In [59]:
df2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

通过索引方式返回的列只是相应数据的视图，而非副本。

+ 嵌套字典。外层字典的键作为列，内层字典的键作为行,当然也可以对结果进行转置。内层字典的键会被合并、排序，以形成最终的索引。如果显式指定了索引，则不会。

In [60]:
pop = {'Nevada':{2001:2.4, 2002:2.9}, 
       'Ohio':{2000:1.5, 2001:1.7, 2002:3.6}}

In [61]:
df3 = pd.DataFrame(pop)

In [62]:
df3

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [63]:
df3.T

Unnamed: 0,2001,2002,2000
Nevada,2.4,2.9,
Ohio,1.7,3.6,1.5


In [64]:
pd.DataFrame(pop, index=[2001, 2002, 2003])

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2003,,


+ 通过Series构造DataFrame。

In [65]:
df3

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [66]:
pdata = {'Ohio':df3['Ohio'][:-1],
         'Neveda':df3['Nevada'][:2]}

In [67]:
pdata

{'Ohio': 2001    1.7
 2002    3.6
 Name: Ohio, dtype: float64,
 'Neveda': 2001    2.4
 2002    2.9
 Name: Nevada, dtype: float64}

In [68]:
pd.DataFrame(pdata)

Unnamed: 0,Ohio,Neveda
2001,1.7,2.4
2002,3.6,2.9


In [69]:
list('abcd')

['a', 'b', 'c', 'd']

In [70]:
df=pd.DataFrame(np.arange(12).reshape((4,3)), 
                index=list('abcd'), 
                columns=['one', 'two', 'three'])
df

Unnamed: 0,one,two,three
a,0,1,2
b,3,4,5
c,6,7,8
d,9,10,11


In [71]:
df.index.name = 'alpha'
df.columns.name = 'number'
df

number,one,two,three
alpha,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,0,1,2
b,3,4,5
c,6,7,8
d,9,10,11


可以输入给DataFrame构造器的数据

|类型   |说明       |
|:----------------------------------------|:--------------------------------------------------|
|二维ndaray|数据矩阵，还可以传入行标题和列标题。|
|由数组、列表或元组组成的字典|每个序列会变成DataFrame的一列，所有列的长度必须相等。|
|NumPy的结构化/记录数组    |类似于“由数组组成的字典”|
|由Series组成的字典      |每个Series会成一列。如果没有显示指定索引，则个Serie的索引会被合并成结果的行索引。|

**Tips**

+ 如果设置了DataFrame的`index`和`columns`的`name`属性，则这些信息也会被显示出来。
+ `values`属性会以二维ndarray的形式返回DataFrame的数据
+ 如果DataFrame各列的数据类型不同，则值数组的数据类型会选用能兼容所有列的数据类型。

In [72]:
df3

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [73]:
df3.index.name = 'year'
df3.columns.name = 'state'
df3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [74]:
df3.values

array([[2.4, 1.7],
       [2.9, 3.6],
       [nan, 1.5]])

In [75]:
df2.values

array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, -1.2],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, -1.5],
       [2002, 'Nevada', 2.9, -1.7]], dtype=object)

### 索引对象
pandas的索引对象负责管理轴标签和其他元数据（轴名称等)。构建Series或DataFrame时，所用的任何数组或其他序列的标签都会被转化为一个`Index`：

In [76]:
obj = pd.Series(range(3), index=['a', 'b', 'c'])

In [77]:
index = obj.index
index

Index(['a', 'b', 'c'], dtype='object')

In [78]:
index[1:]

Index(['b', 'c'], dtype='object')

**Index对象时不可修改的(imutable)**

In [79]:
# 下属命令运行将出错
index[2] = 'd'

TypeError: Index does not support mutable operations

In [157]:
index = pd.Index(np.arange(3))

In [158]:
obj2 = pd.Series([1.5, -2.5, 0], index=index)

In [159]:
obj2.index is index

True

+ 除了长的像数组，`Index`的功能类似于一个固定大小的集合：

In [160]:
df3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [161]:
'Ohio' in df3.columns

True

In [162]:
2003 in df3.index

False

`Index`的方法和属性:

| 方法   |    说明    |
|:---|:---|
|append  |连接另一个`Index`对象，产生一个新的`Index`|
|diff   |计算差集，并得到一个`Index`  |
|intersection|计算交集 |
|union   |  计算并集   |
|isin    | 是否包含在参数集合中的布尔型数组|
|delete   |删除索引*i*处的元素，并得到新的`Index`|
|drop    |删除传入的值，并得到一个新的`Index`|
|insert   |将元素插入到索引*i*处，并得到新的`Index`|
|is_monotonic|当各元素大于等于前一个元素时，返回`True`|
|is_unique |当`Index`没有重复时返回`True`|
|unique   |计算`Index`中唯一的数组|

## Series和DataFrame的基本功能

### 重新索引
1. pandas对象的一个重要方法是`reindex`，其作用是**创建一个适应新索引的新对象**。
2. 如果某个索引值当前并不存在，则会引入缺失值。可以通过`fill_value`参数指定默认缺失值。

In [163]:
obj1 = pd.Series([4.5, 7.2, -5.3, 3.6], 
                 index=['d', 'b', 'a', 'c'])
obj1

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

In [164]:
# 重新指定Index，长度需跟数据一致
obj1.index = pd.Index(list("abcd"))
obj1

a    4.5
b    7.2
c   -5.3
d    3.6
dtype: float64

In [165]:
obj1 = pd.Series([4.5, 7.2, -5.3, 3.6], 
                 index=['d', 'b', 'a', 'c'])
obj1
# 新索引长度可以与旧索引不一致，索引会对齐
obj2 = obj1.reindex(['a', 'b', 'c', 'd', 'e'])

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

In [166]:
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

In [167]:
obj1.reindex(['a', 'b', 'c', 'd', 'e'], fill_value=0)

a   -5.3
b    7.2
c    3.6
d    4.5
e    0.0
dtype: float64

In [168]:
obj1

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

- 对于时间序列等有序数据，重新索引可能需要做一些插值处理，`method`选项可以完成，譬如`ffill`可以实现前向填充。

In [169]:
obj3 = pd.Series(['blue', 'purple', 'yelow'],
                 index=[0, 2, 4])
obj3

0      blue
2    purple
4     yelow
dtype: object

In [170]:
obj3.reindex(range(6))

0      blue
1       NaN
2    purple
3       NaN
4     yelow
5       NaN
dtype: object

In [171]:
obj3.reindex(range(6), method='ffill')

0      blue
1      blue
2    purple
3    purple
4     yelow
5     yelow
dtype: object

reindex的（插值）method选项

参数|说明
----|----
ffill或pad|前向填充值
bfill或backfill|后向填充值

In [172]:
date_index = pd.date_range('/1/8/2018', 
                           periods=6, freq='D')

In [173]:
df4 = pd.DataFrame({"Price":[100, 101, np.nan, 100, 98, 88]}, 
                   index=date_index)

In [174]:
df4

Unnamed: 0,Price
2018-01-08,100.0
2018-01-09,101.0
2018-01-10,
2018-01-11,100.0
2018-01-12,98.0
2018-01-13,88.0


In [175]:
date_index2 = pd.date_range('1/3/2018', periods=12, freq='D')

In [176]:
df4.reindex(date_index2)

Unnamed: 0,Price
2018-01-03,
2018-01-04,
2018-01-05,
2018-01-06,
2018-01-07,
2018-01-08,100.0
2018-01-09,101.0
2018-01-10,
2018-01-11,100.0
2018-01-12,98.0


In [177]:
# For missing values
df4.reindex(date_index2, method='bfill')

Unnamed: 0,Price
2018-01-03,100.0
2018-01-04,100.0
2018-01-05,100.0
2018-01-06,100.0
2018-01-07,100.0
2018-01-08,100.0
2018-01-09,101.0
2018-01-10,
2018-01-11,100.0
2018-01-12,98.0


In [178]:
df4.reindex(date_index2, method='ffill')

Unnamed: 0,Price
2018-01-03,
2018-01-04,
2018-01-05,
2018-01-06,
2018-01-07,
2018-01-08,100.0
2018-01-09,101.0
2018-01-10,
2018-01-11,100.0
2018-01-12,98.0


对于DataFrame，`reindex`可以修改（行）索引、列，或两个都修改。如果仅传入一个序列，则会重新索引行：

In [179]:
df5= pd.DataFrame(np.arange(9).reshape((3,3)) ,
                  index=['a', 'c', 'd'], 
                  columns=['Ohio', 'Texas', 'California'] )

In [180]:
df5

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [181]:
df6 = df5.reindex(['a','b', 'c','d'])
df6

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


In [182]:
states = ['Texas', 'Utah', 'California']
states

['Texas', 'Utah', 'California']

In [183]:
df5

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [184]:
# 重新索引列
df5.reindex(columns=states)

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


In [185]:
# 行重新索引
df5.reindex(index=['a', 'b', 'c', 'd'])

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


In [186]:
df5

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [187]:
# states.sort()
states

['Texas', 'Utah', 'California']

In [188]:
# 同时进行列和行索引
df7=df5.reindex(index=['a', 'b', 'c', 'd'],columns=states)
df7

Unnamed: 0,Texas,Utah,California
a,1.0,,2.0
b,,,
c,4.0,,5.0
d,7.0,,8.0


In [189]:
df7=df5.reindex(index=['a', 'b', 'c', 'd'],columns=['Texas', 'Utah', 'California'])
df7

Unnamed: 0,Texas,Utah,California
a,1.0,,2.0
b,,,
c,4.0,,5.0
d,7.0,,8.0


In [190]:
df7.fillna(method='ffill')

Unnamed: 0,Texas,Utah,California
a,1.0,,2.0
b,1.0,,2.0
c,4.0,,5.0
d,7.0,,8.0


### 丢弃指定轴上的项
丢弃某条轴上的一个或多个项，只需一个索引数组或列表即可。**`drop`方法返回一个在指定轴上删除指定值后的新对象。**

In [191]:
obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])

In [192]:
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [193]:
obj11=obj.drop('c')    # 返回新的对象，原对象不变
obj11

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [194]:
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [195]:
new_obj = obj.drop('c')

In [196]:
new_obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [197]:
obj.drop(['a', 'd'])

b    1.0
c    2.0
e    4.0
dtype: float64

+ 对于DataFrame，可以删除任何轴上的索引值。

In [198]:
data = pd.DataFrame(np.arange(16).reshape((4,4)), 
                    index=['Ohio', 'Colorado', 'Utah', 'New York'], 
                    columns=['one', 'two', 'three', 'four'])

In [199]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [200]:
data.drop(['Colorado', 'Ohio'])

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


In [201]:
data.drop('two', axis=1)   # 必须指定轴，否则出错。

Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


In [202]:
data.drop(['one', 'four'], axis=1)

Unnamed: 0,two,three
Ohio,1,2
Colorado,5,6
Utah,9,10
New York,13,14


### 索引、选取和过滤

**Series**的索引的工作方式类似于NumPy数组的索引，只不过Series的索引值不只是整数。        

In [203]:
obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

In [204]:
obj['b']

1.0

In [205]:
obj[2]

2.0

In [206]:
obj[['a', 'd', 'c']]

a    0.0
d    3.0
c    2.0
dtype: float64

In [207]:
obj[[1, 3]]

b    1.0
d    3.0
dtype: float64

In [208]:
obj[obj < 2]

a    0.0
b    1.0
dtype: float64

In [209]:
obj[2:4]


c    2.0
d    3.0
dtype: float64

***利用标签的切片运算与普通的Python切片运算不同，其末端是包含的（inclusive）***

In [210]:
obj

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

In [211]:
obj['b':'c']

b    1.0
c    2.0
dtype: float64

In [212]:
obj['b':'c'] = 5

In [213]:
obj

a    0.0
b    5.0
c    5.0
d    3.0
dtype: float64

**对DataFrame**进行索引其实就是获取一个或多个列。

In [214]:
data = pd.DataFrame(np.arange(16).reshape((4,4)), 
                    index=['Ohio', 'Colorado', 'Utah', 'New York'], 
                    columns=['one', 'two', 'three', 'four'])

In [215]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [216]:
data.two

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int32

In [217]:
data['two']

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int32

In [218]:
data[['one','two']]

Unnamed: 0,one,two
Ohio,0,1
Colorado,4,5
Utah,8,9
New York,12,13


In [219]:
data[['two']]

Unnamed: 0,two
Ohio,1
Colorado,5
Utah,9
New York,13


In [220]:
data[['three', 'two']]

Unnamed: 0,three,two
Ohio,2,1
Colorado,6,5
Utah,10,9
New York,14,13


In [221]:
data[:2]   # 下标切片

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [222]:
data[data['three'] > 5]   # 布尔索引

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [223]:
data < 5

Unnamed: 0,one,two,three,four
Ohio,True,True,True,True
Colorado,True,False,False,False
Utah,False,False,False,False
New York,False,False,False,False


In [224]:
data[data < 5] = 0

In [225]:
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


+ 为了在DataFrame的行上进行**标签索引**，引入了专门的`loc`字段，从而可以通过NumPy式的的标记法以及轴标签从DataFrame中选取行和列的子集。
+ 如果是索引值索引，则可以使用`iloc`字段

In [226]:
slice = data.loc['Colorado', ['two', 'three']]  # 返回Series

In [227]:
slice

two      5
three    6
Name: Colorado, dtype: int32

In [228]:
data.loc[['Colorado', 'Utah'], ['four','one', 'two']]  # 不能数字和标签混用

Unnamed: 0,four,one,two
Colorado,7,0,5
Utah,11,8,9


In [229]:
data.iloc[[1,2],[3,0,1]]  # 索引值

Unnamed: 0,four,one,two
Colorado,7,0,5
Utah,11,8,9


In [230]:
data.iloc[2]

one       8
two       9
three    10
four     11
Name: Utah, dtype: int32

In [231]:
data.loc[:'Utah','two']  # 标签索引，下限是inclusive

Ohio        0
Colorado    5
Utah        9
Name: two, dtype: int32

In [233]:
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [232]:
data[data['three'] > 5].iloc[:,:3]

Unnamed: 0,one,two,three
Colorado,0,5,6
Utah,8,9,10
New York,12,13,14


DataFrame的索引选项

|类型|说明|
|----|-----|
|`df[val]`|选取DataFrame的单一列，或一组列|
|`df.loc[val,val]`|通过标签索引，选取DataFrame的行、列，或行列的交集|
|`df.iloc[val,val]`|通过下标值索引，选取DataFrame的行、列，或行列的交集|
|`df.reindex`方法|将一个或多个轴匹配到新的索引|
|`df.at[val, val]`|通过行标签和列标签选取单个值。|

In [349]:
data = pd.DataFrame(np.arange(16).reshape((4,4)), 
                    index=['Ohio', 'Colorado', 'Utah', 'New York'], 
                    columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [235]:
data.at['New York', 'three']

14

In [236]:
data.at['Utah','four'] = 15

In [237]:
data.at['two','Utah'] = 0
data

Unnamed: 0,one,two,three,four,Utah
Ohio,0.0,1.0,2.0,3.0,
Colorado,4.0,5.0,6.0,7.0,
Utah,8.0,9.0,10.0,15.0,
New York,12.0,13.0,14.0,15.0,
two,,,,,0.0


### 算术运算和数据对齐 

+ pandas可以对不同索引的对象进行算术运算。将对象相加时，如果存在不同的索引对，则结果的索引就是该索引对的并集。

In [238]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s1

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

In [239]:
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], 
               index=['a', 'c', 'e', 'f', 'g'])
s2

a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

In [240]:
s1  + s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

+ 自动的数据对齐操作在不重叠的索引处引入NA值，缺失值会在算术运算中传播。
+ 对于DataFrame，对齐操作同样发生在行和列上。

In [241]:
df1 = pd.DataFrame(np.arange(9.0).reshape((3, 3)), 
                   columns=list('bcd'), 
                   index=['Ohio', 'Texas', 'Colorado'])
df1

Unnamed: 0,b,c,d
Ohio,0.0,1.0,2.0
Texas,3.0,4.0,5.0
Colorado,6.0,7.0,8.0


In [242]:
df2 = pd.DataFrame(np.arange(12).reshape((4, 3)), 
                   columns=list('bde'), 
                   index=['Utah', 'Ohio', 'Texas', 'Oregon'])
df2

Unnamed: 0,b,d,e
Utah,0,1,2
Ohio,3,4,5
Texas,6,7,8
Oregon,9,10,11


In [243]:
df1 + df2

Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


#### 在算术方法中填充值

In [244]:
df1 = pd.DataFrame(np.arange(12.0).reshape((3, 4)), 
                   columns=list('abcd'))
df1

Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,5.0,6.0,7.0
2,8.0,9.0,10.0,11.0


In [245]:
df2 = pd.DataFrame(np.arange(20.0).reshape((4, 5)), 
                   columns=list('abcde'))
df2

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,4.0
1,5.0,6.0,7.0,8.0,9.0
2,10.0,11.0,12.0,13.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [246]:
df1 + df2

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,
1,9.0,11.0,13.0,15.0,
2,18.0,20.0,22.0,24.0,
3,,,,,


In [247]:
df1.add(df2,fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,11.0,13.0,15.0,9.0
2,18.0,20.0,22.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [248]:
df1

Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,5.0,6.0,7.0
2,8.0,9.0,10.0,11.0


算术算法
1. add：  加法
2. sub：  减法
3. div：  除法/
4. mul：  乘法

#### DataFrame和Series之间的运算

In [249]:
arr = np.arange(12).reshape((3, 4))
arr

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [250]:
arr[0]

array([0, 1, 2, 3])

In [251]:
arr - arr[0]

array([[0, 0, 0, 0],
       [4, 4, 4, 4],
       [8, 8, 8, 8]])

+ 上述模式成为**广播（broadcasting）**，DataFrame和Series之间的运算也差不多。
+ 默认情况下，DataFrame和Series之间的算术运算会将Series的索引匹配到DataFrame的列，然后沿着行一直向下广播。

In [252]:
df3 = pd.DataFrame(np.arange(12.0).reshape((4, 3)), 
                   columns=list('bde'), 
                   index=['Utah', 'Ohio', 'Texas', 'Oregon'])
df3

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [253]:
series = df3.iloc[0]
series

b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

In [254]:
df3 - series

Unnamed: 0,b,d,e
Utah,0.0,0.0,0.0
Ohio,3.0,3.0,3.0
Texas,6.0,6.0,6.0
Oregon,9.0,9.0,9.0


In [255]:
df3.sub(series, axis=1)

Unnamed: 0,b,d,e
Utah,0.0,0.0,0.0
Ohio,3.0,3.0,3.0
Texas,6.0,6.0,6.0
Oregon,9.0,9.0,9.0


+ 如果某个索引值在DataFrame或Series的索引中找不到，则参与运算的两个对象就会**被重新索引以形成并集**。

In [256]:
series2 = pd.Series(range(3), index=['b', 'e', 'f'])
series2

b    0
e    1
f    2
dtype: int64

In [257]:
df3

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [258]:
df3 + series2

Unnamed: 0,b,d,e,f
Utah,0.0,,3.0,
Ohio,3.0,,6.0,
Texas,6.0,,9.0,
Oregon,9.0,,12.0,


***如果希望匹配行且在列上广播，则必须使用算术运算方法***

In [259]:
df3

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [260]:
series3 = df3['d']
series3

Utah       1.0
Ohio       4.0
Texas      7.0
Oregon    10.0
Name: d, dtype: float64

In [261]:
df3.sub(series3, axis=0) # 传入的轴号就是希望匹配的轴

Unnamed: 0,b,d,e
Utah,-1.0,0.0,1.0
Ohio,-1.0,0.0,1.0
Texas,-1.0,0.0,1.0
Oregon,-1.0,0.0,1.0


In [262]:
df5 = pd.DataFrame(np.arange(12).reshape(4,3),
                   columns=np.arange(3))
df5

Unnamed: 0,0,1,2
0,0,1,2
1,3,4,5
2,6,7,8
3,9,10,11


In [264]:
series5 = pd.Series(np.arange(3))
series5

0    0
1    1
2    2
dtype: int32

In [265]:
df5 - series5

Unnamed: 0,0,1,2
0,0,0,0
1,3,3,3
2,6,6,6
3,9,9,9


In [266]:
df5.sub(series5)

Unnamed: 0,0,1,2
0,0,0,0
1,3,3,3
2,6,6,6
3,9,9,9


In [267]:
df5.sub(series5, axis=0)

Unnamed: 0,0,1,2
0,0.0,1.0,2.0
1,2.0,3.0,4.0
2,4.0,5.0,6.0
3,,,


In [350]:
series5[3] = 10
series5

0     0
1     1
2     2
3    10
dtype: int64

In [351]:
df5.sub(series5, axis=0)

Unnamed: 0,0,1,2
0,0,1,2
1,2,3,4
2,4,5,6
3,-1,0,1


### 函数应用和映射
NumPy的ufuncs（元素级数组方法）也可以用于操作pandas对象

In [352]:
df2 = pd.DataFrame(np.random.randn(4, 3), 
                   columns=list('bde'), 
                   index=['Utah', 'Ohio', 'Texas', 'Oregon'])
df2

Unnamed: 0,b,d,e
Utah,0.810599,-0.296787,-0.998881
Ohio,0.357382,-1.388603,-0.232873
Texas,0.032159,-0.611531,-0.273876
Oregon,0.377326,-0.2719,0.422662


In [353]:
np.abs(df2)

Unnamed: 0,b,d,e
Utah,0.810599,0.296787,0.998881
Ohio,0.357382,1.388603,0.232873
Texas,0.032159,0.611531,0.273876
Oregon,0.377326,0.2719,0.422662


+ 还可以用DataFrame的`apply`方法将函数应用到由各列或行所形成的一维数组上。参数*axis*指名Axis along which the function is applied

In [354]:
f = lambda x:x.max() - x.min()

In [355]:
df2.apply(f, axis=0)  # 默认用在行上，即每一列的最大值 - 最小值

b    0.778441
d    1.116703
e    1.421543
dtype: float64

In [356]:
df2.apply(f, axis=1)

Utah      1.809480
Ohio      1.745985
Texas     0.643689
Oregon    0.694562
dtype: float64

In [357]:
def f(x):
    return pd.Series([x.min(), x.max(), x.max()- x.min()], 
                     index=['min', 'max', 'gap'])

In [358]:
df2.apply(f,axis=0)

Unnamed: 0,b,d,e
min,0.032159,-1.388603,-0.998881
max,0.810599,-0.2719,0.422662
gap,0.778441,1.116703,1.421543


In [359]:
df2.apply(f, axis=1)

Unnamed: 0,min,max,gap
Utah,-0.998881,0.810599,1.80948
Ohio,-1.388603,0.357382,1.745985
Texas,-0.611531,0.032159,0.643689
Oregon,-0.2719,0.422662,0.694562


In [360]:
df2.sum(axis=0)

b    1.577467
d   -2.568820
e   -1.082967
dtype: float64

In [361]:
df2.mean(axis=1)

Utah     -0.161689
Ohio     -0.421364
Texas    -0.284416
Oregon    0.176030
dtype: float64

+ 许多最为常见的数组统计功能都被是现成DataFrame的方法，如`sum`和`mean`，因而无需使用apply方法。
+ 此外，还可以用元素级的Python函数，采用`applymap`即可实现

In [362]:
format = lambda x:'%.2f' % x

In [363]:
df2.applymap(format)

Unnamed: 0,b,d,e
Utah,0.81,-0.3,-1.0
Ohio,0.36,-1.39,-0.23
Texas,0.03,-0.61,-0.27
Oregon,0.38,-0.27,0.42


In [364]:
df2['e'].map(format)

Utah      -1.00
Ohio      -0.23
Texas     -0.27
Oregon     0.42
Name: e, dtype: object

#### 排序和排名
根据条件对数据集进行排序(sorting)。要对行或列索引进行排序（按字典顺序）,可使用`sort_index`方法，其将返回一个已排序的新对象。

In [365]:
obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])

In [366]:
obj

d    0
a    1
b    2
c    3
dtype: int64

In [367]:
obj.sort_index()

a    1
b    2
c    3
d    0
dtype: int64

+ DataFrame，可以根据任意轴上的索引进行排序。

In [368]:
df6 = pd.DataFrame(np.arange(8).reshape((2, 4)),
                   index=['three', 'one'], 
                   columns=['d', 'a', 'b', 'c'])
df6

Unnamed: 0,d,a,b,c
three,0,1,2,3
one,4,5,6,7


In [369]:
df6.sort_index(axis=0)

Unnamed: 0,d,a,b,c
one,4,5,6,7
three,0,1,2,3


In [370]:
df6.sort_index(axis=1)

Unnamed: 0,a,b,c,d
three,1,2,3,0
one,5,6,7,4


+ 默认是按升序排列，也可以通过`ascending=False`参数进行降序排列。
+ 如果按值对Series排序，可使用`sort_values()`方法。排序时，任何缺失数据都会被放到Series的末尾。

In [371]:
df6.sort_index(axis=1, ascending=False)

Unnamed: 0,d,c,b,a
three,0,3,2,1
one,4,7,6,5


In [372]:
obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])

In [373]:
obj

0    4.0
1    NaN
2    7.0
3    NaN
4   -3.0
5    2.0
dtype: float64

In [374]:
obj.sort_values()

4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64

In [375]:
obj.sort_values(ascending=False) # 降序排列

2    7.0
0    4.0
5    2.0
4   -3.0
1    NaN
3    NaN
dtype: float64

+ 在DataFrame上，可以根据一个或多个列中的值进行排序，通过将一个或多个列名传递给by选项即可。

In [376]:
df7 = pd.DataFrame({'b':[4, 7, -3, 2], 'a':[0, 1, 0, 1]})
df7

Unnamed: 0,b,a
0,4,0
1,7,1
2,-3,0
3,2,1


In [377]:
df7.sort_values(by='b')

Unnamed: 0,b,a
2,-3,0
3,2,1
0,4,0
1,7,1


In [378]:
df7.sort_values(by=['a', 'b'])

Unnamed: 0,b,a
2,-3,0
0,4,0
3,2,1
1,7,1


In [379]:
df7.sort_values(by=['a', 'b'], ascending=False)

Unnamed: 0,b,a
1,7,1
3,2,1
0,4,0
2,-3,0


+ **排名(ranking)**，跟排序密切相关，会增设一个排名值（从1开始，一直到数组中有效数据的数量）
+ 默认情况下，`rank`是通过“为各组分配一个平均排名”的方式破坏平级关系的。

In [380]:
obj = pd.Series([7, -5, 7, 4, 2, 0, 4])

In [381]:
obj.rank()

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

In [382]:
obj.rank(method='first')   # 原始数据中先出现的排名高

0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

In [383]:
obj.rank(ascending=False, method='max')  # 使用整个组的最大排名

0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64

排名时用于破坏平级关系的method选项

| method | 说明 |
|-------|-------|
|average|默认，在相等的分组中，为各个值分配平均排名|
|min|使用整个分组的最小排名|
|max|使用整个分组的最大排名|
|first|按值在原始数据中出现的顺序分配排名|

In [4]:
df4 = pd.DataFrame({'b':[4.3, 7, -3, 2], 
                    'a':[0, 1, 0, 1], 
                    'c':[-2, 5, 8, -2.5]})
df4

Unnamed: 0,b,a,c
0,4.3,0,-2.0
1,7.0,1,5.0
2,-3.0,0,8.0
3,2.0,1,-2.5


In [385]:
df4.rank(axis=1)

Unnamed: 0,b,a,c
0,3.0,2.0,1.0
1,3.0,1.0,2.0
2,1.0,2.0,3.0
3,3.0,2.0,1.0


In [386]:
df4.rank(axis=0)

Unnamed: 0,b,a,c
0,3.0,1.5,2.0
1,4.0,3.5,3.0
2,1.0,1.5,4.0
3,2.0,3.5,1.0


In [387]:
df4.rank(method='min', ascending=False)

Unnamed: 0,b,a,c
0,2.0,3.0,3.0
1,1.0,1.0,2.0
2,4.0,3.0,1.0
3,3.0,1.0,4.0


In [5]:
df4

Unnamed: 0,b,a,c
0,4.3,0,-2.0
1,7.0,1,5.0
2,-3.0,0,8.0
3,2.0,1,-2.5


### 带有重复值的轴索引
+ 许多pandas函数，如`reindex`都要求标签唯一，但并不是强制性的。
+ 索引的`is_unique`属性可以检查其值是否唯一

In [388]:
obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
obj

a    0
a    1
b    2
b    3
c    4
dtype: int64

In [389]:
obj.index.is_unique

False

+ 如果某个索引对应多个值，则返回一个Series；而对应单个值得，则返回一个标量值。
+ DataFrame的行进行索引时也是如此。

In [390]:
obj['a']

a    0
a    1
dtype: int64

In [391]:
obj['c']

4

In [392]:
import numpy as np
df = pd.DataFrame(np.random.randn(4, 3), 
                  index=list('aabb'),columns=[1,2,3])
df

Unnamed: 0,1,2,3
a,1.439134,-2.60077,-1.506248
a,1.121155,-1.348865,1.083463
b,-0.797856,1.08189,-0.517823
b,0.542973,0.062357,0.21255


In [393]:
df.iloc[:2,2:]

Unnamed: 0,3
a,-1.506248
a,1.083463


In [394]:
df.loc['a']

Unnamed: 0,1,2,3
a,1.439134,-2.60077,-1.506248
a,1.121155,-1.348865,1.083463


In [395]:
df.loc['a',2]

a   -2.600770
a   -1.348865
Name: 2, dtype: float64

## 3 汇总和计算描述统计
pandas对象拥有一组常用的数学和统计方法，大多属于约简和汇总统计。

In [396]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5], 
                   [np.nan, np.nan], [0.75, -1.4]], 
                  index=list('abcd'), 
                  columns=['one', 'two'])
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.4


In [397]:
df.sum(axis=0)  # 返回含有列小计的Series

one    9.25
two   -5.90
dtype: float64

In [398]:
df.sum(axis=1)

a    1.40
b    2.60
c    0.00
d   -0.65
dtype: float64

+ NA值会被自动排除，除非整个切片（行或列）都是NA。通过`skipna`选项可以禁用该功能。

In [399]:
df.mean(axis=1, skipna=False)

a      NaN
b    1.300
c      NaN
d   -0.325
dtype: float64

In [400]:
df.idxmax() # 最大值的索引标签

one    b
two    d
dtype: object

In [401]:
df.idxmin(axis=1) # 按列，最小值的索引

a    one
b    two
c    NaN
d    two
dtype: object

In [402]:
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.4


In [403]:
df.cumsum()

Unnamed: 0,one,two
a,1.4,
b,8.5,-4.5
c,,
d,9.25,-5.9


In [404]:
df
df.describe()

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.4


Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.95
std,3.493685,2.192031
min,0.75,-4.5
25%,1.075,-3.725
50%,1.4,-2.95
75%,4.25,-2.175
max,7.1,-1.4


In [405]:
obj = pd.Series(['a', 'a', 'b', 'c'] * 4)

In [406]:
obj

0     a
1     a
2     b
3     c
4     a
5     a
6     b
7     c
8     a
9     a
10    b
11    c
12    a
13    a
14    b
15    c
dtype: object

In [407]:
obj.describe()

count     16
unique     3
top        a
freq       8
dtype: object

### 3.1 相关系数和协方差

In [408]:
stock = pd.read_csv('./examples/stock.dat', sep='\t', index_col=0)

In [1]:
!cat './examples/stock.dat'

Date	AAPL	MSFT	GSPC
2000-01-03	3.625643	39.334630	1455.219971
2000-01-04	3.319964	38.005900	1399.420044
2000-01-05	3.368548	38.406628	1402.109985
2000-01-06	3.077039	37.120080	1403.449951
2000-01-07	3.222794	37.605172	1441.469971
2000-01-10	3.166112	37.879354	1457.599976
2000-01-11	3.004162	36.909170	1438.560059


In [410]:
stock
stock.columns

Unnamed: 0_level_0,AAPL,MSFT,GSPC
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2000-01-03,3.625643,39.33463,1455.219971
2000-01-04,3.319964,38.0059,1399.420044
2000-01-05,3.368548,38.406628,1402.109985
2000-01-06,3.077039,37.12008,1403.449951
2000-01-07,3.222794,37.605172,1441.469971
2000-01-10,3.166112,37.879354,1457.599976
2000-01-11,3.004162,36.90917,1438.560059


Index(['AAPL', 'MSFT', 'GSPC'], dtype='object')

In [411]:
stock["AAPL"]

Date
2000-01-03    3.625643
2000-01-04    3.319964
2000-01-05    3.368548
2000-01-06    3.077039
2000-01-07    3.222794
2000-01-10    3.166112
2000-01-11    3.004162
Name: AAPL, dtype: float64

In [412]:
stock.AAPL.corr(stock.MSFT)

0.9788313126479431

In [413]:
stock.AAPL.cov(stock.MSFT)

0.16580021717290455

In [414]:
stock.corr()

Unnamed: 0,AAPL,MSFT,GSPC
AAPL,1.0,0.978831,0.122897
MSFT,0.978831,1.0,0.213535
GSPC,0.122897,0.213535,1.0


In [415]:
stock.cov()

Unnamed: 0,AAPL,MSFT,GSPC
AAPL,0.043003,0.1658,0.657975
MSFT,0.1658,0.6672,4.503164
GSPC,0.657975,4.503164,666.562274


+ Series的`corr`方法用于计算两个Series中重叠的、非NA的、按索引对齐的值的相关系数。类似的，`cov`用于计算协方差。
+ DataFrame的`corr`和`cov`方法将以DataFrame的形式返回完整的相关系数或协方差矩阵。

In [416]:
stock.corrwith(stock.AAPL)

AAPL    1.000000
MSFT    0.978831
GSPC    0.122897
dtype: float64

### 3.2 唯一值、值计数及成员资格

In [417]:
obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])

In [418]:
uniques = obj.unique()
uniques

array(['c', 'a', 'd', 'b'], dtype=object)

+ `unique`函数返回唯一值数组，未排序。如果需要可以对结果再次进行排序(`uniques.sort()`)。
+ `value_counts`用于计算一个Series中各值出现的频率。
+ `isin`用于判断矢量化集合的成员资格，可用于选取Series或DataFrame列中的数据的子集。

In [419]:
obj.value_counts()  # 默认按降序排列

c    3
a    3
b    2
d    1
dtype: int64

In [420]:
# 还是顶级pandas方法，可用于任何数组或序列
pd.value_counts(obj.values, sort=False)

b    2
a    3
c    3
d    1
dtype: int64

In [421]:
mask = obj.isin(['b', 'c'])

In [422]:
mask

0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool

In [423]:
obj[mask] # 布尔索引

0    c
5    b
6    b
7    c
8    c
dtype: object

+ 将`value_counts`应用到DataFrame中

In [424]:
data = pd.DataFrame({'Qu1':[1, 3, 4, 3, 4],
                     'Qu2':[2, 2, 1, 2, 3],
                     'Qu3':[1, 5, 2, 4, 4]})

In [425]:
data

Unnamed: 0,Qu1,Qu2,Qu3
0,1,2,1
1,3,2,5
2,4,1,2
3,3,2,4
4,4,3,4


In [426]:
result = data.apply(pd.value_counts, axis=1)
result

Unnamed: 0,1,2,3,4,5
0,2.0,1.0,,,
1,,1.0,1.0,,1.0
2,1.0,1.0,,1.0,
3,,1.0,1.0,1.0,
4,,,1.0,2.0,


In [427]:
result.fillna(0)

Unnamed: 0,1,2,3,4,5
0,2.0,1.0,0.0,0.0,0.0
1,0.0,1.0,1.0,0.0,1.0
2,1.0,1.0,0.0,1.0,0.0
3,0.0,1.0,1.0,1.0,0.0
4,0.0,0.0,1.0,2.0,0.0


## 4 缺失数据处理

+ pandas使用NaN表示浮点和非浮点数组中的缺失值，Python中内置的*None*也会被当作Na处理。
+ Na的处理方法包括：
    1. `dropna`：根据各标签值中是否存在缺失值进行过滤
    2. `fillna`:用指定值或插值方法(如`ffill`或`bfill`)填充缺失数据
    3. `isnull`:返回一个含有布尔值的对象
    4. `notnull`：`isnull`的否定形式

In [428]:
string_data = pd.Series(['aard', 'arti', np.nan, 'avocdo'])
string_data

0      aard
1      arti
2       NaN
3    avocdo
dtype: object

In [429]:
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

In [430]:
string_data[0] = None
string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

### 4.1 滤除缺失数据

In [431]:
data = pd.Series([1, np.nan, 3.5, np.nan, 7])

In [432]:
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

In [433]:
data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

+ 对于DataFrame，`dropna`默认丢失任何含有NA的行。

In [434]:
data = pd.DataFrame([[1., 6.5, 3.], [1., np.nan, np.nan],
                     [np.nan, np.nan, np.nan], 
                     [np.nan, 6.5, 3.0]])
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [435]:
cleaned =data.dropna(axis=0)

In [436]:
cleaned

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


In [437]:
data.dropna(how='all') # 只丢弃全部为NA的行

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


In [438]:
data.dropna(axis=1,how="all")

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [439]:
data.dropna(axis=0, how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


+ 只留一部分观测数据，可以用`thresh`参数实现。

In [6]:
df = pd.DataFrame(np.random.randn(7, 3))
df

Unnamed: 0,0,1,2
0,2.149402,1.275334,0.554641
1,0.300925,0.287586,1.066443
2,-0.327141,1.290258,-0.247399
3,0.127946,-0.154292,0.21349
4,0.135564,-0.189261,-0.850779
5,-0.283492,0.631517,-0.128026
6,0.774722,-2.077274,1.338305


In [7]:
df.loc[:4, 1] = np.nan
df

Unnamed: 0,0,1,2
0,2.149402,,0.554641
1,0.300925,,1.066443
2,-0.327141,,-0.247399
3,0.127946,,0.21349
4,0.135564,,-0.850779
5,-0.283492,0.631517,-0.128026
6,0.774722,-2.077274,1.338305


In [8]:
df.loc[:2, 2] = np.nan
df

Unnamed: 0,0,1,2
0,2.149402,,
1,0.300925,,
2,-0.327141,,
3,0.127946,,0.21349
4,0.135564,,-0.850779
5,-0.283492,0.631517,-0.128026
6,0.774722,-2.077274,1.338305


In [443]:
df.dropna(thresh=3)   # 每行至少三个非NA值

Unnamed: 0,0,1,2
5,0.186374,0.909777,2.097382
6,-0.384984,2.650318,0.136095


In [444]:
df.dropna(thresh=2)

Unnamed: 0,0,1,2
3,-0.780893,,-0.95364
4,0.626499,,1.097815
5,0.186374,0.909777,2.097382
6,-0.384984,2.650318,0.136095


In [445]:
df.dropna(thresh=4, axis=1)

Unnamed: 0,0,2
0,0.609313,
1,1.415084,
2,1.365419,
3,-0.780893,-0.95364
4,0.626499,1.097815
5,0.186374,2.097382
6,-0.384984,0.136095


### 4.3 填充缺失数据

In [446]:
df.fillna(0)

Unnamed: 0,0,1,2
0,0.609313,0.0,0.0
1,1.415084,0.0,0.0
2,1.365419,0.0,0.0
3,-0.780893,0.0,-0.95364
4,0.626499,0.0,1.097815
5,0.186374,0.909777,2.097382
6,-0.384984,2.650318,0.136095


In [447]:
df.fillna({1:0.5, 2:-1}) # 标签1列中的NA用0.5代替，3中的用-1代替

Unnamed: 0,0,1,2
0,0.609313,0.5,-1.0
1,1.415084,0.5,-1.0
2,1.365419,0.5,-1.0
3,-0.780893,0.5,-0.95364
4,0.626499,0.5,1.097815
5,0.186374,0.909777,2.097382
6,-0.384984,2.650318,0.136095


In [448]:
newdf = df.fillna({1:5,2:10})
newdf

Unnamed: 0,0,1,2
0,0.609313,5.0,10.0
1,1.415084,5.0,10.0
2,1.365419,5.0,10.0
3,-0.780893,5.0,-0.95364
4,0.626499,5.0,1.097815
5,0.186374,0.909777,2.097382
6,-0.384984,2.650318,0.136095


In [449]:
df

Unnamed: 0,0,1,2
0,0.609313,,
1,1.415084,,
2,1.365419,,
3,-0.780893,,-0.95364
4,0.626499,,1.097815
5,0.186374,0.909777,2.097382
6,-0.384984,2.650318,0.136095


**fillna**默认返回新对象，但也可以对现有对象进行就地修改

In [9]:
_ = df.fillna(0, inplace=True)

In [10]:
df

Unnamed: 0,0,1,2
0,2.149402,0.0,0.0
1,0.300925,0.0,0.0
2,-0.327141,0.0,0.0
3,0.127946,0.0,0.21349
4,0.135564,0.0,-0.850779
5,-0.283492,0.631517,-0.128026
6,0.774722,-2.077274,1.338305


In [11]:
df8 = pd.DataFrame(np.random.randn(6, 3))

In [12]:
df8.loc[2:4, 1] = np.nan

In [13]:
df8.loc[3:4,2] = np.nan

In [14]:
df8

Unnamed: 0,0,1,2
0,-1.488673,-0.281203,-1.721148
1,-1.100588,2.227557,0.277704
2,1.671886,,-0.878848
3,0.455578,,
4,0.311792,,
5,-0.945308,0.672001,-0.173958


In [15]:
df8.fillna(method='bfill') # 向后填充

Unnamed: 0,0,1,2
0,-1.488673,-0.281203,-1.721148
1,-1.100588,2.227557,0.277704
2,1.671886,0.672001,-0.878848
3,0.455578,0.672001,-0.173958
4,0.311792,0.672001,-0.173958
5,-0.945308,0.672001,-0.173958


In [16]:
df8.fillna(method='ffill', limit=2) # 向前填充，最多填充两个

Unnamed: 0,0,1,2
0,-1.488673,-0.281203,-1.721148
1,-1.100588,2.227557,0.277704
2,1.671886,2.227557,-0.878848
3,0.455578,2.227557,-0.878848
4,0.311792,,-0.878848
5,-0.945308,0.672001,-0.173958


In [17]:
data = pd.Series([1., np.nan, 3.5, np.nan, 7])
data.fillna(data.mean()) # 以均值填充

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

## 5 层次化索引
层次化索引(hierarchical indexing)，能在一个轴上拥有多个（两个以上）的索引级别，从而可以以低维度的形式处理高维度上数据。

In [18]:
data = pd.Series(np.random.randn(10), 
                 index=[['a','a','a','b','b','b','c','c','d','d'],
                        [1,2,3,1,2,3,1,2,2,3]])
data

a  1    0.363552
   2    0.368089
   3    0.013358
b  1    0.875727
   2   -0.472777
   3   -2.133858
c  1   -0.661769
   2   -0.410220
d  2   -0.654665
   3    0.715389
dtype: float64

In [19]:
data.index

MultiIndex([('a', 1),
            ('a', 2),
            ('a', 3),
            ('b', 1),
            ('b', 2),
            ('b', 3),
            ('c', 1),
            ('c', 2),
            ('d', 2),
            ('d', 3)],
           )

In [20]:
data['b']

1    0.875727
2   -0.472777
3   -2.133858
dtype: float64

In [21]:
data['b':'c']

b  1    0.875727
   2   -0.472777
   3   -2.133858
c  1   -0.661769
   2   -0.410220
dtype: float64

In [22]:
data.loc[['b', 'd']]

b  1    0.875727
   2   -0.472777
   3   -2.133858
d  2   -0.654665
   3    0.715389
dtype: float64

In [23]:
data
data[:, 2]  # 内层标签索引

a  1    0.363552
   2    0.368089
   3    0.013358
b  1    0.875727
   2   -0.472777
   3   -2.133858
c  1   -0.661769
   2   -0.410220
d  2   -0.654665
   3    0.715389
dtype: float64

a    0.368089
b   -0.472777
c   -0.410220
d   -0.654665
dtype: float64

In [24]:
data.unstack()

Unnamed: 0,1,2,3
a,0.363552,0.368089,0.013358
b,0.875727,-0.472777,-2.133858
c,-0.661769,-0.41022,
d,,-0.654665,0.715389


In [25]:
data.unstack().stack()

a  1    0.363552
   2    0.368089
   3    0.013358
b  1    0.875727
   2   -0.472777
   3   -2.133858
c  1   -0.661769
   2   -0.410220
d  2   -0.654665
   3    0.715389
dtype: float64

对于DataFrame每条轴都可以有分层索引。

In [26]:
df = pd.DataFrame(np.arange(12.).reshape((4, 3)), 
                  index=[['a', 'a', 'b', 'b'], 
                         [1, 2, 1, 2]], 
                  columns=[['Ohio', 'Ohio', 'Colorado'], 
                           ['Green', 'Red', 'Green']])

In [27]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,Ohio,Ohio,Colorado
Unnamed: 0_level_1,Unnamed: 1_level_1,Green,Red,Green
a,1,0.0,1.0,2.0
a,2,3.0,4.0,5.0
b,1,6.0,7.0,8.0
b,2,9.0,10.0,11.0


In [28]:
# 可以为每层索引指定名字
df.index.names = ['key1', 'key2']
df.columns.names=['state', 'color']

In [29]:
df

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0.0,1.0,2.0
a,2,3.0,4.0,5.0
b,1,6.0,7.0,8.0
b,2,9.0,10.0,11.0


In [30]:
df['Ohio']

Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,0.0,1.0
a,2,3.0,4.0
b,1,6.0,7.0
b,2,9.0,10.0


In [31]:
df.loc['a']

state,Ohio,Ohio,Colorado
color,Green,Red,Green
key2,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,0.0,1.0,2.0
2,3.0,4.0,5.0


In [32]:
df.loc['a', 'Ohio']

color,Green,Red
key2,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.0,1.0
2,3.0,4.0


### 5.1 重排分级顺序
`swaplevel`，接受两个级别编号或名称，并返回一个互换了级别的新对象，但数据不会发生变化。

`sort_index(level=...)`，根据单个级别的值对数据进行排序

In [33]:
df.swaplevel('key1', 'key2')

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0.0,1.0,2.0
2,a,3.0,4.0,5.0
1,b,6.0,7.0,8.0
2,b,9.0,10.0,11.0


In [34]:
df.sort_index(level=0)

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0.0,1.0,2.0
a,2,3.0,4.0,5.0
b,1,6.0,7.0,8.0
b,2,9.0,10.0,11.0


In [35]:
df.swaplevel(0, 1).sort_index(0)

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0.0,1.0,2.0
1,b,6.0,7.0,8.0
2,a,3.0,4.0,5.0
2,b,9.0,10.0,11.0


In [36]:
df.sort_index(axis=1, level=1)

Unnamed: 0_level_0,state,Colorado,Ohio,Ohio
Unnamed: 0_level_1,color,Green,Green,Red
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,2.0,0.0,1.0
a,2,5.0,3.0,4.0
b,1,8.0,6.0,7.0
b,2,11.0,9.0,10.0


### 5.2 根据级别汇总统计

In [37]:
df

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0.0,1.0,2.0
a,2,3.0,4.0,5.0
b,1,6.0,7.0,8.0
b,2,9.0,10.0,11.0


In [38]:
df.sum(level='key2')

state,Ohio,Ohio,Colorado
color,Green,Red,Green
key2,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,6.0,8.0,10.0
2,12.0,14.0,16.0


In [39]:
df.sum(axis=1, level='color')

Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,2.0,1.0
a,2,8.0,4.0
b,1,14.0,7.0
b,2,20.0,10.0


### 5.3 使用DataFrame的列 

In [40]:
df11 = pd.DataFrame({'a':range(7), 'b':range(7, 0, -1), 
                     'c':['one','one','one','two','two','two','two'], 
                     'd':[0, 1, 2, 0, 1, 2, 3]})
df11

Unnamed: 0,a,b,c,d
0,0,7,one,0
1,1,6,one,1
2,2,5,one,2
3,3,4,two,0
4,4,3,two,1
5,5,2,two,2
6,6,1,two,3


+ `set_index`函数可以将一个或多个列转换为行索引，并创建一个新的DataFrame
+ `reset_index`的功能刚好相反，层次化索引的级别会被移到列里面。

In [41]:
df11.set_index('a')

Unnamed: 0_level_0,b,c,d
a,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,7,one,0
1,6,one,1
2,5,one,2
3,4,two,0
4,3,two,1
5,2,two,2
6,1,two,3


In [42]:
df12 = df11.set_index(['c','d'])

In [43]:
df12

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1
one,0,0,7
one,1,1,6
one,2,2,5
two,0,3,4
two,1,4,3
two,2,5,2
two,3,6,1


In [44]:
df12.reset_index()

Unnamed: 0,c,d,a,b
0,one,0,0,7
1,one,1,1,6
2,one,2,2,5
3,two,0,3,4
4,two,1,4,3
5,two,2,5,2
6,two,3,6,1


## 面板数据
pandas中还有一种数据结构,**Panel（面板）**，三维版的DataFrame，请自行搜索相关资料。