# 第 5 章 pandas 入门

pandas和其他工具结合使用：
* 数值计算工具：NumPy 和SciPy，
* 分析库：statsmodels 和scikit-learn
* 数据可视化：matplotlib

pandas vs numpy:
* pandas专门处理表格和混杂数据
* numpy处理同意的数值数组数据

dataframe 没说明就是在列上操作

In [25]:
import pandas as pd
import numpy as np
from IPython.core.interactiveshell import InteractiveShell 
InteractiveShell.ast_node_interactivity = 'all' #默认为'last'

## 5.1pandas 的数据结构介绍

### Series

* series是一种类似于一维数组的对象，它由一组数据（各种 NumPy 数据类型） 以及一组与之相关的数据标签（即索引）组成。

In [26]:
#一组数据产生最简单的series
obj = pd.Series([4,7,5,3])
obj
#获取数组表示形式和索引对象
obj.values
obj.index

0    4
1    7
2    5
3    3
dtype: int64

array([4, 7, 5, 3], dtype=int64)

RangeIndex(start=0, stop=4, step=1)

#### 创建对各个数据点带标记的索引

In [27]:
obj2 = pd.Series([4,7,-5,3],index=['d','b','a','c'])
obj2
obj2.index

d    4
b    7
a   -5
c    3
dtype: int64

Index(['d', 'b', 'a', 'c'], dtype='object')

#### 通过索引选取Series中的单一或一组值

In [28]:
obj2['a']
obj2['d']
obj2[['c','a','d']]

-5

4

c    3
a   -5
d    4
dtype: int64

#### 使用numpy函数或类似numpy的运算

In [29]:
obj2[obj2>0]
obj2*2
np.exp(obj2)

d    4
b    7
c    3
dtype: int64

d     8
b    14
a   -10
c     6
dtype: int64

d      54.598150
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

#### 字典变序列

In [30]:
stata = {'Ohio':35000,'Texas':71000,'Oregon':16000,'Utah':5000}
obj3 = pd.Series(stata)
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

#### series中的索引是原字典的键，通过传入排序好的字典的键改变顺序

In [31]:
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(stata,index=states)
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

#### isnull和notnull用于检测缺失数据

In [32]:
pd.isnull(obj4)
pd.notnull(obj4)
#实例化方法
obj4.isnull()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

#### 运用运算的索引标签自动对齐数据：类似数据库中的join

In [33]:
obj3
obj4
obj3+obj4

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

#### Series的name属性

In [34]:
obj4.name = 'population'
obj4.index.name = 'state'
obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

#### 通过赋值的方式修改

In [35]:
obj
obj.index = ['Bob',"Steve",'Jeff','Ryan']
obj

0    4
1    7
2    5
3    3
dtype: int64

Bob      4
Steve    7
Jeff     5
Ryan     3
dtype: int64

### DataFrame

DataFrame 是一个表格型的数据结构，它含有一组有序的列，每列可以是不同的值类型（数值、字符串、布尔值等）。
虽然 DataFrame 是以二维结构保存数据的，但你仍然可以轻松地将其表 示为更高维度的数据

#### 建DataFrame

In [36]:
# 传入一个由等长列表或numpy数组组成的字典
data = {'state':["Ohio",'Ohio','Ohio','Nevada','Nevada','Necerda'],
       'year':[2000,2001,2002,2001,2002,2003],
       'pop':[1.5,1.7,3.6,2.4,2.9,3.2]}
frame = pd.DataFrame(data)
frame
#取前5行
frame.head()

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Necerda,2003,3.2


Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


#### 制定列序列显示

In [37]:
pd.DataFrame(data,columns=['year','state','pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Necerda,3.2


#### 传入列在数据中找不到，就产生缺失值

In [38]:
frame2 = pd.DataFrame(data,columns=['year','state','pop','dept'],index=['one','two','three','four','five','six'])
frame2
frame2.columns

Unnamed: 0,year,state,pop,dept
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,
six,2003,Necerda,3.2,


Index(['year', 'state', 'pop', 'dept'], dtype='object')

#### 将DataFrame的列取为一个Series

In [39]:
frame2['state']
frame2.state
frame2.year
frame2.columns

one         Ohio
two         Ohio
three       Ohio
four      Nevada
five      Nevada
six      Necerda
Name: state, dtype: object

one         Ohio
two         Ohio
three       Ohio
four      Nevada
five      Nevada
six      Necerda
Name: state, dtype: object

one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64

Index(['year', 'state', 'pop', 'dept'], dtype='object')

#### DataFrame的行也可以通过位置或者名称的方式获取

In [40]:
frame2.loc['three']

year     2002
state    Ohio
pop       3.6
dept      NaN
Name: three, dtype: object

#### 列可以通过赋值的方式进行修改
* 直接赋值
* 赋值np.arange()
* 给列赋值的长度必须等于DateFrame的长度。要是用Series赋值会自动匹配index，不一定等长，空位被补充为了缺失值。

In [41]:
frame2.dept = 16.5
frame2

Unnamed: 0,year,state,pop,dept
one,2000,Ohio,1.5,16.5
two,2001,Ohio,1.7,16.5
three,2002,Ohio,3.6,16.5
four,2001,Nevada,2.4,16.5
five,2002,Nevada,2.9,16.5
six,2003,Necerda,3.2,16.5


In [42]:
frame2['dept']=np.arange(6.)
frame2

Unnamed: 0,year,state,pop,dept
one,2000,Ohio,1.5,0.0
two,2001,Ohio,1.7,1.0
three,2002,Ohio,3.6,2.0
four,2001,Nevada,2.4,3.0
five,2002,Nevada,2.9,4.0
six,2003,Necerda,3.2,5.0


In [43]:
val = pd.Series([-1.2,-1.5,1.7],index=['one','three','five'])
frame2.dept = val
frame2

Unnamed: 0,year,state,pop,dept
one,2000,Ohio,1.5,-1.2
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,-1.5
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,1.7
six,2003,Necerda,3.2,


#### del用于删除列

In [44]:
frame2['eastern'] = frame2.state=='Ohio'
frame2

Unnamed: 0,year,state,pop,dept,eastern
one,2000,Ohio,1.5,-1.2,True
two,2001,Ohio,1.7,,True
three,2002,Ohio,3.6,-1.5,True
four,2001,Nevada,2.4,,False
five,2002,Nevada,2.9,1.7,False
six,2003,Necerda,3.2,,False


In [45]:
del  frame2.eastern
frame2.columns

AttributeError: eastern

In [46]:
del frame2['eastern']
frame2.columns

Index(['year', 'state', 'pop', 'dept'], dtype='object')

#### 嵌套字典转DataFrame：外层键作为列，内层键作为index

In [47]:
pop ={'Nevada':{2001:2.4,2002:2.9},'Ohio':{2000:1.5,2001:1.7,2002:3.6}}
frame3 = pd.DataFrame(pop)
frame3

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


#### DataFrame的转置

In [48]:
frame3.T
frame3

Unnamed: 0,2000,2001,2002
Nevada,,2.4,2.9
Ohio,1.5,1.7,3.6


Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


In [49]:
# 明确索引，就会按索引的顺序，不然就会按内层字典合并、排序后的顺序
pd.DataFrame(pop, index=[2001, 2002, 2003])

AttributeError: 'list' object has no attribute 'astype'

#### DataFrame的index和colums的name属性

In [50]:
frame3.index.name = 'year'
frame3.columns.name='state'
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


#### DataFrame的values属性以二维ndarray的形式返回

In [51]:
frame3.values

array([[nan, 1.5],
       [2.4, 1.7],
       [2.9, 3.6]])

In [52]:
#若DataFrame各列的数据类型不同，则值数组的dtype就会选用能兼容所有列的数据类型
frame2.values

array([[2000, 'Ohio', 1.5, -1.2],
       [2001, 'Ohio', 1.7, nan],
       [2002, 'Ohio', 3.6, -1.5],
       [2001, 'Nevada', 2.4, nan],
       [2002, 'Nevada', 2.9, 1.7],
       [2003, 'Necerda', 3.2, nan]], dtype=object)

### 索引对象

pandas 的索引对象负责管理轴标签和其他元数据（比如轴名称等）。构建 Series 或 DataFrame 时，所用到的任何数组或其他序列的标签都会被转换成一 个 Index

In [53]:
obj = pd.Series(range(3),index=["a",'b','c'])
obj.index
index = obj.index
index[1]
#index对象不可变
index[1]='data'

Index(['a', 'b', 'c'], dtype='object')

'b'

TypeError: Index does not support mutable operations

#### index不可变使得index对象在多个数据结构之间安全共享：

In [54]:
lables = pd.Index(np.arange(3))
lables
obj2 = pd.Series([1.5,-2.5,0],index=lables)
obj2.index is lables


Int64Index([0, 1, 2], dtype='int64')

True

#### Index的功能类似一个固定大小的集合

In [56]:
frame3
frame3.columns

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


Index(['Nevada', 'Ohio'], dtype='object', name='state')

#### 与 python 的集合不同，pandas的 Index 可以包含重复的标签；选择重复的标签会显示所有的重复项

In [57]:
dup_lables = pd.Index(['foo','foo','bar','bar'])
dup_lables

Index(['foo', 'foo', 'bar', 'bar'], dtype='object')

## 5.2基本功能

### 重新索引

In [59]:
obj = pd.Series([4.5,7.2,-5.3,3.6],index=['d','b','a','c'])
obj
obj2 = obj.reindex(['a','b','c','d','e'])
obj
obj2

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

In [67]:
# 时间序列等数据需要插值处理：method可以：ffill实现向前值填充,bfill实现向后值填充
obj3 = pd.Series(['blue','purple','yellow'],index=[0,2,4])
obj3
obj3.reindex(range(10),method='ffill')##后面可以有很多
obj3.reindex(range(5),method='bfill')##最大就到原本Index的最大数

0      blue
2    purple
4    yellow
dtype: object

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
6    yellow
7    yellow
8    yellow
9    yellow
dtype: object

0      blue
1    purple
2    purple
3    yellow
4    yellow
dtype: object

#### reindex修改索引和列

In [70]:
frame = pd.DataFrame(np.arange(9).reshape(3,3),index=['a','c','d'],columns=['Ohio','Texas','California'])
frame
frame2 = frame.reindex(['a','b','c','d'])
frame2
states = ['Texas','Utah','California']
frame.reindex(columns=states)
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


### 丢弃指定轴上的项

In [72]:
#Series的drop:返回的是一个新对象，之前的不变
obj = pd.Series(np.arange(5.),index=['a','b','c','d','e'])
obj
new_obj = obj.drop('c')
new_obj
obj.drop(['c','d'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

a    0.0
b    1.0
e    4.0
dtype: float64

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [76]:
#DataFrame可以删除任意轴上的值:返回新对象
data = pd.DataFrame(np.arange(16).reshape(4,4),index=['Ohio','Colorado','Utah','New York'],columns=['one','two','three','four'])
data
data.drop(['Colorado', 'Ohio'])#删除行
data.drop('two',axis=1)
data.drop(['two','four'],axis='columns')
data
#DataFrame可以删除任意轴上的值:inplace = True 修改对象，不返回新对象
data.drop(['two','four'],axis='columns',inplace = True)
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


Unnamed: 0,one,three
Ohio,0,2
Colorado,4,6
Utah,8,10
New York,12,14


Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


Unnamed: 0,one,three
Ohio,0,2
Colorado,4,6
Utah,8,10
New York,12,14


### 索引、选取和过滤

#### Series的索引类似Numpy数组的索引，只不过索引值不只是整数

In [83]:

obj = pd.Series(np.arange(4.),index=['a', 'b', 'c', 'd'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

In [80]:
obj['b']
obj[1]

1.0

1.0

In [81]:
obj[2:4]

c    2.0
d    3.0
dtype: float64

In [84]:
obj[['b','a','d']]

b    1.0
a    0.0
d    3.0
dtype: float64

In [85]:
obj[[1,3]]

b    1.0
d    3.0
dtype: float64

In [86]:
obj[obj<2]

a    0.0
b    1.0
dtype: float64

#### 利用标签切片和python切片运算不同：其末端包含

In [88]:
obj['b':'c']
obj['b':'c'] = 65
obj

b    1.0
c    2.0
dtype: float64

a     0.0
b    65.0
c    65.0
d     3.0
dtype: float64

#### 用一个值或序列对 DataFrame 进行索引其实就是获取一个或多个列：

In [89]:
data = pd.DataFrame(np.arange(16).reshape(4,4),index=['Ohio','Colorado','Utah','New York'],columns=['one','two','three','four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [91]:
#选取列
data['two']
data[['two','three']]

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int32

Unnamed: 0,two,three
Ohio,1,2
Colorado,5,6
Utah,9,10
New York,13,14


In [93]:
data[:2]#选取行
# 不能这样选取行：data['Ohio']
data[data['three']>5]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [96]:
data < 5
data[data<5]=0
data

Unnamed: 0,one,two,three,four
Ohio,True,True,True,True
Colorado,True,False,False,False
Utah,False,False,False,False
New York,False,False,False,False


Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


### 用loc和iloc进行选取

* 对于 DataFrame 的行的标签索引，我引入了特殊的标签运算符 loc 和iloc：轴标签（loc）或整数索引（iloc）

In [97]:
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [98]:
data.loc['Colorado',['two','three']]

two      5
three    6
Name: Colorado, dtype: int32

In [99]:
data.iloc[2,[3,0,1]]

four    11
one      8
two      9
Name: Utah, dtype: int32

In [100]:
data.iloc[2]
data.iloc[[1,2],[3,0,2]]

one       8
two       9
three    10
four     11
Name: Utah, dtype: int32

Unnamed: 0,four,one,three
Colorado,7,0,6
Utah,11,8,10


#### 这两个索引函数也适用于一个标签或多个标签的切片：

In [101]:
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [102]:
data.loc[:'Utah','two']

Ohio        0
Colorado    5
Utah        9
Name: two, dtype: int32

In [104]:
data.iloc[:,:3][data.three>5]

Unnamed: 0,one,two,three
Colorado,0,5,6
Utah,8,9,10
New York,12,13,14


### 整数索引

#### 为了进行统一，如果轴索引含有整数，数据选取总会使用标签。为了更准确， 请使用loc（标签）或 iloc（整数）

In [110]:
ser = pd.Series(np.arange(3,))
ser
ser[-1]#错的

0    0
1    1
2    2
dtype: int32

KeyError: -1

In [111]:
ser2 = pd.Series(np.arange(3.), index=['a', 'b', 'c'])
ser2
ser2[-1]

a    0.0
b    1.0
c    2.0
dtype: float64

2.0

In [115]:
ser[:1]
ser.loc[:1]
ser.iloc[:1]

0    0
dtype: int32

0    0
1    1
dtype: int32

0    0
dtype: int32

### 算数运算和数据对齐

#### Series对象相加不同索引相当于外连接

In [116]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s1
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1],index=['a', 'c', 'e', 'f', 'g'])
s2
s1+s2

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

#### 对于DataFrame，对齐操作会同时发生在行和列上

In [117]:
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),index=['Ohio', 'Texas', 'Colorado'])
df1
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),index=['Utah', 'Ohio', 'Texas','Oregon'])
df2
df1+df2

Unnamed: 0,b,c,d
Ohio,0.0,1.0,2.0
Texas,3.0,4.0,5.0
Colorado,6.0,7.0,8.0


Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


### 在算数方法中填充值

In [121]:
df1 = pd.DataFrame(np.arange(12.).reshape(3,4),columns=list('abcd')) 
df1
df2 = pd.DataFrame(np.arange(20.).reshape(4,5),columns=list('abcde')) 
df2
df2.loc[1,'b']=np.nan
df2

Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,5.0,6.0,7.0
2,8.0,9.0,10.0,11.0


Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,4.0
1,5.0,6.0,7.0,8.0,9.0
2,10.0,11.0,12.0,13.0,14.0
3,15.0,16.0,17.0,18.0,19.0


Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,4.0
1,5.0,,7.0,8.0,9.0
2,10.0,11.0,12.0,13.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [122]:
df1+df2

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,
1,9.0,,13.0,15.0,
2,18.0,20.0,22.0,24.0,
3,,,,,


#### 使用 df1 的 add 方法，传入 df2 以及一个 fill_value 参数

In [124]:
df1
df2
df1.add(df2,fill_value=0)

Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,5.0,6.0,7.0
2,8.0,9.0,10.0,11.0


Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,4.0
1,5.0,,7.0,8.0,9.0
2,10.0,11.0,12.0,13.0,14.0
3,15.0,16.0,17.0,18.0,19.0


Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,5.0,13.0,15.0,9.0
2,18.0,20.0,22.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


### DataFrame和Series之间的运算

#### 启发性例子

In [126]:
arr = np.arange(12.).reshape(3,4)
arr
arr[0]
arr-arr[0]#每行都减去arr[0]

array([[ 0.,  1.,  2.,  3.],
       [ 4.,  5.,  6.,  7.],
       [ 8.,  9., 10., 11.]])

array([0., 1., 2., 3.])

array([[0., 0., 0., 0.],
       [4., 4., 4., 4.],
       [8., 8., 8., 8.]])

#### 在行上广播

In [127]:
frame = pd.DataFrame(np.arange(12.).reshape(4,3),columns=list('bde'),index=['Utah', 'Ohio', 'Texas', 'Oregon'])
series = frame.iloc[0]
frame
series

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

In [128]:
frame-series

Unnamed: 0,b,d,e
Utah,0.0,0.0,0.0
Ohio,3.0,3.0,3.0
Texas,6.0,6.0,6.0
Oregon,9.0,9.0,9.0


In [130]:
series2 = pd.Series(range(3), index=['b', 'e', 'f'])
series2
frame
frame+series2

b    0
e    1
f    2
dtype: int64

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


Unnamed: 0,b,d,e,f
Utah,0.0,,3.0,
Ohio,3.0,,6.0,
Texas,6.0,,9.0,
Oregon,9.0,,12.0,


#### 在列上广播：标明axis=‘index’

In [131]:
series3 = frame['d']
series3
frame
frame.sub(series3,axis='index')


Utah       1.0
Ohio       4.0
Texas      7.0
Oregon    10.0
Name: d, dtype: float64

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


Unnamed: 0,b,d,e
Utah,-1.0,0.0,1.0
Ohio,-1.0,0.0,1.0
Texas,-1.0,0.0,1.0
Oregon,-1.0,0.0,1.0


### 函数应用和映射

In [133]:
frame = pd.DataFrame(np.random.randn(4,3),columns=list('bde'),index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame
np.abs(frame)

Unnamed: 0,b,d,e
Utah,-0.712223,-0.673283,-1.109558
Ohio,-0.463025,-0.222069,-1.462721
Texas,-0.340665,-0.300101,0.16065
Oregon,0.21151,-0.126145,-0.853598


Unnamed: 0,b,d,e
Utah,0.712223,0.673283,1.109558
Ohio,0.463025,0.222069,1.462721
Texas,0.340665,0.300101,0.16065
Oregon,0.21151,0.126145,0.853598


#### DataFrame的apply方法

#### 应用到列

In [134]:
f = lambda x:x.max()-x.min()
frame.apply(f)

b    0.923734
d    0.547138
e    1.623371
dtype: float64

#### 应用到行

In [136]:
frame.apply(f,axis='columns')

Utah      0.436274
Ohio      1.240652
Texas     0.501315
Oregon    1.065108
dtype: float64

#### 传递到 apply 的函数不是必须返回一个标量，还可以返回由多个值组成的 Series：

In [137]:
def f(x):
    return pd.Series([x.min(),x.max()],index=['min','max'])
frame.apply(f)

Unnamed: 0,b,d,e
min,-0.712223,-0.673283,-1.462721
max,0.21151,-0.126145,0.16065


### 排序和排名:sort_index

In [138]:
obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])
obj
obj.sort_index()
frame = pd.DataFrame(np.arange(8).reshape((2, 4)),index=['three', 'one'],columns=['d', 'a', 'b', 'c'])
frame
frame.sort_index()

d    0
a    1
b    2
c    3
dtype: int64

a    1
b    2
c    3
d    0
dtype: int64

Unnamed: 0,d,a,b,c
three,0,1,2,3
one,4,5,6,7


Unnamed: 0,d,a,b,c
one,4,5,6,7
three,0,1,2,3


### 带有重复标签的轴索引

#### Series

In [140]:
obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
obj
obj.index.is_unique
obj['a']
obj['c']

a    0
a    1
b    2
b    3
c    4
dtype: int64

False

a    0
a    1
dtype: int64

4

#### DataFrame

In [141]:
df = pd.DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])
df
df.loc['b']

Unnamed: 0,0,1,2
a,-0.096586,-0.239953,0.07144
a,-1.411557,0.224477,2.090662
b,1.360529,-1.346187,1.906501
b,-0.485359,1.451741,1.212506


Unnamed: 0,0,1,2
b,1.360529,-1.346187,1.906501
b,-0.485359,1.451741,1.212506


## 5.3汇总和计算描述统计

In [142]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],[np.nan, np.nan], [0.75, -1.3]],index=['a', 'b', 'c', 'd'],columns=['one', 'two'])
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [145]:
df.sum(axis='columns')
df.sum(axis=0)

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

one    9.25
two   -5.80
dtype: float64

In [146]:
df.idxmax()

one    b
two    d
dtype: object

### 相关系数和协方差

### 唯一值、值计数以及成员资格

#### unique

In [153]:
obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
uniques = obj.unique()
uniques
obj.value_counts()
pd.value_counts(obj.values,sort=False)

array(['c', 'a', 'd', 'b'], dtype=object)

a    3
c    3
b    2
d    1
dtype: int64

d    1
c    3
a    3
b    2
dtype: int64

In [155]:
obj
mask = obj.isin(['b','c'])
mask
obj[mask]

0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object

0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool

0    c
5    b
6    b
7    c
8    c
dtype: object

## 5.4总结