### 1.重建索引

在Series中，reindex用于创建一个新的索引对象, 将会按照新的索引进行排序，如果某个索引值之前并不存在，则会引入缺失值

In [3]:
import pandas as pd
import numpy as np

In [2]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index = ['d', 'b', 'a', 'c'])
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

In [5]:
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

method参数可设置ffill等方法在重建索引时对缺失值进行插值，ffill方法将值向前填充

In [6]:
obj3 = pd.Series(['blue', 'purple', 'yellow'], index = [0, 2, 4])
obj3

0      blue
2    purple
4    yellow
dtype: object

In [7]:
obj3.reindex(range(6), method = 'ffill')

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

在DataFrame中，reindex可以改变行索引、列索引，也可以同时改变二者,当仅传入一个序列时，结果中的行会重建索引

In [13]:
frame = pd.DataFrame(np.arange(9).reshape((3, 3)), index = ['a', 'c', 'd'], columns = ['Ohi', 'Texas', 'California'])
frame

Unnamed: 0,Ohi,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [16]:
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
frame2

Unnamed: 0,Ohi,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


列可以使用columns关键字重建索引

In [19]:
states = ['Texas', 'Utah', 'California']
frame3 = frame.reindex(columns = state)
frame3

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


使用loc进行更简单的标签索引

In [25]:
frame.loc[['a', 'c', 'd'], states]
frame

Unnamed: 0,Ohi,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


### 2.删除数据

使用drop方法删除条目

In [26]:
obj = pd.Series(np.arange(5.0), ['a', 'b', 'c', 'd', 'e'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [27]:
new_obj = obj.drop('c')
new_obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [28]:
obj.drop(['c', 'd'])

a    0.0
b    1.0
e    4.0
dtype: float64

在dateframe中，drop可以删除行或列,默认根据行标签删除值，使用axis = 1或axis = 'columns'来从列中删除值

In [4]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)), index = ['Ohio', 'Colorado', 'Utah', 'New York'], columns = ['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [30]:
data.drop('Ohio')

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [5]:
data.drop(['one', 'two'], axis = 1)

Unnamed: 0,three,four
Ohio,2,3
Colorado,6,7
Utah,10,11
New York,14,15


设置drop的参数inplace为True, 将会在当前对象直接进行操作，而不会产生一个新的对象

In [35]:
data.drop('one', axis = 1, inplace = True)
data

Unnamed: 0,two,three,four
Ohio,1,2,3
Colorado,5,6,7
Utah,9,10,11
New York,13,14,15


### 3.使用loc和iloc选择数据

loc:通过标签选择数据

In [43]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)), index = ['Ohio', 'Colorado', 'Utah', 'New York'], columns = ['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [41]:
data.loc['Colorado', ['two', 'three']]

two      5
three    6
Name: Colorado, dtype: int32

iloc:通过整数标签选择数据

In [42]:
data.iloc[2, [3, 0, 1]]

four    11
one      8
two      9
Name: Utah, dtype: int32

In [44]:
data.iloc[[1, 2], [3, 0, 1]]

Unnamed: 0,four,one,two
Colorado,7,4,5
Utah,11,8,9


除了标签外，还可以使用切片

In [45]:
data.loc[:'Utah', 'two']

Ohio        1
Colorado    5
Utah        9
Name: two, dtype: int32

In [46]:
data.iloc[:, :3][data.three > 5]

Unnamed: 0,one,two,three
Colorado,4,5,6
Utah,8,9,10
New York,12,13,14


### 4.算术与数据对齐

不用索引的对象之间的算术，返回的结果是索引对的并集，有交叠的地方会进行相应的算术操作，没有交叠的地方产生缺失值NAN


可以调用方法(如add)设置参数fill_value重新设置缺失值的值

In [47]:
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)), columns = list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)), columns = list('abcde'))

In [48]:
df1

Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,5.0,6.0,7.0
2,8.0,9.0,10.0,11.0


In [49]:
df2

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,4.0
1,5.0,6.0,7.0,8.0,9.0
2,10.0,11.0,12.0,13.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [50]:
df1.add(df2, fill_value = 0)

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,11.0,13.0,15.0,9.0
2,18.0,20.0,22.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


DataFrame和Series数学操作会将Series的索引和DataFrame的列进行匹配，并广播到各行, 没有匹配到的将会进行重建索引，并产生缺失值

In [51]:
frame = pd.DataFrame(np.arange(12).reshape((4, 3)), columns = list("bde"), index = ['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

Unnamed: 0,b,d,e
Utah,0,1,2
Ohio,3,4,5
Texas,6,7,8
Oregon,9,10,11


In [52]:
series = pd.Series(range(3), index = ['b', 'e', 'f'])
frame + series

Unnamed: 0,b,d,e,f
Utah,0.0,,3.0,
Ohio,3.0,,6.0,
Texas,6.0,,9.0,
Oregon,9.0,,12.0,


在列上进行广播

In [54]:
series2 = frame['d']
series2

Utah       1
Ohio       4
Texas      7
Oregon    10
Name: d, dtype: int32

In [56]:
frame.sub(series2, axis = 'index')

Unnamed: 0,b,d,e
Utah,-1,0,1
Ohio,-1,0,1
Texas,-1,0,1
Oregon,-1,0,1


### 5.函数应用于映射

numpy的通用函数对pandas对象也有效

In [2]:
frame = pd.DataFrame(np.random.randn(4, 3), columns = list('bde'), index = ['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

Unnamed: 0,b,d,e
Utah,-0.051434,-0.884571,-0.722161
Ohio,-0.992692,0.460437,-0.014335
Texas,1.509635,0.431142,-1.371297
Oregon,1.550647,-1.820741,-0.426747


In [3]:
np.abs(frame)

Unnamed: 0,b,d,e
Utah,0.051434,0.884571,0.722161
Ohio,0.992692,0.460437,0.014335
Texas,1.509635,0.431142,1.371297
Oregon,1.550647,1.820741,0.426747


apply方法可以将函数应用到一行或一列的一维数组上, 默认列执行

In [4]:
f = lambda x : x.max() - x.min()
frame.apply(f)

b    2.543339
d    2.281178
e    1.356962
dtype: float64

通过设置axis = 'columns'，函数将会被每行调用

In [5]:
frame.apply(f, axis = 'columns')

Utah      0.833137
Ohio      1.453129
Texas     2.880932
Oregon    3.371387
dtype: float64

传递给apply的函数，也可以返回带有多个值的Series

In [9]:
def f(x):
    return pd.Series([x.min(), x.max()], index = ['min', 'max'])
frame.apply(f)

Unnamed: 0,b,d,e
min,-0.992692,-1.820741,-1.371297
max,1.550647,0.460437,-0.014335


对每个元素进行操作，可以使用applymap方法，格式化浮点数，使其保留俩位小数

In [14]:
f = lambda x: '%.2f' % x
frame.applymap(f)

Unnamed: 0,b,d,e
Utah,-0.05,-0.88,-0.72
Ohio,-0.99,0.46,-0.01
Texas,1.51,0.43,-1.37
Oregon,1.55,-1.82,-0.43


### 6.排序和排名

**按行或按列索引进行排序，可以使用sort_index方法，该方法返回一个排好序的新对象，默认升序**

In [15]:
obj = pd.Series(range(4), index = ['d', 'a', 'b', 'c'])
obj

d    0
a    1
b    2
c    3
dtype: int64

In [16]:
obj.sort_index()

a    1
b    2
c    3
d    0
dtype: int64

在dataframe中，可以在各个轴上按索引排序

In [18]:
frame = pd.DataFrame(np.arange(8).reshape((2, 4)), index = ['three', 'one'], columns = ['d', 'a', 'b', 'c'])
frame

Unnamed: 0,d,a,b,c
three,0,1,2,3
one,4,5,6,7


In [19]:
frame.sort_index()

Unnamed: 0,d,a,b,c
one,4,5,6,7
three,0,1,2,3


In [21]:
# 设置axis = 1，使其按照列索引排序
frame.sort_index(axis = 1)

Unnamed: 0,a,b,c,d
three,1,2,3,0
one,5,6,7,4


In [23]:
# 设置ascending = false，使其降序排序
frame.sort_index(axis = 1, ascending = False)

Unnamed: 0,d,c,b,a
three,0,3,2,1
one,4,7,6,5


**使用sort_values方法，按照Series的值进行排序**

In [26]:
# 默认将缺失值放到最后
obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])
obj.sort_values()

4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64

对于DataFrame，可以传递多个列名给by参数，作为排序键

In [28]:
frame = pd.DataFrame({'b':[4, 7, -3, 2], 'a':[0, 1, 0, 1]})
frame

Unnamed: 0,b,a
0,4,0
1,7,1
2,-3,0
3,2,1


In [29]:
frame.sort_values(by = 'b')

Unnamed: 0,b,a
2,-3,0
3,2,1
0,4,0
1,7,1


In [30]:
frame.sort_values(by = ['a', 'b'])

Unnamed: 0,b,a
2,-3,0
0,4,0
3,2,1
1,7,1


**排名：指对数据进行从小到大排名，使用rank方法，通过将平均值分配到每个组内来打破平级关系**

In [32]:
obj = pd.Series([7, -5, 7, 4, 2, 0, 4])
obj.rank()

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

In [33]:
# 设置参数method = 'first'， 使数据根据原始的观察顺序来进行排序
obj.rank(method = 'first')

0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

In [38]:
# 降序排名：设置method = 'max'， 将值分配给组中的最大排名
obj.rank(ascending = False, method = 'max')

0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64

在DataFrame中，可以对行或列计算排名

In [39]:
frame = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a' : [0, 1, 0, 1], 'c': [-2, 5, 8, -2.5]})
frame

Unnamed: 0,b,a,c
0,4.3,0,-2.0
1,7.0,1,5.0
2,-3.0,0,8.0
3,2.0,1,-2.5


In [41]:
frame.rank(axis = 'columns')

Unnamed: 0,b,a,c
0,3.0,2.0,1.0
1,3.0,1.0,2.0
2,1.0,2.0,3.0
3,3.0,2.0,1.0
