# 基本功能
Series和DataFrame中数据的基本操作方法

---
## Series重建索引    `obj.reindex(list)`
其作用是创建一个数据根据新索引重新排列的新对象  
对该Series调用reindex将会根据新索引进行重排。如果某个索引值当前不存在，就导入缺失值  
对于时间序列这样的有序数据，重建索引时可能需要做一些插值或填值处理。method选项可以达到此目的，例如，使用ffill可以实现前向填充值：

In [1]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=["d", "b", "a", "c"])

In [2]:
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

In [3]:
obj2 = obj.reindex(["a", "b", "c", "d", "e"])

In [4]:
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

In [5]:
obj3 = pd.Series(["blue", "purple", "yellow"], index=[0, 2, 4])

In [6]:
obj3

0      blue
2    purple
4    yellow
dtype: object

In [7]:
obj3.reindex(np.arange(6), method="ffill")

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

---
## DataFrame重建索引，`reindex`可以修改（行）索引、列，也可以同时修改。只传入一个序列时，会重建索引结果中的行，列可以用`columns`关键字重建索引。
## `frame.reindex(list)`、`frame.reindex(index=list)`、 `frame.reindex(columns=list)`、`frame.reindex(list, axis='columns')`、`frame.loc[list, list]`

In [8]:
frame = pd.DataFrame(np.arange(9).reshape((3, 3)),index=["a", "c", "d"],columns=["Ohio", "Texas", "California"])

In [9]:
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [10]:
frame.reindex(["a", "b", "c", "d", 'e'])

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0
e,,,


In [11]:
frame2 = frame.reindex(index=["a", "b", "c", "d"])

In [12]:
frame2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


In [13]:
states = ["Texas", "Utah", "California"]

In [14]:
frame.reindex(columns=states)

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


### DataFrame另一种重建索引的方式是传入新的轴标签作为位置参数，然后用`axis`关键字对指定轴进行重建索引    `frame.reindex(list, axis="columns")`

In [15]:
frame.reindex(states, axis="columns")

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


### 还可以用`loc`运算符重建索引，这是更为常用的方式，只有当新索引的标签在DataFrame中已经存在时，才能这么做（否则的话，reindex将会给新标签插入缺失值）    `frame.loc[list, list]`。

In [16]:
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [17]:
frame.loc[["a", "d", "c"], ["California", "Texas"]]

Unnamed: 0,California,Texas
a,2,1
d,8,7
c,5,4


## `reindex`函数的参数及说明如下表

| 参数 | 说明 |
|-------|-------|
| labels | 用作索引的新序列。既可以是索引实例，也可以是其他序列型的Python数据结构。索引会被直接使用，无须复制 |
| index | 使用传入的序列作为新的索引标签 |
| columns | 使用传入的序列作为新的列标签 |
| axis | 进行重建索引的轴，可以是索引(行)也可以是列。默认为索引。既可以reindex(index=new labels)，也可以reindex(columns=newlabels)。 |
| method | 插值(填充)方式，"ffill"表示前向填充，"bfill"表示后向填充 |
| fill-value | 在重建索引的过程中导入了缺失值，用作替换的值。使用fillvalue="missing"可以对结果中不存在的标签填入缺失值。 |
| limit | 前向或后向填充值时，最大的填充区间(以元素计数) |
| tolerance | 前向或后向填充值时，填充不准确匹配项的最大区间(绝对值距离) |
| level | 在多层索引的指定层级上匹配索引，否则选取其子集 |
| copy | 如果为 True,即新索引等于旧索引，总是复制底层数据;如果是False，则在索引相同时不复制数据 |

In [18]:
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [19]:
labels = ['a','b','c','d','e']

In [20]:
frame.reindex(labels=labels)

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0
e,,,


In [21]:
frame.reindex(labels=labels, method='ffill')

Unnamed: 0,Ohio,Texas,California
a,0,1,2
b,0,1,2
c,3,4,5
d,6,7,8
e,6,7,8


In [22]:
frame.reindex(labels=labels, fill_value='9.9')

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,9.9,9.9,9.9
c,3.0,4.0,5.0
d,6.0,7.0,8.0
e,9.9,9.9,9.9


---
## 删除指定轴上的项    `obj.drop(标签/标签列表)`
删除某条轴上的一个或多个项很简单，只要有一个索引数组或不包含这些项的列表，就可以使用`reindex`方法或基于`.loc`的索引进行删除。  
`drop`方法返回的是一个在指定轴上删除了指定值的新对象

In [23]:
obj = pd.Series(np.arange(5.), index=["a", "b", "c", "d", "e"])

In [24]:
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [25]:
new_obj = obj.drop("c")

In [26]:
new_obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [27]:
obj.drop(["d", "c"])

a    0.0
b    1.0
e    4.0
dtype: float64

## 对于DataFrame，可以删除任意轴上的索引值。通过传入`columns`关键字，可以删除列的标签。    `frame.drop(index=标签列表)`、`frame.drop(columns=标签列表)`、`frame.drop(标签列表, axis="columns")`

In [28]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
index=["Ohio", "Colorado", "Utah", "New York"],
columns=["one", "two", "three", "four"])

In [29]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [30]:
data.drop("Ohio")

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [31]:
data.drop(index=["Colorado", "Ohio"])

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


In [32]:
data.drop(["Colorado", "Ohio"], axis=0)

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


### 还可以传入`axis=1`（类似于NumPy）或`axis="columns"`，从列删除值。    `frame.drop(标签列表, axis="columns")`

In [33]:
data.drop(columns=["two"])

Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


In [34]:
data.drop("two", axis=1)

Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


In [35]:
data.drop(["two", "four"], axis="columns")

Unnamed: 0,one,three
Ohio,0,2
Colorado,4,6
Utah,8,10
New York,12,14


---
## 索引、选取和过滤
---
### 对于Series的索引、选取和过滤    索引->`obj[标签/标签列表]`、`obj.iloc[位置]`、`obj.loc[标签/标签列表]`    切片->`obj[整数:整数]`、`obj.loc[标签:标签]`  
### `obj[布尔数组]`
* Series索引(obj[...])的工作方式类似于NumPy数组的索引，只不过Series的索引值可以不仅仅是整数。  
* 在未来的版本中，整数键将始终被视为标签（与 DataFrame 行为一致）。要按位置访问值，请使用 `Series.iloc[pos]`

In [36]:
obj = pd.Series(np.arange(4.), index=["a", "b", "c", "d"])

In [37]:
obj

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

In [38]:
obj["b"]

np.float64(1.0)

In [39]:
obj[["a", "b"]]

a    0.0
b    1.0
dtype: float64

In [40]:
# obj[1]    #弃用

In [41]:
obj.iloc[1]

np.float64(1.0)

In [42]:
obj[2:4]

c    2.0
d    3.0
dtype: float64

In [43]:
obj[:4]

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

In [44]:
obj[["b", "a", "d"]]

b    1.0
a    0.0
d    3.0
dtype: float64

In [45]:
# obj[[1, 3]]    #弃用

In [46]:
obj.iloc[[1, 3]]

b    1.0
d    3.0
dtype: float64

In [47]:
obj[obj < 2]

a    0.0
b    1.0
dtype: float64

### 使用特殊的`loc`运算符选取索引值    `obj.loc[标签/标签列表]`

In [48]:
obj.loc[["b", "a", "d"]]

b    1.0
a    0.0
d    3.0
dtype: float64

In [49]:
obj1 = pd.Series([1, 2, 3], index=[2, 0, 1])

In [50]:
obj2 = pd.Series([1, 2, 3], index=["a", "b", "c"])

In [51]:
obj1

2    1
0    2
1    3
dtype: int64

In [52]:
obj2

a    1
b    2
c    3
dtype: int64

In [53]:
obj1[[0, 1, 2]]

0    2
1    3
2    1
dtype: int64

In [54]:
# obj2[[0, 1, 2]]    # 失效方法不建议使用

In [55]:
obj1.loc[[0, 1, 2]]

0    2
1    3
2    1
dtype: int64

In [56]:
# obj2.loc[[0, 1, 2]]    # 因为标签不包含整数，这样索引失效。

### `loc`运算符只使用标签，`iloc`运算符只使用整数。无论索引是否包含整数，都能使用`iloc`    `obj.iloc[位置/位置列表]`

In [57]:
obj1.iloc[[0, 1, 2]]

2    1
0    2
1    3
dtype: int64

In [58]:
obj2.iloc[[0, 1, 2]]

a    1
b    2
c    3
dtype: int64

### 还可以使用标签进行切片，但区别于普通Python的切片方式，`loc`的切片是包含末端的。使用以上切片方法可以对Series的相应部分进行赋值。    `obj.loc[标签:标签] = value`  

In [59]:
obj2

a    1
b    2
c    3
dtype: int64

In [60]:
obj2.loc["b":"c"]

b    2
c    3
dtype: int64

In [61]:
obj2.loc["b":"c"] = 5

In [62]:
obj2

a    1
b    5
c    5
dtype: int64

---
### 对于Dataframe的索引、选取和过滤。
### 用单个值或序列对DataFrame进行索引，以获取单列或多列    `frame[列标签/列标签列表]`  
### 可以通过切片或布尔型数组选取数据    `frame[行索引位置:行索引位置]`、`frame[布尔数组]` 

In [63]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
index=["Ohio", "Colorado", "Utah", "New York"],
columns=["one", "two", "three", "four"])

In [64]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [65]:
data["two"]

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64

In [66]:
data[["three", "one"]]

Unnamed: 0,three,one
Ohio,2,0
Colorado,6,4
Utah,10,8
New York,14,12


In [67]:
data[:2]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [68]:
data["three"] > 5

Ohio        False
Colorado     True
Utah         True
New York     True
Name: three, dtype: bool

In [69]:
data[data["three"] > 5]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [70]:
data[data < 5] = 0

In [71]:
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


## 用`loc`和`iloc`选取DataFrame    `frame.loc[行标签/行标签列表]`、`frame.loc[行标签/行标签列表, 列标签/列标签列表]`  
## `frame.iloc[行标签位置/行标签位置列表]`、`frame.iloc[行标签位置/行标签位置列表, 列标签位置/列标签位置列表]`
与Series一样，DataFrame有两个特殊属性loc和iloc，分别用于标签索引和整数索引。由于DataFrame是二维的，因此可以用NumPy风格的语法，使用轴标签(loc)或整数(iloc)从DataFrame选取行和列的子集。

In [72]:
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [73]:
data.loc["Colorado"]

one      0
two      5
three    6
four     7
Name: Colorado, dtype: int64

In [74]:
data.loc[["Colorado", "New York"]]

Unnamed: 0,one,two,three,four
Colorado,0,5,6,7
New York,12,13,14,15


### 要用loc、iloc同时选取行和列，可以用逗号将选取过程分隔开    `frame.loc[行标签/行标签列表, 列标签/列标签列表]`、`frame.iloc[行标签位置/行标签位置列表, 列标签位置/列标签位置列表]`

In [75]:
data.loc["Colorado", ["two", "three"]]

two      5
three    6
Name: Colorado, dtype: int64

In [76]:
data.iloc[2]

one       8
two       9
three    10
four     11
Name: Utah, dtype: int64

In [77]:
data.iloc[[2, 1]]

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
Colorado,0,5,6,7


In [78]:
data.iloc[2, [3, 0, 1]]

four    11
one      8
two      9
Name: Utah, dtype: int64

In [79]:
data.iloc[[1, 2], [3, 0, 1]]

Unnamed: 0,four,one,two
Colorado,7,0,5
Utah,11,8,9


### 除了单个标签或多个标签，这两个索引函数也可以使用切片loc可以使用布尔型数组，但iloc不能使用

In [80]:
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [81]:
data.loc[:"Utah", "two"]

Ohio        0
Colorado    5
Utah        9
Name: two, dtype: int64

In [82]:
data.iloc[:, :3][data.three > 5]

Unnamed: 0,one,two,three
Colorado,0,5,6
Utah,8,9,10
New York,12,13,14


In [83]:
data.iloc[-1]

one      12
two      13
three    14
four     15
Name: New York, dtype: int64

## DateFrame的索引选项

| 类型 | 说明 |
|-------|-------|
| df[column] | 从 DataFrame 选取单列或多列;在特殊情况下更为便利:布尔型数组(过滤行)、切片(行切片)、布尔型DataFrame(根据条件设置值) |
| df.loc[rows] | 通过标签，选取 DataFrame 的单行或多行 |
| df.loc[:, cols] | 通过标签，选取单列或多列 |
| df.loc[rows, cols] | 通过标签，同时选取行和列 |
| df.iloc[rows] | 通过整数索引，从DataFrame 选取单行或多行 |
| df.iloc[:, cols] | 通过整数索引，选取单列或多列 |
| df.iloc[rows, cols] | 通过整数索引，同时选取行和列 |
| df.at[row, col] | 通过行和列标签，选取单个标量值 |
| df.iat[row, col] | 通过行和列的索引(整数)，选取单个标量值 |
| reindex | 通过标签选取行或列 |

### 处理整数索引的pandas对象常常会难住新手，因为它与Python内置的数据结构不同，比如列表和元组。

In [84]:
ser = pd.Series(np.arange(3.))

In [85]:
# ser[-1]     # 错误写法，应该用iloc。

In [86]:
ser

0    0.0
1    1.0
2    2.0
dtype: float64

In [87]:
ser2 = pd.Series(np.arange(3.), index=["a", "b", "c"])

In [88]:
# ser2[-1]    #错误写法

In [89]:
ser.iloc[-1]

np.float64(2.0)

### 链式索引中的陷阱

In [90]:
ser[:2]

0    0.0
1    1.0
dtype: float64

In [91]:
data.loc[:, "one"] = 1

In [92]:
data

Unnamed: 0,one,two,three,four
Ohio,1,0,0,0
Colorado,1,5,6,7
Utah,1,9,10,11
New York,1,13,14,15


In [93]:
data.iloc[2] = 5

In [94]:
data

Unnamed: 0,one,two,three,four
Ohio,1,0,0,0
Colorado,1,5,6,7
Utah,5,5,5,5
New York,1,13,14,15


In [95]:
data.loc[data["four"] > 5] = 3

In [96]:
data

Unnamed: 0,one,two,three,four
Ohio,1,0,0,0
Colorado,3,3,3,3
Utah,5,5,5,5
New York,3,3,3,3


In [97]:
data.loc[data.three == 5]

Unnamed: 0,one,two,three,four
Utah,5,5,5,5


In [98]:
# data.loc[data.three == 5]["three"] = 6    # 链式选取错误写法，下面才是正确的写法。

In [99]:
data.loc[data.three == 5, "three"] = 6

In [100]:
data

Unnamed: 0,one,two,three,four
Ohio,1,0,0,0
Colorado,3,3,3,3
Utah,5,5,6,5
New York,3,3,3,3


---
# 算术和数据对齐

## 对于Series

In [101]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=["a", "c", "d", "e"])

In [102]:
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1],index=["a", "c", "e", "f", "g"])

In [103]:
s1

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

In [104]:
s2

a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

In [105]:
s1 + s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

### 对于DataFrame，对齐操作会同时发生在行和列上

In [106]:
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list("bcd"),index=["Ohio", "Texas", "Colorado"])

In [107]:
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list("bde"),index=["Utah", "Ohio", "Texas", "Oregon"])

In [108]:
df1

Unnamed: 0,b,c,d
Ohio,0.0,1.0,2.0
Texas,3.0,4.0,5.0
Colorado,6.0,7.0,8.0


In [109]:
df2

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [110]:
df1 + df2

Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


In [111]:
df1 = pd.DataFrame({"A": [1, 2]})

In [112]:
df2 = pd.DataFrame({"B": [3, 4]})

In [113]:
df1

Unnamed: 0,A
0,1
1,2


In [114]:
df2

Unnamed: 0,B
0,3
1,4


In [115]:
df1 + df2

Unnamed: 0,A,B
0,,
1,,


## 带有填充值的算术方法    `df1.add(df2, fill_value=0)`  注意：填充值是在df1、df2中，是在算术运算产生结果之前。
使用了`np.nan`赋值给NA值

In [116]:
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),columns=list("abcd"))

In [117]:
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),columns=list("abcde"))

In [118]:
df2.loc[1, "b"] = np.nan

In [119]:
df1

Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,5.0,6.0,7.0
2,8.0,9.0,10.0,11.0


In [120]:
df2

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,4.0
1,5.0,,7.0,8.0,9.0
2,10.0,11.0,12.0,13.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [121]:
df1 + df2

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,
1,9.0,,13.0,15.0,
2,18.0,20.0,22.0,24.0,
3,,,,,


In [122]:
df1.add(df2, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,5.0,13.0,15.0,9.0
2,18.0,20.0,22.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


## 下面列出Series和DataFrame的算术方法。每个方法都有一个以字母r开头的副本，会将参数翻转。因此下面两个语句是等价的    (r函数除数和被除数的位置对调了)  
`1/df1`、`df1.rdiv(1)`、`df1.rdiv(1)`

In [123]:
1 / df1

Unnamed: 0,a,b,c,d
0,inf,1.0,0.5,0.333333
1,0.25,0.2,0.166667,0.142857
2,0.125,0.111111,0.1,0.090909


In [124]:
df1.rdiv(1)

Unnamed: 0,a,b,c,d
0,inf,1.0,0.5,0.333333
1,0.25,0.2,0.166667,0.142857
2,0.125,0.111111,0.1,0.090909


In [125]:
df1.div(1)

Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,5.0,6.0,7.0
2,8.0,9.0,10.0,11.0


| 方法 | 说明 |
|-------|-------|
| add, radd | 加法(+) |
| sub, rsub | 减法(-) |
| div, rdiv | 除法(/) |
| floordiv, rfloordiv | 底除(//) |
| mul, rmul | 乘法(*) |
| pow, rpow | 乘方(**) |

In [126]:
df1.rmul(10)

Unnamed: 0,a,b,c,d
0,0.0,10.0,20.0,30.0
1,40.0,50.0,60.0,70.0
2,80.0,90.0,100.0,110.0


In [127]:
df1

Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,5.0,6.0,7.0
2,8.0,9.0,10.0,11.0


In [128]:
df1.sub(3)

Unnamed: 0,a,b,c,d
0,-3.0,-2.0,-1.0,0.0
1,1.0,2.0,3.0,4.0
2,5.0,6.0,7.0,8.0


In [129]:
df1.rsub(3)

Unnamed: 0,a,b,c,d
0,3.0,2.0,1.0,0.0
1,-1.0,-2.0,-3.0,-4.0
2,-5.0,-6.0,-7.0,-8.0


In [130]:
df1.radd(7)

Unnamed: 0,a,b,c,d
0,7.0,8.0,9.0,10.0
1,11.0,12.0,13.0,14.0
2,15.0,16.0,17.0,18.0


### 对Series和DataFrame重建索引时，也可以指定不同的填充值

In [131]:
df1.reindex(columns=df2.columns, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,0
1,4.0,5.0,6.0,7.0,0
2,8.0,9.0,10.0,11.0,0


## DataFrame和Series间的运算
与不同维度的NumPy数组一样，DataFrame和Series之间的算术运算也要遵守一定的规则  
当我们从arr减去arr[0]时，每一行都会执行这个操作。这就是广播机制，因为广播与NumPy数组关系密切。DataFrame和Series之间的运算差不多也是如此。

In [132]:
arr = np.arange(12.).reshape((3, 4))

In [133]:
arr

array([[ 0.,  1.,  2.,  3.],
       [ 4.,  5.,  6.,  7.],
       [ 8.,  9., 10., 11.]])

In [134]:
arr[0]

array([0., 1., 2., 3.])

In [135]:
arr - arr[0]

array([[0., 0., 0., 0.],
       [4., 4., 4., 4.],
       [8., 8., 8., 8.]])

In [136]:
frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
columns=list("bde"),
index=["Utah", "Ohio", "Texas", "Oregon"])

In [137]:
series = frame.iloc[0]

In [138]:
frame

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [139]:
series

b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

In [140]:
frame - series

Unnamed: 0,b,d,e
Utah,0.0,0.0,0.0
Ohio,3.0,3.0,3.0
Texas,6.0,6.0,6.0
Oregon,9.0,9.0,9.0


## 如果某个索引值在DataFrame的列或Series的索引中找不到，则参与运算的两个对象就会重建索引以形成并集

In [141]:
series2 = pd.Series(np.arange(3), index=["b", "e", "f"])

In [142]:
series2

b    0
e    1
f    2
dtype: int64

In [143]:
frame + series2

Unnamed: 0,b,d,e,f
Utah,0.0,,3.0,
Ohio,3.0,,6.0,
Texas,6.0,,9.0,
Oregon,9.0,,12.0,


In [144]:
series3 = frame["d"]

In [145]:
frame

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [146]:
series3

Utah       1.0
Ohio       4.0
Texas      7.0
Oregon    10.0
Name: d, dtype: float64

## 在列上广播且匹配行

In [147]:
frame.sub(series3, axis="index")

Unnamed: 0,b,d,e
Utah,-1.0,0.0,1.0
Ohio,-1.0,0.0,1.0
Texas,-1.0,0.0,1.0
Oregon,-1.0,0.0,1.0


In [148]:
frame.sub(series3, axis=1)

Unnamed: 0,Ohio,Oregon,Texas,Utah,b,d,e
Utah,,,,,,,
Ohio,,,,,,,
Texas,,,,,,,
Oregon,,,,,,,


# 函数应用和映射
### NumPy的通用函数（元素级数组方法）也可用于操作pandas对象

In [149]:
frame = pd.DataFrame(np.random.standard_normal((4, 3)),columns=list("bde"),
index=["Utah", "Ohio", "Texas", "Oregon"])

In [150]:
frame

Unnamed: 0,b,d,e
Utah,-0.181471,1.576872,1.42595
Ohio,-2.915544,-1.082412,0.108079
Texas,0.438118,-0.211749,-1.075959
Oregon,-1.271822,-0.386296,-0.054459


In [151]:
np.abs(frame)

Unnamed: 0,b,d,e
Utah,0.181471,1.576872,1.42595
Ohio,2.915544,1.082412,0.108079
Texas,0.438118,0.211749,1.075959
Oregon,1.271822,0.386296,0.054459


### 将函数应用到由各列或各行所形成的一维数组上。    `frame.apply(定义的函数)`

In [152]:
def f1(x):
    return x.max() - x.min()

In [153]:
frame.apply(f1)

b    3.353663
d    2.659285
e    2.501908
dtype: float64

### 如果传递`axis="columns"`给apply函数，这个函数会在每行执行一次，可以将其当作“跨列处理”

In [154]:
frame.apply(f1, axis="columns")

Utah      1.758344
Ohio      3.023624
Texas     1.514077
Oregon    1.217362
dtype: float64

### 许多最为常见的数组统计功能（如`sum`和`mean`）都是DataFrame的方法，因此无须使用apply方法。

In [155]:
frame

Unnamed: 0,b,d,e
Utah,-0.181471,1.576872,1.42595
Ohio,-2.915544,-1.082412,0.108079
Texas,0.438118,-0.211749,-1.075959
Oregon,-1.271822,-0.386296,-0.054459


In [156]:
frame.sum()

b   -3.930719
d   -0.103585
e    0.403611
dtype: float64

In [157]:
frame.mean()

b   -0.982680
d   -0.025896
e    0.100903
dtype: float64

### 传递到apply的函数不一定返回单个标量值，还可以返回由多个值组成的Series

In [158]:
def f2(x):
    return pd.Series([x.min(), x.max()], index=["min", "max"])

In [159]:
frame.apply(f2)

Unnamed: 0,b,d,e
min,-2.915544,-1.082412,-1.075959
max,0.438118,1.576872,1.42595


### 还可以使用元素级的Python函数。    `frame.map(定义的python函数)`

In [160]:
def my_format(x):
    return f"{x:.2f}"

In [161]:
# frame.applymap(my_format)    # 这个applymap方法过时了，用map替代。

In [162]:
frame.map(my_format)

Unnamed: 0,b,d,e
Utah,-0.18,1.58,1.43
Ohio,-2.92,-1.08,0.11
Texas,0.44,-0.21,-1.08
Oregon,-1.27,-0.39,-0.05


In [163]:
frame["e"].map(my_format)

Utah       1.43
Ohio       0.11
Texas     -1.08
Oregon    -0.05
Name: e, dtype: object

# 排序和排名
## 使用`对象.sort_index()`，它将返回一个排好序的新对象,数据默认是按升序排序的，但也可以降序排序,通过参数`ascending=False`

### 对于Series的排序

In [164]:
obj = pd.Series(np.arange(4), index=["d", "a", "b", "c"])

In [165]:
obj

d    0
a    1
b    2
c    3
dtype: int64

In [166]:
obj.sort_index()

a    1
b    2
c    3
d    0
dtype: int64

In [167]:
obj.sort_index(ascending=False)

d    0
c    3
b    2
a    1
dtype: int64

### 对于DataFrame行标签和列标签的排序

In [168]:
frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
index=["three", "one"],
columns=["d", "a", "b", "c"])

In [169]:
frame

Unnamed: 0,d,a,b,c
three,0,1,2,3
one,4,5,6,7


In [170]:
frame.sort_index()

Unnamed: 0,d,a,b,c
one,4,5,6,7
three,0,1,2,3


In [171]:
frame.sort_index(axis="columns")

Unnamed: 0,a,b,c,d
three,1,2,3,0
one,5,6,7,4


In [172]:
frame.sort_index(axis="columns", ascending=False)

Unnamed: 0,d,c,b,a
three,0,3,2,1
one,4,7,6,5


### Series按值排序`obj.sort_values()`，在排序时，任何缺失值默认都会放到Series的末尾，使用`na_position="first"`选项可以将缺失值排在最前面。

In [173]:
obj = pd.Series([4, 7, -3, 2])

In [174]:
obj.sort_values()

2   -3
3    2
0    4
1    7
dtype: int64

In [175]:
obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])

In [176]:
obj.sort_values()

4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64

In [177]:
obj.sort_values(na_position="first")

1    NaN
3    NaN
4   -3.0
5    2.0
0    4.0
2    7.0
dtype: float64

### DataFrame按值排序

In [178]:
frame = pd.DataFrame({"b": [4, 7, -3, 2], "a": [0, 1, 0, 1]})

In [179]:
frame

Unnamed: 0,b,a
0,4,0
1,7,1
2,-3,0
3,2,1


In [180]:
frame.sort_values("b")

Unnamed: 0,b,a
2,-3,0
3,2,1
0,4,0
1,7,1


In [181]:
frame.sort_values(["a", "b"])

Unnamed: 0,b,a
2,-3,0
0,4,0
3,2,1
1,7,1


### 计算Series平均排名`obj.rank()`，以及降序平均排名`obj.rank(ascending=False)`

In [182]:
obj = pd.Series([7, -5, 7, 4, 2, 0, 4])

In [183]:
obj

0    7
1   -5
2    7
3    4
4    2
5    0
6    4
dtype: int64

In [184]:
obj.sort_values()

1   -5
5    0
4    2
3    4
6    4
0    7
2    7
dtype: int64

In [185]:
obj.sort_values(ascending=False)

0    7
2    7
3    4
6    4
4    2
5    0
1   -5
dtype: int64

In [186]:
obj.rank()    # 计算方法，位置0和2的7排在第6和第7，(6+7)/2=6.5

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

In [187]:
obj.rank(method="first")    #  # 计算方法，位置0和2的7排在第1和第2，(1+2)/2=1.5

0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

In [188]:
obj.rank(ascending=False)     # 计算方法，位置0和2的7排在第1和第2，(1+2)/2=1.5

0    1.5
1    7.0
2    1.5
3    3.5
4    5.0
5    6.0
6    3.5
dtype: float64

### DataFrame在行或列上计算排名

In [189]:
frame = pd.DataFrame({"b": [4.3, 7, -3, 2], "a": [0, 1, 0, 1],"c": [-2, 5, 8, -2.5]})

In [190]:
frame

Unnamed: 0,b,a,c
0,4.3,0,-2.0
1,7.0,1,5.0
2,-3.0,0,8.0
3,2.0,1,-2.5


#### 按行进行排名`frame.rank(axis="columns")`

In [191]:
frame.rank(axis="columns")

Unnamed: 0,b,a,c
0,3.0,2.0,1.0
1,3.0,1.0,2.0
2,1.0,2.0,3.0
3,3.0,2.0,1.0


# 排名中打破平级关系的方法如下表
| 方法 | 说明 |
|-------|-------|
| 'averaget' | 默认：在每个组中分配的平均排名 |
| 'min' | 对整数使用最小排名 |
| 'max' | 对整数使用最大排名 |
| 'first' | 按值在原始数据中出现的顺序分配排名 |
| 'dense' | 类似于method='min'，但排名在组间增加1，而不是组中相等元素的数量 |

In [192]:
frame

Unnamed: 0,b,a,c
0,4.3,0,-2.0
1,7.0,1,5.0
2,-3.0,0,8.0
3,2.0,1,-2.5


In [193]:
obj

0    7
1   -5
2    7
3    4
4    2
5    0
6    4
dtype: int64

In [196]:
obj.rank(method='min')    # 两个4排在第4和第5位，取最小的第4位

0    6.0
1    1.0
2    6.0
3    4.0
4    3.0
5    2.0
6    4.0
dtype: float64

In [197]:
obj.rank(method='max')    # 两个4排在第4和第5位，取最大的第5位

0    7.0
1    1.0
2    7.0
3    5.0
4    3.0
5    2.0
6    5.0
dtype: float64

In [198]:
frame.rank(method='min')

Unnamed: 0,b,a,c
0,3.0,1.0,2.0
1,4.0,3.0,3.0
2,1.0,1.0,4.0
3,2.0,3.0,1.0


In [199]:
obj.rank(method='first') 

0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

In [201]:
obj.rank(method='dense')    # 排名不会跳过，即使有重复值，下一个排名紧接在之前的排名之后。

0    5.0
1    1.0
2    5.0
3    4.0
4    3.0
5    2.0
6    4.0
dtype: float64

In [202]:
frame.rank(method='dense') 

Unnamed: 0,b,a,c
0,3.0,1.0,2.0
1,4.0,2.0,3.0
2,1.0,1.0,4.0
3,2.0,2.0,1.0


# 带有重复标签的轴索引

In [203]:
obj = pd.Series(np.arange(5), index=["a", "a", "b", "b", "c"])

In [204]:
obj

a    0
a    1
b    2
b    3
c    4
dtype: int64

In [205]:
obj.index.is_unique

False

In [206]:
obj["a"]

a    0
a    1
dtype: int64

In [207]:
obj["c"]

np.int64(4)

In [208]:
df = pd.DataFrame(np.random.standard_normal((5, 3)),index=["a", "a", "b", "b", "c"])

In [209]:
df

Unnamed: 0,0,1,2
a,1.90762,-0.388657,-1.264979
a,-0.519326,0.217581,0.014612
b,-1.241226,-0.312095,-0.518894
b,-0.157472,0.189208,-0.20606
c,-1.608687,0.047605,0.482271


In [210]:
df.loc["b"]

Unnamed: 0,0,1,2
b,-1.241226,-0.312095,-0.518894
b,-0.157472,0.189208,-0.20606


In [211]:
df.loc["c"]

0   -1.608687
1    0.047605
2    0.482271
Name: c, dtype: float64