# pandas

两个主要的数据结构：Series 和 DataFrame

## Series
Series 是一种类似一维数组的对象，它有一组数据以及一组与之相关的标签组成。通过 pandas 的 Series 函数实例化一个 Series

In [55]:
import numpy as np
import pandas as pd
obj = pd.Series([2,3,4,5,7])
obj

0    2
1    3
2    4
3    5
4    7
dtype: int64

In [56]:
obj.values

array([2, 3, 4, 5, 7], dtype=int64)

In [57]:
obj.index

RangeIndex(start=0, stop=5, step=1)

In [58]:
obj2 = pd.Series([2,3,4,1,5], index=['a', 'b', 'c', 'd', 'e'])
obj2

a    2
b    3
c    4
d    1
e    5
dtype: int64

In [59]:
obj2.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [60]:
obj2['a']

2

In [61]:
obj2[['a', 'c', 'e']] # 取出多个值

a    2
c    4
e    5
dtype: int64

In [62]:
# pandas 同样支持 numpy 数组运算（布尔型过滤，标量运算，元素级函数运算等），而且运算结果会保留索引与 value 的对应关系
obj2[obj2 > 2]

b    3
c    4
e    5
dtype: int64

In [63]:
obj2*3

a     6
b     9
c    12
d     3
e    15
dtype: int64

In [64]:
np.exp(obj2)

a      7.389056
b     20.085537
c     54.598150
d      2.718282
e    148.413159
dtype: float64

Series 还可以被看成是一个定长的字典，因为它是索引值到 value 的一个映射。python dict 结构的很多用法也可以用在 Series 上。

In [65]:
'c' in obj2

True

Series 也可以直接根据 python 字典结构进行创建，字典的 key 值即为 Series 的索引值

In [66]:
dic_data = {'name': 'john', 'sex': 'man', 'phone': '123456'}
obj3 = pd.Series(dic_data)
obj3

name       john
sex         man
phone    123456
dtype: object

Series 最重要额一个功能是：**它在算术运算中会自动对齐不同索引的数据**

In [67]:
obj3 = pd.Series([3,4,-1], index=['b','d','g'])
obj3

b    3
d    4
g   -1
dtype: int64

In [68]:
obj2 + obj3

a    NaN
b    6.0
c    NaN
d    5.0
e    NaN
g    NaN
dtype: float64

In [69]:
obj3.index=['1','2','3'] # 赋值可以修改 index

In [70]:
obj3

1    3
2    4
3   -1
dtype: int64

DataFrame 是一个**表格型的数据结构**，它含有一组有序的列，每列可以是不同的值类型（布尔型，int 型，字符串等）。DataFrame 既有航索引也有列索引，它可以看成是多个 Series 组成的字典（每一列都是一个 Series，他们共用一个索引）

创建 DataFrame 的方法有很多，最常用的是直接传入一个由等长的列表或者 numpy 数组组成的字典

In [71]:
data= {'name': ['john', 'jack', 'lily', 'tom', 'lucy'],
      'age': [12,20,48,23,19],
       'sex': ['male', 'male', 'female', 'male', 'female']}
frame = pd.DataFrame(data)
frame

Unnamed: 0,name,age,sex
0,john,12,male
1,jack,20,male
2,lily,48,female
3,tom,23,male
4,lucy,19,female


In [72]:
frame1 = pd.DataFrame(data, columns=['age', 'name', 'sex'])
frame1

Unnamed: 0,age,name,sex
0,12,john,male
1,20,jack,male
2,48,lily,female
3,23,tom,male
4,19,lucy,female


In [73]:
frame2 = pd.DataFrame(data, columns=['age', 'sex', 'name', 'address'], index=['a','b','c','d','e'])
frame2 # address 没有就是NaN

Unnamed: 0,age,sex,name,address
a,12,male,john,
b,20,male,jack,
c,48,female,lily,
d,23,male,tom,
e,19,female,lucy,


In [74]:
type(frame2['age'])

pandas.core.series.Series

In [75]:
frame2['age'] = 30
frame2

Unnamed: 0,age,sex,name,address
a,30,male,john,
b,30,male,jack,
c,30,female,lily,
d,30,male,tom,
e,30,female,lucy,


In [76]:
frame2['age'] = range(10, 15)

In [77]:
frame2

Unnamed: 0,age,sex,name,address
a,10,male,john,
b,11,male,jack,
c,12,female,lily,
d,13,male,tom,
e,14,female,lucy,


In [78]:
val = pd.Series([12, 34, 2], index=['c','a','d'])
frame2.age = val

In [79]:
frame2

Unnamed: 0,age,sex,name,address
a,34.0,male,john,
b,,male,jack,
c,12.0,female,lily,
d,2.0,male,tom,
e,,female,lucy,


In [80]:
del frame2['address']

In [81]:
frame2

Unnamed: 0,age,sex,name
a,34.0,male,john
b,,male,jack
c,12.0,female,lily
d,2.0,male,tom
e,,female,lucy


还可以用二维列表或者 numpy 的 ndarray 来构建 DataFrame，在创建是还传入索引名称和列名

In [82]:
frame3 = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=['a', 'b', 'c'], index = [1, 2])
frame3

Unnamed: 0,a,b,c
1,1,2,3
2,4,5,6


### DataFrame 构建方法汇总

- 二维 ndarray：数据矩阵，还可以传入行标和列名
- 由数组，列表或元组组成的字典：所有序列长度必须一致
- 嵌套字典：外层字典为列名，内层字典为索引
- 另一个 DataFrame

另外，同 Series 一样，通过 values 可以返回 DataFrame 中的值，格式为 ndarray

In [83]:
frame2.values

array([[34.0, 'male', 'john'],
       [nan, 'male', 'jack'],
       [12.0, 'female', 'lily'],
       [2.0, 'male', 'tom'],
       [nan, 'female', 'lucy']], dtype=object)

In [84]:
obj = pd.Series(range(3), index=['a', 'b', 'c'])
obj

a    0
b    1
c    2
dtype: int64

In [85]:
obj.index

Index(['a', 'b', 'c'], dtype='object')

不能直接改 Series 或 DataFrame 的索引，要对 index 进行设置

In [86]:
obj.index=['f', 'e', 'd']

In [87]:
obj

f    0
e    1
d    2
dtype: int64

### 基本功能

#### 重新索引

In [88]:
obj=pd.Series(['2.3', '1.2', '1.3', '4.6'], index = ['a','b','c','d'])

In [89]:
obj

a    2.3
b    1.2
c    1.3
d    4.6
dtype: object

In [90]:
obj2 = obj.reindex(['a','b','c','d','e'])

In [91]:
obj2

a    2.3
b    1.2
c    1.3
d    4.6
e    NaN
dtype: object

用 reindex 的时候，如果索引值不存在就引入 NaN，也可以指定需要填充的值。
对于时间序列类的数据，在做 reindex 时可能需要做一些插值处理。
reindex 方法中的 method 选项可以做到

In [92]:
obj.reindex(['a','b','c','d','e'], fill_value=0)

a    2.3
b    1.2
c    1.3
d    4.6
e      0
dtype: object

In [93]:
obj = pd.Series(['john', 'mike', 'jim'], index=[0,2,4])
obj

0    john
2    mike
4     jim
dtype: object

In [94]:
obj.reindex(range(8), method='ffill') # ffill front fill, bfill back fill

0    john
1    john
2    mike
3    mike
4     jim
5     jim
6     jim
7     jim
dtype: object

In [95]:
df = pd.DataFrame(np.arange(1, 10).reshape((3,3)), index=['d', 'a', 'c'], columns=['john', 'mike', 'jim'])
df

Unnamed: 0,john,mike,jim
d,1,2,3
a,4,5,6
c,7,8,9


In [96]:
df.reindex(['a','b','c','d'], ['mike','john','jim'])

	'.reindex(a, b)' as 
	'.reindex(index=a, columns=b)'.
Use named arguments to remove any ambiguity. In the future, using positional arguments for 'index' or 'columns' will raise  a 'TypeError'.
  """Entry point for launching an IPython kernel.


Unnamed: 0,mike,john,jim
a,5.0,4.0,6.0
b,,,
c,8.0,7.0,9.0
d,2.0,1.0,3.0


丢弃指定轴的项

丢弃某个轴的一个或者多个想很简单，只要有一个索引数组或者列表即可。

使用 DataFrame.drop 方法进行丢弃元素

In [97]:
obj = pd.Series(np.arange(5), index=[list('abcde')])
obj

a    0
b    1
c    2
d    3
e    4
dtype: int32

In [98]:
obj.drop('c')

a    0
b    1
d    3
e    4
dtype: int32

注意：drop 方法放回一个新对象，并未改变原数据

In [99]:
obj

a    0
b    1
c    2
d    3
e    4
dtype: int32

对于 DataFrame，可以伤处任意轴上的索引值

In [100]:
obj = pd.DataFrame(np.arange(25).reshape((5,5)), index=list('12345'), columns=list('abcde'))
obj

Unnamed: 0,a,b,c,d,e
1,0,1,2,3,4
2,5,6,7,8,9
3,10,11,12,13,14
4,15,16,17,18,19
5,20,21,22,23,24


In [101]:
obj.drop(['2', '4'])

Unnamed: 0,a,b,c,d,e
1,0,1,2,3,4
3,10,11,12,13,14
5,20,21,22,23,24


In [102]:
obj.drop(['a', 'c'], axis=1) # drop 默认是删除行索引，即 axis = 0，要删除列要明确指定

Unnamed: 0,b,d,e
1,1,3,4
2,6,8,9
3,11,13,14
4,16,18,19
5,21,23,24


## 索引，选取和过滤

Series 以及 Pandas 的索引选取等操作与 numpy 的 ndarray 很像，只不过 Series 和 Pandas 的索引值不只是整数还可以根据行索引和列名称进行索引

In [103]:
obj

Unnamed: 0,a,b,c,d,e
1,0,1,2,3,4
2,5,6,7,8,9
3,10,11,12,13,14
4,15,16,17,18,19
5,20,21,22,23,24


In [104]:
obj[['a', 'c']]

Unnamed: 0,a,c
1,0,2
2,5,7
3,10,12
4,15,17
5,20,22


In [105]:
obj[1:4]

Unnamed: 0,a,b,c,d,e
2,5,6,7,8,9
3,10,11,12,13,14
4,15,16,17,18,19


In [106]:
obj[obj['b']>10]

Unnamed: 0,a,b,c,d,e
3,10,11,12,13,14
4,15,16,17,18,19
5,20,21,22,23,24


### 算术运算与数据对齐

Pandas 一个重要功能是：它可以对不同的索引的对象进行算术运算，具有不同索引的两个对象相加时，结果的索引是两个对象的并集。若果有不重叠的索引，则相应的值为 NaN。

In [107]:
df1 = pd.DataFrame(np.arange(9).reshape((3,3)), columns=list('abc'), index=[1,2,3])
df1

Unnamed: 0,a,b,c
1,0,1,2
2,3,4,5
3,6,7,8


In [108]:
df2 = pd.DataFrame(np.arange(16).reshape((4,4)), columns=list('bcde'), index=[2,3,4,5])
df2

Unnamed: 0,b,c,d,e
2,0,1,2,3
3,4,5,6,7
4,8,9,10,11
5,12,13,14,15


In [109]:
df1+df2

Unnamed: 0,a,b,c,d,e
1,,,,,
2,,4.0,6.0,,
3,,11.0,13.0,,
4,,,,,
5,,,,,


In [110]:
df1.add(df2, fill_value=0)

Unnamed: 0,a,b,c,d,e
1,0.0,1.0,2.0,,
2,3.0,4.0,6.0,2.0,3.0
3,6.0,11.0,13.0,6.0,7.0
4,,8.0,9.0,10.0,11.0
5,,12.0,13.0,14.0,15.0


注意：如果指定 fill_value 后，相当于两个 df 互相将补集置为 fill_value 的值，但是两个 df 都没有的元素依然为 NaN

DataFrame 支持加减乘除运算，可以直接使用 + - * / 操作符，也可以分别使用 add,sub,div 和 mul 进行计算

### DataFrame 和 Series 之间的计算

DataFrame 和 Series 之间的计算会引入一个叫**广播**的操作

In [111]:
df1 = pd.DataFrame(np.arange(12).reshape((4,3)), columns=list('abc'))
df1

Unnamed: 0,a,b,c
0,0,1,2
1,3,4,5
2,6,7,8
3,9,10,11


In [112]:
series1 = pd.Series([3,4,5], index=['a','b','c'])
series1

a    3
b    4
c    5
dtype: int64

In [113]:
df1 - series1

Unnamed: 0,a,b,c
0,-3,-3,-3
1,0,0,0
2,3,3,3
3,6,6,6


In [114]:
series2 = df1['b']
series2

0     1
1     4
2     7
3    10
Name: b, dtype: int32

In [115]:
df1.sub(series2, axis=0)

Unnamed: 0,a,b,c
0,-1,0,1
1,-1,0,1
2,-1,0,1
3,-1,0,1


### 函数应用和映射

numpy 元素级函数可以直接作用的 DataFrame 上

In [116]:
np.square(df1)

Unnamed: 0,a,b,c
0,0,1,4
1,9,16,25
2,36,49,64
3,81,100,121


**重要**：DataFrame另一个操作是将一个函数直接应用到其本身或者各行各列形成一个新的数据或者行或者列

In [117]:
def fun(x):
    return x.max() - x.min()

In [118]:
df1.apply(fun, axis=1)

0    2
1    2
2    2
3    2
dtype: int64

In [119]:
df1.max(axis=1)-df1.min(axis=1)

0    2
1    2
2    2
3    2
dtype: int32

x = (x - min) / (max - min)

In [121]:
def fun(x):
    return (x - x.min()) / (x.max() - x.min())
df1.apply(fun, axis=1)

Unnamed: 0,a,b,c
0,0.0,0.5,1.0
1,0.0,0.5,1.0
2,0.0,0.5,1.0
3,0.0,0.5,1.0


### 排序

对数据集进行排序也是我们经常会碰到的需求，pandas 同样提供了强大的排序函数。

主要有 sort_index() 和 sort_values() 两个函数，前者根据行或者列索引进行排序，后者则根据值进行排序

In [122]:
df1 = pd.DataFrame(np.random.randn(4,4), columns=list('bcad'), index=[2,4,3,1])
df1

Unnamed: 0,b,c,a,d
2,-0.772197,-1.271328,0.43888,-0.480357
4,0.211071,0.531291,0.57384,-0.782769
3,-0.950552,-0.323276,-1.769132,-1.65602
1,0.404893,0.578318,2.255399,0.771602


In [123]:
df1.sort_index()

Unnamed: 0,b,c,a,d
1,0.404893,0.578318,2.255399,0.771602
2,-0.772197,-1.271328,0.43888,-0.480357
3,-0.950552,-0.323276,-1.769132,-1.65602
4,0.211071,0.531291,0.57384,-0.782769


In [124]:
df1.sort_index(axis=1, ascending=False)

Unnamed: 0,d,c,b,a
2,-0.480357,-1.271328,-0.772197,0.43888
4,-0.782769,0.531291,0.211071,0.57384
3,-1.65602,-0.323276,-0.950552,-1.769132
1,0.771602,0.578318,0.404893,2.255399


In [126]:
df1.sort_values(by='b')

Unnamed: 0,b,c,a,d
3,-0.950552,-0.323276,-1.769132,-1.65602
2,-0.772197,-1.271328,0.43888,-0.480357
4,0.211071,0.531291,0.57384,-0.782769
1,0.404893,0.578318,2.255399,0.771602


In [127]:
df1.sort_values(by=['b', 'a']) # b 相等的时候根据 a

Unnamed: 0,b,c,a,d
3,-0.950552,-0.323276,-1.769132,-1.65602
2,-0.772197,-1.271328,0.43888,-0.480357
4,0.211071,0.531291,0.57384,-0.782769
1,0.404893,0.578318,2.255399,0.771602


## 统计相关

In [128]:
df1.describe()

Unnamed: 0,b,c,a,d
count,4.0,4.0,4.0,4.0
mean,-0.276696,-0.121249,0.374747,-0.536886
std,0.683638,0.871531,1.650941,1.004698
min,-0.950552,-1.271328,-1.769132,-1.65602
25%,-0.816786,-0.560289,-0.113123,-1.001082
50%,-0.280563,0.104007,0.50636,-0.631563
75%,0.259526,0.543047,0.99423,-0.167367
max,0.404893,0.578318,2.255399,0.771602


### 处理缺失数据

- dropna 去除 NaN 数据
- fillna 使用默认值填充
- isnull 返回一个含有布尔值的对象，表示哪些是 NaN，哪些不是
- notnull isnull 的否定形式

# pandas

两个主要的数据结构：Series 和 DataFrame

## Series
Series 是一种类似一维数组的对象，它有一组数据以及一组与之相关的标签组成。通过 pandas 的 Series 函数实例化一个 Series

In [55]:
import pandas as pd
obj = pd.Series([2,3,4,5,7])
obj

0    2
1    3
2    4
3    5
4    7
dtype: int64

In [56]:
obj.values

array([2, 3, 4, 5, 7], dtype=int64)

In [57]:
obj.index

RangeIndex(start=0, stop=5, step=1)

In [58]:
obj2 = pd.Series([2,3,4,1,5], index=['a', 'b', 'c', 'd', 'e'])
obj2

a    2
b    3
c    4
d    1
e    5
dtype: int64

In [59]:
obj2.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [60]:
obj2['a']

2

In [61]:
obj2[['a', 'c', 'e']] # 取出多个值

a    2
c    4
e    5
dtype: int64

In [62]:
# pandas 同样支持 numpy 数组运算（布尔型过滤，标量运算，元素级函数运算等），而且运算结果会保留索引与 value 的对应关系
obj2[obj2 > 2]

b    3
c    4
e    5
dtype: int64

In [63]:
obj2*3

a     6
b     9
c    12
d     3
e    15
dtype: int64

In [64]:
np.exp(obj2)

a      7.389056
b     20.085537
c     54.598150
d      2.718282
e    148.413159
dtype: float64

Series 还可以被看成是一个定长的字典，因为它是索引值到 value 的一个映射。python dict 结构的很多用法也可以用在 Series 上。

In [65]:
'c' in obj2

True

Series 也可以直接根据 python 字典结构进行创建，字典的 key 值即为 Series 的索引值

In [66]:
dic_data = {'name': 'john', 'sex': 'man', 'phone': '123456'}
obj3 = pd.Series(dic_data)
obj3

name       john
sex         man
phone    123456
dtype: object

Series 最重要额一个功能是：**它在算术运算中会自动对齐不同索引的数据**

In [67]:
obj3 = pd.Series([3,4,-1], index=['b','d','g'])
obj3

b    3
d    4
g   -1
dtype: int64

In [68]:
obj2 + obj3

a    NaN
b    6.0
c    NaN
d    5.0
e    NaN
g    NaN
dtype: float64

In [69]:
obj3.index=['1','2','3'] # 赋值可以修改 index

In [70]:
obj3

1    3
2    4
3   -1
dtype: int64

DataFrame 是一个**表格型的数据结构**，它含有一组有序的列，每列可以是不同的值类型（布尔型，int 型，字符串等）。DataFrame 既有航索引也有列索引，它可以看成是多个 Series 组成的字典（每一列都是一个 Series，他们共用一个索引）

创建 DataFrame 的方法有很多，最常用的是直接传入一个由等长的列表或者 numpy 数组组成的字典

In [71]:
data= {'name': ['john', 'jack', 'lily', 'tom', 'lucy'],
      'age': [12,20,48,23,19],
       'sex': ['male', 'male', 'female', 'male', 'female']}
frame = pd.DataFrame(data)
frame

Unnamed: 0,name,age,sex
0,john,12,male
1,jack,20,male
2,lily,48,female
3,tom,23,male
4,lucy,19,female


In [72]:
frame1 = pd.DataFrame(data, columns=['age', 'name', 'sex'])
frame1

Unnamed: 0,age,name,sex
0,12,john,male
1,20,jack,male
2,48,lily,female
3,23,tom,male
4,19,lucy,female


In [73]:
frame2 = pd.DataFrame(data, columns=['age', 'sex', 'name', 'address'], index=['a','b','c','d','e'])
frame2 # address 没有就是NaN

Unnamed: 0,age,sex,name,address
a,12,male,john,
b,20,male,jack,
c,48,female,lily,
d,23,male,tom,
e,19,female,lucy,


In [74]:
type(frame2['age'])

pandas.core.series.Series

In [75]:
frame2['age'] = 30
frame2

Unnamed: 0,age,sex,name,address
a,30,male,john,
b,30,male,jack,
c,30,female,lily,
d,30,male,tom,
e,30,female,lucy,


In [76]:
frame2['age'] = range(10, 15)

In [77]:
frame2

Unnamed: 0,age,sex,name,address
a,10,male,john,
b,11,male,jack,
c,12,female,lily,
d,13,male,tom,
e,14,female,lucy,


In [78]:
val = pd.Series([12, 34, 2], index=['c','a','d'])
frame2.age = val

In [79]:
frame2

Unnamed: 0,age,sex,name,address
a,34.0,male,john,
b,,male,jack,
c,12.0,female,lily,
d,2.0,male,tom,
e,,female,lucy,


In [80]:
del frame2['address']

In [81]:
frame2

Unnamed: 0,age,sex,name
a,34.0,male,john
b,,male,jack
c,12.0,female,lily
d,2.0,male,tom
e,,female,lucy


还可以用二维列表或者 numpy 的 ndarray 来构建 DataFrame，在创建是还传入索引名称和列名

In [82]:
frame3 = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=['a', 'b', 'c'], index = [1, 2])
frame3

Unnamed: 0,a,b,c
1,1,2,3
2,4,5,6


### DataFrame 构建方法汇总

- 二维 ndarray：数据矩阵，还可以传入行标和列名
- 由数组，列表或元组组成的字典：所有序列长度必须一致
- 嵌套字典：外层字典为列名，内层字典为索引
- 另一个 DataFrame

另外，同 Series 一样，通过 values 可以返回 DataFrame 中的值，格式为 ndarray

In [83]:
frame2.values

array([[34.0, 'male', 'john'],
       [nan, 'male', 'jack'],
       [12.0, 'female', 'lily'],
       [2.0, 'male', 'tom'],
       [nan, 'female', 'lucy']], dtype=object)

In [84]:
obj = pd.Series(range(3), index=['a', 'b', 'c'])
obj

a    0
b    1
c    2
dtype: int64

In [85]:
obj.index

Index(['a', 'b', 'c'], dtype='object')

不能直接改 Series 或 DataFrame 的索引，要对 index 进行设置

In [86]:
obj.index=['f', 'e', 'd']

In [87]:
obj

f    0
e    1
d    2
dtype: int64

### 基本功能

#### 重新索引

In [88]:
obj=pd.Series(['2.3', '1.2', '1.3', '4.6'], index = ['a','b','c','d'])

In [89]:
obj

a    2.3
b    1.2
c    1.3
d    4.6
dtype: object

In [90]:
obj2 = obj.reindex(['a','b','c','d','e'])

In [91]:
obj2

a    2.3
b    1.2
c    1.3
d    4.6
e    NaN
dtype: object

用 reindex 的时候，如果索引值不存在就引入 NaN，也可以指定需要填充的值。
对于时间序列类的数据，在做 reindex 时可能需要做一些插值处理。
reindex 方法中的 method 选项可以做到

In [92]:
obj.reindex(['a','b','c','d','e'], fill_value=0)

a    2.3
b    1.2
c    1.3
d    4.6
e      0
dtype: object

In [93]:
obj = pd.Series(['john', 'mike', 'jim'], index=[0,2,4])
obj

0    john
2    mike
4     jim
dtype: object

In [94]:
obj.reindex(range(8), method='ffill') # ffill front fill, bfill back fill

0    john
1    john
2    mike
3    mike
4     jim
5     jim
6     jim
7     jim
dtype: object

In [95]:
df = pd.DataFrame(np.arange(1, 10).reshape((3,3)), index=['d', 'a', 'c'], columns=['john', 'mike', 'jim'])
df

Unnamed: 0,john,mike,jim
d,1,2,3
a,4,5,6
c,7,8,9


In [96]:
df.reindex(['a','b','c','d'], ['mike','john','jim'])

	'.reindex(a, b)' as 
	'.reindex(index=a, columns=b)'.
Use named arguments to remove any ambiguity. In the future, using positional arguments for 'index' or 'columns' will raise  a 'TypeError'.
  """Entry point for launching an IPython kernel.


Unnamed: 0,mike,john,jim
a,5.0,4.0,6.0
b,,,
c,8.0,7.0,9.0
d,2.0,1.0,3.0


丢弃指定轴的项

丢弃某个轴的一个或者多个想很简单，只要有一个索引数组或者列表即可。

使用 DataFrame.drop 方法进行丢弃元素

In [97]:
obj = pd.Series(np.arange(5), index=[list('abcde')])
obj

a    0
b    1
c    2
d    3
e    4
dtype: int32

In [98]:
obj.drop('c')

a    0
b    1
d    3
e    4
dtype: int32

注意：drop 方法放回一个新对象，并未改变原数据

In [99]:
obj

a    0
b    1
c    2
d    3
e    4
dtype: int32

对于 DataFrame，可以伤处任意轴上的索引值

In [100]:
obj = pd.DataFrame(np.arange(25).reshape((5,5)), index=list('12345'), columns=list('abcde'))
obj

Unnamed: 0,a,b,c,d,e
1,0,1,2,3,4
2,5,6,7,8,9
3,10,11,12,13,14
4,15,16,17,18,19
5,20,21,22,23,24


In [101]:
obj.drop(['2', '4'])

Unnamed: 0,a,b,c,d,e
1,0,1,2,3,4
3,10,11,12,13,14
5,20,21,22,23,24


In [102]:
obj.drop(['a', 'c'], axis=1) # drop 默认是删除行索引，即 axis = 0，要删除列要明确指定

Unnamed: 0,b,d,e
1,1,3,4
2,6,8,9
3,11,13,14
4,16,18,19
5,21,23,24


## 索引，选取和过滤

Series 以及 Pandas 的索引选取等操作与 numpy 的 ndarray 很像，只不过 Series 和 Pandas 的索引值不只是整数还可以根据行索引和列名称进行索引

In [103]:
obj

Unnamed: 0,a,b,c,d,e
1,0,1,2,3,4
2,5,6,7,8,9
3,10,11,12,13,14
4,15,16,17,18,19
5,20,21,22,23,24


In [104]:
obj[['a', 'c']]

Unnamed: 0,a,c
1,0,2
2,5,7
3,10,12
4,15,17
5,20,22


In [105]:
obj[1:4]

Unnamed: 0,a,b,c,d,e
2,5,6,7,8,9
3,10,11,12,13,14
4,15,16,17,18,19


In [106]:
obj[obj['b']>10]

Unnamed: 0,a,b,c,d,e
3,10,11,12,13,14
4,15,16,17,18,19
5,20,21,22,23,24


### 算术运算与数据对齐

Pandas 一个重要功能是：它可以对不同的索引的对象进行算术运算，具有不同索引的两个对象相加时，结果的索引是两个对象的并集。若果有不重叠的索引，则相应的值为 NaN。

In [107]:
df1 = pd.DataFrame(np.arange(9).reshape((3,3)), columns=list('abc'), index=[1,2,3])
df1

Unnamed: 0,a,b,c
1,0,1,2
2,3,4,5
3,6,7,8


In [108]:
df2 = pd.DataFrame(np.arange(16).reshape((4,4)), columns=list('bcde'), index=[2,3,4,5])
df2

Unnamed: 0,b,c,d,e
2,0,1,2,3
3,4,5,6,7
4,8,9,10,11
5,12,13,14,15


In [109]:
df1+df2

Unnamed: 0,a,b,c,d,e
1,,,,,
2,,4.0,6.0,,
3,,11.0,13.0,,
4,,,,,
5,,,,,


In [110]:
df1.add(df2, fill_value=0)

Unnamed: 0,a,b,c,d,e
1,0.0,1.0,2.0,,
2,3.0,4.0,6.0,2.0,3.0
3,6.0,11.0,13.0,6.0,7.0
4,,8.0,9.0,10.0,11.0
5,,12.0,13.0,14.0,15.0


注意：如果指定 fill_value 后，相当于两个 df 互相将补集置为 fill_value 的值，但是两个 df 都没有的元素依然为 NaN

DataFrame 支持加减乘除运算，可以直接使用 + - * / 操作符，也可以分别使用 add,sub,div 和 mul 进行计算

### DataFrame 和 Series 之间的计算

DataFrame 和 Series 之间的计算会引入一个叫**广播**的操作

In [111]:
df1 = pd.DataFrame(np.arange(12).reshape((4,3)), columns=list('abc'))
df1

Unnamed: 0,a,b,c
0,0,1,2
1,3,4,5
2,6,7,8
3,9,10,11


In [112]:
series1 = pd.Series([3,4,5], index=['a','b','c'])
series1

a    3
b    4
c    5
dtype: int64

In [113]:
df1 - series1

Unnamed: 0,a,b,c
0,-3,-3,-3
1,0,0,0
2,3,3,3
3,6,6,6


In [114]:
series2 = df1['b']
series2

0     1
1     4
2     7
3    10
Name: b, dtype: int32

In [115]:
df1.sub(series2, axis=0)

Unnamed: 0,a,b,c
0,-1,0,1
1,-1,0,1
2,-1,0,1
3,-1,0,1


### 函数应用和映射

numpy 元素级函数可以直接作用的 DataFrame 上

In [116]:
np.square(df1)

Unnamed: 0,a,b,c
0,0,1,4
1,9,16,25
2,36,49,64
3,81,100,121


**重要**：DataFrame另一个操作是将一个函数直接应用到其本身或者各行各列形成一个新的数据或者行或者列

In [117]:
def fun(x):
    return x.max() - x.min()

In [118]:
df1.apply(fun, axis=1)

0    2
1    2
2    2
3    2
dtype: int64

In [119]:
df1.max(axis=1)-df1.min(axis=1)

0    2
1    2
2    2
3    2
dtype: int32

x = (x - min) / (max - min)

In [121]:
def fun(x):
    return (x - x.min()) / (x.max() - x.min())
df1.apply(fun, axis=1)

Unnamed: 0,a,b,c
0,0.0,0.5,1.0
1,0.0,0.5,1.0
2,0.0,0.5,1.0
3,0.0,0.5,1.0


### 排序

对数据集进行排序也是我们经常会碰到的需求，pandas 同样提供了强大的排序函数。

主要有 sort_index() 和 sort_values() 两个函数，前者根据行或者列索引进行排序，后者则根据值进行排序

In [122]:
df1 = pd.DataFrame(np.random.randn(4,4), columns=list('bcad'), index=[2,4,3,1])
df1

Unnamed: 0,b,c,a,d
2,-0.772197,-1.271328,0.43888,-0.480357
4,0.211071,0.531291,0.57384,-0.782769
3,-0.950552,-0.323276,-1.769132,-1.65602
1,0.404893,0.578318,2.255399,0.771602


In [123]:
df1.sort_index()

Unnamed: 0,b,c,a,d
1,0.404893,0.578318,2.255399,0.771602
2,-0.772197,-1.271328,0.43888,-0.480357
3,-0.950552,-0.323276,-1.769132,-1.65602
4,0.211071,0.531291,0.57384,-0.782769


In [124]:
df1.sort_index(axis=1, ascending=False)

Unnamed: 0,d,c,b,a
2,-0.480357,-1.271328,-0.772197,0.43888
4,-0.782769,0.531291,0.211071,0.57384
3,-1.65602,-0.323276,-0.950552,-1.769132
1,0.771602,0.578318,0.404893,2.255399


In [126]:
df1.sort_values(by='b')

Unnamed: 0,b,c,a,d
3,-0.950552,-0.323276,-1.769132,-1.65602
2,-0.772197,-1.271328,0.43888,-0.480357
4,0.211071,0.531291,0.57384,-0.782769
1,0.404893,0.578318,2.255399,0.771602


In [127]:
df1.sort_values(by=['b', 'a']) # b 相等的时候根据 a

Unnamed: 0,b,c,a,d
3,-0.950552,-0.323276,-1.769132,-1.65602
2,-0.772197,-1.271328,0.43888,-0.480357
4,0.211071,0.531291,0.57384,-0.782769
1,0.404893,0.578318,2.255399,0.771602


## 统计相关

In [128]:
df1.describe()

Unnamed: 0,b,c,a,d
count,4.0,4.0,4.0,4.0
mean,-0.276696,-0.121249,0.374747,-0.536886
std,0.683638,0.871531,1.650941,1.004698
min,-0.950552,-1.271328,-1.769132,-1.65602
25%,-0.816786,-0.560289,-0.113123,-1.001082
50%,-0.280563,0.104007,0.50636,-0.631563
75%,0.259526,0.543047,0.99423,-0.167367
max,0.404893,0.578318,2.255399,0.771602


### 处理缺失数据

- dropna 去除 NaN 数据
- fillna 使用默认值填充
- isnull 返回一个含有布尔值的对象，表示哪些是 NaN，哪些不是
- notnull isnull 的否定形式