<a id=menu><center><h1>目录</h1></center></a>

1. [Pandas数据结构介绍](#pandas)
        1.1 Series序列  
        1.2 DataFrame数据框  
        1.3 索引对象
2. [Pandas基本操作](#basic)
        2.1 重建索引(reindex)
        2.2 轴向上删除条目(drop)
        2.3 索引、选择与过滤
        2.4 整数索引
        2.5 算术和数据对齐
        2.6 函数应用和映射
        2.7 排名和排序
        2.8 含有重复标签的索引轴
3. [描述性统计的概述与计算](#statistics)
        3.1 相关性和协方差
        3.2 唯一值、计数和成员性

In [3]:
import pandas as pd
import numpy as np

In [4]:
np.__version__

'1.21.5'

In [5]:
pd.__version__

'1.4.3'

<a id="pandas"></a>
# 1. Pandas数据结构介绍
- Series：一维数组（一列或者是一行），与Numpy中的一维array类似。二者与Python基本的数据结构List也很相近，其区别是：List中的元素可以是不同的数据类型，而Series中则只允许存储相同的数据类型，这样可以更有效的使用内存，提高运算效率。
- Time-Series：以时间为索引的Series。
- DataFrame：二维的表格型数据结构（类似Excel）。很多功能与R中的data.frame类似。
- Panel ：三维的数组(**0.25.0以后已被移除**)

## 1.1 Series序列
[返回目录](#menu)


In [8]:
# ==创建Series
# 创建一个序列
obj = pd.Series([4, 7, -5, 3])
print(obj, type(obj))

# # 创建一个索引序列
# obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
# obj2

# # 用字典创建一个序列
# dic = {'Ohio': 35000, 'Taxas': 71000, 'Oregon': 16000, 'Utah': 5000}
# obj3 = pd.Series(dic)
# print(obj3)

# # 对字典指定索引，创建一个序列
# states = ['California', 'Ohio', 'Oregon', 'Taxas']
# obj4 = pd.Series(dic, index=states, name='pop')
# obj4

0    4
1    7
2   -5
3    3
dtype: int64 <class 'pandas.core.series.Series'>
Ohio      35000
Taxas     71000
Oregon    16000
Utah       5000
dtype: int64


California        NaN
Ohio          35000.0
Oregon        16000.0
Taxas         71000.0
Name: pop, dtype: float64

In [9]:
# ==Series的属性
# # values
# print(obj.values)

# # index
# print(obj.index)

# name
obj4.name = 'popultion'

# index.name
obj4.index.name = 'states'
print(obj4)

# 可以类比成长度固定且有序的字典
'b' in obj2  

states
California        NaN
Ohio          35000.0
Oregon        16000.0
Taxas         71000.0
Name: popultion, dtype: float64


True

In [10]:
# ==Seires的增删改查
# 查
# 使用标签来进行索引
# obj2['a']  # 索引一个元素

# obj2[['c', 'a', 'b']]  # 索引多个元素

# obj2[obj2 > 0]  # 利用布尔序列索引
obj4[obj4.isnull()]

states
California   NaN
Name: popultion, dtype: float64

In [11]:
obj4.isnull()  # 判别序列里面的Na值

states
California     True
Ohio          False
Oregon        False
Taxas         False
Name: popultion, dtype: bool

In [12]:
# 利用isnull，notnull进行NA值判断，返回布尔序列
# obj4.isnull()
obj.notnull()

0    True
1    True
2    True
3    True
dtype: bool

In [13]:
# 改
# obj2['b'] = 6
# obj2

# obj2[obj2 < 0] = np.nan
# obj2

obj2[['c', 'a', 'b']] = 2
obj2

d    4
b    2
a    2
c    2
dtype: int64

In [14]:
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [15]:
# 更改属性
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
# obj.index
obj

Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

In [16]:
# 序列的计算
print(obj2 * 2)
print(obj2)

# np.exp(obj2)

obj3+obj4  # 不同序列的索引自动对齐(重点！！！)

d    8
b    4
a    4
c    4
dtype: int64
d    4
b    2
a    2
c    2
dtype: int64


California         NaN
Ohio           70000.0
Oregon         32000.0
Taxas         142000.0
Utah               NaN
dtype: float64

In [17]:
# 删
# del obj['Bob']  # 用del删除元素
# obj

# obj.drop(['Jeff', 'Ryan'])  # 用drop方法删除元素
# obj
obj.drop('Jeff', inplace=True)  # 设置inplace=True，更改原对象
obj

Bob      4
Steve    7
Ryan     3
dtype: int64

In [18]:
# 增
# obj['Bob'] = 10
# obj

obj = obj.append(obj2)  # 用append函数，拼接序列
obj

Bob      4
Steve    7
Ryan     3
d        4
b        2
a        2
c        2
dtype: int64

## 1.2 DataFrame数据框
[返回目录](#menu)

DataFrame是二维的数据结构，其本质是Series的容器，由于一个Series中的数据类型是相同的，而不同Series的数据结构可以不同。因此对于DataFrame来说，每一列的数据结构都是相同的，而不同的列之间则可以是不同的数据结构。或者以数据库进行类比，DataFrame中的每一行是一个记录，名称为Index的一个元素，而每一列则为一个字段，是这个记录的一个属性。

In [10]:
# ==创建DataFrame
# 利用包含等长度列表或NumPy数组的字典来形成DataFrame
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
       'year': [2000, 2001, 2020, 2001, 2002, 2003],
       'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2020,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [11]:
frame.tail(3)

Unnamed: 0,state,year,pop
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [12]:
# 指定列的顺序
pd.DataFrame(data, columns=['year', 'state', 'pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2020,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2


In [13]:
# 如果传的列不包含在字典中，结果会出现缺失
frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
                     index=['one', 'two', 'three', 'four', 'five', 'six'])
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2020,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,
six,2003,Nevada,3.2,


In [14]:
# 利用包含字典的嵌套字典，创建DataFrame
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
      'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
frame3 = pd.DataFrame(pop)
frame3

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [15]:
pd.DataFrame(pop, index=[2000, 2001, 2002])

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


In [16]:
frame3['Ohio'][:-1]

2001    1.7
2002    3.6
Name: Ohio, dtype: float64

In [17]:
frame3['Nevada'][1:]

2002    2.9
2000    NaN
Name: Nevada, dtype: float64

In [18]:
# 利用包含Series的字典，创建DataFrame
pdata = {'Ohio': frame3['Ohio'][:-1],
        'Nevada': frame3['Nevada'][1:]}
pd.DataFrame(pdata)

Unnamed: 0,Ohio,Nevada
2000,,
2001,1.7,
2002,3.6,2.9


In [19]:
# ==DataFrame的属性
# columns
print(frame2.columns)

# index
print(frame2.index, type(frame2.index))

# values
print(frame2.values)

# 检索某一列
print(frame2.year)  # 不建议

# 转置
print(frame3.T)

# index.name
frame3.index.name = 'year'

# columns.name
frame3.columns.name = 'state'

frame3

Index(['year', 'state', 'pop', 'debt'], dtype='object')
Index(['one', 'two', 'three', 'four', 'five', 'six'], dtype='object') <class 'pandas.core.indexes.base.Index'>
[[2000 'Ohio' 1.5 nan]
 [2001 'Ohio' 1.7 nan]
 [2020 'Ohio' 3.6 nan]
 [2001 'Nevada' 2.4 nan]
 [2002 'Nevada' 2.9 nan]
 [2003 'Nevada' 3.2 nan]]
one      2000
two      2001
three    2020
four     2001
five     2002
six      2003
Name: year, dtype: int64
        2001  2002  2000
Nevada   2.4   2.9   NaN
Ohio     1.7   3.6   1.5


state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [32]:
# ==DataFrame的增删改查
# 查
# frame2['state']  # 查某一列

# frame2[['state', 'year']]  # 查某几列

# frame2.loc['three']  # 查某一行

# frame2.loc[:'three']  # 查某几行, 连续

# frame2.loc[['one', 'three']]  # 查某几行，不连续

# frame2.loc['three', 'year']  # 查某个元素, df.loc[行索引, 列索引]

# frame2.loc[:'three', ['state', 'year']]  # 查某部分元素

frame2.loc[(frame2['pop']>3) | (frame2['pop']<2), :]  # 布尔索引(条件筛选)

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2020,Ohio,3.6,
six,2003,Nevada,3.2,


In [55]:
val

two    -1.2
four   -1.5
five   -1.7
dtype: float64

In [56]:
# 改
# frame2['debt'] = 16.5
# frame2

# frame2['debt'] = np.arange(6.)  # 数组传给序列，必须长度对应
# frame2

val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['debt'] = val  # 将Series赋值给一列Series
frame2  # Series的索引将会按照DataFrame的索引重新排列，并在空缺的地方填充缺失值

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2020,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7
six,2003,Nevada,3.2,


In [57]:
frame2['state'] == 'Ohio'

one       True
two       True
three     True
four     False
five     False
six      False
Name: state, dtype: bool

In [60]:
# 增
# frame2['eastern'] = frame2['state'] == 'Ohio'  # 将Seires赋值给Series
# frame2

# 新增一列location，并对state为Ohio的行填充'E'(部分填充)
frame2.loc[frame2['state'] == 'Ohio', 'location'] = 'E'  
frame2

Unnamed: 0,year,state,pop,debt,eastern,location
one,2000,Ohio,1.5,,True,E
two,2001,Ohio,1.7,-1.2,True,E
three,2020,Ohio,3.6,,True,E
four,2001,Nevada,2.4,-1.5,False,
five,2002,Nevada,2.9,-1.7,False,
six,2003,Nevada,3.2,,False,


In [64]:
# 删
# del frame2['location']
# frame2

frame2.drop(columns=['eastern'], inplace=True)
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2020,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7
six,2003,Nevada,3.2,


In [85]:
frame2.drop(index=['one'])

Unnamed: 0,year,state,pop,debt,eastern
two,2001,Ohio,1.7,-1.2,True
three,2020,Ohio,3.6,,True
four,2001,Nevada,2.4,-1.5,False
five,2002,Nevada,2.9,-1.7,False
six,2003,Nevada,3.2,,False


## 1.3 索引对象
[返回目录](#menu)

pandas中的索引对象是用于存储轴标签和其他元数据的（例如轴名称或标签）。  
在构造Series或DataFrame时，你所使用的任意数组或标签序列都可以在内部转换为索引对象

In [87]:
obj = pd.Series(range(3), index=['a', 'b', 'c'])
obj

a    0
b    1
c    2
dtype: int64

In [89]:
obj.index

Index(['a', 'b', 'c'], dtype='object')

In [91]:
# 索引对象不可变
index = obj.index
# index[:2]
index[1] = 'd'  # TypeError

TypeError: Index does not support mutable operations

In [92]:
np.arange(3)

array([0, 1, 2])

In [95]:
labels = pd.Index(np.arange(3))  # 创建索引对象
labels

Int64Index([0, 1, 2], dtype='int64')

In [97]:
# obj2 = pd.Series([1.5, -2.5, 0], index=np.arange(3))

obj2 = pd.Series([1.5, -2.5, 0], index=labels)
obj2

0    1.5
1   -2.5
2    0.0
dtype: float64

In [99]:
# 索引对象像一个固定大小的集合
print(obj2.index is labels)
print('Ohio' in frame3.columns)
print(2003 in frame3.index)

True
True
False


In [100]:
# pandas索引对象可以包含重复标签
dup_labels = pd.Index(['foo', 'foo', 'bar', 'bar'], name='idx')
dup_labels

Index(['foo', 'foo', 'bar', 'bar'], dtype='object', name='idx')

In [101]:
# 索引对象的方法和数学
print(dup_labels.is_monotonic)  # 索引序列是否递增
print(dup_labels.is_unique)  # 索引序列是否唯一
print(dup_labels.unique())  # 返回所有序列的唯一值

False
False
Index(['foo', 'bar'], dtype='object', name='idx')


<a id="basic"></a>
# 2. Pandas基本功能
了解Series与DataFrame中数据交互的基础机制
## 2.1 重建索引(`reindex`)
[返回目录](#menu)

In [2]:
import pandas as pd
import numpy as np

In [3]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'e'])
obj

d    4.5
b    7.2
a   -5.3
e    3.6
dtype: float64

In [4]:
# 通常利用reindex进行index的补齐和重排序
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2

a   -5.3
b    7.2
c    NaN
d    4.5
e    3.6
dtype: float64

In [4]:
obj.reindex?

In [5]:
# 对于顺序数据，比如时间序列，在重建索引是可能会需要进行插值或填值
obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj3

0      blue
2    purple
4    yellow
dtype: object

In [9]:
# method可选参数允许我们使用不同方法，在重建索引时差值
obj3.reindex(range(6), method='ffill')  # ffill方法会将值向前填充

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

In [10]:
frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
                    index=['a', 'c', 'd'],
                    columns=['Ohio', 'Texas', 'California'])
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [11]:
# 可以对DataFrame中的列名进行重建索引
states = ['Texas', 'Utah', 'California']
frame.reindex(columns=states)

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


In [14]:
frame.mean().mean()

4.0

In [16]:
# fill_value, 通过重新索引引入缺失数据时使用的替代着
frame.reindex(index=list('abcd'), fill_value=frame.mean().mean())

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,4.0,4.0,4.0
c,3.0,4.0,5.0
d,6.0,7.0,8.0


In [17]:
frame.reindex(index=list('abcd'), method='ffill')

Unnamed: 0,Ohio,Texas,California
a,0,1,2
b,0,1,2
c,3,4,5
d,6,7,8


## 2.2 轴向上删除条目(`drop`)
[返回目录](#menu)

In [18]:
# ==Series.drop
obj = pd.Series(np.arange(6.), index=list('abcdef'))
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
f    5.0
dtype: float64

In [19]:
# 删除一个元素
new_obj = obj.drop('c')
new_obj

a    0.0
b    1.0
d    3.0
e    4.0
f    5.0
dtype: float64

In [21]:
list('ca')

['c', 'a']

In [22]:
# 删除多个元素
obj.drop(list('ca'), inplace=True)  # 思考，obj是否更改？

In [23]:
obj

b    1.0
d    3.0
e    4.0
f    5.0
dtype: float64

In [26]:
np.arange(16).reshape((4, 4))

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

In [27]:
# ==DataFrame.drop
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                   index=['Ohio', 'Colorado', 'Utah', 'New York'],
                   columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [28]:
# 默认删除行
data.drop(index=['Colorado', 'Utah'])

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
New York,12,13,14,15


In [32]:
# 删除列
# data.drop(['one', 'two'], axis='columns')  # 设置axis=1或axis='columns'

data.drop(columns=['one', 'two'])  # 设置columns参数

Unnamed: 0,three,four
Ohio,2,3
Colorado,6,7
Utah,10,11
New York,14,15


In [33]:
data.drop('Ohio', inplace=True)  # 默认删除行
data

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


## 2.3 索引、选择与过滤
[返回目录](#menu)

In [35]:
# ==Series的索引
obj = pd.Series(np.arange(4.), index=list('abcd'))
obj

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

In [40]:
# 通过位置索引(position)
# obj[1]  # 取某个元素

# obj[[1,3]]  # 取某些元素

obj[:2]  # 切片，位置索引的切片前闭后开

a    0.0
b    1.0
dtype: float64

In [44]:
# 通过索引值(index label)
# obj['b']  # 取某个元素

# obj[['b', 'a', 'd']]  # 取某些元素

obj['b':'c']  # Series中的切片, 包含尾部部分

b    1.0
c    2.0
dtype: float64

In [45]:
obj<3

a     True
b     True
c     True
d    False
dtype: bool

In [46]:
# 通过布尔值索引
obj[obj<3]

a    0.0
b    1.0
c    2.0
dtype: float64

In [47]:
# 修改Series对应部分
obj['b':'c'] = 5
obj

a    0.0
b    5.0
c    5.0
d    3.0
dtype: float64

In [48]:
# ==DataFrame的索引
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                   index=['Ohio', 'Colorado', 'Utah', 'New York'],
                   columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [51]:
# =列索引
# 索引一列， index label(columns)
data['one']

Ohio         0
Colorado     4
Utah         8
New York    12
Name: one, dtype: int64

In [54]:
# 索引多列
data[['one', 'three']]

Unnamed: 0,one,three
Ohio,0,2
Colorado,4,6
Utah,8,10
New York,12,14


In [55]:
# =行索引
# 切片索引
data[:2]  # 位置切片

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [56]:
data['three'] > 5

Ohio        False
Colorado     True
Utah         True
New York     True
Name: three, dtype: bool

In [57]:
# 布尔索引
data[data['three'] > 5]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [58]:
# =全索引
# 布尔索引
data < 5

Unnamed: 0,one,two,three,four
Ohio,True,True,True,True
Colorado,True,False,False,False
Utah,False,False,False,False
New York,False,False,False,False


In [59]:
data[data < 5] 

Unnamed: 0,one,two,three,four
Ohio,0.0,1.0,2.0,3.0
Colorado,4.0,,,
Utah,,,,
New York,,,,


In [60]:
data[data < 5] = 0
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [61]:
# =loc选择数据
# 使用轴标签索引(index label)
# df.loc[行索引, 列索引]
data.loc['Colorado', ['two', 'one']]

two    5
one    0
Name: Colorado, dtype: int64

In [46]:
# 和切片结合
data.loc[:'Utah', 'two':]

Unnamed: 0,two,three,four
Ohio,0,0,0
Colorado,5,6,7
Utah,9,10,11


In [62]:
data['one']<3

Ohio         True
Colorado     True
Utah        False
New York    False
Name: one, dtype: bool

In [64]:
# 和布尔索引结合
data.loc[data['one']<3, :]

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7


In [72]:
# 将两列互换
data.loc[:, ['two', 'one']] = data[['one', 'two']].values
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [80]:
data.loc[data['one']>0, 'seven'] = 0
data

Unnamed: 0,one,two,three,four,five,six,seven
Ohio,0,0,0,0,4,0.0,
Colorado,0,5,6,7,4,0.0,
Utah,8,9,10,11,4,0.0,0.0
New York,12,13,14,15,4,0.0,0.0


In [74]:
# =iloc选择数据
# 使用位置索引(position)
data.iloc[2, [3, 0, 1]]

# data.iloc[2]

four    11
one      8
two      9
Name: Utah, dtype: int64

In [75]:
# 和切片结合，位置切片是前闭后开的
data.iloc[:2, [3, 0, 1]]

Unnamed: 0,four,one,two
Ohio,0,0,0
Colorado,7,0,5


In [83]:
# =混搭索引
data.iloc[:, :3][data['three'] > 5]['two']

Colorado     5
Utah         9
New York    13
Name: two, dtype: int64

## 2.4 整数索引
[返回目录](#menu)

在pandas对象上使用整数索引对新用户来说经常会产生歧义。这是因为它和在列表、元组等Python内建数据结构上进行索引有些许不同

In [85]:
ser = pd.Series(np.arange(3.))
ser

0    0.0
1    1.0
2    2.0
dtype: float64

In [87]:
ser[-1]  # KeyError

KeyError: -1

In [88]:
ser2 = pd.Series(np.arange(3.), index=list('abc'))
ser2

a    0.0
b    1.0
c    2.0
dtype: float64

In [89]:
ser2[-1]

2.0

In [90]:
# 索引时，推荐使用loc和iloc
ser.loc[:1]

0    0.0
1    1.0
dtype: float64

In [91]:
ser.iloc[:1]

0    0.0
dtype: float64

## 2.5 算术和数据对齐
[返回目录](#menu)

In [92]:
# ==Series的算术
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=list('acde'))
print(s1)
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1],
              index=list('acefg'))
print(s2)

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64
a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64


In [93]:
# 当将对象相加时，如存在索引不同，则返回结果的索引对齐后相加的结果
s1+s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

In [94]:
# ==DataFrame的算术
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)),
                  columns=list('bcd'))
print(df1)
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),
                  columns=list('abcde'))
print(df2)

     b    c    d
0  0.0  1.0  2.0
1  3.0  4.0  5.0
2  6.0  7.0  8.0
      a     b     c     d     e
0   0.0   1.0   2.0   3.0   4.0
1   5.0   6.0   7.0   8.0   9.0
2  10.0  11.0  12.0  13.0  14.0
3  15.0  16.0  17.0  18.0  19.0


In [96]:
df2.loc[1, 'b'] = np.nan
df2

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,4.0
1,5.0,,7.0,8.0,9.0
2,10.0,11.0,12.0,13.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [97]:
# 直接相加会会导致一些不重叠的位置出现NA值
df1+df2

Unnamed: 0,a,b,c,d,e
0,,1.0,3.0,5.0,
1,,,11.0,13.0,
2,,17.0,19.0,21.0,
3,,,,,


In [98]:
# 用add方法中的fill_value参数填补空缺的NA值
df1.add(df2, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,3.0,5.0,4.0
1,5.0,3.0,11.0,13.0,9.0
2,10.0,17.0,19.0,21.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [100]:
# Pandas中的算术方法，每一个都有以r开头的副本，这些副本方法的参数是翻转的
df1.rdiv(1)  # 等价 1/df1

Unnamed: 0,b,c,d
0,inf,1.0,0.5
1,0.333333,0.25,0.2
2,0.166667,0.142857,0.125


In [103]:
df2.columns

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [102]:
df1.loc[1, 'b'] = np.nan
df1

Unnamed: 0,b,c,d
0,0.0,1.0,2.0
1,,4.0,5.0
2,6.0,7.0,8.0


In [104]:
# fill_value填充的是重塑索引后新增的na值
df1.reindex(columns=df2.columns, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0,0.0,1.0,2.0,0
1,0,,4.0,5.0,0
2,0,6.0,7.0,8.0,0


In [105]:
# 计算时的广播机制
arr = np.arange(12.).reshape((3, 4))
arr

array([[ 0.,  1.,  2.,  3.],
       [ 4.,  5.,  6.,  7.],
       [ 8.,  9., 10., 11.]])

In [106]:
arr[0]

array([0., 1., 2., 3.])

In [108]:
arr - arr[0]  # 数组运算中，减法在每一行都进行了操作

array([[0., 0., 0., 0.],
       [4., 4., 4., 4.],
       [8., 8., 8., 8.]])

In [109]:
frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
                    columns=list('bde'),
                    index=['Utah', 'Ohio', 'Texas', 'Oregon'])
series = frame.iloc[0, :]
print(frame)
print(series)

          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0
b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64


In [110]:
# 默认情况下，Series的索引和DataFrame的列进行匹配，并广播到各行
frame - series

Unnamed: 0,b,d,e
Utah,0.0,0.0,0.0
Ohio,3.0,3.0,3.0
Texas,6.0,6.0,6.0
Oregon,9.0,9.0,9.0


In [112]:
# 如果索引值不在DataFrame的列中，也不再Series的索引中，则对象会重建索引并形成联合
series2 = pd.Series(range(3), index=['b', 'e', 'f'])
print(series2)
frame + series2

b    0
e    1
f    2
dtype: int64


Unnamed: 0,b,d,e,f
Utah,0.0,,3.0,
Ohio,3.0,,6.0,
Texas,6.0,,9.0,
Oregon,9.0,,12.0,


In [113]:
# 改为在列上进行广播
series3 = frame['d']
series3

Utah       1.0
Ohio       4.0
Texas      7.0
Oregon    10.0
Name: d, dtype: float64

In [114]:
frame - series3

Unnamed: 0,Ohio,Oregon,Texas,Utah,b,d,e
Utah,,,,,,,
Ohio,,,,,,,
Texas,,,,,,,
Oregon,,,,,,,


In [115]:
frame.sub(series3, axis='index')  # 或axis=0

Unnamed: 0,b,d,e
Utah,-1.0,0.0,1.0
Ohio,-1.0,0.0,1.0
Texas,-1.0,0.0,1.0
Oregon,-1.0,0.0,1.0


## 2.6 函数应用和映射
[返回目录](#menu)

- `pd.DataFrame.apply`
- `pd.DataFrame.applymap`
- `pd.Series.map`

In [117]:
frame = pd.DataFrame(np.random.randn(4, 3), 
                    columns=list('bde'),
                    index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

Unnamed: 0,b,d,e
Utah,-0.870403,-1.038524,0.017245
Ohio,-1.161272,-1.022613,0.397036
Texas,-1.255421,-1.34417,-0.195129
Oregon,-0.31894,-0.234183,0.838645


In [121]:
x = frame['e']
f(x)

1.0337738358186972

In [118]:
# 用apply方法将函数应用到 一行 或 一列 的一维数组上
f = lambda x: x.max() - x.min()
frame.apply(f)  # 求每列的极值差

b    0.936481
d    1.109987
e    1.033774
dtype: float64

In [122]:
frame.apply?

In [96]:
# axix='columns'，函数会被每行调用一次
frame.apply(f, axis='columns')

Utah      1.961248
Ohio      1.452199
Texas     2.388326
Oregon    1.363279
dtype: float64

In [125]:
# apply方法可以返回带有多个值的Series
def func(x):
    return pd.Series([x.min(), x.max()], index=['min', 'max'])

frame.apply(func)

Unnamed: 0,min,max
Utah,-1.038524,0.017245
Ohio,-1.161272,0.397036
Texas,-1.34417,-0.195129
Oregon,-0.31894,0.838645


In [130]:
# applymap方法进行逐元素操作
format_2 = lambda x: '%.2f' % x
new_frame = frame.applymap(format_2)
new_frame

Unnamed: 0,b,d,e
Utah,-0.87,-1.04,0.02
Ohio,-1.16,-1.02,0.4
Texas,-1.26,-1.34,-0.2
Oregon,-0.32,-0.23,0.84


In [131]:
frame['e']

Utah      0.017245
Ohio      0.397036
Texas    -0.195129
Oregon    0.838645
Name: e, dtype: float64

In [132]:
# DataFrame中的applymap函数可以对应Series中的map方法
frame['e'].map(format)

Utah       0.01724521331739269
Ohio        0.3970364563170199
Texas     -0.19512915533500877
Oregon      0.8386446804836883
Name: e, dtype: object

## 2.7 排名和排序
[返回目录](#menu)

In [133]:
# ==sort_index，按照index进行排序
obj = pd.Series(range(4), index=list('dabc'))
obj

d    0
a    1
b    2
c    3
dtype: int64

In [134]:
obj.sort_index()

a    1
b    2
c    3
d    0
dtype: int64

In [135]:
frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
                    index=['three', 'one'],
                    columns=list('dabc'))
frame

Unnamed: 0,d,a,b,c
three,0,1,2,3
one,4,5,6,7


In [136]:
frame.sort_index()

Unnamed: 0,d,a,b,c
one,4,5,6,7
three,0,1,2,3


In [137]:
frame.sort_index(axis=1)

Unnamed: 0,a,b,c,d
three,1,2,3,0
one,5,6,7,4


In [138]:
frame.sort_index(axis=1, ascending=False)

Unnamed: 0,d,c,b,a
three,0,3,2,1
one,4,7,6,5


In [140]:
frame.sort_index(axis=1, ascending=False, inplace=True)
frame

Unnamed: 0,d,c,b,a
three,0,3,2,1
one,4,7,6,5


In [141]:
# ==sort_values，按照值进行排序
obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])
obj

0    4.0
1    NaN
2    7.0
3    NaN
4   -3.0
5    2.0
dtype: float64

In [143]:
obj.sort_values(ascending=False)

2    7.0
0    4.0
5    2.0
4   -3.0
1    NaN
3    NaN
dtype: float64

In [144]:
frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
frame

Unnamed: 0,b,a
0,4,0
1,7,1
2,-3,0
3,2,1


In [145]:
frame.sort_values(by='b')

Unnamed: 0,b,a
2,-3,0
3,2,1
0,4,0
1,7,1


In [146]:
frame.sort_values(by=['a', 'b'])

Unnamed: 0,b,a
2,-3,0
0,4,0
3,2,1
1,7,1


In [148]:
frame.sort_values(by=0, axis=1)

Unnamed: 0,a,b
0,0,4
1,1,7
2,0,-3
3,1,2


In [149]:
# rank排名，返回的是名次
obj = pd.Series([7, -5, 7, 4, 2, 0, 4])
obj

0    7
1   -5
2    7
3    4
4    2
5    0
6    4
dtype: int64

In [150]:
obj.rank()  # 默认保留并列名次，从小到大排序

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

In [151]:
obj.rank(method='first')  # 当出现并列时，将先出现的元素排在前面

0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

In [152]:
# 将值分配给组中的最大排名
obj.rank(ascending=False, method='max')

0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64

In [153]:
frame = pd.DataFrame({'b': [4.3, 7, -3, 2], 
                      'a': [0, 1, 0, 1],
                      'c': [-2, 5, 8, -2.5]})
frame

Unnamed: 0,b,a,c
0,4.3,0,-2.0
1,7.0,1,5.0
2,-3.0,0,8.0
3,2.0,1,-2.5


In [154]:
frame.rank()

Unnamed: 0,b,a,c
0,3.0,1.5,2.0
1,4.0,3.5,3.0
2,1.0,1.5,4.0
3,2.0,3.5,1.0


In [155]:
frame.rank(axis='columns')  # 对每一行的数组进行排名

Unnamed: 0,b,a,c
0,3.0,2.0,1.0
1,3.0,1.0,2.0
2,1.0,2.0,3.0
3,3.0,2.0,1.0


## 2.8 含有重复标签的索引轴
[返回目录](#menu)

In [156]:
obj = pd.Series(range(5), index=list('aabbc'))
obj

a    0
a    1
b    2
b    3
c    4
dtype: int64

In [159]:
obj.index.is_unique

False

In [157]:
obj['a']  # 返回一个序列

a    0
a    1
dtype: int64

In [158]:
obj['c']  # 返回标量值

4

In [160]:
df = pd.DataFrame(np.random.randn(4, 3), index=list('aabb'))
df

Unnamed: 0,0,1,2
a,0.329278,-0.075197,0.768938
a,-0.402381,-2.1231,1.229427
b,-0.358803,1.305602,-1.449987
b,0.613444,-1.492832,-1.741693


In [161]:
df.loc['b', :]

Unnamed: 0,0,1,2
b,-0.358803,1.305602,-1.449987
b,0.613444,-1.492832,-1.741693


<a id="statistics"></a>
# 3. 描述性统计的概述与计算
[返回目录](#menu)

In [162]:
import pandas as pd
import numpy as np

In [15]:
arr = np.array([[1.4, np.nan], [7.1, -4.5], 
                [np.nan, np.nan], [.75, -1.3]])
arr.shape  # 确认列表的行列信息

(4, 2)

In [163]:
lst = [[1.4, np.nan], [7.1, -4.5], 
       [np.nan, np.nan], [.75, -1.3]]
df = pd.DataFrame(lst, index=list('abcd'),
                  columns=['one', 'two'])
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [164]:
df.sum()  # 默认按列求和（行相加）

one    9.25
two   -5.80
dtype: float64

In [165]:
df.sum(axis='columns')  # 或axis=1,按行求和（列相加）

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

In [166]:
df.mean(axis=1, skipna=True)  # 计算时，是否排除缺失值

a    1.400
b    1.300
c      NaN
d   -0.275
dtype: float64

In [167]:
df.idxmax(axis=1)  # 统计最大值对应的索引值

a    one
b    one
c    NaN
d    one
dtype: object

In [169]:
df.cumsum()  # 累计求和

Unnamed: 0,one,two
a,1.4,
b,8.5,-4.5
c,,
d,9.25,-5.8


In [170]:
df.describe()  # 简单的统计描述

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


In [171]:
obj = pd.Series(['a', 'b', 'c', 'a']*4)
obj.describe()  # 对非数值型数据的统计描述

count     16
unique     3
top        a
freq       8
dtype: object

## 3.1 相关性和协方差
[返回目录](#menu)

In [24]:
# import pandas_datareader.data as web
# all_data = {ticker: web.get_data_yahoo(ticker) for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']}
# price = pd.DataFrame({ticker: data['Adj Close'] for ticker, data in all_data.items()})
# price.to_csv('stock_price.csv')
# volume = pd.DataFrame({ticker: data['Volume'] for ticker, data in all_data.items()})
# volume.to_csv('stock_volume.csv')

In [172]:
# 读取csv文件，并设置第一列为index
price = pd.read_csv('../datas/stock_price.csv', index_col=0)
price.head()

Unnamed: 0_level_0,AAPL,IBM,MSFT,GOOG
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2015-03-11,112.492523,125.923889,37.852718,549.670898
2015-03-12,114.526291,126.871559,36.987106,553.989014
2015-03-13,113.734856,123.900124,37.311714,545.821472
2015-03-16,114.986427,126.14875,37.47401,552.99176
2015-03-17,116.90976,126.052376,37.600246,549.331787


In [173]:
volume = pd.read_csv('../datas/stock_volume.csv', index_col=0)
volume.head()

Unnamed: 0_level_0,AAPL,IBM,MSFT,GOOG
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2015-03-11,68939000.0,5709900.0,32215300.0,1820700.0
2015-03-12,48362700.0,4567300.0,59992500.0,1389600.0
2015-03-13,51827300.0,6064100.0,58007700.0,1703500.0
2015-03-16,35874300.0,3749600.0,35273500.0,1640900.0
2015-03-17,51023100.0,3311700.0,31673400.0,1805500.0


In [174]:
returns = price.pct_change()
returns.head()

Unnamed: 0_level_0,AAPL,IBM,MSFT,GOOG
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2015-03-11,,,,
2015-03-12,0.018079,0.007526,-0.022868,0.007856
2015-03-13,-0.006911,-0.023421,0.008776,-0.014743
2015-03-16,0.011004,0.018149,0.00435,0.013137
2015-03-17,0.016727,-0.000764,0.003369,-0.006618


In [175]:
# 计算相关性
returns['MSFT'].corr(returns['IBM'])

0.49708933043814085

In [176]:
# 计算协方差
returns['MSFT'].cov(returns['IBM'])

0.0001007987573938364

In [177]:
returns.corr()

Unnamed: 0,AAPL,IBM,MSFT,GOOG
AAPL,1.0,0.428714,0.616049,0.562864
IBM,0.428714,1.0,0.497089,0.438148
MSFT,0.616049,0.497089,1.0,0.691464
GOOG,0.562864,0.438148,0.691464,1.0


In [178]:
returns.cov()

Unnamed: 0,AAPL,IBM,MSFT,GOOG
AAPL,0.000262,9.4e-05,0.00015,0.000141
IBM,9.4e-05,0.000182,0.000101,9.1e-05
MSFT,0.00015,0.000101,0.000226,0.000161
GOOG,0.000141,9.1e-05,0.000161,0.000239


In [179]:
# 指定某个序列的相关系数
returns.corrwith(returns['IBM'])

AAPL    0.428714
IBM     1.000000
MSFT    0.497089
GOOG    0.438148
dtype: float64

In [36]:
# 传入DataFrame时，会计算匹配到列名的相关性数值
returns.corrwith(volume.pct_change())

AAPL   -0.143781
IBM    -0.027068
MSFT   -0.087620
GOOG   -0.038283
dtype: float64

In [180]:
volume.pct_change()

Unnamed: 0_level_0,AAPL,IBM,MSFT,GOOG
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2015-03-11,,,,
2015-03-12,-0.298471,-0.200109,0.862236,-0.236777
2015-03-13,0.071638,0.327721,-0.033084,0.225892
2015-03-16,-0.307811,-0.381672,-0.391917,-0.036748
2015-03-17,0.422274,-0.116786,-0.102062,0.100311
...,...,...,...,...
2020-03-03,-0.064211,-0.074466,0.009097,-0.012009
2020-03-04,-0.313943,-0.367232,-0.305016,-0.203555
2020-03-05,-0.144200,0.090025,-0.040091,0.338682
2020-03-06,0.205808,0.504407,0.522903,0.038769


## 3.2 唯一值、计数和成员性
[返回目录](#menu)

In [181]:
obj = pd.Series(list('cadaabbcc'))
obj

0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object

In [182]:
# 唯一值
unique = obj.unique()
unique

array(['c', 'a', 'd', 'b'], dtype=object)

In [187]:
# 值的个数
# obj.value_counts()
pd.value_counts(obj, sort=True)  # value_sort也是有效的pandas顶层方法，可以用于任意数组或序列

c    3
a    3
b    2
d    1
dtype: int64

In [188]:
# isin执行向量化的成员属性检查
mask = obj.isin(['b', 'c'])
mask

0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool

In [189]:
obj[mask]

0    c
5    b
6    b
7    c
8    c
dtype: object

In [190]:
data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4],
                     'Qu2': [2,3, 1, 2, 3],
                     'Qu3': [1, 5, 2, 4, 4]})
data

Unnamed: 0,Qu1,Qu2,Qu3
0,1,2,1
1,3,3,5
2,4,1,2
3,3,2,4
4,4,3,4


In [46]:
# 计算DataFrame每一列中，不同数值的个数
result = data.apply(pd.value_counts).fillna(0)
result

Unnamed: 0,Qu1,Qu2,Qu3
1,1.0,1.0,1.0
2,0.0,2.0,1.0
3,2.0,2.0,0.0
4,2.0,0.0,2.0
5,0.0,0.0,1.0
