##### 统计方法有助于理解和分析数据的行为

In [1]:
import pandas as pd
import numpy as np

#### pct_change()
此函数将每个元素与其前一个元素进行比较，并计算变化百分比

In [2]:
s = pd.Series([1,2,3,4,5,4])
print(s)
print(s.pct_change())

0    1
1    2
2    3
3    4
4    5
5    4
dtype: int64
0         NaN
1    1.000000
2    0.500000
3    0.333333
4    0.250000
5   -0.200000
dtype: float64


In [3]:
df = pd.DataFrame(np.random.randn(5,2))
print(df)
print(df.pct_change())

          0         1
0  1.047755 -0.902037
1  0.574110  1.068281
2 -1.570391 -1.684186
3 -0.827252 -1.561630
4 -0.924718 -0.911742
          0         1
0       NaN       NaN
1 -0.452057 -2.184298
2 -3.735347 -2.576538
3 -0.473219 -0.072769
4  0.117818 -0.416160


#### 协方差
Series对象有一个方法cov用来计算序列对象之间的协方差。NA将被自动排除。

In [7]:
s1 = pd.Series(np.random.randn(10))
s2 = pd.Series(np.random.randn(10))
print(s1)
print(s2)

0   -0.955514
1    0.315436
2   -0.167226
3   -0.704854
4    0.766544
5   -1.584474
6    0.267582
7   -0.617931
8    1.693813
9    0.699864
dtype: float64
0   -0.803392
1    0.041531
2   -0.928118
3   -1.330413
4   -0.625608
5   -0.515786
6   -1.001059
7   -2.569825
8    0.609901
9   -2.089757
dtype: float64


In [8]:
print(s1.cov(s2))

0.31531588800713495


##### 当应用于DataFrame时，协方差方法计算所有列之间的协方差(cov)值。

In [11]:
frame = pd.DataFrame(np.random.randn(10,5), columns=['a', 'b', 'c', 'd', 'e'])
print(frame)
print(frame['a'].cov(frame['b']))
print(frame.cov())

          a         b         c         d         e
0  1.699501  0.614042 -0.354822  0.828747 -0.292708
1 -1.613300 -0.210075 -0.046507  0.457616  0.386191
2 -0.921647 -1.124952 -0.145588 -0.196851 -1.345305
3 -0.365916  1.139972 -0.650955 -1.333227  0.443523
4  1.434953 -0.033024  2.462441 -0.409308  0.016077
5 -0.149650 -1.039498 -2.174251 -0.211683  1.558303
6  0.782164 -2.050544 -0.927806  0.876850 -0.038256
7  2.327593  1.937128  0.139896  0.600984  1.005344
8 -0.221105 -0.172807 -0.808411 -0.655769  0.437150
9 -0.435836 -0.749290  1.306850 -1.003377  0.284053
0.645395130176195
          a         b         c         d         e
a  1.575628  0.645395  0.357394  0.412904  0.134890
b  0.645395  1.370120  0.280429 -0.066384  0.254429
c  0.357394  0.280429  1.606886 -0.178055 -0.318582
d  0.412904 -0.066384 -0.178055  0.598028 -0.052566
e  0.134890  0.254429 -0.318582 -0.052566  0.594613


#### 相关性
相关性显示了任何两个数值(系列)之间的线性关系。有多种方法来计算pearson(默认)，spearman和kendall之间的相关性

In [12]:
print(frame)
print(frame['a'].corr(frame['b']))
print(frame.corr())

          a         b         c         d         e
0  1.699501  0.614042 -0.354822  0.828747 -0.292708
1 -1.613300 -0.210075 -0.046507  0.457616  0.386191
2 -0.921647 -1.124952 -0.145588 -0.196851 -1.345305
3 -0.365916  1.139972 -0.650955 -1.333227  0.443523
4  1.434953 -0.033024  2.462441 -0.409308  0.016077
5 -0.149650 -1.039498 -2.174251 -0.211683  1.558303
6  0.782164 -2.050544 -0.927806  0.876850 -0.038256
7  2.327593  1.937128  0.139896  0.600984  1.005344
8 -0.221105 -0.172807 -0.808411 -0.655769  0.437150
9 -0.435836 -0.749290  1.306850 -1.003377  0.284053
0.4392578854608845
          a         b         c         d         e
a  1.000000  0.439258  0.224609  0.425365  0.139359
b  0.439258  1.000000  0.188996 -0.073337  0.281884
c  0.224609  0.188996  1.000000 -0.181635 -0.325920
d  0.425365 -0.073337 -0.181635  1.000000 -0.088151
e  0.139359  0.281884 -0.325920 -0.088151  1.000000


#### 数据排名
数据排名为元素数组中的每个元素生成排名。在关系的情况下，分配平均等级

Rank支持不同的tie-breaking方法，用方法参数指定 

- average - 并列组平均排序等级
- min - 组中最低的排序等
- max - 组中最高的排序等级
- first - 按照它们出现在数组中的顺序分配队列

In [16]:
s = pd.Series(np.random.randn(5), index=list('abcde'))
s['d'] = s['b']
s

a    0.367734
b   -0.562750
c    2.570850
d   -0.562750
e   -0.854715
dtype: float64

In [17]:
print(s.rank())

a    4.0
b    2.5
c    5.0
d    2.5
e    1.0
dtype: float64
