##### 统计方法有助于理解和分析数据的行为

In [2]:
import pandas as pd
import numpy as np

#### pct_change()
此函数将每个元素与其前一个元素进行比较，并计算变化百分比

In [3]:
s = pd.Series([1,2,3,4,5,4])
print(s)
print(s.pct_change())

0    1
1    2
2    3
3    4
4    5
5    4
dtype: int64
0         NaN
1    1.000000
2    0.500000
3    0.333333
4    0.250000
5   -0.200000
dtype: float64


In [4]:
df = pd.DataFrame(np.random.randn(5,2))
print(df)
print(df.pct_change())

          0         1
0 -0.747259 -0.887362
1  0.112792 -2.075690
2  1.099312 -1.502084
3 -0.832515 -0.956793
4 -0.439081  0.815584
          0         1
0       NaN       NaN
1 -1.150941  1.339169
2  8.746370 -0.276345
3 -1.757306 -0.363023
4 -0.472585 -1.852414


#### 协方差
Series对象有一个方法cov用来计算序列对象之间的协方差。NA将被自动排除。

In [5]:
s1 = pd.Series(np.random.randn(10))
s2 = pd.Series(np.random.randn(10))
print(s1)
print(s2)

0   -1.451090
1   -0.306418
2    0.683553
3    0.526187
4   -1.477250
5   -0.665632
6   -0.621975
7    1.497105
8    0.694618
9    0.799649
dtype: float64
0    1.024354
1    0.754664
2   -0.791322
3    0.475300
4    0.535453
5   -0.100175
6   -0.042820
7   -1.614471
8    1.352680
9    0.456267
dtype: float64


In [6]:
print(s1.cov(s2))

-0.41698902970580976


In [7]:
print(help(pd.DataFrame.cov))

Help on function cov in module pandas.core.frame:

cov(self, min_periods=None)
    Compute pairwise covariance of columns, excluding NA/null values.
    
    Compute the pairwise covariance among the series of a DataFrame.
    The returned data frame is the `covariance matrix
    <https://en.wikipedia.org/wiki/Covariance_matrix>`__ of the columns
    of the DataFrame.
    
    Both NA and null values are automatically excluded from the
    calculation. (See the note below about bias from missing values.)
    A threshold can be set for the minimum number of
    observations for each value created. Comparisons with observations
    below this threshold will be returned as ``NaN``.
    
    This method is generally used for the analysis of time series data to
    understand the relationship between different measures
    across time.
    
    Parameters
    ----------
    min_periods : int, optional
        Minimum number of observations required per pair of columns
        to have a vali

##### 当应用于DataFrame时，协方差方法计算所有列之间的协方差(cov)值。

In [8]:
frame = pd.DataFrame(np.random.randn(10,5), columns=['a', 'b', 'c', 'd', 'e'])
print(frame)
print(frame['a'].cov(frame['b']))
print(frame.cov())

          a         b         c         d         e
0 -0.703660 -1.514744 -0.885523 -2.222260 -0.070883
1  0.697156  0.430342 -0.567606  0.688849 -0.787842
2 -0.368070 -0.261166  1.458766  0.782605 -0.824378
3  0.626566  0.835859 -0.064046 -0.505192 -0.820484
4 -0.797495 -0.337524  0.055754 -1.758250 -0.186476
5  0.573183 -0.125357 -0.453431  0.816166 -0.334878
6 -0.529514  1.302971 -0.709794  0.281086  0.007070
7  0.525901  0.406519 -0.473657  1.287054 -0.400794
8  0.173768  0.405239  0.950309  0.905681 -2.566420
9  0.338142 -1.093717  0.161111  2.275042  0.955446
0.1560985052768006
          a         b         c         d         e
a  0.349614  0.156099 -0.031039  0.514149 -0.091233
b  0.156099  0.733260 -0.022817  0.253661 -0.312403
c -0.031039 -0.022817  0.564050  0.315061 -0.318265
d  0.514149  0.253661  0.315061  1.902128  0.006347
e -0.091233 -0.312403 -0.318265  0.006347  0.806198


#### 相关性
相关性显示了任何两个数值(系列)之间的线性关系。有多种方法来计算pearson(默认)，spearman和kendall之间的相关性

In [9]:
print(frame)
print(frame['a'].corr(frame['b']))
print(frame.corr())

          a         b         c         d         e
0 -0.703660 -1.514744 -0.885523 -2.222260 -0.070883
1  0.697156  0.430342 -0.567606  0.688849 -0.787842
2 -0.368070 -0.261166  1.458766  0.782605 -0.824378
3  0.626566  0.835859 -0.064046 -0.505192 -0.820484
4 -0.797495 -0.337524  0.055754 -1.758250 -0.186476
5  0.573183 -0.125357 -0.453431  0.816166 -0.334878
6 -0.529514  1.302971 -0.709794  0.281086  0.007070
7  0.525901  0.406519 -0.473657  1.287054 -0.400794
8  0.173768  0.405239  0.950309  0.905681 -2.566420
9  0.338142 -1.093717  0.161111  2.275042  0.955446
0.30830118644704313
          a         b         c         d         e
a  1.000000  0.308301 -0.069897  0.630485 -0.171844
b  0.308301  1.000000 -0.035480  0.214785 -0.406317
c -0.069897 -0.035480  1.000000  0.304169 -0.471965
d  0.630485  0.214785  0.304169  1.000000  0.005126
e -0.171844 -0.406317 -0.471965  0.005126  1.000000


#### 数据排名
数据排名为元素数组中的每个元素生成排名。在关系的情况下，分配平均等级

Rank支持不同的tie-breaking方法，用方法参数指定 

- average - 并列组平均排序等级
- min - 组中最低的排序等
- max - 组中最高的排序等级
- first - 按照它们出现在数组中的顺序分配队列

In [10]:
s = pd.Series(np.random.randn(5), index=list('abcde'))
s['d'] = s['b']
s

a   -0.168469
b    1.251046
c   -0.907354
d    1.251046
e    0.922703
dtype: float64

In [11]:
print(s.rank())

a    2.0
b    4.5
c    1.0
d    4.5
e    3.0
dtype: float64
