In [1]:
import numpy as np
from matplotlib import pyplot

Line format:
```
 0     1     2     3     4     5    6      7
Code, Date, null, open, high, low, close, Amount
```

In [2]:
!cat BHP.csv | head -n5

BHP,11-02-2011, ,93.11,94.26,92.9,93.72,1741900
BHP,14-02-2011, ,94.57,96.23,94.39,95.64,2620800
BHP,15-02-2011, ,94.45,95.47,93.91,94.56,2461300
BHP,16-02-2011, ,92.67,93.58,92.56,93.3,3270900
BHP,17-02-2011, ,92.65,93.98,92.58,93.93,2650200


In [3]:
!cat VALE.csv | head -n5

VALE,11-02-2011, ,33.88,34.54,33.63,34.37,18433500
VALE,14-02-2011, ,34.53,35.29,34.52,35.13,20780700
VALE,15-02-2011, ,34.89,35.31,34.82,35.14,17756700
VALE,16-02-2011, ,35.16,35.4,34.81,35.31,16792800
VALE,17-02-2011, ,35.18,35.6,35.04,35.57,24088300


In [10]:
bhp = np.loadtxt('BHP.csv', delimiter=',', usecols=(6, ), unpack=True)
bhp_returns = np.diff(bhp) / bhp[ :-1]
print("bhp_returns", bhp_returns)

vale = np.loadtxt('VALE.csv', delimiter=',', usecols=(6, ), unpack=True)
vale_returns = np.diff(vale) / vale[ :-1]
print('vale_returns', vale_returns)

covariance = np.cov(bhp_returns, vale_returns)
print('Covariance', covariance)

bhp_returns [ 0.02048656 -0.01129235 -0.01332487  0.00675241 -0.01639519 -0.00303063
  0.00271415 -0.00649632  0.02343069  0.00734746 -0.0140592   0.01243701
  0.01683787 -0.00270777 -0.01347118 -0.0013761  -0.02247191 -0.04239861
  0.01449439 -0.00636232 -0.0232532  -0.02380679  0.02945335  0.01350423
  0.01163053 -0.00982252  0.01476722  0.01377472 -0.00646504]
vale_returns [ 0.02211231  0.00028466  0.00483779  0.00736335 -0.01518133 -0.04538967
  0.01495215  0.00795522  0.00175387 -0.0011672  -0.01373065  0.01658768
  0.01602564 -0.01061084 -0.03681159  0.0018056  -0.01231601 -0.02950122
  0.00814792  0.00839291 -0.01633785 -0.02726418  0.01514175  0.01999365
  0.00871189 -0.00524368  0.01395349 -0.01039755 -0.00061805]
Covariance [[ 0.00028179  0.00019766]
 [ 0.00019766  0.00030123]]


关于协方差矩阵：


```
covariance = | cov(x1, x1)  cov(x1, x2) |
             | cov(x2, x1)  cov(x2, x2) |
```

其中 bhp_returns 是 x1, vale_returns 是 x2。

对角线上的元素就是各个随机变量（自己）的方差。

In [13]:
print("Covariance diagonal", covariance.diagonal())
print("Covariance trace", covariance.trace())

print(covariance[0,0] + covariance[1,1])

Covariance diagonal [ 0.00028179  0.00030123]
Covariance trace 0.00058302354992
0.00058302354992


从协方差矩阵中“剔除”各自的标准差，就得到了相关系数矩阵。

NumPy 计算协方差的时候，自由度参数默认是 1（分母是 N-1 而不是 N）；

计算标准差的时候，自由度参数是 0，求得的是样本标准差而不是总体标准差的无偏估计。所以协方差不能直接用除以 bhp_returns.std() 或 vale_returns.std() 的方法计算出相关系数矩阵。

In [16]:
print("Correlation coefficient", np.corrcoef(bhp_returns, vale_returns))

Correlation coefficient [[ 1.          0.67841747]
 [ 0.67841747  1.        ]]


np.corrcorf - [皮尔逊积矩相关系数](https://zh.wikipedia.org/wiki/%E7%9A%AE%E5%B0%94%E9%80%8A%E7%A7%AF%E7%9F%A9%E7%9B%B8%E5%85%B3%E7%B3%BB%E6%95%B0)，定义是两个变量之间协方差和标准差的商。

变化范围是 [-1, 1]。

上面那个的返回是


```
| r(x1, x1)  r(x1, x2) |
| r(x2, x1)  r(x2, x2) |
```

In [23]:
difference = bhp - vale
print('difference', difference)

avg = np.mean(difference)  # 两者差距的平均值
dev = np.std(difference)  # 
print('最后一天两者的走势差距大于 2*差距的标准差: ', np.abs(difference[-1] - avg) > 2 * dev)

difference [ 59.35  60.51  59.42  57.99  58.36  57.36  58.67  58.42  57.55  59.64
  60.37  59.51  60.11  61.15  61.26  61.24  61.05  59.34  56.4   57.42
  56.58  55.04  53.84  55.87  56.42  57.17  56.46  57.32  58.9   58.33]
最后一天两者的走势差距大于 2*差距的标准差:  False
