# Affinity Propagationモデルを使った株式市場のサブグループ検出

Affinity Propagation（アフィニティ・プロパゲーション）

クラスタ数を事前に指定する事なく、クラスタリングをしてくれる。

- 入力で使う教師なしデータ
    - exampler

- クラスタリングの評価値
    - responsibility
    - avilability

`pandas_datareader`をインストールしておかないといけない

```
$ pip install pandas_datareader
```

In [3]:
import datetime
import json
import numpy as np
from sklearn import covariance, cluster
import pandas as pd
import pandas_datareader.data as pdd
from dotenv import load_dotenv
import os

load_dotenv()

input_file = './data/company_symbol_mapping.json'
with open(input_file, 'r') as f:
    company_symbol_map = json.loads(f.read())

# JSONファイルのKeyだけを読み込む
symbols = company_symbol_map.keys()

## 株価データを取得する

- [Quandle](https://www.quandl.com/)というサイトから読み込むためにユーザ登録してAPI KEYを取得する必要がある。


In [4]:
start_date = datetime.datetime(2003, 7, 3)
end_date = datetime.datetime(2007, 5, 4)

quotes = []
names = []

for symbol in symbols:
    try:
        print('Loading', symbol, company_symbol_map[symbol], end='...')
        d = pdd.DataReader('WIKI/' + symbol, 'quandl', start_date, end_date, api_key=os.environ['QUANDL_API_KEY'])
        print('done')
        quotes.append(d)
        names.append(company_symbol_map[symbol])
    except Exception as e:
        print(e)
        print('not found')

names = np.array(names)


Loading TOT Total...Unable to read URL: https://www.quandl.com/api/v3/datasets/WIKI/TOT.csv?start_date=2003-07-03&end_date=2007-05-04&order=asc&api_key=a2g1DA1J52nVEHgCBM3s
Response Text:
b'code,message\nQECx02,You have submitted an incorrect Quandl code. Please check your Quandl codes and try again.\n'
not found
Loading XOM Exxon...done
Loading CVX Chevron...done
Loading COP ConocoPhillips...done
Loading VLO Valero Energy...done
Loading MSFT Microsoft...done
Loading IBM IBM...done
Loading TWX Time Warner...done
Loading CMCSA Comcast...done
Loading YHOO Yahoo...done
Loading HPQ HP...done
Loading AMZN Amazon...done
Loading TM Toyota...Unable to read URL: https://www.quandl.com/api/v3/datasets/WIKI/TM.csv?start_date=2003-07-03&end_date=2007-05-04&order=asc&api_key=a2g1DA1J52nVEHgCBM3s
Response Text:
b'code,message\nQECx02,You have submitted an incorrect Quandl code. Please check your Quandl codes and try again.\n'
not found
Loading CAJ Canon...Unable to read URL: https://www.quandl.com/a

In [23]:
print(names)


['Exxon' 'Chevron' 'ConocoPhillips' 'Valero Energy' 'Microsoft' 'IBM'
 'Time Warner' 'Comcast' 'Yahoo' 'HP' 'Amazon' 'Ford' 'Navistar'
 'Northrop Grumman' 'Boeing' 'Coca Cola' '3M' 'Mc Donalds' 'Pepsi'
 'Kraft Foods' 'Kellogg' 'Marriott' 'Procter Gamble' 'Colgate-Palmolive'
 'General Electrics' 'Wells Fargo' 'JPMorgan Chase' 'AIG'
 'American express' 'Bank of America' 'Goldman Sachs' 'Apple' 'Cisco'
 'Texas instruments' 'Xerox' 'Wal-Mart' 'Walgreen' 'Home Depot' 'Pfizer'
 'Kimberly-Clark' 'Ryder' 'General Dynamics' 'Raytheon' 'CVS'
 'Caterpillar']


In [17]:
# 株価の始値、終値の差を計算する
opening_quotes = np.array([quote['Open'] for quote in quotes]).astype(np.float)
closing_quotes = np.array([quote['Close'] for quote in quotes]).astype(np.float)
quotes_diff = closing_quotes - opening_quotes
print(quotes_diff)
# 転置行列を標準偏差で割ってデータを正規化する
X = quotes_diff.copy().T
X /= X.std(axis=0)

edge_model = covariance.GraphicalLassoCV(cv=3)

with np.errstate(invalid='ignore'):
    edge_model.fit(X)

_, labels = cluster.affinity_propagation(edge_model.covariance_)
num_labels = labels.max()

for i in range(num_labels + 1):
    print("Cluster", i+1, "==>", ', '.join(names[labels==i]))


[[-0.45  0.85 -0.01 ... -0.2  -0.05 -0.04]
 [-0.41  1.03  0.16 ...  0.42  1.05 -0.57]
 [ 0.05  0.84  0.21 ... -0.52 -0.77 -0.2 ]
 ...
 [-0.11  0.45 -0.05 ...  0.09 -0.15 -0.3 ]
 [-0.81 -0.24  0.38 ...  0.22  0.19  0.07]
 [-0.24  0.05  0.2  ...  0.4   0.16  0.17]]
Cluster 1 ==> Exxon, Chevron, ConocoPhillips, Valero Energy
Cluster 2 ==> Yahoo, Amazon, Apple
Cluster 3 ==> Ford, Navistar, Caterpillar
Cluster 4 ==> Kraft Foods
Cluster 5 ==> Coca Cola, Pepsi, Kellogg, Procter Gamble, Colgate-Palmolive, Kimberly-Clark
Cluster 6 ==> Comcast, Mc Donalds, Marriott, Wells Fargo, JPMorgan Chase, AIG, American express, Bank of America, Goldman Sachs, Xerox, Wal-Mart, Home Depot, Pfizer, Ryder
Cluster 7 ==> Microsoft, IBM, Time Warner, HP, 3M, General Electrics, Cisco, Texas instruments
Cluster 8 ==> Walgreen, CVS
Cluster 9 ==> Northrop Grumman, Boeing, General Dynamics, Raytheon
