# DSAI-HW1

## Description

I use the policy gradient (actor-critic) to model the stock state.
Under the observation, I assume that `10-day moving average difference` and `5-20-day moving average` are related to the stock up and rise.
Also, We have 5 stock states: `great down`, `down`, `steady`, `up`, `great up`
Considered about this, taking this policy to train my model with the `Actor-Critic` NNs.

Once we have the future stock state, we can simply take action.

```
Buying shares (or returning shares in shorting) in `up` or `great up`
Selling shares (or shorting) in `great down`
Holding shares in other states
```

## Usage

### Install the dependency

- In your virtual environment (you can use pyenv, pipenv, virtualenv, ..., etc.)

```sh
$ pip install -r requirements.txt
```

- execute the script (you can use arguments to specify the files)

```sh
$ python3 trader.py [--training=training_data.csv] [--testing=testing_data.csv] [--output=output.csv]
```

## AUTHORS

[rapirent](https://github.com/raprient)


## LICENSE
MIT@2018


## Detail

### ac.py
我將Actor-Critic寫於ac.py中，裡面有兩個class宣告:`Actor`, `Critic`，使用tensorflow實現詳細算法

In [14]:
import numpy as np
import tensorflow as tf

np.random.seed(2)
tf.set_random_seed(2)  # reproducible

GAMMA = 0.9     # reward discount in TD error


class Actor(object):
    def __init__(self, sess, n_features, n_actions, lr=0.001):
        self.sess = sess

        self.s = tf.placeholder(tf.float32, [1, n_features], "state")
        self.a = tf.placeholder(tf.int32, None, "act")
        self.td_error = tf.placeholder(tf.float32, None, "td_error")  # TD_error

        with tf.variable_scope('Actor'):
            l1 = tf.layers.dense(
                inputs=self.s,
                units=20,    # number of hidden units
                activation=tf.nn.relu,
                kernel_initializer=tf.random_normal_initializer(0., .1),    # weights
                bias_initializer=tf.constant_initializer(0.1),  # biases
                name='l1'
            )

            self.acts_prob = tf.layers.dense(
                inputs=l1,
                units=n_actions,    # output units
                activation=tf.nn.softmax,   # get action probabilities
                kernel_initializer=tf.random_normal_initializer(0., .1),  # weights
                bias_initializer=tf.constant_initializer(0.1),  # biases
                name='acts_prob'
            )

        with tf.variable_scope('exp_v'):
            log_prob = tf.log(self.acts_prob[0, self.a])
            self.exp_v = tf.reduce_mean(log_prob * self.td_error)  # advantage (TD_error) guided loss

        with tf.variable_scope('train'):
            self.train_op = tf.train.AdamOptimizer(lr).minimize(-self.exp_v)  # minimize(-exp_v) = maximize(exp_v)

    def learn(self, s, a, td):
        s = s[np.newaxis, :]
        feed_dict = {self.s: s, self.a: a, self.td_error: td}
        _, exp_v = self.sess.run([self.train_op, self.exp_v], feed_dict)
        return exp_v

    def choose_action(self, s):
        s = s[np.newaxis, :]
        probs = self.sess.run(self.acts_prob, {self.s: s})   # get probabilities for all actions
        return np.random.choice(np.arange(probs.shape[1]), p=probs.ravel())   

In [15]:
class Critic(object):
    def __init__(self, sess, n_features, lr=0.01):
        self.sess = sess

        self.s = tf.placeholder(tf.float32, [1, n_features], "state")
        self.v_ = tf.placeholder(tf.float32, [1, 1], "v_next")
        self.r = tf.placeholder(tf.float32, None, 'r')

        with tf.variable_scope('Critic'):
            l1 = tf.layers.dense(
                inputs=self.s,
                units=20,  # number of hidden units
                activation=tf.nn.relu,  # None
                # have to be linear to make sure the convergence of actor.
                # But linear approximator seems hardly learns the correct Q.
                kernel_initializer=tf.random_normal_initializer(0., .1),  # weights
                bias_initializer=tf.constant_initializer(0.1),  # biases
                name='l1'
            )

            self.v = tf.layers.dense(
                inputs=l1,
                units=1,  # output units
                activation=None,
                kernel_initializer=tf.random_normal_initializer(0., .1),  # weights
                bias_initializer=tf.constant_initializer(0.1),  # biases
                name='V'
            )

        with tf.variable_scope('squared_TD_error'):
            self.td_error = self.r + GAMMA * self.v_ - self.v
            self.loss = tf.square(self.td_error)    # TD_error = (r+gamma*V_next) - V_eval
        with tf.variable_scope('train'):
            self.train_op = tf.train.AdamOptimizer(lr).minimize(self.loss)

    def learn(self, s, r, s_):
        s, s_ = s[np.newaxis, :], s_[np.newaxis, :]

        v_ = self.sess.run(self.v, {self.s: s_})
        td_error, _ = self.sess.run([self.td_error, self.train_op],
                                          {self.s: s, self.v_: v_, self.r: r})
        return td_error

### env.py
由於我的假設是使用policy-gradient-based的NN(Actor-Critic)來預測隔日股市的變化，由預測的變化做出相對的動作(買入、賣出、不動作)

- 為此，我預設了兩個策略：
    1. 5-20均線策略：將5日平均及20日平均算出，並計算其交叉前是5日均線較高還是20日均線較低。
        - 若是5日均線較低，代表將會上漲，隔日預測值應為大漲 (4)
        - 反之，則應預測為大跌(0)
    2. 10日平均線變化策略：計算今日與昨日的十日平均變化，並計算訓練資料的每日10日平均標準差
        - 若是變化大於`(標準差/4) - 全十日平均值的平均`，則應為小漲(3)
        - 若是變化小於`(標準差/4) - 全十日平均值的平均`，則應為小跌(1)
        - 其餘則為平穩不變化(2)

In [16]:
import pandas as pd
import numpy as np


plus = lambda value: value if value > 0 else 0

class Env:

    def __init__(self, data=None, period=10):
        self._DATA = data
        self._PERIOD = period
        self._SHORT_PERIOD = 5
        self._LONG_PERIOD = 20
        self.train_mean = None
        self.train_quantile = None
        # XXX
        self.error_count = [0,0,0]

    def load_data(self, data):
        self._DATA = data

    def data_len(self):
        return len(self._DATA['close'])

    def training_preprocess(self):
        all_moving_average = []
        all_moving_average_diff = []
        for index in range(self.data_len()):
            all_moving_average.append(np.mean([ self._DATA['open'][plus(index - i)] for i in range(self._PERIOD) ] ) - self._DATA['open'][0])
        for index in range(1,self.data_len()):
            all_moving_average_diff.append(all_moving_average[index] - all_moving_average[index - 1])
        print(all_moving_average_diff)
        self.train_quantile =  (np.percentile(all_moving_average_diff, 30), np.percentile(all_moving_average_diff, 40),
                            np.percentile(all_moving_average_diff, 60), np.percentile(all_moving_average_diff, 70))
        self.train_mean = np.mean(all_moving_average)
        self.avg_diff = np.mean(all_moving_average_diff)
        error = []
        for index in all_moving_average_diff:
            error.append(index- self.avg_diff)
        squaredError = []
        for index in error:
            squaredError.append(index ** 2)

        self.avg_diff_mse = np.mean(squaredError) ** 0.5
        self.avg_diff_std = np.std(all_moving_average_diff)
        self.avg_diff_quantile = (self.avg_diff_std,
                                  self.avg_diff_std/4 - self.avg_diff,
                                  (-self.avg_diff_std/4) + self.avg_diff,
                                  -self.avg_diff_std)

    def reset(self):
        long_moving_average = np.mean([ self._DATA['close'][plus(0 - i)] for i in range(self._LONG_PERIOD) ])
        short_moving_average = np.mean([ self._DATA['close'][plus(0 - i)] for i in range(self._SHORT_PERIOD) ])
        moving_average = np.mean([ self._DATA['open'][plus(0 - i)] for i in range(self._PERIOD) ]) - self._DATA['open'][0]
        cross_diff = long_moving_average - short_moving_average
        state = np.array([0, 0, 0, cross_diff, cross_diff])
        return (long_moving_average, short_moving_average, state)

- 以此，我將每次的狀態設為`十日平均`, `十日平均變化`, `當日情況`, `今日5-20均線差值`, `昨日5-20均線差值`

In [17]:
    def get_env(self, today, last_average, last_long_avg, last_short_avg):

        moving_average = np.mean( [self._DATA['open'][plus(today - i)] for i in range(self._PERIOD)] ) - self._DATA['open'][today]
        long_moving_average = np.mean([ self._DATA['close'][plus(today - i)] for i in range(self._LONG_PERIOD) ])
        short_moving_average = np.mean([ self._DATA['close'][plus(today - i)] for i in range(self._SHORT_PERIOD) ])
        moving_average_diff = moving_average - last_average
        current_cross_diff = long_moving_average - short_moving_average
        last_cross_diff = last_long_avg - last_short_avg

        if current_cross_diff * last_cross_diff < 0:
            if short_moving_average >= long_moving_average:
                action_real = 4
            else:
                action_real = 0
        else:
            if moving_average_diff > self.avg_diff_quantile[0]:
                action_real = 3
            # elif moving_average_diff < self.avg_diff_quantile[3]:
            #     action_real = 0
            elif moving_average_diff < self.avg_diff_quantile[2]:
                action_real = 1
            else:
                action_real = 2
        last_long_avg = long_moving_average
        last_short_avg = short_moving_average
        return (last_long_avg, last_short_avg, np.array([moving_average, moving_average_diff, action_real, current_cross_diff, last_cross_diff]))

#### reward

- 而對於reward值，由於我們是在「模擬」股市環境，所以將預測出來的值與`get_env`計算出來的實際狀態做比較，假如差異者，則給予懲罰（負的reward），反之則給予獎勵（正的reward）
    - 如果相同，則reward為20
    - 如果猜測出的漲跌狀態差值大於0，則reward為-100
    - 若是猜測漲跌狀態差異不大，但仍有差異者，則reward為-50

In [18]:
    def step(self, today, action, action_real, last_average, last_long_avg, last_short_avg):

        last_long_avg_, last_short_avg_, n_state = self.get_env(today, last_average, last_long_avg, last_short_avg)
        if action_real == action:
            reward = 20
            self.error_count[1] += 1
        elif action_real - action > 1 or action_real - action < -1:
            reward = -100
            self.error_count[0] += 1
        else:
            reward = -50
            self.error_count[2] += 1

        return (last_long_avg_, last_short_avg_, reward, n_state)

## Stocktrader.py
- 為了模組化及方便計算，我於`stocktrader.py`腳本撰寫`StockTrader`類別，為了能夠明確知道各數值代表的意義而自定義了ACTION_LIST字典，也邊寫了一`StockValueError`，希望能夠在進行動作不合法時拋出特定錯誤

In [19]:
ACTION_LIST = {'BUY': 1, 'IDLE': 0, 'SELL': -1}

class StockValueError(Exception):
    def __init__(self, value):
        self.value = value
    def __str__(self):
        return repr('WRONG AT ' + self.value)


- `StockTrader`類別有兩個比較特別的方法：`predict_action()`和`reaction()`

    - predict_action()方法根據預測狀態值(0~4)來做出相對應的動作
        - 1: 購買
        - 0: 不動作
        - -1: 賣出
    - 做出動作後，可以使用reaction()方法計算此動作能夠帶來的收益或損失，並且改變StockTrader的一自變數:`self.current_state`，此狀態數值與動作值並不一樣，但他們都是介於-1~1的整數值
        - current_state = 0時代表不擁有股票
        - 值為1時代表擁有股票
        - 值為-1時代表空頭股票（出借）

In [20]:
class StockTrader():

    def __init__(self, init_state = 0, data=None):
        self.current_state = init_state
        self._state = [-1, 0, 1]
        self.money = 0
        self._DATA = data

    def load_data(self, data):
        self._DATA = data

    def data_len(self):
        return len(self._DATA['open'])

    def reset(self):
        self.record = 0
        self.current_state = 0

    def set_state(self, state):
        self.current_state = state

    def predict_action(self, trend, i):
        if self.current_state == 0:
            if trend > 2:
                # up
                # BUY
                action = ACTION_LIST['BUY']
            if trend < 2:
                # large down
                action = ACTION_LIST['SELL']
            else:
                action = ACTION_LIST['IDLE']
        elif self.current_state == 1:
            if trend < 2:
                # large down
                # SOLD!
                action = ACTION_LIST['SELL']
            else:
                action = ACTION_LIST['IDLE']
        else:
            if trend > 2:
                # up
                action = ACTION_LIST['BUY']
            else:
                # maintain the
                action = ACTION_LIST['IDLE']
        return action

    def reaction(self, today, action):
        today_price = self._DATA['open'][today]
        if self.current_state == 0:
            if action == ACTION_LIST['BUY']:
                # Buy
                print('----\nTAKE ACTION {} in day{} \n----'.format(action, today))
                self.money -= today_price
                self.current_state = 1
            elif action == ACTION_LIST['IDLE']:
                pass
            else:
                # short
                print('----\nTAKE ACTION {} in day{} \n----'.format(action, today))
                self.money += today_price
                self.current_state = -1
        elif self.current_state == 1:
            if action == ACTION_LIST['SELL']:
                print('----\nTAKE ACTION {} in day{} \n----'.format(action, today))
                self.money += today_price
                self.current_state = 0
            elif action == ACTION_LIST['IDLE']:
                pass
            else:
                raise StockValueError(ACTION_LIST['BUY'])
        else:
            if action == ACTION_LIST['BUY']:
                print('----\nTAKE ACTION {} in day{} \n----'.format(action, today))
                self.money -= today_price
                self.current_state = 0
            elif action == ACTION_LIST['IDLE']:
                pass
            else:
                raise StockValueError(ACTION_LIST['SELL'])

    def get_money(self):
        return self.money

    def get_today_price(self, today):
        return self._DATA['open'][today]


## trader.py

在trader.py中，我們將會引入以上的各個模組`ac`, `stocktrader`, `env`

In [21]:
from env import Env
from stocktrader import (StockTrader, ACTION_LIST)
from ac import (Actor, Critic)
import pandas as pd
import numpy as np
import tensorflow as tf

import argparse

parser = argparse.ArgumentParser()

np.random.seed(2)
tf.set_random_seed(2)  # reproducible

MAX_EPISODE = 25
LR_A = 0.001    # learning rate for actor
LR_C = 0.01     # learning rate for critic
N_F = 5 # 3 feature as state
N_A = 5 # 5 action

- 並做一些初始化

In [22]:
if __name__ == '__main__':

    parser = argparse.ArgumentParser()
    parser.add_argument('--training',
                       default='training_data.csv',
                       help='input training data file name')
    parser.add_argument('--testing',
                        default='testing_data.csv',
                        help='input testing data file name')
    parser.add_argument('--output',
                        default='output.csv',
                        help='output file name')
    args = parser.parse_args()
    training_data = pd.read_csv(args.training, names=['open', 'high', 'low', 'close'])
    env = Env()
    env.load_data(training_data)
    env.training_preprocess() # compute the training data quantile

    sess = tf.Session()

    actor = Actor(sess, n_features=N_F, n_actions=N_A, lr=LR_A)
    critic = Critic(sess, n_features=N_F, lr=LR_C)     # we need a good teacher, so the teacher should learn faster than the actor

    sess.run(tf.global_variables_initializer())


usage: ipykernel_launcher.py [-h] [--training TRAINING] [--testing TESTING]
                             [--output OUTPUT]
ipykernel_launcher.py: error: unrecognized arguments: -f /Users/kuoteng/Library/Jupyter/runtime/kernel-a0beccea-b639-474a-8023-7f1b7c122d39.json


SystemExit: 2

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


- 進行學習與訓練

In [23]:
    for i_episode in range(MAX_EPISODE):
        (last_long_avg, last_short_avg, state) = env.reset()
        # reset training_env

        track_r = []
        total_action = [0,0,0,0,0]
        error_count = [0, 0, 0]
        env.error_count = [0,0,0]
        error_index = [0,0,0,0,0]
        error_index_2 = [0,0,0,0,0]
        error_index_cumulator = []
        for i in range(env.data_len()):

            action = actor.choose_action(state)
            (last_long_avg_, last_short_avg_, reward, state_) = env.step(i, action, state[2], state[0], last_long_avg, last_short_avg)

            track_r.append(reward)

            # gradient = grad[r + gamma * V(s_) - V(s)]
            td_error = critic.learn(state, reward, state_)
            # true_gradient = grad[logPi(s,a) * td_error]
            actor.learn(state, action, td_error)
            # XXX
            if (state[2] - action > 1) or (state[2] - action < -1):
                error_count[2] += 1
                error_index[action] += 1
                error_index_cumulator.append(action)
            elif state[2] != action:
                error_count[1] += 1
                error_index_2[action] += 1
            else:
                error_count[0] += 1
            total_action[action] +=1
            #
            state = state_
            last_long_avg = last_long_avg_
            last_short_avg = last_short_avg_


        ep_rs_sum = sum(track_r)
        print('the actions are \n 0: {0},1: {1}, 2: {2}, 3: {3}, 4: {4} '.format(total_action[0],total_action[1],total_action[2], total_action[3], total_action[4]))
        print('---\n0: {}, +-<1: {}, +->1: {}\n---'.format(error_count[0], error_count[1], error_count[2]))
        print('reward list:\n -100: {}, 10: {}, -50: {}'.format(env.error_count[0], env.error_count[1], env.error_count[2]))
        print('most error is {}'.format(error_index))
        print('less error is {}'.format(error_index_2))
        print('cumulator of most error: ', error_index_cumulator)
        print('---')
        print("episode:", i_episode, "  reward:", ep_rs_sum)

NameError: name 'env' is not defined

- 使用測試資料進行預測

In [24]:
    # Testing
    testing_data = pd.read_csv(args.testing, names=['open', 'high', 'low', 'close'])
    output_file = open(args.output, 'w')
    trader = StockTrader()
    trader.load_data(testing_data)
    env.load_data(testing_data)
    (last_long_avg, last_short_avg, state) = env.reset()
    error_count = [0,0,0]
    error_index = [0,0,0,0,0]
    error_index_2 = [0,0,0,0,0]
    error_index_cumulator = []
    for i in range(trader.data_len()):
        if i > 0:
            ## according the yesterday action
            ## if i == data_len() - 1(269), it will be the 268's action to create a reaction
            trader.reaction(i, predict_action)
            output_file.write(str(predict_action) + '\n')
            print('Day {}: your money is {} | open: {} | action : {}'.format(i, trader.get_money(), trader.get_today_price(i), predict_action))

        trend = actor.choose_action(state)
        predict_action = trader.predict_action(trend, i)
        (last_long_avg_, last_short_avg_, state_) = env.get_env(i, state[0], last_long_avg, last_short_avg)
        state = state_
        last_long_avg = last_long_avg_
        last_short_avg = last_short_avg_

        # XXX
        if (state[2] - trend > 1) or (state[2] - trend < -1):
            error_count[2] += 1
            error_index[trend] += 1
            error_index_cumulator.append(trend)
        elif state[2] - trend != 0:
            error_count[1] += 1
            error_index_2[trend] += 1
        else:
            error_count[0] += 1
    print('0: {}, +-<1: {}, +->1: {}'.format(error_count[0], error_count[1], error_count[2]))
    print('most error is {}'.format(error_index))
    print('less error is {}'.format(error_index_2))
    print('cumulator of most error: ', error_index_cumulator)

    output_file.close()

    if trader.current_state == -1:
        final_line = 'your final money is {}, the last close price is {}'.format(
            trader.get_money() - testing_data['close'][trader.data_len() - 1],
            testing_data['close'][trader.data_len() - 1])
    elif trader.current_state == 1:
        final_line = 'your final money is {}, the last close price is {}'.format(
            trader.get_money() + testing_data['close'][trader.data_len() - 1],
            testing_data['close'][trader.data_len() - 1])
    else:
        final_line = 'your final money is {}, the last close price is {}'.format(
            trader.get_money(),
            testing_data['close'][trader.data_len() - 1])
    print(final_line)
    print('the origin buy-and-hold is {}'.format((testing_data['close'][trader.data_len() - 1] - testing_data['open'][1])))

NameError: name 'args' is not defined