<h1 style="text-align:center;">Stocks Trading Using RL</h1>

<br>

Rather than learning new methods to solve toy reinforcement learning (RL) problems in this chapter, we will try to utilize our deep Q-network (DQN) knowledge to deal with the much more practical problem of financial trading. I can't promise that the code will make you super rich on the stock market or Forex, because my goal is much less ambitious: to demonstrate how to go beyond the Atari games and apply RL to a different practical domain.

In this chapter, we will:

- Implement our own OpenAI Gym environment to simulate the stock market
- Apply the DQN method that you learned in Chapter 6, Deep Q-Networks, and Chapter 8, DQN Extensions, to train an agent to trade stocks to maximize profit

<br>

# 01. Trading

---

There are a lot of financial instruments traded on markets every day: goods, stocks, and currencies. Even weather forecasts can be bought or sold using so- called "weather derivatives," which is just a consequence of the complexity of the modern world and financial markets. If your income depends on future weather conditions, like a business growing crops, then you might want to hedge the risks by buying weather derivatives. All these different items have a price that changes over time. Trading is the activity of buying and selling financial instruments with different goals, like making a profit (investment), gaining protection from future price movement (hedging), or just getting what you need (like buying steel or exchanging USD for JPY to pay a contract).

Since the first financial market was established, people have been trying to predict future price movements, as this promises many benefits, like "profit from nowhere" or protecting capital from sudden market movements.

This problem is known to be complex, and there are a lot of financial consultants, investment funds, banks, and individual traders trying to predict the market and find the best moments to buy and sell to maximize profit.

The question is: can we look at the problem from the RL angle? Let's say that we have some observation of the market, and we want to make a decision: buy, sell, or wait. If we buy before the price goes up, our profit will be positive; otherwise, we will get a negative reward. What we're trying to do is get as much profit as possible. The connections between market trading and RL are quite obvious.

<br>

# 02. Data

---

In our example, we will use the Russian stock market prices from the period of 2015- 2016, which are placed in Chapter08/data/ch08-small-quotes.tgz and have to be unpacked before model training.

Inside the archive, we have CSV files with M1 bars, which means that every row in each CSV file corresponds to a single minute in time, and price movement during that minute is captured with four prices: open, high, low, and close. Here, an open price is the price at the beginning of the minute, high is the maximum price during the interval, low is the minimum price, and the close price is the last price of the minute time interval. Every minute interval is called a bar and allows us to have an idea of price movement within the interval. For example, in the YNDX_160101_161231.csv file (which has Yandex company stocks for 2016), we have 130k lines in this form:

```
                                <DATE>,<TIME>,<OPEN>,<HIGH>,<LOW>,<CLOSE>,<VOL>
                                20160104,100100,1148.9,1148.9,1148.9,1148.9,0
                                20160104,100200,1148.9,1148.9,1148.9,1148.9,50
                                20160104,100300,1149.0,1149.0,1149.0,1149.0,33
                                20160104,100400,1149.0,1149.0,1149.0,1149.0,4
                                20160104,100500,1153.0,1153.0,1153.0,1153.0,0
                                20160104,100600,1156.9,1157.9,1153.0,1153.0,43
                                20160104,100700,1150.6,1150.6,1150.4,1150.4,5
                                20160104,100800,1150.2,1150.2,1150.2,1150.2,4
                                ...
```

The first two columns are the date and time for the minute; the next four columns are open, high, low, and close prices; and the last value represents the number of buy and sell orders performed during the bar. The exact interpretation of this number is stock- and market-dependent, but usually, volumes give you an idea about how active the market was.

The typical way to represent those prices is called a candlestick chart, where every bar is shown as a candle. Part of Yandex's quotes for one day in February 2016 is shown in the following chart. The archive contains two files with M1 data for 2016 and 2015. We will use data from 2016 for model training and data from 2015 for validation.

<img width="700" src="assets/fig10.1.png">

<br>

# 03. Problem statements and key decisions

---

The finance domain is large and complex, so you can easily spend several years learning something new every day. In our example, we will just scratch the surface a bit with our RL tools, and our problem will be formulated as simply as possible, using price as an observation. We will investigate whether it will be possible for our agent to learn when the best time is to buy one single share and then close the position to maximize the profit. The purpose of this example is to show how flexible the RL model can be and what the first steps are that you usually need to take to apply RL to a real-life use case.

As you already know, to formulate RL problems, three things are needed: observation of the environment, possible actions, and a reward system. In previous chapters, all three were already given to us, and the internal machinery of the environment was hidden. Now we're in a different situation, so we need to decide ourselves what our agent will see and what set of actions it can take. The reward system is also not given as a strict set of rules; rather, it will be guided by our feelings and knowledge of the domain, which gives us lots of flexibility.

Flexibility, in this case, is good and bad at the same time. It's good that we have the freedom to pass some information to the agent that we feel will be important to learn efficiently. For example, you can pass to the trading agent not only prices, but also news or important statistics (which are known to influence financial markets a lot). The bad part is that this flexibility usually means that to find a good agent, you need to try a lot of variants of data representation, and it's not always obvious which will work better. In our case, we will implement the basic trading agent in its simplest form. The observation will include the following information:
- N past bars, where each has open, high, low, and close prices
- An indication that the share was bought some time ago (only one share at a time will be possible)
- Profit or loss that we currently have from our current position (the share bought)


At every step, after every minute's bar, the agent can take one of the following actions:
- __Do nothing:__ skip the bar without taking an action
- __Buy a share:__ if the agent has already got the share, nothing will be bought; otherwise, we will pay the commission, which is usually some small percentage of the current price
- __Close the position:__ if we do not have a previously purchased share, nothing will happen; otherwise, we will pay the commission for the trade


The reward that the agent receives can be expressed in various ways. On the one hand, we can split the reward into multiple steps during our ownership of the share. In that case, the reward on every step will be equal to the last bar's movement. On the other hand, the agent will receive the reward only after the close action and receive the full reward at once. At first sight, both variants should have the same final result, but maybe with different convergence speeds. However, in practice, the difference could be dramatic. We will implement both variants to compare them.

One last decision to make is how to represent the prices in our environment observation. Ideally, we would like our agent to be independent of actual price values and take into account relative movement, such as "the stock has grown 1% during the last bar" or "the stock has lost 5%." This makes sense, as different stocks' prices can vary, but they can have similar movement patterns. In finance, there is a branch of analytics called technical analysis that studies such patterns to help to make predictions from them. We would like our system to be able to discover the patterns (if they exist). To achieve this, we will convert every bar's open, high, low, and close prices to three numbers showing high, low, and close prices represented as a percentage of the open price.

This representation has its own drawbacks, as we're potentially losing the information about key price levels. For example, it's known that markets have a tendency to bounce from round price numbers (like $8,000 per bitcoin) and levels that were turning points in the past. However, as already stated, we're just playing with the data here and checking the concept. Representation in the form of relative price movement will help the system to find repeating patterns in the price level (if they exist, of course), regardless of the absolute price position. Potentially, the NN could learn this on its own (it's just the mean price that needs to be subtracted from the absolute price values), but relative representation simplifies the NN's task.

<br>

# 04. The trading environment

---

As we have a lot of code (methods, utility classes in PTAN, and so on) that is supposed to work with OpenAI Gym, we will implement the trading functionality following Gym's Env class API, which should be familiar to you. Our environment is implemented in the StocksEnv class in the Chapter10/lib/environ.py module. It uses several internal classes to keep its state and encode observations. 

In [1]:
# Import the libraries
import gym
import gym.spaces
from gym.utils import seeding
from gym.envs.registration import EnvSpec
import enum
import numpy as np

from lib import data

In [2]:
# Bar count
DEFAULT_BARS_COUNT = 10

In [3]:
# Commision percentage
DEFAULT_COMMISSION_PERC = 0.1

In [4]:
# Class for possible actions
class Actions(enum.Enum):
    
    # Doing nothing
    Skip = 0
    
    # Buying 
    Buy = 1
    
    # Selling 
    Close = 2

In [5]:
# Class for states
class State:
    
    
    # Constructor
    def __init__(self, bars_count, commission_perc, reset_on_close, reward_on_close = True, volumes =True):
        
        # Testing
        assert isinstance(bars_count, int)
        assert bars_count > 0
        assert isinstance(commission_perc, float)
        assert commission_perc >= 0.0
        assert isinstance(reset_on_close, bool)
        assert isinstance(reward_on_close, bool)
        
        # Initiaze the bar count
        self.bars_count = bars_count
        
        # Initialize the commission percentage
        self.commission_perc = commission_perc
        
        # Initialize the reset on close
        self.reset_on_close = reset_on_close
        
        # Initialize the reward on close
        self.reward_on_close = reward_on_close
        
        # Initialize the volumes
        self.volumes = volumes

        
    # Function for reseting
    def reset(self, prices, offset):
        """
        In the beginning, we don't have any shares bought, so our state has have_position=False and open_ price=0.0.
        """
        
        # Testing
        assert isinstance(prices, data.Prices)
        assert offset >= self.bars_count - 1
        
        # Initialize have_position to false
        self.have_position = False
        
        # Initialize the open price to zero
        self.open_price = 0.0
        
        # Initialize the price to given price value
        self._prices = prices
        
        # Initialize the offset to offset value
        self._offset = offset


    # Using getters and setters using @property
    @property
    
    # Function for getting the shape
    def shape(self):
        """
        This function returns the shape of the state representation in a NumPy array. The State class is encoded 
        into a single vector, which includes prices with optional volumes and two numbers indicating the presence 
        of a bought share and position profit.
        """
        
        # [h, l, c] * bars + position_flag + rel_profit
        
        # If volume is defined
        if self.volumes:
            
            return 4 * self.bars_count + 1 + 1,
        
        # If volume is NOT defined
        else:
            
            return 3 * self.bars_count + 1 + 1,

        
    # Function for encoding the prices to get the final observations
    def encode(self):
        """
        The preceding method encodes prices at the current offset into a NumPy array, which will be the 
        observation of the agent.
        """
        
        # Instantiate a n-dimensional array
        res = np.ndarray(shape = self.shape, dtype = np.float32)
        
        # Initialize the shift with zero
        shift = 0
        
        # Loop over the bar index
        for bar_idx in range(-self.bars_count+1, 1):
            
            # Get the offset
            ofs = self._offset + bar_idx
            
            # Add high price to res
            res[shift] = self._prices.high[ofs]
            
            # Increment the shift
            shift += 1
            
            # Add low price to res
            res[shift] = self._prices.low[ofs]
            
            # Increment the shift
            shift += 1
            
            # Add close price to res
            res[shift] = self._prices.close[ofs]
            
            # Increment the shift
            shift += 1
            
            # If volume is defined
            if self.volumes:
                
                # Add volume to res
                res[shift] = self._prices.volume[ofs]
                
                # Increment the shift
                shift += 1
                
        # Add have_position to res
        res[shift] = float(self.have_position)
        
        # Increment the shift
        shift += 1
        
        # If have_position is NOT defined
        if not self.have_position:
            
            # Add 0 to res
            res[shift] = 0.0
            
        # If have_position is defined
        else:
            
            # Add "(close price / open price) - 1" to res
            res[shift] = self._cur_close() / self.open_price - 1.0
            
        return res

    
    # Function for calculating the current bar's close price
    def _cur_close(self):
        """
        Calculate real close price for the current bar
        """
        # Get the open price
        open = self._prices.open[self._offset]
        
        # # Get the close price
        rel_close = self._prices.close[self._offset]
        
        # return "open price * (1 + close price))"
        return open * (1.0 + rel_close)

    
    # Function for performing one step
    def step(self, action):
        """
        Perform one step in our price, adjust offset, check for the end of prices and handle position change.
        Said differently, this function is responsible for performing one step in our environment. On exit, it has 
        to return the reward in a percentage and an indication of the episode ending.
        
        ARGUMENTS
        ==========================
            - action
            
        RETURNS
        ==========================
            - reward
            - done
        """
        
        # Testing
        assert isinstance(action, Actions)
        
        # Initialize the reward with zero
        reward = 0.0
        
        # Initialize the done with False
        done = False
        
        # Get the current close price
        close = self._cur_close()
        
        # If the agent has decided to buy a share
        if (action == Actions.Buy) and (not self.have_position):
            
            # Change the state by setting have_position to true
            self.have_position = True
            
            # Set the close price to open price
            """In our state, we assume the instant order execution at the current bar's close price, which is a 
            simplification on our side; normally, an order can be executed on a different price, which is called 
            price slippage."""
            self.open_price = close
            
            # Pay the commission
            reward -= self.commission_perc
            
        # If the agent has decided to sell a share
        elif (action == Actions.Close) and (self.have_position):
            
            # Pay the commission
            reward -= self.commission_perc
            
            # Change the done flag by applying inplace bitwise OR operation to reset_on_close
            done |= self.reset_on_close
            
            # If reward on close is defined
            if self.reward_on_close:
                
                # Update reward
                reward += 100.0 * (close / self.open_price - 1.0)
                
            # Set have_position to false
            self.have_position = False
            
            # Set open price to zero
            self.open_price = 0.0

        # Increment offset
        self._offset += 1
        
        # Set current close to previous close
        prev_close = close
        
        # Get the current close
        close = self._cur_close()
        
        # Get the done
        done |= self._offset >= self._prices.close.shape[0]-1

        # If have_position is true AND reward_on_close is false
        if (self.have_position) and (not self.reward_on_close):
            
            # Update reward
            reward += 100.0 * (close / prev_close - 1.0)

        return reward, done

In [6]:
# Class for changing the shape of state so it's suitable for 1D convolution
class State1D(State):
    """
    The shape of this representation is different, as our prices are encoded as a 2D matrix suitable for a 1D 
    convolution operator.
    """
    
    # Using getters and setters using @property
    @property
    
    # Function for getting the shape
    def shape(self):
        
        # If volumes exists
        if self.volumes:
            
            # Return the shape
            return (6, self.bars_count)
        
        # If volumes does NOT exist
        else:
            
            # Return the shape
            return (5, self.bars_count)

        
    # Encoding the prices
    def encode(self):
        """
        This method encodes the prices in our matrix, depending on the current offset, whether we need volumes, 
        and whether we have stock.
        """
        
        # Initialize observations with zeros
        res = np.zeros(shape = self.shape, dtype = np.float32)
        
        # Get the start
        start = self._offset - (self.bars_count - 1)
        
        # Get the stop
        stop = self._offset + 1
        
        # Add the high price to res
        res[0] = self._prices.high[start:stop]
        
        # Add the low price to res
        res[1] = self._prices.low[start:stop]
        
        # Add the close price to res
        res[2] = self._prices.close[start:stop]
        
        # If volumes exists
        if self.volumes:
            
            # Add the volume to res
            res[3] = self._prices.volume[start:stop]
            
            # TODO: 
            dst = 4
            
        # If volumes does NOT exist
        else:
            
            # TODO:
            dst = 3
            
        # If have_position exists
        if self.have_position:
            
            # TODO: 
            res[dst] = 1.0
            
            # TODO: 
            res[dst + 1] = self._cur_close() / self.open_price - 1.0
            
        return res

In [7]:
# Class for stock market environment
class StocksEnv(gym.Env):
    
    # Set the meta data
    metadata = {'render.modes': ['human']}
    
    # Specification for a particular instance of the environment. Used to register the parameters for official evaluations.
    spec = EnvSpec("StocksEnv-v0")

    # Constructor
    def __init__(self, prices, bars_count = DEFAULT_BARS_COUNT, commission = DEFAULT_COMMISSION_PERC,
                 reset_on_close = True, state_1d = False, random_ofs_on_reset = True, reward_on_close = False,
                 volumes = False):
        """
        ARGUMENTS
        ===============================
            - prices: Contains one or more stock prices for one or more instruments as a dict, where keys are 
                      the instrument's name and the value is a container object data.Prices, which holds price 
                      data arrays.
                      
            - bars_count: The count of bars that we pass in the observation. By default, this is 10 bars.
            
            - commission: The percentage of the stock price that we have to pay to the broker on buying and selling 
                          the stock. By default, it's 0.1%.
                          
            - reset_on_close: If this parameter is set to True, which it is by default, every time the agent asks 
                              us to close the existing position (in other words, sell a share), we stop the episode. 
                              Otherwise, the episode will continue until the end of our time series, which is one 
                              year of data.
                              
            - conv_1d: This Boolean argument switches between different representations of price data in the 
                       observation passed to the agent. If it is set to True, observations have a 2D shape, with 
                       different price components for subsequent bars organized in rows. For example, high prices 
                       (max price for the bar) are placed on the first row, low prices on the second, and close 
                       prices on the third. This representation is suitable for doing 1D convolution on time series, 
                       where every row in the data has the same meaning as different color planes (red, green, or 
                       blue) in Atari 2D images. If we set this option to False, we have one single array of data 
                       with every bar's components placed together. This organization is convenient for a fully 
                       connected network architecture. Both representations are illustrated in Figure 10.2.
                       
            - random_ofs_on_reset: If the parameter is True (by default), on every reset of the environment, the 
                                   random offset in the time series will be chosen. Otherwise, we will start from 
                                   the beginning of the data.
                                   
            - reward_on_close: This Boolean parameter switches between the two reward schemes discussed previously. 
                               If it is set to True, the agent will receive a reward only on the "close" action issue.
                               Otherwise, we will give a small reward every bar, corresponding to price movement 
                               during that bar.
                               
            - volumes: This argument switches on volumes in observations and is disabled by default.
        """
        
        # Testing
        assert isinstance(prices, dict)
        
        # Set the prices
        self._prices = prices
        
        # If state_1d is true
        if state_1d:
            
            # Set the 1D state
            self._state = State1D(bars_count, commission, reset_on_close, reward_on_close = reward_on_close, 
                                  volumes = volumes)
            
        # If state_1d is false
        else:
            
            # Set the state
            self._state = State(bars_count, commission, reset_on_close, reward_on_close = reward_on_close, 
                                volumes = volumes)
            
        # Set the action space
        self.action_space = gym.spaces.Discrete(n = len(Actions))
        
        # Set the observation space
        self.observation_space = gym.spaces.Box(low = -np.inf, high = np.inf, shape = self._state.shape, 
                                                dtype = np.float32)
        
        # TODO
        self.random_ofs_on_reset = random_ofs_on_reset
        
        # Set the seed
        self.seed()

        
    # Function for reseting
    def reset(self):
        
        # Make selection of the instrument and it's offset. Then reset the state
        self._instrument = self.np_random.choice(list(self._prices.keys()))
        
        # Set the initial prices
        prices = self._prices[self._instrument]
        
        # Set the initial bars
        bars = self._state.bars_count
        
        # If random_ofs_on_reset is defined
        if self.random_ofs_on_reset:
            
            # Define offset
            offset = self.np_random.choice(prices.high.shape[0]-bars*10) + bars
            
        # If random_ofs_on_reset is NOT defined
        else:
            
            # Define offset
            offset = bars
            
        # Reset the states
        self._state.reset(prices, offset)
        
        return self._state.encode()

    
    # Function for handling the action chosen by the agent and return the next observation, reward, and done flag
    def step(self, action_idx):
        
        # Action
        action = Actions(action_idx)
        
        # Reward and done flag
        reward, done = self._state.step(action)
        
        # Observations
        obs = self._state.encode()
        
        # Extra info
        info = {"instrument": self._instrument, "offset": self._state._offset}
        
        return obs, reward, done, info

    
    # Function for rendering the current state
    def render(self, mode = 'human', close = False):
        """
        Render the current state in human or machine-readable format. For example, the market environment could 
        render current prices as a chart to visualize what the agent sees at that moment. Our environment doesn't 
        support rendering, so this method does nothing.
        """
        pass

    
    # Function for calling on the environment's destruction to free the allocated resources
    def close(self):
        pass

    
    # Function for setting the seeds
    def seed(self, seed = None):
        
        # 
        self.np_random, seed1 = seeding.np_random(seed)
        
        # 
        seed2 = seeding.hash_seed(seed1 + 1) % 2 ** 31
        
        return [seed1, seed2]

    
    # Using function without creation of class
    @classmethod
    
    # Getting the price dictionary from directory
    def from_dir(cls, data_dir, **kwargs):
        
        # Get the price dictionary (open, high, low, close, and volume) using load_relative() in data.py
        prices = {file: data.load_relative(file) for file in data.price_files(data_dir)}
        
        return StocksEnv(prices, **kwargs)

<br>

# 05. Models

---
In this example, two architectures of DQN are used: a simple feed-forward network with three layers and a network with 1D convolution as a feature extractor, followed by two fully connected layers to output Q-values. Both of them use the dueling architecture described in Chapter 8, DQN Extensions. Double DQN and two-step Bellman unrolling have also been used. The rest of the process is the same as in a classical DQN (from Chapter 6, Deep Q-Networks).

Both models are in Chapter10/lib/models.py and are very simple.

In [1]:
# Import the libraries
import math
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F

<br>

### 5.1. Model 1 - Simple Feed-Forward Network 

In [None]:
# Class for simple feed-forward network
class SimpleFFDQN(nn.Module):
    
    
    # Constructor
    def __init__(self, obs_len, actions_n):
        
        # Inherite the parent's constructors
        super(SimpleFFDQN, self).__init__()

        # Sequenctial layer for getting Q-values
        self.fc_val = nn.Sequential(
            nn.Linear(obs_len, 512),
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, 1)
        )

        # Sequenctial layer for getting advantages
        self.fc_adv = nn.Sequential(
            nn.Linear(obs_len, 512),
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, actions_n)
        )

    
    # Function for doing feedforward
    def forward(self, x):
        
        # Feedforward and get the Q-values
        val = self.fc_val(x)
        
        # Feedforward and get the advantage values
        adv = self.fc_adv(x)
        
        return val + (adv - adv.mean(dim = 1, keepdim = True))

<br>

### 5.2. Model 2 - CNN and ANN

The convolutional model has a common feature extraction layer with the 1D convolution operations and two fully connected heads to output the value of the state and advantages for actions.

In [2]:
# Convolution and feedforward Network
class DQNConv1D(nn.Module):
    
    # Constructor
    def __init__(self, shape, actions_n):
        
        # Inherite the parent's constructor
        super(DQNConv1D, self).__init__()

        # CNN layers
        self.conv = nn.Sequential(
            nn.Conv1d(shape[0], 128, 5),
            nn.ReLU(),
            nn.Conv1d(128, 128, 5),
            nn.ReLU(),
        )

        # Output size of CNN
        out_size = self._get_conv_out(shape)

        # ANN layers for getting Q-Values
        self.fc_val = nn.Sequential(
            nn.Linear(out_size, 512),
            nn.ReLU(),
            nn.Linear(512, 1)
        )

        # ANN layers for getting advantage values
        self.fc_adv = nn.Sequential(
            nn.Linear(out_size, 512),
            nn.ReLU(),
            nn.Linear(512, actions_n)
        )

        
    # Function for getting the output size of CNN
    def _get_conv_out(self, shape):
        o = self.conv(torch.zeros(1, *shape))
        return int(np.prod(o.size()))

    
    # Function for feedforward
    def forward(self, x):
        
        # Feed data into CNN + Change the output shape
        conv_out = self.conv(x).view(x.size()[0], -1)
        
        # Feed CNN's output into FCL and get the Q-Values
        val = self.fc_val(conv_out)
        
        # Feed CNN's output into FCL and get the advantage values
        adv = self.fc_adv(conv_out)
        
        return val + (adv - adv.mean(dim = 1, keepdim = True))

In [3]:
# Deep CNN + ANN Network
class DQNConv1DLarge(nn.Module):
    
    # Constructor
    def __init__(self, shape, actions_n):
        
        # Inherite the parent's constructor
        super(DQNConv1DLarge, self).__init__()

        # CNN layers
        self.conv = nn.Sequential(
            nn.Conv1d(shape[0], 32, 3),
            nn.MaxPool1d(3, 2),
            nn.ReLU(),
            nn.Conv1d(32, 32, 3),
            nn.MaxPool1d(3, 2),
            nn.ReLU(),
            nn.Conv1d(32, 32, 3),
            nn.MaxPool1d(3, 2),
            nn.ReLU(),
            nn.Conv1d(32, 32, 3),
            nn.MaxPool1d(3, 2),
            nn.ReLU(),
            nn.Conv1d(32, 32, 3),
            nn.ReLU(),
            nn.Conv1d(32, 32, 3),
            nn.ReLU(),
        )

        # Get the output size of CNN layer
        out_size = self._get_conv_out(shape)

        # ANN layers for getting Q-Values
        self.fc_val = nn.Sequential(
            nn.Linear(out_size, 512),
            nn.ReLU(),
            nn.Linear(512, 1)
        )

        # ANN layers for getting advantage values
        self.fc_adv = nn.Sequential(
            nn.Linear(out_size, 512),
            nn.ReLU(),
            nn.Linear(512, actions_n)
        )

        
    # Function for getting the output size of CNN layer
    def _get_conv_out(self, shape):
        o = self.conv(torch.zeros(1, *shape))
        return int(np.prod(o.size()))

    
    # Function for doing feedforward
    def forward(self, x):
        
        # Feed data into CNN + Change the output shape
        conv_out = self.conv(x).view(x.size()[0], -1)
        
        # Feed CNN's output into FCL and get the Q-Values
        val = self.fc_val(conv_out)
        
        # Feed CNN's output into FCL and get the advantage values
        adv = self.fc_adv(conv_out)
        
        return val + (adv - adv.mean(dim=1, keepdim=True))

<br>

# 06. Training code

---

We have two very similar training modules in this example: one for the feed-forward model and one for 1D convolutions. For both of them, there is nothing new added to our examples from Chapter 8, DQN Extensions:
- They're using epsilon-greedy action selection to perform exploration. The epsilon linearly decays over the first 1M steps from 1.0 to 0.1.
- A simple experience replay buffer of size 100k is being used, which is initially populated with 10k transitions.
- For every 1,000 steps, we calculate the mean value for the fixed set of states to check the dynamics of the Q-values during the training.
- For every 100k steps, we perform validation: 100 episodes are played on the training data and on previously unseen quotes. Characteristics of orders are recorded in TensorBoard, such as the mean profit, the mean count of bars, and the share held. This step allows us to check for overfitting conditions.

The training modules are in Chapter10/train_model.py (feed-forward model) and Chapter10/train_model_conv.py (with a 1D convolutional layer). Both versions accept the same command-line options.

To start the training, you need to pass training data with the --data option, which could be an individual CSV file or the whole directory with files. By default, the training module uses Yandex quotes for 2016 (file data/YNDX_160101_161231. csv). For the validation data, there is an option, --val, that takes Yandex 2015 quotes by default. Another required option will be -r, which is used to pass the name of the run. This name will be used in the TensorBoard run name and to create directories with saved models.

<br>

### 6.1. Training Feed-Forward Model 

In [1]:
# Import the libraries
import ptan
import pathlib
import gym.wrappers
import numpy as np
import torch
import torch.optim as optim
from ignite.engine import Engine
from ignite.contrib.handlers import tensorboard_logger as tb_logger

from lib import environ, data_loader, models, common, validation

In [2]:
# Path for saving
SAVES_DIR = pathlib.Path("saves")

In [3]:
# Training stock data path
STOCKS = "data/YNDX_160101_161231.csv"

# Validation stock data path
VAL_STOCKS = "data/YNDX_150101_151231.csv"

In [4]:
# Hyperparameters
BATCH_SIZE = 32
BARS_COUNT = 10

EPS_START = 1.0
EPS_FINAL = 0.1
EPS_STEPS = 1000000

GAMMA = 0.99

REPLAY_SIZE = 100000
REPLAY_INITIAL = 10000
REWARD_STEPS = 2
LEARNING_RATE = 0.0001
STATES_TO_EVALUATE = 1000

In [None]:
# Start the program
if __name__ == "__main__":
    
    # Stock path
    data = STOCKS
    
    # Year to train on
    year = None
    
    # Validation path
    val = VAL_STOCKS
    
    # Run name
    run_name = "run_name"
    
    # Get the device type
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    # Path for saving
    saves_path = SAVES_DIR / f"simple-{run_name}"
    
    # Make saving directry (if doesn't exist)
    saves_path.mkdir(parents = True, exist_ok = True)

    # Training path
    data_path = pathlib.Path(data)
    
    # Validation path
    val_path = pathlib.Path(val)

    # If year is defined OR data_path already exists
    if (year is not None) or (data_path.is_file()):
        
        # If year is defined
        if year is not None:
            
            # Load the stock data
            stock_data = data_loader.load_year_data(year)
            
        # If year is NOT defined
        else:
            
            # Load the stock data
            stock_data = {"YNDX": data_loader.load_relative(data_path)}
            
        # Instantiate the environment for training
        env = environ.StocksEnv(stock_data, bars_count = BARS_COUNT)
        
        # Instantiate the environment for testing
        env_tst = environ.StocksEnv(stock_data, bars_count = BARS_COUNT)
        
    # If data_path exists
    elif data_path.is_dir():
        
        # Instantiate the environment for training
        env = environ.StocksEnv.from_dir(data_path, bars_count = BARS_COUNT)
        
        # Instantiate the environment for testing
        env_tst = environ.StocksEnv.from_dir(data_path, bars_count = BARS_COUNT)
        
    # If year is not defined AND data_path is not exists
    else:
        
        # Raise error
        raise RuntimeError("No data to train on")

    # Wrap the environment with time limit
    env = gym.wrappers.TimeLimit(env, max_episode_steps = 1000)
    
    # Get the validation data
    val_data = {"YNDX": data_loader.load_relative(val_path)}
    
    # Instantiate the environment for validation
    env_val = environ.StocksEnv(val_data, bars_count = BARS_COUNT)

    # Instantiate the sourse network
    net = models.SimpleFFDQN(env.observation_space.shape[0], env.action_space.n).to(device)
    
    # Instantiate the target network
    tgt_net = ptan.agent.TargetNet(net)

    # Instantiate the action selector (epsilon-greedy)
    selector = ptan.actions.EpsilonGreedyActionSelector(EPS_START)
    
    # Track epsilon
    eps_tracker = ptan.actions.EpsilonTracker(selector, EPS_START, EPS_FINAL, EPS_STEPS)
    
    # Instantiate the DQN agent
    agent = ptan.agent.DQNAgent(net, selector, device=device)
    
    # Instantiate the experience source
    exp_source = ptan.experience.ExperienceSourceFirstLast(env, agent, GAMMA, steps_count = REWARD_STEPS)
    
    # Instantiate the experience replay buffer
    buffer = ptan.experience.ExperienceReplayBuffer(exp_source, REPLAY_SIZE)
    
    # Instantiate the adam optimizer
    optimizer = optim.Adam(net.parameters(), lr = LEARNING_RATE)

    
    # Function for procesing each batch
    def process_batch(engine, batch):
        
        # Reset the optimizer's weight to zero
        optimizer.zero_grad()
        
        # Calculate the loss
        loss_v = common.calc_loss(batch, net, tgt_net.target_model, gamma = GAMMA ** REWARD_STEPS, device = device)
        
        # Backpropagation
        loss_v.backward()
        
        # Do optimization
        optimizer.step()
        
        # Track epsilon
        eps_tracker.frame(engine.state.iteration)

        # 
        if getattr(engine.state, "eval_states", None) is None:
            
            # Sample from buffer
            eval_states = buffer.sample(STATES_TO_EVALUATE)
            
            # Convert each state into numpy array
            eval_states = [np.array(transition.state, copy = False) for transition in eval_states]
            
            # Update the evaluation states in engine
            engine.state.eval_states = np.array(eval_states, copy=False)

        return {"loss": loss_v.item(), "epsilon": selector.epsilon,}

    # Instantiate the engine
    engine = Engine(process_batch)
    
    # Setup the ignite
    tb = common.setup_ignite(engine, exp_source, f"simple-{run_name}", extra_metrics=('values_mean',))

    # 
    @engine.on(ptan.ignite.PeriodEvents.ITERS_1000_COMPLETED)
    
    # Function for synching
    def sync_eval(engine: Engine):
        
        # Sync source network with target network
        tgt_net.sync()

        # Calculate the mean of values
        mean_val = common.calc_values_of_states(engine.state.eval_states, net, device = device)
        
        # Update the metrics in engine
        engine.state.metrics["values_mean"] = mean_val
        
        # 
        if getattr(engine.state, "best_mean_val", None) is None:
            
            # Update best mean value in engine
            engine.state.best_mean_val = mean_val
            
        # If the mean values is more than best mean value
        if engine.state.best_mean_val < mean_val:
            
            # Report
            print("%d: Best mean value updated %.3f -> %.3f" % (engine.state.iteration, engine.state.best_mean_val, mean_val))
            
            # Get the path
            path = saves_path / ("mean_value-%.3f.data" % mean_val)
            
            # Save the weights
            torch.save(net.state_dict(), path)
            
            # Update the best mean value in engine
            engine.state.best_mean_val = mean_val

    # 
    @engine.on(ptan.ignite.PeriodEvents.ITERS_10000_COMPLETED)
    
    # Function for doing validation
    def validate(engine: Engine):
        
        # Test the model on testset
        res = validation.validation_run(env_tst, net, device = device)
        
        # Report
        print("%d: tst: %s" % (engine.state.iteration, res))
        
        # Loop over keys and items in res
        for key, val in res.items():
            
            # Update the metrics
            engine.state.metrics[key + "_tst"] = val
            
        # Test the model on validation set
        res = validation.validation_run(env_val, net, device=device)
        
        # Report
        print("%d: val: %s" % (engine.state.iteration, res))
        
        # Loop over keys and items in res
        for key, val in res.items():
            
            # Update the metrics
            engine.state.metrics[key + "_val"] = val
            
        # Get the validation reward
        val_reward = res['episode_reward']
        
        # 
        if getattr(engine.state, "best_val_reward", None) is None:
            
            # Assign val_reward to best_val_reward
            engine.state.best_val_reward = val_reward
            
        # If val_reward is larger than best_val_reward
        if engine.state.best_val_reward < val_reward:
            
            # Report
            print("Best validation reward updated: %.3f -> %.3f, model saved" % (engine.state.best_val_reward, 
                                                                                 val_reward))
            
            # Assign val_reward to best_val_reward
            engine.state.best_val_reward = val_reward
            
            # Get the path
            path = saves_path / ("val_reward-%.3f.data" % val_reward)
            
            # Save the weight
            torch.save(net.state_dict(), path)

    # Instantiate the period event (if 1000 iteration got completed)
    event = ptan.ignite.PeriodEvents.ITERS_10000_COMPLETED
    
    # Get the test metrics
    tst_metrics = [m + "_tst" for m in validation.METRICS]
    
    # Test handler
    tst_handler = tb_logger.OutputHandler(tag = "test", metric_names = tst_metrics)
    
    # 
    tb.attach(engine, log_handler = tst_handler, event_name = event)

    # Get the validation metrics
    val_metrics = [m + "_val" for m in validation.METRICS]
    
    # Validation handler
    val_handler = tb_logger.OutputHandler(tag = "validation", metric_names = val_metrics)
    
    # 
    tb.attach(engine, log_handler = val_handler, event_name = event)

    # Run the engine
    engine.run(common.batch_generator(buffer, REPLAY_INITIAL, BATCH_SIZE))

Reading data/YNDX_160101_161231.csv
Read done, got 131542 rows, 99752 filtered, 0 open prices adjusted
Reading data/YNDX_150101_151231.csv
Read done, got 130566 rows, 104412 filtered, 0 open prices adjusted
Episode 100: reward=-0, steps=9, speed=0.0 f/s, elapsed=0:00:07
Episode 200: reward=0, steps=6, speed=0.0 f/s, elapsed=0:00:07
Episode 300: reward=-0, steps=11, speed=0.0 f/s, elapsed=0:00:08
Episode 400: reward=0, steps=7, speed=0.0 f/s, elapsed=0:00:08
Episode 500: reward=-0, steps=5, speed=0.0 f/s, elapsed=0:00:08
Episode 600: reward=-2, steps=2, speed=0.0 f/s, elapsed=0:00:08
Episode 700: reward=-0, steps=5, speed=0.0 f/s, elapsed=0:00:08
Episode 800: reward=-0, steps=2, speed=0.0 f/s, elapsed=0:00:08
Episode 900: reward=-0, steps=8, speed=0.0 f/s, elapsed=0:00:08
Episode 1000: reward=-0, steps=13, speed=0.0 f/s, elapsed=0:00:08
Episode 1100: reward=-0, steps=2, speed=0.0 f/s, elapsed=0:00:08
Episode 1200: reward=0, steps=4, speed=0.0 f/s, elapsed=0:00:08
Episode 1300: reward=0,

Episode 10100: reward=-0, steps=4, speed=54.6 f/s, elapsed=0:17:24
Episode 10200: reward=-1, steps=3, speed=54.5 f/s, elapsed=0:17:35
Episode 10300: reward=-0, steps=2, speed=54.4 f/s, elapsed=0:17:47
Episode 10400: reward=-0, steps=6, speed=54.4 f/s, elapsed=0:18:00
Episode 10500: reward=-0, steps=7, speed=54.3 f/s, elapsed=0:18:12
Episode 10600: reward=-1, steps=4, speed=54.2 f/s, elapsed=0:18:25
Episode 10700: reward=-0, steps=4, speed=54.2 f/s, elapsed=0:18:37
Episode 10800: reward=0, steps=8, speed=54.1 f/s, elapsed=0:18:50
Episode 10900: reward=-1, steps=4, speed=54.0 f/s, elapsed=0:19:03
Episode 11000: reward=-1, steps=5, speed=53.9 f/s, elapsed=0:19:14
Episode 11100: reward=1, steps=15, speed=53.9 f/s, elapsed=0:19:25
Episode 11200: reward=-0, steps=3, speed=53.8 f/s, elapsed=0:19:37
Episode 11300: reward=-0, steps=4, speed=53.7 f/s, elapsed=0:19:50
Episode 11400: reward=-0, steps=3, speed=53.6 f/s, elapsed=0:20:03
60000: tst: {'episode_reward': -0.11823248812868149, 'episode_s

Episode 19900: reward=0, steps=8, speed=41.3 f/s, elapsed=0:41:45
Episode 20000: reward=-0, steps=2, speed=41.1 f/s, elapsed=0:42:01
Episode 20100: reward=0, steps=9, speed=41.0 f/s, elapsed=0:42:18
Episode 20200: reward=-0, steps=3, speed=41.0 f/s, elapsed=0:42:33
Episode 20300: reward=-0, steps=3, speed=40.8 f/s, elapsed=0:42:51
Episode 20400: reward=-0, steps=2, speed=40.7 f/s, elapsed=0:43:08
Episode 20500: reward=-0, steps=8, speed=40.7 f/s, elapsed=0:43:26
Episode 20600: reward=-0, steps=4, speed=40.6 f/s, elapsed=0:43:41
Episode 20700: reward=0, steps=10, speed=40.6 f/s, elapsed=0:43:56
Episode 20800: reward=-0, steps=5, speed=40.5 f/s, elapsed=0:44:14
Episode 20900: reward=-0, steps=5, speed=40.3 f/s, elapsed=0:44:32
Episode 21000: reward=-0, steps=6, speed=40.3 f/s, elapsed=0:44:48
Episode 21100: reward=-0, steps=8, speed=40.2 f/s, elapsed=0:45:04
Episode 21200: reward=-0, steps=6, speed=40.2 f/s, elapsed=0:45:21
Episode 21300: reward=-1, steps=3, speed=40.1 f/s, elapsed=0:45:

Episode 29700: reward=0, steps=18, speed=38.1 f/s, elapsed=1:09:21
Episode 29800: reward=0, steps=3, speed=38.1 f/s, elapsed=1:09:38
Episode 29900: reward=-0, steps=3, speed=38.0 f/s, elapsed=1:09:56
Episode 30000: reward=-0, steps=4, speed=37.9 f/s, elapsed=1:10:16
Episode 30100: reward=-0, steps=2, speed=37.9 f/s, elapsed=1:10:32
Episode 30200: reward=-7, steps=10, speed=37.9 f/s, elapsed=1:10:49
Episode 30300: reward=-0, steps=4, speed=37.8 f/s, elapsed=1:11:05
Episode 30400: reward=-0, steps=3, speed=37.8 f/s, elapsed=1:11:23
Episode 30500: reward=-0, steps=3, speed=37.8 f/s, elapsed=1:11:39
Episode 30600: reward=-0, steps=4, speed=37.8 f/s, elapsed=1:11:53
Episode 30700: reward=-0, steps=4, speed=37.8 f/s, elapsed=1:12:10
Episode 30800: reward=-1, steps=2, speed=37.7 f/s, elapsed=1:12:31
Episode 30900: reward=1, steps=7, speed=37.7 f/s, elapsed=1:12:48
180000: tst: {'episode_reward': -0.009242630591963655, 'episode_steps': 15.82, 'order_profits': -0.009493421970382707, 'order_step

Episode 39500: reward=-0, steps=2, speed=36.6 f/s, elapsed=1:55:59
Episode 39600: reward=0, steps=4, speed=36.5 f/s, elapsed=1:56:20
Episode 39700: reward=-0, steps=12, speed=36.4 f/s, elapsed=1:56:39
Episode 39800: reward=-0, steps=10, speed=36.3 f/s, elapsed=1:56:57
Episode 39900: reward=-0, steps=4, speed=36.3 f/s, elapsed=1:57:14
Episode 40000: reward=0, steps=4, speed=36.3 f/s, elapsed=1:57:31
240000: tst: {'episode_reward': -0.11372361795930877, 'episode_steps': 7.31, 'order_profits': -0.11364899715566795, 'order_steps': 5.24}
240000: val: {'episode_reward': -0.11036891272248134, 'episode_steps': 6.17, 'order_profits': -0.11128866508097313, 'order_steps': 4.04}
Episode 40100: reward=0, steps=10, speed=36.2 f/s, elapsed=1:57:50
Episode 40200: reward=0, steps=2, speed=36.2 f/s, elapsed=1:58:07
Episode 40300: reward=0, steps=14, speed=36.2 f/s, elapsed=1:58:27
Episode 40400: reward=-1, steps=7, speed=36.2 f/s, elapsed=1:58:43
Episode 40500: reward=0, steps=4, speed=36.2 f/s, elapsed

<br>

### 6.2. Training 1D Convolutions Model 

In [1]:
# Import the libraries
import ptan
import pathlib
import gym.wrappers
import numpy as np
import torch
import torch.optim as optim
from ignite.engine import Engine
from ignite.contrib.handlers import tensorboard_logger as tb_logger

from lib import environ, data_loader, models, common, validation

In [2]:
# Saving directory
SAVES_DIR = pathlib.Path("saves")

In [3]:
# Training data path
STOCKS = "data/YNDX_160101_161231.csv"

In [4]:
# Validation data path
VAL_STOCKS = "data/YNDX_150101_151231.csv"

In [5]:
# Hyperparameters
BATCH_SIZE = 32
BARS_COUNT = 10

EPS_START = 1.0
EPS_FINAL = 0.1
EPS_STEPS = 1000000

GAMMA = 0.99

REPLAY_SIZE = 100000
REPLAY_INITIAL = 10000
REWARD_STEPS = 2
LEARNING_RATE = 0.0001
STATES_TO_EVALUATE = 1000

In [None]:
# Start the program
if __name__ == "__main__":
    
    # Stock path
    data = STOCKS
    
    # Year to train on
    year = None
    
    # Validation path
    val = VAL_STOCKS
    
    # Run name
    run_name = "run_name"
    
    # Get the device type
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    # Saving path
    saves_path = SAVES_DIR / f"conv-{run_name}"
    
    # Make a directory for saving
    saves_path.mkdir(parents = True, exist_ok = True)

    # Get all the subdirectories for data path
    data_path = pathlib.Path(data)
    
    # Get all the subdirectories for validation path
    val_path = pathlib.Path(val)

    # If year is defined OR data_path exists
    if (year is not None) or data_path.is_file():
        
        # If year is defined
        if (year is not None):
            
            # Load the training data
            stock_data = data_loader.load_year_data(year)
            
        # If year is NOT defined
        else:
            
            # Load the training data
            stock_data = {"YNDX": data_loader.load_relative(data_path)}
            
        # Instantiate the training environment
        env = environ.StocksEnv(stock_data, bars_count=BARS_COUNT, state_1d=True)
        
        # Instantiate the testing environment
        env_tst = environ.StocksEnv(stock_data, bars_count=BARS_COUNT, state_1d=True)
        
    # If data_path exists
    elif data_path.is_dir():
        
        # Instantiate the training environment
        env = environ.StocksEnv.from_dir(data_path, bars_count = BARS_COUNT, state_1d = True)
        
        # Instantiate the testing environment
        env_tst = environ.StocksEnv.from_dir(data_path, bars_count = BARS_COUNT, state_1d = True)
        
    # If year is not defined OR data_path does not exists
    else:
        
        # Raise error
        raise RuntimeError("No data to train on")

    # Wrap the environment with time limit
    env = gym.wrappers.TimeLimit(env, max_episode_steps = 1000)
    
    # Load the validation data
    val_data = {"YNDX": data_loader.load_relative(val_path)}
    
    # Instantiate the validation environment
    env_val = environ.StocksEnv(val_data, bars_count = BARS_COUNT, state_1d = True)

    # Instantiate the source network
    net = models.DQNConv1D(env.observation_space.shape, env.action_space.n).to(device)
    
    # Instantiate the target network
    tgt_net = ptan.agent.TargetNet(net)

    # Instantiate the action selector (epsilon-greedy)
    selector = ptan.actions.EpsilonGreedyActionSelector(EPS_START)
    
    # Instantiate the epsilon tracker
    eps_tracker = ptan.actions.EpsilonTracker(selector, EPS_START, EPS_FINAL, EPS_STEPS)
    
    # Instantiate the DQN agent
    agent = ptan.agent.DQNAgent(net, selector, device=device)
    
    # Instantiate the experience source (first-last)
    exp_source = ptan.experience.ExperienceSourceFirstLast(env, agent, GAMMA, steps_count=REWARD_STEPS)
    
    # Instantiate the experience replay buffer
    buffer = ptan.experience.ExperienceReplayBuffer(exp_source, REPLAY_SIZE)
    
    # Instantiate the Adam optimizer
    optimizer = optim.Adam(net.parameters(), lr=LEARNING_RATE)

    # Function for processing each batch
    def process_batch(engine, batch):
        
        # Reset the optimizer's weight to zero
        optimizer.zero_grad()
        
        # Calculate the loss
        loss_v = common.calc_loss(batch, net, tgt_net.target_model, gamma = GAMMA ** REWARD_STEPS, device = device)
        
        # Backpropagation
        loss_v.backward()
        
        # Do optimization
        optimizer.step()
        
        # Track the epsilon
        eps_tracker.frame(engine.state.iteration)

        # 
        if getattr(engine.state, "eval_states", None) is None:
            
            # Sample from buffer
            eval_states = buffer.sample(STATES_TO_EVALUATE)
            
            # Convert each state into a numpy array
            eval_states = [np.array(transition.state, copy = False) for transition in eval_states]
            
            # Convert the whole states into a numpy array
            engine.state.eval_states = np.array(eval_states, copy = False)

        return {"loss": loss_v.item(), "epsilon": selector.epsilon,}

    # Instantiate the engine
    engine = Engine(process_batch)
    
    # Setup the ignite
    tb = common.setup_ignite(engine, exp_source, f"conv-{run_name}", extra_metrics = ('values_mean',))

    # 
    @engine.on(ptan.ignite.PeriodEvents.ITERS_1000_COMPLETED)
    
    # Function for synching
    def sync_eval(engine: Engine):
        
        # Synch the source network's weight with target network
        tgt_net.sync()

        # Get the mean of values
        mean_val = common.calc_values_of_states(engine.state.eval_states, net, device = device)
        
        # Update the metrics
        engine.state.metrics["values_mean"] = mean_val
        
        # 
        if getattr(engine.state, "best_mean_val", None) is None:
            
            # Assign mean_val to best_mean_val
            engine.state.best_mean_val = mean_val
            
        # If mean_val is larger than best_mean_val
        if engine.state.best_mean_val < mean_val:
            
            # Report
            print("%d: Best mean value updated %.3f -> %.3f" % (engine.state.iteration, 
                                                                engine.state.best_mean_val, 
                                                                mean_val))
            
            # Get the path
            path = saves_path / ("mean_value-%.3f.data" % mean_val)
            
            # Save the weights
            torch.save(net.state_dict(), path)
            
            # Update the best_mean_val in engine
            engine.state.best_mean_val = mean_val

    # 
    @engine.on(ptan.ignite.PeriodEvents.ITERS_10000_COMPLETED)
    
    # Function for testing
    def validate(engine: Engine):
        
        # Test the model on testset
        res = validation.validation_run(env_tst, net, device=device)
        
        # Report
        print("%d: tst: %s" % (engine.state.iteration, res))
        
        # Loop over keys and values
        for key, val in res.items():
            
            # Update the metrics
            engine.state.metrics[key + "_tst"] = val
            
        # Test the model on validation set
        res = validation.validation_run(env_val, net, device=device)
        
        # Report
        print("%d: val: %s" % (engine.state.iteration, res))
        
        # Loop over keys and values
        for key, val in res.items():
            
            # Update the metrics
            engine.state.metrics[key + "_val"] = val
            
        # Get validation reward
        val_reward = res['episode_reward']
        
        # 
        if getattr(engine.state, "best_val_reward", None) is None:
            
            # Assign val_reward to best_val_reward
            engine.state.best_val_reward = val_reward
            
        # If val_reward is greater than best_val_reward
        if engine.state.best_val_reward < val_reward:
            
            # Report
            print("Best validation reward updated: %.3f -> %.3f, model saved" % (engine.state.best_val_reward, 
                                                                                 val_reward))
            
            # Assign val_reward to best_val_reward
            engine.state.best_val_reward = val_reward
            
            # Saving path
            path = saves_path / ("val_reward-%.3f.data" % val_reward)
            
            # Save the weights
            torch.save(net.state_dict(), path)

    # Instantiate the period event (if 1000 iteration got completed)
    event = ptan.ignite.PeriodEvents.ITERS_10000_COMPLETED
    
    # Get all the test metrics 
    tst_metrics = [m + "_tst" for m in validation.METRICS]
    
    # Test handler
    tst_handler = tb_logger.OutputHandler(tag = "test", metric_names = tst_metrics)
    
    # 
    tb.attach(engine, log_handler=tst_handler, event_name=event)

    # Get all the validation metrics
    val_metrics = [m + "_val" for m in validation.METRICS]
    
    # Validation handler
    val_handler = tb_logger.OutputHandler(tag = "validation", metric_names = val_metrics)
    
    # 
    tb.attach(engine, log_handler=val_handler, event_name=event)

    # Run the engine
    engine.run(common.batch_generator(buffer, REPLAY_INITIAL, BATCH_SIZE))

Reading data/YNDX_160101_161231.csv
Read done, got 131542 rows, 99752 filtered, 0 open prices adjusted
Reading data/YNDX_150101_151231.csv
Read done, got 130566 rows, 104412 filtered, 0 open prices adjusted
Episode 100: reward=-0, steps=11, speed=0.0 f/s, elapsed=0:00:13
Episode 200: reward=-0, steps=6, speed=0.0 f/s, elapsed=0:00:13
Episode 300: reward=-0, steps=2, speed=0.0 f/s, elapsed=0:00:13
Episode 400: reward=-0, steps=2, speed=0.0 f/s, elapsed=0:00:13
Episode 500: reward=-0, steps=4, speed=0.0 f/s, elapsed=0:00:13
Episode 600: reward=-0, steps=3, speed=0.0 f/s, elapsed=0:00:13
Episode 700: reward=0, steps=4, speed=0.0 f/s, elapsed=0:00:13
Episode 800: reward=-0, steps=3, speed=0.0 f/s, elapsed=0:00:13
Episode 900: reward=-0, steps=16, speed=0.0 f/s, elapsed=0:00:13
Episode 1000: reward=-0, steps=15, speed=0.0 f/s, elapsed=0:00:13
Episode 1100: reward=-0, steps=5, speed=0.0 f/s, elapsed=0:00:13
Episode 1200: reward=-0, steps=3, speed=0.0 f/s, elapsed=0:00:13
Episode 1300: reward

Episode 10100: reward=-0, steps=9, speed=48.6 f/s, elapsed=0:22:36
Episode 10200: reward=-1, steps=4, speed=48.6 f/s, elapsed=0:22:48
Episode 10300: reward=0, steps=9, speed=48.6 f/s, elapsed=0:22:59
Episode 10400: reward=-0, steps=12, speed=48.6 f/s, elapsed=0:23:11
Episode 10500: reward=-0, steps=4, speed=48.6 f/s, elapsed=0:23:24
Episode 10600: reward=0, steps=11, speed=48.6 f/s, elapsed=0:23:38
Episode 10700: reward=-0, steps=2, speed=48.6 f/s, elapsed=0:23:51
Episode 10800: reward=-0, steps=11, speed=48.6 f/s, elapsed=0:24:05
Episode 10900: reward=-0, steps=8, speed=48.6 f/s, elapsed=0:24:17
Episode 11000: reward=-0, steps=2, speed=48.6 f/s, elapsed=0:24:30
Episode 11100: reward=-0, steps=4, speed=48.5 f/s, elapsed=0:24:42
Episode 11200: reward=-0, steps=4, speed=48.5 f/s, elapsed=0:24:56
Episode 11300: reward=-0, steps=9, speed=48.5 f/s, elapsed=0:25:08
60000: tst: {'episode_reward': 0.1481011370630059, 'episode_steps': 285.25, 'order_profits': 0.13052907300045086, 'order_steps':

Episode 19800: reward=-0, steps=2, speed=38.9 f/s, elapsed=0:55:04
Episode 19900: reward=-0, steps=10, speed=38.9 f/s, elapsed=0:55:23
Episode 20000: reward=0, steps=7, speed=38.9 f/s, elapsed=0:55:41
Episode 20100: reward=1, steps=9, speed=38.9 f/s, elapsed=0:55:59
Episode 20200: reward=-0, steps=8, speed=38.9 f/s, elapsed=0:56:15
Episode 20300: reward=0, steps=4, speed=38.8 f/s, elapsed=0:56:32
Episode 20400: reward=1, steps=6, speed=38.8 f/s, elapsed=0:56:50
120000: tst: {'episode_reward': 0.0967929133922643, 'episode_steps': 314.98, 'order_profits': 0.07257138652029868, 'order_steps': 156.63}
120000: val: {'episode_reward': 0.13342067575363592, 'episode_steps': 315.02, 'order_profits': 0.10165642952697351, 'order_steps': 153.7070707070707}
Episode 20500: reward=-1, steps=25, speed=38.2 f/s, elapsed=0:58:17
Episode 20600: reward=0, steps=3, speed=38.2 f/s, elapsed=0:58:34
Episode 20700: reward=-0, steps=9, speed=38.3 f/s, elapsed=0:58:50
Episode 20800: reward=-0, steps=7, speed=38.3

180000: tst: {'episode_reward': -0.09321567300533565, 'episode_steps': 38.84, 'order_profits': -0.09572511615879524, 'order_steps': 32.57}
180000: val: {'episode_reward': 0.09456477651961395, 'episode_steps': 22.66, 'order_profits': 0.0905417472755056, 'order_steps': 18.96}
Episode 29600: reward=-1, steps=5, speed=36.4 f/s, elapsed=1:27:12
Episode 29700: reward=-0, steps=7, speed=36.3 f/s, elapsed=1:27:32
Episode 29800: reward=-0, steps=10, speed=36.3 f/s, elapsed=1:27:49
Episode 29900: reward=-1, steps=16, speed=36.3 f/s, elapsed=1:28:08
Episode 30000: reward=-0, steps=12, speed=36.3 f/s, elapsed=1:28:25
Episode 30100: reward=-0, steps=2, speed=36.2 f/s, elapsed=1:28:45
Episode 30200: reward=-0, steps=5, speed=36.1 f/s, elapsed=1:29:05
Episode 30300: reward=-0, steps=7, speed=36.1 f/s, elapsed=1:29:27
Episode 30400: reward=-0, steps=4, speed=36.1 f/s, elapsed=1:29:45
Episode 30500: reward=-1, steps=5, speed=36.1 f/s, elapsed=1:30:03
Episode 30600: reward=-0, steps=11, speed=36.0 f/s, 

Episode 39000: reward=-0, steps=4, speed=35.3 f/s, elapsed=1:57:08
Episode 39100: reward=-0, steps=9, speed=35.3 f/s, elapsed=1:57:26
Episode 39200: reward=-0, steps=6, speed=35.3 f/s, elapsed=1:57:45
Episode 39300: reward=0, steps=3, speed=35.3 f/s, elapsed=1:58:04
Episode 39400: reward=-0, steps=7, speed=35.2 f/s, elapsed=1:58:22
Episode 39500: reward=-0, steps=4, speed=35.0 f/s, elapsed=1:58:48
Episode 39600: reward=-0, steps=2, speed=35.0 f/s, elapsed=1:59:06
Episode 39700: reward=-0, steps=2, speed=35.0 f/s, elapsed=1:59:24
Episode 39800: reward=0, steps=3, speed=35.0 f/s, elapsed=1:59:40
Episode 39900: reward=-0, steps=6, speed=35.0 f/s, elapsed=1:59:58
Episode 40000: reward=-0, steps=8, speed=35.1 f/s, elapsed=2:00:16
Episode 40100: reward=0, steps=11, speed=35.1 f/s, elapsed=2:00:35
Episode 40200: reward=-0, steps=4, speed=35.2 f/s, elapsed=2:00:53
250000: tst: {'episode_reward': -0.07425977188820414, 'episode_steps': 12.21, 'order_profits': -0.07460277689730918, 'order_steps':

Episode 48800: reward=-0, steps=5, speed=36.8 f/s, elapsed=2:33:32
Episode 48900: reward=-0, steps=9, speed=36.8 f/s, elapsed=2:33:51
Episode 49000: reward=-0, steps=9, speed=36.8 f/s, elapsed=2:34:08
Episode 49100: reward=0, steps=12, speed=36.8 f/s, elapsed=2:34:27
Episode 49200: reward=-1, steps=4, speed=36.8 f/s, elapsed=2:34:48
310000: tst: {'episode_reward': 0.03842843471559472, 'episode_steps': 28.78, 'order_profits': 0.03542893470790201, 'order_steps': 21.82}
310000: val: {'episode_reward': -0.056307926657165164, 'episode_steps': 17.69, 'order_profits': -0.05828946025275457, 'order_steps': 13.5}
Episode 49300: reward=-0, steps=4, speed=36.6 f/s, elapsed=2:35:14
Episode 49400: reward=0, steps=6, speed=36.6 f/s, elapsed=2:35:35
Episode 49500: reward=-0, steps=2, speed=36.6 f/s, elapsed=2:35:55
Episode 49600: reward=0, steps=11, speed=36.5 f/s, elapsed=2:36:15
Episode 49700: reward=-1, steps=4, speed=36.5 f/s, elapsed=2:36:35
Episode 49800: reward=-0, steps=3, speed=36.5 f/s, elap

Episode 58200: reward=-0, steps=4, speed=34.2 f/s, elapsed=3:06:06
Episode 58300: reward=-0, steps=5, speed=34.2 f/s, elapsed=3:06:23
Episode 58400: reward=-0, steps=2, speed=34.3 f/s, elapsed=3:06:42
Episode 58500: reward=-0, steps=4, speed=34.3 f/s, elapsed=3:07:00
Episode 58600: reward=2, steps=9, speed=34.3 f/s, elapsed=3:07:22
Episode 58700: reward=-0, steps=4, speed=34.3 f/s, elapsed=3:07:42
Episode 58800: reward=-0, steps=7, speed=34.3 f/s, elapsed=3:08:02
Episode 58900: reward=0, steps=9, speed=34.4 f/s, elapsed=3:08:22
Episode 59000: reward=-0, steps=6, speed=34.4 f/s, elapsed=3:08:44
Episode 59100: reward=-1, steps=7, speed=34.4 f/s, elapsed=3:09:05
Episode 59200: reward=0, steps=4, speed=34.4 f/s, elapsed=3:09:27
Episode 59300: reward=-0, steps=2, speed=34.4 f/s, elapsed=3:09:49
380000: tst: {'episode_reward': -0.11199564744554008, 'episode_steps': 19.03, 'order_profits': -0.11347955399122694, 'order_steps': 11.23}
380000: val: {'episode_reward': -0.13742018577272447, 'episo

Episode 67600: reward=0, steps=12, speed=35.2 f/s, elapsed=3:39:13
Episode 67700: reward=0, steps=2, speed=35.2 f/s, elapsed=3:39:33
Episode 67800: reward=0, steps=5, speed=35.3 f/s, elapsed=3:39:55
Episode 67900: reward=-0, steps=3, speed=35.3 f/s, elapsed=3:40:17
Episode 68000: reward=-1, steps=17, speed=35.3 f/s, elapsed=3:40:41
Episode 68100: reward=0, steps=5, speed=35.2 f/s, elapsed=3:41:05
Episode 68200: reward=0, steps=13, speed=35.2 f/s, elapsed=3:41:25
Episode 68300: reward=0, steps=8, speed=35.2 f/s, elapsed=3:41:45
Episode 68400: reward=-0, steps=14, speed=35.3 f/s, elapsed=3:42:07
Episode 68500: reward=-0, steps=7, speed=35.3 f/s, elapsed=3:42:28
Episode 68600: reward=-0, steps=12, speed=35.3 f/s, elapsed=3:42:51
Episode 68700: reward=-0, steps=3, speed=35.3 f/s, elapsed=3:43:11
450000: tst: {'episode_reward': -0.013165386357452964, 'episode_steps': 14.1, 'order_profits': -0.013612858998874457, 'order_steps': 5.63}
450000: val: {'episode_reward': -0.12452477584810558, 'epi

Episode 77000: reward=-1, steps=9, speed=36.1 f/s, elapsed=4:13:36
Episode 77100: reward=-0, steps=7, speed=36.1 f/s, elapsed=4:13:58
Episode 77200: reward=-1, steps=12, speed=36.1 f/s, elapsed=4:14:24
Episode 77300: reward=-0, steps=5, speed=36.1 f/s, elapsed=4:14:46
Episode 77400: reward=-0, steps=3, speed=36.2 f/s, elapsed=4:15:02
520000: tst: {'episode_reward': -0.04956092626996725, 'episode_steps': 11.16, 'order_profits': -0.0501558336843849, 'order_steps': 5.01}
520000: val: {'episode_reward': 0.0037194583262562665, 'episode_steps': 7.44, 'order_profits': 0.0026525529439120656, 'order_steps': 3.4}
Episode 77500: reward=-1, steps=10, speed=36.2 f/s, elapsed=4:15:23
Episode 77600: reward=-0, steps=10, speed=36.3 f/s, elapsed=4:15:42
Episode 77700: reward=0, steps=5, speed=36.3 f/s, elapsed=4:16:05
Episode 77800: reward=-0, steps=9, speed=36.3 f/s, elapsed=4:16:32
Episode 77900: reward=-0, steps=8, speed=36.3 f/s, elapsed=4:16:54
Episode 78000: reward=-0, steps=5, speed=36.3 f/s, el

Episode 85700: reward=-1, steps=19, speed=36.2 f/s, elapsed=5:26:22
Episode 85800: reward=-0, steps=7, speed=36.3 f/s, elapsed=5:26:46
Episode 85900: reward=-1, steps=10, speed=36.3 f/s, elapsed=5:27:12
600000: tst: {'episode_reward': 0.002313408359763427, 'episode_steps': 57.1, 'order_profits': -0.0030654171059745837, 'order_steps': 51.99}
600000: val: {'episode_reward': 0.08294072644404606, 'episode_steps': 35.07, 'order_profits': 0.07638875560914027, 'order_steps': 31.22}
Episode 86000: reward=-0, steps=8, speed=36.2 f/s, elapsed=5:27:46
Episode 86100: reward=-0, steps=7, speed=36.2 f/s, elapsed=5:28:17
Episode 86200: reward=-0, steps=8, speed=36.2 f/s, elapsed=5:28:43
Episode 86300: reward=0, steps=7, speed=36.2 f/s, elapsed=5:29:15
Episode 86400: reward=-0, steps=6, speed=36.2 f/s, elapsed=5:29:41
Episode 86500: reward=-0, steps=11, speed=36.3 f/s, elapsed=5:30:06
Episode 86600: reward=-1, steps=4, speed=36.3 f/s, elapsed=5:30:36
Episode 86700: reward=0, steps=45, speed=36.3 f/s, 

690000: val: {'episode_reward': -0.007669903562203915, 'episode_steps': 19.99, 'order_profits': -0.007904272724344132, 'order_steps': 6.32}
Episode 93900: reward=0, steps=5, speed=35.6 f/s, elapsed=6:10:11
Episode 94000: reward=-1, steps=15, speed=35.6 f/s, elapsed=6:10:48
Episode 94100: reward=-1, steps=10, speed=35.6 f/s, elapsed=6:11:22
Episode 94200: reward=-0, steps=2, speed=35.6 f/s, elapsed=6:11:57
Episode 94300: reward=0, steps=20, speed=35.6 f/s, elapsed=6:12:31
Episode 94400: reward=-0, steps=7, speed=35.6 f/s, elapsed=6:13:01
Episode 94500: reward=-0, steps=4, speed=35.6 f/s, elapsed=6:13:33
Episode 94600: reward=-1, steps=29, speed=35.7 f/s, elapsed=6:14:07
700000: tst: {'episode_reward': -0.09356010700276145, 'episode_steps': 19.07, 'order_profits': -0.09438654645209638, 'order_steps': 10.16}
700000: val: {'episode_reward': 0.0519911441467967, 'episode_steps': 10.85, 'order_profits': 0.050527127862113816, 'order_steps': 4.25}
Episode 94700: reward=-0, steps=19, speed=35.6 

Episode 101400: reward=-0, steps=28, speed=34.7 f/s, elapsed=7:03:55
Episode 101500: reward=0, steps=9, speed=34.7 f/s, elapsed=7:04:45
Episode 101600: reward=-0, steps=2, speed=34.7 f/s, elapsed=7:05:38
Episode 101700: reward=0, steps=7, speed=34.7 f/s, elapsed=7:06:20
Episode 101800: reward=0, steps=15, speed=34.7 f/s, elapsed=7:07:10
Episode 101900: reward=-0, steps=15, speed=34.6 f/s, elapsed=7:46:10
810000: tst: {'episode_reward': -0.06847994401602567, 'episode_steps': 51.26, 'order_profits': -0.07258846788413703, 'order_steps': 40.45}
810000: val: {'episode_reward': -0.2057063352002053, 'episode_steps': 27.26, 'order_profits': -0.20817638319151166, 'order_steps': 19.08}
Episode 102000: reward=0, steps=8, speed=34.5 f/s, elapsed=7:47:19
Episode 102100: reward=1, steps=41, speed=34.5 f/s, elapsed=7:48:09
Episode 102200: reward=0, steps=9, speed=34.5 f/s, elapsed=7:48:57
Episode 102300: reward=-2, steps=29, speed=34.4 f/s, elapsed=7:49:42
Episode 102400: reward=1, steps=33, speed=34

Episode 107500: reward=0, steps=14, speed=32.6 f/s, elapsed=8:57:33
Episode 107600: reward=-1, steps=37, speed=32.7 f/s, elapsed=8:59:11
950000: tst: {'episode_reward': 0.05924275910151502, 'episode_steps': 62.02, 'order_profits': 0.05488487475532985, 'order_steps': 37.22222222222222}
950000: val: {'episode_reward': -0.12495185550290261, 'episode_steps': 37.96, 'order_profits': -0.1270155651748269, 'order_steps': 22.49}
Episode 107700: reward=0, steps=24, speed=32.8 f/s, elapsed=9:00:51
Episode 107800: reward=-0, steps=12, speed=32.9 f/s, elapsed=9:02:25
Episode 107900: reward=-1, steps=26, speed=33.0 f/s, elapsed=9:38:14
960000: tst: {'episode_reward': 0.0540840258067015, 'episode_steps': 73.28, 'order_profits': 0.050892932867146384, 'order_steps': 34.98}
960000: val: {'episode_reward': 0.10679937172385104, 'episode_steps': 40.69, 'order_profits': 0.1000055311244093, 'order_steps': 24.85}
Episode 108000: reward=-0, steps=24, speed=33.0 f/s, elapsed=10:26:58
Episode 108100: reward=1, s

Episode 112000: reward=0, steps=64, speed=30.8 f/s, elapsed=23:45:55
Episode 112100: reward=-1, steps=49, speed=30.8 f/s, elapsed=23:48:16
Episode 112200: reward=1, steps=33, speed=30.7 f/s, elapsed=23:50:42
1120000: tst: {'episode_reward': -0.368650213776004, 'episode_steps': 94.54, 'order_profits': -0.371705089821604, 'order_steps': 77.88}
1120000: val: {'episode_reward': 0.12844150243537483, 'episode_steps': 61.56, 'order_profits': 0.11730285141377741, 'order_steps': 52.72}
Episode 112300: reward=0, steps=91, speed=30.6 f/s, elapsed=23:53:54
Episode 112400: reward=1, steps=28, speed=30.6 f/s, elapsed=23:55:59
1130000: tst: {'episode_reward': 0.08322463770177403, 'episode_steps': 79.18, 'order_profits': 0.07696247940701074, 'order_steps': 52.98}
1130000: val: {'episode_reward': 0.05272128229088062, 'episode_steps': 41.12, 'order_profits': 0.046558346594485156, 'order_steps': 29.15}
Episode 112500: reward=0, steps=20, speed=30.5 f/s, elapsed=23:58:12
Episode 112600: reward=-0, steps=1

Episode 116500: reward=-0, steps=10, speed=31.6 f/s, elapsed=1 day, 1:21:31
Episode 116600: reward=-0, steps=13, speed=31.7 f/s, elapsed=1 day, 1:23:12
1290000: tst: {'episode_reward': 0.2740342231037586, 'episode_steps': 78.1, 'order_profits': 0.26405030406869257, 'order_steps': 71.59}
1290000: val: {'episode_reward': -0.07880559211274445, 'episode_steps': 64.36, 'order_profits': -0.0898939157330029, 'order_steps': 58.31}
Episode 116700: reward=-1, steps=19, speed=31.7 f/s, elapsed=1 day, 1:25:13
Episode 116800: reward=-0, steps=36, speed=31.7 f/s, elapsed=1 day, 1:26:51
Episode 116900: reward=0, steps=25, speed=31.8 f/s, elapsed=1 day, 1:28:35
1300000: tst: {'episode_reward': 0.23823301440808817, 'episode_steps': 84.97, 'order_profits': 0.22765580703972138, 'order_steps': 72.72}
1300000: val: {'episode_reward': -0.11208853220365865, 'episode_steps': 74.98, 'order_profits': -0.11165291232348885, 'order_steps': 65.3}
Episode 117000: reward=0, steps=5, speed=31.8 f/s, elapsed=1 day, 1:3

<br>

# 07. Testing

---

In [1]:
# Import the libraries
import numpy as np
import torch
import matplotlib as mpl
import matplotlib.pyplot as plt
from lib import environ, data_loader, models

mpl.use("Agg")

In [2]:
# Hyperparameter
EPSILON = 0.02

In [5]:
# Training data path
STOCKS = "data/YNDX_160101_161231.csv"

# Validation data path
VAL_STOCKS = "data/YNDX_150101_151231.csv"

In [None]:
# Start the program
if __name__ == "__main__":
    
    # CSV file with quotes to run the model
    data = STOCKS
    
    # Model file to load
    model = None
    
    # Count of bars to feed into the model
    bars = 50
    
    # Name to use in output images
    name = "output_image"
    
    # Commission size in percent
    commission = 0.1
    
    # Use convolution model instead of FF
    conv = True

    # Load the dataset
    prices = data_loader.load_relative(data)
    
    # Instantiate the environment
    env = environ.StocksEnv({"TEST": prices}, 
                            bars_count = bars, 
                            reset_on_close = False, 
                            commission = commission,
                            state_1d = conv, 
                            random_ofs_on_reset = False, 
                            reward_on_close = False, 
                            volumes = False)
    
    # If "conv" is true
    if conv:
        
        # Instantiate the CNN network
        net = models.DQNConv1D(env.observation_space.shape, env.action_space.n)
        
    # If "conv" is false
    else:
        
        # Instantiate the ANN network
        net = models.SimpleFFDQN(env.observation_space.shape[0], env.action_space.n)

    # Load the weights
    net.load_state_dict(torch.load(model, map_location = lambda storage, loc: storage))

    # Reset the environment and get the observations
    obs = env.reset()
    
    # Get the start prices
    start_price = env._state._cur_close()

    # Initialize the total reward with zero
    total_reward = 0.0
    
    # Initialize the step index with zero
    step_idx = 0
    
    # Initialize the rewards with a empty list
    rewards = []

    # Infinite loop
    while True:
        
        # Increment the step index
        step_idx += 1
        
        # Convert observations to torch tensor
        obs_v = torch.tensor([obs])
        
        # Feedforward
        out_v = net(obs_v)
        
        # Get the action index
        action_idx = out_v.max(dim = 1)[1].item()
        
        # If epsilon is higher than the random number
        if np.random.random() < EPSILON:
            
            # Set action index randomely
            action_idx = env.action_space.sample()
            
        # Get the actual action to take
        action = environ.Actions(action_idx)

        # Take action and get the S, R, and done mask
        obs, reward, done, _ = env.step(action_idx)
        
        # Add reward to total reward
        total_reward += reward
        
        # Append reward to list
        rewards.append(total_reward)
        
        # Every 100 times
        if step_idx % 100 == 0:
            
            # Reprt
            print("%d: reward=%.3f" % (step_idx, total_reward))
            
        # If terminal state
        if done:
            
            # Break the loop
            break

    # Visualization
    plt.clf()
    plt.plot(rewards)
    plt.title("Total reward, data=%s" % name)
    plt.ylabel("Reward, %")
    plt.savefig("rewards-%s.png" % name)

Reading data/YNDX_160101_161231.csv
Read done, got 131542 rows, 99752 filtered, 0 open prices adjusted
100: reward=-0.255
200: reward=-0.536
300: reward=-1.143
400: reward=-1.729
500: reward=-2.138
600: reward=-2.138
700: reward=-2.138
800: reward=-2.529
900: reward=-2.529
1000: reward=-2.529
1100: reward=-2.529
1200: reward=-2.811
1300: reward=-2.811
1400: reward=-3.367
1500: reward=-3.362
1600: reward=-3.362
1700: reward=-3.362
1800: reward=-3.362
1900: reward=-3.362
2000: reward=-3.383
2100: reward=-3.383
2200: reward=-3.383
2300: reward=-3.612
2400: reward=-4.106
2500: reward=-4.106
2600: reward=-4.106
2700: reward=-4.106
2800: reward=-4.106
2900: reward=-4.106
3000: reward=-4.106
3100: reward=-4.106
3200: reward=-4.008
3300: reward=-4.874
3400: reward=-5.322
3500: reward=-5.807
3600: reward=-5.818
3700: reward=-6.167
3800: reward=-6.167
3900: reward=-6.123
4000: reward=-6.470
4100: reward=-6.470
4200: reward=-6.470
4300: reward=-6.470
4400: reward=-6.670
4500: reward=-6.670
4600: 

<br>

# 08. Results

---

Let's now take a look at the results.

<br>

### 08.1. The feed-forward model

---

The convergence on Yandex data for one year requires about 10M training steps, which can take a while. (GTX 1080 Ti trains at a speed of 230-250 steps per second.) During the training, we have several charts in TensorBoard showing us what's going on.

<img width="700" src="assets/fig10.3.png">

<img width="700" src="assets/fig10.4.png">

The two preceding charts show the reward for episodes played during the training and the reward obtained from testing (which is done on the same quotes, but with epsilon=0). From them, we see that our agent is learning how to increase the profit from its actions over time.

<img width="700" src="assets/fig10.5.6.png">

The lengths of episodes also increased after 1M training iterations. The number of values predicted by the network is growing.

<img width="700" src="assets/fig10.7.png">

Figure 10.7 shows quite important information: the amount of reward obtained during the training on the validation set (which is quotes from 2015 by default). This reward doesn't have as obvious a trend as the reward on the training data. This might be an indication of overfitting of the agent, which starts after 3M training iterations. But still, the reward is above line â€“0.2% (which is a broker commission in our environment) and means that our agent is better than a random "buying and selling monkey."

During the training, our code saves models for later experiments. It does this every time the mean Q-values on our held-out states set update the maximum or when the reward on the validation sets beats the previous record. There is a tool that loads the model, trades on prices you've provided to it with the command-line option, and draws the plots with the profit change over time. The tool is called Chapter10/ run_model.py and it can be used as shown here:

```
$ ./run_model.py -d data/YNDX_160101_161231.csv -m saves/ff- YNDX16/mean_
val-0.332.data -b 10 -n test
```

The options that the tool accepts are as follows:
- `-d`: This is the path to the quotes to use. In the preceding example, we apply the model to the data that it was trained on.
- `-m`: This is the path to the model file. By default, the training code saves it in the saves dir.
- `-b`: This shows how many bars to pass to the model in the context. It has to match the count of bars used on training, which is 10 by default and can be changed in the training code.
- `-n`: This is the suffix to be prepended to the images produced.
- `--commission`: This allows you to redefine the broker's commission, which has a default of 0.1%.

At the end, the tool creates a chart of the total profit dynamics (in percentages). The following is the reward chart on Yandex 2016 quotes (used for training).

<img width="600" src="assets/fig10.8.png">

The result looks amazing: more than 200% profit in a year. However, let's look at what will happen with the 2015 data:

<img width="600" src="assets/fig10.9.png">

This result is much worse, as we've seen from the validation plots in TensorBoard. To check that our system is profitable with zero commission, we must rerun on the same data with the --commission 0.0 option.

<img width="600" src="assets/fig10.10.png">

We have some bad days with drawdown, but the overall results are good: without commission, our agent can be profitable. Of course, the commission is not the only issue. Our order simulation is very primitive and doesn't take into account real-life situations, such as price spread and a slip in order execution.

If we take the model with the best reward on the validation set, the reward dynamics are a bit better. Profitability is lower, but the drawdown on unseen quotes is much less.

<img width="600" src="assets/fig10.11.12.png">

<br>

### 08.2. The convolution model

---

The second model implemented in this example uses 1D convolution filters to extract features from the price data. This allows us to increase the number of bars in the context window that our agent sees on every step without a significant increase in the network size. By default, the convolution model example uses 50 bars of context. The training code is in Chapter10/train_model_conv.py, and it accepts the same set of command-line parameters as the feed-forward version.

Training dynamics are almost identical, but the reward obtained on the validation set is slightly higher and starts to overfit later.

<img width="600" src="assets/fig10.13.png">

<img width="600" src="assets/fig10.14.png">

<br>

# 08. Things to try

---

As already mentioned, financial markets are large and complicated. The methods that we've tried are just the very beginning. Using RL to create a complete and profitable trading strategy is a large project, which can take several months of dedicated labor. However, there are things that we can try to get a better understanding of the topic:
- Our data representation is definitely not perfect. We don't take into account significant price levels (support and resistance), round price values, and others. Incorporating them into the observation could be a challenging problem.
- Market prices are usually analyzed at different timeframes. Low-level data like one-minute bars are noisy (as they include lots of small price movements caused by individual trades), and it is like looking at the market using a microscope. At larger scales, such as one-hour or one-day bars, you can see large, long trends in data movement, which could be extremely important for price prediction.
- More training data is needed. One year of data for one stock is just 130k bars, which might be not enough to capture all market situations. Ideally, a real- life agent should be trained on a much larger dataset, such as the prices for hundreds of stocks for the past 10 years.
- Experiment with the network architecture. The convolution model has shown much faster convergence than the feed-forward model, but there are a lot of things to optimize: the count of layers, kernel size, residual architecture, attention mechanism, and so on.


<br>

# 09. Summary

---
In this chapter, we saw a practical example of RL and implemented the trading agent and custom Gym environment. We tried two different architectures: a feed-forward network with price history on input and a 1D convolution network. Both architectures used the DQN method, with some extensions described in Chapter 8, DQN Extensions.

This is the last chapter in part two of this book. In part three, we will talk about a different family of RL methods: policy gradients. We've touched on this approach a bit, but in the upcoming chapters, we will go much deeper into the subject, covering the REINFORCE method and the best method in the family: A3C.

# The End!