# Task 1.1

**Goal**

We train the network in the correct sequence of actions on the stock exchange: waiting for a purchase -> buying -> waiting for a sale -> selling. Without profit orientation.
 
**Observation**
 - trade state (open / close)

**Reward/Penalty**

The agent receives a positive reward when the deal is closed.


# Imports

In [1]:
# Systen libraries
import os
import sys
import yaml
import random
import warnings
import ipynbname
import logging.config

warnings.filterwarnings('ignore')

# for local development
RT_LIBS_PATH = "/Users/alex/Dev_projects/MyOwnRepo/rt_libs/src"
BA_LIBS_PATH = "/Users/alex/Dev_projects/MyOwnRepo/basic_application/src"
sys.path.append(RT_LIBS_PATH)
sys.path.append(BA_LIBS_PATH)

# read config
with open('config.yaml', "r") as stream:
    config = yaml.safe_load(stream)
    
# set logging config
log_config = config.get("log", None)
logging.config.dictConfig(log_config)

# set notebook alias
ALIAS = ipynbname.name()
print(ALIAS)

gen12.1-Abstract-01-SequenceTraining


In [2]:
# DS frameworks
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns

%matplotlib notebook

In [3]:
# NN Frameworks
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense, LSTM, Dropout, Concatenate, BatchNormalization
from tensorflow.keras.layers import Conv1D, MaxPool1D, AveragePooling1D, Flatten
from tensorflow.keras.optimizers import Adam, RMSprop, SGD
from tensorflow.keras.models import Model
from tensorflow.keras.models import Sequential
from tensorflow.python.keras.models import load_model, clone_model

devices = tf.config.list_physical_devices()
print(devices)

2023-09-23 19:45:35.875688: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]


In [4]:
# RT packages
from rl import DQNAgent
from env import TradeEnv

from core_v2 import Constructor, Player
from core_v2.data_point import DataPointFactory
from core_v2.observation_builder.precompute import PrecomputeOrderbookDiffFeature

from train_tools import plot_and_go
from train_tools import TrainManager, TrainPlot4

In [5]:
seed_value= 0
#os.environ['PYTHONHASHSEED']=str(seed_value)
random.seed(seed_value)
#np.random.seed(seed_value)
#tf.random.set_seed(seed_value)

# Dataset

In [7]:
n_steps = 100
data = np.concatenate([np.ones(n_steps).reshape(-1,1), np.ones(n_steps).reshape(-1,1)*2], axis=1)
data_train = pd.DataFrame(data, columns=["lowest_ask", "highest_bid"], dtype=np.float32)
display(data_train.shape)
display(data_train.head(3))

(100, 2)

Unnamed: 0,lowest_ask,highest_bid
0,1.0,2.0
1,1.0,2.0
2,1.0,2.0


# Init components

## Core

In [15]:
core_config = {
    "action_controller":{"class": "BasicTrainController", "params":{ 
            "penalty": -1, 
            "wait_scale": 0, 
            "open_scale": 0, 
            "hold_scale": 0, 
            "close_scale": 1, 
            "last_points_mean": 0
        },},


    "observation_builder":{
        "class": "ObservationBuilder",
        "inputs": [
            {"class": "Input1D", "features": [{"class": "RawContextFeature", "params": {"name":"is_open"}}]},
    ]
    }
}
# = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
core_constructor = Constructor()
env_core = core_constructor.get_core(ALIAS, core_config)

## Datapoint factory

In [16]:
observation_len = 1

dp_factory_config = {
    "observation_len": observation_len,             # Observation length
    "offset": observation_len,                      # Data history length (Observation + some tail of data for custom calculation)
    "future_points": 0,                             # Points from future for trand indicator feature
    "step_size": 1,                                 # Dataset step size
 }

dpf_train = DataPointFactory(dataset=data_train, **dp_factory_config)
dpf_test = DataPointFactory(dataset=data_train, **dp_factory_config)

## Model

In [17]:
env = TradeEnv(env_core, dpf_train, alias=ALIAS)


ACTIVATION = 'tanh'
def create_q_model(env):
    num_actions = env.action_space
    #----------------------------------------------
    
    inp_static = Input(shape=env.observation_space[0])
    classif = Dense(8, activation=ACTIVATION)(inp_static)
    classif = Dense(8, activation=ACTIVATION)(classif)
    output = Dense(num_actions, activation='softmax')(classif)

    model = Model(inputs=inp_static, outputs=output)
    return model

model = create_q_model(env)
model_target = create_q_model(env)

print(model.summary())

Model: "model_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_5 (InputLayer)        [(None, 1)]               0         
                                                                 
 dense_12 (Dense)            (None, 8)                 16        
                                                                 
 dense_13 (Dense)            (None, 8)                 72        
                                                                 
 dense_14 (Dense)            (None, 4)                 36        
                                                                 
Total params: 124
Trainable params: 124
Non-trainable params: 0
_________________________________________________________________
None


# Train

In [18]:
random.seed(seed_value)

core_train = core_constructor.get_core("train", core_config)
core_test = core_constructor.get_core("test", core_config)
env = TradeEnv(core_train, dpf_train, alias=ALIAS, log=True, log_obs=True)

model = create_q_model(env)
model_target = create_q_model(env)

agent = DQNAgent(env, model, model_target)

agent.epsilon_greedy_frames = 2000
agent.epsilon_random_frames = int(0.05 * agent.epsilon_greedy_frames)
agent.max_memory_length = int(1.0 * agent.epsilon_greedy_frames)

agent.max_steps_per_episode = 50000

agent.gamma = 0.95
agent.epsilon_min = 0.01
agent.batch_size = 32
agent.update_after_actions = 4
agent.update_target_network = 1000
agent.loss_function = tf.keras.losses.Huber() #tf.keras.losses.MeanSquaredError()

learning_rate = 0.0005
agent.optimizer = Adam(learning_rate=learning_rate, clipnorm=0.001)    #Adam(learning_rate=learning_rate) RMSprop(learning_rate=learning_rate) SGD(learning_rate=learning_rate)


tp = TrainPlot4()
tm = TrainManager(agent, core_test, dpf_test, tp, alias=ALIAS)

In [19]:
tp.init_plot(width=1000, height=800)
tp.update_plot(tm.history)

FigureWidget({
    'data': [{'legendgroup': '1',
              'line': {'color': '#109618', 'width': 1},
              'mode': 'lines',
              'name': 'Train',
              'type': 'scatter',
              'uid': '4351d578-b29e-4856-aff3-9362362940b5',
              'xaxis': 'x',
              'yaxis': 'y'},
             {'legendgroup': '1',
              'line': {'color': '#FF9900', 'width': 1},
              'mode': 'lines',
              'name': 'Test',
              'type': 'scatter',
              'uid': '1a772c38-1376-416e-95c7-de2ef67a3fe7',
              'xaxis': 'x',
              'yaxis': 'y'},
             {'legendgroup': '2',
              'line': {'color': '#D62728', 'width': 1},
              'mode': 'lines',
              'name': 'Train',
              'type': 'scatter',
              'uid': '0495b6ae-aeb1-4cb4-97d8-741956381fc8',
              'xaxis': 'x2',
              'yaxis': 'y3'},
             {'legendgroup': '2',
              'line': {'color': '#FF9900'

In [20]:
tm.go(max_frames=3000, test_every=100, snapshot_every=3000, update_plot_every=100, save_since=0.06)

19:48:59 Running reward: -18.10   at episode 11   | frame 1000   | eps: 0.50 | Running loss: 0.25281
19:49:09 Running reward: 3.95     at episode 21   | frame 2000   | eps: 0.01 | Running loss: 0.18777
19:49:18 Running reward: 18.67    at episode 31   | frame 3000   | eps: 0.01 | Running loss: 0.16864
done


# Results

The simplest task of all.

The network has learned an effective algorithm - it opens and immediately closes a transaction, without waiting. The greater the number of closed transactions, the higher the total reward.

Tested different learning rates. With small values (0.00025) it takes longer to train, as it grows (0.001 and above) it reaches the optimal result faster. At a high value (0.5), the dispersion of the results increased significantly.

Increasing the depth of the network compensates for the duration of training with a small learning rate - with the addition of the next layer, the learning rate increases. But the process is not linear - from a certain point the training duration begins to increase (gradient decay?)

Increasing the complexity of the network (number of parameters, not layers) gives the same effect - it learns faster and more stable. And here everything is linear - even with an increase in the number of neurons by 2 orders of magnitude, it learns stably. The reduction in learning rate will be noticeable on a large network/dataset.

----


При одном слое из 4-х нейронов к концу обучения не выходила на оптимальный результат. При увеличении кол-ва нейронов или глубины алгоритм стал сходиться.


При lr=0.00025 на оптимум сеть вышла после 5-7 тыс фреймов. Если поставить lr выше (например 0.005) то выход на оптимум происходим быстрее - 

- 4A
    - 0.00025
        - 1200 и 3200
        - 7100
        - 4100, штрафы ушли с 1000
        - 3500, штрафы ушли с 1400
        - 2500, штрафы ушли с 2600
        - 2400, штрафы ушли с 2500
    - 0.0005
        - 400, штрафы ушли с 400
        - 900, штрафы ушли с 900
        - 2500, штрафы ушли с 900
        - 900, штрафы ушли с 200
        - 1300, штрафы ушли с 1300
        
    - 0.001
        - 1800, штрафы ушли с 500
        - 300, штрафы ушли с 300
        - 1600, штрафы ушли с 700
        - 800, штрафы ушли с 800
        - 1500, штрафы ушли с 800
        
    - 0.002
        - 1200, штрафы ушли с 300
        - 200, штрафы ушли с 100
        - 300, штрафы ушли с 300
        - 300, штрафы ушли с 300
        - 400, штрафы ушли с 400
        
    - 0.005
        - 300, штрафы ушли с 100
        - 100, штрафы ушли с 100
        - 300 и 1000, штрафы ушли с 300
        - 200, штрафы ушли с 200
        - 1000, штрафы ушли с 400
    - 0.5 - нестабильно
    