# Постановка задачи

## Task #1

**Цель**
Обучаем сеть корректной последовательности действий на бирже: ожидание покупки -> покупка -> ожидание продажи -> продажа.
 
**Observation**
 - Состояние сделки

**Действия**
 1. Ожидание момента для покупки
 2. Покупка (открытие позиции)
 3. Ожидание момента для продажи
 4. Продажа (закрытие позици)

**Награда/штраф**
Действия 1 и 2 возможны при отсутствии открытой позиции. Действия 3 и 4 допустимы только при открытой позиции. При нарушении этого требования - сеть штрафуется.
 
Сеть получает награду при закрытии позиции.

# Импорты

In [1]:
# Системные импорты и настройки
import os
import sys
import yaml
import random
import warnings
import ipynbname
import logging.config

warnings.filterwarnings('ignore')

# for local development
RT_LIBS_PATH = "/Users/alex/Dev_projects/MyOwnRepo/rt_libs/src"
BA_LIBS_PATH = "/Users/alex/Dev_projects/MyOwnRepo/basic_application/src"
sys.path.append(RT_LIBS_PATH)
sys.path.append(BA_LIBS_PATH)

# read config
with open('config.yaml', "r") as stream:
    config = yaml.safe_load(stream)
    
# set logging config
log_config = config.get("log", None)
logging.config.dictConfig(log_config)

# set notebook alias
ALIAS = ipynbname.name()
print(ALIAS)

gen12.1-Abstract-03-OpenSignal


In [2]:
# DS frameworks
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns

%matplotlib notebook

In [3]:
# NN Frameworks
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense, LSTM, Dropout, Concatenate, BatchNormalization
from tensorflow.keras.layers import Conv1D, MaxPool1D, AveragePooling1D, Flatten
from tensorflow.keras.optimizers import Adam, RMSprop, SGD
from tensorflow.keras.models import Model
from tensorflow.keras.models import Sequential
from tensorflow.python.keras.models import load_model, clone_model

devices = tf.config.list_physical_devices()
print(devices)

2023-09-23 11:37:16.813095: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]


In [4]:
# RT packages
from rl import DQNAgent
from env import TradeEnv
from core_v2 import Constructor, Player


from core_v2.data_point import DataPointFactory

from core_v2.observation_builder.precompute import PrecomputeOrderbookDiffFeature

from train_tools import plot_and_go
from train_tools.train_plot import TrainPlot4
from train_tools.train_manager import TrainManager

In [5]:
seed_value= 0
#os.environ['PYTHONHASHSEED']=str(seed_value)
random.seed(seed_value)
#np.random.seed(seed_value)
#tf.random.set_seed(seed_value)

# Конфиг

In [6]:
observation_len = 1

# Параметры точки наблюдения
observation_config = {
    "observation_len": observation_len,             # Количество точек наблюдения в сэмпле
    "offset": observation_len,                      # Количество точек наблюдения в сэмпле
    "future_points": 0,                             # Количество будущех точек для предсказания тренда (временное решение)
    "step_size": 1,                                 # Шаг по датасету
 }

# Датасет

In [7]:
np.random.seed(2)

n_steps = 100
sample_num = 10
sample_size = n_steps//sample_num
safe_interval_size = 3

open_signal = np.empty(0)

sample_size_ = sample_size - safe_interval_size
for i in range(sample_num):
    safe_interval = np.zeros(safe_interval_size)
    total = 0
    while not total==1:
        sample = np.random.uniform(size=sample_size_) > 0.9
        total = sum(sample)

    open_signal = np.concatenate([open_signal, sample, safe_interval])

open_signal = open_signal.reshape(-1,1) + 1
lowest_ask = np.ones(len(open_signal)).reshape(-1,1)*2.5 - open_signal 
highest_bid = np.ones(len(open_signal)).reshape(-1,1)
open_signal = open_signal - 1

dataset = np.concatenate([lowest_ask, highest_bid, open_signal], axis=1)

data_train = pd.DataFrame(dataset, columns=["lowest_ask", "highest_bid", "open_signal"], dtype=np.float32)

#data_train['close_signal'] = data_train['close_signal'].replace(0, -1)
#data_train['close_signal'] = data_train['close_signal'].replace(1, 0.5)
#data_train['close_signal']  = data_train['close_signal']  - 0.5

print(data_train["open_signal"].sum())
print(data_train["open_signal"].count())
display(data_train.head(15))

10.0
100


Unnamed: 0,lowest_ask,highest_bid,open_signal
0,1.5,1.0,0.0
1,1.5,1.0,0.0
2,1.5,1.0,0.0
3,1.5,1.0,0.0
4,1.5,1.0,0.0
5,0.5,1.0,1.0
6,1.5,1.0,0.0
7,1.5,1.0,0.0
8,1.5,1.0,0.0
9,1.5,1.0,0.0


# Инициализация компонентов

## Datapoint factory

In [8]:
dpf_train = DataPointFactory(dataset=data_train, **observation_config)
dpf_test = DataPointFactory(dataset=data_train, **observation_config)

## Env

In [9]:
core_config = {
    "action_controller":{"class": "AbstractTrainController", "params":{ 
            "penalty": -1, 
            "wait_scale": 0, 
            "open_scale": 0, 
            "hold_scale": 0, 
            "close_scale": 1, 
            "last_points_mean": 0
        },},


    "observation_builder":{
        "class": "ObservationBuilder",
        "inputs": [
            {"class": "Input1D", "features": [
                {"class": "RawContextFeature", "params": {"name":"is_open"}},
                {"class": "RawValueFeature", "params": {"name":"open_signal"}},
                {"class": "RawContextFeature", "params": {"name":"open_signal"}}
            ]},
    ]
    }
}
# = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
core_constructor = Constructor()
env_core = core_constructor.get_core(ALIAS, core_config)

# train environment
env = TradeEnv(env_core, dpf_train, alias=ALIAS, log=True, log_obs=True)

In [10]:
dp = dpf_train.reset()
obs =  env_core.get_observation(dp)
obs

array([0., 0., 0.], dtype=float32)

In [17]:
dp, done = dpf_train.get_next_step()
obs =  env_core.get_observation(dp)
obs

array([1., 0., 1.], dtype=float32)

In [16]:
reward, act = env_core.apply_action(1)
reward

-0.0

In [63]:
env_core.context.params

{'data_point': <core_v2.data_point.data_point.DataPoint at 0x7fd6b670f9d0>,
 'ts': 35,
 'lowest_ask': 1.5,
 'highest_bid': 1.0,
 'profit': 1.0,
 'market_fee': 0,
 'is_open': True,
 'trade': <core_v2.actions.correct_action.TradeAction at 0x7fd6b670f790>,
 'trade_opposite': <core_v2.actions.correct_action.TradeAction at 0x7fd6b6707a90>,
 'open_signal': 0,
 'observation': array([1., 0., 0.], dtype=float32),
 'action': 1,
 'reward': -0.0}

# Нейронная сеть

In [18]:
ACTIVATION = 'relu'
def create_q_model(env):
    num_actions = env.action_space
    #----------------------------------------------
    
    inp_static = Input(shape=env.observation_space[0])
    classif = Dense(64, activation=ACTIVATION)(inp_static)
    classif = Dense(64, activation=ACTIVATION)(classif)
    classif = Dense(64, activation=ACTIVATION)(classif)
    classif = Dense(64, activation=ACTIVATION)(classif)
    output = Dense(num_actions, activation='softmax')(classif)

    model = Model(inputs=inp_static, outputs=output)
    return model

model = create_q_model(env)
model_target = create_q_model(env)

print(model.summary())

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 3)]               0         
                                                                 
 dense (Dense)               (None, 64)                256       
                                                                 
 dense_1 (Dense)             (None, 64)                4160      
                                                                 
 dense_2 (Dense)             (None, 64)                4160      
                                                                 
 dense_3 (Dense)             (None, 64)                4160      
                                                                 
 dense_4 (Dense)             (None, 4)                 260       
                                                                 
Total params: 12,996
Trainable params: 12,996
Non-trainable p

2023-09-23 11:37:41.563505: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


# Обучение

In [25]:
random.seed(seed_value)

core_train = core_constructor.get_core("train", core_config)
env = TradeEnv(core_train, dpf_train, alias=ALIAS, log=True, log_obs=True)

model = create_q_model(env)
model_target = create_q_model(env)
agent = DQNAgent(env, model, model_target)

agent.epsilon_random_frames = 500
agent.epsilon_greedy_frames = 4000
agent.max_memory_length     = 4000
agent.max_steps_per_episode = 50000
agent.gamma = 0.95
agent.epsilon_min = 0.01
agent.batch_size = 64
agent.update_after_actions = 4
agent.update_target_network = 250
agent.loss_function = tf.keras.losses.Huber() #tf.keras.losses.MeanSquaredError()
agent.optimizer = Adam(learning_rate=0.001, clipnorm=0.001)    #Adam(learning_rate=learning_rate) RMSprop(learning_rate=learning_rate) SGD(learning_rate=learning_rate)


tp = TrainPlot4()
core_test = core_constructor.get_core("test", core_config)
tm = TrainManager(agent, core_test, dpf_test, tp, alias=ALIAS)

In [26]:
tp.init_plot(width=1000, height=800)
tp.update_plot(tm.history)

FigureWidget({
    'data': [{'legendgroup': '1',
              'line': {'color': '#109618', 'width': 1},
              'mode': 'lines',
              'name': 'Train',
              'type': 'scatter',
              'uid': '51c94642-6833-464d-b329-eb4fdc117cb5',
              'xaxis': 'x',
              'yaxis': 'y'},
             {'legendgroup': '1',
              'line': {'color': '#FF9900', 'width': 1},
              'mode': 'lines',
              'name': 'Test',
              'type': 'scatter',
              'uid': 'e3fc3a39-892d-4032-9b2a-e27ae3048283',
              'xaxis': 'x',
              'yaxis': 'y'},
             {'legendgroup': '2',
              'line': {'color': '#D62728', 'width': 1},
              'mode': 'lines',
              'name': 'Train',
              'type': 'scatter',
              'uid': '88137f5e-9ba6-4f49-952c-20a710a8b201',
              'xaxis': 'x2',
              'yaxis': 'y3'},
             {'legendgroup': '2',
              'line': {'color': '#FF9900'

In [27]:
tm.go(max_frames=7000, test_every=100, snapshot_every=500000, update_plot_every=100, save_since=0.06)

13:24:19 Running reward: -52.00   at episode 3    | frame 250    | eps: 0.94 | Running loss: 0.25444
13:24:22 Running reward: -52.00   at episode 6    | frame 500    | eps: 0.88 | Running loss: 0.17382
13:24:26 Running reward: -50.67   at episode 8    | frame 750    | eps: 0.81 | Running loss: 0.13603
13:24:30 Running reward: -47.40   at episode 11   | frame 1000   | eps: 0.75 | Running loss: 0.10882
13:24:34 Running reward: -46.69   at episode 13   | frame 1250   | eps: 0.69 | Running loss: 0.09938
13:24:38 Running reward: -44.49   at episode 16   | frame 1500   | eps: 0.63 | Running loss: 0.09133
13:24:42 Running reward: -43.41   at episode 18   | frame 1750   | eps: 0.57 | Running loss: 0.08817
13:24:45 Running reward: -40.58   at episode 21   | frame 2000   | eps: 0.51 | Running loss: 0.08279
13:24:50 Running reward: -39.27   at episode 23   | frame 2250   | eps: 0.44 | Running loss: 0.07892
13:24:54 Running reward: -37.83   at episode 26   | frame 2500   | eps: 0.38 | Running loss

# Итоги

После небольшого ресеча подобрал параметры для максимально быстрого и стабильного обучения.
На более сложных задачах конфиг придется исследовать снова и менять.

# Ресеч

## Конфиг 1

In [191]:
random.seed(seed_value)

core_train_1 = core_constructor.get_core(core_config)
env_1 = TradeEnv(core_train_1, dpf_train, alias=ALIAS, log=False, log_obs=False)

model_1 = create_q_model(env)
model_target_1 = create_q_model(env)
agent_1 = DQNAgent(env_1, model_1, model_target_1)

agent_1.epsilon_random_frames = 500
agent_1.epsilon_greedy_frames = 4000
agent_1.max_memory_length     = 4000
agent_1.max_steps_per_episode = 50000
agent_1.gamma = 0.95
agent_1.epsilon_min = 0.01
agent_1.batch_size = 16
agent_1.update_after_actions = 4
agent_1.update_target_network = 250
agent_1.loss_function = tf.keras.losses.Huber() #tf.keras.losses.MeanSquaredError()
agent_1.optimizer = Adam(learning_rate=0.00015, clipnorm=0.001)    #Adam(learning_rate=learning_rate) RMSprop(learning_rate=learning_rate) SGD(learning_rate=learning_rate)


tp_1 = TrainPlot4()
core_test_1 = core_constructor.get_core(core_config)
tm_1 = TrainManager(agent_1, core_test_1, dpf_test, tp_1, alias=ALIAS)

tp_1.init_plot(width=1000, height=800)
tp_1.update_plot(tm_1.history)

FigureWidget({
    'data': [{'legendgroup': '1',
              'line': {'color': '#109618', 'width': 1},
              'mode': 'lines',
              'name': 'Train',
              'type': 'scatter',
              'uid': '67dac217-9dbc-40d6-b68c-1f257b78790d',
              'xaxis': 'x',
              'yaxis': 'y'},
             {'legendgroup': '1',
              'line': {'color': '#FF9900', 'width': 1},
              'mode': 'lines',
              'name': 'Test',
              'type': 'scatter',
              'uid': '05c70f2f-08f3-4edb-89c9-ca9a3018e069',
              'xaxis': 'x',
              'yaxis': 'y'},
             {'legendgroup': '2',
              'line': {'color': '#D62728', 'width': 1},
              'mode': 'lines',
              'name': 'Train',
              'type': 'scatter',
              'uid': '65e2b1d7-9ded-4e05-84ab-0f6f43fdbf8a',
              'xaxis': 'x2',
              'yaxis': 'y3'},
             {'legendgroup': '2',
              'line': {'color': '#FF9900'

In [192]:
tm_1.go(max_frames=80000, test_every=300, snapshot_every=500000, update_plot_every=100, save_since=0.06)

12:07:20 Running reward: -114.50  at episode 3    | frame 250    | eps: 0.94 | Running loss: 0.88952
12:07:22 Running reward: -108.60  at episode 6    | frame 500    | eps: 0.88 | Running loss: 0.85609
12:07:24 Running reward: -105.00  at episode 8    | frame 750    | eps: 0.81 | Running loss: 0.82732
12:07:27 Running reward: -101.50  at episode 11   | frame 1000   | eps: 0.75 | Running loss: 0.78823
12:07:29 Running reward: -97.50   at episode 13   | frame 1250   | eps: 0.69 | Running loss: 0.75456
12:07:31 Running reward: -93.07   at episode 16   | frame 1500   | eps: 0.63 | Running loss: 0.70873
12:07:33 Running reward: -90.53   at episode 18   | frame 1750   | eps: 0.57 | Running loss: 0.67820
12:07:35 Running reward: -85.05   at episode 21   | frame 2000   | eps: 0.51 | Running loss: 0.63369
12:07:38 Running reward: -81.95   at episode 23   | frame 2250   | eps: 0.44 | Running loss: 0.60262
12:07:40 Running reward: -78.12   at episode 26   | frame 2500   | eps: 0.38 | Running loss

12:11:00 Running reward: -3.57    at episode 212  | frame 20750  | eps: 0.01 | Running loss: 0.04377
12:11:02 Running reward: -5.63    at episode 215  | frame 21000  | eps: 0.01 | Running loss: 0.03810
12:11:05 Running reward: -6.50    at episode 217  | frame 21250  | eps: 0.01 | Running loss: 0.03627
12:11:08 Running reward: -7.27    at episode 220  | frame 21500  | eps: 0.01 | Running loss: 0.03592
12:11:11 Running reward: -7.33    at episode 222  | frame 21750  | eps: 0.01 | Running loss: 0.03600
12:11:13 Running reward: -8.03    at episode 225  | frame 22000  | eps: 0.01 | Running loss: 0.03743
12:11:16 Running reward: -7.93    at episode 228  | frame 22250  | eps: 0.01 | Running loss: 0.03842
12:11:18 Running reward: -7.97    at episode 230  | frame 22500  | eps: 0.01 | Running loss: 0.03861
12:11:21 Running reward: -7.73    at episode 233  | frame 22750  | eps: 0.01 | Running loss: 0.03890
12:11:24 Running reward: -7.70    at episode 235  | frame 23000  | eps: 0.01 | Running loss

12:14:54 Running reward: -6.20    at episode 421  | frame 41250  | eps: 0.01 | Running loss: 0.06652
12:14:57 Running reward: -2.13    at episode 424  | frame 41500  | eps: 0.01 | Running loss: 0.06817
12:15:00 Running reward: 4.27     at episode 427  | frame 41750  | eps: 0.01 | Running loss: 0.06892
12:15:03 Running reward: 4.77     at episode 429  | frame 42000  | eps: 0.01 | Running loss: 0.06611
12:15:05 Running reward: 5.33     at episode 432  | frame 42250  | eps: 0.01 | Running loss: 0.06607
12:15:08 Running reward: 5.73     at episode 434  | frame 42500  | eps: 0.01 | Running loss: 0.06411
12:15:11 Running reward: 5.83     at episode 437  | frame 42750  | eps: 0.01 | Running loss: 0.05984
12:15:14 Running reward: 6.07     at episode 439  | frame 43000  | eps: 0.01 | Running loss: 0.05635
12:15:17 Running reward: 6.23     at episode 442  | frame 43250  | eps: 0.01 | Running loss: 0.05355
12:15:19 Running reward: 6.63     at episode 444  | frame 43500  | eps: 0.01 | Running loss

12:18:44 Running reward: 9.00     at episode 631  | frame 61750  | eps: 0.01 | Running loss: 0.03511
12:18:47 Running reward: 9.00     at episode 633  | frame 62000  | eps: 0.01 | Running loss: 0.03411
12:18:50 Running reward: 8.73     at episode 636  | frame 62250  | eps: 0.01 | Running loss: 0.03343
12:18:52 Running reward: 8.73     at episode 638  | frame 62500  | eps: 0.01 | Running loss: 0.03332
12:18:55 Running reward: 8.67     at episode 641  | frame 62750  | eps: 0.01 | Running loss: 0.03340
12:18:58 Running reward: 8.60     at episode 643  | frame 63000  | eps: 0.01 | Running loss: 0.03315
12:19:01 Running reward: 8.87     at episode 646  | frame 63250  | eps: 0.01 | Running loss: 0.03346
12:19:04 Running reward: 8.83     at episode 648  | frame 63500  | eps: 0.01 | Running loss: 0.03364
12:19:07 Running reward: 8.63     at episode 651  | frame 63750  | eps: 0.01 | Running loss: 0.03416
12:19:10 Running reward: 8.33     at episode 654  | frame 64000  | eps: 0.01 | Running loss

## Конфиг 2

In [193]:
random.seed(seed_value)

core_train_2 = core_constructor.get_core(core_config)
env_2 = TradeEnv(core_train_2, dpf_train, alias=ALIAS, log=False, log_obs=False)

model_2 = create_q_model(env)
model_target_2 = create_q_model(env)
agent_2 = DQNAgent(env_2, model_2, model_target_2)

agent_2.epsilon_random_frames = 500
agent_2.epsilon_greedy_frames = 4000
agent_2.max_memory_length     = 4000
agent_2.max_steps_per_episode = 50000
agent_2.gamma = 0.95
agent_2.epsilon_min = 0.01
agent_2.batch_size = 32
agent_2.update_after_actions = 4
agent_2.update_target_network = 250
agent_2.loss_function = tf.keras.losses.Huber() #tf.keras.losses.MeanSquaredError()
agent_2.optimizer = Adam(learning_rate=0.00015, clipnorm=0.001)    #Adam(learning_rate=learning_rate) RMSprop(learning_rate=learning_rate) SGD(learning_rate=learning_rate)


tp_2 = TrainPlot4()
core_test_2 = core_constructor.get_core(core_config)
tm_2 = TrainManager(agent_2, core_test_2, dpf_test, tp_2, alias=ALIAS)

tp_2.init_plot(width=1000, height=800)
tp_2.update_plot(tm_2.history)

FigureWidget({
    'data': [{'legendgroup': '1',
              'line': {'color': '#109618', 'width': 1},
              'mode': 'lines',
              'name': 'Train',
              'type': 'scatter',
              'uid': '63d43927-8c35-49e2-b879-f9b4c0946492',
              'xaxis': 'x',
              'yaxis': 'y'},
             {'legendgroup': '1',
              'line': {'color': '#FF9900', 'width': 1},
              'mode': 'lines',
              'name': 'Test',
              'type': 'scatter',
              'uid': '78df3d3a-ed02-4846-9708-dfad7a0443ff',
              'xaxis': 'x',
              'yaxis': 'y'},
             {'legendgroup': '2',
              'line': {'color': '#D62728', 'width': 1},
              'mode': 'lines',
              'name': 'Train',
              'type': 'scatter',
              'uid': 'def5ef97-92c7-439a-94a8-8ddeff84c575',
              'xaxis': 'x2',
              'yaxis': 'y3'},
             {'legendgroup': '2',
              'line': {'color': '#FF9900'

In [194]:
tm_2.go(max_frames=80000, test_every=300, snapshot_every=500000, update_plot_every=100, save_since=0.06)

12:22:10 Running reward: -116.00  at episode 3    | frame 250    | eps: 0.94 | Running loss: 0.81109
12:22:12 Running reward: -112.60  at episode 6    | frame 500    | eps: 0.88 | Running loss: 0.85063
12:22:14 Running reward: -107.86  at episode 8    | frame 750    | eps: 0.81 | Running loss: 0.82948
12:22:17 Running reward: -99.10   at episode 11   | frame 1000   | eps: 0.75 | Running loss: 0.78819
12:22:19 Running reward: -96.25   at episode 13   | frame 1250   | eps: 0.69 | Running loss: 0.75266
12:22:21 Running reward: -91.07   at episode 16   | frame 1500   | eps: 0.63 | Running loss: 0.69924
12:22:23 Running reward: -89.06   at episode 18   | frame 1750   | eps: 0.57 | Running loss: 0.66691
12:22:26 Running reward: -83.85   at episode 21   | frame 2000   | eps: 0.51 | Running loss: 0.62010
12:22:28 Running reward: -80.68   at episode 23   | frame 2250   | eps: 0.44 | Running loss: 0.59265
12:22:31 Running reward: -77.12   at episode 26   | frame 2500   | eps: 0.38 | Running loss

12:25:54 Running reward: -2.47    at episode 212  | frame 20750  | eps: 0.01 | Running loss: 0.07916
12:25:57 Running reward: -2.33    at episode 215  | frame 21000  | eps: 0.01 | Running loss: 0.07768
12:26:00 Running reward: -2.60    at episode 217  | frame 21250  | eps: 0.01 | Running loss: 0.07557
12:26:02 Running reward: -2.53    at episode 220  | frame 21500  | eps: 0.01 | Running loss: 0.07267
12:26:05 Running reward: -1.83    at episode 222  | frame 21750  | eps: 0.01 | Running loss: 0.07134
12:26:08 Running reward: -1.67    at episode 225  | frame 22000  | eps: 0.01 | Running loss: 0.06600
12:26:11 Running reward: -1.30    at episode 228  | frame 22250  | eps: 0.01 | Running loss: 0.06111
12:26:14 Running reward: -0.80    at episode 230  | frame 22500  | eps: 0.01 | Running loss: 0.05822
12:26:17 Running reward: -1.90    at episode 233  | frame 22750  | eps: 0.01 | Running loss: 0.05435
12:26:20 Running reward: -9.27    at episode 235  | frame 23000  | eps: 0.01 | Running loss

12:29:52 Running reward: 8.10     at episode 421  | frame 41250  | eps: 0.01 | Running loss: 0.03532
12:29:55 Running reward: 8.33     at episode 424  | frame 41500  | eps: 0.01 | Running loss: 0.03520
12:29:57 Running reward: 8.43     at episode 427  | frame 41750  | eps: 0.01 | Running loss: 0.03575
12:30:00 Running reward: 8.57     at episode 429  | frame 42000  | eps: 0.01 | Running loss: 0.03521
12:30:03 Running reward: 8.67     at episode 432  | frame 42250  | eps: 0.01 | Running loss: 0.03568
12:30:06 Running reward: 9.20     at episode 434  | frame 42500  | eps: 0.01 | Running loss: 0.03582
12:30:09 Running reward: 9.03     at episode 437  | frame 42750  | eps: 0.01 | Running loss: 0.03539
12:30:12 Running reward: 8.87     at episode 439  | frame 43000  | eps: 0.01 | Running loss: 0.03530
12:30:15 Running reward: 8.90     at episode 442  | frame 43250  | eps: 0.01 | Running loss: 0.03494
12:30:18 Running reward: 8.90     at episode 444  | frame 43500  | eps: 0.01 | Running loss

12:33:53 Running reward: 8.83     at episode 631  | frame 61750  | eps: 0.01 | Running loss: 0.03642
12:33:56 Running reward: 8.90     at episode 633  | frame 62000  | eps: 0.01 | Running loss: 0.03668
12:33:59 Running reward: 8.67     at episode 636  | frame 62250  | eps: 0.01 | Running loss: 0.03685
12:34:02 Running reward: 8.93     at episode 638  | frame 62500  | eps: 0.01 | Running loss: 0.03632
12:34:05 Running reward: 9.00     at episode 641  | frame 62750  | eps: 0.01 | Running loss: 0.03655
12:34:08 Running reward: 8.90     at episode 643  | frame 63000  | eps: 0.01 | Running loss: 0.03621
12:34:11 Running reward: 9.07     at episode 646  | frame 63250  | eps: 0.01 | Running loss: 0.03661
12:34:14 Running reward: 9.10     at episode 648  | frame 63500  | eps: 0.01 | Running loss: 0.03676
12:34:17 Running reward: 8.93     at episode 651  | frame 63750  | eps: 0.01 | Running loss: 0.03629
12:34:20 Running reward: 8.67     at episode 654  | frame 64000  | eps: 0.01 | Running loss

## Конфиг 3

In [195]:
random.seed(seed_value)

core_train_3 = core_constructor.get_core(core_config)
env_3 = TradeEnv(core_train_3, dpf_train, alias=ALIAS, log=False, log_obs=False)

model_3 = create_q_model(env)
model_target_3 = create_q_model(env)
agent_3 = DQNAgent(env_3, model_3, model_target_3)

agent_3.epsilon_random_frames = 500
agent_3.epsilon_greedy_frames = 4000
agent_3.max_memory_length     = 4000
agent_3.max_steps_per_episode = 50000
agent_3.gamma = 0.95
agent_3.epsilon_min = 0.01
agent_3.batch_size = 64
agent_3.update_after_actions = 4
agent_3.update_target_network = 250
agent_3.loss_function = tf.keras.losses.Huber() #tf.keras.losses.MeanSquaredError()
agent_3.optimizer = Adam(learning_rate=0.00015, clipnorm=0.001)    #Adam(learning_rate=learning_rate) RMSprop(learning_rate=learning_rate) SGD(learning_rate=learning_rate)


tp_3 = TrainPlot4()
core_test_3 = core_constructor.get_core(core_config)
tm_3 = TrainManager(agent_3, core_test_3, dpf_test, tp_3, alias=ALIAS)

tp_3.init_plot(width=1000, height=800)
tp_3.update_plot(tm_3.history)

FigureWidget({
    'data': [{'legendgroup': '1',
              'line': {'color': '#109618', 'width': 1},
              'mode': 'lines',
              'name': 'Train',
              'type': 'scatter',
              'uid': 'd608d4a3-5fe6-44cf-9257-da256db58c4c',
              'xaxis': 'x',
              'yaxis': 'y'},
             {'legendgroup': '1',
              'line': {'color': '#FF9900', 'width': 1},
              'mode': 'lines',
              'name': 'Test',
              'type': 'scatter',
              'uid': 'cac1f703-f95b-48a2-94ae-18b1fe94b303',
              'xaxis': 'x',
              'yaxis': 'y'},
             {'legendgroup': '2',
              'line': {'color': '#D62728', 'width': 1},
              'mode': 'lines',
              'name': 'Train',
              'type': 'scatter',
              'uid': 'aa76a147-c540-4fc2-b8e5-19a5569cac46',
              'xaxis': 'x2',
              'yaxis': 'y3'},
             {'legendgroup': '2',
              'line': {'color': '#FF9900'

In [196]:
tm_3.go(max_frames=80000, test_every=300, snapshot_every=500000, update_plot_every=100, save_since=0.06)

12:37:31 Running reward: -108.50  at episode 3    | frame 250    | eps: 0.94 | Running loss: 0.82755
12:37:33 Running reward: -108.80  at episode 6    | frame 500    | eps: 0.88 | Running loss: 0.79970
12:37:35 Running reward: -103.14  at episode 8    | frame 750    | eps: 0.81 | Running loss: 0.78052
12:37:38 Running reward: -97.20   at episode 11   | frame 1000   | eps: 0.75 | Running loss: 0.73018
12:37:40 Running reward: -95.25   at episode 13   | frame 1250   | eps: 0.69 | Running loss: 0.69495
12:37:43 Running reward: -91.13   at episode 16   | frame 1500   | eps: 0.63 | Running loss: 0.64380
12:37:45 Running reward: -88.76   at episode 18   | frame 1750   | eps: 0.57 | Running loss: 0.60786
12:37:48 Running reward: -84.45   at episode 21   | frame 2000   | eps: 0.51 | Running loss: 0.56364
12:37:50 Running reward: -81.77   at episode 23   | frame 2250   | eps: 0.44 | Running loss: 0.53853
12:37:53 Running reward: -78.88   at episode 26   | frame 2500   | eps: 0.38 | Running loss

12:41:23 Running reward: 8.63     at episode 212  | frame 20750  | eps: 0.01 | Running loss: 0.03761
12:41:25 Running reward: 8.57     at episode 215  | frame 21000  | eps: 0.01 | Running loss: 0.03541
12:41:28 Running reward: 8.13     at episode 217  | frame 21250  | eps: 0.01 | Running loss: 0.03483
12:41:31 Running reward: 8.20     at episode 220  | frame 21500  | eps: 0.01 | Running loss: 0.03445
12:41:34 Running reward: 8.33     at episode 222  | frame 21750  | eps: 0.01 | Running loss: 0.03487
12:41:37 Running reward: 8.40     at episode 225  | frame 22000  | eps: 0.01 | Running loss: 0.03547
12:41:40 Running reward: 8.60     at episode 228  | frame 22250  | eps: 0.01 | Running loss: 0.03610
12:41:44 Running reward: 8.47     at episode 230  | frame 22500  | eps: 0.01 | Running loss: 0.03639
12:41:47 Running reward: 8.40     at episode 233  | frame 22750  | eps: 0.01 | Running loss: 0.03729
12:41:50 Running reward: 8.37     at episode 235  | frame 23000  | eps: 0.01 | Running loss

12:45:27 Running reward: 8.53     at episode 421  | frame 41250  | eps: 0.01 | Running loss: 0.03456
12:45:30 Running reward: 9.03     at episode 424  | frame 41500  | eps: 0.01 | Running loss: 0.03481
12:45:33 Running reward: 8.73     at episode 427  | frame 41750  | eps: 0.01 | Running loss: 0.03484
12:45:35 Running reward: 8.80     at episode 429  | frame 42000  | eps: 0.01 | Running loss: 0.03468
12:45:38 Running reward: 8.83     at episode 432  | frame 42250  | eps: 0.01 | Running loss: 0.03474
12:45:41 Running reward: 8.90     at episode 434  | frame 42500  | eps: 0.01 | Running loss: 0.03503
12:45:44 Running reward: 8.70     at episode 437  | frame 42750  | eps: 0.01 | Running loss: 0.03489
12:45:47 Running reward: 8.77     at episode 439  | frame 43000  | eps: 0.01 | Running loss: 0.03497
12:45:51 Running reward: 8.70     at episode 442  | frame 43250  | eps: 0.01 | Running loss: 0.03532
12:45:53 Running reward: 8.83     at episode 444  | frame 43500  | eps: 0.01 | Running loss

12:49:33 Running reward: 8.77     at episode 631  | frame 61750  | eps: 0.01 | Running loss: 0.03651
12:49:36 Running reward: 8.73     at episode 633  | frame 62000  | eps: 0.01 | Running loss: 0.03623
12:49:39 Running reward: 8.63     at episode 636  | frame 62250  | eps: 0.01 | Running loss: 0.03640
12:49:42 Running reward: 8.90     at episode 638  | frame 62500  | eps: 0.01 | Running loss: 0.03641
12:49:45 Running reward: 8.77     at episode 641  | frame 62750  | eps: 0.01 | Running loss: 0.03691
12:49:48 Running reward: 8.67     at episode 643  | frame 63000  | eps: 0.01 | Running loss: 0.03655
12:49:51 Running reward: 8.67     at episode 646  | frame 63250  | eps: 0.01 | Running loss: 0.03631
12:49:54 Running reward: 8.60     at episode 648  | frame 63500  | eps: 0.01 | Running loss: 0.03590
12:49:57 Running reward: 8.60     at episode 651  | frame 63750  | eps: 0.01 | Running loss: 0.03544
12:50:01 Running reward: 8.17     at episode 654  | frame 64000  | eps: 0.01 | Running loss

## Конфиг 4

In [197]:
random.seed(seed_value)

core_train_4 = core_constructor.get_core(core_config)
env_4 = TradeEnv(core_train_4, dpf_train, alias=ALIAS, log=False, log_obs=False)

model_4 = create_q_model(env)
model_target_4 = create_q_model(env)
agent_4 = DQNAgent(env_4, model_4, model_target_4)

agent_4.epsilon_random_frames = 500
agent_4.epsilon_greedy_frames = 4000
agent_4.max_memory_length     = 4000
agent_4.max_steps_per_episode = 50000
agent_4.gamma = 0.95
agent_4.epsilon_min = 0.01
agent_4.batch_size = 128
agent_4.update_after_actions = 4
agent_4.update_target_network = 250
agent_4.loss_function = tf.keras.losses.Huber() #tf.keras.losses.MeanSquaredError()
agent_4.optimizer = Adam(learning_rate=0.00015, clipnorm=0.001)    #Adam(learning_rate=learning_rate) RMSprop(learning_rate=learning_rate) SGD(learning_rate=learning_rate)


tp_4 = TrainPlot4()
core_test_4 = core_constructor.get_core(core_config)
tm_4 = TrainManager(agent_4, core_test_4, dpf_test, tp_4, alias=ALIAS)

tp_4.init_plot(width=1000, height=800)
tp_4.update_plot(tm_4.history)

FigureWidget({
    'data': [{'legendgroup': '1',
              'line': {'color': '#109618', 'width': 1},
              'mode': 'lines',
              'name': 'Train',
              'type': 'scatter',
              'uid': '62b925b1-1a64-4858-98f6-f88e8c42dcc0',
              'xaxis': 'x',
              'yaxis': 'y'},
             {'legendgroup': '1',
              'line': {'color': '#FF9900', 'width': 1},
              'mode': 'lines',
              'name': 'Test',
              'type': 'scatter',
              'uid': 'a64cc42e-2812-4141-b30b-98ccdeeb8b62',
              'xaxis': 'x',
              'yaxis': 'y'},
             {'legendgroup': '2',
              'line': {'color': '#D62728', 'width': 1},
              'mode': 'lines',
              'name': 'Train',
              'type': 'scatter',
              'uid': 'ec4dce19-d68c-438d-97bb-2ddadd455672',
              'xaxis': 'x2',
              'yaxis': 'y3'},
             {'legendgroup': '2',
              'line': {'color': '#FF9900'

In [198]:
tm_4.go(max_frames=80000, test_every=300, snapshot_every=500000, update_plot_every=100, save_since=0.06)

12:53:16 Running reward: -107.50  at episode 3    | frame 250    | eps: 0.94 | Running loss: nan
12:53:18 Running reward: -111.20  at episode 6    | frame 500    | eps: 0.88 | Running loss: nan
12:53:20 Running reward: -106.86  at episode 8    | frame 750    | eps: 0.81 | Running loss: nan
12:53:23 Running reward: -99.10   at episode 11   | frame 1000   | eps: 0.75 | Running loss: nan
12:53:25 Running reward: -98.17   at episode 13   | frame 1250   | eps: 0.69 | Running loss: nan
12:53:28 Running reward: -94.73   at episode 16   | frame 1500   | eps: 0.63 | Running loss: nan
12:53:30 Running reward: -91.59   at episode 18   | frame 1750   | eps: 0.57 | Running loss: nan
12:53:33 Running reward: -85.60   at episode 21   | frame 2000   | eps: 0.51 | Running loss: nan
12:53:36 Running reward: -82.64   at episode 23   | frame 2250   | eps: 0.44 | Running loss: nan
12:53:38 Running reward: -78.24   at episode 26   | frame 2500   | eps: 0.38 | Running loss: nan
12:53:41 Running reward: -74.5

12:57:13 Running reward: 8.70     at episode 212  | frame 20750  | eps: 0.01 | Running loss: 0.02753
12:57:15 Running reward: 8.57     at episode 215  | frame 21000  | eps: 0.01 | Running loss: 0.02832
12:57:19 Running reward: 8.23     at episode 217  | frame 21250  | eps: 0.01 | Running loss: 0.02878
12:57:22 Running reward: 8.30     at episode 220  | frame 21500  | eps: 0.01 | Running loss: 0.03013
12:57:25 Running reward: 8.40     at episode 222  | frame 21750  | eps: 0.01 | Running loss: 0.03057
12:57:28 Running reward: 8.40     at episode 225  | frame 22000  | eps: 0.01 | Running loss: 0.03143
12:57:31 Running reward: 8.47     at episode 228  | frame 22250  | eps: 0.01 | Running loss: 0.03214
12:57:33 Running reward: 8.33     at episode 230  | frame 22500  | eps: 0.01 | Running loss: 0.03266
12:57:36 Running reward: 8.37     at episode 233  | frame 22750  | eps: 0.01 | Running loss: 0.03356
12:57:40 Running reward: 8.30     at episode 235  | frame 23000  | eps: 0.01 | Running loss

13:01:21 Running reward: 8.53     at episode 421  | frame 41250  | eps: 0.01 | Running loss: 0.03572
13:01:24 Running reward: 8.97     at episode 424  | frame 41500  | eps: 0.01 | Running loss: 0.03529
13:01:27 Running reward: 8.83     at episode 427  | frame 41750  | eps: 0.01 | Running loss: 0.03539
13:01:30 Running reward: 8.97     at episode 429  | frame 42000  | eps: 0.01 | Running loss: 0.03533
13:01:33 Running reward: 9.07     at episode 432  | frame 42250  | eps: 0.01 | Running loss: 0.03547
13:01:36 Running reward: 9.13     at episode 434  | frame 42500  | eps: 0.01 | Running loss: 0.03570
13:01:39 Running reward: 8.80     at episode 437  | frame 42750  | eps: 0.01 | Running loss: 0.03571
13:01:42 Running reward: 8.50     at episode 439  | frame 43000  | eps: 0.01 | Running loss: 0.03567
13:01:45 Running reward: 8.27     at episode 442  | frame 43250  | eps: 0.01 | Running loss: 0.03569
13:01:48 Running reward: 8.33     at episode 444  | frame 43500  | eps: 0.01 | Running loss

13:05:30 Running reward: 9.23     at episode 631  | frame 61750  | eps: 0.01 | Running loss: 0.03534
13:05:33 Running reward: 9.03     at episode 633  | frame 62000  | eps: 0.01 | Running loss: 0.03526
13:05:36 Running reward: 8.97     at episode 636  | frame 62250  | eps: 0.01 | Running loss: 0.03521
13:05:40 Running reward: 8.97     at episode 638  | frame 62500  | eps: 0.01 | Running loss: 0.03547
13:05:43 Running reward: 9.00     at episode 641  | frame 62750  | eps: 0.01 | Running loss: 0.03554
13:05:46 Running reward: 8.77     at episode 643  | frame 63000  | eps: 0.01 | Running loss: 0.03554
13:05:49 Running reward: 8.67     at episode 646  | frame 63250  | eps: 0.01 | Running loss: 0.03536
13:05:52 Running reward: 8.50     at episode 648  | frame 63500  | eps: 0.01 | Running loss: 0.03532
13:05:55 Running reward: 8.30     at episode 651  | frame 63750  | eps: 0.01 | Running loss: 0.03525
13:05:58 Running reward: 8.10     at episode 654  | frame 64000  | eps: 0.01 | Running loss

# Исследование параметров
- **update_target_network** = 250. 100 - обучалось чуть дольше. 500 и 1000 - тоже были хуже. Здесь, похоже, надо подстраиваться под сложность задачи.

- **max_memory_length**     = 4000. 2000 примерно так же, чуть похуже. 8к и 16к обучались в 1.5-2 раза дольше

- **epsilon_greedy_frames** = 4000 обучилось на 23к. с 2000 обучение закончилось за 33к. С 8к и 16к обучение закончилось на 31к. Т.е. даже с учетом того, что на больших значениях epsilon_greedy_frames было меньше циклов чистого оубчения, оно было быстрее и стабильнее, чем на маленьких epsilon_greedy_frames.

- **batch_size** = 32. На простой задаче влияние весьма прямолинейно - больше батч, больше данных, быстрее обучение.  На 16 - обучился после 36к фрейвом, на 32 - 23к, 64 и 128 - примерно 16к фреймов. Причем 64 им 128 нет разницы в скорости

- **learning_rate**: в текущем эксперименте влияние не очевидно. 0.0001, 0.00025 и 0.0005  - сеть обучилась около 21-24 к фремов. С lr=0.001 обучение завершилось на 11к. На сложных задачах нужен будет ретест.

- **loss_function** - с MSE обучение совершилось к 32к, с Huber - к 23к. Расколбас был в обоих случаях.

- **optimizer** - adam норм, RMSProp - не затащил. SGD тоже не затащил, причем показал меандр.

- **update_after_actions** - общий смысл - чем реже обновление, тем дольше обучается.