## 什么是强化训练，强化训练的环境以及实现
强化学习的目标是需要学习一种策略，使得对于每一个状态，决策AI的动作

就像训练小狗一样，开始动作是无序的，做对了奖励吃的，再做对了再奖励吃的，以此类推就完成了强化训练的过程

Agent（操作员，算法）----action（动作）----environment（环境）----observation，reward（观察结果/回报）

In [4]:
import tensorflow as tf
tf.__version__

'2.5.0'

### 强化训练环境
    1.卸载你的keras和tensorflow
    2.重新安装keras和tensorflow
        pip install tensorflow==1.14
        pip install keras==2.2
    3.安装gym环境
        pip install gym
        

In [1]:
import gym
import random

In [2]:
env = gym.make('CartPole-v0')  # 建立游戏
for episode in range(1,10):    # for循环控制游戏次数
    state = env.reset()       # 重置游戏
    done = False             #初始化完成状态False，即进行中    
    score = 0                 #奖励积分
    while not done:           #游戏主循环
        env.render()            #渲染游戏界面
        action = random.choice([0,1])       #随机游戏动作
        observation, reward, done, info = env.step(action) #返回发生数据
        score += reward                #添加积分
        if done == True:              #判断游戏是否失败
            print("game over")
    print('Episode:{} Score:{}'.format(episode, score))       #输出当前的积分和轮次

game over
Episode:1 Score:42.0
game over
Episode:2 Score:28.0
game over
Episode:3 Score:24.0
game over
Episode:4 Score:15.0
game over
Episode:5 Score:25.0
game over
Episode:6 Score:31.0
game over
Episode:7 Score:12.0
game over
Episode:8 Score:13.0
game over
Episode:9 Score:24.0


## 学习代码
    1.当前环境是tensorflow2.5+,
     from rl.agents import DQNAgent 会报错no model named 'rl'
     先pip install keras-rl
    2.安装keras-rl2,不是keras-rl

In [2]:
import gym
from keras.models import Sequential
from keras.layers import Dense,Flatten
from tensorflow.keras.optimizers import Adam
from rl.agents import DQNAgent
from rl.policy import BoltzmannQPolicy
from rl.memory import SequentialMemory
from rl.agents.dqn import DQNAgent
# from rl.policy import EpsGreedyQPolicy
# from rl.memory import SequentialMemory

In [3]:
env = gym.make('CartPole-v0')
states = env.observation_space.shape[0]
actions = env.action_space.n

In [8]:
def build_model(states, actions):
    model = Sequential()
    model.add(Flatten(input_shape=(1,states)))
    model.add(Dense(24,activation='relu'))
    model.add(Dense(24,activation='relu'))
    model.add(Dense(actions,activation='linear'))
    return model
model = build_model(states, actions)
# model.summary()
    

In [6]:
#搭建agent玩家 玩家具备三个条件 策略 记忆和玩家本身
def build_agent(model, actions):
    policy = BoltzmannQPolicy()
    memory = SequentialMemory(limit=50000, window_length=1)
    dqn = DQNAgent(model=model,memory=memory,policy=policy,
                   nb_actions=actions,nb_steps_warmup=10)
    return dqn

In [6]:

dqn = build_agent(model, actions)
dqn.compile(optimizer='adam',metrics=['mae'])
dqn.fit(env=env, nb_steps=1000,visualize=True,verbose=1) #控制训练次数和可视化
dqn.save_weights('../output/rl/model_10000.h5')

Training for 1000 steps ...
Interval 1 (0 steps performed)
  999/10000 [=>............................] - ETA: 7:14 - reward: 1.0000done, took 48.640 seconds


In [10]:
dqn = build_agent(model, actions)
dqn.compile(optimizer='adam',metrics=['mae'])
dqn.load_weights('../output/rl/model_10000.h5')
whf15 = dqn.test(env,nb_episodes=100,visualize=True)

Testing for 100 episodes ...




Episode 1: reward: 9.000, steps: 9
Episode 2: reward: 8.000, steps: 8
Episode 3: reward: 9.000, steps: 9
Episode 4: reward: 9.000, steps: 9
Episode 5: reward: 8.000, steps: 8
Episode 6: reward: 10.000, steps: 10
Episode 7: reward: 10.000, steps: 10
Episode 8: reward: 10.000, steps: 10
Episode 9: reward: 10.000, steps: 10
Episode 10: reward: 10.000, steps: 10
Episode 11: reward: 10.000, steps: 10
Episode 12: reward: 11.000, steps: 11
Episode 13: reward: 10.000, steps: 10
Episode 14: reward: 10.000, steps: 10
Episode 15: reward: 9.000, steps: 9
Episode 16: reward: 9.000, steps: 9
Episode 17: reward: 10.000, steps: 10
Episode 18: reward: 9.000, steps: 9
Episode 19: reward: 10.000, steps: 10
Episode 20: reward: 9.000, steps: 9
Episode 21: reward: 10.000, steps: 10
Episode 22: reward: 9.000, steps: 9
Episode 23: reward: 9.000, steps: 9
Episode 24: reward: 9.000, steps: 9
Episode 25: reward: 9.000, steps: 9
Episode 26: reward: 10.000, steps: 10
Episode 27: reward: 9.000, steps: 9
Episode 28:

## stark overflow中找到的代码

In [None]:
import gym
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.optimizers import Adam
from rl.agents import DQNAgent
from rl.policy import BoltzmannQPolicy
from rl.memory import SequentialMemory


def build_model(states, actions):
    model = Sequential()
    # model.add(Input(shape=(1,states)))
    model.add(Flatten(input_shape=(1,states)))
    model.add(Dense(24,activation='relu'))
    model.add(Dense(24,activation='relu'))
    model.add(Dense(actions,activation='linear'))
    return model

def build_agent(model,actions):
    policy = BoltzmannQPolicy()
    memory = SequentialMemory(limit = 50000, window_length=1)
    dqn = DQNAgent(model=model, memory=memory, policy=policy, nb_actions =actions,
                   nb_steps_warmup=10, target_model_update=1e-2)
    return dqn

env = gym.make('CartPole-v0')
states = env.observation_space.shape[0]
actions = env.action_space.n

model = build_model(states, actions)
model.summary()

dqn = build_agent(model, actions)
dqn.compile(Adam(lr=1e-3), metrics=['mae'])
dqn.fit(env, nb_steps=50000, visualize= False, verbose=1)

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten (Flatten)            (None, 4)                 0         
_________________________________________________________________
dense (Dense)                (None, 24)                120       
_________________________________________________________________
dense_1 (Dense)              (None, 24)                600       
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 50        
Total params: 770
Trainable params: 770
Non-trainable params: 0
_________________________________________________________________




Training for 50000 steps ...
Interval 1 (0 steps performed)




    1/10000 [..............................] - ETA: 3:22:54 - reward: 1.0000



105 episodes - episode_reward: 93.771 [9.000, 200.000] - loss: 2.474 - mae: 19.024 - mean_q: 38.572

Interval 2 (10000 steps performed)
 1496/10000 [===>..........................] - ETA: 4:34 - reward: 1.0000