# Reinforcement Learning

Reinforcement learning is a field of machine learning that focuses on sequential decision making problems. The goal is to train an agent to perform actions in order to optimize an objective based on a reward signal.


[SAS VDMML RL Programming Guide](https://go.documentation.sas.com/doc/en/pgmcdc/8.11/casrlpg/titlepage.htm)

[RL CAS Action Doc Page](https://go.documentation.sas.com/doc/en/pgmcdc/8.11/casrlpg/cas-reinforcementlearn-TblOfActions.htm)

# SAS RL Built in Cart Pole Example

In [1]:
import swat

In [2]:
conn = swat.CAS("server", 30571, "student", "Metadata0")

In [3]:
conn.loadactionset('tkrl')

NOTE: Added action set 'tkrl'.


# Deep Q-Network

The Deep Q-Network (DQN) algorithm is a model-free, off-policy, online reinforcement learning (RL) method. Like other neural-network-based reinforcement learning algorithms, DQN uses neural networks to approximate a system’s state-action value function (commonly called a Q-function). The Q-function is used to determine the quality of an action when performed in a given state. Choosing an action that maximizes Q for a given state yields the agent’s optimal policy. In practice, the Q-function is generally unknown and only approximations are available. DQN is a method that approximates a Q-function by using a neural network to learn from interactions with an environment. Using an online reinforcement learning (RL) method, an agent that is trained with DQN learns by interacting directly with an environment. 

[Deep Q-Network SAS Documentation](https://go.documentation.sas.com/doc/en/pgmcdc/8.11/casrlpg/n0eq0hv40n0bbgn1tivnxo53xow1.htm)

In [4]:
conn.tkrl.rlTrainDqn(
    environment=dict(type='builtin', name='CartPole-v0'),    
    seed = 1234,
    optimizer=dict(method='ADAM', miniBatchSize=128),
    numEpisodes = 5,
    gamma = 0.99,
    testInterval = 25,
    numTestEpisodes = 1,
    targetUpdateInterval = 100,
    minReplayMemory = 10,
    maxReplayMemory = 1000,
    syncInterval = 100,
    modelOut=dict(name='dqnWeights', replace=True),
    finalTargetCopy=True,
    QModel=[32,32]
)

         Episode=        0 AvgQValue=0.1147 AvgTarget=1.1896 AvgLoss=1.2136 TestReward=     9
         Episode=        5 AvgQValue=2.1991 AvgTarget=2.1359 AvgLoss=0.1216 TestReward=     9
NOTE: Reinforcement learning rlTrainDqn action complete.


Unnamed: 0,Description,Value
0,Average QValue,2.199053
1,Average Target Value,2.135932
2,Test Reward,9.0

Unnamed: 0,Property,Value
0,Number of State Variables,4
1,Number of Actions,2
2,Algorithm,DQN
3,Optimizer,ADAM

Unnamed: 0,Iteration,AvgQValue,AvgTarget,AvgLoss,Test Reward
0,0,0.114691,1.18959,1.213632,9.0
1,5,2.199053,2.135932,0.121585,9.0


In [5]:
conn.tkrl.rlScore(model='dqnWeights',
    environment=dict(type='builtin', name='CartPole-v0'),  
    numEpisodes=2,
    logFreq=1,
    casout=dict(name='scoreTable', replace=True)
)

         Episode=        1 Step=        1 LastReward=     0 AverageReward=     1
         Episode=        1 Step=        2 LastReward=     0 AverageReward=     2
         Episode=        1 Step=        3 LastReward=     0 AverageReward=     3
         Episode=        1 Step=        4 LastReward=     0 AverageReward=     4
         Episode=        1 Step=        5 LastReward=     0 AverageReward=     5
         Episode=        1 Step=        6 LastReward=     0 AverageReward=     6
         Episode=        1 Step=        7 LastReward=     0 AverageReward=     7
         Episode=        1 Step=        8 LastReward=     0 AverageReward=     8
         Episode=        2 Step=        9 LastReward=     8 AverageReward=   4.5
         Episode=        2 Step=       10 LastReward=     8 AverageReward=     5
         Episode=        2 Step=       11 LastReward=     8 AverageReward=   5.5
         Episode=        2 Step=       12 LastReward=     8 AverageReward=     6
         Episode=        2 S

Unnamed: 0,Property,Value
0,Number of State Variables,4
1,Number of Actions,2
2,Algorithm,DQN
3,Optimizer,ADAM


In [6]:
conn.table.fetch('scoreTable')

Unnamed: 0,_Step_,_Episode_,_State_0,_State_1,_State_2,_State_3,_Action_,_Reward_,_Done_
0,0.0,1.0,-0.044548,-0.030393,-0.037056,-0.046217,1.0,1.0,0.0
1,1.0,1.0,-0.045155,0.16524,-0.03798,-0.350357,1.0,1.0,0.0
2,2.0,1.0,-0.041851,0.360881,-0.044987,-0.65477,1.0,1.0,0.0
3,3.0,1.0,-0.034633,0.5566,-0.058083,-0.961273,1.0,1.0,0.0
4,4.0,1.0,-0.023501,0.752452,-0.077308,-1.271623,1.0,1.0,0.0
5,5.0,1.0,-0.008452,0.948471,-0.102741,-1.587479,1.0,1.0,0.0
6,6.0,1.0,0.010517,1.144653,-0.13449,-1.910353,1.0,1.0,0.0
7,7.0,1.0,0.03341,1.340945,-0.172697,-2.241553,1.0,1.0,1.0
8,0.0,2.0,-0.019654,-0.01548,-0.035218,0.016788,1.0,1.0,0.0
9,1.0,2.0,-0.019964,0.180129,-0.034882,-0.286795,1.0,1.0,0.0


In [7]:
conn.session.endSession()

########################################################################################
########################################################################################
########################################################################################
########################################################################################
########################################################################################
########################################################################################
########################################################################################
########################################################################################
########################################################################################
########################################################################################

# Using Environments

To train or score an agent with an environment, you must specify the environment parameter in a CAS action that supports environments. You can specify a built-in environment or a custom environment that can be hosted remotely. To access a custom environment, the action requires only the URL and the port number at which the environment can be accessed.

# Hosting a Custom Environment

SAS supports custom environments through the use of open-source packages. You can define environments using your own packages, but the recommended method is to use the sasrl-env package. The sasrl-env package can be downloaded using the Python Package Index (PyPI). For details about installing sasrl-env, see sasrl-env on PyPI.

Regardless of how you create environments, the reinforcement learning action set assumes that your environment includes the following functions.

- make - creates an instance of an environment.

- reset - sets the environment to an initial state.

- step - increments the environment by one iteration forward based on the agent’s actions. The step function is responsible for returning the reward and next state of the environment.

- seed - sets the seed of the random number generator.

- close - stops the environment instance and performs any cleanup steps.

In addition to the preceding required functions, you can define the following functions to access additional features from the CAS environment.

- render - generates and displays a visual representation of the environment.

- sample - generates a single random sample from the action space.

# Gym and SAS Environments

In [8]:
# Open Source Env Names
from gym import envs
envids = [spec.id for spec in envs.registry.all()]
for envid in sorted(envids):
    print(envid)

Acrobot-v1
Adventure-ram-v0
Adventure-ram-v4
Adventure-ramDeterministic-v0
Adventure-ramDeterministic-v4
Adventure-ramNoFrameskip-v0
Adventure-ramNoFrameskip-v4
Adventure-v0
Adventure-v4
AdventureDeterministic-v0
AdventureDeterministic-v4
AdventureNoFrameskip-v0
AdventureNoFrameskip-v4
AirRaid-ram-v0
AirRaid-ram-v4
AirRaid-ramDeterministic-v0
AirRaid-ramDeterministic-v4
AirRaid-ramNoFrameskip-v0
AirRaid-ramNoFrameskip-v4
AirRaid-v0
AirRaid-v4
AirRaidDeterministic-v0
AirRaidDeterministic-v4
AirRaidNoFrameskip-v0
AirRaidNoFrameskip-v4
Alien-ram-v0
Alien-ram-v4
Alien-ramDeterministic-v0
Alien-ramDeterministic-v4
Alien-ramNoFrameskip-v0
Alien-ramNoFrameskip-v4
Alien-v0
Alien-v4
AlienDeterministic-v0
AlienDeterministic-v4
AlienNoFrameskip-v0
AlienNoFrameskip-v4
Amidar-ram-v0
Amidar-ram-v4
Amidar-ramDeterministic-v0
Amidar-ramDeterministic-v4
Amidar-ramNoFrameskip-v0
Amidar-ramNoFrameskip-v4
Amidar-v0
Amidar-v4
AmidarDeterministic-v0
AmidarDeterministic-v4
AmidarNoFrameskip-v0
AmidarNoFrames

[gym openai environments](https://gym.openai.com/envs/#classic_control)

[Cart Pole Environment Code](https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py)

[SAS Env Names](https://github.com/sassoftware/sasrlenv/blob/main/sasrl_env/client.py)

# Using the SASRL Environment

[sasrl-env github page](https://github.com/sassoftware/sasrlenv)

# Install the sasrl-env Package

- pip install sasrl-env

# Starting an Enviroment Server (from Anaconda Prompt)

- from sasrl_env import runServer
- runServer.start(#PORT_NUMBER)

# Create URL to SAS RL Environment

In [None]:
computer_id = "*** ENTER IP ADDRESS HERE ***"
sasrl_env_port = "10200"
sasrl_env_url = computer_id + ":" + sasrl_env_port

In [None]:
import swat

In [None]:
conn = swat.CAS("server", 30571, "student", "Metadata0")

In [None]:
conn.loadactionset('tkrl')

# Cart Pole Example - Using an Remote Environment

In [None]:
conn.tkrl.rlTrainDqn(
    environment=dict(type='remote', url=sasrl_env_url, name='CartPole-v0', 
                     render=True, renderFreq=10, renderSleep=0.01, seed=54),    
    seed = 1234,
    optimizer=dict(method='ADAM', miniBatchSize=128),
    exploration = dict(type="linear", initialEpsilon=1.0, minEpsilon=0.05),
    numEpisodes = 50,
    gamma = 0.99,
    testInterval = 25,
    numTestEpisodes = 1,
    targetUpdateInterval = 100,
    minReplayMemory = 10,
    maxReplayMemory = 1000,
    syncInterval = 100,
    modelOut=dict(name='dqnWeights', replace=True),
    finalTargetCopy=True,
    QModel=[{'n':32, 'act':'RELU', 'type':'FC'},
            {'n':32, 'act':'RELU', 'type':'FC'}]
)

In [None]:
conn.tkrl.rlScore(model='dqnWeights',
    environment=dict(type='remote', url=sasrl_env_url, name='CartPole-v0', 
                     render=True, renderFreq=1, renderSleep=0.01),  
    numEpisodes=1,
    logFreq=1,
    writeQValues = True,
    casout=dict(name='scoreTable', replace=True)
)

In [None]:
display(conn.table.recordcount('scoreTable'))
conn.table.fetch('scoreTable')

In [None]:
conn.session.endSession()

# Custom Environment

In [None]:
import swat

In [None]:
conn = swat.CAS("server", 30571, "student", "Metadata0")

In [None]:
conn.loadactionset('tkrl')

In [None]:
conn.tkrl.rlTrainDqn(
    environment=dict(type='remote', url=sasrl_env_url, name='berrypatch-v0', 
                     render=False, seed=54),    
    seed = 1234,
    optimizer=dict(method='ADAM', miniBatchSize=128),
    exploration = dict(type="linear", initialEpsilon=1.0, minEpsilon=0.05),
    numEpisodes = 50,
    gamma = 0.99,
    testInterval = 25,
    numTestEpisodes = 1,
    targetUpdateInterval = 100,
    minReplayMemory = 10,
    maxReplayMemory = 1000,
    syncInterval = 100,
    modelOut=dict(name='dqnWeights', replace=True),
    finalTargetCopy=True,
    QModel=[{'n':32, 'act':'RELU', 'type':'FC'},
            {'n':32, 'act':'RELU', 'type':'FC'}]
)

In [None]:
conn.tkrl.rlScore(model='dqnWeights',
    environment=dict(type='remote', url=sasrl_env_url, name='berrypatch-v0', 
                     render=False),  
    numEpisodes=1,
    logFreq=1,
    writeQValues = True,
    casout=dict(name='scoreTable', replace=True)
)

In [None]:
display(conn.table.recordcount('scoreTable'))
conn.table.fetch('scoreTable')

# End the Session

In [None]:
conn.session.endSession()