<!-- PROJECT SHIELDS -->
<!-- [![LinkedIn][linkedin-shield]][linkedin-url] -->

<img src="images/robot_fail.gif" alt="Logo" width="400" height="300">

# Reinforcement Learning in 3D Simulated Environments

<!-- PROJECT LOGO -->
<br />
<!-- <p align="center">
  <a href="https://github.com/rasbot/Reinforcement_Learning_in_Unity">
    <img src="images/robot_fail.gif" alt="Logo" width="600" height="400">
  </a>

  <h3 align="center">Reinforcement Learning in 3D Simulated Environments</h3> -->

<!--   <p align="center">
    Using Unity to simulate environments to train agents using reinforcement learning.
    <br />
    <a href="https://github.com/rasbot/Reinforcement_Learning_in_Unity"><strong>Explore the docs »</strong></a>
    <br />
    <br />
    <a href="https://github.com/rasbot/Reinforcement_Learning_in_Unity">View Demo</a>
    ·
    <a href="https://github.com/rasbot/Reinforcement_Learning_in_Unity/issues">Reinforcement_Learning_in_Unityrt Bug</a>
    ·
    <a href="https://github.com/rasbot/Reinforcement_Learning_in_Unity/issues">Request Feature</a>
  </p>
</p> -->



<!-- TABLE OF CONTENTS -->
## Table of Contents

* [Introduction](#introduction)
* [Reinforcement Learning Algorithms](#rl-algorithms)
  * [PPO](#ppo)
  * [SAC](#sac)
* [Walking Trainer](#walking)
  * [Untrained Walker](#untrained)
  * [PPO Training](#ppotraining)
  * [SAC Training](#sactraining)
* [Contact](#contact)
* [Acknowledgements](#acknowledgements)



<!-- INTRODUCTION -->
## Introduction
---
Reinforcement learning has many useful applications such as robotics and ... Researchers in artificial intelligence (AI) are constantly improving learning algorithms, and benefit from the ability to train and test their models in different environments. By enabling testing and training within a simulated environment, algorithms can be improved more rapidly. These ideal environments have a physics engine as well as graphics rendering. 

Machine learning (ML) models depend on what type of learning or predictions are involved.

<img src="images/ML_map.PNG" width="400" height="200"/>

If the model will make a prediction of a specific target or label from a set of features, supervised learning can be used. If the data is unstructured or unlabeled, and the target feature is not known, unsupervised learning can be used to find relations between features. If the goal is to have a model learn from the environment through interaction, reinforcement learning is used.

Reinforcement learning involves an __agent__, which could be a robot, a self-driving car, or a video game character, that interacts and learns from the __environment__. The agent observes the environment, takes __actions__, and receives __rewards__ based on those actions. The reward could be positive or negative, suchas if a robot moves closer to its target it would get a positive reward, and if it moves away from its target it would get a negative reward. The goal of the model is to maximize the reward to perform a specific task. An agent might not ever receive a positive reward but get penalized suchas an agent navigating a maze. The penalty might be a negative reward for each step and the model trains the agent to minimize the penalty. The algorithm used to train the model is called the __policy__. This will be discussed more later.

<img src="images/agent_env.PNG" width="500" height="167"/>

The Unity game engine provides an editor which allows for ML algorithms to interact and learn from a variety of simulated environments. Recently the Unity team released version 1.0 of their [ML Agents toolkit](https://github.com/Unity-Technologies/ml-agents). This toolkit, along with the Unity editor, provides a user with the ability to create 3D simulated environments, a Python API to control and integrate ML algorithms into the environments, and a learning pipeline using C# to collect data from the environment, implement agent actions, and collect rewards. Agents can be trained and tested in the Unity engine using the ML agents toolkit.

<img src="images/Unity_pipeline.PNG" width="600" height="450"/>

<p style="text-align: center;">
The Unity Learning Environment
</p>

The Python API interfaces with the Unity game engine to create and control ML algorithms in the simulated environment. Python code initializes the environment variables, and feeds agent interactions to the model during training. The trained model is then fed back into the agent during testing.

<!-- ALGORITHMS -->
## Reinforcement Learning Algorithms
---
As mentioned above, the policy is the algorithm used in reinforcement learning. The simplest example, which doesn't involve any learning, is a stochastic policy where an agent randomly moves around an environment. If the goal is to do something like collect all the coins in an environment, a random walk will eventually get the job done. If the policy has feedback from the environment, a more intelligent policy can be adopted.


<!-- PPO -->
### PPO





<!-- SAC -->
### SAC




<!-- WALKING TRAINER -->
## Walking Trainer



<!-- UNTRAINED WALKER -->
### Untrained Walker
<img src="images/UNTRAINED.gif" width="600" height="450"/>
<!-- PPO WALKER -->
### PPO Training

<!-- SAC WALKER -->
### SAC Training



<img src="images/LANKY.gif" width="600" height="450"/>

<!-- CONTACT -->
## Contact

Nathan Rasmussen - nathan.f.rasmussen@gmail.com

Project Link: [https://github.com/rasbot/Reinforcement_Learning_in_Unity](https://github.com/rasbot/Reinforcement_Learning_in_Unity)



<!-- ACKNOWLEDGEMENTS -->
## Acknowledgements

* []()
* []()
* []()





<!-- MARKDOWN LINKS & IMAGES -->
<!-- https://www.markdownguide.org/basic-syntax/#reference-style-links -->

<!-- [linkedin-shield]: https://img.shields.io/badge/-LinkedIn-black.svg?style=flat-square&logo=linkedin&colorB=555
[linkedin-url]: https://linkedin.com/in/nathanfrasmussen -->

In [5]:
pwd


'D:\\gal\\Reinforcement_Learning_in_Unity'

In [1]:
import tensorflow as tf

In [2]:
if tf.test.gpu_device_name():
    print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))
else:
    print("Please install GPU version of TF")

Default GPU Device: /device:GPU:0


In [4]:
from mlagents_envs.environment import UnityEnvironment
# This is a non-blocking call that only loads the environment.
env = UnityEnvironment(file_name="ml-agents-release_2/Builds/Walker", seed=1, side_channels=[])

UnityEnvironmentException: Couldn't launch the /Builds/Walker environment. Provided filename does not match any environments.

## TO DO:

* Work on README:
    * Intro to RL
    
    * Discuss Simulated environments
    
    * Describe types of RL:
        
        * Deep Q-Networks
        
        * CNN vs RNN
        
        * PPO and SAC
    
    * Show untrained model
    
    * Discuss hyperparameters
    
    * Show plots
    
    * Show trained model
    
    * Future work
---

* Get data for SAC runs

* Get videos / gifs of untrained model, trained model, comparisons (race them)

* Read book / unity paper to fill in details

* Jupyter notebook with code to run training on exe file

    * plots of everything
    
    * ???
    

## Types of Reinforcement Learning

* __Policy Gradients__

* __Deep Q Networks (DQN)__

* __Markov Decision Processes (MDP)__

### Terminology

The AI (player) is the __agent__ which makes __observations__ within an __environment__, takes __actions__, and receives __rewards__.

__policy__: The algorithm an agent uses to determine its actions. This can be a neural network, for example.

    Stochastic Policy - A random algorithm suchas the one a robot vacuum uses
    
    Policy Search - search combinations of parameters, find the ones that maximizes performance
    
        * Brute force the search, checking all combinations
        
        * Genetic policy algorithm - create a random set of parameters, keep the 20% that perform
        
          the best from those, generate new sets...evolve the policy until it performs well
          
        * Evaluate the gradients of the rewards with regards to policy parameters
          
          (called policy gradients)

## Policy Search

<img src="images/agent_env.PNG" width="800" height="400"/>

### Policy Gradients

> REINFORCE algorithm

> * Have AI play game several times, at each step, compute gradient but do not apply them

> * Compute each action's advantage

> * If advantage is positive (action is probably good), apply the gradients to make the action more likely

> * Compute the mean of all resulting gradient vectors, use it to perform a Gradient Descent step

Using RASCOM - Anaconda prompt with the (base) environment,

* Navigate to `D:\gal\ml-agents`

* Launch unity hub

  * Launch project `Project` using Unity `2018.4.17f1`
 
* Navigate to `ML-Agents/Examples/<pick an example>/Scenes/<name of scene>`

## Walker training runs


__Walker_T1__:
behaviors:
  Walker:
    trainer: ppo
    batch_size: 2048
    beta: 0.005
    buffer_size: 20480
    epsilon: 0.2
    hidden_units: 512
    lambd: 0.95
    learning_rate: 0.0003
    learning_rate_schedule: linear
    max_steps: 2e7
    memory_size: 128
    normalize: true
    num_epoch: 3
    num_layers: `3 -> 2`
    time_horizon: 1000
    sequence_length: 64
    summary_freq: `30000 -> 3000`
    use_recurrent: false
    vis_encode_type: simple
    reward_signals:
      extrinsic:
        strength: 1.0
        gamma: 0.995
        
__Walker_T2__:
behaviors:
  Walker:
    trainer: ppo
    batch_size: 2048
    beta: 0.005
    buffer_size: 20480
    epsilon: 0.2
    hidden_units: 512
    lambd: 0.95
    learning_rate: `0.0003 -> 0.003`
    learning_rate_schedule: linear
    max_steps: `2e7 -> 6e3`
    memory_size: 128
    normalize: true
    num_epoch: 3
    num_layers: `3 -> 2`
    time_horizon: 1000
    sequence_length: 64
    summary_freq: `30000 -> 3000`
    use_recurrent: false
    vis_encode_type: simple
    reward_signals:
      extrinsic:
        strength: 1.0
        gamma: 0.995
        
__Walker_T3__:
behaviors:
  Walker:
    trainer: ppo
    batch_size: 2048
    beta: 0.005
    buffer_size: 20480
    epsilon: 0.2
    hidden_units: 512
    lambd: 0.95
    learning_rate: `0.0003 -> 0.003`
    learning_rate_schedule: linear
    max_steps: `2e7 -> 6e3`
    memory_size: 128
    normalize: true
    num_epoch: 3
    num_layers: `3 -> 2`
    time_horizon: 1000
    sequence_length: 64
    summary_freq: `30000 -> 3000`
    use_recurrent: false
    vis_encode_type: simple
    reward_signals:
      extrinsic:
        strength: 1.0
            gamma: 0.995

__NOTES__:

> T1: Trained for 6ish hours

> T2: Test to reduce max_steps, basically a dud

> T3: First test of shorter runs

Supervised learning uses a static training set, whereas reinforcement learning generates the training data, which constantly changes, based on the agent's interaction with the environment. The data distribution of the observations and rewards are dynamic.

Proximal Policy Optimization (PPO)
Deep Q-networks store past experiences and update the policy based on them. PPO will learn from the direct interactions.


Soft Actor-Critic (SAC)

In [None]:
# create unity env
from gym_unity.envs import BaseEnv
env_id = "Project\Assets\ML-Agents\Builds\UnityEnvironment.exe"
env = BaseEnv(env_id, worker_id=2, use_visual=False, no_graphics=False)

# run stable baselines
env = DummyVecEnv([lambda: env])  # The algorithms require a vectorized environment to run
model = PPO2(MlpPolicy, env, verbose=1)
model.learn(total_timesteps=10000)



In [None]:
from mlagents_envs.environment import UnityEnvironment
# This is a non-blocking call that only loads the environment.
env = UnityEnvironment(file_name="ml-agents-release_2/Builds/Walker", seed=1, side_channels=[])

In [None]:
channel = MyChannel()
env = UnityEnvironment(side_channels = [channel])

In [None]:
from mlagents_envs.environment import UnityEnvironment
from mlagents_envs.side_channel.engine_configuration_channel import EngineConfigurationChannel

channel = EngineConfigurationChannel()

env = UnityEnvironment(side_channels=[channel])

channel.set_configuration_parameters(time_scale = 2.0)

i = env.reset()