# Learning Algorithm

I used DDPG algorithm.

DDPG is a deterministic policy gradient method restricted to continuous action space.

The agent collects online experiences and stores these online experience samples in a replay buffer. At each step, the agent retrieves a uniformly randomly sampled mini-batch from the replay buffer. The agent then uses this mini-batch to compute bootstrapped TD targets and learn the Q function.
In DDPG, the optimal behavior in the next state is directly approximated using the policy function.
DDPG uses an actor-critic approach based on the DPG algorithm.
The agent in DDPG learn deterministic policies and do not search on-policy. The agent search using an off-policy search strategy by adding external noise to their behavior.


![ddpg_algorithm.png](attachment:ddpg_algorithm.png)

Reference:Timothy P. Lillicrap, Jonathan J. Hunt.Continuous control with deep reinforcement learning.ICLR(2016)


# Environment and Implementation

I modified the DDPG code from the DRLND GitHub repository.
The racket is called an agent, but the python program has an Agent class, which is confusing.

Two independent instances of the Agent class were created and each Agent instance was trained independently.They did not share replay buffers or model weights.

The reason for adopting this approach is that I thought that in addition to solutions in which the agents behave the same, there could also be solutions in which they behave differently.


# Model Architecture

The model architecture for a neural network is implemented in the class Actor and the class Critic.
There are 3 fully connected linear layers for each class.
```
state_size inputs and 512 units
512 inputs and 256 units
256 inputs and action_size(Actor) / 1(Critic) units
```
Each class needs to include a forward method that implements the forward pass through the network.
In this method, you pass some input tensor x through each of the operations you defined earlier.
Relu activation occurs after each fully connected layer.

# Hyperparameters

```
BUFFER_SIZE = int(1e6)  # replay buffer size
BATCH_SIZE = 256        # minibatch size
GAMMA = 0.95            # discount factor
TAU = 1e-3              # for soft update of target parameters
LR_ACTOR = 1e-4         # learning rate of the actor 
LR_CRITIC = 1e-4        # learning rate of the critic
WEIGHT_DECAY = 0        # L2 weight decay

random_seed = 35        # random seed when initializing OUNoise class
```

# Plot of Rewards

![scores.png](attachment:scores.png)

The number of episodes needed to solve the environment is 2777.


# Ideas for Future Work

Implement PPO with reference to Grokking Deep Reinforcement Learning.

Implement Prioritized Experience Replay.
