# Deep Q-Network (DQN)
---
In this notebook, you will implement a DQN agent with OpenAI Gym's LunarLander-v2 environment.

### 1. Import the Necessary Packages

In [12]:
import gym
import random
import torch
import numpy as np
from collections import deque
import matplotlib.pyplot as plt
%matplotlib inline

### 2. Instantiate the Environment and Agent

Initialize the environment in the code cell below.

In [18]:
env = gym.make('LunarLander-v2')
env.seed(0)
print('State shape: ', env.observation_space.shape)

State shape:  (8,)


In [20]:
print('Number of actions: ', env.action_space.n)

Number of actions:  4


Please refer to the instructions in `Deep_Q_Network.ipynb` if you would like to write your own DQN agent.  Otherwise, run the code cell below to load the solution files.

In [21]:
from dqn_agent import Agent

agent = Agent(state_size=8, action_size=4, seed=0)

# watch an untrained agent
for i in range(10):
    state = env.reset()
    score = 0
    while True:
        action = env.action_space.sample()
        env.render()
        state, reward, done, _ = env.step(action)
        score+=reward
        if done:
            print(score)
            break 
        
env.close()

-135.58804185135858
-86.54432171057229
-72.78626923061778
-198.7339423827267
-121.43908619100935
-210.73540314401555
-103.04037296111133
-116.00013591586416
-151.54106885337308
-170.35394902211215


### 3. Train the Agent with DQN

Run the code cell below to train the agent from scratch.  You are welcome to amend the supplied values of the parameters in the function, to try to see if you can get better performance!

Alternatively, you can skip to the next step below (**4. Watch a Smart Agent!**), to load the saved model weights from a pre-trained agent.

In [17]:
def dqn(n_episodes=500, max_t=1000, eps_start=1.0, eps_end=0.01, eps_decay=0.995):
    """Deep Q-Learning.
    
    Params
    ======
        n_episodes (int): maximum number of training episodes
        max_t (int): maximum number of timesteps per episode
        eps_start (float): starting value of epsilon, for epsilon-greedy action selection
        eps_end (float): minimum value of epsilon
        eps_decay (float): multiplicative factor (per episode) for decreasing epsilon
    """
    scores = []                        # list containing scores from each episode
    scores_window = deque(maxlen=100)  # last 100 scores
    eps = eps_start                    # initialize epsilon
    for i_episode in range(1, n_episodes+1):
        state = env.reset()
        score = 0
        for t in range(max_t):
            action = agent.act(state, eps)
            next_state, reward, done, _ = env.step(action)
            agent.step(state, action, reward, next_state, done)
            state = next_state
            score += reward
            if done:
                break 
        scores_window.append(score)       # save most recent score
        scores.append(score)              # save most recent score
        eps = max(eps_end, eps_decay*eps) # decrease epsilon
        print('\rEpisode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_window)), end="")
        if i_episode % 100 == 0:
            print('\rEpisode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_window)))
        if np.mean(scores_window)>=200.0:
            print('\nEnvironment solved in {:d} episodes!\tAverage Score: {:.2f}'.format(i_episode-100, np.mean(scores_window)))
            torch.save(agent.qnetwork_local.state_dict(), 'checkpoint.pth')
            break
    return scores

scores = dqn()

# plot the scores
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(np.arange(len(scores)), scores)
plt.ylabel('Score')
plt.xlabel('Episode #')
plt.show()

IndexError: invalid index to scalar variable.

### 4. Watch a Smart Agent!

In the next code cell, you will load the trained weights from file to watch a smart agent!

In [22]:
# load the weights from file
agent.qnetwork_local.load_state_dict(torch.load('checkpoint.pth'))

for i in range(100):
    state = env.reset()
    score = 0
    while True:
        action = agent.act(state)
        env.render()
        state, reward, done, _ = env.step(action)
        score +=reward
        if done:
            print(score)
            break 
            
env.close()

255.99381607266002
226.61080651426573
282.58898171199974
259.8534161803889
228.3168566284317
260.21060754539747
281.3071917871947
167.36565398302895
241.4566634286163
282.9761419691806
159.33329602627555
221.2857193629863
230.8217865160099
198.49594306350127
272.9984684474264
234.81127191151668
239.54560287485316
264.96463251789845
252.63479188826284
227.92611075204852
230.5965663049383
250.5077579546411
220.29714237409928
266.4194140017921
241.38565668693457
265.7548009293922
239.9381635424852
241.51114857256488
225.17947467336657
260.179166563512
214.94142604958
130.85709823134223
161.44084130673755
218.90533824159613
241.79581434303256
244.14998695024352
231.08081340784756
234.14822824664915
249.0188294631208
123.33931851587344
256.0148555006591
257.44467906829846
138.28448123096942
214.05063656485186
235.91972510946974
209.21987408525231
265.6289142825
250.3293354934932
263.814213360243
139.89549978812315
230.4530596422017
244.2485105308852
249.71951775313465
231.1530735683525
207.

### 5. Explore

In this exercise, you have implemented a DQN agent and demonstrated how to use it to solve an OpenAI Gym environment.  To continue your learning, you are encouraged to complete any (or all!) of the following tasks:
- Amend the various hyperparameters and network architecture to see if you can get your agent to solve the environment faster.  Once you build intuition for the hyperparameters that work well with this environment, try solving a different OpenAI Gym task with discrete actions!
- You may like to implement some improvements such as prioritized experience replay, Double DQN, or Dueling DQN! 
- Write a blog post explaining the intuition behind the DQN algorithm and demonstrating how to use it to solve an RL environment of your choosing.  