### List of what you need to do:
- ~~Define the Queue (generic)~~
- ~~Define the Q-Table (just empty table with all possible states)~~
- Start Q-Learning Loop

# Generic Queue Simulator
We want to define a generic queue simulator that has its own arrival rates/mean delay requirements so that we can use it for PQ1, PQ2 and Best-Effort, as well as the other queues such as FIFO, RR, etc.

In [2]:
# All Imports and Constants
import sys
!{sys.executable} -m pip install gym
import numpy as np
import gym
from gym import spaces

LEARNING_RATE = 0.1
DISCOUNT = 0.95
EPISODES = 500

# parameters for epsilon decay policy
EPSILON = 1 # not a constant, going to be decayed
START_EPSILON_DECAYING = 1
END_EPSILON_DECAYING = EPISODES // 2
epsilon_decay_value = EPSILON / (END_EPSILON_DECAYING - START_EPSILON_DECAYING)

# for testing
N_TEST_RUNS = 100
TEST_INTERVAL = 50

MAX_TIMESLOTS = 100
MAX_WAIT_STATE = 50 #used as upper limit in q-table

Collecting gym
  Downloading gym-0.18.3.tar.gz (1.6 MB)
[K     |████████████████████████████████| 1.6 MB 20.0 MB/s eta 0:00:01
Collecting pyglet<=1.5.15,>=1.4.0
  Downloading pyglet-1.5.15-py3-none-any.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 68.4 MB/s eta 0:00:01
Building wheels for collected packages: gym
  Building wheel for gym (setup.py) ... [?25ldone
[?25h  Created wheel for gym: filename=gym-0.18.3-py3-none-any.whl size=1657518 sha256=2d2fe19cb7d4705a3a1e969545e9d7a0a096d0f19c225d6895a6fd9879f3638a
  Stored in directory: /home/ec2-user/.cache/pip/wheels/1f/63/7a/4368e4c3aedd396d4ab8e9b7922af06433994ebe739853ae4a
Successfully built gym
Installing collected packages: pyglet, gym
Successfully installed gym-0.18.3 pyglet-1.5.15


In [17]:
class QueueSimulator(gym.Env):
    def __init__(self, arrival_rates, mean_delay_requirements, queues_arrival_times):
        super(QueueSimulator, self).__init__()
        self.arrival_rates = arrival_rates
        self.mean_delay_requirements = mean_delay_requirements
        self.current_timeslot = 0
        self.queues = queues_arrival_times # Basic case of PQ1, PQ2 and Best-Effort
        
        # Action Space is 3, because we have 3 queues to choose from
        self.action_space = spaces.Discrete(3)
        
        # We know the observation space are the range of possible and observable values. This is wait times,
        # so wait times can be 0 or infinity technically.
        self.observation_space = spaces.Box(low=np.array([0, 0, 0]), high=np.array([np.inf, np.inf, np.inf]), dtype=np.dtype(int))
        
    def step(self):
        return self
        # TODO Impl
        
    def reset(self):
        return self
        # TODO Impl
        
    def render(self):
        return self
        # TODO Impl

# Pre-populating Queues
If given the arrival_rates and mean_delay_requirements, you could calculate what timeslots a packet will arrive for any number of timeslots. Hence, we believe that you should just 'pre-populate' your queues with the times that packets arrive at, since this simplifies all of the packet arrival/transmission, and you can later use this to measure wait times and give this information to your model to determine an action to take every step of the way.

In [4]:
# Create the basic env and put logic for the actions
arrival_rates = [0.3, 0.25, 0.4]
mean_delay_requirements = [6, 4, np.inf]
# Keep track of current packets by using another array, which has index corresponding to the arrival_rates
queues_packet_status = [0, 0, 0]
# See all timeslots where a queue finished transmitting a packet
queues_finished_timeslots = [[], [], []]

# At each time interval, increment each queue's current packet status by the arrival rate amount
# if packet status >= 1, get the extra amt above 1 and change packet status to just that
for timeslot in range(MAX_TIMESLOTS):
    for current_queue in range (len(arrival_rates)):
        queues_packet_status[current_queue] += arrival_rates[current_queue]
        
        if queues_packet_status[current_queue] >= 1.0:
            queues_finished_timeslots[current_queue].append(timeslot)
            queues_packet_status[current_queue] -= 1.0

# Result of queues_finished_timeslots
print('{0} -> Length = {1}'.format(queues_finished_timeslots[0], len(queues_finished_timeslots[0])))
print('{0} -> Length = {1}'.format(queues_finished_timeslots[1], len(queues_finished_timeslots[1])))
print('{0} -> Length = {1}'.format(queues_finished_timeslots[2], len(queues_finished_timeslots[2])))


[3, 6, 10, 13, 16, 20, 23, 26, 30, 33, 36, 40, 43, 46, 50, 53, 56, 60, 63, 66, 70, 73, 76, 80, 83, 86, 90, 93, 96] -> Length = 29
[3, 7, 11, 15, 19, 23, 27, 31, 35, 39, 43, 47, 51, 55, 59, 63, 67, 71, 75, 79, 83, 87, 91, 95, 99] -> Length = 25
[2, 4, 7, 9, 12, 14, 17, 19, 22, 24, 27, 29, 32, 34, 37, 39, 42, 44, 47, 49, 52, 54, 57, 59, 62, 64, 67, 69, 72, 74, 77, 79, 82, 84, 87, 89, 92, 94, 97, 99] -> Length = 40


'queues_finished_timeslots' is essentially our environment. We want to take this environment and apply it to a Q-Table, where the Q-Table represents all possible states (states being how long a queue has waited), then some reward for an action given a specific state. Below is what the Q-Table could look like, with a state e.g. (1, 0, 3) saying that the first packet in queue 1 has waited 1 timeslots, queue 2 has waited 0 timeslots, and queue 3 has waited 3 timeslots.

|                       | 0           | 1           | 2           |
|-----------------------|:-----------:|-------------|-------------|
| State (0, 0, 0)       | some-reward | some-reward | some-reward |
| State (0, 0, 1)       | some-reward | some-reward | some-reward |
| State (0, 1, 1)       | "         " | "         " | "         " |
| State (  ...  )       |             |             |             |
| State (Inf, Inf, Inf) | some-reward | some-reward | some-reward |

# Setting Up Q-Learning
First, we need to make a Q-Table with empty values. Access to an entry is given with the state (x, y, z).
Also, the upper state cannot be infinity so we can choose an arbitrary value (but still one that is somewhat realistic).

In [15]:
def create_q_table():
    q_table = {}
    for q1 in range (MAX_WAIT_STATE):
        for q2 in range (MAX_WAIT_STATE):
            for q3 in range (MAX_WAIT_STATE):
                q_table[(q1, q2, q3)] = np.zeros(3)

# Start Q-Learning Loop
1. For every episode, do another 'until done' loop
2. While not done:
    - Get a random chance, and either get value from Q-Table (exploit) or do random action (explore)
    - Do next step() for env
    - Update Q-Table and any other variables
3. Update epsilon
4. Can do some update per episode, but if we're doing minimum 500 episodes more likely to do some performance check every X interval e.g. every 50 episodes get some check in.

In [20]:
env = QueueSimulator(arrival_rates, mean_delay_requirements, queues_finished_timeslots)
q_table = create_q_table()
def q_learning(env, q_table):
    for episode in range(EPISODES):
        # Reset all variables per episode
        done = False
        state = env.reset()
        epsilon = EPSILON
        steps = 0
        
        # Either do action from QTable or random action
        while not done:
            if np.random.random() < 1 - epsilon:
                action = np.argmax(q_table[state])
            else:
                action = env.action_space.sample()
                
            # Get the next state, reward, new done value, and info (not sure what this is)
            state_new, reward, done, info = env.step(action)
            # No need to discretize state_new

            # Update QTable and calc reward. Not sure why current_q in example is 'discretState+(action,)'
            new_max_q = np.max(q_table[state_new])
            current_q = q_table[state_new][action]
            q_table[state_new][action] = (1 - LEARNING_RATE)*current_q + LEARNING_RATE*(reward + DISCOUNT*new_max_q)

            state = state_new
            steps += 1
        
        # Finished done loop, update epsilon
        if END_EPSILON_DECAYING >= episode and episode >= START_EPSILON_DECAYING:
            epsilon -= epsilon_decay_value
            
        # Print progress every X episodes
        if episode % TEST_INTERVAL == 0:
            success_run_ = list()
            steps_ = list()
            for i in range(N_TEST_RUNS):
                success_run, steps = test_model(QTable)
                success_run_.append(success_run)
                steps_.append(steps)
                
            print('Testing at Episode {}:'.format(episode))
            print('\t Successful Runs: {}/{}'.format(np.sum(success_run_), N_TEST_RUNS) )
            print('\t Average Steps: {}'.format(np.mean(steps_)))
        
    print('Finished all episodes')
    env.close()
    return q_table
    
result_q_table = q_learning(env, q_table)

TypeError: step() takes 1 positional argument but 2 were given