# Infrastructure

Purpose of this project: set up a custom gym environment, from scratch, with different versions. Train model with simple q-learning, see how to make it compatible with stable baselines. 

In [8]:
import numpy as np
import gym
import random
# 158595

Creating environments.

In [9]:
# method 1 - build from gym package
env = gym.make("airctrl:AirControl150")

[32mLoading environment from /home/supatel/Games/AirControl_2021/Build/FORSIMULATION/Linux/vFORSIMULATION-AirControl.x86_64 at port 8899 client ip 127.0.1.1 client port 8899
[0m
['/home/supatel/Games/AirControl_2021/Build/FORSIMULATION/Linux/vFORSIMULATION-AirControl.x86_64', '--serverPort', '8899', '--clientIP', '127.0.1.1', '--clientPort', '8899']
[32mSleeping for 5 seconds to allow environment load
[0m
[UnityMemory] Configuration Parameters - Can be set up in boot.config
    "memorysetup-bucket-allocator-granularity=16"
    "memorysetup-bucket-allocator-bucket-count=8"
    "memorysetup-bucket-allocator-block-size=4194304"
    "memorysetup-bucket-allocator-block-count=1"
    "memorysetup-main-allocator-block-size=16777216"
    "memorysetup-thread-allocator-block-size=16777216"
    "memorysetup-gfx-main-allocator-block-size=16777216"
    "memorysetup-gfx-thread-allocator-block-size=16777216"
    "memorysetup-cache-allocator-block-size=4194304"
    "memorysetup-typetree-allocator-b

In [10]:
env.reset()

array([ 4.7221241e-04, -1.6739967e-06,  5.3206945e-06,  0.0000000e+00,
        0.0000000e+00,  0.0000000e+00,  3.2662457e-01,  1.7784832e-03,
        1.0000000e+00,  0.0000000e+00,  0.0000000e+00,  0.0000000e+00,
        8.6829074e-05,  1.4393036e-04, -6.6494058e-06, -1.7101765e-07,
       -3.3069577e-05, -9.6493559e-06,  9.4704086e-01,  9.9994034e-01,
        3.0028954e-04,  9.4704086e-01,  9.9994034e-01,  3.0028986e-04,
        0.0000000e+00,  0.0000000e+00,  0.0000000e+00,  0.0000000e+00,
        0.0000000e+00,  0.0000000e+00,  0.0000000e+00,  0.0000000e+00,
        0.0000000e+00,  0.0000000e+00,  0.0000000e+00,  0.0000000e+00,
        0.0000000e+00,  0.0000000e+00,  0.0000000e+00,  0.0000000e+00,
        0.0000000e+00,  0.0000000e+00,  0.0000000e+00,  0.0000000e+00,
        0.0000000e+00,  0.0000000e+00,  0.0000000e+00,  0.0000000e+00,
        0.0000000e+00,  0.0000000e+00,  0.0000000e+00,  0.0000000e+00,
        0.0000000e+00,  0.0000000e+00,  0.0000000e+00,  0.0000000e+00,
      

In [11]:
env.render()

Already rendered at pid :  166363


# Q-Learning

Source: https://deeplizard.com/learn/video/HGeI30uATws

I copied the code and tested it with the custom environment instead of the built-in Frozen Lake environment. 

In [12]:
env.action_space

Box(-1.0, 1.0, (4,), float32)

In [13]:
num_episodes = 10
max_steps_per_episode = 10 # but it won't go higher than 1

learning_rate = 0.1
discount_rate = 0.99

exploration_rate = 1
max_exploration_rate = 1
min_exploration_rate = 0.01

exploration_decay_rate = 0.01 #if we decrease it, will learn slower

In [14]:
rewards_all_episodes = []

# Q-Learning algorithm
for episode in range(num_episodes):
    state = env.reset()
    
    done = False
    rewards_current_episode = 0
    
    for step in range(max_steps_per_episode):
        
        # Exploration -exploitation trade-off
        exploration_rate_threshold = random.uniform(0,1)
        action = env.action_space.sample()
            
        new_state, reward, done, info = env.step(action)
        
        # Update Q-table for Q(s,a)
            
        state = new_state
        rewards_current_episode += reward
        
        if done == True: 
            break
            
    # Exploration rate decay
    exploration_rate = min_exploration_rate + \
        (max_exploration_rate - min_exploration_rate) * np.exp(-exploration_decay_rate * episode)
    
    rewards_all_episodes.append(rewards_current_episode)
    
# Calculate and print the average reward per 10 episodes
reward_average = np.sum(rewards_all_episodes)/ num_episodes
print("Average  reward : ",reward_average)
        

Average  reward :  1.2700604481679836
