#  Trailing Probablistic Inference of Option Abstraction.
##
This is a very simple environment to test whether our prince opfras has any liklihood of discovering structure in an environment

## What is ABC?

ABC is a very simple procedurally generated environment. 

A and C can only be followed by B. B can have either A or C follow.

An environment is generated by defining the length of these pipes for instance a possible path of 7 is called could be A,B,C,B,A,B,C. But if 7 is called then ALL possible paths of length 7 are created but only one will have a reward (which will be chosen at random).

This gives a very simple environment to test macros, macros length and how they can generalize and effect search in larger unseen tasks.

To put this clearly - if the environment is called with length 2 then the possible routes are: 
- AB
- CB
- BA
- BC

if it was called with 3 then:
- ABC
- ABA
- BAB
- BCB
- CBA
- CBC

and 4:
- ABCB
- ABAB
- BABA
- BABC
- BCBA
- BCBC
- CBAB
- CBCB

What should be clear is that every combination going forwards can be created by using AB, CB, BA, BC, A, B, or C. Similarly other longer combinations could be used.

Further - the environment can be biased to avoid certain combinations of A, B and C, therefore adding more structure to the Environment. The environment can also be built hierarchically - for instance "AB" could become a new building block of "D" and "BAB" may become a new building block of "E", of which these new building blocks can have new rules which allow them to connect directly. This should emulate the hierachical structure of the real world. 

### Therefore this environment and project will aim to explore:

Therefore this biasing of ABC structures can be hand created in the most simplist of terms and we can test a simplified example of probabilistic inference of option fractures.
    

In [1]:
import copy
import os
import random
import pandas as pd
import gym
import gym.spaces as spaces
import stable_baselines3
import numpy as np
import pickle
from plotting_results import plot_graphs
import string

# Creating the environment

1. The environment consists of all possible combinations within the length limit.
2.  There is a reward of +100 for reaching the correct combination.
3. There is a reward of -1 for choosing an invalid move.
4. There is no punishment for taking actions.
5. If an agent gets to the end of the sequence length and hasn't found the reward then it gets a reward of -10 and can reset to the beginning.
6. The final path from start ---> reward will be recorded (for later use in macro development) (ofcourse this is just the reward though and for inference doesn't really need to be "trained"?) 
7.  The total number of decisions made will also be recorded (this is a measure of search efficiency)


Note that if learning structure in the environment, then certain structures where the reward exists can be biased - for instance, bias towards the reward being in an environment with the structure BCBA etc ... 

NOTE2 - there is no reason that you can't modify an existing environment to increase the action_space ... this may actually work better than what I'm doing ... maybe we should try this first ... but then how do we track the measure of structure??? Carry on for now with what we have. 

# TODO below

Must turn the state space into a discrete state space ... but this doesn't make sense 
So I need to work on how to make this make sense ... I don't want to one hot encode 
Maybe I should do a box method ...

box of length "length" and width 1?

In [2]:
# The environment should follow the custom creation of below link:
"https://stable-baselines3.readthedocs.io/en/master/guide/custom_env.html"

class ABC: 
    metadata = {'render.modes': ['human']}
    
    def __init__(self, length, actions, bias=False):
        # not sure if I need the actions to be numbers or no - yes self.action_space needs
        # to be in the form 0,1,2 ... hmmm need to encode what each means
        super(ABC, self).__init__()
        
        self.actions = actions
        self.action_space = spaces.Discrete(len(actions))
        self.create_cypher()
        self.length = length
        self.possible_combinations = None
        self.generate_all()
        self.choose_reward(bias)
        self.state = [0]*self.length
        self.state[0] = 0 # start from 0
        self.observation_shape = (length)
        self.observation_space = spaces.Box(low = np.zeros(self.observation_shape),
                                            high = 3*np.ones(self.observation_shape),
                                            dtype = np.int32) # 0,1,2,3 (hopefully)
        self.info = {}
        self.info["episodes"] = 0
        self.decisions_made = 0 # init
        self.done = False
        
        # init our trackers
        self.decisions_list = []
        self.decisions_made = 0
        self.rewards_list = []
        self.rewards = 0
        self.length_list = []
        self.steps = 0
        self.create_option_cyther()
        
    def create_option_cyther(self):
        self.options_cyther = {}
        self.options_cyther["A"] = "A"
        self.options_cyther["B"] = "B"
        self.options_cyther["C"] = "C"
        # add extra alphabets and correspond to the best fractions.
        self.options_decyther = copy.copy(self.options_cyther)
        
    def create_cypher(self):
        
        # The actions will be 0,1,2,3 ... which will correspond to A, B , C etc 
        # However, the state space is shifted one to the right because 0 means no 
        # action has been taken. So the reverse cypher must place A = 1, B = 2 etc
        # but the cypher must map 0 = A,
        
        self.cypher = {}
        self.cypher[9] = 'Z'
        self.reverse_cypher = {}
        self.reverse_cypher['Z'] = 9
        for i in range(len(self.actions)):
            self.cypher[i] = self.actions[i] # this just relates index+1 to action...
            # +1 as 0 will mean no action yet and then 1 = A, 2 = B, 3 = C
            self.reverse_cypher[self.actions[i]] = i+1
            
    def generate_all(self):
        self.possible_combinations = [["A"], ["B"], ["C"]] # this is just starting states
        
        i = 0
        while i < self.length-1:
            new_choices = []
            for choice in self.possible_combinations:
                if choice[-1] == "A":
                    choice.append("B")
                elif choice[-1] == "C":
                    choice.append("B")
                else:
                    split = copy.copy(choice)
                    choice.append("A")
                    split.append("C")
                    new_choices.append(split)
            i += 1
            for n_c in new_choices:
                self.possible_combinations.append(n_c)
        
        return self.possible_combinations
    
    def choose_reward(self, bias):
        # If you want to discriminate or bias certain structures then 
        # here is where you would insert code to do this
        # it makes the most sense to just remove certain combinations
        # from the possible_combinations because then this can be analysed
        # by the structure checker.
        
        # Remove all with a certain structure in this case ABABC which will then favour  
        # ABABA or ABCBC or ABCBA which is still plenty of options
        if bias:
            drop_list = []
            for combi in self.possible_combinations:
                combi_str = "".join(str(i) for i in combi) 
                if "ABABC" in combi_str:
                    drop_list.append(combi)
            self.possible_combinations = [combi for combi in self.possible_combinations if combi not in drop_list]
                    
        # Randomly choose from what it left
        self.reward_combination = copy.copy(random.choice(self.possible_combinations))
        
        for i in range(len(self.reward_combination)):
            self.reward_combination[i] = self.reverse_cypher[self.reward_combination[i]]
        
    def step(self, action_in):
        # find our current position in the combination
        
        self.decisions_made += 1 # decision is before the outcome
        
        cyphered = self.cypher[action_in] # turn our action of 0, 1, 2, 3 ... into A, B, ...
#         print("state ", self.state)
#         print("action in ", action_in)

        done = False
        reward = 0
        for action in cyphered:
#             print("action in ", action_in)
#             print("cyphered ", cyphered)
#             print("action", action)
            self.position = 0
            for i in range(len(self.state)):
                if self.state[i] != 0: # 0 corresponds to a location without action.
                    self.position = i+1 

            # have we reached the end?
            if (self.position != self.length):
                if self.reward_combination[self.position] != 9:
                    self.state[self.position] = self.reverse_cypher[action]
                    self.rewards += -1
                    self.steps += 1
                else:
                    if self.state[:self.position] == self.reward_combination[:self.position]:
                        reward += 100
                        self.rewards += reward
                        # print("decisions made to success was ", self.decisions_made)
                        done = True
                        self.done = True
                    else:
                        reward = -50
#                         for i in range(len(self.state[:self.position])):
#                             if self.state[i] == self.reward_combination[i]:
#                                 reward += 1
                        self.rewards += reward
                        done = True
                        self.done = True
                        # self.rego() # reset the agent so it can try a new combination

                    if done == True:
                        # if we have this here then we can track our learning. 
                        self.info["episodes"] += 1
                        self.decisions_list.append(self.decisions_made)
                        self.rewards_list.append(self.rewards)
                        self.length_list.append(self.steps)
                        self.rewards = 0
                        self.reset()
            
            # Check if we have reached the end and are we correct
            if (self.position == self.length):
                if self.state == self.reward_combination:
                    reward += 100
                    self.rewards += reward
#                     print("decisions made to success was ", self.decisions_made)
                    done = True
                    self.done = True
                else:
                    reward = -50
#                     for i in range(len(self.state)):
#                         if self.state[i] == self.reward_combination[i]:
#                             reward += 1
                    self.rewards += reward
                    done = True
                    self.done = True
                    # self.rego() # reset the agent so it can try a new combination
                # if we have this here then we can track our learning. 
                if done == True:
                    self.info["episodes"] += 1
                    self.decisions_list.append(self.decisions_made)
                    self.rewards_list.append(self.rewards)
                    self.length_list.append(self.steps)
                    self.rewards = 0
                    self.reset()
            
        
#         print(self.position)
#         print(self.state)
#         print(action)
                
        return self.state, reward, done, self.info # none for info
    
    def reset(self):
        # reset the environment to the beginning of the current reward combination
        self.state = [0]*self.length
        self.state[0] = self.reward_combination[0] # starting state
        self.decisions_made = 0 # reset decisions in the episode
        self.rewards = 0
        self.steps = 0
        self.done = False
        
#         print("reset ---------------- ")
        
        return self.state # must just return the state
    
    def rego(self):
#         print("regoing")
        self.state = [0]*self.length
        self.state[0] = 0 # starting state
        # self.decisions_made = 0 # reset decisions in the episode
        self.done = False
    
    def render(self, mode="human"):
        render_list = []
        for i in range(len(self.state)):
            if self.state[i] != 0:
                render_list.append(self.cypher[self.state[i]-1]) # -1 as been shifted
            else:
                render_list.append(0)
        print(render_list)
        
    def close(self):
        pass
    
    def add_option(self, best_frac):
        option_choices = list(string.ascii_uppercase)
        for option in option_choices:
            print(option)
            if option not in self.options_cyther.keys():
                print(option)
                print(best_frac)
                self.options_cyther[option] = best_frac
                self.options_decyther[best_frac] = option
                self.actions.append(option)
                break
    # refactor our optimal possible combinations
    def refactor(self, best_frac):
        new_combis = []
        #print("best frac ", best_frac)
        for combi in self.possible_combinations:
            #print("combi : ", combi)
            str_combi = ""
            for action in combi:
                str_combi += str(action)

            str_best_frac = ""
            for a in best_frac:
                str_best_frac += a

            if str_best_frac in str_combi:
                # Do we need a cyther for options-actions?
                str_combi = str_combi.replace(str_best_frac, self.options_decyther[best_frac])
            new_combi = list(str_combi)
            while len(new_combi) < self.length:
                new_combi.append('Z')
            new_combis.append(new_combi)
            #print("new_combi : ", new_combi)
        self.possible_combinations = new_combis
        
    def count_fracs(self, length):
        master_dict = {}
        for k in range(1,length): # this will loop over the length of macros
            # print("k is --------------------", k)
            for combi in self.possible_combinations: # loops over the potential solutions
                for i in range(0, length-(k-1)): # loops over the individual combination
                    macro = []
                    #  print("i is ", i)
                    for j in range(k): # j is a look ahead to create the combination
                        try:
                            if combi[i+j] != "Z":
                                macro.append(combi[i+j]) # because our macros get shorter
                        except:
                            macro = []
                    if len(macro) != 0:
                        macro = tuple(macro)
                        if macro not in master_dict.keys():
                            master_dict[macro] = 1
                        else:
                            master_dict[macro] += 1

        master_task_count_dict = {}
        for combi in self.possible_combinations:
            for macro in master_dict.keys():
                macro_str = ""
                combi_str = ""
                for m in macro:
                    if m != 'Z':
                        macro_str += m
                for c in combi:
                    if c != 'Z':
                        combi_str += c
                if macro_str in combi_str:
                    if macro not in master_task_count_dict.keys():
                        master_task_count_dict[macro] = 1
                    else:
                        master_task_count_dict[macro] += 1

        # print(master_dict)
        frac_counter_df = pd.DataFrame(master_dict, index=[0])
        task_counter_df = pd.DataFrame(master_task_count_dict, index=[0])

        frac_counter_dict = master_dict
        task_counter_dict = master_task_count_dict

        return frac_counter_df, task_counter_df, frac_counter_dict, task_counter_dict

In [3]:
def PI_score(frac, frac_counter_dict, task_counter_dict, env):
    # compute the scores
    gamma = 1
    total_length_of_library = 0
    for action in env.actions:
        total_length_of_library += len(action)
    prior = np.log(np.e**(-total_length_of_library)) # don't include the new action length
    sym_dir_prior = np.log(1/len(env.actions))
    # possible combinations only works because this is considered optimal. 
    # in the future just use the trajectories 
    task_dist = task_counter_dict[frac] / len(env.possible_combinations)
    # frac could be called at any location which isn't the end 
    frac_dist = frac_counter_dict[frac] / (len(env.possible_combinations)*(len(env.possible_combinations[0]) - len(frac)))
    
    # So because our model over parameters is perfect (optimal) we can set AIC
    # to 0.
    AIC = 0
    
    ret = prior + task_dist*frac_dist + sym_dir_prior + AIC
    
    return ret

In [4]:
def find_best_frac(master_dict, frac_counter_dict, task_counter_dict, ABC1):
    beam_length = 2
    all_fracs = []
    for frac in master_dict.keys():
        if len(frac) == beam_length:
            all_fracs.append(frac)
    
    best_frac = None
    best_score = -10000
    for frac in all_fracs:
        PIS = PI_score(frac, frac_counter_dict, task_counter_dict, ABC1)
        if PIS > best_score:
            best_score = PIS
            best_frac = frac

    return(best_frac)

# Lets check if there is already some structure in the ABC world.

Create all possible combinations at different length scales and then count the most frequent fractures.

In [5]:
# We should set up the environment and keep it the same as we train.
pd.set_option("display.max_columns", None)
length = 10
ABC1 = ABC(length, ["A", "B", "C"], bias=True) # create an environment with a set combination length
print(len(ABC1.possible_combinations))

40


In [6]:
frac_counter_df, task_counter_df, frac_counter_dict, task_counter_dict = ABC1.count_fracs(length)

In [7]:
print(frac_counter_df)

    A    B    C   A   B    C    B   A   B   C   B   A   C   A   B   C   B   A  \
  NaN  NaN  NaN   B   A    B    C   B   A   B   C   B   B   B   A   B   C   B   
  NaN  NaN  NaN NaN NaN  NaN  NaN   A   B   A   B   C   C   A   B   A   B   C   
  NaN  NaN  NaN NaN NaN  NaN  NaN NaN NaN NaN NaN NaN NaN   B   A   B   A   B   
  NaN  NaN  NaN NaN NaN  NaN  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN   
  NaN  NaN  NaN NaN NaN  NaN  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN   
  NaN  NaN  NaN NaN NaN  NaN  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN   
  NaN  NaN  NaN NaN NaN  NaN  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN   
  NaN  NaN  NaN NaN NaN  NaN  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN   
0  88  200  112  76  80  104  100  28  68  44  92  36  52  21  27  39  39  33   

    C   B       A   B   C   B   A   C   B       C   A   C   A   B   C   B   A  \
    B   A   C   B   A   B   C   B   B   A   C   B   B   B   B   A   B   C   B   
    C   B   B   A   B   A  

In [8]:
print(task_counter_df)

    A   B   A   B   A   B   A   B   A   B   A   B   A   B   A   B  A  B   C  \
  NaN NaN   B   A   B   A   B   A   B   A   B   A   B   A   B   A  B  A NaN   
  NaN NaN NaN NaN   A   B   A   B   A   B   A   B   A   B   A   B  A  B NaN   
  NaN NaN NaN NaN NaN NaN   B   A   B   A   B   A   B   A   B   A  B  A NaN   
  NaN NaN NaN NaN NaN NaN NaN NaN   A   B   A   B   A   B   A   B  A  B NaN   
  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN   B   A   B   A   B   A  B  A NaN   
  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN   A   B   A   B  A  B NaN   
  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN   B   A  B  A NaN   
  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN  A  B NaN   
0  38  40  37  37  14  36  11  14   8  11   6   8   4   6   3   4  2  3  38   

                                   B                              A          \
    B                              C                              B           
  NaN   A                        NaN   B           

In [9]:
task_counter_dict

{('A',): 38,
 ('B',): 40,
 ('A', 'B'): 37,
 ('B', 'A'): 37,
 ('A', 'B', 'A'): 14,
 ('B', 'A', 'B'): 36,
 ('A', 'B', 'A', 'B'): 11,
 ('B', 'A', 'B', 'A'): 14,
 ('A', 'B', 'A', 'B', 'A'): 8,
 ('B', 'A', 'B', 'A', 'B'): 11,
 ('A', 'B', 'A', 'B', 'A', 'B'): 6,
 ('B', 'A', 'B', 'A', 'B', 'A'): 8,
 ('A', 'B', 'A', 'B', 'A', 'B', 'A'): 4,
 ('B', 'A', 'B', 'A', 'B', 'A', 'B'): 6,
 ('A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'): 3,
 ('B', 'A', 'B', 'A', 'B', 'A', 'B', 'A'): 4,
 ('A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A'): 2,
 ('B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'): 3,
 ('C',): 38,
 ('C', 'B'): 38,
 ('C', 'B', 'A'): 34,
 ('C', 'B', 'A', 'B'): 32,
 ('C', 'B', 'A', 'B', 'A'): 12,
 ('C', 'B', 'A', 'B', 'A', 'B'): 9,
 ('C', 'B', 'A', 'B', 'A', 'B', 'A'): 6,
 ('C', 'B', 'A', 'B', 'A', 'B', 'A', 'B'): 4,
 ('C', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A'): 2,
 ('B', 'C'): 37,
 ('B', 'C', 'B'): 37,
 ('B', 'C', 'B', 'A'): 32,
 ('B', 'C', 'B', 'A', 'B'): 30,
 ('B', 'C', 'B', 'A', 'B', 'A'): 11,
 ('B', 'C',

## So we don't actually need to collect the trajectories because if we assume they are optimal and we train over all of them, or atleast a fair fraction of them then we should be able to discover the macros.

In [10]:
# This will loop over and collect "hierarchically" the macros
total_compressions = 8
for compression in range(total_compressions):
    print("compression : ", compression)
    frac_counter_df, task_counter_df, frac_counter_dict, task_counter_dict = ABC1.count_fracs(length)
    #print("frac counter")
    print(frac_counter_df)
    #print("task counter")
    #print(task_counter_df)    
    best_frac = find_best_frac(frac_counter_dict, frac_counter_dict, task_counter_dict, ABC1)
    print("best frac ")
    print("")
    print(best_frac)
    print("")
    ABC1.add_option(best_frac) # only run once else will keep adding the same option
    ABC1.refactor(best_frac)

compression :  0
    A    B    C   A   B    C    B   A   B   C   B   A   C   A   B   C   B   A  \
  NaN  NaN  NaN   B   A    B    C   B   A   B   C   B   B   B   A   B   C   B   
  NaN  NaN  NaN NaN NaN  NaN  NaN   A   B   A   B   C   C   A   B   A   B   C   
  NaN  NaN  NaN NaN NaN  NaN  NaN NaN NaN NaN NaN NaN NaN   B   A   B   A   B   
  NaN  NaN  NaN NaN NaN  NaN  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN   
  NaN  NaN  NaN NaN NaN  NaN  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN   
  NaN  NaN  NaN NaN NaN  NaN  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN   
  NaN  NaN  NaN NaN NaN  NaN  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN   
  NaN  NaN  NaN NaN NaN  NaN  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN   
0  88  200  112  76  80  104  100  28  68  44  92  36  52  21  27  39  39  33   

    C   B       A   B   C   B   A   C   B       C   A   C   A   B   C   B   A  \
    B   A   C   B   A   B   C   B   B   A   C   B   B   B   B   A   B   C   B   
    C   B 

    E   B   A   F    D   C   E   B   E   F   B   E   D   B   F   E   D   F  \
  NaN NaN NaN NaN  NaN NaN   E   E   A   E   F   F   F   D   F   D   D   A   
  NaN NaN NaN NaN  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN   
  NaN NaN NaN NaN  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN   
  NaN NaN NaN NaN  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN   
  NaN NaN NaN NaN  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN   
0  79  20  70  72  112  47  35   8  23  28   5   6  36   7  20   8  58  19   

        D   F   D   E   B   E   F   B   E   D   B   F   B   F   E   D   B   F  \
    D   A   C   C   E   E   E   E   F   F   F   E   E   D   F   D   D   F   F   
  NaN NaN NaN NaN   E   E   A   E   E   E   E   F   A   F   E   F   F   F   A   
  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN   
  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN   
  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Na

    I   E   B   A   F   H   G   J   D   C   I       B   I   F   B   E   H   B  \
  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN   I   E   I   A   I   F   F   I   E   
  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN   
  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN   
  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN   
0  23  56  20  82  45  28  43  20  28  55   2  16   1  14  10   5   4   8   7   

    F   E   B   H   F   E   G   F   H   B   H   F   E   G   E   J   H   D   F  \
    E   A   H   E   F   H   F   A   A   G   F   H   G   H   J   A   D   A   G   
  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN   
  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN   
  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN   
0  17  13   3   9  11   4  11  14   7   4   8   8   4   8   2   7  10  13  11   

    G           F   G   H 

In [11]:
print(ABC1.options_cyther)

{'A': 'A', 'B': 'B', 'C': 'C', 'D': ('C', 'B'), 'E': ('A', 'B'), 'F': ('D', 'E'), 'G': ('D', 'D'), 'H': ('D', 'F'), 'I': ('E', 'E'), 'J': ('F', 'D'), 'K': ('G', 'D')}


In [12]:
ABC1.create_cypher()
print(ABC1.cypher)

{9: 'J', 0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'E', 5: 'F', 6: 'G', 7: 'H', 8: 'I', 10: 'K'}


In [13]:
ABC1.choose_reward(bias=True)
ABC1.reward_combination

[2, 7, 6, 1, 9, 9, 9, 9, 9, 9]

# Creating the agent 

The agent will be have different possible action choices (based on macros) and will also have different possible learning methods - these will be used to compare the effects of these macros. Therefore these should be passed as arguments to the agent.

So I will use stable baselines as this will be the easiest for this.

So each agent which has a different set of actions (as macros) needs to be defined in the environment as it is in gym ... which means that the agent and the environment need to interact to make this make sense - this is something for the future I believe ... 

 # Whats next?
 1. Use PPO with different Macros / actions which make sense 
 2. Analyse there effects on total decisions made and rewards?
 3. Change the PPO method to include a generalising bias on macros?

1. Run 100 trials without any macros and save the learning curves - use tensorboardcallback to track these - and access through the bash commands - for instance:

"python3 -m tensorboard.main --logdir=~/my/training/dir"

Although it doesn't seem like this will be useful as I will want to join lots of these together and average them - so leave it in - but we will also log ... 

In [14]:
from stable_baselines3.common import results_plotter
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.results_plotter import plot_results
from stable_baselines3.common.noise import NormalActionNoise
from stable_baselines3.common.callbacks import BaseCallback

class record_decisions(BaseCallback):
    """
    Callback for saving a model (the check is done every ``check_freq`` steps)
    based on the training reward (in practice, we recommend using ``EvalCallback``).

    :param check_freq:
    :param log_dir: Path to the folder where the model will be saved.
      It must contains the file created by the ``Monitor`` wrapper.
    :param verbose: Verbosity level.
    """
    def __init__(self, check_freq: int, log_dir: str, verbose: int = 1):
        super(record_decisions, self).__init__(verbose)
        self.log_dir = log_dir

    def _on_step(self) -> bool:
        if env.done == True:
            self.logger.record("decisions_made", env.decisions_made)
            return True
        else:
            return False


In [15]:
log_dir = "./PPO_ABC_prim"
os.makedirs(log_dir, exist_ok = True)

In [16]:
recording_folder = "./macro_adder"
try:
    os.makedirs(recording_folder)
except:
    pass

In [None]:
tag = "prim" # choose prim of mac for now
runs = 10
for run in range(runs):
    print("starting run ", run)
    env = ABC(7, actions=["A", "B", "C"], bias=True)
    env.decisions_list = []
    env.rewards_list = []
    env.length_list = []
    model = stable_baselines3.PPO('MlpPolicy', env, tensorboard_log=log_dir)
    model.learn(total_timesteps=40000, callback=record_decisions)
    
    pickle.dump(env.decisions_list, open("{}/{}_{}_decisions".format(recording_folder, 
                                                                     tag, run), "wb"))
    pickle.dump(env.rewards_list, open("{}/{}_{}_reward".format(recording_folder,
                                                                tag, run), "wb"))
    pickle.dump(env.length_list, open("{}/{}_{}_length".format(recording_folder,
                                                               tag, run), "wb"))

starting run  0
starting run  1
starting run  2
starting run  3
starting run  4


In [None]:
env = ABC(7, actions=["A", "B", "C"], bias=True)
env.decisions_list = []
env.rewards_list = []
env.length_list = []

In [None]:
# This is how we change our environment with our additional macro
frac_counter_df, task_counter_df, frac_counter_dict, task_counter_dict = env.count_fracs(length)  
best_frac = find_best_frac(frac_counter_dict, frac_counter_dict, task_counter_dict, env)
env.add_option(best_frac) # only run once else will keep adding the same option
env.refactor(best_frac)
env.create_cypher()
env.action_space = spaces.Discrete(len(env.actions))

In [None]:
tag = "first" # choose prim of mac for now
runs = 10
for run in range(runs):
    env.decisions_list = []
    env.rewards_list = []
    env.length_list = []
    print("starting run ", run)
    env.choose_reward(bias=True)
    model = stable_baselines3.PPO('MlpPolicy', env, tensorboard_log=log_dir)
    model.learn(total_timesteps=40000, callback=record_decisions)
    
    pickle.dump(env.decisions_list, open("{}/{}_{}_decisions".format(recording_folder, 
                                                                     tag, run), "wb"))
    pickle.dump(env.rewards_list, open("{}/{}_{}_reward".format(recording_folder,
                                                                tag, run), "wb"))
    pickle.dump(env.length_list, open("{}/{}_{}_length".format(recording_folder,
                                                               tag, run), "wb"))

In [None]:
env.reward_combination

In [None]:
env.length

In [None]:
# This is how we change our environment with our additional macro
frac_counter_df, task_counter_df, frac_counter_dict, task_counter_dict = env.count_fracs(length)  
best_frac = find_best_frac(frac_counter_dict, frac_counter_dict, task_counter_dict, env)
env.add_option(best_frac) # only run once else will keep adding the same option
env.refactor(best_frac)
env.create_cypher()
env.action_space = spaces.Discrete(len(env.actions))

In [None]:
env.reverse_cypher

In [None]:
tag = "second" # choose prim of mac for now
runs = 10
for run in range(runs):
    env.decisions_list = []
    env.rewards_list = []
    env.length_list = []
    print("starting run ", run)
    env.choose_reward(bias=True)
    model = stable_baselines3.PPO('MlpPolicy', env, tensorboard_log=log_dir)
    model.learn(total_timesteps=40000, callback=record_decisions)
    
    pickle.dump(env.decisions_list, open("{}/{}_{}_decisions".format(recording_folder, 
                                                                     tag, run), "wb"))
    pickle.dump(env.rewards_list, open("{}/{}_{}_reward".format(recording_folder,
                                                                tag, run), "wb"))
    pickle.dump(env.length_list, open("{}/{}_{}_length".format(recording_folder,
                                                               tag, run), "wb"))

In [None]:
# This is how we change our environment with our additional macro
frac_counter_df, task_counter_df, frac_counter_dict, task_counter_dict = env.count_fracs(length)  
best_frac = find_best_frac(frac_counter_dict, frac_counter_dict, task_counter_dict, env)
env.add_option(best_frac) # only run once else will keep adding the same option
env.refactor(best_frac)
env.create_cypher()
env.action_space = spaces.Discrete(len(env.actions))

In [None]:
tag = "third" # choose prim of mac for now
runs = 10
for run in range(runs):
    env.decisions_list = []
    env.rewards_list = []
    env.length_list = []
    print("starting run ", run)
    env.choose_reward(bias=True)
    model = stable_baselines3.PPO('MlpPolicy', env, tensorboard_log=log_dir)
    model.learn(total_timesteps=40000, callback=record_decisions)
    
    pickle.dump(env.decisions_list, open("{}/{}_{}_decisions".format(recording_folder, 
                                                                     tag, run), "wb"))
    pickle.dump(env.rewards_list, open("{}/{}_{}_reward".format(recording_folder,
                                                                tag, run), "wb"))
    pickle.dump(env.length_list, open("{}/{}_{}_length".format(recording_folder,
                                                               tag, run), "wb"))

In [None]:
# This is how we change our environment with our additional macro
frac_counter_df, task_counter_df, frac_counter_dict, task_counter_dict = env.count_fracs(length)  
best_frac = find_best_frac(frac_counter_dict, frac_counter_dict, task_counter_dict, env)
env.add_option(best_frac) # only run once else will keep adding the same option
env.refactor(best_frac)
env.create_cypher()
env.action_space = spaces.Discrete(len(env.actions))

In [None]:
tag = "fourth" # choose prim of mac for now
runs = 10
for run in range(runs):
    env.decisions_list = []
    env.rewards_list = []
    env.length_list = []
    print("starting run ", run)
    env.choose_reward(bias=True)
    model = stable_baselines3.PPO('MlpPolicy', env, tensorboard_log=log_dir)
    model.learn(total_timesteps=40000, callback=record_decisions)
    
    pickle.dump(env.decisions_list, open("{}/{}_{}_decisions".format(recording_folder, 
                                                                     tag, run), "wb"))
    pickle.dump(env.rewards_list, open("{}/{}_{}_reward".format(recording_folder,
                                                                tag, run), "wb"))
    pickle.dump(env.length_list, open("{}/{}_{}_length".format(recording_folder,
                                                               tag, run), "wb"))

In [None]:
# This is how we change our environment with our additional macro
frac_counter_df, task_counter_df, frac_counter_dict, task_counter_dict = env.count_fracs(length)  
best_frac = find_best_frac(frac_counter_dict, frac_counter_dict, task_counter_dict, env)
env.add_option(best_frac) # only run once else will keep adding the same option
env.refactor(best_frac)
env.create_cypher()
env.action_space = spaces.Discrete(len(env.actions))

In [None]:
tag = "fifth" # choose prim of mac for now
runs = 10
for run in range(runs):
    env.decisions_list = []
    env.rewards_list = []
    env.length_list = []
    print("starting run ", run)
    env.choose_reward(bias=True)
    model = stable_baselines3.PPO('MlpPolicy', env, tensorboard_log=log_dir)
    model.learn(total_timesteps=40000, callback=record_decisions)
    
    pickle.dump(env.decisions_list, open("{}/{}_{}_decisions".format(recording_folder, 
                                                                     tag, run), "wb"))
    pickle.dump(env.rewards_list, open("{}/{}_{}_reward".format(recording_folder,
                                                                tag, run), "wb"))
    pickle.dump(env.length_list, open("{}/{}_{}_length".format(recording_folder,
                                                               tag, run), "wb"))

In [None]:
# This is how we change our environment with our additional macro
frac_counter_df, task_counter_df, frac_counter_dict, task_counter_dict = env.count_fracs(length)  
best_frac = find_best_frac(frac_counter_dict, frac_counter_dict, task_counter_dict, env)
env.add_option(best_frac) # only run once else will keep adding the same option
env.refactor(best_frac)
env.create_cypher()
env.action_space = spaces.Discrete(len(env.actions))

In [None]:
tag = "sixth" # choose prim of mac for now
runs = 10
for run in range(runs):
    env.decisions_list = []
    env.rewards_list = []
    env.length_list = []
    print("starting run ", run)
    env.choose_reward(bias=True)
    model = stable_baselines3.PPO('MlpPolicy', env, tensorboard_log=log_dir)
    model.learn(total_timesteps=40000, callback=record_decisions)
    
    pickle.dump(env.decisions_list, open("{}/{}_{}_decisions".format(recording_folder, 
                                                                     tag, run), "wb"))
    pickle.dump(env.rewards_list, open("{}/{}_{}_reward".format(recording_folder,
                                                                tag, run), "wb"))
    pickle.dump(env.length_list, open("{}/{}_{}_length".format(recording_folder,
                                                               tag, run), "wb"))

In [None]:
# This is how we change our environment with our additional macro
frac_counter_df, task_counter_df, frac_counter_dict, task_counter_dict = env.count_fracs(length)  
best_frac = find_best_frac(frac_counter_dict, frac_counter_dict, task_counter_dict, env)
env.add_option(best_frac) # only run once else will keep adding the same option
env.refactor(best_frac)
env.create_cypher()
env.action_space = spaces.Discrete(len(env.actions))

In [None]:
tag = "seventh" # choose prim of mac for now
runs = 10
for run in range(runs):
    env.decisions_list = []
    env.rewards_list = []
    env.length_list = []
    print("starting run ", run)
    env.choose_reward(bias=True)
    model = stable_baselines3.PPO('MlpPolicy', env, tensorboard_log=log_dir)
    model.learn(total_timesteps=40000, callback=record_decisions)
    
    pickle.dump(env.decisions_list, open("{}/{}_{}_decisions".format(recording_folder, 
                                                                     tag, run), "wb"))
    pickle.dump(env.rewards_list, open("{}/{}_{}_reward".format(recording_folder,
                                                                tag, run), "wb"))
    pickle.dump(env.length_list, open("{}/{}_{}_length".format(recording_folder,
                                                               tag, run), "wb"))

In [None]:
# This is how we change our environment with our additional macro
frac_counter_df, task_counter_df, frac_counter_dict, task_counter_dict = env.count_fracs(length)  
best_frac = find_best_frac(frac_counter_dict, frac_counter_dict, task_counter_dict, env)
env.add_option(best_frac) # only run once else will keep adding the same option
env.refactor(best_frac)
env.create_cypher()
env.action_space = spaces.Discrete(len(env.actions))

In [None]:
tag = "eighth" # choose prim of mac for now
runs = 10
for run in range(runs):
    env.decisions_list = []
    env.rewards_list = []
    env.length_list = []
    print("starting run ", run)
    env.choose_reward(bias=True)
    model = stable_baselines3.PPO('MlpPolicy', env, tensorboard_log=log_dir)
    model.learn(total_timesteps=40000, callback=record_decisions)
    
    pickle.dump(env.decisions_list, open("{}/{}_{}_decisions".format(recording_folder, 
                                                                     tag, run), "wb"))
    pickle.dump(env.rewards_list, open("{}/{}_{}_reward".format(recording_folder,
                                                                tag, run), "wb"))
    pickle.dump(env.length_list, open("{}/{}_{}_length".format(recording_folder,
                                                               tag, run), "wb"))

In [None]:
env.state

In [None]:
env.reward_combination

In [None]:
import matplotlib.pyplot as plt
from plotting_results import plot_graphs
plt.rcParams["figure.figsize"] = (20,10)

In [None]:
plot_graphs("./macro_adder", 50)

### 