In [1]:
import torch
import numpy as np
import json

Naming in the code is a tad particular but a walk is made up of steps.
Each step in the code is called a `walk` (singular).

There are 2 types of walks the "default" and the "shiny".

## Default Walks

A walk is an entry in a list.
An individual walk is itself a tuple.
Each tuple has the following elements:

**How to chose a location**

1. a `new_location` which is the a location object from the env file corresponding to the position being samples
2. a `new_observation` which is a one-hot-vector for the observation in the chosen location
3. a `new_action` obtained by making a "policy" out of all the transition probabilities to form a CDF, then samplign the CDF at random.

The first location is chosen at random.
All other locations are chosen based on the action from the previous step/walk - the authors call each step a walk.
The code is not very comprehensible but what it does is that it takes the `transition` array, displaying with a `1` the index of where the action takes you. It then generates a random number, and compares all entries against the random number after a `cumsum()` operation.
This is too convoluted but since all locations other than the next position will be 0s, then the `cumsum()` will have a nonzero after the position of the 1.
It then returns the value of that first entry in the `cumsum()` array.

**Observabtions** are stored in the walk as one-got-encoded vectors.


**POSSIBLE BUG** there is this convoluted piece of code
```python
def get_action(self, new_location, walk, repeat_bias_factor=2):
    # Build policy from action probability of each action of provided location dictionary
    policy = np.array([action['probability'] for action in new_location['actions']])        
    # Add a bias for repeating previous action to walk in straight lines, only if (this is not the first step) and (the previous action was a move)
    policy[[] if len(walk) == 0 or new_location['id'] == walk[-1][0]['id'] else walk[-1][2]] *= repeat_bias_factor
```
in [github.com/jbakermans/torch_tem/world.py#get_action()](https://github.com/jbakermans/torch_tem/blob/bf103fb32b5fdc7541ebbd95ba77a2d35d049d7c/world.py#L235-L239).

If you see, the repeating bias does nothing when we are in the first step of the walk and when the new location ID is the same as the previous location ID.
So it doesn't necessarily encourage moving in a striaght line, it just encourages movement.


**POSSIBLE BUG**

```python
# Clean up walk a bit by only keep essential location dictionary entries
for step in new_walk[:-1]:
    step[0] = {'id': step[0]['id'], 'shiny': step[0]['shiny']}
```

This code can't be right: `new_walk[:-1]` makes `step` be a location and then the one-hot-encoded vector (not a dictionary!).

**Style issue**

`walk_default()` modifies the `walk` variable that's passed as an argument by `generate_walks()`.
This is not C.

In [2]:
with open('./envs/5x5.json', 'r') as f:
    env = json.load(f)

In [3]:
repeat_bias_factor=2

walks = []

In [4]:
new_walk = []

In [5]:
walk = new_walk

In [6]:
len(walk)

0

In [7]:
# The first location is chosen ar random.
new_location_pos = np.random.randint(env['n_locations'])
new_location_pos

17

## Grid

```
(0.1,0.1) - (0.3,0.1) - (0.5,0.1) - (0.7,0.1) - (0.9,0.1)
    |           |           |           |           |
(0.1,0.3) - (0.3,0.3) - (0.5,0.3) - (0.7,0.3) - (0.9,0.3)
    |           |           |           |           |
(0.1,0.5) - (0.3,0.5) - (0.5,0.5) - (0.7,0.5) - (0.9,0.5)
    |           |           |           |           |
(0.1,0.7) - (0.3,0.7) - (0.5,0.7) - (0.7,0.7) - (0.9,0.7)
    |           |           |           |           |
(0.1,0.9) - (0.3,0.9) - (0.5,0.9) - (0.7,0.9) - (0.9,0.9)
```

## Action Types by ID

- 0: Stay in place / No movement
- 1: Move North (typically probability 0 if impossible)
- 2: Move right (east)
- 3: Move down (south)
- 4: Move left (west)

In [8]:
new_location = env['locations'][new_location_pos]

for k, v in new_location.items():
    if k != 'actions':
        print(f'{k:<13}: {v}')
    else:
        print('actions:')
        for action in v: # The value is the array of actions.
            print(f'\t{action["id"]}: {action["probability"]}')

id           : 17
observation  : 40
x            : 0.5
y            : 0.7
in_locations : [12, 16, 17, 18, 22]
in_degree    : 5
out_locations: [12, 16, 17, 18, 22]
out_degree   : 5
actions:
	0: 0.2
	1: 0.2
	2: 0.2
	3: 0.2
	4: 0.2


In [9]:
# Find sensory observation for new state, and store it as one-hot vector.
new_observation = np.eye(env['n_observations'])[new_location['observation']]

# Create a new observation by converting the new observation to a torch tensor.
new_observation = torch.tensor(new_observation, dtype=torch.float).view((new_observation.shape[0]))
 
new_observation, torch.nonzero( new_observation ), new_observation.shape

(tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 1., 0., 0., 0., 0.]),
 tensor([[40]]),
 torch.Size([45]))

In [10]:
# Build policy from action probability of each action of provided location dictionary.
policy = np.array( [action['probability'] for action in new_location['actions']] )
policy

array([0.2, 0.2, 0.2, 0.2, 0.2])

In [11]:
policy[ [] ]

array([], dtype=float64)

In [12]:
policy[ [] ] *= repeat_bias_factor
policy

array([0.2, 0.2, 0.2, 0.2, 0.2])

In [13]:
# Add a bias for repeating previous action to walk in straight lines,
# only if (this is not the first step) and (the previous action was a move).
policy[[] if len(walk) == 0 or new_location['id'] == walk[-1][0]['id'] else walk[-1][2]] *= repeat_bias_factor

# And renormalise policy (note that for unavailable actions, the policy was 0 and remains 0,
# so in that case no renormalisation needed).
policy = policy / sum(policy) if sum(policy) > 0 else policy
policy

array([0.2, 0.2, 0.2, 0.2, 0.2])

In [14]:
_some = np.random.rand()

np.cumsum(policy), _some, np.cumsum(policy) > _some

(array([0.2, 0.4, 0.6, 0.8, 1. ]),
 0.9283203881638409,
 array([False, False, False, False,  True]))

In [15]:
np.flatnonzero( np.cumsum(policy) > _some )

array([4])

In [16]:
# Select action in new state
new_action = int(np.flatnonzero(np.cumsum(policy)>_some)[0])
new_action

4

In [17]:
# Append location, observation, and action to the walk.
# new_location is the actual location from the env file.
# new_observation is the one-hot-vector for the given observation.
# new_action is the first action ID that is greater than some random number.
walk.append([new_location, new_observation, new_action])

# Next step in the walk (next walk actually)

In [18]:
walk[-1][2]

4

In [19]:
prev_location = walk[-1][0]
prev_action_chosen = walk[-1][2]

In [20]:
prev_location['actions'][prev_action_chosen]

{'id': 4,
 'transition': [0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0],
 'probability': 0.2}

In [21]:
_some = np.random.rand()

np.cumsum(prev_location['actions'][prev_action_chosen]['transition']) > _some

array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False,  True,  True,
        True,  True,  True,  True,  True,  True,  True])

In [22]:
np.flatnonzero(
    np.cumsum(
        prev_location['actions'][prev_action_chosen]['transition'],
    )>_some,
)

array([16, 17, 18, 19, 20, 21, 22, 23, 24])

In [23]:
# TODO: this is a very awkward way of doing np.nonzero([...])[0].
new_location = int(
    np.flatnonzero(
        np.cumsum(
            prev_location['actions'][prev_action_chosen]['transition'],
        )>np.random.rand(),
    )[0]
)
print(f'location chosen: {new_location}')
new_location = env['locations'][new_location]
for k, v in new_location.items():
    if k != 'actions':
        print(f'{k:<13}: {v}')
    else:
        print('actions:')
        for action in v: # The value is the array of actions.
            print(f'\t{action["id"]}: {action["probability"]}')

location chosen: 16
id           : 16
observation  : 16
x            : 0.3
y            : 0.7
in_locations : [11, 15, 16, 17, 21]
in_degree    : 5
out_locations: [11, 15, 16, 17, 21]
out_degree   : 5
actions:
	0: 0.2
	1: 0.2
	2: 0.2
	3: 0.2
	4: 0.2


In [24]:
def get_observation(env, new_location):
    # Find sensory observation for new state, and store it as one-hot vector
    new_observation = np.eye(env['n_observations'])[new_location['observation']]
    # Create a new observation by converting the new observation to a torch tensor
    new_observation = torch.tensor(new_observation, dtype=torch.float).view((new_observation.shape[0]))
    # Return the new observation
    return new_observation

In [25]:
new_observation = get_observation(env, new_location)
new_observation, torch.nonzero( new_observation ), new_observation.shape

(tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0.]),
 tensor([[16]]),
 torch.Size([45]))

In [26]:
def get_action(env, new_location, walk, repeat_bias_factor=2):
    # Build policy from action probability of each action of provided location dictionary
    policy = np.array([action['probability'] for action in new_location['actions']])  
    print(f'init policy: {policy}')
    
    # Add a bias for repeating previous action to walk in straight lines, only if
    # (this is not the first step) and (the previous action was a move)
    policy[
        [] if len(walk) == 0 or new_location['id'] == walk[-1][0]['id'] else walk[-1][2]
    ] *= repeat_bias_factor
    print(f"{new_location['id']=} {walk[-1][0]['id']=}")
    print(f"{walk[-1][2]=}")
    print(f'policy after bias: {policy}')
    
    # And renormalise policy (note that for unavailable actions, the policy was 0 and remains 0,
    # so in that case no renormalisation needed)
    policy = policy / sum(policy) if sum(policy) > 0 else policy
    print(f'normalized policy after bias: {policy}')
    
    # Select action in new state.
    _some = np.random.rand()
    new_action = int(np.flatnonzero(np.cumsum(policy)>_some)[0])
    print(f'rand number chosen: {_some}')
    # Return the new action
    return new_action

In [27]:
new_action = get_action(env, new_location, walk)
new_action

init policy: [0.2 0.2 0.2 0.2 0.2]
new_location['id']=16 walk[-1][0]['id']=17
walk[-1][2]=4
policy after bias: [0.2 0.2 0.2 0.2 0.4]
normalized policy after bias: [0.16666667 0.16666667 0.16666667 0.16666667 0.33333333]
rand number chosen: 0.6593621991698245


3

In [28]:
# Append location, observation, and action to the walk.
# new_location is the actual location from the env file.
# new_observation is the one-hot-vector for the given observation.
# new_action is the first action ID that is greater than some random number.
walk.append([new_location, new_observation, new_action])

In [29]:
len(walk)

2

In [30]:
walk

[[{'id': 17,
   'observation': 40,
   'x': 0.5,
   'y': 0.7,
   'in_locations': [12, 16, 17, 18, 22],
   'in_degree': 5,
   'out_locations': [12, 16, 17, 18, 22],
   'out_degree': 5,
   'actions': [{'id': 0,
     'transition': [0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      1,
      0,
      0,
      0,
      0,
      0,
      0,
      0],
     'probability': 0.2},
    {'id': 1,
     'transition': [0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      1,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0],
     'probability': 0.2},
    {'id': 2,
     'transition': [0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      1,
      0,
      0,
      0,
      0,
      0,
  

In [31]:
_new_walk = walk[-1]

In [32]:
_new_walk

[{'id': 16,
  'observation': 16,
  'x': 0.3,
  'y': 0.7,
  'in_locations': [11, 15, 16, 17, 21],
  'in_degree': 5,
  'out_locations': [11, 15, 16, 17, 21],
  'out_degree': 5,
  'actions': [{'id': 0,
    'transition': [0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     1,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0],
    'probability': 0.2},
   {'id': 1,
    'transition': [0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     1,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0],
    'probability': 0.2},
   {'id': 2,
    'transition': [0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     1,
     0,
     0,
     0,
     0,
     0,
     0,
     0],
    'probability': 0.2},
   {'id': 3,
    'transition': [0,
     0,
     0,
    

In [36]:
for step in _new_walk[:-1]:
    print(step[0])
    print()

KeyError: 0

In [34]:
_new_walk[:-1]

[{'id': 16,
  'observation': 16,
  'x': 0.3,
  'y': 0.7,
  'in_locations': [11, 15, 16, 17, 21],
  'in_degree': 5,
  'out_locations': [11, 15, 16, 17, 21],
  'out_degree': 5,
  'actions': [{'id': 0,
    'transition': [0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     1,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0],
    'probability': 0.2},
   {'id': 1,
    'transition': [0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     1,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0],
    'probability': 0.2},
   {'id': 2,
    'transition': [0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     1,
     0,
     0,
     0,
     0,
     0,
     0,
     0],
    'probability': 0.2},
   {'id': 3,
    'transition': [0,
     0,
     0,
    