# Shielding with Deep-ProbLog


### Aims of this notebook
    - Shielding via inserting differentiable constraints using deep problog
    - Can we implement shielding using deep problog?

### Shielding

#### What is shielding 

An agent aims at learning the optimal (stochastic) policy $\pi: S \rightarrow dist(A)$ 
A teacher monitors a learning agent and prevents the agent from taking dangerous actions. 
The teacher knows safety requirement as constraints. 

The teacher is a function $$F:C \times S \times \pi(S) \rightarrow dist(A)$$

We call the input the **base policy**, and the output the **shielded policy**.


The goal of shielding is to help the agent to learn the optimal policy safely (e.g. the agent dies fewer times). 


We refer to any function a **hard shielding** method if 
- the function $F$ makes the probability=0 of a dangerous action 
- the function $F$ provides absolute safety. The agent can only take safe actions.

We refer to any function a **soft shielding** method 
- the function $F$ makes the probability of a dangerous action lower (but not 0)
- the function $F$ makes higher the probability of taking a safe action 

Hard shielding is stricter than soft shielding. 

We start from hard shielding. 

<!-- We can make shielding soft by the following methods:
- Applying hard shielding only at intersections (domain-specific). This makes the entire episode "soft" but strictly speaking this is still hard shilding. 
- Stochastic transition function
- Partially observable environment (like we do, we do it via object detection) -->


### Implement (hard) shielding using deep problog

#### 

Baseline: In a RL setting, explore the environment and learn the optimal policy from the traces.
DPL: In a RL setting, given a constraint, explore the enviroment with the constraint, and learn an optimal policy 



#### Advantages of our method
- not domain-specific
- can be applied on any policy-based Reinforcement Learning methods (on the policy network)
- [?] can potentially learn object detection (no pretraining needed)


#### Questions
 1. Does DPL bahave safely?
 2. Does DPL learn the optimal policy (if the constraint is in line with the reward)?
 3. Does DPL learn object detection?
 4. Does the base policy in DPL behave safely if, after training, we take away the shield?
 5. Can we scale? What factors do we consider to scale? 
 
 

#### Enviromnent
- One pacman, one ghost and one food on a grid world
- The task for the pacman is to eat the only food on the map
- The ghost does not move
- The agent either finishes the task, or crashes into a ghost

#### Problog model
We choose the base policy gradient algorithm as the baseline network.
The output is an action probability distribution 

```prolog
action(0)::action(stay); 
action(1)::action(up); 
action(2)::action(down); 
action(3)::action(left); 
action(4)::action(right).
```

We add two networks to predict pacman and ghost locations (object detection).
We can choose not to learn the locations by replacing these two networks with ground truth locations. 

```prolog
ghost(0)::ghost(0,1).
ghost(1)::ghost(1,1).
ghost(2)::ghost(0,0).
ghost(3)::ghost(1,0).

pacman(0)::pacman(0,1).
pacman(1)::pacman(1,1).
pacman(2)::pacman(0,0).
pacman(3)::pacman(1,0).
```

Then we define the transition function.

```prolog
transition(X,Y,stay,X,Y).
transition(X,Y,left,X1,Y) :- X1 is X - 1.
transition(X,Y,right,X1,Y) :- X1 is X + 1.
transition(X,Y,up,X,Y1) :- Y1 is Y + 1.
transition(X,Y,down,X,Y1) :- Y1 is Y - 1.

transition_with_wall(X,Y,A,X,Y) :- transition(X,Y,A,X1,Y1),wall(X1,Y1).
transition_with_wall(X,Y,A,X1,Y1) :-transition(X,Y,A,X1,Y1),\+ wall(X1,Y1).
```

Then we define the safety requirement

```prolog
unsafe :- pacman(X,Y), action(A), transition_with_wall(X,Y,A,X1,Y1), ghost(X1,Y1).
safe :- \+unsafe.
```

Then we compute the conditional probability $P(a|safe)$ by using "evidence(safe)"

```prolog
evidence(safe).
query(action(_)).
```


####  We compare these three settings 
- Policy gradient (baseline)
- Policy gradient + hard shielding 
- Policy gradient + shielding + object detection 

### Getting start

In [1]:
import altair as alt
import os
import pandas as pd

Making a chart from dataframes

In [2]:
def make_chart2(data_series, title, keys, window_speed=1000):
    assert len(data_series) == len(keys)
    combined_data_series = pd.concat(data_series, keys=keys, names=['setting'])
    combined_data_series = combined_data_series.reset_index(0)
    combined_data_series['series'] = combined_data_series['setting'] + " " + combined_data_series['feature']

    # Data is prepared, now make a chart
    selection_exp = alt.selection_multi(fields=['setting', 'feature'], empty='none')
    color_exp = alt.condition(selection_exp,
                      alt.Color('series:N', legend=None, scale=alt.Scale(scheme='dark2')),
                      alt.value('lightgray'))

    timeseries = alt.Chart(combined_data_series).properties(
        title=title,
        width=500,
        height=250
    ).mark_line(
        opacity=0.5
    ).encode(
        x=alt.X('n_steps:Q',
                axis=alt.Axis(title=f'Steps (x{window_speed})', tickMinStep=1)),
        y=alt.Y('value:Q',
                sort="ascending",
                axis=alt.Axis(title='value', tickMinStep=0.1),
                ),
        color=color_exp
    ).add_selection(
        selection_exp
    ).interactive()

    legend = alt.Chart(combined_data_series).mark_rect().encode(
        x=alt.X('setting:N', axis=alt.Axis(orient='bottom')),
        y='feature',
        color=color_exp
    ).add_selection(
        selection_exp
    )

    chart = timeseries | legend

    return chart

### Env 1

In [3]:
folderpath = "data/env4/"
env_file = folderpath + "/env_spec.txt"
with open(env_file, "r") as f: print(f.read())

Layout:           grid2x2
Learning rate:    0.001
Reward goal:      10
Reward crash:     -10
Reward food:      0
Reward time:      -1

window_size = 2000
window_speed = 1000
AD constraints:
    - exactly-one action

problog program:
    """
    wall(-1, 0).
    wall(-1, 1).
    wall(0, -1).
    wall(0, 2).
    wall(1, -1).
    wall(1, 2).
    wall(2, 0).
    wall(2, 1).


    action(0)::action(stay);
    action(1)::action(up);
    action(2)::action(down);
    action(3)::action(left);
    action(4)::action(right).


    ghost(0)::ghost(0,1).
    ghost(1)::ghost(1,1).
    ghost(2)::ghost(0,0).
    ghost(3)::ghost(1,0).

    pacman(0)::pacman(0,1).
    pacman(1)::pacman(1,1).
    pacman(2)::pacman(0,0).
    pacman(3)::pacman(1,0).

    food(0)::food(0,1).
    food(1)::food(1,1).
    food(2)::food(0,0).
    food(3)::food(1,0).


    % Transition
    transition(X,Y,stay,X,Y).
    transition(X,Y,left,X1,Y) :- X1 is X - 1.
    transition(X,Y,right,X1,Y) :- X1 is X + 1.
    transition(X,Y,up,X

In [4]:
file_pg            = "pg_grid2x2_raw"
file_pg_dpl        = "pg_dpl_nodetect_grid2x2_raw"
file_pg_dpl_detect = "pg_dpl_detect_grid2x2_raw"

logger_files = [file_pg,file_pg_dpl,file_pg_dpl_detect]

# load dataframes
data_frames = []
for logger_file in logger_files:
    pkl_path = f"{folderpath}{logger_file}.pkl"
    d = pd.read_pickle(pkl_path)
    data_frames.append(d)
    
chart = make_chart2(data_frames, title='Features over training time',
                        keys=['PG', 'PG+DPL', 'PG+DPL detection'], window_speed=1000)
chart

### shift-click on the legend to select multiple data series 
- **length**: average episode length. It indicates how long the agent explores without crashing or achieving the goal.
- **reward**: average episode accumulated reward. The higher, the better. 
- **safety**: average last episode reward. It indicates how successful the agent is in achieving the task. The higher, the better. 

A negative safety value means the agent crashes (reward = -10) more times.
A high safety value means the agent finishes the task (reward = 10) more times.


### 1. Does DPL bahave safely?
- PG+DPL performs hard shielding. The agent cannot crash. It can only finish the task eventually so the safety score is a constant == reward of goal. 
- "PG+DPL detection" performs soft shielding. The figure shows it is safer than pure PG (baseline)

### 2. Does DPL learn the optimal policy (if the constraint is in line with the reward)? 
- Yes, the reward feature shows all methods converges to the best reward. 
- "PG+DPL" converges faster than the other two methods

### Now we look at object detection

In [5]:
folderpath = "data/env5/"
# load dataframes
data_frames = []
for logger_file in [file_pg_dpl, file_pg_dpl_detect]:
    pkl_path = f"{folderpath}{logger_file}_prob_err.pkl"
    d = pd.read_pickle(pkl_path)
    data_frames.append(d)
    
chart = make_chart2(data_frames, title='Features over training time',
                        keys=['PG+DPL', 'PG+DPL detection'], window_speed=1000)
chart

### 3. Does DPL learn object detection?
**ghost detection error**: difference between the learned ghost location and the ground truth. The lower the value is, the more accurate the detection is.

**pacman detection error**: difference between the learned pacman location and the ground truth. The lower the value is, the more accurate the detection is.

"PG+DPL" is given the ground truth values of the ghost and the pacman locations, so the diff values are zero in the figure.

"PG+DPL detection" does not learn to detect the ghost and the pacman locations because the accuracy does not converge to zero. 

What does "PG+DPL detection" learn to detect?

### Why does DPL not learn object detection?
- It learns some concept such that the combination of the network and shield can avoid taking bad actions, but not what we think it should learn. 

- It is too hard to learn object detection from the learning signal.

**Real reason**
The last layer is not right -> the last layer should be sigmoid


### 4. Does the base policy in DPL behave safely if, after training, we take away the shield?

**baes policy safety diff**: difference between shielded policy and the base policy. It indicates how much the shield changes the base policy. It also indicates how safe the base policy is. The lower the value is, the safer the base policy is.

We have n actions so every policy is a vector of n values.
base policy:  $P_b = (p_b^1, ... p_b^n)$

shielded policy:  $P_s = (p_s^1, ... p_s^n)$

Safety ratio: $E(P_b, P_s) = \sum_{i=1}^{n} |p_s^i - p_b^i|$. The maximum E value is n, the minimum is 0.

"PG+DPL": the base policy and the shielded policy converge. Hence, the shield can be removed after training.
"PG+DPL detection": the base policy and the shielded policy **do not converge**. This is a result from not being able to learn object detection.



### 5. Can we scale? What factors do we consider to scale? 
for detail see https://docs.google.com/document/d/1M3LekU7oMiwriMir0xlj3hmUnS2-ph9TUyU2T16u4CU/edit?usp=sharing
We can improve runtime by relax the ADs

We compare 4 settings 
0. pure PG (baseline)
1. PG+DPL detect, Exact-one pacman and ghost: most strictly constrained. The slowest.
2. PG+DPL detect, Exact-one pacman: 
3. PG+DPL detect, No constraints: The fastest.

Runtime
1. ~ 31 mins
2. ~ 23 mins
3. ~ 19 mins

In [6]:
# load dataframes
logger_files = [
    'data/env5/pg_grid2x2_raw.pkl',
    'data/env2/pg_dpl_detect_grid2x2_raw.pkl',
    'data/env5/pg_dpl_detect_grid2x2_raw.pkl',
    'data/env4/pg_dpl_detect_grid2x2_raw.pkl'
]


data_frames = []
for logger_file in logger_files:
    d = pd.read_pickle(logger_file)
    data_frames.append(d)

chart = make_chart2(data_frames, title='Features over training time',
                    keys=['PG',
                          '1 PG+DPL',
                          '2 PG+DPL',
                          '3 PG+DPL'], window_speed=1000)
chart

Let's look at **reward**. We see the stricter the constraints are, the higher the reward is. But 1 2 3 are all better than pure PG.
I think it makes sense to go for 2 because we know there is one pacman and one action.



### We change the setting to use relative locations. 
The setting is as follows. We now have a "local shield".

In [7]:
def load_dfs(logger_files):
    dfs_features = []
    dfs_prob_err = []
    for logger_file in logger_files:
        dpl = "dpl" in logger_file or "detect" in logger_file or "shield" in logger_file
        pkl_path = folderpath + f"{logger_file}.pkl"
        pkl_err_path = folderpath + f"{logger_file}_prob_err.pkl"
        df_features = pd.read_pickle(pkl_path)
        dfs_features.append(df_features)
        if dpl:
            df_policy_err = pd.read_pickle(pkl_err_path)
            dfs_prob_err.append(df_policy_err)
    return dfs_features, dfs_prob_err

In [8]:
folderpath = "data/relative_location/"
env_file = folderpath + "env_spec.txt"
with open(env_file, "r") as f: print(f.read())

Layout:           grid2x2
Learning rate:    0.001
Reward goal:      10
Reward crash:     -10
Reward food:      0
Reward time:      -1

window_size = 2000
window_speed = 1000
AD constraints:
    - exactly-one action

problog program:
    """
    action(0)::action(stay);
    action(1)::action(up);
    action(2)::action(down);
    action(3)::action(left);
    action(4)::action(right).

    ghost(0)::ghost(up).
    ghost(1)::ghost(down).
    ghost(2)::ghost(left).
    ghost(3)::ghost(right).

    wall(0)::wall(up).
    wall(1)::wall(down).
    wall(2)::wall(left).
    wall(3)::wall(right).


    % transition(Action, NextPos)
    transition(stay,here).
    transition(left,left).
    transition(right,right).
    transition(up,up).
    transition(down,down).


    transition_with_wall(A,here) :-
        transition(A,NextPos), wall(NextPos).
    transition_with_wall(A,NextPos) :-
        transition(A,NextPos), \+ wall(NextPos).


    unsafe_next :- action(A), transition_with_wall(A,NextPos), g

In [9]:
file_pg = "pg_grid2x2_raw"
file_pg_shield = "pg_shield_grid2x2_raw"
file_pg_detect_shield_wall_ghost = "pg_detect_shield_grid2x2_raw_wall_ghost"
file_pg_detect_shield_ghost = "pg_detect_shield_grid2x2_raw_ghost"
file_pg_detect_shield_wall = "pg_detect_shield_grid2x2_raw_wall"
logger_files = [
    file_pg,
    file_pg_shield,
    file_pg_detect_shield_wall_ghost,
    file_pg_detect_shield_ghost,
    file_pg_detect_shield_wall
]
    
dfs_features, dfs_prob_err = load_dfs(logger_files)

In [10]:
feature_chart = make_chart2(dfs_features, title='Features over training time',
                                keys=[
                                      'PG',
                                      'PG shield',
                                      'PG shield detect wall ghost',
                                      'PG shield detect ghost',
                                      'PG shield detect wall',
                                ], window_speed=1000)
feature_chart

We expect the performance would rank like this:
- hard shielding setting, i.e. "PG shield"
- detect one location, i.e. "PG shield detect ghost", "PG shield detect wall"
- detect two locations, i.s. "PG shield detect wall ghost"
- pure PG
But that is not the case. "PG shield detect wall ghost" detects even better than hard shielding 

### We check if the network learns to detect objects.

In [11]:
prob_err_chart = make_chart2(dfs_prob_err, title='DPL features',
                                   keys=[
                                        'PG shield',
                                        'PG shield detect wall ghost',
                                        'PG shield detect ghost',
                                        'PG shield detect wall',
                                   ], window_speed=1000)
prob_err_chart

If the "error" or "diff" change, it means something is learned. Otherwise, nothing is learned. For a setting, if "base policy safety diff" converges to zero, it means the learning is done by the base policy, and the shield is not learning 

- "PG shield": no need to learn ghost and wall. The base policy learns safety.
- "PG shield wall": does not learn to detect wall. But the wall location is not safety-critical. Since the ghost location is given, the performance (see previous chart) does not drop much compared to hard shielding.
- "PG shield ghost": learns pacman is usually zero. The base policy learns safety.
- "PG shield wall ghost": The most interesting case. This does not learn to detect wall. Learns something about ghost location but not the location we want to learn. Instead of learning "there is a ghost", it is learning "there is NOT a ghost". It makes sense to be cautious. It is learning this because we allow the network to learn it. For example, this is one of the output:

```prolog
---------  Step 39997  ---------------
Shielded probs: tensor([[1.2737e-06,1.8081e-05,2.1972e-02,1.8601e-07,6.3787e-14]])
Base probs:     tensor([[1.2737e-06,1.6713e-01,8.3065e-01,2.2024e-03,1.1373e-05]])
Ghost probs:    tensor([[1.0000,0.9735,0.9999,1.0000]]) (up, down, left, right)
Ghost truth:    tensor([[0.,0.,0.,1.]])
Wall probs:     tensor([[1.0592e-04,3.6507e-08,6.6150e-06,5.6085e-09]])
Wall truth:     tensor([[1.,0.,1.,0.]])
Safe next:      tensor([[0.0220]])
Reward:         9.0
Done:           True
```
We can see that the ghost probs are almost 1, meaning dpl tries to be cautious.
We can see ghost(down) = 0.9735 is the lowest, meaning "the probability of a ghost being below the pacman is lower than other directions", meaning going down is safer than other actions.

#### What if we change the program encoding?
If we want to force the shield to learn "there is a ghost" instead of "there is not a ghost", we can make use of the information that there is at most one ghost. 
The encoding goes from 
```prolog
    ghost(0)::ghost(up).
    ghost(1)::ghost(down).
    ghost(2)::ghost(left).
    ghost(3)::ghost(right).
```
to 
```prolog
    ghost(0)::ghost(no);
    ghost(1)::ghost(up);
    ghost(2)::ghost(down);
    ghost(3)::ghost(left);
    ghost(4)::ghost(right).
```

And we also need a softmax output layer for the new encoding (instead of sigmoid).

In [12]:
folderpath = "data/no_ghost_encoding/"
env_file = folderpath + "env_spec.txt"
with open(env_file, "r") as f: print(f.read())

Layout:           grid2x2
Learning rate:    0.001
Reward goal:      10
Reward crash:     -10
Reward food:      0
Reward time:      -1

window_size = 2000
window_speed = 1000
AD constraints:
    - exactly-one action

OLD problog program:
    """
    action(0)::action(stay);
    action(1)::action(up);
    action(2)::action(down);
    action(3)::action(left);
    action(4)::action(right).

    ghost(0)::ghost(up).
    ghost(1)::ghost(down).
    ghost(2)::ghost(left).
    ghost(3)::ghost(right).

    wall(0)::wall(up).
    wall(1)::wall(down).
    wall(2)::wall(left).
    wall(3)::wall(right).


    % transition(Action, NextPos)
    transition(stay,here).
    transition(left,left).
    transition(right,right).
    transition(up,up).
    transition(down,down).


    transition_with_wall(A,here) :-
        transition(A,NextPos), wall(NextPos).
    transition_with_wall(A,NextPos) :-
        transition(A,NextPos), \+ wall(NextPos).


    unsafe_next :- action(A), transition_with_wall(A,NextPos

In [13]:
file_pg = "pg_grid2x2_raw"
file_pg_detect_shield_wall_ghost_old = "pg_detect_shield_grid2x2_raw_old"
file_pg_detect_shield_wall_ghost_new = "pg_detect_shield_grid2x2_raw_new"
logger_files = [
    file_pg,
    file_pg_detect_shield_wall_ghost_old,
    file_pg_detect_shield_wall_ghost_new
]
    
dfs_features, dfs_prob_err = load_dfs(logger_files)

In [14]:
feature_chart = make_chart2(dfs_features, title='Features over training time',
                                keys=[
                                      'PG',
                                      'PG shield detect wall ghost old',
                                    'PG shield detect wall ghost new',
                                ], window_speed=1000)
feature_chart

We expect that with this constraint, we can push dpl to learn the concept of "there is a ghost" instead of "there is not a ghost". We also expect that this concept is harder for dpl to learn, otherwise it would have learned this concept than "there is no ghost".
The chart above shows that indeed, the agent learns, and it learns slowlier. This is as expected.
Now we look at what concept the new encoding learns.

In [15]:
prob_err_chart = make_chart2(dfs_prob_err, title='DPL features',
                                   keys=[
                                        'PG shield detect wall ghost old',
                                        'PG shield detect wall ghost new',
                                   ], window_speed=1000)
prob_err_chart

The most important feature is "ghost detection error". The old encoding learns something about it, but the new encoding does not (because of the constraint) really learn as the curve is a constant. 

It seems it is really simpler for dpl to learn "there is a ghost", so we will convert the setting back by taking away the constraint "there is at most a ghost".

#### What if there is no ghost?

We showed that if there is a ghost, dpl is cautious and learns the concept of "there is no ghost". But what if there is no ghost in the environment?

Will dpl still be cautious? Or it can learn the probability of having a ghost is actually 0?
We compare with pure PG.

In [16]:
folderpath = "data/no_ghost_in_env/"
env_file = folderpath + "env_spec.txt"
with open(env_file, "r") as f: 
    info = f.readlines()[:13]
    print("".join(info))

Layout:           grid2x2
Learning rate:    0.001
Reward goal:      10
Reward crash:     -10
Reward food:      0
Reward time:      -1

window_size = 2000
window_speed = 1000
AD constraints:
    - exactly-one action

There is NO GHOST in the environment.



In [17]:
file_pg = "pg_grid2x2_raw"
file_pg_detect_shield =  "pg_detect_shield_grid2x2_raw"
logger_files = [
    file_pg,
    file_pg_detect_shield
]
    
dfs_features, dfs_prob_err = load_dfs(logger_files)

In [18]:
feature_chart = make_chart2(dfs_features, title='Features over training time',
                                keys=[
                                      'PG',
                                      'PG shield detect',
                                ], window_speed=1000)
feature_chart

We see that in an environment without ghosts, shielding is still better than pure PG. Why is shielding still better than pure PG in a safe environment? This indicates that dpl learns some concepts that is not only related to safety but also to improve the learning!
Let's see what it learns.

In [19]:
prob_err_chart = make_chart2(dfs_prob_err, title='DPL features',
                                   keys=[
                                      'PG shield detect',
                                   ], window_speed=1000)
prob_err_chart

We see dpl is learning something related to the ghost location because the curve is not constant. We also see that dpl does not learn "there is a ghost" because the error does not converge to 0. Let's look at some raw output.
```prolog
---------  Step 39999  ---------------
Shielded probs: tensor([[2.8037e-05,4.9716e-06,8.2543e-06,2.6479e-01,5.4969e-06]])
Base probs:     tensor([[2.8037e-05,7.3166e-03,1.9698e-01,7.9146e-01,4.2177e-03]])
Ghost probs:    tensor([[1.0000,1.0000,0.6654,1.0000]])
Ghost truth:    tensor([[0.,0.,0.,0.]])
Wall probs:     tensor([[6.7950e-04,6.1948e-08,9.1109e-09,1.3025e-03]])
Wall truth:     tensor([[1.,0.,0.,1.]])
Safe next:      tensor([[0.2648]])
Reward:         -1.0
Done:           False
```
DPL learns that "going left is the best because there is no ghost" but the truth is "going left is good because the food is at the left" (but this is not seen by dpl so dpl tries to explain in its own way)

DPL is still cautious and imagining there is a ghost. This is because it learns a concept that is not related to safety but related to optimality. 

The message here is that shielding with differentiability constraints helps learning even in a safe environment. 

#### What if we add evidence?
Another thing is whether we want to put 

```prolog
evidence(safe_next)
```
in the program or not.
If we do not put the evidence, then we are calculating 
P(action&safe), P(ghost), P(wall)

If we put the evidence, then we are calculating 
P(action|safe), P(ghost|safe), P(wall|safe)

This difference does not affect the shield because P(action|safe) == (normalize P(action&safe)). However, this affects object detection. 

What is the difference between  P(ghost) and  P(ghost|safe)?
P(ghost) : the probability of the ghost locations
P(ghost|safe): the probability of the ghost locations, given the next state is safe

For ghost detection, we want to lookahead whether the next state is actually safe or not, i.e. P(ghost). We should not use P(ghost|safe) because it answers the wrong question: "Where should the ghosts be if the next state has to be safe?" The ghost location is not under our control.  

So the conclusion is, we should NOT put evidence(safe_next) in the file

We show the difference of with and without ecidence


In [20]:
folderpath = "data/evidence_or_not/"
env_file = folderpath + "env_spec.txt"
with open(env_file, "r") as f: print(f.read())

Layout:           grid2x2
Learning rate:    0.001
Reward goal:      10
Reward crash:     -10
Reward food:      0
Reward time:      -1

window_size = 2000
window_speed = 1000
AD constraints:
    - exactly-one action

WITHOUT EVIDENCE problog program:
    """
    action(0)::action(stay);
    action(1)::action(up);
    action(2)::action(down);
    action(3)::action(left);
    action(4)::action(right).

    ghost(0)::ghost(up).
    ghost(1)::ghost(down).
    ghost(2)::ghost(left).
    ghost(3)::ghost(right).

    wall(0)::wall(up).
    wall(1)::wall(down).
    wall(2)::wall(left).
    wall(3)::wall(right).


    % transition(Action, NextPos)
    transition(stay,here).
    transition(left,left).
    transition(right,right).
    transition(up,up).
    transition(down,down).


    transition_with_wall(A,here) :-
        transition(A,NextPos), wall(NextPos).
    transition_with_wall(A,NextPos) :-
        transition(A,NextPos), \+ wall(NextPos).


    unsafe_next :- action(A), transition_with_w

In [21]:
file_pg = "pg_grid2x2_raw"
file_pg_detect_shield_without_ev = "pg_detect_shield_grid2x2_raw_without_ev"
file_pg_detect_shield_with_ev = "pg_detect_shield_grid2x2_raw_with_ev"
logger_files = [
    file_pg,
    file_pg_detect_shield_without_ev,
    file_pg_detect_shield_with_ev
]
dfs_features, dfs_prob_err = load_dfs(logger_files)

In [22]:
feature_chart = make_chart2(dfs_features, title='Features over training time',
                                keys=[
                                      'PG',
                                      'PG shield detect wall ghost without ev',
                                      'PG shield detect wall ghost with ev',
                                ], window_speed=1000)
feature_chart

In [23]:
prob_err_chart = make_chart2(dfs_prob_err, title='DPL features',
                                   keys=[
                                        'PG shield detect wall ghost without ev',
                                      'PG shield detect wall ghost with ev',
                                   ], window_speed=1000)
prob_err_chart

#### We can increase the playground size now!
we now use a 2x3 grid and compare 
- pure gp
- shield
- shield detect

In [24]:
folderpath = "data/grid2x3/"
env_file = folderpath + "env_spec.txt"
with open(env_file, "r") as f: 
    info = f.readlines()[:11]
    print("".join(info))

Layout:           grid2x3
Learning rate:    0.001
Reward goal:      10
Reward crash:     -10
Reward food:      0
Reward time:      -1

window_size = 2000
window_speed = 1000
AD constraints:
    - exactly-one action



In [25]:
file_pg = "pg_grid2x3_raw"
file_pg_shield = "pg_shield_grid2x3_raw"
file_pg_detect_shield =  "pg_detect_shield_grid2x3_raw"
logger_files = [
    file_pg,
    file_pg_shield,
    file_pg_detect_shield
]
dfs_features, dfs_prob_err = load_dfs(logger_files)

In [26]:
feature_chart = make_chart2(dfs_features, title='Features over training time',
                                keys=[
                                      'PG',
                                      'PG shield',
                                      'PG shield detect',
                                ], window_speed=1000)
feature_chart

1. We see now it takes a long time (200k steps instead of 40k steps)
2. hard shielding is not learning well anymore. This means even with perfect object detection, hard shielding might not be perfect for the learning process. Hard shielding is sensitive to the constraints. This is a counter intuitive result as hard shielding requires more information.


In [27]:
prob_err_chart = make_chart2(dfs_prob_err, title='DPL features',
                                   keys=[
                                      'PG shield',
                                      'PG shield detect',
                                   ], window_speed=1000)
prob_err_chart

Let's look at object detection. "PG shield detect" is learning something about the ghost as expected. The special thing here is that dpl seems to be detecting walls now. This is different from before. What concepts does dpl learn? We can analyze from raw log.

Consider this example,
```prolog
---------  Step 90000  ---------------
Shielded probs: tensor([[5.6594e-05,1.8950e-04,3.6812e-06,7.6807e-02,6.9585e-03]])
Base probs:     tensor([[5.6594e-05,4.8120e-03,2.5826e-03,3.0834e-01,6.8421e-01]])
Ghost probs:    tensor([[0.9786,1.0000,1.0000,0.9912]])
Ghost truth:    tensor([[0.,0.,0.,0.]])
Wall probs:     tensor([[0.0183,0.0014,0.2491,0.0014]])
Wall truth:     tensor([[0.,1.,1.,0.]])
Safe next:      tensor([[0.0840]])
Reward:         -1.0
Done:           False
```
the ground truth state is that there is no ghosts, and there are walls left and down, i.e.
```prolog
 
%P
 %
```
About ghost network: all values are close to 1. The ghost network detects that "it's less likely there are ghosts up and right" which is correct. Because of this, the shielded policy should increase P(up) and P(right). This would have resulted in a large P(right) and a very small P(left) compared to the base policy. 

About wall network: a relatively high probability there is a wall to the left. Going left means staying at the same spot, meaning the agent will be safe. This increases P(left) hence decreases P(right). 

The ghost and the wall networks are working together in the way that the ghost network detects possible movements, and the wall network treis to stay at the same spot. 

In summary, a high ghost probability discourges the action, and a high wall probabilty encourges the action.


Here is another example.
```
---------  Step 120034  ---------------
Shielded probs: tensor([[2.1538e-05,2.2613e-04,2.7790e-01,3.9119e-05,5.9723e-05]])
Base probs:     tensor([[2.1538e-05,3.5924e-01,6.3568e-01,4.4018e-03,6.4955e-04]])
Ghost probs:    tensor([[1.0000,0.5628,1.0000,0.9162]])
Ghost truth:    tensor([[0.,0.,0.,1.]])
Wall probs:     tensor([[6.0159e-04,8.0351e-06,8.8871e-03,8.9111e-03]])
Wall truth:     tensor([[1.,0.,1.,0.]])
Safe next:      tensor([[0.2782]])
Reward:         -1.0
Done:           False
```
The ground truth state is this. 
```prolog
 %
%PG
 
```
The pacman has only one best option to go down. Ghost network detects "going down is the better, going right is the second best". In this case the ghost network is confident that there is no ghost down, so the wall probabilities are all close to zero.


Yet another example here. 
```prolog
---------  Step 150140  ---------------
Shielded probs: tensor([[1.4941e-07,1.3162e-06,8.6666e-13,8.7666e-03,4.4449e-06]])
Base probs:     tensor([[1.4941e-07,2.9720e-05,9.4697e-07,9.1884e-01,8.1128e-02]])
Ghost probs:    tensor([[0.9557,1.0000,1.0000,1.0000]])
Ghost truth:    tensor([[0.,0.,0.,0.]])
Wall probs:     tensor([[7.6109e-07,8.0723e-08,9.5409e-03,5.2047e-05]])
Wall truth:     tensor([[1.,0.,0.,1.]])
Safe next:      tensor([[0.0088]])
Reward:         9.0
Done:           True
```
The ground truth state is this. 
```prolog
 %
 P%
 
```
The ghost network says "no ghost up but I am not so confident".
The wall network says "then try left because there is a wall".


### We increase the size even more just to observe
We run a 3x3 grid now. We compare the baseline and "shield + detect". We skip hard shielding since it is just not stable. 

In [28]:
folderpath = "data/grid3x3/"
env_file = folderpath + "env_spec.txt"
with open(env_file, "r") as f: 
    info = f.readlines()[:11]
    print("".join(info))

Layout:           grid3x3
Learning rate:    0.001
Reward goal:      10
Reward crash:     -10
Reward food:      0
Reward time:      -1

window_size = 2000
window_speed = 1000
AD constraints:
    - exactly-one action



In [29]:
file_pg = "pg_grid3x3_raw"
file_pg_detect_shield =  "pg_detect_shield_grid3x3_raw"
logger_files = [
    file_pg,
    file_pg_detect_shield
]
dfs_features, dfs_prob_err = load_dfs(logger_files)

In [30]:
feature_chart = make_chart2(dfs_features, title='Features over training time',
                                keys=[
                                      'PG',
                                      'PG shield detect',
                                ], window_speed=1000)
feature_chart

In [31]:
prob_err_chart = make_chart2(dfs_prob_err, title='DPL features',
                                   keys=[
                                      'PG shield detect',
                                   ], window_speed=1000)
prob_err_chart

### Grid 5x5

In [32]:
folderpath = "data/grid5x5/"
env_file = folderpath + "env_spec.txt"
with open(env_file, "r") as f: 
    info = f.readlines()[:11]
    print("".join(info))

Layout:           grid5x5
Learning rate:    0.001
Reward goal:      10
Reward crash:     -10
Reward food:      0
Reward time:      -1

window_size = 2000
window_speed = 1000
AD constraints:
    - exactly-one action



In [33]:
file_pg = "pg_grid5x5_raw"
file_pg_detect_shield =  "pg_detect_shield_grid5x5_raw"
logger_files = [
    file_pg,
    file_pg_detect_shield
]
dfs_features, dfs_prob_err = load_dfs(logger_files)

FileNotFoundError: [Errno 2] No such file or directory: 'data/grid5x5/pg_detect_shield_grid5x5_raw.pkl'

In [None]:
feature_chart = make_chart2(dfs_features, title='Features over training time',
                                keys=[
                                      'PG',
                                      'PG shield detect',
                                ], window_speed=1000)
feature_chart

In [None]:
prob_err_chart = make_chart2(dfs_prob_err, title='DPL features',
                                   keys=[
                                      'PG shield detect',
                                   ], window_speed=1000)
prob_err_chart

### And now a bigger map: we get the same results
```prolog
%%%%%%%
%  .  %
% %%% %
% %%% %
%G%%% %
%P    %
%%%%%%%
```

In [None]:
folderpath = "data/smallGrid/"
env_file = folderpath + "env_spec.txt"
with open(env_file, "r") as f: 
    info = f.readlines()[:11]
    print("".join(info))

In [None]:
file_pg = "pg_smallGrid_raw"
file_pg_shield = "pg_shield_smallGrid_raw"
file_pg_detect_shield =  "pg_detect_shield_smallGrid_raw"
logger_files = [
    file_pg,
    file_pg_shield,
    file_pg_detect_shield
]
dfs_features, dfs_prob_err = load_dfs(logger_files)

In [None]:
feature_chart = make_chart2(dfs_features, title='Features over training time',
                                keys=[
                                      'PG',
                                      'PG shield',
                                      'PG shield detect',
                                ], window_speed=1000)
feature_chart

In [None]:
prob_err_chart = make_chart2(dfs_prob_err, title='DPL features',
                                   keys=[
                                       'PG shield',
                                      'PG shield detect',
                                   ], window_speed=1000)
prob_err_chart

By looking at the safety feature of "PG" and "PG shield detect", we know both of them converge to a local optima. It might be related to that the networks learn the from the wall positions that are the same in all episodes, and ignore the ghost position that is different in each episode. A todo here is to generate different wall structures in the same training process so that the networks actually learn to behave acccording to the ghost position. 

### Another bigg map
```prolog
%%%%%%%%
%......%
%.%%%%.%
%G.%...%
%.%%%%P%
%......%
%%%%%%%%
```

In [None]:
folderpath = "data/smallGrid2/"
env_file = folderpath + "env_spec.txt"
with open(env_file, "r") as f: 
    info = f.readlines()[:]
    print("".join(info))

In [None]:
file_pg = "pg_smallGrid2_raw"
file_pg_detect_shield =  "pg_detect_shield_smallGrid2_raw"
logger_files = [
    file_pg,
    file_pg_detect_shield
]
dfs_features, dfs_prob_err = load_dfs(logger_files)

In [None]:
feature_chart = make_chart2(dfs_features, title='Features over training time',
                                keys=[
                                      'PG',
                                      'PG shield detect',
                                ], window_speed=1000)
feature_chart

In [None]:
prob_err_chart = make_chart2(dfs_prob_err, title='DPL features',
                                   keys=[
                                      'PG shield detect',
                                   ], window_speed=1000)
prob_err_chart