# Deeper Understanding

In [3]:
from arcprize.notebooks import evaluate_run
from arclib.core import PromptStep

When I started the last notebook, I started with the hypothesis that writing code would help.  But there was an implicit assumption in there that the agent *knew what to do*, but was just messing up with the *details of execution*.

It turns out, once I dived deep on each of the 20 sessions, that in fact the biggest problem wasn't that it knew what to do and was having trouble doing it.  Actually what I learned there is that its biggest problem seems to be not actually understanding what the "gist" is of each puzzle.  Its biggest problem is not really deeply and clearly understanding the basic thrust of each of the puzzles.

This problem breaks down into two parts:
1. Figuring out the algorithm, or basic thrust of the problems.
2. Executing the problems once it understands the basic approach.

In this notebook we'll take a shot and seeing if it can even do #2 once it understands the basic thrust.  This is because I may be able to help it understand the basic thrust by attacking item #1, giving it more opportunity to iterate, try different hypotheses, etc.  But if it still can't execute that won't help.

## Baseline

Here's the last run that we did:

In [2]:
evaluate_run()

## Run Results
Success rate: **20.0%**

| Case     | Session   | Result   | Title                                          | Link                                                |
|:---------|:----------|:---------|:-----------------------------------------------|:----------------------------------------------------|
| 007bbfb7 | uzb9tns0  | fail     | 3x3 expand to 9x9, replicate when set          | [007bbfb7](https://arcprize.org/play?task=007bbfb7) |
| 00d62c1b | 2ztrm935  | fail     | Flood fill                                     | [00d62c1b](https://arcprize.org/play?task=00d62c1b) |
| 017c7c7b | xahtv7qm  | PASS     | Continue pattern to height                     | [017c7c7b](https://arcprize.org/play?task=017c7c7b) |
| 025d127b | y9a3c01p  | fail     | Square up the bottom                           | [025d127b](https://arcprize.org/play?task=025d127b) |
| 045e512c | eb93cpow  | fail     | Directional color replicator                   | [045e512c](https://arcprize.org/play?task=045e512c) |
| 0520fde7 | mw52a3l8  | fail     | Boolean AND on two 3x3s                        | [0520fde7](https://arcprize.org/play?task=0520fde7) |
| 05269061 | 9n8b04dk  | fail     | Diagonal pattern expander                      | [05269061](https://arcprize.org/play?task=05269061) |
| 05f2a901 | 7ra1msyg  | fail     | Square sucks in the shape                      | [05f2a901](https://arcprize.org/play?task=05f2a901) |
| 06df4c85 | ek1oat2b  | fail     | Connect same colors in the grid                | [06df4c85](https://arcprize.org/play?task=06df4c85) |
| 08ed6ac7 | 4v3asyqe  | fail     | Size 1-4 histogram classifier                  | [08ed6ac7](https://arcprize.org/play?task=08ed6ac7) |
| 09629e4f | 9od8s24n  | fail     | Find, copy and expand the section without blue | [09629e4f](https://arcprize.org/play?task=09629e4f) |
| 0962bcdd | 7mary24f  | fail     | Grow crystals                                  | [0962bcdd](https://arcprize.org/play?task=0962bcdd) |
| 0a938d79 | c03bi8ez  | PASS     | Expand into stripes and continue               | [0a938d79](https://arcprize.org/play?task=0a938d79) |
| 0b148d64 | i4w6sb08  | fail     | Find the unique color cluster and output it    | [0b148d64](https://arcprize.org/play?task=0b148d64) |
| 0ca9ddb6 | j7kcpm51  | fail     | Grow crystals 2, don't grow blue               | [0ca9ddb6](https://arcprize.org/play?task=0ca9ddb6) |
| 0d3d703e | 8zvklf3w  | PASS     | Simple color transform                         | [0d3d703e](https://arcprize.org/play?task=0d3d703e) |
| 0dfd9992 | 0tvs9dnc  | PASS     | Ginormous pattern fill                         | [0dfd9992](https://arcprize.org/play?task=0dfd9992) |
| 0e206a2e | b1p9li7u  | fail     | Rotate, translate ships to indicated positions | [0e206a2e](https://arcprize.org/play?task=0e206a2e) |
| 10fcaaa3 | igmzxsob  | fail     | Grow blue on diagonals and double the pattern  | [10fcaaa3](https://arcprize.org/play?task=10fcaaa3) |
| 11852cab | yv7c6901  | fail     | Expand to a square with given color            | [11852cab](https://arcprize.org/play?task=11852cab) |

I just noticed that we failed on item `025d127b` "Square up the bottom" during our last run.  But the system has succeeded with that one in the past, so in principle if I did multiple runs and took on the ones where it succeeded I could get our score up from 4/20 = 20% to 5/20 = 25%.  I put that onto the backlog.

## Injecting Strategies

We're going to do a quick experiment where we inject the basic idea of the some of these cases, and see if the agent is able to run with it.  I realize the concept of data leakage, which this absolutely is.  But what I'm trying to understand here is that if the agent has been able to understand the basic algorithm, can it execute it?

We'll explain the strategies of the first 5, and see if that helps it do those first 5.  But what if that helps it think outside the box better on the others?  We'll explain the first 5, but run this on the first 10, and see what it does to the results.  Actually I'm going to skip the explanation of number 4 "square up the bottom" because frankly I never understood that one myself well enough to explain it.  So we'll explain 5 of the first 10, it will just be 1,2,3,5,6 instead of 1,2,3,4,5.

So we're going to try injecting this prompt step:

In [5]:
class StrategiesStep(PromptStep):
    """Here are some common strategies that are sometimes used in these tasks.  We are telling
    you about these to give you a feel for the wide variety of things that can happen in these
    tasks.  The sky is the limit on what these puzzles might do, so hopefully this list will
    give you ideas on how to approach your problem.

    Here are stratgies or explanations of the transforms needed to solve some puzzles we have
    seen:
    1. The input is a small pattern which is expanded in a 3x3 matrix out to the 9x9 matrix
        by replicating it into the corresponding block of the big 9x9 matrix when the cell
        is set within the 3x3 matrix.
    2. We're doing a flood fill, finding fully enclosed spaces in the input and filling it in
        with a color (number).
    3. The output is a fixed height, and we replicate the pattern seen in the input, continuing
        it to a fixed height.
    4. There is a pattern implied in the input in the foreground (non-zero inputs).  Continue that
        pattern into a same-sized output.  The position of that pattern must match the input
        such that the output rows arrive in the same spots of the input, if you were to visualize
        the input and output matrices as being superimposed on each other.
    5. The input can be considered two 3x3 matrices separated by a barrier which is ignored.
        Envision the two input matrices on top of each other, and do a boolean AND operation
        to decide on the output in the output's 3x3 matrix.
    6. There is a fixed square object which will be in the same position in the output, and the other objects
        should be considered mobile objects.  The mobile objects will translate and move up adjacent
        to the fixed object, sticking up against it.

    We mention these strategies so you can get an idea of what a successful strategy explanation
    sounds like, and to show you how widely varied and fiendish these puzzles can be.  In every case
    there's always a definite way to for definitely predict what the output will be.  Often
    it'll seem like the solution is elusive, or that you need to guess.  If you're feeling like
    you need to guess, then that means you haven't figured out how the puzzles work yet.

    Use this opportunity to consider whether you're satisfied with your understanding of how this case
    works, and to refine your strategy before we take a real graded test question.
    """

In [6]:
evaluate_run()

## Run Results
Success rate: **10.0%**

| Case     | Session   | Result   | Title                                 | Link                                                |
|:---------|:----------|:---------|:--------------------------------------|:----------------------------------------------------|
| 007bbfb7 | y149d2af  | fail     | 3x3 expand to 9x9, replicate when set | [007bbfb7](https://arcprize.org/play?task=007bbfb7) |
| 00d62c1b | w1ic56op  | fail     | Flood fill                            | [00d62c1b](https://arcprize.org/play?task=00d62c1b) |
| 017c7c7b | 6pua31s4  | PASS     | Continue pattern to height            | [017c7c7b](https://arcprize.org/play?task=017c7c7b) |
| 025d127b | wanfh9kr  | fail     | Square up the bottom                  | [025d127b](https://arcprize.org/play?task=025d127b) |
| 045e512c | ynw4mzbr  | fail     | Directional color replicator          | [045e512c](https://arcprize.org/play?task=045e512c) |
| 0520fde7 | kb61lgax  | fail     | Boolean AND on two 3x3s               | [0520fde7](https://arcprize.org/play?task=0520fde7) |
| 05269061 | k6h5cv04  | fail     | Diagonal pattern expander             | [05269061](https://arcprize.org/play?task=05269061) |
| 05f2a901 | d6tlng4k  | fail     | Square sucks in the shape             | [05f2a901](https://arcprize.org/play?task=05f2a901) |
| 06df4c85 | gk6v4u0q  | fail     | Connect same colors in the grid       | [06df4c85](https://arcprize.org/play?task=06df4c85) |
| 08ed6ac7 | k97c6dzt  | fail     | Size 1-4 histogram classifier         | [08ed6ac7](https://arcprize.org/play?task=08ed6ac7) |

Ouch, that's super painful.  We just basically *handed it the answers* to 6/10 of those, and it did even worse than when we weren't trying to help.

Let's try another experiment on this.  Here is the list of steps we used above:
```python
all_arc_task_classes = [
    ArcSystemPrompt,
    IntroduceProblem,
    RowCount,
    InputOutputRows,
    ProposeSolution1,
    CheckAnswer1,
    StrategiesStep, # We added this step, replacing the one below.
    #StrategyClarity,
    ProposeTestAnswer,
    ScoringStep,
]
```

Maybe it's just getting tripped up because the ongoing discussion is getting too large.  Let's try trimming this down radically and see what happens.  We'll try to strip the discussion bare, like this:
```python
all_arc_task_classes = [
    #ArcSystemPrompt, # (experiment 3)
    IntroduceProblem,
    #RowCount, # (experiment 3)
    #InputOutputRows, # (experiment 3)
    ProposeSolution1,
    CheckAnswer1,
    StrategiesStep, # (experiment 3)
    #StrategyClarity, # Which replaces this step (experiment 3)
    ProposeTestAnswer,
    ScoringStep,
]

```

In [7]:
evaluate_run()

## Run Results
Success rate: **20.0%**

| Case     | Session   | Result   | Title                                 | Link                                                |
|:---------|:----------|:---------|:--------------------------------------|:----------------------------------------------------|
| 007bbfb7 | h250gwtl  | fail     | 3x3 expand to 9x9, replicate when set | [007bbfb7](https://arcprize.org/play?task=007bbfb7) |
| 00d62c1b | p4dlnzow  | fail     | Flood fill                            | [00d62c1b](https://arcprize.org/play?task=00d62c1b) |
| 017c7c7b | x4zcg18m  | PASS     | Continue pattern to height            | [017c7c7b](https://arcprize.org/play?task=017c7c7b) |
| 025d127b | gjqyukt4  | fail     | Square up the bottom                  | [025d127b](https://arcprize.org/play?task=025d127b) |
| 045e512c | iho8l0y5  | fail     | Directional color replicator          | [045e512c](https://arcprize.org/play?task=045e512c) |
| 0520fde7 | h91j5c6l  | fail     | Boolean AND on two 3x3s               | [0520fde7](https://arcprize.org/play?task=0520fde7) |
| 05269061 | k4rfx19v  | fail     | Diagonal pattern expander             | [05269061](https://arcprize.org/play?task=05269061) |
| 05f2a901 | xmh9k2ae  | fail     | Square sucks in the shape             | [05f2a901](https://arcprize.org/play?task=05f2a901) |
| 06df4c85 | s3nwxty2  | fail     | Connect same colors in the grid       | [06df4c85](https://arcprize.org/play?task=06df4c85) |
| 08ed6ac7 | t5fo0934  | PASS     | Size 1-4 histogram classifier         | [08ed6ac7](https://arcprize.org/play?task=08ed6ac7) |

So that's interesting: this is the first time the system has figured out the 1-4 histogram classifier, and that is not one of the ones that I explained.  If we were to consider any time we ever got a right answer, then that increases our success rate up to something like 30%.

Let's take a look at what insights the agent was able to glean in that one:

> The variety of strategies outlined provides a valuable framework for tackling abstract pattern recognition tasks. Reflecting on these strategies, it's clear that successful problem-solving often requires thinking beyond traditional or linear approaches. Here's how I can refine my understanding and approach for the future...
> **Logical Operations on Matrix Sections**: In line with strategy #5, explore dividing the matrix into smaller logical units and applying operations within these sections.

While it did get this problem right, I never saw succinct clarity about what the transform is.  Like if I were to write such a thing it'd say "The input is a histogram; the output wants the histogram bars rated from largest to smallest with colors [4,3,2,1]" or something along those lines.

## Run all 20 again

It's unclear to me how much variance there is among runs.  Maybe it's my prompts messing it up, maybe it just doesn't always succeed.  I'm going to do a more full run of the first 20 now, leaving the prompts the same as the last run.

This will give me some clues as to whether it's just the stochastic nature of things on the first 10, and also see if we get any different results on the second 10.