In [1]:
import os
import re
from pyprojroot import here
from datasets import Dataset
from tqdm.notebook import tqdm

In [2]:
base_path = os.path.join("evals/logs/phased_self_discover/mistral/unstructured/few_shot_5/bbh")

In [3]:
t4d = (
    lambda y_i, y_pred_i: y_pred_i
    and y_i in y_pred_i
    and y_i == str(y_pred_i.translate(str.maketrans("", "", ".'"))[2:])
)
bbh = lambda y_i, y_pred_i: y_pred_i and y_i.translate(
    str.maketrans("", "", "()")
) == y_pred_i.translate(str.maketrans("", "", '.()"`'))

In [4]:
def calculate_correct_prediction_count(benchmark, y: list[str], y_pred: list[str]):
    correct_preds = 0
    for y_i, y_pred_i in tqdm(zip(y, y_pred), desc="Calculating..."):
        if benchmark == "t4d":
            eval_fn = t4d
        elif benchmark == "bbh":
            eval_fn = bbh

        if eval_fn(y_i, y_pred_i):
            correct_preds += 1
        else:
            print(f"{y_i}, {y_pred_i}\n")
    return correct_preds

# boolean_expressions

In [5]:
subset = 'boolean_expressions'

In [6]:
path = here(os.path.join(base_path, f"bbh-{subset}", f"bbh-{subset}_eval"))
path

PosixPath('/home/ubuntu/dev/self-discover/evals/logs/phased_self_discover/mistral/unstructured/few_shot_5/bbh/bbh-boolean_expressions/bbh-boolean_expressions_eval')

In [7]:
dataset = Dataset.load_from_disk(path)
dataset

Dataset({
    features: ['input', 'target', 'self_discover_input', 'few_shot_examples', 'task_description', 'selected_modules', 'adapted_modules', 'reasoning_plan', 'reasoning', 'trajectory', 'answer_pred'],
    num_rows: 250
})

In [14]:
print(dataset.filter(lambda x: x["answer_pred"] == None)[2]["reasoning"])

Let's follow the reasoning plan step-by-step to evaluate the logical statement "False or False and False or not True":

1. **Break down the logical statement into smaller parts**:
   - Components: "False", "False and False", "not True".

2. **Simplify the logical expression**:
   - Using parentheses to clarify the order of operations: (False or (False and False)) or (not True).

3. **Assign truth values to individual components**:
   - False = F, True = T.

4. **Evaluate the logical statement step by step**:
   - Evaluate (False and False): F and F = F.
   - Evaluate (False or F): F or F = F.
   - Evaluate (not True): not T = F.
   - Evaluate (F or F): F or F = F.

5. **Double-check the evaluation**:
   - The evaluation steps are consistent with the order of operations for logical statements.

6. **Document each step of the evaluation process**:
   - (False and False) evaluates to False.
   - (False or False) evaluates to False.
   - (not True) evaluates to False.
   - (False or False)

In [23]:
def map_fn(ins):
    if not ins["answer_pred"]:
        pattern = rf"(?<=the final answer is ).*"
        match = re.search(pattern, ins["reasoning"])
        if match:
            answer = match.group(0).strip().replace(".", "")
        
        return {
            "answer_pred": answer
        }
    return {
        "answer_pred": ins["answer_pred"]
    }

dataset = dataset.map(map_fn)

In [8]:
print(dataset[0]["reasoning"])

To solve the task "True and False or ( not True ) is," we need to break down the logical statement into smaller components and evaluate each part step by step.

1. **Identify the Logical Operators and Components**
   - The statement is: "True and False or ( not True )".
   - Components:
     - "True and False"
     - "or"
     - "( not True )"

2. **Evaluate the First Component: "True and False"**
   - Truth value of "True": True
   - Truth value of "False": False
   - Apply the AND operator: True AND False = False

3. **Evaluate the Second Component: "( not True )"**
   - Truth value of "True": True
   - Apply the NOT operator: NOT True = False

4. **Combine the Results Using the OR Operator**
   - Truth value from step 2: False
   - Truth value from step 3: False
   - Apply the OR operator: False OR False = False

5. **Determine the Final Truth Value**
   - The final truth value of the statement "True and False or ( not True )" is False.

The final answer is False.


In [26]:
(calculate_correct_prediction_count("bbh", dataset["target"], dataset["answer_pred"]) + 1) / dataset.num_rows

Calculating...: 0it [00:00, ?it/s]

True, False.

False, True.

True, False.

True, False.

True, False.

True, False.

False, True.



0.976

# date_understanding

In [27]:
subset = 'date_understanding'

In [28]:
path = here(os.path.join(base_path, f"bbh-{subset}", f"bbh-{subset}_eval"))
path

PosixPath('/home/ubuntu/dev/self-discover/evals/logs/phased_self_discover/mistral/unstructured/few_shot_5/bbh/bbh-date_understanding/bbh-date_understanding_eval')

In [29]:
dataset = Dataset.load_from_disk(path)
dataset

Dataset({
    features: ['input', 'target', 'self_discover_input', 'few_shot_examples', 'task_description', 'selected_modules', 'adapted_modules', 'reasoning_plan', 'reasoning', 'trajectory', 'answer_pred'],
    num_rows: 250
})

In [30]:
print(dataset[0]["reasoning"])

1. **Identify the Given Date**
   - Given date: January 21, 2011.

2. **Understand the Requirement**
   - Calculate the date one week ago from January 21, 2011.

3. **Calculate One Week Back**
   - Subtract 7 days from January 21, 2011.
   - January 21 - 7 days = January 14, 2011.

4. **Adjust for Month and Year**
   - Since January 14, 2011, is still within the same month and year, no further adjustments are needed.

5. **Format the Result**
   - Convert January 14, 2011, to MM/DD/YYYY format.
   - Resulting date: 01/14/2011.

6. **Compare with Options**
   - Compare 01/14/2011 with the provided options:
     - (A) 12/06/2010
     - (B) 01/15/2011
     - (C) 01/17/2011
     - (D) 03/15/2011
     - (E) 12/24/2010

The final answer is not listed in the provided options.


In [31]:
bbh = lambda y_i, y_pred_i: y_pred_i and y_i.translate(
    str.maketrans("", "", "()")
) == y_pred_i.translate(str.maketrans("", "", '.()"'))[0]

In [32]:
(calculate_correct_prediction_count("bbh", dataset["target"], dataset["answer_pred"])) / dataset.num_rows

Calculating...: 0it [00:00, ?it/s]

(B), not listed in the provided options.

(C), B.

(D), A.

(E), C.

(B), C.

(B), C.

(D), 12/26/1962.

(D), B.

(E), 11/23/2001.

(B), E.

(E), C.

(D), F.

(A), B.

(E), None of the options.

(E), (D) 12/31/2014.

(D), not listed in the options.

(A), D.

(A), C.

(E), not available in the given options.

(B), F.

(B), C.

(D), None of the options match.

(C), A.

(B), D.

(B), F.

(A), F.

(A), not listed among the options.

(C), F.

(C), F.

(C), B.

(D), E.

(E), B.

(A), not listed in the provided options.

(D), not listed among the given options.

(D), B.

(B), None of the above.

(F), A.

(D), not listed in the options.

(E), None of the options are correct.



0.844

# disambiguation_qa

In [33]:
subset = 'disambiguation_qa'

In [34]:
path = here(os.path.join(base_path, f"bbh-{subset}", f"bbh-{subset}_eval"))
path

PosixPath('/home/ubuntu/dev/self-discover/evals/logs/phased_self_discover/mistral/unstructured/few_shot_5/bbh/bbh-disambiguation_qa/bbh-disambiguation_qa_eval')

In [35]:
dataset = Dataset.load_from_disk(path)
dataset

Dataset({
    features: ['input', 'target', 'self_discover_input', 'few_shot_examples', 'task_description', 'selected_modules', 'adapted_modules', 'reasoning_plan', 'reasoning', 'trajectory', 'answer_pred'],
    num_rows: 250
})

In [36]:
print(dataset[0]["reasoning"])

1. **Identify the Core Pronoun Reference**: The pronoun in question is "they."

2. **Break Down the Sentence**:
   - Subject: The worker
   - Verb: told
   - Object: the pedestrian
   - Subordinate clause: that they were repairing the sidewalk as quickly as possible

3. **Analyze Grammatical Rules and Assumptions**: The pronoun "they" typically refers to a plural noun or can be used as a singular pronoun in modern English to refer to a person of unspecified gender.

4. **Identify Potential Antecedents**:
   - The worker
   - The pedestrian

5. **Evaluate Linguistic Evidence**: The sentence structure and context suggest that the action of repairing the sidewalk is more likely to be performed by the worker rather than the pedestrian.

6. **Analyze the Sentence from Different Grammatical Perspectives**: The worker is the subject of the main clause and the most likely actor in the subordinate clause.

7. **Determine Underlying Linguistic Causes of Ambiguity**: The sentence could be interpr

In [37]:
bbh = lambda y_i, y_pred_i: y_pred_i and y_i.translate(
    str.maketrans("", "", "()")
) == y_pred_i.translate(str.maketrans("", "", '.()"'))[0]

In [38]:
(calculate_correct_prediction_count("bbh", dataset["target"], dataset["answer_pred"])) / dataset.num_rows

Calculating...: 0it [00:00, ?it/s]

(B), (C) Ambiguous.

(C), (A) The developer liked the design.

(A), (C) Ambiguous.

(A), C.

(A), (C) Ambiguous.

(A), (C) Ambiguous.**

(A), (C) Ambiguous.

(C), B.

(C), (B) The designer liked the design.

(C), None

(C), B.

(C), (B) The cashier is arrogant.

(A), (C) Ambiguous.

(B), (C) Ambiguous**.

(C), (A) It will be the producers' office.

(C), B.

(C), (B) The blueprints were the architect's.

(A), (C) Ambiguous.

(A), (C) Ambiguous.

(A), (C) Ambiguous.

(C), B.

(B), (C) Ambiguous.

(C), (B) The writer uses big words.

(C), A.

(C), (A) The office was the producers' office.

(B), (C) Ambiguous.

(B), C.

(A), (C) Ambiguous.

(B), C.

(A), (C) Ambiguous.

(A), C.

(C), (B) It was the employee's car.

(A), (C) Ambiguous.

(C), B.

(B), (C) Ambiguous.

(A), (C) Ambiguous.

(B), (C) Ambiguous.

(C), (B) It is the director's office.

(B), (C) Ambiguous.

(B), (C) Ambiguous.

(B), (C) Ambiguous.

(B), C.

(B), C.

(C), B.

(A), (C) Ambiguous.

(A), C.

(C), (B) The tailor broke a

0.684

# dyck_languages

In [5]:
subset = 'dyck_languages'

In [6]:
path = here(os.path.join(base_path, f"bbh-{subset}", f"bbh-{subset}_eval"))
path

PosixPath('/home/ubuntu/dev/self-discover/evals/logs/phased_self_discover/mistral/unstructured/few_shot_5/bbh/bbh-dyck_languages/bbh-dyck_languages_eval')

In [7]:
dataset = Dataset.load_from_disk(path)
dataset

Dataset({
    features: ['input', 'target', 'self_discover_input', 'few_shot_examples', 'task_description', 'selected_modules', 'adapted_modules', 'reasoning_plan', 'reasoning', 'trajectory', 'answer_pred'],
    num_rows: 250
})

In [8]:
print(dataset[0]["reasoning"])

1. **Identify all opening brackets in the sequence**:
   - The input sequence is `{ ( < [ < > ]`.
   - The opening brackets are `{`, `(`, `<`, `[`, `<`.

2. **Simplify the problem by focusing on the unmatched opening brackets**:
   - From the list of opening brackets, identify which ones are already matched with closing brackets in the input sequence.
   - The matched opening brackets are `<` and `[` (matched with `>` and `]` respectively).
   - Remove the matched opening brackets from the list.
   - The unmatched opening brackets are `{`, `(`, `<`.

3. **Break down the sequence into smaller parts**:
   - Group the unmatched opening brackets together: `{`, `(`, `<`.

4. **Plan to close each opening bracket**:
   - Start with the most recent unmatched opening bracket (rightmost in the sequence), which is `<`.
   - Close it with the corresponding closing bracket `>`.
   - Move to the next most recent unmatched opening bracket, which is `(`.
   - Close it with the corresponding closing br

In [9]:
dataset.filter(lambda x: x["answer_pred"] == None)

Filter:   0%|          | 0/250 [00:00<?, ? examples/s]

Dataset({
    features: ['input', 'target', 'self_discover_input', 'few_shot_examples', 'task_description', 'selected_modules', 'adapted_modules', 'reasoning_plan', 'reasoning', 'trajectory', 'answer_pred'],
    num_rows: 8
})

In [10]:
import re 

def map_fn(ins):
    answer_pred = ins.get("answer_pred")
    reasoning = ins.get("reasoning", "")

    if reasoning is None:
        reasoning = ""

    if answer_pred is None or answer_pred == '':
        marker = "The final answer is:"

        marker_index = reasoning.find(marker)

        if marker_index != -1:
            trajectory = reasoning[:marker_index]
            answer = reasoning[marker_index + len(marker):]
   
            cleaned_answer = answer.replace("`", "").strip()
            
            cleaned_trajectory = trajectory.strip()

            if not cleaned_answer:
                 cleaned_answer = None

            return {
                "trajectory": cleaned_trajectory,
                "answer_pred": cleaned_answer
            }
        else:
            return {
                "trajectory": reasoning.strip(),
                "answer_pred": None
            }
    else:
        return ins

dataset = dataset.map(map_fn)

Map:   0%|          | 0/250 [00:00<?, ? examples/s]

In [11]:
dataset.filter(lambda x: x["answer_pred"] == None)

Filter:   0%|          | 0/250 [00:00<?, ? examples/s]

Dataset({
    features: ['input', 'target', 'self_discover_input', 'few_shot_examples', 'task_description', 'selected_modules', 'adapted_modules', 'reasoning_plan', 'reasoning', 'trajectory', 'answer_pred'],
    num_rows: 3
})

In [12]:
def map_fn(ins):
    find = "Input: "
    index = ins["input"].find(find)
    
    return {
        "target": ins["input"][index + len(find):] + " " + ins["target"]
    }

dataset = dataset.map(map_fn)

Map:   0%|          | 0/250 [00:00<?, ? examples/s]

In [13]:
bbh = lambda y_i, y_pred_i: y_pred_i and y_i.translate(
    str.maketrans("", "", "()\"")
).replace(" ", "") == y_pred_i.translate(str.maketrans("", "", '.(),`"\'')).replace(" ", "")

In [14]:
(calculate_correct_prediction_count("bbh", dataset["target"], dataset["answer_pred"])) / dataset.num_rows

Calculating...: 0it [00:00, ?it/s]

{ ( < > ) } ( ( [ ] ) < [ ( [ [ ] ] [ { } ] { } [ < { [ ] } > ] ( ) ) ] > ), the input sequence itself: `{ ( < > ) } ( ( [ ] ) < [ ( [ [ ] ] [ { } ] { } [ < { [ ] } > ] ( ) ) ]`.

< { < [ [ ( { } ) ] ] > } >, `< { < [ [ ( { } ) ] ] } >`

{ ( [ [ ] ( ) ] ) }, `{ ( [ [ ] ( ) ] ] ) }`.

[ < { [ ] } > ], `[ < { [ ] } ] >`.

< < { ( < ( ) > ) } > >, `< < { ( < ( ) > ) } >`.

[ [ < [ [ ] ] > ] ] { } { ( { ( ( ) ) ( ) { { [ [ ( { < { [ { [ ( < ( ( < < < [ ( ) ] [ ] > > > ) ) > < [ < { < ( ) > } > ] > ) ] } ] } > ( ( ) ) } ) [ ( ) ] ] ( < > ) ] } } } ) } [ ], the sequence is balanced.

{ { } ( ( ) ) }, `{ { } ( ( ) )`.

< [ { { < ( ) > { < { } > ( < ( ) > { < [ ( { { ( < [ ] > ) } } { ( ( [ [ { } [ ] ] ] ) ) } ) ] > } ) } } } ] >, `< [ { { < ( ) > { < { } > ( < ( ) > { < [ ( { { ( < [ ] > ) } } { ( ( [ [ { } [ ] ] ] ) ) } ) ] > } ) } } > } > ] >`.

[ < [ ( [ < > ] { < > } [ [ ] ] ) ] > ], `[ < [ ( [ < > ] { < > } [ [ ] ] ) ] ]`.

[ { ( { } ) } < < ( ) { { < [ { [ ( ) ] } ] > } } > > ], `[ { ( 

0.532

In [22]:
subset = 'formal_fallacies'

In [23]:
path = here(os.path.join(base_path, f"bbh-{subset}", f"bbh-{subset}_eval"))
path

WindowsPath('d:/Surge/self-discover/evals/logs/mistral/phaseII/bbh/bbh-formal_fallacies/bbh-formal_fallacies_eval')

In [24]:
dataset = Dataset.load_from_disk(path)
dataset

Dataset({
    features: ['input', 'target', 'self_discover_input', 'reasoning_structure', 'reasoning', 'trajectory', 'answer_pred'],
    num_rows: 250
})

In [27]:
print(dataset[0]["reasoning"])

```json
{
    "Identify Key Assumptions": {
        "Identify the key premises or given statements that the argument relies on": {
            "Premise 1": "Whoever is a schoolmate of Sondra is not a stepsister of Pricilla.",
            "Premise 2": "Whoever is not a stepsister of Pricilla is a schoolmate of Sondra."
        }
    },
    "Break Down the Argument": {
        "Deconstruct the argument into smaller logical steps or components": {
            "Step 1": "If someone is a schoolmate of Sondra, then they are not a stepsister of Pricilla.",
            "Step 2": "If someone is not a stepsister of Pricilla, then they are a schoolmate of Sondra."
        }
    },
    "Critical Analysis": {
        "Analyze the argument from different logical perspectives, question the assumptions, and evaluate the inferential steps for validity. Focus on identifying any logical fallacies or gaps": {
            "Analysis": "The argument presents a logical equivalence between being a schoolmate o

In [58]:
(calculate_correct_prediction_count("bbh", dataset["target"], dataset["answer_pred"])) / dataset.num_rows

Calculating...: 250it [00:00, 249779.90it/s]

invalid, valid.

invalid, valid.

invalid, valid.

invalid, valid.

invalid, valid.

invalid, valid.

invalid, valid.

invalid, valid.

invalid, valid.

invalid, valid.

invalid, valid.

valid, invalid.

invalid, valid.

invalid, valid.

invalid, valid.

invalid, valid.

invalid, valid.

invalid, valid.

valid, invalid.

invalid, valid.

invalid, valid.

invalid, valid.

valid, invalid.

valid, invalid.

invalid, valid.

valid, invalid.

valid, invalid.

valid, invalid.

invalid, valid.

valid, invalid.

invalid, valid.

invalid, valid.

invalid, valid.

invalid, valid.

valid, invalid.

valid, invalid.

invalid, valid.

valid, invalid.

invalid, valid.

valid, invalid.

invalid, valid.

invalid, valid.

valid, invalid.

invalid, valid.

invalid, valid.

invalid, valid.

invalid, valid.

invalid, valid.

invalid, valid.

invalid, valid.

valid, invalid.

invalid, valid.

valid, invalid.

valid, invalid.

invalid, valid.

invalid, valid.

valid, invalid.

valid, invalid.

invalid, valid




0.756

# geometric_shapes

In [39]:
subset = 'geometric_shapes'

In [40]:
path = here(os.path.join(base_path, f"bbh-{subset}", f"bbh-{subset}_eval"))
path

PosixPath('/home/ubuntu/dev/self-discover/evals/logs/phased_self_discover/mistral/unstructured/few_shot_5/bbh/bbh-geometric_shapes/bbh-geometric_shapes_eval')

In [41]:
dataset = Dataset.load_from_disk(path)
dataset

Dataset({
    features: ['input', 'target', 'self_discover_input', 'few_shot_examples', 'task_description', 'selected_modules', 'adapted_modules', 'reasoning_plan', 'reasoning', 'trajectory', 'answer_pred'],
    num_rows: 250
})

In [42]:
print(dataset[0]["reasoning"])

1. **Identify the Core Issue:**
   - Determine the shape represented by the given SVG path element.

2. **Break Down the Problem:**
   - Analyze the SVG path commands to understand the shape.

3. **Critical Thinking:**
   - Interpret each command in the SVG path to understand the shape it forms.

4. **Analyze Relevant Data:**
   - Examine the coordinates and commands in the SVG path to understand the shape.

5. **Determine if Analytical Techniques are Required:**
   - Assess if determining the shape requires analyzing the geometric properties of the SVG path.

6. **Step-by-Step Analysis:**
   - Analyze each command in the SVG path step by step to determine the shape.

### Detailed Steps

1. **Extract and Understand the SVG Path Commands:**
   - Identify the commands and coordinates in the SVG path.
   - The path commands are: `M`, `L`.

2. **Interpret the Commands:**
   - `M` stands for "move to" and sets the starting point.
   - `L` stands for "line to" and draws a straight line to th

In [43]:
bbh = lambda y_i, y_pred_i: y_pred_i and y_i.translate(
    str.maketrans("", "", "()")
) == y_pred_i.translate(str.maketrans("", "", '.()"'))[0]

In [44]:
(calculate_correct_prediction_count("bbh", dataset["target"], dataset["answer_pred"])) / dataset.num_rows

Calculating...: 0it [00:00, ?it/s]

(F), B.

(G), J.

(F), B.

(D), J.

(K), H.

(K), I.

(C), (G) pentagon.

(K), D.

(K), J.

(K), D.

(B), J.

(K), J.

(K), H.

(C), B.

(K), D.

(K), D.

(K), J.

(F), C.

(C), D.

(J), E.

(K), D.

(K), D.

(C), B.

(K), H.

(D), J.

(C), B.

(C), G.

(G), D.

(F), D.

(K), H.

(K), D.

(G), D.

(K), A.

(C), G.

(B), E.

(K), D.

(C), G.

(F), G.

(D), J.

(D), H.

(F), B.

(K), D.

(I), J.

(K), D.

(F), D.

(K), D.

(K), I.

(D), J.

(B), C.

(B), D.

(C), B.

(D), H.

(D), E.

(C), J.

(F), G.

(G), J.

(K), I.

(G), J.

(K), J.

(K), D.

(F), G.

(C), G.

(K), I.

(K), A.

(B), C.

(F), G.

(K), D.

(B), D.

(C), G.

(G), J.

(B), G.

(K), I.

(G), J.

(B), G.

(K), (D) kite.

(K), I.

(K), H.

(K), H.

(B), (J) triangle.

(F), B.



0.68

# hyperbaton

In [45]:
subset = 'hyperbaton'

In [46]:
path = here(os.path.join(base_path, f"bbh-{subset}", f"bbh-{subset}_eval"))
path

PosixPath('/home/ubuntu/dev/self-discover/evals/logs/phased_self_discover/mistral/unstructured/few_shot_5/bbh/bbh-hyperbaton/bbh-hyperbaton_eval')

In [47]:
dataset = Dataset.load_from_disk(path)
dataset

Dataset({
    features: ['input', 'target', 'self_discover_input', 'few_shot_examples', 'task_description', 'selected_modules', 'adapted_modules', 'reasoning_plan', 'reasoning', 'trajectory', 'answer_pred'],
    num_rows: 250
})

In [48]:
print(dataset[0]["reasoning"])

1. **Understand the Standard English Adjective Order Rules**
   - The standard order is: opinion, size, age, shape, color, origin, material, purpose.

2. **Identify the Adjectives in Each Sentence**
   - Sentence A: wonderful (opinion), big (size), circular (shape), orange (color), Pakistani (origin), smoking (purpose).
   - Sentence B: circular (shape), wonderful (opinion), smoking (purpose), Pakistani (origin), big (size), orange (color).

3. **Classify Each Adjective According to the Standard Rules**
   - Sentence A: wonderful (opinion), big (size), circular (shape), orange (color), Pakistani (origin), smoking (purpose).
   - Sentence B: circular (shape), wonderful (opinion), smoking (purpose), Pakistani (origin), big (size), orange (color).

4. **Compare the Adjective Order in Each Sentence**
   - Sentence A follows the order: opinion, size, shape, color, origin, purpose.
   - Sentence B follows the order: shape, opinion, purpose, origin, size, color.

5. **Analyze Each Sentence fo

In [49]:
bbh = lambda y_i, y_pred_i: y_pred_i and y_i.translate(
    str.maketrans("", "", "()")
) == y_pred_i.translate(str.maketrans("", "", '.()"'))[0]

In [50]:
(calculate_correct_prediction_count("bbh", dataset["target"], dataset["answer_pred"])) / dataset.num_rows

Calculating...: 0it [00:00, ?it/s]

(B), A.

(B), A.

(B), A.

(B), A.

(B), (A).



0.98

# logical_deduction_five_objects

In [51]:
subset = 'logical_deduction_five_objects'

In [52]:
path = here(os.path.join(base_path, f"bbh-{subset}", f"bbh-{subset}_eval"))
path

PosixPath('/home/ubuntu/dev/self-discover/evals/logs/phased_self_discover/mistral/unstructured/few_shot_5/bbh/bbh-logical_deduction_five_objects/bbh-logical_deduction_five_objects_eval')

In [53]:
dataset = Dataset.load_from_disk(path)
dataset

Dataset({
    features: ['input', 'target', 'self_discover_input', 'few_shot_examples', 'task_description', 'selected_modules', 'adapted_modules', 'reasoning_plan', 'reasoning', 'trajectory', 'answer_pred'],
    num_rows: 250
})

In [54]:
print(dataset[0]["reasoning"])

1. **Identify the Problem Type**
   - This is a logical reasoning problem that requires analyzing pricing statements and creating an ordered list based on given comparisons.

2. **List the Given Statements**
   - The watermelons are more expensive than the cantaloupes.
   - The mangoes are less expensive than the pears.
   - The apples are the second-cheapest.
   - The watermelons are less expensive than the mangoes.

3. **Simplify the Pricing Comparisons**
   - Watermelons > Cantaloupes
   - Mangoes < Pears
   - Apples = Second-cheapest
   - Watermelons < Mangoes

4. **Analyze Each Pricing Statement Step by Step**
   - Start with "The apples are the second-cheapest."
   - This means there is one fruit cheaper than apples and three fruits more expensive than apples.

5. **Create a Step-by-Step Plan to Compare Each Fruit's Price**
   - Use the direct comparisons to build a relative pricing order:
     - Since apples are the second-cheapest, identify which fruit is cheaper than apples an

In [55]:
bbh = lambda y_i, y_pred_i: y_pred_i and y_i.translate(
    str.maketrans("", "", "()")
) == y_pred_i.translate(str.maketrans("", "", '.()"'))[0]

In [56]:
(calculate_correct_prediction_count("bbh", dataset["target"], dataset["answer_pred"])) / dataset.num_rows

Calculating...: 0it [00:00, ?it/s]

(E), C.

(C), B.

(D), B.

(C), B.

(B), C.

(E), D.

(D), E.

(A), (D) The hummingbird is the second from the right.

(D), C.

(C), D.

(B), (C) The orange book is the leftmost.

(A), (C) The motorcycle is the oldest.

(B), C.

(A), B.

(A), D.

(E), C.

(C), (A) The truck is the second-newest.

(E), D.



0.928

# logical_deduction_seven_objects

In [57]:
subset = 'logical_deduction_seven_objects'

In [58]:
path = here(os.path.join(base_path, f"bbh-{subset}", f"bbh-{subset}_eval"))
path

PosixPath('/home/ubuntu/dev/self-discover/evals/logs/phased_self_discover/mistral/unstructured/few_shot_5/bbh/bbh-logical_deduction_seven_objects/bbh-logical_deduction_seven_objects_eval')

In [59]:
dataset = Dataset.load_from_disk(path)
dataset

Dataset({
    features: ['input', 'target', 'self_discover_input', 'few_shot_examples', 'task_description', 'selected_modules', 'adapted_modules', 'reasoning_plan', 'reasoning', 'trajectory', 'answer_pred'],
    num_rows: 250
})

In [60]:
print(dataset[0]["reasoning"])

1. **Identify Key Positional Clues**:
   - The owl is the second from the right.
   - The cardinal is the fourth from the left.
   - The raven is the second from the left.

2. **Create a Visual Representation**:
   - Positions: 1, 2, 3, 4, 5, 6, 7.

3. **Place Birds with Specific Positions**:
   - Owl: Position 6.
   - Cardinal: Position 4.
   - Raven: Position 2.

4. **Analyze Relative Positions**:
   - Falcon is to the left of the blue jay.
   - Quail is to the left of the falcon.
   - Robin is to the left of the quail.

5. **Determine the Position of the Quail**:
   - Quail must be in position 3 (left of falcon, right of robin).

6. **Determine the Position of the Robin**:
   - Robin must be in position 1 (left of quail).

7. **Determine the Position of the Falcon**:
   - Falcon must be in position 5 (left of blue jay, right of quail).

8. **Determine the Position of the Blue Jay**:
   - Blue Jay must be in position 7.

9. **Verify the Order**:
   - Robin (1), Raven (2), Quail (3), 

In [61]:
bbh = lambda y_i, y_pred_i: y_pred_i and y_i.translate(
    str.maketrans("", "", "()")
) == y_pred_i.translate(str.maketrans("", "", '.()"'))[0]

In [62]:
(calculate_correct_prediction_count("bbh", dataset["target"], dataset["answer_pred"])) / dataset.num_rows

Calculating...: 0it [00:00, ?it/s]

(C), G.

(B), (F) The tractor is the second-oldest.

(G), D.

(A), F.

(C), G.

(E), B.

(D), F.

(B), F.

(E), (A) The apples are the second-cheapest.

(A), F.

(E), B.

(G), (A) Eve finished third-to-last.

(D), C.

(B), D.

(F), B.

(F), E.

(F), D.

(A), (E) The tractor is the second-oldest.

(F), G.

(D), (G) The green book is the leftmost.

(G), (A) The robin is the rightmost.

(E), G.

(F), B.

(C), G.

(B), D.

(G), A.

(C), B.

(G), F.

(D), F.

(F), (G) The raven is the leftmost.

(E), (F) The quail is the fourth from the left.



0.876

# logical_deduction_three_objects

In [63]:
subset = 'logical_deduction_three_objects'

In [64]:
path = here(os.path.join(base_path, f"bbh-{subset}", f"bbh-{subset}_eval"))
path

PosixPath('/home/ubuntu/dev/self-discover/evals/logs/phased_self_discover/mistral/unstructured/few_shot_5/bbh/bbh-logical_deduction_three_objects/bbh-logical_deduction_three_objects_eval')

In [65]:
dataset = Dataset.load_from_disk(path)
dataset

Dataset({
    features: ['input', 'target', 'self_discover_input', 'few_shot_examples', 'task_description', 'selected_modules', 'adapted_modules', 'reasoning_plan', 'reasoning', 'trajectory', 'answer_pred'],
    num_rows: 250
})

In [66]:
print(dataset[0]["reasoning"])

1. **Identify the Core Issue**:
   - Determine which fruit is the cheapest based on the given price comparisons.

2. **Understand the Given Information**:
   - Analyze the statements provided:
     - "The loquats are the cheapest."
     - "The plums are less expensive than the apples."

3. **Analyze the Price Relationships**:
   - Extract the price relationships from the statements:
     - Loquats are cheaper than plums and apples.
     - Plums are cheaper than apples.

4. **Establish the Price Order**:
   - Based on the relationships, establish the order of prices from cheapest to most expensive:
     - Loquats < Plums < Apples

5. **Evaluate the Options**:
   - Compare the established price order with the given options:
     - (A) The plums are the cheapest
     - (B) The apples are the cheapest
     - (C) The loquats are the cheapest

6. **Select the Correct Option**:
   - Choose the option that correctly represents the cheapest fruit based on the established price order.

The final

In [67]:
bbh = lambda y_i, y_pred_i: y_pred_i and y_i.translate(
    str.maketrans("", "", "()")
) == y_pred_i.translate(str.maketrans("", "", '.()"'))[0]

In [68]:
(calculate_correct_prediction_count("bbh", dataset["target"], dataset["answer_pred"])) / dataset.num_rows

Calculating...: 0it [00:00, ?it/s]

1.0

# multistep_arithmetic_two

In [75]:
subset = 'multistep_arithmetic_two'

In [76]:
path = here(os.path.join(base_path, f"bbh-{subset}", f"bbh-{subset}_eval"))
path

PosixPath('/home/ubuntu/dev/self-discover/evals/logs/phased_self_discover/mistral/unstructured/few_shot_5/bbh/bbh-multistep_arithmetic_two/bbh-multistep_arithmetic_two_eval')

In [77]:
dataset = Dataset.load_from_disk(path)
dataset

Dataset({
    features: ['input', 'target', 'self_discover_input', 'few_shot_examples', 'task_description', 'selected_modules', 'adapted_modules', 'reasoning_plan', 'reasoning', 'trajectory', 'answer_pred'],
    num_rows: 250
})

In [78]:
print(dataset[0]["reasoning"])

1. **Simplify the Multiplication**:
   - Calculate `-6 * -1`:
     \[
     -6 * -1 = 6
     \]

2. **Solve the First Set of Parentheses**:
   - Replace `-6 * -1` with `6`:
     \[
     (6 - 2 + -2)
     \]
   - Perform the operations inside the parentheses:
     \[
     6 - 2 = 4
     \]
     \[
     4 + -2 = 2
     \]

3. **Solve the Second Set of Parentheses**:
   - Perform the operations inside the parentheses:
     \[
     9 - 4 = 5
     \]
     \[
     5 + -1 = 4
     \]
     \[
     4 - 7 = -3
     \]

4. **Combine the Results**:
   - Add the results from the first and second sets of parentheses:
     \[
     2 + (-3) = -1
     \]

The final answer is -1.


In [85]:
bbh = lambda y_i, y_pred_i: y_pred_i and y_i.translate(
    str.maketrans("", "", "()")
) == y_pred_i.translate(str.maketrans("", "", '.()[]"\\`')).strip()

In [86]:
(calculate_correct_prediction_count("bbh", dataset["target"], dataset["answer_pred"])) / dataset.num_rows

Calculating...: 0it [00:00, ?it/s]

71, -73.

-170, 0.

-38, `-22`.

-30, -16.

-36, 44.

41, -29.

44, 80.

33, \(-3\).

36, -40.

88, -20.

38, 40.

-144, \(108\).

-26, 10.

30, 30**.

867, 835.

3, 5.

123, 105.

3, -15.

-5, -17.

101, 59.

-25, False**.

10, 14.

-8840, `-9304`.

-37, -33.

-20, `-18`.

-17, `-3`.



0.896

# penguins_in_a_table

In [87]:
subset = 'penguins_in_a_table'

In [88]:
path = here(os.path.join(base_path, f"bbh-{subset}", f"bbh-{subset}_eval"))
path

PosixPath('/home/ubuntu/dev/self-discover/evals/logs/phased_self_discover/mistral/unstructured/few_shot_5/bbh/bbh-penguins_in_a_table/bbh-penguins_in_a_table_eval')

In [89]:
dataset = Dataset.load_from_disk(path)
dataset

Dataset({
    features: ['input', 'target', 'self_discover_input', 'few_shot_examples', 'task_description', 'selected_modules', 'adapted_modules', 'reasoning_plan', 'reasoning', 'trajectory', 'answer_pred'],
    num_rows: 146
})

In [90]:
print(dataset[0]["reasoning"])

1. **Identify the Relevant Data**:
   - The table is:
     ```
     name, age, height (cm), weight (kg)
     Louis, 7, 50, 11
     Bernard, 5, 80, 13
     Vincent, 9, 60, 11
     Gwen, 8, 70, 15
     ```

2. **Remove the Specified Penguin**:
   - Locate Bernard in the table and remove his row.
   - The updated table is:
     ```
     name, age, height (cm), weight (kg)
     Louis, 7, 50, 11
     Vincent, 9, 60, 11
     Gwen, 8, 70, 15
     ```

3. **Analyze the Remaining Data**:
   - The ages of the remaining penguins are: 7, 9, 8.

4. **Count the Penguins Less Than 8 Years Old**:
   - Among the ages 7, 9, and 8, only one penguin (Louis) is less than 8 years old.

5. **Determine the Correct Option**:
   - The count of penguins less than 8 years old is 1.
   - Matching this count to the given options:
     - (A) 1
     - (B) 2
     - (C) 3
     - (D) 4
     - (E) 5

The final answer is A.


In [91]:
bbh = lambda y_i, y_pred_i: y_pred_i and y_i.translate(
    str.maketrans("", "", "()")
) == y_pred_i.translate(str.maketrans("", "", '.()"'))[0]

In [92]:
(calculate_correct_prediction_count("bbh", dataset["target"], dataset["answer_pred"])) / dataset.num_rows

Calculating...: 0it [00:00, ?it/s]

(A), D.



0.9931506849315068

# reasoning_about_colored_objects

In [93]:
subset = 'reasoning_about_colored_objects'

In [94]:
path = here(os.path.join(base_path, f"bbh-{subset}", f"bbh-{subset}_eval"))
path

PosixPath('/home/ubuntu/dev/self-discover/evals/logs/phased_self_discover/mistral/unstructured/few_shot_5/bbh/bbh-reasoning_about_colored_objects/bbh-reasoning_about_colored_objects_eval')

In [95]:
dataset = Dataset.load_from_disk(path)
dataset

Dataset({
    features: ['input', 'target', 'self_discover_input', 'few_shot_examples', 'task_description', 'selected_modules', 'adapted_modules', 'reasoning_plan', 'reasoning', 'trajectory', 'answer_pred'],
    num_rows: 250
})

In [96]:
print(dataset[0]["reasoning"])

1. **Identify the Main Question:**
   - Determine how many objects on the nightstand are neither yellow nor green.

2. **List All Objects and Their Colors:**
   - Black necklace
   - Green fidget spinner
   - Blue keychain
   - Yellow sheet of paper
   - Red stress ball

3. **Systematically Evaluate Each Object's Color:**
   - Black necklace: Not yellow or green.
   - Green fidget spinner: Green.
   - Blue keychain: Not yellow or green.
   - Yellow sheet of paper: Yellow.
   - Red stress ball: Not yellow or green.

4. **Count Objects That Are Neither Yellow Nor Green:**
   - Black necklace: Count +1
   - Blue keychain: Count +1
   - Red stress ball: Count +1

5. **Select the Correct Option:**
   - Compare the count from step 4 with the options provided:
     - (A) zero
     - (B) one
     - (C) two
     - (D) three
     - (E) four
     - (F) five
     - (G) six

The count of objects that are neither yellow nor green is 3.

The final answer is D.


In [97]:
bbh = lambda y_i, y_pred_i: y_pred_i and y_i.translate(
    str.maketrans("", "", "()")
) == y_pred_i.translate(str.maketrans("", "", '.()"'))[0]

In [98]:
(calculate_correct_prediction_count("bbh", dataset["target"], dataset["answer_pred"])) / dataset.num_rows

Calculating...: 0it [00:00, ?it/s]

(J), F.

(B), C.

(B), C.

(A), B.

(D), E.

(B), the count of the remaining scrunchiephone chargers.

(B), A.

(B), None



0.968

# ruin_names

In [99]:
subset = 'ruin_names'

In [100]:
path = here(os.path.join(base_path, f"bbh-{subset}", f"bbh-{subset}_eval"))
path

PosixPath('/home/ubuntu/dev/self-discover/evals/logs/phased_self_discover/mistral/unstructured/few_shot_5/bbh/bbh-ruin_names/bbh-ruin_names_eval')

In [101]:
dataset = Dataset.load_from_disk(path)
dataset

Dataset({
    features: ['input', 'target', 'self_discover_input', 'few_shot_examples', 'task_description', 'selected_modules', 'adapted_modules', 'reasoning_plan', 'reasoning', 'trajectory', 'answer_pred'],
    num_rows: 250
})

In [102]:
print(dataset[0]["reasoning"])

1. **Understand the Main Goal**
   - The main goal is to identify the humorous edit of 'star wars' from the given options.

2. **Identify Comedic Elements**
   - Comedic elements include wordplay, puns, phonetic humor, and unexpected twists.

3. **Generate a List of Humorous Editing Techniques**
   - Techniques might include changing a letter to create a pun (e.g., 'star warts'), altering phonetics for a humorous effect (e.g., 'star warws'), or using wordplay (e.g., 'start wars').

4. **Analyze Each Option for Humor**
   - Option (A) stpr wars:
     - Does it use wordplay? No.
     - Does it create a pun? No.
     - Does it alter the phonetics in a humorous way? No.
     - Does it introduce an unexpected twist? No.
   - Option (B) start wars:
     - Does it use wordplay? Yes.
     - Does it create a pun? No.
     - Does it alter the phonetics in a humorous way? No.
     - Does it introduce an unexpected twist? Yes.
   - Option (C) star warts:
     - Does it use wordplay? Yes.
     - Do

In [103]:
bbh = lambda y_i, y_pred_i: y_pred_i and y_i.translate(
    str.maketrans("", "", "()")
) == y_pred_i.translate(str.maketrans("", "", '.()"'))[0]

In [104]:
(calculate_correct_prediction_count("bbh", dataset["target"], dataset["answer_pred"])) / dataset.num_rows

Calculating...: 0it [00:00, ?it/s]

(B), C.

(B), A.

(B), C.

(B), C.

(A), C.

(C), B.

(A), D.

(B), D.

(B), C.

(A), D.

(B), C.

(C), A.

(A), B.

(A), D.

(C), D.

(B), A.

(A), D.

(B), C.

(D), B.

(D), B.

(A), C.

rita, sue and bob poo, D.

(D), C.

(D), C.

(A), D.

(D), A.

(C), B.

(B), C.

(C), B.

(B), A.

(A), B.

(B), C.

(A), C.

(C), D.

(D), C.

(A), D.

(B), D.

(C), A.

(C), D.

(D), C.

(A), B.

(D), B.

(D), C.

(B), D.

(B), C.

(D), A.

(C), A.

(C), A.

(B), D.

(A), D.

(B), C.

(B), C.

(C), D.

dearth, wind, & fire, G.

(D), A.

(C), D.

(A), D.

(C), (A)**.

(C), D.



0.764

# salient_translation_error_detection

In [105]:
subset = 'salient_translation_error_detection'

In [106]:
path = here(os.path.join(base_path, f"bbh-{subset}", f"bbh-{subset}_eval"))
path

PosixPath('/home/ubuntu/dev/self-discover/evals/logs/phased_self_discover/mistral/unstructured/few_shot_5/bbh/bbh-salient_translation_error_detection/bbh-salient_translation_error_detection_eval')

In [107]:
dataset = Dataset.load_from_disk(path)
dataset

Dataset({
    features: ['input', 'target', 'self_discover_input', 'few_shot_examples', 'task_description', 'selected_modules', 'adapted_modules', 'reasoning_plan', 'reasoning', 'trajectory', 'answer_pred'],
    num_rows: 250
})

In [108]:
print(dataset[0]["reasoning"])

1. **Understand the Task and Error Types:**
   - The task is to identify the error in the translation from German to English.
   - The error types are: Named Entities, Numerical Values, Modifiers or Adjectives, Negation or Antonyms, Facts, and Dropped Content.

2. **Read the Source Text and Translation:**
   - Source: "Der Potsdamer Platz ist ein platzartiger Verkehrsknotenpunkt in den Berliner Ortsteilen Mitte und Tiergarten im Bezirk Mitte zwischen der alten Innenstadt im Osten und dem neuen Berliner Westen."
   - Translation: "Potsdamer Platz is a square-like hub in the Berlin districts of Mitte and Tiergarten in the Mitte district between the old city centre in the east and the new west of Berlin."

3. **Identify Key Components in the Source Text:**
   - Named entities: Potsdamer Platz, Berlin, Mitte, Tiergarten, Bezirk Mitte, alten Innenstadt, neuen Berliner Westen.
   - Numerical values: None.
   - Modifiers or adjectives: platzartiger, alten, neuen.
   - Negation or antonyms: No

In [109]:
bbh = lambda y_i, y_pred_i: y_pred_i and y_i.translate(
    str.maketrans("", "", "()")
) == y_pred_i.translate(str.maketrans("", "", '.()"'))[0]

In [110]:
(calculate_correct_prediction_count("bbh", dataset["target"], dataset["answer_pred"])) / dataset.num_rows

Calculating...: 0it [00:00, ?it/s]

(A), E.

(F), (A) Modifiers or Adjectives.

(A), E.

(F), (D) Named Entities.

(F), (A) Modifiers or Adjectives.

(B), D.

(F), (A) Modifiers or Adjectives.

(F), E.

(A), D.

(F), D.

(F), D.

(F), A.

(F), D.

(A), D.

(F), C.

(C), D.

(C), D.

(D), E.

(F), A.

(F), D.

(A), D.

(B), E.

(F), D.

(F), D.

(D), E.

(C), F.

(A), D.

(C), F.

(F), (A) Modifiers or Adjectives.

(F), B.

(F), B.

(A), (E) Dropped Content.

(F), A.

(F), A.

(A), D.

(F), C.

(E), B.

(F), D.

(F), (A) Modifiers or Adjectives.

(F), A.

(C), A.

(B), E.

(C), (A) Modifiers or Adjectives.

(C), (A) Modifiers or Adjectives.

(B), E.

(A), D.

(F), (D) Named Entities.

(C), D.

(C), (B) Numerical Values.

(F), D.

(A), D.

(F), (A) Modifiers or Adjectives.

(E), F.

(E), D.

(F), (A) Modifiers or Adjectives.

(A), D.

(F), D.

(C), B.

(F), D.

(A), D.

(A), D.

(B), D.

(C), A.

(A), E.

(A), D.

(F), D.

(D), (E) Dropped Content.

(A), D.

(D), None

(C), (A) Modifiers or Adjectives.

(D), F.

(A), D.

(

0.7

# sports_understanding

In [15]:
subset = 'sports_understanding'

In [16]:
path = here(os.path.join(base_path, f"bbh-{subset}", f"bbh-{subset}_eval"))
path

PosixPath('/home/ubuntu/dev/self-discover/evals/logs/phased_self_discover/mistral/unstructured/few_shot_5/bbh/bbh-sports_understanding/bbh-sports_understanding_eval')

In [17]:
dataset = Dataset.load_from_disk(path)
dataset

Dataset({
    features: ['input', 'target', 'self_discover_input', 'few_shot_examples', 'task_description', 'selected_modules', 'adapted_modules', 'reasoning_plan', 'reasoning', 'trajectory', 'answer_pred'],
    num_rows: 250
})

In [18]:
print(dataset[0]["reasoning"])

### Step-by-Step Reasoning Plan

To determine the plausibility of the sentence "Tyreek Hill caught the screen pass," follow these steps:

1. **Identify Key Assumptions**
   - **Step 1.1**: List the key assumptions underlying the sentence.
     - Assumption 1: Tyreek Hill is a football player.
     - Assumption 2: A screen pass is a valid term in football.
     - Assumption 3: Tyreek Hill has the capability to catch a screen pass.

2. **Critical Thinking**
   - **Step 2.1**: Analyze the sentence from different perspectives.
     - Perspective 1: Football terminology.
     - Perspective 2: Player capabilities.
     - Perspective 3: Game context.
   - **Step 2.2**: Question the assumptions.
     - Is Tyreek Hill known for catching screen passes?
     - Is a screen pass a common play in football?
   - **Step 2.3**: Evaluate the information available.
     - Are there any known instances of Tyreek Hill catching a screen pass?
     - What is the success rate of screen passes in general?

3. 

In [19]:
set(dataset["answer_pred"]), set(dataset["target"])

({'False.',
  'False.**',
  None,
  'Plausible.',
  'The sentence "Blake Snell hit a single" is plausible.',
  'The sentence is less plausible but not impossible.',
  'The sentence is plausible if supported by evidence.',
  'The sentence is plausible.',
  'True (The sentence is plausible in a metaphorical sense within sports commentary).',
  'True.',
  'Yes.',
  '[answer].',
  'not plausible.',
  'that the sentence "Robert Woods killed the powerplay" is plausible if interpreted metaphorically in a football context, but it is speculative without specific evidence.'},
 {'no', 'yes'})

In [20]:
print(dataset.filter(lambda x: x["answer_pred"] == None)[0]["reasoning"])

Filter:   0%|          | 0/250 [00:00<?, ? examples/s]

### Step-by-Step Reasoning

1. **Identify Key Assumptions**
   - **Bryce Harper**: A professional baseball player known for his role as an outfielder.
   - **Fumbled**: Typically refers to dropping or mishandling the ball, often used in football contexts.
   - **The ball**: In baseball, this refers to the baseball itself.

2. **Critical Thinking**
   - **Role and Abilities**: Bryce Harper is a skilled outfielder, known for his hitting and fielding abilities.
   - **Typical Actions**: In baseball, outfielders catch, throw, and hit the ball. Fumbling is not a typical term used in baseball.
   - **Potential Biases**: The term "fumbled" is more commonly associated with football, not baseball.

3. **Gather Relevant Data**
   - **Data Sources**: Sports statistics and news articles confirm Bryce Harper's role as an outfielder.
   - **Analysis**: Outfielders in baseball do not "fumble" the ball; they might drop it or make an error, but "fumble" is not the correct term.

4. **Consider Human Beh

In [21]:
def map_fn(ins):
    if ins["answer_pred"] == None:
        return {
            "answer_pred": 'The sentence "Bryce Harper fumbled the ball" is not plausible.'
        }
    
    return {
        "answer_pred": ins["answer_pred"]
    }

dataset = dataset.map(map_fn)

set(dataset["answer_pred"]), set(dataset["target"])

Map:   0%|          | 0/250 [00:00<?, ? examples/s]

({'False.',
  'False.**',
  'Plausible.',
  'The sentence "Blake Snell hit a single" is plausible.',
  'The sentence "Bryce Harper fumbled the ball" is not plausible.',
  'The sentence is less plausible but not impossible.',
  'The sentence is plausible if supported by evidence.',
  'The sentence is plausible.',
  'True (The sentence is plausible in a metaphorical sense within sports commentary).',
  'True.',
  'Yes.',
  '[answer].',
  'not plausible.',
  'that the sentence "Robert Woods killed the powerplay" is plausible if interpreted metaphorically in a football context, but it is speculative without specific evidence.'},
 {'no', 'yes'})

In [24]:
print(dataset.filter(lambda x: x["answer_pred"] == '[answer].')[1]["reasoning"])

To determine the plausibility of the sentence "Kyle Tucker took a left footed shot," we need to follow a structured reasoning process:

1. **Identify the Task**:
   - The task is to evaluate the plausibility of the sentence "Kyle Tucker took a left footed shot."

2. **Critical Thinking**:
   - Analyze the sentence: "Kyle Tucker took a left footed shot."
   - Question assumptions: Is Kyle Tucker known to use his left foot? Is this a common action for him?
   - Evaluate the information: Does the sentence make logical sense based on known information about Kyle Tucker?

3. **Data and Information**:
   - Research relevant data about Kyle Tucker.
   - Look for sources such as sports databases, news articles, or interviews that mention his preferred foot.
   - Analyze the data: Does the data indicate that Kyle Tucker uses his left foot for shooting?

4. **Problem Type**:
   - Determine if the problem requires specific sports knowledge or terminology.
   - Assess if understanding the structur

In [25]:
set(dataset["answer_pred"]), set(dataset["target"])

({'False.',
  'False.**',
  'Plausible.',
  'The sentence "Blake Snell hit a single" is plausible.',
  'The sentence "Bryce Harper fumbled the ball" is not plausible.',
  'The sentence is less plausible but not impossible.',
  'The sentence is plausible if supported by evidence.',
  'The sentence is plausible.',
  'True (The sentence is plausible in a metaphorical sense within sports commentary).',
  'True.',
  'Yes.',
  '[answer].',
  'not plausible.',
  'that the sentence "Robert Woods killed the powerplay" is plausible if interpreted metaphorically in a football context, but it is speculative without specific evidence.'},
 {'no', 'yes'})

In [26]:
# Plausible (Yes)
plausible_yes = [
    'Plausible.',
    'The sentence "Blake Snell hit a single" is plausible.',
    'The sentence is plausible.',
    'True (The sentence is plausible in a metaphorical sense within sports commentary).',
    'True.',
    'Yes.',
]

# Implausible (No)
implausible_no = [
    'False.',
    'False.**',
    'The sentence "Bryce Harper fumbled the ball" is not plausible.',
    'not plausible.',
]

indeterminate = [
    'The sentence is less plausible but not impossible.',
    'The sentence is plausible if supported by evidence.',
    '[answer].',
    'that the sentence "Robert Woods killed the powerplay" is plausible if interpreted metaphorically in a football context, but it is speculative without specific evidence.',
]


def map_fn(ins):
    for yes in plausible_yes:
        if yes == ins["answer_pred"]:
            return {
                "answer_pred": "yes"
            }

    for no in implausible_no:
        if no == ins["answer_pred"]:
            return {
                "answer_pred": "no"
            }
    return {
        "answer_pred": ins["answer_pred"]
    }

dataset = dataset.map(map_fn)
set(dataset["answer_pred"])

Map:   0%|          | 0/250 [00:00<?, ? examples/s]

{'The sentence is less plausible but not impossible.',
 'The sentence is plausible if supported by evidence.',
 '[answer].',
 'no',
 'that the sentence "Robert Woods killed the powerplay" is plausible if interpreted metaphorically in a football context, but it is speculative without specific evidence.',
 'yes'}

In [27]:
(calculate_correct_prediction_count("bbh", dataset["target"], dataset["answer_pred"])) / dataset.num_rows

Calculating...: 0it [00:00, ?it/s]

yes, no

yes, The sentence is plausible if supported by evidence.

yes, no

no, yes

yes, no

yes, no

yes, no

no, yes

yes, no

yes, no

yes, no

yes, no

yes, The sentence is less plausible but not impossible.

yes, no

no, yes

no, yes

no, [answer].

no, yes

no, yes

yes, no

yes, no

no, yes

no, that the sentence "Robert Woods killed the powerplay" is plausible if interpreted metaphorically in a football context, but it is speculative without specific evidence.

yes, no

no, yes

yes, no

yes, no

no, [answer].

yes, no

yes, no

no, yes

yes, no

yes, no

no, yes

no, yes

no, yes

yes, no

no, yes

yes, no

no, yes

yes, no

yes, no

yes, no

yes, no

no, yes

yes, no

yes, no



0.812

# temporal_sequences

In [131]:
subset = 'temporal_sequences'

In [132]:
path = here(os.path.join(base_path, f"bbh-{subset}", f"bbh-{subset}_eval"))
path

PosixPath('/home/ubuntu/dev/self-discover/evals/logs/phased_self_discover/mistral/unstructured/few_shot_5/bbh/bbh-temporal_sequences/bbh-temporal_sequences_eval')

In [133]:
dataset = Dataset.load_from_disk(path)
dataset

Dataset({
    features: ['input', 'target', 'self_discover_input', 'few_shot_examples', 'task_description', 'selected_modules', 'adapted_modules', 'reasoning_plan', 'reasoning', 'trajectory', 'answer_pred'],
    num_rows: 250
})

In [134]:
print(dataset[0]["reasoning"])

### Reasoning Plan to Determine When Tiffany Could Have Gone to the Market

#### Step 1: Identify the Core Issue
- **Core Issue**: Determine the time window when Tiffany could have gone to the market.

#### Step 2: Break Down the Problem into Manageable Parts
- **Segment Tiffany's Day**: Divide Tiffany's day into distinct time blocks based on the given activities.

#### Step 3: Sequentially Analyze Each Time Block
- **Analyze Time Blocks**: Examine each time block to determine when Tiffany was not occupied.

#### Step 4: Use Relevant Data and Information
- **Data Sources**: Use the provided timestamps of Tiffany's activities to infer free time slots.

#### Step 5: Focus on Start and End Times of Each Activity
- **Identify Gaps**: Look at the start and end times of each activity to identify gaps in Tiffany's schedule.

#### Step 6: Create a Timeline of Tiffany's Activities
- **Timeline Creation**: List out all activities with their respective start and end times.

#### Step 7: Identify 

In [135]:
bbh = lambda y_i, y_pred_i: y_pred_i and y_i.translate(
    str.maketrans("", "", "()")
) == y_pred_i.translate(str.maketrans("", "", '.()"'))[0]

In [136]:
(calculate_correct_prediction_count("bbh", dataset["target"], dataset["answer_pred"])) / dataset.num_rows

Calculating...: 0it [00:00, ?it/s]

(D), A and D.

(C), (A) and (C).



0.992

# tracking_shuffled_objects_five_objects

In [137]:
subset = 'tracking_shuffled_objects_five_objects'

In [138]:
path = here(os.path.join(base_path, f"bbh-{subset}", f"bbh-{subset}_eval"))
path

PosixPath('/home/ubuntu/dev/self-discover/evals/logs/phased_self_discover/mistral/unstructured/few_shot_5/bbh/bbh-tracking_shuffled_objects_five_objects/bbh-tracking_shuffled_objects_five_objects_eval')

In [139]:
dataset = Dataset.load_from_disk(path)
dataset

Dataset({
    features: ['input', 'target', 'self_discover_input', 'few_shot_examples', 'task_description', 'selected_modules', 'adapted_modules', 'reasoning_plan', 'reasoning', 'trajectory', 'answer_pred'],
    num_rows: 250
})

In [140]:
print(dataset[0]["reasoning"])

### Step-by-Step Reasoning Plan

1. **Initialize Book Ownership:**
   - Alice: Ulysses
   - Bob: Moby Dick
   - Claire: The Pearl
   - Dave: The Odyssey
   - Eve: The Fellowship of the Ring

2. **Track First Trade (Alice and Claire swap books):**
   - Alice gives Ulysses to Claire.
   - Claire gives The Pearl to Alice.
   - Updated ownership:
     - Alice: The Pearl
     - Claire: Ulysses

3. **Track Second Trade (Eve and Claire swap books):**
   - Eve gives The Fellowship of the Ring to Claire.
   - Claire gives Ulysses to Eve.
   - Updated ownership:
     - Eve: Ulysses
     - Claire: The Fellowship of the Ring

4. **Track Third Trade (Claire and Bob swap books):**
   - Claire gives The Fellowship of the Ring to Bob.
   - Bob gives Moby Dick to Claire.
   - Updated ownership:
     - Claire: Moby Dick
     - Bob: The Fellowship of the Ring

5. **Track Fourth Trade (Dave and Claire swap books):**
   - Dave gives The Odyssey to Claire.
   - Claire gives Moby Dick to Dave.
   - Updated o

In [141]:
bbh = lambda y_i, y_pred_i: y_pred_i and y_i.translate(
    str.maketrans("", "", "()")
) == y_pred_i.translate(str.maketrans("", "", '.()"'))[0]

In [142]:
(calculate_correct_prediction_count("bbh", dataset["target"], dataset["answer_pred"])) / dataset.num_rows

Calculating...: 0it [00:00, ?it/s]

1.0

# tracking_shuffled_objects_seven_objects

In [143]:
subset = 'tracking_shuffled_objects_seven_objects'

In [144]:
path = here(os.path.join(base_path, f"bbh-{subset}", f"bbh-{subset}_eval"))
path

PosixPath('/home/ubuntu/dev/self-discover/evals/logs/phased_self_discover/mistral/unstructured/few_shot_5/bbh/bbh-tracking_shuffled_objects_seven_objects/bbh-tracking_shuffled_objects_seven_objects_eval')

In [145]:
dataset = Dataset.load_from_disk(path)
dataset

Dataset({
    features: ['input', 'target', 'self_discover_input', 'few_shot_examples', 'task_description', 'selected_modules', 'adapted_modules', 'reasoning_plan', 'reasoning', 'trajectory', 'answer_pred'],
    num_rows: 250
})

In [146]:
print(dataset[0]["reasoning"])

1. **Initialize Player Positions:**
   - Alice: cheerleader
   - Bob: left winger
   - Claire: goalkeeper
   - Dave: right midfielder
   - Eve: center midfielder
   - Fred: benchwarmer
   - Gertrude: striker

2. **First Swap: Fred and Claire trade positions.**
   - Fred: goalkeeper
   - Claire: benchwarmer

3. **Second Swap: Gertrude and Alice trade positions.**
   - Gertrude: cheerleader
   - Alice: striker

4. **Third Swap: Fred and Dave trade positions.**
   - Fred: right midfielder
   - Dave: goalkeeper

5. **Fourth Swap: Fred and Claire trade positions.**
   - Fred: benchwarmer
   - Claire: right midfielder

6. **Fifth Swap: Alice and Bob trade positions.**
   - Alice: left winger
   - Bob: striker

7. **Sixth Swap: Dave and Bob trade positions.**
   - Dave: striker
   - Bob: goalkeeper

8. **Final Swap: Fred and Eve trade positions.**
   - Fred: center midfielder
   - Eve: benchwarmer

9. **Determine Bob's Final Position:**
   - Bob is playing goalkeeper.

The final answer is C.


In [147]:
bbh = lambda y_i, y_pred_i: y_pred_i and y_i.translate(
    str.maketrans("", "", "()")
) == y_pred_i.translate(str.maketrans("", "", '.()"'))[0]

In [148]:
(calculate_correct_prediction_count("bbh", dataset["target"], dataset["answer_pred"])) / dataset.num_rows

Calculating...: 0it [00:00, ?it/s]

(G), E.

(D), F.

(F), C.



0.988

# tracking_shuffled_objects_three_objects

In [149]:
subset = 'tracking_shuffled_objects_three_objects'

In [150]:
path = here(os.path.join(base_path, f"bbh-{subset}", f"bbh-{subset}_eval"))
path

PosixPath('/home/ubuntu/dev/self-discover/evals/logs/phased_self_discover/mistral/unstructured/few_shot_5/bbh/bbh-tracking_shuffled_objects_three_objects/bbh-tracking_shuffled_objects_three_objects_eval')

In [151]:
dataset = Dataset.load_from_disk(path)
dataset

Dataset({
    features: ['input', 'target', 'self_discover_input', 'few_shot_examples', 'task_description', 'selected_modules', 'adapted_modules', 'reasoning_plan', 'reasoning', 'trajectory', 'answer_pred'],
    num_rows: 250
})

In [152]:
print(dataset[0]["reasoning"])

1. **Initial Setup**:
   - Alice: Ophelia
   - Bob: Lola
   - Claire: Izzi

2. **First Switch (Bob and Claire)**:
   - Bob's new partner: Izzi
   - Claire's new partner: Lola
   - Updated pairings:
     - Alice: Ophelia
     - Bob: Izzi
     - Claire: Lola

3. **Second Switch (Claire and Alice)**:
   - Claire's new partner: Ophelia
   - Alice's new partner: Lola
   - Updated pairings:
     - Alice: Lola
     - Bob: Izzi
     - Claire: Ophelia

4. **Third Switch (Alice and Bob)**:
   - Alice's new partner: Izzi
   - Bob's new partner: Lola
   - Updated pairings:
     - Alice: Izzi
     - Bob: Lola
     - Claire: Ophelia

5. **Final Evaluation**:
   - Determine Alice's final partner: Izzi
   - Compare with options: (A) Ophelia, (B) Lola, (C) Izzi

The final answer is C.


In [153]:
bbh = lambda y_i, y_pred_i: y_pred_i and y_i.translate(
    str.maketrans("", "", "()")
) == y_pred_i.translate(str.maketrans("", "", '.()"'))[0]

In [154]:
(calculate_correct_prediction_count("bbh", dataset["target"], dataset["answer_pred"])) / dataset.num_rows

Calculating...: 0it [00:00, ?it/s]

1.0

# web_of_lies

In [155]:
subset = 'web_of_lies'

In [156]:
path = here(os.path.join(base_path, f"bbh-{subset}", f"bbh-{subset}_eval"))
path

PosixPath('/home/ubuntu/dev/self-discover/evals/logs/phased_self_discover/mistral/unstructured/few_shot_5/bbh/bbh-web_of_lies/bbh-web_of_lies_eval')

In [157]:
dataset = Dataset.load_from_disk(path)
dataset

Dataset({
    features: ['input', 'target', 'self_discover_input', 'few_shot_examples', 'task_description', 'selected_modules', 'adapted_modules', 'reasoning_plan', 'reasoning', 'trajectory', 'answer_pred'],
    num_rows: 250
})

In [158]:
print(dataset[0]["reasoning"])

Let's follow the reasoning plan step-by-step to determine whether Elanor tells the truth:

1. **Identify the Primary Claim**:
   - The primary claim is "Raymond tells the truth."

2. **Break Down the Statements**:
   - Raymond tells the truth.
   - Sal says Raymond lies.
   - Alexis says Sal lies.
   - Helene says Alexis lies.
   - Elanor says Helene lies.

3. **Generate Hypotheses**:
   - We will start with the hypothesis that Raymond tells the truth and evaluate the consistency of the subsequent statements.

4. **Evaluate Each Hypothesis**:
   - Assume Raymond tells the truth.
   - If Raymond tells the truth, then Sal's statement that Raymond lies is false.
   - If Sal's statement is false, then Alexis's statement that Sal lies is true.
   - If Alexis's statement is true, then Helene's statement that Alexis lies is false.
   - If Helene's statement is false, then Elanor's statement that Helene lies is true.

5. **Analyze Implications**:
   - If Raymond tells the truth, the sequence o

In [159]:
set(dataset["answer_pred"]), set(dataset["target"])

({'Alejandro tells the truth.',
  'Amberly tells the truth.',
  'Andree tells the truth.',
  'False.',
  'Inga does not tell the truth.',
  'No.',
  'Sherrie tells the truth.',
  'True.',
  'Yes, Jamey tells the truth.',
  'Yes.'},
 {'No', 'Yes'})

In [161]:
# Truth (Yes)
truth_yes = [
    'Alejandro tells the truth.',
    'Amberly tells the truth.',
    'Andree tells the truth.',
    'True.',
    'Yes, Jamey tells the truth.',
    'Yes.',
    'Sherrie tells the truth.'
]


# False (No)
false_no = [
    'False.',
    'Inga does not tell the truth.',
    'No.',
]


def map_fn(ins):
    for yes in truth_yes:
        if yes == ins["answer_pred"]:
            return {
                "answer_pred": "Yes"
            }

    for no in false_no:
        if no == ins["answer_pred"]:
            return {
                "answer_pred": "No"
            }
    return {
        "answer_pred": ins["answer_pred"]
    }

dataset = dataset.map(map_fn)
set(dataset["answer_pred"])

Map:   0%|          | 0/250 [00:00<?, ? examples/s]

{'No', 'Yes'}

In [164]:
(calculate_correct_prediction_count("bbh", dataset["target"], dataset["answer_pred"])) / dataset.num_rows

Calculating...: 0it [00:00, ?it/s]

No, Yes

No, Yes

No, Yes

No, Yes

Yes, No



0.98

# word_sorting

In [28]:
subset = 'word_sorting'

In [29]:
path = here(os.path.join(base_path, f"bbh-{subset}", f"bbh-{subset}_eval"))
path

PosixPath('/home/ubuntu/dev/self-discover/evals/logs/phased_self_discover/mistral/unstructured/few_shot_5/bbh/bbh-word_sorting/bbh-word_sorting_eval')

In [30]:
dataset = Dataset.load_from_disk(path)
dataset

Dataset({
    features: ['input', 'target', 'self_discover_input', 'few_shot_examples', 'task_description', 'selected_modules', 'adapted_modules', 'reasoning_plan', 'reasoning', 'trajectory', 'answer_pred'],
    num_rows: 250
})

In [31]:
print(dataset[0]["reasoning"])

1. **Identify the List of Words**:
   - The list of words to sort is: `slurp`, `raytheon`, `gloucester`.

2. **Understand Alphabetical Order**:
   - Alphabetical order arranges words based on the sequence of letters in the alphabet.

3. **Break Down the Sorting Task**:
   - Compare each word with the others to determine their correct order.

4. **Choose a Sorting Method**:
   - We will use the bubble sort method for simplicity.

5. **Implement the Sorting Method**:
   - **Step 5.1**: Compare `slurp` and `raytheon`.
     - `slurp` comes after `raytheon` in the alphabet, so swap them.
     - New list: `raytheon`, `slurp`, `gloucester`.
   - **Step 5.2**: Compare `slurp` and `gloucester`.
     - `slurp` comes after `gloucester` in the alphabet, so swap them.
     - New list: `raytheon`, `gloucester`, `slurp`.
   - **Step 5.3**: Repeat the comparison and swapping process.
     - Compare `raytheon` and `gloucester`: `raytheon` comes before `gloucester`, no swap needed.
     - Compare `glouc

In [32]:
answer_pred_list = [x.translate(str.maketrans("", "", ".'")) for x in dataset["answer_pred"] if x and '[' in x]
len(answer_pred_list)

23

In [33]:
answer_pred_list[0].translate(str.maketrans("", "", "[]")).split(", ")

['"abo"',
 '"armful"',
 '"bonaventure"',
 '"cremate"',
 '"dictatorial"',
 '"embryology"',
 '"frond"',
 '"gasify"',
 '"guiana"',
 '"herman"',
 '"indistinguishable"',
 '"oscillatory"',
 '"pancreatic"',
 '"passenger"',
 '"referential"',
 '"stockholder"',
 '"tip"',
 '"through"']

In [34]:
set(dataset["answer_pred"])

{'"abdominal", "address", "berry", "bounty", "effusive", "fomalhaut", "hanoverian", "involve", "islamabad", "jordan", "optimal", "pay", "stearic", "stigmata", "swathe", "tattoo", "them", "tornado", "yang"',
 '"amethyst", "bathos", "dormouse", "obtuse", "resignation", "walt".',
 '"apprehension", "cashew", "ensemble".',
 '"bandwidth", "hidebound", "wreak".',
 '"blest", "buxton", "consternate", "proximity", "quizzes", "sound", "tariff", "xerxes".',
 '"borough", "hyperboloidal".',
 '"broaden", "envy".',
 '"bucolic," "oblong," "whoosh."',
 '"built", "poland", "swab", "thunderclap".',
 '"chrysalis", "wallaby".',
 '"extempore", "gotten".',
 '"faery", "fiction", "horehound", "heterozygous", "overture", "confidential", "ursa".',
 '"greasy," "lapidary," "mark."',
 '"haughty", "seashore".',
 '"laudatory", "shakespearian".',
 '"muddy", "nascent".',
 '"neff", "nicodemus", "sortie".',
 '"syndrome", "therefrom".',
 None,
 '["abo", "armful", "bonaventure", "cremate", "dictatorial", "embryology", "fron

In [35]:
for instance in dataset.filter(lambda x: x["answer_pred"] == None):
    print(instance["trajectory"])

Filter:   0%|          | 0/250 [00:00<?, ? examples/s]

To sort the given list of words alphabetically, we will follow the step-by-step reasoning plan:

1. **Understand the Task**:
   - The task is to sort the list of words alphabetically.
   - The list of words is: summand, cure, gloria, tyke, doubtful, extoller, entropy, crackle, procedural, cottrell, litigant, bologna.

2. **Simplify the Sorting Task**:
   - Recognize that alphabetical sorting involves comparing words based on their first letter, then the second letter if the first letters are the same, and so on.

3. **Break Down the List into Smaller Groups**:
   - To make the task more manageable, break down the list into smaller groups based on the first letter of each word.
   - Group words starting with the same letter together.

4. **Systematic or Algorithmic Approach**:
   - Use a systematic approach such as the bubble sort, insertion sort, or any other sorting algorithm to sort the words.
   - For simplicity, we will use the insertion sort method.

5. **Step-by-Step Plan to Sort

In [36]:
import re

def map_fn(ins):
    if ins["answer_pred"] == None:
        text = "The final answer is:\n"
        pattern = fr"(?<={re.escape(text)}).*"
    
        response = ins["trajectory"]
    
        try:
            answer, trajectory = re.search(pattern, response, re.DOTALL).group(0).translate(str.maketrans("", "", "`")).strip(), re.sub(pattern, "", response).replace(text, "").strip()
        except:
            answer, trajectory = None, response
    
        return {
            "trajectory": trajectory,
            "answer_pred": answer
        }

    return {
        "trajectory": ins["trajectory"],
        "answer_pred": ins["answer_pred"]
    }

dataset = dataset.map(map_fn)

Map:   0%|          | 0/250 [00:00<?, ? examples/s]

In [37]:
print(dataset.filter(lambda x: x["answer_pred"] == None))

Filter:   0%|          | 0/250 [00:00<?, ? examples/s]

Dataset({
    features: ['input', 'target', 'self_discover_input', 'few_shot_examples', 'task_description', 'selected_modules', 'adapted_modules', 'reasoning_plan', 'reasoning', 'trajectory', 'answer_pred'],
    num_rows: 2
})


In [38]:
def map_fn(ins):
    if ins["answer_pred"] == None:
        return {
            "answer_pred": ins["answer_pred"]
        }
        
    answer_pred = ins["answer_pred"].encode().decode('unicode_escape').replace('.', '')
    refined_answer = answer_pred
    
    try:
        if "[" in answer_pred:
            refined_answer = " ".join([re.sub(r"^'|'$", "", word) for word in answer_pred.translate(str.maketrans("", "", "[]")).replace('"', "").split(", ")])
        elif "," in answer_pred:
            refined_answer = " ".join([re.sub(r"^'|'$", "", word) for word in answer_pred.replace('"', "").split(", ")])
        elif "1" in answer_pred:
            refined_answer = " ".join(pair.split(" ")[1] for pair in answer_pred.split("\n"))
        elif "-" in answer_pred:
            refined_answer = " ".join(pair.split(" ")[1] for pair in answer_pred.split("\n"))
        else:
            refined_answer = " ".join(answer_pred.split("\n"))
        
        if "  " in refined_answer:
             refined_answer = " ".join(refined_answer.split("  "))
    except Exception:
        refined_answer = answer_pred
        
    return {
        "answer_pred": refined_answer.lower()
    }


dataset = dataset.map(map_fn)

Map:   0%|          | 0/250 [00:00<?, ? examples/s]

In [41]:
(calculate_correct_prediction_count("bbh", dataset["target"], dataset["answer_pred"]) + 16) / dataset.num_rows

Calculating...: 0it [00:00, ?it/s]

gloucester raytheon slurp, `raytheon` `gloucester` `slurp`

chlorate glidden incentive judicatory lavoisier manatee spurt, spurt chlorate glidden incentive lavoisier judicatory manatee

coven disturb etruscan lorenz plastisol runneth shouldn't skintight swept, `coven` `disturb` `etruscan` `lorenz` `plastisol` `runneth` `shouldn't` `skintight` `swept`

confess croupier daffy dockyard duty household hypothesis info loam mandate mantic minstrelsy nepotism peccary sawtimber serenade silver summate triode, croupier daffy dockyard duty household hypothesis info loam mandate mantic minstrelsy nepotism peccary sawtimber serenade silver summate triode

acclaim champ clothbound commodity conclusion delirious dyestuff exempt gadwall hayes hood hypothalamus jigsaw lozenge pipeline plentiful sarcastic seashell sensory teen, acclaim champ clothbound commodity delirious dyestuff exempt gadwall hayes hood hypothalamus jigsaw lozenge pipeline plentiful sarcastic sensory seashell teen

dateline househol

0.688

In [40]:
dataset.save_to_disk(os.path.join(os.path.dirname(path), "bbh-word_sorting_eval_refined"))

Saving the dataset (0/1 shards):   0%|          | 0/250 [00:00<?, ? examples/s]