# Prepare np v_dat_p2 np np recipient pp modification training data derivative

now added to pull request at https://github.com/willy-b/stanford-xcs224u-final-project/pull/8 (was in https://github.com/willy-b/stanford-xcs224u-final-project/pull/7 ; but renamed branch)

In [None]:
# Objective
# In the ReCOGS paper (Wu et al 2023; ReCOGS: How Incidental Details of a Logical Form Overshadow an Evaluation of Semantic Interpretation ; https://arxiv.org/abs/2303.13716 ), they write
#
# The hardest COGS split based on published num-
# bers seems to be the structural generalization task
# that involves interpreting novel combinations of
# modified phrases and grammatical role – e.g., in-
# terpreting subject noun phrases with PP modifiers
# when the train set includes only object noun phrases
# with such modifiers (Noah ate the cake on the plate.
# → The cake on the plate burned). To the best of
# our knowledge, all prior seq2seq models have com-
# pletely failed to get traction here
#
# Possible interpretation of such prior work (Wu et al 2023 , ReCOGS, and COGS paper object pp to subject pp modification generalization split): if the model that translates english sentences to a semantic representation (logical form) has not learned to modify the subject with a prepositional phrase, then it cannot modify the subject.
#
# Then it follows that if we can get the model to consistently modify the subject accurately without learning to modify it then this claim is not true (regardless of whether the subject is on the left/right of the verb which is what it typically has relationships with that break when LF translation fails).
#
# Alternative interpretation of prior work: general dependency parsing issue, if the model has not learned to modify nouns with prepositional phrases on the left of a parent (verb or np it has a dependency relationship with; e.g. for verb could be the agent or theme or recipient of that verb), then it cannot modify that noun with a prepositional phrase.
#
# Think in the limit the difference between np v_dat_p2 np np style sentences where the recipient is modified vs the theme with some long filler content:
# "Emma gave a friend a cookie" ->
# "Emma gave a friend a cookie (filler filler filler)" (not going to miss that a cookie was given as more filler is added)
# vs
# "Emma gave a friend (filler filler filler) a cookie" (as filler gets larger and has more distractors, may miss that a cookie was given to the friend due to the distance between friend and cookie increasing and distractors including nouns being added in between)
#
#  Because the modifications add to the right in a left to right fashion, there is an asymmetry between modifying the dependent on the left of the parent vs dependent on the right of the parent.
#
# distinguishing observations:
#
#     observe that something that is definitively not the subject (not just a theme in a passive vs agent in active) fails to generalize (maybe np v_dat_p2 np np!)
#
#     check that regardless of whether agent, theme, or recipient role, if it is on the left, we do not observe generalization from right as that would reject the 2nd hypothesis (we can also confirm that training with it on the left it can generalize to right)
#
#     since english has SVO order I am not sure it is easy to show us modifying the subject when it is not on the left, and swapping agent/theme side (like with passive verb) may swap its subject role, that is why np v_dat_p2 np np where we do pp modification on the recipient instead of theme np which is not even a verb relationship situation is appealing. It would demonstrate the same mechanism but not get confused with subject/object or anything specific to verbs.
#

# Here's  a left/right example which does not even involve a verb and not the subject of the sentence.
#
# Take existing train examples of np v_dat_p2 np np form (agent, recipient, theme), we can even take those which already have a prepositional phrase on the right np.
#
# just transfer that pp to the middle np and we should observe that it gets the roles wrong (recipient or theme), if it hasn't trained on that.
#
# In this particular pattern modifying the theme should be safe (in other patterns where it is left of the verb it should not), but modifying the recipient with a prepositional phrase here should break something, probably the theme.
#
# Will be using ideas learned from working with the Restricted Access Sequence Processing hand-written model , testing them on Transformers (default model from Wu et al 2023) . So this is custom data , but other author's model we will train from scratch here based on ideas from RASP which is checked out in another notebook.

# Here I am getting data ready for use by other notebooks. We use Lark parser here to check the form of the different sentences as we want to filter on syntax, not semantics.

import os
import pty
import subprocess
import pandas as pd
import numpy as np
import argparse

print("Load official Wu et al 2023 ReCOGS training examples\n(sample from https://raw.githubusercontent.com/frankaging/ReCOGS/refs/heads/main/recogs_positional_index/train.tsv , associated with https://arxiv.org/abs/2303.13716 )")
if not os.path.exists("train_no_sprinkles_or_cp.tsv"):
    # one of official author's dataset for ReCOGS paper
    subprocess.run("wget https://raw.githubusercontent.com/frankaging/ReCOGS/refs/heads/main/recogs_positional_index/train.tsv", shell=True)
    subprocess.run("echo 'COGS Sentence	ReCOGS Logical Form	Distribution' > train_no_sprinkles_or_cp.tsv", shell=True)
    subprocess.run("cat train.tsv | grep 'in_distribution' | grep -v 'preposing' | grep -v 'sprinkle' | grep -v 'that' >> train_no_sprinkles_or_cp.tsv", shell=True)

recogs_train_examples = pd.read_csv("train_no_sprinkles_or_cp.tsv", delimiter="	")

# We do NOT use Lark with the Restricted Access Sequence Processing (RASP) model ,
# but to prepare training data for a separate experiment using Wu et al 2023's model to test something we learned from the RASP model programmed by hand in the language that expresses Transformer capabilities
# on a Wu et al 2023 Transformer that learns from examples, we use Lark to filter for examples with certain syntax and help modify them
subprocess.run("pip install lark --upgrade", shell=True)

from lark import Lark, tree

# COGS grammar in Lark format per IBM CPG project (we don't use any CPG code just their description of COGS grammar from their utilities here for data preparation)
# https://github.com/IBM/cpg/blob/c3626b4e03bfc681be2c2a5b23da0b48abe6f570/src/model/cogs_data.py#L523
grammar = '''
start: s1 | s2 | s3 | s4 | vp_internal
    s1: np vp_external
    s2: np vp_passive
    s3: np vp_passive_dat
    s4: np vp_external4
    vp_external: v_unerg | v_trans_omissible_p1 | vp_external1 | vp_external2 | vp_external3 | vp_external5 | vp_external6 | vp_external7
    vp_external1: v_unacc_p1 np
    vp_external2: v_trans_omissible_p2 np
    vp_external3: v_trans_not_omissible np
    vp_external4: v_inf_taking to v_inf
    vp_external5: v_cp_taking that start
    vp_external6: v_dat_p1 np pp_iobj
    vp_external7: v_dat_p2 np np
    vp_internal: np v_unacc_p2
    vp_passive: vp_passive1 | vp_passive2 | vp_passive3 | vp_passive4 | vp_passive5 | vp_passive6 | vp_passive7 | vp_passive8
    vp_passive1: was v_trans_not_omissible_pp_p1
    vp_passive2: was v_trans_not_omissible_pp_p2 by np
    vp_passive3: was v_trans_omissible_pp_p1
    vp_passive4: was v_trans_omissible_pp_p2 by np
    vp_passive5: was v_unacc_pp_p1
    vp_passive6: was v_unacc_pp_p2 by np
    vp_passive7: was v_dat_pp_p1 pp_iobj
    vp_passive8: was v_dat_pp_p2 pp_iobj by np
    vp_passive_dat: vp_passive_dat1 | vp_passive_dat2
    vp_passive_dat1: was v_dat_pp_p3 np
    vp_passive_dat2: was v_dat_pp_p4 np by np
    np: np_prop | np_det | np_pp
    np_prop: proper_noun
    np_det: det common_noun
    np_pp: np_det pp np
    pp_iobj: to np
    det: "the" | "a"
    pp: "on" | "in" | "beside"
    was: "was"
    by: "by"
    to: "to"
    that: "that"
    common_noun: "girl" | "boy" | "cat" | "dog" | "baby" | "child" | "teacher" | "frog" | "chicken" | "mouse" | "lion" | "monkey" | "bear" | "giraffe" | "horse" | "bird" | "duck" | "bunny" | "butterfly" | "penguin" | "student" | "professor" | "monster" | "hero" | "sailor" | "lawyer" | "customer" | "scientist" | "princess" | "president" | "cow" | "crocodile" | "goose" | "hen" | "deer" | "donkey" | "bee" | "fly" | "kitty" | "tiger" | "wolf" | "zebra" | "mother" | "father" | "patient" | "manager" | "director" | "king" | "queen" | "kid" | "fish" | "moose" | "pig" | "pony" | "puppy" | "sheep" | "squirrel" | "lamb" | "turkey" | "turtle" | "doctor" | "pupil" | "prince" | "driver" | "consumer" | "writer" | "farmer" | "friend" | "judge" | "visitor" | "guest" | "servant" | "chief" | "citizen" | "champion" | "prisoner" | "captain" | "soldier" | "passenger" | "tenant" | "politician" | "resident" | "buyer" | "spokesman" | "governor" | "guard" | "creature" | "coach" | "producer" | "researcher" | "guy" | "dealer" | "duke" | "tourist" | "landlord" | "human" | "host" | "priest" | "journalist" | "poet" | "hedgehog" | "shark" | "cockroach" | "cobra" | "hippo" | "cake" | "donut" | "cookie" | "box" | "rose" | "drink" | "raisin" | "melon" | "sandwich" | "strawberry" | "ball" | "balloon" | "bat" | "block" | "book" | "crayon" | "chalk" | "doll" | "game" | "glue" | "lollipop" | "hamburger" | "banana" | "biscuit" | "muffin" | "pancake" | "pizza" | "potato" | "pretzel" | "pumpkin" | "sweetcorn" | "yogurt" | "pickle" | "jigsaw" | "pen" | "pencil" | "present" | "toy" | "cracker" | "brush" | "radio" | "cloud" | "mandarin" | "hat" | "basket" | "plant" | "flower" | "chair" | "spoon" | "pillow" | "gumball" | "scarf" | "shoe" | "jacket" | "hammer" | "bucket" | "knife" | "cup" | "plate" | "towel" | "bottle" | "bowl" | "can" | "clock" | "jar" | "penny" | "purse" | "soap" | "toothbrush" | "watch" | "newspaper" | "fig" | "bag" | "wine" | "key" | "weapon" | "brain" | "tool" | "crown" | "ring" | "leaf" | "fruit" | "mirror" | "beer" | "shirt" | "guitar" | "chemical" | "seed" | "shell" | "brick" | "bell" | "coin" | "button" | "needle" | "molecule" | "crystal" | "flag" | "nail" | "bean" | "liver" | "table" | "stage" | "bed" | "chair" | "stool" | "road" | "tree" | "box" | "surface" | "seat" | "speaker" | "computer" | "rock" | "boat" | "cabinet" | "tv" | "plate" | "desk" | "bowl" | "bench" | "shelf" | "cloth" | "piano" | "bible" | "leaflet" | "sheet" | "cupboard" | "truck" | "tray" | "notebook" | "blanket" | "deck" | "coffin" | "log" | "ladder" | "barrel" | "rug" | "canvas" | "tiger" | "towel" | "throne" | "booklet" | "sock" | "corpse" | "sofa" | "keyboard" | "book" | "pillow" | "pad" | "train" | "couch" | "bike" | "pedestal" | "platter" | "paper" | "rack" | "board" | "panel" | "tripod" | "branch" | "machine" | "floor" | "napkin" | "cookie" | "block" | "cot" | "device" | "yacht" | "dog" | "mattress" | "ball" | "stand" | "stack" | "windowsill" | "counter" | "cushion" | "hanger" | "trampoline" | "gravel" | "cake" | "carpet" | "plaque" | "boulder" | "leaf" | "mound" | "bun" | "dish" | "cat" | "podium" | "tabletop" | "beach" | "bag" | "glacier" | "brick" | "crack" | "vessel" | "futon" | "turntable" | "rag" | "chessboard" | "house" | "room" | "car" | "garden" | "box" | "cup" | "glass" | "bag" | "vehicle" | "hole" | "cabinet" | "bottle" | "shoe" | "storage" | "cot" | "vessel" | "pot" | "pit" | "tin" | "can" | "cupboard" | "envelope" | "nest" | "bush" | "coffin" | "drawer" | "container" | "basin" | "tent" | "soup" | "well" | "barrel" | "bucket" | "cage" | "sink" | "cylinder" | "parcel" | "cart" | "sack" | "trunk" | "wardrobe" | "basket" | "bin" | "fridge" | "mug" | "jar" | "corner" | "pool" | "blender" | "closet" | "pile" | "van" | "trailer" | "saucepan" | "truck" | "taxi" | "haystack" | "dumpster" | "puddle" | "bathtub" | "pod" | "tub" | "trap" | "bun" | "microwave" | "bookstore" | "package" | "cafe" | "train" | "castle" | "bunker" | "vase" | "backpack" | "tube" | "hammock" | "stadium" | "backyard" | "swamp" | "monastery" | "refrigerator" | "palace" | "cubicle" | "crib" | "condo" | "tower" | "crate" | "dungeon" | "teapot" | "tomb" | "casket" | "jeep" | "shoebox" | "wagon" | "bakery" | "fishbowl" | "kennel" | "china" | "spaceship" | "penthouse" | "pyramid" | "table" | "stage" | "bed" | "chair" | "book" | "road" | "tree" | "machine" | "house" | "seat" | "speaker" | "computer" | "rock" | "car" | "box" | "cup" | "glass" | "bag" | "flower" | "boat" | "vehicle" | "key" | "painting" | "cabinet" | "tv" | "bottle" | "cat" | "desk" | "shoe" | "mirror" | "clock" | "bench" | "bike" | "lamp" | "lion" | "piano" | "crystal" | "toy" | "duck" | "sword" | "sculpture" | "rod" | "truck" | "basket" | "bear" | "nest" | "sphere" | "bush" | "surgeon" | "poster" | "throne" | "giant" | "trophy" | "hedge" | "log" | "tent" | "ladder" | "helicopter" | "barrel" | "yacht" | "statue" | "bucket" | "skull" | "beast" | "lemon" | "whale" | "cage" | "gardner" | "fox" | "sink" | "trainee" | "dragon" | "cylinder" | "monk" | "bat" | "headmaster" | "philosopher" | "foreigner" | "worm" | "chemist" | "corpse" | "wolf" | "torch" | "sailor" | "valve" | "hammer" | "doll" | "genius" | "baron" | "murderer" | "bicycle" | "keyboard" | "stool" | "pepper" | "warrior" | "pillar" | "monkey" | "cassette" | "broker" | "bin"
    proper_noun: "emma" | "liam" | "olivia" | "noah" | "ava" | "william" | "isabella" | "james" | "sophia" | "oliver" | "charlotte" | "benjamin" | "mia" | "elijah" | "amelia" | "lucas" | "harper" | "mason" | "evelyn" | "logan" | "abigail" | "alexander" | "emily" | "ethan" | "elizabeth" | "jacob" | "mila" | "michael" | "ella" | "daniel" | "avery" | "henry" | "sofia" | "jackson" | "camila" | "sebastian" | "aria" | "aiden" | "scarlett" | "matthew" | "victoria" | "samuel" | "madison" | "david" | "luna" | "joseph" | "grace" | "carter" | "chloe" | "owen" | "penelope" | "wyatt" | "layla" | "john" | "riley" | "jack" | "zoey" | "luke" | "nora" | "jayden" | "lily" | "dylan" | "eleanor" | "grayson" | "hannah" | "levi" | "lillian" | "isaac" | "addison" | "gabriel" | "aubrey" | "julian" | "ellie" | "mateo" | "stella" | "anthony" | "natalie" | "jaxon" | "zoe" | "lincoln" | "leah" | "joshua" | "hazel" | "christopher" | "violet" | "andrew" | "aurora" | "theodore" | "savannah" | "caleb" | "audrey" | "ryan" | "brooklyn" | "asher" | "bella" | "nathan" | "claire" | "thomas" | "skylar" | "leo" | "lina" | "paula" | "charlie"
    v_trans_omissible_p1: "ate" | "painted" | "drew" | "cleaned" | "cooked" | "dusted" | "hunted" | "nursed" | "sketched" | "juggled" | "called" | "heard" | "packed" | "saw" | "noticed" | "studied" | "examined" | "observed" | "knew" | "investigated" | "baked"
    v_trans_omissible_p2: "ate" | "painted" | "drew" | "cleaned" | "cooked" | "dusted" | "hunted" | "nursed" | "sketched" | "juggled" | "called" | "heard" | "packed" | "saw" | "noticed" | "studied" | "examined" | "observed" | "knew" | "investigated" | "baked"
    v_trans_omissible_pp_p1: "eaten" | "painted" | "drawn" | "cleaned" | "cooked" | "dusted" | "hunted" | "nursed" | "sketched" | "juggled" | "called" | "heard" | "packed" | "seen" | "noticed" | "studied" | "examined" | "observed" | "known" | "investigated"
    v_trans_omissible_pp_p2: "eaten" | "painted" | "drawn" | "cleaned" | "cooked" | "dusted" | "hunted" | "nursed" | "sketched" | "juggled" | "called" | "heard" | "packed" | "seen" | "noticed" | "studied" | "examined" | "observed" | "known" | "investigated"
    v_trans_not_omissible: "liked" | "helped" | "found" | "loved" | "poked" | "admired" | "adored" | "appreciated" | "missed" | "respected" | "threw" | "tolerated" | "valued" | "worshipped" | "discovered" | "held" | "stabbed" | "touched" | "pierced" | "tossed"
    v_trans_not_omissible_pp_p1: "liked" | "helped" | "found" | "loved" | "poked" | "admired" | "adored" | "appreciated" | "missed" | "respected" | "thrown" | "tolerated" | "valued" | "worshipped" | "discovered" | "held" | "stabbed" | "touched" | "pierced" | "tossed"
    v_trans_not_omissible_pp_p2: "liked" | "helped" | "found" | "loved" | "poked" | "admired" | "adored" | "appreciated" | "missed" | "respected" | "thrown" | "tolerated" | "valued" | "worshipped" | "discovered" | "held" | "stabbed" | "touched" | "pierced" | "tossed"
    v_cp_taking: "liked" | "hoped" | "said" | "noticed" | "believed" | "confessed" | "declared" | "proved" | "thought" | "admired" | "appreciated" | "respected" | "supported" | "tolerated" | "valued" | "wished" | "dreamed" | "expected" | "imagined" | "meant"
    v_inf_taking: "wanted" | "preferred" | "needed" | "intended" | "tried" | "attempted" | "planned" | "expected" | "hoped" | "wished" | "craved" | "liked" | "hated" | "loved" | "enjoyed" | "dreamed" | "meant" | "longed" | "yearned" | "itched"
    v_unacc_p1: "rolled" | "froze" | "burned" | "shortened" | "floated" | "grew" | "slid" | "broke" | "crumpled" | "split" | "changed" | "snapped" | "disintegrated" | "collapsed" | "decomposed" | "doubled" | "improved" | "inflated" | "enlarged" | "reddened" | "shattered" | "blessed" | "squeezed"
    v_unacc_p2: "rolled" | "froze" | "burned" | "shortened" | "floated" | "grew" | "slid" | "broke" | "crumpled" | "split" | "changed" | "snapped" | "disintegrated" | "collapsed" | "decomposed" | "doubled" | "improved" | "inflated" | "enlarged" | "reddened" | "shattered" | "blessed" | "squeezed"
    v_unacc_pp_p1: "rolled" | "frozen" | "burned" | "shortened" | "floated" | "grown" | "slid" | "broken" | "crumpled" | "split" | "changed" | "snapped" | "disintegrated" | "collapsed" | "decomposed" | "doubled" | "improved" | "inflated" | "enlarged" | "reddened" | "shattered" | "blessed" | "squeezed"
    v_unacc_pp_p2: "rolled" | "frozen" | "burned" | "shortened" | "floated" | "grown" | "slid" | "broken" | "crumpled" | "split" | "changed" | "snapped" | "disintegrated" | "collapsed" | "decomposed" | "doubled" | "improved" | "inflated" | "enlarged" | "reddened" | "shattered" | "blessed" | "squeezed"
    v_unerg: "slept" | "smiled" | "laughed" | "sneezed" | "cried" | "talked" | "danced" | "jogged" | "walked" | "ran" | "napped" | "snoozed" | "screamed" | "stuttered" | "frowned" | "giggled" | "scoffed" | "snored" | "smirked" | "gasped"
    v_inf: "walk" | "run" | "sleep" | "sneeze" | "nap" | "eat" | "read" | "cook" | "hunt" | "paint" | "talk" | "dance" | "giggle" | "jog" | "smirk" | "call" | "sketch" | "dust" | "clean" | "investigate" | "crawl"
    v_dat_p1: "gave" | "lended" | "sold" | "offered" | "fed" | "passed" | "sent" | "rented" | "served" | "awarded" | "brought" | "handed" | "forwarded" | "promised" | "mailed" | "loaned" | "posted" | "returned" | "slipped" | "wired" | "teleported" | "shipped"
    v_dat_p2: "gave" | "lended" | "sold" | "offered" | "fed" | "passed" | "sent" | "rented" | "served" | "awarded" | "brought" | "handed" | "forwarded" | "promised" | "mailed" | "loaned" | "posted" | "returned" | "slipped" | "wired" | "teleported" | "shipped"
    v_dat_pp_p1: "given" | "lended" | "sold" | "offered" | "fed" | "passed" | "sent" | "rented" | "served" | "awarded" | "brought" | "handed" | "forwarded" | "promised" | "mailed" | "loaned" | "posted" | "returned" | "slipped" | "wired"
    v_dat_pp_p2: "given" | "lended" | "sold" | "offered" | "fed" | "passed" | "sent" | "rented" | "served" | "awarded" | "brought" | "handed" | "forwarded" | "promised" | "mailed" | "loaned" | "posted" | "returned" | "slipped" | "wired"
    v_dat_pp_p3: "given" | "lended" | "sold" | "offered" | "fed" | "passed" | "sent" | "rented" | "served" | "awarded" | "brought" | "handed" | "forwarded" | "promised" | "mailed" | "loaned" | "posted" | "returned" | "slipped" | "wired"
    v_dat_pp_p4: "given" | "lended" | "sold" | "offered" | "fed" | "passed" | "sent" | "rented" | "served" | "awarded" | "brought" | "handed" | "forwarded" | "promised" | "mailed" | "loaned" | "posted" | "returned" | "slipped" | "wired"
    %import common.WS
    %ignore WS
    '''
parser = Lark(grammar, start='start')

def get_verbs(lark_tree_root):
  nodes = [lark_tree_root]
  verbs = []
  while len(nodes) > 0:
    node = nodes[-1]
    nodes = nodes[:-1]
    node_type = node.data[:]
    if node_type[:2] == 'v_':
      verbs.append(node_type)
    for child in node.children:
      # it is a tree, no need to check for revisits
      nodes.append(child)
  return verbs

# quick example of what can be done with this
samples_100 = recogs_train_examples[:100].copy() # a slice cannot have columns added
sentences = recogs_train_examples["COGS Sentence"]
verbs_for_sentence_entries = []
for idx in range(len(sentences[:100])):
  sentence = sentences[idx]
  tree_for_sentence = tree_for_sentence = parser.parse(sentence.lower().replace(" .", "").strip())
  verbs = get_verbs(tree_for_sentence)
  verbs_for_sentence_entries.append(",".join(verbs))
samples_100["Verb Type"] = verbs_for_sentence_entries

theme_right_of_verb_verb_type_set = set([
    "v_unacc_p1",
    "v_trans_omissible_p2",
    "v_trans_not_omissible",
]) # fill this out

theme_left_of_verb_verb_type_set = set(
    ["v_trans_omissible_pp_p1",
     "np v_unacc_p2",
     "v_unacc_pp_p1",
     "v_unacc_pp_p2",
     "v_trans_omissible_pp_p2",
     "v_trans_not_omissible_pp_p1",
     "v_trans_not_omissible_pp_p2",
     "v_dat_pp_p1",
     "v_dat_pp_p2"
     ]) # fill this out
theme_middle_of_dative_verb_type_set = set(["v_dat_pp_p4", "v_dat_p1"]) # fill this out

theme_right_of_verb_idx_set = set() # this gets populated in loop
theme_left_of_verb_idx_set = set() # this gets populated in loop
theme_middle_of_dative_idx_set = set() # this gets populated in loop

theme_side = []
for idx in range(len(samples_100)):
  verb_type = samples_100["Verb Type"][idx]
  if verb_type in theme_right_of_verb_verb_type_set:
    theme_right_of_verb_idx_set.add(idx)
    theme_side.append("right")
  elif verb_type in theme_left_of_verb_verb_type_set:
    theme_left_of_verb_idx_set.add(idx)
    theme_side.append("left")
  elif verb_type in theme_middle_of_dative_verb_type_set:
    theme_middle_of_dative_idx_set.add(idx)
    theme_side.append("middle")
  else:
    theme_side.append(None)
samples_100["Theme Side"] = theme_side

def get_flat_tree_dfs_left_to_right_order(tree_root):
  nodes = [tree_root]
  output = []
  while len(nodes) > 0:
    node = nodes[-1]
    nodes = nodes[:-1]
    if len(node.children) == 0:
      # output terminals only
      output.append(node.data[:])
    reversed_children = node.children.copy()
    reversed_children.reverse()
    for child in reversed_children:
      nodes.append(child)
  return output

def get_pp_position_relative_to_verb(tree_root):
  flat_tree_dfs_left_to_right_order = get_flat_tree_dfs_left_to_right_order(tree)
  first_pp_index = None
  last_pp_index = None
  v_index = None
  for idx in range(len(flat_tree_dfs_left_to_right_order)):
    pos = flat_tree_dfs_left_to_right_order[idx]
    if pos[:2] == "v_" and v_index == None:
      v_index = idx
    if pos == "pp":
      if first_pp_index == None:
        first_pp_index = idx
      last_pp_index = idx
  if first_pp_index == None:
    return None
  if v_index == None:
    return None
  if first_pp_index < v_index:
    if last_pp_index > v_index:
      return "left and right"
    return "left"
  return "right"

pp_pos_list = []
for idx in range(len(samples_100)):
  sentence = samples_100["COGS Sentence"][idx]
  tree = parser.parse(sentence.replace(" .", "").strip().lower())
  pp_position_rel_to_verb = get_pp_position_relative_to_verb(tree)
  pp_pos_list.append(pp_position_rel_to_verb)
pp_pos_list

samples_100["PP position"] = pp_pos_list

print(f"Example information we can add to 100 train.tsv samples:\n{samples_100}")

print(f"Now let us process the whole dataset to focus on the goal which is setting up an experiment testing generalization on v_dat_pp_p2 recipient pp modification after only seeing v_dat_pp_p2 theme pp modification (nps are in agent, recipient, theme order in v_dat_pp_p2's `np v_dat_pp np np` pattern)")

verbs_for_sentence_entries_all = []
for idx in range(len(sentences)):
  sentence = sentences[idx]
  tree_for_sentence = tree_for_sentence = parser.parse(sentence.lower().replace(" .", "").strip())
  verbs = get_verbs(tree_for_sentence)
  verbs_for_sentence_entries_all.append(",".join(verbs))
recogs_train_examples["Verb Type"] = verbs_for_sentence_entries_all
recogs_train_examples[recogs_train_examples['Verb Type'] == 'v_dat_p2']

def get_pp_count(tree_root):
  flat_tree_dfs_left_to_right_order = get_flat_tree_dfs_left_to_right_order(tree_root)
  #print(flat_tree_dfs_left_to_right_order)
  num_pps = 0
  for idx in range(len(flat_tree_dfs_left_to_right_order)):
    if flat_tree_dfs_left_to_right_order[idx] == "pp":
      num_pps += 1
  return num_pps

def reorder_v_dat_p2_with_1_pp_on_theme(sentence):
  parts = sentence.replace(" .", "").strip().split(" ")
  p1 = parts[:-5]
  p2 = parts[-3:]
  p3 = parts[-5:-3]
  return " ".join(p1 + p2 + p3)

# verb stemming. this is just memorized map that is a fact of the COGS dataset.
#I am using the version collected by IBM ( https://github.com/IBM/cpg/blob/c3626b4e03bfc681be2c2a5b23da0b48abe6f570/src/model/cogs_data.py#L485

# https://github.com/IBM/cpg/blob/c3626b4e03bfc681be2c2a5b23da0b48abe6f570/src/model/cogs_data.py#L523
verbs_lemmas = {
    'ate': 'eat', 'painted': 'paint', 'drew': 'draw', 'cleaned': 'clean',
    'cooked': 'cook', 'dusted': 'dust', 'hunted': 'hunt', 'nursed': 'nurse',
    'sketched': 'sketch', 'washed': 'wash', 'juggled': 'juggle', 'called': 'call',
    'eaten': 'eat', 'drawn': 'draw', 'baked': 'bake', 'liked': 'like', 'knew': 'know',
    'helped': 'help', 'saw': 'see', 'found': 'find', 'heard': 'hear', 'noticed': 'notice',
    'loved': 'love', 'admired': 'admire', 'adored': 'adore', 'appreciated': 'appreciate',
    'missed': 'miss', 'respected': 'respect', 'tolerated': 'tolerate', 'valued': 'value',
    'worshipped': 'worship', 'observed': 'observe', 'discovered': 'discover', 'held': 'hold',
    'stabbed': 'stab', 'touched': 'touch', 'pierced': 'pierce', 'poked': 'poke',
    'known': 'know', 'seen': 'see', 'hit': 'hit', 'hoped': 'hope', 'said': 'say',
    'believed': 'believe', 'confessed': 'confess', 'declared': 'declare', 'proved': 'prove',
    'thought': 'think', 'supported': 'support', 'wished': 'wish', 'dreamed': 'dream',
    'expected': 'expect', 'imagined': 'imagine', 'envied': 'envy', 'wanted': 'want',
    'preferred': 'prefer', 'needed': 'need', 'intended': 'intend', 'tried': 'try',
    'attempted': 'attempt', 'planned': 'plan', 'craved': 'crave', 'hated': 'hate', 'loved': 'love',
    'enjoyed': 'enjoy', 'rolled': 'roll', 'froze': 'freeze', 'burned': 'burn', 'shortened': 'shorten',
    'floated': 'float', 'grew': 'grow', 'slid': 'slide', 'broke': 'break', 'crumpled': 'crumple',
    'split': 'split', 'changed': 'change', 'snapped': 'snap', 'tore': 'tear', 'collapsed': 'collapse',
    'decomposed': 'decompose', 'doubled': 'double', 'improved': 'improve', 'inflated': 'inflate',
    'enlarged': 'enlarge', 'reddened': 'redden', 'popped': 'pop', 'disintegrated': 'disintegrate',
    'expanded': 'expand', 'cooled': 'cool', 'soaked': 'soak', 'frozen': 'freeze', 'grown': 'grow',
    'broken': 'break', 'torn': 'tear', 'slept': 'sleep', 'smiled': 'smile', 'laughed': 'laugh',
    'sneezed': 'sneeze', 'cried': 'cry', 'talked': 'talk', 'danced': 'dance', 'jogged': 'jog',
    'walked': 'walk', 'ran': 'run', 'napped': 'nap', 'snoozed': 'snooze', 'screamed': 'scream',
    'stuttered': 'stutter', 'frowned': 'frown', 'giggled': 'giggle', 'scoffed': 'scoff',
    'snored': 'snore', 'snorted': 'snort', 'smirked': 'smirk', 'gasped': 'gasp',
    'gave': 'give', 'lended': 'lend', 'sold': 'sell', 'offered': 'offer', 'fed': 'feed',
    'passed': 'pass', 'rented': 'rent', 'served': 'serve', 'awarded': 'award', 'promised': 'promise',
    'brought': 'bring', 'sent': 'send', 'handed': 'hand', 'forwarded': 'forward', 'mailed': 'mail',
    'posted': 'post', 'given': 'give', 'shipped': 'ship', 'packed': 'pack', 'studied': 'study',
    'examined': 'examine', 'investigated': 'investigate', 'thrown': 'throw', 'threw': 'throw',
    'tossed': 'toss', 'meant': 'mean', 'longed': 'long', 'yearned': 'yearn', 'itched': 'itch',
    'loaned': 'loan', 'returned': 'return', 'slipped': 'slip', 'wired': 'wire', 'crawled': 'crawl',
    'shattered': 'shatter', 'bought': 'buy', 'squeezed': 'squeeze', 'teleported': 'teleport',
    'melted': 'melt', 'blessed': 'bless'
}

def transform_verbs(word_list):
  for idx in range(len(word_list)):
    if word_list[idx] in verbs_lemmas:
      word_list[idx] = verbs_lemmas[word_list[idx]]
  return word_list

def transform_lf_for_v_dat_p2_with_1_3rdnp_pp(original_sentence, original_lf):
  transformed_sentence = reorder_v_dat_p2_with_1_pp_on_theme(original_sentence)
  index_change_map = {}
  noun_to_index_map = {}
  split_original_sentence = transform_verbs(original_sentence.strip().replace(" .", "").split(" "))
  split_transformed_sentence = transform_verbs(transformed_sentence.strip().replace(" .", "").split(" "))
  split_original_lf = original_lf.replace("* ", "").replace(" ; ", " AND ").replace("nmod . ", "").split(" AND ")
  idx = 0
  recipient_idx = None
  # agent recipient theme
  for part in split_original_lf:
    word = part.strip().split(" ")[0]
    if word == "agent" or word == "recipient" or word == "theme":
      continue
    index_change_map[f"{split_original_sentence.index(word)}"] = f"{split_transformed_sentence.index(word)}"
    if word == "on" or word == "in" or word == "beside":
      continue
    noun_to_index_map[word] = f"{split_transformed_sentence.index(word)}"
    if idx == 1:
      recipient_idx = f"{split_transformed_sentence.index(word)}"
    idx += 1
  new_lf = []
  original_lf_space_split = original_lf.split(" ")
  for idx in range(len(original_lf_space_split)):
    part = original_lf_space_split[idx]
    if part in index_change_map:
      original_lf_space_split[idx] = index_change_map[part]
  transformed_lf = " ".join(original_lf_space_split)
  transformed_lf = transformed_lf[:-7] + recipient_idx + transformed_lf[-6:]
  transformed_lf_parts = transformed_lf.split(";")
  transformed_lf = ";".join(transformed_lf_parts[0:2] + [transformed_lf_parts[3]] + [transformed_lf_parts[2]] + transformed_lf_parts[4:])
  return transformed_lf

pp_counts_all = []
for idx in range(len(recogs_train_examples)):
  tree = parser.parse(recogs_train_examples["COGS Sentence"].values[idx].replace(" .", "").strip().lower())
  pp_counts_all.append(get_pp_count(tree))
recogs_train_examples["PP counts"] = pp_counts_all

recogs_train_examples2 = recogs_train_examples.copy()
keep_list = []
for idx in range(len(recogs_train_examples2)):
  if recogs_train_examples2["Verb Type"].values[idx] == "v_dat_p2":
    if recogs_train_examples2["PP counts"].values[idx] > 1:
      keep_list.append(False)
      continue
    elif recogs_train_examples2["PP counts"].values[idx] == 1:
      original_sentence = recogs_train_examples2["COGS Sentence"].values[idx]
      original_lf = recogs_train_examples2["ReCOGS Logical Form"].values[idx]
      reordered_sentence = reorder_v_dat_p2_with_1_pp_on_theme(original_sentence)
      try:
        new_tree = parser.parse(reordered_sentence.lower())
      except:
        print(f"Skipping '{original_sentence}' which has a proper noun recipient or other disqualifying element for moving the prepositional phrase to the recipient")
        keep_list.append(False)
        continue
      reordered_sentence = reordered_sentence[0].upper() + reordered_sentence[1:] + " ."
      reordered_lf = transform_lf_for_v_dat_p2_with_1_3rdnp_pp(original_sentence, original_lf)
      print(f"Replacing '{original_sentence}' with {reordered_sentence} ; replacing '{original_lf}' with '{reordered_lf}'")
      recogs_train_examples2["COGS Sentence"].values[idx] = reordered_sentence
      recogs_train_examples2["ReCOGS Logical Form"].values[idx] = reordered_lf
  keep_list.append(True)

recogs_train_examples2["keep"] = keep_list

recogs_train_examples2 = recogs_train_examples2[recogs_train_examples2["keep"] == True]

t = recogs_train_examples2[recogs_train_examples2["PP counts"] == 1]
hold_out_train_to_check_v_dat_p2_gen = t[t["Verb Type"] == "v_dat_p2"]

print(f"hold_out_train_to_check_v_dat_p2_gen (sample n=10): {hold_out_train_to_check_v_dat_p2_gen[:10]}")

hold_out_train_to_check_v_dat_p2_gen.filter(["COGS Sentence", "ReCOGS Logical Form", "Distribution"]).to_csv("modified_train_set_examples_v_dat_p2_pp_moved_to_recipient.tsv", sep="	", index=False)

recogs_train_examples2.filter(["COGS Sentence", "ReCOGS Logical Form", "Distribution"]).to_csv("train_data_no_sprinkles_no_preposing_no_cp_v_dat_pp_p2_pp_moved_to_recipient.tsv", sep="	", index=False)


Load official Wu et al 2023 ReCOGS training examples
(sample from https://raw.githubusercontent.com/frankaging/ReCOGS/refs/heads/main/recogs_positional_index/train.tsv , associated with https://arxiv.org/abs/2303.13716 )
Example information we can add to 100 train.tsv samples:
                                     COGS Sentence  \
0                     A rose was helped by a dog .   
1                        The sailor dusted a boy .   
2                          Emma rolled a teacher .   
3                         Evelyn rolled the girl .   
4      A cake was forwarded to Levi by Charlotte .   
..                                             ...   
95  The girl changed a sandwich beside the table .   
96       Emma gave a landlord the box in a house .   
97              The balloon was painted by a boy .   
98                       A girl liked the raisin .   
99   The pancake was passed to Charlotte by Noah .   

                                  ReCOGS Logical Form     Distribution  \

In [None]:
subprocess.run("cat train.tsv | grep -v 'in_distribution' | grep 'primitive' >> train_data_no_sprinkles_no_preposing_no_cp_v_dat_pp_p2_pp_moved_to_recipient.tsv", shell=True)

CompletedProcess(args="cat train.tsv | grep -v 'in_distribution' | grep 'primitive' >> train_data_no_sprinkles_no_preposing_no_cp_v_dat_pp_p2_pp_moved_to_recipient.tsv", returncode=0)

hold_out_train_to_check_v_dat_p2_gen will be used for two different things.

one is to check performance of a baseline Wu et al 2023 Transformer on it

two is to see if mixing into the training data improves generalization on other cases where "pp np" needs to be skipped in attaching a dependency, e.g. the obj to subj pp generalization.

In [None]:
!cat modified_train_set_examples_v_dat_p2_pp_moved_to_recipient.tsv

COGS Sentence	ReCOGS Logical Form	Distribution
Liam gave the monkey in the container a chalk .	Liam ( 0 ) ; * monkey ( 3 ) ; * container ( 6 ) ; chalk ( 8 ) ; give ( 1 ) AND agent ( 1 , 0 ) AND recipient ( 1 , 3 ) AND theme ( 1 , 8 ) AND nmod . in ( 3 , 6 )	in_distribution
Emma gave a landlord in a house the box .	Emma ( 0 ) ; landlord ( 3 ) ; house ( 6 ) ; * box ( 8 ) ; give ( 1 ) AND agent ( 1 , 0 ) AND recipient ( 1 , 3 ) AND theme ( 1 , 8 ) AND nmod . in ( 3 , 6 )	in_distribution
Emma awarded a bird on the stool the drink .	Emma ( 0 ) ; bird ( 3 ) ; * stool ( 6 ) ; * drink ( 8 ) ; award ( 1 ) AND agent ( 1 , 0 ) AND recipient ( 1 , 3 ) AND theme ( 1 , 8 ) AND nmod . on ( 3 , 6 )	in_distribution
Emma offered a girl on the table a drink .	Emma ( 0 ) ; girl ( 3 ) ; * table ( 6 ) ; drink ( 8 ) ; offer ( 1 ) AND agent ( 1 , 0 ) AND recipient ( 1 , 3 ) AND theme ( 1 , 8 ) AND nmod . on ( 3 , 6 )	in_distribution
Emma offered a teacher beside a bed the scarf .	Emma ( 0 ) ; teacher ( 3 ) ; 

# Updated version to also augment by copying and converting `np v_dat_p2 np np pp np pp np` double preposition case to `np v_dat_p2 np pp np pp np np` (not in use)

In [None]:
# Objective
# In the ReCOGS paper (Wu et al 2023; ReCOGS: How Incidental Details of a Logical Form Overshadow an Evaluation of Semantic Interpretation ; https://arxiv.org/abs/2303.13716 ), they write
#
# The hardest COGS split based on published num-
# bers seems to be the structural generalization task
# that involves interpreting novel combinations of
# modified phrases and grammatical role – e.g., in-
# terpreting subject noun phrases with PP modifiers
# when the train set includes only object noun phrases
# with such modifiers (Noah ate the cake on the plate.
# → The cake on the plate burned). To the best of
# our knowledge, all prior seq2seq models have com-
# pletely failed to get traction here
#
# Possible interpretation of such prior work (Wu et al 2023 , ReCOGS, and COGS paper object pp to subject pp modification generalization split): if the model that translates english sentences to a semantic representation (logical form) has not learned to modify the subject with a prepositional phrase, then it cannot modify the subject.
#
# Then it follows that if we can get the model to consistently modify the subject accurately without learning to modify it then this claim is not true (regardless of whether the subject is on the left/right of the verb which is what it typically has relationships with that break when LF translation fails).
#
# Alternative interpretation of this prior work: a general difficulty with ignoring prepositional-phrase-noun-phrases which are inserted in between two parts of speech with a relationship when the modified part-of-speech in the pair is on the left-side; e.g. for verb could be the agent or theme or recipient of that verb.
#
# Think in the limit the difference between np v_dat_p2 np np style sentences where the recipient is modified vs the theme with some long filler content:
# "Emma gave a friend a cookie" ->
# "Emma gave a friend a cookie (filler filler filler)" (not going to miss that a cookie was given as more filler is added)
# vs
# "Emma gave a friend (filler filler filler) a cookie" (as filler gets larger and has more distractors, may miss that a cookie was given to the friend due to the distance between friend and cookie increasing and distractors including nouns being added in between)
#
#  Because the modifications add to the right in a left to right fashion, there is an asymmetry between modifying the part-of-speech on the left side of a pair of two parts-of-speech with a relationship than to modify the one on the right, since modifying the left introduces a distractor noun phrase ("pp np") between them.
#
# distinguishing observations:
#
#     observe that something that is definitively not the subject (not just a theme in a passive vs agent in active) fails to generalize (maybe np v_dat_p2 np np!)
#
#     check that regardless of whether agent, theme, or recipient role, if it is on the left, we do not observe generalization from right as that would reject the 2nd hypothesis (we can also confirm that training with it on the left it can generalize to right)
#
#     since english has SVO order I am not sure it is easy to show us modifying the subject when it is not on the left, and swapping agent/theme side (like with passive verb) may swap its subject role, that is why np v_dat_p2 np np where we do pp modification on the recipient instead of theme np which is not even a verb relationship situation is appealing. It would demonstrate the same mechanism but not get confused with subject/object or anything specific to verbs.
#

# Here's  a left/right example which does not even involve a verb and not the subject of the sentence.
#
# Take existing train examples of np v_dat_p2 np np form (agent, recipient, theme), we can even take those which already have a prepositional phrase on the right np.
#
# just transfer that pp to the middle np and we should observe that it gets the roles wrong (recipient or theme), if it hasn't trained on that.
#
# In this particular pattern modifying the theme should be safe (in other patterns where it is left of the verb it should not), but modifying the recipient with a prepositional phrase here should break something, probably the theme.
#
# Will be using ideas learned from working with the Restricted Access Sequence Processing hand-written model , testing them on Transformers (default model from Wu et al 2023) . So this is custom data , but other author's model we will train from scratch here based on ideas from RASP which is checked out in another notebook.

# Here I am getting data ready for use by other notebooks. We use Lark parser here to check the form of the different sentences as we want to filter on syntax, not semantics.

import os
import pty
import subprocess
import pandas as pd
import numpy as np
import argparse

print("Load official Wu et al 2023 ReCOGS training examples\n(sample from https://raw.githubusercontent.com/frankaging/ReCOGS/refs/heads/main/recogs_positional_index/train.tsv , associated with https://arxiv.org/abs/2303.13716 )")
if not os.path.exists("train_no_sprinkles_or_cp.tsv"):
    # one of official author's dataset for ReCOGS paper
    subprocess.run("wget https://raw.githubusercontent.com/frankaging/ReCOGS/refs/heads/main/recogs_positional_index/train.tsv", shell=True)
    subprocess.run("echo 'COGS Sentence	ReCOGS Logical Form	Distribution' > train_no_sprinkles_or_cp.tsv", shell=True)
    subprocess.run("cat train.tsv | grep 'in_distribution' | grep -v 'preposing' | grep -v 'sprinkle' | grep -v 'that' >> train_no_sprinkles_or_cp.tsv", shell=True)

recogs_train_examples = pd.read_csv("train_no_sprinkles_or_cp.tsv", delimiter="	")

# We do NOT use Lark with the Restricted Access Sequence Processing (RASP) model ,
# but to prepare training data for a separate experiment using Wu et al 2023's model to test something we learned from the RASP model programmed by hand in the language that expresses Transformer capabilities
# on a Wu et al 2023 Transformer that learns from examples, we use Lark to filter for examples with certain syntax and help modify them
subprocess.run("pip install lark --upgrade", shell=True)

from lark import Lark, tree

# COGS grammar in Lark format per IBM CPG project (we don't use any CPG code just their description of COGS grammar from their utilities here for data preparation)
# https://github.com/IBM/cpg/blob/c3626b4e03bfc681be2c2a5b23da0b48abe6f570/src/model/cogs_data.py#L523
grammar = '''
start: s1 | s2 | s3 | s4 | vp_internal
    s1: np vp_external
    s2: np vp_passive
    s3: np vp_passive_dat
    s4: np vp_external4
    vp_external: v_unerg | v_trans_omissible_p1 | vp_external1 | vp_external2 | vp_external3 | vp_external5 | vp_external6 | vp_external7
    vp_external1: v_unacc_p1 np
    vp_external2: v_trans_omissible_p2 np
    vp_external3: v_trans_not_omissible np
    vp_external4: v_inf_taking to v_inf
    vp_external5: v_cp_taking that start
    vp_external6: v_dat_p1 np pp_iobj
    vp_external7: v_dat_p2 np np
    vp_internal: np v_unacc_p2
    vp_passive: vp_passive1 | vp_passive2 | vp_passive3 | vp_passive4 | vp_passive5 | vp_passive6 | vp_passive7 | vp_passive8
    vp_passive1: was v_trans_not_omissible_pp_p1
    vp_passive2: was v_trans_not_omissible_pp_p2 by np
    vp_passive3: was v_trans_omissible_pp_p1
    vp_passive4: was v_trans_omissible_pp_p2 by np
    vp_passive5: was v_unacc_pp_p1
    vp_passive6: was v_unacc_pp_p2 by np
    vp_passive7: was v_dat_pp_p1 pp_iobj
    vp_passive8: was v_dat_pp_p2 pp_iobj by np
    vp_passive_dat: vp_passive_dat1 | vp_passive_dat2
    vp_passive_dat1: was v_dat_pp_p3 np
    vp_passive_dat2: was v_dat_pp_p4 np by np
    np: np_prop | np_det | np_pp
    np_prop: proper_noun
    np_det: det common_noun
    np_pp: np_det pp np
    pp_iobj: to np
    det: "the" | "a"
    pp: "on" | "in" | "beside"
    was: "was"
    by: "by"
    to: "to"
    that: "that"
    common_noun: "girl" | "boy" | "cat" | "dog" | "baby" | "child" | "teacher" | "frog" | "chicken" | "mouse" | "lion" | "monkey" | "bear" | "giraffe" | "horse" | "bird" | "duck" | "bunny" | "butterfly" | "penguin" | "student" | "professor" | "monster" | "hero" | "sailor" | "lawyer" | "customer" | "scientist" | "princess" | "president" | "cow" | "crocodile" | "goose" | "hen" | "deer" | "donkey" | "bee" | "fly" | "kitty" | "tiger" | "wolf" | "zebra" | "mother" | "father" | "patient" | "manager" | "director" | "king" | "queen" | "kid" | "fish" | "moose" | "pig" | "pony" | "puppy" | "sheep" | "squirrel" | "lamb" | "turkey" | "turtle" | "doctor" | "pupil" | "prince" | "driver" | "consumer" | "writer" | "farmer" | "friend" | "judge" | "visitor" | "guest" | "servant" | "chief" | "citizen" | "champion" | "prisoner" | "captain" | "soldier" | "passenger" | "tenant" | "politician" | "resident" | "buyer" | "spokesman" | "governor" | "guard" | "creature" | "coach" | "producer" | "researcher" | "guy" | "dealer" | "duke" | "tourist" | "landlord" | "human" | "host" | "priest" | "journalist" | "poet" | "hedgehog" | "shark" | "cockroach" | "cobra" | "hippo" | "cake" | "donut" | "cookie" | "box" | "rose" | "drink" | "raisin" | "melon" | "sandwich" | "strawberry" | "ball" | "balloon" | "bat" | "block" | "book" | "crayon" | "chalk" | "doll" | "game" | "glue" | "lollipop" | "hamburger" | "banana" | "biscuit" | "muffin" | "pancake" | "pizza" | "potato" | "pretzel" | "pumpkin" | "sweetcorn" | "yogurt" | "pickle" | "jigsaw" | "pen" | "pencil" | "present" | "toy" | "cracker" | "brush" | "radio" | "cloud" | "mandarin" | "hat" | "basket" | "plant" | "flower" | "chair" | "spoon" | "pillow" | "gumball" | "scarf" | "shoe" | "jacket" | "hammer" | "bucket" | "knife" | "cup" | "plate" | "towel" | "bottle" | "bowl" | "can" | "clock" | "jar" | "penny" | "purse" | "soap" | "toothbrush" | "watch" | "newspaper" | "fig" | "bag" | "wine" | "key" | "weapon" | "brain" | "tool" | "crown" | "ring" | "leaf" | "fruit" | "mirror" | "beer" | "shirt" | "guitar" | "chemical" | "seed" | "shell" | "brick" | "bell" | "coin" | "button" | "needle" | "molecule" | "crystal" | "flag" | "nail" | "bean" | "liver" | "table" | "stage" | "bed" | "chair" | "stool" | "road" | "tree" | "box" | "surface" | "seat" | "speaker" | "computer" | "rock" | "boat" | "cabinet" | "tv" | "plate" | "desk" | "bowl" | "bench" | "shelf" | "cloth" | "piano" | "bible" | "leaflet" | "sheet" | "cupboard" | "truck" | "tray" | "notebook" | "blanket" | "deck" | "coffin" | "log" | "ladder" | "barrel" | "rug" | "canvas" | "tiger" | "towel" | "throne" | "booklet" | "sock" | "corpse" | "sofa" | "keyboard" | "book" | "pillow" | "pad" | "train" | "couch" | "bike" | "pedestal" | "platter" | "paper" | "rack" | "board" | "panel" | "tripod" | "branch" | "machine" | "floor" | "napkin" | "cookie" | "block" | "cot" | "device" | "yacht" | "dog" | "mattress" | "ball" | "stand" | "stack" | "windowsill" | "counter" | "cushion" | "hanger" | "trampoline" | "gravel" | "cake" | "carpet" | "plaque" | "boulder" | "leaf" | "mound" | "bun" | "dish" | "cat" | "podium" | "tabletop" | "beach" | "bag" | "glacier" | "brick" | "crack" | "vessel" | "futon" | "turntable" | "rag" | "chessboard" | "house" | "room" | "car" | "garden" | "box" | "cup" | "glass" | "bag" | "vehicle" | "hole" | "cabinet" | "bottle" | "shoe" | "storage" | "cot" | "vessel" | "pot" | "pit" | "tin" | "can" | "cupboard" | "envelope" | "nest" | "bush" | "coffin" | "drawer" | "container" | "basin" | "tent" | "soup" | "well" | "barrel" | "bucket" | "cage" | "sink" | "cylinder" | "parcel" | "cart" | "sack" | "trunk" | "wardrobe" | "basket" | "bin" | "fridge" | "mug" | "jar" | "corner" | "pool" | "blender" | "closet" | "pile" | "van" | "trailer" | "saucepan" | "truck" | "taxi" | "haystack" | "dumpster" | "puddle" | "bathtub" | "pod" | "tub" | "trap" | "bun" | "microwave" | "bookstore" | "package" | "cafe" | "train" | "castle" | "bunker" | "vase" | "backpack" | "tube" | "hammock" | "stadium" | "backyard" | "swamp" | "monastery" | "refrigerator" | "palace" | "cubicle" | "crib" | "condo" | "tower" | "crate" | "dungeon" | "teapot" | "tomb" | "casket" | "jeep" | "shoebox" | "wagon" | "bakery" | "fishbowl" | "kennel" | "china" | "spaceship" | "penthouse" | "pyramid" | "table" | "stage" | "bed" | "chair" | "book" | "road" | "tree" | "machine" | "house" | "seat" | "speaker" | "computer" | "rock" | "car" | "box" | "cup" | "glass" | "bag" | "flower" | "boat" | "vehicle" | "key" | "painting" | "cabinet" | "tv" | "bottle" | "cat" | "desk" | "shoe" | "mirror" | "clock" | "bench" | "bike" | "lamp" | "lion" | "piano" | "crystal" | "toy" | "duck" | "sword" | "sculpture" | "rod" | "truck" | "basket" | "bear" | "nest" | "sphere" | "bush" | "surgeon" | "poster" | "throne" | "giant" | "trophy" | "hedge" | "log" | "tent" | "ladder" | "helicopter" | "barrel" | "yacht" | "statue" | "bucket" | "skull" | "beast" | "lemon" | "whale" | "cage" | "gardner" | "fox" | "sink" | "trainee" | "dragon" | "cylinder" | "monk" | "bat" | "headmaster" | "philosopher" | "foreigner" | "worm" | "chemist" | "corpse" | "wolf" | "torch" | "sailor" | "valve" | "hammer" | "doll" | "genius" | "baron" | "murderer" | "bicycle" | "keyboard" | "stool" | "pepper" | "warrior" | "pillar" | "monkey" | "cassette" | "broker" | "bin"
    proper_noun: "emma" | "liam" | "olivia" | "noah" | "ava" | "william" | "isabella" | "james" | "sophia" | "oliver" | "charlotte" | "benjamin" | "mia" | "elijah" | "amelia" | "lucas" | "harper" | "mason" | "evelyn" | "logan" | "abigail" | "alexander" | "emily" | "ethan" | "elizabeth" | "jacob" | "mila" | "michael" | "ella" | "daniel" | "avery" | "henry" | "sofia" | "jackson" | "camila" | "sebastian" | "aria" | "aiden" | "scarlett" | "matthew" | "victoria" | "samuel" | "madison" | "david" | "luna" | "joseph" | "grace" | "carter" | "chloe" | "owen" | "penelope" | "wyatt" | "layla" | "john" | "riley" | "jack" | "zoey" | "luke" | "nora" | "jayden" | "lily" | "dylan" | "eleanor" | "grayson" | "hannah" | "levi" | "lillian" | "isaac" | "addison" | "gabriel" | "aubrey" | "julian" | "ellie" | "mateo" | "stella" | "anthony" | "natalie" | "jaxon" | "zoe" | "lincoln" | "leah" | "joshua" | "hazel" | "christopher" | "violet" | "andrew" | "aurora" | "theodore" | "savannah" | "caleb" | "audrey" | "ryan" | "brooklyn" | "asher" | "bella" | "nathan" | "claire" | "thomas" | "skylar" | "leo" | "lina" | "paula" | "charlie"
    v_trans_omissible_p1: "ate" | "painted" | "drew" | "cleaned" | "cooked" | "dusted" | "hunted" | "nursed" | "sketched" | "juggled" | "called" | "heard" | "packed" | "saw" | "noticed" | "studied" | "examined" | "observed" | "knew" | "investigated" | "baked"
    v_trans_omissible_p2: "ate" | "painted" | "drew" | "cleaned" | "cooked" | "dusted" | "hunted" | "nursed" | "sketched" | "juggled" | "called" | "heard" | "packed" | "saw" | "noticed" | "studied" | "examined" | "observed" | "knew" | "investigated" | "baked"
    v_trans_omissible_pp_p1: "eaten" | "painted" | "drawn" | "cleaned" | "cooked" | "dusted" | "hunted" | "nursed" | "sketched" | "juggled" | "called" | "heard" | "packed" | "seen" | "noticed" | "studied" | "examined" | "observed" | "known" | "investigated"
    v_trans_omissible_pp_p2: "eaten" | "painted" | "drawn" | "cleaned" | "cooked" | "dusted" | "hunted" | "nursed" | "sketched" | "juggled" | "called" | "heard" | "packed" | "seen" | "noticed" | "studied" | "examined" | "observed" | "known" | "investigated"
    v_trans_not_omissible: "liked" | "helped" | "found" | "loved" | "poked" | "admired" | "adored" | "appreciated" | "missed" | "respected" | "threw" | "tolerated" | "valued" | "worshipped" | "discovered" | "held" | "stabbed" | "touched" | "pierced" | "tossed"
    v_trans_not_omissible_pp_p1: "liked" | "helped" | "found" | "loved" | "poked" | "admired" | "adored" | "appreciated" | "missed" | "respected" | "thrown" | "tolerated" | "valued" | "worshipped" | "discovered" | "held" | "stabbed" | "touched" | "pierced" | "tossed"
    v_trans_not_omissible_pp_p2: "liked" | "helped" | "found" | "loved" | "poked" | "admired" | "adored" | "appreciated" | "missed" | "respected" | "thrown" | "tolerated" | "valued" | "worshipped" | "discovered" | "held" | "stabbed" | "touched" | "pierced" | "tossed"
    v_cp_taking: "liked" | "hoped" | "said" | "noticed" | "believed" | "confessed" | "declared" | "proved" | "thought" | "admired" | "appreciated" | "respected" | "supported" | "tolerated" | "valued" | "wished" | "dreamed" | "expected" | "imagined" | "meant"
    v_inf_taking: "wanted" | "preferred" | "needed" | "intended" | "tried" | "attempted" | "planned" | "expected" | "hoped" | "wished" | "craved" | "liked" | "hated" | "loved" | "enjoyed" | "dreamed" | "meant" | "longed" | "yearned" | "itched"
    v_unacc_p1: "rolled" | "froze" | "burned" | "shortened" | "floated" | "grew" | "slid" | "broke" | "crumpled" | "split" | "changed" | "snapped" | "disintegrated" | "collapsed" | "decomposed" | "doubled" | "improved" | "inflated" | "enlarged" | "reddened" | "shattered" | "blessed" | "squeezed"
    v_unacc_p2: "rolled" | "froze" | "burned" | "shortened" | "floated" | "grew" | "slid" | "broke" | "crumpled" | "split" | "changed" | "snapped" | "disintegrated" | "collapsed" | "decomposed" | "doubled" | "improved" | "inflated" | "enlarged" | "reddened" | "shattered" | "blessed" | "squeezed"
    v_unacc_pp_p1: "rolled" | "frozen" | "burned" | "shortened" | "floated" | "grown" | "slid" | "broken" | "crumpled" | "split" | "changed" | "snapped" | "disintegrated" | "collapsed" | "decomposed" | "doubled" | "improved" | "inflated" | "enlarged" | "reddened" | "shattered" | "blessed" | "squeezed"
    v_unacc_pp_p2: "rolled" | "frozen" | "burned" | "shortened" | "floated" | "grown" | "slid" | "broken" | "crumpled" | "split" | "changed" | "snapped" | "disintegrated" | "collapsed" | "decomposed" | "doubled" | "improved" | "inflated" | "enlarged" | "reddened" | "shattered" | "blessed" | "squeezed"
    v_unerg: "slept" | "smiled" | "laughed" | "sneezed" | "cried" | "talked" | "danced" | "jogged" | "walked" | "ran" | "napped" | "snoozed" | "screamed" | "stuttered" | "frowned" | "giggled" | "scoffed" | "snored" | "smirked" | "gasped"
    v_inf: "walk" | "run" | "sleep" | "sneeze" | "nap" | "eat" | "read" | "cook" | "hunt" | "paint" | "talk" | "dance" | "giggle" | "jog" | "smirk" | "call" | "sketch" | "dust" | "clean" | "investigate" | "crawl"
    v_dat_p1: "gave" | "lended" | "sold" | "offered" | "fed" | "passed" | "sent" | "rented" | "served" | "awarded" | "brought" | "handed" | "forwarded" | "promised" | "mailed" | "loaned" | "posted" | "returned" | "slipped" | "wired" | "teleported" | "shipped"
    v_dat_p2: "gave" | "lended" | "sold" | "offered" | "fed" | "passed" | "sent" | "rented" | "served" | "awarded" | "brought" | "handed" | "forwarded" | "promised" | "mailed" | "loaned" | "posted" | "returned" | "slipped" | "wired" | "teleported" | "shipped"
    v_dat_pp_p1: "given" | "lended" | "sold" | "offered" | "fed" | "passed" | "sent" | "rented" | "served" | "awarded" | "brought" | "handed" | "forwarded" | "promised" | "mailed" | "loaned" | "posted" | "returned" | "slipped" | "wired"
    v_dat_pp_p2: "given" | "lended" | "sold" | "offered" | "fed" | "passed" | "sent" | "rented" | "served" | "awarded" | "brought" | "handed" | "forwarded" | "promised" | "mailed" | "loaned" | "posted" | "returned" | "slipped" | "wired"
    v_dat_pp_p3: "given" | "lended" | "sold" | "offered" | "fed" | "passed" | "sent" | "rented" | "served" | "awarded" | "brought" | "handed" | "forwarded" | "promised" | "mailed" | "loaned" | "posted" | "returned" | "slipped" | "wired"
    v_dat_pp_p4: "given" | "lended" | "sold" | "offered" | "fed" | "passed" | "sent" | "rented" | "served" | "awarded" | "brought" | "handed" | "forwarded" | "promised" | "mailed" | "loaned" | "posted" | "returned" | "slipped" | "wired"
    %import common.WS
    %ignore WS
    '''
parser = Lark(grammar, start='start')

def get_verbs(lark_tree_root):
  nodes = [lark_tree_root]
  verbs = []
  while len(nodes) > 0:
    node = nodes[-1]
    nodes = nodes[:-1]
    node_type = node.data[:]
    if node_type[:2] == 'v_':
      verbs.append(node_type)
    for child in node.children:
      # it is a tree, no need to check for revisits
      nodes.append(child)
  return verbs

# quick example of what can be done with this
samples_100 = recogs_train_examples[:100].copy() # a slice cannot have columns added
sentences = recogs_train_examples["COGS Sentence"]
verbs_for_sentence_entries = []
for idx in range(len(sentences[:100])):
  sentence = sentences[idx]
  tree_for_sentence = tree_for_sentence = parser.parse(sentence.lower().replace(" .", "").strip())
  verbs = get_verbs(tree_for_sentence)
  verbs_for_sentence_entries.append(",".join(verbs))
samples_100["Verb Type"] = verbs_for_sentence_entries

theme_right_of_verb_verb_type_set = set([
    "v_unacc_p1",
    "v_trans_omissible_p2",
    "v_trans_not_omissible",
]) # fill this out

theme_left_of_verb_verb_type_set = set(
    ["v_trans_omissible_pp_p1",
     "np v_unacc_p2",
     "v_unacc_pp_p1",
     "v_unacc_pp_p2",
     "v_trans_omissible_pp_p2",
     "v_trans_not_omissible_pp_p1",
     "v_trans_not_omissible_pp_p2",
     "v_dat_pp_p1",
     "v_dat_pp_p2"
     ]) # fill this out
theme_middle_of_dative_verb_type_set = set(["v_dat_pp_p4", "v_dat_p1"]) # fill this out

theme_right_of_verb_idx_set = set() # this gets populated in loop
theme_left_of_verb_idx_set = set() # this gets populated in loop
theme_middle_of_dative_idx_set = set() # this gets populated in loop

theme_side = []
for idx in range(len(samples_100)):
  verb_type = samples_100["Verb Type"][idx]
  if verb_type in theme_right_of_verb_verb_type_set:
    theme_right_of_verb_idx_set.add(idx)
    theme_side.append("right")
  elif verb_type in theme_left_of_verb_verb_type_set:
    theme_left_of_verb_idx_set.add(idx)
    theme_side.append("left")
  elif verb_type in theme_middle_of_dative_verb_type_set:
    theme_middle_of_dative_idx_set.add(idx)
    theme_side.append("middle")
  else:
    theme_side.append(None)
samples_100["Theme Side"] = theme_side

def get_flat_tree_dfs_left_to_right_order(tree_root):
  nodes = [tree_root]
  output = []
  while len(nodes) > 0:
    node = nodes[-1]
    nodes = nodes[:-1]
    if len(node.children) == 0:
      # output terminals only
      output.append(node.data[:])
    reversed_children = node.children.copy()
    reversed_children.reverse()
    for child in reversed_children:
      nodes.append(child)
  return output

def get_pp_position_relative_to_verb(tree_root):
  flat_tree_dfs_left_to_right_order = get_flat_tree_dfs_left_to_right_order(tree)
  first_pp_index = None
  last_pp_index = None
  v_index = None
  for idx in range(len(flat_tree_dfs_left_to_right_order)):
    pos = flat_tree_dfs_left_to_right_order[idx]
    if pos[:2] == "v_" and v_index == None:
      v_index = idx
    if pos == "pp":
      if first_pp_index == None:
        first_pp_index = idx
      last_pp_index = idx
  if first_pp_index == None:
    return None
  if v_index == None:
    return None
  if first_pp_index < v_index:
    if last_pp_index > v_index:
      return "left and right"
    return "left"
  return "right"

pp_pos_list = []
for idx in range(len(samples_100)):
  sentence = samples_100["COGS Sentence"][idx]
  tree = parser.parse(sentence.replace(" .", "").strip().lower())
  pp_position_rel_to_verb = get_pp_position_relative_to_verb(tree)
  pp_pos_list.append(pp_position_rel_to_verb)
pp_pos_list

samples_100["PP position"] = pp_pos_list

print(f"Example information we can add to 100 train.tsv samples:\n{samples_100}")

print(f"Now let us process the whole dataset to focus on the goal which is setting up an experiment testing generalization on v_dat_p2 recipient pp modification after only seeing v_dat_p2 theme pp modification (nps are in agent, recipient, theme order in v_dat_p2's `np v_dat_pp np np` pattern)")

verbs_for_sentence_entries_all = []
for idx in range(len(sentences)):
  sentence = sentences[idx]
  tree_for_sentence = tree_for_sentence = parser.parse(sentence.lower().replace(" .", "").strip())
  verbs = get_verbs(tree_for_sentence)
  verbs_for_sentence_entries_all.append(",".join(verbs))
recogs_train_examples["Verb Type"] = verbs_for_sentence_entries_all
recogs_train_examples[recogs_train_examples['Verb Type'] == 'v_dat_p2']

def get_pp_count(tree_root):
  flat_tree_dfs_left_to_right_order = get_flat_tree_dfs_left_to_right_order(tree_root)
  #print(flat_tree_dfs_left_to_right_order)
  num_pps = 0
  for idx in range(len(flat_tree_dfs_left_to_right_order)):
    if flat_tree_dfs_left_to_right_order[idx] == "pp":
      num_pps += 1
  return num_pps

# only works for COGS grammar sentence which allow only determiner-common-nouns in this position when there are 2 prepositions
def reorder_v_dat_p2_with_2_pp_on_theme(sentence):
  parts = sentence.replace(" .", "").strip().split(" ")
  if len(parts) != 12 and len(parts) != 13:
      print(f"sentence with unexpected length: {sentence}")
  # allow common noun or proper noun
  # note with negative indices for all steps below it makes no difference to add a determiner ("the" or "a" to the beginning)
  assert len(parts) == 12 or len(parts) == 13
  p1 = parts[:-8]
  p2 = parts[-6:]
  p3 = parts[-8:-6]
  return " ".join(p1 + p2 + p3)

# only works for COGS grammar sentence which allow only determiner-common-nouns in this position when there is a preposition
def reorder_v_dat_p2_with_1_pp_on_theme(sentence):
  parts = sentence.replace(" .", "").strip().split(" ")
  if len(parts) != 9 and len(parts) != 10:
      print(f"sentence with unexpected length: {sentence}")
  # allow common noun or proper noun
  # note with negative indices for all steps below it makes no difference to add a determiner ("the" or "a" to the beginning)
  assert len(parts) == 9 or len(parts) == 10
  p1 = parts[:-5]
  p2 = parts[-3:]
  p3 = parts[-5:-3]
  return " ".join(p1 + p2 + p3)

# verb stemming. this is just memorized map that is a fact of the COGS dataset.
#I am using the version collected by IBM ( https://github.com/IBM/cpg/blob/c3626b4e03bfc681be2c2a5b23da0b48abe6f570/src/model/cogs_data.py#L485

# https://github.com/IBM/cpg/blob/c3626b4e03bfc681be2c2a5b23da0b48abe6f570/src/model/cogs_data.py#L523
verbs_lemmas = {
    'ate': 'eat', 'painted': 'paint', 'drew': 'draw', 'cleaned': 'clean',
    'cooked': 'cook', 'dusted': 'dust', 'hunted': 'hunt', 'nursed': 'nurse',
    'sketched': 'sketch', 'washed': 'wash', 'juggled': 'juggle', 'called': 'call',
    'eaten': 'eat', 'drawn': 'draw', 'baked': 'bake', 'liked': 'like', 'knew': 'know',
    'helped': 'help', 'saw': 'see', 'found': 'find', 'heard': 'hear', 'noticed': 'notice',
    'loved': 'love', 'admired': 'admire', 'adored': 'adore', 'appreciated': 'appreciate',
    'missed': 'miss', 'respected': 'respect', 'tolerated': 'tolerate', 'valued': 'value',
    'worshipped': 'worship', 'observed': 'observe', 'discovered': 'discover', 'held': 'hold',
    'stabbed': 'stab', 'touched': 'touch', 'pierced': 'pierce', 'poked': 'poke',
    'known': 'know', 'seen': 'see', 'hit': 'hit', 'hoped': 'hope', 'said': 'say',
    'believed': 'believe', 'confessed': 'confess', 'declared': 'declare', 'proved': 'prove',
    'thought': 'think', 'supported': 'support', 'wished': 'wish', 'dreamed': 'dream',
    'expected': 'expect', 'imagined': 'imagine', 'envied': 'envy', 'wanted': 'want',
    'preferred': 'prefer', 'needed': 'need', 'intended': 'intend', 'tried': 'try',
    'attempted': 'attempt', 'planned': 'plan', 'craved': 'crave', 'hated': 'hate', 'loved': 'love',
    'enjoyed': 'enjoy', 'rolled': 'roll', 'froze': 'freeze', 'burned': 'burn', 'shortened': 'shorten',
    'floated': 'float', 'grew': 'grow', 'slid': 'slide', 'broke': 'break', 'crumpled': 'crumple',
    'split': 'split', 'changed': 'change', 'snapped': 'snap', 'tore': 'tear', 'collapsed': 'collapse',
    'decomposed': 'decompose', 'doubled': 'double', 'improved': 'improve', 'inflated': 'inflate',
    'enlarged': 'enlarge', 'reddened': 'redden', 'popped': 'pop', 'disintegrated': 'disintegrate',
    'expanded': 'expand', 'cooled': 'cool', 'soaked': 'soak', 'frozen': 'freeze', 'grown': 'grow',
    'broken': 'break', 'torn': 'tear', 'slept': 'sleep', 'smiled': 'smile', 'laughed': 'laugh',
    'sneezed': 'sneeze', 'cried': 'cry', 'talked': 'talk', 'danced': 'dance', 'jogged': 'jog',
    'walked': 'walk', 'ran': 'run', 'napped': 'nap', 'snoozed': 'snooze', 'screamed': 'scream',
    'stuttered': 'stutter', 'frowned': 'frown', 'giggled': 'giggle', 'scoffed': 'scoff',
    'snored': 'snore', 'snorted': 'snort', 'smirked': 'smirk', 'gasped': 'gasp',
    'gave': 'give', 'lended': 'lend', 'sold': 'sell', 'offered': 'offer', 'fed': 'feed',
    'passed': 'pass', 'rented': 'rent', 'served': 'serve', 'awarded': 'award', 'promised': 'promise',
    'brought': 'bring', 'sent': 'send', 'handed': 'hand', 'forwarded': 'forward', 'mailed': 'mail',
    'posted': 'post', 'given': 'give', 'shipped': 'ship', 'packed': 'pack', 'studied': 'study',
    'examined': 'examine', 'investigated': 'investigate', 'thrown': 'throw', 'threw': 'throw',
    'tossed': 'toss', 'meant': 'mean', 'longed': 'long', 'yearned': 'yearn', 'itched': 'itch',
    'loaned': 'loan', 'returned': 'return', 'slipped': 'slip', 'wired': 'wire', 'crawled': 'crawl',
    'shattered': 'shatter', 'bought': 'buy', 'squeezed': 'squeeze', 'teleported': 'teleport',
    'melted': 'melt', 'blessed': 'bless'
}

def transform_verbs(word_list):
  for idx in range(len(word_list)):
    if word_list[idx] in verbs_lemmas:
      word_list[idx] = verbs_lemmas[word_list[idx]]
  return word_list

# note this is specific to the way Wu et al's data is ordered and should not be used on data in general
def transform_lf_for_v_dat_p2_with_1_3rdnp_pp(original_sentence, original_lf):
  transformed_sentence = reorder_v_dat_p2_with_1_pp_on_theme(original_sentence)
  index_change_map = {}
  noun_to_index_map = {}
  split_original_sentence = transform_verbs(original_sentence.strip().replace(" .", "").split(" "))
  split_transformed_sentence = transform_verbs(transformed_sentence.strip().replace(" .", "").split(" "))
  split_original_lf = original_lf.replace("* ", "").replace(" ; ", " AND ").replace("nmod . ", "").split(" AND ")
  idx = 0
  recipient_idx = None
  # agent recipient theme
  for part in split_original_lf:
    word = part.strip().split(" ")[0]
    if word == "agent" or word == "recipient" or word == "theme":
      continue
    index_change_map[f"{split_original_sentence.index(word)}"] = f"{split_transformed_sentence.index(word)}"
    if word == "on" or word == "in" or word == "beside":
      continue
    noun_to_index_map[word] = f"{split_transformed_sentence.index(word)}"
    if idx == 1:
      recipient_idx = f"{split_transformed_sentence.index(word)}"
    idx += 1
  new_lf = []
  original_lf_space_split = original_lf.split(" ")
  for idx in range(len(original_lf_space_split)):
    part = original_lf_space_split[idx]
    if part in index_change_map:
      original_lf_space_split[idx] = index_change_map[part]
  transformed_lf = " ".join(original_lf_space_split)
  # assumes 1 digit recipient index which is true in this dataset per the grammar
  # will not be true with two prepositions though in next function
  transformed_lf = transformed_lf[:-7] + recipient_idx + transformed_lf[-6:]
  transformed_lf_parts = transformed_lf.split(";")
  transformed_lf = ";".join(transformed_lf_parts[0:2] + [transformed_lf_parts[3]] + [transformed_lf_parts[2]] + transformed_lf_parts[4:])
  return transformed_lf

# 2 digit idx here
def transform_lf_for_v_dat_p2_with_2_3rdnp_pp(original_sentence, original_lf):
  transformed_sentence = reorder_v_dat_p2_with_2_pp_on_theme(original_sentence)
  index_change_map = {}
  noun_to_index_map = {}
  split_original_sentence = transform_verbs(original_sentence.strip().replace(" .", "").split(" "))
  split_transformed_sentence = transform_verbs(transformed_sentence.strip().replace(" .", "").split(" "))
  split_original_lf = original_lf.replace("* ", "").replace(" ; ", " AND ").replace("nmod . ", "").split(" AND ")
  idx = 0
  recipient_idx = None
  # agent recipient theme
  for part in split_original_lf:
    word = part.strip().split(" ")[0]
    if word == "agent" or word == "recipient" or word == "theme":
      continue
    index_change_map[f"{split_original_sentence.index(word)}"] = f"{split_transformed_sentence.index(word)}"
    if word == "on" or word == "in" or word == "beside":
      continue
    noun_to_index_map[word] = f"{split_transformed_sentence.index(word)}"
    if idx == 1:
      recipient_idx = f"{split_transformed_sentence.index(word)}"
    idx += 1
  new_lf = []
  original_lf_space_split = original_lf.split(" ")
  for idx in range(len(original_lf_space_split)):
    part = original_lf_space_split[idx]
    if part in index_change_map:
      original_lf_space_split[idx] = index_change_map[part]
  # needs to be 2 preps back now, and 2 digit idx
  original_lf_space_split[-13] = recipient_idx
  transformed_lf = " ".join(original_lf_space_split)
  transformed_lf_parts = transformed_lf.split(";")
  transformed_lf = ";".join(transformed_lf_parts[0:2] + [transformed_lf_parts[3]] + [transformed_lf_parts[4]] + [transformed_lf_parts[2]] + transformed_lf_parts[5:])
  return transformed_lf

pp_counts_all = []
for idx in range(len(recogs_train_examples)):
  tree = parser.parse(recogs_train_examples["COGS Sentence"].values[idx].replace(" .", "").strip().lower())
  pp_counts_all.append(get_pp_count(tree))
recogs_train_examples["PP counts"] = pp_counts_all

recogs_train_examples2 = recogs_train_examples.copy()
keep_list = []
for idx in range(len(recogs_train_examples2)):
  if recogs_train_examples2["Verb Type"].values[idx] == "v_dat_p2":
    if recogs_train_examples2["PP counts"].values[idx] > 2:
      keep_list.append(False)
      continue
    elif recogs_train_examples2["PP counts"].values[idx] == 2:
      original_sentence = recogs_train_examples2["COGS Sentence"].values[idx]
      # support proper noun and determiner common noun cases
      # (already ok in downstream function in original as it used negative indexing relative to end, so adding "A" or "the" doesn't break anything)
      sentence_len = len(original_sentence.strip().replace(" .", "").split(" "))
      if sentence_len != 12 and sentence_len != 13:
          print(f"skipping proper noun recipient or other unsupported format sentence: {original_sentence}")
          keep_list.append(False)
          continue
      original_lf = recogs_train_examples2["ReCOGS Logical Form"].values[idx]
      reordered_sentence = reorder_v_dat_p2_with_2_pp_on_theme(original_sentence)
      try:
        new_tree = parser.parse(reordered_sentence.lower())
      except:
        print(f"Skipping '{original_sentence}' which has a proper noun recipient or other disqualifying element for moving the 2 prepositional phrases to the recipient")
        keep_list.append(False)
        continue
      reordered_sentence = reordered_sentence[0].upper() + reordered_sentence[1:] + " ."
      reordered_lf = transform_lf_for_v_dat_p2_with_2_3rdnp_pp(original_sentence, original_lf)
      print(f"Replacing '{original_sentence}' with {reordered_sentence} ; replacing '{original_lf}' with '{reordered_lf}'")
      recogs_train_examples2["COGS Sentence"].values[idx] = reordered_sentence
      recogs_train_examples2["ReCOGS Logical Form"].values[idx] = reordered_lf
    elif recogs_train_examples2["PP counts"].values[idx] == 1:
      original_sentence = recogs_train_examples2["COGS Sentence"].values[idx]
      # support proper noun and determiner common noun cases
      # (already ok in downstream function in original as it used negative indexing relative to end, so adding "A" or "the" doesn't break anything)
      sentence_len = len(original_sentence.strip().replace(" .", "").split(" "))
      if sentence_len != 9 and sentence_len != 10:
          print(f"skipping proper noun recipient or other unsupported format sentence: {original_sentence}")
          keep_list.append(False)
          continue
      original_lf = recogs_train_examples2["ReCOGS Logical Form"].values[idx]
      reordered_sentence = reorder_v_dat_p2_with_1_pp_on_theme(original_sentence)
      try:
        new_tree = parser.parse(reordered_sentence.lower())
      except:
        print(f"Skipping '{original_sentence}' which has a proper noun recipient or other disqualifying element for moving the prepositional phrase to the recipient")
        keep_list.append(False)
        continue
      reordered_sentence = reordered_sentence[0].upper() + reordered_sentence[1:] + " ."
      reordered_lf = transform_lf_for_v_dat_p2_with_1_3rdnp_pp(original_sentence, original_lf)
      print(f"Replacing '{original_sentence}' with {reordered_sentence} ; replacing '{original_lf}' with '{reordered_lf}'")
      recogs_train_examples2["COGS Sentence"].values[idx] = reordered_sentence
      recogs_train_examples2["ReCOGS Logical Form"].values[idx] = reordered_lf
  keep_list.append(True)

recogs_train_examples2["keep"] = keep_list

recogs_train_examples2 = recogs_train_examples2[recogs_train_examples2["keep"] == True]

t = recogs_train_examples2[recogs_train_examples2["PP counts"] <= 2]
t = t[t["PP counts"] > 0]
modified_train_v_dat_p2_examples_pp_moved_to_recipient_np = t[t["Verb Type"] == "v_dat_p2"]

print(f"modified_train_v_dat_p2_examples_pp_moved_to_recipient_np (sample n=10): {modified_train_v_dat_p2_examples_pp_moved_to_recipient_np[:10]}")

# create one tsv with pp counts up to 2
modified_train_v_dat_p2_examples_pp_moved_to_recipient_np.filter(["COGS Sentence", "ReCOGS Logical Form", "Distribution"]).to_csv("modified_train_set_examples_v_dat_p2_pp_moved_to_recipient_ppdepthle2.tsv", sep="	", index=False)

# create one tsv with only a pp count of 1 partly for backwards compatibility
modified_train_v_dat_p2_examples_pp_moved_to_recipient_np[modified_train_v_dat_p2_examples_pp_moved_to_recipient_np["PP counts"] == 1].filter(["COGS Sentence", "ReCOGS Logical Form", "Distribution"]).to_csv("modified_train_set_examples_v_dat_p2_pp_moved_to_recipient.tsv", sep="	", index=False)

# recombining here with other training examples is not useful, we just concatenate to train.tsv for the data augmentation experiment in https://colab.research.google.com/drive/1rvVNQYH7NUrLmsCfdcyzwMos-HMkCNTM#scrollTo=f9bC-CPBtIil
#recogs_train_examples2.filter(["COGS Sentence", "ReCOGS Logical Form", "Distribution"]).to_csv("train_data_no_sprinkles_no_preposing_no_cp_v_dat_p2_pp_moved_to_recipient.tsv", sep="	", index=False)
#subprocess.run("cat train.tsv | grep -v 'in_distribution' | grep 'primitive' >> train_data_no_sprinkles_no_preposing_no_cp_v_dat_p2_pp_moved_to_recipient.tsv", shell=True)


Load official Wu et al 2023 ReCOGS training examples
(sample from https://raw.githubusercontent.com/frankaging/ReCOGS/refs/heads/main/recogs_positional_index/train.tsv , associated with https://arxiv.org/abs/2303.13716 )
Example information we can add to 100 train.tsv samples:
                                     COGS Sentence  \
0                     A rose was helped by a dog .   
1                        The sailor dusted a boy .   
2                          Emma rolled a teacher .   
3                         Evelyn rolled the girl .   
4      A cake was forwarded to Levi by Charlotte .   
..                                             ...   
95  The girl changed a sandwich beside the table .   
96       Emma gave a landlord the box in a house .   
97              The balloon was painted by a boy .   
98                       A girl liked the raisin .   
99   The pancake was passed to Charlotte by Noah .   

                                  ReCOGS Logical Form     Distribution  \

In [None]:
!cat modified_train_set_examples_v_dat_p2_pp_moved_to_recipient_ppdepthle2.tsv | wc -l

369


In [None]:
!cat modified_train_set_examples_v_dat_p2_pp_moved_to_recipient_ppdepthle2.tsv

COGS Sentence	ReCOGS Logical Form	Distribution
Liam gave the monkey in the container a chalk .	Liam ( 0 ) ; * monkey ( 3 ) ; * container ( 6 ) ; chalk ( 8 ) ; give ( 1 ) AND agent ( 1 , 0 ) AND recipient ( 1 , 3 ) AND theme ( 1 , 8 ) AND nmod . in ( 3 , 6 )	in_distribution
Emma gave a landlord in a house the box .	Emma ( 0 ) ; landlord ( 3 ) ; house ( 6 ) ; * box ( 8 ) ; give ( 1 ) AND agent ( 1 , 0 ) AND recipient ( 1 , 3 ) AND theme ( 1 , 8 ) AND nmod . in ( 3 , 6 )	in_distribution
Emma awarded a bird on the stool the drink .	Emma ( 0 ) ; bird ( 3 ) ; * stool ( 6 ) ; * drink ( 8 ) ; award ( 1 ) AND agent ( 1 , 0 ) AND recipient ( 1 , 3 ) AND theme ( 1 , 8 ) AND nmod . on ( 3 , 6 )	in_distribution
Olivia lended a politician beside a cup in a room a game .	Olivia ( 0 ) ; politician ( 3 ) ; cup ( 6 ) ; room ( 9 ) ; game ( 11 ) ; lend ( 1 ) AND agent ( 1 , 0 ) AND recipient ( 1 , 3 ) AND theme ( 1 , 11 ) AND nmod . beside ( 3 , 6 ) AND nmod . in ( 6 , 9 )	in_distribution
Emma offered a g

In [None]:
!cat /content/modified_train_set_examples_v_dat_p2_pp_moved_to_recipient.tsv

COGS Sentence	ReCOGS Logical Form	Distribution
Liam gave the monkey in the container a chalk .	Liam ( 0 ) ; * monkey ( 3 ) ; * container ( 6 ) ; chalk ( 8 ) ; give ( 1 ) AND agent ( 1 , 0 ) AND recipient ( 1 , 3 ) AND theme ( 1 , 8 ) AND nmod . in ( 3 , 6 )	in_distribution
Emma gave a landlord in a house the box .	Emma ( 0 ) ; landlord ( 3 ) ; house ( 6 ) ; * box ( 8 ) ; give ( 1 ) AND agent ( 1 , 0 ) AND recipient ( 1 , 3 ) AND theme ( 1 , 8 ) AND nmod . in ( 3 , 6 )	in_distribution
Emma awarded a bird on the stool the drink .	Emma ( 0 ) ; bird ( 3 ) ; * stool ( 6 ) ; * drink ( 8 ) ; award ( 1 ) AND agent ( 1 , 0 ) AND recipient ( 1 , 3 ) AND theme ( 1 , 8 ) AND nmod . on ( 3 , 6 )	in_distribution
Emma offered a girl on the table a drink .	Emma ( 0 ) ; girl ( 3 ) ; * table ( 6 ) ; drink ( 8 ) ; offer ( 1 ) AND agent ( 1 , 0 ) AND recipient ( 1 , 3 ) AND theme ( 1 , 8 ) AND nmod . on ( 3 , 6 )	in_distribution
Emma offered a teacher beside a bed the scarf .	Emma ( 0 ) ; teacher ( 3 ) ; 