This document: https://colab.research.google.com/drive/12mXX5L1I4rpwl1Jk8hCm-xyAkqiKJEo7?usp=sharing (public link)

Copy of original at https://colab.research.google.com/drive/10BXfzIlkPHrd_zFqIxTc8fm8soa8rYs1 (request access if needed), only editing text cells / manual computation cells to add stats for runs that finished after the original deadline. At the original deadline I only had n=8  (separate transformer training runs, not size of evaluation set) for 3-layer Wu et al 2023 baseline variant on ReCOGS_pos and n=5 for the 4-layer variant; instead of the full n=10,n=10 (which actually completed in the notebook but was not reported in original draft). Here I just update the text cells to add the n=2 and n=5 additional runs for those two to their means/std/confidence intervals reported.

**Note this is testing the baseline model, not the RASP model on ReCOGS_pos (same dataset as RASP model), exactly as in Wu et al 2023 (their python scripts) as well as with 3 and 4 layer variants (instead of their original 2).**

**Note on random seeds used**:
Wu et al 2023 use the following 5 seeds in their default script 42,66,77,88,99 .
When I do n > 5 for a condition, for each subsequent group of 5 runs execute by their script, I just increment their seeds to get the next 5 seeds, e.g. 43,67,78,89,100 , and so on. There was no cherry-picking of seeds.

# Wu et al 2023 baseline with ReCOGS LF output format, and fixed positional indices

ReCOGS commit used 1b6eca8ff4dca5fd2fb284a7d470998af5083beb

In [None]:
%cd /content/
!rm -rf ReCOGS
!git clone https://github.com/frankaging/ReCOGS.git

/content
Cloning into 'ReCOGS'...
remote: Enumerating objects: 436, done.[K
remote: Counting objects: 100% (124/124), done.[K
remote: Compressing objects: 100% (51/51), done.[K
remote: Total 436 (delta 96), reused 92 (delta 73), pack-reused 312 (from 1)[K
Receiving objects: 100% (436/436), 84.71 MiB | 35.84 MiB/s, done.
Resolving deltas: 100% (303/303), done.
Updating files: 100% (137/137), done.


In [None]:
%cd ReCOGS

/content/ReCOGS


In [None]:
!cat README.md

<!-- PROJECT LOGO -->
<br />
<div align="center">
  <h3 align="center">ReCOGS: How Incidental Details of a Logical Form Overshadow an Evaluation of Semantic Interpretation</h3>
  <p align="center">
    Zhengxuan Wu, Christopher D. Manning, Christopher Potts
    <br />
    <a href="https://arxiv.org/abs/2303.13716"><strong>Read our preprint »</strong></a>
    <br />
    <br />
    <a href="https://github.com/frankaging/ReCOGS/issues">Report Bug</a>
    ·
    <a href="https://nlp.stanford.edu/~wuzhengx/">Contact Us</a>
  </p>
</div>

## Introduction

Compositional generalization benchmarks seek to assess whether models can accurately compute **meanings** for novel sentences, but operationalize this in terms of **logical form** (LF) prediction. This raises the concern that semantically irrelevant details of the chosen LFs could shape model performance. We argue that this concern is realized for [the COGS benchmark](https://aclanthology.org/2020.emnlp-main.731.pdf).

## Citation
If you use

In [None]:
!cat /content/ReCOGS/model/encoder_config.json

{
  "architectures": [
    "Bert"
  ],
  "model_type": "bert",
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 300,
  "initializer_range": 0.02,
  "intermediate_size": 512,
  "num_attention_heads": 4,
  "num_hidden_layers": 2,
  "type_vocab_size": 2,
  "vocab_size": 762,
  "pad_token_id": 0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "unk_token_id": 3,
  "mask_token_id": 4,
  "cls_token_id": 5,
  "sum_token_id": 6,
  "nsp_token_id": 7,
  "position_embedding_type": "absolute",
  "position_embedding_init": "random"
}

In [None]:
!cat /content/ReCOGS/model/decoder_config.json

{
  "architectures": [
    "Bert"
  ],
  "model_type": "bert",
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 300,
  "initializer_range": 0.02,
  "intermediate_size": 512,
  "num_attention_heads": 4,
  "num_hidden_layers": 2,
  "type_vocab_size": 2,
  "vocab_size": 729,
  "pad_token_id": 0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "unk_token_id": 3,
  "mask_token_id": 4,
  "decoder_start_token_id": 1,
  "position_embedding_type": "absolute",
  "position_embedding_init": "random"
}

In [None]:
!python run_cogs.py --model_name ende_transformer --gpu 1 --train_batch_size 128 --eval_batch_size 128 --lr 0.0001 --data_path ./recogs_positional_index --output_dir ./results_recogs_positional_index --lfs cogs --do_train --do_test --do_gen --max_seq_len 512 --output_json --epochs 300 --seeds "42;66;77;88;99" # paper's seeds, not sure how they chose them

INFO:root:Baselining the Transformer Encoder-Decoder Model
INFO:root:__Number CUDA Devices: 1
INFO:root:Number of model params: 4344077
INFO:root:OUTPUT DIR: ./results_recogs_positional_index/cogs_pipeline.model.ende_transformer.lf.cogs.glove.False.seed.42
Epoch: 0:   0% 0/213 [00:00<?, ?it/s]We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.
Epoch: 0: 100% 213/213 [00:11<00:00, 18.71it/s, loss=5.89]
Epoch: 1: 100% 213/213 [00:10<00:00, 20.35it/s, loss=4.54]
Epoch: 2: 100% 213/213 [00:10<00:00, 20.35it/s, loss=3.54]
Epoch: 3: 100% 213/213 [00:10<00:00, 20.36it/s, loss=2.49]
Epoch: 4: 100% 213/213 [00:10<00:00, 20.37it/s, loss=1.91]
Epoch: 5: 100% 213/213 [00:10<00:00, 20.30it/s, loss=1.58]
Epoch: 6: 100% 213/213 [00:10<00:00, 20.37it/s, loss=1.3]
Epoch: 7: 100% 213/213 [00:10<00:00, 20.37it/s, loss=1.09]
Epoch: 8: 100% 213/213 [00:10<00:0

more seeds

In [None]:
!python run_cogs.py --model_name ende_transformer --gpu 1 --train_batch_size 128 --eval_batch_size 128 --lr 0.0001 --data_path ./recogs_positional_index --output_dir ./results_recogs_positional_index --lfs cogs --do_train --do_test --do_gen --max_seq_len 512 --output_json --epochs 300 --seeds "43;67;78;89;100" # paper's seeds +1

INFO:root:Baselining the Transformer Encoder-Decoder Model
INFO:root:__Number CUDA Devices: 1
INFO:root:Number of model params: 4344077
INFO:root:OUTPUT DIR: ./results_recogs_positional_index/cogs_pipeline.model.ende_transformer.lf.cogs.glove.False.seed.43
Epoch: 0:   0% 0/213 [00:00<?, ?it/s]We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.
Epoch: 0: 100% 213/213 [00:12<00:00, 17.33it/s, loss=6]
Epoch: 1: 100% 213/213 [00:10<00:00, 20.07it/s, loss=4.59]
Epoch: 2: 100% 213/213 [00:10<00:00, 20.08it/s, loss=3.54]
Epoch: 3: 100% 213/213 [00:10<00:00, 20.10it/s, loss=2.48]
Epoch: 4: 100% 213/213 [00:10<00:00, 20.07it/s, loss=1.92]
Epoch: 5: 100% 213/213 [00:10<00:00, 20.03it/s, loss=1.55]
Epoch: 6: 100% 213/213 [00:10<00:00, 20.01it/s, loss=1.26]
Epoch: 7: 100% 213/213 [00:10<00:00, 20.05it/s, loss=1.08]
Epoch: 8: 100% 213/213 [00:10<00:00,

ran out of money, pickup from next seed

In [None]:
!python run_cogs.py --model_name ende_transformer --gpu 1 --train_batch_size 128 --eval_batch_size 128 --lr 0.0001 --data_path ./recogs_positional_index --output_dir ./results_recogs_positional_index --lfs cogs --do_train --do_test --do_gen --max_seq_len 512 --output_json --epochs 300 --seeds "78;89;100" # paper's seeds +1

INFO:root:Baselining the Transformer Encoder-Decoder Model
INFO:root:__Number CUDA Devices: 1
INFO:root:Number of model params: 4344077
INFO:root:OUTPUT DIR: ./results_recogs_positional_index/cogs_pipeline.model.ende_transformer.lf.cogs.glove.False.seed.78
Epoch: 0:   0% 0/213 [00:00<?, ?it/s]We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.
Epoch: 0: 100% 213/213 [00:12<00:00, 17.60it/s, loss=5.96]
Epoch: 1: 100% 213/213 [00:10<00:00, 20.08it/s, loss=4.65]
Epoch: 2: 100% 213/213 [00:10<00:00, 20.15it/s, loss=3.57]
Epoch: 3: 100% 213/213 [00:10<00:00, 20.09it/s, loss=2.5]
Epoch: 4: 100% 213/213 [00:10<00:00, 20.25it/s, loss=1.94]
Epoch: 5: 100% 213/213 [00:10<00:00, 20.22it/s, loss=1.61]
Epoch: 6: 100% 213/213 [00:10<00:00, 20.25it/s, loss=1.35]
Epoch: 7: 100% 213/213 [00:10<00:00, 20.23it/s, loss=1.17]
Epoch: 8: 100% 213/213 [00:10<00:0

more seeds

In [None]:
!python run_cogs.py --model_name ende_transformer --gpu 1 --train_batch_size 128 --eval_batch_size 128 --lr 0.0001 --data_path ./recogs_positional_index --output_dir ./results_recogs_positional_index --lfs cogs --do_train --do_test --do_gen --max_seq_len 512 --output_json --epochs 300 --seeds "44;68;79;90;101" # paper's seeds + 2, not sure how they chose them

INFO:root:Baselining the Transformer Encoder-Decoder Model
INFO:root:__Number CUDA Devices: 1
INFO:root:Number of model params: 4344077
INFO:root:OUTPUT DIR: ./results_recogs_positional_index/cogs_pipeline.model.ende_transformer.lf.cogs.glove.False.seed.44
Epoch: 0:   0% 0/213 [00:00<?, ?it/s]We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.
Epoch: 0: 100% 213/213 [00:12<00:00, 17.61it/s, loss=5.93]
Epoch: 1: 100% 213/213 [00:10<00:00, 20.06it/s, loss=4.56]
Epoch: 2: 100% 213/213 [00:10<00:00, 20.07it/s, loss=3.55]
Epoch: 3: 100% 213/213 [00:10<00:00, 19.99it/s, loss=2.52]
Epoch: 4: 100% 213/213 [00:10<00:00, 20.06it/s, loss=1.92]
Epoch: 5: 100% 213/213 [00:10<00:00, 20.05it/s, loss=1.58]
Epoch: 6: 100% 213/213 [00:10<00:00, 20.08it/s, loss=1.33]
Epoch: 7: 100% 213/213 [00:10<00:00, 20.04it/s, loss=1.15]
Epoch: 8: 100% 213/213 [00:10<00:

In [None]:
!python run_cogs.py --model_name ende_transformer --gpu 1 --train_batch_size 128 --eval_batch_size 128 --lr 0.0001 --data_path ./recogs_positional_index --output_dir ./results_recogs_positional_index --lfs cogs --do_train --do_test --do_gen --max_seq_len 512 --output_json --epochs 300 --seeds "45;69;80;91;102" # paper's seeds + 3, not sure how they chose them

INFO:root:Baselining the Transformer Encoder-Decoder Model
INFO:root:__Number CUDA Devices: 1
INFO:root:Number of model params: 4344077
INFO:root:OUTPUT DIR: ./results_recogs_positional_index/cogs_pipeline.model.ende_transformer.lf.cogs.glove.False.seed.45
Epoch: 0:   0% 0/213 [00:00<?, ?it/s]We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.
Epoch: 0: 100% 213/213 [00:11<00:00, 18.49it/s, loss=5.99]
Epoch: 1: 100% 213/213 [00:10<00:00, 20.01it/s, loss=4.65]
Epoch: 2: 100% 213/213 [00:10<00:00, 19.98it/s, loss=3.61]
Epoch: 3: 100% 213/213 [00:10<00:00, 19.93it/s, loss=2.52]
Epoch: 4: 100% 213/213 [00:10<00:00, 20.02it/s, loss=1.95]
Epoch: 5: 100% 213/213 [00:10<00:00, 19.98it/s, loss=1.59]
Epoch: 6: 100% 213/213 [00:10<00:00, 19.96it/s, loss=1.32]
Epoch: 7: 100% 213/213 [00:10<00:00, 19.93it/s, loss=1.13]
Epoch: 8: 100% 213/213 [00:10<00:

seed 42: OVERALL: 87.36190476190477

seed 43: OVERALL: 90.46666666666667

seed 44: OVERALL: 85.20476190476191

seed 45: OVERALL: 90.67619047619047

seed 66: OVERALL: 89.84761904761905

seed 67: OVERALL: 87.83333333333333

seed 68: OVERALL: 90.32857142857142

seed 69: OVERALL: 89.20476190476191

seed 77: OVERALL: 90.14285714285715

seed 78: OVERALL: 87.5047619047619

seed 79: OVERALL: 90.86666666666666

seed 80: OVERALL: 88.82857142857142

seed 88: OVERALL: 85.04761904761905

seed 89: OVERALL: 89.66190476190476

seed 90: OVERALL: 90.10000000000001

seed 91: OVERALL: 85.41904761904762

seed 99: OVERALL: 85.77142857142857

seed 100: OVERALL: 88.41904761904762

seed 101: OVERALL: 89.4

seed 102: OVERALL: 88.85714285714286





In [None]:
import numpy as np
wu_et_al_2023_baseline_overall_sem_accuracy_recogs_pos = np.array([87.36190476190477, 90.46666666666667, 85.20476190476191, 90.67619047619047,
                                                                   89.84761904761905, 87.83333333333333, 90.32857142857142, 89.20476190476191,  90.14285714285715, 87.5047619047619,
                                                                   90.86666666666666, 88.82857142857142, 85.04761904761905, 89.66190476190476, 90.10000000000001 ,
                                                                   85.41904761904762, 85.77142857142857, 88.41904761904762, 89.4, 88.85714285714286])

In [None]:
wu_et_al_2023_baseline_overall_sem_accuracy_recogs_pos.mean()

88.54714285714286

In [None]:
wu_et_al_2023_baseline_overall_sem_accuracy_recogs_pos.std()

1.8697974837894837

In [None]:
import math
wu_et_al_2023_baseline_overall_sem_accuracy_recogs_pos.std()/math.sqrt(len(wu_et_al_2023_baseline_overall_sem_accuracy_recogs_pos))

0.4180994277911346

In [None]:
sem_accuracy_overall_stderr_1p96 = wu_et_al_2023_baseline_overall_sem_accuracy_recogs_pos.std()/math.sqrt(len(wu_et_al_2023_baseline_overall_sem_accuracy_recogs_pos))*1.96

In [None]:
(wu_et_al_2023_baseline_overall_sem_accuracy_recogs_pos.mean() - sem_accuracy_overall_stderr_1p96, wu_et_al_2023_baseline_overall_sem_accuracy_recogs_pos.mean() + sem_accuracy_overall_stderr_1p96)

(87.72766797867223, 89.36661773561349)

(note these are on GPU so re-running with same seeds will not produce same result, but should be from same statistical distribution)

seed 42: obj_pp_to_subj_pp: 14.8

seed 66: obj_pp_to_subj_pp: 19.7

seed 77: obj_pp_to_subj_pp: 31.0

seed 88: obj_pp_to_subj_pp: 13.5

seed 99: obj_pp_to_subj_pp: 17.8

seed 43: obj_pp_to_subj_pp: 20.2

seed 44: obj_pp_to_subj_pp: 15.1

seed 67: obj_pp_to_subj_pp: 17.5

seed 68: obj_pp_to_subj_pp: 16.4

seed 78: obj_pp_to_subj_pp: 20.0

seed 79: obj_pp_to_subj_pp: 23.0

seed 89: obj_pp_to_subj_pp: 20.1

seed 90: obj_pp_to_subj_pp: 12.7

seed 100: obj_pp_to_subj_pp: 15.8

seed 101: obj_pp_to_subj_pp: 12.4

seed 45: obj_pp_to_subj_pp: 31.4

seed 69: obj_pp_to_subj_pp: 35.3

seed 80: obj_pp_to_subj_pp: 21.7

seed 91: obj_pp_to_subj_pp: 18.9

seed 102: obj_pp_to_subj_pp: 16.4


In [None]:
import numpy as np
wu_et_al_2023_baseline_recogs_pos_subjpp = np.array([14.8, 19.7, 31.0, 13.5, 17.8, 20.2, 15.1, 17.5, 16.4, 20.0, 23.0, 20.1, 12.7, 15.8, 12.4, 31.4, 35.3, 21.7, 18.9, 16.4])

In [None]:
wu_et_al_2023_baseline_recogs_pos_subjpp.mean()

19.684999999999995

In [None]:
len(wu_et_al_2023_baseline_recogs_pos_subjpp)

20

In [None]:
wu_et_al_2023_baseline_recogs_pos_subjpp.std()

6.145345799871639

In [None]:
import math
stderr_x_1p96 = wu_et_al_2023_baseline_recogs_pos_subjpp.std()/math.sqrt(20)*1.96

In [None]:
(wu_et_al_2023_baseline_recogs_pos_subjpp.mean() - stderr_x_1p96, wu_et_al_2023_baseline_recogs_pos_subjpp.mean() + stderr_x_1p96)

(16.991683453063857, 22.378316546936134)

# Wu et al 2023 baseline Encoder-Decoder Transformer with 3 layers, 4 layers (note our claim is these do not do better because like our RASP program it is a flat solution the Transformer has learned, not a tree like, recursive one)

see also some early explorations on this at (spread over 3 notebooks so could run in parallel), higher experiment count but not on ReCOGS *positional* dataset (just plain ReCOGS) so would not have been directly comparable to RASP model runs which were done on ReCOGS_pos so could also report String Exact Match (not just Semantic Exact Match) since RASP model can do it.

https://colab.research.google.com/drive/19_M-KC98vK5_2ZiQj0UR8CkiVwuB-lvO

https://colab.research.google.com/drive/1WvMyX-fngMj5MKm10NP4Jct7H63hXzSQ

https://colab.research.google.com/drive/1X2rRBR8WfBr4zCvDJuuaRsJw5UrIsWOu

These links are not public since not reported in the paper and consistent with findings here.

## Wu et al 2023 baseline Encoder-Decoder - 3 layers - not controlling for parameter count

ReCOGS commit used 1b6eca8ff4dca5fd2fb284a7d470998af5083beb

In [None]:
%cd /content/
!rm -rf ReCOGS
!git clone https://github.com/frankaging/ReCOGS.git
%cd ReCOGS

!echo '{\
  "architectures": [\
    "Bert"\
  ],\
  "model_type": "bert",\
  "attention_probs_dropout_prob": 0.1,\
  "hidden_act": "gelu",\
  "hidden_dropout_prob": 0.1,\
  "hidden_size": 300,\
  "initializer_range": 0.02,\
  "intermediate_size": 512,\
  "num_attention_heads": 4,\
  "num_hidden_layers": 3,\
  "type_vocab_size": 2,\
  "vocab_size": 762,\
  "pad_token_id": 0,\
  "bos_token_id": 1,\
  "eos_token_id": 2,\
  "unk_token_id": 3,\
  "mask_token_id": 4,\
  "cls_token_id": 5,\
  "sum_token_id": 6,\
  "nsp_token_id": 7,\
  "position_embedding_type": "absolute",\
  "position_embedding_init": "random"\
}' > /content/ReCOGS/model/encoder_config.json

!echo '{\
  "architectures": [\
    "Bert"\
  ],\
  "model_type": "bert",\
  "attention_probs_dropout_prob": 0.1,\
  "hidden_act": "gelu",\
  "hidden_dropout_prob": 0.1,\
  "hidden_size": 300,\
  "initializer_range": 0.02,\
  "intermediate_size": 512,\
  "num_attention_heads": 4,\
  "num_hidden_layers": 3,\
  "type_vocab_size": 2,\
  "vocab_size": 729,\
  "pad_token_id": 0,\
  "bos_token_id": 1,\
  "eos_token_id": 2,\
  "unk_token_id": 3,\
  "mask_token_id": 4,\
  "decoder_start_token_id": 1,\
  "position_embedding_type": "absolute",\
  "position_embedding_init": "random"\
}' > /content/ReCOGS/model/decoder_config.json

/content
Cloning into 'ReCOGS'...
remote: Enumerating objects: 436, done.[K
remote: Counting objects: 100% (124/124), done.[K
remote: Compressing objects: 100% (51/51), done.[K
remote: Total 436 (delta 96), reused 92 (delta 73), pack-reused 312 (from 1)[K
Receiving objects: 100% (436/436), 84.71 MiB | 33.61 MiB/s, done.
Resolving deltas: 100% (303/303), done.
Updating files: 100% (137/137), done.
/content/ReCOGS


In [None]:
!python run_cogs.py --model_name ende_transformer --gpu 1 --train_batch_size 128 --eval_batch_size 128 --lr 0.0001 --data_path ./recogs_positional_index --output_dir ./results_recogs_positional_index --lfs cogs --do_train --do_test --do_gen --max_seq_len 512 --output_json --epochs 300 --seeds "42;66;77;88;99" # paper's seeds, not sure how they chose them

INFO:root:Baselining the Transformer Encoder-Decoder Model
INFO:root:__Number CUDA Devices: 1
INFO:root:Number of model params: 6046701
INFO:root:OUTPUT DIR: ./results_recogs_positional_index/cogs_pipeline.model.ende_transformer.lf.cogs.glove.False.seed.42
Epoch: 0:   0% 0/213 [00:00<?, ?it/s]We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.
Epoch: 0: 100% 213/213 [00:16<00:00, 12.78it/s, loss=5.85]
Epoch: 1: 100% 213/213 [00:15<00:00, 14.18it/s, loss=4.53]
Epoch: 2: 100% 213/213 [00:14<00:00, 14.26it/s, loss=3.42]
Epoch: 3: 100% 213/213 [00:14<00:00, 14.27it/s, loss=2.38]
Epoch: 4: 100% 213/213 [00:14<00:00, 14.28it/s, loss=1.82]
Epoch: 5: 100% 213/213 [00:14<00:00, 14.27it/s, loss=1.47]
Epoch: 6: 100% 213/213 [00:14<00:00, 14.29it/s, loss=1.22]
Epoch: 7: 100% 213/213 [00:14<00:00, 14.30it/s, loss=1.04]
Epoch: 8: 100% 213/213 [00:14<00:

In [None]:
!python run_cogs.py --model_name ende_transformer --gpu 1 --train_batch_size 128 --eval_batch_size 128 --lr 0.0001 --data_path ./recogs_positional_index --output_dir ./results_recogs_positional_index --lfs cogs --do_train --do_test --do_gen --max_seq_len 512 --output_json --epochs 300 --seeds "43;67;78;89;100" # paper's seeds, not sure how they chose them

INFO:root:Baselining the Transformer Encoder-Decoder Model
INFO:root:__Number CUDA Devices: 1
INFO:root:Number of model params: 6046701
INFO:root:OUTPUT DIR: ./results_recogs_positional_index/cogs_pipeline.model.ende_transformer.lf.cogs.glove.False.seed.43
Epoch: 0:   0% 0/213 [00:00<?, ?it/s]We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.
Epoch: 0: 100% 213/213 [00:15<00:00, 13.42it/s, loss=5.87]
Epoch: 1: 100% 213/213 [00:14<00:00, 14.20it/s, loss=4.58]
Epoch: 2: 100% 213/213 [00:14<00:00, 14.26it/s, loss=3.49]
Epoch: 3: 100% 213/213 [00:14<00:00, 14.26it/s, loss=2.4]
Epoch: 4: 100% 213/213 [00:14<00:00, 14.24it/s, loss=1.83]
Epoch: 5: 100% 213/213 [00:14<00:00, 14.22it/s, loss=1.47]
Epoch: 6: 100% 213/213 [00:14<00:00, 14.25it/s, loss=1.18]
Epoch: 7: 100% 213/213 [00:14<00:00, 14.24it/s, loss=1.01]
Epoch: 8: 100% 213/213 [00:14<00:0

colab interrupted this at seed 67

In [None]:
!python run_cogs.py --model_name ende_transformer --gpu 1 --train_batch_size 128 --eval_batch_size 128 --lr 0.0001 --data_path ./recogs_positional_index --output_dir ./results_recogs_positional_index --lfs cogs --do_train --do_test --do_gen --max_seq_len 512 --output_json --epochs 300 --seeds "67;78;89;100" # paper's seeds, not sure how they chose them

INFO:root:Baselining the Transformer Encoder-Decoder Model
INFO:root:__Number CUDA Devices: 1
INFO:root:Number of model params: 6046701
INFO:root:OUTPUT DIR: ./results_recogs_positional_index/cogs_pipeline.model.ende_transformer.lf.cogs.glove.False.seed.67
Epoch: 0:   0% 0/213 [00:00<?, ?it/s]We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.
Epoch: 0: 100% 213/213 [00:17<00:00, 12.47it/s, loss=5.76]
Epoch: 1: 100% 213/213 [00:15<00:00, 13.99it/s, loss=4.49]
Epoch: 2: 100% 213/213 [00:15<00:00, 14.07it/s, loss=3.44]
Epoch: 3: 100% 213/213 [00:15<00:00, 14.05it/s, loss=2.45]
Epoch: 4: 100% 213/213 [00:15<00:00, 14.07it/s, loss=1.89]
Epoch: 5: 100% 213/213 [00:15<00:00, 14.07it/s, loss=1.54]
Epoch: 6: 100% 213/213 [00:15<00:00, 14.09it/s, loss=1.28]
Epoch: 7: 100% 213/213 [00:15<00:00, 14.07it/s, loss=1.1]
Epoch: 8: 100% 213/213 [00:15<00:0

seed 42: 16.6

seed 43: 16.8

seed 66: 13.0

seed 67: 18.1

seed 77: 17.0

seed 78: 17.5

seed 88: 16.2

seed 89: n/a (reported below and recomputed including, not in original draft as was not available in time (half of these seeds were run on separate run and were not complete, see below))

seed 99: 13.7

seed 100: n/a


16.1125 +/- 1.6914767955842616 (sample +/- std)

95% CI
14.940366381870223 to 17.284633618129778


after original reporting deadline, seed 89 and seed 100 of original run completed (see prior Transformer train/eval cell):
```
obj_pp_to_subj_pp, seed 89: 21.5%
obj_pp_to_subj_pp, seed 100: 11.8%
```

```
>>> wu_et_al_baseline_2023_3_layers_obj_pp_to_subj_pp = np.array([16.6,16.8,13.0,18.1,17.0,17.5,16.2,21.5,13.7,11.8])
wu_et_al_baseline_2023_3_layers_obj_pp_to_subj_pp.mean()
wu_et_al_baseline_2023_3_layers_obj_pp_to_subj_pp.std()
len(wu_et_al_baseline_2023_3_layers_obj_pp_to_subj_pp)
wu_et_al_baseline_2023_3_layers_obj_pp_to_subj_pp.std()/math.sqrt(10)
wu_et_al_baseline_2023_3_layers_obj_pp_to_subj_pp.mean()
(wu_et_al_baseline_2023_3_layers_obj_pp_to_subj_pp.mean() - wu_et_al_baseline_2023_3_layers_obj_pp_to_subj_pp.std()/math.sqrt(10)*1.96,  wu_et_al_baseline_2023_3_layers_obj_pp_to_subj_pp.mean() + wu_et_al_baseline_2023_3_layers_obj_pp_to_subj_pp.std()/math.sqrt(10) * 1.96)
>>> 16.22
>>> 2.653224453377437
>>> 10
>>> 0.8390232416327928
>>> 16.22
>>> (14.575514446399724, 17.86448555360027)
```

So for the Wu et al 2023 baseline Transformer Encoder-Decoder with 3 layers we update from n=8:
```
16.1125 +/- 1.6914767955842616 (sample +/- std)
95% CI 14.940366381870223 to 17.284633618129778
```

to n=10:

```
16.22 +/- 2.653224453377437 (sample +/- std)
95% CI 14.575514446399724 to 17.86448555360027
```

## Wu et al 2023 baseline Encoder-Decoder - 4 layers - not controlling for parameter count

ReCOGS commit used 1b6eca8ff4dca5fd2fb284a7d470998af5083beb

In [None]:
%cd /content/
!rm -rf ReCOGS
!git clone https://github.com/frankaging/ReCOGS.git
%cd ReCOGS

!echo '{\
  "architectures": [\
    "Bert"\
  ],\
  "model_type": "bert",\
  "attention_probs_dropout_prob": 0.1,\
  "hidden_act": "gelu",\
  "hidden_dropout_prob": 0.1,\
  "hidden_size": 300,\
  "initializer_range": 0.02,\
  "intermediate_size": 512,\
  "num_attention_heads": 4,\
  "num_hidden_layers": 4,\
  "type_vocab_size": 2,\
  "vocab_size": 762,\
  "pad_token_id": 0,\
  "bos_token_id": 1,\
  "eos_token_id": 2,\
  "unk_token_id": 3,\
  "mask_token_id": 4,\
  "cls_token_id": 5,\
  "sum_token_id": 6,\
  "nsp_token_id": 7,\
  "position_embedding_type": "absolute",\
  "position_embedding_init": "random"\
}' > /content/ReCOGS/model/encoder_config.json

!echo '{\
  "architectures": [\
    "Bert"\
  ],\
  "model_type": "bert",\
  "attention_probs_dropout_prob": 0.1,\
  "hidden_act": "gelu",\
  "hidden_dropout_prob": 0.1,\
  "hidden_size": 300,\
  "initializer_range": 0.02,\
  "intermediate_size": 512,\
  "num_attention_heads": 4,\
  "num_hidden_layers": 4,\
  "type_vocab_size": 2,\
  "vocab_size": 729,\
  "pad_token_id": 0,\
  "bos_token_id": 1,\
  "eos_token_id": 2,\
  "unk_token_id": 3,\
  "mask_token_id": 4,\
  "decoder_start_token_id": 1,\
  "position_embedding_type": "absolute",\
  "position_embedding_init": "random"\
}' > /content/ReCOGS/model/decoder_config.json

seed 42: obj_pp_to_subj_pp: 23.8

seed 66: obj_pp_to_subj_pp: 20.6

seed 77: obj_pp_to_subj_pp: 23.2

seed 88: obj_pp_to_subj_pp: 16.4

seed 99: obj_pp_to_subj_pp: 12.1



19.22 +/- 4.4128902093752576 (sample mean +/- std)

95% confidence interval 15.4% to 23.1%

In [None]:
!python run_cogs.py --model_name ende_transformer --gpu 1 --train_batch_size 128 --eval_batch_size 128 --lr 0.0001 --data_path ./recogs_positional_index --output_dir ./results_recogs_positional_index --lfs cogs --do_train --do_test --do_gen --max_seq_len 512 --output_json --epochs 300 --seeds "42;66;77;88;99" # paper's seeds, not sure how they chose them

INFO:root:Baselining the Transformer Encoder-Decoder Model
INFO:root:__Number CUDA Devices: 1
INFO:root:Number of model params: 7749325
INFO:root:OUTPUT DIR: ./results_recogs_positional_index/cogs_pipeline.model.ende_transformer.lf.cogs.glove.False.seed.42
Epoch: 0:   0% 0/213 [00:00<?, ?it/s]We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.
Epoch: 0: 100% 213/213 [00:20<00:00, 10.52it/s, loss=5.66]
Epoch: 1: 100% 213/213 [00:19<00:00, 11.11it/s, loss=4.46]
Epoch: 2: 100% 213/213 [00:19<00:00, 11.12it/s, loss=3.33]
Epoch: 3: 100% 213/213 [00:19<00:00, 11.11it/s, loss=2.32]
Epoch: 4: 100% 213/213 [00:19<00:00, 11.06it/s, loss=1.8]
Epoch: 5: 100% 213/213 [00:19<00:00, 11.10it/s, loss=1.43]
Epoch: 6: 100% 213/213 [00:19<00:00, 11.10it/s, loss=1.17]
Epoch: 7: 100% 213/213 [00:19<00:00, 11.07it/s, loss=1]
Epoch: 8: 100% 213/213 [00:19<00:00, 

moved this last one to new notebook to run in parallel, https://colab.research.google.com/drive/13FRQeAjyPOhBtTdrpW8caL25rNryLn5-?authuser=0#scrollTo=VxRXS4jinmeD

In [None]:
!python run_cogs.py --model_name ende_transformer --gpu 1 --train_batch_size 128 --eval_batch_size 128 --lr 0.0001 --data_path ./recogs_positional_index --output_dir ./results_recogs_positional_index --lfs cogs --do_train --do_test --do_gen --max_seq_len 512 --output_json --epochs 300 --seeds "43;67;78;89;100" # paper's seeds, not sure how they chose them

papers seeds + 1, for runs 6-10 inclusive out of 10

results from other notebook ( https://colab.research.google.com/drive/13FRQeAjyPOhBtTdrpW8caL25rNryLn5-#scrollTo=VxRXS4jinmeD ) were:

seed 43: obj_pp_to_subj_pp 16.1%

seed 67: obj_pp_to_subj_pp 22.3%

seed 78: obj_pp_to_subj_pp: 23.7%

seed 89: obj_pp_to_subj_pp: 20.9%

seed 100: obj_pp_to_subj_pp: 14.0%

combine with earlier seeds in this notebook ( https://colab.research.google.com/drive/12mXX5L1I4rpwl1Jk8hCm-xyAkqiKJEo7 ):


seed 42: obj_pp_to_subj_pp: 23.8%

seed 66: obj_pp_to_subj_pp: 20.6%

seed 77: obj_pp_to_subj_pp: 23.2%

seed 88: obj_pp_to_subj_pp: 16.4%

seed 99: obj_pp_to_subj_pp: 12.1%

wu et al 2023 baseline with 4 layers (instead of 2) (not expected to be better):
```
>>> import numpy as np
>>> # not testing my own model, this is the wu et al 2023 baseline
wu_et_al_baseline_2023_4_layers_obj_pp_to_subj_pp = np.array([16.1,22.3,23.7,20.9,14.0,23.8,20.6,23.2,16.4,12.1])
>>> >>> wu_et_al_baseline_2023_4_layers_obj_pp_to_subj_pp.mean()
19.31
>>> wu_et_al_baseline_2023_4_layers_obj_pp_to_subj_pp.std()
4.082266527310533
>>> len(wu_et_al_baseline_2023_4_layers_obj_pp_to_subj_pp)
10
>>> import math
>>> wu_et_al_baseline_2023_4_layers_obj_pp_to_subj_pp_stderr = wu_et_al_baseline_2023_4_layers_obj_pp_to_subj_pp.std()/math.sqrt(10)
>>> (wu_et_al_baseline_2023_4_layers_obj_pp_to_subj_pp.mean() - wu_et_al_baseline_2023_4_layers_obj_pp_to_subj_pp_stderr*1.96, wu_et_al_baseline_2023_4_layers_obj_pp_to_subj_pp.mean() + wu_et_al_baseline_2023_4_layers_obj_pp_to_subj_pp_stderr*1.96)
(16.77978499253522, 21.840215007464778)
```

current draft reports (on n=5 instead of n=10):


19.22 +/- 4.4128902093752576 (sample mean +/- std)

95% confidence interval 15.4% to 23.1%

we update at n=10 to:

19.31 +/- 4.082266527310533 (sample mean +/- std)

95% confidence interval 16.8% to 21.8%