Notebook for fine-tuning encoder-based LMs on LA task.  
**Running this code requires GPU**  
T4 availiable in Free Colab is sufficient to fine-tune encoder LMs on datasets in (```./data/```).  
Average fine-tuning time per epoch: ~2 minutes.  
Storing the weights of the model, all predictions and uncompressed attentions requires around 1Gb of free memory.

In [3]:
## When running in colab
# from google.colab import drive
# drive.mount('/content/gdrive')
# %cd /content/gdrive/My Drive

In [None]:
!git clone https://github.com/upunaprosk/la-tda.git

In [None]:
%cd la-tda
!unzip data/data.zip -d ./data

In [6]:
!pip install -r requirements.txt -q

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.6/519.6 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.8/5.8 MB[0m [31m24.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m572.6/572.6 kB[0m [31m32.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m47.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m140.4/140.4 kB[0m [31m16.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [7]:
from src.grab_attentions import *

# Fine-tuning a pretrained model

In [8]:
!nvidia-smi

Fri Oct 13 10:15:13 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   57C    P8    10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [9]:
epoch = 1
lr = 3e-5
decay = 1e-2
batch=32
model_save_dir = "./"
run_name = f"bert-base-cased-en-cola_{batch}_{lr}_lr_{decay}_decay_balanced"
output_dir = model_save_dir+run_name

## Training argumens

Training argumens include  


* [Trainer](https://github.com/huggingface/transformers/blob/d92e22d1f28324f513f3080e5c47c071a3916721/src/transformers/training_args.py#L121) class parameters;  
* Model type arguments;  
```
  --model_name_or_path MODEL_NAME_OR_PATH
                        Path to pretrained model or model identifier from
                        huggingface.co/models (default: None)
  --config_name CONFIG_NAME
                        Pretrained config name or path if not the same as
                        model_name (default: None)
  --tokenizer_name TOKENIZER_NAME
                        Pretrained tokenizer name or path if not the same as
                        model_name (default: None)
```
* Data training arguments;
```
  --task_name TASK_NAME
                        The name of the task to train on: cola, mnli, mrpc,
                        qnli, qqp, rte, sst2, stsb, wnli (default: None)
  --dataset_name DATASET_NAME
                        The name of the dataset to use (via the datasets
                        library). (default: None)
  --max_seq_length MAX_SEQ_LENGTH
                        The maximum total input sequence length after
                        tokenization. Sequences longer than this will be
                        truncated, sequences shorter will be padded. (default:
                        128)
  --train_file TRAIN_FILE
                        A csv or a json file containing the training data.
                        (default: None)
  --validation_file VALIDATION_FILE
                        A csv or a json file containing the validation data.
                        (default: None)
  --test_file TEST_FILE
                        A csv or a json file containing the test data.
                        (default: None)
  --output_dir OUTPUT_DIR
                        The output directory where the model predictions and
                        checkpoints will be written. (default: None)
  --overwrite_output_dir [OVERWRITE_OUTPUT_DIR]
                        Overwrite the content of the output directory. Use
                        this to continue training if output_dir points to a
                        checkpoint directory. (default: False)
  --do_train [DO_TRAIN]
                        Whether to run training. (default: False)
  --do_eval [DO_EVAL]   Whether to run eval on the dev set. (default: False)
  --do_predict [DO_PREDICT]
                        Whether to run predictions on the test set. (default:
                        False)
  --evaluation_strategy {no,steps,epoch}
                        The evaluation strategy to use. (default: no)
```


* Balance loss function;
```
  --balance_loss        Whether to use class-balanced loss. (default: False)
```
* Layers weights freezing;  
```
  --freeze              Whether to use pre-trained model without fine-tuning.
                        (default: False)
```



In [10]:
!python src/train.py \
        --model_name_or_path bert-base-cased \
        --train_file data/en-cola/train.csv \
        --validation_file data/en-cola/dev.csv \
        --test_file data/en-cola/test.csv \
        --do_train \
        --do_eval \
        --do_predict\
        --num_train_epochs $epoch\
        --learning_rate $lr\
        --weight_decay $decay\
        --max_seq_length 64\
        --per_device_train_batch_size $batch\
        --output_dir $output_dir\
        --balance_loss\
        # --overwrite_output_dir

2023-10-13 10:15:17.563073: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
  ACCURACY = load_metric("accuracy", keep_in_memory=True)
Downloading builder script: 4.21kB [00:00, 3.37MB/s]       
Downloading builder script: 4.47kB [00:00, 3.46MB/s]       
10/13/2023 10:15:21 - INFO - __main__ - Training/evaluation parameters TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=True,
do_train=True,
eval_accumul

# Attention weights extraction

In [11]:
# Choose a subset for which to extract attention matrices. Further will be used for feature calculation
# Avaliable subsets:
!find data/ -type f -exec ls -lhS {} \;

-rw-r--r-- 1 root root 184K Sep  7  2022 data/en-cola/phenomena_minor.tsv
-rw-r--r-- 1 root root 26K Sep  7  2022 data/en-cola/dev.csv
-rw-r--r-- 1 root root 421K Sep  7  2022 data/en-cola/train.csv
-rw-r--r-- 1 root root 28K Sep  7  2022 data/en-cola/test.csv
-rw-r--r-- 1 root root 86K Oct  9  2022 data/en-cola/phenomena.tsv
-rw-r--r-- 1 root root 118K Sep 11  2022 data/ru-cola/dev.csv
-rw-r--r-- 1 root root 964K Sep 11  2022 data/ru-cola/train.csv
-rw-r--r-- 1 root root 296K Sep 11  2022 data/ru-cola/test.csv
-rw-r--r-- 1 root root 420K Oct  9  2022 data/ru-cola/phenomena.csv
-rw-r--r-- 1 root root 929K Oct 13 10:13 data/data.zip
-rw-r--r-- 1 root root 58K Sep  7  2022 data/ita-cola/dev.csv
-rw-r--r-- 1 root root 491K Sep  7  2022 data/ita-cola/train.csv
-rw-r--r-- 1 root root 62K Sep  7  2022 data/ita-cola/test.csv
-rw-r--r-- 1 root root 159K Oct  9  2022 data/ita-cola/phenomena.tsv


In [12]:
d_dir = "./data/en-cola/dev.csv"

In [13]:
!PYTHONPATH=%PYTHONPATH% python -m src.grab_attentions --model_dir $output_dir --data_file $d_dir

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
Moving 0 files to the new cache system
0it [00:00, ?it/s]0it [00:00, ?it/s]
[32m[I] Loading csv dataset from path: data/en-cola/dev.csv...[0m
Downloading data files: 100% 1/1 [00:00<00:00, 8756.38it/s]
Extracting data files: 100% 1/1 [00:00<00:00, 977.24it/s]
Generating dev split: 527 examples [00:00, 54978.19 examples/s]
[32m[I] CUDA is available : True[0m
[33m[W] Using cuda[0m
[32m[I] CUDA version : 11.8[0m
[32m[I] PyTorch version : 2.0.1+cu118[0m
[32m[I] Loading model from ./bert-base-cased-en-cola_32_3e-05_lr_0.01_decay_balanced...[0m
[32m[I] Loading tokenizer from ./bert-base-cased-en-cola_32_3e-05_lr_0.01_decay_balanced...[0m
Weights Extraction: 100% 53/53 [00:05<00:00, 10.05it/s]
[32m[I] Saving weights to: bert-base-cased-en-cola_32_3

In [14]:
# Ex. of direct function call
grab_attention_weights_inference(output_dir, d_dir)

[32m[I] Loading csv dataset from path: data/en-cola/dev.csv...[0m
[32m[I] CUDA is available : True[0m
[33m[W] Using cuda:0[0m
[32m[I] CUDA version : 11.8[0m
[32m[I] PyTorch version : 2.0.1+cu118[0m
[32m[I] Loading model from ./bert-base-cased-en-cola_32_3e-05_lr_0.01_decay_balanced...[0m
[32m[I] Loading tokenizer from ./bert-base-cased-en-cola_32_3e-05_lr_0.01_decay_balanced...[0m


Weights Extraction: 100%|██████████| 53/53 [00:04<00:00, 12.23it/s]


[32m[I] Saving weights to: bert-base-cased-en-cola_32_3e-05_lr_0.01_decay_balanced/attentions/dev_part1of1.npy[0m


In [15]:
!du -c -h $output_dir

12K	./bert-base-cased-en-cola_32_3e-05_lr_0.01_decay_balanced/runs/Oct13_10-15-21_536f80077e44/1697192134.4824107
28K	./bert-base-cased-en-cola_32_3e-05_lr_0.01_decay_balanced/runs/Oct13_10-15-21_536f80077e44
32K	./bert-base-cased-en-cola_32_3e-05_lr_0.01_decay_balanced/runs
593M	./bert-base-cased-en-cola_32_3e-05_lr_0.01_decay_balanced/attentions
1008M	./bert-base-cased-en-cola_32_3e-05_lr_0.01_decay_balanced
1008M	total
