# __3.2. Evaluate model using the PICKLE corpus__

Goal:
- Mask entities from the Pickle corpus to evaluate model performance

Log:
- 11/30/23: 
   PICKLE dataset is dervied from IOB format data from:
    - `hpc.msu.edu:/mnt/research/ShiuLab/serena_kg/PICKLE_250_abstracts_entities_and_relations_FINAL_05Jul2023`
  - The derived data is copied from:
    - `hpc.msu.edu:/mnt/research/compbiol_shiu/kg/1_data_proc`
  - Will eventually move `kg:/1_data_proc/script_1_1_parse_brat.ipynb` to be `3_1` in this repo.
    - Moved.
  

## ___Setup___

In [2]:
import json, pickle, spacy
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path
from datasets import load_dataset
from transformers import BertTokenizerFast, BertForMaskedLM, pipeline

In [3]:
proj_dir   = Path.home() / "projects/plantbert"
work_dir   = proj_dir / "3_eval_with_pickle"
pickle_dir = work_dir / "pickle"

# Vanilla model
dir1       = proj_dir / "1_vanilla_bert" 
model1_dir = dir1 / "models/"
ckpt1_dir  = model1_dir / "checkpoint-35500"

# Filtered model

## ___Load dataset___

### Get PICKLE data

Data processed into Spacy format. Obtained with:

```bash
scp shius@hpc.msu.edu:/mnt/research/compbiol_shiu/kg/1_data_proc/*.spacy ./
```

Info on [saving and loading spacy data](https://spacy.io/usage/saving-loading)
- Specifically, the [from_disk](https://spacy.io/api/language#from_disk) function

In [12]:
nlp = spacy.load("en_core_web_lg")

train_data = nlp.from_disk(pickle_dir / "train.spacy")

NotADirectoryError: [Errno 20] Not a directory: '/home/shius/projects/plantbert/1_vanilla_bert/pickle/train.spacy/tokenizer'

In [3]:
from datasets import load_dataset
dataset = load_dataset("slotreck/pickle")

Downloading readme:   0%|          | 0.00/925 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/589k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/104k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/170k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Failed to read file '/home/shius/.cache/huggingface/datasets/downloads/c7c914282be21967b0b29d51d56fb3f44e978f2da0a4c33f9928b27ba225af66' with error <class 'pyarrow.lib.ArrowInvalid'>: JSON parse error: Column(/ner/[]/[]/[]) changed from number to string in row 0


DatasetGenerationError: An error occurred while generating the dataset

## ___Set up pipeline___

### Load model, tokennizer

In [None]:
model     = BertForMaskedLM.from_pretrained(ckpt_dir)
tokenizer = BertTokenizerFast.from_pretrained(model_dir)

In [None]:
example = "Cytokinins are plant hormones that promote cell division, or " +\
          "cytokinesis, in plant roots and shoots."

input_ids = tokenizer(example)["input_ids"]
for idx, input_id in enumerate(input_ids):
  print(idx, tokenizer.convert_ids_to_tokens(input_id))

0 [CLS]
1 cytokinins
2 are
3 plant
4 hormones
5 that
6 promote
7 cell
8 division
9 ,
10 or
11 cytokinesis
12 ,
13 in
14 plant
15 roots
16 and
17 shoots
18 .
19 [SEP]


In [None]:
test_mask_id = 3
test_list = tokenizer.convert_ids_to_tokens(input_ids)[1:-1]
test_list[3] = "[MASK]"
test_str = " ".join(test_list)
test_str

'cytokinins are plant [MASK] that promote cell division , or cytokinesis , in plant roots and shoots .'

In [None]:
# Even though there is extra spaced added before "," and ".", the number of
# tokens remain the same.
len(tokenizer(test_str)["input_ids"])

20

In [None]:
to_mask = [1, 3, 4, 6, 7, 8, 11, 14, 15, 17]

### Set and test fill mask pipeline

In [None]:
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)

In [None]:
for mask_idx in to_mask:
  input_ids_tmp = input_ids.copy()
  input_ids_tmp[mask_idx] = tokenizer.mask_token_id
  txt = " ".join(tokenizer.convert_ids_to_tokens(input_ids_tmp)[1:-1])
  print(txt)
  for pred in fill_mask(txt):
    print(f"  {pred['token_str']}, score:{pred['score']:.4f}")

[MASK] are plant hormones that promote cell division , or cytokinesis , in plant roots and shoots .
  there, score:0.6489
  cytokinins, score:0.0486
  what, score:0.0454
  brassinosteroids, score:0.0294
  they, score:0.0230
cytokinins are [MASK] hormones that promote cell division , or cytokinesis , in plant roots and shoots .
  plant, score:0.7351
  the, score:0.0778
  endogenous, score:0.0133
  common, score:0.0105
  important, score:0.0077
cytokinins are plant [MASK] that promote cell division , or cytokinesis , in plant roots and shoots .
  hormones, score:0.8892
  cytokinins, score:0.0284
  regulators, score:0.0175
  factors, score:0.0079
  phytohormones, score:0.0068
cytokinins are plant hormones that [MASK] cell division , or cytokinesis , in plant roots and shoots .
  inhibit, score:0.1997
  regulate, score:0.1724
  mediate, score:0.1196
  control, score:0.1189
  promote, score:0.0946
cytokinins are plant hormones that promote [MASK] division , or cytokinesis , in plant roots a