# __Step 3.1 Parse PICKLE annotation files__

___NOTE: The following needs to be updated___

___NOTE: The following needs to be updated___

___NOTE: The following needs to be updated___
- I move this code from the `kg` repo and the directory structure has changed.
- The `0_example` directory is now in:
  - `~/github/plantbert/_dev/kg`

Goal:
- Split train/dev/test
- Separate NER and RE annotations
- Combine annotation and text files for different docs into one JSON, one for NER, one for RE

Notes:
- 9/22/23
  - Run into issue with tokenization of hyphenated words
    - E.g., S-adenosylmethionine synthetase is tokenized into "S", "adenosylmethionine" and "synthetase" which create issue with the indexing.
      - See [this](https://stackoverflow.com/questions/58105967/spacy-tokenization-of-hyphenated-words)
  - Another tokenization error comes from PMID19413897_abstract.ann
    - "S-adenosylmethionine synthetase, glycine-rich RNA binding protein"
    - This is just bad. So I ___removed___ this form the original file in dev.
- 9/14/23
  - The pickle corpus and annotation files is from:
    - `/mnt/research/ShiuLab/serena_kg/PICKLE_250_abstracts_entities_and_relations_FINAL_05Jul2023`
  - Tried to load the PICKLE dataset from huggingface 
    - The dataset won't load.
    - The downloaded train.jsonl does not contain PMID info.
    - The dev and test splits are of different sizes
    - So do our own split.
  - brat to json conversion with [this] works
    - But the resulted json cannot be converted to spacy format properly.
    - Try to go from brat to IOB, then json, then spacy

## ___Setup___

### Environment

Environment: `torch_spacy`, see:
- 0_example/tutorial-ner_bert_spacy.ipynb
- 0_example/enviromment.yml
- ERROR: No matching distribution found for en-core-web-trf==3.6.1
  - I move this offending line to the end of the yml.

Environment setup
```bash
cd /mnt/home/shius/github/kg/0_example
module load CUDA/11.8.0
conda env create -f environment.yml
conda activate torch_spacy
python -m spacy download en_core_web_trf
```

Additional requirements
```bash
pip install ipywidgets

```

### Import

In [92]:
import os, random, json
from pathlib import Path
from shutil import copyfile
from spacy.lang.en import English
from spacy.tokens import DocBin, Doc
from spacy.vocab import Vocab
from spacy.tokenizer import Tokenizer
from spacy.lang.char_classes import \
      ALPHA, ALPHA_LOWER, ALPHA_UPPER, CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS
from spacy.util import compile_infix_regex

### Settings

In [93]:
# setup directories
proj_dir   = Path("/mnt/research/compbiol_shiu/kg")
work_dir   = proj_dir / "1_data_proc"
pickle_dir = work_dir / "_pickle"

dev_dir    = pickle_dir / "dev"
test_dir   = pickle_dir / "test"
train_dir  = pickle_dir / "train"

dev_dir.mkdir(exist_ok=True, parents=True)
test_dir.mkdir(exist_ok=True, parents=True)
train_dir.mkdir(exist_ok=True, parents=True)

# brat related
brat_config = pickle_dir / "annotation.conf"
brat2json   = Path.home() / "bin/bratstandoff-to-json.0.1.0.linux.amd64"

# parameters
rand_seed = 20180519

[train_r, dev_r, test_r] = [0.6, 0.2, 0.2]


## ___Split train/dev/test___

### Get splits

- [Shuffle with seed](https://stackoverflow.com/questions/19306976/python-shuffling-with-a-parameter-to-get-the-same-result
)

In [94]:
def copy_files(file_list, dest_dir):
  '''Copy .ann and .txt based on a basename'''
  for f in file_list:
    f_basename = f.name[:f.name.rfind(".")]   # file basename
    s_basename = str(f)[:str(f).rfind(".")] # source dir + basename

    copyfile(s_basename+".ann" , dest_dir / (f_basename+".ann"))
    copyfile(s_basename+".txt" , dest_dir / (f_basename+".txt"))
  

In [95]:
def split_train_dev_test(pickle_dir, train_r, dev_r, test_r):

  # Get list of PMIDs
  ann_files = [f for f in pickle_dir.glob("*.ann")]
  pmid_list = [x.stem.split("_")[0] for x in ann_files]
  print("# of PMIDs:", len(pmid_list))

  # Shuffle with seed
  random.Random(rand_seed).shuffle(ann_files)

  train_files = ann_files[:int(len(ann_files)*train_r)]
  dev_files   = ann_files[int(len(ann_files)*train_r):\
                          int(len(ann_files)*(train_r+dev_r))]
  test_files  = ann_files[int(len(ann_files)*(train_r+dev_r)):]

  print("train, dev, test=", len(train_files), len(dev_files), len(test_files))

  copy_files(train_files, train_dir)
  copy_files(dev_files, dev_dir)
  copy_files(test_files, test_dir)

In [96]:
split_train_dev_test(pickle_dir, train_r, dev_r, test_r)

# of PMIDs: 250
train, dev, test= 150 50 50


### Check

In [97]:
def check_file_numbers(dir):
  num_ann = len(list(dir.glob("*.ann")))
  num_txt = len(list(dir.glob("*.txt")))
  print(dir)
  print(f" num ann:{num_ann}, num_txt:{num_txt}")

In [98]:
check_file_numbers(dev_dir)
check_file_numbers(test_dir)
check_file_numbers(train_dir)

/mnt/research/compbiol_shiu/kg/1_data_proc/_pickle/dev
 num ann:50, num_txt:50
/mnt/research/compbiol_shiu/kg/1_data_proc/_pickle/test
 num ann:50, num_txt:50
/mnt/research/compbiol_shiu/kg/1_data_proc/_pickle/train
 num ann:150, num_txt:150


## ___Create JSON files from BRAT annotations___

### Example doc

```python
document = "Gene A is in pathyway 1."
```

BRAT:
IOB format
```txt
T1\tGENE 0 7\tGene A
T2\tPATHWAY 14 blah\pathway 1
R2\tIS_IN Arg1:T1 Arg2:T2
```

JSON
```json
[{"document":"Gene A is in pathyway 1.",
  "tokens":[{"text": "Gene A",
             "start": 0,
             "end": 5,
             "token_start": 0,
             "token_end": 1,
             "entityLabel": "GENE"},
            {"text": "pathyway 1",
             "start": blah,
             "end": blah,
             "token_start": 4,
             "token_end": 5,
             "entityLabel": "PATHWAY"},],
  "relations":[
     {"child":3, "head":0, "relationLabel":"IS_IN"},]
 }
]
```

Help from:
- https://stackoverflow.com/questions/55770365/retrieve-the-span-of-an-entity-from-one-of-its-tokens-in-spacy

### Custom tokenizer

This is to dealt with the "-" issue. Solution is from [this post](https://stackoverflow.com/questions/58105967/spacy-tokenization-of-hyphenated-words).

In [99]:
def custom_tokenizer(nlp):
  infixes = (
    LIST_ELLIPSES
    + LIST_ICONS
    + [
        r"(?<=[0-9])[+\-\*^](?=[0-9-])",
        r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
            al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
        ),
        r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
        #r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
        r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
    ]
  )

  infix_re = compile_infix_regex(infixes)

  return Tokenizer(nlp.vocab, 
                   prefix_search=nlp.tokenizer.prefix_search,
                   suffix_search=nlp.tokenizer.suffix_search,
                   infix_finditer=infix_re.finditer,
                   token_match=nlp.tokenizer.token_match,
                   rules=nlp.Defaults.tokenizer_exceptions)


In [100]:
nlp = English()
nlp.tokenizer = custom_tokenizer(nlp)

### brat_to_json function

In [101]:
def brat_to_json(ann_file, debug=0):
  '''Convert brat ann file to a dictionary
  Args:
    ann_file (Path): brat annotation file
    debug (int): print out debug info (1) or not (0)
  Returns:
    jdict (dict): dictionary of the brat annotation in json format
    rdict (dict): dictionary of relations, {relationLabel: count}
    err (dict): dictionary of errors
  '''

  # Read text file
  dir      = ann_file.parent
  txt_file = dir / f"{ann_file.stem}.txt"
  txt_list = open(txt_file).readlines()
  if len(txt_list) > 1:
    print("ERR: >1 line in txt file, may have issue with token indexing")
  
  document = txt_list[0]
  jdict    = {"document": document, "tokens": [], "relations": []}

  doc = nlp(document)
  
  tdict = {} # token dictionary: {Tx: token_start_index}
  rdict = {} # {relationLabel: count}
  edict = {} # error dictionary

  # Read annotations
  with open(ann_file, "r") as f:
    lines = f.readlines()
    if debug: print(lines)

    # Go through lines to deal with tokens first, because tdict needs to be 
    # built first otherwise cannot figure out token id in relations
    for line in lines:
      # Read entities
      if line[0] == "T":
        if debug: print("entity line:", [line])
        # T_index, entity_info, text
        [Tx, entity, text] = line.strip().split("\t")

        # label, start, end
        [entityLabel, start, end] = entity.split(" ")

        # Convert to integers
        start = int(start)
        end   = int(end)

        # Get token start/end index
        span    = doc.char_span(start, end)

        # There are situations where span is None, an example is PMID24519835
        # where typo was corrected for tokenization but the text file was not.
        # These were ignored.
        try:
          start_i = span.start
          end_i   = span.end - 1 # Note that in IOB this is inclusive
        except AttributeError:
          if debug: 
            print("ERR: span.start is None")
            print("", ann_file)
          edict = {"span.start_IS_NONE": [ann_file, Tx, text, start, end]}
          break

        # Only add to jdict if there is no error
        if edict == {}:
          tdict[Tx] = start_i
          jdict["tokens"].append({"text": text, 
                                  "start": start, 
                                  "end": end,
                                  "token_start": start_i,
                                  "token_end": end_i, 
                                  "entityLabel": entityLabel})
          
    # Now deal with relations, only if there is no error from parsing tokens
    if edict == {}:
      for line in lines:
        # Read relations
        if line[0] == "R":
          if debug: print("relation line:", [line])

          # relation lable, bits about child, bits about head
          [relationLabel, cbit, hbit] = line.strip().split("\t")[1].split(" ")

          if relationLabel not in rdict:
            rdict[relationLabel] = 1
          else:
            rdict[relationLabel]+= 1

          # Tx indices
          cTx = cbit.split(":")[1]
          hTx = hbit.split(":")[1]

          # Get child and head token indices
          child = tdict[cTx]
          head  = tdict[hTx]

          jdict["relations"].append({"child": child, 
                                    "head": head, 
                                    "relationLabel": relationLabel})
          
    # Return empty dictionary if there is error
    if edict != {}:
      jdict = {}

  return jdict, rdict, edict

In [102]:
# Testing brat_to_dict
#ann = pickle_dir / "PMID24519835_abstract.ann" # this one has error
ann = pickle_dir / "PMID29187153_abstract.ann"
j_dict, rdict, edict = brat_to_json(ann)
j_dict, rdict, edict

({'document': 'BACKGROUND: Plants respond to various stress stimuli by activating an enhanced broad-spectrum defensive ability. The development of novel resistance inducers represents an attractive, alternative crop protection strategy. In this regard, hexanoic acid (Hxa, a chemical elicitor) and azelaic acid (Aza, a natural signaling compound) have been proposed as inducers of plant defense, by means of a priming mechanism. Here, we investigated both the mode of action and the complementarity of Aza and Hxa as priming agents in Nicotiana tabacum cells in support of enhanced defense. RESULTS: Metabolomic analyses identified signatory biomarkers involved in the establishment of a pre-conditioned state following Aza and Hxa treatment. Both inducers affected the metabolomes in a similar manner and generated common biomarkers: caffeoylputrescine glycoside, cis-5-caffeoylquinic acid, feruloylglycoside, feruloyl-3-methoxytyramine glycoside and feruloyl-3-methoxytyramine conjugate. Subsequent

### Convert all brat files to json

In [103]:
def convert_all_brat_to_json(dir, out_path):
  '''Convert all brat files in a directory to json
  Args:
    dir (Path): directory of brat files
    out_path (Path): output json file
  Returns:
    relations (dict): {relationLabel: count}
  '''
  
  j_list    = [] # list of dictionaries
  relations = {} # relation counts
  c_err     = 0  # count errors

  f_list    = [] # list of qualified files

  print(f"{dir.stem}, errors:")
  for ann_file in dir.glob("*.ann"):
    jdict, rdict, edict = brat_to_json(ann_file)

    # only add to list if there is no error
    if edict == {}:
      j_list.append(jdict)
      f_list.append(ann_file)
    else:
      c_err += 1

    for e in edict:
      print("", edict[e][0].stem)

    # populate relations
    for r in rdict:
      if r not in relations:
        relations[r] = rdict[r]
      else:
        relations[r] += rdict[r]

  # generate output
  with open(out_path, "w") as f:
    json.dump(j_list, f)

  print(f" number errors:{c_err}")

  return relations, f_list

In [104]:
json_dev = work_dir / "dev.json"
json_test = work_dir / "test.json"
json_train = work_dir / "train.json"

rel_dev, flist_dev     = convert_all_brat_to_json(dev_dir, json_dev)
rel_test, flist_test   = convert_all_brat_to_json(test_dir, json_test)
rel_train, flist_train = convert_all_brat_to_json(train_dir, json_train)

dev, errors:
 PMID24519835_abstract
 PMID23297052_abstract
 number errors:2
test, errors:
 PMID27927228_abstract
 PMID18450451_abstract
 PMID27540390_abstract
 PMID29062306_abstract
 PMID16664167_abstract
 PMID16169957_abstract
 PMID30272908_abstract
 number errors:7
train, errors:
 PMID24515663_abstract
 PMID24997625_abstract


 PMID33357541_abstract
 PMID17644730_abstract
 PMID30727640_abstract
 PMID24720904_abstract
 PMID20339157_abstract
 PMID18992204_abstract
 number errors:8


In [105]:
rel_dev, rel_test, rel_train

({'interacts': 122,
  'is-in': 81,
  'activates': 83,
  'inhibits': 56,
  'produces': 28},
 {'activates': 74,
  'is-in': 80,
  'interacts': 114,
  'inhibits': 60,
  'produces': 28},
 {'interacts': 389,
  'inhibits': 234,
  'activates': 316,
  'is-in': 316,
  'produces': 67})

## ___Convret JSON to spacy___

### Function

Modify code from [this](https://github.com/walidamamou/relation_extraction_transformer/blob/main/binary_converter.py), which is originally from [here](https://github.com/explosion/projects/blob/v3/tutorials/rel_component/scripts/parse_data.py/).
- The code from walidamamou has unnecessary parts. 
  - The original code use the ending number in article ids to decide if the doc is for dev (end with 4), test (end with 3), or train (everthing else). But walidamamou's code is just about train.
  - The `ids`, `vocab` variables are not useful.

In [106]:
def json_to_spacy(json_loc, spacy_loc, MAP_LABELS, flist):
  """Creating the corpus from the Prodigy annotations
  Args:
    json_loc (Path): Location of the JSON file
    spacy_loc (Path): Location of the spacy output file
  """
  
  Doc.set_extension("rel", default={},force=True)
  vocab = Vocab()
  docs  = []

  with open(json_loc, encoding="utf8") as jsonfile:
    jlist = json.load(jsonfile) # list of dictionaries
    for idx, jdict in enumerate(jlist):
      span_starts = set()
      neg = 0
      pos = 0
      err = 0
      
      # Parse the tokens
      tokens = nlp(jdict["document"])  
      spaces = []
      spaces = [True if tok.whitespace_ else False for tok in tokens]
      words  = [t.text for t in tokens]
      doc    = Doc(nlp.vocab, words=words, spaces=spaces)

      # Parse entities
      spans    = jdict["tokens"]
      entities = []
      span_end_to_start = {}
      for span in spans:
        entity = doc.char_span(span["start"], span["end"], 
                               label=span["entityLabel"])
        span_end_to_start[span["token_start"]] = span["token_start"]
        entities.append(entity)
        span_starts.add(span["token_start"])

      try:
        doc.ents = entities
      except ValueError:
        err = 1
        # This error is originally found in PMID19413897_abstract.ann where
        # there is an enity spanning "," which does not make sense. This was
        # removed from the annotation.
        print("ERR: ValueError at doc index=", idx)
        print("", flist[idx])
        print("", entities)
        print("", tokens[335 ])
        print("", tokens[335-5:335+5])
        break
      

      # Parse the relations
      # Creat a dict with all possible relations
      rels = {}
      for x1 in span_starts:
        for x2 in span_starts:
          rels[(x1, x2)] = {}

      relations = jdict["relations"]
      for relation in relations:
        # the 'head' and 'child' annotations refer to the end token in the span
        # but we want the first token
        start = span_end_to_start[relation["head"]]
        end = span_end_to_start[relation["child"]]
        label = relation["relationLabel"]

        if label not in rels[(start, end)]:
          rels[(start, end)][label] = 1.0
          pos += 1

      # fill in zero's where the data is missing
      for x1 in span_starts:
        for x2 in span_starts:
          for label in MAP_LABELS.values():
            if label not in rels[(x1, x2)]:
              neg += 1
              rels[(x1, x2)][label] = 0.0

              #print(rels[(x1, x2)])
      doc._.rel = rels
      
      docs.append(doc)

  # Save docs to a spacy format file
  docbin = DocBin(docs=docs, store_user_data=True)
  docbin.to_disk(spacy_loc)

  return docs

In [107]:
# Create a blank Tokenizer with just the English vocab
#nlp = spacy.blank("en")
# commented out since nlp is created earlier with English().

spacy_train = work_dir / "train.spacy"
spacy_dev   = work_dir / "dev.spacy"
spacy_test  = work_dir / "test.spacy"

MAP_LABELS = {label:label for label in rel_train}

train_docs = json_to_spacy(json_train, spacy_train, MAP_LABELS, flist_train)
dev_docs   = json_to_spacy(json_dev, spacy_dev, MAP_LABELS, flist_dev)
test_docs  = json_to_spacy(json_test, spacy_test, MAP_LABELS, flist_test)

In [108]:
len(train_docs), len(dev_docs), len(test_docs)

(142, 48, 43)

In [None]:
os.chdir(work_dir)
!chmod 771 -R *

## ___Testing___

### Get pickle dataset from Huggingface

To get train/dev/test info from [Huggingface](https://huggingface.co/datasets/slotreck/pickle)

In [None]:
import wget
from datasets import load_dataset

pk_hf_dir  = work_dir / "pickle_huggingface"
pk_hf_dir.mkdir(exist_ok=True, parents=True)

In [None]:
# This leads to JSONDecodeError
datasets = load_dataset("slotreck/pickle", split="train")


In [None]:
# Get the files directly
url_hf     = "https://huggingface.co/datasets/slotreck/pickle/blob/main/"
file_dev   = "dev.jsonl"
file_test  = "test.jsonl"
file_train = "train.jsonl"

wget.download(url_hf + file_dev,   str(pk_hf_dir / file_dev))
wget.download(url_hf + file_test,  str(pk_hf_dir / file_test))
wget.download(url_hf + file_train, str(pk_hf_dir / file_train))

In [None]:
def get_ids(jsonl_file):
  '''Get PMID IDs from a jsonl file'''

  # JSON file lines
  jlines = open(jsonl_file).readlines()

  tag   = "PMID"   # target tag
  pmids = []       # PMID ID list
  for jline in jlines:
    if jline.find(tag) != -1:
      jline = jline[jline.find(tag):]
      pmid  = jline.split("_abstract")[0]
      pmids.append(pmid)

  print(f" # PMID IDs: {len(pmids)}")

  return pmids
      

In [None]:
pmid_dev   = get_ids(pk_hf_dir / file_dev)
pmid_test  = get_ids(pk_hf_dir / file_test)
pmid_train = get_ids(pk_hf_dir / file_train)

### Separate NER and RE annotations

In the brat annotation files, entity and relation annotations are in one file with different prefix T (entity) and R (relation). The code below is to separate them into different files before generating JSON for spacy.
- This is not necessary as spacy can choose which part to use.
- Deprecated.

In [None]:
def sep_ner_re_ann(dir):
  ann_list = dir.glob("*.ann")

  ner_out = dir / "NER"
  re_out  = dir / "RE"
  ner_out.mkdir(exist_ok=True, parents=True)
  re_out.mkdir(exist_ok=True, parents=True)
  
  for ann in ann_list:
    ann_name   = ann.name
    with open(ann, "r") as f:
      ann_lines = f.readlines()
    ner_lines = [x for x in ann_lines if x[0]=="T"]
    re_lines  = [x for x in ann_lines if x[0]=="R"]
    with open(ner_out / (f"{ann_name}"), "w") as f:
      f.writelines(ner_lines)
    with open(re_out / (f"{ann_name}"), "w") as f:
      f.writelines(re_lines)

In [None]:
sep_ner_re_ann(dev_dir)
sep_ner_re_ann(test_dir)
sep_ner_re_ann(train_dir)

### Convert brat to json

Make use of [brat-standoff-to-json](https://github.com/astutic/brat-standoff-to-json)
- This creates a json format that Acharya can read. BUT!!!
- This does not led to json that can be used by spacy.

#### Generate shell scripts

```bash
brat_standoff-to-json \
 -a "ann1,ann2,..."
 -t "txt1,txt2,..."
 -c config_file 
 -o output_file
```

In [None]:
def generate_shell_script(dir, subset, task):
  '''Generate shell script to run task (NER or RE)'''

  # dir with annotation files for the task
  task_out = dir / task

  txt_list = [t for t in dir.glob("*.txt")]
  ann_list = []
  for t in txt_list:
    t_stem   = t.stem
    ann_list.append(f"{str(task_out)}/{t.stem}.ann")

  ann_str  = ",".join([str(x) for x in ann_list])
  txt_str  = ",".join([str(x) for x in txt_list])

  script_file = dir / f"run_{task}.sh"
  json_out    = dir / f"{subset}_{task}.json"
  with open(script_file, "w") as f:
    f.write(f"{brat2json} -c {brat_config} -o {json_out} ")
    f.write(f' -a {ann_str} -t {txt_str}')

In [None]:
generate_shell_script(dev_dir, "dev", "NER")
generate_shell_script(dev_dir, "dev", "RE")
generate_shell_script(test_dir, "test", "NER")
generate_shell_script(test_dir, "test", "RE")
generate_shell_script(train_dir, "train", "NER")
generate_shell_script(train_dir, "train", "RE")

#### Run shell scripts

In [None]:
os.chdir(dev_dir)
!chmod 771 -R *.sh
!./run_NER.sh
!./run_RE.sh

In [None]:
os.chdir(train_dir)
!chmod 771 -R *.sh
!./run_NER.sh
!./run_RE.sh

In [None]:
os.chdir(test_dir)
!chmod 771 -R *.sh
!./run_NER.sh
!./run_RE.sh

### Convert brat to IOB

Original plan to do brat-json-spacy does not work.
- Try brat-iob-json-spacy
- See [this repo](https://github.com/PL97/Brat2BIO)

Install standord core nlp
```bash
cd ~/bin
wget https://nlp.stanford.edu/software/stanford-corenlp-4.5.5.zip
unzip stanford-corenlp-4.5.5.zip
cd stanford-corenlp-4.5.5
```

Clone 
```
cd ~/github
git clone https://github.com/PL97/Brat2BIO.git
cd Brad2BIO
chmod +x convert.sh
./convert.sh sample output
```

Include env variables in .bashrc

This thing is a beast and, due to env var issue, did not run properly.

### Test spacy tokenizer

In [None]:
txt = "Reducing phenylpropanoid biosynthesis"
doc = nlp(txt)
len(txt), len(doc), [idx for idx, chr in enumerate(txt) if chr==" "]

In [None]:
for token in doc:
  print(token.text, token.i)

In [None]:
s = doc.char_span(9, 37)
s.text, s.start, s.end