# Project Status and Work Log

## Initial Approach and Thoughts
* My initial approach focused on training a fast word based tagging model. A full set of feature extraction and hyper parameter tuning was conducted
  * However the results were un satisfactory (see the non RNN  so I switched to an RNN model)
  * Also note that the mongo collections have been renamed to differentiate this fix - metrics_coref_word_tagger are the original word-based tagging model work, which are valid
* Switching to the RNN model, I made some initital mistakes:
 1. I didn't have it choose the Anaphora tag but instead chose the most common tag (which isn't always the anaphora tag)
 2. There seems to be some issues with the way the initial RNN's trained, e.g. compare the numbers under the Anaphora.data_points. This is correct in the word tagging model and the later RNN tagger (see "_fixed") but not in the initial RNN tagger work
* To remedy this I switched to newer code, copied from different notebooks, and the results of that are seen under the "coref_new_fixed" collection
* I also moved to having the models spit out tagged data points, and then re-computing the metrics directly from that data. This then allows us to interrogate those predictions at a later date as needed, not just the raw numbers

## Word Tagger Work (Notes)
- Initially I trained a word tagging model as it's much faster, and had similar accuracy to the RNN (but not quite as good). This allowed faster iteration, however the results weren't great.
- I did this in two phases:
 1. Did initial feat sel and hyper parameter tuning to determine the optimal feats (win size, etc) and parameters for training a word tagging model to tag anaphora tags
   - mongo - metrics_coref_word_tagger
   - py scripts
     - windowbasedtagger_most_common_tag_multiclass_feat_seln.py
     - windowbasedtagger_most_common_tag_multiclass_hyper_param_tuning.py
 2. Using results from 1, I then did a sort of feat selection on which co-ref tags to use to replace the concept codes
   - **<span style='color:red'>I think this is invalid as BrattEssay was not updated yet - re-use code though?</span>**
   - mongo - metrics_coref_word_tagger_coref_feats
   - py script - windowbasedtagger_most_common_tag_multiclass_hyper_param_tuning.py

## RNN CC Tagger Work (Notes)

- Noticed that the CC tag predictions were not based on those from the optimal model
 - See "CC Tagger Multiclass..." Notebooks - logic re-ran to fix this
- **NOTE:** - These predictions are now stored in **metrics_codes** mongo coll with the prefix **'STORE\_ RESULTS\_'**

## RNN Ana Tagger Work (Notes)

- Initially made some mistakes (invalid logic - had wrong number of data points), and trained multi class RNN on most common tag
 - Stored in mongo collection called **metrics_coref_broken**, now deleted (see below)
- Then i figured out issues with the predicted tagged Concept Codes not being from the best model, and corrected via logic in "CB - CC Tagger MULTICLASS - Train Save CV Word Predictions - NO EXPLICIT.ipynb"
- With the fixed anaphora training logic, 
 - the data is now stored in **metrics_coref_rnn** 
 - This contains two types of collection based on two steps of work:
   1. Ran the initial RNN tagger model without hyper parameter tuning, stored predictions for calculating metrics directly from 
     - NB "CB - Anaphora Tagger BINARY - FIXED - Train Save CV Word Predictions -NO EXPLICIT.ipynb"
     - Mongo - metrics_coref_rnn.CB_TAGGING_TD_RNN_BINARY_FIXED and similar
     - Predictions - stored in Bi-LSTM-4-Anaphora_Tags-Binary-Fixed folder - metrics match mongo 100%
   2. Decided I needed to do hyper parameter tuning on the RNN model:
     - NB - "CB - Anaphora Tagger BINARY - FIXED - Hyper Parameter Tuning.ipynb"
     - Mongo - metrics_coref_rnn.CB_TAGGING_TD_RNN_BINARY_HYPERPARAM_TUNING and similar
  - The predictions are stored in "Predictions/Bi-LSTM-4-Anaphora_Tags-Binary-Fixed/"
    - The predictions in "Predictions/Bi-LSTM-4-Anaphora_Tags-Binary/" are worse and represent the broken predictions

## Mongo Collection Naming / History 

### RNN Work
-  9/15/2018 Deleted - metrics_coref_broken
  - Initial RNN tagger work with non-optimal model / broken code (num data points incorrect, did multi-class) 
- 9/15/2018 - Consolidated - metrics_coref_new_fixed and metrics_coref_rnn_fixed into metrics_coref_rnn
 - within this, the CB_TAGGING_TD_RNN_BINARY_FIXED and similar colls reflect the initial output from the CB - Anaphora Tagger BINARY - FIXED - Train Save CV Word Predictions -NO EXPLICIT.ipynb notebook (does not hyper param tune, does dump prediction files)
  - I then decided to do hyper parameter tuning (but not to persist preds to disk...) - coll named CB_TAGGING_TD_RNN_BINARY_HYPERPARAM_TUNING and similar, refers to work done under CB - Anaphora Tagger BINARY - FIXED - Hyper Parameter Tuning.ipynb
   
   
## Word Tagging Work
- metrics coref_new renamed to metrics_coref_old
- metrics_coref_old renamed (was originally metrics_coref_new) to be the metrics_coref_word_tagger_coref_feats mongo coll to better reflect what it's used for


## Adjusting the Bratt Essay Parsing Logic to Resolve Anaphora Tags
* Next I adjusted the BrattEssay file to resolve the anaphora tags with their antecedents when provided
* These get resolved as Anaphora:[{code}] where {code} is one of the 13 or 9 concept codes, e.g. Anaphora:[50]
* Analysis of how anaphora tags are initially tagged -  see 'Examine at How Anaphora Tags are Tagged to Inform Essay Parser Changes' 

## Hyper Parameter Tuning

- The code to do this was re-written in order to persist the predictions to disk

## <span style="color:red">For some reason the Skin Cancer Metrics in Mongo do not match those coming from the database. CB matches OK</span>
- Subsequently I decided to re-run the SC train and test runs

## Merging CoRef Files with Annotated Essays

### Notes on CoRef Datastructure
- Dictionary of esssays, keyed by name
- Each essay is a list of sentences
- Each sentence is a list of words
- Words are mapped to a tag dict
  - tag dict - contains
    - NER tag (most are O - none)
    - POS tag
    - If a Co-Reference such as an anaphor (mostly pronouns)
      - COREF_PHRASE - phrase referred to by coref
      - COREF_REF - Id of referenced phrase
    - else if it is a phrase that is referenced:
      - COREF_ID - id of the co-reference, referenced in the COREF_REF tag
      
### Notes of CoRef Output from Stanford
- Co-references can be in either order - the canonical reference can be **before** or **after** the mention, so it's really just a grouping of phrases that mean the same thing.
 - e.g. essay EBA1415_SEAL_34_CB_ES-04796 in '/Users/simon.hughes/Google Drive/PhD/Data/CoralBleaching/Thesis_Dataset/CoReference/Training'
 - Mention COREF_REF = 5 comes before the coreference COREF_ID = 5, whereas for COREF_ID = 4, the coreference (id) comes before the mention  (coref)

## Notes on Berkeley Co-Reference System
- Reading through the introduction to the proposal, I was quite clear in stating the need to compare the Berkeley Co-Ref system to the Stanford System, because the qualities of such a system can vary greatly
- I sent Peter an email on 11/18/2018 asking him to confirm I also needed to include the Berkely System's results
- I should be able to re-use all the code provided I output the same format of files that i output for the stanford system. I will probably have to look for small differences in how it handles certain characters and words
- Need to use the 'Berkeley Entity Resolution System' and not their older coreference resolution system - https://github.com/gregdurrett/berkeley-entity

## Questions 
- Are the Co-references only present for concept codes? 
 - No (and see ans to next qu)
 - Also, some anaphora tags have no antecedent
- Are they only present for concept codes that form part of a causal relation?
 - No, not all anaphora tags have causal relations

##  Measure of Success - Considerations
  * Accuracy at detecting anaphora tags
  * Accuracy at detecting anaphora tags and correctly resolving the associated concept(s)
      1. Using the ML model's predictions to filter
      2. Using the stanford output alone
  * Impact on other metrics when incorporated into a single solution?

## Filters to Apply to CoRef Words or Their Chains

Note - by coref word I mean the identified co-reference, and by chain I mean the entire coreference chain it references
- Filter by co-ref phrase length (shorter referenced words seem more likely be to references)
- Filter by POS type - the coref words are typically pronouns or determiners
- Filter by POS type in the chain - the coreferences typically refer to noun phrases

## Comparing Performance with and without the Co-Reference Resolution Parsing
This is tricky because I didn't store the original predictions from RQ1 (Chapter 4), and re-running the RNN yields slightly different results in terms of accuracy each time due to the random weight initialization. So I can't directly compare the numbers to the numbers from chapter 4. Plus they are different anyway, as we technically have more of each concept code tag.
HOWEVER - I think we can back into these numbers as we have the indiviual tp, fp, tn and fn counts overall and for each code. As the anaphora tags were completely different tags (and excluded), I believe I can take the differences in these numbers and back into the adjusted metrics

# Handling of 'Causer:50' and 'Result:1' Type Tags 

When working on RQ2, I discovered there are quite a few examples where a word is tagged with a 'Causer:3' or 'Result:5' sort of tag, but NOT with the underlying concept code, which is often missing from the entire sentence in that case. To address this, I fxied those tags for the causal relation detection work but NOT the concept code work, as that would have required re-doing all of that chapters work and correcting the write-up.

This chapter combines results for Anaphora detection on word tags and on causal relations. To make sure the results are comparable to the accuracy reported in the previous chapters, the word tagging model predictions (used in co-reference resolution) were obtained without adding in the missing codes, but the CREL parser model was trained using the exact same concept code predictions used in the shift-reduce parser work, WITH the added codes. So while this is slightly inconsistent across the two sections of this chapter, it is makes sure the comparisons with the results from chapters 4 and 5 are still valid comparisons.

## Remaining Tasks / Project Plan - (Chapter 6)
- Compute Word Tagging Accuracy Metrics
 - Anaphora cross-referenced labels (using predicted anaphora tags)
 - **Done - Stanford corefs**
- Repeat Co-Reference Work using the Berkeley Co-Reference Parser
 -  Use the entity resolution parser instead of their older coref parser
     - https://github.com/gregdurrett/berkeley-entity
 - **Done - Berkeley corefs**
- Compute Caual Relation Tagging Accuracy
 - **Done**
- Compute Accuracy of using co-reference detection directly
 - **Done**
- Try Different variants of word length filters, POS tag filters, and controlling how we look back (or fwd) in the coref chain
 - **Done**
- Write up results
 - **2-5 days** initial
 - **2 days** with revisions
- TARGET DATE - **End of Dec**