# whisp/core — coref resolution with allenNLP + spaCy neuralcoref using ensemble methods

## worklog 🧑🏽‍💻
collapse me to get to the code!!

### ~~Initial Server Setup~~
> - Launched AWS EC2 `t3.xlarge` instance
> - Connection via `coref.pem` file
    - Stored in S3 at [s3://whisp-research-keys/ensemble-coref/](https://whisp-research-keys.s3.eu-west-2.amazonaws.com/ensemble-coref/coref.pem)
> - `sudo apt-get update` and `sudo apt-get upgrade`
> - Install Anaconda
    - Make new env: `conda create --name coref python=3.6`
    - Updates `conda update conda --all` and `conda update anaconda`
    - Activate env `conda activate coref`
        - Install GitHub CLI `conda install gh --channel conda-forge`
            - Run `gh auth login` to authenticate and `gh auth setup-git`
            
### ~~Environment for Coref~~
> - Clone GH repo [NeuroSYS-pl/coreference-resolution](https://github.com/NeuroSYS-pl/coreference-resolution)
> - `conda activate coref`
> - install `gcc` and `make` with `sudo apt-get install make gcc`
    - also `sudo apt-get install python3-dev`

**Installation instructions for dependencies (stolen from OG repo)**
```
pip install spacy==2.1
python -m spacy download en_core_web_sm
pip install neuralcoref --no-binary neuralcoref
pip install allennlp
pip install --pre allennlp-models
```

@lucafrost — failing on install of spaCy due to C / Cython
- retry after running `conda install -c conda-forge gcc` failed as `cc1plus` fails to execute
    - as per [StackOverflow](https://stackoverflow.com/questions/69485181/how-to-install-g-on-conda-under-linux), retrying with `conda install -c conda-forge gxx` and `conda install -c conda-forge cxx-compiler`
    
~ 17/10/22

---
### Restarting efforts in Jupyter
- The EC2 instance continues not to cooperate in building spaCy, specifically a package called `preshed` — this issue appears to be caused by the installation (or lack thereof) of a C++ compiler for some relevant cython code.
    - I have tried installing every remedial solution I could find, including `gxx`, `cxx-compiler`, `python-dev`, etc...
    - While lazy, a managed environment makes the most sense, AWS will not have data science clients encountering C++/ObjC errors.
- Instantiated a `Python 3 (PyTorch 1.10 Python 3.8 CPU Optimized)` kernel image in AWS SageMaker Studio.
- Clone git repository [NeuroSYS-pl/coreference-resolution](https://github.com/NeuroSYS-pl/coreference-resolution)
- Successfully installed dependencies as below...
```console
pip install spacy==2.1
python -m spacy download en_core_web_sm
pip install neuralcoref --no-binary neuralcoref
pip install allennlp
pip install --pre allennlp-models
```
*\**I ran these in-notebook with `!pip` but I think an Image Terminal will also suffice*

right, now to get to the actual work...

~ 18/10/22 :: 13:44 GST

---

### updates
- coref resolution with spaCy neuralcoref is up and running, ran into an issue with the `Predictor` class in AllenNLP: missing package 'ipywidgets'
    - fix with `pip install ipywidgets` & restart kernel
- ran into issue with kernel death upon calling `predictor = Predictor.from_path(model_url)`
    - silly me, the instance only had 4GB of memory, upgrading to `ml.g4dn.xlarge`
- all done with both spaCy neuralcoref and AllenNLP pretrained SpanBERT. have used the intersection strategies implemented by @mmaslankowska-neurosys.
    - anecdotally, the `FuzzyIntersectionStrategy` appears to be the most effective.

## installation and dependencies

In [4]:
import spacy
import neuralcoref
from wasabi import msg
from allennlp.predictors.predictor import Predictor

  warn(f"Failed to load image Python extension: {e}")


## creating neuralcoref function

In [5]:
def neural(text):
    # TODO: move outside function & make class-based
    nlp = spacy.load('en_core_web_sm')
    neuralcoref.add_to_pipe(nlp)
    doc = nlp(text)
    out = {
        "resolved": doc._.coref_resolved,
        "clusters": doc._.coref_clusters,
        "token_data": [[token.text, token.pos_, token.tag_]  for token in doc]
    }
    return out

In [19]:
text = 'Luca sat at his desk, before Lily interrupted him.'
data = neural(text)
print(data['clusters'])
# not ideal performance...

[Luca: [Luca, his, him]]


## creating AllenNLP function

**N.B.** the `coref-spanbert-large` model must be downloaded to the notebook directory as below.
```console
sagemaker-user@studio$ conda activate base
(base) sagemaker-user@studio$ wget https://storage.googleapis.com/allennlp-public-models/coref-spanbert-large-2020.02.27.tar.gz
```

In [7]:
model_url = 'coref-spanbert-large-2020.02.27.tar.gz'
predictor = Predictor.from_path(model_url)

Some weights of BertModel were not initialized from the model checkpoint at SpanBERT/spanbert-large-cased and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [20]:
text = 'Luca sat at his desk, before Lily interrupted him.'
prediction = predictor.predict(document=text)
' '.join(prediction['document'])
prediction['clusters']

[[[0, 0], [3, 3]]]

### intersection strategies

In [27]:
from utils import load_models, print_clusters
from utils import IntersectionStrategy, StrictIntersectionStrategy, PartialIntersectionStrategy, FuzzyIntersectionStrategy

In [28]:
predictor, nlp = load_models()

INFO:allennlp.common.plugins:Plugin allennlp_models available
INFO:cached_path:cache of https://storage.googleapis.com/allennlp-public-models/coref-spanbert-large-2020.02.27.tar.gz is up-to-date
INFO:allennlp.models.archival:loading archive file https://storage.googleapis.com/allennlp-public-models/coref-spanbert-large-2020.02.27.tar.gz from cache at /root/.allennlp/cache/0f6b052811b20b13280e609a96efe71ebc636b9c823a5c906ba24459e6e68af9.c1dab61d84cc7c3f7d6751c260040607cb7023a002778ba8f9b9d196b6539174
INFO:allennlp.models.archival:extracting archive file /root/.allennlp/cache/0f6b052811b20b13280e609a96efe71ebc636b9c823a5c906ba24459e6e68af9.c1dab61d84cc7c3f7d6751c260040607cb7023a002778ba8f9b9d196b6539174 to temp dir /tmp/tmpgbvtxgl1
INFO:allennlp.common.params:dataset_reader.type = coref
INFO:allennlp.common.params:dataset_reader.max_instances = None
INFO:allennlp.common.params:dataset_reader.manual_distributed_sharding = False
INFO:allennlp.common.params:dataset_reader.manual_multiproces

In [29]:
text = "Austin Jermaine Wiley (born January 8, 1999) is an American basketball player. He currently plays for the Auburn Tigers in the Southeastern Conference. Wiley attended Spain Park High School in Hoover, Alabama, where he averaged 27.1 points, 12.7 rebounds and 2.9 blocked shots as a junior in 2015-16, before moving to Florida, where he went to Calusa Preparatory School in Miami, Florida, while playing basketball at The Conrad Academy in Orlando."

clusters = predictor.predict(text)['clusters']
doc = nlp(text)

In [30]:
print("~~~ AllenNLP clusters ~~~")
print_clusters(doc, clusters)
print("\n~~~ Huggingface clusters ~~~")
for cluster in doc._.coref_clusters:
    print(cluster)

~~~ AllenNLP clusters ~~~
Austin Jermaine Wiley - [Austin Jermaine Wiley; He; Wiley; he; he]
Florida - [Florida; Florida]

~~~ Huggingface clusters ~~~
Wiley: [Austin Jermaine Wiley (born January 8, 1999), He, Wiley, he, he]
Florida: [Florida, Florida]


In [31]:
strict = StrictIntersectionStrategy(predictor, nlp)
partial = PartialIntersectionStrategy(predictor, nlp)
fuzzy = FuzzyIntersectionStrategy(predictor, nlp)

In [32]:
for intersection_strategy in [strict, partial, fuzzy]:
    print(f'\n~~~ {intersection_strategy.__class__.__name__} clusters ~~~')
    print_clusters(doc, intersection_strategy.clusters(text))


~~~ StrictIntersectionStrategy clusters ~~~
Florida - [Florida; Florida]

~~~ PartialIntersectionStrategy clusters ~~~
Wiley - [He; Wiley; he; he]
Florida - [Florida; Florida]

~~~ FuzzyIntersectionStrategy clusters ~~~
Austin Jermaine Wiley - [Austin Jermaine Wiley; He; Wiley; he; he]
Florida - [Florida; Florida]


In [40]:
print_clusters(nlp(text), fuzzy.clusters(text))

Austin Jermaine Wiley - [Austin Jermaine Wiley; He; Wiley; he; he]
Florida - [Florida; Florida]


In [55]:
fuzzy = FuzzyIntersectionStrategy(predictor, nlp)

In [56]:
fuzzy.resolve_coreferences(text)

'Austin Jermaine Wiley (born January 8, 1999) is an American basketball player. Austin Jermaine Wiley currently plays for the Auburn Tigers in the Southeastern Conference. Austin Jermaine Wiley attended Spain Park High School in Hoover, Alabama, where Austin Jermaine Wiley averaged 27.1 points, 12.7 rebounds and 2.9 blocked shots as a junior in 2015-16, before moving to Florida, where Austin Jermaine Wiley went to Calusa Preparatory School in Miami, Florida, while playing basketball at The Conrad Academy in Orlando.'

In [54]:
print(fuzzy.clusters(text))

[[[0, 2], [16, 16], [28, 28], [40, 40], [65, 65]], [[62, 62], [74, 74]]]
