This notebook explores coreference resolution in Spacy using the `coreferee` library.

In [None]:
!pip install spacy==3.2
!python3 -m spacy download en_core_web_sm

In [None]:
!python3 -m pip install coreferee

In [None]:
!python3 -m coreferee install en

In [None]:
import coreferee, spacy
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe('coreferee')

In [None]:
doc1 = nlp("Although he was very busy with his work, Peter had had enough of it. He and his wife decided they needed a holiday. They travelled to Spain because they loved the country very much.")

In [None]:
doc1._.coref_chains.print()

Coreference clusters can be found in the `_.coref_chains` attribute of `doc`. `_.coref_chains` is a list of mention *clusters* -- each *mention* is a span of tokens in the text and a cluster of such mentions are those spans that co-refer to the same unique *entity*.

The head of a mention can be approximated by the `root_index` attribute. The syntactic relation of the entire mention to the rest of the sentence is best captured by this root.

In [None]:
def print_coref_chains(doc):
  for idx, chain in enumerate(doc._.coref_chains):
    print("Cluster %s" % idx)
    for mention in chain:
      start,end=mention[0], mention[-1]
      text=doc[start:end+1]
      # mention.root_index = the index of the spacy Token object that is the syntactic head of the mention (in a dependency tree)
      root=doc[mention.root_index]

      print(text, start, end, root, root.dep_, root.head)
    print()

In [None]:
print_coref_chains(doc1)

Now test the limits of spacy coreference. How does it fare on:

- Winograd schema challenge questions?
- long documents?
- near-identity?

Importantly, note that `coreferee` only marks coref chains that involve **two or more** mentions.  Singleton chains (involving only one mention) won't appear at all.

In [None]:
doc2=nlp("The trophy would not fit in the brown suitcase because it was too big")
print_coref_chains(doc2)

In [None]:
doc3=nlp("The town councilors refused to give the man a permit because they feared violence.")
print_coref_chains(doc3)

In [None]:
doc4=nlp("The town councilors refused to give the man a permit because they advocated violence.")
print_coref_chains(doc4)