# Coreference resolution in Spacy
The neural coreference resolution module can be combined with spacy.

In [None]:
# First install compatible version of spacy and neuralcoref
! pip install spacy==2.1.0 neuralcoref

# then restart the kernel

In [None]:
# download the small spacy model for English if not done already
! python -mspacy download en_core_web_sm

In [None]:
import spacy
from spacy import displacy 

nlp = spacy.load('en_core_web_sm')

In [None]:
doc = nlp('My sister has a dog. She loves him. But he does not love her.')
displacy.render(doc, style='dep',jupyter=True)

Installing neuralcoref package and adding coreference resolution to spacy pipeline

In [None]:
import neuralcoref

nlp_with_coref = spacy.load('en_core_web_sm') # displacy does not work with docs parsed with neuralcoref component
# Add neural coref to SpaCy's pipeline
neuralcoref.add_to_pipe(nlp_with_coref)
# You're done. You can now use NeuralCoref as you usually manipulate a SpaCy document annotations.

In [None]:
# a naive way to apply the result of coreference resolution
# just reparse the resolved text
doc_resolved = nlp_with_coref('My sister has a dog. She loves him. But he does not love her.')
print(doc_resolved._.coref_clusters)
print(doc_resolved._.coref_resolved)

In [None]:
doc_resolved_reparsed = nlp(doc_resolved._.coref_resolved)
displacy.render(doc_resolved_reparsed, style='dep',jupyter=True)

In [None]:
def doc_resolve_render(doc, nlp_with_coref, nlp):
    print("ORIG DOC:",doc)
    doc_resolved = nlp_with_coref(doc)
    print("RESOVED DOC:",doc_resolved._.coref_resolved)
    doc_resolved_reparsed = nlp(doc_resolved._.coref_resolved)
    displacy.render(doc_resolved_reparsed, style='dep',jupyter=True)

In [None]:
doc2 = "The Marzilibahn is 105 meters long. It is the shortest funicular in Europe."
doc_resolve_render(doc2, nlp_with_coref, nlp)

In [None]:
doc3 = '''Bern is the capital city of Switzerland. It is also the capital of the canton of Bern. As of early 2006, 127.000 people live in the city'''
doc_resolve_render(doc3,nlp_with_coref, nlp)

In [None]:
doc4 = '''The other university is the University of Zürich. It is the largest university in Switzerland.'''
doc_resolve_render(doc4,nlp_with_coref, nlp)

## Adding coreference information to the JSON representation
Unfortunately, we cannot directly inform the to_json() method of spacy to add the coreference resolution information to the parse. However, we can define a function that returns a JSON representation with additional information on the token level that can be used for information extraction.

Therefore we introduce 3 additional attributes on the token level of spacy:
  - coref_cluster_id
  - main_coref_start: First token that is part of this coreferential expression
  - main_coref_end: Index of token coming after the last part of this coreferential expression
  
 main_coref_start and main_coref_end used the normal Spacy span conventions to describe a slice of tokens in a document: doc[main_coref_start:main_coref_end] return the span of tokens belonging to the corerential expression.

In [None]:
def doc_to_json_with_coref(doc):
    """Serialize a spacy document into JSON, enriched by coreference infos at token level"""
    
    json_doc = doc.to_json()
    json_tokens = json_doc["tokens"]
    for i,t in enumerate(doc):
        if t._.in_coref:
            for c in t._.coref_clusters:
                json_tokens[i]["coref_cluster_id"] = c.i
                json_tokens[i]["main_coref_start"] = c.main.start
                json_tokens[i]["main_coref_end"] = c.main.end
    return json_doc

In [None]:
doc_resolved = nlp_with_coref('My sister has a dog. She loves him. But he does not love her.')
doc_to_json_with_coref(doc_resolved)

Accessing coreference attributes: https://github.com/huggingface/neuralcoref
 - Coreferences are sets (=clusters) of coreferential spans in the document

In [None]:
doc = nlp_with_coref('My sister has a dog. She loves him')

# Ex: doc._.coref_clusters[1].mentions[-1].start will give you the index of the first token of the last mention of the second coreference cluster in the document.

In [None]:
doc._.coref_clusters[1].mentions[-1].start

In [None]:
doc[8]

In [None]:
doc._.coref_clusters[1]

In [None]:
doc._.coref_clusters[1].mentions[0].start, doc._.coref_clusters[1].mentions[0].end

In [None]:
doc[3:5]