# Coreference using "neuralcoref"

NeuralCoref is a pipeline extension for spaCy 2.1+ which annotates and resolves coreference clusters using a neural network. NeuralCoref is production-ready, integrated in spaCy's NLP pipeline and extensible to new training datasets.

<h4>#Setup:</h4>

In [1]:
#Setup/Prerequisites:

#Setup VEnv.
    #C:\Users\Jim>python -m venv c:\mypvenv
    #C:\Users\Jim>C:\mypvenv\Scripts\activate.bat

#Install Required Packages:
    #neuralcoref (req.spacy<3.0.0,>=2.1.0)
        #(mypvenv) C:\Users>git clone https://github.com/huggingface/neuralcoref.git
        #(mypvenv) C:\Users>cd neuralcoref
        
        #install requirements
        #(mypvenv) C:\Users\neuralcoref>pip install -r requirements.txt
        
        #Compile and install neuralcoref
        #(mypvenv) C:\Users\neuralcoref>pip install -e .

##################################
    #Cython>=0.25
        #pip install Cython
    #Spacy
        #pip install spacy==2.3.9


    # Trained Pipelines: 
        #Catalan, Chinese, Croatian, Danish, Dutch, English, Finnish, French, German,
        #Greek, Italian, Japanese, Korean, Lithuanian, Macedonian, Multi-language, Norwegian Bokm√•l, Polish, 
        #Portuguese, Romanian, Russian, Spanish, Swedish, Ukrainian
    
    # Download SpaCy English model pipeline: en_core_web_sm , en_core_web_md , en_core_web_lg , en_core_web_trf
        #python -m spacy download en_core_web_sm

#Setup VEnv in vscode settings.
    #crtl+shift+p: Python: Select Interpreter

#Library Imports

In [2]:
#Imports
import spacy
import neuralcoref

In [3]:
# Load SpaCy model (one of SpaCy English model pipelines)
nlp_eng_pipeline = spacy.load('en_core_web_sm')


In [4]:
# Adding NeuralCoref to the pipe of an English SpaCy Language
neuralcoref.add_to_pipe(nlp_eng_pipeline)

<spacy.lang.en.English at 0x234d52e41f0>

---

<h4>Using NeuralCoref</h4>

<h4>List of the annotations:</h1>
<table>
<tr> <th> S.No </th> <th> Attribute </th> <th> Type </th> <th> Description </th> </tr>
<tr> <td> 01 </td> <td> doc._.has_coref </td> <td> boolean</td> <td> Has any coreference has been resolved in the Doc </td> </tr>
<tr> <td> 02 </td> <td> doc._.coref_clusters </td> <td> list of Cluster</td> <td> All the clusters of corefering mentions in the doc </td> </tr>
<tr> <td> 03 </td> <td> doc._.coref_resolved </td> <td> unicode</td> <td> Unicode representation of the doc where each corefering mention is replaced by the main mention in the associated cluster.</td> </tr>
<tr> <td> 04 </td> <td> doc._.coref_scores </td> <td> Dict of Dict</td> <td> Scores of the coreference resolution between mentions. </td> </tr>
<tr> <td> 05 </td> <td> span._.is_coref </td> <td> boolean</td> <td> Whether the span has at least one corefering mention </td> </tr>
<tr> <td> 06 </td> <td> span._.coref_cluster </td> <td> Cluster</td> <td> Cluster of mentions that corefer with the span </td> </tr>
<tr> <td> 07 </td> <td> span._.coref_scores </td> <td> Dict</td> <td> Scores of the coreference resolution of & span with other mentions (if applicable). </td> </tr>
<tr> <td> 08 </td> <td> token._.in_coref </td> <td> boolean</td> <td> Whether the token is inside at least one corefering mention </td> </tr>
<tr> <td> 09 </td> <td> token._.coref_clusters </td> <td> list of Cluster</td> <td> All the clusters of corefering mentions that contains the token </td> </tr>


</table>

In [5]:
#Example.
doc_eng = nlp_eng_pipeline('Jane Jackson has a dog. She loves him.')


In [6]:
#Print all the clusters of corefering mentions in the doc.
if(doc_eng._.has_coref): #Has any coreference has been resolved in the Doc
    print(doc_eng._.coref_clusters) #All the clusters of corefering mentions in the doc

[Jane Jackson: [Jane Jackson, She, him]]


In [7]:
#Each corefering mention is replaced by the main mention in the associated cluster.
doc_eng._.coref_resolved

'Jane Jackson has a dog. Jane Jackson loves Jane Jackson.'

In [8]:
#Scores of the coreference resolution between mentions.
doc_eng._.coref_scores

{Jane Jackson: {Jane Jackson: 1.5842124223709106},
 a dog: {a dog: 1.757738709449768, Jane Jackson: -2.28167462348938},
 She: {She: -0.10834205150604248,
  Jane Jackson: 7.760324478149414,
  a dog: -1.3071657419204712},
 him: {him: -1.870743989944458,
  Jane Jackson: 2.8315646648406982,
  a dog: 2.6815736293792725,
  She: -3.1379528045654297}}

In [9]:
span = doc_eng[0:2]
print(span)
span._.is_coref #Whether the span has at least one corefering mention

Jane Jackson


True

In [10]:
print(span._.coref_cluster) #Cluster of mentions that corefer with the span

Jane Jackson: [Jane Jackson, She, him]


In [11]:
span._.coref_scores #Scores of the coreference resolution of & span with other mentions.

{Jane Jackson: 1.5842124223709106}

In [12]:
token = doc_eng[1]
print(token)
print(token._.in_coref) #Whether the token is inside at least one corefering mention
token._.coref_clusters #All the clusters of corefering mentions that contains the token

Jackson
True


[Jane Jackson: [Jane Jackson, She, him]]

---

<h4>Cluster</h4>
A Cluster is a cluster of coreferring mentions which has 3 attributes and a few methods to simplify the navigation inside a cluster

<table>
    <tr>
        <td>Attribute or method </td>
        <td>Type / Return type </td>
        <td>Description</td>
    </tr>
    <tr>
        <td>i </td>
        <td>int </td>
        <td>Index of the cluster in the Doc</td>
    </tr>
    <tr>
        <td>main </td>
        <td>Span </td>
        <td>Span of the most representative mention in the cluster</td>
    </tr>
    <tr>
        <td>mentions </td>
        <td>list of Span </td>
        <td>List of all the mentions in the cluster</td>
    </tr>
    <tr>
        <td>__getitem__ </td>
        <td>return Span </td>
        <td>Access a mention in the cluster</td>
    </tr>
    <tr>
        <td>__iter__ </td>
        <td>yields Span </td>
        <td>Iterate over mentions in the cluster</td>
    </tr>
    <tr>
        <td>__len__ </td>
        <td>return int </td>
        <td>Number of mentions in the cluster</td>
    </tr>
</table>

In [13]:
doc = nlp_eng_pipeline(u'Jane Jackson has a dog. She loves him.')

print(doc._.coref_clusters)
print(doc._.coref_clusters[0].mentions)
print(doc._.coref_clusters[0].mentions[-1])
print(doc._.coref_clusters[0].mentions[-1]._.coref_cluster.main)

print()
#mentions are spaCy Span objects which means you can access all the usual Span attributes like 
# span.start (index of the first token of the span in the document), 
# span.end (index of the first token after the span in the document)
print(doc._.coref_clusters[0].mentions[-1].start)
print(doc._.coref_clusters[0].mentions[-1].end)
print()

span = doc[0:2]
print(span)
span._.is_coref
print(span._.coref_cluster.main)
print(span._.coref_cluster.main._.coref_cluster)


[Jane Jackson: [Jane Jackson, She, him]]
[Jane Jackson, She, him]
him
Jane Jackson

8
9

Jane Jackson
Jane Jackson
Jane Jackson: [Jane Jackson, She, him]


---

<h4>Parameters</h4>
#Can pass several additional parameters to neuralcoref.add_to_pipe or NeuralCoref() to control the behavior of NeuralCoref.

<h4>Full list of parameters:</h1>
<table>
<tr> <th> S.No </th> <th> Parameter </th> <th> Type </th> <th> Description </th> </tr>
<tr> <td> 01 </td> <td> greedyness </td> <td> float</td> <td> A number between 0 and 1 determining how greedy the model is about making coreference decisions (more greedy means more coreference links). The default value is 0.5.  </td> </tr>
<tr> <td> 02 </td> <td> max_dist </td> <td> int </td> <td> How many mentions back to look when considering possible antecedents of the current mention. Decreasing the value will cause the system to run faster but less accurately. The default value is 50.  </td> </tr>
<tr> <td> 03 </td> <td> max_dist_match </td> <td> int</td> <td> The system will consider linking the current mention to a preceding one further than max_dist away if they share a noun or proper noun. In this case, it looks max_dist_match away instead. The default value is 500.  </td> </tr>
<tr> <td> 04 </td> <td> blacklist </td> <td> boolean</td> <td> Should the system resolve coreferences for pronouns in the following list: ["i", "me", "my", "you", "your"]. The default value is True (coreference resolved). </td> </tr>
<tr> <td> 05 </td> <td> store_scores </td> <td> boolean</td> <td> Should the system store the scores for the coreferences in annotations. The default value is True.  </td> </tr>
<tr> <td> 06 </td> <td> conv_dict </td> <td> dict(str, list(str))</td> <td> A conversion dictionary that you can use to replace the embeddings of rare words (keys) by an average of the embeddings of a list of common words (values). Ex: conv_dict={"Angela": ["woman", "girl"]} will help resolving coreferences for Angela by using the embeddings for the more common woman and girl instead of the embedding of Angela. This currently only works for single words (not for words groups).  </td> </tr>
</table>

In [14]:
#Three methods for Changing the parameters
nlp_eng_pipeline.remove_pipe("neuralcoref")
neuralcoref.add_to_pipe(nlp_eng_pipeline, greedyness=0.5, max_dist=50,
                        max_dist_match=500,blacklist=True,store_scores=True,conv_dict={'Angela': ['woman'],'James': ['him']})

#nlp_eng_lg.remove_pipe("neuralcoref")  # This remove the current neuralcoref instance from SpaCy pipe
#coref = neuralcoref.NeuralCoref(nlp_eng_lg.vocab, greedyness=0.60)
#nlp_eng_lg.add_pipe(coref, name='neuralcoref')


#After NeuralCoref is already in SpaCy's pipe, by modifying NeuralCoref in the pipeline
#nlp_eng.get_pipe('neuralcoref').set_conv_dict({'Angela': ['woman'],'James': ['him']})

<spacy.lang.en.English at 0x234d52e41f0>

#1: Greedyness parameter: <br>
A number between 0 and 1 determining how greedy the model is about making coreference decisions (more greedy means more coreference links). The default value is 0.5.

In [15]:
#Changing the greedyness parameter
nlp_eng_pipeline.remove_pipe("neuralcoref")
neuralcoref.add_to_pipe(nlp_eng_pipeline, greedyness=0.5) #Try:0.90, 0.30

Test_Sentence='Jane has a dog. She loves him.'
print(str(nlp_eng_pipeline(Test_Sentence)._.coref_clusters) +"\n"+str(nlp_eng_pipeline(Test_Sentence)._.coref_resolved))


[Jane: [Jane, She], a dog: [a dog, him]]
Jane has a dog. Jane loves a dog.


#2: max_dist: <br>
How many mentions back to look when considering possible antecedents of the current mention. Decreasing the value will cause the system to run faster but less accurately. The default value is 50.

In [16]:
nlp_eng_pipeline.remove_pipe("neuralcoref")
neuralcoref.add_to_pipe(nlp_eng_pipeline, max_dist=2) #Try: 2 , 3

Test_Sentence='Jane has a dog. She loves him, but the dog does not like her.'
print(str(nlp_eng_pipeline(Test_Sentence)._.coref_clusters) +"\n"+str(nlp_eng_pipeline(Test_Sentence)._.coref_resolved))

[Jane: [Jane, She], a dog: [a dog, him]]
Jane has a dog. Jane loves a dog, but the dog does not like her.


#3. max_dist_match <br>
The system will consider linking the current mention to a preceding one further than max_dist away if they share a noun or proper noun. In this case, it looks max_dist_match away instead. The default value is 500.

In [17]:
nlp_eng_pipeline.remove_pipe("neuralcoref")
neuralcoref.add_to_pipe(nlp_eng_pipeline, max_dist_match=10) 

Test_Sentence='Marry and her sister, Jane has a dog. Jane loves him, but the dog does not like her, he likes her sister.'
print(str(nlp_eng_pipeline(Test_Sentence)._.coref_clusters) +"\n"+str(nlp_eng_pipeline(Test_Sentence)._.coref_resolved))

[Jane: [Jane, Jane, her, her], a dog: [a dog, him, the dog, he]]
Marry and her sister, Jane has a dog. Jane loves a dog, but a dog does not like Jane, a dog likes Jane sister.


#4. Blacklist <br>
Should the system resolve coreferences for pronouns in the following list: ["i", "me", "my", "you", "your"]. The default value is True (coreference resolved).

In [18]:
nlp_eng_pipeline.remove_pipe("neuralcoref") # Remove the current neuralcoref instance from SpaCy pipe
neuralcoref.add_to_pipe(nlp_eng_pipeline, blacklist=True) #Try:False

Test_Sentence='John is an engineer. He loves his job, and he is very dedicated.'
print(str(nlp_eng_pipeline(Test_Sentence)._.coref_clusters) +"\n"+str(nlp_eng_pipeline(Test_Sentence)._.coref_resolved))
print()

Test_Sentence='I am John and I am an engineer. My friend Mike and I work in Microsoft'
print(str(nlp_eng_pipeline(Test_Sentence)._.coref_clusters) +"\n"+str(nlp_eng_pipeline(Test_Sentence)._.coref_resolved))

[John: [John, He, his, he]]
John is an engineer. John loves John job, and John is very dedicated.

[]
I am John and I am an engineer. My friend Mike and I work in Microsoft


#5. store_scores <br>
Should the system store the scores for the coreferences in annotations. The default value is True.

In [19]:
nlp_eng_pipeline.remove_pipe("neuralcoref")
neuralcoref.add_to_pipe(nlp_eng_pipeline, store_scores=True) 

<spacy.lang.en.English at 0x234d52e41f0>

#6: Conversion dictionary parameter to help resolve rare words <br>

#A conversion dictionary that you can use to replace the embeddings of rare words (keys) by an average of the embeddings of a list of common words (values). Ex: conv_dict={"Angela": ["woman", "girl"]} will help resolving coreferences for Angela by using the embeddings for the more common woman and girl instead of the embedding of Angela. This currently only works for single words (not for words groups).

In [20]:
Test_Sentence='Deepika has a dog. She loves him. The movie star has always been fond of animals'
print(str(nlp_eng_pipeline(Test_Sentence)._.coref_clusters) +"\n"+str(nlp_eng_pipeline(Test_Sentence)._.coref_resolved))

[Deepika: [Deepika, She, him, The movie star]]
Deepika has a dog. Deepika loves Deepika. Deepika has always been fond of animals


In [21]:
nlp_eng_pipeline.remove_pipe("neuralcoref")
neuralcoref.add_to_pipe(nlp_eng_pipeline, conv_dict={'Deepika': ['her'],'dog': ['him']}) #try: girl, her, actress

Test_Sentence='Deepika has a dog. She loves him. The movie star has always been fond of animals'
print(str(nlp_eng_pipeline(Test_Sentence)._.coref_clusters) +"\n"+str(nlp_eng_pipeline(Test_Sentence)._.coref_resolved))

[Deepika: [Deepika, She, The movie star], a dog: [a dog, him]]
Deepika has a dog. Deepika loves a dog. Deepika has always been fond of animals


---

<h3>Comparision</h3>

In [22]:
Test_Sentence='Although he was very busy with his work, Peter had enough of it. He and his wife decided they needed a holiday. They travelled to Spain because they loved the country very much.'
print(str(nlp_eng_pipeline(Test_Sentence)._.coref_clusters) +"\n"+str(nlp_eng_pipeline(Test_Sentence)._.coref_resolved))

[he: [he, his], he was very busy with his work: [he was very busy with his work, it], Peter: [Peter, He, his], He and his wife: [He and his wife, they, They, they]]
Although he was very busy with he work, Peter had enough of he was very busy with his work. Peter and Peter wife decided He and his wife needed a holiday. He and his wife travelled to Spain because He and his wife loved the country very much.


In [23]:
Test_Sentence='The woman looked down and saw Lesley. She stood up and greeted her.'
print(str(nlp_eng_pipeline(Test_Sentence)._.coref_clusters) +"\n"+str(nlp_eng_pipeline(Test_Sentence)._.coref_resolved))

[The woman: [The woman, She, her]]
The woman looked down and saw Lesley. The woman stood up and greeted The woman.


In [24]:
Test_Sentence='The woman looked down and saw Lesley. She stood up and greeted him.'
print(str(nlp_eng_pipeline(Test_Sentence)._.coref_clusters) +"\n"+str(nlp_eng_pipeline(Test_Sentence)._.coref_resolved))

[The woman: [The woman, She]]
The woman looked down and saw Lesley. The woman stood up and greeted him.


In [25]:
Test_Sentence='David advised Jim that he must study'
print(str(nlp_eng_pipeline(Test_Sentence)._.coref_clusters) +"\n"+str(nlp_eng_pipeline(Test_Sentence)._.coref_resolved))

[David: [David, he]]
David advised Jim that David must study
