# Resolving coreference with neuralcoref

There are few out-of-the-box libraries that support or specifically built for coreference resolution. Most wide-known are [CoreNLP](https://stanfordnlp.github.io/CoreNLP/coref.html), [Apache OpenNLP](https://opennlp.apache.org/) and [neuralcoref](https://github.com/huggingface/neuralcoref). In this short notebook, we will explore neuralcoref 3.0, a coreference resolution library by Huggingface.

First, let's install neuralcoref 3.0. To do this, we need to slightly downgrade spacy (neuralcoref is not compatible with the new cymem version used by the current version of spacy).

In [None]:
MODEL_URL = "https://github.com/huggingface/neuralcoref-models/releases/" \
            "download/en_coref_md-3.0.0/en_coref_md-3.0.0.tar.gz"

In [None]:
!pip install spacy==2.0.12

In [None]:
!pip install {MODEL_URL}

In [None]:
!python -m spacy download en_core_web_md

## A small neuralcoref tutorial

How does this lib work? Let's find out!

First,we need to load the model:

In [None]:
import en_coref_md

nlp = en_coref_md.load()

In [None]:
test_sent = '''
What are the main breeds of goat? Tell me about boer goats. What breed is good for meat? Are angora goats good for it? What about boer goats? What are pygmies used for? What is the best for fiber production? How long do Angora goats live? Can you milk them? How many can you have per acre? Are Angora goats profitable?
'''

In [None]:
test_list = list(test_sent.split(" ")) 
print(test_list)

Using neuralcoref is not really different from using plain spacy.

In [None]:
doc = nlp(test_sent)

In [None]:
from dataclasses import dataclass
from IPython.core.display import display, HTML
import pandas as pd
import re
i = 0
control = 0
myDict = {} 
@dataclass
class Question:
    questionId:int = 0
    title: str = ""
    questionText: str = ""
with open('../input/questions/only_questions.txt') as f:
    lines = [line.rstrip() for line in f]
ques = Question()
questionList = []
headList = []
for line_number in range(len(lines)):
    
    lineList = lines[line_number].split(":")

    ques.title = lineList[0]
    ques.questionText = lineList[1]
    headList.append(str(lineList[0]))
    #questionList.append(ques)
    #print(ques.questionText)
    doc = nlp(ques.questionText)
    test_list = list(ques.questionText.split(" "))
    
    if doc._.has_coref is True:
        for i in range(len(doc._.coref_clusters)):
            for j in range(len(doc._.coref_clusters[i])):
                for n, k in enumerate(test_list):
                    if k == str(doc._.coref_clusters[i].mentions[j]) and control<len(doc._.coref_clusters[i].mentions):
                        test_list[n] = str(doc._.coref_clusters[i].main)
                        control = control + 1
    
                control = 0
    str1 = ' '.join(test_list)

   
    res = re.split('\?', str1)
    myDict[lineList[0]] = [res] 
print(myDict)     

import json

# as requested in comment
Dict = {'exDict': myDict}

with open('file2.txt', 'w') as file:
     file.write(json.dumps(myDict))            




To check if any kind of coreference was detected, `has_coref` attribute of the extension (referred to as `_`) is used:

In [None]:
doc._.has_coref

Great! We found something, let's see what exactly:

You can get the entity and coreferring pronouns from these clusters by simple indexing. The objects returned are in fact ordinary spacy `span`s.

In [None]:
doc._.coref_clusters

## Deciding which entity the pronoun refers to

In competition data, the position of the entities and the pronoun comes as an offset from the beginning. Let's write a small function that will resolve coreference in a string and decide whether any of detected coreferring entities correspond to given offsets.
