# Co-Reference Resolution
Notebook that contains experiments about co-reference resolution with BERT.

The idea is, given a text, to perform the following tasks:
 - Find all the pronouns (to this list we could add words like "this", "that"...)
 - Find all proper nouns
 - For each pronoun, substitute it with a [MASK] token and let BERT try to predict it, having as options the proper nouns found before
 - Ideally, BERT should predict the right noun in place of the [MASK]

We will be using [SpaCy](https://spacy.io/) for entity recognition (finding proper nouns and nouns), while we will be using the [HappyTransformer](https://github.com/EricFillion/happy-transformer) library on top of [RoBERTa](https://arxiv.org/pdf/1907.11692.pdf) for the masked token prediction.

This approach is similar to the one used in [this paper](http://web.stanford.edu/class/cs224n/reports/custom/15735157.pdf) by Arthi Suresb.

## Named Entity Recognition - SpaCy

In [None]:
!pip install spacy



In [None]:
!python -m spacy download en

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.6/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.6/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')


In [None]:
import pandas as pd
import spacy
nlp = spacy.load("en_core_web_sm")

In [None]:
def recognize_nouns(sentence):
  doc = nlp(sentence)
  proper_nouns = []
  pronouns = []
  for token in doc:
    if token.pos_ == 'PROPN':
      proper_nouns.append(token.text)
    if token.pos_ == 'PRON':
      pronouns.append(token.text)
  return proper_nouns, pronouns

proper_nouns, pronouns = recognize_nouns("President Trump is the worst president in history. He handled the pandemic horribly.")
print("Proper nouns: ", proper_nouns)
print("\nPronouns: ", pronouns)

Proper nouns:  ['President', 'Trump']

Pronouns:  ['He']


## Masked word prediction
We will be using RoBERTa since it's been shown to perform best on masked word prediction tasks.

In [None]:
!pip install happytransformer

Collecting happytransformer
  Downloading https://files.pythonhosted.org/packages/e0/df/644b9878de6f2477813723d44ee6a32b9d25d88b3c80ddcad1492c9268c4/happytransformer-1.1.2-py3-none-any.whl
Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/2c/4e/4f1ede0fd7a36278844a277f8d53c21f88f37f3754abf76a5d6224f76d4a/transformers-3.4.0-py3-none-any.whl (1.3MB)
[K     |████████████████████████████████| 1.3MB 5.8MB/s 
Collecting tokenizers==0.9.2
[?25l  Downloading https://files.pythonhosted.org/packages/7c/a5/78be1a55b2ac8d6a956f0a211d372726e2b1dd2666bb537fea9b03abd62c/tokenizers-0.9.2-cp36-cp36m-manylinux1_x86_64.whl (2.9MB)
[K     |████████████████████████████████| 2.9MB 21.4MB/s 
[?25hCollecting sentencepiece!=0.1.92
[?25l  Downloading https://files.pythonhosted.org/packages/e5/2d/6d4ca4bef9a67070fa1cac508606328329152b1df10bdf31fb6e4e727894/sentencepiece-0.1.94-cp36-cp36m-manylinux2014_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 32.5M

In [None]:
from happytransformer import HappyROBERTA

happy_roberta = HappyROBERTA("roberta-large")

11/09/2020 15:33:05 - INFO - happytransformer.happy_transformer -   Using model: cpu
11/09/2020 15:33:06 - INFO - filelock -   Lock 140104841257256 acquired on /root/.cache/torch/transformers/1ae1f5b6e2b22b25ccc04c000bb79ca847aa226d0761536b011cf7e5868f0655.ef00af9e673c7160b4d41cfda1f48c5f4cba57d5142754525572a846a1ab1b9b.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898823.0, style=ProgressStyle(descripti…

11/09/2020 15:33:07 - INFO - filelock -   Lock 140104841257256 released on /root/.cache/torch/transformers/1ae1f5b6e2b22b25ccc04c000bb79ca847aa226d0761536b011cf7e5868f0655.ef00af9e673c7160b4d41cfda1f48c5f4cba57d5142754525572a846a1ab1b9b.lock





11/09/2020 15:33:08 - INFO - filelock -   Lock 140101957613664 acquired on /root/.cache/torch/transformers/f8f83199a6270d582d6245dc100e99c4155de81c9745c6248077018fe01abcfb.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…

11/09/2020 15:33:10 - INFO - filelock -   Lock 140101957613664 released on /root/.cache/torch/transformers/f8f83199a6270d582d6245dc100e99c4155de81c9745c6248077018fe01abcfb.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda.lock





In [None]:
text = 'President Trump is the worst president in history. [MASK] handled the pandemic horribly.'
options = ["President", "Trump"]
results = happy_roberta.predict_mask(text, options=options, num_results=2)

11/09/2020 15:33:11 - INFO - filelock -   Lock 140101938882208 acquired on /root/.cache/torch/transformers/c22e0b5bbb7c0cb93a87a2ae01263ae715b4c18d692b1740ce72cacaa99ad184.2d28da311092e99a05f9ee17520204614d60b0bfdb32f8a75644df7737b6a748.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=482.0, style=ProgressStyle(description_…

11/09/2020 15:33:12 - INFO - filelock -   Lock 140101938882208 released on /root/.cache/torch/transformers/c22e0b5bbb7c0cb93a87a2ae01263ae715b4c18d692b1740ce72cacaa99ad184.2d28da311092e99a05f9ee17520204614d60b0bfdb32f8a75644df7737b6a748.lock
11/09/2020 15:33:12 - INFO - filelock -   Lock 140101938882768 acquired on /root/.cache/torch/transformers/2339ac1858323405dffff5156947669fed6f63a0c34cfab35bda4f78791893d2.fc7abf72755ecc4a75d0d336a93c1c63358d2334f5998ed326f3b0da380bf536.lock





HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1425941629.0, style=ProgressStyle(descr…

11/09/2020 15:33:39 - INFO - filelock -   Lock 140101938882768 released on /root/.cache/torch/transformers/2339ac1858323405dffff5156947669fed6f63a0c34cfab35bda4f78791893d2.fc7abf72755ecc4a75d0d336a93c1c63358d2334f5998ed326f3b0da380bf536.lock





Some weights of RobertaForMaskedLM were not initialized from the model checkpoint at roberta-large and are newly initialized: ['lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
results

[{'softmax': 0.08047983795404434, 'word': 'Trump'},
 {'softmax': 0.00018230581190437078, 'word': 'President'}]

## Wrap-up

In [None]:
import numpy as np

In [None]:
happy_roberta = HappyROBERTA("roberta-large")

#Returns:
# - list of tokens forming the sentence
# - list of proper nouns found
# - list of token positions where there are pronouns
def recognize_nouns(sentence):
  doc = nlp(sentence)
  proper_nouns = []
  pronouns = []
  tokens = []
  for i, token in enumerate(doc):
    if token.pos_ == 'PROPN':
      proper_nouns.append(token.text)
    if token.pos_ == 'PRON':
      pronouns.append(i)
    tokens.append(token.text)
  return tokens, proper_nouns, np.array(pronouns)


#Receives:
# - text_: the entire text
# - sentence: the sentence you want to focus on (just make it equal tot text_ if you don't want to focus on a particular sentence)
# - threshold: minimum confidence required in a prediction (if the model is not confident in any prediction, returns the text as it is)
def link_entities(text_, sentence, threshold=0.0001):

  #We will get 
  words_0, proper_nouns_0, _ = recognize_nouns(text_.split(sentence)[0])
  words_1, proper_nouns_1, pronouns = recognize_nouns(sentence)
  words_2, proper_nouns_2, _ = recognize_nouns(text_.split(sentence)[1])
  words = words_0 + words_1 + words_2
  proper_nouns = proper_nouns_0 + proper_nouns_1 + proper_nouns_2
  pronouns += len(words_0)
  
  #If no pronoun is found, just return the text as it is
  if len(pronouns) == 0:
    return text_

  #The mask token prediction can be made with just one token at the time
  for pronoun in pronouns:

    #Insert [MASK] token in place of the pronoun
    if pronoun == (len(words)-1):
      text = " ".join(words[:pronoun] + ['[MASK]'])
    else:
      text = " ".join(words[:pronoun] + ['[MASK]'] + words[pronoun+1:])

    #Predict the [MASK] token
    results = happy_roberta.predict_mask(text, options=proper_nouns, num_results=len(options))
    if results[0]['softmax'] >= threshold:
      words_1[pronoun - len(words_0)] = results[0]['word']

  return " ".join(words_1)

text = "President Trump is the worst president in history. He handled the pandemic horribly."
sentence = "He handled the pandemic horribly"
print(link_entities(text, sentence))
  


11/09/2020 15:33:53 - INFO - happytransformer.happy_transformer -   Using model: cpu
Some weights of RobertaForMaskedLM were not initialized from the model checkpoint at roberta-large and are newly initialized: ['lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Trump handled the pandemic horribly


## Testing on GAP dataset

In [None]:
!wget https://raw.githubusercontent.com/google-research-datasets/gap-coreference/master/gap-test.tsv

--2020-11-09 15:34:11--  https://raw.githubusercontent.com/google-research-datasets/gap-coreference/master/gap-test.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1075889 (1.0M) [text/plain]
Saving to: ‘gap-test.tsv’


2020-11-09 15:34:12 (7.45 MB/s) - ‘gap-test.tsv’ saved [1075889/1075889]



In [None]:
df = pd.read_csv('gap-test.tsv', sep='\t')

In [None]:
df.head()

Unnamed: 0,ID,Text,Pronoun,Pronoun-offset,A,A-offset,A-coref,B,B-offset,B-coref,URL
0,test-1,Upon their acceptance into the Kontinental Hoc...,His,383,Bob Suter,352,False,Dehner,366,True,http://en.wikipedia.org/wiki/Jeremy_Dehner
1,test-2,"Between the years 1979-1981, River won four lo...",him,430,Alonso,353,True,Alfredo Di St*fano,390,False,http://en.wikipedia.org/wiki/Norberto_Alonso
2,test-3,Though his emigration from the country has aff...,He,312,Ali Aladhadh,256,True,Saddam,295,False,http://en.wikipedia.org/wiki/Aladhadh
3,test-4,"At the trial, Pisciotta said: ``Those who have...",his,526,Alliata,377,False,Pisciotta,536,True,http://en.wikipedia.org/wiki/Gaspare_Pisciotta
4,test-5,It is about a pair of United States Navy shore...,his,406,Eddie,421,True,Rock Reilly,559,False,http://en.wikipedia.org/wiki/Chasers


In [None]:
df['Text'][:1].tolist()

["Though his emigration from the country has affected his leadership status, Kamel is still a respected elder of the clan. After the fall of Hussien's regime, many considered Dr. Ali Aladhadh a candidate to lead the clan. A contributor to Iraq's liberation, Ali Aladhadh and a long time oppose to Saddam's regime. He was ambushed with his pregnant wife on his way to the hospital in 2006 by Iraqi insurgents."]

In [None]:
link_entities(df['Text'][:1].tolist()[0], df['Text'][:1].tolist()[0])

"Upon their acceptance into the Kontinental Hockey League , Dehner left Finland to sign a contract in Germany with EHC M*nchen of the DEL on June 18 , 2014 . After capturing the German championship with the M*nchen team in 2016 , he left the club and was picked up by fellow DEL side EHC Wolfsburg in July 2016 . Former NHLer Gary Suter and Olympic - medalist Bob Suter are Dehner 's uncles . His cousin is Minnesota Wild 's alternate captain Ryan Suter ."

In [None]:
df = df[df['Pronoun'].isin(['He', 'She'])]

In [None]:
len(df)

286

In [None]:
def test_performances(row):
  
  text = row['Text']
  text = text[:row['Pronoun-offset']] + " [MASK] " + text[row['Pronoun-offset']+3:] 

  #Predict the [MASK] token
  results = happy_roberta.predict_mask(text, options=[row['A'], row['B']], num_results=len(options))
  if results[0]['softmax'] >= 1e-10:
    return results[0]['word']
  
  return 'None'

  

In [None]:
df['result'] = df.apply(test_performances, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [None]:
df['is_A'] = df['result'] == df['A']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [None]:
df_pred = df[df['result'] != 'None']

In [None]:
from sklearn.metrics import classification_report

print(classification_report(df_pred['A-coref'], df_pred['is_A']))

              precision    recall  f1-score   support

       False       0.66      0.30      0.41        84
        True       0.76      0.94      0.84       202

    accuracy                           0.75       286
   macro avg       0.71      0.62      0.62       286
weighted avg       0.73      0.75      0.71       286



In [None]:
len(df[df['result'] == 'None'])

0