#**Coreference Resolution**




In this lab, we are going to build a coreference system based on the mention-ranking algorithm proposed by Lee et al (2017).  You will get part of the code required to build the system, and you are required to fill three code blocks. Hints will be provided to guide you through.

The first part of the notebook will show how to apply coreference resolution to English using a few examples. Then you have to apply that to a real dataset.  

In total, you will be given two python files (*.py), three JSON files (*.jsonlines) and one embedding file (*.txt):

*   **metric.py**: is used to compute the CoNLL scores; you don’t need to change it.
*   **[train/test/dev].jsonl** Documents are the training, testing and development set will be used for training and evaluating the model, which are ready to use.
*   **word_embeddings.filtered.txt** is pre-trained 300-dimensional Glove word embeddings. The original file is large, so we‘ve removed all the words that do not appear in the datasets to make it much smaller.

These files are contained in the folder coreference_lab_files provided with the lab.



##**1. Download the datasets, word embeddings and coreference metrics** **bold text**

In [73]:
!wget 'https://github.com/juntaoy/ECS7001_LAB_DATASETS/raw/refs/heads/main/Coref_data.zip'
!unzip 'Coref_data.zip' -x __MACOSX/*

--2025-04-21 20:32:16--  https://github.com/juntaoy/ECS7001_LAB_DATASETS/raw/refs/heads/main/Coref_data.zip
Resolving github.com (github.com)... 20.205.243.166
Connecting to github.com (github.com)|20.205.243.166|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/juntaoy/ECS7001_LAB_DATASETS/refs/heads/main/Coref_data.zip [following]
--2025-04-21 20:32:16--  https://raw.githubusercontent.com/juntaoy/ECS7001_LAB_DATASETS/refs/heads/main/Coref_data.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19740565 (19M) [application/zip]
Saving to: ‘Coref_data.zip.3’


2025-04-21 20:32:17 (448 MB/s) - ‘Coref_data.zip.3’ saved [19740565/19740565]

Archive:  Coref_data.zip
replace dev.jsonl? [y]es, [n]o, [A

Now we specify the paths to our dev/test/train files and our filtered embeddings


In [106]:
DEV_PATH =  'dev.jsonl'
TEST_PATH = 'test.jsonl'
TRAIN_PATH =  'train.jsonl'

EMBEDDING_PATH = 'word_embeddings.filtered.txt'


##**2. Import files**

We can now import metrics.py along with other python modules

In [107]:
%%capture

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
import numpy as np
import tensorflow as tf
import random
import json,time,collections,random, metrics

#seed everything
seed_value = 42
random.seed(seed_value)
np.random.seed(seed_value)
torch.manual_seed(seed_value)
torch.cuda.manual_seed_all(seed_value)

## **3. Creating an embedding dictionary**

Using the embedding file, we will create an embedding dictionary,
for easy access while preparing our data

In [108]:
# the dimension of the pretrained embeddings
EMBEDDING_SIZE = 300

In [109]:
def load_embeddings(embedding_path=EMBEDDING_PATH, embedding_size=EMBEDDING_SIZE):
    print("Loading word embeddings from {}...".format(embedding_path))
    embeddings = collections.defaultdict(lambda: np.zeros(embedding_size))
    for line in open(embedding_path):
        splitter = line.find(' ')
        emb = np.fromstring(line[splitter + 1:], np.float32, sep=' ')
        assert len(emb) == embedding_size
        embeddings[line[:splitter]] = emb
    print("Finished loading word embeddings")
    print("Number of words: " + str(len(embeddings)))
    return embeddings

In [110]:
EMBEDDING_DICT = load_embeddings()

Loading word embeddings from word_embeddings.filtered.txt...
Finished loading word embeddings
Number of words: 19231


##**4. Preparing Documents for Coreference**

In this section,  we will show how to prepare the dataset for coreference resolution using a few examples in English. Then youwill have to prepare for the Arabic dataset in the jsonfiles.
<br>

Each line in a given json file contains information for a single document. The “doc_key” stores the name of the document; the “sentences” points you to tokenized sentences of the document; the “clusters” element stores the coreference clusters. Each of the clusters contains a number of mentions encoded, each of the mentions has a start and an end indices, denoting the position of its first token and the index its last token within the document.
As an illustration, consider the dummy dataset in the block of code below containing one document. Run the code cell to see the clusters of coreferent mentions.


In [111]:
dummy_dataset = [{'doc_key': 'large_cat',
                  'sentences':[['The', 'large', 'cat', 'yawned', '.'],
                               ['He','was', 'very', 'hungry', 'as', 'he', 'had', 'not', 'eaten', 'since', 'breakfast','.'],
                               ['An', 'unfortunate', 'rat', 'came', 'along', '.'],
                               ['The', 'cat', 'gobbled', 'him', 'up', '.']],
                  'clusters': [[[0, 2], [5, 5], [10, 10], [23, 24]], [[17, 19], [26, 26]]]
                }]


sents = [w for sent in dummy_dataset[0]['sentences'] for w in sent]
print('These are the clusters in %s' %dummy_dataset[0]['doc_key'])
for cl_idx, cl in enumerate(dummy_dataset[0]['clusters']):
    print('Cluster ' + str(cl_idx) + ':', [' '.join(sents[s: e+1])  for s, e in cl])

These are the clusters in large_cat
Cluster 0: ['The large cat', 'He', 'he', 'The cat']
Cluster 1: ['An unfortunate rat', 'him']


To prepare the each dataset for the coreference resolution model, we will need to create variables from the each document:

1.   Embedded Sentences: A 1 X num_sents X num_words X embedding size array for each document.
2.   Mention Pairs: A 1 X num_pairs X 4 array like so [anaphor_start, anaphor_end, antecedent_start, antecedent_end]
3. Mention Pair Labels: A num_pairs X 1 array containing corresponding labels for each mention pair (i.e. 1 if the pair of mentions are coreferent, 0 otherwise).

The functions that follow in the subsections below contains code for extracting this dataset. Study them and test their functionality using the dummy dataset.
In section 4.4, you'll use these functions to create the dev, test and train datasets.

###**4.1 Getting the mentions from the clusters**

The following block of code gets the mentions from a given cluster in a document.

In [112]:
def get_mentions(clusters):

    # get a list of mentions (as tuples) sorted by start indices.
    gold_mentions = sorted([tuple(m) for cl in clusters for m in cl])

    # number of mentions
    num_mentions = len(gold_mentions)

    # assign unique indices to each mention in the mention list based on its position in the list
    gold_mention_map = {m: i for i, m in enumerate(gold_mentions)}

    # assign cluster ids to each mention in order E.g. cluser_ids = [4, 11, 5, 4, ..] => mention 0 is in cluster 4
    # along with mention 3.
    cluster_ids = [0]*num_mentions
    for cid, cluster in enumerate(clusters):
        for mention in cluster:
            cluster_ids[gold_mention_map[tuple(mention)]] = cid

    return gold_mentions, gold_mention_map, cluster_ids, num_mentions


In [113]:
dmentions, dment_map, dcluster_ids, dnum_mentions = get_mentions(dummy_dataset[0]['clusters'])
print('These are all the coreferent mentions in the sample document ', dmentions)
print('These are the mentions mapped to unique ids denoting their order in the document ', dment_map)
print('These are the cluster ids of the ordered mentions', dcluster_ids)
print('There are %d mentions in the document titled \'%s\'' %(dnum_mentions, dummy_dataset[0]['doc_key']))

These are all the coreferent mentions in the sample document  [(0, 2), (5, 5), (10, 10), (17, 19), (23, 24), (26, 26)]
These are the mentions mapped to unique ids denoting their order in the document  {(0, 2): 0, (5, 5): 1, (10, 10): 2, (17, 19): 3, (23, 24): 4, (26, 26): 5}
These are the cluster ids of the ordered mentions [0, 0, 0, 1, 0, 1]
There are 6 mentions in the document titled 'large_cat'


###**4.2 Turning the sentences into embeddings and the mention indices into vectors**

Using the next block of code, you can generate the padded document embeddings, and a copy of the mention starts and end indices adjusted for padding.

In [114]:


def tensorize_doc_sentences(sentences, mentions):
    starts, ends = [],[]
    sent_lengths = [len(sent) for sent in sentences]  # the actual, unpadded length of each sentence
    max_sent_length = max(sent_lengths)

    # by padding each sentence to the maximum length, the embedded document will a new dimension
    embedded_sentences = np.zeros([1, len(sentences), max_sent_length, EMBEDDING_SIZE])

    # in this block, we adjust the mention indices to reflect the added padding.
    sent_start = 0
    sent_start_after_padding = 0
    offset = 0
    for i, sent in enumerate(sentences):
        for m_start, m_end in mentions:
            if (sent_start <= m_start) & (m_end < sent_start + len(sent)):
                starts.append(m_start + offset)
                ends.append(m_end + offset)
        sent_start += len(sent)
        sent_start_after_padding += max_sent_length
        offset += max_sent_length - len(sent)

        # Populate the the embedding tensor with the appropriate word embeddings.
        for j, word in enumerate(sent):
                embedded_word = EMBEDDING_DICT[word]
                embedded_sentences[0, i, j] = embedded_word


    return embedded_sentences, starts, ends

In [115]:
dsents_embedded, dstarts, dends = tensorize_doc_sentences(dummy_dataset[0]['sentences'], dmentions)

In [116]:
print('%d document with %d sentences, each with a maximum of %d words, encoded as %d dimensional vectors' %(dsents_embedded.shape[0], dsents_embedded.shape[1], dsents_embedded.shape[2], dsents_embedded.shape[3]))
print('Mention starts: ', dstarts)
print('Mention ends: ', dends)

1 document with 4 sentences, each with a maximum of 12 words, encoded as 300 dimensional vectors
Mention starts:  [0, 12, 17, 24, 36, 39]
Mention ends:  [2, 12, 17, 26, 37, 39]


###**4.3. Generating Mention Pairs**

This next function generates the example pairs for training or evaluation. For each mention (anaphor), candidate antecedents are any mentions preceeding it.

<br>

Here, during training we choose up to 250 antecedents (i.e. MAX_ANT = 250) and maintain a 2:1 negative to positive example ratio i.e. NEG_RATIO=2. (choosing this ratio can be challenging as you want ample examples to learn from but at the same time do not want the positive examples to be overshadowed by the negative ones).

<br>

At test time, we generate up to MAX_ANT examples without paying attention to the example ratio. We also do not generate training labels for the pairs.

In [181]:
# the maximum number of candidate antecedents we will give to each of the candidate mentions.
MAX_ANT = 250

# the ratio of negative to postive examples
NEG_RATIO = 2



Study the function below and see the sample outputs.

In [182]:
def generate_pairs(num_mentions, cluster_ids, starts, ends, raw_starts, raw_ends, is_training, neg_ratio=NEG_RATIO, max_ant=MAX_ANT):
    mention_pairs = [[]]
    mention_pair_labels = [[]]
    raw_mention_pairs = []

    # for the training set, we want labels. We also want to pay heed to the positive:negative example ratio
    if is_training:
        for ana in range(num_mentions):
            pos = 1
            # each anaphor must not have more that MAX_ANT candidate antecedents
            s = 0 if ana < max_ant else (ana - max_ant)
            for ant in range(s, ana):
                # two mentions are coreferent if they are in the same cluster
                l = cluster_ids[ana] == cluster_ids[ant]
                # if it's a positive example, add it
                if l:
                    pos += neg_ratio
                    mention_pairs[0].append([starts[ana],ends[ana],starts[ant],ends[ant]])
                    mention_pair_labels[0].append(1)
                # if it's a negative example, check that we don't already have twice as
                # many negative examples as positive ones before adding it
                elif pos > 0:
                    pos -=1
                    mention_pairs[0].append([starts[ana],ends[ana],starts[ant],ends[ant]])
                    mention_pair_labels[0].append(0)

    # for the test set, add the pairs without balancing or labels
    else:
        for ana in range(num_mentions):
            s = 0 if ana < max_ant else (ana - max_ant)
            for ant in range(s,ana):
                mention_pairs[0].append([starts[ana], ends[ana], starts[ant], ends[ant]])
                # here we also add the original mention indices for unpadded evaluation.
                raw_mention_pairs.append([(raw_starts[ana], raw_ends[ana]), (raw_starts[ant], raw_ends[ant])])


    return mention_pairs, mention_pair_labels, raw_mention_pairs

In [183]:
# A sample for training. Maximum of 4 antecedents per mention a 2:1 negative example ratio: positive example. No need to save the raw starts/ends
dmpairs, dpair_labels, draw_pairs = generate_pairs(dnum_mentions, dcluster_ids, dstarts, dends, None, None, True, 1, 4)

from tabulate import tabulate
print(tabulate(zip(dmpairs[0], dpair_labels[0]), headers=['Ana_Ant pair', 'Pair label (padded)', ]))

Ana_Ant pair        Pair label (padded)
----------------  ---------------------
[12, 12, 0, 2]                        1
[17, 17, 0, 2]                        1
[17, 17, 12, 12]                      1
[24, 26, 0, 2]                        0
[36, 37, 0, 2]                        1
[36, 37, 12, 12]                      1
[36, 37, 17, 17]                      1
[36, 37, 24, 26]                      0
[39, 39, 12, 12]                      0
[39, 39, 24, 26]                      1
[39, 39, 36, 37]                      0


In [184]:
# A sample for evaluation. No labels necessary. Here we pair each mention with all its antecedents
draw_starts, draw_ends = zip(*dmentions)
dmpairs, dpair_labels, draw_pairs = generate_pairs(dnum_mentions, dcluster_ids, dstarts, dends, draw_starts, draw_ends, False)

from tabulate import tabulate
print(tabulate(zip(draw_pairs, dmpairs[0]), headers=['Ana_Ant pair (unpadded)', 'Ana_Ant pair (padded)', ]))

Ana_Ant pair (unpadded)    Ana_Ant pair (padded)
-------------------------  -----------------------
[(5, 5), (0, 2)]           [12, 12, 0, 2]
[(10, 10), (0, 2)]         [17, 17, 0, 2]
[(10, 10), (5, 5)]         [17, 17, 12, 12]
[(17, 19), (0, 2)]         [24, 26, 0, 2]
[(17, 19), (5, 5)]         [24, 26, 12, 12]
[(17, 19), (10, 10)]       [24, 26, 17, 17]
[(23, 24), (0, 2)]         [36, 37, 0, 2]
[(23, 24), (5, 5)]         [36, 37, 12, 12]
[(23, 24), (10, 10)]       [36, 37, 17, 17]
[(23, 24), (17, 19)]       [36, 37, 24, 26]
[(26, 26), (0, 2)]         [39, 39, 0, 2]
[(26, 26), (5, 5)]         [39, 39, 12, 12]
[(26, 26), (10, 10)]       [39, 39, 17, 17]
[(26, 26), (17, 19)]       [39, 39, 24, 26]
[(26, 26), (23, 24)]       [39, 39, 36, 37]


### **4.4. Preprocessing and loading the dataset**

Now, you will prepare the dataset for the coreference resolution model. Preprocessing step is an important step and depends on the target language. For Arabic, removing diacritics (accents that are written  above, below or on top of certain letters)may improve the overall performance.

In [185]:
import re, json

def preprocess_arabic_text(text):
  #diacrtic unicodes are found using regular expressions
  diacritics_unicode = re.compile(r'[\u0617-\u061A\u064B-\u0652]')
  #the diacrtics are then removed
  text = re.sub(diacritics_unicode, "", text)
  return text

def get_data(json_file, is_training, preprocess_text):
    processed_docs = []

    for line in open(json_file):

      # read the document in
      doc = json.loads(line)

      # check that there are coreferent mentions in this document
      clusters = doc['clusters']

      sentences = doc['sentences']

      if(preprocess_text==True):
          preprocessed_sents = [[preprocess_arabic_text(t) for t in sent] for sent in sentences]
          doc['sentences'] = preprocessed_sents

      if len(clusters) == 0:
          continue

      """
      Task 1

      Begin
      """

      #  get the mentions and their cluster information.
      gold_mentions, gold_mention_map, cluster_ids, num_mentions = get_mentions(clusters)

      # splits the mentions into two arrays, one representing the start indices,
      # and the other for the end indices
      raw_starts, raw_ends = zip(*gold_mentions)

      # pad sentences, create glove sentence embeddings, create mention starts and ends for padded document
      word_emb, starts, ends = tensorize_doc_sentences(doc['sentences'], gold_mentions)

      # generate (anaphor, antecedent) pairs and their labels
      mention_pairs, mention_pair_labels, raw_mention_pairs = generate_pairs(num_mentions, cluster_ids, starts, ends, raw_starts, raw_ends, is_training)
      mention_pairs, mention_pair_labels = np.array(mention_pairs),np.array(mention_pair_labels)

      # add the processed document to the list
      processed_docs.append((word_emb[0], mention_pairs[0], mention_pair_labels[0], clusters, raw_mention_pairs))
      """
      End Task 1
      """
    return processed_docs

In [186]:
DEV_DATA = get_data(DEV_PATH, False, True)
TEST_DATA = get_data(TEST_PATH, False, True)
TRAIN_DATA = get_data(TRAIN_PATH, True, True)

In [187]:
print(TEST_DATA[3][4])

[[(3, 4), (0, 2)], [(6, 8), (0, 2)], [(6, 8), (3, 4)], [(9, 10), (0, 2)], [(9, 10), (3, 4)], [(9, 10), (6, 8)], [(11, 12), (0, 2)], [(11, 12), (3, 4)], [(11, 12), (6, 8)], [(11, 12), (9, 10)], [(13, 18), (0, 2)], [(13, 18), (3, 4)], [(13, 18), (6, 8)], [(13, 18), (9, 10)], [(13, 18), (11, 12)], [(18, 18), (0, 2)], [(18, 18), (3, 4)], [(18, 18), (6, 8)], [(18, 18), (9, 10)], [(18, 18), (11, 12)], [(18, 18), (13, 18)], [(19, 22), (0, 2)], [(19, 22), (3, 4)], [(19, 22), (6, 8)], [(19, 22), (9, 10)], [(19, 22), (11, 12)], [(19, 22), (13, 18)], [(19, 22), (18, 18)], [(21, 22), (0, 2)], [(21, 22), (3, 4)], [(21, 22), (6, 8)], [(21, 22), (9, 10)], [(21, 22), (11, 12)], [(21, 22), (13, 18)], [(21, 22), (18, 18)], [(21, 22), (19, 22)], [(23, 26), (0, 2)], [(23, 26), (3, 4)], [(23, 26), (6, 8)], [(23, 26), (9, 10)], [(23, 26), (11, 12)], [(23, 26), (13, 18)], [(23, 26), (18, 18)], [(23, 26), (19, 22)], [(23, 26), (21, 22)], [(25, 26), (0, 2)], [(25, 26), (3, 4)], [(25, 26), (6, 8)], [(25, 26), (

##**5. Building the Coreference Model**

In this section, we will build the coreference resolution model. There are many ways to learn coreference, in this lab, we will be building a simplified version of a mention pair classification model.

<br>

Given a pair of mentions, (anaphor, antecedent), a mention pair classifier produces a single score between 0 and 1, representing the probability that the given pair is coreferent. We will use keras to take in the processed data we prepared in section 4 and produce mention pair scores for the given pairs.

###**5.1 First, we will initialize model parameters.**

In [188]:
# the dimension of the pretrained embeddings
EMBEDDING_SIZE = 300

# dropout rate for word embeddings
EMBEDDING_DROPOUT_RATE = 0.5

# the size of the hidden layer, include both LSTM and feedforward NN
HIDDEN_SIZE = 50

# the number of hidden layers used for the feedforward NN
NUM_FFN_LAYER = 2

# the dropout rate for the hidden layers of LSTM and feedforward NN
HIDDEN_DROPOUT_RATE = 0.2


###**5.2 Building the model**

In the next cell block, you will complete the `__init__` and `forward` function by doing completing the steps that follow.

1. **Initialize the model inputs**

(a.) the `forward` function takes two input `word_embeddings` and `mention_pairs`.

**Hints:**
*  The dimension for `word_embeddings` is `(batch_size, num_sents, num_words, embedding_size)` where batch_size is the number of inputs at each time step. Our **batch size is 1 document**.
*  The dimension for `mention_pairs` is  `(batch_size,  num_mention_pairs, 4)`.

    A line of code has been written for you to squeeze the word_embeddings after they have been created to remove the document dimension as the LSTMs only take 3 dimensional inputs

    `word_embeddings = word_embeddings.view(-1, word_embeddings.size(2), word_embeddings.size(3))`

    <br>

(b.) Apply `EMBEDDING_DROPOUT_RATE` dropout to this no-batch word embeddings

<br>

2. **Encode the document using Bidirectional LSTMs**


For this task we will be working in both `__init__` and `forward` methods. With the layers defined in `__init__` and used in `forward` methods. The task is to create a bidirectional LSTM to encode the sentences from both directions, which provides context information to the coreference system.

**Hints:**
* You'll need:
The dropout rate  of the hidden layers: `HIDDEN_DROPOUT_RATE`
The size of the lstm hidden layers: `HIDDEN_SIZE`

* You need to create a two layer bidirectional LSTM (BiLSTMs) by stacking two LSTM() layers. The BiLSTMs need to return the output for all the tokens in the sentences, not just the final one. The output of the BiLSTMs should be called word_output. Here is the doc of how to create a BiLSTMs using pytorch: https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html#torch.nn.LSTM.

* Inorder to return the output for all the tokens in the sentences, i.e. a `(num_sents, num_words, HIDDEN_SIZE)` tensor, you will need to set the `batch_first` attribute to True. then the input and output tensors are provided as (batch, seq, feature) instead of (seq, batch, feature)

<br>

We then flatten the output of the lstms to get a `(num_sents*num_words, HIDDEN_SIZE)` tensor using the view() function. This will help us gather the right indices for the mention pairs. We also apply further dropout.

`flatten_word_output = word_output.contiguous().view(-1, 2 * HIDDEN_SIZE)`

`flatten_word_output = self.flatten_dropout(flatten_word_output)`


<br>

Then, we get the mention pair representations by first flatten the mention pair indices using view() function, as the index_select() fuction used to gather the embeddings only take one dementional input. We can get the mention pair embeddings into their desired demention by using the view() function again.

<br>

The output concatenating the embeddings such that each mention pair is represented by a `4*HIDDEN_SIZE` tensor.


3. **Create a multilayer feed-forward neural network to compute the mention-pair scores.**

Then you are required to create a FFNN that contains 2 hidden layers and an output layer. The outputs of the FFNN are  mention_pair_scores. Here are some requirements:

The hidden layers need to have a size of `HIDDEN_SIZE`
You need to apply dropout after each the hidden layers (but not the output layer). The outputs are called mention_pair_scores

**Hint:**

Each hidden layer of the FFNN is a simple Linear with an relu activation function. Layers are simply stacked together the output of the previous layer is the input for the next layer. To apply the dropout you can simply use the Dropout layer. The output layer is slightly different, since it will have an output size of 1. Also in order to compute the binary cross entropy loss we need to give this final layer a sigmoid activation function.

After computing the mention_pair_scores you will need to remove the last dimension of it, since the last dimension is always 1.


In [189]:
import torch.nn.functional as F

# Model Definition
class CoreferenceModel(nn.Module):

    """
    Task 2

    Begin
    """
    def __init__(self):
        super(CoreferenceModel, self).__init__()

        # Your code goes here

        # Dropout layers
        self.embedding_dropout = nn.Dropout(EMBEDDING_DROPOUT_RATE)
        self.flatten_dropout = nn.Dropout(HIDDEN_DROPOUT_RATE)
        self.ffnn_dropout = nn.Dropout(HIDDEN_DROPOUT_RATE)

        # Bidirectional LSTM layers
        self.bi_lstm1 = nn.LSTM(EMBEDDING_SIZE, HIDDEN_SIZE, batch_first=True, bidirectional=True,dropout= HIDDEN_DROPOUT_RATE)
        self.bi_lstm2 = nn.LSTM(2 * HIDDEN_SIZE, HIDDEN_SIZE, batch_first=True, bidirectional=True,dropout=HIDDEN_DROPOUT_RATE)

        self.linear_layer = nn.Linear(8 * HIDDEN_SIZE, HIDDEN_SIZE)
        # Fully Connected Layers
        self.ffnn_layers = nn.ModuleList()
        for _ in range(NUM_FFN_LAYER):

          self.ffnn_layers.append(nn.Linear(HIDDEN_SIZE, HIDDEN_SIZE))

        # Output layer
        self.output_layer = nn.Linear(HIDDEN_SIZE, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, word_embeddings, mention_pairs):
        """
        word_embeddings: Tensor of shape (batch_size, num_sents, num_words, embedding_size)
        mention_pairs: Tensor of shape (batch_size, num_pairs, 4)
        """
        # Reshape for LSTM: merge first two dimensions if necessary
        # origin shape is (batch_size, num_sents, num_words, embedding_size)
        # We'll treat it as (batch_size*num_sents, num_words, embedding_size)
        word_embeddings = word_embeddings.view(-1, word_embeddings.size(2), word_embeddings.size(3))

        # Your code goes here
        # Apply dropout
        word_embeddings = self.embedding_dropout(word_embeddings)

        # Apply Bidirectional LSTM
        word_output, _ = self.bi_lstm1(word_embeddings)
        word_output, _ = self.bi_lstm2(word_output)

        # Flatten word_output: reshape to (batch_size*num_sents*num_words, 2*hidden_size)
        flatten_word_output = word_output.contiguous().view(-1, 2 * HIDDEN_SIZE)  # (batch_size*num_sents*num_words, 2*hidden_size)
        flatten_word_output = self.flatten_dropout(flatten_word_output) # (batch_size*num_sents*num_words, 2*hidden_size)

        # Gather mention pair embeddings
        flatten_mention_pairs = mention_pairs.contiguous().view(-1) # Shape: batch_size*num_pairs*4
        flatten_mention_pair_emb = torch.index_select(flatten_word_output,0,flatten_mention_pairs) # Shape: (batch_size*num_pairs*4, 2*hidden_size)
        mention_pair_emb = flatten_mention_pair_emb.contiguous().view(-1, 8 * HIDDEN_SIZE) # Shape: (batch_size*num_pairs, 8*hidden_size)

        # Apply FFNN layers
        #first laeyr
        x = self.linear_layer(mention_pair_emb)
        x = F.relu(x)
        #other layers
        for layer in self.ffnn_layers:
            x = layer(x)
            x = F.relu(x)
            x = self.ffnn_dropout(x)

        # Output layer
        mention_pair_scores = self.output_layer(x)  # (batch_size * num_pairs, 1)
        mention_pair_scores = self.sigmoid(mention_pair_scores)  # (batch_size * num_pairs, 1)

        # Squeeze the last dimension
        mention_pair_scores = mention_pair_scores.squeeze(1)  # (batch_size * num_pairs)

        return mention_pair_scores
    """
    End Task 2
    """

In [190]:
model = CoreferenceModel()

##**6. Coreference Resolution Evaluation**

Coreference Resolution models are not evaluated using regular accuracy or f1 as one would evaluate a text classification model. Rather, using the pairwise scores produced by the system, we build coreference clusters. These clusters are
then evaluated using the CONLL score https://www.aclweb.org/anthology/W12-4501/
In this section, we build functions to build such clusters.

###**6.1 Getting the Predicted clusters**

First, we will write a function that takes a pair of mentions and produces two variables:

1. `predicted_clusters`: a list of tuples. Each tuple is a cluster (i.e. the elements of each tuple are the mentions predicted to belong to that cluster, where each mention is a (start_index, end_index) tuple.

2. `mention_to_predicted`: a dictionary whose keys are mentions and whose values are predicted clusters for the given mention

The input to the function `mention_pairs` is the list of predicted mention pairs
[[(anaphor_start, anaphor_end), (antecedent_start, antecedent_end)], ...] similar to draw_pairs in section 4.

In [191]:
def get_predicted_clusters(mention_pairs):
    mention_to_predicted = {}
    predicted_clusters = []

    # for each mention and its predicted antecedent
    for anaphora, predicted_antecedent in mention_pairs:
        anaphora = tuple(anaphora)
        predicted_antecedent = tuple(predicted_antecedent)
        # if the predicted antecedent has been processed before as an anaphor
        if predicted_antecedent in mention_to_predicted:
            # then the predicted cluster for the anaphor is the same as the one for its predicted antecedent
            predicted_cluster = mention_to_predicted[predicted_antecedent]
        # otherwise,
        else:
            # create a new cluster, with the antecedent as the first mention in that cluster
            predicted_cluster = len(predicted_clusters) # the cluster number (it's order in the list of clusters)
            predicted_clusters.append([predicted_antecedent])
            mention_to_predicted[predicted_antecedent] = predicted_cluster

        # now we know the right cluster for the anaphor, add it to that cluster
        predicted_clusters[predicted_cluster].append(anaphora)
        mention_to_predicted[anaphora] = predicted_cluster

    # make the cluster list a cluster tuple. Lists can be dictionary keys; they are mutable and support item assignment.
    predicted_clusters = [tuple(pc) for pc in predicted_clusters]
    # get the {mention: complete cluster} map for the predictions.
    mention_to_predicted = {m: predicted_clusters[i] for m, i in mention_to_predicted.items()}

    return predicted_clusters, mention_to_predicted

###**6.2 Coreference evaluation for a given document**

In this subsection you will complete the `evaluate_coref()` function for coref evaluation on a single document.

<br>

The `evaluate_coref()` function takes 3 parameters:
* `predicted_mention_pairs`: the list of predicted mention pairs
* `gold_clusters`: the gold cluster from the orginal document
* `evaluator`: a reference to an instance of metrics.CorefEvaluator()
You will use the first 2 parameters to create variables to run the `evaluator.update()` method.

<br>

The`evaluator.update()` method takes 3 parameters:
* `predicted_clusters`: from 5.1
*  `gold_clusters`: the gold cluster from the orginal document, each cluster transformed from a list to a tuple.
* `mention_to_predicted`: from 5.1
* `mention_to_gold`: the gold equivalent of  `mention_to_predicted`

<br>

Some of the code has been written for you. You complete the code below to generate the rest of it.

In [192]:
def evaluate_coref(predicted_mention_pairs, gold_clusters, evaluator):
    """
    Task 3

    Begin
    """
    gold_clusters = [tuple([tuple(m) for m in cluster]) for cluster in gold_clusters]


    # mention to gold is a {mention: cluster of mentions it belongs, including the present mention} map
    mention_to_gold = {}
    for cluster in gold_clusters:
        for mention in cluster:
            mention_to_gold[mention] = cluster


    # get the predicted clusters and the map of mention to predicted cluster
    predicted_clusters, mention_to_predicted = get_predicted_clusters(predicted_mention_pairs)
    evaluator.update(predicted_clusters, gold_clusters, mention_to_predicted, mention_to_gold)

    """
    End Task 3
    """

###**6.3 Evaluating the model on all the data**

In [193]:
def eval(model, eval_docs, device):
    coref_evaluator = metrics.CorefEvaluator()
    model.eval()
    for word_embeddings, mention_pairs, _, gold_clusters, raw_mention_pairs in eval_docs:
        word_embeddings = torch.Tensor([word_embeddings]).to(torch.float).to(device)
        mention_pairs = torch.Tensor([mention_pairs]).to(torch.int32).to(device)
        # gold_clusters_ = []
        # for item in gold_clusters:
        #     gold_clusters_.append(torch.Tensor(item).to(torch.int32).to(device))
        # gold_clusters = gold_clusters_
        # get the mention pair scores from the model
        mention_pair_scores = model(word_embeddings, mention_pairs).detach().cpu()
        predicted_antecedents = {}
        best_antecedent_scores = {}
        # for a given anaphor
        for (ana, ant), score in zip(raw_mention_pairs, mention_pair_scores):
            # ana = torch.Tensor(ana).to(torch.int32).to(device)
            # ant = torch.Tensor(ant).to(torch.int32).to(device)
            # only candidate antecedents with (ana, ante) above 0.5 are considered as valid system proposed candidates
            if score >= 0.5 and score > best_antecedent_scores.get(ana,0):
                # we chose the best among these to be the predicted antecedent for that anaphor
                predicted_antecedents[ana] = ant
                best_antecedent_scores[ana] = score

        # getting the [anaphor, antecedent] pairs.
        predicted_mention_pairs = [[k,v] for k,v in predicted_antecedents.items()]

        # evaluate the predicted mention pairs
        evaluate_coref(predicted_mention_pairs, gold_clusters, coref_evaluator)

    # afer evaluating each document, get the conll prf
    p, r, f = coref_evaluator.get_prf()
    print("Average F1 (py): {:.2f}%".format(f * 100))
    print("Average precision (py): {:.2f}%".format(p * 100))
    print("Average recall (py): {:.2f}%".format(r * 100))

##**7. Training and Evaluating the Model the Coreference Model**

In [194]:
def time_used(start_time):
    curr_time = time.time()
    used_time = curr_time - start_time
    m = used_time // 60
    s = used_time - 60 * m
    return "%d m %d s" % (m, s)

In [195]:
def train(model, train_data, dev_data, epochs, start_eval, lr=1e-3, device='cpu'):
    model = model.to(device)
    optimizer = optim.Adam(model.parameters(), lr=lr)
    loss_fn = nn.BCELoss().to(device)
    start_time = time.time()
    for epoch in range(epochs):
        model.train()
        epoch_loss = 0
        start_time = time.time()
        for word_embeddings, mention_pairs, mention_pair_labels, _, _ in DataLoader(train_data, batch_size=1, shuffle=True):
            word_embeddings = word_embeddings.to(torch.float).to(device)
            mention_pairs = mention_pairs.to(torch.int32).to(device)
            mention_pair_labels = mention_pair_labels.to(torch.float).to(device)
            optimizer.zero_grad()
            outputs = model(word_embeddings,mention_pairs).unsqueeze(0)
            loss = loss_fn(outputs, mention_pair_labels)
            loss.backward()
            optimizer.step()
            epoch_loss += loss.item()

        print(f"Epoch {epoch + 1}/{epochs}, Loss: {epoch_loss:.4f}")
        if epoch >= start_eval:
          eval(model, dev_data, device)
        print(f"Time used: {time_used(start_time)}")
        time_used(start_time)
        model.train()

In [196]:
import warnings
warnings.filterwarnings('ignore')

# train the model for 10 epochs
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
train(model, TRAIN_DATA, DEV_DATA, 10, 0, 1e-3,device)

Epoch 1/10, Loss: 228.8539
Average F1 (py): 33.02%
Average precision (py): 43.55%
Average recall (py): 48.08%
Time used: 0 m 8 s
Epoch 2/10, Loss: 200.1703
Average F1 (py): 33.86%
Average precision (py): 43.48%
Average recall (py): 52.87%
Time used: 0 m 9 s
Epoch 3/10, Loss: 186.2838
Average F1 (py): 34.60%
Average precision (py): 43.86%
Average recall (py): 52.25%
Time used: 0 m 8 s
Epoch 4/10, Loss: 172.0021
Average F1 (py): 36.78%
Average precision (py): 49.23%
Average recall (py): 51.04%
Time used: 0 m 9 s
Epoch 5/10, Loss: 163.5617
Average F1 (py): 32.10%
Average precision (py): 36.14%
Average recall (py): 58.95%
Time used: 0 m 8 s
Epoch 6/10, Loss: 152.1251
Average F1 (py): 36.31%
Average precision (py): 41.83%
Average recall (py): 53.94%
Time used: 0 m 9 s
Epoch 7/10, Loss: 147.2065
Average F1 (py): 34.36%
Average precision (py): 39.87%
Average recall (py): 57.74%
Time used: 0 m 8 s
Epoch 8/10, Loss: 143.1316
Average F1 (py): 33.30%
Average precision (py): 38.07%
Average recall 

##**8.Questions:**



*   Would the performance decrease if we do not preprocess the text? If yes (or no), then why?
*   Experiment with different values for max antecedent (MAX_ANT) and negative ratio (NEG_RATIO), what do you observe?
*   How would you improve the accuracy?


