### Anaphora resolution

1) Get the pretrained model of FastText from https://fasttext.cc/docs/en/english-vectors.html

2) At the pytorch develop a model, that is a feed forward neural network that consists of three layers, an input layer of size 600, a first layer of size 300, a second layer of 80 and an output layer with two units, all layers have regularization and dropout. The activation function on all layers is ReLU

![title](scheme.jpg)

In [1]:
import pandas as pd

In [2]:
df_dev = pd.read_csv('https://raw.githubusercontent.com/rauan-assabayev/NLP/master/lab5/gap-development.tsv',sep='\t')

The task is to identify the target of a pronoun within a text passage. The source text is taken from Wikipedia articles. In the dataset, there are labels of the pronoun and two candidate names to which the pronoun could refer. An algorithm should be capable of deciding whether the pronoun refers to name A, name B, or neither.  
There are the following columns for analysis:
* ID - Unique identifier for an example (Matches to Id in output file format);
* Text - Text containing the ambiguous pronoun and two candidate names (about a paragraph in length);
* Text - Text containing the ambiguous pronoun and two candidate names (about a paragraph in length);
* Pronoun - The target pronoun (text);
* Pronoun-offset The character offset of Pronoun in Text;
* A - The first name candidate (text);
* A-offset - The character offset of name A in Text;
* B - The second name candidate;
* B-offset - The character offset of name B in Text;
* URL - The URL of the source Wikipedia page for the example;


In [3]:
df_dev

Unnamed: 0,ID,Text,Pronoun,Pronoun-offset,A,A-offset,A-coref,B,B-offset,B-coref,URL
0,development-1,Zoe Telford -- played the police officer girlf...,her,274,Cheryl Cassidy,191,True,Pauline,207,False,http://en.wikipedia.org/wiki/List_of_Teachers_...
1,development-2,"He grew up in Evanston, Illinois the second ol...",His,284,MacKenzie,228,True,Bernard Leach,251,False,http://en.wikipedia.org/wiki/Warren_MacKenzie
2,development-3,"He had been reelected to Congress, but resigne...",his,265,Angeloz,173,False,De la Sota,246,True,http://en.wikipedia.org/wiki/Jos%C3%A9_Manuel_...
3,development-4,The current members of Crime have also perform...,his,321,Hell,174,False,Henry Rosenthal,336,True,http://en.wikipedia.org/wiki/Crime_(band)
4,development-5,Her Santa Fe Opera debut in 2005 was as Nuria ...,She,437,Kitty Oppenheimer,219,False,Rivera,294,True,http://en.wikipedia.org/wiki/Jessica_Rivera
...,...,...,...,...,...,...,...,...,...,...,...
1995,development-1996,"Faye's third husband, Paul Resnick, reported t...",her,433,Nicole,255,False,Faye,328,True,http://en.wikipedia.org/wiki/Faye_Resnick
1996,development-1997,The plot of the film focuses on the life of a ...,her,246,Doris Chu,111,False,Mei,215,True,http://en.wikipedia.org/wiki/Two_Lies
1997,development-1998,Grant played the part in Trevor Nunn's movie a...,she,348,Maria,259,True,Imelda Staunton,266,False,http://en.wikipedia.org/wiki/Sir_Andrew_Aguecheek
1998,development-1999,The fashion house specialised in hand-printed ...,She,284,Helen,145,True,Suzanne Bartsch,208,False,http://en.wikipedia.org/wiki/Helen_David


In [4]:
df_dev.iloc[0]['Text']

"Zoe Telford -- played the police officer girlfriend of Simon, Maggie. Dumped by Simon in the final episode of series 1, after he slept with Jenny, and is not seen again. Phoebe Thomas played Cheryl Cassidy, Pauline's friend and also a year 11 pupil in Simon's class. Dumped her boyfriend following Simon's advice after he wouldn't have sex with her but later realised this was due to him catching crabs off her friend Pauline."

In [5]:
df_val = pd.read_csv('https://raw.githubusercontent.com/rauan-assabayev/NLP/master/lab5/gap-validation.tsv',sep='\t')

In [6]:
df_val

Unnamed: 0,ID,Text,Pronoun,Pronoun-offset,A,A-offset,A-coref,B,B-offset,B-coref,URL
0,validation-1,He admitted making four trips to China and pla...,him,256,Jose de Venecia Jr,208,False,Abalos,241,False,http://en.wikipedia.org/wiki/Commission_on_Ele...
1,validation-2,"Kathleen Nott was born in Camberwell, London. ...",She,185,Ellen,110,False,Kathleen,150,True,http://en.wikipedia.org/wiki/Kathleen_Nott
2,validation-3,"When she returns to her hotel room, a Liberian...",his,435,Jason Scott Lee,383,False,Danny,406,True,http://en.wikipedia.org/wiki/Hawaii_Five-0_(20...
3,validation-4,"On 19 March 2007, during a campaign appearance...",he,333,Reucassel,300,True,Debnam,325,False,http://en.wikipedia.org/wiki/Craig_Reucassel
4,validation-5,"By this time, Karen Blixen had separated from ...",she,427,Finch Hatton,290,False,Beryl Markham,328,True,http://en.wikipedia.org/wiki/Denys_Finch_Hatton
...,...,...,...,...,...,...,...,...,...,...,...
449,validation-450,"He then agrees to name the gargoyle Goldie, af...",He,305,Lucien,252,False,Abel,264,False,http://en.wikipedia.org/wiki/Goldie_(DC_Comics)
450,validation-451,"Disgusted with the family's ``mendacity'', Bri...",she,365,Maggie,242,False,Mae,257,False,http://en.wikipedia.org/wiki/Cat_on_a_Hot_Tin_...
451,validation-452,She manipulates Michael into giving her custod...,she,306,Scarlett,255,False,Alice,291,True,http://en.wikipedia.org/wiki/Michael_Moon_(Eas...
452,validation-453,"On April 4, 1986, Donal Henahan wrote in the N...",her,330,Aida,250,False,Miss Millo,294,True,http://en.wikipedia.org/wiki/Aprile_Millo


In [6]:
import gensim
wiki_bin = gensim.models.fasttext.FastText.load_fasttext_format('wiki.en.bin')
A_tensor = []
B_tensor = []
p_tensor = []
for i in range(len(df_dev['A'])):
    A_tensor.append(wiki_bin.wv[df_dev['A'][i]])
    B_tensor.append(wiki_bin.wv[df_dev['B'][i]])
    p_tensor.append(wiki_bin.wv[df_dev['Pronoun'][i]])
df_tensor = pd.DataFrame()
df_tensor['A'] = A_tensor
df_tensor['B'] = B_tensor
df_tensor['Pronoun'] = p_tensor
df_tensor['A-coref'] = df_dev['A-coref'].replace(True,1)
df_tensor['B-coref'] = df_dev['B-coref'].replace(True,1)

  


In [7]:
from torch import nn
import numpy as np
import torch
inp = []
for i in range(len(A_tensor)):
    input1 = np.append(A_tensor[i],B_tensor[i])
    input2 = np.append(input1, p_tensor[i])
    inp.append(input2)
X = torch.tensor(inp).float()
print('Input size: ', len(X[0]))
print('Lenght of train dataset: ',len(X))
print('The first free raws: \n',X[:3])

Input size:  900
Lenght of train dataset:  2000
The first free raws: 
 tensor([[ 0.2072,  0.0463,  0.1232,  ...,  0.4101,  0.0578,  0.2908],
        [ 0.0793,  0.0274, -0.0651,  ...,  0.0558,  0.4029, -0.8033],
        [-0.4349, -0.2292,  0.0815,  ...,  0.4401, -0.0977,  0.2298]])


In [8]:
df_tensor.head(3)

Unnamed: 0,A,B,Pronoun,A-coref,B-coref
0,"[0.20722698, 0.046324596, 0.123187125, 0.02980...","[0.050630115, 0.079322815, -0.20837106, 0.4176...","[0.23167726, 0.048626047, -0.2783578, 0.310555...",1.0,0.0
1,"[0.07929457, 0.027403368, -0.06506348, -0.0076...","[-0.18771006, -0.15013029, 0.004081186, 0.0881...","[0.013744404, -0.3271882, 0.52612346, 0.560924...",1.0,0.0
2,"[-0.43492344, -0.22918437, 0.08146006, 0.02927...","[-0.10447918, -0.23179889, 0.057727065, -0.066...","[0.104024425, -0.28820887, -0.1581829, 0.20985...",0.0,1.0


In [12]:
out = []
y1 = df_tensor['A-coref'].values
y2 = df_tensor['B-coref'].values
for i in range(len(X)):
    output1 = np.append(y1[i],y2[i])
    out.append(output1)
y = torch.tensor(out).float()
print('output size: ', len(y[0]))
print('len of train dataset: ',len(y))
print('The first free raws: ')
print(y[:3])

output size:  2
len of train dataset:  2000
The first free raws: 
tensor([[1., 0.],
        [1., 0.],
        [0., 1.]])


In [13]:
model = nn.Sequential(
                      nn.Linear(len(X[0]), 600), #input size / 1 layer hidden layer size 600
                      nn.ReLU(),
                      nn.Dropout(p = 0.3),
                      nn.Linear(600,300),
                      nn.ReLU(),
                      nn.Dropout(p = 0.3),
                      nn.Linear(300,80),
                      nn.ReLU(),
                      nn.Dropout(p = 0.3),
                      nn.Linear(80,2)) # output size 2
print(model)
loss_fn = torch.nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)
loss = 1
loss_arr = []
for i in range(5000):
            y_pred = model(X)
            loss = loss_fn(y_pred, y)
            loss_arr.append(loss)
            if (i%100 == 0):
                print(i, loss.item())
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
print(i, loss.item())

Sequential(
  (0): Linear(in_features=900, out_features=600, bias=True)
  (1): ReLU()
  (2): Dropout(p=0.3, inplace=False)
  (3): Linear(in_features=600, out_features=300, bias=True)
  (4): ReLU()
  (5): Dropout(p=0.3, inplace=False)
  (6): Linear(in_features=300, out_features=80, bias=True)
  (7): ReLU()
  (8): Dropout(p=0.3, inplace=False)
  (9): Linear(in_features=80, out_features=2, bias=True)
)
0 0.4935310184955597
100 0.22300280630588531
200 0.07687117159366608
300 0.0364527627825737
400 0.02607397548854351
500 0.019800391048192978
600 0.016994966194033623
700 0.015866562724113464
800 0.015423678793013096
900 0.014192449860274792
1000 0.012613671831786633
1100 0.01247994415462017
1200 0.011141539551317692
1300 0.011742139235138893
1400 0.01100938767194748
1500 0.010170618072152138
1600 0.010180140845477581
1700 0.008968353271484375
1800 0.009323697537183762
1900 0.009810077026486397
2000 0.009284543804824352
2100 0.009062062948942184
2200 0.008565351366996765
2300 0.0085761621594

In [14]:
print(y_pred)

tensor([[ 1.0703,  0.0096],
        [ 1.0249,  0.0152],
        [-0.0095,  0.9135],
        ...,
        [ 0.8005,  0.0455],
        [ 0.9553,  0.0062],
        [-0.0045,  0.9909]], grad_fn=<AddmmBackward>)


In [15]:
A_tensor_test = []
B_tensor_test = []
p_tensor_test = []
for i in range(len(df_val['A'])):
    A_tensor_test.append(wiki_bin.wv[df_val['A'][i]])
    B_tensor_test.append(wiki_bin.wv[df_val['B'][i]])
    p_tensor_test.append(wiki_bin.wv[df_val['Pronoun'][i]])

In [16]:
import torch
inp = []
for i in range(len(A_tensor_test)):
    input1 = np.append(A_tensor_test[i],B_tensor_test[i])
    input2 = np.append(input1, p_tensor_test[i])
    inp.append(input2)
X_test = torch.tensor(inp).float()
print('FOR TASTING DATA')
print('Input size: ', len(X_test[0]))
print('Len of train dataset: ',len(X_test))
print('The first free raws: \n', X_test[:3])

FOR TASTING DATA
Input size:  900
Len of train dataset:  454
The first free raws: 
 tensor([[-0.1394,  0.0182,  0.0389,  ...,  0.3287,  0.1988,  0.2779],
        [ 0.1304, -0.0368, -0.0333,  ..., -0.2414, -0.0914, -0.1415],
        [-0.1029, -0.0926, -0.0633,  ...,  0.4401, -0.0977,  0.2298]])


In [17]:
y1 = (df_val['A-coref'].replace(True, 1).values)
y2 = (df_val['B-coref'].replace(True, 1).values)
out = []
for i in range(len(X_test)):
    output1 = np.append(y1[i],y2[i])
    out.append(output1)
y_test = torch.tensor(out).float()
print('Output size: ', len(y_test[0]))
print('Len of train dataset: ',len(y_test))
print('The first free raws: \n',y_test[:3])

Output size:  2
Len of train dataset:  454
The first free raws: 
 tensor([[0., 0.],
        [0., 1.],
        [0., 1.]])


In [18]:
y_test_pred = model(X_test)
y_test_pred_np = (np.round((y_test_pred).detach().numpy() ))**2
y_test_class = []
for i in range (len(y_test)):
    y_test2 = y_test[i].detach().numpy()
    if (y_test2[0] == 0):
        if (y_test2[1] == 0):
            y_test_class.append('class1')
    if (y_test2[0] == 0):
        if (y_test2[1] >= 1):
            y_test_class.append('class2')
    if (y_test2[0]== 1):
        if (y_test2[1]== 0):
            y_test_class.append('class3')
    if (y_test2[0]>= 1):
        if (y_test2[1] >= 1):
            y_test_class.append('class4')

y_test_pred_class = []
for i in range (len(y_test_pred_np)):
    y_test2 = y_test_pred_np[i]
    if (y_test2[0] == 0):
        if (y_test2[1] == 0):
            y_test_pred_class.append('class1')
    if (y_test2[0] ==0):
        if (y_test2[1] == 1):
            y_test_pred_class.append('class2')
    if (y_test2[0] == 1):
        if (y_test2[1] == 0):
            y_test_pred_class.append('class3')
    if (y_test2[0] == 1):
        if (y_test2[1] == 1):
            y_test_pred_class.append('class4')

In [19]:
from sklearn.metrics import classification_report
clr = classification_report(y_test_class, y_test_pred_class)
print('Classification report for both classes')
print('Firsly I identify classes as both false - class1, falsefrue - class2, truefalse - class3, truetrue -class4')
print(clr)

Classification report for both classes
Firsly I identify classes as both false - class1, falsefrue - class2, truefalse - class3, truetrue -class4
              precision    recall  f1-score   support

      class1       0.15      0.15      0.15        65
      class2       0.53      0.50      0.51       205
      class3       0.49      0.51      0.50       184
      class4       0.00      0.00      0.00         0

    accuracy                           0.45       454
   macro avg       0.29      0.29      0.29       454
weighted avg       0.46      0.45      0.46       454



  _warn_prf(average, modifier, msg_start, len(result))


In [20]:
from sklearn.metrics import classification_report
y_test_np = np.array(y_test)
clr1 = classification_report(y_test_np.T[0], y_test_pred_np.T[0])
print('Classification report for the A output')
print(clr1)

Classification report for the A output
              precision    recall  f1-score   support

         0.0       0.65      0.63      0.64       270
         1.0       0.48      0.51      0.49       184

    accuracy                           0.58       454
   macro avg       0.57      0.57      0.57       454
weighted avg       0.58      0.58      0.58       454



In [21]:
model_was_right = 0
for i in range(len(y_test_np)):
    if (y_test_np[i][0] == y_test_pred_np[i][0]):
        if (y_test_np[i][1] == y_test_pred_np[i][1]):
            model_was_right = model_was_right + 1
accuracy = 100*model_was_right/(len(y_test_np))
print('Model right in ', model_was_right, 'from ', len(y_test_np),' observations. Accuracy is ', round(accuracy, 2), '%')

Model right in  206 from  454  observations. Accuracy is  45.37 %
