# Lab 7: Word Sequences
## Introduction to Human Language Technologies
### Victor Badenas Crespo

***

## Statement:

- Read all pairs of sentences of the trial set within the evaluation framework of the project.
- Compute their similarities by considering the following approach:
    - words plus NEs and Jaccard coefficient ex: word_and_NEs=\['John Smith', 'is', 'working'\]
- Show the results.
- Do you think it could be relevant to use NEs to compute the similarity between two sentences? Justify the answer.



***

## Solution

In [1]:
# core imports
from pprint import pprint
from pathlib import Path
from collections import Counter

# scipy imports
from scipy.stats import pearsonr

# nltk imports
import nltk
from nltk.metrics import jaccard_distance
from nltk.corpus import wordnet as wn
from nltk.wsd import lesk

# nltk downloads
nltk.download('punkt')
nltk.download('words')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('conll2000')

# constants definition
DATA_FOLDER = Path('./test-gold')

[nltk_data] Downloading package punkt to
[nltk_data]     /home/victorbadenas/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     /home/victorbadenas/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /home/victorbadenas/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/victorbadenas/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /home/victorbadenas/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package conll2000 to
[nltk_data]     /home/victorbadenas/nltk_data...
[nltk_data]   Package conll2000 is already up-to-date!


First functions for reading and structuring the data are declared, then the input data is read which has multiple lines containing \[id, sentence1, sentence2\]. The Gold standard info is also read. Then the inputText is formatted into a dict object with the following format for readability:

```json
{
    "id": <id_string>,
    "sent1": <sentence_string>,
    "sent2": <sentence_string>
}
```

In [2]:
def readFile(filePath):
    """
    reads and returns a list of lists containing the text split by line 
    jumps and by tab characters
    """
    with open(filePath, 'r') as fileHandler:
        data = fileHandler.readlines()
    
    # split every line by tabs
    data = list(map(lambda x: x.strip().split('\t'), data))
    return data

def toDict(line):
    """
    creates a dict with fields id sent1 sent2 from the values in line
    """
    keys = ("sent1", "sent2")
    return dict(zip(keys, line))

# read file data
txtPaths = list(DATA_FOLDER.glob('STS.input.*.txt'))
gsPaths = list(DATA_FOLDER.glob('STS.gs.[!ALL]*.txt'))

inputText = []
for path in txtPaths:
    inputText += readFile(path)

gsText = []
for path in gsPaths:
    gsText += readFile(path)

# convert to previously defined dict structure
inputText = list(map(toDict, inputText))
pprint(inputText[2])

{'sent1': '"It\'s a huge black eye," said publisher Arthur Ochs Sulzberger '
          'Jr., whose family has controlled the paper since 1896.',
 'sent2': '"It\'s a huge black eye," Arthur Sulzberger, the newspaper\'s '
          'publisher, said of the scandal.'}


In [3]:
def processCoNLL(conll):
    words = []
    for line in conll.split('\n'):
        token, _, NEtag = line.split()
        if NEtag.startswith("B-"):
            words.append(NEtag.replace("B-", ""))
        elif NEtag.startswith("I-"):
            pass
        else:
            words.append(token)
    return set(words)

In [4]:
for sentence in inputText:
    sent1, sent2 = sentence["sent1"], sentence["sent2"]
    sent1 = nltk.word_tokenize(sent1)
    sent2 = nltk.word_tokenize(sent2)
    t_POS_sent1 = nltk.pos_tag(sent1)
    t_POS_sent2 = nltk.pos_tag(sent2)
    ne_chunk1 = nltk.ne_chunk(t_POS_sent1)
    ne_chunk2 = nltk.ne_chunk(t_POS_sent2)
    conll1 = nltk.chunk.tree2conllstr(ne_chunk1)
    conll2 = nltk.chunk.tree2conllstr(ne_chunk2)
    sentence["conll1"] = processCoNLL(conll1)
    sentence["conll2"] = processCoNLL(conll2)

In [8]:
pprint(inputText[2])

{'conll1': {"''",
            "'s",
            ',',
            '.',
            '1896',
            'It',
            'PERSON',
            '``',
            'a',
            'black',
            'controlled',
            'eye',
            'family',
            'has',
            'huge',
            'paper',
            'publisher',
            'said',
            'since',
            'the',
            'whose'},
 'conll2': {"''",
            "'s",
            ',',
            '.',
            'It',
            'PERSON',
            '``',
            'a',
            'black',
            'eye',
            'huge',
            'newspaper',
            'of',
            'publisher',
            'said',
            'scandal',
            'the'},
 'sent1': '"It\'s a huge black eye," said publisher Arthur Ochs Sulzberger '
          'Jr., whose family has controlled the paper since 1896.',
 'sent2': '"It\'s a huge black eye," Arthur Sulzberger, the newspaper\'s '
          'publisher, sa

In [6]:
def computeSimilarity(sentenceDict):
    context1 = set(sentenceDict["conll1"])
    context2 = set(sentenceDict["conll2"])
    return 1-jaccard_distance(context1, context2)

In [7]:
testDistances = list(map(computeSimilarity, inputText))
refDistances = [float(value[0]) for value in gsText]
pcorr = pearsonr(refDistances, testDistances)[0]

# formatting for purely demonstrative purposes
print(f"pearsonr({list(map(lambda x:float('%.2f' % x), refDistances))}, {list(map(lambda x:float('%.2f' % x), testDistances))}) = {pcorr}")

.75, 4.5, 5.0, 5.0, 4.5, 3.0, 3.4, 4.5, 3.8, 2.83, 4.75, 4.75, 4.5, 5.0, 4.25, 5.0, 3.6, 0.5, 2.2, 5.0, 5.0, 4.67, 4.75, 4.25, 5.0, 5.0, 4.75, 4.5, 4.75, 4.5, 2.5, 1.75, 2.75, 2.25, 5.0, 3.75, 4.0, 4.0, 4.75, 5.0, 5.0, 4.5, 4.25, 5.0, 5.0, 4.75, 4.5, 5.0, 5.0, 4.8, 5.0, 3.4, 5.0, 5.0, 5.0, 5.0, 3.75, 5.0, 4.6, 4.75, 4.8, 4.0, 3.75, 4.71, 5.0, 4.6, 3.2, 4.5, 4.0, 3.0, 4.6, 2.25, 4.0, 4.4, 2.83, 2.75, 4.75, 5.0, 5.0, 5.0, 3.75, 4.25, 4.75, 5.0, 4.5, 4.4, 5.0, 4.75, 4.6, 3.75, 4.75, 4.75, 5.0, 2.25, 4.5, 4.75, 4.75, 4.8, 4.25, 2.5, 5.0, 5.0, 5.0, 3.5, 4.25, 4.25, 4.0, 0.25, 4.25, 5.0, 3.75, 3.75, 5.0, 2.75, 4.75, 1.75, 4.5, 3.5, 4.5, 4.75, 5.0, 5.0, 4.75, 4.0, 5.0, 5.0, 4.5, 5.0, 4.75, 5.0, 5.0, 4.75, 4.25, 4.25, 4.75, 5.0, 4.5, 3.5, 2.75, 3.75, 3.5, 5.0, 4.5, 4.75, 5.0, 4.4, 4.25, 4.5, 4.75, 4.25, 3.8, 4.5, 4.5, 4.25, 5.0, 4.25, 4.5, 4.5, 4.75, 4.75, 4.75, 2.5, 4.5, 3.5, 4.4, 4.25, 3.75, 4.75, 4.75, 4.75, 4.25, 5.0, 2.83, 5.0, 5.0, 4.75, 4.75, 3.5, 5.0, 4.75, 4.0, 5.0, 4.25, 4.8, 5.0, 4.

***

## Conlusion

In this session, the test set was evaluated using NE tagging. In this implementation, every named entity in the sentences was replaced by the tag that they were detected as p.e: Victor Badenas was replaced by PERSON. From this, the jaccard distance was computed and a correlation of 0.35 was obtained. This value is surprisingly low considering that this processing removes the dependancy on the variability of places, names and dates. However, due to time constraints, i have not been able to develop it further.

***

### End of P4