# Lab.5: Lexical semantics
## Introduction to Human Language Technologies
### Victor Badenas Crespo

***

### Statement

- Read all pairs of sentences of the trial set within the evaluation framework of the project.
- Apply Lesk’s algorithm to the words in the sentences.
- Compute their similarities by considering senses and Jaccard coefficient.
- Compare the results with those in session 2 (document) and 3 (morphology) in which words and lemmas were considered.
- Compare the results with gold standard by giving the pearson correlation between them.

***

## Solution

In [1]:
# core imports
from pprint import pprint
from pathlib import Path
from collections import Counter

# scipy imports
from scipy.stats import pearsonr

# nltk imports
import nltk
from nltk.metrics import jaccard_distance
from nltk.corpus import wordnet as wn
from nltk.wsd import lesk

# nltk downloads
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

# constants definition
DATA_FOLDER = Path('./trial')

[nltk_data] Downloading package punkt to /Users/victor/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/victor/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/victor/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


First functions for reading and structuring the data are declared, then the input data is read which has multiple lines containing \[id, sentence1, sentence2\]. The Gold standard info is also read. Then the inputText is formatted into a dict object with the following format for readability:

```json
{
    "id": <id_string>,
    "sent1": <sentence_string>,
    "sent2": <sentence_string>
}
```

In [2]:
def readFile(filePath):
    """
    reads and returns a list of lists containing the text split by line 
    jumps and by tab characters
    """
    with open(filePath, 'r') as fileHandler:
        data = fileHandler.readlines()
    
    # split every line by tabs
    data = list(map(lambda x: x.strip().split('\t'), data))
    return data

def toDict(line):
    """
    creates a dict with fields id sent1 sent2 from the values in line
    """
    keys = ("id", "sent1", "sent2")
    return dict(zip(keys, line))

# read file data
inputText = readFile(DATA_FOLDER / 'STS.input.txt')
gsText = readFile(DATA_FOLDER / 'STS.gs.txt')

# convert to previously defined dict structure
inputText = list(map(toDict, inputText))
pprint(inputText)

[{'id': 'id1',
  'sent1': 'The bird is bathing in the sink.',
  'sent2': 'Birdie is washing itself in the water basin.'},
 {'id': 'id2',
  'sent1': 'In May 2010, the troops attempted to invade Kabul.',
  'sent2': 'The US army invaded Kabul on May 7th last year, 2010.'},
 {'id': 'id3',
  'sent1': 'John said he is considered a witness but not a suspect.',
  'sent2': '"He is not a suspect anymore." John said.'},
 {'id': 'id4',
  'sent1': 'They flew out of the nest in groups.',
  'sent2': 'They flew into the nest together.'},
 {'id': 'id5',
  'sent1': 'The woman is playing the violin.',
  'sent2': 'The young lady enjoys listening to the guitar.'},
 {'id': 'id6',
  'sent1': 'John went horse back riding at dawn with a whole group of friends.',
  'sent2': 'Sunrise at dawn is a magnificent view to take in if you wake up '
           'early enough for it.'}]


In [3]:
for sentence in inputText:
    id, sent1, sent2 = sentence["id"], sentence["sent1"], sentence["sent2"]
    sent1 = nltk.word_tokenize(sent1)
    sent2 = nltk.word_tokenize(sent2)
    t_POS_sent1 = nltk.pos_tag(sent1)
    t_POS_sent2 = nltk.pos_tag(sent2)
    synsets1 = list(map(lambda word: lesk(sent1, word[0], pos=word[1][0].lower()), t_POS_sent1))
    synsets2 = list(map(lambda word: lesk(sent2, word[0], pos=word[1][0].lower()), t_POS_sent2))
    synsets1 = list(filter(lambda i: i is not None, synsets1))
    synsets2 = list(filter(lambda i: i is not None, synsets2))

    sentence["synsets1"] = synsets1
    sentence["synsets2"] = synsets2

In [4]:
pprint(inputText)

[{'id': 'id1',
  'sent1': 'The bird is bathing in the sink.',
  'sent2': 'Birdie is washing itself in the water basin.',
  'synsets1': [Synset('bird.n.02'),
               Synset('be.v.12'),
               Synset('bathe.v.01'),
               Synset('sinkhole.n.01')],
  'synsets2': [Synset('shuttlecock.n.01'),
               Synset('be.v.12'),
               Synset('wash.v.09'),
               Synset('body_of_water.n.01'),
               Synset('washbasin.n.01')]},
 {'id': 'id2',
  'sent1': 'In May 2010, the troops attempted to invade Kabul.',
  'sent2': 'The US army invaded Kabul on May 7th last year, 2010.',
  'synsets1': [Synset('whitethorn.n.01'),
               Synset('troop.n.02'),
               Synset('undertake.v.01'),
               Synset('invade.v.01'),
               Synset('kabul.n.01')],
  'synsets2': [Synset('uranium.n.01'),
               Synset('united_states_army.n.01'),
               Synset('invade.v.03'),
               Synset('kabul.n.01'),
               Synset(

In [5]:
def computeSimilarity(sentenceDict):
    context1 = set(sentenceDict["synsets1"])
    context2 = set(sentenceDict["synsets2"])
    return jaccard_distance(context1, context2)

In [6]:
testDistances = list(map(computeSimilarity, inputText))

In [7]:
refDistances = [float(value)/(len(gsText)-1) for _, value in gsText]

pcorr = pearsonr(refDistances, testDistances)[0]

# formatting for purely demonstrative purposes
print(f"pearsonr({list(map(lambda x:float('%.2f' % x), refDistances))}, {list(map(lambda x:float('%.2f' % x), testDistances))}) = {pcorr}")

pearsonr([0.0, 0.2, 0.4, 0.6, 0.8, 1.0], [0.88, 0.78, 0.38, 1.0, 1.0, 1.0]) = 0.4195840591519665


*** 

## Conlusion

The correlation value is almost the same than in the previous practicums. The higher value in the correlation overall may suggest that the computation of the synsets and the distance being computed by the jaccard_distance of the synsets may detect more similarities between the sentences, but maybe too many. The distance values show that it overestimates the distance of the sentences. 

***

### End of P4