# Lab.3: Morphology
## Introduction to Human Language Technologies
### Victor Badenas Crespo

***

### Statement:

- Read all pairs of sentences of the trial set within the evaluation framework of the project.

- Compute their similarities by considering lemmas and Jaccard distance.

- Compare the results with those in session 2 (document structure) in which words were considered.

- Compare the results with gold standard by giving the pearson correlation between them.

- Questions (justify the answers):

    - Which is better: words or lemmas?

    - Do you think that could perform better for any pair of texts?


In [1]:
import nltk
from pprint import pprint
from scipy.stats import pearsonr
from nltk.metrics import jaccard_distance
from pathlib import Path
nltk.download('punkt')

DATA_FOLDER = Path('./trial')

[nltk_data] Downloading package punkt to /Users/victor/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
def readFile(filePath):
    """
    reads and returns a list of lists containing the text split by line 
    jumps and by tab characters
    """
    with open(filePath, 'r') as fileHandler:
        data = fileHandler.readlines()
    
    # split every line by tabs
    data = list(map(lambda x: x.strip().split('\t'), data))
    return data

def toDict(line):
    """
    creates a dict with fields id sent1 sent2 from the values in line
    """
    keys = ("id", "sent1", "sent2")
    return dict(zip(keys, line))

# read file data
inputText = readFile(DATA_FOLDER / 'STS.input.txt')
gsText = readFile(DATA_FOLDER / 'STS.gs.txt')

# convert to previously defined dict structure
inputText = list(map(toDict, inputText))
pprint(inputText)

[{'id': 'id1',
  'sent1': 'The bird is bathing in the sink.',
  'sent2': 'Birdie is washing itself in the water basin.'},
 {'id': 'id2',
  'sent1': 'In May 2010, the troops attempted to invade Kabul.',
  'sent2': 'The US army invaded Kabul on May 7th last year, 2010.'},
 {'id': 'id3',
  'sent1': 'John said he is considered a witness but not a suspect.',
  'sent2': '"He is not a suspect anymore." John said.'},
 {'id': 'id4',
  'sent1': 'They flew out of the nest in groups.',
  'sent2': 'They flew into the nest together.'},
 {'id': 'id5',
  'sent1': 'The woman is playing the violin.',
  'sent2': 'The young lady enjoys listening to the guitar.'},
 {'id': 'id6',
  'sent1': 'John went horse back riding at dawn with a whole group of friends.',
  'sent2': 'Sunrise at dawn is a magnificent view to take in if you wake up '
           'early enough for it.'}]


In [3]:
def computeSimilarity(sentencePair):
    """
    function responsible of:
    - tokenizing the words in the sentence
    - converting to set
    - computing the jaccard_distance metric
    """
    sent1 = set(nltk.word_tokenize(sentencePair['sent1'], language='english'))
    sent2 = set(nltk.word_tokenize(sentencePair['sent2'], language='english'))
    return jaccard_distance(sent1, sent2)

In [4]:
testDistances = list(map(computeSimilarity, inputText))
refDistances = [float(value)/(len(gsText)-1) for _, value in gsText]

pcorr = pearsonr(refDistances, testDistances)[0]

# formatting for purely demonstrative purposes
print(f"pearsonr({list(map(lambda x:float('%.2f' % x), refDistances))}, {list(map(lambda x:float('%.2f' % x), testDistances))}) = {pcorr}")

pearsonr([0.0, 0.2, 0.4, 0.6, 0.8, 1.0], [0.69, 0.74, 0.53, 0.55, 0.77, 0.86]) = 0.3962389776119232
