# Lab.3: Morphology
## Introduction to Human Language Technologies
### Victor Badenas Crespo

***

### Statement:

- Read all pairs of sentences of the trial set within the evaluation framework of the project.

- Compute their similarities by considering lemmas and Jaccard distance.

- Compare the results with those in session 2 (document structure) in which words were considered.

- Compare the results with gold standard by giving the pearson correlation between them.

- Questions (justify the answers):

    - Which is better: words or lemmas?

    - Do you think that could perform better for any pair of texts?


### Solution

Import necessary packages and declare environment valiables.

In [1]:
# core imports
from pprint import pprint
from pathlib import Path
from collections import Counter

# scipy imports
from scipy.stats import pearsonr

# nltk imports
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.metrics import jaccard_distance
from nltk.corpus import wordnet as wn

# nltk downloads
nltk.download('punkt')
nltk.download('wordnet')

# constants definition
DATA_FOLDER = Path('./trial')

[nltk_data] Downloading package punkt to /Users/victor/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/victor/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


First functions for reading and structuring the data are declared, then the input data is read which has multiple lines containing \[id, sentence1, sentence2\]. The Gold standard info is also read. Then the inputText is formatted into a dict object with the following format for readability:

```json
{
    "id": <id_string>,
    "sent1": <sentence_string>,
    "sent2": <sentence_string>
}
```

In [2]:
def readFile(filePath):
    """
    reads and returns a list of lists containing the text split by line 
    jumps and by tab characters
    """
    with open(filePath, 'r') as fileHandler:
        data = fileHandler.readlines()
    
    # split every line by tabs
    data = list(map(lambda x: x.strip().split('\t'), data))
    return data

def toDict(line):
    """
    creates a dict with fields id sent1 sent2 from the values in line
    """
    keys = ("id", "sent1", "sent2")
    return dict(zip(keys, line))

# read file data
inputText = readFile(DATA_FOLDER / 'STS.input.txt')
gsText = readFile(DATA_FOLDER / 'STS.gs.txt')

# convert to previously defined dict structure
inputText = list(map(toDict, inputText))
pprint(inputText)

[{'id': 'id1',
  'sent1': 'The bird is bathing in the sink.',
  'sent2': 'Birdie is washing itself in the water basin.'},
 {'id': 'id2',
  'sent1': 'In May 2010, the troops attempted to invade Kabul.',
  'sent2': 'The US army invaded Kabul on May 7th last year, 2010.'},
 {'id': 'id3',
  'sent1': 'John said he is considered a witness but not a suspect.',
  'sent2': '"He is not a suspect anymore." John said.'},
 {'id': 'id4',
  'sent1': 'They flew out of the nest in groups.',
  'sent2': 'They flew into the nest together.'},
 {'id': 'id5',
  'sent1': 'The woman is playing the violin.',
  'sent2': 'The young lady enjoys listening to the guitar.'},
 {'id': 'id6',
  'sent1': 'John went horse back riding at dawn with a whole group of friends.',
  'sent2': 'Sunrise at dawn is a magnificent view to take in if you wake up '
           'early enough for it.'}]


As well as the `computeSimilarity` function similar to the function in the `S2` deliverable, a new word lemmatier function that extends the `lemmatizer.lemmatize` function from NLTK. The mentioned function will first make use of the synsets of the `nltk.corpus.wordnet` module to infer which part of speech tag to asign to the word. In order to do so, all the synsets in `nltk.corpus.wordnet` are polled to get which tag corresponds to the word according to that synset. After the types are computed, the most common is used as a kwarg in the `lemmatize.lemmatize` function.

In [3]:
lemmatizer = WordNetLemmatizer()
def lemmatizeWord(word, lemmatizer=WordNetLemmatizer()):
    """
    Function responsible of:
    - detecting the part of speech type of the word
    - if a type is detected, lemmatize
    """
    if word[1][0] in {'N','V'}:
        return lemmatizer.lemmatize(word[0], pos=word[1][0].lower())
    return word[0]

def computeSimilarity(sentencePair):
    """
    function responsible of:
    - tokenizing the words in the sentence
    - converting to set
    - computing the jaccard_distance metric
    """
    sent1 = set(nltk.word_tokenize(sentencePair['sent1'], language='english'))
    sent2 = set(nltk.word_tokenize(sentencePair['sent2'], language='english'))
    t_POS_sent1 = nltk.pos_tag(sent1)
    t_POS_sent2 = nltk.pos_tag(sent2)
    t_POS_sent1 = [(word.lower(), pos) for word, pos in t_POS_sent1]
    t_POS_sent2 = [(word.lower(), pos) for word, pos in t_POS_sent2]
    lemmatizedSent1 = set(map(lemmatizeWord, t_POS_sent1))
    lemmatizedSent2 = set(map(lemmatizeWord, t_POS_sent2))
    return jaccard_distance(lemmatizedSent1, lemmatizedSent2)

The distances are computed using both functions defined previously.

In [4]:
testDistances = list(map(computeSimilarity, inputText))
refDistances = [float(value)/(len(gsText)-1) for _, value in gsText]

pcorr = pearsonr(refDistances, testDistances)[0]

# formatting for purely demonstrative purposes
print(f"pearsonr({list(map(lambda x:float('%.2f' % x), refDistances))}, {list(map(lambda x:float('%.2f' % x), testDistances))}) = {pcorr}")

pearsonr([0.0, 0.2, 0.4, 0.6, 0.8, 1.0], [0.67, 0.59, 0.43, 0.55, 0.83, 0.86]) = 0.5790860088205632


***

## Conclusion


The correlation metric has augmented 0.19 points compared to the correlation without using the lemmatization technique. This shows that this metric is a better reperentation of the metric in the gold standard. However, It is apparent that the metric is not yet accurate enough. As an example, we'll take the `id2` sentence:

In [14]:
sampleSentencePair = inputText[1]
print(f"\"{sampleSentencePair['sent1']}\" / \"{sampleSentencePair['sent2']}\"")
sent1 = set(nltk.word_tokenize(sampleSentencePair['sent1'], language='english'))
sent2 = set(nltk.word_tokenize(sampleSentencePair['sent2'], language='english'))
t_POS_sent1 = nltk.pos_tag(sent1)
t_POS_sent2 = nltk.pos_tag(sent2)
t_POS_sent1 = [(word.lower(), pos) for word, pos in t_POS_sent1]
t_POS_sent2 = [(word.lower(), pos) for word, pos in t_POS_sent2]
lemmatizedSent1 = set(map(lemmatizeWord, t_POS_sent1))
lemmatizedSent2 = set(map(lemmatizeWord, t_POS_sent2))
prevIntersection = sent1.intersection(sent2)
newIntersection = lemmatizedSent1.intersection(lemmatizedSent2)
print("Raw intersecion:", prevIntersection)
print("Lowercase and lemmatization intersecion:", newIntersection)
print("Changed words:", prevIntersection.union(newIntersecion) - prevIntersection.intersection(newIntersection))
print(f"{jaccard_distance(sent1, sent2)} -> {jaccard_distance(lemmatizedSent1, lemmatizedSent2)}")


"In May 2010, the troops attempted to invade Kabul." / "The US army invaded Kabul on May 7th last year, 2010."
Raw intersecion: {',', 'May', 'Kabul', '2010', '.'}
Lowercase and lemmatization intersecion: {'may', ',', '2010', '.', 'the', 'invade', 'kabul'}
Changed words: {'may', 'May', 'Kabul', 'the', 'invade', 'kabul'}
0.7368421052631579 -> 0.5882352941176471


as it can be observed above, the words that have been changed are `may, kabul, the, invade`:
- `may`: effect of the lowercase conversion. As they were already common without the lowercase conversion, it does not change the distance metric. 
- `kabul`: same reasoning as `may`
- `the`: this word was present in both words with different capitalization so, the lowercase convsersion made it work better, even though it is a stopword.
- `invade`: The main word changed with meaning. As it is changed to not contain the past extension, the distance diminishes.

Because of that the similarity between the sentences increases and the distance diminishes by 0.15 points.

It can be argued that lemmas work best in most of the situations as it eliminates the derivation differences between nouns and verbs, which should give a better similarity.

***

### End of P3