# Lab.2: Document Structure
## Introduction to Human Language Technologies
### Victor Badenas Crespo

***

Import necessary packages and declare environment valiables.

In [1]:
import nltk
from pprint import pprint
from scipy.stats import pearsonr
from nltk.metrics import jaccard_distance
from pathlib import Path
nltk.download('punkt')

DATA_FOLDER = Path('./trial')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\victo\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


First functions for reading and structuring the data are declared, then the input data is read which has multiple lines containing \[id, sentence1, sentence2\]. The Gold standard info is also read. Then the inputText is formatted into a dict object with the following format for readability:

```json
{
    "id": <id_string>,
    "sent1": <sentence_string>,
    "sent2": <sentence_string>
}
```

In [2]:
def readFile(filePath):
    """
    reads and returns a list of lists containing the text split by line 
    jumps and by tab characters
    """
    with open(filePath, 'r') as fileHandler:
        data = fileHandler.readlines()
    
    # split every line by tabs
    data = list(map(lambda x: x.strip().split('\t'), data))
    return data

def toDict(line):
    """
    creates a dict with fields id sent1 sent2 from the values in line
    """
    keys = ("id", "sent1", "sent2")
    return dict(zip(keys, line))

# read file data
inputText = readFile(DATA_FOLDER / 'STS.input.txt')
gsText = readFile(DATA_FOLDER / 'STS.gs.txt')

# convert to previously defined dict structure
inputText = list(map(toDict, inputText))
pprint(inputText)

[{'id': 'id1',
  'sent1': 'The bird is bathing in the sink.',
  'sent2': 'Birdie is washing itself in the water basin.'},
 {'id': 'id2',
  'sent1': 'In May 2010, the troops attempted to invade Kabul.',
  'sent2': 'The US army invaded Kabul on May 7th last year, 2010.'},
 {'id': 'id3',
  'sent1': 'John said he is considered a witness but not a suspect.',
  'sent2': '"He is not a suspect anymore." John said.'},
 {'id': 'id4',
  'sent1': 'They flew out of the nest in groups.',
  'sent2': 'They flew into the nest together.'},
 {'id': 'id5',
  'sent1': 'The woman is playing the violin.',
  'sent2': 'The young lady enjoys listening to the guitar.'},
 {'id': 'id6',
  'sent1': 'John went horse back riding at dawn with a whole group of friends.',
  'sent2': 'Sunrise at dawn is a magnificent view to take in if you wake up '
           'early enough for it.'}]


A function is defined that will be responsible of tokenize, convert to set and computing the similarity between both sentences. This function is then applied to the inputText list.

In [3]:
def computeSimilarity(sentencePair):
    """
    function responsible of:
    - tokenizing the words in the sentence
    - converting to set
    - computing the jaccard_distance metric
    """
    sent1 = set(nltk.word_tokenize(sentencePair['sent1'], language='english'))
    sent2 = set(nltk.word_tokenize(sentencePair['sent2'], language='english'))
    return jaccard_distance(sent1, sent2)

The previous function is used to compute the distances in inputText. Also the reference distances are extracted from the gold standard text data.

In [4]:
testDistances = list(map(computeSimilarity, inputText))
refDistances = [float(value)/(len(gsText)-1) for _, value in gsText]

Both distances are then compared with the pearson correlation.

In [5]:
pcorr = pearsonr(refDistances, testDistances)[0]

# formatting for purely demonstrative purposes
print(f"pearsonr({list(map(lambda x:float('%.2f' % x), refDistances))}, {list(map(lambda x:float('%.2f' % x), testDistances))}) = {pcorr}")

pearsonr([0.0, 0.2, 0.4, 0.6, 0.8, 1.0], [0.69, 0.74, 0.53, 0.55, 0.77, 0.86]) = 0.3962389776119232


***

## Conclusion

The gold standard has a measure of distance between the sentences in each sentence pair identified by the id. The jaccard_distance, as measured by the following equation, will measure the similarity between the sets of words that it is given.

\begin{equation}
    S_{jaccard}(X,Y)=\frac{\vert X \cap Y\vert}{\vert X \cup Y\vert}
\end{equation}

However, the words in a sentence alone are not representative of the semantical meaning of the whole sentence. Because of that the correlation between the gold standard and the jaccard_distance metric is only a 0.4.

For instance in the first pair of sentences (which are the most similar and the distance should be 0) create the following results:


In [6]:
sentenceTestPair = inputText[0]
sent1 = sentenceTestPair['sent1']
sent2 = sentenceTestPair['sent2']
print(f"\"{sent1}\", \"{sent2}\"")

"The bird is bathing in the sink.", "Birdie is washing itself in the water basin."


If we think on the context or overall meaning of the sentences, it is obvious that the sentences have the same meaning. However both of them are phrased different, using word derivatives such as ```bird``` and ```Birdie```. This difference will not compute positively to the distance computation. 

If we look at a more step by step implementation of the jaccard_distance, we can compute and then analyse the intersection and union operands that conform the similarity metric:

In [7]:
sent1 = set(nltk.word_tokenize(sent1, language='english'))
sent2 = set(nltk.word_tokenize(sent2, language='english'))
print(sorted(sent1))
print(sorted(sent2))

['.', 'The', 'bathing', 'bird', 'in', 'is', 'sink', 'the']
['.', 'Birdie', 'basin', 'in', 'is', 'itself', 'the', 'washing', 'water']


In [8]:
intersecion = sent1.intersection(sent2)
union = sent1.union(sent2)
print(intersecion)
print(union)

{'is', 'in', 'the', '.'}
{'sink', 'washing', 'water', 'bathing', 'in', 'The', 'Birdie', 'bird', 'the', 'basin', 'itself', 'is', '.'}


From that computation, the only words that are present in both sentences are {'is', 'the', '.', 'in'} which are pretty common words and have no actual semantical information in them. Because of the intersecion of words being so small, the similarity metric (as it is shown in the next cell) will be so small and as a consequence, the distance bigger.

In [9]:
print("len(intersecion):", len(intersecion))
print("len(union):", len(union))
print("computed jaccard similarity:", len(intersecion)/len(union))
print("computed jaccard_distance:", 1-(len(intersecion)/len(union)))

len(intersecion): 4
len(union): 13
computed jaccard similarity: 0.3076923076923077
computed jaccard_distance: 0.6923076923076923


The computed jaccard distance is the same as it is returned by the nltk package's function. So, because the words similar between sentences are not as similar as the meaning may suggest, we can conclude that the jaccard distance is not a good enough metric to correctly represent the distance between sentences. Maybe some preprocessing to filter stopwords and then translate derivated words to the baseword would be beneficial before actually computing the distance.

***

## Optional: extract the correlation from all gold standards from test-gold

in this case, the gold standard is the similarity metric and it is converted using:

$$D_{jaccard}=1-S_{jaccard}$$

In [10]:
TEST_PATH = Path('./test-gold')

inputTestPaths = sorted(TEST_PATH.glob("STS.input.*.txt"))

for inputTestPath in inputTestPaths:
    print('-'*60)
    gsTestPath = inputTestPath.parent / (inputTestPath.stem.replace('input', 'gs') + '.txt')

    inputText = readFile(inputTestPath)
    gsText = readFile(gsTestPath)

    gsDistances = [1-float(value) for line in gsText for value in line]

    distanceScores = list()
    for sentencePair in inputText:
        sentencePair[0] = set(nltk.word_tokenize(sentencePair[0], language='english'))
        sentencePair[1] = set(nltk.word_tokenize(sentencePair[1], language='english'))
        distanceScores.append(jaccard_distance(*sentencePair))

    correlation = pearsonr(distanceScores, gsDistances)[0]
    print(f"correlation for file {inputTestPath}: {correlation:.2f}")
print('-'*60)


------------------------------------------------------------
correlation for file test-gold\STS.input.MSRpar.txt: 0.51
------------------------------------------------------------
correlation for file test-gold\STS.input.MSRvid.txt: 0.36
------------------------------------------------------------
correlation for file test-gold\STS.input.SMTeuroparl.txt: 0.45
------------------------------------------------------------
correlation for file test-gold\STS.input.surprise.OnWN.txt: 0.64
------------------------------------------------------------
correlation for file test-gold\STS.input.surprise.SMTnews.txt: 0.36
------------------------------------------------------------


***

### End of P2