# Lab.2: Document Structure
## Introduction to Human Language Technologies
### Victor Badenas Crespo

***

Import necessary packages and declare environment valiables.

In [1]:
import nltk
from pprint import pprint
from scipy.stats import pearsonr
from nltk.metrics import jaccard_distance
from pathlib import Path

DATA_FOLDER = Path('./trial')

First functions for reading and structuring the data are declared, then the input data is read which has multiple lines containing \[id, sentence1, sentence2\]. The Gold standard info is also read. Then the inputText is formatted into a dict object with the following format for readability:

```json
{
    "id": <id_string>,
    "sent1": <sentence_string>,
    "sent2": <sentence_string>
}
```

In [2]:
def readFile(filePath):
    """
    reads and returns a list of lists containing the text split by line 
    jumps and by tab characters
    """
    with open(filePath, 'r') as fileHandler:
        data = fileHandler.readlines()
    
    # split every line by tabs
    data = list(map(lambda x: x.strip().split('\t'), data))
    return data

def toDict(line):
    """
    creates a dict with fields id sent1 sent2 from the values in line
    """
    keys = ("id", "sent1", "sent2")
    return dict(zip(keys, line))

# read file data
inputText = readFile(DATA_FOLDER / 'STS.input.txt')
gsText = readFile(DATA_FOLDER / 'STS.gs.txt')

# convert to previously defined dict structure
inputText = list(map(toDict, inputText))
pprint(inputText)

[{&#39;id&#39;: &#39;id1&#39;,
  &#39;sent1&#39;: &#39;The bird is bathing in the sink.&#39;,
  &#39;sent2&#39;: &#39;Birdie is washing itself in the water basin.&#39;},
 {&#39;id&#39;: &#39;id2&#39;,
  &#39;sent1&#39;: &#39;In May 2010, the troops attempted to invade Kabul.&#39;,
  &#39;sent2&#39;: &#39;The US army invaded Kabul on May 7th last year, 2010.&#39;},
 {&#39;id&#39;: &#39;id3&#39;,
  &#39;sent1&#39;: &#39;John said he is considered a witness but not a suspect.&#39;,
  &#39;sent2&#39;: &#39;&quot;He is not a suspect anymore.&quot; John said.&#39;},
 {&#39;id&#39;: &#39;id4&#39;,
  &#39;sent1&#39;: &#39;They flew out of the nest in groups.&#39;,
  &#39;sent2&#39;: &#39;They flew into the nest together.&#39;},
 {&#39;id&#39;: &#39;id5&#39;,
  &#39;sent1&#39;: &#39;The woman is playing the violin.&#39;,
  &#39;sent2&#39;: &#39;The young lady enjoys listening to the guitar.&#39;},
 {&#39;id&#39;: &#39;id6&#39;,
  &#39;sent1&#39;: &#39;John went horse back riding at dawn with a 

A function is defined that will be responsible of tokenize, convert to set and computing the similarity between both sentences. This function is then applied to the inputText list.

In [3]:
def computeSimilarity(sentencePair):
    """
    function responsible of:
    - tokenizing the words in the sentence
    - converting to set
    - computing the jaccard_distance metric
    """
    sent1 = set(nltk.word_tokenize(sentencePair['sent1'], language='english'))
    sent2 = set(nltk.word_tokenize(sentencePair['sent2'], language='english'))
    return jaccard_distance(sent1, sent2)

The previous function is used to compute the distances in inputText. Also the reference distances are extracted from the gold standard text data.

In [4]:
testDistances = list(map(computeSimilarity, inputText))
refDistances = [float(value) for _, value in gsText]

Both distances are then compared with the pearson correlation.

In [5]:
pcorr = pearsonr(refDistances, testDistances)[0]

# formatting for purely demonstrative purposes
print(f"pearsonr({refDistances}, {list(map(lambda x:float('%.2f' % x), testDistances))}) = {pcorr}")

pearsonr([0.0, 1.0, 2.0, 3.0, 4.0, 5.0], [0.69, 0.74, 0.53, 0.55, 0.77, 0.86]) = 0.3962389776119232


***

## Conclusion



***

### End of P2