# Lab.5: Lexical semantics
## Introduction to Human Language Technologies
### Victor Badenas Crespo

***

### Statement:

Given the following (lemma, category) pairs:
```python
(’the’,’DT’), (’man’,’NN’), (’swim’,’VB’), (’with’, ’PR’), (’a’, ’DT’),
(’girl’,’NN’), (’and’, ’CC’), (’a’, ’DT’), (’boy’, ’NN’), (’whilst’, ’PR’),
(’the’, ’DT’), (’woman’, ’NN’), (’walk’, ’VB’)
```

- For each pair, when possible, print their most frequent WordNet synset, their corresponding least common subsumer (LCS) and their similarity value, using the following functions:

    - Path Similarity

    - Leacock-Chodorow Similarity

    - Wu-Palmer Similarity

    - Lin Similarity

- Normalize similarity values when necessary. What similarity seems better?

*** 

## Solution

Import necessary packages and declare environment valiables.

In [1]:
import nltk
import numpy as np
from nltk.corpus import wordnet as wn
from nltk.corpus import wordnet_ic
nltk.download('wordnet_ic')

DATA = [
    ('the','DT'), ('man','NN'), ('swim','VB'), ('with', 'PR'), ('a', 'DT'),
    ('girl','NN'), ('and', 'CC'), ('a', 'DT'), ('boy', 'NN'), ('whilst', 'PR'),
    ('the', 'DT'), ('woman', 'NN'), ('walk', 'VB')
]

brownIc = wordnet_ic.ic('ic-brown.dat')

FilteredData = list(filter(lambda x: x[1].lower()[0] in ('v', 'n'), DATA))

[nltk_data] Downloading package wordnet_ic to
[nltk_data]     /Users/victor/nltk_data...
[nltk_data]   Package wordnet_ic is already up-to-date!


In [2]:
synsets = list()
for word, posTag in FilteredData:
    posTag = posTag.lower()[0]
    synset = wn.synsets(word, posTag)[0]
    synsets.append(synset)

In [3]:
PathSimilarities = np.full((len(synsets), len(synsets)), np.nan)
LCSimilarities = np.full((len(synsets), len(synsets)), np.nan)
WuPalmerSimilarities = np.full((len(synsets), len(synsets)), np.nan)
LinSimilarities = np.full((len(synsets), len(synsets)), np.nan)
LCS = [[None for j in range(len(synsets))] for i in range(len(synsets))]

for i, sourceSynset in enumerate(synsets):
    for j, targetSynset in enumerate(synsets):
        LCS[i][j] = sourceSynset.lowest_common_hypernyms(targetSynset)
        if len(LCS[i][j]) > 0 and (FilteredData[i][1].lower()[0] == FilteredData[j][1].lower()[0]):
            PathSimilarities[i, j] = sourceSynset.path_similarity(targetSynset)
            LCSimilarities[i, j] = sourceSynset.lch_similarity(targetSynset)
            WuPalmerSimilarities[i, j] = sourceSynset.wup_similarity(targetSynset)
            LinSimilarities[i, j] = sourceSynset.lin_similarity(targetSynset, brownIc)

LCSimilarities = LCSimilarities / np.nanmax(LCSimilarities)
PathSimilarities = np.round(PathSimilarities, 2)
LCSimilarities = np.round(LCSimilarities, 2)
WuPalmerSimilarities = np.round(WuPalmerSimilarities, 2)
LinSimilarities = np.round(LinSimilarities, 2)

In [4]:
print("PathSimilarities:")
print(PathSimilarities)
print("LCSimilaities:")
print(LCSimilarities)
print("WuPalmerSimilarities:")
print(WuPalmerSimilarities)
print("LinSimilarities:")
print(LinSimilarities)

PathSimilarities:
[[1.    nan 0.25 0.33 0.33  nan]
 [ nan 1.    nan  nan  nan 0.33]
 [0.25  nan 1.   0.17 0.5   nan]
 [0.33  nan 0.17 1.   0.2   nan]
 [0.33  nan 0.5  0.2  1.    nan]
 [ nan 0.33  nan  nan  nan 1.  ]]
LCSimilaities:
[[1.    nan 0.62 0.7  0.7   nan]
 [ nan 0.9   nan  nan  nan 0.59]
 [0.62  nan 1.   0.51 0.81  nan]
 [0.7   nan 0.51 1.   0.56  nan]
 [0.7   nan 0.81 0.56 1.    nan]
 [ nan 0.59  nan  nan  nan 0.9 ]]
WuPalmerSimilarities:
[[1.    nan 0.63 0.67 0.67  nan]
 [ nan 1.    nan  nan  nan 0.33]
 [0.63  nan 1.   0.63 0.63  nan]
 [0.67  nan 0.63 1.   0.67  nan]
 [0.67  nan 0.95 0.67 1.    nan]
 [ nan 0.33  nan  nan  nan 1.  ]]
LinSimilarities:
[[1.    nan 0.71 0.73 0.79  nan]
 [ nan 1.    nan  nan  nan 0.49]
 [0.71  nan 1.   0.29 0.91  nan]
 [0.73  nan 0.29 1.   0.32  nan]
 [0.79  nan 0.91 0.32 1.    nan]
 [ nan 0.49  nan  nan  nan 1.  ]]


In [5]:
def printSynsetDistanceMetric(synsets, LCSMatrix, distanceMatrix):
    for i, sourceSynset in enumerate(synsets):
        for j, targetSynset in enumerate(synsets):
            if len(LCSMatrix[i][j]) > 0 and sourceSynset != targetSynset:
                print(f"{sourceSynset}, {targetSynset}:\t LCS={LCSMatrix[i][j][0]}, distance={distanceMatrix[i,j]}")

In [6]:
print("PathSimilarities:")
printSynsetDistanceMetric(synsets, LCS, PathSimilarities)

PathSimilarities:
Synset('man.n.01'), Synset('girl.n.01'):	 LCS=Synset('adult.n.01'), distance=0.25
Synset('man.n.01'), Synset('male_child.n.01'):	 LCS=Synset('male.n.02'), distance=0.33
Synset('man.n.01'), Synset('woman.n.01'):	 LCS=Synset('adult.n.01'), distance=0.33
Synset('swim.v.01'), Synset('walk.v.01'):	 LCS=Synset('travel.v.01'), distance=0.33
Synset('girl.n.01'), Synset('man.n.01'):	 LCS=Synset('adult.n.01'), distance=0.25
Synset('girl.n.01'), Synset('male_child.n.01'):	 LCS=Synset('person.n.01'), distance=0.17
Synset('girl.n.01'), Synset('woman.n.01'):	 LCS=Synset('woman.n.01'), distance=0.5
Synset('male_child.n.01'), Synset('man.n.01'):	 LCS=Synset('male.n.02'), distance=0.33
Synset('male_child.n.01'), Synset('girl.n.01'):	 LCS=Synset('person.n.01'), distance=0.17
Synset('male_child.n.01'), Synset('woman.n.01'):	 LCS=Synset('person.n.01'), distance=0.2
Synset('woman.n.01'), Synset('man.n.01'):	 LCS=Synset('adult.n.01'), distance=0.33
Synset('woman.n.01'), Synset('girl.n.01')

In [7]:
print("LCSimilaities:")
printSynsetDistanceMetric(synsets, LCS, LCSimilarities)

LCSimilaities:
Synset('man.n.01'), Synset('girl.n.01'):	 LCS=Synset('adult.n.01'), distance=0.62
Synset('man.n.01'), Synset('male_child.n.01'):	 LCS=Synset('male.n.02'), distance=0.7
Synset('man.n.01'), Synset('woman.n.01'):	 LCS=Synset('adult.n.01'), distance=0.7
Synset('swim.v.01'), Synset('walk.v.01'):	 LCS=Synset('travel.v.01'), distance=0.59
Synset('girl.n.01'), Synset('man.n.01'):	 LCS=Synset('adult.n.01'), distance=0.62
Synset('girl.n.01'), Synset('male_child.n.01'):	 LCS=Synset('person.n.01'), distance=0.51
Synset('girl.n.01'), Synset('woman.n.01'):	 LCS=Synset('woman.n.01'), distance=0.81
Synset('male_child.n.01'), Synset('man.n.01'):	 LCS=Synset('male.n.02'), distance=0.7
Synset('male_child.n.01'), Synset('girl.n.01'):	 LCS=Synset('person.n.01'), distance=0.51
Synset('male_child.n.01'), Synset('woman.n.01'):	 LCS=Synset('person.n.01'), distance=0.56
Synset('woman.n.01'), Synset('man.n.01'):	 LCS=Synset('adult.n.01'), distance=0.7
Synset('woman.n.01'), Synset('girl.n.01'):	 LC

In [8]:
print("WuPalmerSimilarities:")
printSynsetDistanceMetric(synsets, LCS, WuPalmerSimilarities)

WuPalmerSimilarities:
Synset('man.n.01'), Synset('girl.n.01'):	 LCS=Synset('adult.n.01'), distance=0.63
Synset('man.n.01'), Synset('male_child.n.01'):	 LCS=Synset('male.n.02'), distance=0.67
Synset('man.n.01'), Synset('woman.n.01'):	 LCS=Synset('adult.n.01'), distance=0.67
Synset('swim.v.01'), Synset('walk.v.01'):	 LCS=Synset('travel.v.01'), distance=0.33
Synset('girl.n.01'), Synset('man.n.01'):	 LCS=Synset('adult.n.01'), distance=0.63
Synset('girl.n.01'), Synset('male_child.n.01'):	 LCS=Synset('person.n.01'), distance=0.63
Synset('girl.n.01'), Synset('woman.n.01'):	 LCS=Synset('woman.n.01'), distance=0.63
Synset('male_child.n.01'), Synset('man.n.01'):	 LCS=Synset('male.n.02'), distance=0.67
Synset('male_child.n.01'), Synset('girl.n.01'):	 LCS=Synset('person.n.01'), distance=0.63
Synset('male_child.n.01'), Synset('woman.n.01'):	 LCS=Synset('person.n.01'), distance=0.67
Synset('woman.n.01'), Synset('man.n.01'):	 LCS=Synset('adult.n.01'), distance=0.67
Synset('woman.n.01'), Synset('girl.

In [9]:
print("LinSimilarities:")
printSynsetDistanceMetric(synsets, LCS, LinSimilarities)

LinSimilarities:
Synset('man.n.01'), Synset('girl.n.01'):	 LCS=Synset('adult.n.01'), distance=0.71
Synset('man.n.01'), Synset('male_child.n.01'):	 LCS=Synset('male.n.02'), distance=0.73
Synset('man.n.01'), Synset('woman.n.01'):	 LCS=Synset('adult.n.01'), distance=0.79
Synset('swim.v.01'), Synset('walk.v.01'):	 LCS=Synset('travel.v.01'), distance=0.49
Synset('girl.n.01'), Synset('man.n.01'):	 LCS=Synset('adult.n.01'), distance=0.71
Synset('girl.n.01'), Synset('male_child.n.01'):	 LCS=Synset('person.n.01'), distance=0.29
Synset('girl.n.01'), Synset('woman.n.01'):	 LCS=Synset('woman.n.01'), distance=0.91
Synset('male_child.n.01'), Synset('man.n.01'):	 LCS=Synset('male.n.02'), distance=0.73
Synset('male_child.n.01'), Synset('girl.n.01'):	 LCS=Synset('person.n.01'), distance=0.29
Synset('male_child.n.01'), Synset('woman.n.01'):	 LCS=Synset('person.n.01'), distance=0.32
Synset('woman.n.01'), Synset('man.n.01'):	 LCS=Synset('adult.n.01'), distance=0.79
Synset('woman.n.01'), Synset('girl.n.01'

***

## Conclusions

The Lin similarity seems to be the best similarity metric overall for the `DATA` provided. Especially in the gender lexical comparisons, we can appreciate a much better discrimination in the Lin metric.

The tables for the nouns are as follows:

| PATH       	| man  	| girl 	| male_child 	| woman 	|
|------------	|------	|------	|------------	|-------	|
| man        	| 1    	| 0.25 	| 0.33       	| 0.33  	|
| girl       	| 0.25 	| 1    	| 0.17       	| 0.5   	|
| male_child 	| 0.33 	| 0.17 	| 1          	| 0.2   	|
| woman      	| 0.33 	| 0.5  	| 0.2        	| 1     	|


| LCS       	| man  	| girl 	| male_child 	| woman 	|
|------------	|------	|------	|------------	|-------	|
| man        	| 1    	| 0.62 	| 0.7        	| 0.7   	|
| girl       	| 0.62 	| 1    	| 0.51       	| 0.81  	|
| male_child 	| 0.7  	| 0.51 	| 1          	| 0.56  	|
| woman      	| 0.7  	| 0.81 	| 0.56       	| 1     	|

| WuPalmer     	| man  	| girl 	| male_child 	| woman 	|
|------------	|------	|------	|------------	|-------	|
| man        	| 1    	| 0.63 	| 0.67       	| 0.67  	|
| girl       	| 0.63 	| 1    	| 0.63       	| 0.63  	|
| male_child 	| 0.67 	| 0.63 	| 1          	| 0.67  	|
| woman      	| 0.67 	| 0.95 	| 0.67       	| 1     	|

| LIN       	| man  	| girl 	| male_child 	| woman 	|
|------------	|------	|------	|------------	|-------	|
| man        	| 1    	| 0.71 	| 0.73       	| 0.79  	|
| girl       	| 0.71 	| 1    	| 0.29       	| 0.91  	|
| male_child 	| 0.73 	| 0.29 	| 1          	| 0.32  	|
| woman      	| 0.79 	| 0.91 	| 0.32       	| 1     	|

From the previous tables we can observe a clear win on the LIN similarity due to the fact that, compared to Wu Palmer, it discerns gender information a lot better. If we look at the woman row; `male_child < man < girl` in terms of similarity to woman, which makes sense. Woman is a grown girl, so it shold be more similar. It also is the opposite gender of Man, which makes it less similar as it would have to go all the way to `person` in the lexical tree and then come down in the male branch. Finally the most different one is `male_child`, as in addition to performing the same path as man, it would have to come even further down the tree.

Other metrics have the same heuristics, but the Lin metric is the one that offers a wider range for discriminating the differences.

***

### End of P4