# Semantic Similarity

# WordNet

- Semantic dictionary of (mostly) english words, interlinked by semantic relations
- Includes rich linguistic information
  - part of speech, word senses, synonyms, hypernyms/hyponyms, meronyms, deriationally related forms, ...
- Machine-readable, freely availabe

<br>

## Semantic similarity using WordNet

- WordNet organizes information in a hierarchy
- Many similarity measures use the hierarchy in some way
- Verbs, nouns, adjectives all have separate hierarchies

<br>

In [9]:
import nltk
from nltk.corpus import wordnet as wn

deer = wn.synset('deer.n.01')
elk = wn.synset('elk.n.01')
horse = wn.synset('horse.n.01')

#### Path Similarity

- Find the shortest path between the two concepts
- Similarity measure inversely related to path distance

In [12]:
print(deer.path_similarity(elk))
print(deer.path_similarity(horse))

0.5
0.14285714285714285


<br>

#### Lin Similarity 

- Lowest common subsumer (LCS)
    - Find the closest ancestor to both concepts
- Lin Similarity measure based on the information contained in the LCS of the two concepts

In [18]:
from nltk.corpus import wordnet_ic
brown_ic = wordnet_ic.ic('ic-brown.dat')

print(deer.lin_similarity(elk,brown_ic))
print(deer.lin_similarity(horse,brown_ic))

0.8623778273893673
0.7726998936065773


## Collocations and Distributional similarity

- Two words that frequently appears in similar contexts are more likely to be sementically related
- Words before, after, within a small window
- Parts of speech of words before, after, in a small window
- Specific syntactic relation to the target word
- WOrds in the same sentence, same document, ....

In [19]:
import nltk
from nltk.collocations import *

In [21]:
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words('ola')
finder.nbest(bigram_measures.pmi,10)

[('l', 'a'), ('o', 'l')]