# 1. Semantic Similarity

Which pair of words are most similar?
- deer, elk
- deer, giraffe
- deer, horse
- deer, mouse

** *How can we quantify this? **

#### *Applications of semantic similarity
- Grouping similar words into semantic concepts
- As a building block in natural language understanding tasks
    - Textual entailment
    - Paraphrasing
    
#### *WordNet
- Semantic dictionary of (mostly) English words, interlinked by semantic relations
- Includes rich linguistic information
    - POS, word senses, synonyms, ..
- Machine-readable, freely available

### 1. Semantic similarity using WordNet
- WordNet organizes informaton in a hierarchy
- Many similarity measures use the hierarchy in some way
- Verbs, nouns, adjectives all have separate hierarchies

#### Calculating Similarity

##### 1) Path Similarity
- Find the shortest path btw the two concepts
- Similarity measure inversely related to path distance

##### 2) Lowest Common subsumer (LCS)
- Find the closest ancestor to both concepts

##### 3) Lin Similarity
- Based on the information contained in the LCS of the two concepts
- LinSim(u, v) = $ 2 * log P(Lcs(u,v)) / (log P(u) + log P(v)) $
- P(u) is given by the information content

#### 4) Collocation and Distributional similarity
- "you know a word by the company it keeps"
- Two words that frequently appears in similar contexts are more likely to be semantically related
    - The friends met at a cafe.
    - Shyam met Ray at a pizzeria.
    - Let's meet up near the coffee shop.
    - The secret meeting at the restaurant.
- Strength of association btw words
    - How frequent?
    - Pointwise Mutual Information PMI(w,c) = log[P(w,c)/P(w)P(c)]

### 2. Application in Python

WordNet easily imported inty Python through NLTK.

In [3]:
import nltk
from nltk.corpus import wordnet as wn

deer = wn.synset('deer.n.01')  # find me the noun, and give me the first synset
elk = wn.synset('elk.n.01')
horse = wn.synset('horse.n.01')

In [4]:
# Find 'path similarity'

print(deer.path_similarity(elk))
print(deer.path_similarity(horse))

0.5
0.14285714285714285


In [6]:
# Find 'Lin similarity'

from nltk.corpus import wordnet_ic
brown_ic = wordnet_ic.ic('ic-brown.dat')

print(deer.lin_similarity(elk, brown_ic))
print(deer.lin_similarity(horse, brown_ic))

0.8623778273893673
0.7726998936065773


In [None]:
# NLTK collocations and association

import nltk
from nltk.collocations import *

bigram_measures = nltk.collocations.BigramAssocMeasures()

finder = BigramCollocaionFinder.from_words(text)
finder.nbest(bigram_measures.pmi, 10)

# other useful function such as frequency filter !

finder.apply_freq_filter(10)