# Semantic Text Similarity
## Applications of semantic similarity
* Grouping similar words into semantic concepts
* As a building block in natural language udnerstnading tasks
    * Textual entailment RTEはテキスト間の含意を推論するタスクです。与えられた2つの文章間の関係性を、幾つかの含意を表すラベルで表現します。
    * Paraphrasing

## WordNet
   * WordNet® is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations.
   * Includes rich linguistic informaiton
        * part of speech, word senses, suynonyms, hypernyms/hyponyms, meronyms, derivationally related forms, ...
   * Machine readable, freely available
   
## Semantic similarity using WordNet
* WordNet organizes information in a hierachy

![](images/semantic1.png)

## Path similarity
* Find the shrotest path between the two concepts
* Similarity measure inversely related to path distance
    * PathSim(deer, elk)= 0.5
    * PathSim(deer, giraffe)= 0.33

## Lowest common subsumer (LCS)
* Find the closest ancestor to both concepts


![](images/semantic2.png)

## Lin Similarity
* Similarituy measure based on the information contained in the LCS of the two concepts
* $LinSim(u,v) = s * logP\frac{LCS(u,v)}{logP(u) + log P(v)}$
* P(u) is fiven by the inforamtion copntent learnt over a large corpus.

## How do you do it in Python?
* WordNet easily imported into Python through NLTK

In [1]:
import nltk
from nltk.corpus import wordnet as wn

### Words
* Look up a word using synsets(); This function has an optional pos argument which lets you constrain the part of speech of the word:

* Find appropriate sense of the words

In [30]:
 wn.synsets('dog') 

[Synset('dog.n.01'),
 Synset('frump.n.01'),
 Synset('dog.n.03'),
 Synset('cad.n.01'),
 Synset('frank.n.02'),
 Synset('pawl.n.01'),
 Synset('andiron.n.01'),
 Synset('chase.v.01')]

In [20]:
 print(wn.synset('dog.n.01').definition())

a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds


In [22]:
deer = wn.synset('deer.n.01') # find me the synset of deer which is a noun and give me the first synset
elk = wn.synset('elk.n.01')
horse = wn.synset('horse.n.01')

* Synset: a set of synonyms that share a common meaning.

In [28]:
print('上位概念', deer.hypernyms(), '下位概念', deer.hyponyms(), '特定の単語が全体の一部分となっている名前の単語', deer.member_holonyms())

上位概念 [Synset('ruminant.n.01')] 下位概念 [Synset('brocket.n.01'), Synset('caribou.n.01'), Synset('elk.n.01'), Synset('fallow_deer.n.01'), Synset('fawn.n.02'), Synset('japanese_deer.n.01'), Synset('mule_deer.n.01'), Synset('muntjac.n.01'), Synset('musk_deer.n.01'), Synset('pere_david's_deer.n.01'), Synset('pricket.n.02'), Synset('red_deer.n.01'), Synset('roe_deer.n.01'), Synset('sambar.n.01'), Synset('virginia_deer.n.01'), Synset('wapiti.n.01')] 特定の単語が全体の一部分となっている名前の単語 [Synset('cervidae.n.01')]


* Find path similarity

In [4]:
deer.path_similarity(elk)

0.5

In [7]:
deer.path_similarity(horse)

0.14285714285714285

* Use an information criteria to find LinSimilarity

In [9]:
#nltk.download('wordnet_ic')

[nltk_data] Downloading package wordnet_ic to
[nltk_data]     C:\Users\Sam\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet_ic.zip.


True

In [12]:
from nltk.corpus import wordnet_ic
brown_ic = wordnet_ic.ic('ic-brown.dat')
print(deer.lin_similarity(elk,brown_ic),
deer.lin_similarity(horse,brown_ic))

0.8623778273893673 0.7726998936065773


## Collocations and Distributional similarity
* "You know a word by the company it keeps" [Firth, 1957]
* Two words that frequently appears in similar context are more likely to be semantically related

![](images/semantic3.png)

* Words before, after, within a small window
* Parts of speech of words before, after, in a small window
* Specific syntactic relation to the traget word
* Words in the same sentence, same document, ...

## Strength of association between words
* How frequent are these?
    * Not similar if two words don't occur together often
    * Also important to see how frequent are individual words
        * 'the' is very frequent, so high chances it co-occurs often with every other word
    * Pointwise Mutual Information $PMI(w,c)= log\frac{P(w,c)}{P(w)P(c)}$
        

In [None]:
import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder=BigramCollocationFinder.from_words(text)
finder.nbest(bigram_measures.pmi,10) # get top 10 measure

* finder also has other useful functions such as frequency filter

`finder.apply_freq_filter(10)`

suppose you want all bigram measures that are, there you have supposed 10 or more occurrences of words only then can you keep them, then you could do something like finder.apply_ freq_filter (10). That would then restrict any pair that does not occur at least 10 times in your corpus. 