# Text Similarity
(by Tevfik Aytekin)

In [240]:
from nltk.corpus import wordnet as wn
import nltk
from nltk.corpus import gutenberg
import numpy as np
from scipy.sparse import csr_matrix
from nltk.tokenize import RegexpTokenizer
import pandas as pd
from nltk.tokenize import sent_tokenize, word_tokenize 
from sklearn.feature_extraction.text import CountVectorizer

# WordNet

### Lexical Matrix
table taken from [https://wordnetcode.princeton.edu/5papers.pdf](https://wordnetcode.princeton.edu/5papers.pdf)

<img src="images/lexical_matrix.png" style="width: 400px;"/>


The word meaning $M_1$ in above table can be represented by the set of word forms that can be used to express it: {F1, F2, . . . }. These sets are called synonym sets (or simply synsets).

### WordNet Hierarchy
Below is a simplified illustration of the hierarchy of wordnet.

<img src="images/wordnet_hierarchy.png" style="width: 400px;"/>


### A word is a set of meanings
wn.synsets('word') gives these meanings.

In [241]:
wn.synsets("car")

[Synset('car.n.01'),
 Synset('car.n.02'),
 Synset('car.n.03'),
 Synset('car.n.04'),
 Synset('cable_car.n.01')]

### Synsets

A synset is a set of synonyms (word forms). Each synset corresponds to a concept/meaning. The nodes in the WordNet hierarchy corresponds to synsets. A synset is identified with a 3-part name of the form: word.pos.nn. For example, 'car.n.01' means the first meaning of 'car' used as a noun.

In [242]:
print(wn.synset('car.n.01').definition())

a motor vehicle with four wheels; usually propelled by an internal combustion engine


Lemmas correspond to word forms.

In [243]:
wn.synset('car.n.01').lemma_names()

['car', 'auto', 'automobile', 'machine', 'motorcar']

In [244]:
wn.synset('car.n.01').lemmas()

[Lemma('car.n.01.car'),
 Lemma('car.n.01.auto'),
 Lemma('car.n.01.automobile'),
 Lemma('car.n.01.machine'),
 Lemma('car.n.01.motorcar')]

Hypernyms anf hyponyms

In [245]:
wn.synset('car.n.01').hypernyms()

[Synset('motor_vehicle.n.01')]

In [246]:
wn.synset('car.n.01').hyponyms()

[Synset('ambulance.n.01'),
 Synset('beach_wagon.n.01'),
 Synset('bus.n.04'),
 Synset('cab.n.03'),
 Synset('compact.n.03'),
 Synset('convertible.n.01'),
 Synset('coupe.n.01'),
 Synset('cruiser.n.01'),
 Synset('electric.n.01'),
 Synset('gas_guzzler.n.01'),
 Synset('hardtop.n.01'),
 Synset('hatchback.n.01'),
 Synset('horseless_carriage.n.01'),
 Synset('hot_rod.n.01'),
 Synset('jeep.n.01'),
 Synset('limousine.n.01'),
 Synset('loaner.n.02'),
 Synset('minicar.n.01'),
 Synset('minivan.n.01'),
 Synset('model_t.n.01'),
 Synset('pace_car.n.01'),
 Synset('racer.n.02'),
 Synset('roadster.n.01'),
 Synset('sedan.n.01'),
 Synset('sport_utility.n.01'),
 Synset('sports_car.n.01'),
 Synset('stanley_steamer.n.01'),
 Synset('stock_car.n.01'),
 Synset('subcompact.n.01'),
 Synset('touring_car.n.01'),
 Synset('used-car.n.01')]

In [247]:
wn.synset('car.n.01').root_hypernyms()

[Synset('entity.n.01')]

## Synonymy

### Path Similarity
path_similarity assigns a score in the range 0–1 based on the shortest path that connects the concepts in the hypernym hierarchy (-1 is returned in those cases where a path cannot be found)

In [248]:
right = wn.synset('right_whale.n.01')
orca = wn.synset('orca.n.01')
minke = wn.synset('minke_whale.n.01')
tortoise = wn.synset('tortoise.n.01')
novel = wn.synset('novel.n.01')

In [249]:
print(right.path_similarity(minke))
print(right.path_similarity(orca))
print(right.path_similarity(tortoise))
print(right.path_similarity(novel))


0.25
0.16666666666666666
0.07692307692307693
0.043478260869565216


<img src="images/wordnet_hierarchy.png" style="width: 400px;"/>

In [250]:
motorcar = wn.synset('car.n.01')
compact = wn.synset('compact.n.03')
hatchback = wn.synset('hatchback.n.01')
print(motorcar.path_similarity(compact))

0.5


In [251]:
print(hatchback.path_similarity(compact))

0.3333333333333333


In [252]:
print(hatchback.path_similarity(hatchback))

1.0


# Automated ways for finding synonyms

WordNet is constructed manually by experts of linguistics. There is also the computational approach to semantics. Below we will look at one such approach for finding synonyms. The approach relies on the below fundamental hypothesis:
<br><br>
<center><b>Distributional Hypothesis: similar words appear in similar contexts.</b></center>
<br><br>
We will first need to build a corpus and a co-occurrence matrix.

In [254]:
def build_corpus(text):
    porter = nltk.PorterStemmer()

    tokenizer = RegexpTokenizer(r'\w+')
    words = [] 
    for i in sent_tokenize(text): 
        for j in tokenizer.tokenize(i):
            words.append(j.lower())
    words = np.unique(words)

    porter = nltk.PorterStemmer()
    words = [porter.stem(t) for t in words]
    words = np.unique(words)
    
    word_to_index = {}
    index_to_word = {}
    counter = 0;
    for w in words:
        word_to_index[w] = counter
        index_to_word[counter] = w
        counter += 1  
    return words, word_to_index, index_to_word

In [255]:
text = "This is data mining course cmp3004. It is about data mining. I like it so much."
words, word_to_index, index_to_word = build_corpus(text)

In [257]:
corpus_size = len(words)
print(words)
print(word_to_index)

['about' 'cmp3004' 'cours' 'data' 'i' 'is' 'it' 'like' 'mine' 'much' 'so'
 'thi']
{'about': 0, 'cmp3004': 1, 'cours': 2, 'data': 3, 'i': 4, 'is': 5, 'it': 6, 'like': 7, 'mine': 8, 'much': 9, 'so': 10, 'thi': 11}


In [258]:
text = gutenberg.raw('austen-emma.txt')
words, word_to_index, index_to_word = build_corpus(text)
corpus_size = len(words)
print(corpus_size)
words[:1000]

4643


array(['000', '10', '1816', '23rd', '24th', '26th', '28th', '7th', '8th',
       '_', '_______', '_a_', '_accepted_', '_adair_', '_addition_',
       '_all_', '_almost_', '_alone_', '_amor_', '_and_', '_answer_',
       '_any_', '_appropriation_', '_as_', '_assistance_', '_at_',
       '_bath_', '_be_', '_been_', '_blunder_', '_boiled_', '_both_',
       '_bride_', '_broke_', '_caro_', '_cause_', '_chaperon_',
       '_compassion_', '_compliments_', '_court_', '_courtship_', '_did_',
       '_dissolved_', '_dixon_', '_dixons_', '_doubts_', '_each_',
       '_eighteen_', '_elton_', '_engagement_', '_evening_', '_felt_',
       '_first_', '_gentleman_', '_great_', '_greater_', '_had_',
       '_half_', '_happily_', '_has_', '_have_', '_he_', '_her_',
       '_here_', '_him_', '_his_', '_home_', '_housebreaking_', '_i_',
       '_introduction_', '_invite_', '_is_', '_it_', '_joint_', '_just_',
       '_lady_', '_letting_', '_little_', '_man_', '_married_', '_marry_',
       '_may_', '_me_

In [None]:
def build_co_matrix2(text, words, word_to_index, window=1):
    porter = nltk.PorterStemmer()
    tokenizer = RegexpTokenizer(r'\w+')
    corpus_size = len(words)
    co_matrix = np.zeros((corpus_size,corpus_size),dtype=int)
    for s in sent_tokenize(text): 
        sent = [] 
        for w in tokenizer.tokenize(s):        
            sent.append(porter.stem(w.lower()))
        for i, w in enumerate(sent):
            for j in range(max(i-window,0),min(i+window+1,len(sent))):
                co_matrix[word_to_index[w],word_to_index[sent[j]]] += 1
    return co_matrix

In [None]:
def build_co_matrix(text, window=1):
    tokenizer = RegexpTokenizer(r'\w+')
    counter = 0
    co_matrix = pd.DataFrame();
    for s in sent_tokenize(text): 
        sent = [] 
        for w in tokenizer.tokenize(s):        
            sent.append(w.lower())
        for i, w in enumerate(sent):
            for j in range(max(i-window,0),min(i+window+1,len(sent))):
                if w == sent[j]:# skip the word itself
                    co_matrix.loc[w,sent[j]] = 0
                elif (w in co_matrix.index and sent[j] in co_matrix.columns) and not np.isnan(co_matrix.loc[w,sent[j]]):
                    co_matrix.loc[w,sent[j]] += 1
                else:
                    co_matrix.loc[w,sent[j]] = 1
    co_matrix.fillna(0, inplace=True)
    return co_matrix

How tokenization with regex works

In [259]:
text = "This is data mining course cmp3004. It is about data mining. I like it so much."
tokenizer = RegexpTokenizer(r'\w+')
for w in tokenizer.tokenize(text):  
    print(w)

This
is
data
mining
course
cmp3004
It
is
about
data
mining
I
like
it
so
much


In [261]:
matrix = build_co_matrix(text, 2)
matrix

Unnamed: 0,this,is,data,mining,course,cmp3004,it,about,i,like,so,much
this,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
is,1.0,0.0,2.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0
data,1.0,2.0,0.0,2.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
mining,0.0,1.0,2.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0
course,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
cmp3004,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
it,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0
about,0.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
i,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
like,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0


In [262]:
words, word_to_index, index_to_word = build_corpus(text)
matrix = build_co_matrix2(text, words, word_to_index, 5)
matrix

array([[1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0],
       [0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1],
       [0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1],
       [1, 1, 1, 2, 0, 2, 1, 0, 2, 0, 0, 1],
       [0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0],
       [1, 1, 1, 2, 0, 2, 1, 0, 2, 0, 0, 1],
       [1, 0, 0, 1, 1, 1, 2, 1, 1, 1, 1, 0],
       [0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0],
       [1, 1, 1, 2, 0, 2, 1, 0, 2, 0, 0, 1],
       [0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0],
       [0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0],
       [0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1]])

In [263]:
text = gutenberg.raw('austen-emma.txt')
words, word_to_index, index_to_word = build_corpus(text)
co_matrix = build_co_matrix2(text, words, word_to_index, 5)
print(co_matrix.shape)
print(co_matrix)

(4643, 4643)
[[ 2  2  0 ...  0  0  0]
 [ 2  2  0 ...  0  0  0]
 [ 0  0  1 ...  0  0  0]
 ...
 [ 0  0  0 ... 15  0  0]
 [ 0  0  0 ...  0  4  0]
 [ 0  0  0 ...  0  0  1]]


### Cosine Similarity:
Intuition: Dot product increases if both pairs have the same sign and decreases if pairs have different signs (similar to correlation, actually Pearson correlation is just cosine similarity of the mean centered vectors). Division by the norms is necessary to penalize vectors which has large values.

In [264]:
# Finds cosine similarity between two vectors a and b
def cosine(a, b):
    dot = np.dot(a, b)
    norma = np.linalg.norm(a)
    normb = np.linalg.norm(b)
    return dot / (norma * normb)

Find most similar words to the target word using cosine similarity

In [265]:
target = word_to_index['brother']
target

669

In [266]:
word_vector = co_matrix[target,:]
word_vector.shape

(4643,)

In [267]:
word_vector = np.reshape(word_vector,(word_vector.size,1 ))
word_vector.shape

(4643, 1)

In [269]:
sims = np.dot(word_vector.T,co_matrix)
sims = sims[0,:]
sims

array([112, 117,   7, ..., 838, 257,  34])

In [270]:
sims.argsort()[::-1][:10]

array([ 345, 4117, 4064, 2862, 2046,  176, 4453,  521, 2186, 2356])

In [271]:
index_to_word[345]

'and'

In [272]:
norms = np.linalg.norm(co_matrix, axis=0)
norms

array([ 5.47722558,  5.47722558,  3.74165739, ..., 24.93992783,
        9.38083152,  2.82842712])

In [273]:
word_to_index["and"]

345

In [274]:
norms[345]

6947.785978281138

In [275]:
norm_sims = np.divide(sims,norms)

In [276]:
norm_sims.argsort()[::-1][:10]

array([ 669, 3716,  345, 2059,  356, 2717, 1262, 2046, 3585, 4117])

In [277]:
index_to_word[669]

'brother'

In [278]:
index_to_word[3716]


'sister'

In [279]:
word_vector = co_matrix[word_to_index['artist'],:]
word_vector = np.reshape(word_vector,(1,word_vector.size))
sims = np.dot(word_vector,co_matrix)
sims = sims[0,:]
norm_sims = np.divide(sims,norms)
norm_sims.argsort()[::-1][:10]

array([ 417, 3727, 2744, 4064, 2901, 1569, 1423, 2862, 1339, 3648])

In [280]:
index_to_word[3727]

'skill'