## Import NLTK and download wordnet

In [1]:
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/saraabesova/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/saraabesova/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [33]:
from nltk.corpus import wordnet as wn

def create_synset(word):
    return wn.synsets(word)
    
words = ["cup", "plate", "apple", "shoe"]
synsets = []

# create synonym sets for each word
for word in words:
    synsets.append(create_synset(word))
    
cup = words[0]
plate = words[1]
apple = words[2]
shoe = words[3]

## **Thematically Related Objects: Cup and Plate**

In [34]:
# cup
print("Synsets for cup:")
for synset in synsets[0]:
    print(synset, synset.definition())
    
# plate 
print("\nSynsets for plate:")
for synset in synsets[1]:
    print(synset, synset.definition())

Synsets for cup:
Synset('cup.n.01') a small open container usually used for drinking; usually has a handle
Synset('cup.n.02') the quantity a cup will hold
Synset('cup.n.03') any cup-shaped concavity
Synset('cup.n.04') a United States liquid unit equal to 8 fluid ounces
Synset('cup.n.05') cup-shaped plant organ
Synset('cup.n.06') a punch served in a pitcher instead of a punch bowl
Synset('cup.n.07') the hole (or metal container in the hole) on a golf green
Synset('cup.n.08') a large metal vessel with two handles that is awarded as a trophy to the winner of a competition
Synset('cup.v.01') form into the shape of a cup
Synset('cup.v.02') put into a cup
Synset('cup.v.03') treat by applying evacuated cups to the patient's skin

Synsets for plate:
Synset('home_plate.n.01') (baseball) base consisting of a rubber slab where the batter stands; it must be touched by a base runner in order to score
Synset('plate.n.02') a sheet of metal or wood or glass or plastic
Synset('plate.n.03') a full-page 

### Calculate relatedness using wordnet

------------------------------------------------------------------------------------------------------------------------

**Path similarity**: It is a similarity measure that finds the distance that is the length of the shortest path between two synsets.

*The score can be 0 < score <= 1* -- Higher values indicate closer or more similar concepts. A value of 1 indicates that the synsets are identical, while a value closer to 0 indicates less similarity.

------------------------------------------------------------------------------------------------------------------------

**Wu-Palmer similarity**: It calculates relatedness by considering the depths of the two synsets in the WordNet taxonomies, along with the depth of the LCS (Least Common Subsumer).

Synset Depth: The depth of a synset is the number of edges from the root of the hypernym tree to the synset.

Least Common Subsumer (LCS): The LCS of two synsets is the deepest synset that is a hypernym (superclass) of both synsets.

*The score can be 0 < score <= 1* -- Higher values indicate more similar concepts. A value of 1 indicates identical synsets, and a lower value indicates less similarity.

------------------------------------------------------------------------------------------------------------------------

**Leacock-Chowdorow similarity**: It is a similarity measure which is an extended version of Path-based similarity as it incorporates the depth of the taxonomy. Therefore, it is the negative log of the shortest path (spath) between two concepts (synset_1 and synset_2) divided by twice the total depth of the taxonomy (D)

*The score depends on the taxonomy depth, typically around 0 to 3.6 in WordNet* -- Higher values indicate more similar concepts. Unlike path similarity and Wu-Palmer similarity, this measure considers the logarithmic scale of path length to account for depth differences.

------------------------------------------------------------------------------------------------------------------------

In [35]:
# synset is chosen based on definition of word
synset_cup = synsets[0][0] 
synset_plate = synsets[1][3]

# Path similarity
path_similarity1 = synset_cup.path_similarity(synset_plate)

# Wu-Palmer similarity
wup_similarity1 = synset_cup.wup_similarity(synset_plate)

# Leacock-Chodorow similarity
lch_similarity1 = synset_cup.lch_similarity(synset_plate)


print(f"Path similarity between {cup} and {plate}: {path_similarity1}")
print(f"Wu-Palmer similarity between {cup} and {plate}: {wup_similarity1}")
print(f"Leacock-Chodorow similarity between {cup} and {plate}: {lch_similarity1}")


Path similarity between cup and plate: 0.2
Wu-Palmer similarity between cup and plate: 0.8
Leacock-Chodorow similarity between cup and plate: 2.0281482472922856


### Calculate the word relatedness using word vectors from spacy

**Cosine similarity**: a measure often used with word vectors to determine the cosine of the angle between two vectors in a multi-dimensional space. It is used with word embeddings to find semantic similarity.

*Score is between -1 and 1* -- Higher values indicate more similar vectors. A value of 1 indicates identical direction (maximum similarity), 0 indicates orthogonality (no similarity), and -1 indicates opposite direction (maximum dissimilarity).

In [5]:
!python3 -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [36]:
import spacy

# english small dataset
nlp = spacy.load("en_core_web_sm")

vector1 = nlp(cup).vector
vector2 = nlp(plate).vector

# cosine similarity

from numpy import dot
from numpy.linalg import norm

cosine_similarity1 = dot(vector1, vector2) / (norm(vector1) * norm(vector2))

print(f"Cosine similarity between {cup} and {plate} is: {cosine_similarity1}")

Cosine similarity between cup and plate is: 0.7203201055526733


## **Thematically Unrelated Objects: Apple and Shoe**

In [37]:
# apple
print("Synsets for apple:")
for synset in synsets[2]:
    print(synset, synset.definition())
    
# shoe 
print("\nSynsets for shoe:")
for synset in synsets[3]:
    print(synset, synset.definition())

Synsets for apple:
Synset('apple.n.01') fruit with red or yellow or green skin and sweet to tart crisp whitish flesh
Synset('apple.n.02') native Eurasian tree widely cultivated in many varieties for its firm rounded edible fruits

Synsets for shoe:
Synset('shoe.n.01') footwear shaped to fit the foot (below the ankle) with a flexible upper of leather or plastic and a sole and heel of heavier material
Synset('shoe.n.02') (card games) a case from which playing cards are dealt one at a time
Synset('horseshoe.n.02') U-shaped plate nailed to underside of horse's hoof
Synset('brake_shoe.n.01') a restraint provided when the brake linings are moved hydraulically against the brake drum to retard the wheel's rotation
Synset('shoe.v.01') furnish with shoes


In [38]:
# synset is chosen based on definition of word
synset_apple = synsets[2][0] 
synset_shoe = synsets[3][0]

# Path similarity
path_similarity2 = synset_apple.path_similarity(synset_shoe)

# Wu-Palmer similarity
wup_similarity2 = synset_apple.wup_similarity(synset_shoe)

# Leacock-Chodorow similarity
lch_similarity2 = synset_apple.lch_similarity(synset_shoe)


print(f"Path similarity between {apple} and {shoe}: {path_similarity2}")
print(f"Wu-Palmer similarity between {apple} and {shoe}: {wup_similarity2}")
print(f"Leacock-Chodorow similarity between 'apple' and 'shoe': {lch_similarity2}")

Path similarity between apple and shoe: 0.08333333333333333
Wu-Palmer similarity between apple and shoe: 0.42105263157894735
Leacock-Chodorow similarity between 'apple' and 'shoe': 1.1526795099383855


In [39]:
vector3 = nlp(apple).vector
vector4 = nlp(shoe).vector

# cosine similarity

from numpy import dot
from numpy.linalg import norm

cosine_similarity2 = dot(vector3, vector4) / (norm(vector3) * norm(vector4))

print(f"Cosine similarity between {apple} and {shoe} is: {cosine_similarity2}")

Cosine similarity between apple and shoe is: 0.7026094794273376


## Combined Results

In [40]:
print("-- Cup and Plate --")
print(f"Path similarity: {path_similarity1}")
print(f"Wu-Palmer similarity: {wup_similarity1}")
print(f"Leacock-Chodorow similarity: {lch_similarity1}")
print(f"Cosine similarity: {cosine_similarity1}\n")

print("-- Apple and Shoe --")
print(f"Path similarity: {path_similarity2}")
print(f"Wu-Palmer similarity: {wup_similarity2}")
print(f"Leacock-Chodorow similarity: {lch_similarity2}")
print(f"Cosine similarity: {cosine_similarity2}")

-- Cup and Plate --
Path similarity: 0.2
Wu-Palmer similarity: 0.8
Leacock-Chodorow similarity: 2.0281482472922856
Cosine similarity: 0.7203201055526733

-- Apple and Shoe --
Path similarity: 0.08333333333333333
Wu-Palmer similarity: 0.42105263157894735
Leacock-Chodorow similarity: 1.1526795099383855
Cosine similarity: 0.7026094794273376


## Working with larger SpaCy dataset

This is to compare the cosine similarity scores when applying larger datasets

In [11]:
!python3 -m spacy download en_core_web_md

Collecting en-core-web-md==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.7.1/en_core_web_md-3.7.1-py3-none-any.whl (42.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.7.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


In [41]:
# english medium dataset
nlp = spacy.load("en_core_web_md")

# calculate word vectors
vector1 = nlp(cup).vector
vector2 = nlp(plate).vector

# cosine similarity
cosine_similarity1 = dot(vector1, vector2) / (norm(vector1) * norm(vector2))
print(f"Cosine similarity between {cup} and {plate}: {cosine_similarity1}")

# calculate word vectors
vector3 = nlp(apple).vector
vector4 = nlp(shoe).vector

# cosine similarity
cosine_similarity2 = dot(vector3, vector4) / (norm(vector3) * norm(vector4))
print(f"Cosine similarity between {apple} and {shoe}: {cosine_similarity2}")

Cosine similarity between cup and plate: 0.3228256404399872
Cosine similarity between apple and shoe: 0.22793497145175934


In [13]:
!python3 -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:03[0m
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.7.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


In [43]:
# english large dataset
nlp = spacy.load("en_core_web_lg")

# calculate word vectors
vector1 = nlp(cup).vector
vector2 = nlp(plate).vector

# cosine similarity
cosine_similarity1 = dot(vector1, vector2) / (norm(vector1) * norm(vector2))
print(f"Cosine similarity between {cup} and {plate}: {cosine_similarity1}")

# calculate word vectors
vector3 = nlp(apple).vector
vector4 = nlp(shoe).vector

# cosine similarity
cosine_similarity2 = dot(vector3, vector4) / (norm(vector3) * norm(vector4))
print(f"Cosine similarity between {apple} and {shoe}: {cosine_similarity2}")

Cosine similarity between cup and plate: 0.3228256404399872
Cosine similarity between apple and shoe: 0.22793497145175934
