# Exercise Sheet 7 - Semantics: WSD

## Learning Objectives

In this lab we are going to:

- Explore word sense disambiguation (WSD) with Wordnet/NLTK
- WSD using the lesk algotithm


-------
## WSD using WordNet

WSD is identifying which sense of a word (i.e. meaning) is used in a sentence, when the word has multiple meanings.

Princeton WordNet (WN) is one of the most important resources for natural language processing. It is a manually created resource that has been used in many different tasks and applications across linguistics and natural language processing. WordNet’s hierarchical structure makes it a useful tool for many semantic applications and it also plays a vital role in modern deep learning based NLP systems.

In [None]:
#setting the stage, as usual with colab ;)
import nltk
nltk.download('all')

In [3]:
# importing WN
from nltk.corpus import wordnet as wn

"**Synset** is a special kind of a simple interface that is present in NLTK to look up words in WordNet. Synset instances are the groupings of synonymous words that express the same concept." 

In [4]:
# Look up a word using synsets()
wn.synsets('car')

[Synset('car.n.01'),
 Synset('car.n.02'),
 Synset('car.n.03'),
 Synset('car.n.04'),
 Synset('cable_car.n.01')]

Notice the part of speech which is represented by "n" in this example. You can also get the POS using ".pos()".

**Exercise 0** 

play around with this interface and try a word that can be both a verb and a noun. What comes to my mind is "break". Can you come up with another one and check it?

In [5]:
wn.synsets('break')

[Synset('interruption.n.02'),
 Synset('break.n.02'),
 Synset('fault.n.04'),
 Synset('rupture.n.02'),
 Synset('respite.n.02'),
 Synset('breakage.n.03'),
 Synset('pause.n.01'),
 Synset('fracture.n.01'),
 Synset('break.n.09'),
 Synset('break.n.10'),
 Synset('break.n.11'),
 Synset('break.n.12'),
 Synset('break.n.13'),
 Synset('break.n.14'),
 Synset('open_frame.n.01'),
 Synset('break.n.16'),
 Synset('interrupt.v.04'),
 Synset('break.v.02'),
 Synset('break.v.03'),
 Synset('break.v.04'),
 Synset('break.v.05'),
 Synset('transgress.v.01'),
 Synset('break.v.07'),
 Synset('break.v.08'),
 Synset('break.v.09'),
 Synset('break.v.10'),
 Synset('break_in.v.01'),
 Synset('break_in.v.06'),
 Synset('violate.v.01'),
 Synset('better.v.01'),
 Synset('unwrap.v.02'),
 Synset('break.v.16'),
 Synset('fail.v.04'),
 Synset('break.v.18'),
 Synset('break.v.19'),
 Synset('break.v.20'),
 Synset('dampen.v.07'),
 Synset('break.v.22'),
 Synset('break.v.23'),
 Synset('break.v.24'),
 Synset('break.v.25'),
 Synset('break.v

**Exercise 1**

For each entry in the "car" synsets print out the following:
- name of the synset - synset.name()
- definition of the synset
- examples of the synset

In [6]:
# your code goes here

car_syns=wn.synsets('car')

for syn in car_syns:
   print(syn.name())
   print(syn.definition())
   print(syn.examples())
   print('\n')

car.n.01
a motor vehicle with four wheels; usually propelled by an internal combustion engine
['he needs a car to get to work']


car.n.02
a wheeled vehicle adapted to the rails of railroad
['three cars had jumped the rails']


car.n.03
the compartment that is suspended from an airship and that carries personnel and the cargo and the power plant
[]


car.n.04
where passengers ride up and down
['the car was on the top floor']


cable_car.n.01
a conveyance for passengers or freight on a cable railway
['they took a cable car to the top of the mountain']




### Lemmas, Synonymys and Antonyms
In order to check how many meanings does the word "car" have? We can use lemma_names() which provides us with "synonym set", a collection of synonymous words (or "lemmas")

In [7]:
wn.synset('car.n.01').lemma_names()

['car', 'auto', 'automobile', 'machine', 'motorcar']

You can get all the synonyms of a given word as follows:

In [8]:
synonyms = []

for synset in wn.synsets("nice"):
    print(synset)
    for lemmas in synset.lemmas():
        synonyms.append(lemmas.name())
    print(set(synonyms))
    synonyms = []

Synset('nice.n.01')
{'Nice'}
Synset('nice.a.01')
{'nice'}
Synset('decent.s.01')
{'decent', 'nice'}
Synset('nice.s.03')
{'skillful', 'nice'}
Synset('dainty.s.04')
{'squeamish', 'overnice', 'dainty', 'prissy', 'nice'}
Synset('courteous.s.01')
{'courteous', 'gracious', 'nice'}


**Exercise 2**

Extend previous code to obtain antonyms of a synset.

In [28]:
antonyms = []
for syn in wn.synsets('nice'): # try 'good' instead of 'nice'
  print(syn)
  for lemma in syn.lemmas():
    if lemma.antonyms():
      antonyms.append(lemma.antonyms()[0].name())
    
  print(antonyms)

# note: you can convert this as a function that takes an input word and returns a list of antonyms

Synset('nice.n.01')
[]
Synset('nice.a.01')
['nasty']
Synset('decent.s.01')
['nasty']
Synset('nice.s.03')
['nasty']
Synset('dainty.s.04')
['nasty']
Synset('courteous.s.01')
['nasty']


### Hyponyms and hypernyms

Now that we have seen how to get synonyms, we might also be interested in the hypernyms and hyponyms of a given word.

In [29]:
wn.synsets('printer')

[Synset('printer.n.01'), Synset('printer.n.02'), Synset('printer.n.03')]

In [30]:
wn.synset('printer.n.03').lemmas()

[Lemma('printer.n.03.printer'), Lemma('printer.n.03.printing_machine')]

In [31]:
machine_that_prints = wn.synset('printer.n.03')
for synset in machine_that_prints.hyponyms():
    for lemma in synset.lemmas():
        print(lemma.name())

addressing_machine
Addressograph
character_printer
character-at-a-time_printer
serial_printer
electrostatic_printer
impact_printer
line_printer
line-at-a-time_printer
page_printer
page-at-a-time_printer
printer
thermal_printer
typesetting_machine


**Exercise 3**

Can you make a one liner out of the previous example?

In [33]:
# your code goes here
[lemma.name() for synset in  wn.synset('printer.n.03').hyponyms() for lemma in synset.lemmas()]

['addressing_machine',
 'Addressograph',
 'character_printer',
 'character-at-a-time_printer',
 'serial_printer',
 'electrostatic_printer',
 'impact_printer',
 'line_printer',
 'line-at-a-time_printer',
 'page_printer',
 'page-at-a-time_printer',
 'printer',
 'thermal_printer',
 'typesetting_machine']

Similarly, we can get the hypernyms. I used a one liner here as follows:

In [34]:
[lemma.name() for synset in  machine_that_prints.hypernyms() for lemma in synset.lemmas()]

['machine']

In this case, a more general term of printer.n.03/"printing_machine" is just a machine.


We can also get the lowest common hypernym between two word senses. For example, we can get the lowest common hypernym between "truck.n.01" and limousine.n.01 as follows:

In [35]:
truck = wn.synset('truck.n.01')
limousine = wn.synset('limousine.n.01')
truck.lowest_common_hypernyms(limousine)

[Synset('motor_vehicle.n.01')]

### WN hierarcy
The hypernym path of a certain entity can be traced using the hypernym_paths() method which returns a list of lists. Each list starts at the root hypernym and ends with the original Synset. 

In [36]:
for synset in machine_that_prints.hypernym_paths()[0]:
    print(synset.name())

entity.n.01
physical_entity.n.01
object.n.01
whole.n.02
artifact.n.01
instrumentality.n.03
device.n.01
machine.n.01
printer.n.03


### Meronymie 

We might be interested in looking at the part of something. to achieve that using WN and NLTK, we can take advantage of two functions which are:
- part_meronyms() - obtains parts,
- substance_meronyms() - obtains substances

Let us check the parts of a "car"

In [37]:
tree = wn.synset('car.n.01')

for meronym in tree.part_meronyms():
    print(meronym.name())

accelerator.n.01
air_bag.n.01
auto_accessory.n.01
automobile_engine.n.01
automobile_horn.n.01
buffer.n.06
bumper.n.02
car_door.n.01
car_mirror.n.01
car_seat.n.01
car_window.n.01
fender.n.01
first_gear.n.01
floorboard.n.02
gasoline_engine.n.01
glove_compartment.n.01
grille.n.02
high_gear.n.01
hood.n.09
luggage_compartment.n.01
rear_window.n.01
reverse.n.02
roof.n.02
running_board.n.01
stabilizer_bar.n.01
sunroof.n.01
tail_fin.n.02
third_gear.n.01
window.n.02


**Exercise 4** 

obtain the substances of "tree.n.02"

In [40]:
#your code goes here
tree = wn.synset('tree.n.02')
print(tree.substance_meronyms())

[]


### Other lexical relations:
There are other relations such as:
- Holonym — denotes a membership to something
- Entailment — denotes how verbs are involved

they can be obtained as follows:

In [41]:
wn.synset('atom.n.01').part_holonyms()

[Synset('chemical_element.n.01'), Synset('molecule.n.01')]

In [42]:
wn.synset('hydrogen.n.01').substance_holonyms()

[Synset('water.n.01')]

In [43]:
wn.synset('eat.v.01').entailments()

[Synset('chew.v.01'), Synset('swallow.v.01')]

### Similarity

We can measure the similarity between two word senses based on the shortest path that connects the senses by employing the hypernym/hypnoym relations.

For example, let us check the similarity between "truck" and "limousine". We can employ "path_similarity" which measures the similarity of synsets based on the shortest path between them and returns a score between 0 and 1 where 0 if not similar at all, 1 if perfectly similar.

In [44]:
truck = wn.synset('truck.n.01')
limousine = wn.synset('limousine.n.01')

In [45]:
truck.path_similarity(limousine) # 0.25

0.25

**Exercise 5** 

Get the similarity (of the first sense) between the words "train, car, vehicle, horse, animal and atom". Which pair has the highest similarity?

In [47]:
#your code goes here
train = wn.synset('train.n.01')
car = wn.synset('car.n.01')
vehicle = wn.synset('vehicle.n.01')
horse = wn.synset('horse.n.01')
animal = wn.synset('animal.n.01')
atom = wn.synset('atom.n.01')

print("Car => Train: {}".format(car.path_similarity(train)))
print("Car => vehicle: {}".format(car.path_similarity(vehicle)))
print("Car => horse: {}".format(car.path_similarity(horse)))
# do that for all pairs

Car => Train: 0.125
Car => vehicle: 0.2
Car => horse: 0.05263157894736842


-----------------
## WSD using the Lesk Algorithm

The Lesk algorithm is based on the assumption that words in a given "neighborhood" (section of text) will tend to share a common topic. A simplified version of the Lesk algorithm is to compare the dictionary definition of an ambiguous word with the terms contained in its neighborhood (Source:wiki).

In [48]:
from nltk.wsd import lesk
sent = 'I went to the bank to deposit my money'
ambiguous = 'money'
lesk(sent, ambiguous)

Synset('money.n.03')

In [49]:
lesk(sent, ambiguous).definition()

'the official currency issued by a government or national bank'

**Exercise 6**

Write a code that takes a sentence, a word to be disambiguated and a specified POS and get the disambiguated synset and its definition.

In [54]:
# your code goes here
def disambig(sent, word, pos):
  lesk_syn = lesk(sent, word, pos=pos)
  lesk_def = lesk_syn.definition()

  return lesk_syn, lesk_def

print(disambig("Put this mug on the brown table.", "table", 'n'))
print(disambig("It is time to take a short break after this long meeting", "break", 'n'))


(Synset('table.n.05'), 'a company of people assembled at a table for a meal or game')
(Synset('rupture.n.02'), 'a personal or social separation (as between opposing factions)')
