<a href="https://colab.research.google.com/github/wadimiusz/hse_nlp_homeworks/blob/master/WSD.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import random
import numpy as np
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import wordnet as wn
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split
stemmer = SnowballStemmer(language='english')

stop_words = list(stopwords.words('english'))

In [0]:
with open('corpus_eng.txt', encoding='utf-8') as f:
    corpus = f.read()

In [0]:
sentences = sent_tokenize(corpus)
sentences = [word_tokenize(sentence.strip()) for sentence in sentences]

In [41]:
sentences_with_break = [sentence for sentence in sentences if 'break' in sentence]
print("Size of pool:", len(sentences_with_break))
samples, other_samples = train_test_split(sentences_with_break, train_size=10, shuffle=True) # just trust me on this one, we may use this later
print("Samples:")
for sample in samples:
  print(*sample)

Size of pool: 422
Samples:
“ This week will be a much-deserved break for all of us , ” Frederick said .
“ Saving money and investing … More » Tokyo shares up by break on lower yen , US rate view AFP - 3 hours ago Tokyo stocks rallied Friday morning as the yen sank against the dollar after Federal Reserve chief Janet Yellen indicated the bank would hike US interest rates next month .
Hong Kong shares also rose , as the Thanksgiving break in the United States helped slow a relentless surge in the U.S. dollar that has sucked capital out of most emerging markets .
The process of smoking causes collagen to break down into simple sugars making the meat sweet and tender .
The news will come as a relief to Chelsea fans , with Courtois having kept five clean sheets in a row prior to the international break to help Antonio Conte 's side climb into second place in the Premier League .
The visitors , however , with 211 runs needed from 34 overs in the final session and with nine wickets in hand , 



Let's try the usual Lesk first:

In [0]:
def lesk(word, sentence):
  best_sence = None
  max_overlap = 0
  for synset in wn.synsets(word):
    definition = word_tokenize(synset.definition())
    definition = [word_ for word_ in definition if not (word_ in stop_words + [word] or len(word_) < 3 or not word_.isalpha())]
    sentence = [word_ for word_ in sentence if not (word_ in stop_words + [word] or len(word_) < 3 or not word_.isalpha())]
    overlap = len(set(definition).intersection(set(sentence)))
    if overlap > max_overlap:
      max_overlap = overlap
      best_sence = synset.definition()
    
  return best_sence

In [0]:
print("Possible meanings of the word break:")
for num, synset in enumerate(wn.synsets('break')):
  print(num+1, synset.definition())

Possible meanings of the word break:
1 some abrupt occurrence that interrupts an ongoing activity
2 an unexpected piece of good luck
3 (geology) a crack in the earth's crust resulting from the displacement of one side with respect to the other
4 a personal or social separation (as between opposing factions)
5 a pause from doing something (as work)
6 the act of breaking something
7 a time interval during which there is a temporary cessation of something
8 breaking of hard tissue such as bone
9 the occurrence of breaking
10 an abrupt change in the tone or register of the voice (as at puberty or due to emotion)
11 the opening shot that scatters the balls in billiards or pool
12 (tennis) a score consisting of winning a game when your opponent was serving
13 an act of delaying or interrupting the continuity
14 a sudden dash
15 any frame in which a bowler fails to make a strike or spare
16 an escape from jail
17 terminate
18 become separated into pieces or fragments
19 render inoperable or i

In [0]:
for num, sample in enumerate(samples):
  print(num+1)
  print("Sentence:", *sample)
  print("Predicted meaning:", lesk("break", sample))
  print("")

1
Sentence: I think it would break down the connections that we have with immigrant communities. ” Sanctuary cities that refuse to cooperate could lose billions of dollars in federal funding .
Predicted meaning: None

2
Sentence: VIDEO : 10 Times Bella Hadid Rocked the Runway 5135677455001 As for what caused the break up ?
Predicted meaning: None

3
Sentence: The visitors , however , with 211 runs needed from 34 overs in the final session and with nine wickets in hand , changed their approach after the break and decided to attack .
Predicted meaning: None

4
Sentence: Photo : Silver Screen Collection , Getty Images Image 25 of 78 Feb. 10 , 1977 : After nearly seven years of marriage to Gus Trikonis , Goldie jumped ship two months later and married musician Bill Hudson , here taking a smoke break with new wife at the People 's Choice Awards in Hollywood .
Predicted meaning: None

5
Sentence: The Cowboys , leading 28-3 at the break , quickly got another score when Jermaine Antoine interc

The main problem here appears to be the lack of words in the definitions. Some sentences fail to have intersections with any definitions, and some only share irrelevant vocabulary. Also, some words don't match because of their difference in inflection. 
Why don't we try analysing contexts from wordnet instead of just definitions?

In [0]:
def lesk(word, sentence):
  best_sence = None
  max_overlap = 0
  for synset in wn.synsets(word):
    definition = word_tokenize(synset.definition())
    definition = [stemmer.stem(word_) for word_ in definition if not (word_ in stop_words + [word] or len(word_) < 3 or not word_.isalpha())]
    
    context = [stemmer.stem(word_) for example in synset.examples() for word_ in example]
    
    sentence = [stemmer.stem(word_) for word_ in sentence if not (word_ in stop_words + [word] or len(word_) < 3 or not word_.isalpha())]

    overlap = len(set(definition).intersection(set(sentence + context)))
    if overlap > max_overlap:
      max_overlap = overlap
      best_sence = synset.definition()
    
  return best_sence

In [28]:
for num, sample in enumerate(samples):
  print(num+1)
  print("Sentence:", *sample)
  print("Predicted meaning:", lesk("break", sample))
  print("")

1
Sentence: `` Unfortunately , this is the latest in a series of incidents that highlight the need for greater investment in our city facilities , '' Turner said , pointing to a September water line break at the city 's main administrative building at 611 Walker , among other episodes .
some abrupt occurrence that interrupts an ongoing activity 1111
a pause from doing something (as work) 1111
render inoperable or ineffective 1111
Predicted meaning: render inoperable or ineffective

2
Sentence: Liverpool have won seven of their last eight league games and go into the international break with growing belief they can win the club 's first English league title since 1990 .
some abrupt occurrence that interrupts an ongoing activity 1111
a time interval during which there is a temporary cessation of something 1111
any frame in which a bowler fails to make a strike or spare 1111
Predicted meaning: any frame in which a bowler fails to make a strike or spare

3
Sentence: Assuming the vice presi

Better but still meh.
What if we use ngrams and tfidf and look for similar sentences to add them to definitions and contexts?

In [0]:

def preprocess(samples):
  return [[stemmer.stem(word_) for word_ in sample if word_.lower() not in stop_words and len(word_) > 2 and word_.isalpha()] for sample in samples]

other_samples_preprocessed = preprocess(other_samples)
tfidf = TfidfVectorizer(analyzer=lambda x: x, ngram_range=(1, 3)).fit(other_samples_preprocessed)

def get_similar(sentence, sentences, num=10):
  sentence_ = tfidf.transform(preprocess([sentence]))
  sentences_ = tfidf.transform(preprocess(sentences))
  similarity = cosine_similarity(sentence_, sentences_)[0]
  idx = similarity.argsort()[-num:][::-1]
  return preprocess([sentences[idx_] for idx_ in idx])

def get_score(sentence1, sentence2):
  sentence1 = tfidf.transform(preprocess([sentence1]))
  sentence2 = tfidf.transform(preprocess([sentence2]))
  similarity = cosine_similarity(sentence1, sentence2)[0][0]
  return similarity

In [0]:
def lesk(word, sentence):
  best_sence = None
  max_score = 0
  common_words = list()
  for synset in wn.synsets(word):
    context = word_tokenize(synset.definition())
    context += sum([word_tokenize(example) for example in synset.examples()], [])
    context += sum(get_similar(context, other_samples, 60), [])
    context = preprocess([context])[0]
    context = [word_ for word_ in context if word_ != word]
    score = get_score(sentence, context)

    if score > max_score:
      best_sence = synset.definition()
      max_score = score
      common_words = list(set(sentence).intersection(set(context)))
    
  print("Common words:", *common_words)
  print("Score", max_score)
  return best_sence

In [49]:
for num, sample in enumerate(samples):
  print(num+1)
  print("Sentence:", *sample)
  print("Predicted meaning:", lesk("break", sample))
  print("")

1
Sentence: “ This week will be a much-deserved break for all of us , ” Frederick said .
Common words: week said
Score 0.18210893497156472
Predicted meaning: diminish or discontinue abruptly

2
Sentence: “ Saving money and investing … More » Tokyo shares up by break on lower yen , US rate view AFP - 3 hours ago Tokyo stocks rallied Friday morning as the yen sank against the dollar after Federal Reserve chief Janet Yellen indicated the bank would hike US interest rates next month .
Common words: would lower rate bank yen money next dollar ago sank month
Score 0.17267608349616478
Predicted meaning: fall sharply

3
Sentence: Hong Kong shares also rose , as the Thanksgiving break in the United States helped slow a relentless surge in the U.S. dollar that has sucked capital out of most emerging markets .
Common words: slow also rose dollar
Score 0.15612591358578534
Predicted meaning: reduce to bankruptcy

4
Sentence: The process of smoking causes collagen to break down into simple sugars ma

# Comments
1. It seems that this meaning of the word "break" is what lesk's alrorighms has a hard time with. Break or having a break can be mentioned pretty much anywhere without any words from the dictionary article.
2. This one is pretty impressive. Modified Lesk's algorithm managed to 1) successfully find other sentences where break means fall sharply, 2) identify their common topic, economics and finances, 3) realised that it is the case here. Well done.
3. That meaning of the word "break" again. Still, Lesk's algorithm knows that the sentence is about finances, hence the interpretation.
4. Phrasal verb. Those are bad for our algorithm. Also, just one word shared by the sentence and the definition with their similar sentences.
5. Pretty much the only available meaning in the list that is related to sport. still, it's not about tennis.
6. Maybe it's about "change"? Honestly I don't know.
7. Probably the word "interrupt" created a confusion between this definition and political news.
8. Now a typical text, so hard to analyse, I guess
9. Well, this is pretty much a junk sample
10. "To break the drought" it's an idiom and it's more about finishing something that had been continued for a long time. Still, the definition is kinda relevant: the mentioned fans _did_ have a good luck!