A. Study of Wordnet Dictionary with methods as synsets, definitions, examples,
antonyms

In [1]:
'''WordNet provides synsets which is the collection of synonym words also called
“lemmas”'''
import nltk
from nltk.corpus import wordnet
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [2]:
print(wordnet.synsets("computer"))

[Synset('computer.n.01'), Synset('calculator.n.01')]


In [3]:
# definition and example of the word ‘computer’
print(wordnet.synset("computer.n.01").definition())

a machine for performing calculations automatically


In [4]:
#examples
print("Examples:", wordnet.synset("computer.n.01").examples())

Examples: []


In [5]:
#get Antonyms
print(wordnet.lemma('buy.v.01.buy').antonyms())

[Lemma('sell.v.01.sell')]




```
# This is formatted as code
```

B. Study lemmas, hyponyms, hypernyms.

In [6]:
import nltk
from nltk.corpus import wordnet
nltk.download('wordnet')
print(wordnet.synsets("computer"))
print(wordnet.synset("computer.n.01").lemma_names())

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[Synset('computer.n.01'), Synset('calculator.n.01')]
['computer', 'computing_machine', 'computing_device', 'data_processor', 'electronic_computer', 'information_processing_system']


In [7]:
#all lemmas for each synset.
for e in wordnet.synsets("computer"):
      print(f'{e} --> {e.lemma_names()}')

Synset('computer.n.01') --> ['computer', 'computing_machine', 'computing_device', 'data_processor', 'electronic_computer', 'information_processing_system']
Synset('calculator.n.01') --> ['calculator', 'reckoner', 'figurer', 'estimator', 'computer']


In [8]:
#print all lemmas for a given synset
print(wordnet.synset('computer.n.01').lemmas())

[Lemma('computer.n.01.computer'), Lemma('computer.n.01.computing_machine'), Lemma('computer.n.01.computing_device'), Lemma('computer.n.01.data_processor'), Lemma('computer.n.01.electronic_computer'), Lemma('computer.n.01.information_processing_system')]


In [9]:
#get the synset corresponding to lemma
print(wordnet.lemma('computer.n.01.computing_device').synset())

Synset('computer.n.01')


In [10]:
#Get the name of the lemma
print(wordnet.lemma('computer.n.01.computing_device').name())

computing_device


In [11]:
#Hyponyms give abstract concepts of the word that are much more specific
#the list of hyponyms words of the computer
syn = wordnet.synset('computer.n.01')
print(syn.hyponyms)
print([lemma.name() for synset in syn.hyponyms() for lemma in synset.lemmas()])

<bound method _WordNetObject.hyponyms of Synset('computer.n.01')>
['analog_computer', 'analogue_computer', 'digital_computer', 'home_computer', 'node', 'client', 'guest', 'number_cruncher', 'pari-mutuel_machine', 'totalizer', 'totaliser', 'totalizator', 'totalisator', 'predictor', 'server', 'host', 'Turing_machine', 'web_site', 'website', 'internet_site', 'site']


In [12]:
#the semantic similarity in WordNet
vehicle = wordnet.synset('vehicle.n.01')
car = wordnet.synset('car.n.01')
print(car.lowest_common_hypernyms(vehicle))

[Synset('vehicle.n.01')]


C. Write a program using python to find synonym and antonym of word "active"
using Wordnet.

In [13]:
from nltk.corpus import wordnet
print( wordnet.synsets("active"))
print(wordnet.lemma('active.a.01.active').antonyms())

[Synset('active_agent.n.01'), Synset('active_voice.n.01'), Synset('active.n.03'), Synset('active.a.01'), Synset('active.s.02'), Synset('active.a.03'), Synset('active.s.04'), Synset('active.a.05'), Synset('active.a.06'), Synset('active.a.07'), Synset('active.s.08'), Synset('active.a.09'), Synset('active.a.10'), Synset('active.a.11'), Synset('active.a.12'), Synset('active.a.13'), Synset('active.a.14')]
[Lemma('inactive.a.02.inactive')]




```
# This is formatted as code
```

D. Compare two nouns

In [14]:
import nltk
from nltk.corpus import wordnet
syn1 = wordnet.synsets('football')
syn2 = wordnet.synsets('soccer')

In [15]:
# A word may have multiple synsets, so need to compare each synset of word1 with synset of word2
for s1 in syn1:
    for s2 in syn2:
        print("Path similarity of: ")
        print(s1, '(', s1.pos(), ')', '[', s1.definition(), ']')
        print(s2, '(', s2.pos(), ')', '[', s2.definition(), ']')
        print(" is", s1.path_similarity(s2))
        print()

Path similarity of: 
Synset('football.n.01') ( n ) [ any of various games played with a ball (round or oval) in which two teams try to kick or carry or propel the ball into each other's goal ]
Synset('soccer.n.01') ( n ) [ a football game in which two teams of 11 players try to kick or head a ball into the opponents' goal ]
 is 0.5

Path similarity of: 
Synset('football.n.02') ( n ) [ the inflated oblong ball used in playing American football ]
Synset('soccer.n.01') ( n ) [ a football game in which two teams of 11 players try to kick or head a ball into the opponents' goal ]
 is 0.05





```
# This is formatted as code
```

E. Handling stopword:

i. Using nltk Adding or Removing Stop Words in NLTK's Default Stop Word
List

In [16]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [17]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [18]:
text = "Yashesh likes to play football, however he is not too fond of tennis."
text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in
stopwords.words()]
print(tokens_without_sw)

['Yashesh', 'likes', 'play', 'football', ',', 'however', 'fond', 'tennis', '.']


In [19]:
#add the word play to the NLTK stop word collection
all_stopwords = stopwords.words('english')
all_stopwords.append('play')
text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in all_stopwords]
print(tokens_without_sw)

['Yashesh', 'likes', 'football', ',', 'however', 'fond', 'tennis', '.']


In [20]:
#remove ‘not’ from stop word collection
all_stopwords.remove('not')
text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in all_stopwords]
print(tokens_without_sw)

['Yashesh', 'likes', 'football', ',', 'however', 'not', 'fond', 'tennis', '.']


ii. Using Gensim Adding and Removing Stop Words in Default Gensim Stop
Words List

In [21]:
#pip install gensim
import gensim
from gensim.parsing.preprocessing import remove_stopwords
text = "Yashesh likes to play football, however he is not too fond of tennis."
filtered_sentence = remove_stopwords(text)
print(filtered_sentence)

Yashesh likes play football, fond tennis.


In [22]:
all_stopwords = gensim.parsing.preprocessing.STOPWORDS
print(all_stopwords)

frozenset({'beyond', 'somehow', 'thru', 'interest', 'thus', 'would', 'so', 'another', 'name', 'myself', 'side', 'without', 'those', 'in', 'against', 'inc', 'con', 'across', 'very', 'per', 'then', 'due', 'six', 'thereupon', 'last', 'keep', 'serious', 'that', 'them', 'where', 'amount', 'becomes', 'whereas', 'are', 'didn', 'before', 'detail', 'put', 'becoming', 'also', 'still', 'off', 'about', 'done', 'up', 'thereafter', 'beforehand', 'does', 'during', 'ltd', 'bottom', 'you', 'however', 'thin', 'afterwards', 'ours', 'nowhere', 'on', 'several', 'hereafter', 'kg', 'below', 'third', 'ourselves', 'whole', 'doing', 'fill', 'although', 'well', 'other', 'either', 'nothing', 'himself', 'describe', 'together', 'neither', 'yours', 'whence', 'via', 'than', 'less', 'its', 'towards', 'fifteen', 'therefore', 'between', 'under', 'toward', 'because', 'a', 'not', 'meanwhile', 'always', 'amoungst', 'anything', 'some', 'doesn', 'nine', 'most', 'whenever', 'once', 'nevertheless', 'show', 'down', 'with', 'abo

In [23]:
'''The following script adds likes and play to the list of stop words in Gensim:'''
from gensim.parsing.preprocessing import STOPWORDS
all_stopwords_gensim = STOPWORDS.union(set(['likes', 'play']))
text = "Yashesh likes to play football, however he is not too fond of tennis."
text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in
all_stopwords_gensim]
print(tokens_without_sw)


['Yashesh', 'football', ',', 'fond', 'tennis', '.']


In [24]:
'''The following script removes the word "not" from the set of stop words in
Gensim:'''
from gensim.parsing.preprocessing import STOPWORDS
all_stopwords_gensim = STOPWORDS
sw_list = {"not"}
all_stopwords_gensim = STOPWORDS.difference(sw_list)
text = "Yashesh likes to play football, however he is not too fond of tennis."
text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in
all_stopwords_gensim]
print(tokens_without_sw)

['Yashesh', 'likes', 'play', 'football', ',', 'not', 'fond', 'tennis', '.']




```
# This is formatted as code
```

iii. Using Spacy Adding and Removing Stop Words in Default Spacy Stop Words
List

In [25]:
#pip install spacy
#python -m spacy download en_core_web_sm
#python -m spacy download en
import spacy
import nltk
from nltk.tokenize import word_tokenize
sp = spacy.load('en_core_web_sm')

In [26]:
#add the word play to the NLTK stop word collection
all_stopwords = sp.Defaults.stop_words
all_stopwords.add("play")
text = "Yashesh likes to play football, however he is not too fond of tennis."
text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in all_stopwords]
print(tokens_without_sw)

['Yashesh', 'likes', 'football', ',', 'fond', 'tennis', '.']


In [27]:
#remove 'not' from stop word collection
all_stopwords.remove('not')
tokens_without_sw = [word for word in text_tokens if not word in all_stopwords]
print(tokens_without_sw)

['Yashesh', 'likes', 'football', ',', 'not', 'fond', 'tennis', '.']
