# Some core operations using `NLTK`

![](NLTK_logo.png)

`NLTK`is one of the oldest NLP libraries. Though more modern libraries like `spaCy` are better choices for applications using language models and machine learning, `NLTK` nevertheless provides many functions that are not available elsewhere. It is also exceptionally well integrated with a variety of corpora that cannot be found elsewhere.

In [None]:
!pip install networkx
import nltk
#nltk.download('brown')
nltk.download('wordnet')
from nltk.corpus import brown
from nltk.corpus import wordnet as wn
import pandas as pd
import networkx as nx
from nltk.metrics import edit_distance
from nltk.metrics import jaccard_distance
nltk.download('omw-1.4')
import seaborn as sns
sns.set()

## Categorising words using a `WordNet`

`WordNet` is a manually annotated database of words that have been classified into categories by professional linguists. It is included as part of `NLTK`, and is very valuable when it comes to understanding the semantic relationships a word has with another word. These include:

* Hypernyms: A word is a hypernym of another word when it denotes a more general category to which the first word belongs. `Colour` is a hypernym of `blue`.
* Hyponyms: A word is a hyponym of another word when it denotes a more specific category than that to which the first word belongs. `Spoon` is a hyonym of `cutlery`.
* Antonyms: Two words are antonyms when they have the opposite meaning to each other. `Good` is an antonym to `evil`.
* Meronyms: A meronym is when part an object used to denote a whole. `Wheels` are a meronym for `car`.
* Synsets: Two words belong to the same synset when they share the same meaning. `Puppy` and `dog` belong to the same synset.

### Synsets

In [None]:
# Get the synsets of the word "bank"
bank = wn.synsets('bank')
bank

In [None]:
# Get some synonyms for the second synset of bank––bank as a financial insititution
bank[1].lemma_names()

In [None]:
# Get a list of all the synonyms in WordNet for all the synsets of bank:

terms = []

for i in bank:
    terms.append(i.lemma_names())

words = list(set([word for sublist in terms for word in sublist]))

words

### Antonyms

In [None]:
# Get the antonyms of the last synset of bank (i.e. to bank trust in someone). 
# Note that most words have no antonyms

bank[-1].lemmas()[0].antonyms()

### Hypernyms

In [None]:
# Get the hypernyms of the first synset of bank (i.e. commercial enterprise that deals with money)
bank[1].hypernyms()

In [None]:
# Get all the hypernyms of all the synsets of the word school

hyp = []

for i in wn.synsets('school'):
    hyp.append(i.hypernyms())

hyp

In [None]:
# Graphing hypernym relationships. Pass the synset to the function

def closure_graph(synset):
    fn = lambda s: s.hypernyms()
    seen = set()
    graph = nx.DiGraph()
    
    def recurse(s):
        if not s in seen:
            seen.add(s)
            graph.add_node(s.name())
            for s1 in fn(s):
                graph.add_node(s1.name())
                graph.add_edge(s.name(), s1.name())
                recurse(s1)
                
    recurse(synset)
    nx.draw(graph, with_labels=True)

In [None]:
school = wn.synsets('school')
closure_graph(school[0])

### Path similarity

Path similarity is a measure of how ontologically close two synsets are. Specifically, it measures the number of steps required to go from one synset up to the first common hypernym and down again to the next synset. It is defined by the following formula, which means it's confined between 0 and 1.

$$\text{Path Similarity} = \frac{1}{\text{Shortest Path Length}+1}$$

A value of $1$ means that two synsets are identical; a value of $0$ means they are maximally dissimilar.

In [None]:
# Let's create some examples

words = ['wolf', 'lollipop', 'dinosaur', 'dog', 'knife', 'tiger', 'cat', 'shovel', 'kitten', 'disease', 'hammer', 'fork']

all_synsets = [wn.synsets(i) for i in words]


path_s = [[]for i in range(len(all_synsets))]

for i in range(len(all_synsets)):
    for j in all_synsets:
        path_s[i].append(all_synsets[i][0].path_similarity(j[0]))

paths_df = pd.DataFrame(path_s, columns = words, index = words)

In [None]:
paths_df

In [None]:
sns.heatmap(paths_df)