#### Import Libraries

In [None]:
# import sys
# !{sys.executable} -m 
# # pip install -U nltk
# # nltk
# # matplotlib

In [None]:
nltk.__version__

In [None]:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

def showImage(x):    
    plt.figure(figsize=(20,12))
    plt.axis('off')
    img = mpimg.imread('./' + str(x) + '.PNG')
    plt.imshow(img)

In [None]:
import nltk
# nltk.download()
# C:\Users\user\AppData\Roaming\nltk_data

## Tokenizing Text and WordNet Basics

1 Tokenizing text into sentences  
2 Tokenizing sentences into words  
3 Tokenizing sentences using regular expressions  
4 Training a sentence tokenizer  
5 Filtering stopwords in a tokenized sentence  
6 Looking up Synsets for a word in WordNet  
7 Looking up lemmas and synonyms in WordNet  
8 Calculating WordNet Synset similarity  
9 Discovering word collocation  

#### Senetence Tokenize

The sent_tokenize function uses an instance of PunktSentenceTokenizer from the 
nltk.tokenize.punkt module. This instance has already been trained and works well for 
many European languages. So it knows what punctuation and characters mark the end of a 
sentence and the beginning of a new sentenc

In [None]:
para = "Hello World. It's good to see you. Thanks for buying this book."

In [None]:
# nltk.download('punkt')

from nltk.tokenize import sent_tokenize
sent_tokenize(para)

The instance used in sent_tokenize() is actually loaded on demand from a pickle 
file. So if you're going to be tokenizing a lot of sentences, it's more efficient to load the 
PunktSentenceTokenizer class once, and call its tokenize() method instead:

In [None]:
# ERROR

# import nltk.data
# tokenizer = nltk.data.load('C:\\Users\\user\\AppData\\Roaming\\nltk_data\\tokenizers\\punkt\\PY3\\english.pickle')
# tokenizer.tokenize(para)
# ['Hello World.', "It's good to see you.", 'Thanks for buying this book.']

In [None]:
# nltk.download('webtext')

from nltk.tokenize import PunktSentenceTokenizer
from nltk.corpus import webtext
text = webtext.raw('overheard.txt')
sent_tokenizer = PunktSentenceTokenizer(text)

In [None]:
sents1 = sent_tokenizer.tokenize(text)
sents1[0]

In [None]:
from nltk.tokenize import sent_tokenize
sents2 = sent_tokenize(text)
sents2[0]

In [None]:
# Difference

print(sents1[678])
print(sents2[678])

The default tokenizer includes the next line of dialog, while our custom tokenizer correctly thinks that 
the next line is a separate sentence. This difference is a good demonstration of why it can 
be useful to train your own sentence tokenizer, especially when your text isn't in the typical 
paragraph-sentence structure

In [None]:
# How to use on your own corpus

with open('/usr/share/nltk_data/corpora/webtext/overheard.txt', 
encoding='ISO-8859-2') as f:
    text = f.read()
    sent_tokenizer = PunktSentenceTokenizer(text)
    sents = sent_tokenizer.tokenize(text)

#### Word Tokenize

In [None]:
from nltk.tokenize import word_tokenize
word_tokenize('Hello World.')

The word_tokenize() function is a wrapper function that calls tokenize() on an 
instance of the TreebankWordTokenizer class.   
It's equivalent to the following code

In [None]:
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize('Hello World.')

In [None]:
showImage(1)

In [None]:
word_tokenize("Can't is a contraction.")

In [None]:
from nltk.tokenize import PunktWordTokenizer
tokenizer = PunktWordTokenizer()
tokenizer.tokenize("Can't is a contraction.")

# Output: ['Can', "'t", 'is', 'a', 'contraction.']

In [None]:
from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()
tokenizer.tokenize("Can't is a contraction.")

In [None]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer("[\w']+")
tokenizer.tokenize("Can't is a contraction.")

In [None]:
from nltk.tokenize import regexp_tokenize
regexp_tokenize("Can't is a contraction.", "[\w']+")

In [None]:
# Whitespace Tokenizer
tokenizer = RegexpTokenizer('\s+', gaps=True)
tokenizer.tokenize("Can't is a contraction.")

#### Stop Words

In [None]:
# nltk.download('stopwords')

from nltk.corpus import stopwords
english_stops = set(stopwords.words('english'))
words = ["Can't", 'is', 'a', 'contraction']
[word for word in words if word not in english_stops]

#### Looking up Synsets for a word in WordNet

WordNet is a lexical database for the English language. In other words, it's a dictionary 
designed specifically for natural language processing.  


NLTK comes with a simple interface to look up words in WordNet. What you get is a list of 
Synset instances, which are groupings of synonymous words that express the same concept. 
Many words have only one Synset, but some have several. In this recipe, we'll explore a single 
Synset, and in the next recipe, we'll look at several in more detail

In [None]:
# nltk.download('wordnet')

from nltk.corpus import wordnet
syn = wordnet.synsets('cookbook')[0]
print(syn.name())
print(syn.definition())


In [None]:
wordnet.synsets("Blood")

In [None]:
 wordnet.synsets('cooking')[0].examples()

In [None]:
 wordnet.synsets('Blood')[0].examples()

#### Working with hypernyms
Synsets are organized in a structure similar to that of an inheritance tree. More abstract terms 
are known as hypernyms and more specific terms are hyponyms. This tree can be traced all 
the way up to a root hypernym.  

Hypernyms provide a way to categorize and group words based on their similarity to each 
other. The Calculating WordNet Synset similarity recipe details the functions used to calculate 
the similarity based on the distance between two words in the hypernym tree:

In [None]:
syn.hypernyms()

In [None]:
syn.hypernyms()[0].hyponyms()

In [None]:
syn.root_hypernyms()

As you can see, reference_book is a hypernym of cookbook, but cookbook is only one of 
the many hyponyms of reference_book. And all these types of books have the same root 
hypernym, which is entity, one of the most abstract terms in the English language. You can 
trace the entire path from entity down to cookbook using the hypernym_paths() method, 
as follows

In [None]:
syn.hypernym_paths()

Noun      n  
Adjective a  
Adverb    r  
Verb      v  

In [None]:
syn.pos()

In [None]:
print(wordnet.synsets('great'))
len(wordnet.synsets('great'))

In [None]:
wordnet.synsets('great', pos='n')

In [None]:
len(wordnet.synsets('great', pos='n'))

#### Looking up lemmas and synonyms in WordNet

Building on the previous recipe, we can also look up lemmas in WordNet to find synonyms 
of a word. A lemma (in linguistics), is the canonical form or morphological form of a word

In [None]:
from nltk.corpus import wordnet
syn = wordnet.synsets('cookbook')[0]
lemmas = syn.lemmas()
len(lemmas)

In [None]:
lemmas[0].name()

In [None]:
lemmas[1].name()

In [None]:
lemmas[0].synset() == lemmas[1].synset()

In [None]:
[lemma.name() for lemma in syn.lemmas()]

In [None]:
 synonyms = []
for syn in wordnet.synsets('book'):
    for lemma in syn.lemmas():
        synonyms.append(lemma.name())
len(synonyms)

In [None]:
synonyms

In [None]:
 len(set(synonyms))

#### Antonyms
Some lemmas also have antonyms. The word good, for example, has 27 Synsets, five 
of which have lemmas with antonyms, as shown in the following code:

In [None]:
gn2 = wordnet.synset('good.n.02')
gn2.definition()

In [None]:
evil = gn2.lemmas()[0].antonyms()[0]
evil.name()

In [None]:
evil.synset().definition()

In [None]:
ga1 = wordnet.synset('good.a.01')
ga1.definition()

In [None]:
bad = ga1.lemmas()[0].antonyms()[0]
bad.name()

In [None]:
bad.synset().definition()

The antonyms() method returns a list of lemmas. In the first case, as we can see in the 
previous code, the second Synset for good as a noun is defined as moral excellence, 
and its first antonym is evil, defined as morally wrong. In the second case, when good is 
used as an adjective to describe positive qualities, the first antonym is bad, which describes 
negative qualities

#### Calculating WordNet Synset similarity
Synsets are organized in a hypernym tree. This tree can be used for reasoning about 
the similarity between the Synsets it contains. The closer the two Synsets are in the tree, 
the more similar they are

If you were to look at all the hyponyms of reference_book (which is the hypernym of 
cookbook), you'd see that one of them is instruction_book. This seems intuitively very 
similar to a cookbook, so let's see what WordNet similarity has to say about it with the help 
of the following code:

In [None]:
from nltk.corpus import wordnet
cb = wordnet.synset('cookbook.n.01')
ib = wordnet.synset('instruction_book.n.01')
cb.wup_similarity(ib)

In [None]:
The wup_similarity method is short for Wu-Palmer Similarity, which is a scoring method 
based on how similar the word senses are and where the Synsets occur relative to each other 
in the hypernym tree. One of the core metrics used to calculate similarity is the shortest path 
distance between the two Synsets and their common hypernym:

In [None]:
ref = cb.hypernyms()[0]
cb.shortest_path_distance(ref)

In [None]:
ib.shortest_path_distance(ref)

In [None]:
cb.shortest_path_distance(ib)

So cookbook and instruction_book must be very similar, because they are only one step 
away from the same reference_book hypernym, and, therefore, only two steps away from 
each other

In [None]:
dog = wordnet.synsets('dog')[0]
dog.wup_similarity(cb)

In [None]:
Wow, dog and cookbook are apparently 38% similar! This is because they share common 
hypernyms further up the tree:

In [None]:
sorted(dog.common_hypernyms(cb))

#### Comparing verbs
The previous comparisons were all between nouns, but the same can be done for verbs 
as well:

In [None]:
cook = wordnet.synset('cook.v.01')
bake = wordnet.synset('bake.v.02')
cook.wup_similarity(bake)

The previous Synsets were obviously handpicked for demonstration, and the reason is that 
the hypernym tree for verbs has a lot more breadth and a lot less depth. While most nouns 
can be traced up to the hypernym object, thereby providing a basis for similarity, many 
verbs do not share common hypernyms, making WordNet unable to calculate the similarity. 
For example, if you were to use the Synset for bake.v.01 in the previous code, instead of 
bake.v.02, the return value would be None. This is because the root hypernyms of both the 
Synsets are different, with no overlapping paths. For this reason, you also cannot calculate 
the similarity between words with different parts of speech.

#### Path and Leacock Chordorow (LCH) similarity
Two other similarity comparisons are the path similarity and the LCH similarity, as shown in 
the following code:

In [None]:
print(cb.path_similarity(ib))

print(cb.path_similarity(dog))

print(cb.lch_similarity(ib))

print(cb.lch_similarity(dog))

### Discovering word collocations
Collocations are two or more words that tend to appear frequently together, such as United 
States. Of course, there are many other words that can come after United, such as United 
Kingdom and United Airlines. As with many aspects of natural language processing, context 
is very important. And for collocations, context is everything!

In the case of collocations, the context will be a document in the form of a list of words. 
Discovering collocations in this list of words means that we'll find common phrases that 
occur frequently throughout the text. For fun, we'll start with the script for Monty Python 
and the Holy Grail

We're going to create a list of all lowercased words in the text, and then produce 
BigramCollocationFinder, which we can use to find bigrams, which are pairs of words. 
These bigrams are found using association measurement functions in the nltk.metrics
package, as follows:

In [None]:
from nltk.corpus import webtext
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures

words = [w.lower() for w in webtext.words('grail.txt')]
bcf = BigramCollocationFinder.from_words(words)
bcf.nbest(BigramAssocMeasures.likelihood_ratio, 4)

Well, that's not very useful! Let's refine it a bit by adding a word filter to remove punctuation 
and stopwords:

In [None]:
from nltk.corpus import stopwords
stopset = set(stopwords.words('english'))

filter_stops = lambda w: len(w) < 3 or w in stopset
bcf.apply_word_filter(filter_stops)
bcf.nbest(BigramAssocMeasures.likelihood_ratio, 4)

BigramCollocationFinder constructs two frequency distributions: one for each word, 
and another for bigrams. A frequency distribution, or FreqDist in NLTK, is basically an 
enhanced Python dictionary where the keys are what's being counted, and the values are 
the counts. Any filtering functions that are applied reduce the size of these two FreqDists
by eliminating any words that don't pass the filter. By using a filtering function to eliminate all 
words that are one or two characters, and all English stopwords, we can get a much cleaner 
result. After filtering, the collocation finder is ready to accept a generic scoring function for 
finding collocations

In [None]:
from nltk.collocations import TrigramCollocationFinder
from nltk.metrics import TrigramAssocMeasures


words = [w.lower() for w in webtext.words('singles.txt')]
tcf = TrigramCollocationFinder.from_words(words)
tcf.apply_word_filter(filter_stops)
tcf.apply_freq_filter(3)
tcf.nbest(TrigramAssocMeasures.likelihood_ratio, 4)

In [None]:
showImage(2)