## Natural Language Processing Tasks

### 1. Tokenizing text into sentences

In [2]:
from nltk.tokenize import sent_tokenize

# paragraph of text
text = "Hello World. It's good to see you. Thanks for buying this book."

sentences = sent_tokenize(text)
sentences

['Hello World.', "It's good to see you.", 'Thanks for buying this book.']

----------------------------------------------------------------------------------------------------------------------
The sent_tokenize function uses an instance of PunktSentenceTokenizer from the
nltk.tokenize.punkt module. This instance has already been trained and works well for
many European languages. So it knows what punctuation and characters mark the end of a
sentence and the beginning of a new sentence.

The instance used in sent_tokenize() is actually loaded on demand from a pickle
file. So if you're going to be tokenizing a lot of sentences, it's more efficient to load the
PunktSentenceTokenizer class once, and call its tokenize() method instead.

----------------------------------------------------------------------------------------------------------------------

In [4]:
import nltk.data

# paragraph of text
text = "Hello World. It's good to see you. Thanks for buying this book."

tokenizer = nltk.data.load('tokenizers/punkt/PY3/english.pickle')
tokenizer.tokenize(text)

['Hello World.', "It's good to see you.", 'Thanks for buying this book.']

----------------------------------------------------------------------------------------------------------------------
Note :<br>
    1> For other languages<br>
       >>> spanish_tokenizer = nltk.data.load('tokenizers/punkt/PY3/spanish.pickle')<br>
       >>> spanish_tokenizer.tokenize('Hola amigo. Estoy bien.')<br>
           ['Hola amigo.', 'Estoy bien.']<br>

### 2. Tokenizing Sentences into Words

In [5]:
from nltk.tokenize import word_tokenize

word_tokenize('Hi how are you!!!')

['Hi', 'how', 'are', 'you', '!', '!', '!']

----------------------------------------------------------------------------------------------------------------------
The word_tokenize() function is a wrapper function that calls tokenize() on an
instance of the TreebankWordTokenizer class. It's equivalent to the following code:

----------------------------------------------------------------------------------------------------------------------

In [6]:
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize('Hi how are you!!!')

['Hi', 'how', 'are', 'you', '!', '!', '!']

### 3. Tokenizing Sentences Using Regular Expressions

Regular expressions can be used if you want complete control over how to tokenize text.<br>

First you need to decide how you want to tokenize a piece of text as this will determine how<br>
you construct your regular expression. The choices are:<br>
1> Match on the tokens<br>
2> Match on the separators or gaps<br>

In [7]:
from nltk.tokenize import regexp_tokenize

regexp_tokenize("Can't be done this.", "[\w']+")

["Can't", 'be', 'done', 'this']

In [9]:
regexp_tokenize("Can't be done this.", "\s+", gaps=True)

["Can't", 'be', 'done', 'this.']

### 4. Training a Sentence Tokenizer

NLTK's default sentence tokenizer is general purpose, and usually works quite well. But
sometimes it is not the best choice for your text. Perhaps our text uses nonstandard
punctuation, or is formatted in a unique way. In such cases, training your own sentence
tokenizer can result in much more accurate sentence tokenization.

In [12]:
from nltk.tokenize import PunktSentenceTokenizer
from nltk.corpus import webtext
text = webtext.raw('overheard.txt')
sent_tokenizer = PunktSentenceTokenizer(text)

In [18]:
sents1 = sent_tokenizer.tokenize(text)
print(sents1[0])
print('\n')
print(sents1[678])

White guy: So, do you have any plans for this evening?


Girl: But you already have a Big Mac...


In [19]:
# Let's compare the results to the default sentence tokenizer
from nltk.tokenize import sent_tokenize
sents2 = sent_tokenize(text)
print(sents2[0])
print('\n')
print(sents2[678])

White guy: So, do you have any plans for this evening?


Girl: But you already have a Big Mac...
Hobo: Oh, this is all theatrical.


----------------------------------------------------------------------------------------------------------------------
While the first sentence is the same, we can see that the tokenizers disagree on how to
tokenize sentence 679 (this is the first sentence where the tokenizers diverge). The default
tokenizer includes the next line of dialog, while our custom tokenizer correctly thinks that
the next line is a separate sentence. This difference is a good demonstration of why it can
be useful to train your own sentence tokenizer, especially when your text isn't in the typical
paragraph-sentence structure

----------------------------------------------------------------------------------------------------------------------

### 5. Filtering Stopwords in a Tokenized Sentence

Stopwords are common words that generally do not contribute to the meaning of a sentence,
at least for the purposes of information retrieval and natural language processing. These are
words such as the and a. Most search engines will filter out stopwords from search queries
and documents in order to save space in their index.<br>

NLTK comes with a stopwords corpus that contains word lists for many languages.

In [21]:
# Create a set of all English stopwords, then use it to filter stopwords from a sentence
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
words = ["Can't", 'be', 'done', 'this']
[word for word in words if word not in stop_words]

["Can't", 'done']

In [22]:
# There are also stopword lists for many other languages
stopwords.fileids()

['arabic',
 'azerbaijani',
 'danish',
 'dutch',
 'english',
 'finnish',
 'french',
 'german',
 'greek',
 'hungarian',
 'indonesian',
 'italian',
 'kazakh',
 'nepali',
 'norwegian',
 'portuguese',
 'romanian',
 'russian',
 'slovene',
 'spanish',
 'swedish',
 'tajik',
 'turkish']

In [24]:
# First 5 stopwords of dutch language
stopwords.words('dutch')[0:5]

['de', 'en', 'van', 'ik', 'te']

### 6. Looking Up Synsets(synonims) for a Word in WordNet

WordNet is a lexical database for the English language. In other words, it's a dictionary
designed specifically for natural language processing.<br>

NLTK comes with a simple interface to look up words in WordNet. What you get is a list of
Synset instances, which are groupings of synonymous words that express the same concept.

In [37]:
from nltk.corpus import wordnet
syn = wordnet.synsets('book')[0]
print(syn.name())
print('\n')
print('Definition :\n',syn.definition())
print('\n')
print('Example : \n',syn.examples())
print('\n')
print('parts of speech :\n', syn.pos())

book.n.01


Definition :
 a written work or composition that has been published (printed on pages bound together)


Example : 
 ['I am reading a good book on economics']


parts of speech :
 n


### 7. Looking up Lemmas and Synonyms in WordNet

A lemma (in linguistics), is the canonical form or morphological form of a word.<br>

we can also look up lemmas in WordNet to find synonyms of a word

In [43]:
from nltk.corpus import wordnet

syn = wordnet.synsets('cookbook')[0]
lemmas = syn.lemmas()
print(lemmas)
print('\n')
print(len(lemmas))

[Lemma('cookbook.n.01.cookbook'), Lemma('cookbook.n.01.cookery_book')]


2


In [44]:
print(lemmas[0].name())
print(lemmas[1].name())

cookbook
cookery_book


In [45]:
lemmas[0].synset() == lemmas[1].synset()

True

### 8. Stemming Words

Stemming is a technique to remove affixes from a word, ending up with the stem. For
example, the stem of cooking is cook, and a good stemming algorithm knows that the ing
suffix can be removed.<br>

Stemming is most commonly used by search engines for indexing
words. Instead of storing all forms of a word, a search engine can store only the stems, greatly
reducing the size of index while increasing retrieval accuracy.<br>

One of the most common stemming algorithms is the Porter stemming algorithm. It is designed to remove and replace well-known suffixes of English words

In [47]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
print(stemmer.stem('cooking'))
print(stemmer.stem('cookery'))

cook
cookeri


### 9. Lemmatizing Words with WordNet

Lemmatization is very similar to stemming, but is more akin to synonym replacement.
A lemma is a root word, as opposed to the root stem. So unlike stemming, you are always
left with a valid word that means the same thing.<br>

In [48]:
# We will use the WordNetLemmatizer class to find lemmas
from nltk.stem import WordNetLemmatizer

lemmatizer =  WordNetLemmatizer()
print(lemmatizer.lemmatize('cooking'))
print(lemmatizer.lemmatize('cooking', pos='v'))
print(lemmatizer.lemmatize('cookbooks'))

cooking
cook
cookbook


### 10. Replacing Words Matching Regular Expressions

We need to define a number of replacement patterns. This will be a list of tuple
pairs, where the first element is the pattern to match with and the second element is
the replacement.<br>

Next, we will create a RegexpReplacer class that will compile the patterns and provide
a replace() method to substitute all the found patterns with their replacements.

In [56]:
import re

replacement_patterns = [
(r'won\'t', 'will not'),
(r'can\'t', 'cannot'),
(r'i\'m', 'i am'),
(r'ain\'t', 'is not'),
(r'(\w+)\'ll', '\g<1> will'),
(r'(\w+)n\'t', '\g<1> not'),
(r'(\w+)\'ve', '\g<1> have'),
(r'(\w+)\'s', '\g<1> is'),
(r'(\w+)\'re', '\g<1> are'),
(r'(\w+)\'d', '\g<1> would')
]

In [57]:
class RegexReplacer(object):
    def __init__(self, patterns=replacement_patterns):
        self.patterns = [(re.compile(regex), repl) for (regex, repl) in patterns]
        
    def replace(self, text):
        s = text
        for (pattern, repl) in self. patterns:
            s = re.sub(pattern, repl, s)
        return s

In [60]:
replacer = RegexReplacer()
replacer.replace("can't is a contraction")

'cannot is a contraction'

In [61]:
replacer.replace("I should've done that thing I didn't do")

'I should have done that thing I did not do'

In [None]:
### 11. 