# Exploring NLTK
### Rahul Das, CS 4395

In [45]:
# In order to access text samples
from nltk.book import *


## Tokens
In NLTK's Text object, the data of a text is stored as **tokens**. Each token is a unit that the text is broken down into, the smallest units being words and punctuation. By using the built in *tokens* method we can see the tokens from a specific text object. The tokens are returned in a list format.

Here we save and display the first 20 tokens of the returned token list for text1.

In [46]:
toks = text1.tokens[:20]
print(toks)

['[', 'Moby', 'Dick', 'by', 'Herman', 'Melville', '1851', ']', 'ETYMOLOGY', '.', '(', 'Supplied', 'by', 'a', 'Late', 'Consumptive', 'Usher', 'to', 'a', 'Grammar']


## Concordance
NLTK offers a way of viewing a text called concordance that can be very helpful. A concordance of a certain word shows every occurrence of the word as well as some context around it from a certain Text object.

Here we view a concordance for 5 occurrences of the word 'sea' in text1.

In [47]:
print(text1.concordance('sea', lines = 5))

Displaying 5 of 455 matches:
 shall slay the dragon that is in the sea ." -- ISAIAH " And what thing soever 
 S PLUTARCH ' S MORALS . " The Indian Sea breedeth the most and the biggest fis
cely had we proceeded two days on the sea , when about sunrise a great many Wha
many Whales and other monsters of the sea , appeared . Among the former , one w
 waves on all sides , and beating the sea before him into a foam ." -- TOOKE ' 
None


## Count
The NLTK Text object has its own count function which is slightly different from the default python count method. The key difference here is that the NLTK count returns the number of occurences of a word within a Text object, whereas the default python count finds the number of occurrences of a specific element in a list or a string. 

In [48]:
# NLTK count
print('Occurences of \'sea\' in Moby Dick: ', text1.count('sea'))
# count is case sensitive
print('Occurences of \'Sea\' in Moby Dick: ', text1.count('Sea'))

# Python default count
# String - lowercase count should be 2, uppercase should be 1
sentence = "The quick brown fox jumps over the lazy dog twice."
print('Occurrences of \'t\' in the sentence: ', sentence.count('t'))
print('Occurrences of \'T\' in the sentence: ', sentence.count('T'))
# List - 'a': 3, 15: 2
data = ['a', 'b', 'a', 'a', 'c', 5, 10, 10, 15, 10, 10, 15]
print('Occurrences of \'a\' in the list: ', data.count('a'))
# works for any element of a list regardless of data type
print('Occurrences of 15 in the list: ', data.count(15))

Occurences of 'sea' in Moby Dick:  433
Occurences of 'Sea' in Moby Dick:  22
Occurrences of 't' in the sentence:  2
Occurrences of 'T' in the sentence:  1
Occurrences of 'a' in the list:  3
Occurrences of 15 in the list:  2


## Tokenizer
While NLTK does provide a few pre-packaged text samples, it is important for us to be able to process our own texts. In order to store a text into a Text object, it must first be turned into a list of tokens to be stored. This can be done with the word_tokenize function. This function is able to tokenize incoming text into word units. An alternative approach is tokenizing into sentences using the sent_tokenize method.

To demonstrate the tokenization process, I am going to tokenize an excerpt from The Last Olympian by Rick Riordan.

In [49]:
# In order to use word_tokenize and sent_tokenize
from nltk.tokenize import *
# Defines text sample to be used by tokenizers
raw_text = 'The main courtyard was filled with warriors - mermen with fish tails from the waist down and human bodies from the waist up, except their skin was blue, which I\'d never known before. Some were tending to the wounded. Some were sharpening spears and swords. One passed us, swimming in a hurry. His eyes were bright green, like that stuff they put in glo-sticks, and his teeth were shark teeth. They don\'t show you stuff like that in The Little Mermaid.'

In [50]:
wordTokens = word_tokenize(raw_text)
# only displays first 44 tokens of the text in order to prove the words were tokenized
print(wordTokens[:44])

['The', 'main', 'courtyard', 'was', 'filled', 'with', 'warriors', '-', 'mermen', 'with', 'fish', 'tails', 'from', 'the', 'waist', 'down', 'and', 'human', 'bodies', 'from', 'the', 'waist', 'up', ',', 'except', 'their', 'skin', 'was', 'blue', ',', 'which', 'I', "'d", 'never', 'known', 'before', '.', 'Some', 'were', 'tending', 'to', 'the', 'wounded', '.']


In [51]:
sentTokens = sent_tokenize(raw_text)
# prints all tokenized sentences
print(sentTokens)

["The main courtyard was filled with warriors - mermen with fish tails from the waist down and human bodies from the waist up, except their skin was blue, which I'd never known before.", 'Some were tending to the wounded.', 'Some were sharpening spears and swords.', 'One passed us, swimming in a hurry.', 'His eyes were bright green, like that stuff they put in glo-sticks, and his teeth were shark teeth.', "They don't show you stuff like that in The Little Mermaid."]


## Stems and Lemma
Stems in text are a form of the word with their prefixes and suffixes removed. This often leaves us with sequences that aren't necessarily actual words, but have all of the additional prefixes and suffixes removed. These stemmed "words" aren't always successfully useful, but on some occasions they can remove parts of the word that are not necessary for tone or meaning.

Lemma, on the other hand are the base form of a word. They revert given words back to their most basic dictionary form. This can overcome the problems that come with stemming where stems end up sometimes being incoherent or useless. By checking against a dictionary to find an actual base form word, you ensure that meaning is preserved. This however has the drawback of taking more time to search a detailed dictionary.

In [52]:
# In order to use stemmer and lemmatizer
from nltk.stem import *

In [53]:
# Creates a stemmer that uses the Proter algorithm for stemming
stemmer = PorterStemmer()
# List comprehension that goes through and applies stemmer to every element of the word tokens list
stems = [stemmer.stem(currTok) for currTok in wordTokens]
print(stems)

['the', 'main', 'courtyard', 'wa', 'fill', 'with', 'warrior', '-', 'mermen', 'with', 'fish', 'tail', 'from', 'the', 'waist', 'down', 'and', 'human', 'bodi', 'from', 'the', 'waist', 'up', ',', 'except', 'their', 'skin', 'wa', 'blue', ',', 'which', 'i', "'d", 'never', 'known', 'befor', '.', 'some', 'were', 'tend', 'to', 'the', 'wound', '.', 'some', 'were', 'sharpen', 'spear', 'and', 'sword', '.', 'one', 'pass', 'us', ',', 'swim', 'in', 'a', 'hurri', '.', 'hi', 'eye', 'were', 'bright', 'green', ',', 'like', 'that', 'stuff', 'they', 'put', 'in', 'glo-stick', ',', 'and', 'hi', 'teeth', 'were', 'shark', 'teeth', '.', 'they', 'do', "n't", 'show', 'you', 'stuff', 'like', 'that', 'in', 'the', 'littl', 'mermaid', '.']


In [54]:
# Creates a lemmatizer using the WordNet approach to lemmatization
lemmatizer = WordNetLemmatizer()
# List comprehension that goes through and applies lemmatizer to every element of the word tokens list
lemma = [lemmatizer.lemmatize(currTok) for currTok in wordTokens]
print(lemma)

['The', 'main', 'courtyard', 'wa', 'filled', 'with', 'warrior', '-', 'merman', 'with', 'fish', 'tail', 'from', 'the', 'waist', 'down', 'and', 'human', 'body', 'from', 'the', 'waist', 'up', ',', 'except', 'their', 'skin', 'wa', 'blue', ',', 'which', 'I', "'d", 'never', 'known', 'before', '.', 'Some', 'were', 'tending', 'to', 'the', 'wounded', '.', 'Some', 'were', 'sharpening', 'spear', 'and', 'sword', '.', 'One', 'passed', 'u', ',', 'swimming', 'in', 'a', 'hurry', '.', 'His', 'eye', 'were', 'bright', 'green', ',', 'like', 'that', 'stuff', 'they', 'put', 'in', 'glo-sticks', ',', 'and', 'his', 'teeth', 'were', 'shark', 'teeth', '.', 'They', 'do', "n't", 'show', 'you', 'stuff', 'like', 'that', 'in', 'The', 'Little', 'Mermaid', '.']


### Some differences between the stems and lemmas
*(In the form stem - lemma)*
- fill - filled
- mermen - merman
- bodi - body
- tend - tending
- hurri - hurry
- sharpen - sharpening

## Conclusions About NLTK
### Functionality
I think the NLTK is a very powerful and incredibly useful tool for language processing. People tend to underestimate the complexity and nuance of human language, so processing language may seem easy at first but it can become an incredibly daunting task for a programmer. The NLTK library accounts for this complexity and offers several powerful tooks to help programmers begin processing natural language texts. I think this is a great tool and I look forward to using it more as I explore natural language processing.
### Code Quality
The NLTK documentation and syntax is some of the cleanest I've seen of the libraries I've used over the years. Many powerful libraries fall into the error of poorly maintaining their documentation, or overcomplicating their tools. I believe NLTK has done a very good job at managing both. The useability and code quality of NLTK is very high and intuitive. The documentation is very clean and informative. I am looking forward to writing code using NLTK because of its high quality.
### Potential Future Uses
NLTK provides many of the tools that allow computers to interact with language. This along with applications of machine learning can make some very powerful programs such as spam classification, simple chatbots, search algorithms, and other similar tasks that require an overlap of processing and understanding language.