[NLTK :: Natural Language Toolkit](NLTK-::-Natural-Language-Toolkit)

1. [Introducing `nltk`](Introducing-`nltk`)        
2. [Tokenizing Strings](#Tokenizing-Strings)    
3. [Stop Words](#Stop-Words)       
4. [A Warning](#A-Warning)   
5. [N-grams](#N-gramss)    



# NLTK :: Natural Language Toolkit

## Introducing `nltk`
Another life saver for prepping your NLP data is the nltk package. `nltk` stands for Natural Language Toolkit and the corresponding documentation can be found here, https://www.nltk.org/.

We'll be using this package a decent amount in the program so be sure to get familiar with it.

In this notebook we'll see how useful it is for breaking strings into individual substrings (think words or sentences) called tokens. We'll also learn about stopwords and ngrams.

`nltk` can be used for more than these three purposes, but we won't introduce those unless we need them later in the course.

In [None]:
import pandas as pd
df = pd.read_csv('Food_Review.csv')

In [None]:
df = df[['Summary','Text']]
df.head()

In [None]:
# We lower all srings 
df['Text_clean'] = df['Text'].str.lower()

## Tokenizing Strings
Recall that the endgoal of Preprocessing - Cleaning Data, and an important step in that process was to clean our data and made it easier to work with. As a part of that step we had to:

* Turn all the words to lowercase (not always necessary, but we've already done that)
* remove punctuation (not always necessary)
* remove numbers 
* strip white space (also generally part of tokenization)
You can also write a `split` statement.  
That process can be called word tokenization, which is the process of breaking strings down into smaller units (in this case words).

`nltk` has a number of built-in tokenizer objects that can make this process as simple as a single line of code. Let's check out an example.

In [None]:
# Run this code chunk to ensure that you have
# nltk installed properly
import nltk

# Note your version may be different than mine
# That's fine
# At the time of writing this notebook I have verson 3.4.5
print(nltk.__version__)

Note: You may get an error or two running the code below if you've never run `nltk` before, that's probably because you have to download some data that isn't automatically downloaded with the `nltk` package. I believe they have pretty good error messages that tell you what to do to fix it.

In [None]:
## Our first tokenizer will be
## word_tokenize
df['tokenized_text'] = df['Text_clean'].apply(nltk.word_tokenize) 

In [None]:
## Practice
## Run the tokenizer on this #FakeTweet
fake_tweet = "tokenizing is really ez-pz :P :D #nlp"

nltk.word_tokenize(fake_tweet)

In [None]:
# This one we have to import from the tokenize subpackage
from nltk.tokenize import TweetTokenizer

# Now we make a Tokenizer object
tweet_tokenizer = TweetTokenizer()

In [None]:
# We call the 'tokenize' method of the tokenizer
tweet_tokenizer.tokenize(fake_tweet)

So when it comes to tokenizing strings it really depends on your use case. You can learn more about existing Tokenizer objects here, https://www.nltk.org/api/nltk.tokenize.html, and even learn how to write your own Tokenizer object.

For the most part we'll stick to the word_tokenizer, I'll point out when we depart from that norm.

### Practice
1. Make a list of the tokens in the Food Review data. Don't bother cleaning out punctuation.

In [None]:
## Code here

In [None]:
## Code here

Google how to tokenize sentences using `nltk`. Once you've figured out how to do that:   

    A. Create a pandas DataFrame containing the unique sentences from the Goblet of Fire excerpt.   
    B. Create a column in your dataframe that contains the tokenized version of the sentence.   
https://www.guru99.com/tokenize-words-sentences-nltk.html

In [None]:
## Code here

In [None]:
## Code here

In [None]:
import pandas as pd

## Stop Words
Think of the words you use most often. There are many words in the world that are necessary to form coherent sentences. However, most of those words are not necessary to convey the meaning behind your sentences.

That's essentially the idea behind _stop words_. These are frequently used words that can be thought of as "noise" for the sake of data analysis. As such it may be useful to remove them prior to analysis. `nltk` stores a corpus (collection of texts) of stopwords in a variety of languages for easy out of the box use.

In [None]:
# stopwords are stored in the corpus subpackage
from nltk.corpus import stopwords

In [None]:
# words can be accessed like so
print(stopwords.words('english'))

### Practice
1. Remove the stop words from your tokens for the Food Review excerpt.

In [None]:
## Code here

In [None]:
## Code here

1. Create a column in your DataFrame that removes the stopwords from tokenized sentence column.

In [None]:
## Code here

In [None]:
## Code here


## A Warning
When working on your NLP projects be wary about removing the stopwords. If you've seen The Office, you know our friend Kevin had to go back to using all the words in order for people to understand him.

One concern I have with `nltk`'s stopwords is that words like "no", "not" and "nor" on there. The absence of those words can greatly alter the meaning of a sentence.

In [None]:
sentence = "I do not like ice cream."

tokens = nltk.word_tokenize(sentence)

In [None]:
print("With stop words.")
print(tokens)

In [None]:
print("Without stop words.")
print([token for token in tokens if token not in stopwords.words('english')])

## N-grams
One final data cleaning step we'll discuss here is the creation of n-grams.

As you might imagine, just looking at the words used in a piece of text is not always enough to create useful applications. This is because breaking a text up into the unique words is essentially assuming that every piece of text (from here on referred to as a document) is created by randomly pulling words from a bag, which is why this technique is called __Bag of Words__ (more on this next week).

In this assumption you lose the information contained in the document's author's word orderings. A step up from simple bag of words is to look at the unique sequence of n words in a row (otherwise known as n-grams).

For example, the bigrams (2-grams) for this sentence:

`"I do not like ice cream"`

are

`[("I", "do"), ("do", "not"), ("not", "like"), ("like", "ice"), ("ice", "cream")]`.

`nltk` also offers functions that take in a list of the tokens (must be in the order they appeared in the text) and outputs an iterator object of n-grams.

In [None]:
# nltk.bigrams makes the bigrams
# it returns an iterator object
nltk.bigrams(nltk.word_tokenize(sentence))

In [None]:
# You can turn that into a list like so
print(list(nltk.bigrams(nltk.word_tokenize(sentence))))

In [None]:
# nltk.ngrams can make any kind of ngram
nltk.ngrams(nltk.word_tokenize(sentence),3)

In [None]:
print(list(nltk.ngrams(nltk.word_tokenize(sentence),3)))

### Practice
1. Produce a list of the 4-grams of the Goblet of Fire Excerpt

In [None]:
## Code here


1. Create a column in your dataframe that contains the bigrams for each sentence of the Goblet of Fire Excerpt.

In [None]:
## Code here


In [None]:
## Code here


# Great!
We now know enough to move onto our first NLP projects!