# Lecture 25:  Natural Language Toolkit - NLTK
- Download data for practice analysis from the NLTK repository
- Explore word usage with NLTK’s __concordance__, __similar__, and __dispersion_plot__ functions
- Calculate a crude metric of the lexical diversity of a text by comparing the ratio of word __(token)__ types to total words
- Calculate the frequency of each word type in a text, and other word metrics using the __FreqDist__ function

__Reading material:__
- Read the introduction to Chapter 1 of the [NLTK Book](http://www.nltk.org/book/ch01.html). 

- Follow 1.2, 1.3, 1.4
- Follow Chapter 1, section 3 (all)
- Skim Chapter 1, section 5. This will give you a good overview of the issues in natural language
- Skim Chapter 3 for processing raw text processing.


In [None]:
import nltk
%matplotlib inline

In [None]:
nltk.download()
# nltk.download_shell()
# try running "nltk.download_shell()" or "nltk.download("book")" instead if there seems to be an issue using the downloader UI

In [None]:
from nltk.book import *

In [None]:
print(text1)
print(type(text1))

In [None]:
# Searching Text
text1.concordance("food")

In [None]:
# To find other words that appear in a similar range of contexts
text1.similar("food")

In [None]:
text1.dispersion_plot(["food", "time", "live", "here"])

In [None]:
# to get frequency distribution of some text
fdist1=FreqDist(text1)
print(fdist1.most_common(5))
fdist1.plot(30, cumulative=False)

Let's see if we can do better by removing meaningless words.

In [None]:
# download the "stopwords" package
nltk.download("stopwords")

In [None]:
from nltk.corpus import stopwords
print(stopwords.words("english"))

In [None]:
stopwds = stopwords.words("english")
filtered_text1 = [w.lower() for w in text1 if w.lower() not in stopwds and w.isalnum()] 
filtered_text1 = nltk.Text(filtered_text1)

In [None]:
filtered_text1.plot(30)