# <center> NLP Lab 2 </center>

# Introduction to NLP with NLTK

Natural Language Processing (NLP) is often taught at the academic level from the perspective of computational linguists. However, as data scientists, we have a richer view of the natural language world - unstructured data that by its very nature has latent information that is important to humans. NLP practioners have benefited from machine learning techniques to unlock meaning from large corpora, and in this class we’ll explore how to do that particularly with Python and with the Natural Language Toolkit (NLTK). 

NLTK is an excellent library for machine-learning based NLP, written in Python by experts from both academia and industry. Python allows you to create rich data applications rapidly, iterating on hypotheses. The combination of Python + NLTK means that you can easily add language-aware data products to your larger analytical workflows and applications. 

## Quick Overview of NLTK
NLTK stands for the Natural Language Toolkit and is written by two eminent computational linguists, Steven Bird (Senior Research Associate of the LDC and professor at the University of Melbourne) and Ewan Klein (Professor of Linguistics at Edinburgh University). NTLK provides a combination of natural language corpora, lexical resources, and example grammars with language processing algorithms, methodologies and demonstrations for a very pythonic "batteries included" view of Natural Language Processing.   

As such, NLTK is perfect for researh driven (hypothesis driven) workflows for agile data science. Its suite of libraries includes:

- tokenization, stemming, and tagging
- chunking and parsing
- language modeling
- classification and clustering
- logical semantics

## Installing NLTK

This notebook has a few dependencies, most of which can be installed via the python package manger - `pip`. 

1. Python 3.6 or later (anaconda is ok)
2. NLTK
3. The NLTK corpora 


Once you have Python and pip installed you can install NLTK as follows:

    ~$ pip install nltk

In [1]:
import nltk

## Working with Example Corpora

NLTK ships with a variety of corpora, let's use a few of them to do some work. Project "gutenberg" has dataset of 25000 books. 18 of those books are available within nltk as nltk.corpus.gutenberg. Download could be called as follows:

In [2]:
nltk.download('gutenberg')

[nltk_data] Downloading package gutenberg to /usr/lib/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


True

In [3]:
from nltk.corpus import gutenberg

## Getting Names of all 18 books


In [4]:
file_names = gutenberg.fileids()
print(file_names)

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']


## Create a dictionary of term frequencies for ALL FILES

In [5]:
term_frequency = {}
for book_name in file_names:
    for word in gutenberg.words(book_name):
        if term_frequency.__contains__(word):
            term_frequency[word] = term_frequency[word] + 1
        else:
            term_frequency[word] = 1

In [6]:
term_frequency

{'[': 115,
 'Emma': 866,
 'by': 8012,
 'Jane': 303,
 'Austen': 3,
 '1816': 1,
 ']': 105,
 'VOLUME': 3,
 'I': 30221,
 'CHAPTER': 291,
 'Woodhouse': 313,
 ',': 186091,
 'handsome': 130,
 'clever': 74,
 'and': 78846,
 'rich': 231,
 'with': 16827,
 'a': 32504,
 'comfortable': 108,
 'home': 677,
 'happy': 537,
 'disposition': 73,
 'seemed': 1083,
 'to': 46443,
 'unite': 17,
 'some': 2560,
 'of': 70078,
 'the': 125748,
 'best': 574,
 'blessings': 24,
 'existence': 47,
 ';': 27329,
 'had': 10177,
 'lived': 260,
 'nearly': 137,
 'twenty': 459,
 '-': 8850,
 'one': 5755,
 'years': 1011,
 'in': 31959,
 'world': 1222,
 'very': 3852,
 'little': 2825,
 'distress': 111,
 'or': 5901,
 'vex': 24,
 'her': 11153,
 '.': 73746,
 'She': 1612,
 'was': 18558,
 'youngest': 42,
 'two': 2268,
 'daughters': 319,
 'most': 1457,
 'affectionate': 56,
 'indulgent': 11,
 'father': 1673,
 'consequence': 132,
 'sister': 592,
 "'": 19873,
 's': 9792,
 'marriage': 142,
 'been': 3408,
 'mistress': 137,
 'his': 20585,
 'hou

## Create an Inverted index containing document-wise frequency

In [7]:
inverted_index = {}
for book_name in file_names:
    for word in gutenberg.words(book_name):
        if inverted_index.__contains__(word):
            posting_list = inverted_index[word]
            if posting_list.__contains__(book_name):
                posting_list[book_name] = posting_list[book_name] + 1
            else:
                posting_list[book_name] = 1
        else:
            inverted_index[word] = {book_name:1}

In [8]:
inverted_index

{'[': {'austen-emma.txt': 2,
  'austen-persuasion.txt': 1,
  'austen-sense.txt': 3,
  'bible-kjv.txt': 1,
  'blake-poems.txt': 1,
  'bryant-stories.txt': 4,
  'burgess-busterbrown.txt': 6,
  'carroll-alice.txt': 3,
  'chesterton-ball.txt': 11,
  'chesterton-brown.txt': 1,
  'chesterton-thursday.txt': 1,
  'edgeworth-parents.txt': 6,
  'melville-moby_dick.txt': 3,
  'milton-paradise.txt': 2,
  'shakespeare-caesar.txt': 3,
  'shakespeare-hamlet.txt': 6,
  'shakespeare-macbeth.txt': 4,
  'whitman-leaves.txt': 57},
 'Emma': {'austen-emma.txt': 865, 'austen-persuasion.txt': 1},
 'by': {'austen-emma.txt': 558,
  'austen-persuasion.txt': 411,
  'austen-sense.txt': 737,
  'bible-kjv.txt': 2540,
  'blake-poems.txt': 21,
  'bryant-stories.txt': 91,
  'burgess-busterbrown.txt': 23,
  'carroll-alice.txt': 55,
  'chesterton-ball.txt': 279,
  'chesterton-brown.txt': 259,
  'chesterton-thursday.txt': 172,
  'edgeworth-parents.txt': 703,
  'melville-moby_dick.txt': 1137,
  'milton-paradise.txt': 380,


## Maximum term frequency for every document

In [9]:
max_frequency = {}
for book_name in file_names:
    max_frequency[book_name] = 0

for posting_list in inverted_index.values():
    for book_name in posting_list.keys():
        max_frequency[book_name] = max(max_frequency[book_name], posting_list[book_name])

In [10]:
max_frequency

{'austen-emma.txt': 11454,
 'austen-persuasion.txt': 6750,
 'austen-sense.txt': 9397,
 'bible-kjv.txt': 70509,
 'blake-poems.txt': 680,
 'bryant-stories.txt': 3481,
 'burgess-busterbrown.txt': 823,
 'carroll-alice.txt': 1993,
 'chesterton-ball.txt': 4547,
 'chesterton-brown.txt': 4321,
 'chesterton-thursday.txt': 3488,
 'edgeworth-parents.txt': 15219,
 'melville-moby_dick.txt': 18713,
 'milton-paradise.txt': 10198,
 'shakespeare-caesar.txt': 2204,
 'shakespeare-hamlet.txt': 2892,
 'shakespeare-macbeth.txt': 1962,
 'whitman-leaves.txt': 17713}