# Natural Language Toolkit for Python (NLTK) for NLP

This notebook covers a handful of the NLP building blocks provided by Natural Language Toolkit for Python (NLTK), including extracting text from HTML, stemming & lemmatization, frequency analysis, and named entity recognition. Several of these components will then be assembled to build a very basic document summarization program.

Install NLTK

In [1]:
# !pip install nltk

Importing and downloading NLTK resources

In [2]:
import nltk

The first time you run anything using NLTK, you'll want to go ahead and download the additional resources that aren't distributed directly with the NLTK package. Upon running the nltk.download() command below, the the NLTK Downloader window will pop-up. In the Collections tab, select "all" and click on Download. As mentioned earlier, this may take several minutes depending on your network connection speed, but you'll only ever need to run it a single time.

In [3]:
# nltk.download()

### Tokenization
Process of breaking a text into individual words or tokens. NLTK provides a word_tokenize function that performs this task.

In [4]:
from nltk.tokenize import word_tokenize

text = "Tesla shareholders have backed a record-breaking pay package for boss Elon Musk and approved a plan to move \
        the firm's legal headquarters to Texas."
tokens = word_tokenize(text)
print(tokens)

['Tesla', 'shareholders', 'have', 'backed', 'a', 'record-breaking', 'pay', 'package', 'for', 'boss', 'Elon', 'Musk', 'and', 'approved', 'a', 'plan', 'to', 'move', 'the', 'firm', "'s", 'legal', 'headquarters', 'to', 'Texas', '.']


### Stopword Removal

NLTK provides a stopwords module that contains a list of stop words for various languages. We can use this module to remove stop words from our tokens.

In [5]:
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

print(filtered_tokens)

['Tesla', 'shareholders', 'backed', 'record-breaking', 'pay', 'package', 'boss', 'Elon', 'Musk', 'approved', 'plan', 'move', 'firm', "'s", 'legal', 'headquarters', 'Texas', '.']


### Lemmatization

NLTK provides a WordNetLemmatizer class that performs lemmatization.

In [6]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]

print(lemmatized_tokens)

['Tesla', 'shareholder', 'backed', 'record-breaking', 'pay', 'package', 'bos', 'Elon', 'Musk', 'approved', 'plan', 'move', 'firm', "'s", 'legal', 'headquarters', 'Texas', '.']


### Sentiment Analysis

NLTK provides a SentimentIntensityAnalyzer class that analyzes text for its negative, neutral, and positive sentiment.


In [7]:
from nltk.sentiment import SentimentIntensityAnalyzer

analyzer = SentimentIntensityAnalyzer()

sentiment_scores = analyzer.polarity_scores(text)
print(sentiment_scores)

{'neg': 0.056, 'neu': 0.726, 'pos': 0.218, 'compound': 0.4588}


### Named Entity Recognition

NLTK provides a ne_chunk function that performs NER.

In [8]:
from nltk import ne_chunk

tokens = word_tokenize(text)

stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]

pos_tags = nltk.pos_tag(lemmatized_tokens)

ner_tags = ne_chunk(pos_tags)
print(ner_tags)

(S
  (GPE Tesla/NNP)
  shareholder/NN
  backed/VBD
  record-breaking/JJ
  pay/NN
  package/NN
  bos/NN
  (PERSON Elon/NNP Musk/NNP)
  approved/VBD
  plan/NN
  move/VBP
  firm/NN
  's/POS
  legal/JJ
  headquarters/NN
  (PERSON Texas/NNP)
  ./.)


For more details on NLTK 
https://www.nltk.org/book/