# NLTK Basics — A Beginner-Friendly Notebook

**Goals:**
- Learn how to set up NLTK and required data
- Practice tokenization, stopword filtering, stemming, lemmatizing, POS tagging
- Try chunking, chinking and basic Named Entity Recognition (NER)
- See simple visualizations (frequency plots)

**How to use this notebook:** run each cell in order. Cells that install packages are provided for convenience — if your environment already has the packages, you can skip those cells.

---


## 1) Install & Setup

Run this cell if you need to install packages. It uses `pip` — comment it out if you run in an environment where installation is not allowed.

**Note:** this notebook was written for Python 3.9+ and NLTK 3.5 (you can use newer NLTK versions too).


In [13]:
# Install (uncomment to run if you need it)
# !python -m pip install --upgrade pip
# !python -m pip install nltk==3.5 numpy matplotlib

print('Install step: run only if needed.')

Install step: run only if needed.


## 2) Download NLTK data

NLTK requires some datasets/models (tokenizers, taggers, corpora). Run the cell below once to download them to your environment.


In [14]:
import nltk

# These downloads are generally required for the examples in this notebook
nltk_data = ['punkt', 'stopwords', 'averaged_perceptron_tagger', 'wordnet', 'omw-1.4',
             'maxent_ne_chunker', 'words']
for pkg in nltk_data:
    print('Downloading', pkg)
    nltk.download(pkg)

print('\nAll required NLTK data attempted for download.')

Downloading punkt
Downloading stopwords
Downloading averaged_perceptron_tagger
Downloading wordnet
Downloading omw-1.4
Downloading maxent_ne_chunker
Downloading words

All required NLTK data attempted for download.


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ROUCHI\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ROUCHI\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\ROUCHI\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ROUCHI\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\ROUCHI\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\ROUCHI\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is a

## 3) Common imports

This cell imports the functions we will use repeatedly. Run it once near the top of the notebook.


In [15]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, SnowballStemmer, WordNetLemmatizer
from nltk.probability import FreqDist
import matplotlib.pyplot as plt
from collections import Counter
print('Imports ready')

Imports ready


## 4) Tokenization — sentences & words

Tokenization splits text into sentences or words. This is almost always the first preprocessing step.


In [16]:
example_string = """Muad'Dib learned rapidly because his first training was in how to learn.
And the first lesson of all was the basic trust that he could learn.
It's shocking to find how many people do not believe they can learn,
and how many more believe learning to be difficult."""

print('--- Sentence tokenization ---')
sentences = sent_tokenize(example_string)
for i,s in enumerate(sentences,1):
    print(i, s)

print('\n--- Word tokenization ---')
words = word_tokenize(example_string)
print(words)

# Simple frequency on alphabetic tokens
alpha_words = [w.lower() for w in words if w.isalpha()]
print('\nTop words:', Counter(alpha_words).most_common(8))

--- Sentence tokenization ---
1 Muad'Dib learned rapidly because his first training was in how to learn.
2 And the first lesson of all was the basic trust that he could learn.
3 It's shocking to find how many people do not believe they can learn,
and how many more believe learning to be difficult.

--- Word tokenization ---
["Muad'Dib", 'learned', 'rapidly', 'because', 'his', 'first', 'training', 'was', 'in', 'how', 'to', 'learn', '.', 'And', 'the', 'first', 'lesson', 'of', 'all', 'was', 'the', 'basic', 'trust', 'that', 'he', 'could', 'learn', '.', 'It', "'s", 'shocking', 'to', 'find', 'how', 'many', 'people', 'do', 'not', 'believe', 'they', 'can', 'learn', ',', 'and', 'how', 'many', 'more', 'believe', 'learning', 'to', 'be', 'difficult', '.']

Top words: [('how', 3), ('to', 3), ('learn', 3), ('first', 2), ('was', 2), ('and', 2), ('the', 2), ('many', 2)]


## 5) Stop words — filtering

Stop words are common words that you may want to remove for many tasks (but *not* always — negation words like 'not' may be important!).


In [17]:
worf_quote = "Sir, I protest. I am not a merry man!"
words_in_quote = word_tokenize(worf_quote)
stop_words = set(stopwords.words('english'))

print('Original tokens:', words_in_quote)
filtered = [w for w in words_in_quote if w.casefold() not in stop_words]
print('\nFiltered (default NLTK stopwords removed):', filtered)

# Example: keep 'not' and 'i' because they may be important
custom_stop = stop_words - {'not', 'i'}
filtered_keep_not = [w for w in words_in_quote if w.casefold() not in custom_stop]
print('\nFiltered with custom stoplist (kept "not", "i"):', filtered_keep_not)

Original tokens: ['Sir', ',', 'I', 'protest', '.', 'I', 'am', 'not', 'a', 'merry', 'man', '!']

Filtered (default NLTK stopwords removed): ['Sir', ',', 'protest', '.', 'merry', 'man', '!']

Filtered with custom stoplist (kept "not", "i"): ['Sir', ',', 'I', 'protest', '.', 'I', 'not', 'merry', 'man', '!']


## 6) Stemming — crude root forms

Stemmers reduce words to a root form — results may be non-words. Try Porter vs Snowball.


In [18]:
porter = PorterStemmer()
snow = SnowballStemmer('english')
string_for_stemming = "The crew of the USS Discovery discovered many discoveries. Discovering is what explorers do."
tokens = word_tokenize(string_for_stemming)
print('Tokens:', tokens)
print('\nPorter stems:')
print([porter.stem(t) for t in tokens])
print('\nSnowball stems:')
print([snow.stem(t) for t in tokens])

Tokens: ['The', 'crew', 'of', 'the', 'USS', 'Discovery', 'discovered', 'many', 'discoveries', '.', 'Discovering', 'is', 'what', 'explorers', 'do', '.']

Porter stems:
['the', 'crew', 'of', 'the', 'uss', 'discoveri', 'discov', 'mani', 'discoveri', '.', 'discov', 'is', 'what', 'explor', 'do', '.']

Snowball stems:
['the', 'crew', 'of', 'the', 'uss', 'discoveri', 'discov', 'mani', 'discoveri', '.', 'discov', 'is', 'what', 'explor', 'do', '.']


## 7) Part-of-Speech (POS) tagging

POS tagging labels each token with a part-of-speech tag (noun, verb, adjective ...). Useful for downstream tasks like lemmatization, chunking.


In [19]:
sagan_quote = "If you wish to make an apple pie from scratch, you must first invent the universe."
tokens = word_tokenize(sagan_quote)
pos_tags = nltk.pos_tag(tokens)
print('Tokens and POS tags:')
print(pos_tags)

# Useful: inspect a small subset of the tagset
print('\nSample tag meanings:')
print('NN → noun, VB → verb, JJ → adjective, RB → adverb, PRP → pronoun')

Tokens and POS tags:
[('If', 'IN'), ('you', 'PRP'), ('wish', 'VBP'), ('to', 'TO'), ('make', 'VB'), ('an', 'DT'), ('apple', 'NN'), ('pie', 'NN'), ('from', 'IN'), ('scratch', 'NN'), (',', ','), ('you', 'PRP'), ('must', 'MD'), ('first', 'VB'), ('invent', 'VB'), ('the', 'DT'), ('universe', 'NN'), ('.', '.')]

Sample tag meanings:
NN → noun, VB → verb, JJ → adjective, RB → adverb, PRP → pronoun


## 8) Lemmatization — dictionary root forms (better than stemming when done with POS)

Lemmatizers produce real words. Supplying the POS to the lemmatizer improves results.


In [20]:
from nltk.corpus import wordnet
lemmatizer = WordNetLemmatizer()

def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

text = "The striped bats are hanging on their feet for best"
tokens = word_tokenize(text)
tags = nltk.pos_tag(tokens)
lemmas = [lemmatizer.lemmatize(w, get_wordnet_pos(t)) for w,t in tags]
print('Word | POS | Lemma')
for w,t,l in zip(tokens, tags, lemmas):
    print(w, '|', t[1], '|', l)

Word | POS | Lemma
The | DT | The
striped | JJ | striped
bats | NNS | bat
are | VBP | be
hanging | VBG | hang
on | IN | on
their | PRP$ | their
feet | NNS | foot
for | IN | for
best | JJS | best


## 9) Chunking and Chinking

Chunking groups tokens into phrases (e.g., noun phrases). Chinking removes patterns from chunks.

Note: GUI tree drawing (tree.draw()) may not work in headless environments — we'll show textual outputs instead.


In [21]:
lotr_quote = "It's a dangerous business, Frodo, going out your door."
tokens = word_tokenize(lotr_quote)
pos = nltk.pos_tag(tokens)
print('Tokens with POS:')
print(pos)

# Define a simple noun phrase grammar: optional determiner, any number of adjectives, then a noun
grammar = 'NP: {<DT>?<JJ>*<NN>}'
cp = nltk.RegexpParser(grammar)
tree = cp.parse(pos)
print('\nChunk parse tree (text):')
print(tree)

print('\nExtracted noun phrases:')
for subtree in tree.subtrees(filter=lambda t: t.label()=='NP'):
    print(' '.join(word for word,pos in subtree.leaves()))

# Chinking example: exclude adjectives (JJ) from chunks
grammar2 = r"""Chunk: {<.*>+}
       }<JJ>{
"""
cp2 = nltk.RegexpParser(grammar2)
tree2 = cp2.parse(pos)
print('\nChinked tree (text):')
print(tree2)

Tokens with POS:
[('It', 'PRP'), ("'s", 'VBZ'), ('a', 'DT'), ('dangerous', 'JJ'), ('business', 'NN'), (',', ','), ('Frodo', 'NNP'), (',', ','), ('going', 'VBG'), ('out', 'RP'), ('your', 'PRP$'), ('door', 'NN'), ('.', '.')]

Chunk parse tree (text):
(S
  It/PRP
  's/VBZ
  (NP a/DT dangerous/JJ business/NN)
  ,/,
  Frodo/NNP
  ,/,
  going/VBG
  out/RP
  your/PRP$
  (NP door/NN)
  ./.)

Extracted noun phrases:
a dangerous business
door

Chinked tree (text):
(S
  (Chunk It/PRP 's/VBZ a/DT)
  dangerous/JJ
  (Chunk
    business/NN
    ,/,
    Frodo/NNP
    ,/,
    going/VBG
    out/RP
    your/PRP$
    door/NN
    ./.))


---
## Exercises (try these)
1. Create a function that accepts a text and returns the top 5 lemmatized content words (no stopwords).
2. Compare stemming vs lemmatization on a small news paragraph — what differences do you see?
3. Build a small named-entity extractor that also returns the entity label (PERSON, GPE, ORGANIZATION). Try it on a news blurb.
4. Try chunking to extract verb phrases instead of noun phrases.

## Further reading
- NLTK Book: *Natural Language Processing with Python* (Bird, Klein, Loper)
- NLTK documentation: https://www.nltk.org

Good luck — experiment and have fun!
