# NLP with NLTK and spaCy: Tokenization, Stemming, and Lemmatization

This notebook performs:
1) Loading and preprocessing a movie review text using **NLTK** and **spaCy** (tokenization, stemming, lemmatization),
2) Side-by-side comparison of outputs from both libraries.

## Setup
Run the cell below to install/download prerequisites.

In [1]:
import sys, platform, subprocess, os

def run(cmd):
    print(">>", " ".join(cmd))
    subprocess.run(cmd, check=False)

print("Python:", sys.version)
print("Executable:", sys.executable)
print("Platform:", platform.platform())

run([sys.executable, "-m", "pip", "install", "-U", "pip", "setuptools", "wheel"])

candidates = [
    "spacy==3.7.5",
    "spacy==3.6.1",
    "spacy==3.5.4",
]
for pkg in candidates:
    print("\nTrying:", pkg)
    run([sys.executable, "-m", "pip", "install", "--prefer-binary", "--no-cache-dir", pkg])
    try:
        import spacy
        print("Imported spaCy:", spacy.__version__)
        break
    except Exception as e:
        print("Import failed, will try next candidate...", e)

try:
    import spacy
    run([sys.executable, "-m", "spacy", "download", "en_core_web_sm"])
    import importlib
    importlib.invalidate_caches()
    nlp = spacy.load("en_core_web_sm")
    print("spaCy model loaded OK.")
except Exception as e:
    print("Failed to load model:", e)


Python: 3.7.6 (default, Jan  8 2020, 20:23:39) [MSC v.1916 64 bit (AMD64)]
Executable: E:\Anaconda_new\python.exe
Platform: Windows-10-10.0.26100-SP0
>> E:\Anaconda_new\python.exe -m pip install -U pip setuptools wheel

Trying: spacy==3.7.5
>> E:\Anaconda_new\python.exe -m pip install --prefer-binary --no-cache-dir spacy==3.7.5
Imported spaCy: 3.7.5
>> E:\Anaconda_new\python.exe -m spacy download en_core_web_sm
spaCy model loaded OK.


In [3]:
import nltk
from nltk import pos_tag
from nltk.corpus import wordnet
from nltk.stem import PorterStemmer, SnowballStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize

In [2]:
import spacy
spacy.__version__

'3.7.5'

In [4]:
# NLTK data downloads
def safe_nltk_download(resource):
    try:
        nltk.data.find(resource)
    except LookupError:
        nltk.download(resource.split('/')[-1], quiet=True)

for res in [
    'tokenizers/punkt',        # tokenizer
    'taggers/averaged_perceptron_tagger',  # POS tagger
    'taggers/averaged_perceptron_tagger_eng', # new name in some builds
    'corpora/wordnet',
    'corpora/omw-1.4'
]:
    try:
        safe_nltk_download(res)
    except Exception:
        pass

import spacy
try:
    nlp = spacy.load("en_core_web_sm")
except Exception:
    print("Downloading spaCy model 'en_core_web_sm'...")
    subprocess.run([sys.executable, "-m", "spacy", "download", "en_core_web_sm"], check=False)
    nlp = spacy.load("en_core_web_sm")

import pandas as pd
print("Setup complete.")

Setup complete.


## Input Text (Movie Review)

In [5]:
review_text = (
    """Three years after the massive, billion-dollar success of \"Jurassic World: Dominion,\" which had brought with it the return of the original film's trio of stars, it's hardly a surprise to see yet another entry come about so quickly. However, while this latest sequel doesn't feature Sam Neill, Laura Dern, or Jeff Goldblum, director Gareth Edwards & his team have still gone the route of adding massive star power in hopes of reinvigorating the series and keeping it fresh enough so as not to feel like just another perfunctory dinosaur outing. As we head into the seventh entry in the franchise overall, will the likes of two-time Oscar nominee Scarlett Johansson and two-time Oscar winner Mahershala Ali be enough to accomplish this extremely difficult goal?


Starting in 2008, we witness disaster at an InGen testing site, where genetic experiments are being conducted on dinosaurs. After one of them escapes containment, it goes on a rampage and forces the rest of the personnel to abandon the site. In the present day, we learn that the remaining dinosaurs all live in areas around the equator due to it being the only climate that they're able to survive in, areas that are off-limits to everyone. Despite this, Martin Krebs (Rupert Friend), an executive for a pharmaceutical company, recruits mercenary Zora Bennett (Scarlett Johansson) and paleontologist Dr. Henry Loomis (Jonathan Bailey) for a mission to collect DNA samples from three dinosaurs in order to develop a treatment for heart disease, a very lucrative prospect. 


Before they depart, Zora recruits her old friend Duncaid Kincaid (Mahershala Ali) to lead the mission, one that immediately hits a snag when he decides to rescue Reuben Delgado (Manuel Garcia-Rulfo), his daughters Teresa (Luna Blaise) & Isabella (Audrina Miranda), and Teresa's boyfriend Xavier (David Iacono), whose boat was overturned by a dinosaur. Further trouble causes the team & those they rescued to be separated and stranded on the island where the former hopes to complete their mission, but, as one can certainly expect at this point, it turns out to be far more difficult & dangerous than originally thought.


As one can imagine, 32 years and now seven films into this franchise, it's rather difficult for even the most talented of writers to come up with excuses for anyone to interact with the remaining dinosaurs within this universe, especially since such interactions always lead to multiple deaths. This has led screenwriter David Koepp, who adapted the first two \"Jurassic Park\" films, to fall back on one of the oldest excuses of all: simple & unbridled greed. In this case, it's a treatment for heart disease that would be worth millions, if not billions of dollars. Throw in a promised hefty payday, and it ends up being more than enough to lure a team of mercenaries into an insanely dangerous mission to procure the genetic material.


As far as overall plots go, this one is a lot more straightforward than what we got last time. There's no potentially world-ending event or conspiracies, just a pharmaceutical company looking to make a ton of money. On that score, it may be a little more interesting, primarily because the main mission doesn't get as bogged down with a lot of superfluous material, but it still ends up being rather ho-hum because it is rather basic as far as stories go. What hurts it even more though is the completely unnecessary addition of the rescued characters from the boat, who can't help but feel superfluous because they're only there to put more people in danger while the team goes about their mission. It only becomes more apparent when you realize that they could've been eliminated entirely and it wouldn't have changed much of anything about the film.


Again, you might be able to say that the narrative is a slight improvement, but it still doesn't do much to reinvigorate the franchise. We still have people stupidly risking their lives by going anywhere near these dinosaurs, and several people still getting killed as a result, causing it to feel like more of the same thing that we've already seen several times before. Scarlett Johansson & Mahershala Ali are certainly nice additions to the cast and do the best they can with the material they're given, but there's only so much they can do with the film's somewhat simple set-up. In the end, if you've been enjoying these films up until now, then there's a fair chance that you'll enjoy this one, but if you've come to find them a little tiresome, then this latest entry will likely do little to change your mind."""
)
len(review_text.split()), review_text[:300] + '...'


(767,
 'Three years after the massive, billion-dollar success of "Jurassic World: Dominion," which had brought with it the return of the original film\'s trio of stars, it\'s hardly a surprise to see yet another entry come about so quickly. However, while this latest sequel doesn\'t feature Sam Neill, Laura De...')

## Tokenization, Stemming, and Lemmatization with NLTK

In [6]:
from typing import List, Tuple

# Tokenization
nltk_tokens: List[str] = word_tokenize(review_text)

# Stemming
porter = PorterStemmer()
snowball = SnowballStemmer("english")
nltk_porter_stems = [porter.stem(tok) for tok in nltk_tokens]
nltk_snowball_stems = [snowball.stem(tok) for tok in nltk_tokens]

# Lemmatization 
lemmatizer = WordNetLemmatizer()

def nltk_pos_to_wordnet(tag: str):
    t = tag[0].upper()
    if t == 'J':
        return wordnet.ADJ
    elif t == 'V':
        return wordnet.VERB
    elif t == 'N':
        return wordnet.NOUN
    elif t == 'R':
        return wordnet.ADV
    else:
        return wordnet.NOUN

try:
    pos_tags = pos_tag(nltk_tokens)
except LookupError:

    nltk.download('averaged_perceptron_tagger_eng', quiet=True)
    pos_tags = pos_tag(nltk_tokens)

nltk_lemmas = [lemmatizer.lemmatize(tok, nltk_pos_to_wordnet(tag)) for tok, tag in pos_tags]

print("NLTK tokens (first 40):", nltk_tokens[:40])
print("\nNLTK Porter stems (first 40):", nltk_porter_stems[:40])
print("\nNLTK Snowball stems (first 40):", nltk_snowball_stems[:40])
print("\nNLTK lemmas (first 40):", nltk_lemmas[:40])

NLTK tokens (first 40): ['Three', 'years', 'after', 'the', 'massive', ',', 'billion-dollar', 'success', 'of', '``', 'Jurassic', 'World', ':', 'Dominion', ',', "''", 'which', 'had', 'brought', 'with', 'it', 'the', 'return', 'of', 'the', 'original', 'film', "'s", 'trio', 'of', 'stars', ',', 'it', "'s", 'hardly', 'a', 'surprise', 'to', 'see', 'yet']

NLTK Porter stems (first 40): ['three', 'year', 'after', 'the', 'massiv', ',', 'billion-dollar', 'success', 'of', '``', 'jurass', 'world', ':', 'dominion', ',', "''", 'which', 'had', 'brought', 'with', 'it', 'the', 'return', 'of', 'the', 'origin', 'film', "'s", 'trio', 'of', 'star', ',', 'it', "'s", 'hardli', 'a', 'surpris', 'to', 'see', 'yet']

NLTK Snowball stems (first 40): ['three', 'year', 'after', 'the', 'massiv', ',', 'billion-dollar', 'success', 'of', '``', 'jurass', 'world', ':', 'dominion', ',', "''", 'which', 'had', 'brought', 'with', 'it', 'the', 'return', 'of', 'the', 'origin', 'film', "'s", 'trio', 'of', 'star', ',', 'it', "'s",

## Tokenization, Stemming, and Lemmatization with spaCy

In [7]:
doc = nlp(review_text)
spacy_tokens = [t.text for t in doc]

spacy_lemmas = [t.lemma_ for t in doc]
spacy_pos = [t.pos_ for t in doc]

print("spaCy tokens (first 40):", spacy_tokens[:40])
print("\nspaCy lemmas (first 40):", spacy_lemmas[:40])
print("\nspaCy coarse POS (first 40):", spacy_pos[:40])

spaCy tokens (first 40): ['Three', 'years', 'after', 'the', 'massive', ',', 'billion', '-', 'dollar', 'success', 'of', '"', 'Jurassic', 'World', ':', 'Dominion', ',', '"', 'which', 'had', 'brought', 'with', 'it', 'the', 'return', 'of', 'the', 'original', 'film', "'s", 'trio', 'of', 'stars', ',', 'it', "'s", 'hardly', 'a', 'surprise', 'to']

spaCy lemmas (first 40): ['three', 'year', 'after', 'the', 'massive', ',', 'billion', '-', 'dollar', 'success', 'of', '"', 'Jurassic', 'World', ':', 'dominion', ',', '"', 'which', 'have', 'bring', 'with', 'it', 'the', 'return', 'of', 'the', 'original', 'film', "'s", 'trio', 'of', 'star', ',', 'it', 'be', 'hardly', 'a', 'surprise', 'to']

spaCy coarse POS (first 40): ['NUM', 'NOUN', 'ADP', 'DET', 'ADJ', 'PUNCT', 'NUM', 'PUNCT', 'NOUN', 'NOUN', 'ADP', 'PUNCT', 'PROPN', 'PROPN', 'PUNCT', 'NOUN', 'PUNCT', 'PUNCT', 'PRON', 'AUX', 'VERB', 'ADP', 'PRON', 'DET', 'NOUN', 'ADP', 'DET', 'ADJ', 'NOUN', 'PART', 'NOUN', 'ADP', 'NOUN', 'PUNCT', 'PRON', 'AUX', 'ADV

## Side-by-side Comparison
Below we align the first 100 **non-space** tokens produced by both libraries and compare:
- the original token text,
- NLTK stems (Porter and Snowball),
- NLTK lemma (with POS-aware lemmatization), and
- spaCy lemma.

In [8]:
import pandas as pd

spacy_not_space = [t for t in doc if not t.is_space]
limit = 100
rows = []
for i in range(min(limit, len(nltk_tokens), len(spacy_not_space))):
    ntok = nltk_tokens[i]
    stok = spacy_not_space[i]
    rows.append({
        "idx": i,
        "nltk_token": ntok,
        "nltk_porter": porter.stem(ntok),
        "nltk_snowball": snowball.stem(ntok),
        "nltk_lemma": nltk_lemmas[i] if i < len(nltk_lemmas) else None,
        "spacy_token": stok.text,
        "spacy_lemma": stok.lemma_,
        "spacy_pos": stok.pos_
    })

cmp_df = pd.DataFrame(rows)
cmp_df.head(20)

Unnamed: 0,idx,nltk_token,nltk_porter,nltk_snowball,nltk_lemma,spacy_token,spacy_lemma,spacy_pos
0,0,Three,three,three,Three,Three,three,NUM
1,1,years,year,year,year,years,year,NOUN
2,2,after,after,after,after,after,after,ADP
3,3,the,the,the,the,the,the,DET
4,4,massive,massiv,massiv,massive,massive,massive,ADJ
5,5,",",",",",",",",",",",",PUNCT
6,6,billion-dollar,billion-dollar,billion-dollar,billion-dollar,billion,billion,NUM
7,7,success,success,success,success,-,-,PUNCT
8,8,of,of,of,of,dollar,dollar,NOUN
9,9,``,``,``,``,success,success,NOUN
