# CLTK Tokenization Comparison with NLTK
This notebook aims to show one of several reasons why CLTK outperforms NLTK.
Many, non-classical NLP users may wonder why we need a separate library (such as CLTK) if modern NLP methods aim to be generalizable, and there are readily available libraries honed on the problems of modern languages (such as NLTK).

Here we compare NLTK and CLTK tokenizers and demonstrate how CLTK functionality can reduce the problem space and help build better, robust models by coalescing morphological variations where appropriate.

We will examine the Tesserae Latin corpus (https://github.com/cltk/lat_text_tesserae) and process it using CLTK and NLT functions and compare the results. We choose the Tesserae corpus because it is very clean and does not suffer from OCR errors, nor sloppy font conversions, etc.

In [1]:
import os
import re
from glob import glob
from collections import Counter

from nltk import word_tokenize
from tqdm import tqdm
from cltk import NLP
from cltk.alphabet.lat import normalize_lat
from cltk.sentence.lat import LatinPunktSentenceTokenizer
from cltk.tokenizers.lat.lat import LatinWordTokenizer
from cltk.tokenizers import LatinTokenizationProcess

In [2]:
nltk_tokens = Counter()
cltk_tokens = Counter()
cltk_jvrtokens = Counter()

In [3]:
# The tesserae file format has each line starting with metadata like:
# <apul.met. 1.1> At ego tibi sermone isto Milesio varias fabulas ...
# So we'll remove this:

ANY_ANGLE = re.compile("<.[^>]+>") # used to remove tesserae metadata 

def swallow(text, pattern_matcher):
    """Given a body of text and a pattern, swallow the text occuring just inside the pattern"""
    idx_to_omit = []
    for item in pattern_matcher.finditer(text):
        idx_to_omit.insert(0, item.span())
    for start, end in idx_to_omit:
        text = text[:start] + text[end:]
    return text.strip()

In [4]:
tesserae = glob(os.path.expanduser('~/cltk_data/latin/text/latin_text_tesserae/texts/*.tess'))
print(f"{len(tesserae)} tesserae corpus files")

762 tesserae corpus files


In [5]:
latin_tokenizer = LatinWordTokenizer()
sent_toker = LatinPunktSentenceTokenizer()

In [6]:
for file in tqdm(tesserae, total=len(tesserae)):
    with open (file, 'rt') as fin:
        text = fin.read()
        text = swallow(text, ANY_ANGLE)
        for token in word_tokenize(text):
            nltk_tokens.update({token : 1})  
        for token in latin_tokenizer.tokenize(text):
            cltk_tokens.update({token : 1})  
        text = normalize_lat(text, drop_accents=True, 
                                drop_macrons=True,
                                jv_replacement=True,
                                ligature_replacement=True)    
        for token in latin_tokenizer.tokenize(text):
            cltk_jvrtokens.update({token : 1})              

100%|██████████| 762/762 [05:18<00:00,  2.39it/s]


In [7]:
print("Using Latin Tesserae corpus:")
print(f"NLTK.word_tokenize: {len(nltk_tokens):,} distinct tokens, {sum(nltk_tokens.values()):,} total tokens")
print(f"CLTK.latin_tokenizer.tokenize: {len(cltk_tokens):,} distinct tokens, {sum(cltk_tokens.values()):,} total tokens")
print(f"CLTK.latin_tokenizer w/normalization, JV replacement, demacronization, ligature replacement: \n{len(cltk_jvrtokens):,} distinct tokens, {sum(cltk_jvrtokens.values()):,} total tokens")
print(f"CLTK reduces Tesserae word/token dimensionality: {100* (1 - (len(cltk_jvrtokens) / len(nltk_tokens))):.3f}%") 

Using Latin Tesserae corpus:
NLTK.word_tokenize: 371,334 distinct tokens, 8,083,360 total tokens
CLTK.latin_tokenizer.tokenize: 344,635 distinct tokens, 8,218,743 total tokens
CLTK.latin_tokenizer w/normalization, JV replacement, demacronization, ligature replacement: 
332,764 distinct tokens, 8,218,778 total tokens
CLTK reduces Tesserae word/token dimensionality: 10.387%


## Note the dimensionality reduction above doesn't include reductions that can be made by ignoring case, but this is an application level choice

In [8]:
nltk_tok_lower = set([tok.lower() for tok in nltk_tokens])
cltk_jvrtokens_lower = set([tok.lower() for tok in cltk_jvrtokens])
len(nltk_tok_lower), len(cltk_jvrtokens_lower)
print(f"Lower casing NLTK tokens: {len(nltk_tok_lower):,} vs. {len(nltk_tokens):,}")
print(f"A {100* (1 - (len(nltk_tok_lower) / len(nltk_tokens))):.3f}% reduction")
print(f"Lower casing CLTK tokenization w/normalization, JV replacement, demacronization, ligature replacement: \n{len(cltk_jvrtokens_lower):,} vs. {len(cltk_jvrtokens):,}")
print(f"A {100* (1 - (len(cltk_jvrtokens_lower) / len(cltk_jvrtokens))):.3f}% reduction") 

Lower casing NLTK tokens: 337,007 vs. 371,334
A 9.244% reduction
Lower casing CLTK tokenization w/normalization, JV replacement, demacronization, ligature replacement: 
299,512 vs. 332,764
A 9.993% reduction
