<a href='https://ai.meng.duke.edu'> = <img align="left" style="padding-top:10px;" src=https://storage.googleapis.com/aipi_datasets/Duke-AIPI-Logo.png>

# Text Pre-processing
Text is messy, and a lot of work needs to be done to pre-process it before it is useful for modeling.  Generally a text pre-processing pipeline will include at least the following steps:  
- Tokenizing the text - splitting it into words and punctuation
- Remove stop words and punctuation  
- Convert words to root words using lemmatization or stemming  

This notebook walks through a basic example of how to perform those steps using two common NLP libraries: [NLTK](https://www.nltk.org) and spaCy (https://spacy.io).


In [1]:
import string

# Import Spacy and download model to use
import spacy
#!python -m spacy download en_core_web_sm

# Import NLTK and download model to use
import nltk
nltk.download('omw-1.4')

import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package omw-1.4 to /home/shaunak/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [2]:
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/shaunak/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/shaunak/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/shaunak/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [3]:
example_doc = '''I saw some geese near the pond. Then they took off flying.'''

example_doc2 = '''"Let's go to N.Y.!""'''

## NLTK
Let's first use NLTK to pre-process our text.  We'll start by tokenizing our sentence, then remove punctuation and stop words, and then we will lemmatize the tokens to get the root words.

### Tokenization

In [4]:
# Convert to tokens
tokens = nltk.word_tokenize(example_doc)

tokens2 = nltk.word_tokenize(example_doc2)


print(tokens)
print(tokens2)

['I', 'saw', 'some', 'geese', 'near', 'the', 'pond', '.', 'Then', 'they', 'took', 'off', 'flying', '.']
['``', 'Let', "'s", 'go', 'to', 'N.Y.', '!', "''", "''"]


### Remove stop words & punctuation

In [5]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
print(stop_words)

{'you', 'am', 'with', 'had', 'isn', 'why', 'some', 'those', 'was', 'or', 'now', "aren't", 'm', 'her', 'shouldn', 's', 'ma', 'haven', "should've", 'just', 'ain', 'once', 've', 'are', 'other', 'any', 'further', "shouldn't", 'as', 'the', 'them', 'being', 'll', "haven't", 'they', 'what', 'do', "you're", 'it', 'below', 'don', 'then', 'no', 'my', 'against', "won't", 'ourselves', "wouldn't", 'himself', 'his', 'aren', "doesn't", "wasn't", 'out', 'doesn', 'and', 'too', 'shan', 'myself', 'their', 'after', 'our', 'only', 'ours', 'between', 'before', 'above', 'than', 'few', 'can', 'been', 'from', 'y', "it's", 'both', "mustn't", 'but', 'mightn', "you've", "shan't", 'this', 'such', 'because', 'didn', 'into', 'same', "you'll", 'hers', 'has', 'an', 'not', 'wasn', 'hasn', "weren't", 'off', "don't", "didn't", 'did', "needn't", 'wouldn', 'were', 'if', 'needn', 'each', 'should', 'i', 'its', 'to', "she's", 'o', "hasn't", 'does', 'theirs', 'when', 'down', "mightn't", 'have', 'your', 'whom', 'is', 'during', 

In [6]:
punctuations = string.punctuation

# Filter out stop words and punctuation
tokens = [w for w in tokens if w.lower() not in stop_words and w not in punctuations]
 
print(tokens)

['saw', 'geese', 'near', 'pond', 'took', 'flying']


### Lemmatize

In [7]:
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
tokens = [wordnet_lemmatizer.lemmatize(word).lower().strip() for word in tokens]
print(tokens)


['saw', 'goose', 'near', 'pond', 'took', 'flying']


In [8]:
# Combine the filtered lemmas back into a string
doc_processed = " ".join([i for i in tokens])

print('Original:')
print(example_doc)
print('Processed:')
print(doc_processed)

Original:
I saw some geese near the pond. Then they took off flying.
Processed:
saw goose near pond took flying


## SpaCy
Let's now walk through our simple example using spaCy.  With spaCy, we'll first tokenize as we did with NLTK.  But since spaCy's tokens are a bit different than NLTK (NLTK just creates string tokens, while spaCy's tokens contain lots of additional useful information on each word such as part-of-speech, root etc.), we will next use the spaCy tokens to extract the lemmas, and then remove stop words and punctuation from the list of string lemmas.
### Tokenization

In [18]:
# Process sentence
nlp = spacy.load("en_core_web_sm")
doc = nlp(example_doc)
# Get tokens
tokens = [token for token in doc]

print(tokens)

[I, saw, some, geese, near, the, pond, ., Then, they, took, off, flying, .]


### Lemmatization

In [19]:
# Extract the lemmas for each token
lemmatized_tokens = [token.lemma_.lower().strip() for token in tokens]
print(lemmatized_tokens)

['i', 'see', 'some', 'geese', 'near', 'the', 'pond', '.', 'then', 'they', 'take', 'off', 'fly', '.']


### Remove stop words and punctuation

In [20]:
from spacy.lang.en.stop_words import STOP_WORDS
stopwords = set(STOP_WORDS)
punctuations = string.punctuation

tokens = [token for token in lemmatized_tokens if token.lower() not in stopwords and token not in punctuations]
print(tokens)

['geese', 'near', 'pond', 'fly']


In [21]:
# Combine the filtered lemmas back into a string
doc_processed = " ".join([i for i in tokens])

print('Original:')
print(example_doc)
print('Processed:')
print(doc_processed)

Original:
I saw some geese near the pond. Then they took off flying.
Processed:
geese near pond fly


### BPE

In [22]:
corpus = [
    "This is the Hugging Face Course.",
    "This chapter is about tokenization.",
    "This section shows several tokenizer algorithms.",
    "Hopefully, you will be able to understand how they are trained and generate tokens.",
]

In [23]:
from transformers import AutoTokenizer

In [24]:
tokenizer = AutoTokenizer.from_pretrained("gpt2")

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [25]:
from collections import defaultdict

word_freqs = defaultdict(int)

for text in corpus:
    words_with_offsets = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)
    print(words_with_offsets)
    new_words = [word for word, offset in words_with_offsets]
    for word in new_words:
        word_freqs[word] += 1

print(word_freqs)

[('This', (0, 4)), ('Ġis', (4, 7)), ('Ġthe', (7, 11)), ('ĠHugging', (11, 19)), ('ĠFace', (19, 24)), ('ĠCourse', (24, 31)), ('.', (31, 32))]
[('This', (0, 4)), ('Ġchapter', (4, 12)), ('Ġis', (12, 15)), ('Ġabout', (15, 21)), ('Ġtokenization', (21, 34)), ('.', (34, 35))]
[('This', (0, 4)), ('Ġsection', (4, 12)), ('Ġshows', (12, 18)), ('Ġseveral', (18, 26)), ('Ġtokenizer', (26, 36)), ('Ġalgorithms', (36, 47)), ('.', (47, 48))]
[('Hopefully', (0, 9)), (',', (9, 10)), ('Ġyou', (10, 14)), ('Ġwill', (14, 19)), ('Ġbe', (19, 22)), ('Ġable', (22, 27)), ('Ġto', (27, 30)), ('Ġunderstand', (30, 41)), ('Ġhow', (41, 45)), ('Ġthey', (45, 50)), ('Ġare', (50, 54)), ('Ġtrained', (54, 62)), ('Ġand', (62, 66)), ('Ġgenerate', (66, 75)), ('Ġtokens', (75, 82)), ('.', (82, 83))]
defaultdict(<class 'int'>, {'This': 3, 'Ġis': 2, 'Ġthe': 1, 'ĠHugging': 1, 'ĠFace': 1, 'ĠCourse': 1, '.': 4, 'Ġchapter': 1, 'Ġabout': 1, 'Ġtokenization': 1, 'Ġsection': 1, 'Ġshows': 1, 'Ġseveral': 1, 'Ġtokenizer': 1, 'Ġalgorithms': 1, '