# Introduction to preprocessing for text

**Word frequency analysis**

PyBooks is developing a book recommendation system and they want to find patterns and trends in text to improve their recommendations.

To begin, you'll want to understand the frequency of words in a given text and remove any rare words.

* Import the tokenization function from torchtext and frequency distribution function from the nltk library.
* Initialize the tokenizer for English and tokenize the given text.
* Calculate the frequency distribution of the tokens and remove rare words using list comprehension.

In [6]:
# Import the necessary functions
from torchtext.data.utils import get_tokenizer
from nltk.probability import FreqDist

text = """In the city of Dataville, a data analyst named Alex explores hidden insights within
        vast data. With determination, Alex uncovers patterns, cleanses the data, and unlocks 
        innovation. Join this adventure to unleash the power of data-driven decisions."""

# Initialize the tokenizer and tokenize the text

tokenizer = get_tokenizer("basic_english")
tokens = tokenizer(text)
print("Tokens:\n",tokens)

threshold = 1
# Remove rare words and print common tokens
freq_dist = FreqDist(tokens)
print("Freq_dist:\n",freq_dist)
common_tokens = [token for token in tokens if freq_dist[token] > threshold]
print("Common_tokens:\n",common_tokens)


Tokens:
 ['in', 'the', 'city', 'of', 'dataville', ',', 'a', 'data', 'analyst', 'named', 'alex', 'explores', 'hidden', 'insights', 'within', 'vast', 'data', '.', 'with', 'determination', ',', 'alex', 'uncovers', 'patterns', ',', 'cleanses', 'the', 'data', ',', 'and', 'unlocks', 'innovation', '.', 'join', 'this', 'adventure', 'to', 'unleash', 'the', 'power', 'of', 'data-driven', 'decisions', '.']
Freq_dist:
 <FreqDist with 33 samples and 44 outcomes>
Common_tokens:
 ['the', 'of', ',', 'data', 'alex', 'data', '.', ',', 'alex', ',', 'the', 'data', ',', '.', 'the', 'of', '.']


You have removed rare words from your text. It looks like data and alex are pretty common. In practice, you'll work with larger text and may find more meaningful words.

**Preprocessing text**

Building a recommendation system, or any model, requires text to be preprocessed first.

A block of text from Sherlock Holmes is loaded here. Preprocess this text using the various techniques presented in the video to prepare it for further analysis.


In [7]:
import nltk 
from nltk.stem import PorterStemmer

text1 = """To Sherlock Holmes she is always the woman. I have seldom heard him mention her under any other name. 
In his eyes she eclipses and predominates the whole of her sex. It was not that he felt any emotion akin to love for Irene Adler. 
All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind. 
He was, I take it, the most perfect reasoning and observing machine that the world has seen, but as a lover he would have placed himself in a false position. 
He never spoke of the softer passions, save with a gibe and a sneer. They were admirable things for the observer—excellent for drawing the veil from men’s motives and actions. 
But for the trained reasoner to admit such intrusions into his own delicate and finely adjusted temperament was to introduce a distracting factor which might throw a doubt upon all his mental results. 
Grit in a sensitive instrument, or a crack in one of his own high-power lenses, would not be more disturbing than a strong emotion in a nature such as his. 
And yet there was but one woman to him, and that woman was the late Irene Adler, of dubious and questionable memory."""


* Initialize the tokenizer with "basic_english".
* Tokenize the text using the tokenizer.

In [8]:
# Initialize and tokenize the text
tokenizer = get_tokenizer("basic_english")
tokens = tokenizer(text1)
print(tokens)

['to', 'sherlock', 'holmes', 'she', 'is', 'always', 'the', 'woman', '.', 'i', 'have', 'seldom', 'heard', 'him', 'mention', 'her', 'under', 'any', 'other', 'name', '.', 'in', 'his', 'eyes', 'she', 'eclipses', 'and', 'predominates', 'the', 'whole', 'of', 'her', 'sex', '.', 'it', 'was', 'not', 'that', 'he', 'felt', 'any', 'emotion', 'akin', 'to', 'love', 'for', 'irene', 'adler', '.', 'all', 'emotions', ',', 'and', 'that', 'one', 'particularly', ',', 'were', 'abhorrent', 'to', 'his', 'cold', ',', 'precise', 'but', 'admirably', 'balanced', 'mind', '.', 'he', 'was', ',', 'i', 'take', 'it', ',', 'the', 'most', 'perfect', 'reasoning', 'and', 'observing', 'machine', 'that', 'the', 'world', 'has', 'seen', ',', 'but', 'as', 'a', 'lover', 'he', 'would', 'have', 'placed', 'himself', 'in', 'a', 'false', 'position', '.', 'he', 'never', 'spoke', 'of', 'the', 'softer', 'passions', ',', 'save', 'with', 'a', 'gibe', 'and', 'a', 'sneer', '.', 'they', 'were', 'admirable', 'things', 'for', 'the', 'observer—

* Create a set of English stopwords and use list comprehension to filter these stop_words out of the text, making sure to ignore capitalization.

In [11]:
from nltk.corpus import stopwords

# Remove any stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [token for token in tokens if token.lower() not in stop_words]
print(filtered_tokens)

['sherlock', 'holmes', 'always', 'woman', '.', 'seldom', 'heard', 'mention', 'name', '.', 'eyes', 'eclipses', 'predominates', 'whole', 'sex', '.', 'felt', 'emotion', 'akin', 'love', 'irene', 'adler', '.', 'emotions', ',', 'one', 'particularly', ',', 'abhorrent', 'cold', ',', 'precise', 'admirably', 'balanced', 'mind', '.', ',', 'take', ',', 'perfect', 'reasoning', 'observing', 'machine', 'world', 'seen', ',', 'lover', 'would', 'placed', 'false', 'position', '.', 'never', 'spoke', 'softer', 'passions', ',', 'save', 'gibe', 'sneer', '.', 'admirable', 'things', 'observer—excellent', 'drawing', 'veil', 'men’s', 'motives', 'actions', '.', 'trained', 'reasoner', 'admit', 'intrusions', 'delicate', 'finely', 'adjusted', 'temperament', 'introduce', 'distracting', 'factor', 'might', 'throw', 'doubt', 'upon', 'mental', 'results', '.', 'grit', 'sensitive', 'instrument', ',', 'crack', 'one', 'high-power', 'lenses', ',', 'would', 'disturbing', 'strong', 'emotion', 'nature', '.', 'yet', 'one', 'woman

* Perform stemming on the filtered_tokens using the appropriate nltk function.

In [12]:
# Perform stemming on the filtered tokens
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens]
print(stemmed_tokens)

['sherlock', 'holm', 'alway', 'woman', '.', 'seldom', 'heard', 'mention', 'name', '.', 'eye', 'eclips', 'predomin', 'whole', 'sex', '.', 'felt', 'emot', 'akin', 'love', 'iren', 'adler', '.', 'emot', ',', 'one', 'particularli', ',', 'abhorr', 'cold', ',', 'precis', 'admir', 'balanc', 'mind', '.', ',', 'take', ',', 'perfect', 'reason', 'observ', 'machin', 'world', 'seen', ',', 'lover', 'would', 'place', 'fals', 'posit', '.', 'never', 'spoke', 'softer', 'passion', ',', 'save', 'gibe', 'sneer', '.', 'admir', 'thing', 'observer—excel', 'draw', 'veil', 'men’', 'motiv', 'action', '.', 'train', 'reason', 'admit', 'intrus', 'delic', 'fine', 'adjust', 'tempera', 'introduc', 'distract', 'factor', 'might', 'throw', 'doubt', 'upon', 'mental', 'result', '.', 'grit', 'sensit', 'instrument', ',', 'crack', 'one', 'high-pow', 'lens', ',', 'would', 'disturb', 'strong', 'emot', 'natur', '.', 'yet', 'one', 'woman', ',', 'woman', 'late', 'iren', 'adler', ',', 'dubiou', 'question', 'memori', '.']


You've cracked the case of the convoluted text. Now, you have a clean, processed list of words that you can use to analyze the text. You're officially a master in text preprocessing!