### Objective
- Reduce featres
- Cleaner, more representative datasets

### Tools
- Tensorflow
- NLTK

### Techniques
- Tokenization
- Stop word removal
- Stemming
- Rate word removal

### Extraxt tokens from text

In [2]:
from torchtext.data.utils import get_tokenizer
tokenizer = get_tokenizer("basic_english")
tokens = tokenizer("The biggest thing that I've seen, which is absolutely takes me to my core, is actually not so much about how humanlike ada is, but how robotic we are. The algorithms that run our systems are extremely able to be analyzed, understood algorithms will know is better than ourselves.")
print(tokens)

['the', 'biggest', 'thing', 'that', 'i', "'", 've', 'seen', ',', 'which', 'is', 'absolutely', 'takes', 'me', 'to', 'my', 'core', ',', 'is', 'actually', 'not', 'so', 'much', 'about', 'how', 'humanlike', 'ada', 'is', ',', 'but', 'how', 'robotic', 'we', 'are', '.', 'the', 'algorithms', 'that', 'run', 'our', 'systems', 'are', 'extremely', 'able', 'to', 'be', 'analyzed', ',', 'understood', 'algorithms', 'will', 'know', 'is', 'better', 'than', 'ourselves', '.']




### Stop word removal
- Eliminate common words that don't contribute to the meaning
- Stop words: a, the, and, or...

In [3]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\SS\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
filtered_tokens = [token for token in tokens if token.lower() not in stop_words]
print(filtered_tokens)

['biggest', 'thing', "'", 'seen', ',', 'absolutely', 'takes', 'core', ',', 'actually', 'much', 'humanlike', 'ada', ',', 'robotic', '.', 'algorithms', 'run', 'systems', 'extremely', 'able', 'analyzed', ',', 'understood', 'algorithms', 'know', 'better', '.']


### Stemming
- Reduce words to their base form

In [5]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens]
print(stemmed_tokens)

['biggest', 'thing', "'", 'seen', ',', 'absolut', 'take', 'core', ',', 'actual', 'much', 'humanlik', 'ada', ',', 'robot', '.', 'algorithm', 'run', 'system', 'extrem', 'abl', 'analyz', ',', 'understood', 'algorithm', 'know', 'better', '.']


### Infrequent rare words
- removing infrequent words that don't add value

In [6]:
from nltk.probability import FreqDist
freq_dist = FreqDist(stemmed_tokens)
thresh = 1
common_tokens = [token for token in stemmed_tokens if freq_dist[token]> thresh]
print(common_tokens)

[',', ',', ',', '.', 'algorithm', ',', 'algorithm', '.']


### Practice
Word frequency analysis

Congratulations! You've just joined PyBooks. PyBooks is developing a book recommendation system and they want to find patterns and trends in text to improve their recommendations.

To begin, you'll want to understand the frequency of words in a given text and remove any rare words.

Note that typical real-world datasets will be larger than this example.
Instructions
100 XP

    Import the tokenization function from torchtext and frequency distribution function from the nltk library.
    Initialize the tokenizer for English and tokenize the given text.
    Calculate the frequency distribution of the tokens and remove rare words using list comprehension.


In [7]:
# Import the necessary functions
from torchtext.data.utils import get_tokenizer
from nltk.probability import FreqDist

text = "In the city of Dataville, a data analyst named Alex explores hidden insights within vast data. With determination, Alex uncovers patterns, cleanses the data, and unlocks innovation. Join this adventure to unleash the power of data-driven decisions."

# Initialize the tokenizer and tokenize the text
tokenizer = get_tokenizer("basic_english")
tokens = tokenizer(text)

threshold = 1
# Remove rare words and print common tokens
freq_dist = FreqDist(tokens)
common_tokens = [token for token in tokens if freq_dist[token] > threshold]
print(common_tokens)

['the', 'of', ',', 'data', 'alex', 'data', '.', ',', 'alex', ',', 'the', 'data', ',', '.', 'the', 'of', '.']


Preprocessing text

Building a recommendation system, or any model, requires text to be preprocessed first.

A block of text from Sherlock Holmes is loaded here. Preprocess this text using the various techniques presented in the video to prepare it for further analysis.

The text variable is an excerpt from The Hound of the Baskervilles by Arther Conan Doyle.

The following packages and functions have been loaded for you: nltk, torch, get_tokenizer, PorterStemmer, stopwords.
Instructions 1/3
35 XP

    Initialize the tokenizer with "basic_english".
    Tokenize the text using the tokenizer.
    
    Create a set of English stopwords and use list comprehension to filter these stop_words out of the text, making sure to ignore capitalization.

    Perform stemming on the filtered_tokens using the appropriate nltk function.

In [8]:
# Initialize and tokenize the text
tokenizer = get_tokenizer("basic_english")
tokens = tokenizer(text)

# Remove any stopwords
stop_words = set(stopwords.words("english"))
filtered_tokens = [token for token in tokens if token.lower() not in stop_words]

# Perform stemming on the filtered tokens
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens]
print(stemmed_tokens)

['citi', 'datavil', ',', 'data', 'analyst', 'name', 'alex', 'explor', 'hidden', 'insight', 'within', 'vast', 'data', '.', 'determin', ',', 'alex', 'uncov', 'pattern', ',', 'cleans', 'data', ',', 'unlock', 'innov', '.', 'join', 'adventur', 'unleash', 'power', 'data-driven', 'decis', '.']
