## Tokenization and Data Cleaning

**Tokenization** - converting a text string into individual tokens (words, terms, symbols or other meaningful elements)


In [1]:
import os
import string
import nltk

with open('data.txt', 'rt') as f:
    raw_text = f.read()

tokens_with_punctuation = nltk.word_tokenize(raw_text)
print(f'Token list: {tokens_with_punctuation[:40]}\n')
print(f'Total tokens: {len(tokens_with_punctuation)}')


Token list: ['You', "'ll", 'learn', 'NLP', '.', 'NLTK', 'is', 'a', 'leading', 'platform', 'for', 'building', 'Python', 'programs', 'to', 'work', 'with', 'human', 'language', 'data', '.', 'It', 'provides', 'easy-to-use', 'interfaces', 'to', 'over', '50', 'corpora', 'and', 'lexical', 'resources', 'such', 'as', 'WordNet', ',', 'along', 'with', 'a', 'suite']

Total tokens: 217


_Text data cleaning:_

- Formatting and standartization (e.g. dates, language)
- Remove punctuation
- Remove or convert abbreviations to full form
- Case conversion
- Remove elements like hashtags


### Remove punctuation, Convert to lowercase


In [2]:
# the best way to get rid of punctuation, preserving contractions and compound words with hyphens
tokens = [token.lower()
          for token in tokens_with_punctuation if nltk.tokenize.punkt.PunktToken(token).is_non_punct]

print(f'Token list: {tokens[:40]}\n')
print(f'Total tokens: {len(tokens)}')


Token list: ['you', "'ll", 'learn', 'nlp', 'nltk', 'is', 'a', 'leading', 'platform', 'for', 'building', 'python', 'programs', 'to', 'work', 'with', 'human', 'language', 'data', 'it', 'provides', 'easy-to-use', 'interfaces', 'to', 'over', '50', 'corpora', 'and', 'lexical', 'resources', 'such', 'as', 'wordnet', 'along', 'with', 'a', 'suite', 'of', 'text', 'processing']

Total tokens: 177


### Remove stopwords

**Stopwords** - a group of words that carry on meaning by themselves (e.g. 'in', 'and', 'the', 'which')  
Stopwords not required for analytics


In [3]:
# nltk.download('stopwords')

from nltk.corpus import stopwords

tokens = [token for token in tokens if token not in stopwords.words('english')]
print(f'Token list: {tokens[:40]}\n')
print(f'Total tokens: {len(tokens)}')


Token list: ["'ll", 'learn', 'nlp', 'nltk', 'leading', 'platform', 'building', 'python', 'programs', 'work', 'human', 'language', 'data', 'provides', 'easy-to-use', 'interfaces', '50', 'corpora', 'lexical', 'resources', 'wordnet', 'along', 'suite', 'text', 'processing', 'libraries', 'classification', 'tokenization', 'stemming', 'tagging', 'parsing', 'semantic', 'reasoning', 'wrappers', 'industrial-strength', 'nlp', 'libraries', 'active', 'discussion', 'forum']

Total tokens: 119


## Stemming/Lemmatization

Lemmatization uses a dictionary to match words to their root word.  
It is more expensive operation than stemming.


In [4]:
wn = nltk.WordNetLemmatizer()
ps = nltk.PorterStemmer()

print(f"Stemming: {ps.stem('meanness')}, {ps.stem('meaning')}")
print(f"Lemmatization: {wn.lemmatize('meaning')}, {wn.lemmatize('meanness')}")


Stemming: mean, mean
Lemmatization: meaning, meanness


In [5]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

stem_tokens = [stemmer.stem(token) for token in tokens]
print(f'Token list: {stem_tokens[:40]}\n')
print(f'Total tokens: {len(stem_tokens)}')


Token list: ["'ll", 'learn', 'nlp', 'nltk', 'lead', 'platform', 'build', 'python', 'program', 'work', 'human', 'languag', 'data', 'provid', 'easy-to-us', 'interfac', '50', 'corpora', 'lexic', 'resourc', 'wordnet', 'along', 'suit', 'text', 'process', 'librari', 'classif', 'token', 'stem', 'tag', 'pars', 'semant', 'reason', 'wrapper', 'industrial-strength', 'nlp', 'librari', 'activ', 'discuss', 'forum']

Total tokens: 119


In [6]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

lem_tokens = [lemmatizer.lemmatize(token) for token in tokens]
print(f'Token list: {lem_tokens[:40]}\n')
print(f'Total tokens: {len(lem_tokens)}')


Token list: ["'ll", 'learn', 'nlp', 'nltk', 'leading', 'platform', 'building', 'python', 'program', 'work', 'human', 'language', 'data', 'provides', 'easy-to-use', 'interface', '50', 'corpus', 'lexical', 'resource', 'wordnet', 'along', 'suite', 'text', 'processing', 'library', 'classification', 'tokenization', 'stemming', 'tagging', 'parsing', 'semantic', 'reasoning', 'wrapper', 'industrial-strength', 'nlp', 'library', 'active', 'discussion', 'forum']

Total tokens: 119


## N-Grams

N-grams is a sequence of n items in text sample


In [7]:
# nltk.download('punkt')

from nltk.util import ngrams
from collections import Counter

bigrams = ngrams(lem_tokens, 2)
print(f"Most common bigrams: {Counter(bigrams).most_common(5)}")

trigrams = ngrams(lem_tokens, 3)
print(f"Most common trigrams: {Counter(trigrams).most_common(5)}")


Most common bigrams: [(('python', 'program'), 2), (('computational', 'linguistics'), 2), (('language', 'processing'), 2), (("'ll", 'learn'), 1), (('learn', 'nlp'), 1)]
Most common trigrams: [(("'ll", 'learn', 'nlp'), 1), (('learn', 'nlp', 'nltk'), 1), (('nlp', 'nltk', 'leading'), 1), (('nltk', 'leading', 'platform'), 1), (('leading', 'platform', 'building'), 1)]


## Parts-of-Speech (POS) Tagging

- POS tagging involves identifying the part of speech for each word in a corpus
- Used for entity recognition, filtering and sentiment analysis

| Word   | POS | Description           |
| ------ | --- | --------------------- |
| Man    | NN  | Noun                  |
| Engage | VBP | Verb Singular Present |
| Top    | JJ  | Adjective             |


In [8]:
# nltk.download('averaged_perceptron_tagger')

# tag and print first 10 tokens
nltk.pos_tag(lem_tokens)[:10]

[("'ll", 'MD'),
 ('learn', 'VB'),
 ('nlp', 'JJ'),
 ('nltk', 'JJ'),
 ('leading', 'VBG'),
 ('platform', 'NN'),
 ('building', 'NN'),
 ('python', 'NN'),
 ('program', 'NN'),
 ('work', 'NN')]