<a href="https://colab.research.google.com/github/rahiakela/text-analytics-with-python/blob/3-processing-and-understanding-text/text_preprocessing_and_wrangling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Preprocessing and Wrangling

Text wrangling (also called preprocessing or normalization) is a process that consists of a series of steps to wrangle, clean, and standardize textual data into a form that could be consumed by other NLP and intelligent systems powered by machine learning and deep learning. 

Common techniques for preprocessing include-
* cleaning text, 
* tokenizing text,
* removing special characters, 
* case conversion, 
* correcting spellings, 
* removing stopwords
* and other unnecessary terms, stemming, and lemmatization.

The key idea is to remove unnecessary content from one or more text documents in a corpus (or corpora) and get clean text documents.

## Removing HTML Tags

Often, unstructured text contains a lot of noise, especially if you use techniques
like web scraping or screen scraping to retrieve data from web pages, blogs, and
online repositories. HTML tags, JavaScript, and Iframe tags typically don’t add much
value to understanding and analyzing text. Our main intent is to extract meaningful
textual content from the data extracted from the web.

Let’s look at a section of a web page showing the King James version of the Bible.
<img src='https://github.com/rahiakela/img-repo/blob/master/text-analytics-with-python/bible.PNG?raw=1' width='800'/>

We will now leverage requests and retrieve the contents of this web page in Python.
This is known as web scraping and the following code helps us achieve this.

In [1]:
import requests

data = requests.get('http://www.gutenberg.org/cache/epub/8001/pg8001.html')
content = data.content
content[1163:2200]

b'content="Ebookmaker 0.4.0a5 by Marcello Perathoner &lt;webmaster@gutenberg.org&gt;" name="generator"/>\r\n</head>\r\n  <body><p id="id00000">Project Gutenberg EBook The Bible, King James, Book 1: Genesis</p>\r\n\r\n<p id="id00001">Copyright laws are changing all over the world. Be sure to check the\r\ncopyright laws for your country before downloading or redistributing\r\nthis or any other Project Gutenberg eBook.</p>\r\n\r\n<p id="id00002">This header should be the first thing seen when viewing this Project\r\nGutenberg file.  Please do not remove it.  Do not change or edit the\r\nheader without written permission.</p>\r\n\r\n<p id="id00003">Please read the "legal small print," and other information about the\r\neBook and Project Gutenberg at the bottom of this file.  Included is\r\nimportant information about your specific rights and restrictions in\r\nhow the file may be used.  You can also find out about how to make a\r\ndonation to Project Gutenberg, and how to get involved.</p>

We can clearly see from the preceding output that it is extremely difficult to decipher the actual textual content in the web page, due to all the unnecessary HTML tags. We need to remove those tags. 

The BeautifulSoup library provides us with some handy
functions that help us remove these unnecessary tags with ease.

In [0]:
import re
from bs4 import BeautifulSoup

In [0]:
def strip_html_tags(text):
  soup = BeautifulSoup(text, 'html.parser')

  # remove iframe and script tag
  [s.extract() for s in soup(['iframe', 'script'])]
  stripped_text = soup.get_text()
  stripped_text = re.sub(r'[\r|\n|\r\n]+', '\n', stripped_text)

  return stripped_text

In [4]:
clean_content = strip_html_tags(content)
clean_content[1163:2045]

"*** START OF THE PROJECT GUTENBERG EBOOK, THE BIBLE, KING JAMES, BOOK 1***\nThis eBook was produced by David Widger\nwith the help of Derek Andrew's text from January 1992\nand the work of Bryan Taylor in November 2002.\nBook 01        Genesis\n01:001:001 In the beginning God created the heaven and the earth.\n01:001:002 And the earth was without form, and void; and darkness was\n           upon the face of the deep. And the Spirit of God moved upon\n           the face of the waters.\n01:001:003 And God said, Let there be light: and there was light.\n01:001:004 And God saw the light, that it was good: and God divided the\n\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0light from the darkness.\n01:001:005 And God called the light Day, and the darkness he called\n\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0Night. And the evening and the morning were the first day.\n01:001:006 And God said, Let there be a firmament in the midst of the\n\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0waters,"

You can compare this output with the raw web page content and see that we have
successfully removed the unnecessary HTML tags. We now have a clean body of text
that’s easier to interpret and understand.

## Text Tokenization

The most popular tokenization techniques include sentence and word tokenization, which are used to break down a text document (or corpus) into sentences and each sentence into words. Thus, tokenization can be defined as the process of breaking down or splitting textual data into smaller and more meaningful components called tokens.

### Sentence Tokenization

Sentence tokenization is the process of splitting a text corpus into sentences that act as the first level of tokens the corpus is comprised of. This is also known as sentence segmentation, since we try to segment the text into meaningful sentences.

There are various ways to perform sentence tokenization. Basic techniques include:-
* looking for specific delimiters between sentences like a period (.) 
* or a newline character (\n) 
* and sometimes even a semicolon (;). 

We will use the NLTK framework, which provides various interfaces for performing sentence tokenization. We primarily focus on the
following sentence tokenizers:
* sent_tokenize
* Pretrained sentence tokenization models
* PunktSentenceTokenizer
* RegexpTokenizer

In [5]:
import nltk
from nltk.corpus import gutenberg
from pprint import pprint
import numpy as np
nltk.download('gutenberg')

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.


True

In [0]:
# loading text corpora
alice = gutenberg.raw(fileids='carroll-alice.txt')

In [7]:
sample_text = '''US unveils world's most powerful supercomputer, beats China. \
 The US has unveiled the world's most powerful supercomputer called 'Summit', \
 beating the previous record-holder China's Sunway TaihuLight. With a peak performance \
 of 200,000 trillion calculations per second, it is over twice as fast as Sunway TaihuLight, \
 which is capable of 93,000 trillion calculations per second. Summit has 4,608 servers, \
 which reportedly take up the size of two tennis courts.'''
sample_text

"US unveils world's most powerful supercomputer, beats China.  The US has unveiled the world's most powerful supercomputer called 'Summit',  beating the previous record-holder China's Sunway TaihuLight. With a peak performance  of 200,000 trillion calculations per second, it is over twice as fast as Sunway TaihuLight,  which is capable of 93,000 trillion calculations per second. Summit has 4,608 servers,  which reportedly take up the size of two tennis courts."

In [8]:
# Total characters in Alice in Wonderland
len(alice)

144395

In [9]:
# First 100 characters in the corpus
alice[:100]

"[Alice's Adventures in Wonderland by Lewis Carroll 1865]\n\nCHAPTER I. Down the Rabbit-Hole\n\nAlice was"

#### Default Sentence Tokenizer

The nltk.sent_tokenize(...) function is the default sentence tokenization function
that NLTK recommends and it uses an instance of the PunktSentenceTokenizer class
internally. However, this is not just a normal object or instance of that class. It has been
pretrained on several language models and works really well on many popular languages
besides English.

In [10]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [11]:
default_st = nltk.sent_tokenize
alice_sentences = default_st(text=alice)
sample_sentences = default_st(text=sample_text)
print(f'Total sentences in sample_text: {str(len(sample_sentences))}')
print(f'Sample text sentences :-\n{str(np.array(sample_sentences))}')

Total sentences in sample_text: 4
Sample text sentences :-
["US unveils world's most powerful supercomputer, beats China."
 "The US has unveiled the world's most powerful supercomputer called 'Summit',  beating the previous record-holder China's Sunway TaihuLight."
 'With a peak performance  of 200,000 trillion calculations per second, it is over twice as fast as Sunway TaihuLight,  which is capable of 93,000 trillion calculations per second.'
 'Summit has 4,608 servers,  which reportedly take up the size of two tennis courts.']


In [12]:
print(f'\nTotal sentences in alice: {str(len(alice_sentences))}')
print(f'First 5 sentences in alice:- \n {str(np.array(alice_sentences[:5]))}')


Total sentences in alice: 1625
First 5 sentences in alice:- 
 ["[Alice's Adventures in Wonderland by Lewis Carroll 1865]\n\nCHAPTER I."
 "Down the Rabbit-Hole\n\nAlice was beginning to get very tired of sitting by her sister on the\nbank, and of having nothing to do: once or twice she had peeped into the\nbook her sister was reading, but it had no pictures or conversations in\nit, 'and what is the use of a book,' thought Alice 'without pictures or\nconversation?'"
 'So she was considering in her own mind (as well as she could, for the\nhot day made her feel very sleepy and stupid), whether the pleasure\nof making a daisy-chain would be worth the trouble of getting up and\npicking the daisies, when suddenly a White Rabbit with pink eyes ran\nclose by her.'
 "There was nothing so VERY remarkable in that; nor did Alice think it so\nVERY much out of the way to hear the Rabbit say to itself, 'Oh dear!"
 'Oh dear!']


Now, as you can see, the tokenizer is quite intelligent. It doesn’t just use periods to delimit sentences, but also considers other punctuation and capitalization of words.

#### Pretrained Sentence Tokenizer Models

Suppose we were dealing with German text. We can use sent_tokenize, which
is already trained, or load a pretrained tokenization model on German text into a PunktSentenceTokenizer instance and perform the same operation. The following
snippet shows this. We start by loading a German text corpus and inspecting it.

In [13]:
nltk.download('europarl_raw')

[nltk_data] Downloading package europarl_raw to /root/nltk_data...
[nltk_data]   Unzipping corpora/europarl_raw.zip.


True

In [14]:
from nltk.corpus import europarl_raw

german_text = europarl_raw.german.raw(fileids='ep-00-01-17.de')
# Total characters in the corpus
print(len(german_text))
# First 100 characters in the corpus
german_text[:100]

157171


' \nWiederaufnahme der Sitzungsperiode Ich erkläre die am Freitag , dem 17. Dezember unterbrochene Sit'

Next, we tokenize the text corpus into sentences using the default sent_
tokenize(...) tokenizer and a pretrained German language tokenizer by loading it
from the NLTK resources.

In [0]:
# default sentence tokenizer
german_sentences_def = default_st(text=german_text, language='german')

# loading german text tokenizer into a PunktSentenceTokenizer instance
german_tokenizer = nltk.data.load(resource_url='tokenizers/punkt/german.pickle')
german_sentences = german_tokenizer.tokenize(german_text)

We can now verify the time of our German tokenizer and check if the results
obtained by using the two tokenizers match!

In [16]:
# verify the type of german_tokenizer, should be PunktSentenceTokenizer
type(german_tokenizer)

nltk.tokenize.punkt.PunktSentenceTokenizer

In [17]:
# check if results of both tokenizers match , should be True
(german_sentences_def == german_sentences)

True

Thus we see that indeed the german_tokenizer is an instance of
PunktSentenceTokenizer, which specializes in dealing with the German language. We
also checked if the sentences obtained from the default tokenizer are the same as the
sentences obtained by this pretrained tokenizer. As expected, they are the same (true).

In [18]:
# print first 5 sentences of the corpus
np.array(german_sentences[:5])

array([' \nWiederaufnahme der Sitzungsperiode Ich erkläre die am Freitag , dem 17. Dezember unterbrochene Sitzungsperiode des Europäischen Parlaments für wiederaufgenommen , wünsche Ihnen nochmals alles Gute zum Jahreswechsel und hoffe , daß Sie schöne Ferien hatten .',
       'Wie Sie feststellen konnten , ist der gefürchtete " Millenium-Bug " nicht eingetreten .',
       'Doch sind Bürger einiger unserer Mitgliedstaaten Opfer von schrecklichen Naturkatastrophen geworden .',
       'Im Parlament besteht der Wunsch nach einer Aussprache im Verlauf dieser Sitzungsperiode in den nächsten Tagen .',
       'Heute möchte ich Sie bitten - das ist auch der Wunsch einiger Kolleginnen und Kollegen - , allen Opfern der Stürme , insbesondere in den verschiedenen Ländern der Europäischen Union , in einer Schweigeminute zu gedenken .'],
      dtype='<U259')

Thus we see that our assumption was indeed correct and you can tokenize sentences
belonging to different languages in two different ways.

#### PunktSentenceTokenizer

In [19]:
punkt_st = nltk.tokenize.PunktSentenceTokenizer()
sample_sentences = punkt_st.tokenize(sample_text)
np.array(sample_sentences)

array(["US unveils world's most powerful supercomputer, beats China.",
       "The US has unveiled the world's most powerful supercomputer called 'Summit',  beating the previous record-holder China's Sunway TaihuLight.",
       'With a peak performance  of 200,000 trillion calculations per second, it is over twice as fast as Sunway TaihuLight,  which is capable of 93,000 trillion calculations per second.',
       'Summit has 4,608 servers,  which reportedly take up the size of two tennis courts.'],
      dtype='<U178')

#### RegexpTokenizer

The last tokenizer we cover in sentence tokenization is using an instance of the
RegexpTokenizer class to tokenize text into sentences, where we will use specific regular expression-based patterns to segment sentences.

In [20]:
SENTENCE_TOKENS_PATTERN = r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<![A-Z]\.)(?<=\.|\?|\!)\s'
regex_st = nltk.tokenize.RegexpTokenizer(pattern=SENTENCE_TOKENS_PATTERN, gaps=True)
sample_sentences = regex_st.tokenize(sample_text)
np.array(sample_sentences)

array(["US unveils world's most powerful supercomputer, beats China.",
       " The US has unveiled the world's most powerful supercomputer called 'Summit',  beating the previous record-holder China's Sunway TaihuLight.",
       'With a peak performance  of 200,000 trillion calculations per second, it is over twice as fast as Sunway TaihuLight,  which is capable of 93,000 trillion calculations per second.',
       'Summit has 4,608 servers,  which reportedly take up the size of two tennis courts.'],
      dtype='<U178')

This output shows that we obtained the same sentences as we had obtained using
the other tokenizers. This gives us an idea of tokenizing text into sentences using
different NLTK interfaces.

### Word Tokenization

Word tokenization is the process of splitting or segmenting sentences into their
constituent words. A sentence is a collection of words and with tokenization we
essentially split a sentence into a list of words that can be used to reconstruct the
sentence. Word tokenization is really important in many processes, especially in
cleaning and normalizing text where operations like stemming and lemmatization work
on each individual word based on its respective stems and lemma. Similar to sentence
tokenization, NLTK provides various useful interfaces for word tokenization.

* word_tokenize
* TreebankWordTokenizer
* TokTokTokenizer
* RegexpTokenizer
* Inherited tokenizers from RegexpTokenizer

#### Default Word Tokenizer

The nltk.word_tokenize(...) function is the default and recommended word
tokenizer, as specified by NLTK. This tokenizer is an instance or object of the
TreebankWordTokenizer class in its internal implementation and acts as a wrapper to that core class.

In [21]:
default_wt = nltk.word_tokenize
words = default_wt(sample_text)
np.array(words)

array(['US', 'unveils', 'world', "'s", 'most', 'powerful',
       'supercomputer', ',', 'beats', 'China', '.', 'The', 'US', 'has',
       'unveiled', 'the', 'world', "'s", 'most', 'powerful',
       'supercomputer', 'called', "'Summit", "'", ',', 'beating', 'the',
       'previous', 'record-holder', 'China', "'s", 'Sunway', 'TaihuLight',
       '.', 'With', 'a', 'peak', 'performance', 'of', '200,000',
       'trillion', 'calculations', 'per', 'second', ',', 'it', 'is',
       'over', 'twice', 'as', 'fast', 'as', 'Sunway', 'TaihuLight', ',',
       'which', 'is', 'capable', 'of', '93,000', 'trillion',
       'calculations', 'per', 'second', '.', 'Summit', 'has', '4,608',
       'servers', ',', 'which', 'reportedly', 'take', 'up', 'the', 'size',
       'of', 'two', 'tennis', 'courts', '.'], dtype='<U13')

#### TreebankWordTokenizer

The TreebankWordTokenizer is based on the Penn Treebank and uses various regular
expressions to tokenize the text. Of course, one primary assumption here is that we have already performed sentence tokenization beforehand.Some of the main features of this tokenizer are mentioned here:

* Splits and separates out periods that appear at the end of a sentence
* Splits and separates commas and single quotes when followed by
whitespace
* Most punctuation characters are split and separated into
independent tokens
* Splits words with standard contractions, such as don’t to do and n’t

In [22]:
treebank_wt = nltk.TreebankWordTokenizer()
words = treebank_wt.tokenize(sample_text)
np.array(words)

array(['US', 'unveils', 'world', "'s", 'most', 'powerful',
       'supercomputer', ',', 'beats', 'China.', 'The', 'US', 'has',
       'unveiled', 'the', 'world', "'s", 'most', 'powerful',
       'supercomputer', 'called', "'Summit", "'", ',', 'beating', 'the',
       'previous', 'record-holder', 'China', "'s", 'Sunway',
       'TaihuLight.', 'With', 'a', 'peak', 'performance', 'of', '200,000',
       'trillion', 'calculations', 'per', 'second', ',', 'it', 'is',
       'over', 'twice', 'as', 'fast', 'as', 'Sunway', 'TaihuLight', ',',
       'which', 'is', 'capable', 'of', '93,000', 'trillion',
       'calculations', 'per', 'second.', 'Summit', 'has', '4,608',
       'servers', ',', 'which', 'reportedly', 'take', 'up', 'the', 'size',
       'of', 'two', 'tennis', 'courts', '.'], dtype='<U13')

As expected, the output is similar to word_tokenize(), since they use the same
tokenizing mechanism.

#### TokTokTokenizer

TokTokTokenizer is one of the newer tokenizers introduced by NLTK present in the
nltk.tokenize.toktok module. In general, the tok-tok tokenizer is a general tokenizer,
where it assumes that the input has one sentence per line. Hence, only the final period
is tokenized. However, as needed, we can remove the other periods from the words
using regular expressions.

In [23]:
from nltk.tokenize.toktok import ToktokTokenizer

tokenizer = ToktokTokenizer()
words = tokenizer.tokenize(sample_text)
np.array(words)

array(['US', 'unveils', 'world', "'", 's', 'most', 'powerful',
       'supercomputer', ',', 'beats', 'China.', 'The', 'US', 'has',
       'unveiled', 'the', 'world', "'", 's', 'most', 'powerful',
       'supercomputer', 'called', "'", 'Summit', "'", ',', 'beating',
       'the', 'previous', 'record-holder', 'China', "'", 's', 'Sunway',
       'TaihuLight.', 'With', 'a', 'peak', 'performance', 'of', '200,000',
       'trillion', 'calculations', 'per', 'second', ',', 'it', 'is',
       'over', 'twice', 'as', 'fast', 'as', 'Sunway', 'TaihuLight', ',',
       'which', 'is', 'capable', 'of', '93,000', 'trillion',
       'calculations', 'per', 'second.', 'Summit', 'has', '4,608',
       'servers', ',', 'which', 'reportedly', 'take', 'up', 'the', 'size',
       'of', 'two', 'tennis', 'courts', '.'], dtype='<U13')

#### RegexpTokenizer

There are two main parameters that
are useful in tokenization—the regex pattern for building the tokenizer and the gaps
parameter, which, if set to true, is used to find the gaps between the tokens. Otherwise, it
is used to find the tokens themselves.

In [24]:
# pattern to identify tokens themselves
TOKEN_PATTERN = r'\w+'
regex_wt = nltk.RegexpTokenizer(pattern=TOKEN_PATTERN, gaps=False)

words = regex_wt.tokenize(sample_text)
np.array(words)

array(['US', 'unveils', 'world', 's', 'most', 'powerful', 'supercomputer',
       'beats', 'China', 'The', 'US', 'has', 'unveiled', 'the', 'world',
       's', 'most', 'powerful', 'supercomputer', 'called', 'Summit',
       'beating', 'the', 'previous', 'record', 'holder', 'China', 's',
       'Sunway', 'TaihuLight', 'With', 'a', 'peak', 'performance', 'of',
       '200', '000', 'trillion', 'calculations', 'per', 'second', 'it',
       'is', 'over', 'twice', 'as', 'fast', 'as', 'Sunway', 'TaihuLight',
       'which', 'is', 'capable', 'of', '93', '000', 'trillion',
       'calculations', 'per', 'second', 'Summit', 'has', '4', '608',
       'servers', 'which', 'reportedly', 'take', 'up', 'the', 'size',
       'of', 'two', 'tennis', 'courts'], dtype='<U13')

In [25]:
# pattern to identify tokens by using gaps between tokens
GAP_PATTERN = r'\s+'
regex_wt = nltk.RegexpTokenizer(pattern=GAP_PATTERN, gaps=True)

words = regex_wt.tokenize(sample_text)
np.array(words)

array(['US', 'unveils', "world's", 'most', 'powerful', 'supercomputer,',
       'beats', 'China.', 'The', 'US', 'has', 'unveiled', 'the',
       "world's", 'most', 'powerful', 'supercomputer', 'called',
       "'Summit',", 'beating', 'the', 'previous', 'record-holder',
       "China's", 'Sunway', 'TaihuLight.', 'With', 'a', 'peak',
       'performance', 'of', '200,000', 'trillion', 'calculations', 'per',
       'second,', 'it', 'is', 'over', 'twice', 'as', 'fast', 'as',
       'Sunway', 'TaihuLight,', 'which', 'is', 'capable', 'of', '93,000',
       'trillion', 'calculations', 'per', 'second.', 'Summit', 'has',
       '4,608', 'servers,', 'which', 'reportedly', 'take', 'up', 'the',
       'size', 'of', 'two', 'tennis', 'courts.'], dtype='<U14')

Thus, you can see that there are multiple ways of obtaining the same results
leveraging token patterns themselves or gap patterns.Lets see how to obtain the token boundaries for each token during the tokenize operation.

In [26]:
word_indices = list(regex_wt.span_tokenize(sample_text))
print(word_indices)
print(np.array([sample_text[start:end] for start, end in word_indices]))

[(0, 2), (3, 10), (11, 18), (19, 23), (24, 32), (33, 47), (48, 53), (54, 60), (62, 65), (66, 68), (69, 72), (73, 81), (82, 85), (86, 93), (94, 98), (99, 107), (108, 121), (122, 128), (129, 138), (140, 147), (148, 151), (152, 160), (161, 174), (175, 182), (183, 189), (190, 201), (202, 206), (207, 208), (209, 213), (214, 225), (227, 229), (230, 237), (238, 246), (247, 259), (260, 263), (264, 271), (272, 274), (275, 277), (278, 282), (283, 288), (289, 291), (292, 296), (297, 299), (300, 306), (307, 318), (320, 325), (326, 328), (329, 336), (337, 339), (340, 346), (347, 355), (356, 368), (369, 372), (373, 380), (381, 387), (388, 391), (392, 397), (398, 406), (408, 413), (414, 424), (425, 429), (430, 432), (433, 436), (437, 441), (442, 444), (445, 448), (449, 455), (456, 463)]
['US' 'unveils' "world's" 'most' 'powerful' 'supercomputer,' 'beats'
 'China.' 'The' 'US' 'has' 'unveiled' 'the' "world's" 'most' 'powerful'
 'supercomputer' 'called' "'Summit'," 'beating' 'the' 'previous'
 'record-ho

#### Inherited Tokenizers from RegexpTokenizer

Besides the base RegexpTokenizer class, there are several derived classes that
perform different types of word tokenization. The WordPunktTokenizer uses the pattern
r'\w+|[^\w\s]+' to tokenize sentences into independent alphabetic and
non-alphabetic tokens.

In [27]:
wordpunkt_wt = nltk.WordPunctTokenizer()
words = wordpunkt_wt.tokenize(sample_text)
np.array(words)

array(['US', 'unveils', 'world', "'", 's', 'most', 'powerful',
       'supercomputer', ',', 'beats', 'China', '.', 'The', 'US', 'has',
       'unveiled', 'the', 'world', "'", 's', 'most', 'powerful',
       'supercomputer', 'called', "'", 'Summit', "',", 'beating', 'the',
       'previous', 'record', '-', 'holder', 'China', "'", 's', 'Sunway',
       'TaihuLight', '.', 'With', 'a', 'peak', 'performance', 'of', '200',
       ',', '000', 'trillion', 'calculations', 'per', 'second', ',', 'it',
       'is', 'over', 'twice', 'as', 'fast', 'as', 'Sunway', 'TaihuLight',
       ',', 'which', 'is', 'capable', 'of', '93', ',', '000', 'trillion',
       'calculations', 'per', 'second', '.', 'Summit', 'has', '4', ',',
       '608', 'servers', ',', 'which', 'reportedly', 'take', 'up', 'the',
       'size', 'of', 'two', 'tennis', 'courts', '.'], dtype='<U13')

The WhitespaceTokenizer tokenizes sentences into words based on whitespace, like
tabs, newlines, and spaces.

In [28]:
whitespace_wt = nltk.WhitespaceTokenizer()
words = whitespace_wt.tokenize(sample_text)
np.array(words)

array(['US', 'unveils', "world's", 'most', 'powerful', 'supercomputer,',
       'beats', 'China.', 'The', 'US', 'has', 'unveiled', 'the',
       "world's", 'most', 'powerful', 'supercomputer', 'called',
       "'Summit',", 'beating', 'the', 'previous', 'record-holder',
       "China's", 'Sunway', 'TaihuLight.', 'With', 'a', 'peak',
       'performance', 'of', '200,000', 'trillion', 'calculations', 'per',
       'second,', 'it', 'is', 'over', 'twice', 'as', 'fast', 'as',
       'Sunway', 'TaihuLight,', 'which', 'is', 'capable', 'of', '93,000',
       'trillion', 'calculations', 'per', 'second.', 'Summit', 'has',
       '4,608', 'servers,', 'which', 'reportedly', 'take', 'up', 'the',
       'size', 'of', 'two', 'tennis', 'courts.'], dtype='<U14')

### Building Robust Tokenizers with NLTK and spaCy

For a typical NLP pipeline, I recommend leveraging state-of-the-art libraries like NLTK
and spaCy and using some of their robust utilities to build a custom function to perform
both sentence- and word-level tokenization.

In [0]:
def tokenize_text(text):
  sentences = nltk.sent_tokenize(text)
  word_tokens = [nltk.word_tokenize(sentence) for sentence in sentences]

  return word_tokens

In [30]:
sents = tokenize_text(sample_text)
np.array(sents)

array([list(['US', 'unveils', 'world', "'s", 'most', 'powerful', 'supercomputer', ',', 'beats', 'China', '.']),
       list(['The', 'US', 'has', 'unveiled', 'the', 'world', "'s", 'most', 'powerful', 'supercomputer', 'called', "'Summit", "'", ',', 'beating', 'the', 'previous', 'record-holder', 'China', "'s", 'Sunway', 'TaihuLight', '.']),
       list(['With', 'a', 'peak', 'performance', 'of', '200,000', 'trillion', 'calculations', 'per', 'second', ',', 'it', 'is', 'over', 'twice', 'as', 'fast', 'as', 'Sunway', 'TaihuLight', ',', 'which', 'is', 'capable', 'of', '93,000', 'trillion', 'calculations', 'per', 'second', '.']),
       list(['Summit', 'has', '4,608', 'servers', ',', 'which', 'reportedly', 'take', 'up', 'the', 'size', 'of', 'two', 'tennis', 'courts', '.'])],
      dtype=object)

We can also get to the level of word-level tokenization by leveraging list
comprehensions.

In [31]:
words = [word for sentence in sents for word in sentence]
np.array(words)

array(['US', 'unveils', 'world', "'s", 'most', 'powerful',
       'supercomputer', ',', 'beats', 'China', '.', 'The', 'US', 'has',
       'unveiled', 'the', 'world', "'s", 'most', 'powerful',
       'supercomputer', 'called', "'Summit", "'", ',', 'beating', 'the',
       'previous', 'record-holder', 'China', "'s", 'Sunway', 'TaihuLight',
       '.', 'With', 'a', 'peak', 'performance', 'of', '200,000',
       'trillion', 'calculations', 'per', 'second', ',', 'it', 'is',
       'over', 'twice', 'as', 'fast', 'as', 'Sunway', 'TaihuLight', ',',
       'which', 'is', 'capable', 'of', '93,000', 'trillion',
       'calculations', 'per', 'second', '.', 'Summit', 'has', '4,608',
       'servers', ',', 'which', 'reportedly', 'take', 'up', 'the', 'size',
       'of', 'two', 'tennis', 'courts', '.'], dtype='<U13')

In a similar way, we can leverage spaCy to perform sentence- and word-level
tokenizations really quickly.

In [32]:
# Initially I download two en packages
!python -m spacy download en_core_web_lg
!python -m spacy download en_core_web_sm

# establish link to packages
!python -m spacy download en

Collecting en_core_web_lg==2.1.0
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.1.0/en_core_web_lg-2.1.0.tar.gz (826.9MB)
[K     |████████████████████████████████| 826.9MB 3.7MB/s 
[?25hBuilding wheels for collected packages: en-core-web-lg
  Building wheel for en-core-web-lg (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-web-lg: filename=en_core_web_lg-2.1.0-cp36-none-any.whl size=828255076 sha256=68d2def0a7cee8a1d6872ff5c9bab4bc8195db9242be27cb1b5c0e2e0f0eb42e
  Stored in directory: /tmp/pip-ephem-wheel-cache-9ea46ttj/wheels/b4/d7/70/426d313a459f82ed5e06cc36a50e2bb2f0ec5cb31d8e0bdf09
Successfully built en-core-web-lg
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-2.1.0
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_lg')
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web

In [33]:
import spacy

nlp = spacy.load('en', parse=True, tag=True, entity=True)
text_spacy = nlp(sample_text)

sents = np.array(list(text_spacy.sents))
sents

array([US unveils world's most powerful supercomputer, beats China.  ,
       The US has unveiled the world's most powerful supercomputer called 'Summit',  beating the previous record-holder China's Sunway TaihuLight.,
       With a peak performance  of 200,000 trillion calculations per second, it is over twice as fast as Sunway TaihuLight,  which is capable of 93,000 trillion calculations per second.,
       Summit has 4,608 servers,  which reportedly take up the size of two tennis courts.],
      dtype=object)

In [34]:
sent_words = [[word.text for word in sent] for sent in sents]
np.array(sent_words)

array([list(['US', 'unveils', 'world', "'s", 'most', 'powerful', 'supercomputer', ',', 'beats', 'China', '.', ' ']),
       list(['The', 'US', 'has', 'unveiled', 'the', 'world', "'s", 'most', 'powerful', 'supercomputer', 'called', "'", 'Summit', "'", ',', ' ', 'beating', 'the', 'previous', 'record', '-', 'holder', 'China', "'s", 'Sunway', 'TaihuLight', '.']),
       list(['With', 'a', 'peak', 'performance', ' ', 'of', '200,000', 'trillion', 'calculations', 'per', 'second', ',', 'it', 'is', 'over', 'twice', 'as', 'fast', 'as', 'Sunway', 'TaihuLight', ',', ' ', 'which', 'is', 'capable', 'of', '93,000', 'trillion', 'calculations', 'per', 'second', '.']),
       list(['Summit', 'has', '4,608', 'servers', ',', ' ', 'which', 'reportedly', 'take', 'up', 'the', 'size', 'of', 'two', 'tennis', 'courts', '.'])],
      dtype=object)

In [35]:
words = [word.text for word in text_spacy]
np.array(words)

array(['US', 'unveils', 'world', "'s", 'most', 'powerful',
       'supercomputer', ',', 'beats', 'China', '.', ' ', 'The', 'US',
       'has', 'unveiled', 'the', 'world', "'s", 'most', 'powerful',
       'supercomputer', 'called', "'", 'Summit', "'", ',', ' ', 'beating',
       'the', 'previous', 'record', '-', 'holder', 'China', "'s",
       'Sunway', 'TaihuLight', '.', 'With', 'a', 'peak', 'performance',
       ' ', 'of', '200,000', 'trillion', 'calculations', 'per', 'second',
       ',', 'it', 'is', 'over', 'twice', 'as', 'fast', 'as', 'Sunway',
       'TaihuLight', ',', ' ', 'which', 'is', 'capable', 'of', '93,000',
       'trillion', 'calculations', 'per', 'second', '.', 'Summit', 'has',
       '4,608', 'servers', ',', ' ', 'which', 'reportedly', 'take', 'up',
       'the', 'size', 'of', 'two', 'tennis', 'courts', '.'], dtype='<U13')

## Removing Accented Characters

Usually in any text corpus, you might be dealing with accented characters/letters, especially
if you only want to analyze the English language. Hence, we need to make sure that these
characters are converted and standardized into ASCII characters. This shows a simple
example — converting é to e.

In [0]:
import unicodedata

def remove_accented_chars(text):
  text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
  return text

In [37]:
remove_accented_chars('Sómě Áccěntěd těxt')

'Some Accented text'

## Expanding Contractions

Contractions are shortened versions of words or syllables. These exist in written and
spoken forms. Shortened versions of existing words are created by removing specific
letters and sounds. In the case of English contractions, they are often created by
removing one of the vowels from the word. Examples include “is not” to “isn’t” and
“will not” to “won’t”, where you can notice the apostrophe being used to denote the
contraction and some of the vowels and other letters being removed.

Ideally, you can have a proper mapping for contractions and their corresponding
expansions and then use that to expand all the contractions in your text.

In [0]:
CONTRACTION_MAP = {
  "ain't": "is not",
  "aren't": "are not",
  "can't": "can not",
  "can't've": "cannot have",
  "'cause": "because",
  "could've": "could have",
  "couldn't": "could not",
  "couldn't've": "could not have",
  "didn't": "did not",
  "doesn't": "does not",
  "don't": "do not",
  "hadn't": "had not",
  "hadn't've": "had not have",
  "hasn't": "has not",
  "haven't": "have not",
  "he'd": "he would",
  "he'd've": "he would have",
  "he'll": "he will",
  "he'll've": "he he will have",
  "he's": "he is",
  "how'd": "how did",
  "how'd'y": "how do you",
  "how'll": "how will",
  "how's": "how is",
  "I'd": "I would",
  "I'd've": "I would have",
  "I'll": "I will",
  "I'll've": "I will have",
  "I'm": "I am",
  "I've": "I have",
  "i'd": "i would",
  "i'd've": "i would have",
  "i'll": "i will",
  "i'll've": "i will have",
  "i'm": "i am",
  "i've": "i have",
  "isn't": "is not",
  "it'd": "it would",
  "it'd've": "it would have",
  "it'll": "it will",
  "it'll've": "it will have",
  "it's": "it is",
  "let's": "let us",
  "ma'am": "madam",
  "mayn't": "may not",
  "might've": "might have",
  "mightn't": "might not",
  "mightn't've": "might not have",
  "must've": "must have",
  "mustn't": "must not",
  "mustn't've": "must not have",
  "needn't": "need not",
  "needn't've": "need not have",
  "o'clock": "of the clock",
  "oughtn't": "ought not",
  "oughtn't've": "ought not have",
  "shan't": "shall not",
  "sha'n't": "shall not",
  "shan't've": "shall not have",
  "she'd": "she would",
  "she'd've": "she would have",
  "she'll": "she will",
  "she'll've": "she will have",
  "she's": "she is",
  "should've": "should have",
  "shouldn't": "should not",
  "shouldn't've": "should not have",
  "so've": "so have",
  "so's": "so as",
  "that'd": "that would",
  "that'd've": "that would have",
  "that's": "that is",
  "there'd": "there would",
  "there'd've": "there would have",
  "there's": "there is",
  "they'd": "they would",
  "they'd've": "they would have",
  "they'll": "they will",
  "they'll've": "they will have",
  "they're": "they are",
  "they've": "they have",
  "to've": "to have",
  "wasn't": "was not",
  "we'd": "we would",
  "we'd've": "we would have",
  "we'll": "we will",
  "we'll've": "we will have",
  "we're": "we are",
  "we've": "we have",
  "weren't": "were not",
  "what'll": "what will",
  "what'll've": "what will have",
  "what're": "what are",
  "what's": "what is",
  "what've": "what have",
  "when's": "when is",
  "when've": "when have",
  "where'd": "where did",
  "where's": "where is",
  "where've": "where have",
  "who'll": "who will",
  "who'll've": "who will have",
  "who's": "who is",
  "who've": "who have",
  "why's": "why is",
  "why've": "why have",
  "will've": "will have",
  "won't": "will not",
  "won't've": "will not have",
  "would've": "would have",
  "wouldn't": "would not",
  "wouldn't've": "would not have",
  "y'all": "you all",
  "y'all'd": "you all would",
  "y'all'd've": "you all would have",
  "y'all're": "you all are",
  "y'all've": "you all have",
  "you'd": "you would",
  "you'd've": "you would have",
  "you'll": "you will",
  "you'll've": "you will have",
  "you're": "you are",
  "you've": "you have"
}

In [0]:
import re

def expand_contractions(text, contraction_mapping=CONTRACTION_MAP):
  contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())), flags=re.IGNORECASE | re.DOTALL)
  def expand_match(contraction):
    match = contraction.group(0)
    first_char = match[0]
    expanded_contraction = contraction_mapping.get(match) if contraction_mapping.get(match) else contraction_mapping.get(match.lower())
    expanded_contraction = first_char + expanded_contraction[1:]

    return expanded_contraction

  expanded_text = contractions_pattern.sub(expand_match, text)
  expanded_text = re.sub("'", "", expanded_text)

  return expanded_text

we use the expanded_match function inside the main
expand_contractions function to find each contraction that matches the regex pattern
we create out of all the contractions in our CONTRACTION_MAP dictionary. On matching
any contraction, we substitute it with its corresponding expanded version and retain the
correct case of the word.

In [40]:
expand_contractions("Y'all can't expand contractions I'd think")

'You all can not expand contractions I would think'

## Removing Special Characters

Special characters and symbols are usually non-alphanumeric characters or even
occasionally numeric characters (depending on the problem), which add to the extra
noise in unstructured text. Usually, simple regular expressions (regexes) can be used to
remove them.

In [0]:
def remove_special_characters(text, remove_digits=False):
  pattern = r'[^a-zA-Z0-9\s]' if not remove_digits else r'[^a-zA-Z\s]'
  text = re.sub(pattern, '', text)

  return text

In [42]:
remove_special_characters('Well this was fun! What do you think? 123#@!', remove_digits=True)

'Well this was fun What do you think '

In [43]:
remove_special_characters('Well this was fun! What do you think? 123#@!', remove_digits=False)

'Well this was fun What do you think 123'

## Text Correction

One of the main challenges faced in text wrangling is the presence of incorrect words
in the text. The definition of incorrect here covers words that have spelling mistakes as
well as words with several letters repeated that do not contribute much to its overall
significance.

The
main objective here is to standardize different forms of these words to the correct form
so that we do not end up losing vital information from different tokens in the text. We
cover dealing with repeated characters as well as correcting spellings in this section.

### Correcting Repeating Characters

We just mentioned words that often contain several repeating characters that could be
due to incorrect spellings, slang language, or even people wanting to express strong
emotions.

The first step in our algorithm is to identify repeated characters in a word using
a regex pattern and then use a substitution to remove the characters one by one.

In [44]:
old_word = 'finalllyyy'
repeat_pattern = re.compile(r'(\w*)(\w)\2(\w*)')
match_substitution = r'\1\2\3'
step =1

while True:
  # remove one repeated character
  new_word = repeat_pattern.sub(match_substitution, old_word)

  if new_word != old_word:
    print(f'Step: {str(step)} Word: {str(new_word)}')
    step += 1
    # update old word to last substituted state
    old_word = new_word
    continue
  else:
    print(f'Final word: {new_word}')
    break

Step: 1 Word: finalllyy
Step: 2 Word: finallly
Step: 3 Word: finally
Step: 4 Word: finaly
Final word: finaly


However, this word is incorrect and the correct
word was “finally,” which we had obtained in Step 3. We will now utilize the WordNet
corpus to check for valid words at each stage and terminate the loop once it is obtained.

In [46]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [48]:
from nltk.corpus import wordnet

old_word = 'finalllyyy'
repeat_pattern = re.compile(r'(\w*)(\w)\2(\w*)')
match_substitution = r'\1\2\3'
step =1

while True:
  # check for semantically correct word
  if wordnet.synsets(old_word):
    print(f'Final correct word: {old_word}')
    break

  # remove one repeated character
  new_word = repeat_pattern.sub(match_substitution, old_word)
  if new_word != old_word:
    print(f'Step: {str(step)} Word: {str(new_word)}')
    step += 1
    # update old word to last substituted state
    old_word = new_word
    continue
  else:
    print(f'Final word: {new_word}')
    break

Step: 1 Word: finalllyy
Step: 2 Word: finallly
Step: 3 Word: finally
Final correct word: finally


We can build a
better version of this code by writing the logic in a function, as depicted here, to make it
more generic to deal with incorrect tokens from a list of tokens.

In [0]:
def remove_repeated_characters(tokens):
  repeat_pattern = re.compile(r'(\w*)(\w)\2(\w*)')
  match_substitution = r'\1\2\3'

  def replace(old_word):
    if wordnet.synsets(old_word):
      return old_word
    new_word = repeat_pattern.sub(match_substitution, old_word)
    return replace(new_word) if new_word != old_word else new_word

  correct_tokens = [replace(word) for word in tokens]
  return correct_tokens

In [50]:
sample_sentence = 'My schooool is realllllyyy amaaazingggg'
correct_tokens = remove_repeated_characters(nltk.word_tokenize(sample_sentence))
' '.join(correct_tokens)

'My school is really amazing'

We can see from this output that our function performs as intended and replaces the repeating characters in each token, giving us correct tokens as desired.

### Correcting Spellings

The second problem we face with words is incorrect or wrong spellings that occur due to
human error and even machine based errors, which you might have seen with features
like auto-correcting text.

