<a href="https://colab.research.google.com/github/rahiakela/text-analytics-with-python/blob/3-processing-and-understanding-text/text_preprocessing_and_wrangling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Preprocessing and Wrangling

Text wrangling (also called preprocessing or normalization) is a process that consists of a series of steps to wrangle, clean, and standardize textual data into a form that could be consumed by other NLP and intelligent systems powered by machine learning and deep learning. 

Common techniques for preprocessing include-
* cleaning text, 
* tokenizing text,
* removing special characters, 
* case conversion, 
* correcting spellings, 
* removing stopwords
* and other unnecessary terms, stemming, and lemmatization.

The key idea is to remove unnecessary content from one or more text documents in a corpus (or corpora) and get clean text documents.

## Removing HTML Tags

Often, unstructured text contains a lot of noise, especially if you use techniques
like web scraping or screen scraping to retrieve data from web pages, blogs, and
online repositories. HTML tags, JavaScript, and Iframe tags typically don’t add much
value to understanding and analyzing text. Our main intent is to extract meaningful
textual content from the data extracted from the web.

Let’s look at a section of a web page showing the King James version of the Bible.
<img src='https://github.com/rahiakela/img-repo/blob/master/text-analytics-with-python/bible.PNG?raw=1' width='800'/>

We will now leverage requests and retrieve the contents of this web page in Python.
This is known as web scraping and the following code helps us achieve this.

In [3]:
import requests

data = requests.get('http://www.gutenberg.org/cache/epub/8001/pg8001.html')
content = data.content
content[1163:2200]

b'content="Ebookmaker 0.4.0a5 by Marcello Perathoner &lt;webmaster@gutenberg.org&gt;" name="generator"/>\r\n</head>\r\n  <body><p id="id00000">Project Gutenberg EBook The Bible, King James, Book 1: Genesis</p>\r\n\r\n<p id="id00001">Copyright laws are changing all over the world. Be sure to check the\r\ncopyright laws for your country before downloading or redistributing\r\nthis or any other Project Gutenberg eBook.</p>\r\n\r\n<p id="id00002">This header should be the first thing seen when viewing this Project\r\nGutenberg file.  Please do not remove it.  Do not change or edit the\r\nheader without written permission.</p>\r\n\r\n<p id="id00003">Please read the "legal small print," and other information about the\r\neBook and Project Gutenberg at the bottom of this file.  Included is\r\nimportant information about your specific rights and restrictions in\r\nhow the file may be used.  You can also find out about how to make a\r\ndonation to Project Gutenberg, and how to get involved.</p>

We can clearly see from the preceding output that it is extremely difficult to decipher the actual textual content in the web page, due to all the unnecessary HTML tags. We need to remove those tags. 

The BeautifulSoup library provides us with some handy
functions that help us remove these unnecessary tags with ease.

In [0]:
import re
from bs4 import BeautifulSoup

In [0]:
def strip_html_tags(text):
  soup = BeautifulSoup(text, 'html.parser')

  # remove iframe and script tag
  [s.extract() for s in soup(['iframe', 'script'])]
  stripped_text = soup.get_text()
  stripped_text = re.sub(r'[\r|\n|\r\n]+', '\n', stripped_text)

  return stripped_text

In [9]:
clean_content = strip_html_tags(content)
clean_content[1163:2045]

"*** START OF THE PROJECT GUTENBERG EBOOK, THE BIBLE, KING JAMES, BOOK 1***\nThis eBook was produced by David Widger\nwith the help of Derek Andrew's text from January 1992\nand the work of Bryan Taylor in November 2002.\nBook 01        Genesis\n01:001:001 In the beginning God created the heaven and the earth.\n01:001:002 And the earth was without form, and void; and darkness was\n           upon the face of the deep. And the Spirit of God moved upon\n           the face of the waters.\n01:001:003 And God said, Let there be light: and there was light.\n01:001:004 And God saw the light, that it was good: and God divided the\n\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0light from the darkness.\n01:001:005 And God called the light Day, and the darkness he called\n\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0Night. And the evening and the morning were the first day.\n01:001:006 And God said, Let there be a firmament in the midst of the\n\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0waters,"

You can compare this output with the raw web page content and see that we have
successfully removed the unnecessary HTML tags. We now have a clean body of text
that’s easier to interpret and understand.

## Text Tokenization

The most popular tokenization techniques include sentence and word tokenization, which are used to break down a text document (or corpus) into sentences and each sentence into words. Thus, tokenization can be defined as the process of breaking down or splitting textual data into smaller and more meaningful components called tokens.

### Sentence Tokenization

Sentence tokenization is the process of splitting a text corpus into sentences that act as the first level of tokens the corpus is comprised of. This is also known as sentence segmentation, since we try to segment the text into meaningful sentences.

There are various ways to perform sentence tokenization. Basic techniques include:-
* looking for specific delimiters between sentences like a period (.) 
* or a newline character (\n) 
* and sometimes even a semicolon (;). 

We will use the NLTK framework, which provides various interfaces for performing sentence tokenization. We primarily focus on the
following sentence tokenizers:
* sent_tokenize
* Pretrained sentence tokenization models
* PunktSentenceTokenizer
* RegexpTokenizer

In [13]:
import nltk
from nltk.corpus import gutenberg
from pprint import pprint
import numpy as np
nltk.download('gutenberg')

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.


True

In [0]:
# loading text corpora
alice = gutenberg.raw(fileids='carroll-alice.txt')

In [16]:
sample_text = '''US unveils world's most powerful supercomputer, beats China. \
 The US has unveiled the world's most powerful supercomputer called 'Summit', \
 beating the previous record-holder China's Sunway TaihuLight. With a peak performance \
 of 200,000 trillion calculations per second, it is over twice as fast as Sunway TaihuLight, \
 which is capable of 93,000 trillion calculations per second. Summit has 4,608 servers, \
 which reportedly take up the size of two tennis courts.'''
sample_text

"US unveils world's most powerful supercomputer, beats China.  The US has unveiled the world's most powerful supercomputer called 'Summit',  beating the previous record-holder China's Sunway TaihuLight. With a peak performance  of 200,000 trillion calculations per second, it is over twice as fast as Sunway TaihuLight,  which is capable of 93,000 trillion calculations per second. Summit has 4,608 servers,  which reportedly take up the size of two tennis courts."

In [17]:
# Total characters in Alice in Wonderland
len(alice)

144395

In [18]:
# First 100 characters in the corpus
alice[:100]

"[Alice's Adventures in Wonderland by Lewis Carroll 1865]\n\nCHAPTER I. Down the Rabbit-Hole\n\nAlice was"

#### Default Sentence Tokenizer

The nltk.sent_tokenize(...) function is the default sentence tokenization function
that NLTK recommends and it uses an instance of the PunktSentenceTokenizer class
internally. However, this is not just a normal object or instance of that class. It has been
pretrained on several language models and works really well on many popular languages
besides English.

In [20]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [22]:
default_st = nltk.sent_tokenize
alice_sentences = default_st(text=alice)
sample_sentences = default_st(text=sample_text)
print(f'Total sentences in sample_text: {str(len(sample_sentences))}')
print(f'Sample text sentences :-\n{str(np.array(sample_sentences))}')

Total sentences in sample_text: 4
Sample text sentences :-
["US unveils world's most powerful supercomputer, beats China."
 "The US has unveiled the world's most powerful supercomputer called 'Summit',  beating the previous record-holder China's Sunway TaihuLight."
 'With a peak performance  of 200,000 trillion calculations per second, it is over twice as fast as Sunway TaihuLight,  which is capable of 93,000 trillion calculations per second.'
 'Summit has 4,608 servers,  which reportedly take up the size of two tennis courts.']


In [23]:
print(f'\nTotal sentences in alice: {str(len(alice_sentences))}')
print(f'First 5 sentences in alice:- \n {str(np.array(alice_sentences[:5]))}')


Total sentences in alice: 1625
First 5 sentences in alice:- 
 ["[Alice's Adventures in Wonderland by Lewis Carroll 1865]\n\nCHAPTER I."
 "Down the Rabbit-Hole\n\nAlice was beginning to get very tired of sitting by her sister on the\nbank, and of having nothing to do: once or twice she had peeped into the\nbook her sister was reading, but it had no pictures or conversations in\nit, 'and what is the use of a book,' thought Alice 'without pictures or\nconversation?'"
 'So she was considering in her own mind (as well as she could, for the\nhot day made her feel very sleepy and stupid), whether the pleasure\nof making a daisy-chain would be worth the trouble of getting up and\npicking the daisies, when suddenly a White Rabbit with pink eyes ran\nclose by her.'
 "There was nothing so VERY remarkable in that; nor did Alice think it so\nVERY much out of the way to hear the Rabbit say to itself, 'Oh dear!"
 'Oh dear!']


Now, as you can see, the tokenizer is quite intelligent. It doesn’t just use periods to delimit sentences, but also considers other punctuation and capitalization of words.

#### Pretrained Sentence Tokenizer Models

Suppose we were dealing with German text. We can use sent_tokenize, which
is already trained, or load a pretrained tokenization model on German text into a PunktSentenceTokenizer instance and perform the same operation. The following
snippet shows this. We start by loading a German text corpus and inspecting it.

In [25]:
nltk.download('europarl_raw')

[nltk_data] Downloading package europarl_raw to /root/nltk_data...
[nltk_data]   Unzipping corpora/europarl_raw.zip.


True

In [27]:
from nltk.corpus import europarl_raw

german_text = europarl_raw.german.raw(fileids='ep-00-01-17.de')
# Total characters in the corpus
print(len(german_text))
# First 100 characters in the corpus
german_text[:100]

157171


' \nWiederaufnahme der Sitzungsperiode Ich erkläre die am Freitag , dem 17. Dezember unterbrochene Sit'

Next, we tokenize the text corpus into sentences using the default sent_
tokenize(...) tokenizer and a pretrained German language tokenizer by loading it
from the NLTK resources.

In [0]:
# default sentence tokenizer
german_sentences_def = default_st(text=german_text, language='german')

# loading german text tokenizer into a PunktSentenceTokenizer instance
german_tokenizer = nltk.data.load(resource_url='tokenizers/punkt/german.pickle')
german_sentences = german_tokenizer.tokenize(german_text)

We can now verify the time of our German tokenizer and check if the results
obtained by using the two tokenizers match!

In [30]:
# verify the type of german_tokenizer, should be PunktSentenceTokenizer
type(german_tokenizer)

nltk.tokenize.punkt.PunktSentenceTokenizer

In [32]:
# check if results of both tokenizers match , should be True
(german_sentences_def == german_sentences)

True

Thus we see that indeed the german_tokenizer is an instance of
PunktSentenceTokenizer, which specializes in dealing with the German language. We
also checked if the sentences obtained from the default tokenizer are the same as the
sentences obtained by this pretrained tokenizer. As expected, they are the same (true).

In [33]:
# print first 5 sentences of the corpus
np.array(german_sentences[:5])

array([' \nWiederaufnahme der Sitzungsperiode Ich erkläre die am Freitag , dem 17. Dezember unterbrochene Sitzungsperiode des Europäischen Parlaments für wiederaufgenommen , wünsche Ihnen nochmals alles Gute zum Jahreswechsel und hoffe , daß Sie schöne Ferien hatten .',
       'Wie Sie feststellen konnten , ist der gefürchtete " Millenium-Bug " nicht eingetreten .',
       'Doch sind Bürger einiger unserer Mitgliedstaaten Opfer von schrecklichen Naturkatastrophen geworden .',
       'Im Parlament besteht der Wunsch nach einer Aussprache im Verlauf dieser Sitzungsperiode in den nächsten Tagen .',
       'Heute möchte ich Sie bitten - das ist auch der Wunsch einiger Kolleginnen und Kollegen - , allen Opfern der Stürme , insbesondere in den verschiedenen Ländern der Europäischen Union , in einer Schweigeminute zu gedenken .'],
      dtype='<U259')

Thus we see that our assumption was indeed correct and you can tokenize sentences
belonging to different languages in two different ways.

#### PunktSentenceTokenizer

In [34]:
punkt_st = nltk.tokenize.PunktSentenceTokenizer()
sample_sentences = punkt_st.tokenize(sample_text)
np.array(sample_sentences)

array(["US unveils world's most powerful supercomputer, beats China.",
       "The US has unveiled the world's most powerful supercomputer called 'Summit',  beating the previous record-holder China's Sunway TaihuLight.",
       'With a peak performance  of 200,000 trillion calculations per second, it is over twice as fast as Sunway TaihuLight,  which is capable of 93,000 trillion calculations per second.',
       'Summit has 4,608 servers,  which reportedly take up the size of two tennis courts.'],
      dtype='<U178')

#### RegexpTokenizer

The last tokenizer we cover in sentence tokenization is using an instance of the
RegexpTokenizer class to tokenize text into sentences, where we will use specific regular expression-based patterns to segment sentences.

In [36]:
SENTENCE_TOKENS_PATTERN = r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<![A-Z]\.)(?<=\.|\?|\!)\s'
regex_st = nltk.tokenize.RegexpTokenizer(pattern=SENTENCE_TOKENS_PATTERN, gaps=True)
sample_sentences = regex_st.tokenize(sample_text)
np.array(sample_sentences)

array(["US unveils world's most powerful supercomputer, beats China.",
       " The US has unveiled the world's most powerful supercomputer called 'Summit',  beating the previous record-holder China's Sunway TaihuLight.",
       'With a peak performance  of 200,000 trillion calculations per second, it is over twice as fast as Sunway TaihuLight,  which is capable of 93,000 trillion calculations per second.',
       'Summit has 4,608 servers,  which reportedly take up the size of two tennis courts.'],
      dtype='<U178')

This output shows that we obtained the same sentences as we had obtained using
the other tokenizers. This gives us an idea of tokenizing text into sentences using
different NLTK interfaces.

### Word Tokenization