Our goal is to go from what we will describe as a chunk of text (not to be confused with text chunking), a lengthy, unprocessed single string, and end up with a list (or several lists) of cleaned tokens that would be useful for further text mining and/or natural language processing tasks.

- NLTK - The Natural Language ToolKit is one of the best-known and most-used NLP libraries in the Python ecosystem, useful for all sorts of tasks from tokenization, to stemming, and beyond

- BeautifulSoup - BeautifulSoup is a useful library for extracting data from HTML documents

In [1]:
# Import necessary libraries.
import re, string, unicodedata
import nltk                                   # Natural language processing tool-kit

!pip install contractions
import contractions


from bs4 import BeautifulSoup                 # Beautiful soup is a parsing library that can use different parsers.
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import stopwords, wordnet    # Stopwords, and wordnet corpus
from nltk.stem import LancasterStemmer, WordNetLemmatizer



We need some sample text. We'll start with something very small and artificial in order to easily see the results of what we are doing step by step.

In [0]:
text = """<h1>This is the title</h1>
            <b>This is bold text</b>
            <i>This is italicized Text</i>
            <img src="another html tag"/>
            <a href="Apart from the others"> This is also here!</a>
            “Love all, trust a few, do wrong to none.” 
            ― William Shakespeare, All's Well That Ends Well

            “All the world's a stage,
            And all the men and women merely players;
            They have their exits and their entrances;
            And one man in his time plays many parts,
            His acts being seven ages.” 
            ― William Shakespeare, As You Like It

            "How old are you," asked Jem, "four-and-a-half?"

            "Goin' on seven."

            "Shoot no wonder, then," said Jem, jerking his thumb at me. "Scout yonder's been readin' ever since she was born, 
            and she ain't even started to school yet. You look right puny for goin' on seven."

            "I'm little but I'm old," he said.
            - To Kill a Mockingbird

            Le dîner, Clémence, Anaïs, Raphaël, Voilà !

            something... is! not right() with.,; this :: line.
            
            &nbsp;&nbsp;
            
            11    42   1024   2048
            {{There are double curly braces.}}
            {Here are single curly braces.}
            </body>
            </html>"""

# Noise Removal

Let's define noise removal as text-specific normalization tasks which often take place prior to tokenization. 
- While the other 2 major steps of the preprocessing framework (tokenization and normalization) are basically task-independent, noise removal is much more task-specific.

Noise removal tasks could include:

- Removing text file headers, footers
- Removing HTML, XML, etc. markup and metadata
- Extracting valuable data from other formats, such as csv.

In [3]:
def strip_html(text):
    soup = BeautifulSoup(text, "html.parser")    # Removing HTML tags
    return soup.get_text()

def denoise_text(text):
    text = strip_html(text)
    # Any other step can also be added here according to need e.g we can add code to remove string inside the curly braces.
    return text

text = denoise_text(text)
print(text)

This is the title
This is bold text
This is italicized Text

 This is also here!
            “Love all, trust a few, do wrong to none.” 
            ― William Shakespeare, All's Well That Ends Well

            “All the world's a stage,
            And all the men and women merely players;
            They have their exits and their entrances;
            And one man in his time plays many parts,
            His acts being seven ages.” 
            ― William Shakespeare, As You Like It

            "How old are you," asked Jem, "four-and-a-half?"

            "Goin' on seven."

            "Shoot no wonder, then," said Jem, jerking his thumb at me. "Scout yonder's been readin' ever since she was born, 
            and she ain't even started to school yet. You look right puny for goin' on seven."

            "I'm little but I'm old," he said.
            - To Kill a Mockingbird

            Le dîner, Clémence, Anaïs, Raphaël, Voilà !

            something... is! not right() with.,; th

While not mandatory to do at this stage prior to tokenization but:
- Replacing contractions with their expansions can be beneficial at this point, since our word tokenizer will split words like "didn't" into "did" and "n't."
- It's not impossible to remedy this tokenization at a later stage, but doing so prior makes it easier and more straightforward.

In [4]:
def replace_contractions(text):
    """Replace contractions in string of text"""
    return contractions.fix(text)

text = replace_contractions(text)
print(text)

This is the title
This is bold text
This is italicized Text

 This is also here!
            “Love all, trust a few, do wrong to none.” 
            ― William Shakespeare, All's Well That Ends Well

            “All the world's a stage,
            And all the men and women merely players;
            They have their exits and their entrances;
            And one man in his time plays many parts,
            His acts being seven ages.” 
            ― William Shakespeare, As You Like It

            "How old are you," asked Jem, "four-and-a-half?"

            "going on seven."

            "Shoot no wonder, then," said Jem, jerking his thumb at me. "Scout yonder's been readin' ever since she was born, 
            and she are not even started to school yet. You look right puny for going on seven."

            "I am little but I am old," he said.
            - To Kill a Mockingbird

            Le dîner, Clémence, Anaïs, Raphaël, Voilà !

            something... is! not right() with.,

# Tokenization
 
- Tokenization is a step which splits longer strings of text into smaller pieces, or tokens. 
- Larger chunks of text can be tokenized into sentences, sentences can be tokenized into words, etc. 
- Further processing is generally performed after a piece of text has been appropriately tokenized. 
- Tokenization is also referred to as text segmentation or lexical analysis.
- Sometimes segmentation is used to refer to the breakdown of a large chunk of text into pieces larger than words (e.g. paragraphs or sentences), while tokenization is reserved for the breakdown process which results exclusively in words.

### For our task, we will tokenize our sample text into a list of words. This is done using NTLK's word_tokenize() function.

In [5]:
words = nltk.word_tokenize(text)     # list of words.
print(words)
print('Number of words is: ', len(words))

['This', 'is', 'the', 'title', 'This', 'is', 'bold', 'text', 'This', 'is', 'italicized', 'Text', 'This', 'is', 'also', 'here', '!', '“', 'Love', 'all', ',', 'trust', 'a', 'few', ',', 'do', 'wrong', 'to', 'none.', '”', '―', 'William', 'Shakespeare', ',', 'All', "'s", 'Well', 'That', 'Ends', 'Well', '“', 'All', 'the', 'world', "'s", 'a', 'stage', ',', 'And', 'all', 'the', 'men', 'and', 'women', 'merely', 'players', ';', 'They', 'have', 'their', 'exits', 'and', 'their', 'entrances', ';', 'And', 'one', 'man', 'in', 'his', 'time', 'plays', 'many', 'parts', ',', 'His', 'acts', 'being', 'seven', 'ages.', '”', '―', 'William', 'Shakespeare', ',', 'As', 'You', 'Like', 'It', '``', 'How', 'old', 'are', 'you', ',', "''", 'asked', 'Jem', ',', '``', 'four-and-a-half', '?', "''", '``', 'going', 'on', 'seven', '.', "''", '``', 'Shoot', 'no', 'wonder', ',', 'then', ',', "''", 'said', 'Jem', ',', 'jerking', 'his', 'thumb', 'at', 'me', '.', '``', 'Scout', 'yonder', "'s", 'been', 'readin', "'", 'ever', 'si

# Normalization

- converting all text to the same case (upper or lower), removing punctuation,  and so on.

- Steps:
  - Removal of non-ASCII characters.
  - Conversion of all characters to lowercase.
  - Removal of Punctuation.
  - Stop word removal.
  - Stemming / Lemmatization

- After tokenization, we are no longer working at a text level, but now at a word level. Our normalization functions, shown below, reflect this. Function names and comments should provide the necessary insight into what each does.

Converting all words to lowercase and removing punctuations.

**Stemming:** Converting the words into their base word or stem word ( Ex - tastefully, tasty, these words are converted to stem word called 'tasti'). This reduces the vector dimension because we dont consider all similar words

**Stopwords:** Stopwords are the unnecessary words that even if they are removed the sentiment of the sentence dosent change.

Ex - **This pasta is so tasty** ==> **pasta tasty** ( This , is, so are stopwords so they are removed)

Hint:

- Use regular expressions to remove punctuations.

To see all the steps, run the below cell.

In [6]:
def remove_non_ascii(words):
    """Remove non-ASCII characters from list of tokenized words"""
    new_words = []                        # Create empty list to store pre-processed words.
    for word in words:
        new_word = unicodedata.normalize('NFKD', word).encode('ascii', 'ignore').decode('utf-8', 'ignore')
        new_words.append(new_word)        # Append processed words to new list.
    return new_words

def to_lowercase(words):
    """Convert all characters to lowercase from list of tokenized words"""
    new_words = []                        # Create empty list to store pre-processed words.
    for word in words:
        new_word = word.lower()           # Converting to lowercase
        new_words.append(new_word)        # Append processed words to new list.
    return new_words

def remove_punctuation(words):
    """Remove punctuation from list of tokenized words"""
    new_words = []                        # Create empty list to store pre-processed words.
    for word in words:
        new_word = re.sub(r'[^\w\s]', '', word)
        if new_word != '':
            new_words.append(new_word)    # Append processed words to new list.
    return new_words

def remove_stopwords(words):
    """Remove stop words from list of tokenized words"""
    new_words = []                        # Create empty list to store pre-processed words.
    for word in words:
        if word not in stopwords.words('english'):
            new_words.append(word)        # Append processed words to new list.
    return new_words

def stem_words(words):
    """Stem words in list of tokenized words"""
    stemmer = LancasterStemmer()
    stems = []                            # Create empty list to store pre-processed words.
    for word in words:
        stem = stemmer.stem(word)
        stems.append(stem)                # Append processed words to new list.
    return stems

def lemmatize_verbs(words):
    """Lemmatize verbs in list of tokenized words"""
    lemmatizer = WordNetLemmatizer()
    lemmas = []                           # Create empty list to store pre-processed words.
    for word in words:
        lemma = lemmatizer.lemmatize(word, pos='v')
        lemmas.append(lemma)              # Append processed words to new list.
    return lemmas

def normalize(words):
    words = remove_non_ascii(words)
    words = to_lowercase(words)
    words = remove_punctuation(words)
    words = remove_stopwords(words)
    return words

words = normalize(words)
print(words)
print('Number of words is: ', len(words))

['title', 'bold', 'text', 'italicized', 'text', 'also', 'love', 'trust', 'wrong', 'none', 'william', 'shakespeare', 'well', 'ends', 'well', 'world', 'stage', 'men', 'women', 'merely', 'players', 'exits', 'entrances', 'one', 'man', 'time', 'plays', 'many', 'parts', 'acts', 'seven', 'ages', 'william', 'shakespeare', 'like', 'old', 'asked', 'jem', 'fourandahalf', 'going', 'seven', 'shoot', 'wonder', 'said', 'jem', 'jerking', 'thumb', 'scout', 'yonder', 'readin', 'ever', 'since', 'born', 'even', 'started', 'school', 'yet', 'look', 'right', 'puny', 'going', 'seven', 'little', 'old', 'said', 'kill', 'mockingbird', 'le', 'diner', 'clemence', 'anais', 'raphael', 'voila', 'something', 'right', 'line', '11', '42', '1024', '2048', 'double', 'curly', 'braces', 'single', 'curly', 'braces']
Number of words is:  86


In [7]:
def stem_and_lemmatize(words):
    stems = stem_words(words)
    lemmas = lemmatize_verbs(words)
    return stems, lemmas

stems, lemmas = stem_and_lemmatize(words)
print('Stemmed:\n', stems)
print('\nLemmatized:\n', lemmas)

Stemmed:
 ['titl', 'bold', 'text', 'it', 'text', 'also', 'lov', 'trust', 'wrong', 'non', 'william', 'shakespear', 'wel', 'end', 'wel', 'world', 'stag', 'men', 'wom', 'mer', 'play', 'exit', 'ent', 'on', 'man', 'tim', 'play', 'many', 'part', 'act', 'sev', 'ag', 'william', 'shakespear', 'lik', 'old', 'ask', 'jem', 'fourandahalf', 'going', 'sev', 'shoot', 'wond', 'said', 'jem', 'jerk', 'thumb', 'scout', 'yond', 'readin', 'ev', 'sint', 'born', 'ev', 'start', 'school', 'yet', 'look', 'right', 'puny', 'going', 'sev', 'littl', 'old', 'said', 'kil', 'mockingbird', 'le', 'din', 'cle', 'ana', 'raphael', 'voil', 'someth', 'right', 'lin', '11', '42', '1024', '2048', 'doubl', 'cur', 'brac', 'singl', 'cur', 'brac']

Lemmatized:
 ['title', 'bold', 'text', 'italicize', 'text', 'also', 'love', 'trust', 'wrong', 'none', 'william', 'shakespeare', 'well', 'end', 'well', 'world', 'stage', 'men', 'women', 'merely', 'players', 'exit', 'entrance', 'one', 'man', 'time', 'play', 'many', 'part', 'act', 'seven', '

This results in a return of 2 new lists: one of stemmed tokens, and another of lemmatized tokens with respect to verbs. Depending on your upcoming NLP task or preference, one of these may be more appropriate than the other.