Our goal is to go from what we will describe as a chunk of text (not to be confused with text chunking), a lengthy, unprocessed single string, and end up with a list (or several lists) of cleaned tokens that would be useful for further text mining and/or natural language processing tasks.

- NLTK - The Natural Language ToolKit is one of the best-known and most-used NLP libraries in the Python ecosystem, useful for all sorts of tasks from tokenization, to stemming, and beyond

- BeautifulSoup - BeautifulSoup is a useful library for extracting data from HTML documents

In [0]:
# Import necessary libraries.
import re, string, unicodedata
import nltk                                   # Natural language processing tool-kit

!pip install contractions
import contractions


from bs4 import BeautifulSoup                 # Beautiful soup is a parsing library that can use different parsers.
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import stopwords, wordnet    # Stopwords, and wordnet corpus
from nltk.stem import LancasterStemmer, WordNetLemmatizer

We need some sample text. We'll start with something very small and artificial in order to easily see the results of what we are doing step by step.

In [0]:
text = """<h1>This is the title</h1>
            <b>This is bold text</b>
            <i>This is italicized Text</i>
            <img src="another html tag"/>
            <a href="Apart from the others"> This is also here!</a>
            “Love all, trust a few, do wrong to none.” 
            ― William Shakespeare, All's Well That Ends Well

            “All the world's a stage,
            And all the men and women merely players;
            They have their exits and their entrances;
            And one man in his time plays many parts,
            His acts being seven ages.” 
            ― William Shakespeare, As You Like It

            "How old are you," asked Jem, "four-and-a-half?"

            "Goin' on seven."

            "Shoot no wonder, then," said Jem, jerking his thumb at me. "Scout yonder's been readin' ever since she was born, 
            and she ain't even started to school yet. You look right puny for goin' on seven."

            "I'm little but I'm old," he said.
            - To Kill a Mockingbird

            Le dîner, Clémence, Anaïs, Raphaël, Voilà !

            something... is! not right() with.,; this :: line.
            
            &nbsp;&nbsp;
            
            11    42   1024   2048
            {{There are double curly braces.}}
            {Here are single curly braces.}
            </body>
            </html>"""

# Noise Removal

Let's define noise removal as text-specific normalization tasks which often take place prior to tokenization. 
- While the other 2 major steps of the preprocessing framework (tokenization and normalization) are basically task-independent, noise removal is much more task-specific.

Noise removal tasks could include:

- Removing text file headers, footers
- Removing HTML, XML, etc. markup and metadata
- Extracting valuable data from other formats, such as csv.

In [0]:
# Write the code to remove all the html tags from the text string. And print the processed text.


While not mandatory to do at this stage prior to tokenization but:
- Replacing contractions with their expansions can be beneficial at this point, since our word tokenizer will split words like "didn't" into "did" and "n't."
- It's not impossible to remedy this tokenization at a later stage, but doing so prior makes it easier and more straightforward.

In [0]:
# Write the code to replace all the contractions. (I'm  ==>>  I am and so on.) [Hint: use contractions library.]


# Tokenization
 
- Tokenization is a step which splits longer strings of text into smaller pieces, or tokens. 
- Larger chunks of text can be tokenized into sentences, sentences can be tokenized into words, etc. 
- Further processing is generally performed after a piece of text has been appropriately tokenized. 
- Tokenization is also referred to as text segmentation or lexical analysis.
- Sometimes segmentation is used to refer to the breakdown of a large chunk of text into pieces larger than words (e.g. paragraphs or sentences), while tokenization is reserved for the breakdown process which results exclusively in words.

### For our task, we will tokenize our sample text into a list of words. This is done using NTLK's word_tokenize() function.

In [0]:
# Tokenize the text and print.


# Normalization

- converting all text to the same case (upper or lower), removing punctuation,  and so on.

- Steps:
  - Removal of non-ASCII characters.
  - Conversion of all characters to lowercase.
  - Removal of Punctuation.
  - Stop word removal.
  - Stemming / Lemmatization

- After tokenization, we are no longer working at a text level, but now at a word level. Our normalization functions, shown below, reflect this. Function names and comments should provide the necessary insight into what each does.

Converting all words to lowercase and removing punctuations.

**Stemming:** Converting the words into their base word or stem word ( Ex - tastefully, tasty, these words are converted to stem word called 'tasti'). This reduces the vector dimension because we dont consider all similar words

**Stopwords:** Stopwords are the unnecessary words that even if they are removed the sentiment of the sentence dosent change.

Ex - **This pasta is so tasty** ==> **pasta tasty** ( This , is, so are stopwords so they are removed)

Hint:

- Use regular expressions to remove punctuations.

To see all the steps, run the below cell.

In [0]:
# Write function to remove non-ASCII characters from the list of tokenized words.


# Write function to convert all the characters from the list of tokenized words.


# Write function to remove punctuations from the list of tokenized words.


# Write function to remove stopwords from the list of tokenized words.


# Write function to convert to stem words from the list of tokenized words.


# Write function to lemmatize the words from the list of tokenized words.


# write a function to perform all the above steps.


# write the code to execute the function which has all the above steps combined.
