<img src="Images\atom.png" alt="Atom" style="width:60px" align="left" vertical-align="middle">

## 1. Introduction
*Text Preprocessing*

----
Text preprocessing is an approach for cleaning and preparing text data for use in a specific context. Developers use it in almost all natural language processing (NLP) pipelines, including voice recognition software, search engine lookup, and machine learning model training. It is an essential step because text data can vary. From its format (website, text message, voice recognition) to the people who create the text (language, dialect), there are plenty of things that can introduce noise into your data.

<br/>The ultimate goal of cleaning and preparing text data is to reduce the text to only the words that you need for your NLP goals.

<br/>In this lesson, you will learn strategies for preparing text data. While this list is not exhaustive, we will cover a few common approaches for cleaning and processing text data. They include:
- Using Regex & NLTK libraries
- Noise Removal – Removing unnecessary characters and formatting
- Tokenization – break multi-word strings into smaller components
- Normalization – a catch-all term for processing data; this includes stemming and lemmatization

<br/>In the gif below, you can see an example of using noise removal, tokenization, and lemmatization to change the string `"Who was partying?"` into a list with the words `"who"`, `"be"`, and `"party"`. In this lesson, you will learn how to use built-in and NLTK functions to apply these same text preprocessing approaches to your own strings.
<img src="Images/text-preprocessing-introduction.gif" width="50%" height="50%">

<img src="Images\atom.png" alt="Atom" style="width:60px" align="left" vertical-align="middle">

## 2. Noise Removal
*Text Preprocessing*

----
Text cleaning is a technique that developers use in a variety of domains. Depending on the goal of your project and where you get your data from, you may want to remove unwanted information, such as:
- Punctuation and accents
- Special characters
- Numeric digits
- Leading, ending, and vertical whitespace
- HTML formatting

<br/>The type of noise that you need to remove from text usually depends on its source. For example, you could access data via the Twitter API, scraping a webpage, or voice recognition software. Fortunately, you can use the `.sub()` method in Python’s regular expression (`re`) library for most of your noise removal needs.

<br/>The `.sub()` method has three required arguments:
- `pattern` – a regular expression that is searched for in the input string. There must be an `r` preceding the string to indicate it is a raw string, which treats backslashes as literal characters.
- `replacement_text` – text that replaces all matches in the input string
- `input` – the input string that will be edited by the `.sub()` method

<br/>The method returns a string with all instances of the pattern replaced by the replacement_text. Let’s see a few examples of using this method to remove and replace text from a string.

<br/>*Example:*
1. First, let’s consider how to remove HTML `<p>` tags from a string:

In [34]:
import re 

text = "<p>    This is a paragraph</p>" 
result = re.sub(r'<.?p>', '', text)
print(result) 

    This is a paragraph


Notice, we replace the tags with an empty string `''`. This is a common approach for removing text.

2. Next, let’s remove the whitespace from the beginning of the text. The whitespace consists of four spaces.

In [35]:
text = "    This is a paragraph" 
result = re.sub(r'\s{4}', '', text)
print(result) 

This is a paragraph


3. Remove the opening and closing `h1` tags from `headline_one`. Save the value to `headline_no_tag`.

In [36]:
headline_one = '<h1>Nation\'s Top Pseudoscientists Harness High-Energy Quartz Crystal Capable Of Reversing Effects Of Being Gemini</h1>'
headline_no_tag = re.sub(r'<.?h1>', '', headline_one)
print(headline_no_tag)

Nation's Top Pseudoscientists Harness High-Energy Quartz Crystal Capable Of Reversing Effects Of Being Gemini


4. We also saved a Tweet to the variable `tweet`. Remove all `@` characters. Save the result to `tweet_no_at`

In [37]:
tweet = '@fat_meats, veggies are better than you think.'
tweet_no_at = re.sub(r'@', '', tweet)
print(tweet_no_at)

fat_meats, veggies are better than you think.


<img src="Images\atom.png" alt="Atom" style="width:60px" align="left" vertical-align="middle">

## 3. Tokenization
*Text Preprocessing*

----
For many natural language processing tasks, we need access to each word in a string. To access each word, we first have to break the text into smaller components. The method for breaking text into smaller components is called *tokenization* and the individual components are called *tokens.* In other words, tokenization is when a string is broken into a list of substrings. 

<br/>A few common operations that require tokenization include:
- Finding how many words or sentences appear in text
- Determining how many times a specific word or phrase exists
- Accounting for which terms are likely to co-occur

<br/>While tokens are usually individual words or terms, they can also be sentences or other size pieces of text.

<br/>To tokenize individual words, we can use `nltk`‘s `word_tokenize()` function. The function accepts a string and returns a list of words:

In [38]:
from nltk.tokenize import word_tokenize

text = "Tokenize this text"
tokenized = word_tokenize(text)
print(tokenized)

['Tokenize', 'this', 'text']


To tokenize at the sentence level, we can use `sent_tokenize()` from the same module.

In [39]:
from nltk.tokenize import sent_tokenize

text = "Tokenize this sentence. Also, tokenize this sentence."
tokenized = sent_tokenize(text)
print(tokenized)

['Tokenize this sentence.', 'Also, tokenize this sentence.']


*Example:*
1. Import the `word_tokenize()` and `sent_tokenize()` functions from Python’s NLTK package. Tokenize `ecg_text` by word and save the result to `tokenized_by_word`.

In [40]:
from nltk.tokenize import sent_tokenize, word_tokenize

ecg_text = 'An electrocardiogram is used to record the electrical conduction through a person\'s heart. The readings can be used to diagnose cardiac arrhythmias.'
tokenized_by_word = word_tokenize(ecg_text)
print(tokenized_by_word)

['An', 'electrocardiogram', 'is', 'used', 'to', 'record', 'the', 'electrical', 'conduction', 'through', 'a', 'person', "'s", 'heart', '.', 'The', 'readings', 'can', 'be', 'used', 'to', 'diagnose', 'cardiac', 'arrhythmias', '.']


2. Tokenize `ecg_text` by sentence and save the result to `tokenized_by_sentence`.

In [41]:
ecg_text = 'An electrocardiogram is used to record the electrical conduction through a person\'s heart. The readings can be used to diagnose cardiac arrhythmias.'
tokenized_by_sentence = sent_tokenize(ecg_text)
print(tokenized_by_sentence)

["An electrocardiogram is used to record the electrical conduction through a person's heart.", 'The readings can be used to diagnose cardiac arrhythmias.']


<img src="Images\atom.png" alt="Atom" style="width:60px" align="left" vertical-align="middle">

## 4. Normalization
*Text Preprocessing*

----
Tokenization and noise removal are staples of almost all text pre-processing pipelines. However, some data may require further processing through text normalization. Text *normalization* is a catch-all term for various text pre-processing tasks. In the next few exercises, we’ll cover a few of them:
- Upper or lowercasing
- Stopword removal
- Stemming – bluntly removing prefixes and suffixes from a word
- Lemmatization – replacing a single-word token with its root

<br/>The simplest of these approaches is to change the case of a string. We can use Python’s built-in String methods to make a string all uppercase or lowercase:

In [42]:
my_string = 'tHiS HaS a MiX oF cAsEs'
print("Uppercase: ", my_string.upper())
print("Lower case: ", my_string.lower())

Uppercase:  THIS HAS A MIX OF CASES
Lower case:  this has a mix of cases


*Example:*
1. Make all the characters in `brands` lowercase and save the results to `brands_lower`.

In [43]:
brands = 'Salvation Army, YMCA, Boys & Girls Club of America'
brands_lower = brands.lower()
print(brands_lower)

salvation army, ymca, boys & girls club of america


2. Make all the `letters` in brands uppercase and save the results to `brands_upper`.

In [44]:
brands_upper = brands.upper()
print(brands_upper)

SALVATION ARMY, YMCA, BOYS & GIRLS CLUB OF AMERICA


<img src="Images\atom.png" alt="Atom" style="width:60px" align="left" vertical-align="middle">

## 5. Stopword Removal
*Text Preprocessing*

----
Stopwords are words that we remove during preprocessing when we don’t care about sentence structure. They are usually the most common words in a language and don’t provide any information about the tone of a statement. They include words such as “a”, “an”, and “the”.

<br/>NLTK provides a built-in library with these words. You can import them using the following statement:

In [45]:
from nltk.corpus import stopwords 
stop_words = set(stopwords.words('english')) 

We create a set with the stop words so we can check if the words are in a list below.

<br/>Now that we have the words saved to `stop_words`, we can use tokenization and a list comprehension to remove them from a sentence:

In [46]:
nbc_statement = "NBC was founded in 1926 making it the oldest major broadcast network in the USA"
word_tokens = word_tokenize(nbc_statement) # Tokenize nbc_statement 
statement_no_stop = [word for word in word_tokens if word not in stop_words] # Remove stop words
print(statement_no_stop)

['NBC', 'founded', '1926', 'making', 'oldest', 'major', 'broadcast', 'network', 'USA']


*Example:*
1. At the top of your script, import stopwords from NLTK. Save all English stopwords, as a set, to a variable called `stop_words`. Then, tokenize the text in `survey_text` and save the result to `tokenized_survey`.

In [47]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

survey_text = 'A YouGov study found that American\'s like Italian food more than any other country\'s cuisine.'
stop_words = set(stopwords.words('english')) 
tokenized_survey = word_tokenize(survey_text)
print(tokenized_survey)

['A', 'YouGov', 'study', 'found', 'that', 'American', "'s", 'like', 'Italian', 'food', 'more', 'than', 'any', 'other', 'country', "'s", 'cuisine', '.']


2. Remove stop words from `tokenized_survey` and save the result to `text_no_stops`.

In [48]:
text_no_stops = [word for word in tokenized_survey if word not in stop_words]
print(text_no_stops)

['A', 'YouGov', 'study', 'found', 'American', "'s", 'like', 'Italian', 'food', 'country', "'s", 'cuisine', '.']


<img src="Images\atom.png" alt="Atom" style="width:60px" align="left" vertical-align="middle">

## 5. Stemming
*Text Preprocessing*

----
In natural language processing, *stemming* is the text preprocessing normalization task concerned with bluntly removing word affixes (prefixes and suffixes). For example, stemming would cast the word “going” to “go”. This is a common method used by search engines to improve matching between user input and website hits.

<br/>NLTK has a built-in stemmer called PorterStemmer. You can use it with a list comprehension to stem each word in a tokenized list of words.

<br/>First, you must import and initialize the stemmer:

In [49]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

Now that we have our stemmer, we can apply it to each word in a list using a list comprehension:

In [50]:
tokenized = ['NBC', 'was', 'founded', 'in', '1926', '.', 'This', 'makes', 'NBC', 'the', 'oldest', 'major', 'broadcast', 'network', '.']
stemmed = [stemmer.stem(token) for token in tokenized]
print(stemmed)

['nbc', 'wa', 'found', 'in', '1926', '.', 'thi', 'make', 'nbc', 'the', 'oldest', 'major', 'broadcast', 'network', '.']


Notice, the words like ‘was’ and ‘founded’ became ‘wa’ and ‘found’, respectively. The fact that these words have been reduced is useful for many language processing applications. However, you need to be careful when stemming strings, because words can often be converted to something unrecognizable.

<br/>*Exercise:*
1. At the top of your script, import `PorterStemmer`, then initialize an instance of it and save the object to a variable called `stemmer`. Then, tokenize `populated_island` and save the result to `island_tokenized`.

In [51]:
from nltk.stem import PorterStemmer

populated_island = 'Java is an Indonesian island in the Pacific Ocean. It is the most populated island in the world, with over 140 million people.'
stemmer = PorterStemmer()
island_tokenized = word_tokenize(populated_island)

2. Use a list comprehension to stem each word in `island_tokenized`. Save the result to a variable called `stemmed`.

In [52]:
stemmed = [stemmer.stem(token) for token in island_tokenized]
print(stemmed)

['java', 'is', 'an', 'indonesian', 'island', 'in', 'the', 'pacif', 'ocean', '.', 'it', 'is', 'the', 'most', 'popul', 'island', 'in', 'the', 'world', ',', 'with', 'over', '140', 'million', 'peopl', '.']


<img src="Images\atom.png" alt="Atom" style="width:60px" align="left" vertical-align="middle">

## 6. Lemmatization
*Text Preprocessing*

----
*Lemmatization* is a method for casting words to their root forms. This is a more involved process than stemming, because it requires the method to know the part of speech for each word. Since lemmatization requires the part of speech, it is a less efficient approach than stemming.

<br/>In the next exercise, we will consider how to tag each word with a part of speech. In the meantime, let’s see how to use NLTK’s lemmatize operation.

<br/>We can use NLTK’s `WordNetLemmatizer` to lemmatize text:

In [53]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

Once we have the `lemmatizer` initialized, we can use a list comprehension to apply the lemmatize operation to each word in a list:

In [54]:
tokenized = ["NBC", "was", "founded", "in", "1926"]
lemmatized = [lemmatizer.lemmatize(token) for token in tokenized]
print(lemmatized)

['NBC', 'wa', 'founded', 'in', '1926']


The result saved to `lemmatized` contains `'wa'`, while the rest of the words remain the same. Not too useful. This happened because `lemmatize()` treats every word as a noun. To take advantage of the power of lemmatization, we need to tag each word in our text with the most likely part of speech. We’ll do that in the next exercise.

<br/>*Example:*
1. At the top of the script, import `WordNetLemmatizer`, then initialize an instance of it and save the result to `lemmatizer`. Then, tokenize the string saved to `populated_island`. Save the result to `tokenized_string`.

In [55]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

populated_island = 'Indonesia was founded in 1945. It contains the most populated island in the world, Java, with over 140 million people.'
tokenized_string = word_tokenize(populated_island)
print(tokenized_string)

['Indonesia', 'was', 'founded', 'in', '1945', '.', 'It', 'contains', 'the', 'most', 'populated', 'island', 'in', 'the', 'world', ',', 'Java', ',', 'with', 'over', '140', 'million', 'people', '.']


2. Use a list comprehension to lemmatize every word in `tokenized_string`. Save the result to the variable `lemmatized_words`.

In [56]:
lemmatized_words = [lemmatizer.lemmatize(token) for token in tokenized_string]
print(lemmatized_words)

['Indonesia', 'wa', 'founded', 'in', '1945', '.', 'It', 'contains', 'the', 'most', 'populated', 'island', 'in', 'the', 'world', ',', 'Java', ',', 'with', 'over', '140', 'million', 'people', '.']


<img src="Images\atom.png" alt="Atom" style="width:60px" align="left" vertical-align="middle">

## 7. Part-of-Speech Tagging
*Text Preprocessing*

----
To improve the performance of lemmatization, we need to find the part of speech for each word in our string. In the script below, we created a part-of-speech tagging function. The function accepts a word, then returns the most common part of speech for that word. Let’s break down the steps:

In [68]:
import nltk
from nltk.corpus import wordnet
from collections import Counter

def get_part_of_speech(word):
    probable_part_of_speech = wordnet.synsets(word)
    pos_counts = Counter()

    pos_counts["n"] = len([item for item in probable_part_of_speech if item.pos()=="n"])
    pos_counts["v"] = len([item for item in probable_part_of_speech if item.pos()=="v"])
    pos_counts["a"] = len([item for item in probable_part_of_speech if item.pos()=="a"])
    pos_counts["r"] = len([item for item in probable_part_of_speech if item.pos()=="r"])

    most_likely_part_of_speech = pos_counts.most_common(1)[0][0]
    return most_likely_part_of_speech

<br/>*A. Import wordnet and Counter:*

In [58]:
from nltk.corpus import wordnet
from collections import Counter

- `wordnet` is a database that we use for contextualizing words
- `Counter` is a container that stores elements as dictionary keys

<br/>*B. Get synonyms:*
<br/>Inside of our function, we use the `wordnet.synsets()` function to get a set of synonyms for the word:

In [59]:
def get_part_of_speech(word):
    probable_part_of_speech = wordnet.synsets(word)

The returned synonyms come with their part of speech.

<br/>*C. Use synonyms to determine the most likely part of speech:*
<br/>Next, we create a `Counter()` object and set each value to the count of the number of synonyms that fall into each part of speech:

In [60]:
# pos_counts["n"] = len([item for item in probable_part_of_speech if item.pos()=="n"])

This line counts the number of nouns in the synonym set.

<br/>*D. Return the most common part of speech:*
<br/>Now that we have a count for each part of speech, we can use the `.most_common()` counter method to find and return the most likely part of speech:

In [61]:
# most_likely_part_of_speech = pos_counts.most_common(1)[0][0]

Now that we can find the most probable part of speech for a given word, we can pass this into our lemmatizer when we find the root for each word. Let’s take a look at how we would do this for a tokenized string:

In [65]:
tokenized = ["How", "old", "is", "the", "country", "Indonesia"]
lemmatized = [lemmatizer.lemmatize(token, get_part_of_speech(token)) for token in tokenized]
print(lemmatized)

['How', 'old', 'be', 'the', 'country', 'Indonesia']


*Example:*
<br/>Use `get_part_of_speech()` to improve your lemmatizer. Under the line where `tokenized_string` is defined, use the `get_part_of_speech()` function in a list comprehension to lemmatize all the words in `tokenized_string`. Save the result to a new variable called `lemmatized_pos`.

In [66]:
populated_island = 'Indonesia was founded in 1945. It contains the most populated island in the world, Java, with over 140 million people.'
tokenized_string = word_tokenize(populated_island)
lemmatized_pos = [lemmatizer.lemmatize(token, get_part_of_speech(token)) for token in tokenized_string]
print(lemmatized_pos)

['Indonesia', 'be', 'found', 'in', '1945', '.', 'It', 'contain', 'the', 'most', 'populate', 'island', 'in', 'the', 'world', ',', 'Java', ',', 'with', 'over', '140', 'million', 'people', '.']


<img src="Images\atom.png" alt="Atom" style="width:60px" align="left" vertical-align="middle">

## 8. Review
*Text Preprocessing*

----
Congratulations! The goal of this unit was to introduce regular expressions (regex) and several methods to prepare your text data for most NLP tasks.

<br/>This lesson is not an exhaustive introduction to text preprocessing. However, it does show a few of the most common tricks for cleaning your data. Before building a text preprocessing pipeline, it’s most important to have an idea of how you want your data formatted and why you want it formatted that way. Once you know what you want, you can use these tools to get you there.

<br/>Having completed this unit, you are now able to:
- Implement basic regular expressions.
- Recognize several techniques to prepare text for various NLP tasks.
- Use Python and regex to remove unnecessary formatting from your text.
- Split text into tokens using NLTK.
- Normalize text with Python, regex, and NLTK by removing affixes, changing case, and removing common words.

<br/>Let’s review what we covered in this lesson:
- Text preprocessing is all about cleaning and prepping text data so that it’s ready for other NLP tasks.
- Noise removal is a text preprocessing step concerned with removing unnecessary formatting from our text.
- Tokenization is a text preprocessing step devoted to breaking up text into smaller units (usually words or discrete terms).
- Normalization is the name we give most other text preprocessing tasks, including stemming, lemmatization, upper and lowercasing, and stopword removal.
- Stemming is the normalization preprocessing task focused on removing word affixes.
- Lemmatization is the normalization preprocessing task that more carefully brings words down to their root forms.

<br/>If you are interested in learning more about these topics, here are some additional resources:
- Book: [Speech and Language Processing, Chapter 2, Daniel Jurafsky & James H. Martin](https://web.stanford.edu/~jurafsky/slp3/2.pdf)
- Video Playlist: [NLTK with Python 3 for Natural Language Processing](https://www.youtube.com/playlist?reload=9&list=PLQVvvaa0QuDf2JswnfiGkliBInZnIC4HL)

<br/>*Example:*
<br/>Below, is random HTML text. See if you can use only the skills you’ve learned in this lesson to:
- Select only the string within the `<p>` tags.
- Remove all periods.
- Make all of the words lowercase.
- Tokenize the string.
- Lemmatize the string.

In [69]:
import re
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
oprah_wiki = '<p>Working in local media, she was both the youngest news anchor and the first black female news anchor at Nashville\'s WLAC-TV. </p>'

oprah_no_html = re.sub(r'<.?p>', '', oprah_wiki)
oprah_text = re.sub(r'\.', '', oprah_no_html)
oprah_text = oprah_text.lower()
oprah_tokenized = word_tokenize(oprah_text)
oprah_lemmatized = [lemmatizer.lemmatize(token, get_part_of_speech(token)) for token in oprah_tokenized]

print(oprah_lemmatized)

['work', 'in', 'local', 'medium', ',', 'she', 'be', 'both', 'the', 'young', 'news', 'anchor', 'and', 'the', 'first', 'black', 'female', 'news', 'anchor', 'at', 'nashville', "'s", 'wlac-tv']


<img src="Images\atom.png" alt="Atom" style="width:60px" align="left" vertical-align="middle">

## 9. NLTK with Python 3 for Natural Language Processing
*Text Preprocessing*

----
We recommend watching the following videos for helpful tutorials on using NLTK for text preprocessing:

<br/>[Natural Language Processing With Python and NLTK p.1 Tokenizing words and Sentences](https://pythonprogramming.net/tokenizing-words-sentences-nltk-tutorial/)

<br/>In this video, you will get a tutorial on tokenizing text using NLTK. This is helpful if you want to see a demo of how to break text data up into words or sentences.

<br/>[Stop Words - Natural Language Processing With Python and NLTK p.2](https://pythonprogramming.net/stop-words-nltk-tutorial/)

<br/>In this video, you will get a tutorial on removing stopwords from text data using NLTK, which is helpful if you’re preparing text for sentiment analysis, topic modeling, or other tasks where common words are not helpful.

<br/>[Stemming - Natural Language Processing With Python and NLTK p.3](https://pythonprogramming.net/stemming-nltk-tutorial/)

<br/>[Lemmatizing - Natural Language Processing With Python and NLTK p.8](https://pythonprogramming.net/lemmatizing-nltk-tutorial/)

<br/>In these videos, you will get tutorials on stemming and lemmatization using NLTK, further explaining the differences between these two techniques.