## Data Preprocessing
- Data preprocessing is an essential step in building a machine learning model.
- Final result from the trained machine learning model will depend on how well the data has been preprocessed.

## Text Preprocessing in NLP
- Text preprocessing is the first step in the process of building a machine learning model.
- The text is represented as a vector in multi-dimensional space.
- Various text preprocessing steps are:
    - Tokenization
    - Lower casing
    - Stop words removal
    - Stemming 
    - Lemmatization

In [None]:
## Uncomment below cell to install nltk

# !pip install nltk

In [7]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /home/fm-pc-
[nltk_data]     lt-125/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## NLTK Basics
- NLTK stands for `Natural Language Toolkit`
- We will use nltk library to preprocess the textual data.
- In order to work with NLTK we need to first install it.

In [3]:
# define test string

PHRASE = """Data preprocessing is the essential steps in building a machine learnig model. Final result from the trained machine learning model
    will depend on how well the data has been preprocessed. NLTK is the library that helps us to preprocess the textual data and various other
    textual analysis.
    """

### 1. Tokenization
- Tokenization is the process of splitting bigger chunks of text into smaller chunks.
- Mainly there are 3 types of Tokenization:
    - `Character Tokenization`
        - Character Tokenization is the process of splitting text chunks into character level.
    - `Word Tokenization`
        - Word Tokenization is the process of splitting sentences into words.
        - Words, numbers, punctuations are also called as tokens in NLP.
     
    - `Sentence Tokenization`
        - Sentence Tokenization is the process of splitting phrases or paragraphs into Sentences.
        - A sentence usually ends by a full stop, so splitting strings with character '.' also a process of sentence tokenization

In [4]:
# character tokenization
characters = list(PHRASE)

print(characters)

['D', 'a', 't', 'a', ' ', 'p', 'r', 'e', 'p', 'r', 'o', 'c', 'e', 's', 's', 'i', 'n', 'g', ' ', 'i', 's', ' ', 't', 'h', 'e', ' ', 'e', 's', 's', 'e', 'n', 't', 'i', 'a', 'l', ' ', 's', 't', 'e', 'p', 's', ' ', 'i', 'n', ' ', 'b', 'u', 'i', 'l', 'd', 'i', 'n', 'g', ' ', 'a', ' ', 'm', 'a', 'c', 'h', 'i', 'n', 'e', ' ', 'l', 'e', 'a', 'r', 'n', 'i', 'g', ' ', 'm', 'o', 'd', 'e', 'l', '.', ' ', 'F', 'i', 'n', 'a', 'l', ' ', 'r', 'e', 's', 'u', 'l', 't', ' ', 'f', 'r', 'o', 'm', ' ', 't', 'h', 'e', ' ', 't', 'r', 'a', 'i', 'n', 'e', 'd', ' ', 'm', 'a', 'c', 'h', 'i', 'n', 'e', ' ', 'l', 'e', 'a', 'r', 'n', 'i', 'n', 'g', ' ', 'm', 'o', 'd', 'e', 'l', '\n', ' ', ' ', ' ', ' ', 'w', 'i', 'l', 'l', ' ', 'd', 'e', 'p', 'e', 'n', 'd', ' ', 'o', 'n', ' ', 'h', 'o', 'w', ' ', 'w', 'e', 'l', 'l', ' ', 't', 'h', 'e', ' ', 'd', 'a', 't', 'a', ' ', 'h', 'a', 's', ' ', 'b', 'e', 'e', 'n', ' ', 'p', 'r', 'e', 'p', 'r', 'o', 'c', 'e', 's', 's', 'e', 'd', '.', ' ', 'N', 'L', 'T', 'K', ' ', 'i', 's', ' '

In [6]:
from nltk.tokenize import word_tokenize

# word tokenization
words = word_tokenize(PHRASE)

print(words)

['Data', 'preprocessing', 'is', 'the', 'essential', 'steps', 'in', 'building', 'a', 'machine', 'learnig', 'model', '.', 'Final', 'result', 'from', 'the', 'trained', 'machine', 'learning', 'model', 'will', 'depend', 'on', 'how', 'well', 'the', 'data', 'has', 'been', 'preprocessed', '.', 'NLTK', 'is', 'the', 'library', 'that', 'helps', 'us', 'to', 'preprocess', 'the', 'textual', 'data', 'and', 'various', 'other', 'textual', 'analysis', '.']


In [9]:
from nltk.tokenize import sent_tokenize

# sentence tokenization
sentences = sent_tokenize(PHRASE)

print(sentences)

['Data preprocessing is the essential steps in building a machine learnig model.', 'Final result from the trained machine learning model\n    will depend on how well the data has been preprocessed.', 'NLTK is the library that helps us to preprocess the textual data and various other\n    textual analysis.']


### 2. Lower Casing
- Convert words to lower casing
- **Why?**
    - Words like `Lower` and `lower` means the same,
    - No use of lower case means similar words as discussed above will be different.

In [10]:
# lower casing
lower_phrase = PHRASE.lower()

print(lower_phrase)

data preprocessing is the essential steps in building a machine learnig model. final result from the trained machine learning model
    will depend on how well the data has been preprocessed. nltk is the library that helps us to preprocess the textual data and various other
    textual analysis.
    


### 3. Stopwords Removal
- In NLP, useless or repeated words (data) are referred to as `stopwords`.
- Stopwords are very commonly used words in the documents like `a, an, the, is, are, etc`.
- These kind of words do not signify any importance as they do not help in distinguishing two documents.
- NLTK in python has a list of stopwords stored in 16 different languages.


In [11]:
from nltk.corpus import stopwords

# stopwords removal
stopwords_list = set(stopwords.words('english'))

no_stop_words_list = [word for word in words if word.lower() not in stopwords_list]

print(no_stop_words_list)

['Data', 'preprocessing', 'essential', 'steps', 'building', 'machine', 'learnig', 'model', '.', 'Final', 'result', 'trained', 'machine', 'learning', 'model', 'depend', 'well', 'data', 'preprocessed', '.', 'NLTK', 'library', 'helps', 'us', 'preprocess', 'textual', 'data', 'various', 'textual', 'analysis', '.']


### 4. Stemming
- Stemming is a process of transforming a word to its root form.
- Stemming reduces the words "chocolates", "chocolatey", "choco" to the root word "chocolate"
- Stemming is an important part in the pipelining process in Natural Language Processing.
- Stemming is useful when you do not care much of contextual information, Since words obtained after stemming may or may not have actual Dictionary meaning.
- **Example:**
    - likes, liked, likely, liking --> like
    - history, historical --> histori
    - finally, final, finalized --> fina
    - going, goes --> go
- **Errors in Stemming:**
    1. `Over Stemming`
        - It is the process where a much larger part of a word is chopped off than what is required, which in turn leads to two or more words being reduced to the same root word or stem incorrectly when they should have been reduced to two or more stem words.
        - Example: Unversity and universe reduced to same word "univers"
    2. `Under Stemming`
        - In under stemming, two or more words could be wrongly reduced to more than one root word, when they actually should be reduced to the same root word.
        - Example: when words "data" and "datum" reduced to word dat and datu
- **Cons:**
    - Obtained stemmed words may not have actual Dictionary meaning.

In [12]:
## stemming words
from nltk.stem import PorterStemmer

ps = PorterStemmer()

stemmed_phrase = [ps.stem(word) for word in no_stop_words_list]

print(stemmed_phrase)

['data', 'preprocess', 'essenti', 'step', 'build', 'machin', 'learnig', 'model', '.', 'final', 'result', 'train', 'machin', 'learn', 'model', 'depend', 'well', 'data', 'preprocess', '.', 'nltk', 'librari', 'help', 'us', 'preprocess', 'textual', 'data', 'variou', 'textual', 'analysi', '.']


### 6. Lemmatization
- Unlike stemming, lemmatization reduces the words to a word existing in the Dictionary.
- Libraries such as nltk, have stemmers and lemmatizers implemented.

In [13]:
# Lemmatization: WordNetLemmatizer
from nltk.stem import WordNetLemmatizer


# Init the Wordnet Lemmatizer
lemmatizer = WordNetLemmatizer()

# lemmatize single word
print("Lemmatization of word **bats** is:", lemmatizer.lemmatize('bats'))
print("Lemmatization of word **are** is:", lemmatizer.lemmatize('are'))
print("Lemmatization of word **feet** is:", lemmatizer.lemmatize('feet'))

Lemmatization of word **bats** is: bat
Lemmatization of word **are** is: are
Lemmatization of word **feet** is: foot


In [14]:
# lemmatize sentence

lemmatized_phrase = [lemmatizer.lemmatize(word) for word in no_stop_words_list]
print(lemmatized_phrase)

['Data', 'preprocessing', 'essential', 'step', 'building', 'machine', 'learnig', 'model', '.', 'Final', 'result', 'trained', 'machine', 'learning', 'model', 'depend', 'well', 'data', 'preprocessed', '.', 'NLTK', 'library', 'help', 'u', 'preprocess', 'textual', 'data', 'various', 'textual', 'analysis', '.']


For better accuracy we need to pass pos tag associated with each words in a sentence. This is because meaning of words changes based on the context in which it may arise.

In [16]:
from nltk import pos_tag
from nltk.corpus import wordnet

def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    
    tag = nltk.pos_tag([word])[0][1][0].upper() # E.g. [('Data', 'NNS')] --> N
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)


# get lemmas
lemmas = [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in no_stop_words_list]
print(lemmas)

['Data', 'preprocessing', 'essential', 'step', 'building', 'machine', 'learnig', 'model', '.', 'Final', 'result', 'train', 'machine', 'learn', 'model', 'depend', 'well', 'data', 'preprocessed', '.', 'NLTK', 'library', 'help', 'u', 'preprocess', 'textual', 'data', 'various', 'textual', 'analysis', '.']


### Difference between Stemming and Lemmatization

| Aspect                  | Stemming                                  | Lemmatization                        |
|-------------------------|------------------------------------------|-------------------------------------|
| Context Consideration   | Does not consider context; removes suffixes | Considers context; finds meaningful base forms |
| Meaningful Base Forms   | May not result in meaningful base words  | Always results in meaningful base words |
| Widely Used             | Widely used and implemented in multiple languages | Less commonly used due to complexity |
| Ease of Implementation  | Relatively easy to implement and build custom stemmers | Complex, requires linguistic knowledge |
| Example                 | "Caring" -> "Car"                         | "Caring" -> "Care"                  |
