### Text PreProcessing

To transform words into numerical form that can work with machine learning algorithms we use text preprocessing.

### Text Preprocessing Steps for NLP
Text preprocessing is crucial for preparing raw text data for NLP tasks. Below are common steps in a typical preprocessing pipeline:

1. Lowercasing:
Convert all text to lowercase to reduce the complexity of comparisons.
Example:

- Input: "Hello World!"
- Output: "hello world!"

2. Tokenization:
Split the text into individual words or sentences.
Example:

- Input: "NLP is amazing!"
- Output: ["NLP", "is", "amazing", "!"]

3. Removing Punctuation:
Remove special characters and punctuation to clean the text.
Example:

- Input: "Hello, NLP World!"
- Output: "Hello NLP World"

4. Stopword Removal:
Remove common words (like "the," "is") that do not add meaningful information.
Example:

- Input: "This is a great book!"
- Output: ["great", "book"]

5. Stemming:
Reduce words to their root form (may not be linguistically accurate).
Example:

- Input: "playing, played, plays"
- Output: "play"

6. Lemmatization:
Get the base form of a word using linguistic context (better than stemming).
Example:

- Input: "running, ran, runs"
- Output: "run"

7. Removing Numerical Values (Optional): Remove numbers if they are not useful for the task.

8. Handling Contractions: Expand common contractions (e.g., "can't" → "cannot").


9. Removing Extra Whitespaces: Clean up multiple spaces.

In [18]:
import nltk
import string
import re

### Lowercasing: 
Convert all text to lowercase to reduce the complexity of comparisons.

In [21]:
def lowercase_text(text):
    return text.lower()

In [27]:
string = "The World is Round"
lowercase_text(string)

'the world is round'

### Tokenization:
Split the text into individual words or sentences.

In [48]:
>>> import nltk
>>> nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\sharm\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


True

In [50]:
from nltk.tokenize import word_tokenize

def tokenize_text(text):
    tokens = word_tokenize(text)
    return tokens

In [52]:
string = " Today the sky is cloudy and winds are strong. There are chances of heavy rain and storm."
tokenize_text(string)

['Today',
 'the',
 'sky',
 'is',
 'cloudy',
 'and',
 'winds',
 'are',
 'strong',
 '.',
 'There',
 'are',
 'chances',
 'of',
 'heavy',
 'rain',
 'and',
 'storm',
 '.']

### Removing Punctuation: 
Remove special characters and punctuation to clean the text.

In [84]:
import string

def rem_punc(text):
    translator = str.maketrans('','', string.punctuation)
    return text.translate(translator)

In [88]:
string_data = "A parakeet!! is any one of many small to medium-sized species of parrot, in multiple genera, that generally has long tail~ feathers. Isn't this fact exciting???"  
rem_punc(string_data)

'A parakeet is any one of many small to mediumsized species of parrot in multiple genera that generally has long tail feathers Isnt this fact exciting'

### Stopword Removal:
Remove common words (like "the," "is") that do not add meaningful information.

In [116]:
import string
from nltk.corpus import stopwords

def rem_stopwords(text):
    stop_words = set(stopwords.words("english"))
    words_token = word_tokenize(text)
    filtered_text = [ word for word in words_token if word not in string.punctuation]
    filtered_text = [ word for word in words_token if word not in stop_words]
    return filtered_text

In [118]:
string_text = " AI might be the last and most intelligent invention of humanity. "
rem_stopwords(string_text)

['AI', 'might', 'last', 'intelligent', 'invention', 'humanity', '.']

### Stemming:
Reduce words to their root form (may not be linguistically accurate).

In [120]:
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import word_tokenize
root = PorterStemmer()

def root_words(text):
    tokens = word_tokenize(text)
    stems = [root.stem(word) for word in tokens]
    return stems

In [122]:
string_text = " Some people believe that artificial intelligence will become so advanced that it will surpass human intelligence and effectively take control of the planet"
root_words(string_text)

['some',
 'peopl',
 'believ',
 'that',
 'artifici',
 'intellig',
 'will',
 'becom',
 'so',
 'advanc',
 'that',
 'it',
 'will',
 'surpass',
 'human',
 'intellig',
 'and',
 'effect',
 'take',
 'control',
 'of',
 'the',
 'planet']

### Lemmatization: 
Get the base form of a word using linguistic context (better than stemming).

In [128]:
>>> import nltk
>>> nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\sharm\AppData\Roaming\nltk_data...


True

In [130]:
from nltk.stem import wordnet
from nltk.tokenize import word_tokenize
lemma = wordnet.WordNetLemmatizer()

def lem_words(text):
    tokens = word_tokenize(text)
    lem_text = [lemma.lemmatize(word, pos = 'v') for word in tokens]
    return lem_text

In [132]:
string_text = " Some people believe that artificial intelligence will become so advanced that it will surpass human intelligence and effectively take control of the planet"
lem_words(string_text)

['Some',
 'people',
 'believe',
 'that',
 'artificial',
 'intelligence',
 'will',
 'become',
 'so',
 'advance',
 'that',
 'it',
 'will',
 'surpass',
 'human',
 'intelligence',
 'and',
 'effectively',
 'take',
 'control',
 'of',
 'the',
 'planet']

### POS Tagging
pos refers to Part-of-Speech (POS) tagging, which is the process of labeling words in a sentence with their grammatical category, such as nouns, verbs, adjectives, etc.

![image.png](attachment:d7f04c2b-4824-4bfa-afbf-171711bf5d76.png)
![image.png](attachment:603f8eb7-2417-41ac-9d5e-187a18aee01b.png)

In [7]:
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

def pos_words(text):
    tokens = word_tokenize(text)
    pos_tags = nltk.pos_tag(tokens)
    return pos_tags

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sharm\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\sharm\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [9]:
string1 = " Some people believe that artificial intelligence will become so advanced that it will surpass human intelligence and effectively take control of the planet"
pos_words(string1)

[('Some', 'DT'),
 ('people', 'NNS'),
 ('believe', 'VBP'),
 ('that', 'IN'),
 ('artificial', 'JJ'),
 ('intelligence', 'NN'),
 ('will', 'MD'),
 ('become', 'VB'),
 ('so', 'RB'),
 ('advanced', 'JJ'),
 ('that', 'IN'),
 ('it', 'PRP'),
 ('will', 'MD'),
 ('surpass', 'VB'),
 ('human', 'JJ'),
 ('intelligence', 'NN'),
 ('and', 'CC'),
 ('effectively', 'RB'),
 ('take', 'VB'),
 ('control', 'NN'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('planet', 'NN')]

### Removing Numerical Values (Optional):
Remove numbers if they are not useful for the task.

Use a module like inflect or num2words to convert numbers to words.

1. Syntax:

- num2words(string)

2. Syntax:

- p = inflect.engine()
- word_representation = p.number_to_words(number)

In [145]:
import inflect

# Initialize the engine
q = inflect.engine()

In [146]:
def convert_num(text):
    string = text.split()
    new_str = []                           # Initialize empty list

    for word in string:

        if word.isdigit():
            temp = q.number_to_words(word) # Store the converted digit in temp variable
            new_str.append(temp)           # Append to new_str list
        else:
            new_str.append(word)           # Append text to new_str list

    string = " ".join(new_str)             # Join text of new-str list
    return string 

In [151]:
string_text = "I need to buy 1 dozen mangos, 5 kg potatoes, 100 gm chilli powder from the grocery store."
convert_num(string_text)

'I need to buy one dozen mangos, five kg potatoes, one hundred gm chilli powder from the grocery store.'

### Handling Contractions:
Expand common contractions (e.g., "can't" → "cannot").

In [158]:
import contractions

def expand_contractions(text):
    expanded_words = [contractions.fix(word) for word in text.split()]
    return " ".join(expanded_words)

In [160]:
string_text = "I can't go there, it's too late."
expand_contractions(string_text)

'I cannot go there, it is too late.'

### Removing Extra Whitespaces: 
Clean up multiple spaces.

In [162]:
def remove_extra_whitespaces(text):
    return " ".join(text.split())

In [168]:
text = "    This   is    an   example    with   extra spaces."
remove_extra_whitespaces(text)

'This is an example with extra spaces.'