<a href="https://colab.research.google.com/github/shartazkhan/nlp_fundamentals/blob/main/NLP_Text_Preprocessing_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [24]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [17]:
# from google.colab import drive
# drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [38]:
df = pd.read_csv('/content/drive/MyDrive/Datasets/IMDB Dataset.csv')

In [40]:
df.sample(5)

Unnamed: 0,review,sentiment
521,one of the best ensemble acted films I've ever...,positive
17390,After just finishing the book the same day I w...,negative
45040,"The biggest National Lampoon hit remains ""Anim...",negative
29979,Wow... what would you do with $33m? Let me giv...,negative
14319,"Believe it or not, this was at one time the wo...",negative


In [41]:
df['review'][49]

"Average (and surprisingly tame) Fulci giallo which means it's still quite bad by normal standards, but redeemed by its solid build-up and some nice touches such as a neat time twist on the issues of visions and clairvoyance.<br /><br />The genre's well-known weaknesses are in full gear: banal dialogue, wooden acting, illogical plot points. And the finale goes on much too long, while the denouement proves to be a rather lame or shall I say: limp affair.<br /><br />Fulci's ironic handling of giallo norms is amusing, though. Yellow clues wherever you look.<br /><br />3 out of 10 limping killers"

## 1. First we will convert data to lowercase

In [42]:
df['review'] = df['review'].str.lower()

df.sample(5)

Unnamed: 0,review,sentiment
3562,recap: since the warrior queen gedren raised a...,negative
32280,i wandered into this movie after watching the ...,negative
12932,"this book-based movie is truly awful, and a bi...",negative
22128,"paul telfer, who plays hercules in this tv fil...",negative
35833,to me this film is just a very very lame teen ...,negative


## 2. Remove unnecessery thigns


### HTML Tags

In [43]:
import re

def remove_html_tags(text):
    clean = re.compile('<.*?>')
    return clean.sub(r'', text)

df['review'] = df['review'].apply(remove_html_tags)

df.sample(5)

Unnamed: 0,review,sentiment
26845,"as a big dostoyevsky fan, i had always been di...",positive
45124,"fragile carne, just before his great period. a...",positive
27858,this episode from the first season slightly ed...,positive
3452,the sentinel is a movie that was recommended t...,positive
45971,overall an extremely disappointing picture. ve...,negative


In [44]:
df['review'][49]

"average (and surprisingly tame) fulci giallo which means it's still quite bad by normal standards, but redeemed by its solid build-up and some nice touches such as a neat time twist on the issues of visions and clairvoyance.the genre's well-known weaknesses are in full gear: banal dialogue, wooden acting, illogical plot points. and the finale goes on much too long, while the denouement proves to be a rather lame or shall i say: limp affair.fulci's ironic handling of giallo norms is amusing, though. yellow clues wherever you look.3 out of 10 limping killers"

### URLS

In [35]:
def remove_url(text):
    clean = re.compile('https?://\S+|www\.\S+')
    return clean.sub(r'', text)

df['review'] = df['review'].apply(remove_url)

df.sample(5)

Unnamed: 0,review,sentiment
49936,i watched this mini in the early eighties. sam...,positive
40469,"without a doubt, 12 monkeys is one of the best...",positive
14779,"yes, it's not a great cinematic achievement, b...",positive
33476,"although, this episode was offensive to the to...",positive
16512,i first saw a poster advertising this film on ...,negative


### Punctuations

In [46]:
import string

def remove_punctuation(text):
    translator = str.maketrans('', '', string.punctuation) # str.maketrans() is used to create a mapping of characters to be replaced.
    # In this case
    return text.translate(translator)

df['review'] = df['review'].apply(remove_punctuation)

df.sample(5)

Unnamed: 0,review,sentiment
39641,hilarious and lowbudget comedy at its best thi...,positive
48591,too bad chuck norris has gone to tv he made so...,negative
45242,what a time we live in when someone like this ...,negative
42862,uninspired direction leaves a decent cast stra...,negative
14105,i had high hopes for it when i heard that it w...,negative


`str.maketrans()` is used to create a mapping of characters to be replaced. In this case, the first two arguments are empty strings, meaning no characters are being mapped to other characters. The third argument, string.punctuation, provides a string containing all standard punctuation characters. This tells maketrans to map each of these punctuation characters to None, effectively marking them for deletion.

In [47]:
df['review'][49]

'average and surprisingly tame fulci giallo which means its still quite bad by normal standards but redeemed by its solid buildup and some nice touches such as a neat time twist on the issues of visions and clairvoyancethe genres wellknown weaknesses are in full gear banal dialogue wooden acting illogical plot points and the finale goes on much too long while the denouement proves to be a rather lame or shall i say limp affairfulcis ironic handling of giallo norms is amusing though yellow clues wherever you look3 out of 10 limping killers'

## 3. Chat word Treatment

In [45]:
file_path = '/content/drive/MyDrive/Datasets/chat_words_dictionary.txt'

try:
    with open(file_path, 'r') as f:
        file_content = f.read()

        chat_words_dict = {}
        for line in file_content.strip().split('\n'):
            if '=' in line:
                abbr, full_form = line.split('=', 1)
                chat_words_dict[abbr.strip()] = full_form.strip()

        print("\nChat word dictionary created:")
        display(chat_words_dict)

except FileNotFoundError:
    print(f"Error: File not found at {file_path}")
except Exception as e:
    print(f"An error occurred: {e}")


Chat word dictionary created:


{'afaik': 'as far as i know',
 'afk': 'away from keyboard',
 'asap': 'as soon as possible',
 'atk': 'at the keyboard',
 'atm': 'at the moment',
 'a3': 'anytime anywhere anyplace',
 'bak': 'back at keyboard',
 'bbl': 'be back later',
 'bbs': 'be back soon',
 'bfn': 'bye for now',
 'brb': 'be right back',
 'brt': 'be right there',
 'btw': 'by the way',
 'b4': 'before',
 'cu': 'see you',
 'cul8r': 'see you later',
 'faq': 'frequently asked questions',
 'fc': 'fingers crossed',
 'fwiw': 'for what its worth',
 'fyi': 'for your information',
 'gal': 'get a life',
 'gg': 'good game',
 'gn': 'good night',
 'gday': 'good day',
 'gmta': 'great minds think alike',
 'gr8': 'great',
 'g9': 'genius',
 'iykyk': 'if you know you know',
 'ic': 'i see',
 'icq': 'i seek you also a chat program',
 'ilu': 'ilu i love you',
 'imho': 'in my honesthumble opinion',
 'imo': 'in my opinion',
 'iow': 'in other words',
 'irl': 'in real life',
 'kiss': 'keep it simple stupid',
 'ldr': 'long distance relationship',


**We can use this on a dataset in relevant dataset. To convert slangs and chat words to normal language. **

In [46]:
def chat_conversion(text):
  new_text = []
  for w in text.split():
    if w.lower() in chat_words_dict:
      new_text.append(chat_words_dict[w.lower()])
    else:
      new_text.append(w)
  return " ".join(new_text)

In [47]:
chat_conversion('gday man, sry u had to w8 for so long.')

'good day man, sorry you had to wait for so long.'

## 4. Spell checker

In [48]:
from textblob import TextBlob

In [63]:
incorrect_text = 'ploiec esacrh for car involved in ftala hit run'

textBlb = TextBlob(incorrect_text)

correct_text = textBlb.correct().string

print(f"\nOriginal sentence: {incorrect_text}")
print(f"Corrected sentence: {correct_text}")


Original sentence: ploiec esacrh for car involved in ftala hit run
Corrected sentence: police each for car involved in talk hit run


In [64]:
#!pip install pyspellchecker

In [65]:
from spellchecker import SpellChecker

spell = SpellChecker()

words = incorrect_text.split()
corrected_sentence = []
for word in words:
    corrected_sentence.append(spell.correction(word) if spell.correction(word) is not None else word)

print(f"\nOriginal sentence: {incorrect_text}")
print(f"Corrected sentence: {' '.join(corrected_sentence)}")


Original sentence: ploiec esacrh for car involved in ftala hit run
Corrected sentence: police each for car involved in tala hit run


## 5. Stopwords

**What are Stopwords?**
Stopwords are common words in a language (like "the", "a", "is", "in") that are often removed from text during preprocessing because they don't usually carry significant meaning and can add noise to analysis.

**When to Remove Them:**
Remove stopwords for tasks where the focus is on the important keywords and their frequencies, such as:
- Text classification
- Information retrieval
- Topic modeling

**When to Keep Them:**
Keep stopwords when their presence is important for understanding the context or structure of the text, such as:
- Sentiment analysis (words like "not" can change the meaning)
- Language translation
- Text generation
- When analyzing sentence structure or syntax.

In [66]:
from nltk.corpus import stopwords

In [69]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [70]:
stopwords.words('english')

['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 'her',
 'here',
 'hers',
 'herself',
 "he's",
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'd",
 'if',
 "i'll",
 "i'm",
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 "i've",
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'on

In [71]:
def remove_stopwords(text):
    new_text = []

    for word in text.split():
        if word in stopwords.words('english'):
            new_text.append('')
        else:
            new_text.append(word)
    return " ".join(new_text)

In [72]:
remove_stopwords('this is a sample sentence')

'   sample sentence'

## 6. Handling Emojis




Emojis are increasingly common in text and can carry significant sentiment and meaning. When processing text for NLP tasks, you might need to handle emojis. Common approaches include:

*   **Removal:** Simply remove emojis from the text if they are not relevant to your analysis.
*   **Replacement with text:** Replace emojis with their textual descriptions (e.g., converting 😊 to "smiling face with smiling eyes"). This can be useful for sentiment analysis. Libraries like `emoji` can help with this.
*   **Treating as separate tokens:** Keep emojis as they are but treat them as individual tokens during tokenization. This preserves their presence and can be useful in some models.

In [77]:
import re

def remove_emoji(text):
  emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               u"\U0002702-\U00027B0"
                               u"\U00024C2-\U0001F251"
                               "]+", flags=re.UNICODE)
  return emoji_pattern.sub(r'', text)

SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-8: truncated \UXXXXXXXX escape (ipython-input-77-3132311563.py, line 11)

In [78]:
!pip install emoji

Collecting emoji
  Downloading emoji-2.14.1-py3-none-any.whl.metadata (5.7 kB)
Downloading emoji-2.14.1-py3-none-any.whl (590 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m590.6/590.6 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: emoji
Successfully installed emoji-2.14.1


In [83]:
import emoji

def emoji_to_text(text):
    return emoji.demojize(text)

# Example usage
text_with_emojis = "I feel great! 😁"
converted_text = emoji_to_text(text_with_emojis)
print("Converted text:", converted_text)

Converted text: I feel great! :beaming_face_with_smiling_eyes:


In [85]:
def detect_emojis(text):
    return [char for char in text if char in emoji.EMOJI_DATA]

text = "I feel great! 😁 ❤️👍"
emojis_found = detect_emojis(text)
print("Emojis detected:", emojis_found)

Emojis detected: ['😁', '❤', '👍']


In [86]:
def remove_emojis(text):
    return ''.join(char for char in text if char not in emoji.EMOJI_DATA)

cleaned_text = remove_emojis(text)
print("Text after emoji removal:", cleaned_text)

Text after emoji removal: I feel great!  ️


## 7. Tokenization

**What is Tokenization?**
Tokenization is the process of breaking down a sequence of text into smaller units called tokens. These tokens can be words, subwords, or even individual characters, depending on the granularity required for the NLP task.

**When to Keep/Remove Tokens:**

*   **Keeping Word Tokens:** This is the most common approach. Keeping words as tokens is essential for most NLP tasks like text classification, sentiment analysis, and topic modeling, as words carry the primary meaning.
*   **Removing Punctuation Tokens:** Often, punctuation is removed as it might not contribute to the meaning and can increase the vocabulary size unnecessarily. However, in some cases like sentiment analysis, punctuation like "!" or "?" can be important.
*   **Removing Stopword Tokens:** As discussed earlier, stopwords (common words like "the", "a", "is") are often removed when their presence doesn't add significant value to the analysis.
*   **Keeping Subword Tokens:** For languages with complex morphology or when dealing with out-of-vocabulary words, tokenizing into subwords (like using WordPiece or BPE) can be beneficial. This is often used in transformer models.
*   **Keeping Character Tokens:** In some specific tasks like character-level language modeling or sequence-to-sequence tasks, tokenizing into individual characters might be necessary.
*   **Handling Special Tokens:** Depending on the model or task, special tokens (like `[CLS]`, `[SEP]`, `[PAD]`) are added and should be kept for the model to function correctly.

### Word level spliting

In [88]:
sentence = "I am from Barishal"
sentence.split()

['I', 'am', 'from', 'Barishal']

### Sentence level spliting

In [89]:
sentence = "I am going to Dhaka. It is a big city. So many people live there."
sentence.split('.')

['I am going to Dhaka', ' It is a big city', ' So many people live there', '']

*Limitation: You can see a string at the end of the list. It should not be there!*

In [92]:
sentence = "I am going to Dhaka! It is a big city. So many people live there"
sentence.split()

['I',
 'am',
 'going',
 'to',
 'Dhaka!',
 'It',
 'is',
 'a',
 'big',
 'city.',
 'So',
 'many',
 'people',
 'live',
 'there']

*Limitation: 'Dhaka' and '!' should be 2 differnt tockens.*

In [93]:
sentence = "How sad! Are you okay?"
sentence.split('.')

['How sad! Are you okay?']

*Limitation: 'How sad!' and 'Are you okay?' should be 2 differnt tockens.*

> **One solution is to use regular expression or regex**











In [97]:
import re

sentence = "How sad! Are you okay?"
tokens = re.findall("[\w']+",sentence)
tokens

['How', 'sad', 'Are', 'you', 'okay']

**Works great! But what if we need the exclamatory sign? **

In [101]:
sentence = "I can't believe it's already Wednesday! Are you sure about that meeting time tomorrow? The weather forecast looks terrible; it's going to pour!"

tokens = re.compile("[.!?;] ").split(sentence)
tokens

["I can't believe it's already Wednesday",
 'Are you sure about that meeting time tomorrow',
 'The weather forecast looks terrible',
 "it's going to pour!"]

**Noot bad, but can be better! **

In [108]:
from nltk import word_tokenize
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [109]:
sentence = "I am from Barishal!"
word_tokenize(sentence)

['I', 'am', 'from', 'Barishal', '!']

In [111]:
from nltk import word_tokenize, sent_tokenize

sentence = "I can't believe it's already Wednesday! Are you sure about that meeting time tomorrow? The weather forecast looks terrible; it's going to pour!"

sent_tokenize(sentence)

["I can't believe it's already Wednesday!",
 'Are you sure about that meeting time tomorrow?',
 "The weather forecast looks terrible; it's going to pour!"]

In [112]:
sentence = "We're here for you. Please email us at email@email.com"

word_tokenize(sentence)

['We',
 "'re",
 'here',
 'for',
 'you',
 '.',
 'Please',
 'email',
 'us',
 'at',
 'email',
 '@',
 'email.com']

Not perfect huh!

In [113]:
import spacy

In [114]:
nlp = spacy.load('en_core_web_sm')

document = nlp(sentence)

for token in document:
  print(token)

We
're
here
for
you
.
Please
email
us
at
email@email.com


Hmm... not bad!



> No lib is perfect, You have to select one denpending one your data type.



## 7. Stemming

## Stemming in NLP

**What is Stemming?**
Stemming is a process in NLP that reduces words to their root or base form, often called the "stem." This is done by removing suffixes from words. The stem may not be a linguistically correct word.

**When to Keep/Remove Stemmed Words:**

*   **When to Stem:** Stemming is useful when you need to group words with similar meanings based on their root, even if the stem itself isn't a real word. This can be beneficial for tasks like:
    *   Information Retrieval (searching for documents containing variations of a word)
    *   Spelling Correction
    *   Reducing vocabulary size

*   **When Not to Stem:** Stemming can sometimes result in stems that are not actual words or can conflate words with different meanings (e.g., "organize" and "organ" might be stemmed to "organ"). Avoid stemming when:
    *   The exact meaning and form of a word are important (e.g., in text generation or machine translation).
    *   You need linguistically correct roots (use lemmatization instead).
    *   The nuances of word variations are crucial for your analysis.

In [116]:
from nltk.stem.porter import PorterStemmer

In [117]:
port_stem = PorterStemmer()

def stemming_words(text):
  return " ".join([port_stem.stem(word) for word in text.split()])

In [121]:
smaple = "walk walks walking walked"

stemming_words(smaple)

'walk walk walk walk'

In [127]:
sample_para = """The quick brown fox jumps over the lazy dog. This sentence is a pangram, which means it contains every letter of the alphabet at least once. Pangrams are often used to test typewriters and fonts, and they serve as a fun way to practice handwriting."""

stemming_words(sample_para)

'the quick brown fox jump over the lazi dog. thi sentenc is a pangram, which mean it contain everi letter of the alphabet at least once. pangram are often use to test typewrit and fonts, and they serv as a fun way to practic handwriting.'

**What languages is that mate!**

## Stemming vs. Lemmatization in NLP

Both stemming and lemmatization are techniques used to reduce words to their root form, but they differ in their approach and the resulting root.

*   **Stemming:**
    *   A more basic, rule-based process that chops off suffixes from words.
    *   The resulting "stem" may not be a real word (e.g., "running" -> "runn").
    *   Generally faster and simpler to implement.
    *   Useful for tasks where you need to group related words quickly, even if the root isn't linguistically perfect (like information retrieval).

*   **Lemmatization:**
    *   A more complex process that considers the word's context and uses a vocabulary and morphological analysis to return the base or dictionary form of a word (the lemma).
    *   The resulting "lemma" is always a real word (e.g., "running" -> "run").
    *   Generally slower than stemming due to the need for vocabulary lookup and analysis.
    *   Useful for tasks where you need accurate, linguistically correct roots for better understanding of meaning (like text analysis or machine translation).

In short, stemming is a cruder, faster method that might produce non-words, while lemmatization is a more accurate, slower method that produces real words. The choice between them depends on the specific NLP task and the required level of accuracy.

In [128]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

nltk.download('wordnet')

wordnet_lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to /root/nltk_data...


In [130]:
sample_para = """The quick brown fox jumps over the lazy dog. This sentence is a pangram, which means it contains every letter of the alphabet at least once. Pangrams are often used to test typewriters and fonts, and they serve as a fun way to practice handwriting."""

puncs = "?:!.,;"
para_words = nltk.word_tokenize(sample_para)
for word in para_words:
    if word in puncs:
        para_words.remove(word)

print("{0:20}{1:20}".format("Word","Lemma"))
for word in para_words:
    print("{0:20}{1:20}".format(word,wordnet_lemmatizer.lemmatize(word)))

Word                Lemma               
The                 The                 
quick               quick               
brown               brown               
fox                 fox                 
jumps               jump                
over                over                
the                 the                 
lazy                lazy                
dog                 dog                 
This                This                
sentence            sentence            
is                  is                  
a                   a                   
pangram             pangram             
which               which               
means               mean                
it                  it                  
contains            contains            
every               every               
letter              letter              
of                  of                  
the                 the                 
alphabet            alphabet            
at              