# Text Preprocessing in NLP

## Text Preprocessing Methods

Importing the required libraries and data.

In [1]:
import pandas as pd #importing pandas
import numpy as np #importing numpy
import seaborn as sns #importing seaborn
import matplotlib.pyplot as plt #importing matplotlib

In [2]:
movies = pd.read_csv("K:\DATA SCIENCE\DataSets\IMDB Dataset.csv")

In [3]:
movies.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


### 1) Lower Casing

* What is Lower Casing?
  Converting the text into lowercase is called as lower casing. "What" will be converted into "what".

* Why it is done?
  * Reduces Vocabulary Size:Imagine the words "The" and "the" being treated as different words by an NLP model. This would effectively double the number of words the model needs to consider, making it more complex and potentially less accurate. Lowercasing forces "The" and "the" to be considered the same word, reducing the vocabulary size and simplifying the model's task.
  * Improves Model Consistency: Capitalization can vary depending on the context. For instance, "US" and "U.S." could represent the same thing. Lowercasing eliminates these inconsistencies, allowing the model to focus on the core meaning of the word rather than capitalization variations. This can lead to more consistent and accurate results.

In [4]:
movies['review'] = movies['review'].str.lower()

In [5]:
movies.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. <br /><br />the...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive


### 2) Removing HTML tags

* In Natural Language Processing (NLP), removing HTML tags refers to the process of cleaning text data by eliminating the code used for formatting and presentation on web pages

* This is done for a few key reasons:

 
    *  Focus on Content, Not Presentation: NLP tasks like sentiment analysis or topic modeling aim to understand the meaning conveyed by the text. HTML tags don't contribute to the core meaning and can even introduce noise. Removing them allows the NLP model to concentrate on the actual content of the text.
    *  Simplify Text Processing:  HTML tags can add complexity to the text data.  For instance, nested tags or complex structures can make it difficult for NLP algorithms to parse and analyze the text efficiently. Removing these tags creates a cleaner and more structured format, streamlining the processing pipeline.
    *  Standardization:  Websites can use different HTML tags and formatting styles.  By removing these variations, you create a standardized format for the text data. This consistency makes it easier to train and apply NLP models across different datasets.

In [6]:
import re

In [7]:
def remove_html_tags(text):
    pattern = re.compile('<.*?>')
    return pattern.sub(r'',text)

In [8]:
movies['review'] = movies['review'].apply(remove_html_tags)

In [9]:
movies.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. the filming tec...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive


### 3) Removing URLs

* In NLP, removing URLs refers to the process of eliminating website addresses and hyperlinks from text data.



* This is done for a couple of reasons:This is done for a couple of reasons:
    *  Focus on Content, Not References: NLP tasks often aim to understand the meaning and sentiment of text. URLs themselves don't contribute much to this meaning. By removing them, the NLP model can concentrate on the core content of the text, like the surrounding words and their relationships
 
    *  Reduce Noise and Improve Efficiency:  URLs can introduce unnecessary noise into the data. They can vary greatly in length and format, making it harder for the model to learn patterns. Removing them simplifies the data and can improve the efficiency and accuracy of the NLP process.

In [10]:
movies['review'].iloc[742]

'mario lewis of the competitive enterprise institute has written a definitive 120-page point-by-point, line-by-line refutation of this mendacious film, which should be titled a convenient lie. the website address where his debunking report, which is titled "a skeptic\'s guide to an inconvenient truth" can be found at is :www.cei.org. a shorter 10-page version can be found at: www.cei.org/pdf/5539.pdf once you read those demolitions, you\'ll realize that alleged "global warming" is no more real or dangerous than the y2k scare of 1999, which gore also endorsed, as he did the pseudo-scientific film the day after tomorrow, which was based on a book written by alleged ufo abductee whitley strieber. as james "the amazing" randi does to psychics, and philip klass does to ufos, and gerald posner does to jfk conspir-idiocy theories, so does mario lewis does to al gore\'s movie and the whole "global warming" scam.'

In [11]:
import re
def remove_url(text):
    pattern = re.compile(r'https?://\S+|www\.\S+')
    return pattern.sub(r'',text)

In [12]:
movies['review'] = movies['review'].apply(remove_url)

In [13]:
movies['review'].iloc[742]

'mario lewis of the competitive enterprise institute has written a definitive 120-page point-by-point, line-by-line refutation of this mendacious film, which should be titled a convenient lie. the website address where his debunking report, which is titled "a skeptic\'s guide to an inconvenient truth" can be found at is : a shorter 10-page version can be found at:  once you read those demolitions, you\'ll realize that alleged "global warming" is no more real or dangerous than the y2k scare of 1999, which gore also endorsed, as he did the pseudo-scientific film the day after tomorrow, which was based on a book written by alleged ufo abductee whitley strieber. as james "the amazing" randi does to psychics, and philip klass does to ufos, and gerald posner does to jfk conspir-idiocy theories, so does mario lewis does to al gore\'s movie and the whole "global warming" scam.'

### 4) Removing Punctuations

* In NLP, removing punctuation refers to the process of eliminating punctuation marks from text data.


* This is a common text pre-processing step done for a few reasons
    
    
    *  Focusing on Word Meaning: Punctuation primarily conveys emphasis, tone, and grammatical structure. In many NLP tasks, we're more interested in the core meaning conveyed by the words themselves. Removing punctuation allows the model to concentrate on the semantic content of the text, treating words like "data" and "data!" identically.
    *  Reducing Vocabulary Size: Similar to lowercasing, removing punctuation helps shrink the vocabulary size an NLP model needs to handle. Punctuation variations like commas, periods, and exclamation marks wouldn't be considered separate "words" by the model, streamlining the processing.
    *  Standardizing Text: Punctuation usage can vary depending on writing style or origin. Removing it ensures consistency in the text data presented to the model. This can be particularly helpful when dealing with large datasets from diverse sources.

* However, removing punctuation isn't always advisable. Here's when it might not be the best approach:

    * Sentiment Analysis: Punctuation can be crucial for understanding sentiment. An exclamation mark can drastically change the meaning of a sentence compared to a period.
    * Sarcasm Detection: Sarcasm often relies on punctuation like quotation marks or italics, which wouldn't be captured if removed.
    * Emojis: Emojis are a form of punctuation that convey emotions. Removing them would eliminate valuable information, especially in informal communication.

In [14]:
import string
exclude = string.punctuation

def remove_punc(text):
    return text.translate(str.maketrans('','',exclude))

In [15]:
movies['review'] = movies['review'].apply(remove_punc)

In [16]:
movies.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production the filming tech...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically theres a family where a little boy j...,negative
4,petter matteis love in the time of money is a ...,positive


### 5) Spelling Corrections

* Spelling correction in NLP is the process of identifying and fixing errors in written text. This is crucial for several reasons:

    * Improved Accuracy: Many NLP tasks rely on understanding the meaning of text. Misspelled words can confuse the model and lead to inaccurate results. By correcting spelling errors, NLP applications can function with greater accuracy.
    *  Handling Informal Text: Informal communication, like social media posts or emails, often contains typos and slang. Spelling correction helps NLP models understand these informal styles of writing and extract meaning from them.
    *  Data Cleaning: Large amounts of text data used in NLP tasks may contain typos due to various reasons like optical character recognition (OCR) errors or user mistakes. Spelling correction helps clean this data and prepare it for better NLP processing.

In [17]:
from textblob import TextBlob


In [18]:
incorrect_text = ' Hell how are you fiine'

In [19]:
textblb = TextBlob(incorrect_text)

In [20]:
textblb.correct().string

' Well how are you fine'

### 6) Removing Stop Words

* In Natural Language Processing (NLP), removing stop words is a technique used during text pre-processing. Stop words are very common words that carry little meaning on their own. Examples include "the", "a", "is", "in", "of", etc.
* 
Here's why removing stop words is commonly don

    * Focus on Content: Stop words don't contribute much to the actual content or meaning of a sentence. By removing them, NLP models can focus on the more important words that convey the core ideas. This can be particularly beneficial for tasks like sentiment analysis or topic modeling.
    * Reduce Noise: Stop words can be seen as noise in the data. Removing them reduces the overall data size and makes it easier for NLP models to learn patterns and relationships between the important words. This can lead to improved model performance.
    * Improve Efficiency: Since there are fewer words to process, removing stop words can make NLP tasks more efficient, especially when dealing with large datasets.e:

In [21]:
from nltk.corpus import stopwords

In [22]:
stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [23]:
# Slower but easier to understand Method
def remove_stopWords(text):
    new_text = []
    for word in text.split():
        if word in stopwords.words('english'):
            new_text.append('')
        else:
            new_text.append(word)
    x = new_text[:]
    new_text.clear()
    return '' ''.join(x)

In [24]:
# Faster Method
def remove_stopWords1(text):
    stop_words = set(stopwords.words('english'))  # Convert stopwords to a set for faster lookup
    return ' '.join(word for word in text.split() if word not in stop_words)


In [25]:
movies['review'] = movies['review'].apply(remove_stopWords1)

In [26]:
movies.head()

Unnamed: 0,review,sentiment
0,one reviewers mentioned watching 1 oz episode ...,positive
1,wonderful little production filming technique ...,positive
2,thought wonderful way spend time hot summer we...,positive
3,basically theres family little boy jake thinks...,negative
4,petter matteis love time money visually stunni...,positive


### 7) Handling Emoji

* Removing emojis in NLP refers to the process of filtering out emoji characters from text data during preprocessing. Here's a breakdown of why this is done:

    * Focus on Core Meaning: Emojis are often visual representations of emotions or ideas. In tasks like sentiment analysis or topic modeling, the focus is on the underlying meaning conveyed by words. Removing emojis helps the NLP model concentrate on the core textual content.
    * Data Consistency: Emojis can have subjective interpretations. A happy face emoji might indicate joy for one person and sarcasm for another. Removing them ensures consistency in the data the model analyzes.
    * Reduced Complexity: Emojis can add complexity, especially for simpler models. By removing them, the model deals with a smaller set of characters, potentially improving processing efficiency.

In [27]:
import re

def remove_emoji(text):
  emoji_pattern = re.compile("["
    u"\U0001F600-\U0001F64F"  # emoticons
    u"\U0001F300-\U0001F5FF"  # symbols & pictographs
    u"\U0001F680-\U0001F6FF"  # transport & map symbols
    u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
    u"\U00002702-\U000027B0"  # dingbats
    u"\U000024C2-\U0001F251"  # variation selectors
     "]+", flags=re.UNICODE)
  return emoji_pattern.sub(r'', text)


In [28]:
movies['review'] = movies['review'].apply(remove_emoji)

our data didn't had any emojis

### 8) Tokenization

* Tokenization in NLP refers to the process of breaking down a piece of text into smaller units called tokens. These tokens can be individual words, characters, or even phrases, depending on the specific task and chosen approach. It's essentially the foundation of any NLP pipeline, prepping the text data for further analysis.wer of NLP.

* There are a couple of key reasons why tokenization is crucial in NLP:

    * Makes Text Manageable for Machines: Raw text is a continuous stream of characters for computers. Tokenization chops it up into discrete units that machines can understand and process more easily. Imagine trying to analyze the meaning of a whole paragraph at once – it's overwhelming! Tokens provide bite-sized pieces for efficient analysis.

    * Enables Further NLP Tasks:  Tokenization acts like a springboard for various NLP applications. By having the text separated into tokens, you can perform tasks like:

    * Identifying Parts of Speech: Recognizing nouns, verbs, adjectives, etc. in the tokens helps understand the sentence structure and meaning.
    * Building Vocabularies: Creating a list of unique tokens from a text corpus helps NLP models understand the language and identify patterns.

    * Performing Text Cleaning: Removing punctuation, stop words (common words like "the" or "and" with little meaning) and other irrelevant elements often happens after tokenization.

In essence, tokenization transforms unstructured text data into a structured format that computers can work with, making it the essential first step in unlocking the power of NLP.

In [29]:
from nltk.tokenize import word_tokenize, sent_tokenize

In [31]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\siddh\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

a) Word Tokenizer

In [34]:
def tokenize_words(text):
  return nltk.word_tokenize(text)


b) Sentence Tokenizer

In [35]:
def tokenize_sentences(text):
  return nltk.sent_tokenize(text)


In [36]:
movies['review_word_token'] = movies['review'].apply(tokenize_words)


In [37]:
movies['review_sent_token'] = movies['review'].apply(tokenize_sentences)

In [38]:
movies.head()

Unnamed: 0,review,sentiment,review_word_token,review_sent_token
0,one reviewers mentioned watching 1 oz episode ...,positive,"[one, reviewers, mentioned, watching, 1, oz, e...",[one reviewers mentioned watching 1 oz episode...
1,wonderful little production filming technique ...,positive,"[wonderful, little, production, filming, techn...",[wonderful little production filming technique...
2,thought wonderful way spend time hot summer we...,positive,"[thought, wonderful, way, spend, time, hot, su...",[thought wonderful way spend time hot summer w...
3,basically theres family little boy jake thinks...,negative,"[basically, theres, family, little, boy, jake,...",[basically theres family little boy jake think...
4,petter matteis love time money visually stunni...,positive,"[petter, matteis, love, time, money, visually,...",[petter matteis love time money visually stunn...


### 9) Stemming & Lemmatization

* Stemming in NLP refers to the process of reducing words to their base or root form. This is done by chopping off prefixes and suffixes from words, aiming to get to a more generalizable form.

* Here's why stemming is done in NLP:

    * Reduces Redundancy: Words with different endings (like "playing", "played", "plays") often convey the same core meaning. Stemming reduces these variations to a single "play", streamlining the text for processing.

    * Improves Efficiency: By reducing word variants, stemming helps manage the vocabulary size an NLP system needs to handle. This makes the system more efficient, especially when dealing with large amounts of text data.

    * Enhances Text Matching: Stemming allows for better matching between words in a query and words in a document. Even if the exact word form isn't present, stemming can connect them based on their root meaning.

* However, it's important to remember that stemming can be a bit rough. Unlike lemmatization (which considers proper grammatical forms), stemming might sometimes create unrecognizable words ("play" from "playing" is a valid word, but "teach" from "teacher" is not).iency.

* Here's a quick comparison:
    * Stemming: Simpler, faster, may create non-words
    * Lemmatization: More accurate, slower, preserves actual words

a) PorterStemmer

In [39]:
from nltk.stem.porter import PorterStemmer


In [40]:
ps = PorterStemmer()

In [41]:
def stem_words(text):
    return " ".join([ps.stem(word) for word in text.split()])

In [42]:
movies['review_stemming'] = movies['review'].apply(stem_words) # you can apply stemming on tokenized data also.

In [45]:
movies.head()

Unnamed: 0,review,sentiment,review_word_token,review_sent_token,review_stemming
0,one reviewers mentioned watching 1 oz episode ...,positive,"[one, reviewers, mentioned, watching, 1, oz, e...",[one reviewers mentioned watching 1 oz episode...,one review mention watch 1 oz episod youll hoo...
1,wonderful little production filming technique ...,positive,"[wonderful, little, production, filming, techn...",[wonderful little production filming technique...,wonder littl product film techniqu unassum old...
2,thought wonderful way spend time hot summer we...,positive,"[thought, wonderful, way, spend, time, hot, su...",[thought wonderful way spend time hot summer w...,thought wonder way spend time hot summer weeke...
3,basically theres family little boy jake thinks...,negative,"[basically, theres, family, little, boy, jake,...",[basically theres family little boy jake think...,basic there famili littl boy jake think there ...
4,petter matteis love time money visually stunni...,positive,"[petter, matteis, love, time, money, visually,...",[petter matteis love time money visually stunn...,petter mattei love time money visual stun film...


b) Lemmitization

In [70]:
from nltk.stem import WordNetLemmatizer

In [71]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\siddh\AppData\Roaming\nltk_data...


True

In [75]:
import nltk
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\siddh\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


True

In [83]:
import nltk

w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()

def lemmatize_text(text):
    return [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)]






In [84]:
movies['review_Lemmatization'] = movies['review'].apply(lemmatize_text)

In [85]:
movies.head()

Unnamed: 0,review,sentiment,review_word_token,review_sent_token,review_stemming,review_Lemmatization
0,one reviewers mentioned watching 1 oz episode ...,positive,"[one, reviewers, mentioned, watching, 1, oz, e...",[one reviewers mentioned watching 1 oz episode...,one review mention watch 1 oz episod youll hoo...,"[one, reviewer, mentioned, watching, 1, oz, ep..."
1,wonderful little production filming technique ...,positive,"[wonderful, little, production, filming, techn...",[wonderful little production filming technique...,wonder littl product film techniqu unassum old...,"[wonderful, little, production, filming, techn..."
2,thought wonderful way spend time hot summer we...,positive,"[thought, wonderful, way, spend, time, hot, su...",[thought wonderful way spend time hot summer w...,thought wonder way spend time hot summer weeke...,"[thought, wonderful, way, spend, time, hot, su..."
3,basically theres family little boy jake thinks...,negative,"[basically, theres, family, little, boy, jake,...",[basically theres family little boy jake think...,basic there famili littl boy jake think there ...,"[basically, there, family, little, boy, jake, ..."
4,petter matteis love time money visually stunni...,positive,"[petter, matteis, love, time, money, visually,...",[petter matteis love time money visually stunn...,petter mattei love time money visual stun film...,"[petter, matteis, love, time, money, visually,..."
