#  Steps to preprocess your dataset
---

Here are a few steps to do a basic preprocessing of a dataset. The example dataset used is the `IMDB Dataset of 50K Movie Reviews dataset`


---

## 📑 Contents

1. Lower Casing
2. Remove HTML tags
3. Remove URLs
4. Remove Punctuation
5. Chat word treatment
6. Spelling Correction
7. Removing Stop words
8. Handling Emojis
9. Tokenization
10. Stemming
11. Lemmatization

In [17]:
import pandas as pd

df = pd.read_csv('IMDB_dataset.csv')
df.sample(5)

Unnamed: 0,review,sentiment
32161,This is the best television series for childre...,positive
13453,"Very sadly, I can relate to this movie, as I'm...",positive
32621,I watched this movie for the first time around...,positive
39433,"Here's the good news first. ""Spirit"" is the mo...",negative
36606,"To be honest, I've never been to the Congo or ...",negative


# 1. Lower Casing

In [35]:
df['review'] = df['review'].str.lower()
df.sample(5)

Unnamed: 0,review,sentiment
41670,i really thought that this movie was superb. n...,positive
18231,this movie is a cinematic collage of gangster ...,negative
48885,this picture came out in 1975 and it was the s...,positive
26698,extremely pinching vision of a war situation w...,positive
28170,if you can imagine mickey mouse as a new york ...,negative


# 2. Remove HTML tags

In [20]:
import re

def remove_html_tags(text):
    pattern = re.compile('<.*?>')
    return pattern.sub(r'', text)

In [37]:
df['review'] = df['review'].apply(remove_html_tags)
df.sample(5)

Unnamed: 0,review,sentiment
38403,"being a d.b. sweeney fan, i've been on the loo...",positive
30033,"this movie's heart was in the right place, no ...",negative
317,"having been a marine, i can tell you that the ...",positive
4934,"don't get me wrong, dan jansen was a great spe...",negative
31344,i'm one of those people who usually watch prog...,positive


# 3. Remove URLs

In [40]:
def remove_url(text):
    pattern = re.compile(r'https?://\S+|www\.\S+')
    return pattern.sub(r'', text)

df['review'] = df['review'].apply(remove_url)
df.sample(5)

Unnamed: 0,review,sentiment
46065,"dull acting, weak script...worst spanish movie...",negative
11904,"this movie was 100% boring, i swear i almost d...",negative
18933,this is absolutely the most stupidest movie ev...,negative
6267,i don't know why critics cal it bizarre and ma...,positive
19037,what a class bit of british cinema! it's about...,positive


# 4. Remove Punctuations

In [45]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [46]:
exclude = string.punctuation

def remove_punctuation(text):
    return text.translate(str.maketrans('','',exclude))

In [47]:
df['review'] = df['review'].apply(remove_punctuation)
df.sample(5)

Unnamed: 0,review,sentiment
32331,tom hanks returns as dan browns symbologist ro...,positive
6465,this musical has a deep meaning which is appre...,positive
10464,the polar express director robert zemeckis i l...,negative
13910,i didnt know willem dafoe was so hard up for b...,negative
22809,this is the second movie based on the life and...,positive


# 5. Chat word treatment

In [58]:
# Initialize empty dictionary
chat_words = {}

# Read file and process each line
with open("slang.txt", "r", encoding="utf-8") as file:
    for line in file:
        # Skip empty lines
        if "=" not in line.strip():
            continue

        # Split by '=' and strip spaces
        key, value = line.strip().split("=", 1)

        # Convert to lowercase and strip
        key = key.strip().lower()
        value = value.strip().lower()

        # Store in dictionary
        chat_words[key] = value

# Print the dictionary
print(chat_words)


{'afaik': 'as far as i know', 'afk': 'away from keyboard', 'asap': 'as soon as possible', 'atk': 'at the keyboard', 'atm': 'at the moment', 'a3': 'anytime, anywhere, anyplace', 'bak': 'back at keyboard', 'bbl': 'be back later', 'bbs': 'be back soon', 'bfn': 'bye for now', 'b4n': 'bye for now', 'brb': 'be right back', 'brt': 'be right there', 'btw': 'by the way', 'b4': 'before', 'cu': 'see you', 'cul8r': 'see you later', 'cya': 'see you', 'faq': 'frequently asked questions', 'fc': 'fingers crossed', 'fwiw': "for what it's worth", 'fyi': 'for your information', 'gal': 'get a life', 'gg': 'good game', 'gn': 'good night', 'gmta': 'great minds think alike', 'gr8': 'great!', 'g9': 'genius', 'ic': 'i see', 'icq': 'i seek you (also a chat program)', 'ilu': 'ilu: i love you', 'imho': 'in my honest/humble opinion', 'imo': 'in my opinion', 'iow': 'in other words', 'irl': 'in real life', 'kiss': 'keep it simple, stupid', 'ldr': 'long distance relationship', 'lmao': 'laugh my a.. off', 'lol': 'laug

In [62]:
def chat_conversion(text):
    new_text = []
    for w in text.split():
        if w in chat_words:
            new_text.append(chat_words[w])
        else:
            new_text.append(w)
    return " ".join(new_text)

In [72]:
txt = 'afaik i am good'
chat_conversion(txt)

'as far as i know i am good'

In [68]:
df['review'] = df['review'].apply(chat_conversion)
df.sample(5)


Unnamed: 0,review,sentiment
3754,a truly accurate and unglamourous look into mo...,positive
14364,a film written and directed by neil young gree...,negative
2249,i am always wary of taking too instant a disli...,negative
11560,im a fan of jeff bridges so i snapped this up ...,positive
18450,this series could very well be the best britco...,positive


# 6. Spelling Correction

In [71]:
from textblob import TextBlob

In [76]:
def spell_correct(text):
    textBlb = TextBlob(text)
    return textBlb.correct().string

incorrect_text = 'ceertain conditionas duriing seveal ggenerationss aree modiffied in the saame maner.'
spell_correct(incorrect_text)

'certain conditions during several generations are modified in the same manner.'

In [None]:
df['review'] = df['review'].apply(spell_correct)
df.sample(5)


# 7. Removing  Stop words

Remove stopwords (common, less meaningful words like "the", "is", "and") from a given text

In [84]:
import nltk
from nltk.corpus import stopwords

stopwords.words('english')

['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 'her',
 'here',
 'hers',
 'herself',
 "he's",
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'd",
 'if',
 "i'll",
 "i'm",
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 "i've",
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'on

In [86]:
def remove_stopwords(text):
    words = text.split()
    filtered_words = [word for word in words if word.lower() not in stopwords.words('english')]
    return " ".join(filtered_words)


In [87]:
remove_stopwords('probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to a noble cause, but it\'s not preachy or boring. it just never gets old, despite my having seen it some 15 or more times')

'probably all-time favorite movie, story selflessness, sacrifice dedication noble cause, preachy boring. never gets old, despite seen 15 times'

In [None]:
df['review'] = df['review'].apply(remove_stopwords)
df.sample(5)

# 8. Removing Emojis


In [89]:
import re
def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

In [90]:
remove_emoji("Loved the movie. It was 😘😘")

'Loved the movie. It was '

## 8.1 Replace emoji with context

In [94]:
import emoji
print(emoji.demojize('Python is 🔥'))

Python is :fire:


In [95]:
print(emoji.demojize('Loved the movie. It was 😘'))

Loved the movie. It was :face_blowing_a_kiss:


In [None]:
def replace_emoji_with_context(text):
    return emoji.demojize(text)

In [None]:
df['review'] = df['review'].apply(remove_stopwords)
df.sample(5)

# 9. Tokenization

### 9.1 Split() function

In [103]:
# word tokenization
sent1 = 'I am going to Korea'
sent1.split()

['I', 'am', 'going', 'to', 'Korea']

In [104]:
# sentence tokenization
sent2 = 'I am going to Korea. I will stay there for 3 days. Let\'s hope the trip to be great'
sent2.split('.')

['I am going to Korea',
 ' I will stay there for 3 days',
 " Let's hope the trip to be great"]

In [105]:
# Problems with split function
sent3 = 'I am going to Korea!'
sent3.split()

['I', 'am', 'going', 'to', 'Korea!']

In [106]:
sent4 = 'Where do think I should go? I have 3 day holiday'
sent4.split('.')

['Where do think I should go? I have 3 day holiday']

### 9.2.1 Libraries (NLTK)

In [108]:
from nltk.tokenize import word_tokenize, sent_tokenize

sent1 = 'I am going to visit Korea!'
word_tokenize(sent1)

['I', 'am', 'going', 'to', 'visit', 'Korea', '!']

In [109]:
text = """Lorem Ipsum is simply dummy text of the printing and typesetting industry? 
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, 
when an unknown printer took a galley of type and scrambled it to make a type specimen book."""

sent_tokenize(text)

['Lorem Ipsum is simply dummy text of the printing and typesetting industry?',
 "Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, \nwhen an unknown printer took a galley of type and scrambled it to make a type specimen book."]

In [110]:
sent5 = 'I have a Ph.D in A.I'
sent6 = "We're here to help! mail us at abc@gmail.com"
sent7 = 'A 5km ride cost $10.50'

word_tokenize(sent5)

['I', 'have', 'a', 'Ph.D', 'in', 'A.I']

In [111]:
word_tokenize(sent6)

['We',
 "'re",
 'here',
 'to',
 'help',
 '!',
 'mail',
 'us',
 'at',
 'abc',
 '@',
 'gmail.com']

In [112]:
word_tokenize(sent7)

['A', '5km', 'ride', 'cost', '$', '10.50']

### 9.2.2 Libraires (Spacy)

In [118]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [119]:
doc1 = nlp(sent1)
doc2 = nlp(sent5)
doc3 = nlp(sent6)
doc4 = nlp(sent7)

In [123]:
for token in doc3:
    print(token)

We
're
here
to
help
!
mail
us
at
abc@gmail.com


# 10. Stemming

Stemming is a method that converts raw text data into a structured format for machine processing. Stemming process reduce inflection words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a word in tha language. The idea behind stemming is to take away different endings of words to find the most basic part, which is the “stem.” For instance, if you took the words “swimmer,” “swimming,” and “swims,” they all have the root word “swim.” This helps NLP algorithms understand the meaning of different related words.

In [124]:
from nltk.stem.porter import PorterStemmer

In [125]:
ps = PorterStemmer()
def stem_words(text):
    return " ".join([ps.stem(word) for word in text.split()])

In [126]:
sample = "walk walks walking walked"
stem_words(sample)

'walk walk walk walk'

In [127]:
text = 'probably my alltime favorite movie a story of selflessness sacrifice and dedication to a noble cause but its not preachy or boring it just never gets old despite my having seen it some 15 or more times in the last 25 years paul lukas performance brings tears to my eyes and bette davis in one of her very few truly sympathetic roles is a delight the kids are as grandma says more like dressedup midgets than children but that only makes them more fun to watch and the mothers slow awakening to whats happening in the world and under her own roof is believable and startling if i had a dozen thumbs theyd all be up for this movie'
print(text)

probably my alltime favorite movie a story of selflessness sacrifice and dedication to a noble cause but its not preachy or boring it just never gets old despite my having seen it some 15 or more times in the last 25 years paul lukas performance brings tears to my eyes and bette davis in one of her very few truly sympathetic roles is a delight the kids are as grandma says more like dressedup midgets than children but that only makes them more fun to watch and the mothers slow awakening to whats happening in the world and under her own roof is believable and startling if i had a dozen thumbs theyd all be up for this movie


In [128]:
stem_words(text)

'probabl my alltim favorit movi a stori of selfless sacrific and dedic to a nobl caus but it not preachi or bore it just never get old despit my have seen it some 15 or more time in the last 25 year paul luka perform bring tear to my eye and bett davi in one of her veri few truli sympathet role is a delight the kid are as grandma say more like dressedup midget than children but that onli make them more fun to watch and the mother slow awaken to what happen in the world and under her own roof is believ and startl if i had a dozen thumb theyd all be up for thi movi'

# 11. Lemmatization

Lemmatization is a text pre-processing technique used in natural language processing to reduce a word to its root form, (root word is called lemma), based on its meaning. It reduces the inflected words properly ensuring that the root word belongs to the language.

- Lemmatization vs Stemming

Lemmatization and stemming are both text normalization techniques used in Natural Language Processing (NLP) to reduce words to their root or base forms. However, they differ in their approach and accuracy. Stemming is a simpler, faster process that chops off word endings, while lemmatization considers the context of the word and uses a dictionary to find its base form (lemma)

In [129]:
import nltk
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

sentence = "He was running and eating at same time. He has bad habit of swimming after playing long hours in the Sun."
punctuations="?:!.,;"
sentence_words = nltk.word_tokenize(sentence)
for word in sentence_words:
    if word in punctuations:
        sentence_words.remove(word)

sentence_words
print("{0:20}{1:20}".format("Word","Lemma"))
for word in sentence_words:
    print ("{0:20}{1:20}".format(word,wordnet_lemmatizer.lemmatize(word,pos='v')))

Word                Lemma               
He                  He                  
was                 be                  
running             run                 
and                 and                 
eating              eat                 
at                  at                  
same                same                
time                time                
He                  He                  
has                 have                
bad                 bad                 
habit               habit               
of                  of                  
swimming            swim                
after               after               
playing             play                
long                long                
hours               hours               
in                  in                  
the                 the                 
Sun                 Sun                 
