<span style="color: green; font-size: 55px; font-weight: bold;">Text Preprocessing</span>


In [31]:
import pandas as pd

In [32]:
df = pd.read_csv("IMDB Dataset.csv")

In [33]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [34]:
df.shape

(50000, 2)

In [35]:
df = df.head(100)

In [36]:
df.shape

(100, 2)

<span style="color: red; font-size: 40px; font-weight: bold;">LowerCase</span>


In [37]:
df['review'] = df['review'].str.lower()

In [38]:
df

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. <br /><br />the...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive
...,...,...
95,daniel day-lewis is the most versatile actor a...,positive
96,my guess would be this was originally going to...,negative
97,"well, i like to watch bad horror b-movies, cau...",negative
98,"this is the worst movie i have ever seen, as w...",negative


<span style="color: red; font-size: 40px; font-weight: bold;">Remove HTML Tags</span>


In [39]:
import re
def remove_html_tags(text):
    pattern = re.compile('<.*?>')
    return pattern.sub(r'', text)

### Explanation

#### Importing the `re` Module:
The `re` module is imported to use regular expressions in Python. Regular expressions are a powerful tool for pattern matching and text manipulation.

#### Defining the Function:
The function `remove_html_tags` takes one argument, `text`, which is the input string containing HTML tags.

#### Compiling the Regular Expression Pattern:
`pattern = re.compile('<.*?>')`:
- `re.compile` is used to compile a regular expression pattern into a regex object.
- The pattern `<.*?>` is designed to match any HTML tag:
  - `<` matches the opening angle bracket of an HTML tag.
  - `.*?` is a non-greedy match for any character (except a newline) between the opening and closing angle brackets.
  - `>` matches the closing angle bracket of an HTML tag.
- The `?` in `.*?` makes the match non-greedy, ensuring that the shortest possible match is found (e.g., for nested tags).

#### Substituting HTML Tags with an Empty String:
`pattern.sub(r'', text)`:
- The `sub` method of the compiled regex object is used to replace all occurrences of the pattern in the input `text` with an empty string `r''`.
- This effectively removes all HTML tags from the input text.

#### Returning the Cleaned Text:
The function returns the modified text with all HTML tags removed.

In [40]:
df['review'] = df['review'].apply(remove_html_tags)

In [41]:
df['review'][0]

"one of the other reviewers has mentioned that after watching just 1 oz episode you'll be hooked. they are right, as this is exactly what happened with me.the first thing that struck me about oz was its brutality and unflinching scenes of violence, which set in right from the word go. trust me, this is not a show for the faint hearted or timid. this show pulls no punches with regards to drugs, sex or violence. its is hardcore, in the classic use of the word.it is called oz as that is the nickname given to the oswald maximum security state penitentary. it focuses mainly on emerald city, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. em city is home to many..aryans, muslims, gangstas, latinos, christians, italians, irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.i would say the main appeal of the show is due to the fact that it goes where other shows wo

<span style="color: red; font-size: 40px; font-weight: bold;">Remove URLs</span>


In [42]:
def remove_url(text):
    pattern = re.compile(r'https?://\S+|www\.\S+')
    return pattern.sub(r'', text)


### Explanation

#### Defining the Function:
The function `remove_url` takes one argument, `text`, which is the input string containing URLs.

#### Compiling the Regular Expression Pattern:
`pattern = re.compile(r'https?://\S+|www\.\S+')`:
- `re.compile` is used to compile a regular expression pattern into a regex object.
- The pattern `https?://\S+|www\.\S+` is designed to match URLs:
  - `https?://` matches `http://` or `https://`:
    - `http` matches the literal string "http".
    - `s?` makes the "s" optional, so it matches both "http" and "https".
    - `://` matches the literal string "://".
  - `\S+` matches one or more non-whitespace characters (the domain and path of the URL).
  - `|` is the OR operator, allowing the pattern to match either `https?://\S+` or `www\.\S+`.
  - `www\.\S+` matches URLs starting with "www.":
    - `www\.` matches the literal string "www.".
    - `\S+` matches one or more non-whitespace characters (the domain and path of the URL).

#### Substituting URLs with an Empty String:
`pattern.sub(r'', text)`:
- The `sub` method of the compiled regex object is used to replace all occurrences of the pattern in the input `text` with an empty string `r''`.
- This effectively removes all URLs from the input text.

#### Returning the Cleaned Text:
The function returns the modified text with all URLs removed.

In [43]:

text1 = 'Google search here www.google.com'
text2 = 'For data click https://www.kaggle.com/'

In [44]:
remove_url(text2)

'For data click '

<span style="color: red; font-size: 40px; font-weight: bold;">Punctuation Handling</span>


In [45]:
import string,time
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [46]:
exclude = string.punctuation
exclude

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [47]:
def remove_punc(text):
    for char in exclude:
        text = text.replace(char,'')
    return text

In [48]:
text = 'string. With. Punctuation?'

In [49]:
start = time.time()
print(remove_punc(text))
time1 = time.time() - start
print(time1*50000)

string With Punctuation
0.0


In [50]:
def remove_punc1(text):
    return text.translate(str.maketrans('', '', exclude))

In [51]:
start = time.time()
remove_punc1(text)
time2 = time.time() - start
print(time2*50000)

0.0


In [54]:
time1/(time2+0.000001)

0.0

In [55]:
df['review'][0]

"one of the other reviewers has mentioned that after watching just 1 oz episode you'll be hooked. they are right, as this is exactly what happened with me.the first thing that struck me about oz was its brutality and unflinching scenes of violence, which set in right from the word go. trust me, this is not a show for the faint hearted or timid. this show pulls no punches with regards to drugs, sex or violence. its is hardcore, in the classic use of the word.it is called oz as that is the nickname given to the oswald maximum security state penitentary. it focuses mainly on emerald city, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. em city is home to many..aryans, muslims, gangstas, latinos, christians, italians, irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.i would say the main appeal of the show is due to the fact that it goes where other shows wo

In [56]:
remove_punc1(df['review'][0])

'one of the other reviewers has mentioned that after watching just 1 oz episode youll be hooked they are right as this is exactly what happened with methe first thing that struck me about oz was its brutality and unflinching scenes of violence which set in right from the word go trust me this is not a show for the faint hearted or timid this show pulls no punches with regards to drugs sex or violence its is hardcore in the classic use of the wordit is called oz as that is the nickname given to the oswald maximum security state penitentary it focuses mainly on emerald city an experimental section of the prison where all the cells have glass fronts and face inwards so privacy is not high on the agenda em city is home to manyaryans muslims gangstas latinos christians italians irish and moreso scuffles death stares dodgy dealings and shady agreements are never far awayi would say the main appeal of the show is due to the fact that it goes where other shows wouldnt dare forget pretty pictur

<span style="color: red; font-size: 40px; font-weight: bold;">Chat Conversion Handle</span>


In [57]:
chat_words = {
    'AFAIK':'As Far As I Know',
    'AFK':'Away From Keyboard',
    'ASAP':'As Soon As Possible'
}


{
    "FYI": "For Your Information",
    "ASAP": "As Soon As Possible",
    "BRB": "Be Right Back",
    "BTW": "By The Way",
    "OMG": "Oh My God",
    "IMO": "In My Opinion",
    "LOL": "Laugh Out Loud",
    "TTYL": "Talk To You Later",
    "GTG": "Got To Go",
    "TTYT": "Talk To You Tomorrow",
    "IDK": "I Don't Know",
    "TMI": "Too Much Information",
    "IMHO": "In My Humble Opinion",
    "ICYMI": "In Case You Missed It",
    "AFAIK": "As Far As I Know",
    "BTW": "By The Way",
    "FAQ": "Frequently Asked Questions",
    "TGIF": "Thank God It's Friday",
    "FYA": "For Your Action",
    "ICYMI": "In Case You Missed It",
}

{'FYI': 'For Your Information',
 'ASAP': 'As Soon As Possible',
 'BRB': 'Be Right Back',
 'BTW': 'By The Way',
 'OMG': 'Oh My God',
 'IMO': 'In My Opinion',
 'LOL': 'Laugh Out Loud',
 'TTYL': 'Talk To You Later',
 'GTG': 'Got To Go',
 'TTYT': 'Talk To You Tomorrow',
 'IDK': "I Don't Know",
 'TMI': 'Too Much Information',
 'IMHO': 'In My Humble Opinion',
 'ICYMI': 'In Case You Missed It',
 'AFAIK': 'As Far As I Know',
 'FAQ': 'Frequently Asked Questions',
 'TGIF': "Thank God It's Friday",
 'FYA': 'For Your Action'}

In [58]:
def chat_conversion(text):
    new_text = []
    for w in text.split():
        if w.upper() in chat_words:
            new_text.append(chat_words[w.upper()])
        else:
            new_text.append(w)
    return " ".join(new_text)

In [59]:
chat_conversion('Do this work ASAP')


'Do this work As Soon As Possible'

<span style="color: red; font-size: 30px; font-weight: bold; background-color: yellow;">
Incorrect Text Handling
</span>


In [60]:
from textblob import TextBlob

In [61]:
incorrect_text = 'ceertain conditionas duriing seveal ggenerations aree moodified in the saame maner.'

textBlb = TextBlob(incorrect_text)

textBlb.correct().string

'certain conditions during several generations are modified in the same manner.'

<span style="color: red; font-size: 30px; font-weight: bold; background-color: yellow;">
Stopwords
</span>


In [62]:
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Dell\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [63]:
stopwords.words('english')

['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 'her',
 'here',
 'hers',
 'herself',
 "he's",
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'd",
 'if',
 "i'll",
 "i'm",
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 "i've",
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'on

In [64]:
len(stopwords.words('english'))

198

In [65]:
def remove_stopwords(text):
    new_text = []

    for word in text.split():
        if word in stopwords.words('english'):
            new_text.append('')
        else:
            new_text.append(word)
    x = new_text[:]
    new_text.clear()
    return " ".join(x)

In [66]:
remove_stopwords('probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to a noble cause, but it\'s not preachy or boring. it just never gets old, despite my having seen it some 15 or more times')


'probably  all-time favorite movie,  story  selflessness, sacrifice  dedication   noble cause,    preachy  boring.   never gets old, despite   seen   15   times'

In [67]:
df.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. the filming tec...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive


In [68]:
df['review'].apply(remove_stopwords)

0     one    reviewers  mentioned   watching  1 oz e...
1      wonderful little production.  filming techniq...
2      thought    wonderful way  spend time    hot s...
3     basically there's  family   little boy (jake) ...
4     petter mattei's "love   time  money"   visuall...
                            ...                        
95    daniel day-lewis    versatile actor alive. eng...
96     guess would    originally going    least two ...
97    well,  like  watch bad horror b-movies, cause ...
98       worst movie   ever seen,  well as,  worst  ...
99        mario fan   long    remember,    fond memo...
Name: review, Length: 100, dtype: object

<span style="color: red; font-size: 30px; font-weight: bold; background-color: yellow;">
Remove Emoji Handle
</span>


In [69]:
import re
def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

In [70]:
remove_emoji("Loved the movie. It was 😘😘")

'Loved the movie. It was '

In [71]:
remove_emoji("Lmao 😂😂")

'Lmao '

In [72]:
!pip install emoji



In [73]:
import emoji
print(emoji.demojize('Python is 🔥'))

Python is :fire:


In [74]:
print(emoji.demojize('Loved the movie. It was 😘'))

Loved the movie. It was :face_blowing_a_kiss:


<span style="color: red; font-size: 50px; font-weight: bold; background-color: yellow;">
Tokenization
</span>


<span style="color: red; font-size: 30px; font-weight: bold; background-color: yellow;">
1.Using the Split Function
</span>


In [75]:
# word tokenization
sent1 = 'I am going to delhi'
sent1.split()

['I', 'am', 'going', 'to', 'delhi']

In [76]:
# sentence tokenization
sent2 = 'I am going to delhi. I will stay there for 3 days. Let\'s hope the trip to be great'
sent2.split('.')

['I am going to delhi',
 ' I will stay there for 3 days',
 " Let's hope the trip to be great"]

In [77]:
# Problems with split function
sent3 = 'I am going to delhi!'
sent3.split()

['I', 'am', 'going', 'to', 'delhi!']

In [78]:

sent4 = 'Where do think I should go? I have 3 day holiday'
sent4.split('.')

['Where do think I should go? I have 3 day holiday']

<span style="color: red; font-size: 30px; font-weight: bold; background-color: yellow;">
2. Regular Expression
</span>


In [79]:

import re
sent3 = 'I am going to delhi!'
tokens = re.findall("[\w']+", sent3)
tokens

['I', 'am', 'going', 'to', 'delhi']

In [80]:
text = """Lorem Ipsum is simply dummy text of the printing and typesetting industry?
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s,
when an unknown printer took a galley of type and scrambled it to make a type specimen book."""
sentences = re.compile('[.!?] ').split(text)
sentences

["Lorem Ipsum is simply dummy text of the printing and typesetting industry?\nLorem Ipsum has been the industry's standard dummy text ever since the 1500s,\nwhen an unknown printer took a galley of type and scrambled it to make a type specimen book."]

<span style="color: red; font-size: 30px; font-weight: bold; background-color: yellow;">
3. NLTK
</span>


In [81]:
from nltk.tokenize import word_tokenize, sent_tokenize 
import nltk 
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Dell\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [82]:
sent1 = 'I am going to visit delhi!'
word_tokenize(sent1)

['I', 'am', 'going', 'to', 'visit', 'delhi', '!']

In [83]:
text = """Lorem Ipsum is simply dummy text of the printing and typesetting industry?
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s,
when an unknown printer took a galley of type and scrambled it to make a type specimen book."""

sent_tokenize(text)


['Lorem Ipsum is simply dummy text of the printing and typesetting industry?',
 "Lorem Ipsum has been the industry's standard dummy text ever since the 1500s,\nwhen an unknown printer took a galley of type and scrambled it to make a type specimen book."]

In [84]:

sent5 = 'I have a Ph.D in A.I'
sent6 = "We're here to help! mail us at nks@gmail.com"
sent7 = 'A 5km ride cost $10.50'

word_tokenize(sent5)

['I', 'have', 'a', 'Ph.D', 'in', 'A.I']

In [85]:

word_tokenize(sent6)

['We',
 "'re",
 'here',
 'to',
 'help',
 '!',
 'mail',
 'us',
 'at',
 'nks',
 '@',
 'gmail.com']

In [86]:
word_tokenize(sent7)

['A', '5km', 'ride', 'cost', '$', '10.50']

<span style="color: red; font-size: 30px; font-weight: bold; background-color: yellow;">
4. Spacy (good)
</span>


In [94]:
import spacy

ModuleNotFoundError: No module named 'spacy'

In [89]:
import spacy 
nlp = spacy.load('en_core_web_sm')

ModuleNotFoundError: No module named 'spacy'

In [57]:
doc1 = nlp(sent5)
doc2 = nlp(sent6)
doc3 = nlp(sent7)
doc4 = nlp(sent1)

In [58]:
doc4 = nlp(sent1)
doc4

I am going to visit delhi!

In [59]:
for token in doc4:
    print(token)

I
am
going
to
visit
delhi
!


In [60]:
df.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. the filming tec...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive


<span style="color: red; font-size:50px; font-weight: bold; background-color: yellow;">
Stemmer
</span>


In [61]:
from nltk.stem.porter import PorterStemmer

In [62]:
ps = PorterStemmer()
def stem_words(text):
    return " ".join([ps.stem(word) for word in text.split()])

In [63]:
sample = "walk walks walking walked"
stem_words(sample)

'walk walk walk walk'

In [65]:

text = 'probably my alltime favorite movie a story of selflessness sacrifice and dedication to a noble cause but its not preachy or boring it just never gets old despite my having seen it some 15 or more times in the last 25 years paul lukas performance brings tears to my eyes and bette davis in one of her very few truly sympathetic roles is a delight the kids are as grandma says more like dressedup midgets than children but that only makes them more fun to watch and the mothers slow awakening to whats happening in the world and under her own roof is believable and startling if i had a dozen thumbs theyd all be up for this movie'
print(text)

probably my alltime favorite movie a story of selflessness sacrifice and dedication to a noble cause but its not preachy or boring it just never gets old despite my having seen it some 15 or more times in the last 25 years paul lukas performance brings tears to my eyes and bette davis in one of her very few truly sympathetic roles is a delight the kids are as grandma says more like dressedup midgets than children but that only makes them more fun to watch and the mothers slow awakening to whats happening in the world and under her own roof is believable and startling if i had a dozen thumbs theyd all be up for this movie


In [66]:
stem_words(text)

'probabl my alltim favorit movi a stori of selfless sacrific and dedic to a nobl caus but it not preachi or bore it just never get old despit my have seen it some 15 or more time in the last 25 year paul luka perform bring tear to my eye and bett davi in one of her veri few truli sympathet role is a delight the kid are as grandma say more like dressedup midget than children but that onli make them more fun to watch and the mother slow awaken to what happen in the world and under her own roof is believ and startl if i had a dozen thumb theyd all be up for thi movi'

<span style="color: red; font-size: 50px; font-weight: bold; background-color: yellow;">
Lemmatization
</span>


In [None]:
import nltk
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
wordnet_lemmatizer = WordNetLemmatizer()

sentence = "He was running and eating at same time. He has bad habit of swimming after playing long hours in the Sun."
punctuations="?:!.,;"
sentence_words = nltk.word_tokenize(sentence)
for word in sentence_words:
    if word in punctuations:
        sentence_words.remove(word)

sentence_words
print("{0:20}{1:20}".format("Word","Lemma"))
for word in sentence_words:
    print ("{0:20}{1:20}".format(word,wordnet_lemmatizer.lemmatize(word,pos='v')))