<a href="https://colab.research.google.com/github/raviteja-padala/NLP/blob/main/NLP_Text_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP Text Preprocessing: From Raw Text to Refined Data



Natural Language Processing (NLP) thrives on the nuances of human language, but before we can tap into its potential, we must prepare and refine the raw text through a process known as "Text Preprocessing." This crucial step involves a series of operations that cleanse, structure, and transform the text into a format that is more amenable to analysis and machine learning. Let's delve into the essential stages of NLP text preprocessing:

- 1.Lowercasing
- 2.Remove HTML Tags
- 3.Remove URLs
- 4.Removing Punctuation
- 5.Chat word treatment
- 6.Spelling correction
- 7.Removing Stop words
- 8.Handling Emojis
- 9.Tokenization
- 10.Stemming and Lemmantization

# 1.Lowercasing

 Converting all text to lowercase ensures uniformity and eliminates inconsistencies caused by variations in letter case.

In [1]:
#importing pandas library and data
import pandas as pd
imdb_df = pd.read_csv("https://raw.githubusercontent.com/SK7here/Movie-Review-Sentiment-Analysis/master/IMDB-Dataset.csv")

In [2]:
#copying the df
df = imdb_df.copy()
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [5]:
# lower casing a specific sentence
print(df['review'][0])
print('-'*100)

print("After lowercasing")

print(df['review'][0].lower())

One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fac

In [6]:
#another way
import string
raw_docs = [df['review'][0]]
print(raw_docs)
print("After Lowercasing")
raw_docs = [doc.lower() for doc in raw_docs]
print(raw_docs)

["One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the f

In [7]:
df['review']

0        One of the other reviewers has mentioned that ...
1        A wonderful little production. <br /><br />The...
2        I thought this was a wonderful way to spend ti...
3        Basically there's a family where a little boy ...
4        Petter Mattei's "Love in the Time of Money" is...
                               ...                        
49995    I thought this movie did a down right good job...
49996    Bad plot, bad dialogue, bad acting, idiotic di...
49997    I am a Catholic taught in parochial elementary...
49998    I'm going to have to disagree with the previou...
49999    No one expects the Star Trek movies to be high...
Name: review, Length: 50000, dtype: object

In [8]:
#Lowercasing entire column
df['review'].str.lower()

0        one of the other reviewers has mentioned that ...
1        a wonderful little production. <br /><br />the...
2        i thought this was a wonderful way to spend ti...
3        basically there's a family where a little boy ...
4        petter mattei's "love in the time of money" is...
                               ...                        
49995    i thought this movie did a down right good job...
49996    bad plot, bad dialogue, bad acting, idiotic di...
49997    i am a catholic taught in parochial elementary...
49998    i'm going to have to disagree with the previou...
49999    no one expects the star trek movies to be high...
Name: review, Length: 50000, dtype: object

# 2.Remove HTML Tags

When dealing with text data obtained from websites or other digital sources, we often encounter HTML tags that need to be stripped away to reveal the actual textual content.

In [None]:
df1 = imdb_df.copy()

In [None]:
df1.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [None]:
df1['review'][1]

'A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master\'s of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell\'s murals decorating every surface) are terribly well d

In [None]:
# importing regex
import re
def remove_html_tags(text):
  pattern = re.compile('<.*?>')
  return pattern.sub(r'', text)

In [None]:
remove_html_tags(df1['review'][1])

'A wonderful little production. The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master\'s of comedy and his life. The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell\'s murals decorating every surface) are terribly well done.'

In [None]:
df1.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [None]:
#removing HTML tags from entire column

df1['review'].apply(remove_html_tags)

0        One of the other reviewers has mentioned that ...
1        A wonderful little production. The filming tec...
2        I thought this was a wonderful way to spend ti...
3        Basically there's a family where a little boy ...
4        Petter Mattei's "Love in the Time of Money" is...
                               ...                        
49995    I thought this movie did a down right good job...
49996    Bad plot, bad dialogue, bad acting, idiotic di...
49997    I am a Catholic taught in parochial elementary...
49998    I'm going to have to disagree with the previou...
49999    No one expects the Star Trek movies to be high...
Name: review, Length: 50000, dtype: object

# 3.Remove URLs

Eliminating URLs, email addresses, and specific patterns that are irrelevant to the analysis further cleanses the text.

In [None]:
text1 = 'Check out my notebook https://www.kagg1e.com/campusx/notebook8223fc1abb'

text2 = 'Check out my notebook http://www.kagg1e.com/campusx/notebook8223fc1abb'

text3 = 'Google search here www.google.com'

text4 = 'For notebook click https://mvw.kaggle.com/campusx/notebook8223fc1abb to search check www.google.com'


In [None]:
def remove_url(text):
	pattern = re.compile(r'https?://\S+|www\.\S+')
	return pattern.sub(r'', text)

In [None]:
remove_url(text1)

'Check out my notebook '

In [None]:
remove_url(text2)

'Check out my notebook '

In [None]:
remove_url(text3)

'Google search here '

# 4.Removing Punctuation


Punctuation, symbols, and other non-alphanumeric characters can often be removed without affecting the underlying meaning of the text.

In [None]:
import string, time
string.punctuation # contains basic puntuations

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [None]:
exclude = string.punctuation

In [None]:
#function to remove punctuation
def remove_punctuation(text):
	for char in exclude:
		text = text.replace(char,'')
	return text

In [None]:
punc_text = 'this . is, the, ? string! with@ punctuation&'

In [None]:
start = time.time()
print(punc_text)
print(remove_punctuation(punc_text))
end = time.time() - start
print("Time taken for execution=", end)

this . is, the, ? string! with@ punctuation&
this  is the  string with punctuation
Time taken for execution= 0.0018172264099121094


In [None]:
#standard method of removing punctuation

def remove_Punctuation(text):
	return text.translate(str.maketrans('', '', exclude))

In [None]:
start = time.time()
print(punc_text)
print(remove_Punctuation(punc_text))
End = time.time() - start
print("Time taken for execution=", end)

this . is, the, ? string! with@ punctuation&
this  is the  string with punctuation
Time taken for execution= 0.00027251243591308594


In [None]:
# standard method is 14times more faster than defined function
end/End

0.14227035100821508

In [None]:
#loading dataset

dft= pd.read_csv("https://raw.githubusercontent.com/sachink382/Twitter-Sentiment-Analysis---Analytics-Vidhya/master/train.csv")

In [None]:
dft.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


In [None]:
dft['TWEET'] = dft['tweet'].apply(remove_Punctuation)

In [None]:
# punctuation removed tweets in TWEET
dft.head()

Unnamed: 0,id,label,tweet,TWEET
0,1,0,@user when a father is dysfunctional and is s...,user when a father is dysfunctional and is so...
1,2,0,@user @user thanks for #lyft credit i can't us...,user user thanks for lyft credit i cant use ca...
2,3,0,bihday your majesty,bihday your majesty
3,4,0,#model i love u take with u all the time in ...,model i love u take with u all the time in u...
4,5,0,factsguide: society now #motivation,factsguide society now motivation


# 5. Chat word treatment

Dealing with chat words and abbreviations is important, especially when working with informal text data such as social media posts, chats, and online discussions. These informal text sources often contain abbreviations, acronyms, and chat-specific words that can pose challenges for traditional language models.

In [None]:
chat_abbreviations = {
    'AFAIK': 'As Far As I Know',
    'AFK': 'Away From Keyboard',
    'ASAP': 'As Soon As Possible',
    'ATK': 'At The Keyboard',
    'ATM': 'At The Moment',
    'A3': 'Anytime, Anywhere, Anyplace',
    'BAK': 'Back At Keyboard',
    'BBL': 'Be Back Later',
    'BBS': 'Be Back Soon',
    'BFN': 'Bye For Now',
    'B4N': 'Bye For Now',
    'BRB': 'Be Right Back',
    'BRT': 'Be Right There',
    'BTW': 'By The Way',
    'B4': 'Before',
    'CU': 'See You',
    'CUL8R': 'See You Later',
    'CYA': 'See You',
    'FAQ': 'Frequently Asked Questions',
    'FC': 'Fingers Crossed',
    'FWIW': 'For What Its Worth',
    'FYI': 'For Your Information',
    'GAL': 'Get A Life',
    'GG': 'Good Game',
    'GN': 'Good Night',
    'GMTA': 'Great Minds Think Alike',
    'GR8': 'Great!',
    'G9': 'Genius',
    'IC': 'I See',
    'ICQ': 'I Seek you (also a chat program)',
    'ILU': 'I Love You',
    'IMHO': 'In My Honest/Humble Opinion',
    'IMO': 'In My Opinion',
    'IOW': 'In Other Words',
    'IRL': 'In Real Life',
    'KISS': 'Keep It Simple, Stupid',
    'LDR': 'Long Distance Relationship',
    'LMAO': 'Laugh My A.. Off',
    'LOL': 'Laughing Out Loud',
    'LTNS': 'Long Time No See',
    'L8R': 'Later',
    'MTE': 'My Thoughts Exactly',
    'M8': 'Mate',
    'NRN': 'No Reply Necessary',
    'OIC': 'Oh I See',
    'PITA': 'Pain In The A..',
    'PRT': 'Party',
    'PRW': 'Parents Are Watching',
    'QPSA?': 'Que Pasa?',
    'ROFL': 'Rolling On The Floor Laughing',
    'ROFLOL': 'Rolling On The Floor Laughing Out Loud',
    'ROTFLMAO': 'Rolling On The Floor Laughing My A.. Off',
    'SK8': 'Skate',
    'STATS': 'Your sex and age',
    'ASL': 'Age, Sex, Location',
    'THX': 'Thank You',
    'TTFN': 'Ta-Ta For Now!',
    'TTYL': 'Talk To You Later',
    'U': 'You',
    'UR':'Your',
    'U2': 'You Too',
    'U4E': 'Yours For Ever',
    'WB': 'Welcome Back',
    'WTF': 'What The F...',
    'WTG': 'Way To Go!',
    'WUF': 'Where Are You From?',
    'W8': 'Wait...',
    '7K': 'Sick:-D Laughter',
    'TFW': 'That feeling when. TFW internet slang often goes in a caption to an image.',
    'MFW': 'My face when',
    'MRW': 'My reaction when',
    'IFYP': 'I feel your pain',
    'TNTL': 'Trying not to laugh',
    'JK': 'Just kidding',
    'IDC': 'I don’t care',
    'ILY': 'I love you',
    'IMU': 'I miss you',
    'ADIH': 'Another day in hell',
    'IDC': 'I don’t care',
    'ZZZ': 'Sleeping, bored, tired',
    'WYWH': 'Wish you were here',
    'TIME': 'Tears in my eyes',
    'BAE': 'Before anyone else',
    'FIMH': 'Forever in my heart',
    'BSAAW': 'Big smile and a wink',
    'BWL': 'Bursting with laughter',
    'LMAO': 'Laughing my ass off',
    'BFF': 'Best friends forever',
    'CSL': 'Can’t stop laughing'
}


In [None]:
# function to remove chat words
def chat_conversation(text):
	new_text = []
	for w in text.split():
		if w.upper() in chat_abbreviations:
			new_text.append(chat_abbreviations[w.upper()])
		else:
			new_text.append(w)
	return " ".join(new_text)

In [None]:
chat_conversation('btw gal b4 this')

'By The Way Get A Life Before this'

In [None]:
chat_conversation('btw what is ur asl')

'By The Way what is Your Age, Sex, Location'

# 6.Spelling correction

Correcting typos and misspellings improves the quality of the text, making it more conducive to analysis.

In [None]:
from textblob import TextBlob

In [None]:
incorrect_text = "India hass reaiched mooon throughhh moon misssion"

textBlb = TextBlob(incorrect_text)

textBlb.correct().string

'India has reached moon through moon mission'

# 7.Removing Stop words

Stopwords like "the," "and," "is," etc., carry little meaningful information and can be excluded to reduce noise in the data.

In [None]:
# mostly stop words like a , is, on are removed while text preprocessing
# STOP WORDS are not removed during POS parts of speech tagging

from nltk.corpus import stopwords

In [None]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
#stop words in english

stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [None]:
def remove_stopwords(text):
	new_text = []

	for  word in text.split():
		if word in stopwords.words('english'):
			new_text.append('')
		else:
			new_text.append(word)

	x = new_text[:]
	new_text.clear()
	return " ".join(x)

In [None]:
senetence = "While the Alpha Particle Spectrometer is aiming at deriving the chemical composition of the lunar surface, the Laser-Induced Breakdown Spectroscope  will attempt to determine the elemental composition of lunar soil and rocks around the lunar landing site."

print(senetence)
print(remove_stopwords(senetence))

While the Alpha Particle Spectrometer is aiming at deriving the chemical composition of the lunar surface, the Laser-Induced Breakdown Spectroscope  will attempt to determine the elemental composition of lunar soil and rocks around the lunar landing site.
While  Alpha Particle Spectrometer  aiming  deriving  chemical composition   lunar surface,  Laser-Induced Breakdown Spectroscope  attempt  determine  elemental composition  lunar soil  rocks around  lunar landing site.


In [None]:
dft['tweet']

0         @user when a father is dysfunctional and is s...
1        @user @user thanks for #lyft credit i can't us...
2                                      bihday your majesty
3        #model   i love u take with u all the time in ...
4                   factsguide: society now    #motivation
                               ...                        
31957    ate @user isz that youuu?ðððððð...
31958      to see nina turner on the airwaves trying to...
31959    listening to sad songs on a monday morning otw...
31960    @user #sikh #temple vandalised in in #calgary,...
31961                     thank you @user for you follow  
Name: tweet, Length: 31962, dtype: object

In [None]:
dft['tweet'].apply(remove_stopwords)

0        @user   father  dysfunctional    selfish  drag...
1        @user @user thanks  #lyft credit  can't use ca...
2                                          bihday  majesty
3        #model  love u take  u   time  urð±!!! ðð...
4                         factsguide: society  #motivation
                               ...                        
31957    ate @user isz  youuu?ððððððð...
31958     see nina turner   airwaves trying  wrap    ma...
31959    listening  sad songs   monday morning otw  wor...
31960    @user #sikh #temple vandalised   #calgary, #ws...
31961                                thank  @user   follow
Name: tweet, Length: 31962, dtype: object

# 8.Handling Emojis

Emojis are pictorial symbols often used in text to convey emotions, feelings, or reactions. In NLP, understanding and processing emojis can add context and depth to text analysis. Here's how you can handle emojis in your NLP pipeline:

1. Emoji-to-Text Mapping:
Create a dictionary that maps emojis to corresponding text descriptions. There are existing libraries that provide such mappings, like the "emoji" library in Python

In [None]:
import re

def remove_emoji(text):
	emoji_pattern = re.compile("["
				u"\U0001F600-\U0001F64F" #emoticons
				u"\U0001F300-\U0001F5FF" #symbols and photos
				u"\U0001F680-\U0001F6FF" #transport & map symbols
				u"\U0001F1E0-\U0001F1FF" #flags
				u"\U00002702-\U000027B0"
				u"\U000024C2-\U0001F251"
				"]+", flags = re.UNICODE)
	return emoji_pattern.sub(r'',text)

In [None]:
print(remove_emoji('I ❤️ my country'))
print(remove_emoji('India is best in 🎾'))
print(remove_emoji('be happy 😀'))

I  my country
India is best in 
be happy 


In [None]:
pip install emoji

Collecting emoji
  Downloading emoji-2.8.0-py2.py3-none-any.whl (358 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m358.9/358.9 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: emoji
Successfully installed emoji-2.8.0


In [None]:
# emoji library
import emoji
print(emoji.demojize('I ❤️ my country'))
print(emoji.demojize('be happy and 😁'))

I :red_heart: my country
be happy and :beaming_face_with_smiling_eyes:


# 9.Tokenization

Tokenization is the process of breaking down a sequence of text into smaller units, known as tokens.

Word tokenization:

Example :  "Tokenization is an important NLP concept."<br>
Tokens  :  ["Tokenization", "is", "an", "important", "NLP", "concept", "."]

Sentence tokenisation:

Example-2: "Tokenization is breaking text into tokens, Tokenization is important in NLP "<br>
Tokens  : ["Tokenization is breaking text into tokens"], [" Tokenization is important in NLP"]


## 9.1 Using split function

In [None]:
sentence1 =  "Tokenization is an important NLP concept."
sentence1.split()

['Tokenization', 'is', 'an', 'important', 'NLP', 'concept.']

In [None]:
sentence2 = " Tokenization is essential step in various NLP tasks, including text classification, language modeling, machine translation, and more."

#splitting based on delimeter ,

sentence2.split(',')

[' Tokenization is essential step in various NLP tasks',
 ' including text classification',
 ' language modeling',
 ' machine translation',
 ' and more.']

## 9.2 Using Regular Expression

Regular expressions (regex) are powerful tools for pattern matching and manipulation of text. They are widely used in text preprocessing for tasks such as cleaning, tokenization, extraction, and more.

In [None]:
import re
sentence1 = "Tokenization is an important NLP concept."
tokens = re.findall("[\w']+", sentence1)

tokens

['Tokenization', 'is', 'an', 'important', 'NLP', 'concept']

## 9.3 NLTK

Natural Language Toolkit (NLTK) is a popular Python library designed to work with human language data, also known as natural language processing (NLP) tasks. It provides various tools and resources for tasks such as tokenization, stemming, tagging, parsing, and more.

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
from nltk.tokenize import word_tokenize

In [None]:
#word tokenization

word_tokenize(sentence1)

['Tokenization', 'is', 'an', 'important', 'NLP', 'concept', '.']

In [None]:
#Sentence tokenization

from nltk.tokenize import sent_tokenize

In [None]:
sentence3 = ["In the context of embeddings", "we will find that if we donot remove these inconsistencies", "the vectors will not be properly placed."]

sent_token = [sent_tokenize(doc) for doc in sentence3]
print(sent_token)

[['In the context of embeddings'], ['we will find that if we donot remove these inconsistencies'], ['the vectors will not be properly placed.']]


In [None]:

sentence4 = 'I have a Ph.D in A.I'
sentence5 = "We're here to help! mail us at nks@gmail.com"
sentence6 = "A 5km ride will cost $10.50"

In [None]:
print(word_tokenize(sentence4))

['I', 'have', 'a', 'Ph.D', 'in', 'A.I']


In [None]:
print(word_tokenize(sentence5))

['We', "'re", 'here', 'to', 'help', '!', 'mail', 'us', 'at', 'nks', '@', 'gmail.com']


In [None]:
print(word_tokenize(sentence6))

['A', '5km', 'ride', 'will', 'cost', '$', '10.50']


## 9.4 SPACY

spaCy is a popular open-source library for natural language processing (NLP) in Python. It's designed for production use and is known for its speed and efficiency. spaCy provides various tools for processing text, performing linguistic analyses, and extracting useful information from text data.

In [None]:
import spacy
nlp_spacy= spacy.load('en_core_web_sm')

In [None]:
doc6 = nlp_spacy(sentence6)

In [None]:
for token in doc6:
  print(token)

A
5
km
ride
will
cost
$
10.50


In [None]:
doc5 = nlp_spacy(sentence5)

for token in doc5:
  print(token)

We
're
here
to
help
!
mail
us
at
nks@gmail.com


# 10. Stemming and Lemmantization


"In grammar, inflection is the modification of a word to
express different grammatical categories such as tense,
case, voice. as ect, erson. number, ender, and mood.

Example: <br>
actual word/stem word: walk <br>
inflection words: walking, walks, walked <br>


"**Stemming** is the process of reducing inflection in words
to their root forms such as mapping a group of words to
the same stem even if the stem itself is not a valid word in
the Language."

Mostly used in Information retrival systems.

* Stemming is bit faster compared to lemmatization.

In [None]:
# Stemming
from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()

def stem_words(text):
	return " ".join([porter.stem(word) for word in text.split()])

In [None]:
sample = ' walk walks walking walked'

stem_words(sample)

'walk walk walk walk'

In [None]:
overview = "An adrenaline seeking snowboarder gets lost in a massive winter storm in the back country of the High Sierras where he is pushed to the limits of human endurance and forced to battle his own personal demons as he fights for survival"

stem_words(overview)

'an adrenalin seek snowboard get lost in a massiv winter storm in the back countri of the high sierra where he is push to the limit of human endur and forc to battl hi own person demon as he fight for surviv'

### Lemmatization

* Lemmatization, unlike Stemming, reduces the inflected
words properly ensuring that the root word belongs to
the language.
* In Lemæatization root word is called
Lemma.
* A lemma (plural lemmas or lemmata) is the
canonical form, dictionary form, or citation form ofa set
of words.
* Lemmatization is bit slow comapred to stemming, Lemmatizer seaches for the lemma in the wordnet

In [None]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [None]:
import nltk
from nltk.stem.wordnet import WordNetLemmatizer

# Initialize the WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# Input sentence
sentence = "He was running and eating at the same time. He has a bad habit of swimming after playing long hours in the Sun."

# Define punctuation characters
punctuations = "?:!.,;"

# Tokenize the sentence
sentence_words = nltk.word_tokenize(sentence)

# Remove punctuations from the list of words
cleaned_words = [word for word in sentence_words if word not in punctuations]

# Print the header for the output
print("{0:20}{1:20}".format("Word", "Lemma"))

# Iterate through each word and print the original word and its lemma
for word in cleaned_words:
    lemma = wordnet_lemmatizer.lemmatize(word, pos='v')  # Lemmatize the word as a pos(parts of speech='v')verb
    print("{0:20}{1:20}".format(word, lemma))


Word                Lemma               
He                  He                  
was                 be                  
running             run                 
and                 and                 
eating              eat                 
at                  at                  
the                 the                 
same                same                
time                time                
He                  He                  
has                 have                
a                   a                   
bad                 bad                 
habit               habit               
of                  of                  
swimming            swim                
after               after               
playing             play                
long                long                
hours               hours               
in                  in                  
the                 the                 
Sun                 Sun                 


## Conclusion:

Text preprocessing plays a crucial role in preparing raw text data for natural language processing (NLP) tasks. It involves a series of steps and techniques to clean, transform, and structure the text data into a format suitable for analysis and machine learning.


By performing effective text preprocessing, we create a solid foundation for successful NLP tasks like sentiment analysis, topic modeling, text classification, and more. The choice of preprocessing techniques depends on the nature of the text data and the specific goals of the analysis or machine learning model. Through proper preprocessing, we can harness the power of text data and derive meaningful insights from it.

## Thank you for reading till the end

### - Raviteja

https://www.linkedin.com/in/raviteja-padala/