<a href="https://www.kaggle.com/code/sharanharsoor/lexical-processing-nlp-zomato-review?scriptVersionId=122397213" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# 0. Introduction.

In this Notebook, will touch base on basics of NLP. extract the Zomato customer reviews from a given data file and will perform various text pre-processing steps:
* Removing unnecessary elements like HTML tags, URLs and emojis
* Text encoding
* Removing special characters and symbols
* Converting the text to lower case
* Removing stopwords
* Stemming and lemmatisation

# 1. Importing the libraries

In [1]:
#Load the libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import nltk

import os
import warnings
warnings.filterwarnings('ignore')
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

# 2. Reading the input data

In [2]:
zomato = pd.read_csv('/kaggle/input/zomato-reviews-ratings/zomato_reviews.csv',index_col='Unnamed: 0')
print(zomato.shape)
zomato.head(5)

(5479, 2)


Unnamed: 0,rating,review
0,5,nice
1,5,"best biryani , so supportive staff of outlet ,..."
2,4,delivery boy was very decent and supportive.👌👍
3,1,"worst biryani i have tasted in my life, half o..."
4,5,all food is good and tasty . will order again ...


# 3. Basic Pre-processing reviews text

In [3]:
text = u'<div><h1><Title>The apple π was [*][AMAZING][*] and YuMmY too\U0001f602! You can Checkout the entire Menu in https://www.zomato.com/chennai/top-restaurantshello </div></h1></Title>'
print(text)

<div><h1><Title>The apple π was [*][AMAZING][*] and YuMmY too😂! You can Checkout the entire Menu in https://www.zomato.com/chennai/top-restaurantshello </div></h1></Title>


## 3.1 Removing HTML strips
In case any of the reviews have got any HTML tags (ex : < html >") remove the same.

In [4]:
from bs4 import BeautifulSoup

def strip_html(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

In [5]:
text = strip_html(text)
print(text)

The apple π was [*][AMAZING][*] and YuMmY too😂! You can Checkout the entire Menu in https://www.zomato.com/chennai/top-restaurantshello 


## 3.2 Removing URLs
After parsing the text and removing the html tags, let's now remove all hyper links (for example, urls containing "https://") from the reviews text if that may have. We will use regular expressions to do so.

In [6]:
import re

text = re.sub(r"http\S+", "", text)
print(text)

The apple π was [*][AMAZING][*] and YuMmY too😂! You can Checkout the entire Menu in  


## 3.3 Removing Emojis
Now, we will define a function to remove different types of emojis like smileys, symbols, gifs, flags and map symbols.

In [7]:
def deEmojify(text):
    regrex_pattern = re.compile(pattern = "["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           "]+", flags = re.UNICODE)
    return regrex_pattern.sub(r'',text)

In [8]:
text = deEmojify(text)
print(text)

The apple π was [*][AMAZING][*] and YuMmY too! You can Checkout the entire Menu in  


# 3.4 Text encoding
Let us consider some examples of different encodings - UTF-8, ASCII and typical string.

Notice the 'b' in front of the text; which basically says the data is in bytes format.

Typically we prefer UTF-8 encoding, as it is devoid of foreign characters.

In this case we will first remove the "pi" symbol and then convert the text to UTF-8 format.

In [9]:
text.encode('utf-8', 'ignore')

b'The apple \xcf\x80 was [*][AMAZING][*] and YuMmY too! You can Checkout the entire Menu in  '

In this example, if we print the text as unicode-encoded, we can notice the "pi" symbol is translated to some xml tags.

In [10]:
#In our context, we would want to remove the 'pi' symbol from the review, hence we considered ascii encoding.
text = text.encode('ascii', 'ignore')
print(text)

b'The apple  was [*][AMAZING][*] and YuMmY too! You can Checkout the entire Menu in  '


In [11]:
def to_unicode(text):
    if isinstance(text, float):
        text = str(text)
    if isinstance(text, int):
        text = str(text)
    if not isinstance(text, str):
        text = text.decode('utf-8', 'ignore')
    return text

In [12]:
#As you noticed above, there was a 'b' character before the text because it is in byte format after ascii encoding, 
#so let us convert back to utf-8 encoding
text = to_unicode(text)
print(text)

The apple  was [*][AMAZING][*] and YuMmY too! You can Checkout the entire Menu in  


## 3.5 Removing symbols


In [13]:
#Removing the square brackets, symbols
import re,string

def remove_between_square_brackets(text):
    return re.sub('\[[^]]*\]', '', text)

In [14]:
text = remove_between_square_brackets(text)
print(text)

The apple  was  and YuMmY too! You can Checkout the entire Menu in  


## 3.6 Removing special characters

In [15]:
#Function for removing special characters
def remove_special_characters(text, remove_digits=True):
    pattern=r'[^a-zA-z0-9\s]'
    text=re.sub(pattern,'',text)
    return text

In [16]:
text = remove_special_characters(text)
print(text)

The apple  was  and YuMmY too You can Checkout the entire Menu in  


## 3.7 Lowercase conversion


In [17]:
#Converting text to lowercase for standardisation
text = text.lower()
print(text)

the apple  was  and yummy too you can checkout the entire menu in  


## 3.8 Function to denoise

In [18]:
# defining a func to apply all pre-processing steps on all the reviews

def denoise_text(text):
    text = to_unicode(text)
    text = strip_html(text)
    text = re.sub(r"http\S+", "", text)
    text = deEmojify(text)
    text = text.encode('ascii', 'ignore')
    text = to_unicode(text)
    text = remove_between_square_brackets(text)
    text = remove_special_characters(text)
    text = text.lower()
    return text


In [19]:
# checking a random review 
zomato['review'][556]

'Thank you for delivering 🙂🙂'

In [20]:
# IMP : Apply function on review column
zomato['review']=zomato['review'].apply(denoise_text)
zomato['review'].head()

0                                                 nice
1    best biryani  so supportive staff of outlet  p...
2          delivery boy was very decent and supportive
3    worst biryani i have tasted in my life half of...
4    all food is good and tasty  will order again a...
Name: review, dtype: object

In [21]:
# Processed example of randomly selected review and the emojies in the text is removed. 
zomato['review'][556]

'thank you for delivering '

# 4. Removing stopwords


In [22]:
from nltk.corpus import stopwords
from nltk.tokenize.toktok import ToktokTokenizer

In [23]:
#Tokenization of text
tokenizer=ToktokTokenizer() 

In [24]:
#Setting English stopwords
stopword_list=nltk.corpus.stopwords.words('english')

In [25]:
#Removing standard english stopwords like prepositions, adverbs
from nltk.tokenize import word_tokenize,sent_tokenize

stop=set(stopwords.words('english'))
print(stop)

#Removing the stopwords
def remove_stopwords(text, is_lower_case=False):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopword_list]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text


{'you', 'before', 'a', 'against', "you'll", 'doesn', "doesn't", 'isn', 'few', 'after', 'your', 'for', 'm', 'are', 'our', 'only', 'any', 'so', 'not', 'should', 'until', 'wouldn', "isn't", "that'll", 'theirs', 'an', 'into', "aren't", 'ours', 'himself', 'ourselves', 'who', 'herself', 'ain', 'more', 'll', "hadn't", 'hadn', 'had', 'having', 'i', 'above', 'were', 'here', "she's", 'down', 'all', 'about', 'haven', 'his', 'or', 'wasn', 'if', "won't", 'she', 'both', 'hasn', 'between', "couldn't", 'been', "hasn't", 'mightn', 'did', 'during', 's', 'out', 'those', 'why', "it's", "needn't", 'myself', 'o', 'they', 'can', 'him', 'them', 'most', 'with', 'what', 'of', 'd', 'than', 'through', 't', 'no', 'themselves', "should've", 'while', 'how', 'her', 'now', 'mustn', 'same', "shouldn't", "weren't", 'couldn', 'when', 'the', 'by', 'won', 'is', 'on', 'y', 'doing', "wasn't", "wouldn't", 'there', 'being', 'whom', "haven't", 'shan', 'nor', 'yourself', "you'd", 'has', 'these', 'does', 'their', 'it', 'will', 'f

In [26]:
#Raw example of randomly selected review text
zomato['review'][5475]

'it took 1 hour to assign valvet and thn prepare food like 30 mins to deliver also 4 valvet was near by cant do delivery and ur help chat talking to 34 ppl cant  still get my delivery on timei didnt got support from support chat system also restaurant was non responsive to cook order on time and this valet system assigning late and any mistake happens ur chat support system credit d voucher seriously its was waste on time to order from ds app n restaurant'

In [27]:
#Apply function on review column
zomato['review']=zomato['review'].apply(remove_stopwords)

In [28]:
#Processed example of randomly selected review text
zomato['review'][5475]

'took 1 hour assign valvet thn prepare food like 30 mins deliver also 4 valvet near cant delivery ur help chat talking 34 ppl cant still get delivery timei didnt got support support chat system also restaurant non responsive cook order time valet system assigning late mistake happens ur chat support system credit voucher seriously waste time order ds app n restaurant'

# 5. Stemming and Lemmatization


In [29]:
from nltk.stem import WordNetLemmatizer,SnowballStemmer
from nltk.stem.porter import PorterStemmer
nltk.download('wordnet')

def simple_stemmer(text):
    ps=SnowballStemmer(language='english')
    return ' '.join([ps.stem(word) for word in tokenizer.tokenize(text)])

[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [30]:
# random text to verify stemming
zomato['review'][5475]

'took 1 hour assign valvet thn prepare food like 30 mins deliver also 4 valvet near cant delivery ur help chat talking 34 ppl cant still get delivery timei didnt got support support chat system also restaurant non responsive cook order time valet system assigning late mistake happens ur chat support system credit voucher seriously waste time order ds app n restaurant'

In [31]:
# after stemming some of the words are changed, some words might not make sense as it's rule based apporach.
simple_stemmer(zomato['review'][5475])

'took 1 hour assign valvet thn prepar food like 30 min deliv also 4 valvet near cant deliveri ur help chat talk 34 ppl cant still get deliveri timei didnt got support support chat system also restaur non respons cook order time valet system assign late mistak happen ur chat support system credit voucher serious wast time order ds app n restaur'

In [32]:
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
#Lemmatizer example
def lemmatize_all(sentence):
    wnl = WordNetLemmatizer()
    for word, tag in pos_tag(word_tokenize(sentence)):
        if tag.startswith("NN"):
            yield wnl.lemmatize(word, pos='n')
        elif tag.startswith('VB'):
            yield wnl.lemmatize(word, pos='v')
        elif tag.startswith('JJ'):
            yield wnl.lemmatize(word, pos='a')
        else:
            yield word
            
def lemmatize_text(text):
    return ' '.join(lemmatize_all(text))

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /usr/share/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [33]:
zomato['review'][5475]

'took 1 hour assign valvet thn prepare food like 30 mins deliver also 4 valvet near cant delivery ur help chat talking 34 ppl cant still get delivery timei didnt got support support chat system also restaurant non responsive cook order time valet system assigning late mistake happens ur chat support system credit voucher seriously waste time order ds app n restaurant'

In [34]:
import nltk
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to /usr/share/nltk_data...


True

In [35]:
# Lemmatization gives the root of any word ex: Playing -> play, gone -> go, am -> be etc..
lemmatize_text(zomato['review'][5475])

'take 1 hour assign valvet thn prepare food like 30 min deliver also 4 valvet near cant delivery ur help chat talk 34 ppl cant still get delivery timei didnt get support support chat system also restaurant non responsive cook order time valet system assign late mistake happen ur chat support system credit voucher seriously waste time order d app n restaurant'

In [36]:
#Raw example of randomly selected review text
zomato['review'][5475]

'took 1 hour assign valvet thn prepare food like 30 mins deliver also 4 valvet near cant delivery ur help chat talking 34 ppl cant still get delivery timei didnt got support support chat system also restaurant non responsive cook order time valet system assigning late mistake happens ur chat support system credit voucher seriously waste time order ds app n restaurant'

In [37]:
zomato['review'] = zomato['review'].apply(lemmatize_text)

In [38]:
# Processed example of randomly selected review text 
# in below text some changes are after applying lemmatization are talking -> talk, assigning -> assign ,happens-> happen.. etc
zomato['review'][5475]

'take 1 hour assign valvet thn prepare food like 30 min deliver also 4 valvet near cant delivery ur help chat talk 34 ppl cant still get delivery timei didnt get support support chat system also restaurant non responsive cook order time valet system assign late mistake happen ur chat support system credit voucher seriously waste time order d app n restaurant'

# 6. Conclusion
As you can see in the output of the code given above, the review still doesn't make complete grammatical sense. However, you can still realise the value of text preprocessing here. Several unwanted characters have been removed. You can perform a decent amount of text analysis based on this preprocessed text.