# 1. Pre-processing "reviewText" column


## 1. Import necessary packages and modules

In [2]:
import pandas as pd
import numpy as np
import json
import re
import nltk
import spacy
import string
pd.options.mode.chained_assignment = None

## 2. Import the dataset

In [3]:
data = pd.read_json('C:\\Users\\Roma\\Downloads\\4.Text Mining\\Amazon_reviews_final_exam.json',lines = True)

## 3. Exploring the data

In [4]:
data.head()

Unnamed: 0,overall,vote,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,style,image
0,5,9.0,False,"11 8, 2001",AH2IFH762VY5U,B00005N7P0,ted sedlmayr,"for computer enthusiast, MaxPC is a welcome si...","AVID READER SINCE ""boot"" WAS THE NAME",1005177600,,
1,5,9.0,False,"10 31, 2001",AOSFI0JEYU4XM,B00005N7P0,Amazon Customer,Thank god this is not a Ziff Davis publication...,The straight scoop,1004486400,,
2,3,14.0,False,"03 24, 2007",A3JPFWKS83R49V,B00005N7OJ,Bryan Carey,Antiques Magazine is a publication made for an...,"Antiques Magazine is Good, but not for Everyone",1174694400,{'Format:': ' Print Magazine'},
3,5,13.0,False,"11 10, 2006",A19FKU6JZQ2ECJ,B00005N7OJ,Patricia L. Porada,This beautiful magazine is in itself a work of...,THE DISCERNING READER,1163116800,{'Format:': ' Print Magazine'},
4,5,,True,"07 14, 2014",A25MDGOMZ2GALN,B00005N7P0,Alvey,A great read every issue.,Five Stars,1405296000,,


In [5]:
print(data.columns.values)

['overall' 'vote' 'verified' 'reviewTime' 'reviewerID' 'asin'
 'reviewerName' 'reviewText' 'summary' 'unixReviewTime' 'style' 'image']


In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   overall         5000 non-null   int64  
 1   vote            1127 non-null   float64
 2   verified        5000 non-null   bool   
 3   reviewTime      5000 non-null   object 
 4   reviewerID      5000 non-null   object 
 5   asin            5000 non-null   object 
 6   reviewerName    5000 non-null   object 
 7   reviewText      4999 non-null   object 
 8   summary         4997 non-null   object 
 9   unixReviewTime  5000 non-null   int64  
 10  style           4404 non-null   object 
 11  image           1 non-null      object 
dtypes: bool(1), float64(1), int64(2), object(8)
memory usage: 434.7+ KB


In [7]:
#Checking for null values
data.isnull().sum()

overall              0
vote              3873
verified             0
reviewTime           0
reviewerID           0
asin                 0
reviewerName         0
reviewText           1
summary              3
unixReviewTime       0
style              596
image             4999
dtype: int64

The dataset has 12 columns and 5000 records. 

### Out of all the columns, the pre-processing will be performed on the "reviewText" column for our analysis. In order to preserve the original "reviewText" column, the same column was duplicated and renamed as "processed_reviewText". All our preprocessing will be done in this column.

In [8]:
# Creating column "processed_reviewText"

data['processed_reviewText'] = data['reviewText'].copy()

data.head()

Unnamed: 0,overall,vote,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,style,image,processed_reviewText
0,5,9.0,False,"11 8, 2001",AH2IFH762VY5U,B00005N7P0,ted sedlmayr,"for computer enthusiast, MaxPC is a welcome si...","AVID READER SINCE ""boot"" WAS THE NAME",1005177600,,,"for computer enthusiast, MaxPC is a welcome si..."
1,5,9.0,False,"10 31, 2001",AOSFI0JEYU4XM,B00005N7P0,Amazon Customer,Thank god this is not a Ziff Davis publication...,The straight scoop,1004486400,,,Thank god this is not a Ziff Davis publication...
2,3,14.0,False,"03 24, 2007",A3JPFWKS83R49V,B00005N7OJ,Bryan Carey,Antiques Magazine is a publication made for an...,"Antiques Magazine is Good, but not for Everyone",1174694400,{'Format:': ' Print Magazine'},,Antiques Magazine is a publication made for an...
3,5,13.0,False,"11 10, 2006",A19FKU6JZQ2ECJ,B00005N7OJ,Patricia L. Porada,This beautiful magazine is in itself a work of...,THE DISCERNING READER,1163116800,{'Format:': ' Print Magazine'},,This beautiful magazine is in itself a work of...
4,5,,True,"07 14, 2014",A25MDGOMZ2GALN,B00005N7P0,Alvey,A great read every issue.,Five Stars,1405296000,,,A great read every issue.


# 4. Data Pre-processing

 Data pre-processing was done to remove the noise present in data for easier processing of the data by the algorithm.
## Pre-processing steps followed :
### i. Conversion of the characters into lower case
### ii. Removal of Punctuation
### iii. Removal of Stopwords
### iv. Removal of frequently occuring words
### v. Removal of rare words
### vi. Stemming
### vii. Lemmatization
### viii. Removal of emojis

# i. Conversion of the characters into Lowercase

Lower casing is a common text pre-processing technique. The idea of conversion of the characters into lowercase is to make all the characters into the same case so that the words like 'review', 'Review', 'REVIEW' are treated in the same way.

This conversion also helps in feature extraction techniques which will come in the later stage of pre-processing. In techniques like TF-IDF, it helps to combine the same words together thereby reducing the duplication and get correct counts or tfidf values.


In [9]:
data.processed_reviewText = data.processed_reviewText.str.lower()

In [10]:
data['processed_reviewText']

0       for computer enthusiast, maxpc is a welcome si...
1       thank god this is not a ziff davis publication...
2       antiques magazine is a publication made for an...
3       this beautiful magazine is in itself a work of...
4                               a great read every issue.
                              ...                        
4995    this is a trashbag magazine for losers with no...
4996              love i can read this right on my kindle
4997                       love it  my favorite us weekly
4998                                                   ok
4999                                     great magazine!!
Name: processed_reviewText, Length: 5000, dtype: object

# ii. Removal of Punctuations, URL, HTML TAGS

Our second step in preprocessing is to remove the punctuations from the text data which is a text standardization process that will help to treat 'hurray' and 'hurray!' in the same way.
 
For example, the string.punctuation in python contains the following punctuation symbols

!"#$%&\'()*+,-./:;<=>?@[\\]^_{|}~`

#### Removal of URL
There also exists some possibility that users will post the link of other products which they find useful while writing a review. Links containing URL will pose an obstacle while processing. The URL was also removed in this step. 

e.g. : 'https?://\S+|www\.\S+', '',

#### Removal of HTML Tags

One another common preprocessing technique that will come handy in multiple places is removal of html tags. This is especially useful, if we scrap the data from different websites. We might end up having html strings as part of our text.

e.g. : '<.*?>+', ''

In [11]:
import re

def review_cleaning(text):
    '''Make text lowercase, remove text in square brackets,remove links,remove punctuation
    and remove words containing numbers.'''
    text = str(text).lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

In [12]:
data['processed_reviewText']=data['processed_reviewText'].apply(lambda x:review_cleaning(x))
data['processed_reviewText']

0       for computer enthusiast, maxpc is a welcome si...
1       thank god this is not a ziff davis publication...
2       antiques magazine is a publication made for an...
3       this beautiful magazine is in itself a work of...
4                               a great read every issue.
                              ...                        
4995    this is a trashbag magazine for losers with no...
4996              love i can read this right on my kindle
4997                       love it  my favorite us weekly
4998                                                   ok
4999                                     great magazine!!
Name: processed_reviewText, Length: 5000, dtype: object

# iii. Removal of Stopwords

Stopwords are commonly occuring words in a language like 'the', 'a' and so on.As they don't add any valuable information for downstream analysis, they can be removed from the texts. In cases like Part of Speech tagging, we should not remove them as they provide very valuable information about the POS.

These stopword lists are already compiled for different languages and we can safely use them. For example, the stopword list for english language from the nltk package can be seen below.

In [13]:
from nltk.corpus import stopwords
", ".join(stopwords.words('english'))

"i, me, my, myself, we, our, ours, ourselves, you, you're, you've, you'll, you'd, your, yours, yourself, yourselves, he, him, his, himself, she, she's, her, hers, herself, it, it's, its, itself, they, them, their, theirs, themselves, what, which, who, whom, this, that, that'll, these, those, am, is, are, was, were, be, been, being, have, has, had, having, do, does, did, doing, a, an, the, and, but, if, or, because, as, until, while, of, at, by, for, with, about, against, between, into, through, during, before, after, above, below, to, from, up, down, in, out, on, off, over, under, again, further, then, once, here, there, when, where, why, how, all, any, both, each, few, more, most, other, some, such, no, nor, not, only, own, same, so, than, too, very, s, t, can, will, just, don, don't, should, should've, now, d, ll, m, o, re, ve, y, ain, aren, aren't, couldn, couldn't, didn, didn't, doesn, doesn't, hadn, hadn't, hasn, hasn't, haven, haven't, isn, isn't, ma, mightn, mightn't, mustn, mus

In [14]:
STOPWORDS = set(stopwords.words('english'))
def remove_stopwords(text):
    """custom function to remove the stopwords"""
    return " ".join([word for word in str(text).split() if word not in STOPWORDS])

data["processed_reviewText"] = data["processed_reviewText"].apply(lambda text: remove_stopwords(text))
data['processed_reviewText']

0       computer enthusiast, maxpc welcome sight mailb...
1       thank god ziff davis publication. maxpc actual...
2       antiques magazine publication made antique lov...
3       beautiful magazine work art. quality every pag...
4                                 great read every issue.
                              ...                        
4995    trashbag magazine losers life think kartrashia...
4996                               love read right kindle
4997                              love favorite us weekly
4998                                                   ok
4999                                     great magazine!!
Name: processed_reviewText, Length: 5000, dtype: object

# iv. Removal of Frequently occuring words

In the previos preprocessing step, we removed the stopwords based on language information. But say, if we have a domain specific corpus, we might also have some frequent words which are of not so much importance to us.

So this step is to remove the frequent words in the given corpus. 


#### Count of the frequent words

In [15]:
from collections import Counter

cnt = Counter()
for text in data["processed_reviewText"].values:
    for word in text.split():
        cnt[word] += 1
        
cnt.most_common(10)

[('magazine', 2403),
 ('great', 1243),
 ('good', 908),
 ('like', 877),
 ('articles', 762),
 ('love', 740),
 ('one', 669),
 ('magazine.', 660),
 ('read', 646),
 ('subscription', 601)]

#### Removal of frequent words

In [16]:
FREQWORDS = set([w for (w, wc) in cnt.most_common(10)])

def remove_freqwords(text):
    """custom function to remove the frequent words"""
    return " ".join([word for word in str(text).split() if word not in FREQWORDS])

data["processed_reviewText"] = data["processed_reviewText"].apply(lambda text: remove_freqwords(text))
data['processed_reviewText']

0       computer enthusiast, maxpc welcome sight mailb...
1       thank god ziff davis publication. maxpc actual...
2       antiques publication made antique lovers histo...
3       beautiful work art. quality every page bits in...
4                                            every issue.
                              ...                        
4995     trashbag losers life think kartrashians awesome.
4996                                         right kindle
4997                                   favorite us weekly
4998                                                   ok
4999                                           magazine!!
Name: processed_reviewText, Length: 5000, dtype: object

# v. Removal of Rare words

Words occuring very rarely in texts also fail to add meaning to the texts, so it's okay to remove them as well.

In [17]:
n_rare_words = 10
RAREWORDS = set([w for (w, wc) in cnt.most_common()[:-n_rare_words-1:-1]])
def remove_rarewords(text):
    """custom function to remove the rare words"""
    return " ".join([word for word in str(text).split() if word not in RAREWORDS])

data["processed_reviewText"] = data["processed_reviewText"].apply(lambda text: remove_rarewords(text))
data['processed_reviewText']

0       computer enthusiast, maxpc welcome sight mailb...
1       thank god ziff davis publication. maxpc actual...
2       antiques publication made antique lovers histo...
3       beautiful work art. quality every page bits in...
4                                            every issue.
                              ...                        
4995                     life think kartrashians awesome.
4996                                         right kindle
4997                                   favorite us weekly
4998                                                   ok
4999                                           magazine!!
Name: processed_reviewText, Length: 5000, dtype: object

# vi. Stemming

Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma.

For example, if there are two words in the corpus walks and walking, then stemming will stem the suffix to make them walk. But say in another example, we have two words console and consoling, the stemmer will remove the suffix and make them consol which is not a proper english word.

There are several type of stemming algorithms available and one of the famous one is porter stemmer which is widely used. We can use nltk package for the same.

In [18]:
from nltk.stem.porter import PorterStemmer

stemmer = PorterStemmer()
def stem_words(text):
    return " ".join([stemmer.stem(word) for word in text.split()])

data["processed_reviewText_stemmed"] = data["processed_reviewText"].apply(lambda text: stem_words(text))
data['processed_reviewText_stemmed']

0       comput enthusiast, maxpc welcom sight mailbox....
1       thank god ziff davi publication. maxpc actual ...
2       antiqu public made antiqu lover histori buff p...
3       beauti work art. qualiti everi page bit inform...
4                                            everi issue.
                              ...                        
4995                      life think kartrashian awesome.
4996                                          right kindl
4997                                    favorit us weekli
4998                                                   ok
4999                                           magazine!!
Name: processed_reviewText_stemmed, Length: 5000, dtype: object

#### We can see that words like 'welcome', 'antique' have their 'e' at the end chopped off due to stemming which is not really intented. 

#### Ideal solution for this can be 'Lemmatization'.


In [19]:
data.head()

Unnamed: 0,overall,vote,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,style,image,processed_reviewText,processed_reviewText_stemmed
0,5,9.0,False,"11 8, 2001",AH2IFH762VY5U,B00005N7P0,ted sedlmayr,"for computer enthusiast, MaxPC is a welcome si...","AVID READER SINCE ""boot"" WAS THE NAME",1005177600,,,"computer enthusiast, maxpc welcome sight mailb...","comput enthusiast, maxpc welcom sight mailbox...."
1,5,9.0,False,"10 31, 2001",AOSFI0JEYU4XM,B00005N7P0,Amazon Customer,Thank god this is not a Ziff Davis publication...,The straight scoop,1004486400,,,thank god ziff davis publication. maxpc actual...,thank god ziff davi publication. maxpc actual ...
2,3,14.0,False,"03 24, 2007",A3JPFWKS83R49V,B00005N7OJ,Bryan Carey,Antiques Magazine is a publication made for an...,"Antiques Magazine is Good, but not for Everyone",1174694400,{'Format:': ' Print Magazine'},,antiques publication made antique lovers histo...,antiqu public made antiqu lover histori buff p...
3,5,13.0,False,"11 10, 2006",A19FKU6JZQ2ECJ,B00005N7OJ,Patricia L. Porada,This beautiful magazine is in itself a work of...,THE DISCERNING READER,1163116800,{'Format:': ' Print Magazine'},,beautiful work art. quality every page bits in...,beauti work art. qualiti everi page bit inform...
4,5,,True,"07 14, 2014",A25MDGOMZ2GALN,B00005N7P0,Alvey,A great read every issue.,Five Stars,1405296000,,,every issue.,everi issue.


# vii. Lemmatization

Lemmatization is similar to stemming in reducing inflected words to their word stem but differs in the way that it makes sure the root word (also called as lemma) retains in the text. 

We have used the WordNetLemmatizer in nltk for the process.

In [21]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
def lemmatize_words(text):
    return " ".join([lemmatizer.lemmatize(word) for word in text.split()])

data["processed_reviewText_lemmatized"] = data["processed_reviewText"].apply(lambda text: lemmatize_words(text))
data['processed_reviewText_lemmatized']

0       computer enthusiast, maxpc welcome sight mailb...
1       thank god ziff davis publication. maxpc actual...
2       antique publication made antique lover history...
3       beautiful work art. quality every page bit inf...
4                                            every issue.
                              ...                        
4995                     life think kartrashians awesome.
4996                                         right kindle
4997                                    favorite u weekly
4998                                                   ok
4999                                           magazine!!
Name: processed_reviewText_lemmatized, Length: 5000, dtype: object

#### We can notice that the 'e' in the words like 'welcome' and 'antique' has been retained again

#### We also need to provide the POS tag of the word along with the word for lemmatizer in nltk. Depending on the POS, the lemmatizer may return different results.

### Part-of-speech tagging , also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context.

In [23]:
######   LEMMATIZATION BY USING PART-OF-SPEECH TAGGING

from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
wordnet_map = {"N":wordnet.NOUN, "V":wordnet.VERB, "J":wordnet.ADJ, "R":wordnet.ADV}
def lemmatize_words(text):
    pos_tagged_text = nltk.pos_tag(text.split())
    return " ".join([lemmatizer.lemmatize(word, wordnet_map.get(pos[0], wordnet.NOUN)) for word, pos in pos_tagged_text])


data["processed_reviewText"] = data["processed_reviewText_lemmatized"].apply(lambda text: lemmatize_words(text))
data['processed_reviewText']

0       computer enthusiast, maxpc welcome sight mailb...
1       thank god ziff davis publication. maxpc actual...
2       antique publication make antique lover history...
3       beautiful work art. quality every page bit inf...
4                                            every issue.
                              ...                        
4995                     life think kartrashians awesome.
4996                                         right kindle
4997                                    favorite u weekly
4998                                                   ok
4999                                           magazine!!
Name: processed_reviewText, Length: 5000, dtype: object

# vii. Removal of emojis

Usage of emojis is really common in this era. Customers when writing a review tend to use emojis for better expression. For the purpose of analysis it is required to remove them from our texts.


In [24]:
def remove_emoji(string):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', string)

data["processed_reviewText"] = data["processed_reviewText"].apply(lambda string: remove_emoji(string))
data['processed_reviewText']

0       computer enthusiast, maxpc welcome sight mailb...
1       thank god ziff davis publication. maxpc actual...
2       antique publication make antique lover history...
3       beautiful work art. quality every page bit inf...
4                                            every issue.
                              ...                        
4995                     life think kartrashians awesome.
4996                                         right kindle
4997                                    favorite u weekly
4998                                                   ok
4999                                           magazine!!
Name: processed_reviewText, Length: 5000, dtype: object

### Overview of the data

In [25]:
data.head()

Unnamed: 0,overall,vote,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,style,image,processed_reviewText,processed_reviewText_stemmed,processed_reviewText_lemmatized
0,5,9.0,False,"11 8, 2001",AH2IFH762VY5U,B00005N7P0,ted sedlmayr,"for computer enthusiast, MaxPC is a welcome si...","AVID READER SINCE ""boot"" WAS THE NAME",1005177600,,,"computer enthusiast, maxpc welcome sight mailb...","comput enthusiast, maxpc welcom sight mailbox....","computer enthusiast, maxpc welcome sight mailb..."
1,5,9.0,False,"10 31, 2001",AOSFI0JEYU4XM,B00005N7P0,Amazon Customer,Thank god this is not a Ziff Davis publication...,The straight scoop,1004486400,,,thank god ziff davis publication. maxpc actual...,thank god ziff davi publication. maxpc actual ...,thank god ziff davis publication. maxpc actual...
2,3,14.0,False,"03 24, 2007",A3JPFWKS83R49V,B00005N7OJ,Bryan Carey,Antiques Magazine is a publication made for an...,"Antiques Magazine is Good, but not for Everyone",1174694400,{'Format:': ' Print Magazine'},,antique publication make antique lover history...,antiqu public made antiqu lover histori buff p...,antique publication made antique lover history...
3,5,13.0,False,"11 10, 2006",A19FKU6JZQ2ECJ,B00005N7OJ,Patricia L. Porada,This beautiful magazine is in itself a work of...,THE DISCERNING READER,1163116800,{'Format:': ' Print Magazine'},,beautiful work art. quality every page bit inf...,beauti work art. qualiti everi page bit inform...,beautiful work art. quality every page bit inf...
4,5,,True,"07 14, 2014",A25MDGOMZ2GALN,B00005N7P0,Alvey,A great read every issue.,Five Stars,1405296000,,,every issue.,everi issue.,every issue.


As seen there are several redundant columns in our data, so we will remove them.

In [26]:
# Drop the redundant columns 

data.drop(["processed_reviewText_stemmed", "processed_reviewText_lemmatized", "style","image"], axis=1, inplace=True)


In [27]:
data.head()

Unnamed: 0,overall,vote,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,processed_reviewText
0,5,9.0,False,"11 8, 2001",AH2IFH762VY5U,B00005N7P0,ted sedlmayr,"for computer enthusiast, MaxPC is a welcome si...","AVID READER SINCE ""boot"" WAS THE NAME",1005177600,"computer enthusiast, maxpc welcome sight mailb..."
1,5,9.0,False,"10 31, 2001",AOSFI0JEYU4XM,B00005N7P0,Amazon Customer,Thank god this is not a Ziff Davis publication...,The straight scoop,1004486400,thank god ziff davis publication. maxpc actual...
2,3,14.0,False,"03 24, 2007",A3JPFWKS83R49V,B00005N7OJ,Bryan Carey,Antiques Magazine is a publication made for an...,"Antiques Magazine is Good, but not for Everyone",1174694400,antique publication make antique lover history...
3,5,13.0,False,"11 10, 2006",A19FKU6JZQ2ECJ,B00005N7OJ,Patricia L. Porada,This beautiful magazine is in itself a work of...,THE DISCERNING READER,1163116800,beautiful work art. quality every page bit inf...
4,5,,True,"07 14, 2014",A25MDGOMZ2GALN,B00005N7P0,Alvey,A great read every issue.,Five Stars,1405296000,every issue.


# 5. Exporting the file in excel format

In [28]:
pip install xlsxwriter

Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'c:\users\roma\appdata\local\programs\python\python38\python.exe -m pip install --upgrade pip' command.


In [29]:
data.to_excel("C:\\Users\\Roma\\Downloads\\4.Text Mining\\preprocessed_reviewText.xlsx", engine='xlsxwriter')