## Bring in the Bazar Voice Data

In [49]:
# load libraries
import pandas as pd
import nltk # https://www.nltk.org/ ; nltk helps with tokenization, stopwords dictionary, lemmatization
import re # https://docs.python.org/3/library/re.html ; re is used for regular expressions
import numpy as np

bazar = pd.read_csv("BazarVoice_edit2.csv", header = 0, engine='python')
bazar.head()



Unnamed: 0,Product,Product_Collection,ReviewID,Date,Review_Title,Review,WordCloud,Rating,Location,Product_website
0,"Kids Anticavity Fluoride Mouthwash, Berry Spla...",LISTERINEÂ® SMART RINSEÂ® Kids Mouthwash Colle...,221839752,6/27/2020,Great product,We bought this product and absolutely love it!...,BOUGHT,5,USA,https://www.listerine.com/mouthwash/for-kids/l...
1,LISTERINE TOTAL CARE ZERO FRESH MINT ANTICAVIT...,LISTERINEÂ® ZERO alcohol-free Mouthwash Collec...,221832493,6/27/2020,Oral Tissue Sloughing-- A Serious Concern,I figured I would weigh in here regarding the ...,FIGURED,1,Philadelphia,https://www.listerine.com/mouthwash/anticavity...
2,LISTERINE TOTAL CARE ZERO FRESH MINT ANTICAVIT...,LISTERINEÂ® ZERO alcohol-free Mouthwash Collec...,221784924,6/26/2020,*** Mouth Sloughing !!! ***,"If you want to LIVE an actual NIGHTMARE, use t...",WANT,1,Syracuse,https://www.listerine.com/mouthwash/anticavity...
3,Chewable Tablets Soft Mint,LISTERINEÂ® On-The-Go Oral Care Products,221768439,6/26/2020,It works,Love this product it does what it's supposed t...,LOVE,5,Columbus,https://www.listerine.com/on-the-go-oral-healt...
4,LISTERINE TOTAL CARE ZERO FRESH MINT ANTICAVIT...,LISTERINEÂ® ZERO alcohol-free Mouthwash Collec...,221771164,6/26/2020,Zero option no good at all,"I got the original, the green one the blue min...",GOT,1,Hialeah,https://www.listerine.com/mouthwash/anticavity...


In [50]:
# check the structure of the data: number of rows, columns, type of each column, number of non-null values in each column
bazar.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8591 entries, 0 to 8590
Data columns (total 10 columns):
Product               8591 non-null object
Product_Collection    8591 non-null object
ReviewID              8591 non-null int64
Date                  8591 non-null object
Review_Title          8591 non-null object
Review                8591 non-null object
WordCloud             8579 non-null object
Rating                8591 non-null int64
Location              6332 non-null object
Product_website       8591 non-null object
dtypes: int64(2), object(8)
memory usage: 671.3+ KB


## Text Preprocessing

##### On text, we will apply the following:
- lower case
- remove usernames
- replace words
- remove punctuation, numbers
- remove stopwords
- lemmatization

#### lower case

In [53]:
# lower case the "text" column and save it under a column named "tweet_lower"
bazar["full_text"] = bazar["Review"].str.lower()
bazar.head(2)

Unnamed: 0,Product,Product_Collection,ReviewID,Date,Review_Title,Review,WordCloud,Rating,Location,Product_website,full_text,no_char
0,"Kids Anticavity Fluoride Mouthwash, Berry Spla...",LISTERINEÂ® SMART RINSEÂ® Kids Mouthwash Colle...,221839752,6/27/2020,Great product,We bought this product and absolutely love it!...,BOUGHT,5,USA,https://www.listerine.com/mouthwash/for-kids/l...,we bought this product and absolutely love it!...,we bought this product and absolutely love it ...
1,LISTERINE TOTAL CARE ZERO FRESH MINT ANTICAVIT...,LISTERINEÂ® ZERO alcohol-free Mouthwash Collec...,221832493,6/27/2020,Oral Tissue Sloughing-- A Serious Concern,I figured I would weigh in here regarding the ...,FIGURED,1,Philadelphia,https://www.listerine.com/mouthwash/anticavity...,i figured i would weigh in here regarding the ...,i figured i would weigh in here regarding the ...


In [52]:
# remove all characters except for letters
bazar["no_char"] = bazar["full_text"].apply(lambda x: re.sub('[^a-zA-Z]',' ',x))
bazar.head(2)

Unnamed: 0,Product,Product_Collection,ReviewID,Date,Review_Title,Review,WordCloud,Rating,Location,Product_website,full_text,no_char
0,"Kids Anticavity Fluoride Mouthwash, Berry Spla...",LISTERINEÂ® SMART RINSEÂ® Kids Mouthwash Colle...,221839752,6/27/2020,Great product,We bought this product and absolutely love it!...,BOUGHT,5,USA,https://www.listerine.com/mouthwash/for-kids/l...,we bought this product and absolutely love it!...,we bought this product and absolutely love it ...
1,LISTERINE TOTAL CARE ZERO FRESH MINT ANTICAVIT...,LISTERINEÂ® ZERO alcohol-free Mouthwash Collec...,221832493,6/27/2020,Oral Tissue Sloughing-- A Serious Concern,I figured I would weigh in here regarding the ...,FIGURED,1,Philadelphia,https://www.listerine.com/mouthwash/anticavity...,i figured i would weigh in here regarding the ...,i figured i would weigh in here regarding the ...


#### remove stopwords

We will be using the nltk stopwords dictionary. Let's download the stopwords dictionary and take a look at the wors that are in it.

In [44]:
# download the stopwords dictionary
from nltk.corpus import stopwords
nltk.download('stopwords')

# save the list of stopwords in stop_words
stop_words = set(stopwords.words("english"))
stop_words

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\DeTriumph's\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [54]:
# remove the words that exist in the stopwords dictionary from tweets
bazar["no_stop_words"] = bazar["no_char"].apply(lambda words: ' '.join(word.lower() for word in words.split() if word not in stop_words))
bazar.head(2)

Unnamed: 0,Product,Product_Collection,ReviewID,Date,Review_Title,Review,WordCloud,Rating,Location,Product_website,full_text,no_char,no_stop_words
0,"Kids Anticavity Fluoride Mouthwash, Berry Spla...",LISTERINEÂ® SMART RINSEÂ® Kids Mouthwash Colle...,221839752,6/27/2020,Great product,We bought this product and absolutely love it!...,BOUGHT,5,USA,https://www.listerine.com/mouthwash/for-kids/l...,we bought this product and absolutely love it!...,we bought this product and absolutely love it ...,bought product absolutely love burn leave mout...
1,LISTERINE TOTAL CARE ZERO FRESH MINT ANTICAVIT...,LISTERINEÂ® ZERO alcohol-free Mouthwash Collec...,221832493,6/27/2020,Oral Tissue Sloughing-- A Serious Concern,I figured I would weigh in here regarding the ...,FIGURED,1,Philadelphia,https://www.listerine.com/mouthwash/anticavity...,i figured i would weigh in here regarding the ...,i figured i would weigh in here regarding the ...,figured would weigh regarding complaints tissu...


#### lemmatization

We will use the nltk lemmatization dictionary called "wordnet".

In [8]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\DeTriumph's\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [46]:
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

In order to lemmatize text, we need to know the part of speech of the words. If we take the words and we specify 'v' inside the lemmatize() function, then the words are treated as verbs and nltk lemmatizes them assuming that they are verbs. Not all the words will get lemmatized at this stage, We'll apply the lemmatization function again, using 'n' inside of lemmatize(). This way, we will assume that all the words are nouns. Using this approach, the context of the words and their actual part of speech is not accounted for.

To improve the lemmatizer, we could identify the part of speech of each word and then pass that to the lemmatizer.

In [55]:
# first, split the sentenes into tokens
bazar['tokenized']=bazar["no_stop_words"].apply(lambda x : filter(None,x.split(" ")))

# then lemmatize the words
bazar["lemma_v1"] = bazar["tokenized"].apply(lambda x: [wordnet_lemmatizer.lemmatize(y, "v") for y in x])
bazar["lemma_v1"] = bazar["lemma_v1"].apply(lambda x: [wordnet_lemmatizer.lemmatize(y, "n") for y in x])

# put the words back into a sentence
bazar["lemma"]=bazar["lemma_v1"].apply(lambda x : " ".join(x))

bazar

Unnamed: 0,Product,Product_Collection,ReviewID,Date,Review_Title,Review,WordCloud,Rating,Location,Product_website,full_text,no_char,no_stop_words,tokenized,lemma_v1,lemma
0,"Kids Anticavity Fluoride Mouthwash, Berry Spla...",LISTERINEÂ® SMART RINSEÂ® Kids Mouthwash Colle...,221839752,6/27/2020,Great product,We bought this product and absolutely love it!...,BOUGHT,5,USA,https://www.listerine.com/mouthwash/for-kids/l...,we bought this product and absolutely love it!...,we bought this product and absolutely love it ...,bought product absolutely love burn leave mout...,<filter object at 0x000001D94A689A48>,"[buy, product, absolutely, love, burn, leave, ...",buy product absolutely love burn leave mouth t...
1,LISTERINE TOTAL CARE ZERO FRESH MINT ANTICAVIT...,LISTERINEÂ® ZERO alcohol-free Mouthwash Collec...,221832493,6/27/2020,Oral Tissue Sloughing-- A Serious Concern,I figured I would weigh in here regarding the ...,FIGURED,1,Philadelphia,https://www.listerine.com/mouthwash/anticavity...,i figured i would weigh in here regarding the ...,i figured i would weigh in here regarding the ...,figured would weigh regarding complaints tissu...,<filter object at 0x000001D94B010D08>,"[figure, would, weigh, regard, complaint, tiss...",figure would weigh regard complaint tissue slo...
2,LISTERINE TOTAL CARE ZERO FRESH MINT ANTICAVIT...,LISTERINEÂ® ZERO alcohol-free Mouthwash Collec...,221784924,6/26/2020,*** Mouth Sloughing !!! ***,"If you want to LIVE an actual NIGHTMARE, use t...",WANT,1,Syracuse,https://www.listerine.com/mouthwash/anticavity...,"if you want to live an actual nightmare, use t...",if you want to live an actual nightmare use t...,want live actual nightmare use product woke ni...,<filter object at 0x000001D94B051A48>,"[want, live, actual, nightmare, use, product, ...",want live actual nightmare use product wake ni...
3,Chewable Tablets Soft Mint,LISTERINEÂ® On-The-Go Oral Care Products,221768439,6/26/2020,It works,Love this product it does what it's supposed t...,LOVE,5,Columbus,https://www.listerine.com/on-the-go-oral-healt...,love this product it does what it's supposed t...,love this product it does what it s supposed t...,love product supposed quick easy use ate stron...,<filter object at 0x000001D94B06FB88>,"[love, product, suppose, quick, easy, use, eat...",love product suppose quick easy use eat strong...
4,LISTERINE TOTAL CARE ZERO FRESH MINT ANTICAVIT...,LISTERINEÂ® ZERO alcohol-free Mouthwash Collec...,221771164,6/26/2020,Zero option no good at all,"I got the original, the green one the blue min...",GOT,1,Hialeah,https://www.listerine.com/mouthwash/anticavity...,"i got the original, the green one the blue min...",i got the original the green one the blue min...,got original green one blue mint zero one good...,<filter object at 0x000001D94AFF7248>,"[get, original, green, one, blue, mint, zero, ...",get original green one blue mint zero one good...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8586,TOTAL CARE Mouthwash,LISTERINEÂ® TOTAL CARE Mouthwash Collection,194955169,1/4/2018,New taste,I am so disappointed in the change to the form...,DISAPPOINTED,1,Columbia,https://www.listerine.com/mouthwash/anticavity...,i am so disappointed in the change to the form...,i am so disappointed in the change to the form...,disappointed change formula used love flavor p...,<filter object at 0x000001D94C73F388>,"[disappoint, change, formula, use, love, flavo...",disappoint change formula use love flavor powe...
8587,Flosser with Ergonomic Handle,LISTERINEÂ® Floss Products,194920964,2/1/2018,NEED MORE COLORS,"I use this every day. I have one in the car, a...",EVERY,4,Houston,https://www.listerine.com/toothpaste-floss/lis...,"i use this every day. i have one in the car, a...",i use this every day i have one in the car a...,use every day one car desk work course home st...,<filter object at 0x000001D94C73FB88>,"[use, every, day, one, car, desk, work, course...",use every day one car desk work course home st...
8588,Oral Care Breath Strips,LISTERINEÂ® On-The-Go Oral Care Products,194903921,1/29/2018,Helps keep my breath fresh,"I love caring the stripes with me , for the fa...",LOVE,5,Corpus Christi,https://www.listerine.com/products/listerine-g...,"i love caring the stripes with me , for the fa...",i love caring the stripes with me for the fa...,love caring stripes fact around people convers...,<filter object at 0x000001D94C740208>,"[love, care, strip, fact, around, people, conv...",love care strip fact around people conversate ...
8589,Oral Care Breath Strips,LISTERINEÂ® On-The-Go Oral Care Products,194917345,1/3/2018,Miss the cinnamon!,This is a great product but wish they had not ...,GREAT,4,,https://www.listerine.com/products/listerine-g...,this is a great product but wish they had not ...,this is a great product but wish they had not ...,great product wish discontinued flavors like c...,<filter object at 0x000001D94C740488>,"[great, product, wish, discontinue, flavor, li...",great product wish discontinue flavor like cin...


## Clean up to match Bazar Voice

In [59]:
# create columns to match Brandwatch
bazar['Brand'] = "Listerine"
bazar['Sentiment'] = bazar.Rating.apply(lambda x: 1 if x >= 3 else 0) # binarize review, >=3 is pos/"1", and <3 is neg/"0"
bazar['page_type'] = "listerine.com"
bazar.head(2)



Unnamed: 0,Product,Product_Collection,ReviewID,Date,Review_Title,Review,WordCloud,Rating,Location,Product_website,full_text,no_char,no_stop_words,tokenized,lemma_v1,lemma,Brand,Sentiment,page_type
0,"Kids Anticavity Fluoride Mouthwash, Berry Spla...",LISTERINEÂ® SMART RINSEÂ® Kids Mouthwash Colle...,221839752,6/27/2020,Great product,We bought this product and absolutely love it!...,BOUGHT,5,USA,https://www.listerine.com/mouthwash/for-kids/l...,we bought this product and absolutely love it!...,we bought this product and absolutely love it ...,bought product absolutely love burn leave mout...,<filter object at 0x000001D94A689A48>,"[buy, product, absolutely, love, burn, leave, ...",buy product absolutely love burn leave mouth t...,Listerine,1,listerine.com
1,LISTERINE TOTAL CARE ZERO FRESH MINT ANTICAVIT...,LISTERINEÂ® ZERO alcohol-free Mouthwash Collec...,221832493,6/27/2020,Oral Tissue Sloughing-- A Serious Concern,I figured I would weigh in here regarding the ...,FIGURED,1,Philadelphia,https://www.listerine.com/mouthwash/anticavity...,i figured i would weigh in here regarding the ...,i figured i would weigh in here regarding the ...,figured would weigh regarding complaints tissu...,<filter object at 0x000001D94B010D08>,"[figure, would, weigh, regard, complaint, tiss...",figure would weigh regard complaint tissue slo...,Listerine,0,listerine.com


In [78]:
# Clean Product Collection names
new = bazar["Product_Collection"].str.split("® ", n=1, expand = True)
new[1].value_counts()

On-The-Go Oral Care Products                             3727
SENSITIVITY                                              1069
NIGHTLY RESET                                             897
Floss Products                                            749
ZERO alcohol-free Mouthwash Collection                    712
Antiseptic Mouthwash Collection                           626
ULTRACLEANÂ® Tartar Control Mouthwash Collection          343
SMART RINSEÂ® Kids Mouthwash Collection                   246
TOTAL CARE Mouthwash Collection                           120
Fluoride Toothpaste Collection                             40
NATURALS Mouthwash Collection                              34
HEALTHY WHITEâ?¢ Teeth Whitening Mouthwash Collection      26
FLUORIDE DEFENSEâ?¢                                         2
Name: 1, dtype: int64

In [79]:
# Clean Product Collection names
new = bazar["Product_Collection"].str.split("® ", n=1, expand = True)
new[1] = new[1].str.replace('SMART RINSEÂ® Kids Mouthwash Collection', 'Kids Mouthwash Collection')
new[1] = new[1].str.replace('ULTRACLEANÂ® Tartar Control Mouthwash Collection', 'Ultraclean Tartar Control Mouthwash')
new[1] = new[1].str.replace('HEALTHY WHITEâ?¢ Teeth Whitening Mouthwash Collection', 'Teeth Whitening Mouthwash')
new[1] = new[1].str.replace('FLUORIDE DEFENSEâ?¢', 'flouride defense')

new[1].value_counts()


On-The-Go Oral Care Products                             3727
SENSITIVITY                                              1069
NIGHTLY RESET                                             897
Floss Products                                            749
ZERO alcohol-free Mouthwash Collection                    712
Antiseptic Mouthwash Collection                           626
Ultraclean Tartar Control Mouthwash                       343
Kids Mouthwash Collection                                 246
TOTAL CARE Mouthwash Collection                           120
Fluoride Toothpaste Collection                             40
NATURALS Mouthwash Collection                              34
HEALTHY WHITEâ?¢ Teeth Whitening Mouthwash Collection      26
FLUORIDE DEFENSEâ?¢                                         2
Name: 1, dtype: int64

In [80]:
# bring into df and lower
bazar["Product_Collection_New"] = new[1]
bazar["Product_category"] = bazar["Product_Collection_New"].str.lower()

In [81]:
# only retain those column we need
bazar_clean = bazar.iloc[:,[3,5,15,16,17,18,20]]
bazar_clean.head(2)


Unnamed: 0,Date,Review,lemma,Brand,Sentiment,page_type,Product_category
0,6/27/2020,We bought this product and absolutely love it!...,buy product absolutely love burn leave mouth t...,Listerine,1,listerine.com,kids mouthwash collection
1,6/27/2020,I figured I would weigh in here regarding the ...,figure would weigh regard complaint tissue slo...,Listerine,0,listerine.com,zero alcohol-free mouthwash collection


In [82]:

#change column names to match Brandwatch
column_mapping = {bazar_clean.columns[0]:'Date', bazar_clean.columns[1]:'Review_original', bazar_clean.columns[2]:'Review_clean', bazar_clean.columns[3]:'Brand', bazar_clean.columns[4]:'Sentiment', bazar_clean.columns[5]:'Source', bazar_clean.columns[6]:'Product_category'}
bazar_clean.rename(columns = column_mapping, inplace = True)
bazar_clean.head(2)

Unnamed: 0,Date,Review_original,Review_clean,Brand,Sentiment,Source,Product_category
0,6/27/2020,We bought this product and absolutely love it!...,buy product absolutely love burn leave mouth t...,Listerine,1,listerine.com,kids mouthwash collection
1,6/27/2020,I figured I would weigh in here regarding the ...,figure would weigh regard complaint tissue slo...,Listerine,0,listerine.com,zero alcohol-free mouthwash collection


In [83]:
# reorder columns
cols = list(bazar_clean.columns.values)
cols

['Date',
 'Review_original',
 'Review_clean',
 'Brand',
 'Sentiment',
 'Source',
 'Product_category']

In [84]:
# reorder columns2
bazar_clean = bazar_clean[['Date','Brand', 'Product_category', 'Source', 'Review_original', 'Review_clean', 'Sentiment']]
bazar_clean.head(2)

Unnamed: 0,Date,Brand,Product_category,Source,Review_original,Review_clean,Sentiment
0,6/27/2020,Listerine,kids mouthwash collection,listerine.com,We bought this product and absolutely love it!...,buy product absolutely love burn leave mouth t...,1
1,6/27/2020,Listerine,zero alcohol-free mouthwash collection,listerine.com,I figured I would weigh in here regarding the ...,figure would weigh regard complaint tissue slo...,0


In [86]:
# export to csv
bazar_clean.to_csv('Bazar_Clean.csv')