<h1><center>Data Wrangling - Reviews File</center></h1>

In [1]:
import pandas as pd
import numpy as np

Read the product reviews file into a dataframe

In [2]:
reviews = pd.read_csv('reviews_beauty.csv')
reviews.head(2)

Unnamed: 0.1,Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,0,A39HTATAQ9V7YF,205616461,cheryl roberts,"[0, 0]",i do love this moisturizer and would recommend...,5.0,bio-active anti-aging serum,1369699200,"05 28, 2013"
1,1,A3JM6GV9MNOF9X,558925278,Patty,"[0, 1]",I received this product before the deadline.I ...,3.0,"This product is ok, I'm use Baby Kabuki in moment",1355443200,"12 14, 2012"


In [3]:
reviews.info(null_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2023070 entries, 0 to 2023069
Data columns (total 10 columns):
Unnamed: 0        2023070 non-null int64
reviewerID        2023070 non-null object
asin              2023070 non-null object
reviewerName      2010822 non-null object
helpful           2023070 non-null object
reviewText        2022815 non-null object
overall           2023070 non-null float64
summary           2023056 non-null object
unixReviewTime    2023070 non-null int64
reviewTime        2023070 non-null object
dtypes: float64(1), int64(2), object(7)
memory usage: 154.3+ MB


In [4]:
print(reviews.shape)

print(reviews.dtypes)

(2023070, 10)
Unnamed: 0          int64
reviewerID         object
asin               object
reviewerName       object
helpful            object
reviewText         object
overall           float64
summary            object
unixReviewTime      int64
reviewTime         object
dtype: object


In [5]:
reviews.isnull().sum()

Unnamed: 0            0
reviewerID            0
asin                  0
reviewerName      12248
helpful               0
reviewText          255
overall               0
summary              14
unixReviewTime        0
reviewTime            0
dtype: int64

12248 observations are missing the reviewerName values. Since we have the reviewerID, the name of reviewer will not be very helpful. Review time columns are also unimportant.

Merge the reviewText and summary columns into review and then drop the individual columns. Fill the missing values with a space before merging the columns.

In [6]:
reviews['review'] = reviews['summary'].fillna('') + ' ' + reviews['reviewText'].fillna('')

Convert "reviewTime" to datetime format.

In [7]:
reviews['reviewTime'] = pd.to_datetime(reviews['reviewTime'])                                    

In [8]:
reviews.head(2)

Unnamed: 0.1,Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime,review
0,0,A39HTATAQ9V7YF,205616461,cheryl roberts,"[0, 0]",i do love this moisturizer and would recommend...,5.0,bio-active anti-aging serum,1369699200,2013-05-28,bio-active anti-aging serum i do love this moi...
1,1,A3JM6GV9MNOF9X,558925278,Patty,"[0, 1]",I received this product before the deadline.I ...,3.0,"This product is ok, I'm use Baby Kabuki in moment",1355443200,2012-12-14,"This product is ok, I'm use Baby Kabuki in mom..."


In [9]:
df = reviews.groupby('reviewerID')['reviewerName'].nunique()
df.sort_values(ascending=False).head(5)

reviewerID
A2DG63DN704LOI    3
A3X6BLPGK2ANW     2
A2E8GMHH04T9JI    2
A1RTSVWEXMKAR1    2
A10YO33BWWWMFK    2
Name: reviewerName, dtype: int64

Some of the reviewerID have multiple reviewerName associated to them. Since we have the ID column, the name will not be very helpful. Review time columns are also unimportant. So let's drop all the irrelevant columns.

In [10]:
print(reviews[reviews['review'].isnull()])

Empty DataFrame
Columns: [Unnamed: 0, reviewerID, asin, reviewerName, helpful, reviewText, overall, summary, unixReviewTime, reviewTime, review]
Index: []


In [11]:
reviews.overall.unique()

array([5., 3., 4., 1., 2.])

Convert the overall column from float to int

In [12]:
reviews['overall'] = reviews['overall'].apply(lambda x: int(x) if x == x else "")

Split the helpful column into upvotes and downvotes

In [13]:
reviews['votes'] = reviews['helpful'].str.strip('[]')

reviews['upvotes'] = reviews.votes.str.split(',').str[0]
reviews['downvotes'] = reviews.votes.str.split(',').str[1]

Drop the irrelevant columns from the dataframe

In [14]:
columns = ['Unnamed: 0', 'reviewerName', 'reviewText', 'summary', 'unixReviewTime', 'helpful', 'votes']
reviews.drop(columns, inplace=True, axis=1)

In [15]:
reviews['upvotes'].min(),reviews['upvotes'].max(),reviews['downvotes'].min(),reviews['downvotes'].max()

('0', '99', ' 0', ' 99')

In [16]:
reviews.head(2)

Unnamed: 0,reviewerID,asin,overall,reviewTime,review,upvotes,downvotes
0,A39HTATAQ9V7YF,205616461,5,2013-05-28,bio-active anti-aging serum i do love this moi...,0,0
1,A3JM6GV9MNOF9X,558925278,3,2012-12-14,"This product is ok, I'm use Baby Kabuki in mom...",0,1


In [17]:
import re
import nltk

In [18]:
reviews['word_count'] = reviews['review'].apply(lambda x: len(str(x).split(" ")))
reviews[['review','word_count']].head()

Unnamed: 0,review,word_count
0,bio-active anti-aging serum i do love this moi...,34
1,"This product is ok, I'm use Baby Kabuki in mom...",44
2,I love this set I love this set. Great buy for...,31
3,"Nice Moisturizer A nice moisturizer, all natur...",35
4,Fake MAC Please research the MAC Hello Kitty c...,45


Transform review text to lowercase to avoid having multiple versions of the same word

In [19]:
reviews_new = reviews.dropna(subset=['review']) 
reviews_new.shape

(2023070, 8)

In [20]:
reviews_new['review'] = reviews_new['review'].apply(lambda x: " ".join(x.lower() for x in x.split()))
##reviews_new.head()

In [21]:
reviews_new.info(null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2023070 entries, 0 to 2023069
Data columns (total 8 columns):
reviewerID    2023070 non-null object
asin          2023070 non-null object
overall       2023070 non-null int64
reviewTime    2023070 non-null datetime64[ns]
review        2023070 non-null object
upvotes       2023070 non-null object
downvotes     2023070 non-null object
word_count    2023070 non-null int64
dtypes: datetime64[ns](1), int64(2), object(5)
memory usage: 138.9+ MB


Remove punctuation as it doesn't add any extra value to process text data

In [22]:
reviews_new['review'] = reviews_new['review'].str.replace('[^\w\s]','')
reviews_new['review'].head()

0    bioactive antiaging serum i do love this moist...
1    this product is ok im use baby kabuki in momen...
2    i love this set i love this set great buy for ...
3    nice moisturizer a nice moisturizer all natura...
4    fake mac please research the mac hello kitty c...
Name: review, dtype: object

Remove the stopwords

In [23]:
from nltk.corpus import stopwords
stop = stopwords.words('english')
reviews_new['review'] = reviews_new['review'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
reviews_new['review'].head()

0    bioactive antiaging serum love moisturizer wou...
1    product ok im use baby kabuki moment received ...
2    love set love set great buy price dont wear ma...
3    nice moisturizer nice moisturizer natural ingr...
4    fake mac please research mac hello kitty colle...
Name: review, dtype: object

Remove the 10 most common words

In [None]:
##freq = pd.Series(' '.join(reviews_new['reviewText']).split()).value_counts()[:10]
##freq

In [None]:
##freq = list(freq.index)
##reviews_new['reviewText'] = reviews_new['reviewText'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))
##reviews_new['reviewText'].head()

Remove the 50 most rare words

In [24]:
freq = pd.Series(' '.join(reviews_new['review']).split()).value_counts()[-50:]
freq

it626262                  1
supplise                  1
warrantyrepairand         1
measureverify             1
wwaaaaaaaaoo              1
biodegradableconsnot      1
addictivecheck            1
162024                    1
fingernailsconsthere      1
mllightly                 1
dispenserlast             1
thicknessadds             1
34poreminimizing34        1
everythingeverythingto    1
honeyquat                 1
revitalashs               1
toneusing                 1
lighmedium                1
4syringe                  1
tornando                  1
fightfade                 1
sanitaryhygiene           1
reviewers1                1
phenoxyethnolive          1
quicklycolors             1
dosthe                    1
quicker2                  1
thingsproslight           1
2014omagazee              1
shapingsmoothing          1
receoived                 1
7tube                     1
bottlenormally            1
sexymen                   1
60dont                    1
familyman           

In [25]:
freq = list(freq.index)
reviews_new['review'] = reviews_new['review'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))
reviews_new['review'].head()

0    bioactive antiaging serum love moisturizer wou...
1    product ok im use baby kabuki moment received ...
2    love set love set great buy price dont wear ma...
3    nice moisturizer nice moisturizer natural ingr...
4    fake mac please research mac hello kitty colle...
Name: review, dtype: object

In [26]:
from textblob import TextBlob

In [27]:
reviews_new['review'][:10].apply(lambda x: str(TextBlob(x).correct()))

0    inactive antiaging serum love moisturizer woul...
1    product ok in use baby kabuki moment received ...
2    love set love set great buy price dont wear ma...
3    nice moisturizer nice moisturizer natural ingr...
4    face mac please research mac hello kitty colle...
5    cut girl compact mirror single sided mirror co...
6    id say one best lip pencil give tried id say o...
7    real product bought mac store mac lip care lip...
8    benefit automatic eyeliner pen far easiest eye...
9    really like stuff dark circles eyes runs famil...
Name: review, dtype: object

In [28]:
sample_reviews = reviews_new[['overall', 'review']].sample(10000)
def detect_polarity(text):
    return TextBlob(text).sentiment.polarity
sample_reviews['polarity'] = sample_reviews.review.apply(detect_polarity)
sample_reviews.head()

Unnamed: 0,overall,review,polarity
780090,4,lather lather lather seems shaving creams work...,0.383333
949552,1,dont waste money old production perfumes fendi...,0.15
723,5,nice love scent perfume old granny scent clean...,0.346667
1615251,5,natural much better great moisturizing lasts d...,0.466667
1470465,4,decent price problem found brushes packaging c...,0.0


In [29]:
print(reviews.iloc[381639,3])

2008-08-22 00:00:00


In [30]:
reviews_new['polarity'] = reviews_new.review.apply(detect_polarity)
reviews_new.head()

Unnamed: 0,reviewerID,asin,overall,reviewTime,review,upvotes,downvotes,word_count,polarity
0,A39HTATAQ9V7YF,205616461,5,2013-05-28,bioactive antiaging serum love moisturizer wou...,0,0,34,0.283333
1,A3JM6GV9MNOF9X,558925278,3,2012-12-14,product ok im use baby kabuki moment received ...,0,1,44,0.52
2,A1Z513UWSAAO0F,558925278,5,2014-07-07,love set love set great buy price dont wear ma...,0,0,31,0.575
3,A1WMRR494NWEWV,733001998,4,2013-10-24,nice moisturizer nice moisturizer natural ingr...,0,0,35,0.375
4,A3IAAVS479H7M7,737104473,1,2010-05-19,fake mac please research mac hello kitty colle...,2,2,45,-0.125


In [31]:
reviews_new.to_csv('cleaned_reviews.csv')