<h1><center>Data Wrangling - Reviews File</center></h1>

In [1]:
import pandas as pd
import numpy as np

Read the product reviews file into a dataframe

In [2]:
reviews = pd.read_csv('reviews_beauty.csv',index_col=0)
reviews.head(2)

  mask |= (ar1 == a)


Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A39HTATAQ9V7YF,205616461,cheryl roberts,"[0, 0]",i do love this moisturizer and would recommend...,5.0,bio-active anti-aging serum,1369699200,"05 28, 2013"
1,A3JM6GV9MNOF9X,558925278,Patty,"[0, 1]",I received this product before the deadline.I ...,3.0,"This product is ok, I'm use Baby Kabuki in moment",1355443200,"12 14, 2012"


In [3]:
reviews.info(null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2023070 entries, 0 to 2023069
Data columns (total 9 columns):
reviewerID        2023070 non-null object
asin              2023070 non-null object
reviewerName      2010822 non-null object
helpful           2023070 non-null object
reviewText        2022815 non-null object
overall           2023070 non-null float64
summary           2023056 non-null object
unixReviewTime    2023070 non-null int64
reviewTime        2023070 non-null object
dtypes: float64(1), int64(1), object(7)
memory usage: 154.3+ MB


In [4]:
print(reviews.shape)

print(reviews.dtypes)

(2023070, 9)
reviewerID         object
asin               object
reviewerName       object
helpful            object
reviewText         object
overall           float64
summary            object
unixReviewTime      int64
reviewTime         object
dtype: object


In [5]:
reviews.isnull().sum()

reviewerID            0
asin                  0
reviewerName      12248
helpful               0
reviewText          255
overall               0
summary              14
unixReviewTime        0
reviewTime            0
dtype: int64

12248 observations are missing the reviewerName values. Since we have the reviewerID, the name of reviewer will not be very helpful. Review time columns are also unimportant.

Merge the reviewText and summary columns into review and then drop the individual columns. Fill the missing values with a space before merging the columns.

In [6]:
reviews['review'] = reviews['summary'].fillna('') + ' ' + reviews['reviewText'].fillna('')

In [12]:
##beauty.groupby(['reviewerID','reviewerName']).count()

df = reviews.groupby('reviewerID')['reviewerName'].nunique()
df.sort_values(ascending=False).head(5)

reviewerID
A2DG63DN704LOI    3
A3X6BLPGK2ANW     2
A2E8GMHH04T9JI    2
A1RTSVWEXMKAR1    2
A10YO33BWWWMFK    2
Name: reviewerName, dtype: int64

Some of the reviewerID have multiple reviewerName associated to them. Since we have the ID column, the name will not be very helpful. Review time columns are also unimportant. So let's drop all the irrelevant columns.

In [13]:
print(reviews[reviews['review'].isnull()])

Empty DataFrame
Columns: [reviewerID, asin, reviewerName, helpful, reviewText, overall, summary, unixReviewTime, reviewTime, review]
Index: []


In [14]:
reviews.overall.unique()

array([5., 3., 4., 1., 2.])

Convert the overall column from float to int

In [15]:
reviews['overall'] = reviews['overall'].apply(lambda x: int(x) if x == x else "")

Split the helpful column into upvotes and downvotes

In [16]:
reviews['votes'] = reviews['helpful'].str.strip('[]')

reviews['upvotes'] = reviews.votes.str.split(',').str[0]
reviews['downvotes'] = reviews.votes.str.split(',').str[1]

Drop the irrelevant columns from the dataframe

In [17]:
columns = ['reviewerName', 'reviewText', 'summary', 'unixReviewTime', 'reviewTime', 'helpful', 'votes']
reviews.drop(columns, inplace=True, axis=1)

In [18]:
reviews['upvotes'].min(),reviews['upvotes'].max(),reviews['downvotes'].min(),reviews['downvotes'].max()

('0', '99', ' 0', ' 99')

In [19]:
reviews.head(2)

Unnamed: 0,reviewerID,asin,overall,review,upvotes,downvotes
0,A39HTATAQ9V7YF,205616461,5,bio-active anti-aging serum i do love this moi...,0,0
1,A3JM6GV9MNOF9X,558925278,3,"This product is ok, I'm use Baby Kabuki in mom...",0,1


In [20]:
import re
import nltk

In [21]:
reviews['word_count'] = reviews['review'].apply(lambda x: len(str(x).split(" ")))
reviews[['review','word_count']].head()

Unnamed: 0,review,word_count
0,bio-active anti-aging serum i do love this moi...,34
1,"This product is ok, I'm use Baby Kabuki in mom...",44
2,I love this set I love this set. Great buy for...,31
3,"Nice Moisturizer A nice moisturizer, all natur...",35
4,Fake MAC Please research the MAC Hello Kitty c...,45


Transform review text to lowercase to avoid having multiple versions of the same word

In [22]:
reviews_new = reviews.dropna(subset=['review']) 
reviews_new.shape

(2023070, 7)

In [19]:
reviews_new['review'] = reviews_new['review'].apply(lambda x: " ".join(x.lower() for x in x.split()))
##reviews_new.head()

In [20]:
reviews_new.info(null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2023070 entries, 0 to 2023069
Data columns (total 7 columns):
reviewerID    2023070 non-null object
asin          2023070 non-null object
overall       2023070 non-null int64
review        2023070 non-null object
upvotes       2023070 non-null object
downvotes     2023070 non-null object
word_count    2023070 non-null int64
dtypes: int64(2), object(5)
memory usage: 123.5+ MB


Remove punctuation as it doesn't add any extra value to process text data

In [21]:
reviews_new['review'] = reviews_new['review'].str.replace('[^\w\s]','')
reviews_new['review'].head()

0    bioactive antiaging serum i do love this moist...
1    this product is ok im use baby kabuki in momen...
2    i love this set i love this set great buy for ...
3    nice moisturizer a nice moisturizer all natura...
4    fake mac please research the mac hello kitty c...
Name: review, dtype: object

Remove the stopwords

In [22]:
from nltk.corpus import stopwords
stop = stopwords.words('english')
reviews_new['review'] = reviews_new['review'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
reviews_new['review'].head()

0    bioactive antiaging serum love moisturizer wou...
1    product ok im use baby kabuki moment received ...
2    love set love set great buy price dont wear ma...
3    nice moisturizer nice moisturizer natural ingr...
4    fake mac please research mac hello kitty colle...
Name: review, dtype: object

Remove the 10 most common words

In [None]:
##freq = pd.Series(' '.join(reviews_new['reviewText']).split()).value_counts()[:10]
##freq

In [None]:
##freq = list(freq.index)
##reviews_new['reviewText'] = reviews_new['reviewText'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))
##reviews_new['reviewText'].head()

Remove the 50 most rare words

In [23]:
freq = pd.Series(' '.join(reviews_new['review']).split()).value_counts()[-50:]
freq

nothrills                    1
34spiruilna34                1
tubeparticles                1
crustylikeat                 1
wwnt                         1
unabsorbent                  1
colorinexpensiveeasy         1
1175l                        1
skinconswhen                 1
firmergranted                1
mafrketplac                  1
wrinkleseh                   1
backingstands                1
soothesour                   1
belowactually                1
productslesson               1
buckscan                     1
skintom                      1
excitedapplication           1
beginner5                    1
reesewitherspooncirca2001    1
historymessage               1
gotomy                       1
amazonjill                   1
ponky                        1
deviceugh                    1
scentsmake                   1
ecolovely                    1
skindissolves                1
differencr                   1
presentthey                  1
legshavingfool               1
holeon  

In [24]:
freq = list(freq.index)
reviews_new['review'] = reviews_new['review'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))
reviews_new['review'].head()

0    bioactive antiaging serum love moisturizer wou...
1    product ok im use baby kabuki moment received ...
2    love set love set great buy price dont wear ma...
3    nice moisturizer nice moisturizer natural ingr...
4    fake mac please research mac hello kitty colle...
Name: review, dtype: object

In [23]:
from textblob import TextBlob

In [24]:
reviews_new['review'][:10].apply(lambda x: str(TextBlob(x).correct()))

0    big-active anti-raging serum i do love this mo...
1    His product is ok, I'm use Baby Kabuki in mome...
2    I love this set I love this set. Great buy for...
3    Vice Moisturizer A nice moisturizer, all natur...
4    Take MAC Please research the MAC Hello Witty c...
5    Mute girl compact mirror Single sided mirror. ...
6    I'd say one of the best lip pencil I've tried....
7    real product, same i bought in a mac store MAC...
8    Benefit Automatic Eyeliner Men His is by far t...
9    I really like this stuff. Dark circles under t...
Name: review, dtype: object

In [28]:
sample_reviews = reviews_new[['overall', 'review']].sample(10000)
def detect_polarity(text):
    return TextBlob(text).sentiment.polarity
sample_reviews['polarity'] = sample_reviews.review.apply(detect_polarity)
sample_reviews.head()

Unnamed: 0,overall,review,polarity
32019,4,good good reason 5 sorta oily put takes little...,0.404167
1965600,5,skin barrier used another bran many years ship...,0.261111
381639,5,love stuff got free sample styling cream love ...,0.28287
610983,5,far good really impressed exfoliator leaves sk...,0.358712
643924,2,good product poor delivery product works wonde...,0.186667


In [30]:
print(reviews.iloc[381639,3])

I love this stuff I got a free sample of the styling cream and I LOVE it. I've been shopping around for it and Amazon has the best prices.My hair is so soft and smooth. I apply it when mey hair is wet and give it a little blow dry.


In [31]:
reviews_new['polarity'] = reviews_new.review.apply(detect_polarity)
reviews_new.head()

Unnamed: 0,reviewerID,asin,overall,review,upvotes,downvotes,word_count,polarity
0,A39HTATAQ9V7YF,205616461,5,bioactive antiaging serum love moisturizer wou...,0,0,34,0.283333
1,A3JM6GV9MNOF9X,558925278,3,product ok im use baby kabuki moment received ...,0,1,44,0.52
2,A1Z513UWSAAO0F,558925278,5,love set love set great buy price dont wear ma...,0,0,31,0.575
3,A1WMRR494NWEWV,733001998,4,nice moisturizer nice moisturizer natural ingr...,0,0,35,0.375
4,A3IAAVS479H7M7,737104473,1,fake mac please research mac hello kitty colle...,2,2,45,-0.125


In [32]:
reviews_new.to_csv('cleaned_reviews.csv',index=False)