<h1>Table of contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#load-data" data-toc-modified-id="load-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>load data</a></span></li><li><span><a href="#clean-corpus" data-toc-modified-id="clean-corpus-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>clean corpus</a></span><ul class="toc-item"><li><span><a href="#dataframe-cleaning" data-toc-modified-id="dataframe-cleaning-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>dataframe cleaning</a></span></li><li><span><a href="#text-cleaning" data-toc-modified-id="text-cleaning-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>text cleaning</a></span></li></ul></li><li><span><a href="#downsampling" data-toc-modified-id="downsampling-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>downsampling</a></span></li></ul></div>

**Dataset**:<br>
Amazon Reviews: https://nijianmo.github.io/amazon/index.html

In [1]:
import os
import json
import gzip
import pandas as pd
from urllib.request import urlopen

from utils import clean_text, random_downsampling

# load data

In [9]:
""" very time consuming

data = []
with gzip.open('../corpora/Electronics_5.json.gz') as f:
    for l in f:
        data.append(json.loads(l.strip()))

df = pd.DataFrame.from_dict(data)

big_corpus = df[["overall", "reviewerName", "reviewText", 
                  "summary", "verified", "vote", "reviewTime"]]
                  
big_corpus.to_csv("../corpora/amazon_reviews_electronic.csv", index=False)
"""
print("Done loading.")

Done loading.


# clean corpus

In [2]:
PATH = "../corpora/amazon_reviews_electronic.csv"
MIN_YEAR = 2018
COMBINE_SUMMARY = False

In [3]:
%%time
corpus = pd.read_csv(PATH)



CPU times: user 54.7 s, sys: 12.1 s, total: 1min 6s
Wall time: 1min 14s


In [12]:
corpus.head()

Unnamed: 0,overall,reviewerName,reviewText,summary,verified,vote,reviewTime
0,5.0,D. C. Carrad,This is the best novel I have read in 2 or 3 y...,A star is born,True,67,"09 18, 1999"
1,3.0,Evy,"Pages and pages of introspection, in the style...",A stream of consciousness novel,True,5,"10 23, 2013"
2,5.0,Kcorn,This is the kind of novel to read when you hav...,I'm a huge fan of the author and this one did ...,False,4,"09 2, 2008"
3,5.0,Caf Girl Writes,What gorgeous language! What an incredible wri...,The most beautiful book I have ever read!,False,13,"09 4, 2000"
4,3.0,W. Shane Schmidt,I was taken in by reviews that compared this b...,A dissenting view--In part.,True,8,"02 4, 2000"


## dataframe cleaning

In [13]:
%%time
corpus.vote = corpus.vote.fillna(0)
corpus = corpus.fillna("")


# rename columns

corpus = corpus.rename(columns={"overall": "rating", 
                                "reviewerName": "name",
                                "reviewText": "review",
                                "reviewTime": "date"})

# no empty ratings, reviews & dates

corpus = corpus[corpus.rating != ""]
corpus = corpus[corpus.review != ""]
corpus = corpus[corpus.date != ""]


# change date

def change_date(date):
    day_month, year = str(date).split(",")
    month, day = day_month.split(" ")
    
    if len(day) <= 1:
        day = "0" + str(day)
        
    
    return f"{day}.{month}.{str(year)[-4:]}"

# only reviews since a specific date

def remove_dates(date, min_year):
    year = int(str(date)[-4:])
    if year < min_year:
        return ""
    else:
        return date
    
corpus.date = corpus.date.apply(lambda x: remove_dates(x, MIN_YEAR))
corpus = corpus[corpus.date != ""]

corpus.date = corpus.date.apply(lambda x: change_date(x))


# combine summary and review text

if COMBINE_SUMMARY:
    corpus['review'] = corpus[['summary', 'review']].apply(lambda x: " ".join(str(y) for y in x if str(y) != 'nan'), axis = 1)
del corpus["summary"]

  res_values = method(rvalues)


CPU times: user 24.3 s, sys: 6.12 s, total: 30.4 s
Wall time: 34.7 s


In [14]:
corpus.sample(10)

Unnamed: 0,rating,name,review,verified,vote,date
4402338,5.0,KEVIN HENDERSON,Works great.,True,0,08.01.2018
4668412,5.0,Amazon Customer,Functions as expected-- never gives me any pro...,True,0,18.02.2018
980357,5.0,Robert W Vincent,"Great mount, no problems.",True,0,14.03.2018
5644328,5.0,Birdman,Holds my Alien well and doesn't let my compute...,True,0,20.01.2018
2899963,5.0,R. Chung,"The bright yellow sockets are a plus, other po...",True,0,04.06.2018
4459477,5.0,southman,I used this to house a 500gb hard drive salvag...,True,0,31.05.2018
4434343,5.0,Lindsay,I recently started doing some music videos and...,False,0,03.01.2018
6240646,1.0,Jason Drushel,Card literally died at the exact one year mark...,True,0,14.02.2018
3826691,5.0,Mary S. Murray,Everyone compliments me on this case. Second o...,True,0,17.03.2018
6712322,3.0,Aaron &#034;Adub&#034; Woodwell,Im honestly having a hard time deciding if the...,True,0,20.09.2018


## text cleaning

In [15]:
%%time
corpus.review = corpus.review.apply(clean_text)
corpus = corpus[corpus.review != ""]

CPU times: user 1min 13s, sys: 6.99 s, total: 1min 20s
Wall time: 1min 22s


In [16]:
corpus.head()

Unnamed: 0,rating,name,review,verified,vote,date
217,5.0,Problematic1963,I made a photo album for a senior friend who w...,True,0,27.01.2018
842,5.0,Tazman32,Great addition to our new Galaxy Ss which by t...,True,0,01.04.2018
843,5.0,Brian D. Carrico,Perfect,True,0,30.03.2018
844,4.0,Cici Ciconia,As described,True,0,30.03.2018
845,5.0,AJ,Great little card made my device better,True,0,27.03.2018


In [17]:
corpus.shape

(377057, 6)

In [18]:
corpus.rating.value_counts()

5.0    253232
4.0     52737
1.0     28443
3.0     25828
2.0     16817
Name: rating, dtype: int64

# downsampling

Reduces corpus size by random downsampling.

In [21]:
s_corpus = random_downsampling(corpus, class_col="rating", max_value = 15000)

In [25]:
#s_corpus.to_csv("../corpora/small_amazon_reviews_electronic.csv", index=False)

In [23]:
s_corpus.shape

(75000, 6)