# Analysing Amazon Reviews Part I: Reading the data
(2017-01-19 (C) Wouter van Atteveldt CC-BY-SA)

We will analyse the Amazon reviews from http://jmcauley.ucsd.edu/data/amazon/links.html. You can download any of the 5-core subsets for a product category, for example the Grocery and Gourmet data (also found on canvas).

This data is composed of review data and product (meta) data, and can be used to answer a lot of questions.
For example, what product got the most reviews? What product got the best reviews? 
What was the most controversial product, i.e. the one with the most divergent reviews?

In this handout, we will be doing the analysis in pure python, i.e. without pandas or numpy. 
Some things would be easier or quicker with pandas, but it is a good exercise to be able to get comfortable with standard python.

Before we start with a substantive analysis, we want to get the data and read it into python data structures. 
This handout explains how to read the data into python lists and dicts and save them as json files.
The next handout (Part II: Controversial reviews?) will do a substantive analysis of the reviews.

# Product data

Let's have a look at the raw data first:

In [1]:
!head -1 reviews_Grocery_and_Gourmet_Food_5.json

{"reviewerID": "A1VEELTKS8NLZB", "asin": "616719923X", "reviewerName": "Amazon Customer", "helpful": [0, 0], "reviewText": "Just another flavor of Kit Kat but the taste is unique and a bit different.  The only thing that is bothersome is the price.  I thought it was a bit expensive....", "overall": 4.0, "summary": "Good Taste", "unixReviewTime": 1370044800, "reviewTime": "06 1, 2013"}


So by the looks of it, each line is a python dictionary. 
We can use 'eval' to evaluate this line. 
(Note that executing code from untrusted sources is not a good idea, but I'm sure McAuley is a pretty nice guy)

Let's start by turning the file into a list of dictionaries per product:

In [2]:
fn = "reviews_Grocery_and_Gourmet_Food_5.json"
reviews = []
for line in open(fn):
    review = eval(line)
    reviews.append(review)
   
print(reviews[0])

{'reviewTime': '06 1, 2013', 'helpful': [0, 0], 'overall': 4.0, 'asin': '616719923X', 'reviewerID': 'A1VEELTKS8NLZB', 'unixReviewTime': 1370044800, 'reviewerName': 'Amazon Customer', 'reviewText': 'Just another flavor of Kit Kat but the taste is unique and a bit different.  The only thing that is bothersome is the price.  I thought it was a bit expensive....', 'summary': 'Good Taste'}


In [3]:
print(len(reviews))

151254


Each product can be reviewed by multiple users. 
To figure out the total amount of products, we can create a *set* of product ids:

In [4]:
product_ids = set()
for review in reviews:
    product_id = review['asin']
    product_ids.add(product_id)
print(len(product_ids))
print(list(product_ids)[:5])

8713
['B005Q8C2WK', 'B002GWMA6C', 'B0007R9L5Q', 'B000EDG3TU', 'B006M98T1A']


We can also create the reviews list and products set using comprehensions, which leads to much shorter code. 
I also find this code more readable, but that really depends on the complexity of processing. 

In [5]:
reviews2 = [eval(line) for line in open(fn)]
product_ids2 = {review['asin'] for review in reviews2}


# check whether results are the same:
print("Identical reviews?", reviews == reviews2)
print("Identical product IDs?", product_ids == product_ids2)

Identical reviews? True
Identical product IDs? True


# Metadata

To see which product is which, we need the product metadata. Let's have a look at the metadata file:

In [5]:
!head -1 meta_Grocery_and_Gourmet_Food.json

{'asin': '0657745316', 'description': 'This is real vanilla extract made with only 3 premium ingredients. GMO free, no fillers you find in store bought "vanilla extract." \n\nThe taste will knock your socks off. Everyone will notice a difference in your baking and cooking and they\'ll want to know your secret. I also use this for a special homemade coffee creamer that\'s out of this world and I use it for tea and black coffee as well as espresso drinks. You can add this to make a vanilla latte and skip the sugary syrup for a healthier latte with more flavor! \n\nWhen this item arrives, there will also be instructions to refill the product as its used so that you won\'t have to age it or repurchase it. I\'ve been using the same vanilla for 2 years now and have friends who\'ve had theirs for 5 years. It\'s just as tasteful, just as sweet, strong, and yummy. \n\nI use only top shelf liquor and the product is aged a minimum of 4 months. \n\nThese also make great gifts. I currently have ple

Similar to above, we start by reading the data into python.
Since each product has a unique ID (asin), it makes sense to create a python dictionary, with the asin as key. 
Note that we only need products with IDs that we have reviews of.

In [6]:
products = {}
fn = "meta_Grocery_and_Gourmet_Food.json"
for line in open(fn):
    d = eval(line)
    if d['asin'] in product_ids:  
        products[d['asin']] = d
print(len(products))
print(products['B001EQ5O5U']['title'])

8713
Starbucks Sumatra Coffee, Whole Bean, 12-Ounce Bags (Pack of 3)


Same example with comprehensions:

In [11]:
fn = "meta_Grocery_and_Gourmet_Food.json"
lines = (eval(line) for line in open(fn))
products = {line['asin'] : line for line in lines 
          if line['asin'] in product_ids}
print(len(products))

8713


In [13]:
print(products['B001EQ5428'])

{'title': 'Starbucks Caffe Verona Coffee, Whole Bean,Dark Cocoa, Roasty Sweet,12-Ounce Bags (Pack of 3)', 'description': 'Smooth, sweet and pleasant, this is one of Starbucks perennial favourite coffees. In the United States, it&#8217;s become especially popular as a romantic gift for Valentine&#8217;s Day. First we create a full-bodied blend of Latin American and Indonesian coffees. Then we add 20% of our Italian Roast coffee to give Caff&#xE8; Verona its added depth and sweetness. The result is heavenly.', 'asin': 'B001EQ5428', 'price': 41.95, 'salesRank': {'Grocery & Gourmet Food': 21463}, 'imUrl': 'http://ecx.images-amazon.com/images/I/411qmN5DJpL._SY300_.jpg', 'categories': [['Grocery & Gourmet Food']], 'related': {'bought_together': ['B004LL5O46', 'B00004SPEU'], 'also_viewed': ['B004LL5O46', 'B0029K11EI', 'B00E017PCG', 'B000TYEUXU', 'B004UM2IZE', 'B0078P1QI0', 'B00IIV4BRG', 'B000AAJQQO', 'B001EQ53WE', 'B004X2O3TA', 'B004X2N0MQ', 'B007PSZCX0', 'B001EQ522A', 'B006WA2H9Y', 'B00GR6HP

# Check the data

Are all products included in the metadata?

In [15]:
print(any([id in products for id in product_ids]))

True


In [9]:
print(all(id in products for id in product_ids))

True


Excellent, all review product_ids are included as key in the products dictionary.

# Saving the data

To save the data for future use we have multiple options. For example, we could use 'pickle', python's built-in object serializer.
However, for this kind of data (dictionaries), I prefer to store the data as json, as that facilitates reading it from other programs (e.g. from R). 
So, I store the data as two json files:

In [None]:
import json
json.dump(products, open("gourmet_products.json", "w"), indent=4)
json.dump(reviews, open("gourmet_reviews.json", "w"), indent=4)

In [10]:
!head gourmet_reviews.json

[
    {
        "reviewerName": "Amazon Customer",
        "reviewText": "Just another flavor of Kit Kat but the taste is unique and a bit different.  The only thing that is bothersome is the price.  I thought it was a bit expensive....",
        "helpful": [
            0,
            0
        ],
        "asin": "616719923X",
        "overall": 4.0,
