# Analysing Amazon Reviews with Pandas Part I: Reading the data
(2017-02-03 (C) Wouter van Atteveldt CC-BY-SA)

Similar to the previous handout, we will analyse the Amazon reviews from http://jmcauley.ucsd.edu/data/amazon. You can download any of the 5-core subsets for a product category, for example the Grocery and Gourmet data using the link found on canvas.

This data is composed of review data and product (meta) data, and can be used to answer a lot of questions.
For example, what product got the most reviews? What product got the best reviews? 
What was the most controversial product, i.e. the one with the most divergent reviews?

In this handout, we will be doing the analysis in pandas. In comparison with the previous exercises, this showcases how pandas can be used to easily manage and analyse data.

Before we start with a substantive analysis, we want to get the data and read it into pandas data frames and save them as pickle files.
The next handout (Part II: Controversial reviews?) will do a substantive analysis of the reviews.

# Reviews data

To read the reviews data, we can reuse the code from the previous exercise.
This `eval`s every line, yielding a dictionary with columns as keys.
Since DataFrame accepts a list of dictionaries as input, we can directly use this to create the dataframe.
Note that here we use a 'generator expression' to do the eval of each line in a single expression without storing the full result as a python list.

In [1]:
import json
import pandas as pd
fn = "reviews_Grocery_and_Gourmet_Food_5.json"
reviews = (eval(line) for line in open(fn))
reviews = pd.DataFrame(reviews)

In [2]:
reviews.head()

Unnamed: 0,asin,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
0,616719923X,"[0, 0]",4.0,Just another flavor of Kit Kat but the taste i...,"06 1, 2013",A1VEELTKS8NLZB,Amazon Customer,Good Taste,1370044800
1,616719923X,"[0, 1]",3.0,I bought this on impulse and it comes from Jap...,"05 19, 2014",A14R9XMZVJ6INB,amf0001,"3.5 stars, sadly not as wonderful as I had hoped",1400457600
2,616719923X,"[3, 4]",4.0,Really good. Great gift for any fan of green t...,"10 8, 2013",A27IQHDZFQFNGG,Caitlin,Yum!,1381190400
3,616719923X,"[0, 0]",5.0,"I had never had it before, was curious to see ...","05 20, 2013",A31QY5TASILE89,DebraDownSth,Unexpected flavor meld,1369008000
4,616719923X,"[1, 2]",4.0,I've been looking forward to trying these afte...,"05 26, 2013",A2LWK003FFMCI5,Diana X.,"Not a very strong tea flavor, but still yummy ...",1369526400


# Reading products data

For the products data, we can in principle to the same thing, with two changes:
(1) we only use those lines where the product (asin) is in the set of reviews, and
(2) we set the asin to be the 'index' of the dataframe (since it is unique)
Note that we need to split the processing into two generator expressions (or a single for loop)
since the condition depends on having performed the eval() beforehand. 

In [3]:
fn = "meta_Grocery_and_Gourmet_Food.json"
asins = set(reviews.asin.unique())
products = (eval(line) for line in open(fn))
products = (p for p in products if p['asin'] in asins)
products = pd.DataFrame(products)

products = products.set_index("asin")
products.head()

Unnamed: 0_level_0,brand,categories,description,imUrl,price,related,salesRank,title
asin,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
616719923X,,[[Grocery & Gourmet Food]],Green Tea Flavor Kit Kat have quickly become t...,http://ecx.images-amazon.com/images/I/51LdEao6...,,"{'also_viewed': ['B0047YG5UY', 'B00FKQ6X46', '...",{'Grocery & Gourmet Food': 37305},Japanese Kit Kat Maccha Green Tea Flavor (5 Ba...
9742356831,Mae Ploy,[[Grocery & Gourmet Food]],Used to make various curry soups and stir fry ...,http://ecx.images-amazon.com/images/I/41pQp67A...,7.23,"{'also_viewed': ['B000EI2LLO', 'B0035AV5II', '...",{'Grocery & Gourmet Food': 3434},Mae Ploy Thai Green Curry Paste - 14 oz jar
B00004S1C5,HIC Harold Import Co.,"[[Grocery & Gourmet Food, Cooking & Baking, Fo...","From Easter eggs to colorful cookies, Spectrum...",http://ecx.images-amazon.com/images/I/41F75K9F...,9.76,"{'also_viewed': ['B000FNM5PU', 'B0029YDR82', '...",{'Kitchen & Dining': 4494},"Ateco Food Coloring Kit, 6 colors"
B0000531B7,Powerbar,[[Grocery & Gourmet Food]],,http://ecx.images-amazon.com/images/I/519SuVj1...,24.75,"{'also_viewed': ['B009VV7G60', 'B00DZGEY44', '...",{'Grocery & Gourmet Food': 2858},"PowerBar Harvest Energy Bars, Double Chocolate..."
B00005344V,Traditional Medicinals,[[Grocery & Gourmet Food]],"For nearly forty years, we&#x2019;ve been pass...",http://ecx.images-amazon.com/images/I/51H54cd-...,21.74,"{'also_viewed': ['B00028N5Q6', 'B000CMIYWC', '...",{'Grocery & Gourmet Food': 5034},"Traditional Medicinals Breathe Easy, 16-Count ..."


# Saving the data

The easiest way to save a dataframe is by using the to_pickle command.
One caveat is that pickle is unsafe (it can trigger code execution), so never unpickle code from the Internet. 
You can also use e.g. json or csv to save and load data frames, but if they contain python objects this might not work. 

In [4]:
products.to_pickle("products.pickle")
reviews.to_pickle("reviews.pickle")