# Intro to Python Data Processing 

In this notebook, we're going to see how we read texts in form of JSON and CSV files into Python objects and what do we actually do once we've opened the files? We also introduce some new usefull libraries.
In this step we transform the raw data to structured data (or derived data).

In [1]:
# Disable jedi autocompleter
%config Completer.use_jedi = False

## String.split()

In [2]:
x = "marketplace customer_id review_id product_id product_parent"
x.split()

['marketplace', 'customer_id', 'review_id', 'product_id', 'product_parent']

In [3]:
x = "marketplace; customer_id review_id product_id product_parent"
x.split(';')

['marketplace', ' customer_id review_id product_id product_parent']

## eval()

In [4]:
import ast
path = "../datasets/example.json"
f = open(path)

In [5]:
line = f.readline()
line

'{"_id":"5c1a010ae61b49b43c4b4864","index":0,"age":35,"eyeColor":"green","name":"Wiggins Holman","address":"247 Thatford Avenue, Oneida,Puerto Rico, 7233","friends":[{"id":0,"name":"Carmela Hampton"},{"id":1,"name":"Lynda Pittman"},{"id":2,"name":"Cleveland Noble"}]}\n'

In [6]:
# To read the lines as a json object, we could use the eval method
d = eval(line)
d

{'_id': '5c1a010ae61b49b43c4b4864',
 'index': 0,
 'age': 35,
 'eyeColor': 'green',
 'name': 'Wiggins Holman',
 'address': '247 Thatford Avenue, Oneida,Puerto Rico, 7233',
 'friends': [{'id': 0, 'name': 'Carmela Hampton'},
  {'id': 1, 'name': 'Lynda Pittman'},
  {'id': 2, 'name': 'Cleveland Noble'}]}

In [7]:
d['_id']

'5c1a010ae61b49b43c4b4864'

In [8]:
# however we need to be careful when using eval(), 
# since it treats arbitrary strings as a bit of python code
eval("print(2+6)")

8


In [9]:
# To prevent this undesired behaviour from happening we should use ast or json library
# The input must be however a json object 
ast.literal_eval(line)

{'_id': '5c1a010ae61b49b43c4b4864',
 'index': 0,
 'age': 35,
 'eyeColor': 'green',
 'name': 'Wiggins Holman',
 'address': '247 Thatford Avenue, Oneida,Puerto Rico, 7233',
 'friends': [{'id': 0, 'name': 'Carmela Hampton'},
  {'id': 1, 'name': 'Lynda Pittman'},
  {'id': 2, 'name': 'Cleveland Noble'}]}

In [10]:
# We could also use the json library
import json
json.loads(line)

{'_id': '5c1a010ae61b49b43c4b4864',
 'index': 0,
 'age': 35,
 'eyeColor': 'green',
 'name': 'Wiggins Holman',
 'address': '247 Thatford Avenue, Oneida,Puerto Rico, 7233',
 'friends': [{'id': 0, 'name': 'Carmela Hampton'},
  {'id': 1, 'name': 'Lynda Pittman'},
  {'id': 2, 'name': 'Cleveland Noble'}]}

## Dealing with Large Files

### Gzip

Often we'll be dealing with very large datasets, and only few of the data is relevant to us. With gzip we can work directly with the compressed data in the native gziped format without having to tax the hard drive.

In [11]:
import gzip
import csv

In [12]:
# Unzimp the datafile
path = "../datasets/amazon_reviews_us_Gift_Card_v1_00.tsv.gz"
f = gzip.open(path, 'rt')

In [13]:
# Initiate a reader object
reader = csv.reader(f, delimiter='\t')

# read one line at a time
# First line = header
header = next(reader)

In [14]:
header

['marketplace',
 'customer_id',
 'review_id',
 'product_id',
 'product_parent',
 'product_title',
 'product_category',
 'star_rating',
 'helpful_votes',
 'total_votes',
 'vine',
 'verified_purchase',
 'review_headline',
 'review_body',
 'review_date']

### Reading and Filtering Files Line by Line

How can we read and filter out data sets line by line? So, for manipulating a very large file and we have a gzipped, it's not going to help us if we then try to read the entire file into memory all in one go, because we're just going to run out of memory. So, the next concept we would like to introduce is to say, "How can we construct a data structure containing some reduced subset of the file that we'd really like to work with?" So, perhaps, in the case of our Amazon dataset, we'd like to build a subset that ignores the text fields in that dataset, because we'd just like to do some operations on the rating, or the vote, or the user data. That's what we'll do in this example.

In [15]:
dataset = []

In [16]:
for line in reader:
    line = line[:-3] # drop the last 3 entries of each line
    if line[-1] == 'Y': # discard unverified reviews
        dataset.append(line)

In [17]:
dataset[10]

['US',
 '3559726',
 'R6JH7A117FHFA',
 'B004LLIKVU',
 '473048287',
 'Amazon.com eGift Cards',
 'Gift Card',
 '5',
 '0',
 '0',
 'N',
 'Y']

In [18]:
dataset[3][5]

'Amazon.com Gift Card Balance Reload'

In [19]:
header

['marketplace',
 'customer_id',
 'review_id',
 'product_id',
 'product_parent',
 'product_title',
 'product_category',
 'star_rating',
 'helpful_votes',
 'total_votes',
 'vine',
 'verified_purchase',
 'review_headline',
 'review_body',
 'review_date']

In [20]:
# Change stream position to 0
f.seek(0)
# First line as header - as above
header = next(reader)

dataset = []
for line in reader:
    d = dict(zip(header, line))
    # convert string to int
    for field in ['helpful_votes', 'star_rating', 'total_votes']:
        d[field] = int(d[field])
    # convert string to boolean
    for field in ['verified_purchase', 'vine']:
        if d[field] == "Y":
            d[field] = True
        else:
            d[field] = False
    dataset.append(d)

In [21]:
dataset[4]

{'marketplace': 'US',
 'customer_id': '397970',
 'review_id': 'RNYLPX611NB7Q',
 'product_id': 'B005ESMGV4',
 'product_parent': '379368939',
 'product_title': 'Amazon.com Gift Cards, Pack of 3 (Various Designs)',
 'product_category': 'Gift Card',
 'star_rating': 5,
 'helpful_votes': 0,
 'total_votes': 0,
 'vine': False,
 'verified_purchase': True,
 'review_headline': 'Five Stars',
 'review_body': "I can't believe how quickly Amazon can get these into my hands!!  Thank you!",
 'review_date': '2015-08-31'}

## Summary Statistics

### Average rating

In [25]:
ratings = [d['star_rating'] for d in dataset]
sum(ratings) / len(ratings)

4.731333018677096

### Rating Score Distribution


In [26]:
ratingCounts = {1: 0, 2: 0, 3: 0, 4: 0, 5: 0}
for d in dataset:
    ratingCounts[d['star_rating']] += 1

In [27]:
ratingCounts

{1: 4766, 2: 1560, 3: 3147, 4: 9808, 5: 129029}

In [34]:
# rating score distribution using the defaultdict function
from collections import defaultdict

# delete ratingCounts
del ratingCounts

# With defaultdict no need to define the keys upfront
# Which is very handy for long key lists. 
# With the star rating count it was ok to define the keys upfront
# and set the value to zeros. However, when we have much more keys
# such as product id's, it is not possible to set all the keys upfront.
ratingCounts = defaultdict(int)
for d in dataset:
    ratingCounts[d['star_rating']] += 1

ratingCounts

defaultdict(int, {5: 129029, 1: 4766, 4: 9808, 2: 1560, 3: 3147})

### Verified Purchases

In [32]:
verifiedCounts = defaultdict(int)
for d in dataset:
    verifiedCounts[d['verified_purchase']] += 1
verifiedCounts

defaultdict(int, {True: 135289, False: 13021})

### Most Popular Products

In [35]:
productCounts = defaultdict(int)
for d in dataset:
    productCounts[d['product_id']] += 1

In [43]:
# converting productCounts to a list of tuples:
list = [(productCounts[p], p) for p in productCounts]
list.sort(reverse=True)

In [45]:
# displey the 10 most popular products
list[:10]

[(28705, 'B004LLIKVU'),
 (6037, 'B00A48G0D4'),
 (5034, 'BT00DDVMVQ'),
 (4283, 'B00IX1I3G6'),
 (3440, 'BT00CTOUNS'),
 (3407, 'BT00DDC7BK'),
 (2643, 'B004LLIKY2'),
 (2630, 'BT00DDC7CE'),
 (2173, 'B0066AZGD4'),
 (2038, 'B004KNWWO0')]