# Milestone 2 - Data collection and description
The second task is to intimately acquaint yourself with the data, preprocess it and complete all the necessary descriptive statistics tasks. We expect you to have a pipeline in place, fully documented in a notebook, and show us that you’ve advanced with your understanding of the project goals by updating its README description.

# Gaining insights into the Amazon product network 

## Overview

The Amazon dataset contains relations among products, such as "also viewed", "also bought", "bought together", "bought after viewing". These links can be used to create a graph that represents products with similar characteristics, that is, products that are viewed together but not bought together.
Our idea is to exploit the dataset to create clusters of competing products. These clusters may be used not only to identify the best product in terms of rate and sale within a group, but also to investigate how brands can influence the sale and the price of similar products.

The dataset will be transformed into a graph of relations between products, where the vertices represent products, and edges represent competitions between products. For instance, if two products are viewed together (people who viewed product A also viewed product B, and vice versa) but not bought together, they are competitors. On the other hand, two products that are viewed together and bought together are not competitors (e.g. a user buys a smartphone and a cover). A way of expressing this in more formal terms is with max-cliques, that is, finding sets of vertices that are totally interconnected.

## Dataset description

**@show** 
- That you can handle the data in its size.

The Amazon dataset consists of two JSON files: 
- *metadata.json*: contains information about the products, such as their unique ID, description and price. The size of the dataset is 9.81 GB.
- *reviews.json*: contains reviews and ratings associated to each product. The size of the dataset is approximately 100 GB.

To handle the files and use them on our local machines, we have decided to install and use *PySpark*. However, also the cluster is necessary to process the reviews dataset.

### Metadata
The dataset contains a list of entries of products with the following fields (some may be missing):
- **asin**: unique ID of the product.
- **title**: name of the product.
- **price**: price in US dollars.
- **imUrl**: url of the product image.
- **related**: related products, which contains the sub-lists: *also bought, also viewed, bought together, buy after viewing*.
- **salesRank**: sales rank information, i.e. how many times the product has been sold over the total number of sold products in the category.
- **brand**: brand name.
- **categories**: list to which categories the products belong.

These fields are already sufficient to build our graph, since they contain the above-mentioned relations between products, as well as their IDs and names. <br>
Due to the large number of products in the dataset, we decided to process those within a small set of categories. The *categories* field contains a list that represents the hierarchy of categories to which the product has been assigned, i.e. the first element of the list is the macro category, and the last element is the smallest sub category.
We collected all the macro categories, of which number is relatively high, and inspected what categories might be suitable for our project. <br>
Firstly, we performed a qualitative inspection, choosing macro categories containing products that can be objectively compared in terms of features and characteristics, such as *Electronics* or *Cell phones*. On the other hand, categories of products of which the purchase decision is subjective (e.g. clothes and books) have been discarded. <br>
Secondly, we have counted the number of products associated to each macro category, and selected categories with a relatively large number of products.
According to our analysis, we have decided to process the following macro categories: *Electronics, Cell Phones & Accessories, Automotive, Tools & Home Improvement, Musical Instruments*. <br>
**@todo(improve algo description)** For each macro category, we built a tree of sub categories. Each category is translated into the node of a tree. Each node contains its children and its cumulative number of products, both of the children and itself. Nodes are then merged together recursevely to construct larger trees. The code of the algorithm is shown below.

In [None]:
# Transforms an element into a node
def convert_to_trie(elements):
    root = {}
    node = root
    for element in elements:
        node[element] = ({}, 1)
        node = node[element][0]
    return root

# Merge nodes
def merge_tries(a, b):
    for key in b:
        if key in a:
            a[key] = (a[key][0], a[key][1] + b[key][1])
            merge_tries(a[key][0], b[key][0])
        else:
            a[key] = b[key]
    return a

# Build the category tree
category_tree = sc.textFile(r'C:\Spinn3r\amazon\metadata.json')\
    .map(lambda x: ast.literal_eval(x))\
    .filter(lambda x: 'categories' in x)\
    .flatMap(lambda x: x['categories'])\
    .map(convert_to_trie)\
    .reduce(merge_tries)

In the example below, the tree of *Cell Phones & Accessories* is showed, with product count for each sub category. As can be seen, sub categories may differ significantly in terms of belonging products. Therefore, some heuristics may be necessary to group categories that contains a small number of products.

In [None]:
category_tree['Cell Phones & Accessories'][0]

```{'Accessories': ({'Accessory Kits': ({}, 26545),
   'Audio Adapters': ({}, 497),
   'Batteries': ({'Battery Charger Cases': ({}, 555),
     'External Battery Packs': ({}, 2053),
     'Internal Batteries': ({}, 6645)},
    9842),
   'Bluetooth Speakers': ({}, 779),
   'Car Accessories': ({'Car Cradles & Mounts': ({'Car Cradles': ({}, 424),
       'Car Mounts': ({}, 4183)},
      4699),
     'Car Kits': ({}, 841),
     'Car Speakerphones': ({}, 297)},
    5837),
   'Chargers': ({'Car Chargers': ({}, 7623),
     'Cell Phone Docks': ({}, 1886),
     'International Chargers': ({}, 161),
     'Solar Chargers': ({}, 275),
     'Travel Chargers': ({}, 6845)},
    17111),
   'Cradles, Mounts & Stands': ({'Stands': ({}, 44)}, 44),
   'Data Cables': ({}, 6647),
   'Headsets': ({'Bluetooth Headsets': ({}, 5033),
     'Wired Headsets': ({}, 5015)},
    10148),
   'Phone Charms': ({}, 3073),
   'Replacement Parts': ({}, 6583),
   'SIM Cards & Tools': ({}, 506),
   'Screen Protectors': ({}, 15865),
   'Signal Boosters': ({}, 586),
   'Smart Watches & Accessories': ({}, 147),
   'Stylus Pens': ({}, 3581)},
  109235),
 'Cases': ({'Armbands': ({}, 1521),
   'Basic Cases': ({}, 222345),
   'Customizable Cases': ({}, 2),
   'Holsters & Clips': ({}, 4224),
   'Sleeves': ({}, 21),
   'Wallet Cases': ({}, 954),
   'Waterproof Cases': ({}, 133)},
  229207),
 'Cell Phones': ({'Contract Cell Phones': ({}, 618),
   'No-Contract Cell Phones': ({'Minutes': ({}, 52), 'Phones': ({}, 697)},
    750),
   'Unlocked Cell Phones': ({}, 6287)},
  7693),
 'Connected Devices': ({'Mobile Broadband': ({'Data Cards': ({}, 1),
     'Mobile Hotspots': ({}, 34),
     'USB Modems': ({}, 9)},
    52),
   'Tablets': ({}, 9)},
  62)}```

### Reviews

The dataset contains a list of entries of reviews with the following fields:
- **reviewerID**: unique ID associated to each user.
- **asin**: unique ID associated to each product.
- **reviewerName**: name of the user.
- **helpful**: helpfulness rating of the review.
- **reviewText**: text of the review.
- **overall**: rating of the product.
- **summary** - summary of the review
- **unixReviewTime** - unix timestamp of the review.
- **reviewTime** - raw timestap of the review.


Being our project mainly focused on products, we consider these fields less relevant. However, the *overall* field could be exploited to infer additional information, i.e. the average rating could show the quality of a product.

## Preliminary processing

**@show **
- That you considered ways to enrich, filter, transform the data according to your needs.

#### Reduce the Amazon dataset

Due to the large size of the Amazon dataset, we decided to create a custom dataset prior to performing any further analysis. The custom dataset, which has been named *reduced*, contains only products belonging to the macro categories extracted in the previous parapraph. To further reduce the size, every image URL associated to a product has been deleted. In addition, the review ratings of each product are averaged and merged with the products. As a result, we obtain a smaller *metadata* dataset (1.71 GB) that is enriched with the average product rating field. <br>

##### Aggregate ratings
The average product rating is computed from the data in the *reviews* dataset. For each entry, the product ID and the rating are stored, respectively, in the *asin* and *overall* fields. To compute the rating, entries are grouped by product ID and then 
averaged on the *overall* field. <br>
The average product ratings are stored in the *aggregate_ratings* dataset. The code of the processing is shown below.

In [None]:
sc.textFile(r'C:\Spinn3r\amazon\reviews_sample.json.gz')\
    .map(lambda x: json.loads(x))\
    .map(lambda x: (x['asin'], x['overall'], x['helpful'][0], x['helpful'][1]))\
    .toDF(['asin', 'overall', 'helpful', 'not_helpful'])\
    .groupBy('asin')\
    .agg(
        func.mean('overall').alias('average_rating'),
        func.count('overall').alias('num_reviews'),
        (func.sum('helpful') / func.sum('not_helpful')).alias('helpful_fraction')
    )\
    .toJSON()\
    .coalesce(1)\
    .saveAsTextFile('aggregate_ratings.json')

##### Merge the datasets
*Metadata* is filtered to mantain only the products belonging to the macro categories of interest, which are then merged with *aggregate_ratings*. The code that generates the *reduced* dataset is shown below. 

In [None]:
# The list of macro categories that we want to extract
categories_to_extract = set(['Electronics', 'Cell Phones & Accessories', 'Automotive', 'Tools & Home Improvement', 'Musical Instruments'])

# Extract macro category and delete img url to reduce size
def extract_category(x):
    x['category'] = x['categories'][0]
    del x['categories']
    x['num_reviews'] = 0
    if 'imUrl' in x:
        del x['imUrl']
    return x

# Load the aggregate ratings
ratings = sc.textFile(r'C:\Spinn3r\amazon\aggregate_ratings.json')\
    .map(lambda x: json.loads(x))\
    .map(lambda x: (x['asin'], x))

# Filter products and merge datasets
sc.textFile(r'C:\Spinn3r\amazon\metadata.json')\
    .map(lambda x: ast.literal_eval(x))\
    .filter(lambda x: 'categories' in x)\
    .map(extract_category)\
    .filter(lambda x: x['category'][0] in categories_to_extract)\
    .map(lambda x: (x['asin'], x))\
    .leftOuterJoin(ratings)\
    .map(lambda x: x[1])\
    .map(lambda x: x[0] if x[1] == None else {**x[0], **x[1]})\
    .map(lambda x: json.dumps(x))\
    .saveAsTextFile('reduced.json')

#### Build a light sandbox dataset

To perform tests on data rapidly, we have decided to further reduce the Amazon dataset. Specifically, from *reduced* we built a lighter dataset containg only a single macro category, being *Musical Instruments*. The code that generates the dataset and the category tree are shown below.

In [None]:
sc.textFile('reduced.json')\
    .filter(lambda x: json.loads(x)['category'][0] == 'Musical Instruments')\
    .coalesce(1)\
    .saveAsTextFile('musical_instruments.json')

category_tree['Musical Instruments'][0].keys()

##  Exploratory analysis

**@show**
- That you understand what’s into the data (formats, distributions, missing values, correlations, etc.).
- That you have updated your plan in a reasonable way, reflecting your improved knowledge after data acquaintance. In particular, discuss how your data suits your project needs and discuss the methods you’re going to use, giving their essential mathematical details in the notebook.


**@todo**
- show some correlation between the variables
- show some cliques and infer some conclusion. are the clicques well constructed? can we claim that  a product is better than the others within the clique? if yes, with what metrics? (analyze correlations)
- Are the cliques meaningful? We have seen that some cliques are composed of the same object with different colors, or even of the same exact products from different vendors. It may be necessary to add some considerations on how dealing with these extreme cases.
- How do we merge cliques? What heuristic do we choose? How do we deal with single-product-size cliques?

#### Correlation analysis
We performed some analyses on the *musical_instruments* dataset variables. In details, we investigated the correlations among price, review rating and sale rank. 

In [None]:
all_ratings = sc.textFile('musical_instruments.json')\
    .map(lambda x: json.loads(x))\
    .filter(lambda x: x['num_reviews'] > 10 and 'price' in x and 'salesRank' in x and 'Musical Instruments' in x['salesRank'])\
    .map(lambda x: (x['price'], x['average_rating'], x['salesRank']['Musical Instruments']))\
    .collect()
    
df = pd.DataFrame(all_ratings)
print(len(df))
df.columns = ['price', 'rating', 'rank']
corr = df.corr()
display(df.head())
plt.figure(figsize=(10,10))
_ = sns.heatmap(corr, annot=True,
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values)

sns.pairplot(df)

According to the graphs above, we discuss the following outcomes on the analysis of musical instruments:
- As the price increases, the ratings tend to have lower variance and higher mean. In other words, more expensive products have on average higher ratings, and are less likely to be not popular. 
- As the price increases, the variance of sale rate tends to be lower. Thefore, the sales are less likely to be low.
- As the sale rank decreases, ratings tend do be higher and less distributed over the range.

**@todo**: However, these analysis may vary among different categories.


## Conclusion

**@show **
- That your plan for analysis and communication is now reasonable and sound, potentially discussing alternatives to your choices that you considered but dropped.

**@todo**
- discuss the feasability of the project
- define some further internal steps before milestone 3

# Reminder: Internal steps before milestone2

- Define the rules for creating the graph (i.e. the influence of each relation type).
- Devise an efficient algorithm for extracting cliques or highly connected subgraphs, and, possibly, merging them into clusters.
- Find useful insights into the structure of these clusters, apart from obvious ones (the best product in a cluster). For example:
-- Do people always choose the most cheap product among related products?
-- Conversely, does the best product cost more than the others?
-- Are the best products sold only by well-known brands?