## Analysis on Dataset from Amazon

For this exercise, you will analyze a dataset from Amazon. The data format and a
sample entry are shown on the next page.

A. (Suggested duration: 90 mins)
With the given data for 548552 products, perform exploratory analysis and make
suggestions for further analysis on the following aspects.
1. Trustworthiness of ratings<br><br/>
Ratings are susceptible to manipulation, bias etc. What can you say (quantitatively
speaking) about the ratings in this dataset?
2. Category bloat<br><br/>
Consider the product group named 'Books'. Each product in this group is associated with
categories. Naturally, with categorization, there are tradeoffs between how broad or
specific the categories must be.
For this dataset, quantify the following:<br><br/>
a. Is there redundancy in the categorization? How can it be identified/removed?<br><br/>
b. Is is possible to reduce the number of categories drastically (say to 10% of existing
categories) by sacrificing relatively few category entries (say close to 10%)?<br><br/>

B. (Suggested duration: 30 mins)
Give the number crunching a rest! Just think about these problems.
1. Algorithm thinking<br><br/>
How would build the product categorization from scratch, using similar/co-purchased
information?
2. Product thinking<br><br/>
Now, put on your 'product thinking' hat.<br><br/>
a. Is it a good idea to show users the categorization hierarchy for items?<br><br/>
b. Is it a good idea to show users similar/co-purchased items?<br><br/>
c. Is it a good idea to show users reviews and ratings for items?<br><br/>
d. For each of the above, why? How will you establish the same?

Data Source: http://snap.stanford.edu/data/amazon-meta.html

### Part A 1.

In [14]:
import pandas as pd
import numpy as np

In [7]:
with open('amazon-meta.txt', 'r', encoding="utf8") as f:
    file_content = f.read()

In [11]:
# Parse and group the data by each product
grouped_products = []
info = []

for string in file_content.split('\n')[3:]:
    if string != '':
        info.append(string)
    else:
        grouped_products.append(info)
        info = []

In [12]:
grouped_products[0:2]

[['Id:   0', 'ASIN: 0771044445', '  discontinued product'],
 ['Id:   1',
  'ASIN: 0827229534',
  '  title: Patterns of Preaching: A Sermon Sampler',
  '  group: Book',
  '  salesrank: 396585',
  '  similar: 5  0804215715  156101074X  0687023955  0687074231  082721619X',
  '  categories: 2',
  '   |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Preaching[12368]',
  '   |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Sermons[12370]',
  '  reviews: total: 2  downloaded: 2  avg rating: 5',
  '    2000-7-28  cutomer: A2JW67OY8U6HHK  rating: 5  votes:  10  helpful:   9',
  '    2003-12-14  cutomer: A2VE83MZF98ITY  rating: 5  votes:   6  helpful:   5']]

In [16]:
# Build a dataframe with extracted rating info of products
rating_dict = {}

for product in grouped_products:
    idn, total, downloaded, avg_rating = '', '', '', ''
    for item in product:
        if item.startswith('Id:'):
            idn = item.split()[-1]
        elif item.startswith('  reviews:'):
            total = item.split()[2]
            downloaded = item.split()[4]
            avg_rating = item.split()[7]
        elif item.startswith('  discontinued product'):
            skip = True
    if skip == False:
        rating_dict[idn] = [total, downloaded, avg_rating]
    else:
        skip = False
    
rating_df = pd.DataFrame.from_dict(rating_dict)
rating_df = rating_df.T.reset_index(drop=False)
rating_df.columns = ['id', 'total', 'downloaded', 'avg_rating']
rating_df = rating_df.iloc[1:]
rating_df['id']  = rating_df['id'].astype(int)
rating_df['total']  = rating_df['total'].astype(int)
rating_df['downloaded']  = rating_df['downloaded'].astype(int)
rating_df['avg_rating']  = rating_df['avg_rating'].astype(float)
rating_df = rating_df.sort_values('id')
rating_df.head()

Unnamed: 0,id,total,downloaded,avg_rating
1,1,2,2,5.0
109902,2,12,12,4.5
219674,3,1,1,5.0
329477,4,1,1,4.0
439371,5,0,0,0.0


In [31]:
rating_df[rating_df.total > 0].describe()

Unnamed: 0,id,total,downloaded,avg_rating
count,402735.0,402735.0,402735.0,402735.0
mean,279228.680966,19.322855,18.854194,4.324836
std,161089.943216,86.235955,82.92161,0.739279
min,1.0,1.0,0.0,1.0
25%,139238.5,2.0,2.0,4.0
50%,278536.0,4.0,4.0,4.5
75%,419909.0,11.0,11.0,5.0
max,548551.0,5545.0,4995.0,5.0


In [37]:
percent_no_review = len(rating_df[rating_df.total == 0])/len(rating_df) * 100
print('There are {:.3}% products that do not have any review.'.format(percent_no_review))

There are 25.8% products that do not have any review.


In [26]:
unusual = len(rating_df[rating_df['total'] > rating_df['downloaded']])
max_unusual = max(rating_df['total'] - rating_df['downloaded'])
print('There are %d products with the number of total reviews larger than the number of downloads and the largest difference is %d.' %(unusual, max_unusual)) 

There are 8615 products with the number of total reviews larger than the number of downloads and the largest difference is 5029.


#### From the description of the data with reviews, it is shown that at least 75% of the products have average ratings higher or equal to 4. There are 25.8% of the products do not have any review. These are findings that might show some bias in the reviews. The next finding shows that 8615 products have total review numbers larger than download times and the maximum difference is 5029. This is not very common and shows some possibility in manipulation. 

### Part A 2.

In [38]:
# Extract the category list of books and number of books in all products
category_list = []
book_count = 0
for product in grouped_products:
    group = ''
    for item in product:
        if item.startswith('  group:'):
            group = item.split()[-1]
        if group == 'Book':
            if item.startswith('   |'):
                category_list.append(item.strip())
    if group == 'Book':
        book_count += 1

sub_category_list = []
for branch in category_list:
    for sub_cat in branch.split('|')[1:]:
        sub_category_list.append(sub_cat)

In [46]:
print('Number of books among all products is {}.'.format(book_count))
print('Total number of categories for books is {}.'.format(len(set(category_list))))
print('Total number of sub-categories is {}.'.format(len(set(sub_category_list))))

Number of books among all products is 393561.
Total number of categories for books is 12853.
Total number of sub-categories is 14923.


In [49]:
pd.Series(sub_category_list).value_counts().head(5)

Books[283155]                1286848
Subjects[1000]               1222638
Children's Books[4]           134263
[265523]                      123925
Amazon.com Stores[285080]     123925
dtype: int64

#### a. From the above, there is some redundancy in the categorication as the number of unique categories is 12853 and the number of unique sub-categories is 14923. It can be removed by recategorize some sub-categories with really small size to a sub-category with larger size. 

#### b. It is possible to reduce the number of categories drastically by sacrificing relatively few category entries. The category Books and Subjects have nearly 9 times higher counts than other categories and if these large size categories can be removed and use more specific categories, it should be possible to reduce the number of categories drastically by sacrificing relatively few category entries.

### Part B 1.

To build the product categorization from scratch, I would collect the words in the product categorization information from the similar/co-purchased products of the given product and then use Naive Bayes to classify the product's categorication. 

### Part B 2.

It is a good idea as showing users the categorization hierarchy for each items will help them to search for more or other items which are on the same category with what he/she is looking at. One thing to keep in mind is to not have too much hierarchy as it could begin with a very general categorization while in fact it could start with a more specific categorization and customers would take more time in finding what they want. 

Yes, it is a good idea to show users similar/co-purchased items since it is a quick way to show some of the products that the user could be interested in. It could increase sales as it will make the shopping experience of the customer more personalized and encourage them to check out more products. 

Yes, it is a good idea to show users reviews and ratings for items because for some customers that has limited information on the item, they will based their decisions on other customers who bought and used the product. 