# KeepUp Inc. Take-Home Challenge

KeepUp is an edTech startup based in San Francisco. Founded in 2012 by Rick Sindhwani, it curates educational resources to help companies rapidly train their employees and keep up with changing times and trends.

## A. (Suggested duration: 90 mins)

With the given data for 548552 products, perform exploratory analysis and make suggestions for further analysis on the following aspects.

### 1. Trustworthiness of ratings

Ratings are susceptible to manipulation, bias etc. What can you say (quantitatively speaking) about the ratings in this dataset?

In [1]:
import pandas as pd

In [2]:
with open('data/amazon-meta.txt') as f:
    txt = f.read().splitlines()

In [3]:
#create list of items
data = []
space = []

for string in txt[3:]:
    if string != '':
        space.append(string)
    else:
        data.append(space)
        space = []

In [4]:
#check size of data
len(data)

548552

In [5]:
#sample data entry
data[2:3]

[['Id:   2',
  'ASIN: 0738700797',
  '  title: Candlemas: Feast of Flames',
  '  group: Book',
  '  salesrank: 168596',
  '  similar: 5  0738700827  1567184960  1567182836  0738700525  0738700940',
  '  categories: 2',
  '   |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Earth-Based Religions[12472]|Wicca[12484]',
  '   |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Earth-Based Religions[12472]|Witchcraft[12486]',
  '  reviews: total: 12  downloaded: 12  avg rating: 4.5',
  '    2001-12-16  cutomer: A11NCO6YTE4BTJ  rating: 5  votes:   5  helpful:   4',
  '    2002-1-7  cutomer:  A9CQ3PLRNIR83  rating: 4  votes:   5  helpful:   5',
  '    2002-1-24  cutomer: A13SG9ACZ9O5IM  rating: 5  votes:   8  helpful:   8',
  '    2002-1-28  cutomer: A1BDAI6VEYMAZA  rating: 5  votes:   4  helpful:   4',
  '    2002-2-6  cutomer: A2P6KAWXJ16234  rating: 4  votes:  16  helpful:  16',
  '    2002-2-14  cutomer:  AMACWC3M7PQFR  rating: 4  votes:   5  helpful:   5',
  '    2002-3-23

In [6]:
#extract ratings data into a dictionary 
rating = {}

for d in data:
    id_, total, downloaded, avg_rating = '', '', '', ''
    for item in d:
        if item.startswith('Id:'):
            id_ = item.split()[-1]
        elif item.startswith('  reviews:'):
            total = item.split()[2]
            downloaded = item.split()[4]
            avg_rating = item.split()[7]
        elif item.startswith('  discontinued product'):
            skip = True
    if skip == False:
        rating[id_] = [total, downloaded, avg_rating]
    else:
        skip = False

In [7]:
#create ratings dataframe
ratings = pd.DataFrame.from_dict(rating)
ratings = ratings.T.reset_index(drop=False)
ratings.columns = ['id', 'total', 'downloaded', 'avg_rating']
ratings['id']  = ratings['id'].astype(int)
ratings['total']  = ratings['total'].astype(int)
ratings['downloaded']  = ratings['downloaded'].astype(int)
ratings['avg_rating']  = ratings['avg_rating'].astype(float)
ratings = ratings.sort_values('id')
ratings.head()

Unnamed: 0,id,total,downloaded,avg_rating
0,1,2,2,5.0
109901,2,12,12,4.5
219673,3,1,1,5.0
329476,4,1,1,4.0
439370,5,0,0,0.0


In [8]:
#remaining products that are still available
len(ratings)

542684

In [9]:
#sum of totals, download and mean of all ratings
sum(ratings.total), sum(ratings.downloaded), ratings.avg_rating.mean()

(7781990, 7593244, 3.2095344620442097)

In [10]:
#total reviews compared to total downloads
sum(ratings.total)/float(sum(ratings.downloaded))

1.0248570966506543

Assuming that all reviews are done by individuals that downloaded a product, there seems to be some problem with the ratings already. There are more total reviews than actual downloads. This needs to be investigated further.

In [11]:
#are there products that have more reviews than downloads?
r = ratings[ratings['total'] > ratings['downloaded']]
r.head()

Unnamed: 0,id,total,downloaded,avg_rating
102214,193,261,260,3.0
292160,366,10,5,4.5
295450,369,416,5,5.0
297636,371,416,415,5.0
300933,374,7,5,3.5


In [12]:
#how many had more reviews than downloads?
len(r)

8615

There seems to be some products that has a huge difference when it comes to total reviews and downloads as can be seen on item id: 369.

In [13]:
#create a column to show the difference and sort
ratings['difference'] = ratings['total'] - ratings['downloaded']
ratings = ratings.sort_values('difference', ascending=False)
ratings.head()

Unnamed: 0,id,total,downloaded,avg_rating,difference
52947,148185,5034,5,5.0,5029
542124,99487,5033,5,5.0,5028
31509,128673,4922,5,5.0,4917
307143,379661,2925,5,4.5,2920
166425,251503,2925,5,4.5,2920


The biggest difference only had a a total of 5 downloads. There seems to be a problem with this. There are a lot of reviews that have been done on the data despite not availing the actual product thus putting the review system's credibility in question.

In [14]:
#lets get an example
data[148185]

['Id:   148185',
 'ASIN: 059035342X',
 "  title: Harry Potter and the Sorcerer's Stone (Book 1)",
 '  group: Book',
 '  salesrank: 746',
 '  similar: 5  0439064864  0439136350  0439139600  043935806X  B00005JMAH',
 '  categories: 7',
 "   |Books[283155]|Subjects[1000]|Children's Books[4]|Literature[2966]|Humorous[3003]",
 "   |Books[283155]|Subjects[1000]|Children's Books[4]|Literature[2966]|Science Fiction, Fantasy, Mystery & Horror[3013]|Science Fiction, Fantasy, & Magic[3017]",
 "   |Books[283155]|Subjects[1000]|Children's Books[4]|Ages 9-12[2786]|General[170063]",
 "   |Books[283155]|Subjects[1000]|Children's Books[4]|Authors & Illustrators, A-Z[170540]|( R )[170558]|Rowling, J.K.[285272]|General[285273]",
 "   |Books[283155]|Subjects[1000]|Children's Books[4]|Series[3302]|Fantasy & Adventure[3328]|Harry Potter Books[281785]|General[285696]",
 "   |Books[283155]|Subjects[1000]|Children's Books[4]|Authors & Illustrators, A-Z[170540]|( R )[170558]|Rowling, J.K.[285272]|Paperback[2864

This particular id had the biggest difference in reviews and downloads. While it is no doubt a best seller, and people that have been reviewing the product may have bought elsewhere, it is possible that some reviews were done by people or bots that haven't read the book yet.

In [15]:
#lets look at another example
data[81]

['Id:   81',
 'ASIN: 6304286961',
 '  title: The Doors',
 '  group: Video',
 '  salesrank: 10217',
 '  similar: 5  0783233485  0679726225  6305603847  1592400647  0679734627',
 '  categories: 20',
 '   |[139452]|VHS[404272]|Actors & Actresses[140]|( B )[140748]|Burkley, Dennis[142653]',
 '   |[139452]|VHS[404272]|Actors & Actresses[140]|( D )[144551]|Dillon, Kevin[145163]',
 '   |[139452]|VHS[404272]|Actors & Actresses[140]|( I )[149429]|Idol, Billy[149446]',
 '   |[139452]|VHS[404272]|Actors & Actresses[140]|( K )[150127]|Kilmer, Val[150602]',
 '   |[139452]|VHS[404272]|Actors & Actresses[140]|( M )[152234]|Maberly to Mazzello[769018]|Maclachlan, Kyle[152313]',
 '   |[139452]|VHS[404272]|Actors & Actresses[140]|( M )[152234]|Maberly to Mazzello[769018]|Madsen, Michael[152359]',
 '   |[139452]|VHS[404272]|Actors & Actresses[140]|( Q )[156257]|Quinlan, Kathleen[156298]',
 '   |[139452]|VHS[404272]|Actors & Actresses[140]|( R )[156323]|Ryan, Meg[157461]',
 '   |[139452]|VHS[404272]|Actor

By looking at the reviews in this example, it can seen that some reviews came from the same customer but had different ratings.

In [16]:
#percent of products with 0 ratings
round(len(ratings[ratings['avg_rating']==0])/float(len(ratings)) * 100, 2)

25.79

In [17]:
#percent of products with 5 ratings
round(len(ratings[ratings['avg_rating']==5])/float(len(ratings)) * 100, 2)

26.87

### Answer to A1

By simply looking at just the total reveiws and downloaded totals it seems like there are more reviews than actual downloads in the data. It puts the ratings system's credibility into question as it is assumed that the ratings are made only by individuals who downloaded the product. This is important because a lot of those who made reviews are making a review based only on the product and not the product that came from this merchant.

Also, it is possible that some customers may be making some questionable reviews like on the above example. It doesn't make sense for one customer to make multiple reviews with different ratings on one product. This might be a sign of manipulating the rating.

Suggestions fot further analysis:
 - An in depth analysis of customers can be done to check if there is an anomalous behavior for customers that make a review.
 - Given more data about the actual reviews, Natural Language Processing(NLP) can be used to go beyond the ratings
 - The reviews can alo be looked at if the review was voted or helpful in determining its relevance to rating

### 2. Category bloat

Consider the product group named 'Books'. Each product in this group is associated with categories. Naturally, with categorization, there are tradeoffs between how broad or specific the categories must be.

For this dataset, quantify the following:

    a. Is there redundancy in the categorization? How can it be identified/removed?
    b. Is is possible to reduce the number of categories drastically (say to 10% of existing categories) by sacrificing relatively few category entries (say close to 10%)?

In [18]:
#sample data to heck categories
data[1]

['Id:   1',
 'ASIN: 0827229534',
 '  title: Patterns of Preaching: A Sermon Sampler',
 '  group: Book',
 '  salesrank: 396585',
 '  similar: 5  0804215715  156101074X  0687023955  0687074231  082721619X',
 '  categories: 2',
 '   |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Preaching[12368]',
 '   |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Sermons[12370]',
 '  reviews: total: 2  downloaded: 2  avg rating: 5',
 '    2000-7-28  cutomer: A2JW67OY8U6HHK  rating: 5  votes:  10  helpful:   9',
 '    2003-12-14  cutomer: A2VE83MZF98ITY  rating: 5  votes:   6  helpful:   5']

Looking at this sample data we can already reduce category redundancy by simply removing just the last sub category. This immediately cuts the data's category number into half.

In [19]:
#extract category data into a lists 
category = []
ids = []
book_count = 0

for d in data:
    group, id_ = '', ''
    for item in d:
        if item.startswith('  group:'):
            group = item.split()[-1]
        if item.startswith('Id:'):
            id_ = item.split()[-1]
        if group == 'Book':
            if item.startswith('   |'):
                category.append(item.strip())
                ids.append(id_)
    if group == 'Book':
        book_count += 1

In [20]:
#check totals
len(ids), len(category), book_count

(1440329, 1440329, 393561)

In [21]:
round(len(category)/float(book_count), 2)

3.66

This further emphasizes that we have several categories for each book.

In [22]:
#create categories dataframe
categories = pd.DataFrame({'id': ids,'category': category})
categories = categories[['id', 'category']]
categories['sub_cat'] = [len(branch.split('|')[1:]) for branch in categories['category']]
categories.head()

Unnamed: 0,id,category,sub_cat
0,1,|Books[283155]|Subjects[1000]|Religion & Spiri...,6
1,1,|Books[283155]|Subjects[1000]|Religion & Spiri...,6
2,2,|Books[283155]|Subjects[1000]|Religion & Spiri...,5
3,2,|Books[283155]|Subjects[1000]|Religion & Spiri...,5
4,3,|Books[283155]|Subjects[1000]|Home & Garden[48...,5


In [23]:
#totals of books and category branches, and categories per book
book_count, len(category), round(len(category)/float(book_count), 2)

(393561, 1440329, 3.66)

Just by looking at the average category per book, it seems that there are already a lot of categories just for one item.

In [24]:
#number of unique categories
len(set(category))

12853

In [25]:
#create a list of ids with more than 2 categories
cat = categories['id'].value_counts().reset_index(name='counts')
cat = cat[cat['counts'] > 1]
cat_list = list(cat['index'])

In [26]:
#dataframe of ids with more than 2 categories
cats = categories[categories['id'].isin(cat_list)]
#number of unique categories
len(set(cats['category']))

12833

In [27]:
#list of categories for books with single categories, for later use
single = list(set(category) - set(cats['category']))

In [28]:
#create a new column with categories less than one sub category
cats['cat_1'] = [branch.split('|')[1:][:-1] for branch in cats['category']]
cats['cat_1'] = cats['cat_1'].str.join('|')
cats['cat_2'] = [branch.split('|')[1:][:-2] for branch in cats['category']]
cats['cat_2'] = cats['cat_2'].str.join('|')
cats.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas

Unnamed: 0,id,category,sub_cat,cat_1,cat_2
0,1,|Books[283155]|Subjects[1000]|Religion & Spiri...,6,Books[283155]|Subjects[1000]|Religion & Spirit...,Books[283155]|Subjects[1000]|Religion & Spirit...
1,1,|Books[283155]|Subjects[1000]|Religion & Spiri...,6,Books[283155]|Subjects[1000]|Religion & Spirit...,Books[283155]|Subjects[1000]|Religion & Spirit...
2,2,|Books[283155]|Subjects[1000]|Religion & Spiri...,5,Books[283155]|Subjects[1000]|Religion & Spirit...,Books[283155]|Subjects[1000]|Religion & Spirit...
3,2,|Books[283155]|Subjects[1000]|Religion & Spiri...,5,Books[283155]|Subjects[1000]|Religion & Spirit...,Books[283155]|Subjects[1000]|Religion & Spirit...
5,4,|Books[283155]|Subjects[1000]|Religion & Spiri...,7,Books[283155]|Subjects[1000]|Religion & Spirit...,Books[283155]|Subjects[1000]|Religion & Spirit...


In [29]:
#minimum number of sub-categories per book
min(categories['sub_cat'])

3

In [30]:
#remaining unique categories and percentage from total unique categories if one sub category was removed
len(set(list(cats['cat_1']) + single)), round(len(set(list(cats['cat_1']) + single))/float(len(set(categories['category'])))*100, 2)

(2024, 15.75)

In [31]:
#remaining unique categories and percentage from total unique categories if two sub categories were removed
len(set(list(cats['cat_2']) + single)), round(len(set(list(cats['cat_2']) + single))/float(len(set(categories['category'])))*100, 2)

(544, 4.23)

If we simply remove the last sub-category of all categories, we are able to drastically reduce the amount of categories to 15.75% of the original number. Furthermore, if the last two sub-categories we can reduce the total to 4.23% of the original category number.

### Answer to A2

There is evidence that can be seen that there is redundancy in the categorization as can be seen on the very first product id sub-categories(Preaching and Sermon). By favoring a more broad categorization and decreasing the sub-categories by 2 we can mimize the total categories to 4.23% of the original number.

## B. (Suggested duration: 30 mins)

Give the number crunching a rest! Just think about these problems.

### 1. Algorithm thinking

How would build the product categorization from scratch, using similar/co-purchased information?

I will use K-Means Clustering based on the incomplete vectors of co-purchased information to determine the appropriate number of categories. These categories will then be labeled on general clusters of items that belong together.

### 2. Product thinking

Now, put on your 'product thinking' hat.

    a. Is it a good idea to show users the categorization hierarchy for items?
    b. Is it a good idea to show users similar/co-purchased items?
    c. Is it a good idea to show users reviews and ratings for items?
    d. For each of the above, why? How will you establish the same?

It's a good idea to show categorization hierarchies to users as long as it doesn't get too specific. It is a good way to get information about the users preferences thus, helping create a better recommendation for the user depending on what the user chooses.

It's a good idea to include similar/co-purchased items as it is a good way of providing recommendations to users. As long as it's not the main focus of a page, it will be very effective in letting users know things that they would otherwise not thought about.

Reviews and ratings have become really a powerful source of information for users. A lot of users nowadays use reviews and ratings in decision making. As such, it is a good idea to include it. But with the caveat that reviews must be consistently monitored for manipulation.

In building a product, it is best to think about the user experience. I would put a lot of priority in building a great recommendation system so that users need not bother too much with looking at specific categories for certain products. This would be built using similar or co-purchased items. In order to help them with decision making, the reviews and ratings system will be throuroughly checked using algorithms to determined faked reviews and ratings. This should build confidence in the users that the revies they're looking at are genuine.