# Metadata Recommendation Module
### Recommendations via products "also viewed" and "also bought"

This module generates recommendations based on Amazon metadata generously provided by Julian McAuley of UCSD. The metadata includes descriptions, price, sales-rank, brand info, and co-purchasing links based on products sold on Amazon. The metadata file for Kindle books contains 434,702 products. This module also uses a second dataset corresponding to a collection of 982,610 reviews of Amazon Kindle products.

Our goal remains to recommend a list of ranked items to purchase given only a user_id. Here, we implemented the recommendation engine according to the following process:
1. Import and clean both the reviews dataset and the metadata dataset.
2. Given a user_id, extract all products that the user reviewed and filter them by products that the user actually likes (here, we consider this to be products for which the user's review rating was greater than or equal to 4/5).
3. Initialize a pandas Series called running_list which will store all products related to all products that the user has positively reviewed. Iterate through the list of positively reviewed products. For each:
    * Generate list of related products by filtering the metadata dataset by the 'related' column.
    * Iterate through lists according to which related products have been also_bought as opposed to bought_after_viewing. For each, extract the products' mean review scores from the reviews dataset and store the results in running_list.
4. Clean up running_list by dropping duplicates and products that have already been reviewed by the user.
5. Sort the list by its mean review scores and return the final list.

In [10]:
import numpy as np
from mpl_toolkits.mplot3d import Axes3D
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.cm as cmx
import matplotlib.colors as colors
import statsmodels.formula.api as sm
import pandas as pd
from sklearn import linear_model
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.linear_model import LogisticRegression as LogReg
from sklearn.decomposition import PCA
import json
import gzip
%matplotlib inline

In [13]:
def parse(path):
    g = gzip.open(path, 'rb')
    for l in g:
        yield eval(l)

def getDF(path):
    i = 0
    df = {}
    for d in parse(path):
        df[i] = d
        i += 1
    return pd.DataFrame.from_dict(df, orient='index')

df = getDF('meta_Kindle_Store.json.gz')

In [101]:
print len(df)
df.head()

434702


Unnamed: 0,asin,description,price,imUrl,related,categories,title,salesRank,brand
0,1603420304,In less time and for less money than it takes ...,7.69,http://ecx.images-amazon.com/images/I/51IEqPrF...,"{u'also_viewed': [u'B001OLRKLQ', u'B004J35JIC'...","[[Books, Cookbooks, Food & Wine, Quick & Easy]...",,,
1,B0002IQ15S,This universal DC adapter powers/charges porta...,19.99,http://ecx.images-amazon.com/images/I/21QFJM28...,"{u'also_viewed': [u'B00511PS3C', u'B000PI17MM'...","[[Kindle Store, Kindle Accessories, Power Adap...",Mobility IGO AUTOPOWER 3000 SERIES ( PS0221-10 ),{},
2,B000F83SZQ,,0.0,http://ecx.images-amazon.com/images/I/51yLqHe%...,"{u'also_bought': [u'B0080H1C0W', u'B00LK4ZKOG'...","[[Books, Literature & Fiction], [Books, Myster...",,,
3,B000F83TEQ,,,http://ecx.images-amazon.com/images/I/2136NBNV...,"{u'also_bought': [u'B00IS81LFO', u'B000FA5T6A'...","[[Books, Literature & Fiction], [Books, Myster...",,,
4,B000F83STC,,,http://g-ecx.images-amazon.com/images/G/01/x-s...,,"[[Books, Literature & Fiction, Erotica], [Kind...",,,


In [17]:
with open('reviews.json', 'rb') as f:
    data = f.readlines()
    
data = map(lambda x: x.rstrip(), data)

data_json_str = "[" + ','.join(data) + "]"

data_df = pd.read_json(data_json_str)

In [103]:
print len(data_df)
data_df.head()

982619


Unnamed: 0,asin,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
0,B000F83SZQ,"[0, 0]",5,I enjoy vintage books and movies so I enjoyed ...,"05 5, 2014",A1F6404F1VG29J,Avidreader,Nice vintage story,1399248000
1,B000F83SZQ,"[2, 2]",4,This book is a reissue of an old one; the auth...,"01 6, 2014",AN0N05A9LIJEQ,critters,Different...,1388966400
2,B000F83SZQ,"[2, 2]",4,This was a fairly interesting read. It had ol...,"04 4, 2014",A795DMNCJILA6,dot,Oldie,1396569600
3,B000F83SZQ,"[1, 1]",5,I'd never read any of the Amy Brewster mysteri...,"02 19, 2014",A1FV0SX13TWVXQ,"Elaine H. Turley ""Montana Songbird""",I really liked it.,1392768000
4,B000F83SZQ,"[0, 1]",4,"If you like period pieces - clothing, lingo, y...","03 19, 2014",A3SPTOKDG7WBLN,Father Dowling Fan,Period Mystery,1395187200


In [104]:
def recommend_products(user_id):
    
    # Extract Products positively reviewed by that user
    
    products_reviewed = data_df[data_df['reviewerID'] == user_id]
    positive_reviews = products_reviewed[products_reviewed['overall'] >= 4]
    positive_products = list(positive_reviews['asin'])
    
    # Initialize list of related products
    
    running_list = pd.Series()
    
    for product in positive_products:
        related_products = df[df['asin'] == product]['related']
        for entry in related_products:
            #print entry.keys()
            for key in entry.keys():
                
                # Find review data for products also bought / bought after viewing
                # Filter them by generally positive reviews and store them in running_list
                
                if key == 'also_bought':
                    product_ids = list(entry[key])
                    also_bought = data_df[data_df['asin'].isin(product_ids)]
                    also_bought = also_bought.groupby('asin')['overall'].mean()
                    also_bought = also_bought[also_bought >= 4.0]
                    running_list = running_list.append(also_bought)
                    
                    
                if key == 'buy_after_viewing':
                    product_ids = list(entry[key])
                    buy_after_viewing = data_df[data_df['asin'].isin(product_ids)]
                    buy_after_viewing = buy_after_viewing.groupby('asin')['overall'].mean()
                    buy_after_viewing = buy_after_viewing[buy_after_viewing >= 4.0]
                    running_list = running_list.append(buy_after_viewing)


    # Delete duplicates from running_list and return final_list
    
    running_list = running_list.drop_duplicates()

    for product in positive_products:
        if product in running_list.index:
            running_list = running_list.drop(product)
    
    final_list = running_list.sort_values(ascending = False)
    
    return final_list
    
print recommend_products('A1F6404F1VG29J')

B00LK30NEY    5.000000
B00IPY385M    4.916667
B00JLJFJ4S    4.909091
B00K1NURL8    4.900000
B00LMSRVAQ    4.888889
B00K1RB5JW    4.875000
B00GNICK4M    4.857143
B008KFO0AI    4.846154
B00L7BV2F8    4.837209
B00KFRDWMQ    4.833333
B00EI3E0T2    4.818182
B0089PKMHO    4.812500
B00HNSIOHS    4.806452
B00K6H2D5W    4.800000
B00H92XFA4    4.777778
B00F1G8L5O    4.769231
B008RZQYN2    4.750000
B006I69YPC    4.745763
B00KL13WXK    4.727273
B00KVMWJRY    4.714286
B00F4BWD46    4.710526
B00H9WPHXM    4.700000
B00A9HITFM    4.697674
B00FXUEUBW    4.687500
B008TA811I    4.684211
B00FGJVK1I    4.681818
B00K00LEOG    4.666667
B009ZJRF00    4.650000
B008V4FRSM    4.645833
B00FKEES0Y    4.631579
                ...   
B00GWTWW5O    4.185185
B00IR2QR3M    4.181818
B00CPVEPSU    4.178571
B00HRIJVFS    4.172414
B006HCTWVS    4.169231
B00ICP5JLK    4.166667
B007U7WMI4    4.164384
B00D1CPG5I    4.151515
B0097HAPDY    4.146341
B00DPKRINE    4.142857
B005HAWAZG    4.136000
B0071DWXDQ    4.133333
B00B60R6W8 

## Model Evaluation

In order to evaluate this model, we will run through the steps of the recommendation engine with an example, user A1F6404F1VG29J, and investigate critically whether these recommendations make intuitive sense.

In [105]:
products_reviewed = data_df[data_df['reviewerID'] == 'A1F6404F1VG29J']
positive_reviews = products_reviewed[products_reviewed['overall'] >= 4]
positive_reviews

Unnamed: 0,asin,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
0,B000F83SZQ,"[0, 0]",5,I enjoy vintage books and movies so I enjoyed ...,"05 5, 2014",A1F6404F1VG29J,Avidreader,Nice vintage story,1399248000
57149,B004TS8P7O,"[0, 0]",4,This book take place years ago. I like vintag...,"06 7, 2014",A1F6404F1VG29J,Avidreader,Vintage western,1402099200
163614,B006WAZRZK,"[0, 0]",5,I enjoyed this book. I downloaded it because ...,"12 29, 2013",A1F6404F1VG29J,Avidreader,good story,1388275200
222667,B0080S6OKE,"[0, 0]",5,I liked this book because Saucy is a unique pe...,"12 29, 2013",A1F6404F1VG29J,Avidreader,Funky character,1388275200
278339,B008VFI5D0,"[0, 0]",5,This vintage mystery. The characters were wel...,"05 24, 2014",A1F6404F1VG29J,Avidreader,Really enjoy vintage novels,1400889600
596437,B00E5JMQP4,"[1, 1]",5,Good story. Enjoyed learning about a group I ...,"06 28, 2014",A1F6404F1VG29J,Avidreader,Good story. Enjoyed learning about a group I n...,1403913600
601364,B00E8JE8DE,"[0, 0]",5,Cute story about 2 orphans. Nice characters. ...,"06 28, 2014",A1F6404F1VG29J,Avidreader,Nice characters. Exciting for young people,1403913600
863592,B00J16SQFU,"[0, 0]",5,I enjoyed the book. The characters were belie...,"04 29, 2014",A1F6404F1VG29J,Avidreader,Really enjoyed the book.,1398729600
957476,B00KWHKH9K,"[2, 2]",5,Interesting characters. Good story line. Lik...,"07 3, 2014",A1F6404F1VG29J,Avidreader,Good story line,1404345600


######

It looks like this user has made 9 positive Amazon reviews. Users generally find this user's reviews helpful. Let's find out a bit more about the products this user has reviewed.

######

In [108]:
positive_products = list(positive_reviews['asin'])
for i in df[df['asin'].isin(positive_products)]['categories']:
    print i

[['Books', 'Literature & Fiction'], ['Books', 'Mystery, Thriller & Suspense', 'Thrillers & Suspense', 'Suspense'], ['Kindle Store', 'Kindle eBooks', 'Mystery, Thriller & Suspense', 'Suspense']]
[['Books', 'Mystery, Thriller & Suspense', 'Mystery', 'Hard-Boiled'], ['Kindle Store', 'Kindle eBooks', 'Nonfiction']]
[['Books', 'Literature & Fiction'], ['Books', 'Mystery, Thriller & Suspense', 'Mystery', 'Hard-Boiled'], ['Books', 'Mystery, Thriller & Suspense', 'Thrillers & Suspense', 'Crime'], ['Kindle Store', 'Kindle eBooks', 'Mystery, Thriller & Suspense', 'Crime Fiction'], ['Kindle Store', 'Kindle eBooks', 'Mystery, Thriller & Suspense', 'Mystery', 'Hard-Boiled']]
[['Books', 'Literature & Fiction'], ['Books', 'Mystery, Thriller & Suspense', 'Mystery', 'Cozy'], ['Books', 'Mystery, Thriller & Suspense', 'Mystery', 'Women Sleuths'], ['Books', 'Mystery, Thriller & Suspense', 'Thrillers & Suspense', 'Suspense'], ['Kindle Store', 'Kindle eBooks', 'Mystery, Thriller & Suspense', 'Mystery', 'Wom

######

It looks like for the products positively reviewed by Avidreader, they mostly fall into the categories of "Literature & Fiction," "Mystery, Thriller, & Suspense," "Women Sleuths," "Romance," and "Crime Fiction." It seems like this user has particular taste in books, so it will help to know this information when reviewing the products we have recommended. 

######

In [110]:
running_list = pd.Series()
    
for product in positive_products:
    related_products = df[df['asin'] == product]['related']
    for entry in related_products:
        #print entry.keys()
        for key in entry.keys():
                
            # Find review data for products also bought / bought after viewing
            # Filter them by generally positive reviews and store them in running_list
                
            if key == 'also_bought':
                product_ids = list(entry[key])
                also_bought = data_df[data_df['asin'].isin(product_ids)]
                also_bought = also_bought.groupby('asin')['overall'].mean()
                also_bought = also_bought[also_bought >= 4.0]
                running_list = running_list.append(also_bought)
                    
                    
            if key == 'buy_after_viewing':
                product_ids = list(entry[key])
                buy_after_viewing = data_df[data_df['asin'].isin(product_ids)]
                buy_after_viewing = buy_after_viewing.groupby('asin')['overall'].mean()
                buy_after_viewing = buy_after_viewing[buy_after_viewing >= 4.0]
                running_list = running_list.append(buy_after_viewing)


# Delete duplicates from running_list and return final_list
    
running_list = running_list.drop_duplicates()

for product in positive_products:
    if product in running_list.index:
        running_list = running_list.drop(product)
    
final_list = running_list.sort_values(ascending = False)    

######

Now that we have our final list of recommendations, let us observe whether the books fall into similar categories as the originally reviewed products by Avidreader.

######

In [115]:
book_list = final_list.index
top_20 = book_list[:20]
print top_20

Index([u'B00LK30NEY', u'B00IPY385M', u'B00JLJFJ4S', u'B00K1NURL8',
       u'B00LMSRVAQ', u'B00K1RB5JW', u'B00GNICK4M', u'B008KFO0AI',
       u'B00L7BV2F8', u'B00KFRDWMQ', u'B00EI3E0T2', u'B0089PKMHO',
       u'B00HNSIOHS', u'B00K6H2D5W', u'B00H92XFA4', u'B00F1G8L5O',
       u'B008RZQYN2', u'B006I69YPC', u'B00KL13WXK', u'B00KVMWJRY'],
      dtype='object')


In [116]:
for i in df[df['asin'].isin(top_20)]['categories']:
    print i

[['Books', 'Literature & Fiction', 'Humor'], ['Books', 'Mystery, Thriller & Suspense', 'Mystery'], ['Kindle Store', 'Kindle eBooks', 'Literature & Fiction', 'Humor & Satire', 'General Humor'], ['Kindle Store', 'Kindle eBooks', 'Mystery, Thriller & Suspense', 'Mystery']]
[['Books', 'Literature & Fiction'], ['Books', 'Science Fiction & Fantasy', 'Fantasy', 'Epic'], ['Books', 'Science Fiction & Fantasy', 'Fantasy', 'Sword & Sorcery'], ['Kindle Store', 'Kindle eBooks', 'Science Fiction & Fantasy', 'Fantasy', 'Coming of Age'], ['Kindle Store', 'Kindle eBooks', 'Science Fiction & Fantasy', 'Fantasy', 'Epic'], ['Kindle Store', 'Kindle eBooks', 'Science Fiction & Fantasy', 'Fantasy', 'Sword & Sorcery']]
[['Books', 'Literature & Fiction', 'Genre Fiction', 'Action & Adventure'], ['Books', 'Mystery, Thriller & Suspense', 'Thrillers & Suspense', 'Spies & Politics', 'Espionage'], ['Kindle Store', 'Kindle Short Reads', 'Two hours or more (65-100 pages)', 'Literature & Fiction'], ['Kindle Store', 'Ki

######

Fortunately, this recommendation module has produced recommendations full of mystery, thriller, and suspense novels - so we suspect that these will fall within the interest and taste of Avidreader!

######