In [4]:
import gzip
import json
import pandas as pd
from pprint import pprint
from tqdm import tqdm
import numpy as np

import gensim
from gensim.parsing.preprocessing import remove_stopwords
from gensim.utils import simple_preprocess

import pickle

In [11]:
# Filter products with atleast 15 reviews. USe the processed csv dumped earlier rather than reading the whole data again.
df = pd.read_csv('office.csv', index_col=0)

In [166]:
df.shape[0], df.user_id.nunique(), df.asin.nunique()

(588734, 90502, 10007)

Over 90K users, 10k products with 0.58M ratings. We will filter our metadata to these 10k asins only.

In [17]:
asins_of_interest = set(df.asin.unique())

I will use the product description text as a basis for establishing product profiles. No user/rating data is used at this stage.

In [33]:
with gzip.open(r"F:\work\is590ml_final\data\meta_Office_Products.json.gz", 'rt', encoding='utf-8') as f:
    corpus = {}
    n_empty = 0
    for line in f:
        prod = json.loads(line)
        desc = ' '.join(prod.get('description', '')).strip()
        if desc:
            if prod['asin'] in asins_of_interest:
                corpus[prod['asin']] = desc
        else:
            n_empty += 1


The above cell loads all descriptions into corpus dictionary keyed by the asin. We note that some products do not have a description. For the rest of the analysis these products are ignored for recommendations.

In [172]:
len(corpus)/len(asins_of_interest)

0.9179574297991406

Only 9% of products do not have a description

In [193]:
def iter_file():
    with gzip.open(r"F:\work\is590ml_final\data\meta_Office_Products.json.gz", 'rt', encoding='utf-8') as f:
        for line in f:
            yield json.loads(line)

In [194]:
a = iter_file()

In [199]:
with gzip.open(r"F:\work\is590ml_final\data\meta_Office_Products.json.gz", 'rt', encoding='utf-8') as f:
    titles = {}
    also_buy = {}
    n_empty = 0
    for line in f:
        prod = json.loads(line)
        desc = ' '.join(prod.get('description', '')).strip()
        if desc:
            if prod['asin'] in asins_of_interest:
                titles[prod['asin']] = prod['title']
                also_buy[prod['asin']] = prod.get('also_buy', [])
        else:
            n_empty += 1

In [40]:
# with open('review_corpus.pickle', 'wb') as f:
#     pickle.dump(corpus, f)

In [173]:
def read_corpus(corpus):
    """Helper function for postprocessing product description and tagged with asins."""
    for asin, desc in corpus.items():
        tokens = gensim.utils.simple_preprocess(desc)
        yield gensim.models.doc2vec.TaggedDocument(tokens, [asin])

In [52]:
train_corpus = list(read_corpus(corpus))

In [176]:
train_corpus[:2]

[TaggedDocument(words=['this', 'rugged', 'covers', 'is', 'ideal', 'for', 'young', 'explorers', 'ages', 'the', 'rubber', 'adventure', 'bible', 'logo', 'durable', 'nylon', 'construction', 'and', 'multiple', 'pockets', 'will', 'encourage', 'kids', 'to', 'take', 'their', 'niv', 'adventure', 'bible', 'with', 'them', 'wherever', 'they', 'go'], tags=['0310802636']),
 TaggedDocument(words=['featuring', 'metal', 'accents', 'purse', 'style', 'handles', 'and', 'removable', 'key', 'chain', 'this', 'is', 'practical', 'and', 'fashionable', 'bible', 'cover', 'this', 'cover', 'will', 'fit', 'the', 'zondervan', 'niv', 'study', 'bible', 'large', 'print', 'niv', 'life', 'application', 'study', 'bible', 'archaeological', 'study', 'bible', 'as', 'well', 'as', 'many', 'other', 'books', 'and', 'bibles', 'up', 'to', 'mm', 'mm'], tags=['0310821800'])]

We want to create an embedding for each product by considering each product description as a document. The traditional way of doing this is using TF-IDF or LSI. However, since we are dealing with products, I have attempted to use Doc2Vec here (an offshoot of Word2Vec). The advantage of using Doc2Vec is we get an embedding of the whole document (unlike Word2Vec) at once with the nice property that documents pertaining to the same topics have embeddings that are close to each other (parallel to Word2Vec). This way, product profiles for closeby products will be close to each other. As a first pass, I choose 50 dimensions for the embedding and ignore words which do not appear at least twice in the corpus.

Since Doc2Vec is based on Word2Vec, it is actually important that stopwords are not removed.

In [54]:
model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=20)

In [55]:
model.build_vocab(train_corpus)

Training the doc2vec model. Should not take long if BLAS is installed. We have around 9K documents with a around 50 words each.

In [56]:
model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)

Sanity check. I now have a model that can maps a product to an embedding. It should stand that a document (product) embedding should actually be closest to itself rather than other documents (product). However, given the model building mechanism of doc2vec, this might not be the case always. As a sanity check, I check how often this is true.

In [None]:
ranks = []
first_ranks = []
for doc in tqdm(train_corpus):
    inferred_vector = model.infer_vector(doc.words)
    sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
    rank = [asin for asin, sim in sims].index(doc.tags[0])
    ranks.append(rank)
    first_ranks.append([doc.tags[0], *sims[0]])

In [177]:
first_ranks[:10]

[['0310802636', '0310802636', 0.9360512495040894],
 ['0310821800', '0310821800', 0.921605110168457],
 ['0439893577', '0439893577', 0.9230896830558777],
 ['0486256006', 'B01DN8T948', 0.9315845966339111],
 ['0528959948', '0528959948', 0.9450106620788574],
 ['0545114780', '0545114780', 0.9184267520904541],
 ['0545114829', '0545114829', 0.9529480934143066],
 ['0545115000', '0545115000', 0.9598956108093262],
 ['0545114985', '0545114985', 0.9652101993560791],
 ['0641678584', '0641678584', 0.9008191823959351]]

In [179]:
sum(1 for i in first_ranks if i[0] != i[1])/len(first_ranks)

0.15588939690833878

Our embedding model is 85% succesful in distinguishing documents. Frankly, this is way better than I expected given how sparse the description is for many products. Also, I might need to do some CV to figure out the ideal embedding space dimension and training epochs. For now, we collect the product profiles or their 50dimensional embeddings.

In [None]:
product_profiles = {}
for doc in train_corpus:
    product_profiles[doc.tags[0]] = model.infer_vector(doc.words)

We have the product profiles. Now we need to represent individual user preferences or the user profile. Since we do not have any background data on the user, we will model the user based on the ratings given.

A rating greater than 3 for product implies that the user has liked the product. So our user profile will be oriented towards the product
A rating less than 2 implies the user dislikes the product. So our user profile will be oriented away from the product.
A rating of 3 is no particular preference and does not influence the user profile.

With these assumptions, I can model the user preferences in the same vector space as the product embedding. User preferences are the weighted sum of their purchased product profiles with centered ratings as the weights. No normalization is done as cosine similarity is going to be used to align products with user preferences.


In [123]:
# center the ratings to use as weight
df.loc[:, 'rating_weight'] = df.rating - 3

In [180]:
# loop through each user, asin and rating tuple and update user profiles as you go. 
# If a product does not have a description, it does not get a product profile and does not contribute
# to user profiles

user_profiles = {}
for row in df.itertuples():
    user_profiles[row.user_id] = user_profiles.get(row.user_id, np.zeros(50)) + product_profiles.get(row.asin, np.zeros(50)) * row.rating_weight

In [181]:
user_profiles['A398INYG0ZBUZB']

array([ -2.08724699,   2.93460173,   1.71699869,  -9.46520349,
        -0.63525108,   8.20420562,  18.98161112,  10.81906489,
         1.92793886,   9.27051598,  -9.65438822,   8.67588541,
        -6.70450363,  -0.11881249,   0.92907609,  -3.24585358,
        -2.32580936,   3.32132585,  -3.2365552 ,  -0.77785248,
         8.38153526,  -0.69689466,   7.11275415,   3.24274399,
        -1.27049851,   6.34725926,  13.54728842,   2.56984111,
        -4.36694187,   8.99892992,   6.84929928, -14.89513856,
         6.92706026, -10.42026725,  -0.99373901,  -2.78128792,
        -3.20345693,  -7.76043923,  17.36399991,  -6.84747966,
         6.0444157 ,  -0.40112205,  -7.14231681,   8.26383809,
         6.61044567,  -4.13041114, -11.27173559,  -4.66862772,
        -7.16969907,   1.34080695])

# manual testing

In [187]:
# user under test
user_id = 'A1NK4TLIMODCTN'


In [188]:
# print users given ratings when titles exists
for row in df.loc[df.user_id == user_id].itertuples():
    try:
        print(row.asin, titles[row.asin], row.rating)
    except:
        pass

B00006IFEU Sharpie Permanent Markers, Fine Point, Purple, 12-Pack (30008) 5.0
B002CKHH8O Uni Jetstream Multi Function 4 Color Ballpoint Pen, White Barrel (MSXE510007.1) 5.0
B00F9ZQ0HI Brother HL-2280DW Wireless Monochrome Multifunction Laser Printer 5.0
B00G4CJ8GK Sharpie 1884739 Permanent Markers Fine Point Black - 36 Pieces 5.0
B00L95MIJQ 8 PCS Jinhao 599 Fountain Pens Diversity Set Transparent and Unique Style 3.0
B015EXS130 Pilot MR Retro Pop Collection Fountain Pen, Red Barrel with Wave Accent, Fine Nib, Black Ink (91432) 5.0
B01CWM4E5A Kaisercraft CL101 KaiserColour Gel Pens (24 Pack), 12 Pastel & 12 Glitter Colors, Assorted 5.0


In [189]:
# print the top 25 recommendations from our model.

# user profile
u = user_profiles['A1NK4TLIMODCTN']

for sim in model.docvecs.most_similar([u], topn=25):
    
    # some titles are not clean but rather html. To avoid clusttering the output suppress them using a simple len check.
    if len(titles[sim[0]]) > 250:
        continue
    
    print(sim[0], titles[sim[0]], sim[1])

B001XM9BV8 Brother HL-5370DW Laser Printer 0.8484295606613159
B00AP6T05K Brother MFC7460DN Ethernet Monochrome Printer with Scanner, Copier & Fax 0.8423100709915161
B00F9ZQ0HI Brother HL-2280DW Wireless Monochrome Multifunction Laser Printer 0.8359221816062927
B001XUQP9G Brother HL-5340D High Speed Laser Printer with Duplex 0.8325595855712891
B00450DVDY Brother HL-2270DW Compact Laser Printer with Wireless Networking and Duplex 0.827316164970398
B016AT5WES Brother Wireless Digital Color Printer with Convenience Copying and Scanning (HL-3180CDW), Amazon Dash Replenishment Enabled 0.8210573792457581
B00MFG58N6 Brother MFCL2700DW All-In One Laser Printer with Wireless Networking and Duplex Printing, Amazon Dash Replenishment Enabled 0.818708062171936
B0038ZRAES NEW YORK TONER COMPATIBLE WITH BROTHER TN-460 TONER CARTRIDGE (6,000 PAGE YIELD) FOR HL 1435, HL 1440, HL 1450, HL 1470N - BLACK 0.816688597202301
B00LEA5EJM Brother HL-L2360DW Compact Laser Printer with Wireless Networking and Dup

This has been more succesful than I expected it to be. The user preferred a brother wireless printer, and our recommender has succesfully pointed out related printers (even trying to upsell higher end models). More impressively, it has recommended toner as well. Similarly, I see a lot of stationary recommendations based on the user purchases. Especially impressive is the the Noodler's ink recommendation since the user has only bought one fountain pen.

This is impressive for a content based recommender, because all the product semantics were derived from the description only. One can certainly see how this avoids the cold start problem. If the description is detailed enough, this recommender can certainly pick it up. 