In [None]:
!pip install gensim
!pip install pandas
!pip install pandarallel
!pip install numpy
!pip install tqdm
!pip install ipywidgets

In [1]:
import logging
import numpy as np
import pandas as pd
import warnings

from gensim.models.doc2vec import Doc2Vec
from gensim.similarities.annoy import AnnoyIndexer
from pandarallel import pandarallel
from pathlib import Path
from pprint import pprint
from src.features import preprocessing
from tqdm import tqdm

pandarallel.initialize(progress_bar=True)
tqdm.pandas()
warnings.filterwarnings('ignore')

INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.


# 1. Load data

Let's load the train/test data that we have processed previously. We will do a quick check on the shape of both datasets, and also visually inspect that the `processedReviewText` should still be `str` format – hence, we are required to tokenized it before parsing into the `Doc2Vec` model.

In [2]:
# global variables
DATA_PATH = Path('data/processed/')
CATEGORY = 'Clothing_Shoes_and_Jewelry'

train = pd.read_csv(f"{DATA_PATH}/{CATEGORY}_train.csv")
test = pd.read_csv(f"{DATA_PATH}/{CATEGORY}_test.csv")

In [3]:
print(f"Train: {train.shape}, unique users: {train.reviewerID.nunique()}, unique items: {train.asin.nunique()}")
print(f"Test: {test.shape}, unique users: {test.reviewerID.nunique()}, unique items: {test.asin.nunique()}")

Train: (231491, 5), unique users: 39387, unique items: 23033
Test: (47145, 5), unique users: 39380, unique items: 17949


In [4]:
# check train
train.head().append(train.tail())

Unnamed: 0,overall,reviewerID,asin,reviewText,processedReviewText
0,5.0,A1KLRMWW2FWPL4,0000031887,This is a great tutu and at a really great pri...,this great tutu great price it look cheap glad...
1,5.0,A2G5TCU2WDFZ65,0000031887,I bought this for my 4 yr old daughter for dan...,buy yr old daughter dance class wore today tim...
2,5.0,A1RLQXYNCMWRWN,0000031887,What can I say... my daughters have it in oran...,what daughters orange black white pink think b...
3,4.0,A27UF1MSF3DB2,0000031887,I received this today and I'm not a fan of it ...,receive today fan daughter think puffier look ...
4,5.0,A16GFPNVF4Y816,0000031887,Bought this as a backup to the regular ballet ...,bought backup regular ballet outfit daughter w...
231486,5.0,ACJT8MUC0LRF0,B00KKXCJQU,When I pack it looks like a disaster area in a...,when pack look like disaster area suitcase pac...
231487,5.0,A2DG63DN704LOI,B00KKXCJQU,I don't normally go ga-ga over a product very ...,normally ga ga product cub awesome help review...
231488,5.0,A1UQBFCERIP7VJ,B00KKXCJQU,These are very nice packing cubes and the 18 x...,these nice packing cube laundry storage bag ni...
231489,5.0,A22CW0ZHY3NJH8,B00KKXCJQU,I am on vacation with my family of four and th...,vacation family shacke pak set wonderful excep...
231490,5.0,A30VWT3R25QAVD,B00KKXCJQU,When I signed up to receive a free set of Shac...,when sign receive free set shacke pak review t...


In [5]:
# check test
test.head().append(test.tail())

Unnamed: 0,overall,reviewerID,asin,reviewText,processedReviewText
0,5.0,A8U3FAMSJVHS5,0000031887,"We bought several tutus at once, and they are ...",we buy tutu get high review sturdy seemingly t...
1,5.0,A3GEOILWLK86XM,0000031887,Thank you Halo Heaven great product for Little...,thank halo heaven great product little girls m...
2,5.0,A2A2WZYLU528RO,0000031887,My daughter has worn this skirt almost every d...,my daughter worn skirt day receive washer clot...
3,5.0,A34ATJR9KFIXL9,0000031887,Full and well stitched. This tutu is a beauti...,full stitch this tutu beautiful purple color l...
4,5.0,A1MXJVYXE2QU6H,0000031887,Perfect for my budding grand daughter ballerin...,perfect bud grand daughter ballerina beautiful...
47140,5.0,A2XX2A4OJCDNLZ,B00KF9180W,While balaclavas can be used for a variety of ...,while balaclavas variety thing use mainly late...
47141,2.0,A34BZM6S9L7QI4,B00KGCLROK,These were a free sample for review. I was ex...,these free sample review excite try unfortunat...
47142,5.0,A25C2M3QF9G7OQ,B00KGCLROK,These socks are very nicely made and quite com...,these sock nicely comfortable wear the grip do...
47143,5.0,AEL6CQNQXONBX,B00KKXCJQU,This set of travel organizers includes four pi...,this set travel organizer include piece total ...
47144,5.0,A1EVV74UQYVKRY,B00KKXCJQU,I've been traveling back and forth to England ...,travel forth england pack way suitcases some p...


# 2. Doc2Vec model

## 2.1 Preparing `TaggedDocument` for Doc2Vec model

In this following section, we will generating tagged documents that will be feed into the `Doc2Vec` model. We will be required to generate documents, where each is 'tagged' with the corresponding `asin` of which the review is addressing. This enables the Doc2Vec model to identify documents that are associated to each of the asin within our training dataset, and create a document vector based on the seperated documents for each asin. 

The intuition behind this preparation is that we assume that each asin is a representation of all the reviews customers has left after purchasing. If customers like any aspect of the product, the reviews should leave relevant positive feedback on that particular e.g., "the boots is comfortable" – hence, we know that this particular boots is comfortable. If a product (asin) has many of such reviews, semantically, we can build an item profile that associates this boots as a product that is comfortable. As embeddings are generated in *n*-dimensional vector space, we can then attempt to find similar products within the neighbourhood that has a similar profile be it is either a pair of boots, or a product that is comfortable in nature. 

The reason why we are building this in an item-item level is because at a user-level, interests may vary greatly depending on the user needs when purchasing items. Also, as some users may be more negative in nature, their reviews may generally be more critical which will inherently develop a profile that is critical in nature. Assuming this, if we were to place this user profile vector into the *n*-dimensional vector space, we will likely be recommending products that were also critically (or negatively) reviewed. This meant that for this user, we might only be generating poor recommendations due to how its neighbourhood is associated with a negatively semantics. However, in terms of item-level, it is highly unlike that a user has only made poor purchases on the site and hence, its more likely that the items profile develop should have a mix of good and possibly poor semantics. This ensures that we will have positive recommendations generated if we were to implement a treshold of sort, to ensure that the average/weighted average rating of the products meets an initial criteria for recommendations.

In [None]:
%%time
# tokenizng the `processedReviewText`
train['tokenizedReviewText'] = train['processedReviewText'].parallel_apply(lambda x: x.split())

In [None]:
# check train
train.head().append(train.tail())

## 2.2 Preparing `item-level` tagged documents

In [None]:
_, item_documents = preprocessing.prepare_tagged_documents(users='reviewerID', asins='asin', reviews='tokenizedReviewText', df=train)

In [None]:
pprint(item_documents[:10])

## 2.3 Training `Doc2Vec` model

We will be training a `Doc2Vec` model with initial hyperparameters decided based on a study by [Caselles-Dupré, Lesaint, and Royo-Letelier (2018)](https://arxiv.org/abs/1804.04212), where they observed that `negative sampling distribution`, `number of epochs`, `subsampling parameter` and `window size` can significantly improve performance on recommendation tasks. Hence, we will have decided to start with values that similar to those presented in the studies as the basis for improvement during this project.

In [6]:
logging.basicConfig(level=logging.DEBUG,
                    format='%(asctime)s %(message)s')

# model parameters
VECTOR_SIZE = 150
MIN_COUNT = 10
NEGATIVE = 5
NS_EXPONENT = 0.5
SAMPLE = 1e-05
DM = 1
WORKERS = 8
EPOCHS = 50

In [None]:
model = Doc2Vec(
    vector_size=VECTOR_SIZE,
    min_count=MIN_COUNT,
    negative=NEGATIVE,
    sample=SAMPLE,
    ns_exponent=NS_EXPONENT,
    dm=DM,
    workers=WORKERS,
)

# building vocab
model.build_vocab(item_documents)

# training model
model.train(item_documents, total_examples=model.corpus_count, epochs=EPOCHS)

In [None]:
# saving model
MODEL_PATH = Path("models/d2v/")

model.save(f"{MODEL_PATH}/{CATEGORY}_{VECTOR_SIZE}_{SAMPLE}_{EPOCHS}_d2v.model")

## 2.4 Verifying Doc2Vec model

In [7]:
MODEL_PATH = Path("models/d2v/")

model = Doc2Vec.load(f"{MODEL_PATH}/{CATEGORY}_{VECTOR_SIZE}_{SAMPLE}_{EPOCHS}_d2v.model")

2021-07-21 00:01:52,447 loading Doc2Vec object from models/d2v/Clothing_Shoes_and_Jewelry_150_1e-05_50_d2v.model
2021-07-21 00:01:52,448 {'uri': 'models/d2v/Clothing_Shoes_and_Jewelry_150_1e-05_50_d2v.model', 'mode': 'rb', 'buffering': -1, 'encoding': None, 'errors': None, 'newline': None, 'closefd': True, 'opener': None, 'ignore_ext': False, 'compression': None, 'transport_params': None}
2021-07-21 00:01:52,484 loading dv recursively from models/d2v/Clothing_Shoes_and_Jewelry_150_1e-05_50_d2v.model.dv.* with mmap=None
2021-07-21 00:01:52,485 loading wv recursively from models/d2v/Clothing_Shoes_and_Jewelry_150_1e-05_50_d2v.model.wv.* with mmap=None
2021-07-21 00:01:52,486 setting ignored attribute cum_table to None
2021-07-21 00:01:52,627 Doc2Vec lifecycle event {'fname': 'models/d2v/Clothing_Shoes_and_Jewelry_150_1e-05_50_d2v.model', 'datetime': '2021-07-21T00:01:52.600508', 'gensim': '4.0.1', 'python': '3.8.8 (default, Apr 13 2021, 12:59:45) \n[Clang 10.0.0 ]', 'platform': 'macOS-10

### 2.4.1 Testing retrieval of vectors via index

In [8]:
model.dv[0] # '0000031887'

array([-0.55455124, -0.06204286,  0.12651655,  0.17184411, -0.41832566,
        0.43929744,  0.5989178 , -0.47313836, -0.26175705,  0.23614489,
       -1.3901784 ,  0.6646123 ,  0.06347404,  0.08615484, -0.6717607 ,
       -0.3768335 ,  1.0026171 ,  0.25654572,  0.5956967 ,  0.09548705,
       -0.5090417 , -0.28958946,  1.0680248 , -0.2967359 ,  0.05868936,
       -0.27846545, -0.45245284, -0.39223024, -0.13383815, -0.5927535 ,
       -0.6065034 ,  0.11480772, -0.24187134,  0.89880323, -0.4563438 ,
       -0.07274003, -0.25703123, -0.67628574, -0.23870592,  0.03952418,
       -0.43918917, -0.37131128, -0.3459309 ,  0.22690648,  0.44771397,
       -0.46793684, -0.09896857, -0.5261713 , -0.02965886,  0.6870407 ,
       -0.49471295, -0.55824864,  0.14912887,  0.19684845,  0.06110121,
        0.08721235, -0.67826223,  0.18145782,  0.49971262, -0.26642862,
        0.00974132,  0.02288771, -0.05008493, -0.4627209 , -0.58579177,
       -0.63324004,  0.05645867, -0.5750356 ,  0.32855687, -0.35

### 2.4.2 Testing retrieval of vectors via tags

In [9]:
print([i[1] for i in item_documents[:5]])

NameError: name 'item_documents' is not defined

In [10]:
model.dv['0000031887']

array([-0.55455124, -0.06204286,  0.12651655,  0.17184411, -0.41832566,
        0.43929744,  0.5989178 , -0.47313836, -0.26175705,  0.23614489,
       -1.3901784 ,  0.6646123 ,  0.06347404,  0.08615484, -0.6717607 ,
       -0.3768335 ,  1.0026171 ,  0.25654572,  0.5956967 ,  0.09548705,
       -0.5090417 , -0.28958946,  1.0680248 , -0.2967359 ,  0.05868936,
       -0.27846545, -0.45245284, -0.39223024, -0.13383815, -0.5927535 ,
       -0.6065034 ,  0.11480772, -0.24187134,  0.89880323, -0.4563438 ,
       -0.07274003, -0.25703123, -0.67628574, -0.23870592,  0.03952418,
       -0.43918917, -0.37131128, -0.3459309 ,  0.22690648,  0.44771397,
       -0.46793684, -0.09896857, -0.5261713 , -0.02965886,  0.6870407 ,
       -0.49471295, -0.55824864,  0.14912887,  0.19684845,  0.06110121,
        0.08721235, -0.67826223,  0.18145782,  0.49971262, -0.26642862,
        0.00974132,  0.02288771, -0.05008493, -0.4627209 , -0.58579177,
       -0.63324004,  0.05645867, -0.5750356 ,  0.32855687, -0.35

### 2.4.3 Checking number of document vectors

In [11]:
print(f"Number of document vectors: {len(model.dv.index_to_key)}")

Number of document vectors: 23033


We observed that by calling the document vector via both index and actual tags returns the same vector. We also generated `23033` vectors that is aligned with the number of unique items we have in training data. 

## 2.5 Examining Doc2Vec results

### 2.5.1 Are inferred vectors close to the precalculated ones?

In [12]:
# let's try to generate a random item id and infer its vector and compare if we can get similar items back
random_asin = np.random.choice(list(train['asin'].unique()), 1)[0]

# combining all the words from the all reviews
all_review = []
for review in train[train['asin'] == random_asin]["tokenizedReviewText"]:
    all_review.extend(review)

# inferring vector
print(f"For item {random_asin}...\n")
print(f'Most similar D2V vectors: {model.dv.most_similar([model.infer_vector(all_review, epochs=5)], topn=5)}')

KeyError: 'tokenizedReviewText'

In [13]:
# let's try to generate a random item id and infer its vector and compare if we can get similar items back
random_asin = np.random.choice(list(train['asin'].unique()), 1)[0]

# combining all the words from the all reviews
all_review = []
for review in train[train['asin'] == random_asin]["tokenizedReviewText"]:
    all_review.extend(review)

# inferring vector
print(f"For item {random_asin}...\n")
print(f'Most similar D2V vectors: {model.dv.most_similar([model.infer_vector(all_review, epochs=5)], topn=5)}')

KeyError: 'tokenizedReviewText'

We are able to verify that by using the same tokens used for training, we were able to still infer accurate vectors from the model.

### 2.5.2 Generating top-N recommendations based on aggregated item history

As mentioned previously, we wanted to generate recommendations based on item-level instead of user-level due to the fact that user's nature is dynamic. What user like now, might not be the same a few days or weeks later. However, on a item perspective, if an item is good, general public who purchased the item should have positive feedback. The nature of the item profile should not vary as much as it would on a user-level. 

Hence, instead of building a user profile based on the reviews that the user has given, instead, we will build an *aggregated* item history profile based on the past purchase history of the users. A simplification of the algorithm would be:

1. Identify the list of items previously purchased by a user
2. Using the Doc2Vec model trained on unique tags based on `asin`, we mean-aggregate the item's document vectors to produce a aggregate item purchase history vector
3. Using the aggregated item purchase history vector, we find the top-N, 10, recommendations while excluding previously purchased item and setting a treshold for the mean/weighted-mean rating of the recommended items to ensure that items purchase is popular among others as well.

In [14]:
# TODO: need to figure how to define a treshold
def get_top_n(user, n=10, threshold=4):

    # retrieving user's purchase history
    purchase_history = train.groupby(['reviewerID'])['asin'].apply(list)[user]

    purchase_history_vec = np.zeros(150)
    for item in purchase_history:
        purchase_history_vec += model.dv[item]
    # mean aggregation
    purchase_history_vec /= len(purchase_history)

    return [i[0] for i in model.dv.most_similar([purchase_history_vec], topn=n)]

In [15]:
# generating a random user
random_user = np.random.choice(list(train['reviewerID'].unique()), 1)[0]
# looking at user records
pprint(train[train['reviewerID'] == random_user][["asin", "reviewText"]])

print(f"\nFor user {random_user}...\n")
print(f'Most similar item D2V vectors: {get_top_n(random_user)}')

              asin                                         reviewText
141747  B0067GUM2W  This necklace is actually a lot nicer than i e...
164511  B007WNWEFC  No closure but the chain is quite long. The oc...
229819  B00H2SU5ZI  I bought this in eggplant and dark brown and I...
230705  B00ICP47CC  I figured out I'm not a huge fan of the poly-s...
231362  B00JH0RWEG  I love this skirt! I'm 5'8&#34; and I typicall...

For user A2GS3YF3AZ8CVT...

Most similar item D2V vectors: ['B0067GUM2W', 'B007WNWEFC', 'B006WXSY3E', 'B00A73WBZW', 'B006WXW16U', 'B00DCL33QM', 'B00CAMCKJ0', 'B00EU7EKGE', 'B00AV4OR3Q', 'B00GY4V38O']


The example of recommendation above where user `A2OSOO0NRPLZRH`

## 3. Computing Metrics

Now that we are able to generate recommendations for users, we are going to compute metrics to better evaluate our model with existing techniques used in recommendations for comparison. 

We will be using the following metrics:

1. `Precision@K`: Proportion of recommendations that are relevant (which means that items that users has already make a purchase before).
2. `Recall@K`: Proportion of relevant recommendations retrieved.


In [16]:
# let take a look at our testing set 
test.head().append(test.tail())

Unnamed: 0,overall,reviewerID,asin,reviewText,processedReviewText
0,5.0,A8U3FAMSJVHS5,0000031887,"We bought several tutus at once, and they are ...",we buy tutu get high review sturdy seemingly t...
1,5.0,A3GEOILWLK86XM,0000031887,Thank you Halo Heaven great product for Little...,thank halo heaven great product little girls m...
2,5.0,A2A2WZYLU528RO,0000031887,My daughter has worn this skirt almost every d...,my daughter worn skirt day receive washer clot...
3,5.0,A34ATJR9KFIXL9,0000031887,Full and well stitched. This tutu is a beauti...,full stitch this tutu beautiful purple color l...
4,5.0,A1MXJVYXE2QU6H,0000031887,Perfect for my budding grand daughter ballerin...,perfect bud grand daughter ballerina beautiful...
47140,5.0,A2XX2A4OJCDNLZ,B00KF9180W,While balaclavas can be used for a variety of ...,while balaclavas variety thing use mainly late...
47141,2.0,A34BZM6S9L7QI4,B00KGCLROK,These were a free sample for review. I was ex...,these free sample review excite try unfortunat...
47142,5.0,A25C2M3QF9G7OQ,B00KGCLROK,These socks are very nicely made and quite com...,these sock nicely comfortable wear the grip do...
47143,5.0,AEL6CQNQXONBX,B00KKXCJQU,This set of travel organizers includes four pi...,this set travel organizer include piece total ...
47144,5.0,A1EVV74UQYVKRY,B00KKXCJQU,I've been traveling back and forth to England ...,travel forth england pack way suitcases some p...


In [17]:
# creating the purchase history of users in the testing set
test_purchase_history = test.groupby(['reviewerID'])['asin'].apply(list).to_frame().reset_index()

pprint(test_purchase_history.iloc[:5])

              reviewerID          asin
0  A001114613O3F18Q5NVR6  [B000J6ZYL0]
1  A00146182PNM90WNNAZ5Q  [B00823Y41S]
2  A00165422B2GAUE3EL6Z0  [B008G51WHQ]
3  A00338282E99B8OR2JYTZ  [B00DVFNNQE]
4  A00354001GE099Q1FL0TU  [B00BTWAZ0I]


In [18]:
# let randomly sample 1000 rows to make predictions
sampled_test_purchase_history = test_purchase_history.sample(n=5000, random_state=42)

pprint(sampled_test_purchase_history)

           reviewerID                                               asin
109    A104QGECCAFCI9                                       [B00592VMNI]
15112  A2G5OW0UIBAUIT                           [B008SCM0AU, B00AOCV6OI]
13118  A29BPMJI0ZYH4H  [B0058XH5D4, B007BZ5CUU, B00A0SXLOO, B00AVPHH4...
37097   ARQZEE0LA1PBB                                       [B000A2KC7O]
31660   A8VSC4N8D63MJ                                       [B007ZRS0ZI]
...               ...                                                ...
14667  A2EP4PMBS78D5F                                       [B001SN8DHK]
16387  A2KNB31SNXN0MR                                       [B008MMJ27K]
33258   AENTXUFIYPSMZ                                       [B003NX8C2O]
7029    A1OJHJSWH0F4K                                       [B0087SX5YA]
26273  A3IKG99RBVQDMK                                       [B00CJ6YMES]

[5000 rows x 2 columns]


In [19]:
%%time
# generating predictions
sampled_test_purchase_history['asin_predictions'] = sampled_test_purchase_history['reviewerID'].parallel_apply(get_top_n)

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=625), Label(value='0 / 625'))), HB…

CPU times: user 11.4 s, sys: 1.41 s, total: 12.8 s
Wall time: 18min 58s


In [20]:
# checking the dataframe
sampled_test_purchase_history.head().append(sampled_test_purchase_history.tail())

Unnamed: 0,reviewerID,asin,asin_predictions
109,A104QGECCAFCI9,[B00592VMNI],"[B000MM8I5U, B001KVQM96, B005FDWNKM, B00FSB5SB..."
15112,A2G5OW0UIBAUIT,"[B008SCM0AU, B00AOCV6OI]","[B008SCHXGQ, B007JPJD6G, B00CVQ16S6, B0094FYRM..."
13118,A29BPMJI0ZYH4H,"[B0058XH5D4, B007BZ5CUU, B00A0SXLOO, B00AVPHH4...","[B008KKSJYQ, B00AKSCOMO, B00BXXX3PM, B008KKSKJ..."
37097,ARQZEE0LA1PBB,[B000A2KC7O],"[B006H9S4LU, B009V7Q8YK, B0030BELDS, B0018OFU9..."
31660,A8VSC4N8D63MJ,[B007ZRS0ZI],"[B0006LMBJ6, B000S6ICUQ, B0018OHOB0, B0018OLPQ..."
14667,A2EP4PMBS78D5F,[B001SN8DHK],"[B004A9FVDS, B005AFKQFY, B002AHW2P2, B002463U8..."
16387,A2KNB31SNXN0MR,[B008MMJ27K],"[B009WQRQ64, B004Y6R5EA, B00CYNAUE2, B00J6OOYI..."
33258,AENTXUFIYPSMZ,[B003NX8C2O],"[B0007YR8WW, B000T8EN8I, B001124KQG, B004A7XXI..."
7029,A1OJHJSWH0F4K,[B0087SX5YA],"[B00BU17JP2, B007MU5HUE, B004WJBZ54, B009TYOL8..."
26273,A3IKG99RBVQDMK,[B00CJ6YMES],"[B007VU1H2W, B008KKOEA4, B00BM27X46, B005TEO7V..."


### 3.1 Defining utility metrics functions

In [21]:
def precision_at_k(asins, predicted_asins, k=10):
    # number of relevant items
    set_actual = set(asins)
    set_preds = set(predicted_asins)
    num_relevant = len(set_actual.intersection(set_preds))
    
    # calculating precision@K - relevant / total recommended
    precision_at_k = num_relevant / k
    
    return precision_at_k

def recall_at_k(asins, predicted_asins, k=10):
    # number of relevant items
    set_actual = set(asins)
    set_preds = set(predicted_asins)
    num_relevant = len(set_actual.intersection(set_preds))
    
    # calculating recall@K - relevant / total relevant items
    recall_at_k = num_relevant / len(asins)
    
    return recall_at_k

In [22]:
# computing the metrics
sampled_test_purchase_history['precision@K'] = sampled_test_purchase_history.progress_apply(lambda x: precision_at_k(x.asin, x.asin_predictions), axis=1)
sampled_test_purchase_history['recall@K'] = sampled_test_purchase_history.progress_apply(lambda x: recall_at_k(x.asin, x.asin_predictions), axis=1)

100%|██████████| 5000/5000 [00:00<00:00, 20988.54it/s]
100%|██████████| 5000/5000 [00:00<00:00, 30353.83it/s]


In [23]:
# checking the dataframe
sampled_test_purchase_history.head().append(sampled_test_purchase_history.tail())

Unnamed: 0,reviewerID,asin,asin_predictions,precision@K,recall@K
109,A104QGECCAFCI9,[B00592VMNI],"[B000MM8I5U, B001KVQM96, B005FDWNKM, B00FSB5SB...",0.0,0.0
15112,A2G5OW0UIBAUIT,"[B008SCM0AU, B00AOCV6OI]","[B008SCHXGQ, B007JPJD6G, B00CVQ16S6, B0094FYRM...",0.0,0.0
13118,A29BPMJI0ZYH4H,"[B0058XH5D4, B007BZ5CUU, B00A0SXLOO, B00AVPHH4...","[B008KKSJYQ, B00AKSCOMO, B00BXXX3PM, B008KKSKJ...",0.0,0.0
37097,ARQZEE0LA1PBB,[B000A2KC7O],"[B006H9S4LU, B009V7Q8YK, B0030BELDS, B0018OFU9...",0.0,0.0
31660,A8VSC4N8D63MJ,[B007ZRS0ZI],"[B0006LMBJ6, B000S6ICUQ, B0018OHOB0, B0018OLPQ...",0.0,0.0
14667,A2EP4PMBS78D5F,[B001SN8DHK],"[B004A9FVDS, B005AFKQFY, B002AHW2P2, B002463U8...",0.0,0.0
16387,A2KNB31SNXN0MR,[B008MMJ27K],"[B009WQRQ64, B004Y6R5EA, B00CYNAUE2, B00J6OOYI...",0.0,0.0
33258,AENTXUFIYPSMZ,[B003NX8C2O],"[B0007YR8WW, B000T8EN8I, B001124KQG, B004A7XXI...",0.0,0.0
7029,A1OJHJSWH0F4K,[B0087SX5YA],"[B00BU17JP2, B007MU5HUE, B004WJBZ54, B009TYOL8...",0.0,0.0
26273,A3IKG99RBVQDMK,[B00CJ6YMES],"[B007VU1H2W, B008KKOEA4, B00BM27X46, B005TEO7V...",0.0,0.0


### 3.2 Retrieving the average metrics 

In [24]:
average_precision_at_k = sampled_test_purchase_history["precision@K"].mean()
average_recall_at_k = sampled_test_purchase_history["recall@K"].mean()

print(f"The model has a average precision@K: {average_precision_at_k:.5f}, average recall@K: {average_recall_at_k:.5f}.")

The model has a average precision@K: 0.00136, average recall@K: 0.01214.


### 3.3 Looking at the correct recommendations

In [25]:
sampled_test_purchase_history[sampled_test_purchase_history['recall@K'] == 1]

Unnamed: 0,reviewerID,asin,asin_predictions,precision@K,recall@K
37689,ATQVNXUU2N1GG,[B0009WXTX4],"[B000ZPMYCC, B000B5MI3Q, B0032FOSI0, B009KYJAJ...",0.1,1.0
39337,AZV969S41XUYF,[B009S3HPTE],"[B009S3HYQ8, B006NU5Z60, B006LFF850, B009S3HL9...",0.1,1.0
36648,AQA5BVA14WTJQ,[B000KPP352],"[B000KKTPD8, B000XY3XX4, B000KPPICA, B000KPP35...",0.1,1.0
37385,ASQFKYFTM1A3G,[B002YIPCJA],"[B002WUVOBA, B002VS8H3G, B001WWWAA8, B002YM52L...",0.1,1.0
2158,A17OB8ULOJ5U50,[B000UECV3U],"[B000CC3OMC, B000BVYQ9O, B000BVYQ9Y, B0006GYLO...",0.1,1.0
35465,AM74DCJX3UI7X,[B002JKZU4A],"[B002JL1ZUC, B004874ZHK, B009DKKVFW, B004DYUAX...",0.1,1.0
29431,A3TF05A315UM97,[B006GDARO4],"[B0055X1NDK, B002RS26LE, B0085U2YTC, B0064YY0D...",0.1,1.0
22453,A35ARB435GXRMH,[B006B3AOJC],"[B005UVM368, B0062WL55E, B008IZKAP4, B00EU7OYY...",0.1,1.0
29035,A3S4V8NPKKDDUL,[B001SN8BLS],"[B001N0MSI8, B002RL87OQ, B003FSPWAC, B001SN8AD...",0.1,1.0
30780,A5SYJBHLU748Z,[B008Q0E61U],"[B001PUJH3A, B001NODU36, B001B9XI3A, B001DNFAO...",0.1,1.0


At the moment, we have computed the metrics for the model which comes to average precision@k: `0.00190` and average recall@k: `0.01663`. As we dont have any other baseline to compare with at the moment, we are unable to better interpret model. 

The next step involves, using `approximate nearest neighbour` method to help us query similarities as opposed to the current brute-force search in the `Doc2Vec` model. Theoretically, we can achieve up to *11*x performance increase but still, it is subjected to the number of trees built during the indexing phase. 

### 3.4 Improving query time using `Annoy`

In [26]:
# building index using 100 trees
annoy_index = AnnoyIndexer(model, 100)

# TODO: need to figure how to define a treshold
def get_top_n_annoy(user, n=10, threshold=4, indexer=annoy_index):

    # retrieving user's purchase history
    purchase_history = train.groupby(['reviewerID'])['asin'].apply(list)[user]

    purchase_history_vec = np.zeros(150)
    for item in purchase_history:
        purchase_history_vec += model.dv[item]
    # mean aggregation
    purchase_history_vec /= len(purchase_history)

    return [i[0] for i in model.dv.most_similar([purchase_history_vec], topn=n, indexer=indexer)]

ImportError: Annoy not installed. To use the Annoy indexer, please run `pip install annoy`.

In [None]:
# generating predictions
sampled_test_purchase_history['asin_predictions_annoy'] = sampled_test_purchase_history['reviewerID'].progress_apply(get_top_n_annoy)

In [None]:
# checking the dataframe
sampled_test_purchase_history.head().append(sampled_test_purchase_history.tail())

In [None]:
# computing the metrics
sampled_test_purchase_history['precision@K_annoy'] = sampled_test_purchase_history.progress_apply(lambda x: precision_at_k(x.asin, x.asin_predictions_annoy), axis=1)
sampled_test_purchase_history['recall@K_annoy'] = sampled_test_purchase_history.progress_apply(lambda x: recall_at_k(x.asin, x.asin_predictions_annoy), axis=1)

In [None]:
average_precision_at_k_annoy = sampled_test_purchase_history["precision@K_annoy"].mean()
average_recall_at_k_annoy = sampled_test_purchase_history["recall@K_annoy"].mean()

print(f"The model has a average precision@K: {average_precision_at_k_annoy:.5f}, average recall@K: {average_recall_at_k_annoy:.5f}.")

Based on initial observation, by no means the speed up was significant. Either the current annoy indexer setup is inappropriate or existing function design is flawed considering that we have have to do a inner `for` loop within the retrieval function. Design improvement would include shifting pre-computation user's purchase history vector into a hashmap (dictionary) to allow O(1) time complexity. 

### 4. Misc

In [None]:
# save the test users
sampled_test_purchase_history['reviewerID'].to_csv("test_users_id.csv", index=False)