## Introduction 

The following notebook contains data loading and pre-processing steps for the the [Amazon Review Dataset](https://nijianmo.github.io/amazon/index.html). This Dataset is an updated version of the Amazon review dataset released in 2014. As in the previous version, this dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs). This version provides additional reviews (233.1 million in total) and additonal metadata. The reviews are distributed across 26 high level groups, as illustrated below: 

<img width="600" alt="Amazon categories" src="https://user-images.githubusercontent.com/34798787/177803504-89909b59-a2cd-497b-b892-64f40a9a9e29.png">

The goal of this notebook is to provide an in depth walkthrough of how the train, validation and test datasets for a single high level group are generated from the review and metadata files. In particular, we will be using the Movies and TV group which is often used as a benchmark in recommender system research.

In [1]:
import os
import json
import pandas as pd

from recommenders.datasets.amazon_reviews import _reviews_preprocessing, _meta_preprocessing, _create_instance, _create_item2cate, _get_sampled_data, _data_processing, _data_generating, _create_vocab, _negative_sampling_offline, download_and_extract

In [2]:
DATA_PATH = "data"
REVIEWS_FILE = 'reviews_Movies_and_TV_5.json'
META_FILE = 'meta_Movies_and_TV.json'

In [3]:
# Directories to store train, validation and test splits
train_path = os.path.join(DATA_PATH, r'train_data')
valid_path = os.path.join(DATA_PATH, r'valid_data')
test_path = os.path.join(DATA_PATH, r'test_data')

# Files paths to store the list of existing ids for user, item and item category 
user_vocab_path = os.path.join(DATA_PATH, r'user_vocab.pkl')
item_vocab_path = os.path.join(DATA_PATH, r'item_vocab.pkl')
cate_vocab_path = os.path.join(DATA_PATH, r'category_vocab.pkl')
output_file_path = os.path.join(DATA_PATH, r'output.txt')

# File paths to store reviews and associated metadata
reviews_path = os.path.join(DATA_PATH, REVIEWS_FILE)
meta_path = os.path.join(DATA_PATH, META_FILE)

valid_num_ngs = 4 # number of negative instances with a positive instance for validation
test_num_ngs = 9 # number of negative instances with a positive instance for testing

## Data Loading

In the data loading stage, the review data and associated metadata specified by the `REVIEWS_FILE` and `META_FILE` are downloaded locally from a remote [index](http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/) of review and metadata files for each category in the amazon dataset. The resulting review data and associated metadata is extracted to `reviews_path` and `meta_path` respectively.

In [4]:
# Create base directory to store amazon dataset and run information
if os.path.exists(DATA_PATH) == False: 
    os.mkdir(DATA_PATH)

The raw review data exists in a file `reviews_path` where each row corresponds to the review of product by a user at a specific time. The data takes the form of a dictionairy that contains keys of interest such as the user id `reviewer_id`, the item id (Amazon Standard Identification Number) `asin` and a timestamp of the review `unixReviewTime`. 

In [5]:
# Download and extract review data
if not os.path.exists(reviews_path):
    download_and_extract(REVIEWS_FILE, reviews_path)

# Visualize review data
reviews_r = open(reviews_path, "r")
for line in list(reviews_r)[:1]: 
    print(line)
reviews_r.close()

{"reviewerID": "ADZPIG9QOCDG5", "asin": "0005019281", "reviewerName": "Alice L. Larson \"alice-loves-books\"", "helpful": [0, 0], "reviewText": "This is a charming version of the classic Dicken's tale.  Henry Winkler makes a good showing as the \"Scrooge\" character.  Even though you know what will happen this version has enough of a change to make it better that average.  If you love A Christmas Carol in any version, then you will love this.", "overall": 4.0, "summary": "good version of a classic", "unixReviewTime": 1203984000, "reviewTime": "02 26, 2008"}



The raw metadata exists in a file `meta_path` where each row corresponds to a product. The data the form of a dictionairy that contains keys of interest such as the categories of the item `categories`, a description of the item `description` and the price of an item `price`. 

In [6]:
# Download and extract metadata 
if not os.path.exists(meta_path):
    download_and_extract(META_FILE, meta_path)
    
# Visualize metadata
meta_r = open(meta_path, "r")
for line in list(meta_r)[:1]: 
    print(line)
meta_r.close()

{'asin': '0000143561', 'categories': [['Movies & TV', 'Movies']], 'description': '3Pack DVD set - Italian Classics, Parties and Holidays.', 'title': 'Everyday Italian (with Giada de Laurentiis), Volume 1 (3 Pack): Italian Classics, Parties, Holidays', 'price': 12.99, 'salesRank': {'Movies & TV': 376041}, 'imUrl': 'http://g-ecx.images-amazon.com/images/G/01/x-site/icons/no-img-sm._CB192198896_.gif', 'related': {'also_viewed': ['B0036FO6SI', 'B000KL8ODE', '000014357X', 'B0037718RC', 'B002I5GNVU', 'B000RBU4BM'], 'buy_after_viewing': ['B0036FO6SI', 'B000KL8ODE', '000014357X', 'B0037718RC']}}



## Data Preprocessing

In the data prepocessing step, the raw review and associated metadata are processed to generate train, validation and testing datasets. The generated datasets are in a form that can be fed into a sequential recommender model from the [microsoft recommenders package](https://github.com/microsoft/recommenders), such as [SLi-Rec](https://www.microsoft.com/en-us/research/uploads/prod/2019/07/IJCAI19-ready_v1.pdf).

In [7]:
# Extract relevant information from reviews_path file
reviews_writefile = _reviews_preprocessing(reviews_path)

# Visualize extracted review data
reviews_r = open(reviews_writefile, "r")
for line in list(reviews_r)[:1]: 
    print(line)
reviews_r.close()

ADZPIG9QOCDG5	0005019281	1203984000



In [8]:
# Extract relevant information from meta_path file
meta_writefile = _meta_preprocessing(meta_path)

# Visualize extracted review data
meta_r = open(meta_writefile, "r")
for line in list(meta_r)[:1]: 
    print(line)
meta_r.close()

0000143561	Movies



The `_create_instance` function accepts as arguments the previously generated review file `reviews_writefile` and metadata file `meta_writefile` and generated a new file where each line has a binary label `label` indicating their was positive interaction between `reviewer_id` and `asin` with `categories` at time `unixReviewTime`. 

In [9]:
# Merge review and meta files into instance_output file
instance_output_path = _create_instance(reviews_writefile, meta_writefile)
ns_df = pd.read_csv(instance_output_path, sep="\t", names=["label", "user_id", "item_id", "timestamp", "cate_id"])
ns_df

Unnamed: 0,label,user_id,item_id,timestamp,cate_id
0,1,ADZPIG9QOCDG5,0780623746,1138752000,Movies
1,1,ADZPIG9QOCDG5,6300251004,1138752000,Movies
2,1,ADZPIG9QOCDG5,6302595916,1138752000,Movies
3,1,ADZPIG9QOCDG5,0005019281,1203984000,Movies
4,1,ADZPIG9QOCDG5,B001FB4VXU,1359763200,TV
...,...,...,...,...,...
1697528,1,A3JZUZPZQEXIAB,B00H9L26AA,1397779200,Movies
1697529,1,A3JZUZPZQEXIAB,B00H7LINKE,1398902400,Movies
1697530,1,A3JZUZPZQEXIAB,B00GMV8LIO,1399852800,Movies
1697531,1,A3JZUZPZQEXIAB,B00H9HZGQ0,1400544000,Movies


Following the generation of the csv file that contains user item interations and their associated metadata, we randomly sample `len(ns_df) * sampling_rate` item ids without replacement. These item ids are used to filter interactions in `ns_df` to include only item ids from the random sample. The subsampled data is stored in csv at path `sampled_instance_path`. 

In [10]:
# Establish global item to category dictionairy
_create_item2cate(instance_output_path)

# Sample subset of interactions and store in csv file 
sampled_instance_path = _get_sampled_data(instance_output_path, sample_rate=.01)

# Load csv into dataframe to visualize
ss_ns_df = pd.read_csv(sampled_instance_path, sep="\t", names=["label", "user_id", "item_id", "timestamp", "cate_id"])
ss_ns_df

Unnamed: 0,label,user_id,item_id,timestamp,cate_id
0,1,A1VKW06X1O2X7V,B009AMAO46,1396656000,Movies
1,1,A3R27T4HADWFFJ,B000J10EQU,1387756800,Movies
2,1,A3R27T4HADWFFJ,B0000AZT3R,1389657600,Movies
3,1,AWF2S3UNW9UA0,B005LAIHQS,1361232000,Movies
4,1,AWF2S3UNW9UA0,B008220C38,1362441600,Movies
...,...,...,...,...,...
89579,1,A3GB1MQ9XK8CUA,B00ESLAIYK,1388448000,TV
89580,1,AGAWDSE1J20RI,B00H7KJTCG,1405468800,Movies
89581,1,AGAWDSE1J20RI,B00JAQJMJ0,1405468800,Movies
89582,1,A2E1WM75696X67,B00JAQJMJ0,1403481600,Movies


The next step in the data preprocessing pipeline is to break the subsampled dataset at `sampled_instance_path` into train, validation and testing sets. This is accomplished in part by the `_data_processing` function which extends the csv in `sampled_instance_path` to include a column that indicates whether the sample is part of the train, validation or testing partition. For each user, the most recent interaction is allocated to the test set, the second most recent interaction is allocated the validation set and the remaining samples are allocated to the test set. 

In [11]:
# Preprocess output by splitting into train, validation and test
preprocessed_output = _data_processing(sampled_instance_path)

pp_df = pd.read_csv(preprocessed_output, sep="\t", names=["set", "label", "user_id", "item_id", "timestamp", "cate_id"])
pp_df

Unnamed: 0,set,label,user_id,item_id,timestamp,cate_id
0,test,1,A1VKW06X1O2X7V,B009AMAO46,1396656000,Movies
1,valid,1,A3R27T4HADWFFJ,B000J10EQU,1387756800,Movies
2,test,1,A3R27T4HADWFFJ,B0000AZT3R,1389657600,Movies
3,train,1,AWF2S3UNW9UA0,B005LAIHQS,1361232000,Movies
4,train,1,AWF2S3UNW9UA0,B008220C38,1362441600,Movies
...,...,...,...,...,...,...
89579,test,1,A3GB1MQ9XK8CUA,B00ESLAIYK,1388448000,TV
89580,valid,1,AGAWDSE1J20RI,B00H7KJTCG,1405468800,Movies
89581,test,1,AGAWDSE1J20RI,B00JAQJMJ0,1405468800,Movies
89582,test,1,A2E1WM75696X67,B00JAQJMJ0,1403481600,Movies


The `_data_generating` functions generates subsequences of interactions in the train set for each user are then used to generate additional samples. For example, given a user's interactions sequence `12345`, the following subsequences would be added to the train file `1`, `12`, `123`, `1234`, `12345`. In addition, columns are added to the `train_path` csv file which encode the previous item ids `prev_item_ids` as well as corresponding category ids `prev_cate_ids` and previous timestamps `prev_timestamps` the user reviewed with before the current interaction. 

In [12]:
_data_generating(preprocessed_output, train_path, valid_path, test_path)

train_df = pd.read_csv(train_path, sep="\t", index_col=False, names=["label", "user_id", "item_id", "cate_id", "timestamp", "prev_item_ids", "prev_cate_ids", "prev_timestamps"])
train_df

Unnamed: 0,label,user_id,item_id,cate_id,timestamp,prev_item_ids,prev_cate_ids,prev_timestamps
0,1,AWF2S3UNW9UA0,B008220C38,Movies,1362441600,B005LAIHQS,Movies,1361232000
1,1,AWF2S3UNW9UA0,B009AMANBA,Movies,1365033600,"B005LAIHQS,B008220C38","Movies,Movies",13612320001362441600
2,1,AWF2S3UNW9UA0,B00B74MJOS,Movies,1367625600,"B005LAIHQS,B008220C38,B009AMANBA","Movies,Movies,Movies",136123200013624416001365033600
3,1,AWF2S3UNW9UA0,B0067EKYL8,Movies,1371686400,"B005LAIHQS,B008220C38,B009AMANBA,B00B74MJOS","Movies,Movies,Movies,Movies",1361232000136244160013650336001367625600
4,1,AWF2S3UNW9UA0,0792839072,Movies,1372982400,"B005LAIHQS,B008220C38,B009AMANBA,B00B74MJOS,B0...","Movies,Movies,Movies,Movies,Movies","1361232000,1362441600,1365033600,1367625600,13..."
...,...,...,...,...,...,...,...,...
16630,1,A1WZZDWYPVST2M,B008JFUUIA,Movies,1365552000,B005S9ELM6,Movies,1365552000
16631,1,A37K6TJ94ZFXVQ,B008JFUOWM,Movies,1390262400,B00B74MJOS,Movies,1368144000
16632,1,A16342W88H5YWK,B0090SI3ZW,Movies,1364256000,B007R6D74G,Movies,1348185600
16633,1,AA3UZRM4EFLK2,B0067EKYL8,Movies,1365465600,B005S9ELM6,Movies,1365465600


The next step is to generate the user, item and category vocabulary files `user_vocab`, `item_vocab` and `cate_vocab` for the train set using the `_create_vocab` function. These files simply contain all of the unique ids for the user, item and categories.

In [13]:
# Create user, item and category vocabulary files
_create_vocab(train_path, user_vocab_path, item_vocab_path, cate_vocab_path)

The final step of the data preprocessing is to randomly sample negative samples to include in the validation and testing set using the `_negative_sampling_offline`. Negative samples are interactions between users and interactions that did not happen. It is important to have negative samples in order to evaluate the performance of recommender systems. Negative samples are denoted using label of 0 and are only present in the validation and test set. 

In [14]:
# Add negative sample to validation and testing 
_negative_sampling_offline(
    sampled_instance_path, valid_path, test_path, valid_num_ngs, test_num_ngs
)

In [15]:
# Visualize validation dataset dataframe
valid_df = pd.read_csv(valid_path, sep="\t", index_col=False, names=["label", "user_id", "item_id", "cate_id", "timestamp", "prev_item_ids", "prev_cate_ids", "prev_timestamps"])
valid_df

Unnamed: 0,label,user_id,item_id,cate_id,timestamp,prev_item_ids,prev_cate_ids,prev_timestamps
0,1,AWF2S3UNW9UA0,B00005K3OT,Movies,1393718400,"B005LAIHQS,B008220C38,B009AMANBA,B00B74MJOS,B0...","Movies,Movies,Movies,Movies,Movies,Movies,Movi...","1361232000,1362441600,1365033600,1367625600,13..."
1,0,AWF2S3UNW9UA0,B0090SI3ZW,Movies,1393718400,"B005LAIHQS,B008220C38,B009AMANBA,B00B74MJOS,B0...","Movies,Movies,Movies,Movies,Movies,Movies,Movi...","1361232000,1362441600,1365033600,1367625600,13..."
2,0,AWF2S3UNW9UA0,B00E8RK5OC,Movies,1393718400,"B005LAIHQS,B008220C38,B009AMANBA,B00B74MJOS,B0...","Movies,Movies,Movies,Movies,Movies,Movies,Movi...","1361232000,1362441600,1365033600,1367625600,13..."
3,0,AWF2S3UNW9UA0,6305171769,Movies,1393718400,"B005LAIHQS,B008220C38,B009AMANBA,B00B74MJOS,B0...","Movies,Movies,Movies,Movies,Movies,Movies,Movi...","1361232000,1362441600,1365033600,1367625600,13..."
4,0,AWF2S3UNW9UA0,B00005JPFX,Movies,1393718400,"B005LAIHQS,B008220C38,B009AMANBA,B00B74MJOS,B0...","Movies,Movies,Movies,Movies,Movies,Movies,Movi...","1361232000,1362441600,1365033600,1367625600,13..."
...,...,...,...,...,...,...,...,...
34360,1,A173F44ZGP878J,B00E8RK5OC,Movies,1383264000,B009AMANBA,Movies,1365811200
34361,0,A173F44ZGP878J,B00005JPS8,Movies,1383264000,B009AMANBA,Movies,1365811200
34362,0,A173F44ZGP878J,B009934S5M,Movies,1383264000,B009AMANBA,Movies,1365811200
34363,0,A173F44ZGP878J,B000E1MTYK,Movies,1383264000,B009AMANBA,Movies,1365811200


In [16]:
# Visualize test dataset dataframe
test_df = pd.read_csv(test_path, sep="\t", index_col=False, names=["label", "user_id", "item_id", "cate_id", "timestamp", "prev_item_ids", "prev_cate_ids", "prev_timestamps"])
test_df

Unnamed: 0,label,user_id,item_id,cate_id,timestamp,prev_item_ids,prev_cate_ids,prev_timestamps
0,1,A3R27T4HADWFFJ,B0000AZT3R,Movies,1389657600,B000J10EQU,Movies,1387756800
1,0,A3R27T4HADWFFJ,B0000VD02Y,Movies,1389657600,B000J10EQU,Movies,1387756800
2,0,A3R27T4HADWFFJ,B00005JPS8,Movies,1389657600,B000J10EQU,Movies,1387756800
3,0,A3R27T4HADWFFJ,B00003CXXO,Movies,1389657600,B000J10EQU,Movies,1387756800
4,0,A3R27T4HADWFFJ,B000C3L27K,Movies,1389657600,B000J10EQU,Movies,1387756800
...,...,...,...,...,...,...,...,...
169165,0,AGAWDSE1J20RI,B002ZG98R8,Movies,1405468800,B00H7KJTCG,Movies,1405468800
169166,0,AGAWDSE1J20RI,B00005JPFX,Movies,1405468800,B00H7KJTCG,Movies,1405468800
169167,0,AGAWDSE1J20RI,B000AE4QD8,TV,1405468800,B00H7KJTCG,Movies,1405468800
169168,0,AGAWDSE1J20RI,B000BTJDG2,Movies,1405468800,B00H7KJTCG,Movies,1405468800
