## Introduction

This notebook presents dataloading and preprocssing on [Amazon Review Dataset](https://nijianmo.github.io/amazon/index.html). This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs) across 26 high level groups. In this notebook we use the Movies and TV group including reviews from 123960 users on 50052 items.

This notebook consists of three main sections:
1. Data Loading
2. Data Preprocessing
3. Negative Sampling

In [8]:
import os 
import pickle
import random
import tempfile
import zipfile
from collections import Counter
from pathlib import Path
from typing import Tuple

import numpy as np
import pandas as pd
from dotmap import DotMap
from recommenders.datasets.amazon_reviews import (_reviews_preprocessing, 
                                                  _meta_preprocessing, 
                                                  _create_instance, 
                                                  _get_sampled_data,
                                                  _create_item2cate,
                                                  download_and_extract)
from tqdm import trange

Below are the list of arguments and parameters used in the process of loading and preprocessing the dataset. Feel free to alter the values according to your experiment.

In [9]:
REVIEWS_FILE = 'reviews_Movies_and_TV_5.json'
META_FILE = 'meta_Movies_and_TV.json'

args = DotMap()

args.dataset_root = './Data'
args.dataset_name = 'amazon'
args.dataset_path = os.path.join(args.dataset_root, args.dataset_name)
args.reviews_path = os.path.join(args.dataset_path, REVIEWS_FILE)
args.meta_path =  os.path.join(args.dataset_path, META_FILE)

args.preprocessed = Path(os.path.join('./Data', args.dataset_name, 'preprocessed'))
args.sample_rate = 0.1
args.min_uc = 5
args.min_ic = 10

args.train_batch_size = 128
args.val_batch_size = 128
args.test_batch_size = 128

args.negative_sample_size = 100

args.data_seed = 98765
random.seed(args.data_seed)
np.random.seed(args.data_seed)

## Data Loading

There are two main files included in the dataset: 
1. reviews file: The raw review data exists in a file `reviews_path` where each row corresponds to the review of product by a user at a specific time. The data takes the form of a dictionairy that contains keys of interest such as the user id `reviewer_id`, the item id (Amazon Standard Identification Number) `asin` and a timestamp of the review `unixReviewTime`. 


In [10]:
os.makedirs(args.dataset_root, exist_ok='True')    
os.makedirs(args.preprocessed, exist_ok='True')
    
# Download and extract review data
if not os.path.exists(args.reviews_path):
    download_and_extract(REVIEWS_FILE, args.reviews_path)

# Visualize review data
with open(args.reviews_path, "r") as f:
    for line in list(f)[:1]: 
        print(line)    

100%|██████████| 692k/692k [00:20<00:00, 33.2kKB/s]


{"reviewerID": "ADZPIG9QOCDG5", "asin": "0005019281", "reviewerName": "Alice L. Larson \"alice-loves-books\"", "helpful": [0, 0], "reviewText": "This is a charming version of the classic Dicken's tale.  Henry Winkler makes a good showing as the \"Scrooge\" character.  Even though you know what will happen this version has enough of a change to make it better that average.  If you love A Christmas Carol in any version, then you will love this.", "overall": 4.0, "summary": "good version of a classic", "unixReviewTime": 1203984000, "reviewTime": "02 26, 2008"}



2. metadata file: The raw metadata exists in a file `meta_path` where each row corresponds to a product. The data the form of a dictionairy that contains keys of interest such as the categories of the item `categories`, a description of the item `description` and the price of an item `price`.

In [11]:
# Download and extract metadata 
if not os.path.exists(args.meta_path):
    download_and_extract(META_FILE, args.meta_path)
    
# Visualize metadata
with open(args.meta_path, "r") as f:
    for line in list(f)[:1]: 
        print(line)

100%|██████████| 97.5k/97.5k [00:04<00:00, 22.5kKB/s]


{'asin': '0000143561', 'categories': [['Movies & TV', 'Movies']], 'description': '3Pack DVD set - Italian Classics, Parties and Holidays.', 'title': 'Everyday Italian (with Giada de Laurentiis), Volume 1 (3 Pack): Italian Classics, Parties, Holidays', 'price': 12.99, 'salesRank': {'Movies & TV': 376041}, 'imUrl': 'http://g-ecx.images-amazon.com/images/G/01/x-site/icons/no-img-sm._CB192198896_.gif', 'related': {'also_viewed': ['B0036FO6SI', 'B000KL8ODE', '000014357X', 'B0037718RC', 'B002I5GNVU', 'B000RBU4BM'], 'buy_after_viewing': ['B0036FO6SI', 'B000KL8ODE', '000014357X', 'B0037718RC']}}



The relevent information is extracted from the reviews file including the user ids, item ids, and the timestamp of the reviews.

In [12]:
# Extract relevant information from reviews_path file
reviews_writefile = _reviews_preprocessing(str(args.reviews_path))

# Visualize extracted review data
with open(reviews_writefile, "r") as f:
    for line in list(f)[:5]: 
        print(line)

ADZPIG9QOCDG5	0005019281	1203984000

A35947ZP82G7JH	0005019281	1388361600

A3UORV8A9D5L2E	0005019281	1388361600

A1VKW06X1O2X7V	0005019281	1202860800

A3R27T4HADWFFJ	0005019281	1387670400



The relevent information is extracted from the metadata file including the item ids, and their category.

In [13]:
# Extract relevant information from meta_path file
meta_writefile = _meta_preprocessing(args.meta_path)

# Visualize extracted review data
with open(meta_writefile, "r") as f:
    for line in list(f)[:5]: 
        print(line)

0000143561	Movies

0000589012	Movies

0000695009	Movies

000107461X	Movies

0000143529	Movies



The extracted information from meta and reviews file are merged to shape the main dataframe.

In [14]:
# Merge review and meta files into instance_output file
instance_output_path = _create_instance(reviews_writefile, meta_writefile)
df = pd.read_csv(
    instance_output_path, sep="\t", 
    names=["label", "user_id", "item_id", "timestamp", "category"]
)
df = df.sort_values(by=['user_id', 'timestamp'], ignore_index=True)
df = df.drop(df[df.label == 0].index)
df

Unnamed: 0,label,user_id,item_id,timestamp,category
0,1,A00295401U6S2UG3RAQSZ,0767015533,1353196800,TV
1,1,A00295401U6S2UG3RAQSZ,0792838084,1353196800,Movies
2,1,A00295401U6S2UG3RAQSZ,6304484054,1353196800,Movies
3,1,A00295401U6S2UG3RAQSZ,6305182205,1353196800,TV
4,1,A00295401U6S2UG3RAQSZ,B00004W22I,1353196800,TV
...,...,...,...,...,...
1697528,1,AZZZMSZI9LKE6,B00009K77X,1361318400,TV
1697529,1,AZZZMSZI9LKE6,B0090XUARQ,1361577600,TV
1697530,1,AZZZMSZI9LKE6,B00AY2DL78,1364256000,TV
1697531,1,AZZZMSZI9LKE6,B00BC5FN2C,1397606400,TV


## Data Preprocessing

The goal of data preprocessing is to make the dataset compatible with BERT4Rec. The preprocessing steps are mainly based on the Pytorch implementation of BERT4Rec provided [here](https://github.com/jaywonchung/BERT4Rec-VAE-Pytorch).


As the first step, we randomly sample `user_count * sample_rate` users to reduce the size of dataset if necessary.

In [15]:
def sample_users(df, sample_rate):
    user_count = df["user_id"].nunique()
    users = list(df["user_id"])
    users_sample, count = set(), 0
    random_users = np.random.choice(users, 
                                    size=int(user_count * args.sample_rate), 
                                    replace=False)

    df = df[df["user_id"].isin(random_users)]
    return df

Then the data is filtered based on: 
1. Number of reviews per item: removing the items that have been reviewed fewer times than `args.min_ic`.
2. Number of reviews per user: removing the users who have reviewed items fewer times than `args.min_uc`.

In [16]:
def filter_min_ic(df, min_ic):
    """removes the movie records that have been reviewed less frequent than minimum

    Args:
        df (pd.Dataframe): a pandas dataframe including label, user_id, item_id, timestamp, 
        category)

    Returns:
        df (pd.Dataframe): the updated dataframe
    """
    if min_ic > 0:
            item_sizes = df.groupby('item_id').size()
            good_items = item_sizes.index[item_sizes >= min_ic]
            df = df[df['item_id'].isin(good_items)]
    return df

def filter_min_uc(df, min_uc):
    """removes the user records that have reviewed less frequent than minimum

    Args:
        df (pd.Dataframe): a pandas dataframe including label, user_id, item_id, timestamp, 
        category)

    Returns:
        df (pd.Dataframe): the updated dataframe
    """
    if min_uc > 0:
            user_sizes = df.groupby('user_id').size()
            good_users = user_sizes.index[user_sizes >= min_uc]
            df = df[df['user_id'].isin(good_users)]
    return df

After filtering some of the records in the dataset, the item ids and the user ids are mapped to integer values.

In [17]:
def densify_index(df):
    """reassigns the user and movie ids to remove the gaps caused by deletions

    Args:
        df (pd.Dataframe): a pandas dataframe including label, user_id, item_id, timestamp, 
        category)

    Returns:
        df (pd.Dataframe): the updated dataframe
    """
    umap = {u: i for i, u in enumerate(set(df['user_id']))}
    smap = {s: i for i, s in enumerate(set(df['item_id']))}
    df['user_id'] = df['user_id'].map(umap)
    df['item_id'] = df['item_id'].map(smap)
    return df, umap, smap

Finally the dataset is split into three subsets for training, validation, and testing. Since BERT4Rec adopts leave-one-out evaluation method, the dataset is split in a way that for each user, the last item of the sequence is held as the test data, the item just before the last is held as the validation set, and the remaining items are used for training.

In [18]:
def split_df(df, user_count):
    """splits dataset to train, validation, and test sets

    Args:
        df (pd.Dataframe): the preprocessed dataframe
        user_count (int): number of all users in the dataset

    Returns:
        Tuple: a tuple of data splits
    """
    user_group = df.groupby('user_id')
    user2items = user_group.apply(lambda d: list(d.sort_values(by='timestamp')['item_id']))
    train, val, test = {}, {}, {}
    for user in range(user_count):
        items = user2items[user]
        train[user], val[user], test[user] = items[:-2], items[-2:-1], items[-1:]
    return train, val, test

All the preprocessing functions are applied to the data as below, and the final dataframe and the data splits are stored in the `preprocessed` directory.

In [19]:
df = sample_users(df, args.sample_rate)
df = filter_min_ic(df, args.min_ic)
df = filter_min_uc(df, args.min_uc)

df, umap, smap = densify_index(df)

user_count = len(umap)
item_count = len(smap)

df = df.sort_values(by=['user_id', 'timestamp', 'item_id'], ignore_index=True)

train, val, test = split_df(df, user_count)

dataset = {'train': train,
            'val': val,
            'test': test,
            'umap': umap,
            'smap': smap}

dataset_path = args.preprocessed.joinpath('dataset.pkl')
with open(dataset_path, 'wb') as f:
    pickle.dump(dataset, f)

df_path = args.preprocessed.joinpath('preprocessed.csv')
df.to_csv(df_path, index=False)


In [20]:
df

Unnamed: 0,label,user_id,item_id,timestamp,category
0,1,0,4294,1121644800,TV
1,1,0,5295,1137801600,Movies
2,1,0,1207,1143763200,Movies
3,1,0,2517,1144022400,Movies
4,1,0,7735,1152316800,Movies
...,...,...,...,...,...
361019,1,8197,1944,1353110400,Movies
361020,1,8197,2532,1353110400,Movies
361021,1,8197,11569,1353110400,Movies
361022,1,8197,9278,1355529600,TV


To see how the data is split, let's take a look at the reviews by user 0 that are sorted based on the timestamp, and then see which partitions of the sequence are included in the train, validation, and test sets.

In [22]:
df[df['user_id'].isin([0])]

Unnamed: 0,label,user_id,item_id,timestamp,category
0,1,0,4294,1121644800,TV
1,1,0,5295,1137801600,Movies
2,1,0,1207,1143763200,Movies
3,1,0,2517,1144022400,Movies
4,1,0,7735,1152316800,Movies
5,1,0,10079,1152403200,Movies
6,1,0,925,1153958400,Movies
7,1,0,1714,1153958400,Movies
8,1,0,4518,1153958400,Movies
9,1,0,6151,1153958400,Movies


The train data includes all the above records except the last two.

In [23]:
train[0]

[4294,
 5295,
 1207,
 2517,
 7735,
 10079,
 925,
 1714,
 4518,
 6151,
 7524,
 1798,
 9523,
 10314,
 230,
 2020,
 5499,
 9096,
 8926,
 5213,
 6392,
 9519,
 1240,
 9004,
 10499,
 10700,
 6138,
 11605,
 4211,
 10281,
 2557,
 4324,
 4209,
 6275,
 6155,
 6035,
 10813,
 5504,
 6735,
 2462,
 4025,
 7548,
 1494,
 3768,
 11457,
 9535,
 2925,
 1087,
 805]

The last but one is used for validation.

In [24]:
val[0]

[3437]

The last record is used as the test data.

In [25]:
test[0]

[10062]

## Negative Sampling

To evaluate BERT4Rec, each ground truth item in the validation and test sets is paired with a number of sampled negative items that the user has not interacted with. The evaluation mertrics are then calculated using these values.

Below two negative samplers are presented:
1. Negative sampling based on popularity: The popular items are those ones that are reviewed more frequently. In this case, the negative samples are selected from a sorted list of popular items that the user has not interacted with.
2. Random negative sampling: The items are randomly sampled.

In [26]:

def items_by_popularity(train, val, test, user_count):
    """sorts the items based on their popularity

    Args:
        train (dict): train dataset
        val (dict): validation dataset
        test (dict): test dataset
        user_count (int): total number of users

    Returns:
        list: list of items sorted based on their popularity
    """
    popularity = Counter()
    for user in range(user_count):
        popularity.update(train[user])
        popularity.update(val[user])
        popularity.update(test[user])
    popular_items = sorted(popularity, key=popularity.get, reverse=True)
    return popular_items

def popular_negative_samples(train, val, test, user_count):
    """generates negative samples from the popular items

    Args:
        train (dict): train dataset
        val (dict): validation dataset
        test (dict): test dataset
        user_count (int): total number of users

    Returns:
        dict: negatives samples for all users
    """
    popular_items = items_by_popularity(train, val, test, user_count)

    negative_samples = {}
    for user in trange(user_count):
        seen = set(train[user])
        seen.update(val[user])
        seen.update(test[user])

        samples = []
        for item in popular_items:
            if len(samples) == args.negative_sample_size:
                break
            if item in seen:
                continue
            samples.append(item)

        negative_samples[user] = samples
    
    negatives_path = args.preprocessed.joinpath('popular_negatives.pkl')
    with negatives_path.open('wb') as f:
        pickle.dump(negative_samples, f)

    return negative_samples




In [27]:
popular_samples = popular_negative_samples(train, val, test, user_count)
popular_samples[0]

100%|██████████| 8198/8198 [00:02<00:00, 3300.15it/s]


[11197,
 4041,
 3558,
 10139,
 2017,
 11051,
 3217,
 5014,
 1284,
 7093,
 2890,
 7197,
 5018,
 7206,
 3418,
 1275,
 9724,
 8598,
 3268,
 6958,
 1132,
 11583,
 1010,
 5211,
 1403,
 11615,
 3186,
 10826,
 185,
 2223,
 6108,
 9854,
 11559,
 4121,
 1553,
 7419,
 467,
 8255,
 1856,
 9956,
 7401,
 6220,
 7135,
 9233,
 11524,
 9149,
 6657,
 4146,
 4972,
 4488,
 10841,
 357,
 4089,
 2025,
 7591,
 11503,
 8977,
 10726,
 6302,
 10460,
 5817,
 8281,
 5452,
 1375,
 3554,
 9182,
 9923,
 3549,
 3933,
 1241,
 6595,
 11426,
 8158,
 2867,
 110,
 3219,
 4864,
 3173,
 417,
 11606,
 9616,
 366,
 9474,
 10372,
 8451,
 3181,
 2207,
 11463,
 6320,
 6357,
 6809,
 4817,
 4788,
 7123,
 1448,
 3018,
 2201,
 10331,
 7633,
 5793]

In [28]:
def random_negative_samples(train, val, test, user_count, item_count):
    """generates negative samples randomly

    Args:
        train (dict): train dataset
        val (dict): validation dataset
        test (dict): test dataset
        user_count (int): total number of users
        item_count (int): total number of items

    Returns:
        dict: negatives samples for all users
    """
    np.random.seed(args.data_seed)
    negative_samples = {}
    print('Sampling negative items')
    for user in trange(user_count):
        seen = set(train[user])
        seen.update(val[user])
        seen.update(test[user])

        samples = []
        for _ in range(args.negative_sample_size):
            item = np.random.choice(item_count) + 1
            while item in seen or item in samples:
                item = np.random.choice(item_count) + 1
            samples.append(item)

        negative_samples[user] = samples

    negatives_path = args.preprocessed.joinpath('random_negatives.pkl')
    with negatives_path.open('wb') as f:
        pickle.dump(negative_samples, f)

    return negative_samples

In [29]:
random_samples = random_negative_samples(train, val, test, user_count, item_count)
random_samples[0]

Sampling negative items


100%|██████████| 8198/8198 [00:15<00:00, 525.30it/s]


[3862,
 3572,
 9581,
 5041,
 7719,
 4828,
 2058,
 9605,
 922,
 4480,
 2035,
 4171,
 3584,
 5597,
 3576,
 10721,
 5456,
 6251,
 2835,
 11672,
 10047,
 8362,
 895,
 1036,
 5018,
 141,
 5174,
 8086,
 1736,
 5070,
 1647,
 657,
 11311,
 274,
 4533,
 3792,
 733,
 7785,
 9063,
 3257,
 3621,
 2435,
 11000,
 1352,
 6791,
 6839,
 3803,
 8296,
 8010,
 7126,
 2471,
 9046,
 2125,
 10687,
 3937,
 2447,
 6760,
 2398,
 2890,
 8196,
 2570,
 9139,
 9644,
 9115,
 2790,
 9444,
 4230,
 9852,
 9245,
 1911,
 8604,
 10752,
 7350,
 6809,
 2688,
 10966,
 9087,
 3823,
 8660,
 8785,
 9104,
 8978,
 1106,
 9933,
 10310,
 2227,
 9557,
 4807,
 10379,
 1195,
 428,
 10000,
 4082,
 1429,
 8514,
 2452,
 726,
 359,
 4665,
 2178]