## Introduction

This notebook presents dataloading and preprocssing on [MovieLens 1M](https://grouplens.org/datasets/movielens/1m/) dataset, one of the benchmark datasets used for training and evaluating the BERT4Rec model. This dataset includes 1 million of movie ratings from 6000 users on 4000 movies. Since the ratings by each user has been collected alongside the corresponding timestamps, the sequence of ratings can be used for sequential recommendation.

This notebook consists of three main sections:
1. Data Loading
2. Data Preprocessing
3. Negative Sampling

In [53]:
import os 
import pickle
import tempfile
import wget
import zipfile
from collections import Counter
from pathlib import Path
from typing import Tuple

import numpy as np
import pandas as pd
from tqdm import trange

from dotmap import DotMap

Below are the list of arguments and parameters used in the process of loading and preprocessing the dataset. Feel free to alter the values according to your experiment.

In [54]:
FILES = ['movies.dat', 'ratings.dat']

args = DotMap()

args.dataset_root = Path('/ssd003/projects/aieng/public/recsys_datasets/movielens')
args.dataset_name = 'ml-1m'
args.dataset_url = 'https://files.grouplens.org/datasets/movielens/ml-1m.zip'
args.dataset_path = args.dataset_root.joinpath(args.dataset_name)

args.preprocessed = Path(os.path.join('./Data', args.dataset_name, 'preprocessed'))
args.min_rating = 0
args.min_uc = 5
args.min_mc = 0

args.train_batch_size = 128
args.val_batch_size = 128
args.test_batch_size = 128

args.negative_sample_size = 100

args.data_seed = 98765

## Data Loading

There are three main files included in the dataset: 
1. users file: includes user-specific information such as age, and gender, with `user_id` as the identifier.
2. movies file: includes title and genre of the movies, with `item_id` as the identifier.
3. ratings file: includes the `rating`s on the movie `item_id` by the user `user_id`, and the corresponding time stamps.

Below the dataset is downloaded if doesn't already exist.



In [55]:
os.makedirs(args.dataset_root, exist_ok='True')    
os.makedirs(args.preprocessed, exist_ok='True')   

if all(args.dataset_path.joinpath(filename).is_file() for filename in FILES):
    print('Raw data already exists. Skip downloading')
else:
    wget.download(args.dataset_url, f'{args.dataset_root}/ml-1m.zip')
    zip = zipfile.ZipFile(args.dataset_root / 'ml-1m.zip')
    zip.extractall(args.dataset_root)
    zip.close()
    print(f'Downloaded dataset into {os.path.abspath(args.dataset_root)}')    

Raw data already exists. Skip downloading


Below the data from the movies file and the ratings files are merged together in a single dataframe. The same dataframe is later preprocessed and used to form the train and test sets.

In [56]:
def load_ratings_df():
    """loads the ratings by user alongside the movie details

    Returns:
        pd.DataFrame: a pandas dataframe including user_id, item_id, rating, timestamp, 
        title, and genre)
    """
    folder_path = args.dataset_path

    file_path_r = folder_path.joinpath('ratings.dat')
    df_r = pd.read_csv(file_path_r, sep='::', header=None, engine='python')
    df_r.columns = ['user_id', 'item_id', 'rating', 'timestamp']

    file_path_m = folder_path.joinpath('movies.dat')
    df_m = pd.read_csv(
        file_path_m, 
        sep='::', 
        header=None, 
        engine='python', 
        encoding='ISO-8859-1'
    )
    df_m.columns = ['item_id', 'title', 'genre']

    df_r =  df_r.join(df_m.set_index('item_id'), on='item_id')
    return df_r.sort_values(by=['user_id', 'timestamp'], ignore_index=True)

load_ratings_df()

Unnamed: 0,user_id,item_id,rating,timestamp,title,genre
0,1,3186,4,978300019,"Girl, Interrupted (1999)",Drama
1,1,1270,5,978300055,Back to the Future (1985),Comedy|Sci-Fi
2,1,1721,4,978300055,Titanic (1997),Drama|Romance
3,1,1022,5,978300055,Cinderella (1950),Animation|Children's|Musical
4,1,2340,3,978300103,Meet Joe Black (1998),Romance
...,...,...,...,...,...,...
1000204,6040,2917,4,997454429,Body Heat (1981),Crime|Thriller
1000205,6040,1921,4,997454464,Pi (1998),Sci-Fi|Thriller
1000206,6040,1784,3,997454464,As Good As It Gets (1997),Comedy|Drama
1000207,6040,161,3,997454486,Crimson Tide (1995),Drama|Thriller|War


## Data Preprocessing

The goal of data preprocessing is to make the dataset compatible with BERT4Rec. The preprocessing steps are based on the Pytorch implementation of BERT4Rec provided [here](https://github.com/jaywonchung/BERT4Rec-VAE-Pytorch).

The first step of preprocessing is to filter the data based on 
1. Rating values: removing the records with a rating lower than `args.min_rating`.
2. Number of ratings per movie: removing the movies that have been rated fewer times than `args.min_mc`.
3. Number of ratings per user: removing the users who have rated movies fewer times than `args.min_uc`.

In [57]:
def filter_min_rating(df, min_rating):
    """removes the records that have ratings lower than minimum

    Args:
        df (pd.Dataframe): a pandas dataframe including user_id, item_id, rating, timestamp, 
        title, and genre)

    Returns:
        df (pd.Dataframe): the updated dataframe
    """
    return df[df['rating'] >= min_rating]

def filter_min_mc(df, min_mc):
    """removes the movie records that have been rated less frequent than minimum

    Args:
        df (pd.Dataframe): a pandas dataframe including user_id, item_id, rating, timestamp, 
        title, and genre)

    Returns:
        df (pd.Dataframe): the updated dataframe
    """
    if min_mc > 0:
            item_sizes = df.groupby('item_id').size()
            good_items = item_sizes.index[item_sizes >= min_mc]
            df = df[df['item_id'].isin(good_items)]
    return df

def filter_min_uc(df, min_uc):
    """removes the user records that have rated less frequent than minimum

    Args:
        df (pd.Dataframe): a pandas dataframe including user_id, item_id, rating, timestamp, 
        title, and genre)

    Returns:
        df (pd.Dataframe): the updated dataframe
    """
    if min_uc > 0:
            user_sizes = df.groupby('user_id').size()
            good_users = user_sizes.index[user_sizes >= min_uc]
            df = df[df['user_id'].isin(good_users)]
    return df

After filtering some of the records in the dataset, the movie ids and the user ids are densified so that there is no missing id value in the sequence of all ids.

In [58]:
def densify_index(df):
    """reassigns the user and movie ids to remove the gaps caused by deletions

    Args:
        df (pd.Dataframe): a pandas dataframe including user_id, item_id, rating, timestamp, 
        title, and genre)

    Returns:
        df (pd.Dataframe): the updated dataframe
    """
    umap = {u: i for i, u in enumerate(set(df['user_id']))}
    smap = {s: i for i, s in enumerate(set(df['item_id']))}
    df['user_id'] = df['user_id'].map(umap)
    df['item_id'] = df['item_id'].map(smap)
    return df, umap, smap

Finally the dataset is split into three subsets for training, validation, and testing. Since BERT4Rec adopts leave-one-out evaluation method, the dataset is split in a way that for each user, the last item of the rating sequence is held as the test data, the item just before the last is held as the validation set, and the remaining items are used for training.

In [59]:
def split_df(df, user_count):
    """splits dataset to train, validation, and test sets

    Args:
        df (pd.Dataframe): the preprocessed dataframe
        user_count (int): number of all users in the dataset

    Returns:
        Tuple: a tuple of data splits
    """
    user_group = df.groupby('user_id')
    user2items = user_group.apply(lambda d: list(d.sort_values(by='timestamp')['item_id']))
    train, val, test = {}, {}, {}
    for user in range(user_count):
        items = user2items[user]
        train[user], val[user], test[user] = items[:-2], items[-2:-1], items[-1:]
    return train, val, test

All the preprocessing functions are applied to the data as below, and the final dataframe and the data splits are stored in the `preprocessed` directory.

In [60]:
df = load_ratings_df()
df = filter_min_rating(df, args.min_rating)
df = filter_min_mc(df, args.min_mc)
df = filter_min_uc(df, args.min_uc)
df, umap, smap = densify_index(df)

user_count = len(umap)
item_count = len(smap)

train, val, test = split_df(df, user_count)

dataset = {'train': train,
            'val': val,
            'test': test,
            'umap': umap,
            'smap': smap}

dataset_path = args.preprocessed.joinpath('dataset.pkl')
with dataset_path.open('wb') as f:
    pickle.dump(dataset, f)

df_path = args.preprocessed.joinpath('preprocessed.csv')
df.to_csv(df_path, index=False)


In [61]:
df

Unnamed: 0,user_id,item_id,rating,timestamp,title,genre
0,0,2969,4,978300019,"Girl, Interrupted (1999)",Drama
1,0,1178,5,978300055,Back to the Future (1985),Comedy|Sci-Fi
2,0,1574,4,978300055,Titanic (1997),Drama|Romance
3,0,957,5,978300055,Cinderella (1950),Animation|Children's|Musical
4,0,2147,3,978300103,Meet Joe Black (1998),Romance
...,...,...,...,...,...,...
1000204,6039,2709,4,997454429,Body Heat (1981),Crime|Thriller
1000205,6039,1741,4,997454464,Pi (1998),Sci-Fi|Thriller
1000206,6039,1618,3,997454464,As Good As It Gets (1997),Comedy|Drama
1000207,6039,155,3,997454486,Crimson Tide (1995),Drama|Thriller|War


To see how the data is split, let's take a look at the ratings by user 0 that are sorted based on the timestamp, and then see which partitions of the sequence are included in the train, validation, and test sets.

In [62]:
df[df['user_id'].isin([0])]

Unnamed: 0,user_id,item_id,rating,timestamp,title,genre
0,0,2969,4,978300019,"Girl, Interrupted (1999)",Drama
1,0,1178,5,978300055,Back to the Future (1985),Comedy|Sci-Fi
2,0,1574,4,978300055,Titanic (1997),Drama|Romance
3,0,957,5,978300055,Cinderella (1950),Animation|Children's|Musical
4,0,2147,3,978300103,Meet Joe Black (1998),Romance
5,0,1658,5,978300172,"Last Days of Disco, The (1998)",Drama
6,0,3177,4,978300275,Erin Brockovich (2000),Drama
7,0,2599,5,978300719,"Christmas Story, A (1983)",Comedy|Drama
8,0,1117,4,978300719,To Kill a Mockingbird (1962),Drama
9,0,1104,5,978300760,One Flew Over the Cuckoo's Nest (1975),Drama


The train data includes all the above ratings except the last two.

In [63]:
train[0]

[2969,
 1178,
 1574,
 957,
 2147,
 1658,
 3177,
 2599,
 1117,
 1104,
 689,
 253,
 858,
 593,
 2488,
 1781,
 1848,
 2889,
 877,
 970,
 1782,
 1838,
 144,
 963,
 1025,
 853,
 2592,
 1195,
 2557,
 1154,
 639,
 2710,
 517,
 2898,
 2586,
 2128,
 964,
 1107,
 580,
 2205,
 1421,
 513,
 581,
 2483,
 708,
 574,
 0,
 2162,
 2102,
 740,
 1439]

The last but one is used for validation.

In [64]:
val[0]

[1727]

The last record is used as the test data.

In [65]:
test[0]

[47]

## Negative Sampling

To evaluate BERT4Rec, each ground truth item in the validation and test sets is paired with a number of sampled negative items that the user has not interacted with. The evaluation mertrics are then calculated using these values.

Below two negative samplers are presented:
1. Negative sampling based on popularity: The popular items are those ones that are rated more frequently. In this case, the negative samples are selected from a sorted list of popular items that the user has not interacted with.
2. Random negative sampling: The items are randomly sampled.

In [37]:

def items_by_popularity(train, val, test, user_count):
    """sorts the items based on their popularity

    Args:
        train (dict): train dataset
        val (dict): validation dataset
        test (dict): test dataset
        user_count (int): total number of users

    Returns:
        list: list of items sorted based on their popularity
    """
    popularity = Counter()
    for user in range(user_count):
        popularity.update(train[user])
        popularity.update(val[user])
        popularity.update(test[user])
    popular_items = sorted(popularity, key=popularity.get, reverse=True)
    return popular_items

def popular_negative_samples(train, val, test, user_count):
    """generates negative samples from the popular items

    Args:
        train (dict): train dataset
        val (dict): validation dataset
        test (dict): test dataset
        user_count (int): total number of users

    Returns:
        dict: negatives samples for all users
    """
    popular_items = items_by_popularity(train, val, test, user_count)

    negative_samples = {}
    for user in trange(user_count):
        seen = set(train[user])
        seen.update(val[user])
        seen.update(test[user])

        samples = []
        for item in popular_items:
            if len(samples) == args.negative_sample_size:
                break
            if item in seen:
                continue
            samples.append(item)

        negative_samples[user] = samples
    
    negatives_path = args.preprocessed.joinpath('popular_negatives.pkl')
    with negatives_path.open('wb') as f:
        pickle.dump(negative_samples, f)

    return negative_samples

In [38]:
popular_samples = popular_negative_samples(train, val, test, user_count)
popular_samples[0]

100%|██████████| 6040/6040 [00:03<00:00, 1646.70it/s]


[2651,
 1106,
 1120,
 466,
 575,
 2374,
 579,
 1449,
 1108,
 106,
 2203,
 1485,
 1173,
 2426,
 2785,
 309,
 802,
 346,
 2511,
 287,
 1148,
 1124,
 2708,
 443,
 3341,
 1110,
 527,
 2775,
 1167,
 49,
 33,
 737,
 2958,
 863,
 1050,
 1288,
 1131,
 851,
 971,
 1123,
 1478,
 367,
 1199,
 1820,
 1059,
 1215,
 1563,
 1788,
 627,
 2400,
 3550,
 31,
 1993,
 2099,
 3238,
 576,
 2748,
 1275,
 1295,
 2480,
 578,
 1618,
 1445,
 1212,
 3186,
 216,
 370,
 2501,
 1135,
 1453,
 1406,
 3248,
 1743,
 2494,
 713,
 38,
 20,
 1294,
 2213,
 1130,
 699,
 1139,
 2959,
 3508,
 1007,
 847,
 3383,
 1003,
 1258,
 3294,
 2495,
 1129,
 1155,
 1017,
 1420,
 1132,
 3510,
 107,
 2505,
 347]

In [39]:
def random_negative_samples(train, val, test, user_count, item_count):
    """generates negative samples randomly

    Args:
        train (dict): train dataset
        val (dict): validation dataset
        test (dict): test dataset
        user_count (int): total number of users
        item_count (int): total number of items

    Returns:
        dict: negatives samples for all users
    """
    np.random.seed(args.data_seed)
    negative_samples = {}
    print('Sampling negative items')
    for user in trange(user_count):
        seen = set(train[user])
        seen.update(val[user])
        seen.update(test[user])

        samples = []
        for _ in range(args.negative_sample_size):
            item = np.random.choice(item_count) + 1
            while item in seen or item in samples:
                item = np.random.choice(item_count) + 1
            samples.append(item)

        negative_samples[user] = samples

    negatives_path = args.preprocessed.joinpath('random_negatives.pkl')
    with negatives_path.open('wb') as f:
        pickle.dump(negative_samples, f)

    return negative_samples

In [40]:
random_samples = random_negative_samples(train, val, test, user_count, item_count)
random_samples[0]

Sampling negative items


100%|██████████| 6040/6040 [00:14<00:00, 405.19it/s]


[3572,
 1389,
 2014,
 945,
 3623,
 732,
 2058,
 1413,
 922,
 384,
 2035,
 75,
 3584,
 1501,
 3576,
 2529,
 1360,
 2155,
 2835,
 1225,
 3480,
 1855,
 170,
 426,
 3484,
 895,
 1036,
 2193,
 141,
 1078,
 1736,
 3644,
 974,
 1647,
 657,
 2960,
 3119,
 274,
 437,
 2607,
 733,
 1700,
 3689,
 871,
 3257,
 3621,
 2435,
 854,
 2808,
 3434,
 1352,
 2695,
 1582,
 2743,
 1738,
 104,
 2331,
 1821,
 3030,
 2471,
 1696,
 2125,
 2495,
 2447,
 2664,
 1866,
 2398,
 2323,
 1381,
 346,
 2890,
 4,
 2570,
 947,
 1452,
 923,
 2790,
 3680,
 1252,
 493,
 134,
 1660,
 1053,
 1911,
 412,
 579,
 2560,
 3254,
 2713,
 2688,
 2774,
 468,
 2709,
 912,
 786,
 2828,
 1106,
 1741,
 2476,
 2118]