# Preparing MovieLens+IMDB dataset

## Downloading raw datasets

Datasets descriptions are available at
* http://files.grouplens.org/datasets/movielens/ml-20m-README.html for MovieLens-20M
* https://www.imdb.com/interfaces/ for IMDB

In [1]:
from utils import DATASETS_ROOT_DIR
# Directory where those datasets will be downloaded:
DATASETS_ROOT_DIR

2022-10-18 13:38:53.877596: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda/lib:/usr/local/lib/x86_64-linux-gnu:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:
2022-10-18 13:38:53.877644: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


'/home/jupyter/recsys-multi-atrribute-benchmark/datasets'

In [None]:
!wget https://files.grouplens.org/datasets/movielens/ml-20m.zip -P $DATASETS_ROOT_DIR/raw
!unzip $DATASETS_ROOT_DIR/raw/ml-20m.zip -d $DATASETS_ROOT_DIR/raw

In [None]:
# Downloading only some IMDB files needed for our experiment
for file_name in [
    'name.basics.tsv.gz',
    # 'title.akas.tsv.gz',
    'title.basics.tsv.gz',
    'title.crew.tsv.gz',
    # 'title.episode.tsv.gz',
    'title.principals.tsv.gz',
    # 'title.ratings.tsv.gz',
]:
    !wget https://datasets.imdbws.com/$file_name -P $DATASETS_ROOT_DIR/raw/imdb

## Choosing features from IMDB dataset & joining datasets together

Here we load IMDB dataset and keep some attributes we will use further as film features:
* `primaryTitle` (name of a film)
* `genres` (list of up to 3 genres associated with films)
* `actors` (list of actors of the given film)
* `directors` (list of directors of the given film)
* `titleType` (e.g. movie, short, tvseries, tvepisode, video, etc)
* `startYear` (year of a film's release / initial release of TV series)
* `runtimeMinutes` (film's duration in minutes)

In [2]:
%%time

import os
import gzip
import pandas as pd

with gzip.open(os.path.join(DATASETS_ROOT_DIR, 'raw/imdb/title.basics.tsv.gz'), 'rb') as f:
    imdb_titles = pd.read_csv(f, sep='\t')

with gzip.open(os.path.join(DATASETS_ROOT_DIR, 'raw/imdb/title.principals.tsv.gz'), 'rb') as f:
    imdb_principals = pd.read_csv(f, sep='\t')

with gzip.open(os.path.join(DATASETS_ROOT_DIR, 'raw/imdb/title.crew.tsv.gz'), 'rb') as f:
    imdb_crew = pd.read_csv(f, sep='\t')

# for each dataframe we set film id cast to the integer as an index
for dataframe in [imdb_titles, imdb_principals, imdb_crew]:
    dataframe['tconst'] = dataframe['tconst'].map(lambda x: int(x[2:]))
    dataframe.set_index('tconst', inplace=True)

  exec(code, glob, local_ns)


CPU times: user 3min 6s, sys: 10.6 s, total: 3min 16s
Wall time: 3min 20s


In [3]:
%%time

import numpy as np

# getting list of actors for each film
imdb_titles['actors'] = imdb_principals[np.isin(imdb_principals.category, ['actor', 'actress'])]\
    .groupby('tconst')['nconst'].agg(','.join)

# getting list of directors for each film
imdb_titles['directors'] = imdb_crew['directors']

del imdb_principals
del imdb_crew

CPU times: user 1min 4s, sys: 1.21 s, total: 1min 5s
Wall time: 1min 5s


In [4]:
ml_links = pd.read_csv(os.path.join(DATASETS_ROOT_DIR, 'raw/ml-20m/links.csv'))
ml_ratings = pd.read_csv(os.path.join(DATASETS_ROOT_DIR, 'raw/ml-20m/ratings.csv'))

ml_ratings = pd.merge(ml_ratings, ml_links[['movieId', 'imdbId']], on='movieId', how='left')

del ml_links

In [5]:
# Joining MovieLens dataset with IMDB attributes
imdb_features = ['titleType', 'primaryTitle', 'startYear', 'runtimeMinutes', 'genres', 'actors', 'directors']
ratings = pd.merge(ml_ratings, imdb_titles[imdb_features],
                   how='left', left_on='imdbId', right_index=True)

del imdb_titles
del ml_ratings

In [6]:
# Correcting dtype
ratings['date'] = ratings['timestamp'].astype('datetime64[s]').astype('datetime64[D]')

# Set correct null values
ratings['directors'] = ratings['directors'].replace('\\N', np.nan)
ratings['genres'] = ratings['genres'].replace('\\N', np.nan)
ratings['runtimeMinutes'] = ratings['runtimeMinutes'].replace('\\N', np.nan).astype(float)

# Null stats
print('Null values count')
print(ratings[['actors', 'directors', 'genres', 'startYear', 'runtimeMinutes', 'primaryTitle']].isnull().sum())
print(f'from {len(ratings)}')

# Choose median values to replace nulls
median_year = int(ratings['startYear'].dropna().astype(int).median())
median_runtime = int(ratings['runtimeMinutes'].dropna().median())
print(f'Replacing null startYear by {median_year}')
print(f'Replacing null runtimeMinutes by {median_runtime}')

# Filling null values
ratings['actors'] = ratings['actors'].fillna('')
ratings['directors'] = ratings['directors'].replace('\\N', '').fillna('')
ratings['genres'] = ratings['genres'].replace('\\N', '').fillna('')
ratings['startYear'] = ratings['startYear'].fillna(median_year).astype(int)
ratings['runtimeMinutes'] = ratings['runtimeMinutes'].fillna(median_runtime).astype(int)
ratings['primaryTitle'] = ratings['primaryTitle'].fillna('')

Null values count
actors            236605
directors          40687
genres             19400
startYear          18190
runtimeMinutes     18969
primaryTitle       18190
dtype: int64
from 20000263
Replacing null startYear by 1995
Replacing null runtimeMinutes by 112


Some films have same titles so, we will use `imdbId` as a feature, and not `primaryTitle`:

In [7]:
films_by_title = ratings.groupby('primaryTitle')['imdbId'].agg(lambda x: len(set(x)))
films_by_title[films_by_title != 1].head()

primaryTitle
                                69
12 Angry Men                     2
13                               2
1984                             2
20,000 Leagues Under the Sea     4
Name: imdbId, dtype: int64

## Feature engineering

Here we will do some simple feature engineering on IMDB attributes:
* clusterize film's release year
* clusterize film's duration

In [8]:
from sklearn.cluster import KMeans

kmeans_runtime = KMeans(n_clusters=25, random_state=42)
kmeans_runtime.fit(ratings['runtimeMinutes'].sample(5 * 10 ** 5, random_state=43).values.reshape(-1, 1))
cluster_labels = np.round(kmeans_runtime.cluster_centers_[:, 0], 0).astype(int)
assert len(np.unique(cluster_labels)) == 25
ratings['runtimeMinutesCluster'] = cluster_labels[kmeans_runtime.predict(ratings['runtimeMinutes'].values.reshape(-1, 1))]

kmeans_year = KMeans(n_clusters=30, random_state=42)
kmeans_year.fit(ratings['startYear'].sample(5 * 10 ** 5, random_state=43).values.reshape(-1, 1))
cluster_labels = np.round(kmeans_year.cluster_centers_[:, 0], 0).astype(int)
assert len(np.unique(cluster_labels)) == 30
ratings['startYearCluster'] = cluster_labels[kmeans_year.predict(ratings['startYear'].values.reshape(-1, 1))]

Also for further simplicity of the script we will
* keep random genre tag
* keep most popular director instead of list of directors
* also keep most popular actor's name for each film


That will allow us to work only with categorical features.

In [9]:
from collections import defaultdict

unique_films = ratings.drop_duplicates(['imdbId'])
film_popularity = ratings['imdbId'].value_counts().to_dict()

actor_popularity, director_popularity = defaultdict(int), defaultdict(int)
for film, actors, directors in zip(unique_films['imdbId'], unique_films['actors'], unique_films['directors']):
    for actor in actors.split(','):
        actor_popularity[actor] += film_popularity[film]
    for director in directors.split(','):
        director_popularity[director] += film_popularity[film]
        
popular_actors, popular_directors, random_genres = [], [], []
np.random.seed(1729)
for actors, directors, genres in zip(unique_films['actors'], unique_films['directors'], unique_films['genres']):
    popular_actors.append(max(actors.split(','), key=actor_popularity.__getitem__))
    popular_directors.append(max(directors.split(','), key=director_popularity.__getitem__))
    random_genres.append(np.random.choice(genres.split(','), size=1)[0])
np.random.seed(None)

unique_films['actor_id'] = popular_actors
unique_films['director_id'] = popular_directors
unique_films['genre'] = random_genres

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [10]:
# replace actors and directors by their name for more readability
with gzip.open(os.path.join(DATASETS_ROOT_DIR, 'raw/imdb/name.basics.tsv.gz'), 'rb') as f:
    imdb_names = pd.read_csv(f, sep='\t')
imdb_names.set_index('nconst', inplace=True)
imdb_names['actor'] = imdb_names['director'] = imdb_names['primaryName']

unique_films = pd.merge(unique_films, imdb_names[['actor']],
         how='left', left_on='actor_id', right_index=True)
unique_films = pd.merge(unique_films, imdb_names[['director']],
         how='left', left_on='director_id', right_index=True)

unique_films.loc[unique_films['director'].isnull(), 'director'] =\
    unique_films.loc[unique_films['director'].isnull(), 'director_id']

# merging everything back to ratings table
ratings = pd.merge(ratings, unique_films[['genre', 'actor', 'director', 'imdbId']],
                   how='left', left_on='imdbId', right_on='imdbId')

## Keep only rating >= 4
To consider only implicit interactions we keep only explicit ratings of four or higher.

In [11]:
implicit_ratings = ratings[ratings['rating'] >= 4.]
print(f'Keep {len(implicit_ratings)} lines from {len(ratings)}')

Keep 9995410 lines from 20000263


In [12]:
# getting train/val/test split same as in https://github.com/zombak79/vasp/blob/main/MovieLens%20-%20preprocessing.ipynb
implicit_ratings['nb_ratings'] = implicit_ratings.groupby('userId')['imdbId'].transform('count')
_users = implicit_ratings[implicit_ratings['nb_ratings'] >= 5].sort_values(['userId','timestamp'])['userId'].drop_duplicates()
_users.index = _users.index.astype(str)
_users = _users.sort_index().sample(frac=1., random_state=42)
test_users = _users.iloc[:10000].values
val_users = _users.iloc[10000:20000].values
del _users

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


## Encoding features and converting them into `tf.Tensor`

Now let's encode categorical features into ordinal using `tf.keras.layers.StringLookup` and transform out dataset into a dictionary with `tf.Tensor` for each column - result will be kept in `tf_tensors`.

We also keep track of inverse transformation should we want to see a value corresponding to some label - it will be stored in `inverse_lookups` dictionary.

In [13]:
%%time

from utils import get_tensorflow_dataset

item_features = ['imdbId', 'titleType', 'genre', 'actor', 'director',
                 'runtimeMinutesCluster', 'startYearCluster']

tf_tensors, inverse_lookups = get_tensorflow_dataset(implicit_ratings, item_features,
                                                     user_id_column='userId', date_column='date')

2022-10-18 13:47:34.202686: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda/lib:/usr/local/lib/x86_64-linux-gnu:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:
2022-10-18 13:47:34.248764: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-10-18 13:47:34.248892: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (26869bb83a96): /proc/driver/nvidia/version does not exist
2022-10-18 13:47:34.331441: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the

Encoding imdbId column
Reserving labels for 7884 categories out of 20720
Encoding titleType column
Reserving labels for 10 categories out of 10
Encoding genre column
Reserving labels for 26 categories out of 27
Encoding actor column
Reserving labels for 2497 categories out of 6597
Encoding director column
Reserving labels for 3094 categories out of 8166
Encoding runtimeMinutesCluster column
Reserving labels for 25 categories out of 25
Encoding startYearCluster column
Reserving labels for 30 categories out of 30
CPU times: user 55.5 s, sys: 2.93 s, total: 58.4 s
Wall time: 58.8 s


In [14]:
from utils import gather_structure
gather_structure(tf_tensors, [0, 1])

{'userId': <tf.Tensor: shape=(2,), dtype=int64, numpy=array([1, 1])>,
 'date': <tf.Tensor: shape=(2,), dtype=int32, numpy=array([12671, 12875], dtype=int32)>,
 'imdbId': <tf.Tensor: shape=(2,), dtype=int32, numpy=array([355,  94], dtype=int32)>,
 'titleType': <tf.Tensor: shape=(2,), dtype=int32, numpy=array([10, 10], dtype=int32)>,
 'genre': <tf.Tensor: shape=(2,), dtype=int32, numpy=array([20, 11], dtype=int32)>,
 'actor': <tf.Tensor: shape=(2,), dtype=int32, numpy=array([ 31, 140], dtype=int32)>,
 'director': <tf.Tensor: shape=(2,), dtype=int32, numpy=array([258,  55], dtype=int32)>,
 'runtimeMinutesCluster': <tf.Tensor: shape=(2,), dtype=int32, numpy=array([20, 16], dtype=int32)>,
 'startYearCluster': <tf.Tensor: shape=(2,), dtype=int32, numpy=array([12, 15], dtype=int32)>}

In [15]:
inverse_lookups['actor']

<keras.layers.preprocessing.string_lookup.StringLookup at 0x7fad00699310>

In [16]:
inverse_lookups['actor'](tf_tensors['actor'][:2])

<tf.Tensor: shape=(2,), dtype=string, numpy=array([b'Liam Neeson', b'Jeff Anderson'], dtype=object)>

For film ids we can replace reverse mapping by film names for more readability if needed:

In [17]:
import tensorflow as tf

from utils import enforce_unique_values

film_names = enforce_unique_values(implicit_ratings.groupby('imdbId')['primaryTitle'].first().to_dict())

new_vocab = list(map(lambda x: film_names[int(x)] if x != '[UNK]' else x,
                     inverse_lookups['imdbId'].get_vocabulary()))
inverse_lookups['imdbId'] = tf.keras.layers.StringLookup(vocabulary=new_vocab,
                                                         invert=True,
                                                         name=inverse_lookups['imdbId'].name,
                                                         num_oov_indices=inverse_lookups['imdbId'].num_oov_indices)

In [18]:
inverse_lookups['imdbId'](tf_tensors['imdbId'][:2])

<tf.Tensor: shape=(2,), dtype=string, numpy=array([b'Rob Roy', b'Clerks'], dtype=object)>

In [19]:
from utils import save_inverse_lookups

save_inverse_lookups(inverse_lookups, os.path.join(DATASETS_ROOT_DIR, 'movielens_imdb/inverse_lookups.pickle'))

In [20]:
del inverse_lookups

## Changing format to event sequences by user

From this point we won't need `pd.DataFrame` anymore and will work witrh tensorflow objects

In [21]:
del implicit_ratings

First we apply train/val/test split defined above:

In [22]:
val_mask = np.isin(tf_tensors['userId'], val_users)
test_mask = np.isin(tf_tensors['userId'], test_users)
train_mask = (~val_mask) & (~test_mask)

tensors = {}

tensors['train'] = gather_structure(tf_tensors, np.where(train_mask)[0])
tensors['val'] = gather_structure(tf_tensors, np.where(val_mask)[0])
tensors['test'] = gather_structure(tf_tensors, np.where(test_mask)[0])

Then, let's define for each user and for each type of event (here we have only ratings) a sequence of corresponding events

In [23]:
from utils import get_user_sequences

for split_name, tensors_dict in tensors.items():
    tensors[split_name] = get_user_sequences({'ratings': tensors_dict}, 'ratings', 'userId')

del tensors_dict

Now our dict contains additional technical key `_user_index` that encodes what events correspond to what user. To get sequences of events one can simply gather a feature's values using this index:

In [24]:
type(tensors['train']['ratings']['_user_index'])

tensorflow.python.ops.ragged.ragged_tensor.RaggedTensor

First dimension of this tensor corresponds to unique users, second corresponds to line numbers of events for a given user. By taking bounding shape we see number of unique users and maximal number of ratings done by one user in train dataset:

In [25]:
tensors['train']['ratings']['_user_index'].bounding_shape()

<tf.Tensor: shape=(2,), dtype=int32, numpy=array([118287,   3177], dtype=int32)>

Each split has its own independent index, with local indices (this property will be kept for each batch) starting from 0:

In [26]:
tensors['train']['ratings']['_user_index'][:3, :5]

<tf.RaggedTensor [[0, 1, 2, 3, 4],
 [88, 89, 90, 91, 92],
 [131, 132, 133, 134, 135]]>

In [27]:
tensors['test']['ratings']['_user_index'].bounding_shape()

<tf.Tensor: shape=(2,), dtype=int32, numpy=array([10000,  1738], dtype=int32)>

In [28]:
tensors['test']['ratings']['_user_index'][:3, :5]

<tf.RaggedTensor [[0, 1, 2, 3, 4],
 [348, 349, 350, 351, 352],
 [383, 384, 385, 386, 387]]>

To get all films seen by user it is enough to gather on correspondant tensor. In this example we limit to 3 users, 5 films:

In [29]:
tf.gather(tensors['train']['ratings']['imdbId'], tensors['train']['ratings']['_user_index'][:3, :5])

<tf.RaggedTensor [[355, 94, 138, 14, 67],
 [612, 170, 463, 18, 14],
 [26, 32, 17, 973, 94]]>

## Batching by users

For further operation let's transform dicts into `tf.data.Dataset` batched by users

In [30]:
from utils import batch_by_user

datasets = {}
for split_name, tensors_dict_by_event in tensors.items():
    datasets[split_name] = batch_by_user(tensors_dict_by_event, 'ratings', 5 * 10 ** 3, seed=12345)
    
del tensors

Now we have `tf.data.Dataset`

In [31]:
type(datasets['train'])

tensorflow.python.data.ops.dataset_ops.ConcatenateDataset

where each batch contain 5000 unique users and batch values is a dict with event type as key:

In [32]:
first_batch = next(iter(datasets['train']))

In [33]:
first_batch.keys()

dict_keys(['ratings'])

In [34]:
first_batch['ratings'].keys()

dict_keys(['_user_index', 'userId', 'date', 'imdbId', 'titleType', 'genre', 'actor', 'director', 'runtimeMinutesCluster', 'startYearCluster'])

In [35]:
first_batch['ratings']['_user_index'].bounding_shape()

<tf.Tensor: shape=(2,), dtype=int32, numpy=array([5000, 2655], dtype=int32)>

For now we have less users in last batch

In [36]:
for last_batch in datasets['train']:
    pass
last_batch['ratings']['_user_index'].bounding_shape()

<tf.Tensor: shape=(2,), dtype=int32, numpy=array([3287, 1040], dtype=int32)>

Note that we kept local batch indexing

In [37]:
last_batch['ratings']['_user_index'][:3, :5]

<tf.RaggedTensor [[0, 1, 2, 3, 4],
 [12, 13, 14, 15, 16],
 [102, 103, 104, 105, 106]]>

## Saving raw dataset

In [38]:
%%time
for split_name, dataset in datasets.items():
    tf.data.experimental.save(dataset, os.path.join(DATASETS_ROOT_DIR, f'movielens_imdb/raw_{split_name}_dataset.tf'),
                              compression='GZIP')

CPU times: user 35.2 s, sys: 300 ms, total: 35.5 s
Wall time: 35.6 s


## Aggregate preceding events

Till now we have only features describing items. To describe users let's consider following features:
* for each `userId`, `date` we look at events done on previous dates
* independently for each item feature we construct lists of those features corresponding to preceding events

So if a user have rated some films

| **film** | **genre**   | **date** |
|----------|-------------|----------|
| `1`      | Comedy      | `20/01`  |
| `9`      | Drama       | `25/01`  |
| `8`      | Comedy      | `25/01`  |
| `3`      | Documentary | `30/01`  |

we will construct `aggregated_film` and `aggregated_genre` features as

| **aggregated_film** | **aggregated_genre**    | **date** |
|---------------------|-------------------------|----------|
| []                  | []                      | `20/01`  |
| [`1`]               | [Comedy]                | `25/01`  |
| [`1`]               | [Comedy]                | `25/01`  |
| [`1`, `9`, `8`]     | [Comedy, Drama, Comedy] | `30/01`   |

and so for each user, for each date corresponding to events we want to predict (`ratings` here)

In [39]:
from functools import partial

from utils import aggregate_preceding_events

aggregated_datesets = {}
for split_name, dataset in datasets.items():
    agg_func = partial(aggregate_preceding_events, target='ratings', item_features=item_features,
                       user_id_column='userId', date_column='date')
    aggregated_datesets[split_name] = dataset.map(agg_func, num_parallel_calls=2, deterministic=True)

In [40]:
first_batch = next(iter(aggregated_datesets['train']))
first_batch.keys()

dict_keys(['aggregated_ratings_imdbId', 'aggregated_ratings_titleType', 'aggregated_ratings_genre', 'aggregated_ratings_actor', 'aggregated_ratings_director', 'aggregated_ratings_runtimeMinutesCluster', 'aggregated_ratings_startYearCluster', 'imdbId', 'titleType', 'genre', 'actor', 'director', 'runtimeMinutesCluster', 'startYearCluster', 'userId', 'date'])

So resulting structure contains
* aggregated historical features: for each user (1st dim), each target event (2nd dim) we have a list of previous event's attributes (3rd dim)
* raw item features, user id, date: for each user (1st dim) we have a list of target events (2nd dim)

In [41]:
first_batch['aggregated_ratings_genre'].bounding_shape()

<tf.Tensor: shape=(3,), dtype=int32, numpy=array([5000, 2655,  100], dtype=int32)>

In [42]:
first_batch['genre'].bounding_shape()

<tf.Tensor: shape=(2,), dtype=int32, numpy=array([5000, 2655], dtype=int32)>

## Saving resulting datasets

In [43]:
%%time
for split_name, dataset in aggregated_datesets.items():
    tf.data.experimental.save(dataset, os.path.join(DATASETS_ROOT_DIR, f'movielens_imdb/aggregated_{split_name}_dataset.tf'),
                              compression='GZIP')

CPU times: user 2min 44s, sys: 6.09 s, total: 2min 50s
Wall time: 1min 15s
