In [1]:
# Copyright 2021 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

<img src="http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png" style="width: 90px; float: right;">


# MovieLens Data Enrichment

In this notebook, we will enrich the MovieLens 100k dataset with posters obtained from imdb.

In [4]:
# External dependencies
import os

from merlin.core.utils import download_file

# Get dataframe library - cudf or pandas
from merlin.core.dispatch import get_lib

df_lib = get_lib()

In [5]:
INPUT_DATA_DIR = os.environ.get(
    "INPUT_DATA_DIR", os.path.expanduser("~/nvt-examples/movielens_100k/data/")
)

The following file contains a mapping of movies in the MovieLens 100k dataset to poster urls.

In [13]:
!wget https://raw.githubusercontent.com/babu-thomas/movielens-posters/master/movie_poster.csv

--2022-05-25 03:57:18--  https://raw.githubusercontent.com/babu-thomas/movielens-posters/master/movie_poster.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 202458 (198K) [text/plain]
Saving to: ‘movie_poster.csv’


2022-05-25 03:57:34 (4.00 MB/s) - ‘movie_poster.csv’ saved [202458/202458]



In [65]:
posters = df_lib.read_csv('movie_poster.csv', header=None, names=['movie_id', 'poster_url'])

In [66]:
posters.tail()

Unnamed: 0,movie_id,poster_url
1587,1678,https://images-na.ssl-images-amazon.com/images...
1588,1679,https://images-na.ssl-images-amazon.com/images...
1589,1680,https://images-na.ssl-images-amazon.com/images...
1590,1681,https://images-na.ssl-images-amazon.com/images...
1591,1682,https://images-na.ssl-images-amazon.com/images...


Unfortunately, not all of the movies have posters available. In such a scenario where an embedding is missing, we will use the average embedding value across the entire dataset.

Let's download and store the posters.

Depending on your connection, this can take a while, but we can speed things up via downloading the files in parallel.

In [77]:
%%time

import multiprocessing

!mkdir -p {INPUT_DATA_DIR}/posters

def download_poster(movie_id, poster_url):
    img_data = requests.get(poster_url).content
    with open(f'{INPUT_DATA_DIR}/posters/{movie_id}.jpg', 'wb') as handler:
        handler.write(img_data)

with multiprocessing.Pool(multiprocessing.cpu_count()) as p:
    p.starmap(download_poster, [(data.movie_id, data.poster_url) for row_idx, data in posters.to_pandas().iterrows()])

CPU times: user 175 ms, sys: 355 ms, total: 530 ms
Wall time: 6min 21s
