# Standard Practices for Data Processing and Multimodal Feature Extraction in Recommendation with DataRec and Ducho (D&D4Rec) (2nd Hands-On Session)

⭐ **The 19th ACM Conference on Recommender Systems** ⭐

*Prague (Czech Republic), September 26th, 2025*

<div>
  <img src="http://github.com/sisinflab/DnD4Rec-tutorial/blob/main/images/DD4Rec-logo.jpg?raw=true" alt="d&d4Rec" width="108">
  <img src="https://recsys.acm.org/wp-content/uploads/2024/10/RecSys2025_website_header.jpg" alt="SisInfLab" width="600">
  <img src="https://recsys.acm.org/wp-content/uploads/2024/10/RecSys2025_logo_transparent.png" alt="recsys" width="200">
</div>

🧑 Speaker: [Matteo Attimonelli](https://matteoattimonelli.github.io/)

If you use this code for your experiments, please cite our recent work (on arXiv) 🙏

![GitHub Repo stars](https://img.shields.io/github/stars/sisinflab/Ducho-meets-Elliot)
 [![arXiv](https://img.shields.io/badge/arXiv-2409.15857-b31b1b.svg)](https://arxiv.org/abs/2409.15857)

 <img src="https://github.com/sisinflab/Ducho-meets-Elliot/blob/master/framework.png?raw=true"  width="700">

```
@article{DBLP:journals/corr/abs-2409-15857,
  author       = {Matteo Attimonelli and
                  Danilo Danese and
                  Angela Di Fazio and
                  Daniele Malitesta and
                  Claudio Pomo and
                  Tommaso Di Noia},
  title        = {Ducho meets Elliot: Large-scale Benchmarks for Multimodal Recommendation},
  journal      = {CoRR},
  volume       = {abs/2409.15857},
  year         = {2024}
}
```

## Clone the repository

First, we clone the repository to exploit the Ducho + Elliot experimental environment 🐑

In [None]:
%cd /Users/matteoattimonelli/Desktop/Tutorial
!git clone --recursive https://github.com/sisinflab/Ducho-meets-Elliot.git
%cd Ducho-meets-Elliot/
!git pull --recurse-submodules
!git submodule update --remote --merge

## Set up the working environment

Now, we setup the proper environment to run the experiments. We provide a proper file with all the dependencies to facilitate the environment creation! 😎

In [None]:
!conda env create -f ducho_env.yml

## Download and visualize the multimodal recommendation datasets

We're now set to download the multimodal recommendation dataset. For the sake of this lecture, we consider the popular **[Amazon Product Reviews](https://cseweb.ucsd.edu/~jmcauley/datasets.html#amazon_reviews)** dataset 🛒

Specifically, we consider the following product categories:

| **Datasets**     | **# Users** | **# Items** | **# Interactions** | **Sparsity (%)** |
|------------------|-------------|-------------|--------------------|------------------|
| Office Products  | 4,471       | 1,703       | 20,608             | 99.73%           |
| Digital Music    | 5,082       | 2,338       | 30,623             | 99.74%           |
| Baby             | 19,100      | 6,283       | 80,931             | 99.93%           |
| Toys & Games     | 19,241      | 11,101      | 89,558             | 99.96%           |
| Beauty           | 21,752      | 11,145      | 100,834            | 99.96%           |

For the sake of this hands-on session, we will consider only the "Office Products" dataset!

In [None]:
# RUN THIS CELL ONLY IF YOU HAVE TIME TO WASTE :-)

%cd Ducho
!mkdir -p ./local/data/demo_office
%cp ../../reviews_Office_Products_5.tsv ./local/data/demo_office

# data taken from: https://cseweb.ucsd.edu/~jmcauley/datasets/amazon/links.html
import os.path
import time
from tqdm import tqdm
import pandas as pd
import gzip
import requests


def download_image(url, image, save_path):
    response = requests.get(url)
    if response.status_code == 429:
        print('Too many requests, waiting...')
        time.sleep(60)
    if response.status_code == 200:
        with open(save_path, 'wb') as file:
            file.write(response.content)
        return image


def parse(path):
    g = gzip.open(path,'rb')
    for l in g:
        yield eval(l)


def getDF(path):
    i = 0
    df = {}
    for d in parse(path):
        df[i] = d
        i += 1
    return pd.DataFrame.from_dict(df, orient='index')


def download_dataset(file_name, save_path):
    base_url = 'http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/'
    response = requests.get(f'{base_url}{file_name}', stream=True)
    if response.status_code == 200:
        with open(f'{save_path}/{file_name}', 'wb') as file:
            bar = tqdm(total=int(response.headers.get('content-length')))
            for chunk in response.iter_content(chunk_size=1024):
                if chunk:
                    file.write(chunk)
                    bar.update(len(chunk))


folder = './local/data/demo_office'
name = 'Office_Products'

core = 5

if not os.path.exists(folder):
    os.makedirs(folder)

file_name = f'meta_{name}.json.gz'
if not os.path.exists(f'{folder}/{file_name}'):
    download_dataset(file_name, folder)

reviews = pd.read_csv(f'{folder}/reviews_{name}_{core}.tsv', sep='\t', names=['reviewerID', 'asin', 'overall'])

# check reviewerID, asin, and overall are not nan and remove duplicates
reviews = reviews[reviews['reviewerID'].notna()]
reviews = reviews[reviews['asin'].notna()]
reviews = reviews[reviews['overall'].notna()]
reviews = reviews.drop_duplicates()

meta = getDF(f'{folder}/meta_{name}.json.gz')[['asin', 'description', 'imUrl']]

# check asin, description, and imUrl are not nan
meta = meta[meta['asin'].notna()]
meta = meta[meta['description'].notna()]
meta = meta[meta['description'].str.len() > 0]
meta = meta[meta['imUrl'].notna()]
meta = meta[meta['imUrl'].str.len() > 0]
meta = meta.drop_duplicates()

# merge meta and reviews for 5-core
meta_reviews = pd.merge(reviews, meta, how='inner', on='asin')
meta_reviews.drop_duplicates(inplace=True)
meta = meta_reviews[['asin', 'description', 'imUrl']].drop_duplicates()
reviews = meta_reviews[['reviewerID', 'asin', 'overall']]

# get unavailable items
all_items = set(meta['asin'].tolist())
items_nan_description = set(meta[meta['description'].isna()]['asin'].tolist()).union(set(
    meta[meta['description'].notna()][
        meta[meta['description'].notna()]['description'].str.contains('nan|Nan|NaN|naN|n\/a|N\/A', regex=True)][
        'asin'].tolist()))
items_nan_url = set(meta[meta['imUrl'].isna()]['asin'].tolist()).union(set(meta[meta['imUrl'].notna()][
                                                                            meta[meta['imUrl'].notna()][
                                                                                'imUrl'].str.contains(
                                                                                'nan|Nan|NaN|naN|n\/a|N\/A',
                                                                                regex=True)]['asin'].tolist()))
items_empty_description = set(
    meta[meta['description'].notna()][meta[meta['description'].notna()]['description'].str.len() == 0][
        'asin'].tolist())
items_empty_url = set(
    meta[meta['imUrl'].notna()][meta[meta['imUrl'].notna()]['imUrl'].str.len() == 0]['asin'].tolist())
remaining_items = all_items.difference(items_nan_description).difference(items_nan_url).difference(
    items_empty_description).difference(items_empty_url)
print(f'All items: {len(all_items)}')
print(f'All users: {reviews["reviewerID"].nunique()}')
print(f'All interactions: {len(reviews)}')
print(f'Nan description: {len(items_nan_description)}')
print(f'Nan url: {len(items_nan_url)}')
print(f'Empty description: {len(items_empty_description)}')
print(f'Empty url: {len(items_empty_url)}')

missing_visual = items_nan_url.union(items_empty_url)
missing_textual = items_nan_description.union(items_empty_description)

# save original df
meta.to_csv(f'{folder}/original_meta.tsv', sep='\t', index=None)
reviews.to_csv(f'{folder}/original_reviews.tsv', sep='\t', index=None, header=None)

meta = meta[meta['asin'].isin(all_items.difference(missing_textual))]
reviews = reviews[reviews['asin'].isin(remaining_items)]

print(len(meta[meta['description'].isna()]))
print(len(meta[meta['description'].str.len() == 0]))
print(len(meta[meta['description'].str.contains('nan')]))
print(len(meta[meta['description'].str.contains('Nan')]))
print(len(meta[meta['description'].str.contains('NaN')]))
print(len(meta[meta['description'].str.contains('n\/a')]))
print(len(meta[meta['description'].str.contains('N\/A')]))

print(f'Remaining users: {reviews["reviewerID"].nunique()}')
print(f'Remaining interactions: {len(reviews)}')

if not os.path.exists(f'{folder}/images/'):
    os.makedirs(f'{folder}/images/')

images = []
with tqdm(total=len(meta)) as t:
    for index, row in meta.iterrows():
        if pd.notna(row['imUrl']):
            output = download_image(row['imUrl'], row['asin'], f'{folder}/images/{row["asin"]}.jpg')
            if output is None:
                missing_visual.add(row['asin'])
            images.append(output)
            t.update()

broken_urls = set([im for im in images if im is None])
print(f'Broken url: {len(broken_urls)}')
remaining_items = remaining_items.intersection(set([im for im in images if im]))
print(f'Remaining items: {len(remaining_items)}')

if len(broken_urls) > 0:
    meta = meta[~meta['asin'].isin(missing_visual)]
    reviews = reviews[~reviews['asin'].isin(missing_visual)]

meta.to_csv(f'{folder}/meta.tsv', sep='\t', index=None)
reviews.to_csv(f'{folder}/reviews.tsv', sep='\t', index=None, header=None)

# save missing items
pd.DataFrame(list(missing_visual)).to_csv(f'{folder}/missing_visual.tsv', sep='\t',
                                        index=None, header=None)
pd.DataFrame(list(missing_textual)).to_csv(f'{folder}/missing_textual.tsv', sep='\t',
                                        index=None,header=None)


In [None]:
# OTHERWISE, HERE'S AN ALREADY-PREPARED VERSION OF THE DATASET

import gdown

%cd Ducho
!mkdir -p local/data
gdown.download(f'https://drive.google.com/uc?id=1shg7SJJqTLGzjyBC8z-KAnRhcK7cRGEp', 'demo_office.zip', quiet=False)
!mv demo_office.zip local/data/
!unzip local/data/demo_office.zip -d local/data/
%rm local/data/demo_office.zip

We can visualize one random item from the dataset 👓

In [None]:
import pandas as pd
import random
from matplotlib import pyplot as plt
from PIL import Image
from IPython.display import display, HTML

meta = pd.read_csv('local/data/demo_office/meta.tsv', sep='\t')
random_item = random.choice(meta['asin'].tolist())
description = meta[meta["asin"]==random_item]["description"].values[0]
display(HTML(f"ASIN: {random_item}\n<div style='white-space: pre-wrap; width: 100%;'>{description}</div>"))
img = Image.open(f'local/data/demo_office/images/{random_item}.jpg')
plt.imshow(img)
plt.axis('off')
plt.show()

from prettytable import PrettyTable
table = PrettyTable()

table.field_names = ["USER", "Rating"]
reviews = pd.read_csv('local/data/demo_office/reviews.tsv', sep='\t', header=None)
current_reviews = reviews[reviews[1]==random_item]

for idx, row in current_reviews.iterrows():
  table.add_row([row[0], row[2]])

display(HTML('Clicked by:'))
print(table)

## Check if GPU is available

Before running any GPU-bound process, let's check if the GPU is available:

In [None]:
!nvidia-smi
!nvcc --version

## Extract multimodal features with Ducho

![GitHub Repo stars](https://img.shields.io/github/stars/sisinflab/Ducho)
 [![arXiv](https://img.shields.io/badge/arXiv-2403.04503-b31b1b.svg)](https://arxiv.org/abs/2403.04503)

 If you use Ducho for your experiments, please cite our papers 🙏

<div>
  <img src="https://github.com/sisinflab/Ducho/raw/main/docs/source/img/ducho_v2_overview.png" alt="duccio" width="800">
</div>

```
@inproceedings{DBLP:conf/www/AttimonelliDMPG24,
  author       = {Matteo Attimonelli and
                  Danilo Danese and
                  Daniele Malitesta and
                  Claudio Pomo and
                  Giuseppe Gassi and
                  Tommaso Di Noia},
  title        = {Ducho 2.0: Towards a More Up-to-Date Unified Framework for the Extraction
                  of Multimodal Features in Recommendation},
  booktitle    = {{WWW} (Companion Volume)},
  pages        = {1075--1078},
  publisher    = {{ACM}},
  year         = {2024}
}
```

```
@inproceedings{DBLP:conf/mm/MalitestaGPN23,
  author       = {Daniele Malitesta and
                  Giuseppe Gassi and
                  Claudio Pomo and
                  Tommaso Di Noia},
  title        = {Ducho: {A} Unified Framework for the Extraction of Multimodal Features
                  in Recommendation},
  booktitle    = {{ACM} Multimedia},
  pages        = {9668--9671},
  publisher    = {{ACM}},
  year         = {2023}
}
```

Now, we are all set to extract multimodal product features through Ducho 🦾

In [None]:
# CREATE THE CONFIGURATION FILE FOR DUCHO
def create_config():
  import yaml

  config_ducho = """dataset_path: ./local/data/demo_office
gpu list: 0
visual:
    items:
        input_path: images
        output_path: visual_embeddings_32
        model: [
            { model_name: ResNet50,
              output_layers: avgpool,
              reshape: [224, 224],
              preprocessing: zscore,
              backend: torch,
              batch_size: 32
            }
        ]

textual:
    items:
        input_path: meta.tsv
        item_column: asin
        text_column: description
        output_path: textual_embeddings_32
        model: [
          { model_name: sentence-transformers/all-mpnet-base-v2,
              output_layers: 1,
              clear_text: False,
              backend: sentence_transformers,
              batch_size: 32
          }
        ]

visual_textual:
    items:
        input_path: {visual: images, textual: meta.tsv}
        item_column: asin
        text_column: description
        output_path: {visual: visual_embeddings_32, textual: textual_embeddings_32}
        model: [
          { model_name: openai/clip-vit-base-patch16,
              backend: transformers,
              output_layers: 1,
              batch_size: 32
          }
        ]

  """

  ducho_dir = f"demos/demo_office/config.yml"
  with open(ducho_dir, 'w') as conf_file:
      conf_file.write(config_ducho)

# RUN THE EXTRACTION WITH DUCHO

from ducho.runner.Runner import MultimodalFeatureExtractor
import torch
import os
import numpy as np
import random

def set_seed(seed = 42):
    """Set all seeds to make results reproducible (deterministic mode).
       When seed is None, disables deterministic mode.
    :param seed: an integer to your choosing
    """
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    np.random.seed(seed)
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    os.environ['CUBLAS_WORKSPACE_CONFIG'] = ":16:8"


def main():
    set_seed()
    extractor_obj = MultimodalFeatureExtractor(config_file_path='./demos/demo_office/config.yml')
    extractor_obj.execute_extractions()


if __name__ == '__main__':
    create_config()
    main()

## Dataset splitting and features mapping

![GitHub Repo stars](https://img.shields.io/github/stars/sisinflab/Formal-Multimod-Rec)
 [![arXiv](https://img.shields.io/badge/arXiv-2309.05273-b31b1b.svg)](https://arxiv.org/abs/2309.05273)

If everything went smoothly with the features extraction, now we can: (i) split the original dataset into train/validation/test set ✂ (ii) map the multimodal item features to ids aligned with the training set 🗾

To this end, we will use Elliot, our framework for rigorous and reproducibile recommender systems evaluation.

If you find it useful for your research, please cite our works 🙏



```
@article{10.1145/3662738,
author = {Malitesta, Daniele and Cornacchia, Giandomenico and Pomo, Claudio and Merra, Felice Antonio and Di Noia, Tommaso and Di Sciascio, Eugenio},
title = {Formalizing Multimedia Recommendation through Multimodal Deep Learning},
year = {2024},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3662738},
doi = {10.1145/3662738},
note = {Just Accepted},
journal = {ACM Trans. Recomm. Syst.},
month = {apr},
keywords = {Multimodal Deep Learning, Multimedia Recommender Systems, Benchmarking}
}
```



```
@inproceedings{DBLP:conf/sigir/AnelliBFMMPDN21,
  author       = {Vito Walter Anelli and
                  Alejandro Bellog{\'{\i}}n and
                  Antonio Ferrara and
                  Daniele Malitesta and
                  Felice Antonio Merra and
                  Claudio Pomo and
                  Francesco Maria Donini and
                  Tommaso Di Noia},
  title        = {Elliot: {A} Comprehensive and Rigorous Framework for Reproducible
                  Recommender Systems Evaluation},
  booktitle    = {{SIGIR}},
  pages        = {2405--2414},
  publisher    = {{ACM}},
  year         = {2021}
}
```



**Dataset splitting**

* Train = 80% dataset

* Test = 20% dataset

* Valid = 10% train set

In [None]:
import os
import pandas as pd
import numpy as np

In [None]:
%cd ..
!git clone https://github.com/sisinflab/DataRec.git
%cd DataRec

from datarec.io import read_tabular
from datarec.data.dataset import DataRec
from datarec.splitters import RandomHoldOut
from datarec.io import write_tabular

 
path="../Ducho/local/data/demo_office/reviews.tsv"
raw = read_tabular(
    filepath=path,
    sep="\t",
    user_col=0,
    item_col=1,
    rating_col=2,
    header=None
)
 
data = DataRec(rawdata=raw, dataset_name="AmazonOffice", version_name="ducho")
  
splitter = RandomHoldOut(test_ratio=0.2, val_ratio=0.1, seed=42)
split_result = splitter.run(data)
 
train_data = split_result['train']
val_data = split_result['val']
test_data = split_result['test']
 

file_path = '../data/office'
write_tabular(train_data.to_rawdata(), path=f"{file_path}/train.tsv", sep='\t', header=False, timestamp=False) 
write_tabular(val_data.to_rawdata(), path=f"{file_path}/val.tsv", sep='\t', header=False, timestamp=False) 
write_tabular(test_data.to_rawdata(), path=f"{file_path}/test.tsv", sep='\t', header=False, timestamp=False)  

%cd /Users/matteoattimonelli/Desktop/Tutorial/Ducho-meets-Elliot

%mv ./Ducho/local/data/demo_office/visual_embeddings_32 ./data/office
%mv ./Ducho/local/data/demo_office/textual_embeddings_32 ./data/office

%cd ./data/office

!pwd

%env CUBLAS_WORKSPACE_CONFIG=:16:8

# MAP ITEMS TO NUMERICAL IDS
train = pd.read_csv('train.tsv', sep='\t', header=None)
val = pd.read_csv('val.tsv', sep='\t', header=None)
test = pd.read_csv('test.tsv', sep='\t', header=None)

df = pd.concat([train, val, test], axis=0)

users = df[0].unique()
items = df[1].unique()

users_map = {u: idx for idx, u in enumerate(users)}
items_map = {i: idx for idx, i in enumerate(items)}

train[0] = train[0].map(users_map)
train[1] = train[1].map(items_map)

val[0] = val[0].map(users_map)
val[1] = val[1].map(items_map)

test[0] = test[0].map(users_map)
test[1] = test[1].map(items_map)

train.to_csv('train_indexed.tsv', sep='\t', index=False, header=None)
val.to_csv('val_indexed.tsv', sep='\t', index=False, header=None)
test.to_csv('test_indexed.tsv', sep='\t', index=False, header=None)

visual_embeddings_folder = f'visual_embeddings_32/torch/ResNet50/avgpool'
textual_embeddings_folder = f'textual_embeddings_32/sentence_transformers/sentence-transformers/all-mpnet-base-v2/1'

visual_embeddings_folder_indexed = f'visual_embeddings_indexed_32/torch/ResNet50/avgpool'
textual_embeddings_folder_indexed = f'textual_embeddings_indexed_32/sentence_transformers/sentence-transformers/all-mpnet-base-v2/1'

if not os.path.exists(visual_embeddings_folder_indexed):
    os.makedirs(visual_embeddings_folder_indexed)

if not os.path.exists(textual_embeddings_folder_indexed):
    os.makedirs(textual_embeddings_folder_indexed)

for key, value in items_map.items():
    np.save(f'{visual_embeddings_folder_indexed}/{value}.npy', np.load(f'{visual_embeddings_folder}/{key}.npy'))
    np.save(f'{textual_embeddings_folder_indexed}/{value}.npy', np.load(f'{textual_embeddings_folder}/{key}.npy'))


visual_embeddings_folder = f'visual_embeddings_32/transformers/openai/clip-vit-base-patch16/1'
textual_embeddings_folder = f'textual_embeddings_32/transformers/openai/clip-vit-base-patch16/1'

visual_embeddings_folder_indexed = f'visual_embeddings_indexed_32/transformers/openai/clip-vit-base-patch16/1'
textual_embeddings_folder_indexed = f'textual_embeddings_indexed_32/transformers/openai/clip-vit-base-patch16/1'

if not os.path.exists(visual_embeddings_folder_indexed):
    os.makedirs(visual_embeddings_folder_indexed)

if not os.path.exists(textual_embeddings_folder_indexed):
    os.makedirs(textual_embeddings_folder_indexed)

for key, value in items_map.items():
    np.save(f'{visual_embeddings_folder_indexed}/{value}.npy', np.load(f'{visual_embeddings_folder}/{key}.npy'))
    np.save(f'{textual_embeddings_folder_indexed}/{value}.npy', np.load(f'{textual_embeddings_folder}/{key}.npy'))


%cd ../../

In [None]:
#### OR IF YOU WANT TO PERFORM THE SPLITTING WITH ELLIOT :-)

%cd /Users/matteoattimonelli/Desktop/Tutorial/Ducho-meets-Elliot

%mv ./Ducho/local/data/demo_office/visual_embeddings_32 ./data/office
%mv ./Ducho/local/data/demo_office/textual_embeddings_32 ./data/office

%cp ./Ducho/local/data/demo_office/reviews.tsv ./data/office

split_config = '''
experiment:
  backend: pytorch
  data_config:
    strategy: dataset
    dataset_path: ../data/office/reviews.tsv
  splitting:
    save_on_disk: True
    save_folder: ../data/office_splits/
    test_splitting:
      strategy: random_subsampling
      test_ratio: 0.2
    validation_splitting:
      strategy: random_subsampling
      test_ratio: 0.1
  dataset: office
  top_k: 20
  evaluation:
    cutoffs: [ 10, 20 ]
    simple_metrics: [ Recall, nDCG ]
  gpu: 0
  external_models_path: ../external/models/__init__.py
  models:
    MostPop:
      meta:
        verbose: True
        save_recs: False
'''

split_dir = f"./config_files/split_office.yml"
with open(split_dir, 'w') as conf_file:
    conf_file.write(split_config)

# SPLIT INTO TRAIN/VAL/TEST
%env CUBLAS_WORKSPACE_CONFIG=:16:8
!python3 run_split.py --dataset office

%cd ./data/office

%env CUBLAS_WORKSPACE_CONFIG=:16:8

# MAP ITEMS TO NUMERICAL IDS
train = pd.read_csv('train.tsv', sep='\t', header=None)
val = pd.read_csv('val.tsv', sep='\t', header=None)
test = pd.read_csv('test.tsv', sep='\t', header=None)

df = pd.concat([train, val, test], axis=0)

users = df[0].unique()
items = df[1].unique()

users_map = {u: idx for idx, u in enumerate(users)}
items_map = {i: idx for idx, i in enumerate(items)}

train[0] = train[0].map(users_map)
train[1] = train[1].map(items_map)

val[0] = val[0].map(users_map)
val[1] = val[1].map(items_map)

test[0] = test[0].map(users_map)
test[1] = test[1].map(items_map)

train.to_csv('train_indexed.tsv', sep='\t', index=False, header=None)
val.to_csv('val_indexed.tsv', sep='\t', index=False, header=None)
test.to_csv('test_indexed.tsv', sep='\t', index=False, header=None)

visual_embeddings_folder = f'visual_embeddings_32/torch/ResNet50/avgpool'
textual_embeddings_folder = f'textual_embeddings_32/sentence_transformers/sentence-transformers/all-mpnet-base-v2/1'

visual_embeddings_folder_indexed = f'visual_embeddings_indexed_32/torch/ResNet50/avgpool'
textual_embeddings_folder_indexed = f'textual_embeddings_indexed_32/sentence_transformers/sentence-transformers/all-mpnet-base-v2/1'

if not os.path.exists(visual_embeddings_folder_indexed):
    os.makedirs(visual_embeddings_folder_indexed)

if not os.path.exists(textual_embeddings_folder_indexed):
    os.makedirs(textual_embeddings_folder_indexed)

for key, value in items_map.items():
    np.save(f'{visual_embeddings_folder_indexed}/{value}.npy', np.load(f'{visual_embeddings_folder}/{key}.npy'))
    np.save(f'{textual_embeddings_folder_indexed}/{value}.npy', np.load(f'{textual_embeddings_folder}/{key}.npy'))


visual_embeddings_folder = f'visual_embeddings_32/transformers/openai/clip-vit-base-patch16/1'
textual_embeddings_folder = f'textual_embeddings_32/transformers/openai/clip-vit-base-patch16/1'

visual_embeddings_folder_indexed = f'visual_embeddings_indexed_32/transformers/openai/clip-vit-base-patch16/1'
textual_embeddings_folder_indexed = f'textual_embeddings_indexed_32/transformers/openai/clip-vit-base-patch16/1'

if not os.path.exists(visual_embeddings_folder_indexed):
    os.makedirs(visual_embeddings_folder_indexed)

if not os.path.exists(textual_embeddings_folder_indexed):
    os.makedirs(textual_embeddings_folder_indexed)

for key, value in items_map.items():
    np.save(f'{visual_embeddings_folder_indexed}/{value}.npy', np.load(f'{visual_embeddings_folder}/{key}.npy'))
    np.save(f'{textual_embeddings_folder_indexed}/{value}.npy', np.load(f'{textual_embeddings_folder}/{key}.npy'))


%cd ../../

The downloaded multimodal dataset has the following structure:

```
├── office
│   ├── visual_embeddings_indexed_32
|       ├── torch
|          ├── ResNet50
|             ├── avgpool
│                ├── 0.npy
│                ├── 1.npy
│                ├── ...
|       ├── transformers
|          ├── openai
|             ├── clip-vit-base-patch16
|                ├── 1
│                   ├── 0.npy
│                   ├── 1.npy
│                   ├── ...
|
│   ├── textual_embeddings_indexed_32
|       ├── sentence_transformers
|          ├── sentence-transformers
|             ├── all-mpnet-base-v2
|                ├── 1
│                   ├── 0.npy
│                   ├── 1.npy
│                   ├── ...
|       ├── transformers
|          ├── openai
|             ├── clip-vit-base-patch16
|                ├── 1
│                   ├── 0.npy
│                   ├── 1.npy
│                   ├── ...
│   ├── train_indexed.tsv
│   ├── val_indexed.tsv
│   ├── test_indexed.tsv
```



## Configure and run the experiments
Let's set the hyper-parameters for the model to be trained and tested. We will focus on VBPR in a modified version which adopts multimodal features. We train and evaluate it on Amazon Office.

### First multimodal features configuration

We start with the configuration ResNet50 (visual) + SentenceBert (textual), the most common one from the literature.

In [None]:
import yaml

config_filename = 'hands-on_resnet50_sentencebert'
elliot_config_1 = {
  'experiment': {
    'backend': 'pytorch',
    'data_config': {
      'strategy': 'fixed',
      'train_path': '../data/{0}/train_indexed.tsv',
      'validation_path': '../data/{0}/val_indexed.tsv',
      'test_path': '../data/{0}/test_indexed.tsv',
      'side_information': [
        {
            'dataloader': 'VisualAttribute',
            'visual_features': '../data/{0}/visual_embeddings_indexed_32/torch/ResNet50/avgpool'
        },
        {
            'dataloader': 'TextualAttribute',
            'textual_features': '../data/{0}/textual_embeddings_indexed_32/sentence_transformers/sentence-transformers/all-mpnet-base-v2/1'
        }
      ]
    },
    'dataset': 'office',
    'top_k': 20,
    'evaluation': {
      'cutoffs': [20],
      'simple_metrics': ['Recall', 'Precision', 'nDCG', 'HR']
    },
    'gpu': 0,
    'external_models_path': '../external/models/__init__.py',
    'models': {
      'external.VBPR': {
        'meta': {
          'hyper_opt_alg': 'grid',
          'verbose': True,
          'save_weights': False,
          'save_recs': False,
          'validation_rate': 10,
          'validation_metric': 'Recall@20',
          'restore': False
        },
        'epochs': 200,
        'batch_size': 1024,
        'factors': 64,
        'lr': 0.005,
        'l_w': 1e-5,
        'n_layers': 1,
        'comb_mod': 'concat',
        'modalities': "('visual','textual')",
        'loaders': "('VisualAttribute','TextualAttribute')",
        'seed': 123
      }
    }
  }
}

with open(f'config_files/{config_filename}.yml', 'w') as file:
    documents = yaml.dump(elliot_config_1, file)

### Run Elliot

Now we are all set to run an experiment with VBPR on Amazon Office with the first multimodal configuration.

In [None]:
from elliot.run import run_experiment


run_experiment(f"config_files/hands-on_resnet50_sentencebert.yml")

### Second multimodal features configuration

Second, we prepare and run the second configuration for the multimodal feature extractors. In this case, we use CLIP, a popular multimodal model in the deep learning literature, but largely overlooked in the recommendation community.



In [None]:
%cd /Users/matteoattimonelli/Desktop/Tutorial/Ducho-meets-Elliot
import yaml
config_filename = 'hands-on_clip'
elliot_config_2 = {
  'experiment': {
    'backend': 'pytorch',
    'data_config': {
      'strategy': 'fixed',
      'train_path': '../data/{0}/train_indexed.tsv',
      'validation_path': '../data/{0}/val_indexed.tsv',
      'test_path': '../data/{0}/test_indexed.tsv',
      'side_information': [
        {
            'dataloader': 'VisualAttribute',
            'visual_features': '../data/{0}/visual_embeddings_indexed_32/transformers/openai/clip-vit-base-patch16/1'
        },
        {
            'dataloader': 'TextualAttribute',
            'textual_features': '../data/{0}/textual_embeddings_indexed_32/transformers/openai/clip-vit-base-patch16/1'
        }
      ]
    },
    'dataset': 'office',
    'top_k': 20,
    'evaluation': {
      'cutoffs': [20],
      'simple_metrics': ['Recall', 'Precision', 'nDCG', 'HR']
    },
    'gpu': 0,
    'external_models_path': '../external/models/__init__.py',
    'models': {
      'external.VBPR': {
        'meta': {
          'hyper_opt_alg': 'grid',
          'verbose': True,
          'save_weights': False,
          'save_recs': False,
          'validation_rate': 10,
          'validation_metric': 'Recall@20',
          'restore': False
        },
        'epochs': 200,
        'batch_size': 1024,
        'factors': 64,
        'lr': 0.005,
        'l_w': 1e-5,
        'n_layers': 1,
        'comb_mod': 'concat',
        'modalities': "('visual','textual')",
        'loaders': "('VisualAttribute','TextualAttribute')",
        'seed': 123
      }
    }
  }
}

with open(f'config_files/{config_filename}.yml', 'w') as file:
    documents = yaml.dump(elliot_config_2, file)

### Run Elliot

Now we are all set to run an experiment with VBPR on Amazon Office with the second multimodal configuration.

In [None]:
from elliot.run import run_experiment

run_experiment(f"config_files/hands-on_clip.yml")

## Final comments

We see that with different multimodal feature extractors, results are not the same. Indeed, with CLIP, we find (in most cases) improved recommendation performance than the usual ones obtained with ResNet50 + SentenceBert. That happens even without having explored a wide hyper-parameter space.





```
# First configuration (ResNet50 + SentenceBert)

{20: {'Recall': 0.08349616427404198, 'Precision': 0.009457434052757795, 'nDCG': 0.039479006837110385, 'HR': 0.1591726618705036}}
```



```
# Second configuration (CLIP)

{20: {'Recall': 0.0843764665257471, 'Precision': 0.009637290167865707, 'nDCG': 0.039173725010405266, 'HR': 0.16007194244604317}}
```

