# Multimodal Recommender Systems Using User Reviews and Product Images
## By: Om Choudhary

### Motivation

In an era where online shopping dominates, building recommender systems that go beyond simple numeric ratings or keyword matching has become essential. I chose this topic to explore how integrating **user-generated reviews** with **product images** can enhance rating prediction, which is foundational to effective personalized recommendations. This is particularly useful for fashion e-commerce, where visual cues are as important as textual sentiment.

### Historical Context: Multimodal Learning in Recommendation

Recommender systems have evolved from:

- **Collaborative filtering** (Matrix Factorization)

- To **content-based filtering** (text reviews, metadata)

- And now to **multimodal learning**, where multiple data types (e.g., text, image, audio) are combined.

Recent research (e.g., Chen et al., 2020; McAuley et al., 2015) has shown that fusing reviews and product images significantly improves predictive performance, especially in the cold-start scenario. Multimodal learning allows systems to learn richer product representations, capturing both style and sentiment.

### Summary of Our Workflow

####Step 1: Data Loading & Cleaning

- Downloaded and loaded Amazon_Fashion.jsonl (reviews) and meta_Amazon_Fashion.jsonl (product metadata).

- Sampled 25,000 records for efficiency.

- Merged on parent_asin.

- Extracted the first available image URL (hi-res > large > thumb).

####Step 2: Image Download

- Downloaded images to local disk using requests and PIL.

- Limited to 2,000 images to conserve resources.

####Step 3: Feature Extraction

- **Text**: Used **DistilBERT** to get 768-dimensional contextual embeddings.

- **Image**: Used **ResNet50** to extract 2048-dimensional visual embeddings.

- Concatenated both for a 2816-dimensional multimodal representation.

####Step 4: Model Training

- Used a **deep neural network classifier** with:

  - 3 hidden layers (1024, 512, 128)

  - BatchNorm + Dropout for regularization

  - `Adam` optimizer with learning rate scheduling

- Handled **class imbalance** with `class_weight`

####Step 5: Evaluation

- Metrics: Accuracy, Confusion Matrix, Classification Report

- Final Accuracy: ~**60**%

##  Key Learning Outcomes

- **Multimodal fusion** significantly improves performance compared to using text or image alone.
- **BERT-style embeddings** capture sentiment better than traditional TF-IDF.
- Image features complement text, especially for visual domains like fashion.

##  Reflections
### What Surprised Me:
- DistilBERT was extremely effective even without fine-tuning.
- Combining two very different modalities was relatively smooth using pretrained models.
###  Scope for Improvement:
- Add **user personalization** (user IDs, embeddings)
- Use **fine-tuned BERT** or **CLIP** for joint vision-language learning
- Scale up to full Amazon dataset with distributed processing

##  References
- [Chen et al., 2020] Multimodal Recommender Systems
- [McAuley et al., 2015] Image-based Recommendations on Styles and Substitutes
- Hugging Face Transformers: https://huggingface.co/transformers/
- TensorFlow & Keras Documentation
- https://nijianmo.github.io/amazon/index.html
- ChatGPT

##  Code & Visualizations
You can run the entire experiment using the notebook provided in this repository:
- **Downloads data** from Google Drive
- **Extracts features** using DistilBERT + ResNet50
- **Trains classifier** with class balancing and early stopping
- **Visualizes** training curves and confusion matrix

The model is generalizable and can be adapted to any multimodal recommendation task.

In [None]:
# Install dependencies (uncomment if needed)
# !pip install tensorflow numpy pandas scikit-learn matplotlib pillow tqdm

import os
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense, Concatenate, Dropout
from tensorflow.keras.models import Model
from tensorflow.keras.applications.resnet50 import ResNet50, preprocess_input
from tensorflow.keras.preprocessing import image
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from tqdm import tqdm
import matplotlib.pyplot as plt
import pickle

In [None]:
# STEP 1: Install gdown to download from Google Drive
!pip install -q gdown

# STEP 2: Download files using their IDs
import gdown

# File 1: Amazon_Fashion.jsonl
review_file_id = "1s3H_MI9GXw6TZ4cnQBNu2nzYuqYgIyox"
meta_file_id = "1KYzuJznVDbZeAigQKlDzNFm-JNBkDdQI"

# Output paths
review_path = "Amazon_Fashion.jsonl"
meta_path = "meta_Amazon_Fashion.jsonl"

# Download from Google Drive
gdown.download(id=review_file_id, output=review_path, quiet=False)
gdown.download(id=meta_file_id, output=meta_path, quiet=False)

# STEP 3: Read limited number of lines from .jsonl files
import json
import pandas as pd

def load_jsonl_clean(path, max_lines=25000):
    records = []
    with open(path, 'r') as f:
        for i, line in enumerate(f):
            if i >= max_lines:
                break
            try:
                records.append(json.loads(line))
            except json.JSONDecodeError as e:
                print(f"Skipping line {i}: {e}")
    return pd.DataFrame(records)

# Load only first 25000 lines from each file
reviews_df = load_jsonl_clean(review_path, max_lines=25000)
meta_df = load_jsonl_clean(meta_path, max_lines=25000)

# Preview structure
print("✅ Reviews columns:", reviews_df.columns)
print("✅ Metadata columns:", meta_df.columns)


In [None]:
# Keep relevant fields
reviews_df = reviews_df[['parent_asin', 'text', 'rating']].dropna()
meta_df = meta_df[['parent_asin', 'images']].dropna()

# Only keep entries with at least one image
meta_df = meta_df[meta_df['images'].apply(lambda x: isinstance(x, list) and len(x) > 0)]

# Merge on ASIN
merged_df = pd.merge(reviews_df, meta_df, on='parent_asin')

# Extract image URL (try hi_res, else thumb or large)
def extract_first_image_url(image_list):
    if isinstance(image_list, list) and len(image_list) > 0:
        img_entry = image_list[0]
        for key in ['hi_res', 'large', 'thumb']:
            if key in img_entry and img_entry[key]:
                return img_entry[key]
    return None

merged_df['image_url'] = merged_df['images'].apply(extract_first_image_url)
merged_df = merged_df.dropna(subset=['image_url'])

# Final clean columns
merged_df = merged_df[['parent_asin', 'text', 'rating', 'image_url']]
merged_df.rename(columns={'text': 'reviewText', 'rating': 'overall'}, inplace=True)

print("Final merged dataset shape:", merged_df.shape)
merged_df.head()


In [None]:
import os
import requests
from PIL import Image
from io import BytesIO
from tqdm import tqdm

# Directory to save images
image_dir = 'product_images'
os.makedirs(image_dir, exist_ok=True)

# Limit to first 5000 entries with non-null image_url
subset_df = merged_df.dropna(subset=['image_url']).head(2000).copy()


# Function to download and save an image
def download_image(url, image_id, save_dir):
    try:
        response = requests.get(url, timeout=10)
        response.raise_for_status()  # Raise HTTPError for bad responses
        img = Image.open(BytesIO(response.content)).convert('RGB')
        save_path = os.path.join(save_dir, f'{image_id}.jpg')
        img.save(save_path)
        return save_path
    except Exception as e:
        print(f"❌ Failed to download {url}: {e}")
        return None

# Download images and track paths
local_paths = []

for idx, row in tqdm(subset_df.iterrows(), total=len(subset_df), desc="📥 Downloading images (limit 2000)"):
    url = row['image_url']
    image_id = row['parent_asin']  # Or 'asin' if that's what you're using
    save_path = download_image(url, image_id, image_dir)
    local_paths.append(save_path)

# Attach image paths and clean up
subset_df['image_path'] = local_paths
subset_df = subset_df.dropna(subset=['image_path']).reset_index(drop=True)

print("✅ Images successfully downloaded:", len(subset_df))

# You can now use `subset_df` for feature extraction and training


In [None]:
import os
import numpy as np
import pandas as pd
from tqdm import tqdm
import tensorflow as tf
from tensorflow.keras.preprocessing import image
from tensorflow.keras.applications.resnet50 import ResNet50, preprocess_input
from transformers import DistilBertTokenizer, TFDistilBertModel


!pip install -q huggingface_hub
from huggingface_hub import login

# Paste your HF token here
token = os.environ.get("HF_TOKEN")
login(token)


# Make sure 'reviewText', 'parent_asin', 'overall' are in merged_df
merged_df['image_path'] = merged_df['parent_asin'].apply(lambda x: os.path.join('product_images', f'{x}.jpg'))
merged_df = merged_df[merged_df['image_path'].apply(os.path.exists)].reset_index(drop=True)

# ------------------------------
# 1. DistilBERT for Text Features
# ------------------------------
print("🔤 Extracting DistilBERT embeddings...")

tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
bert_model = TFDistilBertModel.from_pretrained("distilbert-base-uncased")

def get_bert_embedding(text):
    inputs = tokenizer(text, return_tensors='tf', padding='max_length', truncation=True, max_length=128)
    outputs = bert_model(inputs)
    cls_embedding = outputs.last_hidden_state[:, 0, :]  # CLS token
    return cls_embedding.numpy().flatten()

text_features = np.array([get_bert_embedding(text) for text in tqdm(merged_df['reviewText'].fillna(""))])
np.save("text_features_bert.npy", text_features)

# ------------------------------
# 2. ResNet50 for Image Features
# ------------------------------
print("🖼️ Extracting ResNet50 image features...")

resnet = ResNet50(weights='imagenet', include_top=False, pooling='avg')

def extract_image_feature(img_path):
    try:
        img = image.load_img(img_path, target_size=(224, 224))
        x = image.img_to_array(img)
        x = np.expand_dims(x, axis=0)
        x = preprocess_input(x)
        features = resnet.predict(x, verbose=0)
        return features.flatten()
    except Exception as e:
        print(f"⚠️ Error processing {img_path}: {e}")
        return np.zeros((2048,))

image_features = np.array([extract_image_feature(path) for path in tqdm(merged_df['image_path'])])
np.save("image_features_resnet.npy", image_features)

# ------------------------------
# 3. Combine Features + Labels
# ------------------------------
print("🔗 Combining features...")

combined_features = np.hstack((text_features, image_features))
ratings = merged_df['overall'].values

print("Combined features shape:", combined_features.shape)
print("Ratings shape:", ratings.shape)

np.save("combined_features_bert_resnet.npy", combined_features)
np.save("ratings.npy", ratings)

print("✅ Feature extraction complete using BERT + ResNet50.")


In [None]:
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.utils import class_weight
from tensorflow.keras.utils import to_categorical
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import matplotlib.pyplot as plt
import seaborn as sns

# -------------------------------
# 1. Load Features and Labels
# -------------------------------
X = np.load("combined_features_bert_resnet.npy")
y = np.load("ratings.npy")

# Ensure labels are within [1, 5] and integers
y = np.clip(np.round(y), 1, 5).astype(int)

# -------------------------------
# 2. Train-Test Split & One-Hot Labels
# -------------------------------
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

num_classes = 5
y_train_cat = to_categorical(y_train - 1, num_classes)
y_test_cat = to_categorical(y_test - 1, num_classes)

# -------------------------------
# 3. Compute Class Weights
# -------------------------------
cw = class_weight.compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)
class_weights = dict(enumerate(cw))

# -------------------------------
# 4. Build the Classification Model
# -------------------------------
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(X.shape[1],)),
    tf.keras.layers.Dense(1024, activation='relu'),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dropout(0.4),

    tf.keras.layers.Dense(512, activation='relu'),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dropout(0.3),

    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dropout(0.2),

    tf.keras.layers.Dense(num_classes, activation='softmax')
])

# -------------------------------
# 5. Optimizer and Compilation
# -------------------------------
from tensorflow.keras.optimizers import Adam
optimizer = Adam(learning_rate=1e-3)

model.compile(
    optimizer=optimizer,
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

model.summary()

# -------------------------------
# 6. Callbacks
# -------------------------------
early_stop = tf.keras.callbacks.EarlyStopping(
    monitor='val_loss',
    patience=12,
    restore_best_weights=True,
    verbose=1
)

reduce_lr = tf.keras.callbacks.ReduceLROnPlateau(
    monitor='val_loss',
    factor=0.5,
    patience=5,
    min_lr=1e-6,
    verbose=1
)

# -------------------------------
# 7. Train the Model
# -------------------------------
history = model.fit(
    X_train, y_train_cat,
    validation_data=(X_test, y_test_cat),
    epochs=50,
    batch_size=64,
    class_weight=class_weights,
    callbacks=[early_stop, reduce_lr],
    verbose=1
)

# -------------------------------
# 8. Plot Training Curves
# -------------------------------
plt.figure(figsize=(12,5))
plt.subplot(1,2,1)
plt.plot(history.history['loss'], label='Train Loss')
plt.plot(history.history['val_loss'], label='Val Loss')
plt.title("Loss Over Epochs")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.legend()

plt.subplot(1,2,2)
plt.plot(history.history['accuracy'], label='Train Acc')
plt.plot(history.history['val_accuracy'], label='Val Acc')
plt.title("Accuracy Over Epochs")
plt.xlabel("Epoch")
plt.ylabel("Accuracy")
plt.legend()
plt.tight_layout()
plt.show()


In [None]:
# Predict classes
y_pred_probs = model.predict(X_test)
y_pred = np.argmax(y_pred_probs, axis=1) + 1  # +1 to match rating scale

# Accuracy
acc = accuracy_score(y_test, y_pred)
print(f"\n✅ Test Accuracy: {acc * 100:.2f}%")

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred, labels=[1,2,3,4,5])
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=[1,2,3,4,5], yticklabels=[1,2,3,4,5])
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()

# Classification Report
print("\n📊 Classification Report:\n")
print(classification_report(y_test, y_pred, digits=3))

# Actual vs Predicted Plot
plt.figure(figsize=(7, 5))
plt.scatter(y_test, y_pred, alpha=0.3)
plt.plot([1, 5], [1, 5], 'r--', label="Perfect Match")
plt.xlabel("Actual Ratings")
plt.ylabel("Predicted Ratings")
plt.title("Actual vs Predicted Ratings")
plt.grid(True)
plt.legend()
plt.tight_layout()
plt.show()
