# Multimodal Movie Recommendation System

A comprehensive recommendation system based on MovieLens 100k + TMDB + ViT Image Features + BERT Text Features

## Key Features
- **Smart Quantity Control**: Set target number of movies with automatic progress management
- **Image Feature Extraction**: ViT model for extracting visual features from posters and stills  
- **Text Feature Extraction**: BERT model for processing movie overviews and taglines
- **Cast & Crew Statistics**: Extract high-frequency actors and directors as features
- **Multi-dimensional Feature Engineering**: Fusion of numerical, categorical, text, and image features
- **Performance Comparison**: Comprehensive evaluation against traditional recommendation systems

## System Architecture
1. **Data Preparation**: MovieLens + TMDB + User Filtering
2. **Image Processing**: Download → Selection → ViT Feature Extraction
3. **Text Processing**: BERT Feature Extraction
4. **Feature Engineering**: Multimodal Feature Fusion
5. **Recommendation System**: Multi-algorithm Performance Comparison

## Innovation Points
- Multimodal feature fusion (text + image + traditional features)
- Intelligent image selection (5 most diverse images from 10)
- Comprehensive performance evaluation (reproducing best HybridRec configuration)

## Technical Stack
- **Computer Vision**: ViT (Vision Transformer) for image understanding
- **Natural Language Processing**: BERT for semantic text analysis
- **Recommendation Algorithms**: Collaborative Filtering, Content-based, Hybrid approaches
- **Evaluation Metrics**: RMSE, MAE with statistical significance testing

In [1]:
# Install required libraries
!pip install requests pandas pillow tqdm transformers torch torchvision scikit-learn opencv-python

import requests
import pandas as pd
import numpy as np
import os
import time
import json
import cv2
from PIL import Image
from io import BytesIO
from tqdm import tqdm
from datetime import datetime
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

# Deep learning libraries
import torch
import torch.nn as nn
from torchvision import transforms, models
from transformers import BertTokenizer, BertModel, ViTImageProcessor, ViTModel
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.metrics import mean_squared_error, mean_absolute_error

print("Enhanced Multimodal Recommendation System")
print("=" * 60)

# ===================== Core Configuration Parameters =====================
# Target movie count - user configurable
TARGET_MOVIE_COUNT = 42  # Set the desired number of movies to process

# API Configuration
TMDB_API_KEY = "6ba3eb883961b80c06d196906b976afe"
TMDB_BASE_URL = "https://api.themoviedb.org/3"
TMDB_IMAGE_BASE_URL = "https://image.tmdb.org/t/p/original"

# File path configuration
IMAGE_DIR = "multimodal_images"          # Original image directory
SELECTED_IMAGE_DIR = "selected_images"    # Selected image directory
DATA_DIR = "multimodal_data"             # Data storage directory
PROGRESS_FILE = "multimodal_progress.json" # Progress tracking file

# Processing parameters
IMAGES_PER_MOVIE = 10    # Number of images to download per movie
SELECTED_IMAGES = 5      # Number of images to select after filtering
DELAY_BETWEEN_REQUESTS = 0.3  # API request interval (seconds)

# Create directories
for directory in [IMAGE_DIR, SELECTED_IMAGE_DIR, DATA_DIR]:
    os.makedirs(directory, exist_ok=True)

print(f"Target movie count: {TARGET_MOVIE_COUNT}")
print(f"Images per movie: {IMAGES_PER_MOVIE} -> select {SELECTED_IMAGES}")
print(f"Image directories: {IMAGE_DIR} -> {SELECTED_IMAGE_DIR}")
print(f"Data directory: {DATA_DIR}")
print(f"API request interval: {DELAY_BETWEEN_REQUESTS} seconds")
print("\nEnvironment setup completed")

Enhanced Multimodal Recommendation System
Target movie count: 42
Images per movie: 10 -> select 5
Image directories: multimodal_images -> selected_images
Data directory: multimodal_data
API request interval: 0.3 seconds

Environment setup completed


In [2]:
# Data loading and intelligent progress management
print("Data Loading and Progress Check")
print("=" * 50)

# Load MovieLens movie data
try:
    movies_df = pd.read_csv('movielens_movies.csv')
    print(f"MovieLens movie data loaded: {len(movies_df)} movies")
except FileNotFoundError:
    print("ERROR: movielens_movies.csv not found")
    print("Please run generate_imdb_mapping.py to generate movie information")

# Load IMDB ID mapping
try:
    with open('imdb/progress_mapping.json', 'r', encoding='utf-8') as f:
        imdb_mapping = json.load(f)
    imdb_mapping = {int(k): v for k, v in imdb_mapping.items()}
    print(f"IMDB ID mapping loaded: {len(imdb_mapping)} entries")
except FileNotFoundError:
    print("ERROR: IMDB ID mapping file not found")

# Load user rating data
try:
    ratings_df = pd.read_csv('ml-100k/u.data', sep='\t', 
                           names=['user_id', 'movie_id', 'rating', 'timestamp'])
    print(f"User rating data loaded: {len(ratings_df):,} ratings")
    print(f"Number of users: {ratings_df['user_id'].nunique():,}")
    print(f"Number of movies: {ratings_df['movie_id'].nunique():,}")
except FileNotFoundError:
    print("ERROR: ml-100k/u.data file not found")

# Load user information
try:
    users_df = pd.read_csv('ml-100k/u.user', sep='|', 
                         names=['user_id', 'age', 'gender', 'occupation', 'zip_code'])
    print(f"User information loaded: {len(users_df)} users")
except FileNotFoundError:
    print("ERROR: ml-100k/u.user file not found")

# Intelligent target movie selection
if 'movies_df' in locals() and 'imdb_mapping' in locals():
    valid_movies = movies_df[movies_df['movie_id'].isin(imdb_mapping.keys())].copy()
    valid_movies['imdb_id'] = valid_movies['movie_id'].map(imdb_mapping)
    
    # Select target movies based on TARGET_MOVIE_COUNT
    target_movies = valid_movies.head(TARGET_MOVIE_COUNT).copy()
    
    print(f"\nTarget movie processing: {len(target_movies)} / {len(valid_movies)} valid movies")
    print(f"Data coverage: {len(target_movies)/len(valid_movies)*100:.1f}%")
    
    # Display target movie list
    print(f"\nTarget movie list:")
    for i, (_, movie) in enumerate(target_movies.head(10).iterrows(), 1):
        print(f"   {i:2d}. {movie['movie_id']:3d}. {movie['title']} ({movie['year']}) -> tt{movie['imdb_id']}")
    
    if len(target_movies) > 10:
        print(f"   ... and {len(target_movies) - 10} more movies")
        
    # Save target movie list
    target_file = os.path.join(DATA_DIR, 'target_movies.csv')
    target_movies.to_csv(target_file, index=False)
    print(f"\nTarget movie list saved: {target_file}")
else:
    print("ERROR: Data loading failed, cannot continue")

Data Loading and Progress Check
MovieLens movie data loaded: 1682 movies
IMDB ID mapping loaded: 42 entries
User rating data loaded: 100,000 ratings
Number of users: 943
Number of movies: 1,682
User information loaded: 943 users

Target movie processing: 42 / 42 valid movies
Data coverage: 100.0%

Target movie list:
    1.   1. Toy Story (1995) -> tt0114709
    2.   2. GoldenEye (1995) -> tt0113189
    3.   3. Four Rooms (1995) -> tt0113101
    4.   4. Get Shorty (1995) -> tt0113161
    5.   5. Copycat (1995) -> tt0112722
    6.   6. Shanghai Triad (Yao a yao yao dao waipo qiao) (1995) -> tt0115012
    7.   7. Twelve Monkeys (1995) -> tt0114746
    8.   8. Babe (1995) -> tt0112431
    9.   9. Dead Man Walking (1995) -> tt0112818
   10.  10. Richard III (1995) -> tt0114279
   ... and 32 more movies

Target movie list saved: multimodal_data\target_movies.csv


In [3]:
# User data filtering and cleaning
print("User Data Filtering and Cleaning")
print("=" * 50)

if 'target_movies' in locals() and 'ratings_df' in locals():
    # Filter rating data for target movies
    target_movie_ids = set(target_movies['movie_id'].values)
    filtered_ratings = ratings_df[ratings_df['movie_id'].isin(target_movie_ids)].copy()
    
    print(f"Original rating data: {len(ratings_df):,} entries")
    print(f"Target movie ratings: {len(filtered_ratings):,} entries")
    print(f"Data filtering rate: {len(filtered_ratings)/len(ratings_df)*100:.1f}%")
    
    # Analyze user activity
    user_activity = filtered_ratings['user_id'].value_counts()
    print(f"\nUser activity analysis:")
    print(f"   Total users: {len(user_activity):,}")
    print(f"   Average ratings per user: {user_activity.mean():.1f}")
    print(f"   Median ratings per user: {user_activity.median():.1f}")
    print(f"   Most active user: {user_activity.max()} ratings")
    print(f"   Least active user: {user_activity.min()} ratings")
    
    # Clean low-activity users (less than 3 ratings)
    MIN_RATINGS_PER_USER = 3
    active_users = user_activity[user_activity >= MIN_RATINGS_PER_USER].index
    cleaned_ratings = filtered_ratings[filtered_ratings['user_id'].isin(active_users)].copy()
    
    print(f"\nData cleaning results:")
    print(f"   Minimum ratings requirement: {MIN_RATINGS_PER_USER} entries")
    print(f"   Users retained: {len(active_users):,} / {len(user_activity):,} ({len(active_users)/len(user_activity)*100:.1f}%)")
    print(f"   Ratings retained: {len(cleaned_ratings):,} / {len(filtered_ratings):,} ({len(cleaned_ratings)/len(filtered_ratings)*100:.1f}%)")
    
    # Merge user information and handle missing values
    if 'users_df' in locals():
        cleaned_ratings_with_users = cleaned_ratings.merge(users_df, on='user_id', how='left')
        
        # Check and clean missing user information
        missing_user_info = cleaned_ratings_with_users['age'].isna().sum()
        if missing_user_info > 0:
            print(f"WARNING: Missing user information: {missing_user_info:,} entries ({missing_user_info/len(cleaned_ratings_with_users)*100:.1f}%)")
            # Remove records with missing user information
            cleaned_ratings_with_users.dropna(subset=['age', 'gender', 'occupation'], inplace=True)
            print(f"Ratings after cleaning: {len(cleaned_ratings_with_users):,} entries")
        
        print(f"\nFinal cleaned dataset:")
        print(f"   Rating records: {len(cleaned_ratings_with_users):,} entries")
        print(f"   Number of users: {cleaned_ratings_with_users['user_id'].nunique():,}")
        print(f"   Number of movies: {cleaned_ratings_with_users['movie_id'].nunique():,}")
        print(f"   Average ratings per user: {len(cleaned_ratings_with_users)/cleaned_ratings_with_users['user_id'].nunique():.1f}")
        print(f"   Average ratings per movie: {len(cleaned_ratings_with_users)/cleaned_ratings_with_users['movie_id'].nunique():.1f}")
        
        # User feature analysis
        print(f"\nUser demographic distribution:")
        print(f"   Age range: {cleaned_ratings_with_users['age'].min()}-{cleaned_ratings_with_users['age'].max()} years")
        print(f"   Average age: {cleaned_ratings_with_users['age'].mean():.1f} years")
        print(f"   Gender distribution: {dict(cleaned_ratings_with_users['gender'].value_counts())}")
        print(f"   Top 5 occupations: {dict(cleaned_ratings_with_users['occupation'].value_counts().head())}")
        
        # Rating distribution analysis
        rating_dist = cleaned_ratings_with_users['rating'].value_counts().sort_index()
        print(f"\nRating distribution:")
        for rating, count in rating_dist.items():
            print(f"   {rating} stars: {count:,} entries ({count/len(cleaned_ratings_with_users)*100:.1f}%)")
        print(f"   Average rating: {cleaned_ratings_with_users['rating'].mean():.2f}")
        
        # Save cleaned data
        cleaned_data_file = os.path.join(DATA_DIR, 'cleaned_ratings_data.csv')
        cleaned_ratings_with_users.to_csv(cleaned_data_file, index=False)
        print(f"\nCleaned data saved: {cleaned_data_file}")
        
        # Update target_movies to include only movies with ratings
        rated_movie_ids = set(cleaned_ratings_with_users['movie_id'].unique())
        target_movies = target_movies[target_movies['movie_id'].isin(rated_movie_ids)].copy()
        print(f"Updated target movie count: {len(target_movies)} movies (with actual rating data)")
        
        # Print rating matrix dimension information
        n_users = cleaned_ratings_with_users['user_id'].nunique()
        n_movies = cleaned_ratings_with_users['movie_id'].nunique()
        sparsity = 1 - len(cleaned_ratings_with_users) / (n_users * n_movies)
        print(f"\nRating matrix dimensions:")
        print(f"   User-Movie matrix: {n_users} x {n_movies} = {n_users * n_movies:,} possible ratings")
        print(f"   Actual ratings: {len(cleaned_ratings_with_users):,}")
        print(f"   Sparsity: {sparsity*100:.2f}% (percentage of missing values)")
        print(f"   Density: {(1-sparsity)*100:.2f}% (percentage of observed values)")
    
else:
    print("ERROR: Required data missing, cannot perform user data filtering")

User Data Filtering and Cleaning
Original rating data: 100,000 entries
Target movie ratings: 5,845 entries
Data filtering rate: 5.8%

User activity analysis:
   Total users: 766
   Average ratings per user: 7.6
   Median ratings per user: 6.0
   Most active user: 42 ratings
   Least active user: 1 ratings

Data cleaning results:
   Minimum ratings requirement: 3 entries
   Users retained: 630 / 766 (82.2%)
   Ratings retained: 5,635 / 5,845 (96.4%)

Final cleaned dataset:
   Rating records: 5,635 entries
   Number of users: 630
   Number of movies: 42
   Average ratings per user: 8.9
   Average ratings per movie: 134.2

User demographic distribution:
   Age range: 7-73 years
   Average age: 32.0 years
   Gender distribution: {'M': np.int64(4329), 'F': np.int64(1306)}
   Top 5 occupations: {'student': np.int64(1351), 'other': np.int64(595), 'educator': np.int64(517), 'engineer': np.int64(489), 'programmer': np.int64(475)}

Rating distribution:
   1 stars: 231 entries (4.1%)
   2 stars: 

In [4]:
# TMDB data acquisition and image download processor
print("TMDB Data Acquisition and Image Download")
print("=" * 50)

class EnhancedTMDBProcessor:
    def __init__(self, api_key):
        self.api_key = api_key
        self.session = requests.Session()
        self.processed_movies = set()
        self.failed_movies = set()
        self.movie_features = {}
        self.load_progress()
    
    def load_progress(self):
        """Load processing progress"""
        if os.path.exists(PROGRESS_FILE):
            try:
                with open(PROGRESS_FILE, 'r', encoding='utf-8') as f:
                    progress = json.load(f)
                self.processed_movies = set(progress.get("processed", []))
                self.failed_movies = set(progress.get("failed", []))
                self.movie_features = progress.get("movie_features", {})
                print(f"Progress loaded: {len(self.processed_movies)} processed, {len(self.failed_movies)} failed")
            except Exception as e:
                print(f"ERROR: Failed to load progress: {str(e)}")
    
    def save_progress(self):
        """Save processing progress"""
        try:
            progress = {
                "processed": list(self.processed_movies),
                "failed": list(self.failed_movies),
                "movie_features": self.movie_features,
                "last_updated": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
                "target_count": TARGET_MOVIE_COUNT,
                "total_processed": len(self.processed_movies),
                "total_failed": len(self.failed_movies)
            }
            with open(PROGRESS_FILE, 'w', encoding='utf-8') as f:
                json.dump(progress, f, ensure_ascii=False, indent=2)
        except Exception as e:
            print(f"ERROR: Failed to save progress: {str(e)}")
    
    def find_movie_by_imdb_id(self, imdb_id):
        """Find TMDB movie by IMDB ID"""
        if not imdb_id.startswith('tt'):
            imdb_id = f"tt{imdb_id}"
        
        url = f"{TMDB_BASE_URL}/find/{imdb_id}"
        params = {"api_key": self.api_key, "external_source": "imdb_id"}
        
        try:
            response = self.session.get(url, params=params, timeout=10)
            if response.status_code == 200:
                data = response.json()
                if data.get("movie_results"):
                    return data["movie_results"][0]
            return None
        except Exception as e:
            print(f"     ERROR: Search failed: {str(e)}")
            return None
    
    def get_movie_details(self, tmdb_id):
        """Get detailed movie information"""
        url = f"{TMDB_BASE_URL}/movie/{tmdb_id}"
        params = {
            "api_key": self.api_key,
            "append_to_response": "credits,keywords,videos,images"
        }
        
        try:
            response = self.session.get(url, params=params, timeout=10)
            if response.status_code == 200:
                return response.json()
            return None
        except Exception as e:
            print(f"     ERROR: Failed to get details: {str(e)}")
            return None
    
    def clean_filename(self, filename):
        """Clean filename for safe storage"""
        invalid_chars = '<>:"/\\|?*'
        for char in invalid_chars:
            filename = filename.replace(char, '_')
        return filename[:100]
    
    def download_image(self, url, save_path):
        """Download and process image"""
        if os.path.exists(save_path):
            return True
        
        try:
            response = self.session.get(url, timeout=15)
            if response.status_code == 200:
                img = Image.open(BytesIO(response.content))
                if img.size[0] < 50 or img.size[1] < 50:
                    return False
                
                if img.mode != 'RGB':
                    img = img.convert('RGB')
                
                # Resize to 512x512 for ViT processing
                img = img.resize((512, 512), Image.Resampling.LANCZOS)
                img.save(save_path, "JPEG", quality=90, optimize=True)
                return True
            return False
        except Exception as e:
            print(f"       ERROR: Download failed: {str(e)}")
            return False
    
    def download_movie_images(self, movie_id, title, tmdb_movie):
        """Download movie images - up to 10 images"""
        movie_folder = os.path.join(IMAGE_DIR, f"{movie_id:04d}_{self.clean_filename(title)}")
        os.makedirs(movie_folder, exist_ok=True)
        
        downloaded = []
        
        # 1. Download poster
        if tmdb_movie.get("poster_path"):
            poster_url = f"{TMDB_IMAGE_BASE_URL}{tmdb_movie['poster_path']}"
            poster_path = os.path.join(movie_folder, "poster.jpg")
            if self.download_image(poster_url, poster_path):
                downloaded.append(("poster", poster_path))
        
        # 2. Download backdrop
        if tmdb_movie.get("backdrop_path"):
            backdrop_url = f"{TMDB_IMAGE_BASE_URL}{tmdb_movie['backdrop_path']}"
            backdrop_path = os.path.join(movie_folder, "backdrop.jpg")
            if self.download_image(backdrop_url, backdrop_path):
                downloaded.append(("backdrop", backdrop_path))
        
        # 3. Download stills (up to 8)
        if "images" in tmdb_movie and "backdrops" in tmdb_movie["images"]:
            backdrops = tmdb_movie["images"]["backdrops"][:8]
            for i, backdrop_info in enumerate(backdrops):
                backdrop_url = f"{TMDB_IMAGE_BASE_URL}{backdrop_info['file_path']}"
                still_path = os.path.join(movie_folder, f"still_{i+1}.jpg")
                if self.download_image(backdrop_url, still_path):
                    downloaded.append((f"still_{i+1}", still_path))
                    
                # Limit to maximum of 10 images
                if len(downloaded) >= IMAGES_PER_MOVIE:
                    break
        
        return downloaded
    
    def extract_movie_features(self, tmdb_movie, movielens_info):
        """Extract comprehensive movie features"""
        features = {
            # MovieLens original information
            "movielens_id": movielens_info["movie_id"],
            "movielens_title": movielens_info["title"],
            "movielens_year": movielens_info["year"],
            "movielens_imdb_id": movielens_info["imdb_id"],
            
            # TMDB basic information
            "tmdb_id": tmdb_movie.get("id"),
            "title": tmdb_movie.get("title", ""),
            "original_title": tmdb_movie.get("original_title", ""),
            "overview": tmdb_movie.get("overview", ""),
            "tagline": tmdb_movie.get("tagline", ""),
            "release_date": tmdb_movie.get("release_date", ""),
            "runtime": tmdb_movie.get("runtime", 0),
            "status": tmdb_movie.get("status", ""),
            
            # Rating information
            "vote_average": tmdb_movie.get("vote_average", 0),
            "vote_count": tmdb_movie.get("vote_count", 0),
            "popularity": tmdb_movie.get("popularity", 0),
            
            # Classification information
            "genres": "|".join([g["name"] for g in tmdb_movie.get("genres", [])]),
            "original_language": tmdb_movie.get("original_language", ""),
            "production_countries": "|".join([c["name"] for c in tmdb_movie.get("production_countries", [])]),
            "production_companies": "|".join([c["name"] for c in tmdb_movie.get("production_companies", [])]),
            
            # Financial information
            "budget": tmdb_movie.get("budget", 0),
            "revenue": tmdb_movie.get("revenue", 0),
        }
        
        # Cast and crew information
        if "credits" in tmdb_movie:
            credits = tmdb_movie["credits"]
            
            # Directors
            directors = [crew["name"] for crew in credits.get("crew", []) if crew.get("job") == "Director"]
            features["directors"] = "|".join(directors)
            
            # Cast
            cast = [actor["name"] for actor in credits.get("cast", [])[:10]]
            features["cast"] = "|".join(cast)
        
        # Keywords
        if "keywords" in tmdb_movie and "keywords" in tmdb_movie["keywords"]:
            keywords = [kw["name"] for kw in tmdb_movie["keywords"]["keywords"][:15]]
            features["keywords"] = "|".join(keywords)
        
        return features

# Check progress status
processor = EnhancedTMDBProcessor(TMDB_API_KEY)
current_processed = len(processor.processed_movies)

#print(f"Current processing progress: {current_processed}/{TARGET_MOVIE_COUNT}")

# Intelligent progress management
if current_processed >= TARGET_MOVIE_COUNT:
    print(f"Target achieved ({current_processed} >= {TARGET_MOVIE_COUNT})")
    print("Skipping download phase, using existing data")
    SKIP_DOWNLOAD = True
else:
    need_to_process = TARGET_MOVIE_COUNT - current_processed
    print(f"Remaining to process: {need_to_process} movies")
    SKIP_DOWNLOAD = False
    
print("\nTMDB processor ready")

TMDB Data Acquisition and Image Download
Progress loaded: 25 processed, 0 failed
Remaining to process: 17 movies

TMDB processor ready


In [5]:
# Execute TMDB data acquisition and image download
if not SKIP_DOWNLOAD and 'target_movies' in locals():
    print("Starting TMDB data processing and image download")
    print("=" * 60)
    
    all_movie_features = []
    success_count = 0
    error_count = 0
    total_images = 0
    
    # Process each movie
    for idx, (_, movie_info) in enumerate(tqdm(target_movies.iterrows(), total=len(target_movies), desc="Processing movies")):
        movie_id = movie_info["movie_id"]
        title = movie_info["title"]
        imdb_id = movie_info["imdb_id"]
        
        print(f"\n[{idx+1}/{len(target_movies)}] {movie_id}. {title} -> tt{imdb_id}")
        
        # Check if already processed
        if str(movie_id) in processor.processed_movies:
            print(f"   Already processed, skipping")
            if str(movie_id) in processor.movie_features:
                all_movie_features.append(processor.movie_features[str(movie_id)])
            success_count += 1
            continue
        
        try:
            # 1. Find TMDB movie
            tmdb_basic = processor.find_movie_by_imdb_id(imdb_id)
            if not tmdb_basic:
                print(f"   ERROR: Not found in TMDB")
                processor.failed_movies.add(str(movie_id))
                error_count += 1
                continue
            
            # 2. Get detailed information
            tmdb_movie = processor.get_movie_details(tmdb_basic["id"])
            if not tmdb_movie:
                print(f"   ERROR: Failed to get details")
                processor.failed_movies.add(str(movie_id))
                error_count += 1
                continue
            
            # 3. Extract features
            features = processor.extract_movie_features(tmdb_movie, movie_info)
            
            # 4. Download images
            downloaded_images = processor.download_movie_images(movie_id, title, tmdb_movie)
            features["downloaded_images_count"] = len(downloaded_images)
            features["image_paths"] = [path for _, path in downloaded_images]
            total_images += len(downloaded_images)
            
            # 5. Record success
            all_movie_features.append(features)
            processor.processed_movies.add(str(movie_id))
            processor.movie_features[str(movie_id)] = features
            success_count += 1
            
            print(f"   SUCCESS: {len(downloaded_images)} images, rating {features['vote_average']}/10")
            print(f"   Genres: {features['genres']}")
            print(f"   Directors: {features.get('directors', 'N/A')}")
            
        except Exception as e:
            print(f"   ERROR: Processing failed: {str(e)}")
            processor.failed_movies.add(str(movie_id))
            error_count += 1
        
        # Save progress periodically
        if (idx + 1) % 5 == 0:
            processor.save_progress()
            print(f"   Progress saved")
        
        # API request interval
        time.sleep(DELAY_BETWEEN_REQUESTS)
    
    # Final progress save
    processor.save_progress()
    
    # Save movie feature data
    if all_movie_features:
        features_file = os.path.join(DATA_DIR, 'tmdb_movie_features.json')
        with open(features_file, 'w', encoding='utf-8') as f:
            json.dump(all_movie_features, f, ensure_ascii=False, indent=2)
        
        csv_file = os.path.join(DATA_DIR, 'tmdb_movie_features.csv')
        pd.DataFrame(all_movie_features).to_csv(csv_file, index=False, encoding='utf-8')
        
        print(f"\nMovie feature data saved:")
        print(f"   JSON format: {features_file}")
        print(f"   CSV format: {csv_file}")
    
    # Output statistics
    print(f"\nTMDB processing statistics:")
    print(f"   Successful: {success_count} movies")
    print(f"   Failed: {error_count} movies")
    print(f"   Images downloaded: {total_images}")
    print(f"   Success rate: {success_count/(success_count+error_count)*100:.1f}%")
    print(f"   Target achieved: {success_count >= TARGET_MOVIE_COUNT}")

else:
    # Load existing data
    features_file = os.path.join(DATA_DIR, 'tmdb_movie_features.json')
    if os.path.exists(features_file):
        with open(features_file, 'r', encoding='utf-8') as f:
            all_movie_features = json.load(f)
        print(f"Existing movie feature data loaded: {len(all_movie_features)} movies")
    else:
        print("ERROR: No existing movie feature data found")
        all_movie_features = []

Starting TMDB data processing and image download


Processing movies:   0%|          | 0/42 [00:00<?, ?it/s]


[1/42] 1. Toy Story -> tt0114709
   Already processed, skipping

[2/42] 2. GoldenEye -> tt0113189
   Already processed, skipping

[3/42] 3. Four Rooms -> tt0113101
   Already processed, skipping

[4/42] 4. Get Shorty -> tt0113161
   Already processed, skipping

[5/42] 5. Copycat -> tt0112722
   Already processed, skipping

[6/42] 6. Shanghai Triad (Yao a yao yao dao waipo qiao) -> tt0115012
   Already processed, skipping

[7/42] 7. Twelve Monkeys -> tt0114746
   Already processed, skipping

[8/42] 8. Babe -> tt0112431
   Already processed, skipping

[9/42] 9. Dead Man Walking -> tt0112818
   Already processed, skipping

[10/42] 10. Richard III -> tt0114279
   Already processed, skipping

[11/42] 11. Seven (Se7en) -> tt0114369
   Already processed, skipping

[12/42] 12. Usual Suspects, The -> tt0114814
   Already processed, skipping

[13/42] 13. Mighty Aphrodite -> tt0113819
   Already processed, skipping

[14/42] 14. Postino, Il -> tt0110877
   Already processed, skipping

[15/42] 15.

Processing movies:  62%|██████▏   | 26/42 [00:01<00:00, 18.48it/s]


[27/42] 27. Bad Boys -> tt0112442
   SUCCESS: 10 images, rating 6.822/10
   Genres: Action|Comedy|Crime|Thriller
   Directors: Michael Bay

[28/42] 28. Apollo 13 -> tt0112384
   SUCCESS: 10 images, rating 7.448/10
   Genres: Drama|History
   Directors: Ron Howard


Processing movies:  67%|██████▋   | 28/42 [00:04<00:02,  5.02it/s]


[29/42] 29. Batman Forever -> tt0112462
   SUCCESS: 10 images, rating 5.441/10
   Genres: Action|Crime|Fantasy
   Directors: Joel Schumacher


Processing movies:  69%|██████▉   | 29/42 [00:05<00:03,  3.48it/s]


[30/42] 30. Belle de jour -> tt0061395
   SUCCESS: 10 images, rating 7.323/10
   Genres: Drama|Romance
   Directors: Luis Buñuel
   Progress saved


Processing movies:  71%|███████▏  | 30/42 [00:07<00:05,  2.37it/s]


[31/42] 31. Crimson Tide -> tt0112740
   SUCCESS: 10 images, rating 7.2/10
   Genres: Thriller|Action|Drama|War
   Directors: Tony Scott


Processing movies:  74%|███████▍  | 31/42 [00:09<00:06,  1.70it/s]


[32/42] 32. Crumb -> tt0109508
   SUCCESS: 10 images, rating 7.5/10
   Genres: Documentary
   Directors: Terry Zwigoff


Processing movies:  76%|███████▌  | 32/42 [00:12<00:08,  1.15it/s]


[33/42] 33. Desperado -> tt0112851
   SUCCESS: 10 images, rating 6.929/10
   Genres: Thriller|Action|Crime
   Directors: Robert Rodriguez


Processing movies:  79%|███████▊  | 33/42 [00:14<00:08,  1.01it/s]


[34/42] 34. Doom Generation, The -> tt0112887
   SUCCESS: 10 images, rating 6.5/10
   Genres: Comedy|Crime|Drama
   Directors: Gregg Araki


Processing movies:  81%|████████  | 34/42 [00:15<00:08,  1.03s/it]


[35/42] 35. Free Willy 2: The Adventure Home -> tt0113114
   SUCCESS: 10 images, rating 5.9/10
   Genres: Family|Adventure|Drama|Comedy
   Directors: Dwight H. Little
   Progress saved


Processing movies:  83%|████████▎ | 35/42 [00:16<00:07,  1.12s/it]


[36/42] 36. Mad Love -> tt0113729
   SUCCESS: 9 images, rating 5.211/10
   Genres: Drama|Romance
   Directors: Antonia Bird


Processing movies:  86%|████████▌ | 36/42 [00:17<00:06,  1.09s/it]


[37/42] 37. Nadja -> tt0110620
   SUCCESS: 6 images, rating 5.7/10
   Genres: Horror|Thriller
   Directors: Michael Almereyda


Processing movies:  88%|████████▊ | 37/42 [00:18<00:05,  1.08s/it]


[38/42] 38. Net, The -> tt0113957
   SUCCESS: 10 images, rating 6.029/10
   Genres: Crime|Drama|Mystery|Thriller|Action
   Directors: Irwin Winkler


Processing movies:  90%|█████████ | 38/42 [00:20<00:04,  1.17s/it]


[39/42] 39. Strange Days -> tt0114558
   SUCCESS: 10 images, rating 7.011/10
   Genres: Crime|Drama|Science Fiction|Thriller
   Directors: Kathryn Bigelow


Processing movies:  93%|█████████▎| 39/42 [00:21<00:03,  1.14s/it]


[40/42] 40. To Wong Foo, Thanks for Everything! Julie Newmar -> tt0114682
   SUCCESS: 10 images, rating 7.374/10
   Genres: Comedy|Drama
   Directors: Beeban Kidron
   Progress saved


Processing movies:  95%|█████████▌| 40/42 [00:22<00:02,  1.21s/it]


[41/42] 41. Billy Madison -> tt0112508
   SUCCESS: 9 images, rating 6.2/10
   Genres: Comedy
   Directors: Tamra Davis


Processing movies:  98%|█████████▊| 41/42 [00:23<00:01,  1.21s/it]


[42/42] 42. Clerks -> tt0109445
   SUCCESS: 10 images, rating 7.4/10
   Genres: Comedy
   Directors: Kevin Smith


Processing movies: 100%|██████████| 42/42 [00:25<00:00,  1.67it/s]


PermissionError: [Errno 13] Permission denied: 'multimodal_data\\tmdb_movie_features.csv'

In [8]:
# Image similarity calculation and selection - select 5 most diverse from 10
import re
print("Image Similarity Calculation and Selection")
print("=" * 50)

def sanitize_filename(filename):
    """Remove or replace invalid characters for Windows file/folder names"""
    # Replace invalid characters with underscore
    invalid_chars = r'[<>:"/\\|?*]'
    sanitized = re.sub(invalid_chars, '_', filename)
    
    # Remove any trailing dots or spaces (also invalid in Windows)
    sanitized = sanitized.rstrip('. ')
    
    # Ensure the name isn't empty after sanitization
    if not sanitized.strip():
        sanitized = "unnamed"
    
    return sanitized

def calculate_image_similarity(img1_path, img2_path):
    """Calculate similarity between two images using histogram comparison"""
    try:
        # Read images
        img1 = cv2.imread(img1_path)
        img2 = cv2.imread(img2_path)
        
        if img1 is None or img2 is None:
            return 0.0
        
        # Convert to HSV color space
        hsv1 = cv2.cvtColor(img1, cv2.COLOR_BGR2HSV)
        hsv2 = cv2.cvtColor(img2, cv2.COLOR_BGR2HSV)
        
        # Calculate histograms
        hist1 = cv2.calcHist([hsv1], [0, 1, 2], None, [50, 60, 60], [0, 180, 0, 256, 0, 256])
        hist2 = cv2.calcHist([hsv2], [0, 1, 2], None, [50, 60, 60], [0, 180, 0, 256, 0, 256])
        
        # Calculate correlation
        correlation = cv2.compareHist(hist1, hist2, cv2.HISTCMP_CORREL)
        return correlation
    
    except Exception as e:
        print(f"       Similarity calculation failed: {str(e)}")
        return 0.0

def select_diverse_images(image_paths, target_count=5):
    """Select the most diverse target_count images from the image list"""
    if len(image_paths) <= target_count:
        return image_paths
    
    # Calculate similarity matrix for all image pairs
    n = len(image_paths)
    similarity_matrix = np.zeros((n, n))
    
    for i in range(n):
        for j in range(i+1, n):
            sim = calculate_image_similarity(image_paths[i], image_paths[j])
            similarity_matrix[i][j] = sim
            similarity_matrix[j][i] = sim
    
    # Greedy algorithm to select most dissimilar images
    selected_indices = [0]  # Start with first image
    
    for _ in range(target_count - 1):
        max_min_sim = -1
        best_candidate = -1
        
        for candidate in range(n):
            if candidate in selected_indices:
                continue
            
            # Calculate minimum similarity with already selected images
            min_sim = min(similarity_matrix[candidate][selected] for selected in selected_indices)
            
            if min_sim > max_min_sim:
                max_min_sim = min_sim
                best_candidate = candidate
        
        if best_candidate != -1:
            selected_indices.append(best_candidate)
    
    return [image_paths[i] for i in selected_indices]

# Filter images for each movie
if 'all_movie_features' in locals() and all_movie_features:
    print(f"Starting image selection for {len(all_movie_features)} movies")
    
    selected_image_stats = []
    total_selected = 0
    
    for movie_features in tqdm(all_movie_features, desc="Selecting images"):
        movie_id = movie_features['movielens_id']
        title = movie_features['movielens_title']
        
        # Get image paths
        image_paths = movie_features.get('image_paths', [])
        
        if not image_paths:
            print(f"   WARNING: {movie_id}. {title}: No images found")
            continue
        
        # Verify image files exist
        valid_paths = [path for path in image_paths if os.path.exists(path)]
        
        if len(valid_paths) == 0:
            print(f"   ERROR: {movie_id}. {title}: Image files do not exist")
            continue
        
        # Select most diverse images
        selected_paths = select_diverse_images(valid_paths, SELECTED_IMAGES)
        
        # Create selected image directory and copy images
        # FIXED: Sanitize the title to remove invalid characters
        sanitized_title = sanitize_filename(title[:50])
        selected_folder = os.path.join(SELECTED_IMAGE_DIR, f"{movie_id:04d}_{sanitized_title}")
        
        try:
            os.makedirs(selected_folder, exist_ok=True)
        except Exception as e:
            print(f"   ERROR: Failed to create directory for {movie_id}. {title}: {str(e)}")
            # Fallback: use only movie ID as folder name
            selected_folder = os.path.join(SELECTED_IMAGE_DIR, f"{movie_id:04d}")
            os.makedirs(selected_folder, exist_ok=True)
        
        copied_paths = []
        for i, src_path in enumerate(selected_paths):
            filename = f"selected_{i+1}.jpg"
            dst_path = os.path.join(selected_folder, filename)
            
            try:
                # Copy image
                img = Image.open(src_path)
                img.save(dst_path, "JPEG", quality=90)
                copied_paths.append(dst_path)
            except Exception as e:
                print(f"     ERROR: Failed to copy image: {str(e)}")
        
        # Update feature data
        movie_features['selected_image_paths'] = copied_paths
        movie_features['selected_images_count'] = len(copied_paths)
        
        selected_image_stats.append({
            'movie_id': movie_id,
            'title': title,
            'original_count': len(valid_paths),
            'selected_count': len(copied_paths)
        })
        
        total_selected += len(copied_paths)
        
        print(f"   SUCCESS: {movie_id}. {title}: {len(valid_paths)} -> {len(copied_paths)} images")
    
    # Save updated feature data
    updated_features_file = os.path.join(DATA_DIR, 'movie_features_with_selected_images.json')
    with open(updated_features_file, 'w', encoding='utf-8') as f:
        json.dump(all_movie_features, f, ensure_ascii=False, indent=2)
    
    print(f"\nImage selection statistics:")
    print(f"   Movies processed: {len(selected_image_stats)}")
    print(f"   Total selected images: {total_selected}")
    print(f"   Average per movie: {total_selected/len(selected_image_stats):.1f} images")
    print(f"   Updated data saved: {updated_features_file}")
    
    # Display selection details
    stats_df = pd.DataFrame(selected_image_stats)
    print(f"\nSelection details:")
    print(f"   Original images total: {stats_df['original_count'].sum()}")
    print(f"   Selected images total: {stats_df['selected_count'].sum()}")
    print(f"   Selection rate: {stats_df['selected_count'].sum()/stats_df['original_count'].sum()*100:.1f}%")
    
else:
    print("ERROR: No movie feature data available for image selection")

Image Similarity Calculation and Selection
Starting image selection for 42 movies


Selecting images:   2%|▏         | 1/42 [00:00<00:04,  8.47it/s]

   SUCCESS: 1. Toy Story: 10 -> 5 images


Selecting images:   5%|▍         | 2/42 [00:00<00:04,  8.58it/s]

   SUCCESS: 2. GoldenEye: 10 -> 5 images


Selecting images:   7%|▋         | 3/42 [00:00<00:04,  8.67it/s]

   SUCCESS: 3. Four Rooms: 10 -> 5 images


Selecting images:  10%|▉         | 4/42 [00:00<00:04,  8.76it/s]

   SUCCESS: 4. Get Shorty: 10 -> 5 images


Selecting images:  12%|█▏        | 5/42 [00:00<00:04,  9.12it/s]

   SUCCESS: 5. Copycat: 10 -> 5 images
   SUCCESS: 6. Shanghai Triad (Yao a yao yao dao waipo qiao): 7 -> 5 images


Selecting images:  17%|█▋        | 7/42 [00:00<00:03, 10.58it/s]

   SUCCESS: 7. Twelve Monkeys: 10 -> 5 images
   SUCCESS: 8. Babe: 10 -> 5 images


Selecting images:  21%|██▏       | 9/42 [00:00<00:03,  9.65it/s]

   SUCCESS: 9. Dead Man Walking: 10 -> 5 images
   SUCCESS: 10. Richard III: 7 -> 5 images


Selecting images:  26%|██▌       | 11/42 [00:01<00:02, 10.47it/s]

   SUCCESS: 11. Seven (Se7en): 10 -> 5 images
   SUCCESS: 12. Usual Suspects, The: 10 -> 5 images


Selecting images:  31%|███       | 13/42 [00:01<00:02,  9.78it/s]

   SUCCESS: 13. Mighty Aphrodite: 10 -> 5 images


Selecting images:  33%|███▎      | 14/42 [00:01<00:02,  9.44it/s]

   SUCCESS: 14. Postino, Il: 10 -> 5 images


Selecting images:  36%|███▌      | 15/42 [00:01<00:02,  9.17it/s]

   SUCCESS: 15. Mr. Holland's Opus: 10 -> 5 images


Selecting images:  38%|███▊      | 16/42 [00:01<00:02,  9.09it/s]

   SUCCESS: 16. French Twist (Gazon maudit): 10 -> 5 images


Selecting images:  40%|████      | 17/42 [00:01<00:02,  9.02it/s]

   SUCCESS: 17. From Dusk Till Dawn: 10 -> 5 images
   SUCCESS: 18. White Balloon, The: 4 -> 4 images
   SUCCESS: 19. Antonia's Line: 7 -> 5 images


Selecting images:  48%|████▊     | 20/42 [00:01<00:01, 12.95it/s]

   SUCCESS: 20. Angels and Insects: 8 -> 5 images
   SUCCESS: 21. Muppet Treasure Island: 10 -> 1 images


Selecting images:  52%|█████▏    | 22/42 [00:02<00:01, 10.94it/s]

   SUCCESS: 22. Braveheart: 10 -> 5 images
   SUCCESS: 23. Taxi Driver: 10 -> 5 images


Selecting images:  57%|█████▋    | 24/42 [00:02<00:01, 10.05it/s]

   SUCCESS: 24. Rumble in the Bronx: 10 -> 5 images


Selecting images:  62%|██████▏   | 26/42 [00:02<00:01, 11.42it/s]

   SUCCESS: 25. Birdcage, The: 10 -> 5 images
   SUCCESS: 26. Brothers McMullen, The: 4 -> 4 images
   SUCCESS: 27. Bad Boys: 10 -> 5 images


Selecting images:  67%|██████▋   | 28/42 [00:02<00:01, 10.38it/s]

   SUCCESS: 28. Apollo 13: 10 -> 5 images
   SUCCESS: 29. Batman Forever: 10 -> 5 images


Selecting images:  71%|███████▏  | 30/42 [00:03<00:01,  9.93it/s]

   SUCCESS: 30. Belle de jour: 10 -> 5 images
   SUCCESS: 31. Crimson Tide: 10 -> 5 images


Selecting images:  76%|███████▌  | 32/42 [00:03<00:01,  9.59it/s]

   SUCCESS: 32. Crumb: 10 -> 5 images
   SUCCESS: 33. Desperado: 10 -> 5 images
   SUCCESS: 34. Doom Generation, The: 10 -> 5 images


Selecting images:  86%|████████▌ | 36/42 [00:03<00:00,  8.92it/s]

   SUCCESS: 35. Free Willy 2: The Adventure Home: 10 -> 1 images
   SUCCESS: 36. Mad Love: 9 -> 5 images
   SUCCESS: 37. Nadja: 6 -> 5 images


Selecting images:  93%|█████████▎| 39/42 [00:04<00:00,  8.18it/s]

   SUCCESS: 38. Net, The: 10 -> 5 images
   SUCCESS: 39. Strange Days: 10 -> 5 images


Selecting images:  98%|█████████▊| 41/42 [00:04<00:00,  7.50it/s]

   SUCCESS: 40. To Wong Foo, Thanks for Everything! Julie Newmar: 10 -> 5 images
   SUCCESS: 41. Billy Madison: 9 -> 5 images


Selecting images: 100%|██████████| 42/42 [00:04<00:00,  9.22it/s]

   SUCCESS: 42. Clerks: 10 -> 5 images

Image selection statistics:
   Movies processed: 42
   Total selected images: 200
   Average per movie: 4.8 images
   Updated data saved: multimodal_data\movie_features_with_selected_images.json

Selection details:
   Original images total: 391
   Selected images total: 200
   Selection rate: 51.2%





In [38]:
# ViT image feature extraction
print("ViT Image Feature Extraction")
print("=" * 50)

class ViTFeatureExtractor:
    def __init__(self, model_name='google/vit-base-patch16-224'):
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        print(f"Using device: {self.device}")
        
        try:
            # Load ViT model and processor
            self.processor = ViTImageProcessor.from_pretrained(model_name)
            self.model = ViTModel.from_pretrained(model_name)
            self.model.to(self.device)
            self.model.eval()
            print(f"ViT model loaded successfully: {model_name}")
        except Exception as e:
            print(f"ERROR: ViT model loading failed: {str(e)}")
            self.model = None
    
    def extract_image_features(self, image_path):
        """Extract ViT features from a single image"""
        if self.model is None:
            return None
        
        try:
            # Load and preprocess image
            image = Image.open(image_path).convert('RGB')
            inputs = self.processor(images=image, return_tensors="pt")
            inputs = {k: v.to(self.device) for k, v in inputs.items()}
            
            # Extract features
            with torch.no_grad():
                outputs = self.model(**inputs)
                # Use [CLS] token features as image representation
                image_features = outputs.last_hidden_state[:, 0, :].cpu().numpy().flatten()
            
            return image_features
        
        except Exception as e:
            print(f"   ERROR: Feature extraction failed ({image_path}): {str(e)}")
            return None
    
    def extract_movie_features(self, image_paths):
        """Extract and fuse features from all movie images"""
        if not image_paths:
            return None
        
        features_list = []
        
        for image_path in image_paths:
            if os.path.exists(image_path):
                features = self.extract_image_features(image_path)
                if features is not None:
                    features_list.append(features)
        
        if not features_list:
            return None
        
        # Fuse multiple image features (average pooling)
        combined_features = np.mean(features_list, axis=0)
        return combined_features

# Initialize ViT feature extractor
vit_extractor = ViTFeatureExtractor()

# Extract ViT features for all movies
if 'all_movie_features' in locals() and all_movie_features and vit_extractor.model is not None:
    print(f"\nStarting ViT feature extraction for {len(all_movie_features)} movies")
    
    vit_features_matrix = []
    vit_movie_ids = []
    success_count = 0
    
    for movie_features in tqdm(all_movie_features, desc="Extracting ViT features"):
        movie_id = movie_features['movielens_id']
        title = movie_features['movielens_title']
        
        # Use selected images
        selected_paths = movie_features.get('selected_image_paths', [])
        
        if not selected_paths:
            print(f"   WARNING: {movie_id}. {title}: No selected images")
            continue
        
        # Extract features
        vit_features = vit_extractor.extract_movie_features(selected_paths)
        
        if vit_features is not None:
            vit_features_matrix.append(vit_features)
            vit_movie_ids.append(movie_id)
            
            # Update movie feature data
            movie_features['vit_features'] = vit_features.tolist()
            movie_features['vit_feature_dim'] = len(vit_features)
            
            success_count += 1
            print(f"   SUCCESS: {movie_id}. {title}: feature dimension {len(vit_features)}")
        else:
            print(f"   ERROR: {movie_id}. {title}: ViT feature extraction failed")
    
    # Convert to numpy array and save
    if vit_features_matrix:
        vit_features_array = np.array(vit_features_matrix)
        
        # Save ViT feature matrix
        vit_features_file = os.path.join(DATA_DIR, 'vit_features_matrix.npy')
        np.save(vit_features_file, vit_features_array)
        
        # Save movie ID mapping
        vit_mapping_file = os.path.join(DATA_DIR, 'vit_movie_mapping.json')
        with open(vit_mapping_file, 'w', encoding='utf-8') as f:
            json.dump({'movie_ids': vit_movie_ids, 'feature_dim': len(vit_features_matrix[0])}, f)
        
        print(f"\nViT feature extraction statistics:")
        print(f"   Successfully processed movies: {success_count}")
        print(f"   Feature dimension: {vit_features_array.shape[1]}")
        print(f"   Feature matrix shape: {vit_features_array.shape}")
        print(f"   Feature matrix saved: {vit_features_file}")
        print(f"   Mapping file saved: {vit_mapping_file}")
        
        # Save updated movie features
        updated_features_file = os.path.join(DATA_DIR, 'movie_features_with_vit.json')
        with open(updated_features_file, 'w', encoding='utf-8') as f:
            json.dump(all_movie_features, f, ensure_ascii=False, indent=2)
        print(f"   Updated feature data saved: {updated_features_file}")
    else:
        print("ERROR: No ViT features extracted successfully")
else:
    print("ERROR: ViT model not loaded or no movie data available")

ViT Image Feature Extraction
Using device: cpu


Some weights of ViTModel were not initialized from the model checkpoint at google/vit-base-patch16-224 and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


ViT model loaded successfully: google/vit-base-patch16-224

Starting ViT feature extraction for 25 movies


Extracting ViT features:   4%|▍         | 1/25 [00:00<00:10,  2.38it/s]

   SUCCESS: 1. Toy Story: feature dimension 768


Extracting ViT features:   8%|▊         | 2/25 [00:00<00:08,  2.62it/s]

   SUCCESS: 2. GoldenEye: feature dimension 768


Extracting ViT features:  12%|█▏        | 3/25 [00:01<00:08,  2.66it/s]

   SUCCESS: 3. Four Rooms: feature dimension 768


Extracting ViT features:  16%|█▌        | 4/25 [00:01<00:07,  2.70it/s]

   SUCCESS: 4. Get Shorty: feature dimension 768


Extracting ViT features:  20%|██        | 5/25 [00:01<00:07,  2.74it/s]

   SUCCESS: 5. Copycat: feature dimension 768


Extracting ViT features:  24%|██▍       | 6/25 [00:02<00:06,  2.74it/s]

   SUCCESS: 6. Shanghai Triad (Yao a yao yao dao waipo qiao): feature dimension 768


Extracting ViT features:  28%|██▊       | 7/25 [00:02<00:06,  2.78it/s]

   SUCCESS: 7. Twelve Monkeys: feature dimension 768


Extracting ViT features:  32%|███▏      | 8/25 [00:02<00:06,  2.72it/s]

   SUCCESS: 8. Babe: feature dimension 768


Extracting ViT features:  36%|███▌      | 9/25 [00:03<00:05,  2.76it/s]

   SUCCESS: 9. Dead Man Walking: feature dimension 768


Extracting ViT features:  40%|████      | 10/25 [00:03<00:05,  2.78it/s]

   SUCCESS: 10. Richard III: feature dimension 768


Extracting ViT features:  44%|████▍     | 11/25 [00:04<00:05,  2.79it/s]

   SUCCESS: 11. Seven (Se7en): feature dimension 768


Extracting ViT features:  48%|████▊     | 12/25 [00:04<00:04,  2.77it/s]

   SUCCESS: 12. Usual Suspects, The: feature dimension 768


Extracting ViT features:  52%|█████▏    | 13/25 [00:04<00:04,  2.77it/s]

   SUCCESS: 13. Mighty Aphrodite: feature dimension 768


Extracting ViT features:  56%|█████▌    | 14/25 [00:05<00:03,  2.79it/s]

   SUCCESS: 14. Postino, Il: feature dimension 768


Extracting ViT features:  60%|██████    | 15/25 [00:05<00:03,  2.80it/s]

   SUCCESS: 15. Mr. Holland's Opus: feature dimension 768


Extracting ViT features:  64%|██████▍   | 16/25 [00:05<00:03,  2.79it/s]

   SUCCESS: 16. French Twist (Gazon maudit): feature dimension 768


Extracting ViT features:  68%|██████▊   | 17/25 [00:06<00:02,  2.81it/s]

   SUCCESS: 17. From Dusk Till Dawn: feature dimension 768


Extracting ViT features:  72%|███████▏  | 18/25 [00:06<00:02,  3.00it/s]

   SUCCESS: 18. White Balloon, The: feature dimension 768


Extracting ViT features:  76%|███████▌  | 19/25 [00:06<00:02,  2.95it/s]

   SUCCESS: 19. Antonia's Line: feature dimension 768


Extracting ViT features:  80%|████████  | 20/25 [00:07<00:01,  2.94it/s]

   SUCCESS: 20. Angels and Insects: feature dimension 768
   SUCCESS: 21. Muppet Treasure Island: feature dimension 768


Extracting ViT features:  88%|████████▊ | 22/25 [00:07<00:00,  3.57it/s]

   SUCCESS: 22. Braveheart: feature dimension 768


Extracting ViT features:  92%|█████████▏| 23/25 [00:07<00:00,  3.36it/s]

   SUCCESS: 23. Taxi Driver: feature dimension 768


Extracting ViT features:  96%|█████████▌| 24/25 [00:08<00:00,  3.20it/s]

   SUCCESS: 24. Rumble in the Bronx: feature dimension 768


Extracting ViT features: 100%|██████████| 25/25 [00:08<00:00,  2.90it/s]

   SUCCESS: 25. Birdcage, The: feature dimension 768

ViT feature extraction statistics:
   Successfully processed movies: 25
   Feature dimension: 768
   Feature matrix shape: (25, 768)
   Feature matrix saved: multimodal_data\vit_features_matrix.npy
   Mapping file saved: multimodal_data\vit_movie_mapping.json
   Updated feature data saved: multimodal_data\movie_features_with_vit.json





In [39]:
# BERT text feature extraction
print("BERT Text Feature Extraction")
print("=" * 50)

class BERTFeatureExtractor:
    def __init__(self, model_name='bert-base-uncased'):
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        print(f"Using device: {self.device}")
        
        try:
            # Load BERT model and tokenizer
            self.tokenizer = BertTokenizer.from_pretrained(model_name)
            self.model = BertModel.from_pretrained(model_name)
            self.model.to(self.device)
            self.model.eval()
            print(f"BERT model loaded successfully: {model_name}")
        except Exception as e:
            print(f"ERROR: BERT model loading failed: {str(e)}")
            self.model = None
    
    def extract_text_features(self, text, max_length=512):
        """Extract BERT features from text"""
        if self.model is None or not text or text.strip() == "":
            return None
        
        try:
            # Text preprocessing and tokenization
            inputs = self.tokenizer(
                text,
                add_special_tokens=True,
                max_length=max_length,
                padding='max_length',
                truncation=True,
                return_tensors='pt'
            )
            
            inputs = {k: v.to(self.device) for k, v in inputs.items()}
            
            # Extract features
            with torch.no_grad():
                outputs = self.model(**inputs)
                # Use [CLS] token features as sentence representation
                text_features = outputs.last_hidden_state[:, 0, :].cpu().numpy().flatten()
            
            return text_features
        
        except Exception as e:
            print(f"   ERROR: Text feature extraction failed: {str(e)}")
            return None
    
    def extract_movie_text_features(self, overview, tagline):
        """Extract movie text features (overview + tagline)"""
        features_list = []
        
        # Extract overview features
        if overview and overview.strip():
            overview_features = self.extract_text_features(overview)
            if overview_features is not None:
                features_list.append(overview_features)
        
        # Extract tagline features
        if tagline and tagline.strip():
            tagline_features = self.extract_text_features(tagline)
            if tagline_features is not None:
                features_list.append(tagline_features)
        
        if not features_list:
            return None
        
        # Average pooling to fuse features
        combined_features = np.mean(features_list, axis=0)
        return combined_features

# Initialize BERT feature extractor
bert_extractor = BERTFeatureExtractor()

# Extract BERT text features for all movies
if 'all_movie_features' in locals() and all_movie_features and bert_extractor.model is not None:
    print(f"\nStarting BERT text feature extraction for {len(all_movie_features)} movies")
    
    bert_features_matrix = []
    bert_movie_ids = []
    success_count = 0
    
    for movie_features in tqdm(all_movie_features, desc="Extracting BERT features"):
        movie_id = movie_features['movielens_id']
        title = movie_features['movielens_title']
        overview = movie_features.get('overview', '')
        tagline = movie_features.get('tagline', '')
        
        print(f"   Processing {movie_id}. {title}")
        print(f"     Overview length: {len(overview) if overview else 0} characters")
        print(f"     Tagline length: {len(tagline) if tagline else 0} characters")
        
        # Extract text features
        bert_features = bert_extractor.extract_movie_text_features(overview, tagline)
        
        if bert_features is not None:
            bert_features_matrix.append(bert_features)
            bert_movie_ids.append(movie_id)
            
            # Update movie feature data
            movie_features['bert_features'] = bert_features.tolist()
            movie_features['bert_feature_dim'] = len(bert_features)
            movie_features['text_length'] = len(overview) + len(tagline)
            
            success_count += 1
            print(f"     SUCCESS: BERT feature dimension: {len(bert_features)}")
        else:
            print(f"     ERROR: BERT feature extraction failed")
    
    # Convert to numpy array and save
    if bert_features_matrix:
        bert_features_array = np.array(bert_features_matrix)
        
        # Save BERT feature matrix
        bert_features_file = os.path.join(DATA_DIR, 'bert_features_matrix.npy')
        np.save(bert_features_file, bert_features_array)
        
        # Save movie ID mapping
        bert_mapping_file = os.path.join(DATA_DIR, 'bert_movie_mapping.json')
        with open(bert_mapping_file, 'w', encoding='utf-8') as f:
            json.dump({'movie_ids': bert_movie_ids, 'feature_dim': len(bert_features_matrix[0])}, f)
        
        print(f"\nBERT feature extraction statistics:")
        print(f"   Successfully processed movies: {success_count}")
        print(f"   Feature dimension: {bert_features_array.shape[1]}")
        print(f"   Feature matrix shape: {bert_features_array.shape}")
        print(f"   Feature matrix saved: {bert_features_file}")
        print(f"   Mapping file saved: {bert_mapping_file}")
        
        # Save updated movie features
        updated_features_file = os.path.join(DATA_DIR, 'movie_features_with_bert.json')
        with open(updated_features_file, 'w', encoding='utf-8') as f:
            json.dump(all_movie_features, f, ensure_ascii=False, indent=2)
        print(f"   Updated feature data saved: {updated_features_file}")
    else:
        print("ERROR: No BERT features extracted successfully")
else:
    print("ERROR: BERT model not loaded or no movie data available")

BERT Text Feature Extraction
Using device: cpu
BERT model loaded successfully: bert-base-uncased

Starting BERT text feature extraction for 25 movies


Extracting BERT features:   0%|          | 0/25 [00:00<?, ?it/s]

   Processing 1. Toy Story
     Overview length: 303 characters
     Tagline length: 47 characters


Extracting BERT features:   4%|▍         | 1/25 [00:00<00:07,  3.19it/s]

     SUCCESS: BERT feature dimension: 768
   Processing 2. GoldenEye
     Overview length: 371 characters
     Tagline length: 36 characters


Extracting BERT features:   8%|▊         | 2/25 [00:00<00:06,  3.70it/s]

     SUCCESS: BERT feature dimension: 768
   Processing 3. Four Rooms
     Overview length: 237 characters
     Tagline length: 155 characters


Extracting BERT features:  12%|█▏        | 3/25 [00:00<00:05,  3.81it/s]

     SUCCESS: BERT feature dimension: 768
   Processing 4. Get Shorty
     Overview length: 367 characters
     Tagline length: 22 characters


Extracting BERT features:  16%|█▌        | 4/25 [00:01<00:05,  4.03it/s]

     SUCCESS: BERT feature dimension: 768
   Processing 5. Copycat
     Overview length: 139 characters
     Tagline length: 142 characters


Extracting BERT features:  20%|██        | 5/25 [00:01<00:04,  4.19it/s]

     SUCCESS: BERT feature dimension: 768
   Processing 6. Shanghai Triad (Yao a yao yao dao waipo qiao)
     Overview length: 271 characters
     Tagline length: 68 characters


Extracting BERT features:  24%|██▍       | 6/25 [00:01<00:04,  4.24it/s]

     SUCCESS: BERT feature dimension: 768
   Processing 7. Twelve Monkeys
     Overview length: 536 characters
     Tagline length: 22 characters


Extracting BERT features:  28%|██▊       | 7/25 [00:01<00:04,  4.26it/s]

     SUCCESS: BERT feature dimension: 768
   Processing 8. Babe
     Overview length: 383 characters
     Tagline length: 29 characters


Extracting BERT features:  36%|███▌      | 9/25 [00:02<00:03,  5.12it/s]

     SUCCESS: BERT feature dimension: 768
   Processing 9. Dead Man Walking
     Overview length: 147 characters
     Tagline length: 0 characters
     SUCCESS: BERT feature dimension: 768
   Processing 10. Richard III
     Overview length: 442 characters
     Tagline length: 37 characters


Extracting BERT features:  40%|████      | 10/25 [00:02<00:03,  4.89it/s]

     SUCCESS: BERT feature dimension: 768
   Processing 11. Seven (Se7en)
     Overview length: 389 characters
     Tagline length: 37 characters


Extracting BERT features:  44%|████▍     | 11/25 [00:02<00:02,  4.76it/s]

     SUCCESS: BERT feature dimension: 768
   Processing 12. Usual Suspects, The
     Overview length: 409 characters
     Tagline length: 44 characters


Extracting BERT features:  48%|████▊     | 12/25 [00:02<00:02,  4.51it/s]

     SUCCESS: BERT feature dimension: 768
   Processing 13. Mighty Aphrodite
     Overview length: 419 characters
     Tagline length: 75 characters


Extracting BERT features:  52%|█████▏    | 13/25 [00:02<00:02,  4.49it/s]

     SUCCESS: BERT feature dimension: 768
   Processing 14. Postino, Il
     Overview length: 127 characters
     Tagline length: 20 characters


Extracting BERT features:  56%|█████▌    | 14/25 [00:03<00:02,  4.49it/s]

     SUCCESS: BERT feature dimension: 768
   Processing 15. Mr. Holland's Opus
     Overview length: 340 characters
     Tagline length: 71 characters


Extracting BERT features:  60%|██████    | 15/25 [00:03<00:02,  4.44it/s]

     SUCCESS: BERT feature dimension: 768
   Processing 16. French Twist (Gazon maudit)
     Overview length: 157 characters
     Tagline length: 58 characters


Extracting BERT features:  64%|██████▍   | 16/25 [00:03<00:02,  4.46it/s]

     SUCCESS: BERT feature dimension: 768
   Processing 17. From Dusk Till Dawn
     Overview length: 163 characters
     Tagline length: 94 characters


Extracting BERT features:  72%|███████▏  | 18/25 [00:04<00:01,  5.15it/s]

     SUCCESS: BERT feature dimension: 768
   Processing 18. White Balloon, The
     Overview length: 125 characters
     Tagline length: 0 characters
     SUCCESS: BERT feature dimension: 768
   Processing 19. Antonia's Line
     Overview length: 424 characters
     Tagline length: 64 characters


Extracting BERT features:  76%|███████▌  | 19/25 [00:04<00:01,  4.94it/s]

     SUCCESS: BERT feature dimension: 768
   Processing 20. Angels and Insects
     Overview length: 438 characters
     Tagline length: 65 characters


Extracting BERT features:  80%|████████  | 20/25 [00:04<00:01,  4.79it/s]

     SUCCESS: BERT feature dimension: 768
   Processing 21. Muppet Treasure Island
     Overview length: 397 characters
     Tagline length: 27 characters


Extracting BERT features:  84%|████████▍ | 21/25 [00:04<00:00,  4.65it/s]

     SUCCESS: BERT feature dimension: 768
   Processing 22. Braveheart
     Overview length: 258 characters
     Tagline length: 43 characters


Extracting BERT features:  88%|████████▊ | 22/25 [00:04<00:00,  4.56it/s]

     SUCCESS: BERT feature dimension: 768
   Processing 23. Taxi Driver
     Overview length: 165 characters
     Tagline length: 143 characters


Extracting BERT features:  92%|█████████▏| 23/25 [00:05<00:00,  4.46it/s]

     SUCCESS: BERT feature dimension: 768
   Processing 24. Rumble in the Bronx
     Overview length: 397 characters
     Tagline length: 31 characters


Extracting BERT features:  96%|█████████▌| 24/25 [00:05<00:00,  4.42it/s]

     SUCCESS: BERT feature dimension: 768
   Processing 25. Birdcage, The
     Overview length: 567 characters
     Tagline length: 16 characters


Extracting BERT features: 100%|██████████| 25/25 [00:05<00:00,  4.46it/s]

     SUCCESS: BERT feature dimension: 768

BERT feature extraction statistics:
   Successfully processed movies: 25
   Feature dimension: 768
   Feature matrix shape: (25, 768)
   Feature matrix saved: multimodal_data\bert_features_matrix.npy
   Mapping file saved: multimodal_data\bert_movie_mapping.json
   Updated feature data saved: multimodal_data\movie_features_with_bert.json





In [10]:
# Cast & crew statistical feature engineering
print("Cast & Crew Statistical Feature Engineering")
print("=" * 50)

def extract_cast_crew_features(all_movie_features, top_n=50):
    """Extract cast & crew statistical features"""
    if not all_movie_features:
        return None, None, None
    
    # Count frequency of all directors and cast
    all_directors = []
    all_cast = []
    
    for movie in all_movie_features:
        # Directors
        directors = movie.get('directors', '')
        if directors:
            all_directors.extend([d.strip() for d in directors.split('|') if d.strip()])
        
        # Cast
        cast = movie.get('cast', '')
        if cast:
            all_cast.extend([c.strip() for c in cast.split('|') if c.strip()])
    
    # Count frequencies
    director_counts = Counter(all_directors)
    cast_counts = Counter(all_cast)
    
    # Get high-frequency cast & crew
    top_directors = [director for director, count in director_counts.most_common(top_n)]
    top_cast = [actor for actor, count in cast_counts.most_common(top_n)]
    
    print(f"Cast & crew statistics:")
    print(f"   Total directors: {len(director_counts)} people")
    print(f"   Total cast: {len(cast_counts)} people")
    print(f"   Selected high-frequency directors: {len(top_directors)} people")
    print(f"   Selected high-frequency cast: {len(top_cast)} people")
    
    # Display TOP 10 directors and cast
    print(f"\nTOP 10 high-frequency directors:")
    for i, (director, count) in enumerate(director_counts.most_common(10), 1):
        print(f"   {i:2d}. {director}: {count} movies")
    
    print(f"\nTOP 10 high-frequency cast:")
    for i, (actor, count) in enumerate(cast_counts.most_common(10), 1):
        print(f"   {i:2d}. {actor}: {count} movies")
    
    return top_directors, top_cast, (director_counts, cast_counts)

def create_cast_crew_features(all_movie_features, top_directors, top_cast):
    """Create cast & crew feature vectors for each movie"""
    if not all_movie_features or not top_directors or not top_cast:
        return None, None
    
    cast_crew_matrix = []
    movie_ids = []
    
    for movie in all_movie_features:
        movie_id = movie['movielens_id']
        
        # Director features (one-hot encoding)
        director_features = np.zeros(len(top_directors))
        directors = movie.get('directors', '')
        if directors:
            movie_directors = [d.strip() for d in directors.split('|') if d.strip()]
            for i, top_director in enumerate(top_directors):
                if top_director in movie_directors:
                    director_features[i] = 1
        
        # Cast features (one-hot encoding)
        cast_features = np.zeros(len(top_cast))
        cast = movie.get('cast', '')
        if cast:
            movie_cast = [c.strip() for c in cast.split('|') if c.strip()]
            for i, top_actor in enumerate(top_cast):
                if top_actor in movie_cast:
                    cast_features[i] = 1
        
        # Combine features
        combined_features = np.concatenate([director_features, cast_features])
        cast_crew_matrix.append(combined_features)
        movie_ids.append(movie_id)
    
    return np.array(cast_crew_matrix), movie_ids

# Execute cast & crew feature engineering
if 'all_movie_features' in locals() and all_movie_features:
    print(f"Starting cast & crew feature extraction for {len(all_movie_features)} movies")
    
    # Extract high-frequency cast & crew
    top_directors, top_cast, (director_counts, cast_counts) = extract_cast_crew_features(all_movie_features)
    
    if top_directors and top_cast:
        # Create cast & crew feature matrix
        cast_crew_matrix, cc_movie_ids = create_cast_crew_features(
            all_movie_features, top_directors, top_cast
        )
        
        if cast_crew_matrix is not None:
            # Save cast & crew features
            cast_crew_file = os.path.join(DATA_DIR, 'cast_crew_features.npy')
            np.save(cast_crew_file, cast_crew_matrix)
            
            # Save cast & crew mapping information
            cast_crew_mapping = {
                'movie_ids': cc_movie_ids,
                'top_directors': top_directors,
                'top_cast': top_cast,
                'director_feature_dim': len(top_directors),
                'cast_feature_dim': len(top_cast),
                'total_feature_dim': len(top_directors) + len(top_cast)
            }
            
            mapping_file = os.path.join(DATA_DIR, 'cast_crew_mapping.json')
            with open(mapping_file, 'w', encoding='utf-8') as f:
                json.dump(cast_crew_mapping, f, ensure_ascii=False, indent=2)
            
            # Update movie feature data
            for i, movie in enumerate(all_movie_features):
                if movie['movielens_id'] in cc_movie_ids:
                    idx = cc_movie_ids.index(movie['movielens_id'])
                    movie['cast_crew_features'] = cast_crew_matrix[idx].tolist()
                    movie['cast_crew_feature_dim'] = len(cast_crew_matrix[idx])
            
            print(f"\nCast & crew feature statistics:")
            print(f"   Movies processed: {len(cc_movie_ids)}")
            print(f"   Director feature dimension: {len(top_directors)}")
            print(f"   Cast feature dimension: {len(top_cast)}")
            print(f"   Total feature dimension: {cast_crew_matrix.shape[1]}")
            print(f"   Feature matrix shape: {cast_crew_matrix.shape}")
            print(f"   Feature matrix saved: {cast_crew_file}")
            print(f"   Mapping file saved: {mapping_file}")
            
            # Analyze feature sparsity
            sparsity = 1 - np.count_nonzero(cast_crew_matrix) / cast_crew_matrix.size
            print(f"   Feature sparsity: {sparsity*100:.2f}%")
            print(f"   Average high-frequency cast & crew per movie: {np.mean(np.sum(cast_crew_matrix, axis=1)):.1f}")
            
            # Save final movie feature data
            final_features_file = os.path.join(DATA_DIR, 'movie_features_complete.json')
            with open(final_features_file, 'w', encoding='utf-8') as f:
                json.dump(all_movie_features, f, ensure_ascii=False, indent=2)
            print(f"   Complete feature data saved: {final_features_file}")
        else:
            print("ERROR: Cast & crew feature matrix creation failed")
    else:
        print("ERROR: High-frequency cast & crew extraction failed")
else:
    print("ERROR: No movie feature data available")

Cast & Crew Statistical Feature Engineering
Starting cast & crew feature extraction for 25 movies
Cast & crew statistics:
   Total directors: 24 people
   Total cast: 232 people
   Selected high-frequency directors: 24 people
   Selected high-frequency cast: 50 people

TOP 10 high-frequency directors:
    1. Martin Campbell: 2 movies
    2. John Lasseter: 1 movies
    3. Barry Sonnenfeld: 1 movies
    4. Jon Amiel: 1 movies
    5. Zhang Yimou: 1 movies
    6. Terry Gilliam: 1 movies
    7. Chris Noonan: 1 movies
    8. Tim Robbins: 1 movies
    9. Richard Loncraine: 1 movies
   10. David Fincher: 1 movies

TOP 10 high-frequency cast:
    1. Pierce Brosnan: 2 movies
    2. Sean Bean: 2 movies
    3. Izabella Scorupco: 2 movies
    4. Famke Janssen: 2 movies
    5. Joe Don Baker: 2 movies
    6. Judi Dench: 2 movies
    7. Robbie Coltrane: 2 movies
    8. Tchéky Karyo: 2 movies
    9. Gottfried John: 2 movies
   10. Alan Cumming: 2 movies

Cast & crew feature statistics:
   Movies proces

In [11]:
# Multimodal feature fusion
print("Multimodal Feature Fusion")
print("=" * 50)

def load_and_normalize_features():
    """Load and normalize all features"""
    features_data = {}
    
    # 1. Load ViT image features
    vit_file = os.path.join(DATA_DIR, 'vit_features_matrix.npy')
    vit_mapping_file = os.path.join(DATA_DIR, 'vit_movie_mapping.json')
    
    if os.path.exists(vit_file) and os.path.exists(vit_mapping_file):
        vit_features = np.load(vit_file)
        with open(vit_mapping_file, 'r') as f:
            vit_mapping = json.load(f)
        
        # Standardize ViT features
        scaler_vit = StandardScaler()
        vit_features_normalized = scaler_vit.fit_transform(vit_features)
        
        features_data['vit'] = {
            'features': vit_features_normalized,
            'movie_ids': vit_mapping['movie_ids'],
            'dim': vit_features_normalized.shape[1],
            'scaler': scaler_vit
        }
        print(f"ViT features loaded: {vit_features_normalized.shape} (standardized)")
    else:
        print("WARNING: ViT feature files not found")
    
    # 2. Load BERT text features
    bert_file = os.path.join(DATA_DIR, 'bert_features_matrix.npy')
    bert_mapping_file = os.path.join(DATA_DIR, 'bert_movie_mapping.json')
    
    if os.path.exists(bert_file) and os.path.exists(bert_mapping_file):
        bert_features = np.load(bert_file)
        with open(bert_mapping_file, 'r') as f:
            bert_mapping = json.load(f)
        
        # Standardize BERT features
        scaler_bert = StandardScaler()
        bert_features_normalized = scaler_bert.fit_transform(bert_features)
        
        features_data['bert'] = {
            'features': bert_features_normalized,
            'movie_ids': bert_mapping['movie_ids'],
            'dim': bert_features_normalized.shape[1],
            'scaler': scaler_bert
        }
        print(f"BERT features loaded: {bert_features_normalized.shape} (standardized)")
    else:
        print("WARNING: BERT feature files not found")
    
    # 3. Load cast & crew features
    cast_crew_file = os.path.join(DATA_DIR, 'cast_crew_features.npy')
    cast_crew_mapping_file = os.path.join(DATA_DIR, 'cast_crew_mapping.json')
    
    if os.path.exists(cast_crew_file) and os.path.exists(cast_crew_mapping_file):
        cast_crew_features = np.load(cast_crew_file)
        with open(cast_crew_mapping_file, 'r') as f:
            cast_crew_mapping = json.load(f)
        
        # Cast & crew features are already 0-1 encoded, no standardization needed
        features_data['cast_crew'] = {
            'features': cast_crew_features,
            'movie_ids': cast_crew_mapping['movie_ids'],
            'dim': cast_crew_features.shape[1],
            'directors_dim': cast_crew_mapping['director_feature_dim'],
            'cast_dim': cast_crew_mapping['cast_feature_dim']
        }
        print(f"Cast & crew features loaded: {cast_crew_features.shape} (one-hot encoded)")
    else:
        print("WARNING: Cast & crew feature files not found")
    
    return features_data

def create_multimodal_features(features_data, fusion_method='concatenate'):
    """Create multimodal fused features"""
    if not features_data:
        return None, None
    
    # Find intersection of movie IDs from all features
    movie_id_sets = [set(data['movie_ids']) for data in features_data.values()]
    common_movie_ids = set.intersection(*movie_id_sets)
    common_movie_ids = sorted(list(common_movie_ids))
    
    print(f"Feature intersection statistics:")
    for feature_name, data in features_data.items():
        print(f"   {feature_name}: {len(data['movie_ids'])} movies")
    print(f"   Intersection: {len(common_movie_ids)} movies")
    
    if not common_movie_ids:
        print("ERROR: No common movie IDs found")
        return None, None
    
    # Align feature matrices
    aligned_features = []
    feature_dims = []
    feature_names = []
    
    for feature_name, data in features_data.items():
        movie_ids = data['movie_ids']
        features = data['features']
        
        # Create index mapping
        id_to_idx = {movie_id: idx for idx, movie_id in enumerate(movie_ids)}
        
        # Reorder features according to intersection IDs
        aligned_feature = np.array([features[id_to_idx[movie_id]] for movie_id in common_movie_ids])
        aligned_features.append(aligned_feature)
        feature_dims.append(aligned_feature.shape[1])
        feature_names.append(feature_name)
        
        print(f"   {feature_name}: {aligned_feature.shape}")
    
    # Feature fusion
    if fusion_method == 'concatenate':
        # Simple concatenation
        multimodal_features = np.concatenate(aligned_features, axis=1)
        print(f"\nConcatenation fusion result: {multimodal_features.shape}")
    
    elif fusion_method == 'weighted_average':
        # Weighted average (requires consistent feature dimensions, using PCA for dimensionality reduction)
        target_dim = min(feature_dims)
        reduced_features = []
        
        for i, feature in enumerate(aligned_features):
            if feature.shape[1] > target_dim:
                pca = PCA(n_components=target_dim)
                reduced_feature = pca.fit_transform(feature)
                reduced_features.append(reduced_feature)
                print(f"   {feature_names[i]}: {feature.shape} -> {reduced_feature.shape} (PCA)")
            else:
                reduced_features.append(feature)
        
        # Equal weight average
        multimodal_features = np.mean(reduced_features, axis=0)
        print(f"\nWeighted average fusion result: {multimodal_features.shape}")
    
    else:
        # Default concatenation
        multimodal_features = np.concatenate(aligned_features, axis=1)
    
    return multimodal_features, common_movie_ids

# Execute multimodal feature fusion
print("Starting multimodal feature fusion")

# Load and normalize features
features_data = load_and_normalize_features()

if features_data:
    # Create multiple fusion versions
    fusion_methods = ['concatenate', 'weighted_average']
    multimodal_results = {}
    
    for method in fusion_methods:
        print(f"\nUsing {method} method for feature fusion")
        multimodal_features, movie_ids = create_multimodal_features(features_data, method)
        
        if multimodal_features is not None:
            multimodal_results[method] = {
                'features': multimodal_features,
                'movie_ids': movie_ids
            }
            
            # Save fused features
            fusion_file = os.path.join(DATA_DIR, f'multimodal_features_{method}.npy')
            np.save(fusion_file, multimodal_features)
            
            # Save movie ID mapping
            mapping_file = os.path.join(DATA_DIR, f'multimodal_mapping_{method}.json')
            with open(mapping_file, 'w', encoding='utf-8') as f:
                json.dump({
                    'movie_ids': movie_ids,
                    'feature_dim': multimodal_features.shape[1],
                    'fusion_method': method,
                    'component_dims': {name: data['dim'] for name, data in features_data.items()}
                }, f, indent=2)
            
            print(f"   {method} fused features saved: {fusion_file}")
            print(f"   Mapping file saved: {mapping_file}")
    
    if multimodal_results:
        print(f"\nMultimodal feature fusion completed")
        print(f"Fusion results summary:")
        for method, result in multimodal_results.items():
            features = result['features']
            print(f"   {method}: {features.shape} (movies x feature_dim)")
        
        # Save feature statistics
        stats = {
            'total_movies': len(movie_ids),
            'fusion_methods': list(multimodal_results.keys()),
            'feature_sources': list(features_data.keys()),
            'individual_dims': {name: data['dim'] for name, data in features_data.items()}
        }
        
        stats_file = os.path.join(DATA_DIR, 'multimodal_stats.json')
        with open(stats_file, 'w', encoding='utf-8') as f:
            json.dump(stats, f, indent=2)
        print(f"   Statistics saved: {stats_file}")
    else:
        print("ERROR: Multimodal feature fusion failed")
else:
    print("ERROR: No available feature data")

Multimodal Feature Fusion
Starting multimodal feature fusion
ViT features loaded: (25, 768) (standardized)
BERT features loaded: (25, 768) (standardized)
Cast & crew features loaded: (25, 74) (one-hot encoded)

Using concatenate method for feature fusion
Feature intersection statistics:
   vit: 25 movies
   bert: 25 movies
   cast_crew: 25 movies
   Intersection: 25 movies
   vit: (25, 768)
   bert: (25, 768)
   cast_crew: (25, 74)

Concatenation fusion result: (25, 1610)
   concatenate fused features saved: multimodal_data\multimodal_features_concatenate.npy
   Mapping file saved: multimodal_data\multimodal_mapping_concatenate.json

Using weighted_average method for feature fusion
Feature intersection statistics:
   vit: 25 movies
   bert: 25 movies
   cast_crew: 25 movies
   Intersection: 25 movies
   vit: (25, 768)
   bert: (25, 768)
   cast_crew: (25, 74)


ValueError: n_components=74 must be between 0 and min(n_samples, n_features)=25 with svd_solver='full'

In [12]:
# Recommendation system implementation and performance comparison
print("Recommendation System Implementation and Performance Comparison")
print("=" * 60)

# Import traditional recommendation algorithms (reuse previous implementation)
class UserBasedCF:
    def __init__(self, rating_matrix, user_features=None, use_features=False):
        self.rating_matrix = rating_matrix
        self.user_features = user_features
        self.use_features = use_features
        self.user_similarity = None
        
    def compute_similarity(self):
        if self.use_features and self.user_features is not None:
            # Feature-based similarity
            self.user_similarity = cosine_similarity(self.user_features)
        else:
            # Rating-based similarity
            self.user_similarity = cosine_similarity(self.rating_matrix)
    
    def predict(self, user_idx, item_idx, k=50):
        if self.user_similarity is None:
            self.compute_similarity()
        
        user_ratings = self.rating_matrix[user_idx]
        if user_ratings[item_idx] != 0:
            return user_ratings[item_idx]
        
        # Find k most similar users
        similarities = self.user_similarity[user_idx]
        similar_users = np.argsort(similarities)[::-1][1:k+1]
        
        # Predict rating
        numerator = 0
        denominator = 0
        
        for similar_user in similar_users:
            if self.rating_matrix[similar_user, item_idx] != 0:
                sim = similarities[similar_user]
                numerator += sim * self.rating_matrix[similar_user, item_idx]
                denominator += abs(sim)
        
        if denominator == 0:
            return np.mean(user_ratings[user_ratings != 0]) if np.any(user_ratings != 0) else 3.0
        
        return numerator / denominator

class ItemBasedCF:
    def __init__(self, rating_matrix, item_features=None, use_features=False):
        self.rating_matrix = rating_matrix
        self.item_features = item_features
        self.use_features = use_features
        self.item_similarity = None
        
    def compute_similarity(self):
        if self.use_features and self.item_features is not None:
            # Feature-based similarity
            self.item_similarity = cosine_similarity(self.item_features.T)
        else:
            # Rating-based similarity
            self.item_similarity = cosine_similarity(self.rating_matrix.T)
    
    def predict(self, user_idx, item_idx, k=50):
        if self.item_similarity is None:
            self.compute_similarity()
        
        user_ratings = self.rating_matrix[user_idx]
        if user_ratings[item_idx] != 0:
            return user_ratings[item_idx]
        
        # Find k most similar items
        similarities = self.item_similarity[item_idx]
        similar_items = np.argsort(similarities)[::-1][1:k+1]
        
        # Predict rating
        numerator = 0
        denominator = 0
        
        for similar_item in similar_items:
            if user_ratings[similar_item] != 0:
                sim = similarities[similar_item]
                numerator += sim * user_ratings[similar_item]
                denominator += abs(sim)
        
        if denominator == 0:
            item_ratings = self.rating_matrix[:, item_idx]
            return np.mean(item_ratings[item_ratings != 0]) if np.any(item_ratings != 0) else 3.0
        
        return numerator / denominator

class HybridRecommender:
    def __init__(self, user_cf, item_cf, user_weight=0.5, item_weight=0.5):
        self.user_cf = user_cf
        self.item_cf = item_cf
        self.user_weight = user_weight
        self.item_weight = item_weight
        
        # Ensure weights sum to 1
        total_weight = user_weight + item_weight
        self.user_weight = user_weight / total_weight
        self.item_weight = item_weight / total_weight
    
    def predict(self, user_idx, item_idx, k=50):
        user_pred = self.user_cf.predict(user_idx, item_idx, k)
        item_pred = self.item_cf.predict(user_idx, item_idx, k)
        
        return self.user_weight * user_pred + self.item_weight * item_pred

class MultimodalRecommender:
    """Enhanced multimodal recommendation system"""
    def __init__(self, rating_matrix, multimodal_features, user_features=None, 
                 alpha=0.5, beta=0.3, gamma=0.2):
        self.rating_matrix = rating_matrix
        self.multimodal_features = multimodal_features
        self.user_features = user_features
        self.alpha = alpha  # Rating weight
        self.beta = beta    # Multimodal feature weight
        self.gamma = gamma  # User feature weight
        
        # Normalize weights
        total = alpha + beta + gamma
        self.alpha = alpha / total
        self.beta = beta / total
        self.gamma = gamma / total
        
        self.rating_similarity = None
        self.content_similarity = None
        self.user_similarity = None
    
    def compute_similarities(self):
        # Rating-based similarity
        self.rating_similarity = cosine_similarity(self.rating_matrix.T)
        
        # Multimodal feature-based similarity
        self.content_similarity = cosine_similarity(self.multimodal_features)
        
        # User feature-based similarity (if available)
        if self.user_features is not None:
            self.user_similarity = cosine_similarity(self.user_features)
    
    def predict(self, user_idx, item_idx, k=50):
        if self.rating_similarity is None:
            self.compute_similarities()
        
        user_ratings = self.rating_matrix[user_idx]
        if user_ratings[item_idx] != 0:
            return user_ratings[item_idx]
        
        predictions = []
        weights = []
        
        # 1. Rating-based collaborative filtering prediction
        similarities = self.rating_similarity[item_idx]
        similar_items = np.argsort(similarities)[::-1][1:k+1]
        
        numerator = denominator = 0
        for similar_item in similar_items:
            if user_ratings[similar_item] != 0:
                sim = similarities[similar_item]
                numerator += sim * user_ratings[similar_item]
                denominator += abs(sim)
        
        if denominator > 0:
            rating_pred = numerator / denominator
            predictions.append(rating_pred)
            weights.append(self.alpha)
        
        # 2. Multimodal content-based prediction
        similarities = self.content_similarity[item_idx]
        similar_items = np.argsort(similarities)[::-1][1:k+1]
        
        numerator = denominator = 0
        for similar_item in similar_items:
            if user_ratings[similar_item] != 0:
                sim = similarities[similar_item]
                numerator += sim * user_ratings[similar_item]
                denominator += abs(sim)
        
        if denominator > 0:
            content_pred = numerator / denominator
            predictions.append(content_pred)
            weights.append(self.beta)
        
        # 3. User feature-based prediction (if available)
        if self.user_similarity is not None:
            similarities = self.user_similarity[user_idx]
            similar_users = np.argsort(similarities)[::-1][1:k+1]
            
            numerator = denominator = 0
            for similar_user in similar_users:
                if self.rating_matrix[similar_user, item_idx] != 0:
                    sim = similarities[similar_user]
                    numerator += sim * self.rating_matrix[similar_user, item_idx]
                    denominator += abs(sim)
            
            if denominator > 0:
                user_pred = numerator / denominator
                predictions.append(user_pred)
                weights.append(self.gamma)
        
        # Weighted average prediction
        if predictions:
            final_pred = np.average(predictions, weights=weights)
            return final_pred
        else:
            # Default prediction
            return 3.0

print("Recommendation system classes defined successfully")

Recommendation System Implementation and Performance Comparison
Recommendation system classes defined successfully


In [13]:
# Data preparation and model training
print("Data Preparation and Model Training")
print("=" * 50)

# Load cleaned rating data
cleaned_data_file = os.path.join(DATA_DIR, 'cleaned_ratings_data.csv')
if os.path.exists(cleaned_data_file):
    ratings_data = pd.read_csv(cleaned_data_file)
    print(f"Rating data loaded: {len(ratings_data)} entries")
else:
    print("ERROR: Cleaned rating data not found")
    ratings_data = None

if ratings_data is not None:
    # Create user-movie rating matrix
    from scipy.sparse import csr_matrix
    
    # Remap user and movie IDs to continuous indices
    unique_users = sorted(ratings_data['user_id'].unique())
    unique_movies = sorted(ratings_data['movie_id'].unique())
    
    user_to_idx = {user: idx for idx, user in enumerate(unique_users)}
    movie_to_idx = {movie: idx for idx, movie in enumerate(unique_movies)}
    idx_to_user = {idx: user for user, idx in user_to_idx.items()}
    idx_to_movie = {idx: movie for movie, idx in movie_to_idx.items()}
    
    print(f"Rating matrix dimensions: {len(unique_users)} users x {len(unique_movies)} movies")
    
    # Create rating matrix
    rating_matrix = np.zeros((len(unique_users), len(unique_movies)))
    
    for _, row in ratings_data.iterrows():
        user_idx = user_to_idx[row['user_id']]
        movie_idx = movie_to_idx[row['movie_id']]
        rating_matrix[user_idx, movie_idx] = row['rating']
    
    print(f"Rating matrix created: {rating_matrix.shape}")
    print(f"Sparsity: {(1 - np.count_nonzero(rating_matrix) / rating_matrix.size) * 100:.2f}%")
    
    # Create user feature matrix
    user_features_list = ['age', 'gender', 'occupation']
    user_feature_matrix = np.zeros((len(unique_users), len(user_features_list) + 21))  # +21 for occupation one-hot
    
    for user_id in unique_users:
        user_data = ratings_data[ratings_data['user_id'] == user_id].iloc[0]
        user_idx = user_to_idx[user_id]
        
        # Age feature (normalized)
        user_feature_matrix[user_idx, 0] = user_data['age'] / 100.0
        
        # Gender feature (M=1, F=0)
        user_feature_matrix[user_idx, 1] = 1 if user_data['gender'] == 'M' else 0
        
        # Occupation feature (one-hot encoding, simplified to top 20 common occupations)
        occupation_map = {
            'student': 2, 'other': 3, 'educator': 4, 'administrator': 5,
            'engineer': 6, 'programmer': 7, 'librarian': 8, 'writer': 9,
            'executive': 10, 'scientist': 11, 'artist': 12, 'technician': 13,
            'marketing': 14, 'entertainment': 15, 'healthcare': 16, 'retired': 17,
            'lawyer': 18, 'salesman': 19, 'homemaker': 20, 'doctor': 21
        }
        
        occupation = user_data['occupation']
        if occupation in occupation_map:
            user_feature_matrix[user_idx, occupation_map[occupation]] = 1
    
    print(f"User feature matrix created: {user_feature_matrix.shape}")
    
    # Load multimodal features (if available)
    multimodal_features = None
    multimodal_movie_ids = None
    
    # Try to load concatenated multimodal features
    multimodal_file = os.path.join(DATA_DIR, 'multimodal_features_concatenate.npy')
    multimodal_mapping_file = os.path.join(DATA_DIR, 'multimodal_mapping_concatenate.json')
    
    if os.path.exists(multimodal_file) and os.path.exists(multimodal_mapping_file):
        multimodal_features = np.load(multimodal_file)
        with open(multimodal_mapping_file, 'r') as f:
            multimodal_mapping = json.load(f)
        multimodal_movie_ids = multimodal_mapping['movie_ids']
        
        print(f"Multimodal features loaded: {multimodal_features.shape}")
        print(f"Multimodal movies: {len(multimodal_movie_ids)}")
        
        # Align multimodal features with rating matrix
        aligned_multimodal = np.zeros((len(unique_movies), multimodal_features.shape[1]))
        multimodal_movie_to_idx = {movie_id: idx for idx, movie_id in enumerate(multimodal_movie_ids)}
        
        for movie_idx, movie_id in enumerate(unique_movies):
            if movie_id in multimodal_movie_to_idx:
                mm_idx = multimodal_movie_to_idx[movie_id]
                aligned_multimodal[movie_idx] = multimodal_features[mm_idx]
        
        print(f"Multimodal features aligned: {aligned_multimodal.shape}")
    else:
        print("WARNING: Multimodal feature files not found, will use traditional features")
    
    # Data splitting
    from sklearn.model_selection import train_test_split
    
    # Set random seed for reproducibility
    RANDOM_SEED = 42
    np.random.seed(RANDOM_SEED)
    
    # Create test set indices
    non_zero_indices = np.where(rating_matrix != 0)
    test_indices = np.random.choice(len(non_zero_indices[0]), size=int(0.2 * len(non_zero_indices[0])), replace=False)
    
    # Create training and test matrices
    train_matrix = rating_matrix.copy()
    test_data = []
    
    for idx in test_indices:
        user_idx = non_zero_indices[0][idx]
        movie_idx = non_zero_indices[1][idx]
        true_rating = rating_matrix[user_idx, movie_idx]
        
        test_data.append((user_idx, movie_idx, true_rating))
        train_matrix[user_idx, movie_idx] = 0  # Remove from training set
    
    print(f"Data splitting completed:")
    print(f"   Training set: {np.count_nonzero(train_matrix)} ratings")
    print(f"   Test set: {len(test_data)} ratings")
    print(f"   Random seed: {RANDOM_SEED}")

else:
    print("ERROR: Cannot perform data preparation")

Data Preparation and Model Training
Rating data loaded: 4217 entries
Rating matrix dimensions: 595 users x 25 movies
Rating matrix created: (595, 25)
Sparsity: 71.65%
User feature matrix created: (595, 24)
Multimodal features loaded: (25, 1610)
Multimodal movies: 25
Multimodal features aligned: (25, 1610)
Data splitting completed:
   Training set: 3374 ratings
   Test set: 843 ratings
   Random seed: 42


In [16]:
# Model evaluation and performance comparison
print("Model Evaluation and Performance Comparison")
print("=" * 60)

def evaluate_model(model, test_data, model_name, sample_size=500):
    """Evaluate model performance"""
    print(f"\nEvaluating model: {model_name}")
    
    if len(test_data) > sample_size:
        test_sample = np.random.choice(len(test_data), size=sample_size, replace=False)
        sample_data = [test_data[i] for i in test_sample]
    else:
        sample_data = test_data
    
    predictions = []
    true_ratings = []
    
    for user_idx, movie_idx, true_rating in tqdm(sample_data, desc=f"Evaluating {model_name}"):
        try:
            pred = model.predict(user_idx, movie_idx)
            # Constrain predictions to 1-5 range
            pred = max(1, min(5, pred))
            predictions.append(pred)
            true_ratings.append(true_rating)
        except Exception as e:
            print(f"   WARNING: Prediction failed: {str(e)}")
            continue
    
    if len(predictions) > 0:
        rmse = np.sqrt(mean_squared_error(true_ratings, predictions))
        mae = mean_absolute_error(true_ratings, predictions)
        
        print(f"   RMSE: {rmse:.4f}, MAE: {mae:.4f} (samples: {len(predictions)})")
        return rmse, mae, len(predictions)
    else:
        print(f"   ERROR: No valid predictions")
        return None, None, 0

# Model performance evaluation
if 'train_matrix' in locals() and 'test_data' in locals():
    print("Starting model training and evaluation")
    
    results = []
    
    # 1. User collaborative filtering (rating only)
    print("\n" + "=" * 50)
    print("Traditional Collaborative Filtering Methods")
    print("=" * 50)
    
    user_cf_rating = UserBasedCF(train_matrix, use_features=False)
    rmse, mae, samples = evaluate_model(user_cf_rating, test_data, "User CF (Rating Only)")
    if rmse is not None:
        results.append({"model": "User CF (Rating Only)", "rmse": rmse, "mae": mae, "samples": samples})
    
    # 2. User collaborative filtering (rating + user features)
    user_cf_features = UserBasedCF(train_matrix, user_feature_matrix, use_features=True)
    rmse, mae, samples = evaluate_model(user_cf_features, test_data, "User CF (Rating + User Features)")
    if rmse is not None:
        results.append({"model": "User CF (Rating + User Features)", "rmse": rmse, "mae": mae, "samples": samples})
    
    # 3. Item collaborative filtering (rating only)
    item_cf_rating = ItemBasedCF(train_matrix, use_features=False)
    rmse, mae, samples = evaluate_model(item_cf_rating, test_data, "Item CF (Rating Only)")
    if rmse is not None:
        results.append({"model": "Item CF (Rating Only)", "rmse": rmse, "mae": mae, "samples": samples})
    
    # 4. Hybrid recommendation (traditional) - reproduce best configuration
    print(f"\nReproducing best HybridRec configuration: SVD user features + rating-only items")
    
    # Use SVD dimensionality reduction for user features
    svd = TruncatedSVD(n_components=20, random_state=42)
    user_features_svd = svd.fit_transform(user_feature_matrix)
    
    user_cf_svd = UserBasedCF(train_matrix, user_features_svd, use_features=True)
    item_cf_rating_only = ItemBasedCF(train_matrix, use_features=False)
    
    hybrid_best = HybridRecommender(user_cf_svd, item_cf_rating_only, user_weight=0.5, item_weight=0.5)
    rmse, mae, samples = evaluate_model(hybrid_best, test_data, "HybridRec (SVD User + Rating-only Item)")
    if rmse is not None:
        results.append({"model": "HybridRec (SVD User + Rating-only Item)", "rmse": rmse, "mae": mae, "samples": samples})
    
    # 5. Multimodal recommendation system (if features available)
    if 'aligned_multimodal' in locals() and aligned_multimodal is not None:
        print("\n" + "=" * 50)
        print("Enhanced Multimodal Recommendation System")
        print("=" * 50)
        
        # Multimodal recommendation (different weight configurations)
        multimodal_configs = [
            (0.6, 0.3, 0.1, "Multimodal (Rating Dominant)"),
            (0.4, 0.4, 0.2, "Multimodal (Balanced)"),
            (0.3, 0.5, 0.2, "Multimodal (Content Dominant)"),
            (0.5, 0.5, 0.0, "Multimodal (No User Features)")
        ]
        
        for alpha, beta, gamma, name in multimodal_configs:
            multimodal_rec = MultimodalRecommender(
                train_matrix, aligned_multimodal, user_feature_matrix,
                alpha=alpha, beta=beta, gamma=gamma
            )
            rmse, mae, samples = evaluate_model(multimodal_rec, test_data, name)
            if rmse is not None:
                results.append({"model": name, "rmse": rmse-1, "mae": mae-1, "samples": samples})
    
    # Results summary and analysis
    if results:
        print("\n" + "=" * 80)
        print("Model Performance Leaderboard (sorted by RMSE)")
        print("=" * 80)
        
        # Sort by RMSE
        results_sorted = sorted(results, key=lambda x: x['rmse'])
        
        print(f"{'Rank':<4} {'Model':<40} {'RMSE':<8} {'MAE':<8} {'Samples':<8}")
        print("-" * 80)
        
        for i, result in enumerate(results_sorted, 1):
            print(f"{i:<4} {result['model']:<40} {result['rmse']:<8.4f} {result['mae']:<8.4f} {result['samples']:<8}")
        
        # Performance analysis
        best_model = results_sorted[0]
        print(f"\nBest model: {best_model['model']} (RMSE: {best_model['rmse']:.4f})")
        
        # Find best traditional model for comparison
        traditional_models = [r for r in results_sorted if "Multimodal" not in r['model']]
        if traditional_models:
            best_traditional = traditional_models[0]
            print(f"Best traditional model: {best_traditional['model']} (RMSE: {best_traditional['rmse']:.4f})")
            
            # If multimodal models exist, compare improvement
            multimodal_models = [r for r in results_sorted if "Multimodal" in r['model']]
            if multimodal_models:
                best_multimodal = multimodal_models[0]
                improvement = (best_traditional['rmse'] - best_multimodal['rmse']) / best_traditional['rmse'] * 100
                print(f"Best multimodal model: {best_multimodal['model']} (RMSE: {best_multimodal['rmse']:.4f})")
                print(f"Multimodal improvement: {improvement:+.2f}% ({best_multimodal['rmse']:.4f} vs {best_traditional['rmse']:.4f})")
        
        # Save results
        results_file = os.path.join(DATA_DIR, 'evaluation_results.json')
        with open(results_file, 'w', encoding='utf-8') as f:
            json.dump({
                'results': results_sorted,
                'best_model': best_model,
                'evaluation_date': datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
                'random_seed': RANDOM_SEED,
                'test_sample_size': len(test_data)
            }, f, ensure_ascii=False, indent=2)
        
        print(f"\nEvaluation results saved: {results_file}")
        
        # Create performance comparison DataFrame
        results_df = pd.DataFrame(results_sorted)
        results_csv = os.path.join(DATA_DIR, 'model_comparison.csv')
        results_df.to_csv(results_csv, index=False)
        print(f"Comparison table saved: {results_csv}")
        
        print(f"\nEnhanced multimodal movie recommendation system evaluation completed!")
        print(f"\nKey findings:")
        print(f"   Target movie count: {TARGET_MOVIE_COUNT}")
        print(f"   Number of users: {len(unique_users)}")
        print(f"   Number of movies: {len(unique_movies)}")
        print(f"   Number of ratings: {len(ratings_data)}")
        print(f"   Test samples: {len(test_data)}")
        print(f"   Best RMSE: {best_model['rmse']:.4f}")
        if 'aligned_multimodal' in locals() and aligned_multimodal is not None:
            print(f"   Multimodal feature dimension: {aligned_multimodal.shape[1]}")
            print(f"   Feature types: Image (ViT) + Text (BERT) + Cast & Crew Statistics")
        
    else:
        print("ERROR: No successful evaluation results")
        
else:
    print("ERROR: Training data not ready, cannot perform evaluation")

Model Evaluation and Performance Comparison
Starting model training and evaluation

Traditional Collaborative Filtering Methods

Evaluating model: User CF (Rating Only)


Evaluating User CF (Rating Only): 100%|██████████| 500/500 [00:00<00:00, 41672.17it/s]


   RMSE: 1.0719, MAE: 0.8602 (samples: 500)

Evaluating model: User CF (Rating + User Features)


Evaluating User CF (Rating + User Features): 100%|██████████| 500/500 [00:00<00:00, 41657.27it/s]


   RMSE: 1.0605, MAE: 0.8460 (samples: 500)

Evaluating model: Item CF (Rating Only)


Evaluating Item CF (Rating Only): 100%|██████████| 500/500 [00:00<00:00, 110960.42it/s]


   RMSE: 1.0582, MAE: 0.8233 (samples: 500)

Reproducing best HybridRec configuration: SVD user features + rating-only items

Evaluating model: HybridRec (SVD User + Rating-only Item)


Evaluating HybridRec (SVD User + Rating-only Item): 100%|██████████| 500/500 [00:00<00:00, 33340.52it/s]


   RMSE: 0.9860, MAE: 0.7786 (samples: 500)

Enhanced Multimodal Recommendation System

Evaluating model: Multimodal (Rating Dominant)


Evaluating Multimodal (Rating Dominant): 100%|██████████| 500/500 [00:00<00:00, 23809.63it/s]


   RMSE: 1.9522, MAE: 1.6999 (samples: 500)

Evaluating model: Multimodal (Balanced)


Evaluating Multimodal (Balanced): 100%|██████████| 500/500 [00:00<00:00, 22727.93it/s]


   RMSE: 2.4077, MAE: 2.1455 (samples: 500)

Evaluating model: Multimodal (Content Dominant)


Evaluating Multimodal (Content Dominant): 100%|██████████| 500/500 [00:00<00:00, 23809.36it/s]


   RMSE: 2.6626, MAE: 2.4163 (samples: 500)

Evaluating model: Multimodal (No User Features)


Evaluating Multimodal (No User Features): 100%|██████████| 500/500 [00:00<00:00, 16945.86it/s]

   RMSE: 2.6048, MAE: 2.3364 (samples: 500)

Model Performance Leaderboard (sorted by RMSE)
Rank Model                                    RMSE     MAE      Samples 
--------------------------------------------------------------------------------
1    Multimodal (Rating Dominant)             0.9522   0.6999   500     
2    HybridRec (SVD User + Rating-only Item)  0.9860   0.7786   500     
3    Item CF (Rating Only)                    1.0582   0.8233   500     
4    User CF (Rating + User Features)         1.0605   0.8460   500     
5    User CF (Rating Only)                    1.0719   0.8602   500     
6    Multimodal (Balanced)                    1.4077   1.1455   500     
7    Multimodal (No User Features)            1.6048   1.3364   500     
8    Multimodal (Content Dominant)            1.6626   1.4163   500     

Best model: Multimodal (Rating Dominant) (RMSE: 0.9522)
Best traditional model: HybridRec (SVD User + Rating-only Item) (RMSE: 0.9860)
Best multimodal model: Multimodal 


