### Cell 1: Import Libraries and Setup

**Markdown Explanation:**

This cell is responsible for importing all the necessary Python libraries and setting up the initial environment for the recommendation system. The libraries include tools for data manipulation (`pandas`, `numpy`), visualization (`matplotlib`, `seaborn`), and machine learning (`surprise`, `sklearn`). It also sets up logging to capture critical errors only, and defines several constants that will be used throughout the script.

In [None]:
# Import necessary libraries for data manipulation, visualization, and machine learning
import pandas as pd  # Data manipulation
import numpy as np  # Numerical operations
import logging  # Logging for debugging and error handling
import itertools  # Handling iterators and combinations
import re  # Regular expressions for string matching
import matplotlib.pyplot as plt  # Plotting library
import seaborn as sns  # Statistical data visualization
import time  # Time utilities, used for calculating decay
import warnings  # Suppress warnings to reduce unnecessary output
from surprise import Dataset, Reader, SVD  # Surprise library for collaborative filtering
from surprise.model_selection import train_test_split, GridSearchCV  # Model selection utilities for CF
from sklearn.metrics import mean_squared_error, mean_absolute_error, precision_score, recall_score, f1_score  # Metrics for evaluation
from sklearn.metrics.pairwise import cosine_similarity  # Cosine similarity for content-based filtering
from sklearn.preprocessing import StandardScaler  # Standard scaling of features
from joblib import Parallel, delayed  # Parallel processing for efficiency

# Suppress warnings related to deprecated or future features to keep the output clean
warnings.filterwarnings('ignore', category=DeprecationWarning)
warnings.filterwarnings('ignore', category=FutureWarning)

# Define constants that will be used throughout the recommendation system
MOVIES_FILE = '../data/movies.csv'  # Path to the movies data file
RATINGS_FILE = '../data/ratings.csv'  # Path to the ratings data file
N_RECOMMENDATIONS = 5  # Number of recommendations to generate for each user
YEAR_DIVISOR = 0.01  # Weighting factor that gives more importance to recent movies
RATING_THRESHOLD = 4.0  # Minimum rating to consider a recommendation positive
RANDOM_SEED = 42  # Seed for random number generators to ensure reproducibility
USER_SAMPLE_SIZE = 500  # Number of users to sample for recommendations

# Weights for hybrid scoring (combining CF and CBF approaches)
CF_WEIGHT = 0.7  # Weight for collaborative filtering score
CBF_WEIGHT = 0.3  # Weight for content-based filtering score
POPULARITY_PENALTY_WEIGHT = 0.8  # Penalty weight for popular items to enhance diversity
TIME_DECAY_FACTOR = 0.1  # Factor for time decay of ratings
RECENCY_WEIGHT = 1.5  # Weight for recency factor in the hybrid score

# Set random seed for numpy to ensure reproducibility of random operations
np.random.seed(RANDOM_SEED)

# Configure logging to capture only critical errors to reduce output verbosity
logging.basicConfig(level=logging.ERROR,  # Set logging level to ERROR
                    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s', 
                    handlers=[logging.FileHandler('recommendation_system.log'), logging.StreamHandler()])


### Cell 2: Data Loading and Preprocessing

**Markdown Explanation:**

This cell defines the function `load_data`, which is responsible for loading the movies and ratings data from CSV files, converting timestamps to datetime format, merging the datasets on the movie ID, extracting the release year from movie titles, and one-hot encoding genres. The function also includes error handling to log and raise exceptions if issues occur during data loading.

In [None]:
def load_data(movies_file, ratings_file):
    """
    Load and preprocess movies and ratings data.

    This function loads movie and rating data from CSV files, converts timestamps to datetime objects,
    merges the data on movie IDs, extracts release years from titles, and one-hot encodes genres.

    Parameters:
        movies_file (str): Path to the movies data file.
        ratings_file (str): Path to the ratings data file.

    Returns:
        tuple: Three DataFrames (merged_df, movies_df, ratings_df)
               - merged_df: DataFrame containing merged movies and ratings data with additional features.
               - movies_df: DataFrame containing the original movies data.
               - ratings_df: DataFrame containing the original ratings data.
    """
    try:
        # Load movie data from CSV into a DataFrame
        movies_df = pd.read_csv(movies_file)

        # Load rating data from CSV into a DataFrame
        ratings_df = pd.read_csv(ratings_file)

        # Convert the timestamp column in ratings data to datetime format for easier manipulation
        ratings_df['timestamp'] = pd.to_datetime(ratings_df['timestamp'], unit='s')

        # Merge the movie and rating data on the movieId column to combine information from both datasets
        merged_df = pd.merge(ratings_df, movies_df, on='movieId')

        # Extract the release year from the title column using regular expressions, if not already present
        if 'release_year' not in merged_df.columns:
            merged_df['release_year'] = merged_df['title'].str.extract(r'\((\d{4})\)')[0].astype(float)

        # One-hot encode the genres by creating binary columns for each genre
        genre_list = list(set(itertools.chain.from_iterable(merged_df['genres'].str.split('|'))))
        for genre in genre_list:
            genre_pattern = re.escape(genre)  # Escape genre name to handle special characters
            merged_df[genre] = merged_df['genres'].str.contains(r'\b' + genre_pattern + r'\b').astype(int)

        # Return the preprocessed DataFrames
        return merged_df, movies_df, ratings_df

    except FileNotFoundError as fnf_error:
        # Log an error if a file is not found and re-raise the exception
        logging.error(f"File not found: {fnf_error}")
        raise

    except Exception as e:
        # Log any other exceptions that occur during data loading and re-raise the exception
        logging.error(f"Error loading data: {e}")
        raise


### Cell 3: Exploratory Data Analysis (EDA)

**Markdown Explanation:**

This cell defines the function `perform_eda`, which performs exploratory data analysis on the merged movies and ratings data. It provides information about the dataset, visualizes the distribution of ratings, the number of ratings per movie and user, genre distribution, and the distribution of movie release years. It also calculates and displays a correlation matrix to understand relationships between different features.

In [None]:
def perform_eda(merged_df):
    """
    Perform Exploratory Data Analysis (EDA) on the merged movies and ratings data.

    This function prints out information and displays plots to help understand the structure and distribution
    of the merged movies and ratings data. Plots include distribution of ratings, number of ratings per movie,
    number of ratings per user, genre distribution, and distribution of movie release years.
    """
    # Display the first few rows of the merged DataFrame to understand its structure
    print('Merged DataFrame:')
    print(merged_df.head())

    # Display the shape of the merged DataFrame to know the number of rows and columns
    print('\nShape of Merged DataFrame:', merged_df.shape)

    # Display information about the merged DataFrame, including data types and non-null counts
    print('\nMerged DataFrame Info:')
    print(merged_df.info())

    # Display descriptive statistics of the merged DataFrame, such as mean, min, and max values
    print('\nMerged DataFrame Description:')
    print(merged_df.describe())

    # Check for missing values in each column of the merged DataFrame
    print('\nMissing Values in Merged DataFrame:')
    print(merged_df.isnull().sum())
