# Table of contents
1. [Introduction](#intro)
2. [Libraries](#libraries)
3. [Strategies](#strategies)
4. [Evaluation](#evaluation)



<h1 id = "intro"> 1. Introduction </h1>

This notebook is created to test strategies to impute missing values of numerical columns and to evaluate which one is most appropriate and comprehensive to apply for preprocessing stage

<h1 id = "libraries"> 2. Libraries </h1>

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import re
import seaborn as sns
from scipy.stats import sem

from sklearn.impute import KNNImputer
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor


In [2]:
raw_data = pd.read_csv('../Data/retyped_data.csv')

<div id = "strategies"> <h1>3. Strategies </h1> </div>

- As we concluded from the data exploration section, missing data of all numerical columns is **MAR**, therefore, we need to have an appropriate approach to fill. 
- So we decided to test **Similarity based strategy**, **KNN Imputer** and **Decision Tree Regressor** and check metrics to choose which one is the most suitable.

In [3]:
# Create list to store metrics of each model
similarity_model = []
knn_model = []
decision_tree_model = []

<div id = "similarity"> <h2> Similarity based strategy </h2> </div>

- **Similarity-Based Imputation:** Missing values in the ratings matrix are filled by leveraging similarity scores, which are calculated based on the inverse of the mean absolute difference between users and critics' ratings. This ensures that missing values are imputed based on the preferences of users with similar rating patterns.
- **Handling Missing Values:** The use of np.nanmean and logical operations effectively handles NaN values during computation, ensuring accurate similarity and weight calculations. Furthermore, adding a small epsilon (0.001) prevents division by zero errors, enhancing the robustness of the similarity computation.

In [4]:

batch_size = 32
start = 0
end = 32

# Define the function to calculate similarities
def calculate_similarities(ratings, batch_start, batch_end):
    # Select the batch of users
    batch_ratings = ratings[batch_start:batch_end]
    
    # Calculate the absolute difference between the batch and all users
    abs_diff = np.abs(ratings - batch_ratings.reshape(batch_end - batch_start, 1, ratings.shape[1]))
    
    # Calculate the mean absolute difference across movies, ignoring NaN values
    mean_diff = np.nanmean(abs_diff, axis=2)
    
    # Compute similarity as the inverse of the mean absolute difference
    similarities = 1 / (mean_diff + 0.001)  # Adding a small epsilon to avoid division by zero
    similarities[np.isnan(similarities)] = 0
    return similarities

def fill_missing(data, batch_size = 32):
    n_movies = data.shape[0]
    filled_ratings = np.empty_like(data)
    num_batches = int(np.ceil(n_movies / batch_size))

    for i in range(num_batches):
        start = i * batch_size
        end = min((i + 1) * batch_size, n_movies)

        similarities = calculate_similarities(data, start, end)
        
        weights = ~np.isnan(data) * similarities.reshape(end - start, -1, 1)
        weights /= weights.sum(axis=1, keepdims=True)

        filled_ratings[start:end] = np.nansum(data * weights, axis=1)

    return filled_ratings



+ Evaluate each columns

In [5]:
columns = ['Tomatoes CriticScore', 'Tomatoes UserScore', 'Metascore', 'Meta UserScore']

test_size = 0.3

#Evalue each column seperatly
for test_col in columns:
    #Get a copy of data but remove all null value for testing
    raw_data_copy = raw_data.copy()
    raw_data_copy.dropna(inplace=True)
    raw_data_copy.reset_index(drop=True, inplace=True)

    #Get sample with size of 30%
    test_rows = raw_data_copy.sample(frac=test_size, random_state=42).index

    #Get y true value from the dataset
    y_test = raw_data_copy.loc[test_rows, test_col].copy()

    #Assign it's as nan value for imputing
    raw_data_copy.loc[test_rows, test_col] = np.nan

    #Perform imputing missing value using similarity
    tmp_data = raw_data_copy.copy()
    tmp_data['id'] = tmp_data.index
    tmp_data['Meta UserScore'] = tmp_data['Meta UserScore'] * 10

    tmp_data = tmp_data[['id','Tomatoes CriticScore', 'Tomatoes UserScore', 'Metascore', 'Meta UserScore']].to_numpy()

    filled_ratings = fill_missing(tmp_data)
    filled_nanvals = filled_ratings[np.isnan(tmp_data)]

    tmp_data[np.isnan(tmp_data)] = filled_nanvals

    filled_df = pd.DataFrame(
        filled_ratings[:, 1:],
        columns=['Tomatoes CriticScore', 'Tomatoes UserScore', 'Metascore', 'Meta UserScore']
    )

    filled_df['Meta UserScore'] /= 10
    tmp_data_2 = raw_data_copy.copy()

    for col in filled_df.columns:
        tmp_data_2[col].fillna(filled_df[col], inplace=True)

    #Get y predicted (column after imputing)
    y_pred = tmp_data_2.loc[test_rows, test_col]

    mae = mean_absolute_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)

    # Store the metrics of each model
    similarity_model.append({
        'Column': test_col,
        'MAE': mae,
        'MSE': mse,
        'R2': r2
    })

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  tmp_data_2[col].fillna(filled_df[col], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  tmp_data_2[col].fillna(filled_df[col], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are se

<div id = "knn"> <h2> KNN imputer strategy </h2> </div>

- Using 20 neighbors with uniform weighting to predict missing values.
- The imputation process fills the NaN values in the masked test column based on the nearest neighbors' values in the dataset.

In [6]:

# Prepare the data (numeric columns only)
columns = ['Tomatoes CriticScore', 'Tomatoes UserScore', 'Metascore', 'Meta UserScore']
test_size = 0.3

for test_col in columns:
    # Step 1: Prepare the test dataset
    raw_data_copy = raw_data.copy()
    raw_data_copy.dropna(inplace=True)
    raw_data_copy.reset_index(drop=True, inplace=True)

    # Select test rows (30% of the data)
    test_rows = raw_data_copy.sample(frac=test_size, random_state=42).index
    y_test = raw_data_copy.loc[test_rows, test_col].copy()

    # Mask test column values (set them to NaN for imputation)
    raw_data_copy.loc[test_rows, test_col] = np.nan

    # Step 2: Apply KNN Imputation
    knn_imputer = KNNImputer(n_neighbors=20, weights="uniform")
    imputed_data = knn_imputer.fit_transform(raw_data_copy[columns])

    # Reconstruct the imputed DataFrame
    imputed_df = pd.DataFrame(imputed_data, columns=columns)

    # Step 3: Evaluate the imputed values
    y_pred = imputed_df.loc[test_rows, test_col]

    mae = mean_absolute_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)

    # Store the metrics of each model
    knn_model.append({
        'Column': test_col,
        'MAE': mae,
        'MSE': mse,
        'R2': r2
    })

<div id = "decision"> <h2> Decision Tree Regressor strategy </h2> </div>

- Observes features in this case we will use all scores columns and trains a model in the structure of a tree to predict scores


In [7]:

# Prepare the data (numeric columns only)
columns = ['Tomatoes CriticScore', 'Tomatoes UserScore', 'Metascore', 'Meta UserScore']
test_size = 0.3

for test_col in columns:
    # Step 1: Prepare the test dataset
    raw_data_copy = raw_data.copy()
    raw_data_copy.dropna(inplace=True)
    raw_data_copy.reset_index(drop=True, inplace=True)

    #Train Test split
    train_col = columns.copy()
    train_col.remove(test_col)

    X = raw_data_copy[train_col].values
    y = raw_data_copy[test_col].values
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=42)

    #Train model
    decision_tree = DecisionTreeRegressor(max_depth=5, random_state=42)
    decision_tree.fit(X_train, y_train)

    y_pred = decision_tree.predict(X_test)
        
    #Evaluate
    mae = mean_absolute_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)

    # Store the metrics of each model
    decision_tree_model.append({
        'Column': test_col,
        'MAE': mae,
        'MSE': mse,
        'R2': r2
    })

<div id = "evaluation"> <h1>4. Evaluation </h1> </div>

In [8]:
# Convert the metrics of each model to DataFrames
similarity_model = pd.DataFrame(similarity_model)
knn_model = pd.DataFrame(knn_model)
decision_tree_model = pd.DataFrame(decision_tree_model)

# Make sure that similarity_model, knn_model, and decision_tree_model have columns 'MAE', 'MSE', 'R2' and 'Model'
similarity_model['Model'] = 'Similarity'
knn_model['Model'] = 'KNN'
decision_tree_model['Model'] = 'Decision Tree'

# Combine the metrics of all models
all_models = pd.concat([similarity_model, knn_model, decision_tree_model], ignore_index=True)

# Create a table to summarize the metrics of all models
summary_table = all_models.groupby(['Model']).agg({
    'MAE': ['mean'],
    'MSE': ['mean'],
    'R2': ['mean']
}).round(2)

print(summary_table)


                 MAE     MSE    R2
                mean    mean  mean
Model                             
Decision Tree   6.12   88.12  0.70
KNN             6.23   88.30  0.71
Similarity     11.83  262.07  0.32


- KNN and Decision Tree are performing similarly in terms of MAE, MSE, and R². Both models are significantly better than the Similarity model.
- However, KNN would likely be the better choice because:
    - Movie ratings are typically continuous, with values influenced by the preferences of users/critics (neighbors). KNN works well for this kind of problem because it relies on the similarity between users/critics, which is particularly useful in collaborative filtering tasks like movie rating prediction.
    - KNN can predict the missing ratings by considering the ratings of similar users or movies, which is a natural fit for this type of task.
