# Personalized Practice
### *Leveraging Collaborative Filtering for Personalized Practice in Computer-Based Assessments*
This notebook builds multiple **recommender systems** based on **six different collaboartive filtering (CF) techniques** and compares these models against each other based on their performance on a dataset containing student performance data (i.e., a dataset containing the scores of students on assessment questions). Our chosen CF models include **three latent factor-based models** (Singular Value Decomposition, Singular Value Decomposition Plus, Non-Negative Matrix Factorization) and **three neighborhood-based models** (k-Nearest Neighbors, k-Nearest Neighbors with Means, k-Nearest Neighbors with Z-Scores).

To evaluate whether CF-based recommender systems can effectively predict students' performance scores on new, unseen questions based on their past performance, this notebook conducts **[Dietterich's 5x2 CV paired t-test](https://pubmed.ncbi.nlm.nih.gov/9744903/)** on each model. We compare the models across two metrics, **Mean Absolute Error (MAE)** and **Root Mean Squared Error (RMSE)**, using two different datasets. For each model, we computed the average MAE and RMSE values from the 10 trials of the 5x2 CV process. To assess the statistical significance of the performance differences between each CF model and the baseline, we perform **two-sided paired t-tests** and calculate **Bonferroni-corrected p-values**, along with **Cohen's d** to measure effect sizes. 

## 1) Set dataset-specific variables
After setting the following dataset-specific variables, you should be able to run this notebook without any additional changes.

**NOTE**: This notebook assumes that the student performance dataset is stored as CSV file with one column for the (anonymized) student ID, one column for the question ID, and one column for the **normalized** score (i.e., a score falling between 0.0 and 1.0). If your dataset does not follow these specifications, you will also need to change the implementation of the ``load_and_preprocess_data`` function accordingly based on the shape of your dataset.

In [21]:
# TODO: Fill in the values for these variables before running the remaining cells of this notebook!

# a string that specifies the path to the performance dataset from the current directory
dataset_path = './datasets/dataset_2022.csv'
# a string that specifies the NAME of the column containing the (anonymized) student IDs 
student_id_col_name = 'User ID'
# a string that specifies the NAME of the column containing the question IDs 
question_id_col_name = 'Question ID'
# a string that specifies the NAME of the column containing the normalized performance scores
score_col_name = 'Score'

# set to False if you want to disable status messages during model evaluation
include_status_messages = True 

## 2) Import packages
We use the [Surprise](https://surpriselib.com/) package, a Python scikit for building and analyzing CF-based recommender systems, to build and evaluate our CF models.

In [22]:
import random
import numpy as np
import pandas as pd
from scipy.stats import t
from surprise import Reader, Dataset
from surprise import AlgoBase, SVD, SVDpp, NMF, KNNBasic, KNNWithMeans, KNNWithZScore
from surprise.model_selection import KFold, cross_validate

## 3) Load and preprocess raw data

In [23]:
def load_and_preprocess_data(path):
    """
    Loads the performance dataset from the CSV file located at path.
    
    @param path: path to the performance dataset from the current directory
    @return data_df: Pandas dataframe containing the loaded performance dataset
    @return data_wrapped: Surprise dataframe containing the loaded performance dataset
    """
    data_df = pd.read_csv(path, keep_default_na=False)
    
    # rename columns to match the names expected by the functions in the Surprise package
    data_df = data_df.rename(
        columns={question_id_col_name:'itemID', student_id_col_name:'userID', score_col_name:'rating'})
    data_df = data_df[['itemID', 'userID', 'rating']]
    data_df['rating'] = pd.to_numeric(data_df['rating']).fillna(0)
    
    # functions in the Suprise package require the data to be wrapped by a Surprise wrapper class
    reader = Reader(rating_scale=(0.0, 1.0))
    data_wrapped = Dataset.load_from_df(data_df[['userID', 'itemID', 'rating']], reader)
    
    return data_df, data_wrapped

In [24]:
data_df, data_wrapped = load_and_preprocess_data(dataset_path)

In [25]:
# verify that the dataset was loaded properly
data_df.head()

Unnamed: 0,itemID,userID,rating
0,Q1,S1,1.0
1,Q2,S1,1.0
2,Q3,S1,0.0
3,Q4,S1,0.125
4,Q5,S1,0.333333


In [26]:
# report the number of students and questions in the dataset
num_students = len(set(data_df['userID']))
num_questions = len(set(data_df['itemID']))
print('Number of distinct  students in dataset: %d' % num_students)
print('Number of distinct questions in dataset: %d' % num_questions)

Number of distinct  students in dataset: 679
Number of distinct questions in dataset: 165


In [27]:
records_per_student = data_df.groupby('userID').size().reset_index(name='Record Count')
records_per_student = records_per_student.sort_values('Record Count', ascending=False)
print(f"Average records per student: {records_per_student['Record Count'].mean():.0f}")
print(f"Median  records per student: {records_per_student['Record Count'].median():.0f}")

Average records per student: 96
Median  records per student: 101


## 4) Implement a baseline model
We compare the performance of our CF models against an **average-based baseline model**, a standard benchmark in recommender system evaluations. For a given student and a new question, the baseline model predicts a performance score based on the average of three means: the overall mean score, the mean score of the student, and the mean score of the question.

In [28]:
class AvgBaseline(AlgoBase):
    def __init__(self):
        AlgoBase.__init__(self, random_state=0)

    def fit(self, trainset):
        """
        Fits the average-based model to the provided training set.

        @param trainset: training set (wrapped by Surprise wrapper class) 
        """
        AlgoBase.fit(self, trainset)
        self.avg_rating = np.mean([r for (_, _, r) in self.trainset.all_ratings()])

        return self

    def estimate(self, u, i):
        """
        Predicts the score of user/student u on item/student i. 

        @param u: ID of the user/student
        @param i: ID of the item/question
        @return: the predicted score of user/student u on item/student i
        """
        sum_means = self.avg_rating 
        div = 1
        if self.trainset.knows_user(u):
            sum_means += np.mean([r for (_, r) in self.trainset.ur[u]])
            div += 1
        if self.trainset.knows_item(i):
            sum_means += np.mean([r for (_, r) in self.trainset.ir[i]])
            div += 1

        return sum_means / div

## 5) Build and evaluate models
We follow [Dietterich's 5x2 CV technique](https://pubmed.ncbi.nlm.nih.gov/9744903/) to evaluate each of our models across the two benchmarking metrics of Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE). This results in 5 iterations of 2-fold cross-validation for each model, giving us a total of 10 trials per model.

In [12]:
model_names = ['baseline', 'svd', 'svdpp', 'nmf', 'knn_basic', 'knn_means', 'knn_zscore']
num_cf_models  = 6 # set to 6 for Bonferroni correction of p-values
num_iterations = 5 # set to 5 for Dietterich's 5x2 CV test 
num_splits = 2     # set to 2 for Dietterich's 5x2 CV test 
num_trials = 10    # set to 10 since we have 5 iterations of 2-fold CV

In [13]:
def evaluate_model(model, data_wrapped, kfold):
    """
    Performs cross-validation on the specified model instance using MAE and RMSE as its measures.
    
    @param model: model instance to which cross-validation is applied
    @param data_wrapped: dataset that will be used for cross-validation (wrapped by a Surprise class)
    @param kfold: KFold object specifying the number of splits to use for cross-validation
    @return results: dictionary containing the MAE and RMSE values for each testset from cross-validation.
    """
    cv_results = cross_validate(model, data_wrapped, measures=['MAE', 'RMSE'], cv=kfold, verbose=include_status_messages)
    results = {
        'MAE': cv_results['test_mae'],
        'RMSE': cv_results['test_rmse']
    }
    return results

In [None]:
# evalute all the models using Dietterich's 5x2 CV technique 
models_results = {
    'baseline': [],
    'svd': [],
    'svdpp': [],
    'nmf': [],
    'knn_basic': [],
    'knn_means': [],
    'knn_zscore': []
}
for i in range(num_iterations):
    print('\n** ITERATION ROUND %d **' %(i+1))
    random.seed(i)                                                                    
    np.random.seed(i)
    kfold = KFold(n_splits=num_splits, random_state=i)
    
    models_results['baseline'].append(evaluate_model(AvgBaseline(), data_wrapped, kfold))
    models_results['svd'].append(evaluate_model(SVD(random_state=i), data_wrapped, kfold))
    models_results['svdpp'].append(evaluate_model(SVDpp(random_state=i), data_wrapped, kfold))
    models_results['nmf'].append(evaluate_model(NMF(random_state=i), data_wrapped, kfold))
    models_results['knn_basic'].append(evaluate_model(KNNBasic(), data_wrapped, kfold))
    models_results['knn_means'].append(evaluate_model(KNNWithMeans(), data_wrapped, kfold))
    models_results['knn_zscore'].append(evaluate_model(KNNWithZScore(), data_wrapped, kfold))

## 6) Compare models
We report the mean MAE and RMSE values (from the 10 trials of the 5x2 CV process) for each model.

In [None]:
def compute_means(models_results, model_name):
    """
    Computes and prints the mean MAE and RMSE values (from the 10 trials of the 5x2 CV process) for each model.
    
    @param model_results: the dictionary generated in the evaluation step containing the MAE and RMSE values
                          for all the models across all iterations and splits
    @param model_name: name of the model for which the mean MAE and RMSE will be computed
    """
    print(f'\n** RESULTS FOR {model_name.upper()} **')
    for metric_name in ['MAE', 'RMSE']:
        vals = np.array([])
        for i in range(num_iterations):
            vals = np.append(vals, models_results[model_name][i][metric_name])
        mean = np.mean(vals)
        std = np.std(vals)
        print(f'Mean of {metric_name}: {mean} +/- {std}')

In [None]:
for model_name in model_names:
    compute_means(models_results, model_name)

To assess the statistical significance of the performance differences between each CF model and the average-based baseline, we perform a two-sided paired t-test using the 5x2 CV approach (as outlined in [Dietterich's paper](https://pubmed.ncbi.nlm.nih.gov/9744903/) under **Section 3.5 - The 5x2cv paired t-test**), with the assumption that the t-stat approximately follows a t-distribution with 5 degrees of freedom and a null hypothesis that both models 1 and 2 have equal performance. In addition to reporting the raw p-values, we also report the **Bonferroni-corrected p-values**. The Bonferroni correction, which multiplies the raw p-values by the number of tests conducted, is widely used when conducting multiple statistical tests, as it reduces the risk of false positives by adjusting p-values.

In [None]:
def paired_ttest(models_results, model1_name, model2_name):
    """
    Computes and prints the p-values resulting form the comparison of models 1 and 2 across MAE and RMSE.
    
    @param model_results: the dictionary generated in the evaluation step containing the MAE and RMSE values
                          for all the models across all iterations and splits
    @param model1_name: name of the 1st model to be used in the paired t-test
    @param model2_name: name of the 2nd model to be used in the paired t-test
    """
    print(f'\n** RESULTS FOR COMPARING {model1_name.upper()} AND {model2_name.upper()} **')
    for metric_name in ['MAE','RMSE']:
        perf_diff_var_sum = 0
        for i in range(num_iterations):
            perf_diff = models_results[model1_name][i][metric_name] - models_results[model2_name][i][metric_name]
            perf_diff_mean = np.mean(perf_diff)
            perf_diff_var = np.sum((perf_diff - perf_diff_mean)**2)
            perf_diff_var_sum += perf_diff_var

        perf_diff_first = models_results[model1_name][0][metric_name] - models_results[model2_name][0][metric_name]
        t_stat = perf_diff_first[0] / np.sqrt(1/num_iterations*perf_diff_var_sum)
        p_val = 2*(1 - t.cdf(abs(t_stat), num_iterations))

        print(f'\nRaw p-value based on {metric_name}: {p_val}')
        print(f'Bonferroni-adjusted p-value based on {metric_name}: {p_val*num_cf_models}')

In [None]:
for model_name in model_names:
    if model_name != 'baseline':
        paired_ttest(models_results, model_name, 'baseline')

Finally, to quantify the effect sizes of the performance differences between each CF model and the baseline, we calculate Cohen's d for each test.

In [None]:
def cohen_d(models_results, model1_name, model2_name):
    """
    Computes and prints the Cohen's d resulting form the comparison of models 1 and 2 across MAE and RMSE.
    
    @param model_results: the dictionary generated in the evaluation step containing the MAE and RMSE values
                          for all the models across all iterations and splits
    @param model1_name: name of the 1st model to be used in the effect size computation
    @param model2_name: name of the 2nd model to be used in the effect size computation
    """
    print(f'\n** RESULTS FOR COMPARING {model1_name.upper()} AND {model2_name.upper()} **')
    for metric_name in ['MAE','RMSE']:
        vals1 = np.array([])
        vals2 = np.array([])
        for i in range(num_iterations):
            vals1 = np.append(vals1, models_results[model1_name][i][metric_name])
            vals2 = np.append(vals2, models_results[model2_name][i][metric_name])
        
        mean1 = np.mean(vals1)
        mean2 = np.mean(vals2)
        std1 = np.std(vals1)
        std2 = np.std(vals2)
        
        pooled_std = std1 + std2 / 2
        cohen_d = np.abs(mean1 - mean2) / pooled_std

        print(f'Cohens d based on {metric_name}: {cohen_d}')

In [None]:
for model_name in model_names:
    if model_name != 'baseline':
        cohen_d(models_results, model_name, 'baseline')

## 7) Evaluate Models Across Data Availability Levels

In this section, we evaluate all six collaborative filtering models and the baseline model across different data availability levels (25%, 50%, and 75%) using the same Dietterich's 5x2 CV technique. This allows us to assess how model performance compares when differents levels of data is available.

In [29]:
def create_data_subset(data_df, availability_percentage):
    subset_data = []
    
    for student_id in data_df['userID'].unique():
        student_records = data_df[data_df['userID'] == student_id]
        num_records = len(student_records)
        num_to_keep = max(1, int(num_records * availability_percentage / 100))
        
        sampled_records = student_records.sample(n=num_to_keep, random_state=0)
        subset_data.append(sampled_records)
    
    subset_df = pd.concat(subset_data, ignore_index=True)
    return subset_df

data_availability_levels = {
    '25%': create_data_subset(data_df, 25),
    '50%': create_data_subset(data_df, 50),
    '75%': create_data_subset(data_df, 75)
}

In [30]:
availability_levels = [25, 50, 75]

availability_results = {
    '25%': {'baseline': [], 'svd': [], 'svdpp': [], 'nmf': [], 'knn_basic': [], 'knn_means': [], 'knn_zscore': []},
    '50%': {'baseline': [], 'svd': [], 'svdpp': [], 'nmf': [], 'knn_basic': [], 'knn_means': [], 'knn_zscore': []},
    '75%': {'baseline': [], 'svd': [], 'svdpp': [], 'nmf': [], 'knn_basic': [], 'knn_means': [], 'knn_zscore': []}
}

availability_data = {}
for availability in availability_levels:
    subset_df = create_data_subset(data_df, availability)
    availability_key = f'{availability}%'
    
    reader = Reader(rating_scale=(0.0, 1.0))
    availability_data[availability_key] = Dataset.load_from_df(
        subset_df[['userID', 'itemID', 'rating']], reader)

In [31]:
for availability_key in ['25%', '50%', '75%']:
    print(f'\n\n===== EVALUATING MODELS WITH {availability_key} DATA AVAILABILITY =====')
    
    for i in range(num_iterations):
        print(f'\n** ITERATION ROUND {i+1} **')
        random.seed(i)                                                                    
        np.random.seed(i)
        kfold = KFold(n_splits=num_splits, random_state=i)
        
        # Evaluate all models on this availability level
        availability_results[availability_key]['baseline'].append(
            evaluate_model(AvgBaseline(), availability_data[availability_key], kfold))
        availability_results[availability_key]['svd'].append(
            evaluate_model(SVD(random_state=i), availability_data[availability_key], kfold))
        availability_results[availability_key]['svdpp'].append(
            evaluate_model(SVDpp(random_state=i), availability_data[availability_key], kfold))
        availability_results[availability_key]['nmf'].append(
            evaluate_model(NMF(random_state=i), availability_data[availability_key], kfold))
        availability_results[availability_key]['knn_basic'].append(
            evaluate_model(KNNBasic(), availability_data[availability_key], kfold))
        availability_results[availability_key]['knn_means'].append(
            evaluate_model(KNNWithMeans(), availability_data[availability_key], kfold))
        availability_results[availability_key]['knn_zscore'].append(
            evaluate_model(KNNWithZScore(), availability_data[availability_key], kfold))




===== EVALUATING MODELS WITH 25% DATA AVAILABILITY =====

** ITERATION ROUND 1 **
Evaluating MAE, RMSE of algorithm AvgBaseline on 2 split(s).

                  Fold 1  Fold 2  Mean    Std     
MAE (testset)     0.2984  0.2995  0.2989  0.0005  
RMSE (testset)    0.3512  0.3519  0.3516  0.0004  
Fit time          0.01    0.00    0.00    0.00    
Test time         0.36    0.31    0.33    0.02    
Evaluating MAE, RMSE of algorithm SVD on 2 split(s).

                  Fold 1  Fold 2  Mean    Std     
MAE (testset)     0.2713  0.2720  0.2717  0.0003  
RMSE (testset)    0.3418  0.3449  0.3434  0.0016  
Fit time          0.06    0.05    0.05    0.00    
Test time         0.04    0.03    0.03    0.00    
Evaluating MAE, RMSE of algorithm SVDpp on 2 split(s).

                  Fold 1  Fold 2  Mean    Std     
MAE (testset)     0.2654  0.2671  0.2663  0.0009  
RMSE (testset)    0.3354  0.3391  0.3372  0.0018  
Fit time          0.13    0.12    0.13    0.00    
Test time         0.17    0.28

### Results Summary by Data Availability Level

We report the mean MAE and RMSE values for each model at each data availability level (50%, 80%, and 100%).


In [32]:
def compute_means_by_availability(availability_results, availability_key):
    """
    Computes and prints the mean MAE and RMSE values for all models at a given data availability level.
    
    @param availability_results: dictionary containing results for each availability level and model
    @param availability_key: the data availability level
    """
    print(f'\n\n===== RESULTS FOR {availability_key} DATA AVAILABILITY =====')
    
    for model_name in model_names:
        print(f'\n** {model_name.upper()} **')
        for metric_name in ['MAE', 'RMSE']:
            vals = np.array([])
            for i in range(num_iterations):
                vals = np.append(vals, availability_results[availability_key][model_name][i][metric_name])
            mean = np.mean(vals)
            std = np.std(vals)
            print(f'Mean {metric_name}: {mean:.6f} +/- {std:.6f}')

# Compute results for each availability level
for availability_key in ['25%', '50%', '75%']:
    compute_means_by_availability(availability_results, availability_key)




===== RESULTS FOR 25% DATA AVAILABILITY =====

** BASELINE **
Mean MAE: 0.298994 +/- 0.000556
Mean RMSE: 0.351460 +/- 0.001166

** SVD **
Mean MAE: 0.271773 +/- 0.000986
Mean RMSE: 0.343176 +/- 0.001981

** SVDPP **
Mean MAE: 0.265604 +/- 0.000845
Mean RMSE: 0.336178 +/- 0.001602

** NMF **
Mean MAE: 0.286119 +/- 0.001512
Mean RMSE: 0.348131 +/- 0.002419

** KNN_BASIC **
Mean MAE: 0.269204 +/- 0.001597
Mean RMSE: 0.346416 +/- 0.001894

** KNN_MEANS **
Mean MAE: 0.266173 +/- 0.000992
Mean RMSE: 0.338817 +/- 0.001453

** KNN_ZSCORE **
Mean MAE: 0.263688 +/- 0.001409
Mean RMSE: 0.341255 +/- 0.001973


===== RESULTS FOR 50% DATA AVAILABILITY =====

** BASELINE **
Mean MAE: 0.299327 +/- 0.000501
Mean RMSE: 0.351266 +/- 0.001543

** SVD **
Mean MAE: 0.261639 +/- 0.000840
Mean RMSE: 0.335154 +/- 0.001224

** SVDPP **
Mean MAE: 0.256702 +/- 0.000698
Mean RMSE: 0.329281 +/- 0.001178

** NMF **
Mean MAE: 0.274220 +/- 0.000652
Mean RMSE: 0.333973 +/- 0.001013

** KNN_BASIC **
Mean MAE: 0.262935