# Personalized Practice Quizzes
### *Leveraging Collaborative Filtering for Personalized Practice in Computer-Based Assessments*
This notebook builds multiple **recommender systems** based on **six different collaboartive filtering (CF) techniques** and compares these models against each other based on their performance on a dataset containing student performance data (i.e., a dataset containing the scores of students on assessment questions). Our chosen CF models include **three latent factor-based models** (Singular Value Decomposition, Singular Value Decomposition Plus, Non-Negative Matrix Factorization) and **three neighborhood-based models** (k-Nearest Neighbors, k-Nearest Neighbors with Means, k-Nearest Neighbors with Z-Scores).

To evaluate whether CF-based recommender systems can effectively predict students' performance scores on new, unseen questions based on their past performance, this notebook conducts **[Dietterich's 5x2 CV paired t-test](https://pubmed.ncbi.nlm.nih.gov/9744903/)** on each model. The average **Mean Absolute Error (MAE)** and **Root Mean Squared Error (RMSE)** are first reported and then a **paired t-test** is performed to assess whether the performance difference between each CF model and an average-based baseline model is statistically significant.

## 1) Set dataset-specific variables
After setting the following dataset-specific variables, you should be able to run this notebook without any additional changes.

**NOTE**: This notebook assumes that the student performance dataset is stored as CSV file with one column for the (anonymized) student ID, one column for the question ID, and one column for the **normalized** score (i.e., a score falling between 0.0 and 1.0). If your dataset does not follow these specifications, you will also need to change the implementation of the ``load_and_preprocess_data`` function accordingly based on the shape of your dataset.

In [None]:
# TODO: Fill in the values for these variables before running the remaining cells of this notebook!

# a string that specifies the path to the performance dataset from the current directory
dataset_path = ''
# a string that specifies the NAME of the column containing the (anonymized) student IDs 
student_id_col_name = ''
# a string that specifies the NAME of the column containing the question IDs 
question_id_col_name = ''
# a string that specifies the NAME of the column containing the normalized performance scores
score_col_name = ''

# set to False if you want to disable status messages during model evaluation
include_status_messages = True 

## 2) Import packages
We use the [Surprise](https://surpriselib.com/) package, a Python scikit for building and analyzing CF-based recommender systems, to build and evaluate our CF models.

In [None]:
import random
import numpy as np
import pandas as pd
from scipy.stats import t
from surprise import Reader, Dataset
from surprise import AlgoBase, SVD, SVDpp, NMF, KNNBasic, KNNWithMeans, KNNWithZScore
from surprise.model_selection import KFold, cross_validate

## 3) Load and preprocess raw data

In [None]:
def load_and_preprocess_data(path):
    """
    Loads the performance dataset from the CSV file located at path.
    
    @param path: path to the performance dataset from the current directory
    @return data_df: Pandas dataframe containing the loaded performance dataset
    @return data_wrapped: Surprise dataframe containing the loaded performance dataset
    """
    data_df = pd.read_csv(path, keep_default_na=False)
    
    # rename columns to match the names expected by the functions in the Surprise package
    data_df = data_df.rename(
        columns={question_id_col_name:'itemID', student_id_col_name:'userID', score_col_name:'rating'})
    data_df = data_df[['itemID', 'userID', 'rating']]
    data_df['rating'] = pd.to_numeric(data_df['rating']).fillna(0)
    
    # functions in the Suprise package require the data to be wrapped by a Surprise wrapper class
    reader = Reader(rating_scale=(0.0, 1.0))
    data_wrapped = Dataset.load_from_df(data_df[['userID', 'itemID', 'rating']], reader)
    
    return data_df, data_wrapped

In [None]:
data_df, data_wrapped = load_and_preprocess_data(dataset_path)

In [None]:
# verify that the dataset was loaded properly
data_df.head()

In [None]:
# report the number of students and questions in the dataset
num_students = len(set(data_df['userID']))
num_questions = len(set(data_df['itemID']))
print('Number of distinct  students in dataset: %d' % num_students)
print('Number of distinct questions in dataset: %d' % num_questions)

## 4) Implement a baseline model
We compare the performance of our CF models against an **average-based baseline model**, a standard benchmark in recommender system evaluations. For a given student and a new question, the baseline model predicts a performance score based on the average of three means: the overall mean score, the mean score of the student, and the mean score of the question.

In [None]:
class AvgBaseline(AlgoBase):
    def __init__(self):
        AlgoBase.__init__(self, random_state=0)

    def fit(self, trainset):
        """
        Fits the average-based model to the provided training set.

        @param trainset: training set (wrapped by Surprise wrapper class) 
        """
        AlgoBase.fit(self, trainset)
        self.avg_rating = np.mean([r for (_, _, r) in self.trainset.all_ratings()])

        return self

    def estimate(self, u, i):
        """
        Predicts the score of user/student u on item/student i. 

        @param u: ID of the user/student
        @param i: ID of the item/question
        @return: the predicted score of user/student u on item/student i
        """
        sum_means = self.avg_rating 
        div = 1
        if self.trainset.knows_user(u):
            sum_means += np.mean([r for (_, r) in self.trainset.ur[u]])
            div += 1
        if self.trainset.knows_item(i):
            sum_means += np.mean([r for (_, r) in self.trainset.ir[i]])
            div += 1

        return sum_means / div

## 5) Build and evaluate models
We follow [Dietterich's 5x2 CV technique](https://pubmed.ncbi.nlm.nih.gov/9744903/) to evaluate each of our models across the two benchmarking metrics of Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE). This results in 5 iterations of 2-fold cross-validation for each model, giving us a total of 10 trials per model.

In [None]:
model_names = ['baseline', 'svd', 'svdpp', 'nmf', 'knn_basic', 'knn_means', 'knn_zscore']
num_iterations = 5 # set to 5 for Dietterich's 5x2 CV test 
num_splits = 2     # set to 2 for Dietterich's 5x2 CV test 

In [None]:
def evaluate_model(model, data_wrapped, kfold):
    """
    Performs cross-validation on the specified model instance using MAE and RMSE as its measures.
    
    @param model: model instance to which cross-validation is applied
    @param data_wrapped: dataset that will be used for cross-validation (wrapped by a Surprise class)
    @param kfold: KFold object specifying the number of splits to use for cross-validation
    @return results: dictionary containing the MAE and RMSE values for each testset from cross-validation.
    """
    cv_results = cross_validate(model, data_wrapped, measures=['MAE', 'RMSE'], cv=kfold, verbose=include_status_messages)
    results = {
        'MAE': cv_results['test_mae'],
        'RMSE': cv_results['test_rmse']
    }
    return results

In [None]:
# evalute all the models using Dietterich's 5x2 CV technique 
models_results = {
    'baseline': [],
    'svd': [],
    'svdpp': [],
    'nmf': [],
    'knn_basic': [],
    'knn_means': [],
    'knn_zscore': []
}
for i in range(num_iterations):
    print('\n** ITERATION ROUND %d **' %(i+1))
    random.seed(i)                                                                    
    np.random.seed(i)
    kfold = KFold(n_splits=num_splits, random_state=i)
    
    models_results['baseline'].append(evaluate_model(AvgBaseline(), data_wrapped, kfold))
    models_results['svd'].append(evaluate_model(SVD(random_state=i), data_wrapped, kfold))
    models_results['svdpp'].append(evaluate_model(SVDpp(random_state=i), data_wrapped, kfold))
    models_results['nmf'].append(evaluate_model(NMF(random_state=i), data_wrapped, kfold))
    models_results['knn_basic'].append(evaluate_model(KNNBasic(), data_wrapped, kfold))
    models_results['knn_means'].append(evaluate_model(KNNWithMeans(), data_wrapped, kfold))
    models_results['knn_zscore'].append(evaluate_model(KNNWithZScore(), data_wrapped, kfold))

## 6) Compare models
We report the mean MAE and RMSE values (from the 10 trials of the 5x2 CV process) for each model.

In [None]:
def compute_means(models_results, model_name):
    """
    Computes and prints the mean MAE and RMSE values (from the 10 trials of the 5x2 CV process) for each model.
    
    @param model_results: the dictionary generated in the evaluation step containing the MAE and RMSE values
                          for all the models across all iterations and splits
    @param model_name: name of the model for which the mean MAE and RMSE will be computed
    """
    print(f'\n** RESULTS FOR {model_name.upper()} **')
    for metric_name in ['MAE', 'RMSE']:
        metric_vals = np.array([])
        for i in range(num_iterations):
            metric_vals = np.append(metric_vals, models_results[model_name][i][metric_name])
        metric_mean = np.mean(metric_vals)
        metric_std = np.std(metric_vals)
        print(f'Mean of {metric_name}: {metric_mean} +/- {metric_std}')

In [None]:
for model_name in model_names:
    compute_means(models_results, model_name)

To assess the statistical significance of the performance differences between each CF model and the average-based baseline, we perform a paired t-test using the 5x2 CV approach (as outlined in [Dietterich's paper](https://pubmed.ncbi.nlm.nih.gov/9744903/) under **Section 3.5 - The 5x2cv paired t-test**), with the assumption that the t-stat approximately follows a t-distribution with 5 degrees of freedom and a null hypothesis that both models 1 and 2 have equal performance. 

In [None]:
def paired_ttest(models_results, model1_name, model2_name):
    """
    Computes and prints the p-values resulting form the comparison of models 1 and 2 across MAE and RMSE.
    
    @param model_results: the dictionary generated in the evaluation step containing the MAE and RMSE values
                          for all the models across all iterations and splits
    @param model1_name: name of the 1st model to be used in the paired t-test
    @param model2_name: name of the 2nd model to be used in the paired t-test
    """
    print(f'\n** RESULTS FOR COMPARING {model1_name.upper()} AND {model2_name.upper()} **')
    for metric_name in ['MAE','RMSE']:
        perf_diff_var_sum = 0
        for i in range(num_iterations):
            perf_diff = models_results[model1_name][i][metric_name] - models_results[model2_name][i][metric_name]
            perf_diff_mean = np.mean(perf_diff)
            perf_diff_var = np.sum((perf_diff - perf_diff_mean)**2)
            perf_diff_var_sum += perf_diff_var

        perf_diff_first = models_results[model1_name][0][metric_name] - models_results[model2_name][0][metric_name]
        t_stat = perf_diff_first[0] / np.sqrt(1/num_iterations*perf_diff_var_sum)
        p_val = 2*(1 - t.cdf(abs(t_stat), num_iterations))

        print(f'P-value based on {metric_name}: {p_val}')

In [None]:
for model_name in model_names:
    if model_name != 'baseline':
        paired_ttest(models_results, model_name, 'baseline')