# Evaluation of High Order Collaborative Filtering

Evaluation of Implementation of Higher Order Collaborative filtering based on [this paper](https://dl.acm.org/doi/10.1145/3460231.3474273).

**Author:** Hai Hung Nguyen  
**Course:** Recommender Systems (NSWI166)  
**Semester:** LS 2023/2024

## Prepare data

In [193]:
import pandas as pd
import numpy as np

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

### Load and Preprocess participation data

In [194]:
df_participation = pd.read_csv("./data/participation.csv", encoding='utf-8')

df_completed_participation = df_participation.drop(["participant_email","extra_data"], axis=1)

df_completed_participation

Unnamed: 0,id,age_group,gender,education,ml_familiar,user_study_id,time_joined,time_finished,uuid,language
0,1,16,0,2,0,1,2024-06-15 14:12:28.237252,2024-06-15 14:26:32.426407,Ltj9Hpeo1M6ub-XzuJ_AkA,en
1,2,21,1,2,0,1,2024-06-15 14:27:45.994813,2024-06-15 14:38:41.402801,Ltj9Hpeo1M6ub-XzuJ_AkA,en
2,3,21,0,2,1,1,2024-06-15 14:39:58.079518,2024-06-15 14:42:25.906710,Ltj9Hpeo1M6ub-XzuJ_AkA,en
3,4,21,0,2,1,1,2024-06-15 14:42:53.758059,2024-06-15 14:48:03.921291,Ltj9Hpeo1M6ub-XzuJ_AkA,en
4,5,21,1,2,0,2,2024-06-15 14:51:19.561055,2024-06-15 15:00:27.865696,Ltj9Hpeo1M6ub-XzuJ_AkA,en
5,6,16,0,2,0,2,2024-06-15 15:01:02.444312,2024-06-15 15:12:23.715402,Ltj9Hpeo1M6ub-XzuJ_AkA,en
6,7,21,0,2,1,2,2024-06-15 15:13:01.125449,2024-06-15 15:16:13.657074,Ltj9Hpeo1M6ub-XzuJ_AkA,en


### Load and Preprocess interaction data

In [195]:
df_interaction = pd.read_csv("./data/interaction.csv", encoding='utf-8')
df_interaction = df_interaction.merge(df_completed_participation[['id', 'time_joined', 'time_finished']], 
                                      left_on='participation', right_on='id', suffixes=('', '_part'))

df_interaction['time_joined'] = pd.to_datetime(df_interaction['time_joined'])
df_interaction['time_finished'] = pd.to_datetime(df_interaction['time_finished'])

df_interaction['duration'] = (df_interaction['time_finished'] - df_interaction['time_joined']).dt.total_seconds()

df_interaction

Unnamed: 0,id,participation,interaction_type,time,data,id_part,time_joined,time_finished,duration
0,1,1,loaded-page,2024-06-15 14:12:41.711140,"{""page"": ""preference_elicitation"", ""context"": ...",1,2024-06-15 14:12:28.237252,2024-06-15 14:26:32.426407,844.189155
1,2,1,changed-viewport,2024-06-15 14:12:46.717015,"{""viewport"": {""left"": 0, ""top"": 0, ""width"": 16...",1,2024-06-15 14:12:28.237252,2024-06-15 14:26:32.426407,844.189155
2,3,1,selected-item,2024-06-15 14:12:47.274506,"{""selected_item"": {""movieName"": ""Hercules (199...",1,2024-06-15 14:12:28.237252,2024-06-15 14:26:32.426407,844.189155
3,4,1,deselected-item,2024-06-15 14:13:05.598520,"{""deselected_item"": {""movieName"": ""Hercules (1...",1,2024-06-15 14:12:28.237252,2024-06-15 14:26:32.426407,844.189155
4,5,1,on-input,2024-06-15 14:13:12.192231,"{""search_text_box_value"": ""kimi no na wa"", ""co...",1,2024-06-15 14:12:28.237252,2024-06-15 14:26:32.426407,844.189155
...,...,...,...,...,...,...,...,...,...
1112,1113,7,study-ended,2024-06-15 15:16:13.656151,"{""iteration"": 5}",7,2024-06-15 15:13:01.125449,2024-06-15 15:16:13.657074,192.531625
1113,1114,7,loaded-page,2024-06-15 15:16:13.771284,"{""page"": ""finished_user_study"", ""context"": {""u...",7,2024-06-15 15:13:01.125449,2024-06-15 15:16:13.657074,192.531625
1114,1115,7,changed-viewport,2024-06-15 15:16:13.772407,"{""viewport"": {""left"": 0, ""top"": 0, ""width"": 16...",7,2024-06-15 15:13:01.125449,2024-06-15 15:16:13.657074,192.531625
1115,1116,7,loaded-page,2024-06-15 15:57:32.982331,"{""page"": ""finished_user_study"", ""context"": {""u...",7,2024-06-15 15:13:01.125449,2024-06-15 15:16:13.657074,192.531625


In [196]:
df_interaction.interaction_type.unique()

array(['loaded-page', 'changed-viewport', 'selected-item',
       'deselected-item', 'on-input', 'elicitation-ended',
       'iteration-started', 'iteration-ended', 'study-ended'],
      dtype=object)

### Enrich data

In [197]:
import json

def set_iteration(row):
    if row['interaction_type'] in ["iteration-started", "iteration-ended"]:
        return json.loads(row['data'])["iteration"]
    else:
        return np.nan

def set_result_layout(row):
    if row['interaction_type'] == "iteration-started":
        return json.loads(row['data'])["result_layout"]
    else:
        return np.nan

def set_mapping(row, df):
    if row['interaction_type'] == "iteration-started":
        dat = json.loads(row['data'])["algorithm_assignment"]
        for key, value in dat.items():
            col_name = value["name"]
            if col_name not in df.columns:
                df[col_name] = np.nan
            df.at[row.name, col_name] = value["order"]

df_interaction.loc[:, 'iteration'] = df_interaction.apply(set_iteration, axis=1)
df_interaction.loc[:, 'result_layout'] = df_interaction.apply(set_result_layout, axis=1)

for index, row in df_interaction.iterrows():
    set_mapping(row, df_interaction)

def forward_fill(df, col):
    df.loc[:, col] = df[col].ffill()

forward_fill(df_interaction, 'iteration')
forward_fill(df_interaction, 'result_layout')

fixed_columns = ['iteration', 'result_layout', 'interaction_type', 'time', 'data', 'participation', 'id', 'time_joined', 'time_finished', 'duration']
for col_name in df_interaction.columns:
    if col_name not in fixed_columns:
        forward_fill(df_interaction, col_name)

df_interaction = df_interaction.dropna(subset=['iteration'])

df_interaction.head()


Unnamed: 0,id,participation,interaction_type,time,data,id_part,time_joined,time_finished,duration,iteration,result_layout,EASE,Higher Order CF,HigherOrderCFWithNovelty
42,43,1,iteration-started,2024-06-15 14:20:39.792726,"{""iteration"": 1, ""movies"": {""EASE"": {""movies"":...",1,2024-06-15 14:12:28.237252,2024-06-15 14:26:32.426407,844.189155,1.0,max-columns,1.0,0.0,
43,44,1,loaded-page,2024-06-15 14:20:39.920353,"{""page"": ""compare_algorithms"", ""context"": {""ur...",1,2024-06-15 14:12:28.237252,2024-06-15 14:26:32.426407,844.189155,1.0,max-columns,1.0,0.0,
44,45,1,changed-viewport,2024-06-15 14:20:39.975237,"{""viewport"": {""left"": 0, ""top"": 0, ""width"": 16...",1,2024-06-15 14:12:28.237252,2024-06-15 14:26:32.426407,844.189155,1.0,max-columns,1.0,0.0,
45,46,1,selected-item,2024-06-15 14:20:45.242660,"{""selected_item"": {""genres"": [""Animation"", ""Dr...",1,2024-06-15 14:12:28.237252,2024-06-15 14:26:32.426407,844.189155,1.0,max-columns,1.0,0.0,
46,47,1,selected-item,2024-06-15 14:20:47.106634,"{""selected_item"": {""genres"": [""Animation"", ""Dr...",1,2024-06-15 14:12:28.237252,2024-06-15 14:26:32.426407,844.189155,1.0,max-columns,1.0,0.0,


In [198]:
df_interaction.loc[42]

id                                                                         43
participation                                                               1
interaction_type                                            iteration-started
time                                               2024-06-15 14:20:39.792726
data                        {"iteration": 1, "movies": {"EASE": {"movies":...
id_part                                                                     1
time_joined                                        2024-06-15 14:12:28.237252
time_finished                                      2024-06-15 14:26:32.426407
duration                                                           844.189155
iteration                                                                 1.0
result_layout                                                     max-columns
EASE                                                                      1.0
Higher Order CF                                                 

In [199]:
json.loads(df_interaction.loc[42]['data'])

{'iteration': 1,
 'movies': {'EASE': {'movies': [{'movie': 'Everything Everywhere All at Once (2022)',
     'url': '/static/datasets/ml-latest/img/270698.jpg',
     'movie_idx': '1631',
     'movie_id': 270698,
     'genres': ['Action', 'Comedy', 'Sci-Fi']},
    {'movie': 'Guardians of the Galaxy Volume 3 (2023)',
     'url': '/static/datasets/ml-latest/img/285593.jpg',
     'movie_idx': '1782',
     'movie_id': 285593,
     'genres': ['Action', 'Adventure', 'Sci-Fi']},
    {'movie': "Howl's Moving Castle (Hauru no ugoku shiro) (2004)",
     'url': '/static/datasets/ml-latest/img/31658.jpg',
     'movie_idx': '460',
     'movie_id': 31658,
     'genres': ['Adventure', 'Animation', 'Fantasy', 'Romance']},
    {'movie': 'Kubo and the Two Strings (2016)',
     'url': '/static/datasets/ml-latest/img/162578.jpg',
     'movie_idx': '1278',
     'movie_id': 162578,
     'genres': ['Adventure', 'Animation', 'Children', 'Fantasy']},
    {'movie': 'Moneyball (2011)',
     'url': '/static/dataset

### Filter out needed data

In [200]:
df_interaction = df_interaction.copy()
df_interaction["variant"] = -1
print(df_interaction.shape)
df_interaction.loc[df_interaction["interaction_type"] == "selected-item", "variant"] = df_interaction[df_interaction["interaction_type"] == "selected-item"].data.map(lambda x: json.loads(x)["selected_item"]).map(lambda x: x.get("variant", -1))
df_interaction = df_interaction.loc[df_interaction.iteration <= 8] 
print(df_interaction.shape)

(1075, 15)
(1075, 15)


In [201]:
selected_item_interactions = df_interaction[df_interaction.variant >= 0].copy()
selected_item_interactions.head()

Unnamed: 0,id,participation,interaction_type,time,data,id_part,time_joined,time_finished,duration,iteration,result_layout,EASE,Higher Order CF,HigherOrderCFWithNovelty,variant
45,46,1,selected-item,2024-06-15 14:20:45.242660,"{""selected_item"": {""genres"": [""Animation"", ""Dr...",1,2024-06-15 14:12:28.237252,2024-06-15 14:26:32.426407,844.189155,1.0,max-columns,1.0,0.0,,0
46,47,1,selected-item,2024-06-15 14:20:47.106634,"{""selected_item"": {""genres"": [""Animation"", ""Dr...",1,2024-06-15 14:12:28.237252,2024-06-15 14:26:32.426407,844.189155,1.0,max-columns,1.0,0.0,,0
48,49,1,selected-item,2024-06-15 14:20:51.291526,"{""selected_item"": {""genres"": [""Comedy"", ""Drama...",1,2024-06-15 14:12:28.237252,2024-06-15 14:26:32.426407,844.189155,1.0,max-columns,1.0,0.0,,1
51,52,1,selected-item,2024-06-15 14:20:56.138772,"{""selected_item"": {""genres"": [""Action"", ""Adven...",1,2024-06-15 14:12:28.237252,2024-06-15 14:26:32.426407,844.189155,1.0,max-columns,1.0,0.0,,0
75,76,1,selected-item,2024-06-15 14:22:10.115838,"{""selected_item"": {""genres"": [""Animation"", ""Dr...",1,2024-06-15 14:12:28.237252,2024-06-15 14:26:32.426407,844.189155,2.0,max-columns,0.0,1.0,,1


In [202]:
iteration_started_data = df_interaction[df_interaction["interaction_type"] == "iteration-started"].copy()
iteration_started_data

Unnamed: 0,id,participation,interaction_type,time,data,id_part,time_joined,time_finished,duration,iteration,result_layout,EASE,Higher Order CF,HigherOrderCFWithNovelty,variant
42,43,1,iteration-started,2024-06-15 14:20:39.792726,"{""iteration"": 1, ""movies"": {""EASE"": {""movies"":...",1,2024-06-15 14:12:28.237252,2024-06-15 14:26:32.426407,844.189155,1.0,max-columns,1.0,0.0,,-1
72,73,1,iteration-started,2024-06-15 14:22:02.046805,"{""iteration"": 2, ""movies"": {""EASE"": {""movies"":...",1,2024-06-15 14:12:28.237252,2024-06-15 14:26:32.426407,844.189155,2.0,max-columns,0.0,1.0,,-1
99,100,1,iteration-started,2024-06-15 14:23:27.225373,"{""iteration"": 3, ""movies"": {""EASE"": {""movies"":...",1,2024-06-15 14:12:28.237252,2024-06-15 14:26:32.426407,844.189155,3.0,max-columns,0.0,1.0,,-1
122,123,1,iteration-started,2024-06-15 14:25:07.837637,"{""iteration"": 4, ""movies"": {""EASE"": {""movies"":...",1,2024-06-15 14:12:28.237252,2024-06-15 14:26:32.426407,844.189155,4.0,max-columns,1.0,0.0,,-1
148,149,1,iteration-started,2024-06-15 14:25:51.502531,"{""iteration"": 5, ""movies"": {""EASE"": {""movies"":...",1,2024-06-15 14:12:28.237252,2024-06-15 14:26:32.426407,844.189155,5.0,max-columns,0.0,1.0,,-1
201,202,2,iteration-started,2024-06-15 14:29:29.922026,"{""iteration"": 1, ""movies"": {""EASE"": {""movies"":...",2,2024-06-15 14:27:45.994813,2024-06-15 14:38:41.402801,655.407988,1.0,max-columns,0.0,1.0,,-1
229,230,2,iteration-started,2024-06-15 14:30:18.977867,"{""iteration"": 2, ""movies"": {""EASE"": {""movies"":...",2,2024-06-15 14:27:45.994813,2024-06-15 14:38:41.402801,655.407988,2.0,max-columns,0.0,1.0,,-1
260,261,2,iteration-started,2024-06-15 14:32:09.867129,"{""iteration"": 3, ""movies"": {""EASE"": {""movies"":...",2,2024-06-15 14:27:45.994813,2024-06-15 14:38:41.402801,655.407988,3.0,max-columns,1.0,0.0,,-1
313,314,2,iteration-started,2024-06-15 14:35:17.469743,"{""iteration"": 4, ""movies"": {""EASE"": {""movies"":...",2,2024-06-15 14:27:45.994813,2024-06-15 14:38:41.402801,655.407988,4.0,max-columns,0.0,1.0,,-1
356,357,2,iteration-started,2024-06-15 14:37:09.316843,"{""iteration"": 5, ""movies"": {""EASE"": {""movies"":...",2,2024-06-15 14:27:45.994813,2024-06-15 14:38:41.402801,655.407988,5.0,max-columns,0.0,1.0,,-1


## EASE vs High Order Collaborative Filtering

Since Participation with ids *1, 2, 4* are for EASE vs High Order Collaborative Filtering, we will filter out the interactions for these participation.

In [203]:
df_interaction_ease_vs_hocf = selected_item_interactions[selected_item_interactions['participation'].isin([1, 2, 4])]
df_interaction_ease_vs_hocf

Unnamed: 0,id,participation,interaction_type,time,data,id_part,time_joined,time_finished,duration,iteration,result_layout,EASE,Higher Order CF,HigherOrderCFWithNovelty,variant
45,46,1,selected-item,2024-06-15 14:20:45.242660,"{""selected_item"": {""genres"": [""Animation"", ""Dr...",1,2024-06-15 14:12:28.237252,2024-06-15 14:26:32.426407,844.189155,1.0,max-columns,1.0,0.0,,0
46,47,1,selected-item,2024-06-15 14:20:47.106634,"{""selected_item"": {""genres"": [""Animation"", ""Dr...",1,2024-06-15 14:12:28.237252,2024-06-15 14:26:32.426407,844.189155,1.0,max-columns,1.0,0.0,,0
48,49,1,selected-item,2024-06-15 14:20:51.291526,"{""selected_item"": {""genres"": [""Comedy"", ""Drama...",1,2024-06-15 14:12:28.237252,2024-06-15 14:26:32.426407,844.189155,1.0,max-columns,1.0,0.0,,1
51,52,1,selected-item,2024-06-15 14:20:56.138772,"{""selected_item"": {""genres"": [""Action"", ""Adven...",1,2024-06-15 14:12:28.237252,2024-06-15 14:26:32.426407,844.189155,1.0,max-columns,1.0,0.0,,0
75,76,1,selected-item,2024-06-15 14:22:10.115838,"{""selected_item"": {""genres"": [""Animation"", ""Dr...",1,2024-06-15 14:12:28.237252,2024-06-15 14:26:32.426407,844.189155,2.0,max-columns,0.0,1.0,,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
586,587,4,selected-item,2024-06-15 14:47:43.560253,"{""selected_item"": {""genres"": [""Action"", ""Adven...",4,2024-06-15 14:42:53.758059,2024-06-15 14:48:03.921291,310.163232,5.0,max-columns,1.0,0.0,,0
587,588,4,selected-item,2024-06-15 14:47:44.878296,"{""selected_item"": {""genres"": [""Adventure"", ""Co...",4,2024-06-15 14:42:53.758059,2024-06-15 14:48:03.921291,310.163232,5.0,max-columns,1.0,0.0,,0
588,589,4,selected-item,2024-06-15 14:47:45.680278,"{""selected_item"": {""genres"": [""Action"", ""Sci-F...",4,2024-06-15 14:42:53.758059,2024-06-15 14:48:03.921291,310.163232,5.0,max-columns,1.0,0.0,,0
590,591,4,selected-item,2024-06-15 14:47:47.228796,"{""selected_item"": {""genres"": [""Action"", ""Adven...",4,2024-06-15 14:42:53.758059,2024-06-15 14:48:03.921291,310.163232,5.0,max-columns,1.0,0.0,,0


In [204]:
# adding information on which algorithm is responsible for the selection
df_interaction_ease_vs_hocf = df_interaction_ease_vs_hocf.copy()
df_interaction_ease_vs_hocf["selected_algorithm"] = "HIGH ORDER CF"
df_interaction_ease_vs_hocf.loc[df_interaction_ease_vs_hocf.variant == df_interaction_ease_vs_hocf.EASE, "selected_algorithm"] = "EASE"

df_interaction_ease_vs_hocf

Unnamed: 0,id,participation,interaction_type,time,data,id_part,time_joined,time_finished,duration,iteration,result_layout,EASE,Higher Order CF,HigherOrderCFWithNovelty,variant,selected_algorithm
45,46,1,selected-item,2024-06-15 14:20:45.242660,"{""selected_item"": {""genres"": [""Animation"", ""Dr...",1,2024-06-15 14:12:28.237252,2024-06-15 14:26:32.426407,844.189155,1.0,max-columns,1.0,0.0,,0,HIGH ORDER CF
46,47,1,selected-item,2024-06-15 14:20:47.106634,"{""selected_item"": {""genres"": [""Animation"", ""Dr...",1,2024-06-15 14:12:28.237252,2024-06-15 14:26:32.426407,844.189155,1.0,max-columns,1.0,0.0,,0,HIGH ORDER CF
48,49,1,selected-item,2024-06-15 14:20:51.291526,"{""selected_item"": {""genres"": [""Comedy"", ""Drama...",1,2024-06-15 14:12:28.237252,2024-06-15 14:26:32.426407,844.189155,1.0,max-columns,1.0,0.0,,1,EASE
51,52,1,selected-item,2024-06-15 14:20:56.138772,"{""selected_item"": {""genres"": [""Action"", ""Adven...",1,2024-06-15 14:12:28.237252,2024-06-15 14:26:32.426407,844.189155,1.0,max-columns,1.0,0.0,,0,HIGH ORDER CF
75,76,1,selected-item,2024-06-15 14:22:10.115838,"{""selected_item"": {""genres"": [""Animation"", ""Dr...",1,2024-06-15 14:12:28.237252,2024-06-15 14:26:32.426407,844.189155,2.0,max-columns,0.0,1.0,,1,HIGH ORDER CF
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
586,587,4,selected-item,2024-06-15 14:47:43.560253,"{""selected_item"": {""genres"": [""Action"", ""Adven...",4,2024-06-15 14:42:53.758059,2024-06-15 14:48:03.921291,310.163232,5.0,max-columns,1.0,0.0,,0,HIGH ORDER CF
587,588,4,selected-item,2024-06-15 14:47:44.878296,"{""selected_item"": {""genres"": [""Adventure"", ""Co...",4,2024-06-15 14:42:53.758059,2024-06-15 14:48:03.921291,310.163232,5.0,max-columns,1.0,0.0,,0,HIGH ORDER CF
588,589,4,selected-item,2024-06-15 14:47:45.680278,"{""selected_item"": {""genres"": [""Action"", ""Sci-F...",4,2024-06-15 14:42:53.758059,2024-06-15 14:48:03.921291,310.163232,5.0,max-columns,1.0,0.0,,0,HIGH ORDER CF
590,591,4,selected-item,2024-06-15 14:47:47.228796,"{""selected_item"": {""genres"": [""Action"", ""Adven...",4,2024-06-15 14:42:53.758059,2024-06-15 14:48:03.921291,310.163232,5.0,max-columns,1.0,0.0,,0,HIGH ORDER CF


### Metrics

#### Which algorithm attracted more selections?
For this we will use Chi square test to see if the difference is statistically significant.

In [205]:
num_participants = 3 # I gave the survey only to three people

num_iterations = df_interaction_ease_vs_hocf['iteration'].nunique()
num_items = 10  # 10 items per iteration per algorithm

total_attempts = num_participants * num_iterations * num_items

selection_counts = df_interaction_ease_vs_hocf['selected_algorithm'].value_counts()


selection_counts_df = pd.DataFrame({
    'Algorithm': selection_counts.index,
    'Selections': selection_counts.values,
    'Failures': total_attempts - selection_counts.values
})

selection_counts_df.set_index('Algorithm', inplace=True)
print(selection_counts_df)

               Selections  Failures
Algorithm                          
HIGH ORDER CF          73        77
EASE                   54        96


From this we can see that users preferred the High Order Collaborative Filtering algorithm over EASE. We will now perform a chi-square test to determine if this difference is statistically significant.

In [206]:
from scipy.stats import chi2_contingency

contingency_table = selection_counts_df.values

chi2, p, dof, expected = chi2_contingency(contingency_table)

print(f"Chi-Square Statistic: {chi2}")
print(f"P-value: {p}")
print(f"Degrees of Freedom: {dof}")
print("Expected Frequencies:")
print(expected)

Chi-Square Statistic: 4.424013472304401
P-value: 0.03543659255495137
Degrees of Freedom: 1
Expected Frequencies:
[[63.5 86.5]
 [63.5 86.5]]


The chi-square statistic of 4.424 indicates a deviation of the observed frequencies from the expected frequencies. The p-value of 0.035, which is less than the commonly used significance level of 0.05, provides strong evidence to reject the null hypothesis. This suggests that the difference in selection rates between the algorithms is statistically significant and not due to random chance.

#### Average DCG score

We need to link selections to the positions of displayed items. We will use the following mapping:

In [207]:
# Filter iteration-started interactions
df_iteration_started_ease_vs_hocf = iteration_started_data[iteration_started_data['participation'].isin([1, 2, 4])].copy()
selected_item_interactions_ease_vs_hocf = df_interaction_ease_vs_hocf.copy()


def extract_ordering(data_json):
    data = json.loads(data_json)
    return data.get('algorithm_assignment', {})

df_iteration_started_ease_vs_hocf['ordering'] = df_iteration_started_ease_vs_hocf['data'].apply(extract_ordering)

# Link
def extract_algorithm_positions(json_data):
    interaction_data = json.loads(json_data)
    algorithm_data = interaction_data['algorithm_assignment']
    positions = {v['name']: v['order'] for v in algorithm_data.values()}
    return positions

df_iteration_started_ease_vs_hocf['positions'] = df_iteration_started_ease_vs_hocf['data'].apply(extract_algorithm_positions)


# Ensure that the DataFrame is multi-indexed with 'participation' and 'iteration'
position_mapping = df_iteration_started_ease_vs_hocf[['participation', 'iteration', 'positions']].set_index(['participation', 'iteration'])

def find_position(row):
    try:
        positions = position_mapping.loc[(row['participation'], row['iteration']), 'positions']
        selected_algorithm_upper = row['selected_algorithm'].upper()
        return positions.get(selected_algorithm_upper, 1)  # Use upper case key
    except KeyError:
        return -1  # Default position if no entry found

selected_item_interactions_ease_vs_hocf['display_position'] = selected_item_interactions_ease_vs_hocf.apply(find_position, axis=1)
selected_item_interactions_ease_vs_hocf
# position_mapping

Unnamed: 0,id,participation,interaction_type,time,data,id_part,time_joined,time_finished,duration,iteration,result_layout,EASE,Higher Order CF,HigherOrderCFWithNovelty,variant,selected_algorithm,display_position
45,46,1,selected-item,2024-06-15 14:20:45.242660,"{""selected_item"": {""genres"": [""Animation"", ""Dr...",1,2024-06-15 14:12:28.237252,2024-06-15 14:26:32.426407,844.189155,1.0,max-columns,1.0,0.0,,0,HIGH ORDER CF,1
46,47,1,selected-item,2024-06-15 14:20:47.106634,"{""selected_item"": {""genres"": [""Animation"", ""Dr...",1,2024-06-15 14:12:28.237252,2024-06-15 14:26:32.426407,844.189155,1.0,max-columns,1.0,0.0,,0,HIGH ORDER CF,1
48,49,1,selected-item,2024-06-15 14:20:51.291526,"{""selected_item"": {""genres"": [""Comedy"", ""Drama...",1,2024-06-15 14:12:28.237252,2024-06-15 14:26:32.426407,844.189155,1.0,max-columns,1.0,0.0,,1,EASE,1
51,52,1,selected-item,2024-06-15 14:20:56.138772,"{""selected_item"": {""genres"": [""Action"", ""Adven...",1,2024-06-15 14:12:28.237252,2024-06-15 14:26:32.426407,844.189155,1.0,max-columns,1.0,0.0,,0,HIGH ORDER CF,1
75,76,1,selected-item,2024-06-15 14:22:10.115838,"{""selected_item"": {""genres"": [""Animation"", ""Dr...",1,2024-06-15 14:12:28.237252,2024-06-15 14:26:32.426407,844.189155,2.0,max-columns,0.0,1.0,,1,HIGH ORDER CF,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
586,587,4,selected-item,2024-06-15 14:47:43.560253,"{""selected_item"": {""genres"": [""Action"", ""Adven...",4,2024-06-15 14:42:53.758059,2024-06-15 14:48:03.921291,310.163232,5.0,max-columns,1.0,0.0,,0,HIGH ORDER CF,1
587,588,4,selected-item,2024-06-15 14:47:44.878296,"{""selected_item"": {""genres"": [""Adventure"", ""Co...",4,2024-06-15 14:42:53.758059,2024-06-15 14:48:03.921291,310.163232,5.0,max-columns,1.0,0.0,,0,HIGH ORDER CF,1
588,589,4,selected-item,2024-06-15 14:47:45.680278,"{""selected_item"": {""genres"": [""Action"", ""Sci-F...",4,2024-06-15 14:42:53.758059,2024-06-15 14:48:03.921291,310.163232,5.0,max-columns,1.0,0.0,,0,HIGH ORDER CF,1
590,591,4,selected-item,2024-06-15 14:47:47.228796,"{""selected_item"": {""genres"": [""Action"", ""Adven...",4,2024-06-15 14:42:53.758059,2024-06-15 14:48:03.921291,310.163232,5.0,max-columns,1.0,0.0,,0,HIGH ORDER CF,1


In [208]:
def calculate_dcg(position_list):
    position_list = [x + 1 for x in position_list if x >= 0]  # ignore -1 (not found)
    if not position_list:
        return 0
    dcg = sum((2**1 - 1) / np.log2(p + 1) for p in position_list)
    return dcg

# Group the data by participation ID and algorithm, then calculate DCG
dcg_scores = selected_item_interactions_ease_vs_hocf[selected_item_interactions_ease_vs_hocf['display_position'] >= 0].groupby(['participation', 'selected_algorithm']).agg({
    'display_position': lambda positions: calculate_dcg(list(positions))
}).reset_index()

dcg_scores

Unnamed: 0,participation,selected_algorithm,display_position
0,1,EASE,6.26186
1,1,HIGH ORDER CF,11.356736
2,2,EASE,23.047438
3,2,HIGH ORDER CF,13.249525
4,4,EASE,15.463946
5,4,HIGH ORDER CF,21.451612


In [209]:
average_dcg_scores = dcg_scores.groupby('selected_algorithm')['display_position'].mean()
average_dcg_scores

selected_algorithm
EASE             14.924415
HIGH ORDER CF    15.352624
Name: display_position, dtype: float64

1. **Display Position:**
   - The `display_position` data shows where each item was placed in the recommendation list when selected by the user. Lower positions are preferable as they indicate that the item appeared higher in the list.
   - The mean display position for EASE was found to be approximately 14.92, while the mean display position for Higher Order CF was approximately 15.35.

2. **Ranking Quality:**
   - EASE appears to place relevant items slightly higher in the list compared to Higher Order CF, as indicated by the lower average display position.
   - This suggests that EASE may have a marginally better ranking quality, making it more effective in presenting relevant items at more prominent positions in the recommendation list.

After all the difference is not that big to be significant.

#### Novelty and Diversity

Now let's analyze the novelty and diversity of the recommendations made by the two algorithms by year and genre of items.

##### Extracting and preparing 

In [210]:
def parse_movie_details(data_json):
    details = json.loads(data_json)
    selected_item = details.get('selected_item', {})
    genres = selected_item.get('genres', [])
    movie_details = selected_item.get('movie', "")
    year = None
    if '(' in movie_details and ')' in movie_details:
        year = movie_details[movie_details.find('(')+1:movie_details.find(')')]
        try:
            year = int(year)  
        except ValueError:
            year = None
    return genres, year

selected_item_interactions_ease_vs_hocf['genres'], selected_item_interactions_ease_vs_hocf['year'] = zip(
    *selected_item_interactions_ease_vs_hocf['data'].map(parse_movie_details)
)

selected_item_interactions_ease_vs_hocf[['genres', 'year']]

Unnamed: 0,genres,year
45,"[Animation, Drama, Fantasy, Romance]",2019.0
46,"[Animation, Drama, Romance]",2016.0
48,"[Comedy, Drama]",2019.0
51,"[Action, Adventure, Sci-Fi]",2019.0
75,"[Animation, Drama, Romance]",
...,...,...
586,"[Action, Adventure, Sci-Fi]",2016.0
587,"[Adventure, Comedy, Sci-Fi]",2022.0
588,"[Action, Sci-Fi]",2021.0
590,"[Action, Adventure, Thriller]",2019.0


##### Diversity

In [211]:
selected_item_interactions_ease_vs_hocf['diversity_score'] = selected_item_interactions_ease_vs_hocf['genres'].apply(lambda x: len(set(x)))

average_diversity = selected_item_interactions_ease_vs_hocf.groupby('selected_algorithm')['diversity_score'].mean()

average_diversity

selected_algorithm
EASE             3.277778
HIGH ORDER CF    3.068493
Name: diversity_score, dtype: float64

**EASE** tends to recommend items from a wider range of genres compared to **Higher Order CF**. This suggests that EASE may provide more diverse recommendations to users.

##### Novelty

In [212]:
selected_item_interactions_ease_vs_hocf.dropna(subset=['year'], inplace=True)

average_year = selected_item_interactions_ease_vs_hocf.groupby('selected_algorithm')['year'].mean()

average_year

selected_algorithm
EASE             2012.481481
HIGH ORDER CF    2018.333333
Name: year, dtype: float64

**HIGH ORDER CF** tends to recommend newer items compared to **EASE**. This suggests that HIGH ORDER CF may provide more novel recommendations to users.


## High Order Collaborative Filtering vs High Order Collaborative Filtering with Novelty
The two algorithms differ only in reranking the items based on novelty. Where we rerank based on items popularity and average year of user items which user rated positively.

Since Participation with ids 5, 6, 7 are for High Order Collaborative Filtering vs High Order Collaborative Filtering with Novelty, we will filter out the interactions for these participation.

In [213]:
df_interaction_hocf_vs_hocfn = selected_item_interactions[selected_item_interactions['participation'].isin([5, 6, 7])].copy()
df_interaction_hocf_vs_hocfn.drop(['EASE'], axis=1, inplace=True)
df_interaction_hocf_vs_hocfn

Unnamed: 0,id,participation,interaction_type,time,data,id_part,time_joined,time_finished,duration,iteration,result_layout,Higher Order CF,HigherOrderCFWithNovelty,variant
629,630,5,selected-item,2024-06-15 14:54:35.111571,"{""selected_item"": {""genres"": [""Comedy"", ""Roman...",5,2024-06-15 14:51:19.561055,2024-06-15 15:00:27.865696,548.304641,1.0,max-columns,0.0,1.0,0
630,631,5,selected-item,2024-06-15 14:54:36.156283,"{""selected_item"": {""genres"": [""Drama"", ""Horror...",5,2024-06-15 14:51:19.561055,2024-06-15 15:00:27.865696,548.304641,1.0,max-columns,0.0,1.0,0
632,633,5,selected-item,2024-06-15 14:54:38.736736,"{""selected_item"": {""genres"": [""Action"", ""Comed...",5,2024-06-15 14:51:19.561055,2024-06-15 15:00:27.865696,548.304641,1.0,max-columns,0.0,1.0,0
634,635,5,selected-item,2024-06-15 14:54:42.139504,"{""selected_item"": {""genres"": [""Action"", ""Thril...",5,2024-06-15 14:51:19.561055,2024-06-15 15:00:27.865696,548.304641,1.0,max-columns,0.0,1.0,0
641,642,5,selected-item,2024-06-15 14:55:13.913135,"{""selected_item"": {""genres"": [""Comedy"", ""Crime...",5,2024-06-15 14:51:19.561055,2024-06-15 15:00:27.865696,548.304641,1.0,max-columns,0.0,1.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1086,1087,7,selected-item,2024-06-15 15:15:50.942784,"{""selected_item"": {""genres"": [""Action"", ""Child...",7,2024-06-15 15:13:01.125449,2024-06-15 15:16:13.657074,192.531625,5.0,max-columns,0.0,1.0,1
1088,1089,7,selected-item,2024-06-15 15:15:52.948022,"{""selected_item"": {""genres"": [""Action"", ""Adven...",7,2024-06-15 15:13:01.125449,2024-06-15 15:16:13.657074,192.531625,5.0,max-columns,0.0,1.0,1
1090,1091,7,selected-item,2024-06-15 15:15:54.860855,"{""selected_item"": {""genres"": [""Action"", ""Adven...",7,2024-06-15 15:13:01.125449,2024-06-15 15:16:13.657074,192.531625,5.0,max-columns,0.0,1.0,1
1091,1092,7,selected-item,2024-06-15 15:15:55.612278,"{""selected_item"": {""genres"": [""Adventure"", ""Co...",7,2024-06-15 15:13:01.125449,2024-06-15 15:16:13.657074,192.531625,5.0,max-columns,0.0,1.0,1


In [214]:
# adding information on which algorithm is responsible for the selection
df_interaction_hocf_vs_hocfn = df_interaction_hocf_vs_hocfn.copy()
df_interaction_hocf_vs_hocfn["selected_algorithm"] = "HigherOrderCFWithNovelty"
df_interaction_hocf_vs_hocfn.loc[df_interaction_hocf_vs_hocfn.variant == df_interaction_hocf_vs_hocfn["Higher Order CF"], "selected_algorithm"] = "HIGH ORDER CF"

df_interaction_hocf_vs_hocfn

Unnamed: 0,id,participation,interaction_type,time,data,id_part,time_joined,time_finished,duration,iteration,result_layout,Higher Order CF,HigherOrderCFWithNovelty,variant,selected_algorithm
629,630,5,selected-item,2024-06-15 14:54:35.111571,"{""selected_item"": {""genres"": [""Comedy"", ""Roman...",5,2024-06-15 14:51:19.561055,2024-06-15 15:00:27.865696,548.304641,1.0,max-columns,0.0,1.0,0,HIGH ORDER CF
630,631,5,selected-item,2024-06-15 14:54:36.156283,"{""selected_item"": {""genres"": [""Drama"", ""Horror...",5,2024-06-15 14:51:19.561055,2024-06-15 15:00:27.865696,548.304641,1.0,max-columns,0.0,1.0,0,HIGH ORDER CF
632,633,5,selected-item,2024-06-15 14:54:38.736736,"{""selected_item"": {""genres"": [""Action"", ""Comed...",5,2024-06-15 14:51:19.561055,2024-06-15 15:00:27.865696,548.304641,1.0,max-columns,0.0,1.0,0,HIGH ORDER CF
634,635,5,selected-item,2024-06-15 14:54:42.139504,"{""selected_item"": {""genres"": [""Action"", ""Thril...",5,2024-06-15 14:51:19.561055,2024-06-15 15:00:27.865696,548.304641,1.0,max-columns,0.0,1.0,0,HIGH ORDER CF
641,642,5,selected-item,2024-06-15 14:55:13.913135,"{""selected_item"": {""genres"": [""Comedy"", ""Crime...",5,2024-06-15 14:51:19.561055,2024-06-15 15:00:27.865696,548.304641,1.0,max-columns,0.0,1.0,1,HigherOrderCFWithNovelty
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1086,1087,7,selected-item,2024-06-15 15:15:50.942784,"{""selected_item"": {""genres"": [""Action"", ""Child...",7,2024-06-15 15:13:01.125449,2024-06-15 15:16:13.657074,192.531625,5.0,max-columns,0.0,1.0,1,HigherOrderCFWithNovelty
1088,1089,7,selected-item,2024-06-15 15:15:52.948022,"{""selected_item"": {""genres"": [""Action"", ""Adven...",7,2024-06-15 15:13:01.125449,2024-06-15 15:16:13.657074,192.531625,5.0,max-columns,0.0,1.0,1,HigherOrderCFWithNovelty
1090,1091,7,selected-item,2024-06-15 15:15:54.860855,"{""selected_item"": {""genres"": [""Action"", ""Adven...",7,2024-06-15 15:13:01.125449,2024-06-15 15:16:13.657074,192.531625,5.0,max-columns,0.0,1.0,1,HigherOrderCFWithNovelty
1091,1092,7,selected-item,2024-06-15 15:15:55.612278,"{""selected_item"": {""genres"": [""Adventure"", ""Co...",7,2024-06-15 15:13:01.125449,2024-06-15 15:16:13.657074,192.531625,5.0,max-columns,0.0,1.0,1,HigherOrderCFWithNovelty


### Metrics

#### Which algorithm attracted more selections?
For this we will use Chi square test to see if the difference is statistically significant.

In [215]:
num_participants = 3 # I gave the survey only to three people

num_iterations = df_interaction_hocf_vs_hocfn['iteration'].nunique()
num_items = 10  # 10 items per iteration per algorithm

total_attempts = num_participants * num_iterations * num_items

selection_counts = df_interaction_hocf_vs_hocfn['selected_algorithm'].value_counts()


selection_counts_df = pd.DataFrame({
    'Algorithm': selection_counts.index,
    'Selections': selection_counts.values,
    'Failures': total_attempts - selection_counts.values
})

selection_counts_df.set_index('Algorithm', inplace=True)
print(selection_counts_df)

                          Selections  Failures
Algorithm                                     
HigherOrderCFWithNovelty          61        89
HIGH ORDER CF                     31       119


From this we can see that users preferred the High Order Collaborative Filtering algorithm with Novelty over High Order Collaborative Filtering. We will now perform a chi-square test to determine if this difference is statistically significant.

In [216]:
contingency_table = selection_counts_df.values

chi2, p, dof, expected = chi2_contingency(contingency_table)

print(f"Chi-Square Statistic: {chi2}")
print(f"P-value: {p}")
print(f"Degrees of Freedom: {dof}")
print("Expected Frequencies:")
print(expected)

Chi-Square Statistic: 13.184573578595318
P-value: 0.00028226303497656607
Degrees of Freedom: 1
Expected Frequencies:
[[ 46. 104.]
 [ 46. 104.]]


The chi-square test reveals a statistically significant difference in the selection rates between the "Higher Order CF with Novelty" and "Higher Order CF" algorithms. This significant result suggests that the effectiveness of these algorithms in attracting user selections differs.

#### Average DCG score

We need to link selections to the positions of displayed items. We will use the following mapping:

In [217]:
# Filter iteration-started interactions
df_iteration_started_hocf_vs_hocfn = iteration_started_data[iteration_started_data['participation'].isin([5, 6, 7])].copy()
selected_item_interactions_hocf_vs_hocfn = df_interaction_hocf_vs_hocfn.copy()

df_iteration_started_hocf_vs_hocfn['ordering'] = df_iteration_started_hocf_vs_hocfn['data'].apply(extract_ordering)

df_iteration_started_hocf_vs_hocfn['positions'] = df_iteration_started_hocf_vs_hocfn['data'].apply(extract_algorithm_positions)

# Ensure that the DataFrame is multi-indexed with 'participation' and 'iteration'
position_mapping = df_iteration_started_hocf_vs_hocfn[['participation', 'iteration', 'positions']].set_index(['participation', 'iteration'])

selected_item_interactions_hocf_vs_hocfn['display_position'] = selected_item_interactions_hocf_vs_hocfn.apply(find_position, axis=1)
selected_item_interactions_hocf_vs_hocfn

Unnamed: 0,id,participation,interaction_type,time,data,id_part,time_joined,time_finished,duration,iteration,result_layout,Higher Order CF,HigherOrderCFWithNovelty,variant,selected_algorithm,display_position
629,630,5,selected-item,2024-06-15 14:54:35.111571,"{""selected_item"": {""genres"": [""Comedy"", ""Roman...",5,2024-06-15 14:51:19.561055,2024-06-15 15:00:27.865696,548.304641,1.0,max-columns,0.0,1.0,0,HIGH ORDER CF,1
630,631,5,selected-item,2024-06-15 14:54:36.156283,"{""selected_item"": {""genres"": [""Drama"", ""Horror...",5,2024-06-15 14:51:19.561055,2024-06-15 15:00:27.865696,548.304641,1.0,max-columns,0.0,1.0,0,HIGH ORDER CF,1
632,633,5,selected-item,2024-06-15 14:54:38.736736,"{""selected_item"": {""genres"": [""Action"", ""Comed...",5,2024-06-15 14:51:19.561055,2024-06-15 15:00:27.865696,548.304641,1.0,max-columns,0.0,1.0,0,HIGH ORDER CF,1
634,635,5,selected-item,2024-06-15 14:54:42.139504,"{""selected_item"": {""genres"": [""Action"", ""Thril...",5,2024-06-15 14:51:19.561055,2024-06-15 15:00:27.865696,548.304641,1.0,max-columns,0.0,1.0,0,HIGH ORDER CF,1
641,642,5,selected-item,2024-06-15 14:55:13.913135,"{""selected_item"": {""genres"": [""Comedy"", ""Crime...",5,2024-06-15 14:51:19.561055,2024-06-15 15:00:27.865696,548.304641,1.0,max-columns,0.0,1.0,1,HigherOrderCFWithNovelty,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1086,1087,7,selected-item,2024-06-15 15:15:50.942784,"{""selected_item"": {""genres"": [""Action"", ""Child...",7,2024-06-15 15:13:01.125449,2024-06-15 15:16:13.657074,192.531625,5.0,max-columns,0.0,1.0,1,HigherOrderCFWithNovelty,1
1088,1089,7,selected-item,2024-06-15 15:15:52.948022,"{""selected_item"": {""genres"": [""Action"", ""Adven...",7,2024-06-15 15:13:01.125449,2024-06-15 15:16:13.657074,192.531625,5.0,max-columns,0.0,1.0,1,HigherOrderCFWithNovelty,1
1090,1091,7,selected-item,2024-06-15 15:15:54.860855,"{""selected_item"": {""genres"": [""Action"", ""Adven...",7,2024-06-15 15:13:01.125449,2024-06-15 15:16:13.657074,192.531625,5.0,max-columns,0.0,1.0,1,HigherOrderCFWithNovelty,1
1091,1092,7,selected-item,2024-06-15 15:15:55.612278,"{""selected_item"": {""genres"": [""Adventure"", ""Co...",7,2024-06-15 15:13:01.125449,2024-06-15 15:16:13.657074,192.531625,5.0,max-columns,0.0,1.0,1,HigherOrderCFWithNovelty,1


In [218]:
dcg_scores = selected_item_interactions_hocf_vs_hocfn[selected_item_interactions_hocf_vs_hocfn['display_position'] >= 0].groupby(['participation', 'selected_algorithm']).agg({
    'display_position': lambda positions: calculate_dcg(list(positions))
}).reset_index()

dcg_scores

Unnamed: 0,participation,selected_algorithm,display_position
0,5,HIGH ORDER CF,8.202087
1,5,HigherOrderCFWithNovelty,10.725806
2,6,HIGH ORDER CF,6.309298
3,6,HigherOrderCFWithNovelty,4.416508
4,7,HIGH ORDER CF,5.047438
5,7,HigherOrderCFWithNovelty,23.344401


In [219]:
average_dcg_scores = dcg_scores.groupby('selected_algorithm')['display_position'].mean()
average_dcg_scores

selected_algorithm
HIGH ORDER CF                6.519607
HigherOrderCFWithNovelty    12.828905
Name: display_position, dtype: float64

The analysis of display positions reveals that "HIGH ORDER CF" performs better in ranking relevant items higher in the recommendation list compared to "Higher Order CF with Novelty." This finding suggests that "HIGH ORDER CF" may lead to better user engagement due to its more effective ranking. But from selection we could see that user preferred "Higher Order CF with Novelty" over "HIGH ORDER CF". This could be due to the novelty of the items.

#### Novelty and Diversity
Now let's calculate the novelty and diversity of the recommendations made by the two algorithms by year and genre of items.

In [220]:
selected_item_interactions_hocf_vs_hocfn['genres'], selected_item_interactions_hocf_vs_hocfn['year'] = zip(
    *selected_item_interactions_hocf_vs_hocfn['data'].map(parse_movie_details)
)

selected_item_interactions_hocf_vs_hocfn[['genres', 'year']]

Unnamed: 0,genres,year
629,"[Comedy, Romance]",1995
630,"[Drama, Horror, Mystery, Sci-Fi, Thriller]",2019
632,"[Action, Comedy, Sci-Fi]",2022
634,"[Action, Thriller]",2022
641,"[Comedy, Crime]",2000
...,...,...
1086,"[Action, Children, Comedy, Sci-Fi]",2022
1088,"[Action, Adventure, Fantasy]",2022
1090,"[Action, Adventure, Fantasy, Sci-Fi]",2021
1091,"[Adventure, Comedy, Sci-Fi]",2022


##### Diversity

In [221]:
selected_item_interactions_hocf_vs_hocfn['diversity_score'] = selected_item_interactions_hocf_vs_hocfn['genres'].apply(lambda x: len(set(x)))

average_diversity = selected_item_interactions_hocf_vs_hocfn.groupby('selected_algorithm')['diversity_score'].mean()

average_diversity

selected_algorithm
HIGH ORDER CF               3.000000
HigherOrderCFWithNovelty    2.918033
Name: diversity_score, dtype: float64

As we can see it is no surprise that both are almost the same since the novelty is the only difference between the two algorithms.

##### Novelty

Now lets calculate the average year of the items recommended by the two algorithms.

In [222]:
selected_item_interactions_hocf_vs_hocfn.dropna(subset=['year'], inplace=True)

average_year = selected_item_interactions_hocf_vs_hocfn.groupby('selected_algorithm')['year'].mean()

average_year

selected_algorithm
HIGH ORDER CF               2017.064516
HigherOrderCFWithNovelty    2021.213115
Name: year, dtype: float64

As expected my test users told me, they prefer newer movies so this reflects well in the results. The "Higher Order CF with Novelty" algorithm recommends newer items compared to "HIGH ORDER CF, because as we mention it takes the average year of the items that the user rated positively.

We can also try to calculate *IP* and *INV* but since my three users have different preferences, and also the amount of test users is small, it is hard to calculate these metrics 