**This is the resit assignment for recommender system's group project assignment.**

**Name & Surname:** Selin YAZICI

**Student ID:** i6205952

The research question this assignment aims to answer is: 'Are there any differences in the performances of a group recommender using the least misery strategy related to the size of the groups?'

In [1]:
import pandas as pd
import numpy as np 
from sklearn.model_selection import train_test_split
from lenskit.algorithms import Recommender
from lenskit.algorithms.user_knn import UserUser

# The datasets

In [2]:
movies = pd.read_csv('preprocessed_dataset/movies.csv')
ratings = pd.read_csv('preprocessed_dataset/ratings.csv')

Because the ratings dataset is too large, I wanted to filter out the outliers. So the total number of items each user rated is obtained and the outliers is removed.

In [3]:
user_evaluation_counts = ratings.groupby('user')['item'].nunique()
print("Number of Items Evaluated by Each User:")
print(user_evaluation_counts)

Number of Items Evaluated by Each User:
user
1      154
2       16
3       29
4      130
5       30
      ... 
606    579
607    120
608    578
609     25
610    730
Name: item, Length: 610, dtype: int64


In [4]:
Q1 = user_evaluation_counts.quantile(0.25)
Q3 = user_evaluation_counts.quantile(0.75)

# IQR (Interquartile Range)
IQR = Q3 - Q1

# upper and lower bounds to identify outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# remove the outliers
outliers = (user_evaluation_counts < lower_bound) | (user_evaluation_counts > upper_bound)

# filtering the ratings 
filtered_ratings = ratings[ratings['user'].isin(user_evaluation_counts[~outliers].index)]

print("Lower Bound for Outliers:", lower_bound)
print("Upper Bound for Outliers:", upper_bound)
print("Outliers (user IDs):")
print(user_evaluation_counts[outliers])
print("Filtered DataFrame without Outliers:")
print(filtered_ratings)


Lower Bound for Outliers: -109.0
Upper Bound for Outliers: 243.0
Outliers (user IDs):
user
18     320
19     492
21     280
28     352
42     312
      ... 
600    493
603    570
606    579
608    578
610    730
Name: item, Length: 67, dtype: int64
Filtered DataFrame without Outliers:
       user  item  rating  timestamp
0         1     1     4.0  964982703
1         1     3     4.0  964981247
2         1     6     4.0  964982224
3         1    70     3.0  964982400
4         1   101     5.0  964980868
...     ...   ...     ...        ...
63703   609   786     3.0  847221025
63704   609   833     3.0  847221080
63705   609   892     3.0  847221080
63706   609  1056     3.0  847221080
63707   609  1059     3.0  847221054

[32516 rows x 4 columns]


# Methods We Need

Generating random groups of size 2,3 and 5.

In [5]:
def generate_user_groups(df, column_name, group_sizes):
    unique_users = df[column_name].unique()

    random_groups = {}
    
    for size in group_sizes:
        # Randomly shuffle the unique user IDs
        np.random.shuffle(unique_users)

        # Calculate the number of groups for the given size
        num_groups = len(unique_users) // size

        # Generate random groups of the specified size based on the specified column
        for i in range(num_groups):
            group_users = unique_users[i * size: (i + 1) * size]
            group = df[df[column_name].isin(group_users)]
            random_groups[i] = group

    return random_groups


# Holdout Validation Strategy 
Holdout Validation Strategy for splitting the dataset such that 80% is used for training and 20% is used for testing.

In [6]:
def holdout_method(df, test_size=0.2, random_state=None):
    # Split the dataset into training and testing sets
    train_set, test_set = train_test_split(filtered_ratings, test_size=test_size, random_state=random_state)

    return train_set, test_set

In [7]:
train_set, test_set = holdout_method(filtered_ratings, test_size=0.2, random_state=42)

# Display the training and testing sets
print("Training Set:")
print(train_set)

print("\nTesting Set:")
print(test_set)

Training Set:
       user   item  rating   timestamp
48790   479   2694     3.0  1039362630
58596   591    356     4.0   970524486
3653     41   2712     5.0  1459368540
62996   607    292     4.0   963080256
21766   230  50872     3.5  1196305180
...     ...    ...     ...         ...
55969   562   6870     5.0  1368894017
9534     95   1374     4.5  1105400929
860      11   1917     4.0   901200037
27843   290     24     3.0   975032355
42385   425   3408     3.5  1085476542

[26012 rows x 4 columns]

Testing Set:
       user   item  rating   timestamp
18426   199   2671     4.0  1021178968
821      11    150     5.0   902154266
7855     74  30707     4.0  1207502554
17764   187   7360     4.5  1161849723
37971   385    589     5.0   834691845
...     ...    ...     ...         ...
75        1   1954     5.0   964982176
4         1    101     5.0   964980868
13278   137   1136     5.0  1204863777
24393   257   2406     3.5  1141625649
1108     17   1090     4.5  1322629080

[6504 row

# Collaborative Filtering

For the recommender type, I chose user based collaborative filtering. 

In [8]:
user_user = UserUser(15, min_nbrs=3)  

user_recommendations_list = []

for user_id in train_set['user'].unique():
    if train_set[train_set['user'] == user_id].empty:
        print(f"User {user_id} has no ratings, and none provided.")
        continue

    recsys = Recommender.adapt(user_user)

    recsys.fit(train_set)

    # Generate recommendations for the current user
    user_recommendations = recsys.recommend(user_id, 5)  # Adjust the number of recommendations as needed

    # Add the 'user' column to the recommendations DataFrame
    user_recommendations['user'] = user_id

    # Append the recommendations DataFrame to the list
    user_recommendations_list.append(user_recommendations)

# Concatenate all DataFrames in the list
all_user_recommendations_train_set = pd.concat(user_recommendations_list, ignore_index=True)

# Join with the 'title' column from 'movies_df'
all_user_recommendations_train_set = all_user_recommendations_train_set.join(movies['title'], on='item')

# Display the recommendations for all users
display(all_user_recommendations_train_set)


could not load LIBBLAS: Could not find module 'libblas' (or one of its dependencies). Try using the full path with constructor syntax.


Unnamed: 0,item,score,user,title
0,1217,4.970144,479,assassination
1,5747,4.745396,479,
2,1178,4.733695,479,mommie dearest
3,27831,4.723562,479,
4,101,4.686537,479,mallrats
...,...,...,...,...
2705,562,5.589985,189,akira
2706,785,5.326710,189,wide awake
2707,177593,5.191265,189,
2708,56782,5.153706,189,


In [9]:
test_set['predicted_rating'] = recsys.predict(test_set)
test_set['relevant'] = test_set['rating'].apply(lambda x: 1 if x>3 else 0)
test_set['predicted_relevant'] = test_set['predicted_rating'].apply(lambda x: 1 if x>3 else 0)
test_set

Unnamed: 0,user,item,rating,timestamp,predicted_rating,relevant,predicted_relevant
18426,199,2671,4.0,1021178968,2.535208,1,0
821,11,150,5.0,902154266,4.216202,1,1
7855,74,30707,4.0,1207502554,4.341387,1,1
17764,187,7360,4.5,1161849723,,1,0
37971,385,589,5.0,834691845,3.523322,1,1
...,...,...,...,...,...,...,...
75,1,1954,5.0,964982176,4.223276,1,1
4,1,101,5.0,964980868,3.411987,1,1
13278,137,1136,5.0,1204863777,4.324568,1,1
24393,257,2406,3.5,1141625649,3.297843,1,1


In [10]:
# Here, we are creating our random groups and displaying the individual recommendations. 

group_sizes = [2, 3, 5]
random_user_groups3= generate_user_groups(all_user_recommendations_train_set, 'user', group_sizes)

for group_index, group_df in random_user_groups3.items():
    print(f"Group {group_index} Members:\n{group_df}\n")


Group 0 Members:
        item     score  user                              title
70    177593  5.039818   104                                NaN
71       171  4.875160   104              village of the damned
72       299  4.829658   104                            chasers
73      2318  4.695980   104                ernest goes to jail
74     86781  4.688648   104                                NaN
265   177593  4.390355    50                                NaN
266     5747  4.039469    50                                NaN
267   180031  4.000722    50                                NaN
268     1178  3.930307    50                     mommie dearest
269     3740  3.912079    50  cloudy with a chance of meatballs
870   112290  4.454057    54                                NaN
871     3451  4.383871    54                         maniac cop
872      911  4.301518    54                 lady and the tramp
873     3035  4.269597    54                 unfaithfully yours
874      175  4.235159 

Group 161 Members:
        item     score  user                              title
640      101  4.658233   528                           mallrats
641     3740  4.632466   528  cloudy with a chance of meatballs
642      562  4.583647   528                              akira
643     3489  4.505672   528                              gotti
644   136020  4.489592   528                                NaN
830   177593  6.068909   122                                NaN
831    56715  5.910993   122                                NaN
832     5747  5.893648   122                                NaN
833     1178  5.745178   122                     mommie dearest
834     2138  5.730167   122                            topkapi
1135    5135  5.664679   119                                NaN
1136    1217  5.652091   119                      assassination
1137    1248  5.234781   119                       perfect blue
1138     218  5.161606   119                         blue chips
1139   55442  5.15638

In [11]:
# here we are converting the group dictionary into a dataframe for the sake of easy reading 
rows = []
for group, group_data in random_user_groups3.items():
    group_df = group_data.copy()
    group_df['Group'] = group
    rows.append(group_df)

group_dataframe = pd.concat(rows)
print(group_dataframe)

        item     score  user                    title  Group
70    177593  5.039818   104                      NaN      0
71       171  4.875160   104    village of the damned      0
72       299  4.829658   104                  chasers      0
73      2318  4.695980   104      ernest goes to jail      0
74     86781  4.688648   104                      NaN      0
...      ...       ...   ...                      ...    ...
1375    2390  4.680769   559        friday after next    270
1376     175  4.513448   559               virtuosity    270
1377     930  4.287799   559           needful things    270
1378     527  4.257125   559                   brazil    270
1379     176  4.245953   559  while you were sleeping    270

[4690 rows x 5 columns]


In [12]:
user_counts = group_dataframe.groupby('Group')['user'].nunique()
print(user_counts)

Group
0      5
1      5
2      5
3      5
4      5
      ..
266    2
267    2
268    2
269    2
270    2
Name: user, Length: 271, dtype: int64


In [13]:
# Here, for the sake of readability, we are showing which users belong to which groups. 
group_info_by_members = {count: [] for count in user_counts.unique()}

for group, count in user_counts.items():
    group_info_by_members[count].append(group)

for count, groups in group_info_by_members.items():
    print(f"Groups with {count} members are: {groups}")


Groups with 5 members are: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107]
Groups with 3 members are: [108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179]
Groups with 2 members are: [180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 2

# Least Misery Aggregation Strategy

In [14]:
def least_misery(group_ratings, recommendations_number):
    # Aggregate using least misery strategy
    aggregated_df = group_ratings.groupby('item').min()
    aggregated_df = aggregated_df.sort_values(by="score", ascending=False).reset_index()[['item', 'score']]
    
    # Recommendation list based on LMS
    recommendation_list = list(aggregated_df.head(recommendations_number)['item'])
    
    # Calculate relevance scores for the recommended items
    relevance_scores = group_ratings[group_ratings['item'].isin(recommendation_list)]['score']
    
    # Calculate relevance using a threshold (e.g., 3)
    test_set['relevant'] = test_set['rating'].apply(lambda x: 1 if x > 3 else 0)
    test_set['predicted_relevant'] = test_set['predicted_rating'].apply(lambda x: 1 if x > 3 else 0)
    
    # Normalize relevance scores
    max_score = test_set['rating'].max()  # Assuming ratings are on a scale from 1 to some maximum value
    relevance_scores_normalized = relevance_scores / max_score
    
    return {"LMS": {"recommendations": recommendation_list, "relevance_scores": list(relevance_scores_normalized)}}


In [15]:
# We are applying least misery aggregation strategy for 5 recommendations for each group.
group_recommendations_dict = {}

for group_index, group_df in random_user_groups3.items():
    group_recommendations = least_misery(group_df, 5)  
    group_recommendations_dict[group_index] = group_recommendations

# Display recommendations for each group
for group_index, recommendations in group_recommendations_dict.items():
    print(f"Group {group_index} Recommendations:\n{recommendations}\n")

    

Group 0 Recommendations:
{'LMS': {'recommendations': [1228, 57669, 3424, 1235, 101], 'relevance_scores': [1.134297014592106, 1.1272708829376326, 1.1101872120556155, 1.1097353083384174, 1.092577842085203]}}

Group 1 Recommendations:
{'LMS': {'recommendations': [2138, 140174, 1204, 562, 175], 'relevance_scores': [1.0322866557088557, 1.096179165323558, 1.0455326311620605, 1.0148393712580062, 0.9838593630260759]}}

Group 2 Recommendations:
{'LMS': {'recommendations': [1248, 8368, 46972, 2390, 750], 'relevance_scores': [1.1736856103162387, 1.1522138856827282, 1.146638727605033, 1.1657936288952773, 1.1344852178005482]}}

Group 3 Recommendations:
{'LMS': {'recommendations': [95441, 140174, 171, 177593, 8869], 'relevance_scores': [1.1359470359848254, 1.0878109728867558, 1.0796152618882418, 1.0834098256700966, 1.1226157584570804, 1.1065631468292845]}}

Group 4 Recommendations:
{'LMS': {'recommendations': [5528, 299, 562, 2138, 95441], 'relevance_scores': [1.0507945291127765, 1.1575572082455623,

In [16]:
# For the sake of readability, we are converting our results to a dataframe. 
flattened_data = []

for group_id, group_data in group_recommendations_dict.items():
    group_name = list(group_data.keys())[0]  
    recommendations = group_data[group_name]['recommendations']
    relevance_scores = group_data[group_name]['relevance_scores']
    
    row_data = {'Group': group_id, 'Group Name': group_name, 'Recommendations': recommendations, 'Relevance Scores': relevance_scores}
    flattened_data.append(row_data)

df = pd.DataFrame(flattened_data)

print(df)

     Group Group Name                     Recommendations  \
0        0        LMS      [1228, 57669, 3424, 1235, 101]   
1        1        LMS      [2138, 140174, 1204, 562, 175]   
2        2        LMS      [1248, 8368, 46972, 2390, 750]   
3        3        LMS  [95441, 140174, 171, 177593, 8869]   
4        4        LMS       [5528, 299, 562, 2138, 95441]   
..     ...        ...                                 ...   
266    266        LMS    [2138, 3740, 2135, 5135, 177593]   
267    267        LMS      [319, 95441, 4446, 1242, 1266]   
268    268        LMS       [2138, 5747, 3740, 1204, 899]   
269    269        LMS     [95441, 101, 112290, 562, 8869]   
270    270        LMS    [6807, 104879, 4848, 2640, 1204]   

                                      Relevance Scores  
0    [1.134297014592106, 1.1272708829376326, 1.1101...  
1    [1.0322866557088557, 1.096179165323558, 1.0455...  
2    [1.1736856103162387, 1.1522138856827282, 1.146...  
3    [1.1359470359848254, 1.08781097288

# nDCG

In [17]:
import numpy as np

def dcg(relevance_scores):
    return np.sum((2**relevance_scores - 1) / np.log2(np.arange(2, len(relevance_scores) + 2)))

def ndcg(relevance_scores):
    ideal_scores = np.sort(relevance_scores)[::-1]
    ideal_dcg = dcg(ideal_scores)
    if ideal_dcg == 0:
        return 0  # Avoid division by zero
    return dcg(relevance_scores) / ideal_dcg

def calculate_ndcg_for_group(df_row):
    relevance_scores = df_row['Relevance Scores']

    # Normalize relevance scores
    max_score = np.max(relevance_scores)  
    relevance_scores_normalized = relevance_scores / max_score

    # Calculate nDCG for the group
    ndcg_value = ndcg(relevance_scores_normalized)

    return ndcg_value

# Calculate nDCG for each group
df['nDCG'] = df.apply(calculate_ndcg_for_group, axis=1)

# Display the DataFrame with nDCG values
print(df[['Group', 'Group Name', 'nDCG']])


     Group Group Name      nDCG
0        0        LMS  1.000000
1        1        LMS  0.988697
2        2        LMS  0.998746
3        3        LMS  0.995314
4        4        LMS  0.981625
..     ...        ...       ...
266    266        LMS  1.000000
267    267        LMS  1.000000
268    268        LMS  0.981719
269    269        LMS  0.999988
270    270        LMS  1.000000

[271 rows x 3 columns]


In [18]:
# displaying each nDCG value for each group. 
df

Unnamed: 0,Group,Group Name,Recommendations,Relevance Scores,nDCG
0,0,LMS,"[1228, 57669, 3424, 1235, 101]","[1.134297014592106, 1.1272708829376326, 1.1101...",1.000000
1,1,LMS,"[2138, 140174, 1204, 562, 175]","[1.0322866557088557, 1.096179165323558, 1.0455...",0.988697
2,2,LMS,"[1248, 8368, 46972, 2390, 750]","[1.1736856103162387, 1.1522138856827282, 1.146...",0.998746
3,3,LMS,"[95441, 140174, 171, 177593, 8869]","[1.1359470359848254, 1.0878109728867558, 1.079...",0.995314
4,4,LMS,"[5528, 299, 562, 2138, 95441]","[1.0507945291127765, 1.1575572082455623, 1.091...",0.981625
...,...,...,...,...,...
266,266,LMS,"[2138, 3740, 2135, 5135, 177593]","[1.0507945291127765, 1.0182559531146524, 1.016...",1.000000
267,267,LMS,"[319, 95441, 4446, 1242, 1266]","[1.1946810297706891, 1.1250054423566511, 1.086...",1.000000
268,268,LMS,"[2138, 5747, 3740, 1204, 899]","[1.1974779023161282, 1.1435468231820334, 1.143...",0.981719
269,269,LMS,"[95441, 101, 112290, 562, 8869]","[1.0481541174307862, 1.0411096046555537, 1.018...",0.999988


# Answering the Research Question 

In [19]:
df_merged = df.merge(group_dataframe, on='Group')
ndcg_by_members = {count: set() for count in user_counts.unique()}

for count, groups in group_info_by_members.items():
    ndcg_values = set(df_merged[df_merged['Group'].isin(groups)]['nDCG'])
    ndcg_by_members[count].update(ndcg_values)

for count, ndcg_values in ndcg_by_members.items():
    print(f"Groups with {count} members have unique nDCG values: {ndcg_values}")


Groups with 5 members have unique nDCG values: {0.9816248612025689, 1.0, 0.9892939893289305, 0.9953561609286075, 0.9908300686086475, 0.9673568361644291, 0.9862436121201696, 0.9996264499112544, 0.9904865012642932, 0.9923045174983438, 0.996628731333857, 0.9996605085909175, 0.9953141602897947, 0.9950850047665586, 0.9944184526808687, 0.9983367363388926, 0.9962430448894541, 0.9934123518193912, 0.9998014132372013, 0.9854762542787945, 0.9780664337472913, 0.9920295737660724, 0.9726586158613956, 0.9759789545876573, 0.9996116621718024, 0.9399517343925874, 0.9996954173956789, 0.9973990464543937, 0.9672332165904126, 0.9865337739374784, 0.9954614090936549, 0.9944939444032859, 0.9976507039226318, 0.9971864369402622, 0.996874966997646, 0.9460706232112109, 0.9986572407246307, 0.9970625552362066, 0.9974442878211193, 0.9796790655864507, 0.99482329801606, 0.9506486507334432, 0.9982765662328549, 0.9963766870236537, 0.9810214529223541, 0.9975861183163791, 0.9652847722267969, 0.9902830902147689, 0.960327747

- Looking at the nDCG values for groups with sizes 2,3 and 5, the analysis suggests that group size may influence the performance of the least misery strategy in group recommendation systems.

- Larger groups may benefit from a broader range of preferences, leading to both higher and lower nDCG values.

- Smaller groups may face challenges in achieving consensus, resulting in more varied nDCG scores.