**This is the resit assignment for recommender system's group project assignment.**

**Name & Surname:** Selin YAZICI

**Student ID:** i6205952

The research question this assignment aims to answer is: 'Are there any differences in the performances of a group recommender using the least misery strategy related to the size of the groups?'

In [1]:
import pandas as pd
import numpy as np 
from sklearn.model_selection import train_test_split
from lenskit.algorithms import Recommender
from lenskit.algorithms.user_knn import UserUser

# The datasets

In [2]:
movies = pd.read_csv('preprocessed_dataset/movies.csv')
ratings = pd.read_csv('preprocessed_dataset/ratings.csv')

Because the ratings dataset is too large, I wanted to filter out the outliers. So the total number of items each user rated is obtained and the outliers is removed.

In [3]:
user_evaluation_counts = ratings.groupby('user')['item'].nunique()
print("Number of Items Evaluated by Each User:")
print(user_evaluation_counts)

Number of Items Evaluated by Each User:
user
1      154
2       16
3       29
4      130
5       30
      ... 
606    579
607    120
608    578
609     25
610    730
Name: item, Length: 610, dtype: int64


In [4]:
Q1 = user_evaluation_counts.quantile(0.25)
Q3 = user_evaluation_counts.quantile(0.75)

# IQR (Interquartile Range)
IQR = Q3 - Q1

# upper and lower bounds to identify outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# remove the outliers
outliers = (user_evaluation_counts < lower_bound) | (user_evaluation_counts > upper_bound)

# filtering the ratings 
filtered_ratings = ratings[ratings['user'].isin(user_evaluation_counts[~outliers].index)]

print("Lower Bound for Outliers:", lower_bound)
print("Upper Bound for Outliers:", upper_bound)
print("Outliers (user IDs):")
print(user_evaluation_counts[outliers])
print("Filtered DataFrame without Outliers:")
print(filtered_ratings)


Lower Bound for Outliers: -109.0
Upper Bound for Outliers: 243.0
Outliers (user IDs):
user
18     320
19     492
21     280
28     352
42     312
      ... 
600    493
603    570
606    579
608    578
610    730
Name: item, Length: 67, dtype: int64
Filtered DataFrame without Outliers:
       user  item  rating  timestamp
0         1     1     4.0  964982703
1         1     3     4.0  964981247
2         1     6     4.0  964982224
3         1    70     3.0  964982400
4         1   101     5.0  964980868
...     ...   ...     ...        ...
63703   609   786     3.0  847221025
63704   609   833     3.0  847221080
63705   609   892     3.0  847221080
63706   609  1056     3.0  847221080
63707   609  1059     3.0  847221054

[32516 rows x 4 columns]


# Methods We Need

Generating random groups of size 2,3 and 5.

In [5]:
def generate_user_groups(df, column_name, group_sizes):
    unique_users = df[column_name].unique()

    random_groups = {}
    
    for size in group_sizes:
        # Randomly shuffle the unique user IDs
        np.random.shuffle(unique_users)

        # Calculate the number of groups for the given size
        num_groups = len(unique_users) // size

        # Generate random groups of the specified size based on the specified column
        for i in range(num_groups):
            group_users = unique_users[i * size: (i + 1) * size]
            group = df[df[column_name].isin(group_users)]
            random_groups[i] = group

    return random_groups


# Holdout Validation Strategy 
Holdout Validation Strategy for splitting the dataset such that 80% is used for training and 20% is used for testing.

In [6]:
def holdout_method(df, test_size=0.2, random_state=None):
    # Split the dataset into training and testing sets
    train_set, test_set = train_test_split(filtered_ratings, test_size=test_size, random_state=random_state)

    return train_set, test_set

In [7]:
train_set, test_set = holdout_method(filtered_ratings, test_size=0.2, random_state=42)

# Display the training and testing sets
print("Training Set:")
print(train_set)

print("\nTesting Set:")
print(test_set)

Training Set:
       user   item  rating   timestamp
48790   479   2694     3.0  1039362630
58596   591    356     4.0   970524486
3653     41   2712     5.0  1459368540
62996   607    292     4.0   963080256
21766   230  50872     3.5  1196305180
...     ...    ...     ...         ...
55969   562   6870     5.0  1368894017
9534     95   1374     4.5  1105400929
860      11   1917     4.0   901200037
27843   290     24     3.0   975032355
42385   425   3408     3.5  1085476542

[26012 rows x 4 columns]

Testing Set:
       user   item  rating   timestamp
18426   199   2671     4.0  1021178968
821      11    150     5.0   902154266
7855     74  30707     4.0  1207502554
17764   187   7360     4.5  1161849723
37971   385    589     5.0   834691845
...     ...    ...     ...         ...
75        1   1954     5.0   964982176
4         1    101     5.0   964980868
13278   137   1136     5.0  1204863777
24393   257   2406     3.5  1141625649
1108     17   1090     4.5  1322629080

[6504 row

# Collaborative Filtering

For the recommender type, I chose user based collaborative filtering. 

In [8]:
user_user = UserUser(15, min_nbrs=3)  

user_recommendations_list = []

for user_id in train_set['user'].unique():
    if train_set[train_set['user'] == user_id].empty:
        print(f"User {user_id} has no ratings, and none provided.")
        continue

    recsys = Recommender.adapt(user_user)

    recsys.fit(train_set)

    # Generate recommendations for the current user
    user_recommendations = recsys.recommend(user_id, 5)  # Adjust the number of recommendations as needed

    # Add the 'user' column to the recommendations DataFrame
    user_recommendations['user'] = user_id

    # Append the recommendations DataFrame to the list
    user_recommendations_list.append(user_recommendations)

# Concatenate all DataFrames in the list
all_user_recommendations_train_set = pd.concat(user_recommendations_list, ignore_index=True)

# Join with the 'title' column from 'movies_df'
all_user_recommendations_train_set = all_user_recommendations_train_set.join(movies['title'], on='item')

# Display the recommendations for all users
display(all_user_recommendations_train_set)


could not load LIBBLAS: Could not find module 'libblas' (or one of its dependencies). Try using the full path with constructor syntax.


Unnamed: 0,item,score,user,title
0,1217,4.970144,479,assassination
1,5747,4.745396,479,
2,1178,4.733695,479,mommie dearest
3,27831,4.723562,479,
4,101,4.686537,479,mallrats
...,...,...,...,...
2705,562,5.589985,189,akira
2706,785,5.326710,189,wide awake
2707,177593,5.191265,189,
2708,56782,5.153706,189,


In [9]:
test_set['predicted_rating'] = recsys.predict(test_set)
test_set['relevant'] = test_set['rating'].apply(lambda x: 1 if x>3 else 0)
test_set['predicted_relevant'] = test_set['predicted_rating'].apply(lambda x: 1 if x>3 else 0)
test_set

Unnamed: 0,user,item,rating,timestamp,predicted_rating,relevant,predicted_relevant
18426,199,2671,4.0,1021178968,2.535208,1,0
821,11,150,5.0,902154266,4.216202,1,1
7855,74,30707,4.0,1207502554,4.341387,1,1
17764,187,7360,4.5,1161849723,,1,0
37971,385,589,5.0,834691845,3.523322,1,1
...,...,...,...,...,...,...,...
75,1,1954,5.0,964982176,4.223276,1,1
4,1,101,5.0,964980868,3.411987,1,1
13278,137,1136,5.0,1204863777,4.324568,1,1
24393,257,2406,3.5,1141625649,3.297843,1,1


In [10]:
# Here, we are creating our random groups and displaying the individual recommendations. 

group_sizes = [2, 3, 5]
random_user_groups3= generate_user_groups(all_user_recommendations_train_set, 'user', group_sizes)

# Here we are printing each group and the members, if you want to see it, please uncomment the lines below. 

# for group_index, group_df in random_user_groups3.items():
#     print(f"Group {group_index} Members:\n{group_df}\n")


In [11]:
# here we are converting the group dictionary into a dataframe for the sake of easy reading 
rows = []
for group, group_data in random_user_groups3.items():
    group_df = group_data.copy()
    group_df['Group'] = group
    rows.append(group_df)

group_dataframe = pd.concat(rows)
print(group_dataframe)

       item     score  user            title  Group
1990    968  5.286746   146               54      0
1991   3072  4.668852   146         cry-baby      0
1992  87869  4.575827   146              NaN      0
1993   1274  4.368329   146  double jeopardy      0
1994   2232  4.357857   146           enigma      0
...     ...       ...   ...              ...    ...
2380    562  5.497089    56            akira    270
2381    332  5.307850    56    moll flanders    270
2382   1272  5.086116    56            tommy    270
2383  86781  5.052951    56              NaN    270
2384   3451  5.046001    56       maniac cop    270

[4690 rows x 5 columns]


In [12]:
user_counts = group_dataframe.groupby('Group')['user'].nunique()
print(user_counts)

Group
0      5
1      5
2      5
3      5
4      5
      ..
266    2
267    2
268    2
269    2
270    2
Name: user, Length: 271, dtype: int64


In [13]:
# Here, for the sake of readability, we are showing which users belong to which groups. 
group_info_by_members = {count: [] for count in user_counts.unique()}

for group, count in user_counts.items():
    group_info_by_members[count].append(group)

for count, groups in group_info_by_members.items():
    print(f"Groups with {count} members are: {groups}")


Groups with 5 members are: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107]
Groups with 3 members are: [108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179]
Groups with 2 members are: [180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 2

# Least Misery Aggregation Strategy

In [14]:
def least_misery(group_ratings, recommendations_number):
    # Aggregate using least misery strategy
    aggregated_df = group_ratings.groupby('item').min()
    aggregated_df = aggregated_df.sort_values(by="score", ascending=False).reset_index()[['item', 'score']]
    
    # Recommendation list based on LMS
    recommendation_list = list(aggregated_df.head(recommendations_number)['item'])
    
    # Calculate relevance scores for the recommended items
    relevance_scores = group_ratings[group_ratings['item'].isin(recommendation_list)]['score']
    
    # Calculate relevance using a threshold (e.g., 3)
    test_set['relevant'] = test_set['rating'].apply(lambda x: 1 if x > 3 else 0)
    test_set['predicted_relevant'] = test_set['predicted_rating'].apply(lambda x: 1 if x > 3 else 0)
    
    # Normalize relevance scores
    max_score = test_set['rating'].max()  # Assuming ratings are on a scale from 1 to some maximum value
    relevance_scores_normalized = relevance_scores / max_score
    
    return {"LMS": {"recommendations": recommendation_list, "relevance_scores": list(relevance_scores_normalized)}}


In [15]:
# We are applying least misery aggregation strategy for 5 recommendations for each group.
group_recommendations_dict = {}

for group_index, group_df in random_user_groups3.items():
    group_recommendations = least_misery(group_df, 5)  
    group_recommendations_dict[group_index] = group_recommendations

# Display recommendations for each group
for group_index, recommendations in group_recommendations_dict.items():
    print(f"Group {group_index} Recommendations:\n{recommendations}\n")

    

Group 0 Recommendations:
{'LMS': {'recommendations': [2138, 968, 56715, 541, 177593], 'relevance_scores': [1.0573492253434664, 1.0601680375522746, 1.0573413092570687, 1.0340136205401942, 1.0693982750606579, 1.0373183652745248]}}

Group 1 Recommendations:
{'LMS': {'recommendations': [1178, 2138, 3347, 104879, 1217], 'relevance_scores': [1.0848936308490391, 1.0289913368992198, 1.0176955886344916, 1.1547877412293237, 1.1546874829112928, 1.1337588114952042]}}

Group 2 Recommendations:
{'LMS': {'recommendations': [177593, 3451, 104879, 56715, 3436], 'relevance_scores': [1.1516812877856215, 1.088973033143494, 1.0789565179159333, 1.1081587941780144, 1.092833149280328]}}

Group 3 Recommendations:
{'LMS': {'recommendations': [968, 8957, 1407, 111362, 122892], 'relevance_scores': [1.1322383621194443, 1.047190472848184, 1.0431191327244043, 1.0361544922148191, 1.0282443872978302]}}

Group 4 Recommendations:
{'LMS': {'recommendations': [1178, 1217, 177593, 299, 5747], 'relevance_scores': [1.0705074

In [16]:
# For the sake of readability, we are converting our results to a dataframe. 
flattened_data = []

for group_id, group_data in group_recommendations_dict.items():
    group_name = list(group_data.keys())[0]  
    recommendations = group_data[group_name]['recommendations']
    relevance_scores = group_data[group_name]['relevance_scores']
    
    row_data = {'Group': group_id, 'Group Name': group_name, 'Recommendations': recommendations, 'Relevance Scores': relevance_scores}
    flattened_data.append(row_data)

df = pd.DataFrame(flattened_data)

print(df)

     Group Group Name                      Recommendations  \
0        0        LMS      [2138, 968, 56715, 541, 177593]   
1        1        LMS     [1178, 2138, 3347, 104879, 1217]   
2        2        LMS  [177593, 3451, 104879, 56715, 3436]   
3        3        LMS    [968, 8957, 1407, 111362, 122892]   
4        4        LMS      [1178, 1217, 177593, 299, 5747]   
..     ...        ...                                  ...   
266    266        LMS  [140174, 104879, 177593, 101, 2318]   
267    267        LMS   [2138, 56715, 555, 177593, 104879]   
268    268        LMS     [56782, 562, 8984, 1203, 168252]   
269    269        LMS       [101, 104879, 1178, 538, 5747]   
270    270        LMS        [562, 332, 1272, 86781, 3451]   

                                      Relevance Scores  
0    [1.0573492253434664, 1.0601680375522746, 1.057...  
1    [1.0848936308490391, 1.0289913368992198, 1.017...  
2    [1.1516812877856215, 1.088973033143494, 1.0789...  
3    [1.1322383621194443, 1

# nDCG

In [17]:
import numpy as np

def dcg(relevance_scores):
    return np.sum((2**relevance_scores - 1) / np.log2(np.arange(2, len(relevance_scores) + 2)))

def ndcg(relevance_scores):
    ideal_scores = np.sort(relevance_scores)[::-1]
    ideal_dcg = dcg(ideal_scores)
    if ideal_dcg == 0:
        return 0  # Avoid division by zero
    return dcg(relevance_scores) / ideal_dcg

def calculate_ndcg_for_group(df_row):
    relevance_scores = df_row['Relevance Scores']

    # Normalize relevance scores
    max_score = np.max(relevance_scores)  
    relevance_scores_normalized = relevance_scores / max_score

    # Calculate nDCG for the group
    ndcg_value = ndcg(relevance_scores_normalized)

    return ndcg_value

# Calculate nDCG for each group
df['nDCG'] = df.apply(calculate_ndcg_for_group, axis=1)

# Display the DataFrame with nDCG values
print(df[['Group', 'Group Name', 'nDCG']])


     Group Group Name      nDCG
0        0        LMS  0.996624
1        1        LMS  0.967761
2        2        LMS  0.997764
3        3        LMS  1.000000
4        4        LMS  0.981690
..     ...        ...       ...
266    266        LMS  0.989508
267    267        LMS  0.995388
268    268        LMS  1.000000
269    269        LMS  0.942808
270    270        LMS  1.000000

[271 rows x 3 columns]


In [18]:
# displaying each nDCG value for each group. 
df

Unnamed: 0,Group,Group Name,Recommendations,Relevance Scores,nDCG
0,0,LMS,"[2138, 968, 56715, 541, 177593]","[1.0573492253434664, 1.0601680375522746, 1.057...",0.996624
1,1,LMS,"[1178, 2138, 3347, 104879, 1217]","[1.0848936308490391, 1.0289913368992198, 1.017...",0.967761
2,2,LMS,"[177593, 3451, 104879, 56715, 3436]","[1.1516812877856215, 1.088973033143494, 1.0789...",0.997764
3,3,LMS,"[968, 8957, 1407, 111362, 122892]","[1.1322383621194443, 1.047190472848184, 1.0431...",1.000000
4,4,LMS,"[1178, 1217, 177593, 299, 5747]","[1.070507499549206, 0.9783316625359595, 1.1042...",0.981690
...,...,...,...,...,...
266,266,LMS,"[140174, 104879, 177593, 101, 2318]","[1.1226157584570804, 1.1065631468292845, 1.174...",0.989508
267,267,LMS,"[2138, 56715, 555, 177593, 104879]","[1.0493828337625595, 1.0329069423584119, 1.060...",0.995388
268,268,LMS,"[56782, 562, 8984, 1203, 168252]","[1.252909415539992, 1.1413821952073278, 1.1118...",1.000000
269,269,LMS,"[101, 104879, 1178, 538, 5747]","[0.8327166545051089, 1.083252497220007, 1.0512...",0.942808


# Answering the Research Question 

In [19]:
df_merged = df.merge(group_dataframe, on='Group')
ndcg_by_members = {count: set() for count in user_counts.unique()}

for count, groups in group_info_by_members.items():
    ndcg_values = set(df_merged[df_merged['Group'].isin(groups)]['nDCG'])
    ndcg_by_members[count].update(ndcg_values)

for count, ndcg_values in ndcg_by_members.items():
    print(f"Groups with {count} members have unique nDCG values: {ndcg_values}")


Groups with 5 members have unique nDCG values: {0.9677605391580111, 0.9816896499484835, 1.0, 0.9996817349345901, 0.9886121991141041, 0.9792175095200957, 0.9903797992873093, 0.9884576401683758, 0.9934288561121909, 0.998476592396184, 0.9747586962957265, 0.9998124870908641, 0.9962577863613974, 0.9793389763183918, 0.9990584368596026, 0.9892783390500386, 0.9969646480711243, 0.9884996108120669, 0.9779205326980114, 0.9986009711768457, 0.9824764173738747, 0.9826912957674209, 0.9872255085020399, 0.9624234748385526, 0.9850820982399505, 0.9940074946980646, 0.9690361420510288, 0.999493338205956, 0.9960931890464347, 0.9909227721193794, 0.9742113851566867, 0.9997682154855145, 0.9989086043758219, 0.9990888826630323, 0.9865772194286974, 0.9969899656231056, 0.9997389934308578, 0.9977642780943031, 0.9930393277555377, 0.983789140998363, 0.9995335601750259, 0.9874688403867697, 0.973230329102643, 0.9769107178267031, 0.9703492742981422, 0.9629239493970606, 0.9990934254417445, 0.9846936711461913, 0.997935676

- Looking at the nDCG values for groups with sizes 2,3 and 5, the analysis suggests that group size may influence the performance of the least misery strategy in group recommendation systems.

- Larger groups may benefit from a broader range of preferences, leading to both higher and lower nDCG values.

- Smaller groups may face challenges in achieving consensus, resulting in more varied nDCG scores.