## Experiments on Community Detection in RecSys

This Jupyter notebook aims to conduct a series of experiments to evaluate how the performance of specific recommendation algorithms varies with the addition of community detectors. The experiments will be performed using the MovieLens 100k and Jester datasets. The scikit-surprise library will also be used.

The main goal of these experiments is to verify whether the integration of community detection techniques in recommender systems can improve the recommendation accuracy. The algorithms will be evaluated based on RMSE, MSE and MAE metrics and the results will be saved in CSV format for further analysis.

### Setting up the environment

In [1]:
"""
    Importing needed libs
"""

'\n    Importing needed libs\n'

In [3]:
import os
import numpy as np
import pandas as pd
from typing import List
from surprise import (
    accuracy,
    Reader,
    Dataset,
    CoClustering,
    KNNBasic,
    NMF,
    SVD
)
from surprise.model_selection.split import ShuffleSplit
from surprise.trainset import Trainset
import networkx as nx
from cdlib import algorithms 
from tqdm.notebook import tqdm
from sklearn.metrics.pairwise import pairwise_distances
import warnings
warnings.filterwarnings("ignore")

In [4]:
"""
    Setting up functions
"""

'\n    Setting up functions\n'

In [5]:
def uncouple(train_set: Trainset, test_set: list):
    """
    Description:
        It takes in a Trainset-Surprise object and 
        a list of test set data, and returns two pandas 
        dataframes: one containing the training set 
        data, and the other containing the test set 
        data.
    Input: 
        train_set: a Trainset object containing the 
        training set data. 
        test_set: a list containing the test set data
    Output:
        df_train: a pandas dataframe containing the 
        training set data, with columns 'uid', 'iid' 
        and 'rating'
        df_test: a pandas dataframe containing the test 
        set data, with columns 'uid', 'iid', and 'rating'
    """
    iterator = train_set.all_ratings()
    df_train = pd.DataFrame(columns=['uid','iid','rating'])
    i=0
    for (uid, iid, rating) in iterator:
        df_train.loc[i] = [uid, iid, rating]
        i=i+1
    df_test = pd.DataFrame.from_records(test_set, columns = ['uid', 'iid', 'rating'])
    
    return df_train, df_test



def get_similarity_matrix(data: pd.DataFrame, index: List[str], columns: List[str], values: str, metric: str):
    """
    Description:
        It takes the ratings data and returns an
        user-user similarity matrix.
        Null data is filled with zero.
    Input: 
        data: pandas DataFrame containing the data 
        to be transformed into a rating matrix
        index: a list of strings representing the 
        column names that will be used as the index 
        columns: a list of strings representing the 
        column names that will be used as the columns 
        of the rating matrix.
        values: a string representing the column name 
        metric: the metric to use when calculating 
        distance between instances
    Output:
        similarity_matrix: an pandas user-user similarity
        matrix
    """
    if metric not in ['cosine', 'euclidian', 'l1', 'l2']:
        raise ValueError('Invalid metric. Please choose one of the following: cosine, euclidian, l1 or l2')
    rating_matrix = data.pivot_table(index=index, columns=columns, values=values)
    rating_matrix = rating_matrix.copy().fillna(0)
    similarity_matrix = pairwise_distances(rating_matrix, rating_matrix, metric=metric)
    
    return similarity_matrix

In [6]:
"""
    Setting up experiment algorithms 
"""

'\n    Setting up experiment algorithms \n'

In [7]:
algos_recommendation = {
    'SVD': SVD(),
    'k-NN': KNNBasic(), 
    'NMF': NMF(), 
    'Co-Clustering': CoClustering()
}

communities_detectors = {
    'Not-Applicable': None,
    'Louvain': algorithms.louvain,
    'Paris': algorithms.paris,
    'Surprise': algorithms.surprise_communities 
}

### Running experiment

In [8]:
results = []
problematic_execs=[]
for dataset in tqdm(['ml-100k', 'jester'], desc='General Progress', leave=True):
    data = Dataset.load_builtin(dataset)
    for test_size in tqdm([0.25, 0.1, 0.01], desc='Test Size Progress', leave=False):
        shuffle_split = ShuffleSplit(n_splits=100, test_size=test_size)
        split_id = 1
        for trainset, testset in tqdm(shuffle_split.split(data), desc='Splits Progress', leave=False): 
            for similarity_metric in  tqdm(['cosine', 'euclidian', 'l1', 'l2'], desc='Similarity Metric Progress', leave=False): 
                for detector_name, community_detector in tqdm(communities_detectors.items(), desc='Community Detector Progress', leave=False):
                    for algo_name, algo in  tqdm(algos_recommendation.items(), desc='Algorithm Progress', leave=False):            
                        if community_detector != None:
                            trainpd, testpd = uncouple(trainset, testset)
                            similarity_matrix = get_similarity_matrix(trainpd, index=['uid'], columns=['iid'], 
                                                                    values='rating', metric=similarity_metric)
                            G = nx.from_numpy_matrix(similarity_matrix)
                            for u, v in G.edges():
                                similarity = similarity_matrix[u][v]
                                G[u][v]['weight'] = similarity
                            if communities_detectors == 'Paris':
                                try:
                                    coms = community_detector(G)
                                except Exception as e:
                                    problematic_execs.append([
                                        'Problem with communiy detection', dataset, similarity_metric, 
                                        detector_name, algo_name, test_size, split_id
                                    ])
                                continue
                            else:
                                try:
                                    coms = community_detector(G, weights='weight')
                                except Exception as e:
                                    problematic_execs.append([
                                        'communiy detection', dataset, similarity_metric, 
                                        detector_name, algo_name, test_size, split_id
                                    ])
                                continue
                            all_predictions = []
                            for community in tqdm(coms.communities, desc='Communities progress', leave=False):
                                train_community = trainpd[trainpd['uid'].isin(community)]
                                test_community = testpd[testpd['uid'].isin([str(x) for x in community])]
                                
                                reader = Reader(rating_scale=(1, 5))
                                train_surprise = Dataset.load_from_df(train_community[['uid', 'iid', 'rating']], 
                                                                        reader)
                                train_surprise = train_surprise.build_full_trainset()
                                test_surprise = list(test_community.itertuples(index=False, name=None))
                                
                                try:
                                    algo.fit(train_surprise)
                                except Exception as e:
                                    problematic_execs.append([
                                        'model fitting', dataset, similarity_metric, 
                                        detector_name, algo_name, test_size, split_id
                                    ])
                                continue
                                try:    
                                    predictions = algo.test(test_surprise)
                                except Exception as e:
                                    problematic_execs.append([
                                        'model prediction', dataset, similarity_metric, 
                                        detector_name, algo_name, test_size, split_id
                                    ])
                                continue      
                                all_predictions.extend(predictions)
                            rmse_value = accuracy.rmse(all_predictions, verbose=False)
                            mse_value = accuracy.mse(all_predictions, verbose=False)
                            mae_value = accuracy.mae(all_predictions, verbose=False)
                        else:
                            try:
                                algo.fit(trainset)
                            except Exception as e:
                                    problematic_execs.append([
                                        'model fitting', dataset, similarity_metric, 
                                        detector_name, algo_name, test_size, split_id
                                    ])
                            continue
                            try:
                                predictions = algo.test(testset)
                            except Exception as e:
                                    problematic_execs.append([
                                        'model prediction', dataset, similarity_metric, 
                                        detector_name, algo_name, test_size, split_id
                                    ])
                            continue
                        rmse_value = accuracy.rmse(predictions, verbose=False)
                        mse_value = accuracy.mse(predictions, verbose=False)
                        mae_value = accuracy.mae(predictions, verbose=False)
                        result_dict = {
                            'dataset': dataset,
                            'similarity_metric': similarity_metric,
                            'community_detector': detector_name,
                            'algorithm_rec': algo_name,
                            'test_size': test_size,
                            'split_id': split_id,
                            'rmse': rmse_value,
                            'mse': mse_value,
                            'mae': mae_value
                        }
                        results.append(result_dict)
                        split_id += 1


# Saving results as .csv file
df_results = pd.DataFrame(results)
file_name = 'results.csv'
notebook_dir = os.getcwd()
outputs_dir = notebook_dir.replace('notebooks', 'outputs')
file_path = os.path.join(outputs_dir, file_name)
df_results.to_csv(file_path, index=False)

General Progress:   0%|          | 0/2 [00:00<?, ?it/s]

Similarity Metric Progress:   0%|          | 0/4 [00:00<?, ?it/s]

Community Detector Progress:   0%|          | 0/4 [00:00<?, ?it/s]

Algorithm Progress:   0%|          | 0/4 [00:00<?, ?it/s]

Test Size Progress:   0%|          | 0/3 [00:00<?, ?it/s]

Splits Progress: 0it [00:00, ?it/s]

Splits Progress: 0it [00:00, ?it/s]

KeyboardInterrupt: 

In [None]:
problematic_execs