# Recommendation System



##Installing Surprise and mounting Drive

In [1]:
!pip install surprise

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting surprise
  Downloading surprise-0.1-py2.py3-none-any.whl (1.8 kB)
Collecting scikit-surprise (from surprise)
  Downloading scikit-surprise-1.1.3.tar.gz (771 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m772.0/772.0 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.3-cp310-cp310-linux_x86_64.whl size=3095456 sha256=84be6304afced77972ca5a6c961e2b74dc4aeed8bf4feb132fe05434ef3f071e
  Stored in directory: /root/.cache/pip/wheels/a5/ca/a8/4e28def53797fdc4363ca4af740db15a9c2f1595ebc51fb445
Successfully built scikit-surprise
Installing collected packages: scikit-surprise, surprise
Successfully installed scikit-surprise-1.1.

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Libraries

Here we are first going to import all the nescessary libraries. We will also import our dataset for the recommendation system. I decided to use the **Automotive dataset**. 

In [3]:
from surprise import KNNWithMeans
from surprise import prediction_algorithms
from surprise import Dataset
from surprise import accuracy
from surprise import Reader
from surprise.model_selection import train_test_split
from surprise.model_selection import KFold
from surprise import CoClustering
from surprise import SVD
from surprise import SlopeOne

In [4]:
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split as tts
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from math import sqrt
from collections import defaultdict

## Data Preprocessing

### Selecting first million datapoints

Colab keeps crashing when I try to import the entire json file. Here, I am only selecting the first million entries from the dataset. This dataset I am then converting to a csv file which I use for splitting into training and testing datatset. 

In [5]:
#df = pd.read_json('/content/drive/MyDrive/Automotive.json', lines = True, nrows=1000000)

In [6]:
#df.drop(['verified', 'reviewTime', 'style', 'reviewerName', 'summary', 'unixReviewTime', 'vote', 'image'], axis=1, inplace=True)

In [7]:
#df.to_csv('/content/drive/MyDrive/23 - Spring Quarter/CSE 272 - IR/HW 2/dataset.csv', index=False)

### Splitting the data

In [8]:
dataset = pd.read_csv('/content/drive/MyDrive/23 - Spring Quarter/CSE 272 - IR/HW 2/dataset.csv')

In [9]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 4 columns):
 #   Column      Non-Null Count    Dtype 
---  ------      --------------    ----- 
 0   overall     1000000 non-null  int64 
 1   reviewerID  1000000 non-null  object
 2   asin        1000000 non-null  object
 3   reviewText  999566 non-null   object
dtypes: int64(1), object(3)
memory usage: 30.5+ MB


In [10]:
dataset = dataset[['reviewerID','asin', 'overall']]

In [11]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 3 columns):
 #   Column      Non-Null Count    Dtype 
---  ------      --------------    ----- 
 0   reviewerID  1000000 non-null  object
 1   asin        1000000 non-null  object
 2   overall     1000000 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 22.9+ MB


In [12]:
reader = Reader(rating_scale=(1, 5))


In [13]:
dataset = Dataset.load_from_df(dataset, reader)

In [14]:
trainset, testset = train_test_split(dataset, test_size=0.2, random_state=10)

## Co-Clustering

Co-Clustering is a collaborative filtering algorithm that simultaneously clusters users and items based on their rating patterns. It identifies groups of users and items that share similar rating behaviors. 

In [15]:
coclustering = CoClustering(n_cltr_u=3, n_cltr_i=3, n_epochs=10)

In [16]:
coclustering.fit(trainset)

<surprise.prediction_algorithms.co_clustering.CoClustering at 0x7f1632b7f160>

In [17]:
test_predictions_coclustering = coclustering.test(testset)

In [18]:
rmse_coclusterting = accuracy.rmse(test_predictions_coclustering)

RMSE: 1.2113


In [19]:
mae_coclusterting = accuracy.mae(test_predictions_coclustering)

MAE:  0.8236


In [20]:
print("The RMSE and MAE values for coclustering are:")
print("RMSE - " + str(rmse_coclusterting))
print("MAE - " + str(mae_coclusterting))

The RMSE and MAE values for coclustering are:
RMSE - 1.2113069092157847
MAE - 0.82355637704684


## Collaborative Filtering - Item Based Recommendation

SVD is a matrix factorization-based algorithm that decomposes the user-item rating matrix into low-rank matrices to capture latent factors. To use SVD in Surprise, you can create an instance of the SVD class and customize its parameters as needed. 

In [21]:
svd = SVD(n_factors=100, n_epochs=20, biased=True)


In [22]:
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f161b545210>

In [23]:
test_predictions_CF = svd.test(testset)

In [24]:
rmse_CF = accuracy.rmse(test_predictions_CF)

RMSE: 1.1268


In [25]:
mae_CF = accuracy.mae(test_predictions_CF)

MAE:  0.8204


In [26]:
print("The RMSE and MAE values for Collaborative Filtering are:")
print("RMSE - " + str(rmse_CF))
print("MAE - " + str(mae_CF))

The RMSE and MAE values for Collaborative Filtering are:
RMSE - 1.126801957095781
MAE - 0.8203837086664989


## SlopeOne

Slope One is a simple but effective collaborative filtering algorithm that calculates the average difference in ratings between items and uses this information to make predictions.

In [27]:
slope_one = SlopeOne()

In [28]:
slope_one.fit(trainset)

<surprise.prediction_algorithms.slope_one.SlopeOne at 0x7f1632b7f3a0>

In [29]:
test_predictions_Slope = slope_one.test(testset)

In [30]:
rmse_slope = accuracy.rmse(test_predictions_Slope)

RMSE: 1.2413


In [31]:
mae_slope = accuracy.mae(test_predictions_Slope)

MAE:  0.8360


In [32]:
print("The RMSE and MAE values for Slope One are:")
print("RMSE - " + str(rmse_slope))
print("MAE - " + str(mae_slope))

The RMSE and MAE values for Slope One are:
RMSE - 1.2413344992878401
MAE - 0.8360409321569288


##Ranking

Here we create a ranking function for getting the top 10 recommendation for each of the algorithms.

In [33]:
def get_top_n(predictions, n=10):
    """Return the top-N recommendation for each user from a set of predictions.

    Args:
        predictions(list of Prediction objects): The list of predictions, as
            returned by the test method of an algorithm.
        n(int): The number of recommendation to output for each user. Default
            is 10.

    Returns:
    A dict where keys are user (raw) ids and values are lists of tuples:
        [(raw item id, rating estimation), ...] of size n.
    """

    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

###CoClustering

In [34]:
top_10_coclustering = get_top_n(test_predictions_coclustering, n=10)

In [35]:
top_10_coclustering

defaultdict(list,
            {'A2OS2IYIDJUQM9': [('B000GZAUQ8', 4.3286675)],
             'A2SH845F7XUXHF': [('B000CPI5XM', 3.439879121666549),
              ('B000C57Z42', 2.9717093603933398)],
             'A3QDOLSOTH1WCI': [('B00068XCQU', 4.3286675)],
             'AXVYF61GIL20T': [('B000182F22', 4.3286675)],
             'A5DRCP8BSDK55': [('B000RBILHQ', 4.3286675)],
             'A30NC2G306RUN2': [('B0009N0W4W', 4.3286675)],
             'A2JEWOJVFNWMG7': [('B000CB9558', 4.3286675)],
             'A13OUXV88B50TF': [('B000P6CJ9O', 4.825721495567647)],
             'AIDFDD8YRUDO6': [('B0000AYFTH', 5)],
             'A2GE28FBAQBP8J': [('B000TCCWS2', 4.3286675)],
             'ABSV4LC32VJVL': [('B000H7CKAY', 4.839161067418434),
              ('B000EDUTXG', 4.582844002690154),
              ('B000NMDFNY', 4.442492405467514),
              ('B0002MAILC', 4.280710433343492)],
             'A3U0X4SDP7UMGS': [('B0009XDITS', 4.3286675)],
             'A30EPX5W35UTIR': [('B000TK5TLG', 4.3286

###Collaborative Filtering

In [36]:
top_10_CF = get_top_n(test_predictions_CF, n=10)

In [37]:
top_10_CF

defaultdict(list,
            {'A2OS2IYIDJUQM9': [('B000GZAUQ8', 4.761810927346904)],
             'A2SH845F7XUXHF': [('B000CPI5XM', 4.385892448563616),
              ('B000C57Z42', 3.8771563221734704)],
             'A3QDOLSOTH1WCI': [('B00068XCQU', 4.699996970400121)],
             'AXVYF61GIL20T': [('B000182F22', 4.692557104598859)],
             'A5DRCP8BSDK55': [('B000RBILHQ', 3.499075385921005)],
             'A30NC2G306RUN2': [('B0009N0W4W', 4.090773553667076)],
             'A2JEWOJVFNWMG7': [('B000CB9558', 4.495178861228623)],
             'A13OUXV88B50TF': [('B000P6CJ9O', 4.304203788634777)],
             'AIDFDD8YRUDO6': [('B0000AYFTH', 4.551643645748433)],
             'A2GE28FBAQBP8J': [('B000TCCWS2', 4.374527198285722)],
             'ABSV4LC32VJVL': [('B000H7CKAY', 4.900782996437155),
              ('B000EDUTXG', 4.8736597387285725),
              ('B000NMDFNY', 4.30810945564845),
              ('B0002MAILC', 3.79380040237621)],
             'A3U0X4SDP7UMGS': [('B0009XDI

###SlopeOne

In [38]:
top_10_slope = get_top_n(test_predictions_Slope, n=10)

In [39]:
top_10_slope

defaultdict(list,
            {'A2OS2IYIDJUQM9': [('B000GZAUQ8', 4.3286675)],
             'A2SH845F7XUXHF': [('B000C57Z42', 3.0), ('B000CPI5XM', 3.0)],
             'A3QDOLSOTH1WCI': [('B00068XCQU', 4.3286675)],
             'AXVYF61GIL20T': [('B000182F22', 4.3286675)],
             'A5DRCP8BSDK55': [('B000RBILHQ', 4.3286675)],
             'A30NC2G306RUN2': [('B0009N0W4W', 4.3286675)],
             'A2JEWOJVFNWMG7': [('B000CB9558', 4.3286675)],
             'A13OUXV88B50TF': [('B000P6CJ9O', 5)],
             'AIDFDD8YRUDO6': [('B0000AYFTH', 5)],
             'A2GE28FBAQBP8J': [('B000TCCWS2', 4.3286675)],
             'ABSV4LC32VJVL': [('B000EDUTXG', 4.6923076923076925),
              ('B000H7CKAY', 4.1923076923076925),
              ('B000NMDFNY', 4.17948717948718),
              ('B0002MAILC', 4.0)],
             'A3U0X4SDP7UMGS': [('B0009XDITS', 4.3286675)],
             'A30EPX5W35UTIR': [('B000TK5TLG', 4.3286675)],
             'A13OEU7THWACPA': [('B000TK1TRY', 4.3286675)],
     

###Precision and Recall

Here we create a function to calculate the precision and recall for the algorithms. The precision and recall function can be found in [surprise documentation](https://surprise.readthedocs.io/en/stable/FAQ.html). 

In [40]:
def precision_recall_at_k(predictions, k=10, threshold=3.5):
    """Return precision and recall at k metrics for each user"""

    # First map the predictions to each user.
    user_est_true = defaultdict(list)
    for uid, _, true_r, est, _ in predictions:
        user_est_true[uid].append((est, true_r))

    precisions = dict()
    recalls = dict()
    for uid, user_ratings in user_est_true.items():

        # Sort user ratings by estimated value
        user_ratings.sort(key=lambda x: x[0], reverse=True)

        # Number of relevant items
        n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)

        # Number of recommended items in top k
        n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[:k])

        # Number of relevant and recommended items in top k
        n_rel_and_rec_k = sum(
            ((true_r >= threshold) and (est >= threshold))
            for (est, true_r) in user_ratings[:k]
        )

        # Precision@K: Proportion of recommended items that are relevant
        # When n_rec_k is 0, Precision is undefined. We here set it to 0.

        precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 0

        # Recall@K: Proportion of relevant items that are recommended
        # When n_rel is 0, Recall is undefined. We here set it to 0.

        recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 0

    return precisions, recalls

In [41]:
precisions, recalls = precision_recall_at_k(test_predictions_CF, k=5, threshold=4)

    # Precision and recall can then be averaged over all users

prec_CF = sum(prec for prec in precisions.values()) / len(precisions)

rec_CF = sum(rec for rec in recalls.values()) / len(recalls)

fmeasure_CF = (2*prec_CF*rec_CF)/(prec_CF+rec_CF)

print("The precision, recall and f-measure values for Collaborative Filtering are: ")
print("Precision: "+ str(prec_CF))
print("Recall: "+ str(rec_CF))
print("F-Measure: "+ str(fmeasure_CF))

The precision, recall and f-measure values for Collaborative Filtering are: 
Precision: 0.7081351205790348
Recall: 0.705411824693868
F-Measure: 0.7067708493275818


In [42]:
precisions1, recalls1 = precision_recall_at_k(test_predictions_Slope, k=5, threshold=4)

    # Precision and recall can then be averaged over all users

prec_slope = sum(prec for prec in precisions1.values()) / len(precisions1)

rec1_slope = sum(rec for rec in recalls1.values()) / len(recalls1)

fmeasure_slope = (2*prec_slope*rec1_slope)/(prec_slope+rec1_slope)

print("The precision, recall and f-measure values for Slope One are: ")
print("Precision: "+ str(prec_slope))
print("Recall: "+ str(rec1_slope))
print("F-Measure: "+ str(fmeasure_slope))

The precision, recall and f-measure values for Slope One are: 
Precision: 0.7832268379600184
Recall: 0.7865871022071589
F-Measure: 0.7849033736778229


In [43]:
precisions2, recall2 = precision_recall_at_k(test_predictions_coclustering, k=5, threshold=4)


prec_CC = sum(prec for prec in precisions2.values()) / len(precisions2)

rec_CC = sum(rec for rec in recall2.values()) / len(recall2)

fmeasure_CC = (2*prec_CC*rec_CC)/(prec_CC+rec_CC)

print("The precision, recall and f-measure values for Co-clustering are: ")
print("Precision: "+ str(prec_CC))
print("Recall: "+ str(rec_CC))
print("F-Measure: "+ str(fmeasure_CC))

The precision, recall and f-measure values for Co-clustering are: 
Precision: 0.7772044542276323
Recall: 0.7797728323478627
F-Measure: 0.7784865248990512


##Final Results 

Here we simply print the final results. 

In [44]:
print("The final results for the Co-clustering are: ")
print("RMSE - " + str(rmse_coclusterting))
print("MAE - " + str(mae_coclusterting))
print("Precision - "+ str(prec_CC))
print("Recall - "+ str(rec_CC))
print("F-Measure - "+ str(fmeasure_CC))
print()
print("The final results for the Collaborative Filtering are: ")
print("RMSE - " + str(rmse_CF))
print("MAE - " + str(mae_CF))
print("Precision - "+ str(prec_CF))
print("Recall - "+ str(rec_CF))
print("F-Measure - "+ str(fmeasure_CF))
print()
print("The final results for the Slope One are: ")
print("RMSE - " + str(rmse_slope))
print("MAE - " + str(mae_slope))
print("Precision - "+ str(prec_slope))
print("Recall - "+ str(rec1_slope))
print("F-Measure - "+ str(fmeasure_slope))

The final results for the Co-clustering are: 
RMSE - 1.2113069092157847
MAE - 0.82355637704684
Precision - 0.7772044542276323
Recall - 0.7797728323478627
F-Measure - 0.7784865248990512

The final results for the Collaborative Filtering are: 
RMSE - 1.126801957095781
MAE - 0.8203837086664989
Precision - 0.7081351205790348
Recall - 0.705411824693868
F-Measure - 0.7067708493275818

The final results for the Slope One are: 
RMSE - 1.2413344992878401
MAE - 0.8360409321569288
Precision - 0.7832268379600184
Recall - 0.7865871022071589
F-Measure - 0.7849033736778229
