<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>

<i>Licensed under the MIT License.</i>

# Apply Diversity Metrics  
## -- Compare ALS and Random Recommenders on MovieLens (PySpark)

In this notebook, we demonstrate how to evaluate a recommender using metrics other than commonly used rating/ranking metrics.

Such metrics include:
- Coverage - We use following two metrics defined by \[Shani and Gunawardana\]:
 
    - (1) catalog_coverage, which measures the proportion of items that get recommended from the item catalog; 
    - (2) distributional_coverage, which measures how equally different items are recommended in the recommendations to all users.

- Novelty - A more novel item indicates it is less popular, i.e. it gets recommended less frequently.
We use the definition of novelty from \[Castells et al.\]

- Diversity - The dissimilarity of items being recommended.
We use a definition based on _intralist similarity_ by \[Zhang et al.]

- Serendipity - The "unusualness" or "surprise" of recommendations to a user.
We use a definition based on cosine similarity by \[Zhang et al.]

We evaluate the results obtained with two approaches: using the ALS recommender algorithm vs. a baseline of random recommendations. 
 - Matrix factorization by [ALS](https://spark.apache.org/docs/latest/api/python/_modules/pyspark/ml/recommendation.html#ALS) (Alternating Least Squares) is a well known collaborative filtering algorithm.
 - We also define a process which randomly recommends unseen items to each user. 
 - We show two options to calculate item-item similarity: (1) based on item co-occurrence count; and (2) based on item feature vectors.
 
The comparision results show that the ALS recommender outperforms the random recommender on ranking metrics (Precision@k, Recall@k, NDCG@k, and	Mean average precision), while the random recommender outperforms ALS recommender on diversity metrics. This is because ALS is optimized for estimating the item rating as accurate as possible, therefore it performs well on accuracy metrics including rating and ranking metrics. As a side effect, the items being recommended tend to be popular items, which are the items mostly sold or viewed. It leaves the [long-tail items](https://github.com/microsoft/recommenders/blob/main/GLOSSARY.md) having less chance to get introduced to the users. This is the reason why ALS is not performing as well as a random recommender on diversity metrics. 

From the algorithmic point of view, items in the tail suffer from the cold-start problem, making them hard for recommendation systems to use. However, from the business point of view, oftentimes the items in the tail can be highly profitable, since, depending on supply, business can apply a higher margin to them. Recommendation systems that optimize metrics like novelty and diversity, can help to find users willing to get these long tail items. Usually there is a trade-off between one type of metric vs. another. One should decide which set of metrics to optimize based on business scenarios.

**Coverage**

We define _catalog coverage_ as the proportion of items showing in all users’ recommendations: 
$$
\textrm{CatalogCoverage} = \frac{|N_r|}{|N_t|}
$$
where $N_r$ denotes the set of items in the recommendations (`reco_df` in the code below) and $N_t$ the set of items in the historical data (`train_df`).

_Distributional coverage_ measures how equally different items are recommended to users when a particular recommender system is used.
If  $p(i|R)$ denotes the probability that item $i$ is observed among all recommendation lists, we define distributional coverage as
$$
\textrm{DistributionalCoverage} = -\sum_{i \in N_t} p(i|R) \log_2 p(i)
$$
where 
$$
p(i|R) = \frac{|M_r (i)|}{|\textrm{reco_df}|}
$$
and $M_r (i)$ denotes the users who are recommended item $i$.



**Diversity**

Diversity represents the variety present in a list of recommendations.
_Intra-List Similarity_ aggregates the pairwise similarity of all items in a set. A recommendation list with groups of very similar items will score a high intra-list similarity. Lower intra-list similarity indicates higher diversity.
To measure similarity between any two items we use _cosine similarity_:
$$
\textrm{Cosine Similarity}(i,j)=  \frac{|M_t^{l(i,j)}|} {\sqrt{|M_t^{l(i)}|} \sqrt{|M_t^{l(j)}|} }
$$
where $M_t^{l(i)}$ denotes the set of users who liked item $i$ and $M_t^{l(i,j)}$ the users who liked both $i$ and $j$.
Intra-list similarity is then defined as 
$$
\textrm{IL} = \frac{1}{|M|} \sum_{u \in M} \frac{1}{\binom{N_r(u)}{2}} \sum_{i,j \in N_r (u),\, i<j} \textrm{Cosine Similarity}(i,j)
$$
where $M$ is the set of users and $N_r(u)$ the set of recommendations for user $u$. Finally, diversity is defined as
$$
\textrm{diversity} = 1 - \textrm{IL}
$$



**Novelty**

The novelty of an item is inverse to its _popularity_. If $p(i)$ represents the probability that item $i$ is observed (or known, interacted with etc.) by users, then  
$$
p(i) = \frac{|M_t (i)|} {|\textrm{train_df}|}
$$
where $M_t (i)$ is the set of users who have interacted with item $i$ in the historical data. 

The novelty of an item is then defined as
$$
\textrm{novelty}(i) = -\log_2 p(i)
$$
and the novelty of the recommendations across all users is defined as
$$
\textrm{novelty} = \sum_{i \in N_r} \frac{|M_r (i)|}{|\textrm{reco_df}|} \textrm{novelty}(i)
$$


**Serendipity**

Serendipity represents the “unusualness” or “surprise” of recommendations. Unlike novelty, serendipity encompasses the semantic content of items and can be imagined as the distance between recommended items and their expected contents (Zhang et al.) Lower cosine similarity indicates lower expectedness and higher serendipity.
We define the expectedness of an unseen item $i$ for user $u$ as the average similarity between every already seen item $j$ in the historical data and $i$:
$$
\textrm{expectedness}(i|u) = \frac{1}{|N_t (u)|} \sum_{j \in N_t (u)} \textrm{Cosine Similarity}(i,j)
$$
The serendipity of item $i$ is (1 - expectedness) multiplied by _relevance_, where relevance indicates whether the item turns out to be liked by the user or not. For example, in a binary scenario, if an item in `reco_df` is liked (purchased, clicked) in `test_df`, its relevance equals one, otherwise it equals zero. Aggregating over all users and items, the overall 
serendipity is defined as
$$
\textrm{serendipity} = \frac{1}{|M|} \sum_{u \in M_r}
\frac{1}{|N_r (u)|} \sum_{i \in N_r (u)} \big(1 - \textrm{expectedness}(i|u) \big) \, \textrm{relevance}(i)
$$


**Note**: This notebook requires a PySpark environment to run properly. Please follow the steps in [SETUP.md](https://github.com/Microsoft/Recommenders/blob/master/SETUP.md#dependencies-setup) to install the PySpark environment.

In [1]:
# set the environment path to find Recommenders
%load_ext autoreload
%autoreload 2

import sys

import pyspark
from pyspark.ml.recommendation import ALS
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType, FloatType, IntegerType, LongType, StructType, StructField
from pyspark.ml.feature import Tokenizer, RegexTokenizer, StopWordsRemover
from pyspark.ml.feature import HashingTF, CountVectorizer, VectorAssembler

from recommenders.utils.timer import Timer
from recommenders.datasets import movielens
from recommenders.utils.notebook_utils import is_jupyter
from recommenders.datasets.spark_splitters import spark_random_split
from recommenders.evaluation.spark_evaluation import SparkRatingEvaluation, SparkRankingEvaluation, SparkDiversityEvaluation
from recommenders.utils.spark_utils import start_or_get_spark

from pyspark.sql.window import Window
import pyspark.sql.functions as F

import numpy as np
import pandas as pd

print("System version: {}".format(sys.version))
print("Spark version: {}".format(pyspark.__version__))


System version: 3.6.13 |Anaconda, Inc.| (default, Jun  4 2021, 14:25:59) 
[GCC 7.5.0]
Spark version: 2.4.8



Set the default parameters.

In [2]:
# top k items to recommend
TOP_K = 10

# Select MovieLens data size: 100k, 1m, 10m, or 20m
MOVIELENS_DATA_SIZE = '100k'

# user, item column names
COL_USER="UserId"
COL_ITEM="MovieId"
COL_RATING="Rating"

### 1. Set up Spark context

The following settings work well for debugging locally on VM - change when running on a cluster. We set up a giant single executor with many threads and specify memory cap. 

In [3]:
# the following settings work well for debugging locally on VM - change when running on a cluster
# set up a giant single executor with many threads and specify memory cap

spark = start_or_get_spark("ALS PySpark", memory="16g")

spark.conf.set("spark.sql.crossJoin.enabled", "true")

### 2. Download the MovieLens dataset

In [4]:
# Note: The DataFrame-based API for ALS currently only supports integers for user and item ids.
schema = StructType(
    (
        StructField(COL_USER, IntegerType()),
        StructField(COL_ITEM, IntegerType()),
        StructField(COL_RATING, FloatType()),
        StructField("Timestamp", LongType()),
    )
)

data = movielens.load_spark_df(spark, size=MOVIELENS_DATA_SIZE, schema=schema, title_col="title", genres_col="genres")
data.show()

100%|██████████| 4.81k/4.81k [00:00<00:00, 17.1kKB/s]


+-------+------+------+---------+--------------------+------+
|MovieId|UserId|Rating|Timestamp|               title|genres|
+-------+------+------+---------+--------------------+------+
|     26|   138|   5.0|879024232|Brothers McMullen...|Comedy|
|     26|   224|   3.0|888104153|Brothers McMullen...|Comedy|
|     26|    18|   4.0|880129731|Brothers McMullen...|Comedy|
|     26|   222|   3.0|878183043|Brothers McMullen...|Comedy|
|     26|    43|   5.0|883954901|Brothers McMullen...|Comedy|
|     26|   201|   4.0|884111927|Brothers McMullen...|Comedy|
|     26|   299|   4.0|878192601|Brothers McMullen...|Comedy|
|     26|    95|   3.0|880571951|Brothers McMullen...|Comedy|
|     26|    89|   3.0|879459909|Brothers McMullen...|Comedy|
|     26|   361|   3.0|879440941|Brothers McMullen...|Comedy|
|     26|   194|   3.0|879522240|Brothers McMullen...|Comedy|
|     26|   391|   5.0|877399745|Brothers McMullen...|Comedy|
|     26|   345|   3.0|884993555|Brothers McMullen...|Comedy|
|     26

#### Split the data using the Spark random splitter provided in utilities

In [5]:
train_df, test_df = spark_random_split(data.select(COL_USER, COL_ITEM, COL_RATING), ratio=0.75, seed=123)
print ("N train_df", train_df.cache().count())
print ("N test_df", test_df.cache().count())

N train_df 75066
N test_df 24934


#### Get all possible user-item pairs

Note: We assume that training data contains all users and all catalog items. 

In [6]:
users = train_df.select(COL_USER).distinct()
items = train_df.select(COL_ITEM).distinct()
user_item = users.crossJoin(items)

### 3. Train the ALS model on the training data, and get the top-k recommendations for our testing data

To predict movie ratings, we use the rating data in the training set as users' explicit feedback. The hyperparameters used in building the model are referenced from [here](http://mymedialite.net/examples/datasets.html). We do not constrain the latent factors (`nonnegative = False`) in order to allow for both positive and negative preferences towards movies.
Timing will vary depending on the machine being used to train.

In [7]:
header = {
    "userCol": COL_USER,
    "itemCol": COL_ITEM,
    "ratingCol": COL_RATING,
}


als = ALS(
    rank=10,
    maxIter=15,
    implicitPrefs=False,
    regParam=0.05,
    coldStartStrategy='drop',
    nonnegative=False,
    seed=42,
    **header
)

In [8]:
with Timer() as train_time:
    model = als.fit(train_df)

print("Took {} seconds for training.".format(train_time.interval))

Took 4.012367556002573 seconds for training.


In the movie recommendation use case, recommending movies that have been rated by the users does not make sense. Therefore, the rated movies are removed from the recommended items.

In order to achieve this, we recommend all movies to all users, and then remove the user-movie pairs that exist in the training dataset.

In [9]:
# Score all user-item pairs
dfs_pred = model.transform(user_item)

# Remove seen items.
dfs_pred_exclude_train = dfs_pred.alias("pred").join(
    train_df.alias("train"),
    (dfs_pred[COL_USER] == train_df[COL_USER]) & (dfs_pred[COL_ITEM] == train_df[COL_ITEM]),
    how='outer'
)

top_all = dfs_pred_exclude_train.filter(dfs_pred_exclude_train["train.Rating"].isNull()) \
    .select('pred.' + COL_USER, 'pred.' + COL_ITEM, 'pred.' + "prediction")

print(top_all.count())
    
window = Window.partitionBy(COL_USER).orderBy(F.col("prediction").desc())    
top_k_reco = top_all.select("*", F.row_number().over(window).alias("rank")).filter(F.col("rank") <= TOP_K).drop("rank")
 
print(top_k_reco.count())

1464853
9430


### 4. Random Recommender

We define a recommender which randomly recommends unseen items to each user. 

In [10]:
# random recommender
window = Window.partitionBy(COL_USER).orderBy(F.rand())

# randomly generated recommendations for each user
pred_df = (
  train_df
  # join training data with all possible user-item pairs (seen in training)
  .join(user_item,
        on=[COL_USER, COL_ITEM],
        how="right"
  )
  # get user-item pairs that were not seen in the training data
  .filter(F.col(COL_RATING).isNull())
  # count items for each user (randomly sorting them)
  .withColumn("score", F.row_number().over(window))
  # get the top k items per user
  .filter(F.col("score") <= TOP_K)
  .drop(COL_RATING)
)

### 5. ALS vs Random Recommenders Performance Comparison

In [11]:
def get_ranking_results(ranking_eval):
    metrics = {
        "Precision@k": ranking_eval.precision_at_k(),
        "Recall@k": ranking_eval.recall_at_k(),
        "NDCG@k": ranking_eval.ndcg_at_k(),
        "Mean average precision": ranking_eval.map_at_k()
      
    }
    return metrics   

def get_diversity_results(diversity_eval):
    metrics = {
        "catalog_coverage":diversity_eval.catalog_coverage(),
        "distributional_coverage":diversity_eval.distributional_coverage(), 
        "novelty": diversity_eval.novelty(), 
        "diversity": diversity_eval.diversity(), 
        "serendipity": diversity_eval.serendipity()
    }
    return metrics 

In [12]:
def generate_summary(data, algo, k, ranking_metrics, diversity_metrics):
    summary = {"Data": data, "Algo": algo, "K": k}

    if ranking_metrics is None:
        ranking_metrics = {           
            "Precision@k": np.nan,
            "Recall@k": np.nan,            
            "nDCG@k": np.nan,
            "MAP": np.nan,
        }
    summary.update(ranking_metrics)
    summary.update(diversity_metrics)
    return summary

#### ALS Recommender Performance Results

In [13]:
als_ranking_eval = SparkRankingEvaluation(
    test_df, 
    top_all, 
    k = TOP_K, 
    col_user="UserId", 
    col_item="MovieId",
    col_rating="Rating", 
    col_prediction="prediction",
    relevancy_method="top_k"
)

als_ranking_metrics = get_ranking_results(als_ranking_eval)

In [17]:
als_diversity_eval = SparkDiversityEvaluation(
    train_df = train_df, 
    reco_df = top_k_reco,
    col_user = COL_USER, 
    col_item = COL_ITEM
)

als_diversity_metrics = get_diversity_results(als_diversity_eval)

In [18]:
als_results = generate_summary(MOVIELENS_DATA_SIZE, "als", TOP_K, als_ranking_metrics, als_diversity_metrics)

#### Random Recommender Performance Results

In [19]:
random_ranking_eval = SparkRankingEvaluation(
    test_df,
    pred_df,
    col_user=COL_USER,
    col_item=COL_ITEM,
    col_rating=COL_RATING,
    col_prediction="score",
    k=TOP_K,
)

random_ranking_metrics = get_ranking_results(random_ranking_eval)

In [20]:
random_diversity_eval = SparkDiversityEvaluation(
    train_df = train_df, 
    reco_df = pred_df, 
    col_user = COL_USER, 
    col_item = COL_ITEM
)
  
random_diversity_metrics = get_diversity_results(random_diversity_eval)

In [21]:
random_results = generate_summary(MOVIELENS_DATA_SIZE, "random", TOP_K, random_ranking_metrics, random_diversity_metrics)

#### Result Comparison

In [22]:
cols = ["Data", "Algo", "K", "Precision@k", "Recall@k", "NDCG@k", "Mean average precision","catalog_coverage", "distributional_coverage","novelty", "diversity", "serendipity" ]
df_results = pd.DataFrame(columns=cols)

df_results.loc[1] = als_results 
df_results.loc[2] = random_results 

In [23]:
df_results

Unnamed: 0,Data,Algo,K,Precision@k,Recall@k,NDCG@k,Mean average precision,catalog_coverage,distributional_coverage,novelty,diversity,serendipity
1,100k,als,10,0.047296,0.016015,0.043097,0.004579,0.385793,7.967257,11.659776,0.892277,0.878733
2,100k,random,10,0.016755,0.005883,0.017849,0.00189,0.996326,10.540834,12.133664,0.922288,0.893001


#### Conclusion
The comparision results show that the ALS recommender outperforms the random recommender on ranking metrics (Precision@k, Recall@k, NDCG@k, and	Mean average precision), while the random recommender outperforms ALS recommender on diversity metrics. This is because ALS is optimized for estimating the item rating as accurate as possible, therefore it performs well on accuracy metrics including rating and ranking metrics. As a side effect, the items being recommended tend to be popular items, which are the items mostly sold or viewed. It leaves the long-tail less popular items having less chance to get introduced to the users. This is the reason why ALS is not performing as well as a random recommender on diversity metrics. 

### 6.  Calculate diversity metrics using item feature vector based item-item similarity
In the above section we calculate diversity metrics using item co-occurrence count based item-item similarity. In the scenarios when item features are available, we may want to calculate item-item similarity based on item feature vectors. In this section, we show how to calculate diversity metrics using item feature vector based item-item similarity.

In [24]:
# Get movie features "title" and "genres"
movies = (
    data.groupBy("MovieId", "title", "genres").count()
    .na.drop()  # remove rows with null values
    .withColumn("genres", F.split(F.col("genres"), "\|"))  # convert to array of genres
    .withColumn("title", F.regexp_replace(F.col("title"), "[\(),:^0-9]", ""))  # remove year from title
    .drop("count")  # remove unused columns
)

In [25]:
# tokenize "title" column
title_tokenizer = Tokenizer(inputCol="title", outputCol="title_words")
tokenized_data = title_tokenizer.transform(movies)

# remove stop words
remover = StopWordsRemover(inputCol="title_words", outputCol="text")
clean_data = remover.transform(tokenized_data).drop("title", "title_words")

In [26]:
# convert text input into feature vectors

# step 1: perform HashingTF on column "text"
text_hasher = HashingTF(inputCol="text", outputCol="text_features", numFeatures=1024)
hashed_data = text_hasher.transform(clean_data)

# step 2: fit a CountVectorizerModel from column "genres".
count_vectorizer = CountVectorizer(inputCol="genres", outputCol="genres_features")
count_vectorizer_model = count_vectorizer.fit(hashed_data)
vectorized_data = count_vectorizer_model.transform(hashed_data)

# step 3: assemble features into a single vector
assembler = VectorAssembler(
    inputCols=["text_features", "genres_features"],
    outputCol="features",
)
feature_data = assembler.transform(vectorized_data).select("MovieId", "features")

feature_data.show(10, False)

+-------+---------------------------------------------+
|MovieId|features                                     |
+-------+---------------------------------------------+
|167    |(1043,[128,544,1025],[1.0,1.0,1.0])          |
|1343   |(1043,[38,300,1024],[1.0,1.0,1.0])           |
|1607   |(1043,[592,821,1024],[1.0,1.0,1.0])          |
|966    |(1043,[389,502,1028],[1.0,1.0,1.0])          |
|9      |(1043,[11,342,1014,1024],[1.0,1.0,1.0,1.0])  |
|1230   |(1043,[597,740,902,1025],[1.0,1.0,1.0,1.0])  |
|1118   |(1043,[702,1025],[1.0,1.0])                  |
|673    |(1043,[169,690,1027,1040],[1.0,1.0,1.0,1.0]) |
|879    |(1043,[909,1026,1027,1034],[1.0,1.0,1.0,1.0])|
|66     |(1043,[256,1025,1028],[1.0,1.0,1.0])         |
+-------+---------------------------------------------+
only showing top 10 rows



The *features* column is represented with a SparseVector object. For example, in the feature vector (1043,[128,544,1025],[1.0,1.0,1.0]), 1043 is the vector length, indicating the vector consisting of 1043 item features. The values at index positions 128,544,1025 are 1.0, and the values at other positions are all 0. 

In [27]:
als_eval = SparkDiversityEvaluation(
    train_df = train_df, 
    reco_df = top_k_reco,
    item_feature_df = feature_data, 
    item_sim_measure="item_feature_vector",
    col_user = COL_USER, 
    col_item = COL_ITEM
)

als_diversity=als_eval.diversity()
als_serendipity=als_eval.serendipity()
print(als_diversity)
print(als_serendipity)

0.8738984131037538
0.8873467159479473


In [28]:
random_eval = SparkDiversityEvaluation(
    train_df = train_df, 
    reco_df = pred_df, 
    item_feature_df = feature_data, 
    item_sim_measure="item_feature_vector",    
    col_user = COL_USER, 
    col_item = COL_ITEM
)
  
random_diversity=random_eval.diversity()
random_serendipity=random_eval.serendipity()
print(random_diversity)
print(random_serendipity)

0.8978120851519519
0.8937850286817351


It's interesting that the value of diversity and serendipity changes when using different item-item similarity calculation approach, for both ALS algorithm and random recommender. The diversity and serendipity of random recommender are still higher than ALS algorithm. 

### References
The metric definitions / formulations are based on the following references:
- P. Castells, S. Vargas, and J. Wang, Novelty and diversity metrics for recommender systems: choice, discovery and relevance, ECIR 2011
- G. Shani and A. Gunawardana, Evaluating recommendation systems, Recommender Systems Handbook pp. 257-297, 2010.
- E. Yan, Serendipity: Accuracy’s unpopular best friend in recommender Systems, eugeneyan.com, April 2020
- Y.C. Zhang, D.Ó. Séaghdha, D. Quercia and T. Jambor, Auralist: introducing serendipity into music recommendation, WSDM 2012


In [None]:
# cleanup spark instance
spark.stop()