# Reranking for diversity improvement

Prepare the dataset as introduced before:

In [1]:
import rsdiv as rs

loader = rs.MovieLens1MDownLoader()
ratings = loader.read_ratings() 
ratings['rating'] = 1 # Only keeps the implicit data
items = loader.read_items()

Not only for categorical labels, but **rsdiv** also supports embedding for items. 

For example, but the pre-trained 300-dim embedding based on `wiki_en` by `fastText` can also be simply imported as:

In [2]:
emb = rs.FastTextEmbedder()
items['embedding'] = items['genres'].apply(emb.embedding_list)

In [3]:
items

Unnamed: 0,itemId,title,genres,release_date,embedding
0,1,Toy Story,"[Animation, Children's, Comedy]",1995,"[-0.030589849, 0.05325674, 0.019193454, -0.050..."
1,2,Jumanji,"[Adventure, Children's, Fantasy]",1995,"[-0.015678799, 0.042902038, -0.035489853, -0.0..."
2,3,Grumpier Old Men,"[Comedy, Romance]",1995,"[-0.020618143, 0.06264187, 0.007298471, -0.043..."
3,4,Waiting to Exhale,"[Comedy, Drama]",1995,"[-0.012459491, 0.066781715, 0.005510467, -0.04..."
4,5,Father of the Bride Part II,[Comedy],1995,"[-0.050720982, 0.05634493, 0.026702933, -0.043..."
...,...,...,...,...,...
3878,3948,Meet the Parents,[Comedy],2000,"[-0.050720982, 0.05634493, 0.026702933, -0.043..."
3879,3949,Requiem for a Dream,[Drama],2000,"[0.025802, 0.077218495, -0.015681999, -0.05331..."
3880,3950,Tigerland,[Drama],2000,"[0.025802, 0.077218495, -0.015681999, -0.05331..."
3881,3951,Two Family House,[Drama],2000,"[0.025802, 0.077218495, -0.015681999, -0.05331..."


Train a iALS recommender (based on `implicit`):

In [4]:
rc = rs.IALSRecommender(ratings, items, test_size=50000, random_split=True, iterations=10, factors=300).fit()

  0%|          | 0/10 [00:00<?, ?it/s]

Evaluate the recommender:

In [5]:
rc.auc_score(top_k=300)

  0%|          | 0/5654 [00:00<?, ?it/s]

0.8211658057177699

Prepare the relevance scores and similarity scores for `user_id=1024`:

In [6]:
org_select, category, relevance, similarity = rc.rerank_preprocess(
    user_id=1024, 
    truncate_at=500, 
    category_col='genres',
    embedding_col='embedding'
)

**rsdiv** supports various kinds of diversifying algorithms:

- [Maximal Marginal Relevance](https://www.cs.cmu.edu/~jgc/publication/The_Use_MMR_Diversity_Based_LTMIR_1998.pdf), MMR diversify algorithm:

$$MMR\stackrel{\text{def}}{=}\mathop{\text{argmax}}\limits_{D_i\in R\backslash S}\left[\underbrace{\lambda \text{Sim}_1\left(D_i,Q\right)}_\text{relevance}-\left(1-\lambda\right)\underbrace{\max\limits_{D_j\in S}\text{Sim}_2\left(D_i,D_j\right)}_\text{diversity}\right]$$

Rerank `top 500` to compare the new `top 100` and the orginal one:

In [7]:
mmr = rs.MaximalMarginalRelevance(lbd=0.1)
rerank_scale = 100

In [8]:
new_orders = mmr.rerank(relevance, k=rerank_scale, similarity_scores=similarity)
new_select = [org_select[order] for order in new_orders]
new_genres = [category[order] for order in new_select]
org_genres = [category[order] for order in org_select]

Check the new gini coefficients, a notable improvement of diversity could be obeserved:

In [9]:
metrics = rs.DiversityMetrics()
metrics.gini_coefficient(org_genres[:rerank_scale]), metrics.gini_coefficient(new_genres)

(0.4971910112359551, 0.3769173213617658)

In [10]:
metrics.effective_catalog_size(org_genres[:rerank_scale]), metrics.effective_catalog_size(new_genres)

(8.04494382022472, 11.215488215488216)

- Modified Gram-Schmidt, MGS diversify algorithm, also known as SSD([Sliding Spectrum Decomposition](https://arxiv.org/pdf/2107.05204.pdf)):

The objective could be formed as:

$$\max\limits_{j\in\mathcal{Y}\backslash Y}\left[r_j+\lambda\left||P_{\perp q_j}\right|| \prod\limits_{i\in Y}^{}\left||P_{\perp q_i}\right||\right]$$