## Why do we build Recommender Systems?

Recommender Systems are part of almost every web platform offering items to users. These systems are build to help people find items they like but might not discover on their own, such as movies, products, or articles. Recommenders analyze patterns of user behavior to suggest items similar to what users have liked in the past or what similar users have enjoyed. Evidence supporting the effectiveness of recommender systems includes increased sales, higher user satisfaction, and improved engagement on platforms like Amazon, Netflix, and Spotify, where personalized recommendations significantly contribute to user experience and business success.

Below we will show how to build a very simple recommender system that will provide news articles to users. The system consists of two parts:
    
   1. a content-based recommender - the model operates via item similarity, meaning that only items similar to the context item are recommended. The recommended list would be shown to the user under the title "Similar Articles", and the objective is to motivate the readers to read more content.
   2. a collaborative-filtering recommender - in this case the user's past interactions are taken into account. The model first identifies users who have similar interaction history with the current user, and then collects items seen by these similar users, which where not yet seen by the current user. The recommended list would be shown under the tile "Others also read" or "Personalized Recommendations" to indicate that this list is specifically generated for this particular user. The objective is obviously to keep the users engaged with the platform
    
   The dataset below can be downloaded [here](https://www.kaggle.com/datasets/yazansalameh/news-category-dataset-v2). It is a news dataset, using which we will build the recommenders. We will provide user-article interaction data later, once we move on to the collaborative-filtering model.

In [1]:
import numpy as np
import pandas as pd

from sklearn.metrics.pairwise import cosine_similarity  

In [2]:
pd.options.display.max_columns = None

In [3]:
from sentence_transformers import SentenceTransformer

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
BERT_SENT = 'sentence-transformers/distiluse-base-multilingual-cased-v1'

In [5]:
news_articles = pd.read_json("News_Category_Dataset_v2.json", lines = True)

In [6]:
news_articles.head()

Unnamed: 0,category,headline,authors,link,short_description,date
0,CRIME,There Were 2 Mass Shootings In Texas Last Week...,Melissa Jeltsen,https://www.huffingtonpost.com/entry/texas-ama...,She left her husband. He killed their children...,2018-05-26
1,ENTERTAINMENT,Will Smith Joins Diplo And Nicky Jam For The 2...,Andy McDonald,https://www.huffingtonpost.com/entry/will-smit...,Of course it has a song.,2018-05-26
2,ENTERTAINMENT,Hugh Grant Marries For The First Time At Age 57,Ron Dicker,https://www.huffingtonpost.com/entry/hugh-gran...,The actor and his longtime girlfriend Anna Ebe...,2018-05-26
3,ENTERTAINMENT,Jim Carrey Blasts 'Castrato' Adam Schiff And D...,Ron Dicker,https://www.huffingtonpost.com/entry/jim-carre...,The actor gives Dems an ass-kicking for not fi...,2018-05-26
4,ENTERTAINMENT,Julianna Margulies Uses Donald Trump Poop Bags...,Ron Dicker,https://www.huffingtonpost.com/entry/julianna-...,"The ""Dietland"" actress said using the bags is ...",2018-05-26


We see that the dataset has columns such as `category`, `headline`, `short_description`, `date`. We will extract 
instances that were published after 2018-01-01.

In [7]:
news_articles.shape

(200853, 6)

In [8]:
news_articles = news_articles[news_articles['date'] >= pd.Timestamp(2018, 1, 1)]

In [9]:
news_articles.shape

(8583, 6)

We will also remove news with short headlines (shorter than 7 words).

In [10]:
news_articles = news_articles[news_articles['headline'].apply(lambda x: len(x.split()) > 6)]
news_articles.shape[0]

8429

In [11]:
# drop duplicates
news_articles = news_articles.sort_values('headline', ascending=False).drop_duplicates('headline', keep=False)
print(f"Total number of articles after removing duplicates: {news_articles.shape[0]}")

Total number of articles after removing duplicates: 8384


In [12]:
print("Total number of articles : ", news_articles.shape[0])
print("Total number of authors : ", news_articles["authors"].nunique())
print("Total number of categories : ", news_articles["category"].nunique())

Total number of articles :  8384
Total number of authors :  876
Total number of categories :  26


## 1. Content-based recommender

Below we will implement our content-based recommended. The recommended list would be same for all users, and would be displayed under the title "Similar Articles".

To identify which news articles are similar to a given article, we will obtain embedding of the text associated 
with all articles. Once we have the embeddings, we will use cosine similarity to retrieve the most similar
articles. The model we chose is from the Sentence Transformers family, often used for text-embedding tasks.

In [13]:
def get_text_and_mappings(df):

    corpus = df['headline_description'].tolist()
    
    # generate mappings
    ids_count_map = {row_id: index for index, row_id in enumerate(df['article_id'])}
    count_ids_map = {index: row_id for index, row_id in enumerate(df['article_id'])}
    
    return corpus, ids_count_map, count_ids_map

In [14]:
def compute_vectors(corpus, model):

    print('Calculating Embeddings of articles...')
    vectors = model.encode(corpus)
    print('Embeddings calculated!')
    
    return vectors

We will add one more column to the dataset, the values of which would be a concatenation of the headline and the 
news article description. We intend to use the text corpus of this column for embedding, i.e., calculating vector
representation of the articles

In [15]:
news_articles["headline_description"] = news_articles['headline'] + ' ' + news_articles['short_description']
news_articles.head()

Unnamed: 0,category,headline,authors,link,short_description,date,headline_description
2932,QUEER VOICES,‘Will & Grace’ Creator To Donate Gay Bunny Boo...,Elyse Wanshel,https://www.huffingtonpost.com/entry/will-grac...,It's about to be a lot easier for kids in Mike...,2018-04-02,‘Will & Grace’ Creator To Donate Gay Bunny Boo...
4487,QUEER VOICES,‘The Voice’ Blind Auditions Make History With ...,"Lyndsey Parker, Yahoo Entertainment",https://www.huffingtonpost.com/entry/the-voice...,"Austin Giorgio, 21: “How Sweet It Is (To Be Lo...",2018-03-06,‘The Voice’ Blind Auditions Make History With ...
8255,QUEER VOICES,‘The Penumbra’ Is The Queer Audio Drama You Di...,"Sarah Emily Baum, ContributorFreelance Writer",https://www.huffingtonpost.com/entry/the-penum...,"Young, fun, fantastical and, most notably, inc...",2018-01-05,‘The Penumbra’ Is The Queer Audio Drama You Di...
744,COMEDY,‘The Opposition’ Gives Trump A Hot Lawyer Of H...,Ed Mazza,https://www.huffingtonpost.com/entry/trump-hot...,"He's here to make a ""strong case"" for the pres...",2018-05-11,‘The Opposition’ Gives Trump A Hot Lawyer Of H...
2893,ENTERTAINMENT,‘Stranger Things’ Fans Will Be Able To Visit T...,Elyse Wanshel,https://www.huffingtonpost.com/entry/stranger-...,"Hawkins is headed to Hollywood, Orlando and Si...",2018-04-03,‘Stranger Things’ Fans Will Be Able To Visit T...


In [16]:
news_articles['article_id'] = news_articles.index
#news_articles.reset_index(drop=True)
news_articles.head()

Unnamed: 0,category,headline,authors,link,short_description,date,headline_description,article_id
2932,QUEER VOICES,‘Will & Grace’ Creator To Donate Gay Bunny Boo...,Elyse Wanshel,https://www.huffingtonpost.com/entry/will-grac...,It's about to be a lot easier for kids in Mike...,2018-04-02,‘Will & Grace’ Creator To Donate Gay Bunny Boo...,2932
4487,QUEER VOICES,‘The Voice’ Blind Auditions Make History With ...,"Lyndsey Parker, Yahoo Entertainment",https://www.huffingtonpost.com/entry/the-voice...,"Austin Giorgio, 21: “How Sweet It Is (To Be Lo...",2018-03-06,‘The Voice’ Blind Auditions Make History With ...,4487
8255,QUEER VOICES,‘The Penumbra’ Is The Queer Audio Drama You Di...,"Sarah Emily Baum, ContributorFreelance Writer",https://www.huffingtonpost.com/entry/the-penum...,"Young, fun, fantastical and, most notably, inc...",2018-01-05,‘The Penumbra’ Is The Queer Audio Drama You Di...,8255
744,COMEDY,‘The Opposition’ Gives Trump A Hot Lawyer Of H...,Ed Mazza,https://www.huffingtonpost.com/entry/trump-hot...,"He's here to make a ""strong case"" for the pres...",2018-05-11,‘The Opposition’ Gives Trump A Hot Lawyer Of H...,744
2893,ENTERTAINMENT,‘Stranger Things’ Fans Will Be Able To Visit T...,Elyse Wanshel,https://www.huffingtonpost.com/entry/stranger-...,"Hawkins is headed to Hollywood, Orlando and Si...",2018-04-03,‘Stranger Things’ Fans Will Be Able To Visit T...,2893


In [17]:
articles_simple = news_articles[['article_id', 'headline_description']]
articles_simple.head()

Unnamed: 0,article_id,headline_description
2932,2932,‘Will & Grace’ Creator To Donate Gay Bunny Boo...
4487,4487,‘The Voice’ Blind Auditions Make History With ...
8255,8255,‘The Penumbra’ Is The Queer Audio Drama You Di...
744,744,‘The Opposition’ Gives Trump A Hot Lawyer Of H...
2893,2893,‘Stranger Things’ Fans Will Be Able To Visit T...


In [18]:
articles_sample = articles_simple.head(500)

In [19]:
corpus, ids_count_map, count_ids_map = get_text_and_mappings(articles_sample)

In [20]:
print("Loading the model...")
model_bert = SentenceTransformer(BERT_SENT)

Loading the model...


  return torch._C._cuda_getDeviceCount() > 0


In [21]:
vectors = compute_vectors(corpus, model_bert)

Calculating Embeddings of articles...
Embeddings calculated!


In [45]:
import operator
from numpy import dot
from numpy.linalg import norm

def get_cosine_sim(count_id, a, vectors):
    
    ids_scores = []
    for count in range(len(vectors)):
        if count == count_id:
            continue
        b = vectors[count]
        cos_sim = dot(a, b) / (norm(a) * norm(b))
        ids_scores.append((count, cos_sim))

    ids_scores.sort(key=operator.itemgetter(1), reverse=True)

    return ids_scores

def get_similar_vectors(count_id, vectors, N):

    a = vectors[count_id]
    
    ids_scores = get_cosine_sim(count_id, a, vectors)
    
    return ids_scores[:N]

def get_similar_articles(count_id, vectors, N, count_ids_map):

    ids_scores = get_similar_vectors(count_id, vectors, N)
    original_ids = [count_ids_map[a[0]] for a in ids_scores]

    return original_ids

In [24]:
def print_article_text(corpus, ids_count_map, similar_article_ids):
    
    #print(f'{N} movies/shows similar to\n {corpus[count_id]}:\n')
    for article_id in similar_article_ids:
        print(corpus[article_id])
        print('-' * 30)

Below we will chose a context article, for which we will then search similar articles. We hope that the short artcile description would be enough to evaluate if the recommended similar articles make sense.

In [29]:
original_id = 744
count_id = ids_count_map[original_id]
print("Context Article: ", count_id, corpus[count_id])

Context Article:  3 ‘The Opposition’ Gives Trump A Hot Lawyer Of His Own He's here to make a "strong case" for the president.


In [30]:
k = 3 # how many similar articles to retrieve

In [33]:
similar_articles = get_similar_articles(count_id, vectors, k, count_ids_map)
similar_articles

[3705, 8208, 7878]

In [34]:
similar_article_ids = [ids_count_map[id_] for id_ in similar_articles]
similar_article_ids

[277, 276, 367]

In [35]:
corpus[count_id] # context article

'‘The Opposition’ Gives Trump A Hot Lawyer Of His Own He\'s here to make a "strong case" for the president.'

In [36]:
print_article_text(corpus, ids_count_map, similar_article_ids) # similar articles

White House Lawyer Insists Trump Isn't Considering Firing Mueller Republican lawmakers have insisted that Trump let the special counsel to do his job.
------------------------------
White House Lawyer Misled Trump To Prevent James Comey's Dismissal: Report The deputy counsel was reportedly trying to prevent an obstruction investigation.
------------------------------
Wednesday's Morning Email: Judge Halts Trump Administration's Plan To Kill DACA While a lawsuit proceeds.
------------------------------


As we see in the above, the context article was about Donald Trump. Our recommendered articles also mention
Mr. Trump - it makes sense to assume that people who read the context article would also be interested in reading
our recommendations.

In [72]:
news_articles.head(70)

Unnamed: 0,category,headline,authors,link,short_description,date,headline_description,article_id
2932,QUEER VOICES,‘Will & Grace’ Creator To Donate Gay Bunny Boo...,Elyse Wanshel,https://www.huffingtonpost.com/entry/will-grac...,It's about to be a lot easier for kids in Mike...,2018-04-02,‘Will & Grace’ Creator To Donate Gay Bunny Boo...,2932
4487,QUEER VOICES,‘The Voice’ Blind Auditions Make History With ...,"Lyndsey Parker, Yahoo Entertainment",https://www.huffingtonpost.com/entry/the-voice...,"Austin Giorgio, 21: “How Sweet It Is (To Be Lo...",2018-03-06,‘The Voice’ Blind Auditions Make History With ...,4487
8255,QUEER VOICES,‘The Penumbra’ Is The Queer Audio Drama You Di...,"Sarah Emily Baum, ContributorFreelance Writer",https://www.huffingtonpost.com/entry/the-penum...,"Young, fun, fantastical and, most notably, inc...",2018-01-05,‘The Penumbra’ Is The Queer Audio Drama You Di...,8255
744,COMEDY,‘The Opposition’ Gives Trump A Hot Lawyer Of H...,Ed Mazza,https://www.huffingtonpost.com/entry/trump-hot...,"He's here to make a ""strong case"" for the pres...",2018-05-11,‘The Opposition’ Gives Trump A Hot Lawyer Of H...,744
2893,ENTERTAINMENT,‘Stranger Things’ Fans Will Be Able To Visit T...,Elyse Wanshel,https://www.huffingtonpost.com/entry/stranger-...,"Hawkins is headed to Hollywood, Orlando and Si...",2018-04-03,‘Stranger Things’ Fans Will Be Able To Visit T...,2893
...,...,...,...,...,...,...,...,...
5938,ENTERTAINMENT,YouTube Suspends Ads From Logan Paul Videos Af...,Jenna Amatulli,https://www.huffingtonpost.com/entry/youtube-l...,"The streaming service says Paul's channel is ""...",2018-02-09,YouTube Suspends Ads From Logan Paul Videos Af...,5938
8515,ENTERTAINMENT,YouTube Star Logan Paul Sparks Outrage With Di...,Lee Moran,https://www.huffingtonpost.com/entry/logan-pau...,Actors Sophie Turner and Aaron Paul were among...,2018-01-02,YouTube Star Logan Paul Sparks Outrage With Di...,8515
6180,ENTERTAINMENT,YouTube Star Kian Lawley Fired From 'The Hate ...,Cole Delbyck,https://www.huffingtonpost.com/entry/youtube-s...,"The actor has apologized, and said he agrees w...",2018-02-06,YouTube Star Kian Lawley Fired From 'The Hate ...,6180
2894,CRIME,"YouTube Shooter Was 'Upset' With Company, Poli...","Carla Herreria, Nick Visser, and Hayley Miller",https://www.huffingtonpost.com/entry/youtube-o...,Four people were injured in the attack at the ...,2018-04-03,"YouTube Shooter Was 'Upset' With Company, Poli...",2894


We will now try to identify two different articles, and find articles similar/relavent to both. This is done by simple vector averaging before cosine similarity search. We will use articles from Entertainment and Business section.

In [80]:
bv_ind = 2893 # entertainment
bv_count_id = ids_count_map[bv_ind]
bv_vect = vectors[bv_count_id]

print("1st Context Article: ", bv_count_id, corpus[bv_count_id])

1st Context Article:  4 ‘Stranger Things’ Fans Will Be Able To Visit The Upside Down IRL Hawkins is headed to Hollywood, Orlando and Singapore this fall.


In [81]:
en_ind = 3510 # business
en_count_id = ids_count_map[en_ind]
en_vect = vectors[en_count_id]

print("2nd Context Article: ", en_count_id, corpus[en_count_id])

2nd Context Article:  69 YouTube Quietly Escalates Crackdown On Firearm Videos The video site is expanding restrictions following the Florida massacre.


In [82]:
avg_vect = average_array = (bv_vect + en_vect) / 2

similar_articles_ = get_cosine_sim(count_id, avg_vect, vectors)

In [83]:
similar_article_ids_ = [ids_count_map[id_[0]] for id_ in similar_articles_ if id_[0] in ids_count_map][:k]

In [84]:
print_article_text(corpus, ids_count_map, similar_article_ids_)

Whistleblower Leaked Michael Cohen's Financials Over Potential Cover-Up: Report The whistleblower said two files about Cohen's business dealings are missing from a government database.
------------------------------
What You Missed About The Saddest Death In 'Avengers: Infinity War' Directors Joe and Anthony Russo answer our most pressing questions.
------------------------------
Will Ferrell And Molly Shannon Cover The Royal Wedding As 'Cord And Tish' They should cover everything.
------------------------------


#### Evaluation

One way to evaluate a content-based recommender system is by 'manual' inspection of results, as we have demonstrated above. In case of a news platform, someone from the editorial team can check if - given a context article - the recommended articles make sense.

However, a golden standard in evaluating/comparing impact of two or more recommenders (be it content-based, or with user interactions) is AB-testing. This simply means launching the models, assigning fair amount of traffic 
to each, then watching how they behave: which one has a higher click-through-rate, basically.

## 2. Collaborative-filtering

Below we will provide implementations of two collaborative filtering approaches that give personalized recommendations to the users. The cold-start problem will also be tackled, and in the end we will implement some basic evaluation metrics that would tell us which model is to be preferred.

The list of generated recommendations would be under the title "Recommendations for you", "Others also read" or "Personalized Recommendations".

#### Generating user-item interactions

We'll create a simulated user-article interaction dataset with the following assumptions for simplicity:

Users have specific interests (e.g., "politics", "entertainment", "comedy").
Articles are already categorized, so we will simply 'match' the users to their preferred category. 
We also assign a rating to the interaction: a rating 3-5 would be given in a case of a taste match, 1-2 otherwise.
The low ratings will be filtered out, but we leave the option for further exploration/

In [32]:
import random

def create_users(num_users, categories):
    interests = categories # + ['other']  # include 'other' for users with general interests
    return [{'user_id': i + 1, 'interest': random.choice(interests)} for i in range(num_users)]

# generate user-article interactions
def generate_interactions(users, articles_df):
    interactions = []
    for user in users:
        user_interest = user['interest']
        for _, article in articles_df.iterrows():
            # bias: Higher probability of higher rating for interest match
            # if article['category'] == user_interest or user_interest == 'other':
            if article['category'] == user_interest:
                rating = random.randint(3, 5)
            else:
                rating = random.randint(1, 2) # (1, 3)
            interactions.append({
                'user_id': user['user_id'],
                'article_id': article['article_id'],
                'rating': rating
            })
    return pd.DataFrame(interactions)

In [33]:
articles = news_articles.head(3000)

In [137]:
# create 30 users
num_users = 300
categories = articles['category'].unique().tolist()
users_dynamic = create_users(num_users, categories)

In [138]:
# cenerate the user-article interactions dataset
interactions = generate_interactions(users_dynamic, articles)
print(interactions.head())

   user_id  article_id  rating
0        1        2932       1
1        1        4487       1
2        1        8255       1
3        1         744       1
4        1        2893       2


In [139]:
interactions.shape

(900000, 3)

In [140]:
users_dynamic

[{'user_id': 1, 'interest': 'LATINO VOICES'},
 {'user_id': 2, 'interest': 'ARTS & CULTURE'},
 {'user_id': 3, 'interest': 'WORLD NEWS'},
 {'user_id': 4, 'interest': 'LATINO VOICES'},
 {'user_id': 5, 'interest': 'TRAVEL'},
 {'user_id': 6, 'interest': 'STYLE'},
 {'user_id': 7, 'interest': 'WOMEN'},
 {'user_id': 8, 'interest': 'RELIGION'},
 {'user_id': 9, 'interest': 'QUEER VOICES'},
 {'user_id': 10, 'interest': 'TRAVEL'},
 {'user_id': 11, 'interest': 'TECH'},
 {'user_id': 12, 'interest': 'WEIRD NEWS'},
 {'user_id': 13, 'interest': 'GREEN'},
 {'user_id': 14, 'interest': 'LATINO VOICES'},
 {'user_id': 15, 'interest': 'EDUCATION'},
 {'user_id': 16, 'interest': 'BLACK VOICES'},
 {'user_id': 17, 'interest': 'GREEN'},
 {'user_id': 18, 'interest': 'CRIME'},
 {'user_id': 19, 'interest': 'LATINO VOICES'},
 {'user_id': 20, 'interest': 'ENTERTAINMENT'},
 {'user_id': 21, 'interest': 'RELIGION'},
 {'user_id': 22, 'interest': 'STYLE'},
 {'user_id': 23, 'interest': 'RELIGION'},
 {'user_id': 24, 'interes

In [141]:
# collecting interactions with ratings greater than or equal to 3
interactions = interactions[interactions['rating'] >= 3]
print(interactions.shape)

(25768, 3)


In [142]:
interactions = interactions.sample(frac=1)
interactions.to_csv('interactions_02-01-12-34.csv')

In [None]:
interactions = pd.read_csv('interactions_02-01-12-34.csv')

In [143]:
user_id = 74

We see above that the user number 18 has interests in sports. Let's see if we have matched him with
appropriate articles.

In [144]:
specific_articles = interactions[interactions['user_id'] == user_id]['article_id'].unique().tolist()

In [145]:
def print_articles_from_list(articles):
    
    for id_ in articles:
        print(news_articles.loc[id_]['headline_description'] + "\n")

In [146]:
print_articles_from_list(specific_articles)

United Airlines Mistakenly Flies Family's Dog To Japan Instead Of Kansas City The mix-up came just a day after a puppy died aboard a United flight.

The 10 Best Hotels In The US In 2018, Revealed BRB, booking a trip to San Antonio immediately ✈️

Rogue Cat Rescued After Hiding Out In New York Airport For Over A Week Pepper is safe and sound!

Yelp Users Are Dragging Trump Hotels By Leaving ‘S**thole’ Reviews “Perhaps the Trump brand could take some lessons from Norway, where they have the BEST hotels."

You Can Fly Around The World For Less Than $1,200 See four cities across four continents in 16 days ✈️

Take A Virtual Disney Vacation With Stunning New Google Street View Maps Visit Disneyland and Disney World on the same day without leaving home.

These Gorgeous Secret Lagoons Exist For Only Three Months A Year Lençóis might give off Saharan vibes, but the park is not technically a desert.

The 5 Best (And Most Affordable) Places To Travel In April If you’re smart about where you plac

In [147]:
set([news_articles.loc[id_]['category'] for id_ in specific_articles])

{'TRAVEL'}

It seems we have done it correctly.

In [45]:
#interactions = interactions.sample(n=600)
#interactions.shape

In [46]:
#interactions.head()

Let's create train and test data. The train data would be used for models training. 

We will provide two models. For the first, the train data would be 
used to create a vector, for each user, where this vector would be populated by ratings given to news articles.
Once we have created the user vectors, we can retrieve the most similar users via, e.g., cosine similarity. 
And once similar users are identified, we can easily collect items they have seen that the context users has not
yet seen. Let's call the first model "Similar Vectors".

The second model is a Matrix Factorization model presented in [this paper](http://yifanhu.net/PUB/cf.pdf), with an efficient implementation in `implicit` package available [here](https://github.com/benfred/implicit).

### Cold-start problem

The models can recommend items only to users that were part of training. This means that for new users for whom we don't have any interactions yet, we have to recommend items via a different strategy. One common way is to present these users with a list of most popular items.

In [148]:
most_popular_articles = train['article_id'].value_counts().index.tolist()[:5]
most_popular_articles

[6622, 1005, 2561, 3622, 1861]

In [149]:
print_articles_from_list(most_popular_articles)

These 'No Promo Homo' Laws Are Hurting LGBTQ Students Across America Seven states still have laws that specifically target gay students.

Trump Defends Gina Haspel, His Nominee For CIA Director, And Her Record Of Torture “We have the most qualified person, a woman, who Democrats want Out because she is too tough on terror,” Trump tweeted Monday.

Trump Condemns 'Sick' Syria Disaster Yet Slams The Door On Countless Refugees The U.S. has resettled only 44 Syrian refugees since October.

Trump Congratulates Putin On Totally Expected Victory In Russian Election It seems the U.S. president at least has no hard feelings toward Russia.

Trump Considering 'Full Pardon' Of Late Boxing Champion Jack Johnson Johnson, the first black heavyweight champion, was arrested for driving his girlfriend across state lines.



In general, below is the recipe on which type of recommendations to provide to a particular user-type (where user-type is determined by user activity on the platform):

- no interactions -> most popular items
- some interactions -> content-based items
- more interactions -> collaborative filtering

#### Train-test split

In [150]:
train = interactions.head(20000)
test = interactions.tail(interactions.shape[0] - train.shape[0])

#### Similar Vectors model

In [151]:
# function to recommend articles for a given user
def recommend_articles_sv(user_id, user_article_matrix, user_similarity_df, top_n=5):
    
    # identify similar users
    n_similar_users = 3
    similar_users = user_similarity_df[user_id].sort_values(ascending=False)[1:n_similar_users+1].index
    
    # get articles interacted by similar users with ratings > 0
    articles_seen_by_similar = user_article_matrix.loc[similar_users]
    articles_seen_by_similar = articles_seen_by_similar[articles_seen_by_similar > 0].stack().index.tolist()
    
    # get articles interacted by the target user with ratings > 0
    articles_seen_by_user = user_article_matrix.loc[user_id]
    articles_seen_by_user = articles_seen_by_user[articles_seen_by_user > 0].index.tolist()
    
    #print(len(articles_seen_by_similar), len(articles_seen_by_user))
    # filter out articles the user has already seen
    recommended_articles = [article for user, article in articles_seen_by_similar if article not in articles_seen_by_user]
    
    # select unique articles and limit the number of recommendations
    recommended_articles = list(set(recommended_articles))[:top_n]
    
    return recommended_articles

In [152]:
user_article_matrix = train.pivot_table(index='user_id', columns='article_id', values='rating', fill_value=0)

# compute cosine similarity between users
user_similarity = cosine_similarity(user_article_matrix)
user_similarity_df = pd.DataFrame(user_similarity, index=user_article_matrix.index, columns=user_article_matrix.index)

In [153]:
user_id = 74

In [154]:
recommended_articles_sv = recommend_articles_sv(user_id, user_article_matrix, user_similarity_df, top_n=5)
recommended_articles_sv

63 18


[5129, 5199, 4146, 5395, 1299]

#### Matrix Factorization

A general MF model takes a list of triplets (user, item, rating), and then tries to create vector representaitons for both users and items, such that the inner product of user-vector and item-vector is as close to the rating as possible. 

The specific model we use actually integrates a weight component to the innner product, and restricts the ratings to a constant value of 1.

In [155]:
from typing import Iterable

import itertools
import os
import threadpoolctl
import implicit

from tqdm import tqdm
from scipy.sparse import csr_matrix

# as recommended by the implicit package warning
threadpoolctl.threadpool_limits(1, "blas")

<threadpoolctl.threadpool_limits at 0x7f905564f400>

In [156]:
def build_user_and_item_mappings(users, items):
    users_map = {user: idx for idx, user in enumerate(users)}
    items_map = {game: idx for idx, game in enumerate(items)}
    return users_map, items_map

In [157]:
def build_matrix(data, rating_col, users_map, items_map,
    user_col = "user_id",
    item_col= "article_id"):
    
    records = (
        data
        .loc[:, [user_col, item_col, rating_col]]
        .groupby(by=[user_col, item_col])
        .agg({rating_col: "sum"})
        .reset_index()
        .assign(**{
            user_col: lambda x: x[user_col].map(users_map),
            item_col: lambda x: x[item_col].map(items_map),
        })
    )
    return csr_matrix(
        (records[rating_col], (records[user_col], records[item_col])),
        shape=(len(users_map), len(items_map))
    )

#### Model parameters

The MF model has several parameters:

- alpha: a float the magnitude of which basically indicates the separation of the items the users have interactions with versus those with which they haven't
- factors: this is the lenght (dimensionality) of the vectors generated, both for users and for items
- iterations: the number of iterations involved in numerical optimization

In [158]:
def train_model(
    train_matrix,
    test_matrix,
    alpha: float = 40,
    factors: int = 50,
    iterations: int = 30,
    show_progress: bool = False
):
    model = implicit.als.AlternatingLeastSquares(
        alpha=alpha,
        factors=factors,
        iterations=iterations,
    )
    model.fit(train_matrix, show_progress=show_progress)
    
    return model

In [159]:
train.head()

Unnamed: 0,user_id,article_id,rating
379822,127,1047,3
621877,208,57,5
857185,286,1958,5
894788,299,37,5
255164,86,8436,5


In [160]:
train_data = train.groupby(by=["user_id", "article_id"]).agg("sum").reset_index()

In [161]:
# building player and bet mappings
users_map, items_map = build_user_and_item_mappings(train_data["user_id"].unique(), train_data["article_id"].unique())
print(f"Number of users: {len(users_map)}")
print(f"Number of articles: {len(items_map)}")

Number of users: 299
Number of articles: 3000


In [162]:
model = implicit.als.AlternatingLeastSquares()

In [163]:
train_matrix = build_matrix(train_data, rating_col='rating', users_map=users_map, items_map=items_map)

In [164]:
model.fit(train_matrix, show_progress=True)

100%|███████████████████████████████████████████| 15/15 [00:00<00:00, 89.22it/s]


In [165]:
items_map_inv = {val:key for key, val in items_map.items()}

In [166]:
def recommend_articles_mf(model, original_user_id, users_map, items_map, k=5):
    
    user_id = users_map[original_user_id]
    recos = set(model.recommend(user_id, train_matrix[user_id], N=k)[0])
    
    return [items_map[r] for r in recos]

In [167]:
user_id = 74

In [168]:
recommended_articles_mf = recommend_articles_mf(model, user_id, users_map, items_map_inv)
recommended_articles_mf

[1299, 2524, 5129, 5199, 5395]

In [169]:
train.head()

Unnamed: 0,user_id,article_id,rating
379822,127,1047,3
621877,208,57,5
857185,286,1958,5
894788,299,37,5
255164,86,8436,5


### 'Manual' evaluation

Below we will list the articles that the context user has actually read, and compare this list to the lists generated by our two models

In [170]:
def get_seen_articles_by_user(df, user_id):
    
    return df[df['user_id'] == user_id]['article_id'].tolist()

In [171]:
seen_articles = get_seen_articles_by_user(test, user_id)
len(seen_articles)

7

In [172]:
set([news_articles.loc[id_]['category'] for id_ in seen_articles])

{'TRAVEL'}

In [173]:
set([news_articles.loc[id_]['category'] for id_ in recommended_articles_sv])

{'TRAVEL'}

In [174]:
set([news_articles.loc[id_]['category'] for id_ in recommended_articles_mf])

{'TRAVEL'}

We see that both of our models recommend items that belong to the 'travel' category, which means that both lists are relevant. Below we introduce evaluation metrics that can help us compare the models in a numerical fashion.

### Evaluation Metrics

Another, probably more effective way to measure the quality of a recommender is by measuring metrics values, such
as precision and recall. These metrics evaluate the relevancy of the recommendations. Precision measures the proportion of recommended items that are relevant, while recall assesses the proportion of relevant items that are recommended. High precision means that most of the recommended items are relevant, and high recall means that most of the relevant items are recommended.

In [175]:
def precision_and_recall_at_k(train, test, recommended, k):
    
    train_user_ids = train['user_id'].unique().tolist()
    test_user_ids = test['user_id'].unique().tolist()
    
    common_user_ids = list(set(train_user_ids).intersection(set(test_user_ids)))
    print('Number of common users: ', len(common_user_ids))

    precision = 0
    recall = 0
    
    for user_id in common_user_ids:
        #recommended_articles = recommend_articles(user_id, user_article_matrix, user_similarity_df, top_n=k)
        recommended_articles = recommended[user_id]
        test_articles = get_seen_articles_by_user(test, user_id)
        
        intersection_count = len(list(set(recommended_articles).intersection(set(test_articles))))
        
        # precision
        if k > 0:
            precision += intersection_count / k
        
        # recall
        if len(test_articles) > 0:
            recall += intersection_count / len(test_articles)
    
    # division by zero is handled
    if len(common_user_ids) > 0:
        average_precision = precision / len(common_user_ids)
        average_recall = recall / len(common_user_ids)
    else:
        average_precision = 0
        average_recall = 0
    
    return average_precision, average_recall

We will first extract a list of users which appear both in the train and in the test set, because - as discussed in the Cold-start section - our models can generate recommendations only to users whose interactions they have
been trained on.

In [176]:
train_user_ids = train['user_id'].unique().tolist()
test_user_ids = test['user_id'].unique().tolist()
    
common_user_ids = list(set(train_user_ids).intersection(set(test_user_ids)))
len(common_user_ids)

284

In [177]:
common_user_ids

[1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29,
 30,
 31,
 32,
 33,
 34,
 35,
 36,
 37,
 38,
 39,
 40,
 41,
 42,
 43,
 44,
 45,
 46,
 47,
 48,
 49,
 50,
 51,
 52,
 53,
 54,
 55,
 56,
 57,
 58,
 59,
 60,
 62,
 63,
 64,
 65,
 66,
 67,
 68,
 69,
 70,
 71,
 72,
 73,
 74,
 75,
 76,
 77,
 78,
 79,
 80,
 81,
 84,
 85,
 86,
 87,
 88,
 89,
 90,
 91,
 92,
 93,
 94,
 95,
 96,
 98,
 99,
 100,
 101,
 102,
 103,
 104,
 105,
 106,
 107,
 108,
 109,
 110,
 111,
 112,
 113,
 114,
 115,
 116,
 117,
 118,
 119,
 121,
 122,
 123,
 124,
 125,
 126,
 127,
 128,
 129,
 131,
 132,
 133,
 134,
 135,
 136,
 137,
 138,
 139,
 140,
 141,
 142,
 143,
 144,
 145,
 146,
 147,
 148,
 149,
 151,
 152,
 153,
 154,
 155,
 156,
 157,
 158,
 159,
 160,
 161,
 162,
 163,
 164,
 165,
 166,
 167,
 168,
 169,
 170,
 171,
 172,
 173,
 174,
 175,
 176,
 177,
 178,
 179,
 180,
 181,
 182,
 183,
 184,
 185,
 186,
 187,
 188,
 189,
 190,
 191,
 192,

In [178]:
recos_sv = {user_id: recommend_articles_sv(user_id, user_article_matrix, user_similarity_df, top_n=5) 
                         for user_id in common_user_ids}

52 14
10 3
430 136
58 17
66 23
19 6
158 47
36 8
329 104
67 20
40 15
177 56
23 8
57 18
36 12
293 100
111 36
63 17
1031 346
39 13
25 8
40 10
10 1
201 65
179 57
430 133
86 18
88 22
329 105
325 113
22 6
470 157
40 11
44 12
36 10
60 19
156 53
301 91
311 94
114 36
197 54
325 97
67 19
44 15
87 27
14 4
105 26
200 66
62 20
61 17
326 110
437 143
42 13
20 6
469 160
103 29
42 12
56 16
43 15
38 10
60 21
10 3
35 10
9 3
114 36
92 24
307 99
183 57
1017 345
38 12
180 62
63 18
28 7
39 13
25 8
14 4
333 110
36 12
10 2
473 150
40 13
24 5
37 11
34 10
478 152
40 10
430 138
21 6
55 17
36 11
64 22
65 17
41 13
36 13
114 39
23 7
38 11
53 17
66 20
60 17
39 11
11 3
298 103
181 56
67 22
35 11
36 10
60 22
1017 341
318 99
35 11
26 6
155 46
290 99
183 59
183 61
41 10
327 105
34 8
35 11
321 88
182 55
332 108
182 59
87 27
39 10
37 12
25 6
3055 1000
67 20
323 110
301 96
477 138
11 4
114 39
97 34
102 28
1027 335
35 11
101 36
1036 336
293 90
159 50
87 23
10 2
38 11
4 1
67 22
37 12
42 14
12 4
3022 1043
426 138
423 133
329 1

In [179]:
recos_mf = {user_id: recommend_articles_mf(model, user_id, users_map, items_map_inv) 
                         for user_id in common_user_ids}

In [180]:
recos_sv

{1: [1472, 6913, 4834, 6793, 7121],
 2: [8310],
 3: [4098, 3075, 5891, 7686, 3078],
 4: [7809, 6793, 6699, 6357, 1693],
 5: [3457, 4420],
 6: [7962, 8287],
 7: [6920, 8201, 4754, 275, 4505],
 8: [774, 7756, 1202, 146, 7032],
 9: [8577, 7810, 5763, 2437, 5904],
 10: [5344, 8448, 999, 4146, 2715],
 11: [568],
 12: [3264, 7618, 8451, 7302, 3527],
 13: [8066],
 14: [4834, 6699, 3115, 5579, 8505],
 15: [4792, 2666, 3029],
 16: [7812, 4742, 6407, 1032, 649],
 18: [7648, 7073, 835, 7144, 6999],
 19: [6252, 8084, 8505, 1693, 6110],
 20: [1, 1027, 4615, 1545, 2570],
 21: [7756],
 22: [7755],
 23: [7032, 958, 774, 7756],
 24: [8064, 8148, 8310],
 25: [2436, 3845, 2696, 6154, 6176],
 26: [5050, 673, 8450, 6403, 2082],
 27: [8326, 3078, 2314, 6796, 6415],
 28: [4667, 1572, 8452, 3110, 7112],
 29: [1156, 3110, 7112, 361, 7346],
 30: [5762, 5898, 5132, 1939, 5909],
 31: [4737, 131, 2437, 4361, 6672],
 32: [8066, 7766, 8166],
 33: [7301, 4231, 6540, 6158, 7055],
 34: [1416, 138, 4882, 499, 7380],
 35

In [181]:
precision_and_recall_at_k(train, test, recos_sv, k=5)

Number of common users:  284


(0.784507042253521, 0.6387401993297399)

In [182]:
precision_and_recall_at_k(train, test, recos_mf, k=5)

Number of common users:  284


(0.6542253521126759, 0.635305934716874)

#### Position-based metrics

However, beside considering only the number of common items between the recommendations and the test set, there are also metrics that measure the position of the common items. Mean Reciprocal Rank (MRR) is one of these
metrics.

MRR is calculated as the average of the reciprocal ranks of the first correct answer for a set of queries or users. The reciprocal rank is the inverse of the rank at which the first relevant item appears; for example, if the first relevant item appears in the third position, the reciprocal rank is 1/3.

MRR is particularly useful when the position of the first relevant recommendation is more significant than the presence of other relevant items in the list. It's a way to quantify how effectively a system can provide the most relevant result as quickly as possible. High MRR values indicate that the system often ranks the most relevant items higher, which is crucial for user satisfaction in scenarios where users are likely to consider only the top few recommendations or answers.

In [187]:
def calculate_mrr(common_user_ids, recommended, k):
    
    reciprocal_ranks = []
    
    for user_id in common_user_ids:

        #recommended_articles = recommend_articles(user_id, user_article_matrix, user_similarity_df, top_n=k) #recommend_to_user(user_id)
        recommended_articles = recommended[user_id]
        actual_articles = test[test['user_id'] == user_id]['article_id'].tolist()
        
        # find the rank of the first relevant (actual) article in the recommendations
        rank = None
        for i, article_id in enumerate(recommended_articles):
            if article_id in actual_articles:
                rank = i + 1  # adding 1 because index starts at 0, but ranks start at 1
                break
        
        # if a relevant article was found in the recommendations, calculate its reciprocal rank
        if rank:
            reciprocal_ranks.append(1 / rank)
    
    #print(reciprocal_ranks)
    # calculate MRR
    mrr = sum(reciprocal_ranks) / len(reciprocal_ranks) if reciprocal_ranks else 0
    return mrr

In [188]:
calculate_mrr(common_user_ids, recos_sv, k)

1.0

In [189]:
calculate_mrr(common_user_ids, recos_mf, k)

0.7596692111959288

Both precision-recall and the MRR results indicate that the Similar Vector approach gives better recommendations. However, note that the results could be different with real-world data.