<a href="https://colab.research.google.com/github/svenheins/article-recommender-system/blob/main/Recommender_system_news_articles.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Recommender system: recommend news articles to users
This notebook can be used to get a good starting point for a recommender system which is based on news data. The example data is coming from a kaggle challenge. In order to run the notebook you have to

1. get the data from kaggle: https://www.kaggle.com/datasets/asad1m9a9h6mood/news-articles/
2. upload to google drive
3. mount your google drive and define the respective path forther down

### Install dependencies



In [1]:
## just run if you are in colab. Locally you should create a respective environment (i.e. conda environment)
!pip install -U -q textblob plotly pandas scikit-learn matplotlib sentence-transformers surprise

# Read the data

In [2]:
import pandas as pd
## Adjust the line accordingly, depending where you uploaded the csv
df = pd.read_csv("/content/drive/MyDrive/data/Articles.csv", encoding='latin-1')

In [3]:
df.columns = ['article', 'date', 'heading', 'news_type']

# Feature engineering:
Date features, text features (tfidf, sentiment)

In [5]:
df['datetime'] = pd.to_datetime(df['date'], format='%m/%d/%Y')

# Assuming 'Date' is in datetime format
df['year'] = df['datetime'].dt.year
df['month'] = df['datetime'].dt.month
df['day'] = df['datetime'].dt.day

df['weekday'] = df['datetime'].dt.weekday

## Calculate and add age to the data

In [6]:
max_datetime = df['datetime'].max()
def compute_age(datetime_value):
  return (max_datetime - datetime_value).days
df['age'] = df['datetime'].apply(compute_age)

In [7]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('stopwords')
nltk.download('punkt')

stop_words = set(stopwords.words('english'))

df['article_length'] = df['article'].apply(len)
df['heading_length'] = df['heading'].apply(len)

def count_unique_words(text):
    words = word_tokenize(text.lower())
    unique_words = set(words) - stop_words
    return len(unique_words)

df['article_unique_words'] = df['article'].apply(count_unique_words)
df['heading_unique_words'] = df['heading'].apply(count_unique_words)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [8]:
# Define the mapping from category to binary values
category_mapping = {'business': 0, 'sports': 1}

# Create a new column 'News Type Encoded' based on the mapping
df['news_type_encoded'] = df['news_type'].map(category_mapping)


## Add tfidf features to the dataframe

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(df['article'])

# Add TF-IDF features to the DataFrame
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns="tfidf_" + tfidf_vectorizer.get_feature_names_out())
df = pd.concat([df, tfidf_df], axis=1)

## Add sentiment features based on article content

In [10]:
from textblob import TextBlob

# Function to calculate sentiment score using TextBlob
def calculate_sentiment(text):
    analysis = TextBlob(text)
    # Get the sentiment polarity (-1 to 1, where -1 is negative, 1 is positive)
    sentiment_score = analysis.sentiment.polarity
    return sentiment_score

# Apply sentiment analysis function to the 'Article' column
df['sentiment_score'] = df['article'].apply(calculate_sentiment)

## Identify topics
Based on some keywords mentioned in the text you can define topics, add more
or a combination of keywords to get more specific or cover different aspects

In [11]:
# Function to calculate sentiment score using TextBlob
def topic_identification(text_original):
    text = text_original.lower()
    topic_value = 'no'
    if 'oil price' in text:
      topic_value = 'oil price'#'|'.join([topic_value, 'oil price'])
    elif ' pia ' in text:
      topic_value = 'pia'
    elif 'nepra' in text:
      topic_value = 'nepra'
    elif 'gold price' in text:
      topic_value = 'gold price'
    elif 'united states' in text:
      topic_value = 'united states' #'|'.join([topic_value, 'US'])
    elif 'pakistan' in text:
      topic_value = 'pakistan'#'|'.join([topic_value, 'Pakistan'])
    elif 'healthcare' in text:
      topic_value = 'healthcare' #'|'.join([topic_value, 'healthcare'])
    elif 'india' in text:
      topic_value = 'india' #'|'.join([topic_value, 'India'])
    elif 'emergency' in text:
      topic_value = 'emergency' # '|'.join([topic_value, 'emergency'])
    elif 'peace' in text:
      topic_value = 'peace' #'|'.join([topic_value, 'peace'])
    elif 'weather' in text:
      topic_value = 'weather' #'|'.join([topic_value, 'weather'])
    return topic_value



## Tfidf embedding
To get a first impression how informative the tfidf is, we are computing the
t-SNE embedding for some features combined with the tfidf-matrix

In [12]:
## try different feature combination to see other clustering

#df_features = df[['article_length', 'weekday', 'sentiment_score', 'news_type_encoded']]
df_features = df[['weekday', 'sentiment_score', 'age']]

## combine the above features with the tfidf-compartment of the dataframe
## (columns 15-5015)
df_features = pd.concat([df_features, df.iloc[:,15:5015]], axis=1)

In [13]:
from sklearn.preprocessing import StandardScaler

# Standardize your numerical features
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df_features))

In [14]:
from sklearn.manifold import TSNE
import numpy as np

tsne = TSNE(n_components=2, random_state=42, n_iter=2000, perplexity=8)

# Fit t-SNE to your standardized numerical features
tsne_tfidf = tsne.fit_transform(df_scaled)
df_tsne_tfidf = pd.DataFrame(data=tsne_tfidf, columns=['tsne1', 'tsne2'])


df_tsne_tfidf['article'] = df['article']
df_tsne_tfidf['heading'] = df['heading']
df_tsne_tfidf['sentiment_score'] = df['sentiment_score']
df_tsne_tfidf['weekday'] = df['weekday']
df_tsne_tfidf['year'] = df['year']
df_tsne_tfidf['month'] = df['month']
df_tsne_tfidf['day'] = df['day']
df_tsne_tfidf['age'] = df['age']
df_tsne_tfidf['heading_length'] = df['heading_length']
df_tsne_tfidf['article_length'] = df['article_length'].apply(np.log10)
df_tsne_tfidf['news_type_encoded'] = df['news_type_encoded']
df_tsne_tfidf['topic'] = df['article'].apply(topic_identification)

### Plot Tfidf embedding
In order to get a first impression about the structure or clusters, we plot the tsne embedding

In [15]:
import plotly.graph_objs as go
import plotly.express as px


fig = px.scatter(df_tsne_tfidf,
                 x="tsne1",
                 y="tsne2",
                 color='topic',
                 hover_data=['sentiment_score',
                             'news_type_encoded',
                             'heading',
                             'topic',
                             'weekday',
                             'year',
                             'month',
                             'day',
                             'age',
                             'article_length',
                             'heading_length',
                             ],
                 opacity = 1,
                 )
fig.show()

## Sentence Transformer embedding
To get a more informative latent representation we are now using the sentence transformer model.

In [16]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

#Sentences we want to encode. Example:
sentence = ['This framework generates embeddings for each input sentence']

#Sentences are encoded by calling model.encode()
embedding = model.encode(sentence)


In [17]:
# Function to calculate sentiment score using TextBlob
def sentence_embedding(text):
    sentence_embedding = model.encode(text)
    return sentence_embedding

# Apply sentiment analysis function to the 'Article' column
df['sentence_embedding'] = df['article'].apply(sentence_embedding)

In [18]:
sentence_embedding_df = df['sentence_embedding'].apply(pd.Series)


In [19]:
# Define the prefix
prefix = 'se_'

# Add the prefix to column names
sentence_embedding_df.columns = [prefix + str(col) for col in sentence_embedding_df.columns]

In [20]:
## try different alternative feature combination to see the resulting clustering

#df_features_se = df[['weekday', 'sentiment_score', 'news_type_encoded', 'year', 'month', 'day', 'age']]
#df_features_se = pd.concat([sentence_embedding_df], axis=1)
df_features_se = df[['weekday', 'sentiment_score', 'age']]
df_features_se = pd.concat([df_features_se, sentence_embedding_df], axis=1)



In [21]:
# Standardize your numerical features
scaler = StandardScaler()
df_features_se_scaled = pd.DataFrame(scaler.fit_transform(df_features_se))

In [22]:
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, random_state=42, n_iter=2000, perplexity=8)

# Fit t-SNE to your standardized numerical features
tsne_sentence_embedding = tsne.fit_transform(df_features_se_scaled)

In [23]:
df_tsne_sentence_embedding = pd.DataFrame(data=tsne_sentence_embedding, columns=['tsne1', 'tsne2'])

In [24]:
## enrich the dataframe with metadata
df_tsne_sentence_embedding['article'] = df['article']
df_tsne_sentence_embedding['heading'] = df['heading']
df_tsne_sentence_embedding['sentiment_score'] = df['sentiment_score']
df_tsne_sentence_embedding['weekday'] = df['weekday']
df_tsne_sentence_embedding['year'] = df['year']
df_tsne_sentence_embedding['month'] = df['month']
df_tsne_sentence_embedding['day'] = df['day']
df_tsne_sentence_embedding['age'] = df['age']
df_tsne_sentence_embedding['heading_length'] = df['heading_length']
df_tsne_sentence_embedding['article_length'] = df['article_length'].apply(np.log10)
df_tsne_sentence_embedding['news_type_encoded'] = df['news_type_encoded']
df_tsne_sentence_embedding['topic'] = df['article'].apply(topic_identification)


### Plot sentence transformer embedding

In [25]:
## plot the tsne-embedding
fig = px.scatter(df_tsne_sentence_embedding,
                 x="tsne1",
                 y="tsne2",
                 color='topic',
                 hover_data=['sentiment_score',
                             'news_type_encoded',
                             'heading',
                             'topic',
                             'weekday',
                             'year',
                             'month',
                             'day',
                             'age',
                             'article_length',
                             'heading_length',
                             ],
                 #hover_data=['heading', 'topic'],
                 opacity = 1,
                 )
fig.show()

# User and Article classes
Now we are modeling real users and some articles. For test purposes we are just taking a small subset of the articles

In [26]:
## user class
class User:
    def __init__(self, user_id):
        self.user_id = user_id
        self.read_history = []
        self.retention_times = {}
        self.interaction_scores = {}
        self.scores = {}

    def read_article(self, article, retention_time, interaction_score):
        article_id = article.article_id
        self.read_history.append(article_id)
        self.retention_times[article_id] = retention_time
        self.interaction_scores[article_id] = interaction_score
        self.scores[article_id] = article.calculate_score(
            retention_time=retention_time, interaction_score=interaction_score)



In [27]:
## article class
class Article:
    def __init__(self, article_id, content):
        self.article_id = article_id
        self.content = content

    def calculate_score(self, retention_time, interaction_score):
        # You can define a scoring function based on retention time
        # For example, a simple scoring function could be based on the inverse of retention time
        article_length = len(self.content)
        return (np.log(retention_time) + interaction_score) / article_length

## Simple example

In [28]:
# Create user and article instances
user_dict = {0 : User(user_id=1)}
dummy_article_dict = {0: Article(article_id=0, content="Article 0"),
                1: Article(article_id=1, content="Article 1"),
                }

# Simulate user reading articles and capturing retention times
user_dict[0].read_article(dummy_article_dict[0], retention_time=10, interaction_score=1)
user_dict[0].read_article(dummy_article_dict[1], retention_time=15, interaction_score=1)

In [29]:
# Sort articles by score (descending order)
sorted_articles = sorted(user_dict[0].scores.items(), key=lambda x: x[1], reverse=True)

# Print the articles and their scores
for article_id, score in sorted_articles:
    print(f"Article {article_id}: Score = {score}")

Article 1: Score = 0.41200557790024556
Article 0: Score = 0.36695389922156063


## Full User example
1. users and article are created / retrieved from the article dataframe
2. users read articles and are plotted based on their reading profile

In [30]:
## define users and articles
count_users = 10
user_dict = {index: User(index) for index in range(count_users)}
article_dict = {id: Article(id, df.loc[id, 'article']) for id in df.index
                }

In [31]:
## user reads random articles
import random

subsample_size = 15
subsample = random.choices(list(article_dict.keys()), k=subsample_size)

## Users read some articles
not everybody reads every article, so the table also contains Nan **values**

In [32]:
## simulate users reading parts of the subsample of articles , while 2 clusters share similar preferences
for index in subsample:
  retention_time_class_1 = random.randint(10, 100)
  retention_time_class_2 = random.randint(10, 100)
  for user_id in range(count_users):
    bool_read_article = random.randint(0,100)
    if bool_read_article < 50:
      if user_id < 3:
        user_dict[user_id].read_article(article_dict[index], retention_time = retention_time_class_1 + user_id, interaction_score = 2)
      elif user_id < 6:
        user_dict[user_id].read_article(article_dict[index], retention_time = retention_time_class_2 + user_id, interaction_score = 2)
      else:
        user_dict[user_id].read_article(article_dict[index], retention_time = random.randint(10, 100), interaction_score = 1)

In [33]:
combined_dict = {index: [user_dict[user_id].scores[index] if index in user_dict[user_id].scores else None for user_id in list(user_dict.keys()) ] for index in subsample} # for user_id in list(user_dict.keys())}

In [34]:
df_user_scores = pd.DataFrame(combined_dict)

In [35]:
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=2)

imputer.fit(df_user_scores)
df_user_scores_imputed = pd.DataFrame(imputer.transform(df_user_scores))
df_user_scores_imputed.set_index(df_user_scores.index, inplace = True)
df_user_scores_imputed.columns = df_user_scores.columns

### Plot user embedding

In [36]:
tsne = TSNE(n_components=2, random_state=42, n_iter=1000, perplexity=3)

# Fit t-SNE to your standardized numerical features
tsne_user_scores = tsne.fit_transform(df_user_scores_imputed)
df_tsne_user_scores = pd.DataFrame(data=tsne_user_scores, columns=['tsne1', 'tsne2'])

df_tsne_user_scores['user_id'] = df_tsne_user_scores.index

fig = px.scatter(df_tsne_user_scores,
                 x="tsne1",
                 y="tsne2",
                 #color='topic',
                 color_discrete_sequence=px.colors.qualitative.Safe,
                 hover_data=['user_id',
                             ],
                 #hover_data=['heading', 'topic'],
                 opacity = 1,
                 )
fig.show()

## New user is introduced
1. provide top 3 article(s)
2. provide a recommendation based on the users interaction with the initial article

In [37]:
## how many top articles should be displayed?
top_n = 3
list_top_articles = list(df_user_scores.median().sort_values( ascending = False)[0:top_n].index)

In [38]:
list_top_articles

[162, 124, 2195]

In [39]:
new_user_id = count_users
user_dict[new_user_id] = User(new_user_id)

In [40]:
## user reads top 3 articles
for index in list_top_articles:
  retention_time_new_user = random.randint(10, 100)
  interaction_score = random.randint(1, 2)
  user_dict[new_user_id].read_article(article_dict[index],
                                      retention_time = retention_time_new_user,
                                      interaction_score = interaction_score)


# Recommender system
First extend the existing user score dataframe. Then impute missing values for the first approach (TruncatedSVD). Later we will recommend articles based on the matrix that still contains missing values and compare the recommendation to the imputed approach.

In [41]:
new_combined_dict = {index: [user_dict[user_id].scores[index]
                             if index in user_dict[user_id].scores
                             else None
                             for user_id in list(user_dict.keys()) ]
                      for index in subsample}
new_df_user_scores = pd.DataFrame(new_combined_dict)

imputer.fit(new_df_user_scores)
new_df_user_scores_imputed = pd.DataFrame(imputer.transform(new_df_user_scores))
new_df_user_scores_imputed.set_index(new_df_user_scores.index, inplace = True)
new_df_user_scores_imputed.columns = new_df_user_scores.columns


## Method: TruncatedSVD (Matrix factorization based on imputed matrix)


In [42]:
from sklearn.decomposition import TruncatedSVD

# Apply matrix factorization (SVD)
svd = TruncatedSVD(n_components=5)  # Choose the number of latent factors
user_factors = svd.fit_transform(new_df_user_scores_imputed)
article_factors = svd.components_

In [43]:
user_id = new_user_id
user_scores = np.dot(user_factors[user_id], article_factors)

recommendation_list = df_user_scores.columns[np.argsort(user_scores)[::-1]]

top_n_recommendations = [article_id for article_id in recommendation_list ]

print("Top n recommendations for user " + str(user_id) + ":\t" + str(top_n_recommendations))
print("Sorted imputed list (no Nan values): \t"
      + str(list(new_df_user_scores_imputed.loc[user_id,].sort_values(ascending = False).index)))
print("Sorted original read list (Nan values are listed last): "
      + str(list(new_df_user_scores.loc[user_id,].sort_values(ascending = False).index)))


Top n recommendations for user 10:	[162, 124, 2195, 230, 1344, 433, 1197, 1764, 1110, 2332, 401, 2312, 2132, 2685, 1312]
Sorted imputed list (no Nan values): 	[162, 124, 2195, 230, 1344, 433, 1764, 1197, 1110, 2332, 401, 2132, 2312, 2685, 1312]
Sorted original read list (Nan values are listed last): [162, 124, 2195, 2332, 2312, 2132, 433, 1110, 1764, 401, 2685, 1344, 1312, 1197, 230]


### Plot user embedding

In [44]:
tsne = TSNE(n_components=2, random_state=42, n_iter=1000, perplexity=3)

# Fit t-SNE to your standardized numerical features
tsne_user_scores = tsne.fit_transform(new_df_user_scores_imputed)
df_tsne_user_scores = pd.DataFrame(data=tsne_user_scores, columns=['tsne1', 'tsne2'])

df_tsne_user_scores['is_user_new'] = 'no'
df_tsne_user_scores.loc[user_id, 'is_user_new'] = 'yes'

df_tsne_user_scores['user_id'] = df_tsne_user_scores.index

fig = px.scatter(df_tsne_user_scores,
                 x="tsne1",
                 y="tsne2",
                 color='is_user_new',
                 color_discrete_sequence=px.colors.qualitative.Safe,
                 hover_data=['user_id'],
                 opacity = 1,
                 )
fig.show()

## Alternative Least Squares (ALS)
Alternating Least Squares (ALS) is one method that naturally deals with missing data

In [45]:
new_df_user_scores['user_id'] = new_df_user_scores.index

In [46]:
from surprise import SVD
from surprise import Dataset
from surprise import Reader

In [47]:
melted_new_df_user_scores = pd.melt(new_df_user_scores, id_vars=['user_id'],
                                    var_name='article', value_name='rating')

max_rating_value = new_df_user_scores.max()[:-1].max() # ignore user_id column
sur_reader = Reader(rating_scale=(0, max_rating_value))  # Define a rating scale
sur_data = Dataset.load_from_df(melted_new_df_user_scores, sur_reader)

In [48]:
# Create and train an SVD model with ALS
model = SVD(biased=False, n_factors=50, lr_all=0.005, reg_all=0.02, n_epochs=20)
model.fit(sur_data.build_full_trainset())

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7da5c69071f0>

In [49]:
# Recommend articles for the target user
top_n_eval = []
articles_read_by_target_user = [item for item in new_df_user_scores.columns
                                if (not np.isnan(new_df_user_scores.loc[user_id,item])
                                and item != 'user_id')]

In [50]:
article_list = melted_new_df_user_scores['article'].unique()
count_articles = len(article_list)
for article in article_list:
    predicted_rating = model.predict(user_id, article)
    top_n_eval.append((article, predicted_rating.est))

# Sort the recommendations by predicted rating
top_n_eval.sort(key=lambda x: x[1], reverse=True)
top_n_recommendations_svd = [article for article, _ in top_n_eval[:count_articles]]

print("Top recommendations for user", user_id, ":", top_n_recommendations_svd)

Top recommendations for user 10 : [2332, 2195, 2312, 2132, 433, 162, 1110, 1764, 124, 401, 2685, 1344, 1312, 1197, 230]


In [51]:
print("Top recommendations for user " + str(user_id) + ": " + str(top_n_recommendations))

Top recommendations for user 10: [162, 124, 2195, 230, 1344, 433, 1197, 1764, 1110, 2332, 401, 2312, 2132, 2685, 1312]


## Evaluation
Compare the predictions in order to see if they are pointing in the same direction (Spearman's Rank Correlation Coefficient, Kendall's Tau);

(-1: perfect negative correlation (inverse order), 0: no correlation, 1: perfect positive correlation (exact same order))

In [52]:
## stats: Spearman's Rank Correlation Coefficient, Kendall's Tau:
from scipy.stats import spearmanr, kendalltau

correlation, p_value = spearmanr(top_n_recommendations_svd, top_n_recommendations)
print("Spearman's rank correlation coefficient: :", correlation)
tau, p_value = kendalltau(top_n_recommendations_svd, top_n_recommendations)
print("Kendall's tau::", tau)

Spearman's rank correlation coefficient: : -0.2785714285714285
Kendall's tau:: -0.14285714285714288


### Interpretation
The evaluation is slighly negative, so there is a clear difference between the imputed method and the ALS method. Therefore you are advised to further analyse different methods and assess the quality of recommendation based on some test sets, where you leave out parts of the ratings but this is beyond the scope of this tutorial.