In [1]:
from normative_diversity.RADio.metric import DiversityMetric
from functions import *  # This may take a while to load first time
from visualize import *
from tqdm.auto import tqdm

tqdm.pandas()

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/johannes.kruse/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
  from .autonotebook import tqdm as notebook_tqdm


# Preparation

Ensure that the data folder contains the following files:
- news.tsv <br>
    Tab-separated file as supplied in the MIND dataset. To be usable, entities need to have a Label and a Type. Example: <br>
    N12733	news	politics	Title		url	[{"Label": "Entity Label", "Type": "O"]	[] <br>
- recommendations.json  <br>
    JSON files containing the generated recommendations. Should contain the fields impr_index, userid, date and history (optionally ordered by recency). The other columns will be interpreted as 'recommendation' columns, and their names as the name of the algorithm used. Example:
     impr_index userid  date        history         lstur           random
     34         U1234   13-08-1991  [N5, N4, N3],   [N1, N2, N3],   [N3, N2, N1]
<br>    
The news.tsv follows the same format as the MIND dataset. The recommendations can be constructed from generating MIND's prediction files, and merging those with the relevant information from the MIND behavior file. For more details about the MIND format, see https://github.com/msnews/msnews.github.io/blob/master/assets/doc/introduction.md The repo currently contains a sample of 100 rows and predictions, which correspond to the news.tsv in MINDsmall_dev. Reach out to us for an example of the full file. 

In [2]:
# the number of unique users that will be sampled for analysis. Set to 0 if no sampling should take place.
sample_size = 0

# Folder where the necessary files can be found
data_folder = "data"

In [3]:
# Preprocess the articles
articles = process_articles(data_folder + "/news.tsv")
articles.head()

Unnamed: 0_level_0,category,subcategory,title,subtitle,url,entities_title,entities_subtitle,absolute_sentiment_score,persons
article_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
N55528,lifestyle,lifestyleroyals,"The Brands Queen Elizabeth, Prince Charles, an...","Shop the notebooks, jackets, and more that the...",https://assets.msn.com/labs/mind/AAGH0ET.html,"[{""Label"": ""Prince Philip, Duke of Edinburgh"",...",[],0.0516,"[Prince Philip, Duke of Edinburgh, Charles, Pr..."
N18955,health,medical,Dispose of unwanted prescription drugs during ...,,https://assets.msn.com/labs/mind/AAISxPN.html,"[{""Label"": ""Drug Enforcement Administration"", ...",[],0.2263,[]
N61837,news,newsworld,The Cost of Trump's Aid Freeze in the Trenches...,Lt. Ivan Molchanets peeked over a parapet of s...,https://assets.msn.com/labs/mind/AAJgNsz.html,[],"[{""Label"": ""Ukraine"", ""Type"": ""G"", ""WikidataId...",0.5719,[]
N53526,health,voices,I Was An NBA Wife. Here's How It Affected My M...,"I felt like I was a fraud, and being an NBA wi...",https://assets.msn.com/labs/mind/AACk2N6.html,[],"[{""Label"": ""National Basketball Association"", ...",0.1531,[]
N38324,health,medical,"How to Get Rid of Skin Tags, According to a De...","They seem harmless, but there's a very good re...",https://assets.msn.com/labs/mind/AAAKEkt.html,"[{""Label"": ""Skin tag"", ""Type"": ""C"", ""WikidataI...","[{""Label"": ""Skin tag"", ""Type"": ""C"", ""WikidataI...",0.0,[]


In [4]:
algorithms, predictions = process_recommendations(
    data_folder + "/recommendations_sample.json", sample_size
)
predictions.head()

Unnamed: 0,impr_index,userid,date,history,lstur,nrms,pop,random
757,758,U27694,2019-11-15 05:07:18,"[N29929, N26727, N57737, N32312, N31801, N6271...","[N48487, N6916, N37204, N30290, N53615, N5940,...","[N37204, N48487, N53615, N30290, N20036, N2351...","[N36786, N11930, N36940, N63421, N53242, N1968...","[N60939, N36786, N6916, N30290, N60975, N15719..."
1666,1667,U19405,2019-11-15 10:26:09,"[N30353, N59359, N5905, N41298, N4526, N37131,...","[N30290, N19990, N31958, N42844, N5472, N60724...","[N19990, N60724, N30290, N16327, N31958, N5940...","[N37352, N58098, N55237, N60724, N55913, N5077...","[N5472, N31958, N19990, N36779, N42844, N19990..."
1768,1769,U37816,2019-11-15 12:45:53,"[N21383, N250, N5102, N54575, N58267, N45954, ...","[N6950, N35815, N38620, N58748, N61053, N28072...","[N35815, N63342, N54593, N38324, N24802, N6072...","[N6074, N11930, N38324, N29862, N496, N23336, ...","[N58188, N31141, N6916, N24802, N39928, N48740..."
2278,2279,U63512,2019-11-15 07:55:54,"[N56586, N63842, N36739, N29177, N13137, N3016...","[N30290, N42844, N19990, N20187, N5940, N36779...","[N23513, N30290, N5940, N19990, N36779, N42844...","[N36786, N58098, N36940, N31958, N46976, N4284...","[N42844, N23355, N30290, N42844, N60757, N2003..."
3074,3075,U76977,2019-11-15 12:42:00,"[N42299, N47069, N47069, N6956, N16233, N51706...","[N28682, N31141, N20477, N61811, N59933, N2333...","[N5472, N20477, N61811, N31141, N6400, N11150,...","[N16344, N38311, N18708, N11930, N29862, N5809...","[N59933, N13601, N19990, N53201, N44621, N5472..."


# Configuring the Diversity metrics

In the next section we are configuring the normative diversity metrics. Following the RADio framework, we conceptualize diversity as **a rank-aware divergence score between a recommendation and a context**. 

$$
D^*_f(P,Q) = \sum_x Q^*(x) f (\frac{P^*(x)}{Q^*(x)})
$$

where  *x* refers to the relevant feature to consider; *P* to the recommendation, and *Q* to the context. Both the recommendation *P* and context *Q* can be set up to be *rank-aware*, meaning, discounting articles lower on the list. <br>
<br>

The relevant feature to consider and which context to consider is very domain and application dependent. For example, one could be interested in comparing the article categories (x) in the recommendation (P) to the categories in a users' reading history (Q). Or, to compare the mentions of political parties (x) in the recommendation to the distribution of parties in government (Q). In this example we configure the metrics as in DART, but note that this is not necessarily the right configuration for *your* application.    <br>
<br>
For each metric, we need to configure the following things: 
-   *feature_type*: cat/cat_m/cont; Whether the feature type single-value categorical (cat), multi-value categorical (cat_m) or continuous (cont)
-   *rank_aware_recommendation*: True/False; whether the recommendation should be rank-aware
-   *rank_aware_context*: True/False; whether the context should be rank-aware
-   *divergence*: JSD/KL; which divergence metric to use
-   *context*: dynamic/static; for efficiency, whether the context distribution is stable or is expected to be different for every recommendation
<br>
<br>Both the recommendation and the context should be expressed as a *list*. If it is meant to be rank-aware, it is important that the first item in the list should be counted the strongest, and the last one the least.


## Calibration

Calibration calculates to which extent a recommendation is tailored to a user's preferences. In this initialization, we compare the article categories in the recommendation to what a user has consumed in the past. 

**Feature**: article category <br>
**Context**: User history <br>
**Feature type**: categorical, here single but could be multi <br>
**Rank-aware**: both recommendation and context <br>
**Desired value**: Low divergence if you want to tailor to a user's preferences; higher if you want to help them encounter new things


In [5]:
Calibration = DiversityMetric(
    feature_type="cat",
    rank_aware_recommendation=True,
    rank_aware_context=True,
    divergence="JSD",
    context="dynamic",
)


def calculate_calibration(recommendations, history):
    scores = []
    context_features = make_list(history, "category")
    for recommendation in recommendations:
        recommendation_features = make_list(recommendation, "category")
        if context_features and recommendation_features:
            scores.append(
                Calibration.compute(context_features, recommendation_features)
            )
        else:
            scores.append(None)
    return scores

## Fragmentation

Fragmentation calculates to what extent users that received recommendation have a shared understanding, or: the amount of overlap. Ideally, we would want individual articles to be clustered into stories, as users can read slightly different articles about the same topic but still have a shared understanding. However, as this information is not present in this dataset, we use article subcategory as an approximation. See also "Improving and Evaluating the Detection of Fragmentation in News Recommendations with the Clustering of News Story Chains" by Polimeno et al. for inspiration on how story clustering could work. 


Fragmentation is conceptually related to the filter bubble. When there is low Fragmentation, users see similar topics and thus have a shared understanding of the system. On the other hand, with high Fragmentation what people receive does not overlap: they exist in a personal 'bubble'. This can still be desirable, for example when the goal is to help users specialize in their domain of choice. <br>
As we need to compare recommendations to other users' recommendations, Fragmentation can be computationally heavy. Its performance is highly dependent on the number of other users we compare to. 

**Feature**: article category <br>
**Context**: Other users' recommendations <br>
**Feature type**: categorical, single <br>
**Rank-aware**: both recommendation and context <br>
**Desired value**: low when we want a shared understanding, high when we want specialization

In [6]:
Fragmentation = DiversityMetric(
    feature_type="cat",
    rank_aware_recommendation=True,
    rank_aware_context=True,
    divergence="JSD",
    context="dynamic",
)


def calculate_fragmentation(recommendations, user_sample):
    scores = []
    for index, recommendation in enumerate(recommendations):
        alg = recommendations.index[index]
        recommendation_features = make_list(recommendation, "subcategory")
        score_per_user = 0
        recommendations_to_other_users = user_sample[alg]
        for x in recommendations_to_other_users:
            context_features = make_list(x, "subcategory")
            score_per_user += Fragmentation.compute(
                recommendation_features, context_features
            )
        average = score_per_user / len(user_sample)
        scores.append(average)
    return scores

## Activation

The goal of Activation is to express the affect of the content in the recommendation. We approximate affect as the absolute sentiment score; we care about the strength of the emotion, not about it's valence (though there could be a lot more discussion about this). Mostly, we want to know whether the recommendation is very different from the overall content on the platform. Therefore, we set feature type to continuous (sentiment score is between 0 and 1) the context as static. Since the sentiment score will be binned, we also specify the desired number of bins.

**Feature**: Absolute sentiment scores <br>
**Context**: All articles published <br>
**Feature type**: continuous <br>
**Rank-aware**: Only the recommendation <br>
**Bins**: 10 <br>
**Desired value**: Up to interpretation, and dependent on a) whether the general tone of articles published is emotional or neutral, and b) whether we want this recommendation to reflect that tone or not.



In [7]:
Activation = DiversityMetric(
    feature_type="cont",
    rank_aware_recommendation=True,
    rank_aware_context=False,
    divergence="JSD",
    bins=10,
    context="static",
)

# This variable helps the metrics to be calculated more efficiently, as we don't need to retrieve the sentiment scores of all articles every time.
all_article_sentiments = list(articles["absolute_sentiment_score"])


def calculate_activation(recommendations):
    scores = []
    context_features = all_article_sentiments
    for recommendation in recommendations:
        recommendation_features = make_list(recommendation, "absolute_sentiment_score")
        if context_features and recommendation_features:
            activation = Activation.compute(recommendation_features, context_features)
            scores.append(activation)
        else:
            scores.append(None)
    return scores

## Representation

Representation aims to express to what extent the recommendation is representative of different people/groups in society. Usually we conceptualize this in the context of politics; meaning, the relevant features are the politicians mentioned in the texts. However, this information is not available in MIND. To still show how this metric would work theoretically we use *all* entities of type 'P' as an approximation. 'Representative' can also be conceptualized in different ways. We could want all groups to be represented equally; inversely to give a larger platform to marginalized groups; or reflective of society, for example by comparing to political party distribution in government. Here, we compare to the people mentioned in the dataset. We also apply the Representation metric **only** to articles from the 'news' category.  

**Feature**: People mentioned <br>
**Context**: All articles published <br>
**Feature type**: categorical, multi <br>
**Rank-aware**: Only the recommendation <br>
**Desired value**: Low if we want the recommendation to reflect the chosen context distribution

In [8]:
Representation = DiversityMetric(
    feature_type="cat_m",
    discount_recommendation=True,
    discount_context=False,
    divergence="JSD",
    context="static",
)

# Only apply Representation to hard news
news = articles[articles["category"] == "news"]
# This variable helps the metrics to be calculated more efficiently, as we don't need to retrieve the sentiment scores of all articles every time.
all_news_persons = list(news[news["persons"].map(len) > 0].persons)


def calculate_representation(recommendations):
    scores = []
    context_features = all_news_persons
    for recommendation in recommendations:
        # only consider the articles in the recommendation of the 'news' category
        is_news = news.index.intersection(recommendation)
        # the column 'persons' has already been constructed during preprocessing
        recommendation_features = make_list(is_news, "persons")
        if context_features and recommendation_features:
            representation = Representation.compute(
                recommendation_features, context_features
            )
            scores.append(representation)
        else:
            scores.append(None)
    return scores

## Alternative Voices

With Alternative Voices, we aim to express whether the recommendation exposes readers to people different from themselves. Here, the focus should lie on people from marginalized groups. However, this information is not present in the MIND dataset, and same as with Representation, we configure the metric to consider the people mentioned in the recommendation. However, different to Representation, the context distribution is the people that have been mentioned in the users' reading history. Note that it is very likely that there is zero overlap between the people in the reading history and the people in the recommendation; in the current setup, the score is expected to mostly be 1. 

**Feature**: People mentioned <br>
**Context**: The user's history <br>
**Feature type**: categorical, multi <br>
**Rank-aware**: Both the recommendation and the context<br>
**Desired value**: Higher divergence if we want the user to encounter new perspectives; lower if we want them to find their niche.

In [9]:
AlternativeVoices = DiversityMetric(
    # we currently don't have this information
    feature_type="cat_m",
    discount_recommendation=True,
    discount_context=True,
    divergence="JSD",
    context="dynamic",
)


def calculate_alternative_voices(recommendations, history):
    scores = []
    context_features = make_list(history, "persons")
    for recommendation in recommendations:
        recommendation_features = make_list(recommendation, "persons")
        if context_features and recommendation_features:
            alternative_voices = AlternativeVoices.compute(
                recommendation_features, context_features
            )
            scores.append(alternative_voices)
        else:
            scores.append(None)
    return scores

# Calculating the metrics

In [10]:
def metrics(row):
    recommendations = row[algorithms]
    history = row["history"]

    calibration = calculate_calibration(recommendations, history)

    user_sample = predictions.sample(5)[algorithms]
    fragmentation = calculate_fragmentation(recommendations, user_sample)

    activation = calculate_activation(recommendations)

    representation = calculate_representation(recommendations)

    alternative_voices = calculate_alternative_voices(recommendations, history)

    return calibration, fragmentation, activation, representation, alternative_voices

In [11]:
# Calculate the metrics
predictions["metrics"] = predictions.progress_apply(metrics, axis=1)
# Make each metric go to its own column
(
    predictions["calibration"],
    predictions["fragmentation"],
    predictions["activation"],
    predictions["representation"],
    predictions["alternative_voices"],
) = zip(*predictions.metrics)

100%|██████████| 137/137 [00:05<00:00, 25.32it/s]


# Visualizing results
Change the value of parameter 'metric' to visualize any of the other metrics. 
- calibration
- fragmentation
- activation
- representation
- alternative_voices

In [12]:
metric = "calibration"

In [13]:
visualize_metric(predictions, metric, algorithms)

                                date       lstur        nrms         pop  \
count                            137  134.000000  134.000000  134.000000   
mean   2019-11-15 10:13:08.124087808    0.549152    0.540320    0.643823   
min              2019-11-15 00:37:33    0.000000    0.000000    0.000000   
25%              2019-11-15 07:15:16    0.442211    0.435870    0.564227   
50%              2019-11-15 10:24:31    0.528804    0.538144    0.656484   
75%              2019-11-15 12:53:55    0.644875    0.650099    0.728508   
max              2019-11-15 21:14:50    0.994280    0.994280    0.994280   
std                              NaN    0.168798    0.175798    0.142447   

           random  
count  134.000000  
mean     0.640882  
min      0.000000  
25%      0.540583  
50%      0.648558  
75%      0.756698  
max      0.994280  
std      0.165859  


ValueError: Mime type rendering requires nbformat>=4.2.0 but it is not installed