# ValueMonitor - Create your own topic model

This page is a visualisation of the ValueMonitor prototype. In case you would like to use the notebook, click on the icon ‘**Run in Google Colab**’ hereunder:

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/tristandewildt/ValueMonitor_Prototype/blob/main/ValueMonitor_Prototype_create_own_model.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/tristandewildt/ValueMonitor_Prototype/blob/main/ValueMonitor_Prototype_create_own_model.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
</table>

## Table of content:
* [1. Import dataset and packages](#import_dataset_and_packages)
* [2. Creating the topic model](#creating_the_topic_model)
* [3. Verifying the topic model](#verifying_the_topic_model)
* [4. Values in different realms](#values_in_different_realms)
* [5. Values over time](#values_over_time)
* [6. Gap assessment](#gap_assessment)


## 1. Import dataset and packages  <a name="import_dataset_and_packages"></a>

In this step, the dataset and relavant python packages are imported

In [1]:
''' Packages'''

!pip install corextopic
!pip install joblib
!pip install tabulate
!pip install simple_colors
!pip install ipyfilechooser

import os, sys, importlib
import pandas as pd
import ipywidgets as widgets
from ipywidgets import interact, interact_manual, Button
import pickle
from ipyfilechooser import FileChooser
from tkinter import Tk, filedialog
from IPython.display import clear_output, display
from google.colab import files
import nltk
import io
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
nltk.download('vader_lexicon')

''' Source code'''

user = "tristandewildt"
repo = "ValueMonitor_Workshops"
src_dir = "code"
pyfile_1 = "make_topic_model.py"
pyfile_2 = "create_visualisation.py"
token = "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"

if os.path.isdir(repo):
  !rm -rf {repo}

!git clone https://{token}@github.com/{user}/{repo}.git

from ValueMonitor_Workshops.code.make_topic_model import *
from ValueMonitor_Workshops.code.create_visualisation import *

''' Datasets'''

!wget -q --show-progress --no-check-certificate 'https://docs.google.com/uc?export=download&id=12ZyryF8MbMYKuhIBEhUUvnvx43_cna56' -O dataset_ValueMonitor_prototype

with open('dataset_ValueMonitor_prototype', "rb") as fh:
    df = pickle.load(fh)

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


Cloning into 'ValueMonitor_Workshops'...
remote: Enumerating objects: 137, done.[K
remote: Counting objects: 100% (84/84), done.[K
remote: Compressing objects: 100% (40/40), done.[K
remote: Total 137 (delta 50), reused 70 (delta 44), pack-reused 53[K
Receiving objects: 100% (137/137), 1008.63 KiB | 16.01 MiB/s, done.
Resolving deltas: 100% (77/77), done.


## 2. Creating the topic model <a name="creating_the_topic_model"></a>

In this step, we create a topic model in which some of the topics refer to values. The creation of topics that reflect values is done by means of so-called 'anchor' words. These words guide the algorithm in the creation of topics that reflect values.

Anchor words are typically words that people use to refer to (the idea of) a value, such as synonyms. After adding some anchor words and running the model, the algorithm will automatically pick up other words that refer to the value. This is because the algorithm has observed that these words are often mentionned in the same documents as the anchor words.

Finding the right anchor words is typically an iterative process, by observing the new topic model created by the algorithm. Some anchor words need to be added to ensure that some aspect of the value are not left behind (to be placed in *dict_anchor_words* in the cell below). Other words need to be removed since they do not refer to the value (in *list_rejected_words* in the cell below).

We have prefilled an number of anchor words for each value.

In [2]:
dict_anchor_words = {
"Justice and Fairness" : ["justice", "fairness", "fair", "equality", "unfair"],
"Privacy" : ["privacy", "personal data", "personal sphere", "data privacy", "privacy protection", "privacy concerns", 
             "confidentiality"],
"Cyber-security" : ["cyber", "security", "cybersecurity", "malicious", "attacks"],
"Environmnental Sustainability" : ["sustainability", "sustainable", "renewable", "durable", "durability",
                                  "sustainable development", "environmental"],
"Transparency" : ["transparency", "transparent", "transparently", "explainability", "interpretability", "explainable",
                 "opaque", "interpretable"],
"Accountability" : ["accountable", "accountability", "accountable", "traceability", "traceable"],
"Autonomy" : ["autonomy", "self-determination", "autonomy human", "personal autonomy"], 
"Democracy" : ["democracy", "democratic", "human rights", "freedom speech", "equal representation",
              "political"], 
"Reliability" : ["reliability", "reliable", "robustness", "robust", "predictability"],
"Trust" : ["trust", "trustworthy", "trustworthiness", "confidence", "honesty"],
"Well-being" : ["well being", "well-being", "wellbeing", "quality life",
               "good life", "qol", "life satisfaction", "welfare"],
"Inclusiveness" : ["inclusiveness", "inclusive", "inclusivity", "discrimination", "diversity"]
}

list_rejected_words = ["iop", "iop publishing", "publishing ltd", "publishing", "licence iop",
                       "mdpi basel", "basel switzerland", "mdpi", "basel", "licensee mdpi", "licensee", "authors licensee", 
                       "switzerland", "authors", "publishing limited", "emerald", "emerald publishing", ]

list_anchor_words_other_topics = [
        ["internet of things", "iot", "internet things", "iot devices", "things iot"],
        ["artificial intelligence", "ai", "artificial"],
]



In [3]:
number_of_topics_to_find = 20
number_of_documents_in_analysis = 200

number_of_words_per_topic_to_show = 10
number_of_words_per_topic = 10

'''--------------------------------------------------------------------------''' 

model_and_vectorized_data = make_anchored_topic_model(df, number_of_topics_to_find, min(number_of_documents_in_analysis, len(df)), dict_anchor_words, list_anchor_words_other_topics, list_rejected_words)
topics = report_topics(model_and_vectorized_data[0], dict_anchor_words,number_of_words_per_topic)
df_with_topics = create_df_with_topics(df, model_and_vectorized_data[0], model_and_vectorized_data[1], number_of_topics_to_find)
topics_weights = report_topics_words_and_weights(model_and_vectorized_data[0], dict_anchor_words, number_of_words_per_topic)

Number of articles used to build the topic model: 200




Topic #0 (Justice and Fairness): fair, justice, way, something, few, early, case, nothing, power, hand
Topic #1 (Privacy): privacy, news, threat, facebook, congress, street, apple, many people, action, sen
Topic #2 (Cyber-security): security, washington, cyber, house, trump, white house, defense, american, attacks, bill
Topic #3 (Environmnental Sustainability): company, amazon, able, many, big, today, long, thing, different, clear
Topic #4 (Transparency): transparency, transparent, state, public, twitter, mike, law, team, didn, part
Topic #5 (Accountability): accountability, accountable, decisions, op, moral, ed, arguments, ethics, application, interaction
Topic #6 (Autonomy): country, millions, man, old, history, job, personal, days, men, half
Topic #7 (Democracy): political, democratic, democracy, president, campaign, officials, presidential, politics, voters, court
Topic #8 (Reliability): content, social media, post, robust, social, standards, free speech, speech, law enforcement, e

## 3. Verifying the topic model   <a name="verifying_the_topic_model"></a>

To verify whether topics sufficiently refer to values, the code hereunder can be used to evaluate whether documents indeed address the value in question.

In [6]:
for topic, words in topics_weights.items():
  print(str(topic)+": "+str(words))

Topic #0# (Justice and Fairness): {'fair': 0.528, 'justice': 0.455, 'way': 0.304, 'something': 0.23, 'few': 0.186, 'early': 0.181, 'case': 0.18, 'nothing': 0.18, 'power': 0.17, 'hand': 0.169}
Topic #1# (Privacy): {'privacy': 0.473, 'news': 0.242, 'threat': 0.233, 'facebook': 0.224, 'congress': 0.204, 'street': 0.191, 'apple': 0.171, 'many people': 0.163, 'action': 0.159, 'sen': 0.141}
Topic #2# (Cyber-security): {'security': 0.431, 'washington': 0.325, 'cyber': 0.288, 'house': 0.265, 'trump': 0.256, 'white house': 0.23, 'defense': 0.189, 'american': 0.188, 'attacks': 0.175, 'bill': 0.166}
Topic #3# (Environmnental Sustainability): {'company': 0.221, 'amazon': 0.217, 'able': 0.201, 'many': 0.193, 'big': 0.181, 'today': 0.178, 'long': 0.171, 'thing': 0.17, 'different': 0.169, 'clear': 0.16}
Topic #4# (Transparency): {'transparency': 0.254, 'transparent': 0.178, 'state': 0.171, 'public': 0.168, 'twitter': 0.162, 'mike': 0.154, 'law': 0.147, 'team': 0.142, 'didn': 0.129, 'part': 0.127}
Top

In [7]:
topics_to_remove_int = []

def plot_top_topics_on_values(selected_value, top_topics_to_show):
  top_topics_on_values(df_with_topics, selected_value, dict_anchor_words, topics_weights, topics_to_remove_int, top_topics_to_show)

interact(plot_top_topics_on_values, top_topics_to_show = (3, 25, 1), selected_value=[*dict_anchor_words])

interactive(children=(Dropdown(description='selected_value', options=('Justice and Fairness', 'Privacy', 'Cybe…

<function __main__.plot_top_topics_on_values(selected_value, top_topics_to_show)>

In [8]:
def plot_print_sample_articles_topic(selected_value, selected_topic, show_full_text, window, size_sample):
    show_extracts = True # True, False
    df_to_evaluate = df_with_topics
    if selected_topic == "":
      selected_topic = 0
    df_to_evaluate = df_to_evaluate.loc[(df_to_evaluate[int(selected_topic)] == 1)]
    print_sample_articles_topic(df_to_evaluate, dict_anchor_words, topics, selected_value, size_sample, window, show_extracts, show_full_text)

interact(plot_print_sample_articles_topic, selected_value=[*dict_anchor_words], selected_topic=widgets.Text(), size_sample =(5,20, 5), window =(5,100, 5), show_full_text = widgets.Checkbox(value=False))

interactive(children=(Dropdown(description='selected_value', options=('Justice and Fairness', 'Privacy', 'Cybe…

<function __main__.plot_print_sample_articles_topic(selected_value, selected_topic, show_full_text, window, size_sample)>

## 4. Values in different realms <a name="values_in_different_realms"></a>

ValueMonitor can be used to evaluate which values different societal groups tend to discuss.

In [9]:
def plot_values_in_different_groups(selected_dataset):
    values_in_different_groups(df_with_topics, dict_anchor_words, selected_dataset)

interact(plot_values_in_different_groups, selected_dataset = ['NEWS', 'ETHICS', 'TECH'])

interactive(children=(Dropdown(description='selected_dataset', options=('NEWS', 'ETHICS', 'TECH'), value='NEWS…

<function __main__.plot_values_in_different_groups(selected_dataset)>

In [10]:
def plot_print_sample_articles_topic(selected_value, selected_dataset, show_full_text, window, size_sample):
    show_extracts = True # True, False
    df_with_topics_selected_technology_dataset = df_with_topics[df_with_topics['dataset'] == selected_dataset]
    print_sample_articles_topic(df_with_topics_selected_technology_dataset, dict_anchor_words, topics, selected_value, size_sample, window, show_extracts, show_full_text)

interact(plot_print_sample_articles_topic, selected_value=[*dict_anchor_words], selected_dataset = ["TECH", "NEWS", "ETHICS", ], size_sample =(5,20, 5), window =(5,100, 5), show_full_text = widgets.Checkbox(value=False))


interactive(children=(Dropdown(description='selected_value', options=('Justice and Fairness', 'Privacy', 'Cybe…

<function __main__.plot_print_sample_articles_topic(selected_value, selected_dataset, show_full_text, window, size_sample)>

## 5. Values over time <a name="values_over_time"></a>

The occurence of values can be traced over time.

In [11]:
def plot_create_vis_values_over_time (selected_technology, selected_dataset, resampling, smoothing, max_value_y):
    values_to_include_in_visualisation = []   
    resampling_dict = {"Year": "Y", "Month": "M", "Day": "D"}
    resampling = resampling_dict[resampling]
    df_with_topics_selected_technology = df_with_topics[df_with_topics[selected_technology] == True]
    df_with_topics_selected_technology_dataset = df_with_topics_selected_technology[df_with_topics_selected_technology['dataset'] == selected_dataset]
    create_vis_values_over_time(df_with_topics_selected_technology_dataset, dict_anchor_words, resampling, values_to_include_in_visualisation, smoothing, max_value_y)  
    
interact(plot_create_vis_values_over_time, selected_technology=["AI", "IoT"], selected_dataset = ["TECH", "NEWS", "ETHICS",], smoothing = (0.25,3, 0.25), max_value_y = (5,100, 5), resampling = ["Year", "Month", "Day"])

interactive(children=(Dropdown(description='selected_technology', options=('AI', 'IoT'), value='AI'), Dropdown…

<function __main__.plot_create_vis_values_over_time(selected_technology, selected_dataset, resampling, smoothing, max_value_y)>

In [12]:
def plot_words_over_time (selected_value, selected_dataset, smoothing, max_value_y, resampling):
    list_words = []
    selected_df_with_topics = df_with_topics
    if selected_dataset != "All_datasets":
      selected_df_with_topics = selected_df_with_topics.loc[(selected_df_with_topics["dataset"] == selected_dataset)]
    top_words = 10
    list_words = topics[selected_value][:top_words]
    print(list_words)
    resampling_dict = {"Year": "Y", "Month": "M", "Day": "D"}
    inspect_words_over_time(df_with_topics = selected_df_with_topics, selected_value = selected_value, dict_anchor_words = dict_anchor_words, topics = topics, list_words = list_words, resampling = resampling_dict[resampling], smoothing = smoothing, max_value_y = max_value_y)

my_interact_manual = interact_manual.options(manual_name="Plot words over time")
my_interact_manual(plot_words_over_time, selected_value=[*dict_anchor_words], selected_dataset=["All_datasets", "TECH", "NEWS", "ETHICS", ], smoothing = (0.1,3, 0.25), max_value_y = (5,100, 5), resampling = ["Year", "Month", "Day"])

interactive(children=(Dropdown(description='selected_value', options=('Justice and Fairness', 'Privacy', 'Cybe…

<function __main__.plot_words_over_time(selected_value, selected_dataset, smoothing, max_value_y, resampling)>

In [13]:
topics_to_remove_int = []

def plot_top_topics_over_time(selected_value, selected_dataset, top_topics_to_show, smoothing, max_value_y, resampling):
  resampling_dict = {"Year": "Y", "Month": "M", "Day": "D"}
  resampling = resampling_dict[resampling]
  df_to_evaluate = df_with_topics
  if selected_dataset != "All_datasets":
    df_to_evaluate = df_to_evaluate.loc[(df_to_evaluate["dataset"] == selected_dataset)]
  top_topics_on_values_over_time(df_to_evaluate, selected_value, selected_dataset, dict_anchor_words, topics_weights, top_topics_to_show, topics_to_remove_int, smoothing, max_value_y, resampling)

interact(plot_top_topics_over_time, top_topics_to_show = (3, 25, 1), selected_value=[*dict_anchor_words], selected_dataset = ["All_datasets","TECH", "NEWS", "ETHICS",], smoothing = (0.25,3, 0.25), max_value_y = (5,100, 5), resampling = ["Year", "Month", "Day"])

interactive(children=(Dropdown(description='selected_value', options=('Justice and Fairness', 'Privacy', 'Cybe…

<function __main__.plot_top_topics_over_time(selected_value, selected_dataset, top_topics_to_show, smoothing, max_value_y, resampling)>

In [14]:
def plot_print_sample_articles_topic(selected_value, selected_dataset, selected_topic, show_full_text, window, size_sample):
    show_extracts = True # True, False
    '''--------------------------------------------------------------------------''' 
    selected_dataframe = df_with_topics
    if selected_dataset != "All_datasets":
      selected_dataframe = selected_dataframe.loc[(selected_dataset["dataset"] == selected_dataset)]
    if selected_topic == "":
      selected_topic = 0
    selected_dataframe = selected_dataframe[selected_dataframe[int(selected_topic)] == 1]
    print_sample_articles_topic(selected_dataframe, dict_anchor_words, topics, selected_value, size_sample, window, show_extracts, show_full_text)

interact(plot_print_sample_articles_topic, selected_value=[*dict_anchor_words], selected_dataset = ["All_datasets", "TECH", "NEWS", "ETHICS", ], selected_topic=widgets.Text(), size_sample =(5,20, 5), window =(5,100, 5), show_full_text = widgets.Checkbox(value=False))


interactive(children=(Dropdown(description='selected_value', options=('Justice and Fairness', 'Privacy', 'Cybe…

<function __main__.plot_print_sample_articles_topic(selected_value, selected_dataset, selected_topic, show_full_text, window, size_sample)>

## 6. Gap assessment <a name="gap_assessment"></a>

It takes time before a good topic model is build in which topics adequately represent values. The code in the next cell can be used to import an existing topic model.

In [15]:
def plot_values_in_different_datasets(Selected_technology):
  selected_df = df_with_topics
  selected_df = selected_df[selected_df[Selected_technology] == True]
  values_in_different_datasets(selected_df, dict_anchor_words)

interact(plot_values_in_different_datasets, Selected_technology=["AI", "IoT"])

interactive(children=(Dropdown(description='Selected_technology', options=('AI', 'IoT'), value='AI'), Output()…

<function __main__.plot_values_in_different_datasets(Selected_technology)>

In [16]:
def plot_print_sample_articles_topic(selected_value, selected_dataset, selected_technology, show_full_text, window, size_sample):
    show_extracts = True # True, False
    '''--------------------------------------------------------------------------''' 
    selected_dataframe = df_with_topics
    selected_dataframe = selected_dataframe[selected_dataframe[selected_technology] == True]
    if selected_dataset != "All_datasets":
      selected_dataframe = selected_dataframe.loc[(selected_dataset["dataset"] == selected_dataset)]
    print_sample_articles_topic(selected_dataframe, dict_anchor_words, topics, selected_value, size_sample, window, show_extracts, show_full_text)

interact(plot_print_sample_articles_topic, selected_value=[*dict_anchor_words], selected_dataset = ["All_datasets", "TECH", "NEWS", "ETHICS", ], selected_technology=["AI", "IoT"], size_sample =(5,20, 5), window =(5,100, 5), show_full_text = widgets.Checkbox(value=False))


interactive(children=(Dropdown(description='selected_value', options=('Justice and Fairness', 'Privacy', 'Cybe…

<function __main__.plot_print_sample_articles_topic(selected_value, selected_dataset, selected_technology, show_full_text, window, size_sample)>