# ValueMonitor - Create your own topic model

This page is a visualisation of the ValueMonitor prototype. In case you would like to use the notebook, click on the icon ‘**Run in Google Colab**’ hereunder:

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/tristandewildt/ValueMonitor/blob/main/ValueMonitor_create_own_model.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/tristandewildt/ValueMonitor/blob/main/ValueMonitor_create_own_model.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
</table>

## Table of content:
* [1. Imports and preparation of the dataset](#import_dataset_and_packages)
* [2. Creating the topic model](#creating_the_topic_model)
* [3. Verifying the topic model](#verifying_the_topic_model)
* [4. Values over time](#values_over_time)
* [5. Values in different realms](#values_in_different_realms)
* [6. Gap assessment](#gap_assessment)


## 1. Imports and preparation of the dataset <a name="import_dataset_and_packages"></a>

### 1.1. Import packages

In [1]:
''' Packages'''

!pip install corextopic
!pip install joblib
!pip install tabulate
!pip install simple_colors
!pip install ipyfilechooser
!pip install colorama

import os, sys, importlib
import pandas as pd
import ipywidgets as widgets
from ipywidgets import interact, interact_manual, Button
import pickle
from ipyfilechooser import FileChooser
from tkinter import Tk, filedialog
from IPython.display import clear_output, display
from google.colab import files
import nltk
import io
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
nltk.download('vader_lexicon')

''' Source code'''

user = "tristandewildt"
repo = "ValueMonitor"
src_dir = "code"
pyfile_1 = "make_topic_model.py"
pyfile_2 = "create_visualisation.py"
token = "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"

if os.path.isdir(repo):
  !rm -rf {repo}

!git clone https://{token}@github.com/{user}/{repo}.git

from ValueMonitor.code.make_topic_model import *
from ValueMonitor.code.create_visualisation import *

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


Cloning into 'ValueMonitor'...
remote: Enumerating objects: 632, done.[K
remote: Counting objects: 100% (128/128), done.[K
remote: Compressing objects: 100% (66/66), done.[K
remote: Total 632 (delta 73), reused 110 (delta 62), pack-reused 504[K
Receiving objects: 100% (632/632), 26.67 MiB | 17.26 MiB/s, done.
Resolving deltas: 100% (401/401), done.


### 1.2. Import your dataset

There are two options to import your dataset. Option 1 is very easy but very slow. Option 2 (recommended) takes a few more steps but is very fast.

**Option 1**: Import using Colab import module (easy but probably not adequate for files > 10MB)

In [97]:
csv = files.upload()
data = io.BytesIO(csv[list(csv.keys())[0]])
df = pd.read_csv(data)
df.info()

Saving data_twitter_hydrogen_2018.csv to data_twitter_hydrogen_2018.csv
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5867 entries, 0 to 5866
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Unnamed: 0   5867 non-null   int64 
 1   author_id    5867 non-null   int64 
 2   created_at   5867 non-null   object
 3   like_count   5867 non-null   int64 
 4   quote_count  5867 non-null   int64 
 5   reply_count  5867 non-null   int64 
 6   source       5867 non-null   object
 7   text         5867 non-null   object
 8   date         5867 non-null   object
dtypes: int64(5), object(4)
memory usage: 412.6+ KB


**Option 2**: Import file from Google Drive (a bit more complicated the first time but very fast upload)

Instructions: upload your CSV file to Google Drive using your webbrowser (https://drive.google.com/drive/u/0/my-drive). Once uploaded:

*   Click right on the file
*   Click on 'Share'
*   Under 'General access', select 'Anyone with the link'
*   Click on 'Copy link'
*   Paste the link in the code hereunder (replace the current link after 'google_drive_link', make sure that the link is surrounded by quotation marks).
*   Run the code

In [2]:
google_drive_link = "https://drive.google.com/file/d/1f5MD486ENALDrVC5Ws7MzOq-9zQLUbPz/view?usp=sharing"
FILE_KEY = google_drive_link.split('drive.google.com/file/d/')[1].split('/view?usp=')[0]
print(FILE_KEY)

1f5MD486ENALDrVC5Ws7MzOq-9zQLUbPz


*   Copy the file key that was prompted and paste in the code below (everywhere you see FILE_KEY)
*   Run the code



In [2]:
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=FILE_KEY' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=FILE_KEY" -O imported_dataset && rm -rf /tmp/cookies.txt
df = pd.read_csv('imported_dataset')
df.info()

--2023-02-21 15:30:32--  https://docs.google.com/uc?export=download&confirm=&id=1f5MD486ENALDrVC5Ws7MzOq-9zQLUbPz
Resolving docs.google.com (docs.google.com)... 142.250.141.100, 142.250.141.113, 142.250.141.139, ...
Connecting to docs.google.com (docs.google.com)|142.250.141.100|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://doc-0k-2c-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/ucr2fsjpslv9i0nvgkmb5aogeeu8c93l/1676993400000/12635936161789443610/*/1f5MD486ENALDrVC5Ws7MzOq-9zQLUbPz?e=download&uuid=88f946a2-d34c-44f1-acab-db126f77ef7b [following]
--2023-02-21 15:30:54--  https://doc-0k-2c-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/ucr2fsjpslv9i0nvgkmb5aogeeu8c93l/1676993400000/12635936161789443610/*/1f5MD486ENALDrVC5Ws7MzOq-9zQLUbPz?e=download&uuid=88f946a2-d34c-44f1-acab-db126f77ef7b
Resolving doc-0k-2c-docs.googleusercontent.com (doc-0k-2c-docs.googleusercontent.com)... 74.125.137.13

### 1.3. Prepation of the dataset

Select which columns of the dataset you want to use as text (put the columns in 'columns_to_select_as_text'), and which as date (put the columns in 'column_as_date').

You can set wordtagging to 'True' if you want the topic model also to be created based on nouns and adverbs. It takes a bit of time for the program to select nouns and adverbs, but this generally increases the quality of the topic model.

In [3]:
''' Preparation of the dataset  '''

columns_to_select_as_text = ["Title", "Author Keywords", "Abstract"]
column_as_date = ["Year"]
other_columns_to_keep = []

wordtagging = False # True, False
types_of_words_to_use = ['NN', 'NNP', 'NNS', 'JJ']

pd.options.mode.chained_assignment = None  # default='warn'

df = clean_df(df, columns_to_select_as_text, column_as_date, other_columns_to_keep, wordtagging, types_of_words_to_use)

## 2. Creating the topic model <a name="creating_the_topic_model"></a>

In this step, we create a topic model in which some of the topics refer to values. The creation of topics that reflect values is done by means of so-called 'anchor' words. These words guide the algorithm in the creation of topics that reflect values.

Anchor words are typically words that people use to refer to (the idea of) a value, such as synonyms. After adding some anchor words and running the model, the algorithm will automatically pick up other words that refer to the value. This is because the algorithm has observed that these words are often mentionned in the same documents as the anchor words.

Finding the right anchor words is typically an iterative process, by observing the new topic model created by the algorithm. Some anchor words need to be added to ensure that some aspect of the value are not left behind (to be placed in *dict_anchor_words* in the cell below). Other words need to be removed since they do not refer to the value (in *list_rejected_words* in the cell below).

We have prefilled an number of anchor words for each value.

In [4]:
dict_anchor_words = {
"Justice and Fairness" : ["justice", "fairness", "fair", "equality", "unfair"],
"Privacy" : ["privacy", "personal data", "personal sphere", "data privacy", "privacy protection", "privacy concerns", 
             "confidentiality"],
"Cyber-security" : ["cyber", "security", "cybersecurity", "malicious", "attacks"],
"Environmnental Sustainability" : ["sustainability", "sustainable", "renewable", "durable", "durability",
                                  "sustainable development", "environmental"],
"Transparency" : ["transparency", "transparent", "transparently", "explainability", "interpretability", "explainable",
                 "opaque", "interpretable"],
"Accountability" : ["accountable", "accountability", "accountable", "traceability", "traceable"],
"Autonomy" : ["autonomy", "self-determination", "autonomy human", "personal autonomy"], 
"Democracy" : ["democracy", "democratic", "human rights", "freedom speech", "equal representation",
              "political"], 
"Reliability" : ["reliability", "reliable", "robustness", "robust", "predictability"],
"Trust" : ["trust", "trustworthy", "trustworthiness", "confidence", "honesty"],
"Well-being" : ["well being", "well-being", "wellbeing", "quality life",
               "good life", "qol", "life satisfaction", "welfare"],
"Inclusiveness" : ["inclusiveness", "inclusive", "inclusivity", "discrimination", "diversity"]
}

list_rejected_words = ["iop", "iop publishing", "publishing ltd", "publishing", "licence iop",
                       "mdpi basel", "basel switzerland", "mdpi", "basel", "licensee mdpi", "licensee", "authors licensee", 
                       "switzerland", "authors", "publishing limited", "emerald", "emerald publishing", ]

list_anchor_words_other_topics = [
        ["internet of things", "iot", "internet things", "iot devices", "things iot"],
        ["artificial intelligence", "ai", "artificial"],
]

In [16]:
number_of_topics_to_find = 50
number_of_documents_in_analysis = 2000

number_of_words_per_topic_to_show = 10
number_of_words_per_topic = 10

'''--------------------------------------------------------------------------''' 

model_and_vectorized_data = make_anchored_topic_model(df, number_of_topics_to_find, min(number_of_documents_in_analysis, len(df)), dict_anchor_words, list_anchor_words_other_topics, list_rejected_words)
topics = report_topics(model_and_vectorized_data[0], dict_anchor_words,number_of_words_per_topic)
df_with_topics = create_df_with_topics(df, model_and_vectorized_data[0], model_and_vectorized_data[1], number_of_topics_to_find)
topics_weights = report_topics_words_and_weights(model_and_vectorized_data[0], dict_anchor_words, number_of_words_per_topic)

Number of articles used to build the topic model: 2000




Topic #0 (Justice and Fairness): justice, fair, white, the white, horse, white horse, horse press, ethics, fairness, equality
Topic #1 (Privacy): privacy, sage, sage publications, publications, perceived, psychological, satisfaction, perception, attitudes, perceptions
Topic #2 (Cyber-security): security, food, food security, agriculture, agricultural, crop, production, security and, and food, energy security
Topic #3 (Environmnental Sustainability): sustainable, sustainability, sustainable development, of sustainability, for sustainable, of sustainable, sustainability of, the sustainability, and sustainable, sustainability and
Topic #4 (Transparency): transparency, transparent, singapore, pte, nature singapore, singapore pte, pte ltd, the purpose, opaque, purpose of
Topic #5 (Accountability): accountability, management, land, land use, accountable, management and, and management, management of, of land, traceability
Topic #6 (Autonomy): autonomy, design, the design, design and, buildin

## 3. Verifying the topic model   <a name="verifying_the_topic_model"></a>

To verify whether topics sufficiently refer to values, the code hereunder can be used to evaluate whether documents indeed address the value in question.

In [17]:
for topic, words in topics_weights.items():
  print(str(topic)+": "+str(words))

Topic #0# (Justice and Fairness): {'justice': 0.385, 'fair': 0.156, 'white': 0.102, 'the white': 0.1, 'horse': 0.099, 'white horse': 0.097, 'horse press': 0.094, 'ethics': 0.089, 'fairness': 0.079, 'equality': 0.063}
Topic #1# (Privacy): {'privacy': 0.333, 'sage': 0.101, 'sage publications': 0.101, 'publications': 0.082, 'perceived': 0.074, 'psychological': 0.066, 'satisfaction': 0.065, 'perception': 0.062, 'attitudes': 0.052, 'perceptions': 0.049}
Topic #2# (Cyber-security): {'security': 0.993, 'food': 0.153, 'food security': 0.108, 'agriculture': 0.079, 'agricultural': 0.078, 'crop': 0.07, 'production': 0.068, 'security and': 0.061, 'and food': 0.041, 'energy security': 0.038}
Topic #3# (Environmnental Sustainability): {'sustainable': 1.204, 'sustainability': 1.035, 'sustainable development': 0.306, 'of sustainability': 0.065, 'for sustainable': 0.063, 'of sustainable': 0.061, 'sustainability of': 0.054, 'the sustainability': 0.041, 'and sustainable': 0.036, 'sustainability and': 0.0

In [18]:
topics_to_remove_int = []

def plot_top_topics_on_values(selected_value, top_topics_to_show):
  top_topics_on_values(df_with_topics, selected_value, dict_anchor_words, topics_weights, topics_to_remove_int, top_topics_to_show)

interact(plot_top_topics_on_values, top_topics_to_show = (3, 25, 1), selected_value=[*dict_anchor_words])

interactive(children=(Dropdown(description='selected_value', options=('Justice and Fairness', 'Privacy', 'Cybe…

<function __main__.plot_top_topics_on_values(selected_value, top_topics_to_show)>

In [8]:
def plot_print_sample_articles_topic(selected_value, selected_topic, show_full_text, window, size_sample):
    show_extracts = True # True, False
    df_to_evaluate = df_with_topics
    if selected_topic == "":
      selected_topic = 0
    df_to_evaluate = df_to_evaluate.loc[(df_to_evaluate[int(selected_topic)] == 1)]
    print_sample_articles_topic(df_to_evaluate, dict_anchor_words, topics, selected_value, size_sample, window, show_extracts, show_full_text)

interact(plot_print_sample_articles_topic, selected_value=[*dict_anchor_words], selected_topic=widgets.Text(), size_sample =(5,20, 5), window =(5,100, 5), show_full_text = widgets.Checkbox(value=False))

interactive(children=(Dropdown(description='selected_value', options=('Justice and Fairness', 'Privacy', 'Cybe…

<function __main__.plot_print_sample_articles_topic(selected_value, selected_topic, show_full_text, window, size_sample)>

## 4. Values over time <a name="values_over_time"></a>

The occurence of values can be traced over time.

In [10]:
def plot_create_vis_values_over_time (#selected_technology, selected_dataset, 
                                      resampling, smoothing, max_value_y):
    values_to_include_in_visualisation = []   
    resampling_dict = {"Year": "Y", "Month": "M", "Day": "D"}
    resampling = resampling_dict[resampling]
    selected_df_with_topics = df_with_topics
    #selected_df_with_topics = selected_df_with_topics[selected_df_with_topics[selected_technology] == True]
    #selected_df_with_topics = selected_df_with_topics[selected_df_with_topics['dataset'] == selected_dataset]
    create_vis_values_over_time(selected_df_with_topics, dict_anchor_words, resampling, values_to_include_in_visualisation, smoothing, max_value_y)  
    
interact(plot_create_vis_values_over_time, #selected_technology=["AI", "IoT"], selected_dataset = ["TECH", "NEWS", "ETHICS",], 
         smoothing = (0.25,3, 0.25), max_value_y = (5,100, 5), resampling = ["Year", "Month", "Day"])

interactive(children=(Dropdown(description='resampling', options=('Year', 'Month', 'Day'), value='Year'), Floa…

<function __main__.plot_create_vis_values_over_time(resampling, smoothing, max_value_y)>

In [11]:
def plot_words_over_time (selected_value, #selected_dataset, 
                          smoothing, max_value_y, resampling):
    list_words = []
    selected_df_with_topics = df_with_topics
    #if selected_dataset != "All_datasets":
    #  selected_df_with_topics = selected_df_with_topics.loc[(selected_df_with_topics["dataset"] == selected_dataset)]
    top_words = 10
    list_words = topics[selected_value][:top_words]
    print(list_words)
    resampling_dict = {"Year": "Y", "Month": "M", "Day": "D"}
    inspect_words_over_time(df_with_topics = selected_df_with_topics, selected_value = selected_value, dict_anchor_words = dict_anchor_words, topics = topics, list_words = list_words, resampling = resampling_dict[resampling], smoothing = smoothing, max_value_y = max_value_y)

my_interact_manual = interact_manual.options(manual_name="Plot words over time")
my_interact_manual(plot_words_over_time, selected_value=[*dict_anchor_words], #selected_dataset=["All_datasets", "TECH", "NEWS", "ETHICS", ], 
                   smoothing = (0.1,3, 0.25), max_value_y = (5,100, 5), resampling = ["Year", "Month", "Day"])

interactive(children=(Dropdown(description='selected_value', options=('Justice and Fairness', 'Privacy', 'Cybe…

<function __main__.plot_words_over_time(selected_value, smoothing, max_value_y, resampling)>

In [13]:
topics_to_remove_int = []

def plot_top_topics_over_time(selected_value, #selected_dataset, 
                              top_topics_to_show, smoothing, max_value_y, resampling):
  resampling_dict = {"Year": "Y", "Month": "M", "Day": "D"}
  resampling = resampling_dict[resampling]
  df_to_evaluate = df_with_topics
  #if selected_dataset != "All_datasets":
  #  df_to_evaluate = df_to_evaluate.loc[(df_to_evaluate["dataset"] == selected_dataset)]
  top_topics_on_values_over_time(df_to_evaluate, selected_value, dict_anchor_words, topics_weights, top_topics_to_show, topics_to_remove_int, smoothing, max_value_y, resampling)

interact(plot_top_topics_over_time, top_topics_to_show = (3, 25, 1), selected_value=[*dict_anchor_words], #selected_dataset = ["All_datasets","TECH", "NEWS", "ETHICS",], 
         smoothing = (0.25,3, 0.25), max_value_y = (5,100, 5), resampling = ["Year", "Month", "Day"])

interactive(children=(Dropdown(description='selected_value', options=('Justice and Fairness', 'Privacy', 'Cybe…

<function __main__.plot_top_topics_over_time(selected_value, top_topics_to_show, smoothing, max_value_y, resampling)>

In [15]:
def plot_print_sample_articles_topic(selected_value, #selected_dataset, 
                                     selected_topic, show_full_text, window, size_sample):
    show_extracts = True # True, False
    '''--------------------------------------------------------------------------''' 
    selected_dataframe = df_with_topics
    #if selected_dataset != "All_datasets":
    #  selected_dataframe = selected_dataframe.loc[(selected_dataset["dataset"] == selected_dataset)]
    if selected_topic == "":
      selected_topic = 0
    selected_dataframe = selected_dataframe[selected_dataframe[int(selected_topic)] == 1]
    print_sample_articles_topic(selected_dataframe, dict_anchor_words, topics, selected_value, size_sample, window, show_extracts, show_full_text)

interact(plot_print_sample_articles_topic, selected_value=[*dict_anchor_words], #selected_dataset = ["All_datasets", "TECH", "NEWS", "ETHICS", ], 
         selected_topic=widgets.Text(), size_sample =(5,20, 5), window =(5,100, 5), show_full_text = widgets.Checkbox(value=False))


interactive(children=(Dropdown(description='selected_value', options=('Justice and Fairness', 'Privacy', 'Cybe…

<function __main__.plot_print_sample_articles_topic(selected_value, selected_topic, show_full_text, window, size_sample)>

## 5. Values in different realms <a name="values_in_different_realms"></a>

ValueMonitor can be used to evaluate which values different societal groups tend to discuss.

In [9]:
def plot_values_in_different_groups(selected_dataset):
    values_in_different_groups(df_with_topics, dict_anchor_words, selected_dataset)

interact(plot_values_in_different_groups, selected_dataset = ['NEWS', 'ETHICS', 'TECH'])

interactive(children=(Dropdown(description='selected_dataset', options=('NEWS', 'ETHICS', 'TECH'), value='NEWS…

<function __main__.plot_values_in_different_groups(selected_dataset)>

In [None]:
def plot_print_sample_articles_topic(selected_value, selected_dataset, show_full_text, window, size_sample):
    show_extracts = True # True, False
    df_with_topics_selected_technology_dataset = df_with_topics[df_with_topics['dataset'] == selected_dataset]
    print_sample_articles_topic(df_with_topics_selected_technology_dataset, dict_anchor_words, topics, selected_value, size_sample, window, show_extracts, show_full_text)

interact(plot_print_sample_articles_topic, selected_value=[*dict_anchor_words], selected_dataset = ["TECH", "NEWS", "ETHICS", ], size_sample =(5,20, 5), window =(5,100, 5), show_full_text = widgets.Checkbox(value=False))


interactive(children=(Dropdown(description='selected_value', options=('Justice and Fairness', 'Privacy', 'Cybe…

<function __main__.plot_print_sample_articles_topic(selected_value, selected_dataset, show_full_text, window, size_sample)>

## 6. Gap assessment <a name="gap_assessment"></a>

It takes time before a good topic model is build in which topics adequately represent values. The code in the next cell can be used to import an existing topic model.

In [None]:
def plot_values_in_different_datasets(Selected_technology):
  selected_df = df_with_topics
  selected_df = selected_df[selected_df[Selected_technology] == True]
  values_in_different_datasets(selected_df, dict_anchor_words)

interact(plot_values_in_different_datasets, Selected_technology=["AI", "IoT"])

interactive(children=(Dropdown(description='Selected_technology', options=('AI', 'IoT'), value='AI'), Output()…

<function __main__.plot_values_in_different_datasets(Selected_technology)>

In [None]:
def plot_print_sample_articles_topic(selected_value, selected_dataset, selected_technology, show_full_text, window, size_sample):
    show_extracts = True # True, False
    '''--------------------------------------------------------------------------''' 
    selected_dataframe = df_with_topics
    selected_dataframe = selected_dataframe[selected_dataframe[selected_technology] == True]
    if selected_dataset != "All_datasets":
      selected_dataframe = selected_dataframe.loc[(selected_dataset["dataset"] == selected_dataset)]
    print_sample_articles_topic(selected_dataframe, dict_anchor_words, topics, selected_value, size_sample, window, show_extracts, show_full_text)

interact(plot_print_sample_articles_topic, selected_value=[*dict_anchor_words], selected_dataset = ["All_datasets", "TECH", "NEWS", "ETHICS", ], selected_technology=["AI", "IoT"], size_sample =(5,20, 5), window =(5,100, 5), show_full_text = widgets.Checkbox(value=False))


interactive(children=(Dropdown(description='selected_value', options=('Justice and Fairness', 'Privacy', 'Cybe…

<function __main__.plot_print_sample_articles_topic(selected_value, selected_dataset, selected_technology, show_full_text, window, size_sample)>