# ValueMonitor - Use an existing topic model

This page is a visualisation of the ValueMonitor prototype. In case you would like to use the notebook, click on the icon ‘**Run in Google Colab**’ hereunder:

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/tristandewildt/ValueMonitor_Prototype/blob/main/ValueMonitor_Prototype_use_existing_model.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/tristandewildt/ValueMonitor_Prototype/blob/main/ValueMonitor_Prototype_use_existing_model.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
</table>

## Table of content:
* [1. Import dataset and packages](#import_dataset_and_packages)
* [2. Gap assessment](#gap_assessment)
* [3. Impact assessment](#impact_assessment)
* [4. Values in different realms](#values_in_different_realms)

## 1. Import dataset and packages  <a name="import_dataset_and_packages"></a>

In this step, the dataset and relavant python packages are imported

In [1]:
''' Packages'''

!pip install corextopic
!pip install joblib
!pip install tabulate
!pip install simple_colors

import os, sys, importlib
import pandas as pd
import ipywidgets as widgets
from ipywidgets import interact, interact_manual
import pickle

''' Source code'''

user = "tristandewildt"
repo = "ValueMonitor_Prototype"
src_dir = "code"
pyfile_1 = "make_topic_model.py"
pyfile_2 = "create_visualisation.py"

if os.path.isdir(repo):
    !rm -rf {repo}
    
!git clone https://github.com/{user}/{repo}.git

path = f"{repo}/{src_dir}"
if not path in sys.path:
    sys.path.insert(1, path)

make_topic_model = importlib.import_module(pyfile_1.rstrip(".py"))
create_visualisation = importlib.import_module(pyfile_2.rstrip(".py"))

from make_topic_model import *
from create_visualisation import *

''' Datasets'''

!wget -q --show-progress --no-check-certificate 'https://docs.google.com/uc?export=download&id=12ZyryF8MbMYKuhIBEhUUvnvx43_cna56' -O dataset_ValueMonitor_prototype
!wget -q --show-progress --no-check-certificate 'https://docs.google.com/uc?export=download&id=1-tg19xpRGShyICoyujfGq-SJTccqa4l0' -O list_topics
!wget -q --show-progress --no-check-certificate 'https://docs.google.com/uc?export=download&id=1-ua31tRTmXsseitqQMR1G68AWLmhEO9E' -O topics
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=12_EoLJLL_wjc8n1Az3wudsvaTgA605aK' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=12_EoLJLL_wjc8n1Az3wudsvaTgA605aK" -O combined_STOA_technologies_saved_topic_model && rm -rf /tmp/cookies.txt


#https://drive.google.com/file/d/1-ua31tRTmXsseitqQMR1G68AWLmhEO9E/view?usp=sharing
with open('dataset_ValueMonitor_prototype', "rb") as fh:
    df = pickle.load(fh)
with open('list_topics', "rb") as fh:
    list_topics = pickle.load(fh)
with open('topics', "rb") as fh:
    topics = pickle.load(fh)
with open('combined_STOA_technologies_saved_topic_model', "rb") as fh:
    combined_STOA_technologies_saved_topic_model = pickle.load(fh)
    
results_import = import_topic_model(combined_STOA_technologies_saved_topic_model, df)
if len(results_import):
    df_with_topics = results_import[0]
    #topics = results_import[1]
    dict_anchor_words = results_import[2]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Cloning into 'ValueMonitor_Prototype'...
remote: Enumerating objects: 145, done.[K
remote: Counting objects: 100% (82/82), done.[K
remote: Compressing objects: 100% (47/47), done.[K
remote: Total 145 (delta 39), reused 74 (delta 34), pack-reused 63[K
Receiving objects: 100% (145/145), 16.69 MiB | 16.29 MiB/s, done.
Resolving deltas: 100% (75/75), done.
--2023-02-09 15:27:22--  https://docs.google.com/uc?export=download&confirm=t&id=12_EoLJLL_wjc8n1Az3wudsvaTgA605aK
Resolving docs.google.com (docs.google.com)... 74.125.204.100, 74.125.204.102, 74.125.204.113, ...
Connecting to docs.google.com (

https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations


## 2. Overview of topics

In [2]:
for topic in list_topics:
  print(topic)

Topic #0 (Justice and Fairness): justice, 0.722, fair, 0.558, fairness, 0.237, equality, 0.133, unfair, 0.13, unequal, 0.057, equitable, 0.045, unjust, 0.036, criminal justice, 0.035, social justice, 0.025
Topic #1 (Privacy): privacy, 2.01, personal data, 0.215, data privacy, 0.2, privacy protection, 0.092, privacy concerns, 0.086, privacy data, 0.071, user privacy, 0.064, privacy issues, 0.048, security privacy, 0.045, privacy security, 0.042
Topic #2 (Cyber-security): security, 2.037, attacks, 0.349, cybersecurity, 0.19, cyber, 0.086, threats, 0.036, malicious, 0.032, encryption, 0.022, social security, 0.018, safety security, 0.016, security issues, 0.015
Topic #3 (Environmnental Sustainability): environmental, 0.756, sustainable, 0.428, sustainability, 0.262, renewable, 0.121, sustainable development, 0.062, durable, 0.047, renewable energy, 0.032, carbon, 0.026, emissions, 0.023, environmental protection, 0.012
Topic #4 (Transparency): transparency, 0.879, transparent, 0.489, opaq

In [3]:
topics

{0: 'justice, fair, fairness, equality, unfair, unequal, equitable, unjust, criminal justice, social justice, justice system, department justice, gender equality, free fair, egalitarianism, distributive justice, distributive, inequalities, vanity fair, sentencing',
 'Justice and Fairness': 'justice, fair, fairness, equality, unfair, unequal, equitable, unjust, criminal justice, social justice, justice system, department justice, gender equality, free fair, egalitarianism, distributive justice, distributive, inequalities, vanity fair, sentencing',
 1: 'privacy, personal data, data privacy, privacy protection, privacy concerns, privacy data, user privacy, privacy issues, security privacy, privacy security, personal information, privacy preserving, confidentiality, privacy preservation, privacy law, consumer privacy, facebook privacy, privacy information, privacy policies, issues privacy',
 'Privacy': 'privacy, personal data, data privacy, privacy protection, privacy concerns, privacy dat

In [4]:
dict_values = {}
counter = 0
for key, value in dict_anchor_words.items():
  dict_values[key]=counter
  counter = counter + 1

dict_topics = {}
counter = 0
for key, value in topics.items():
    if type(key) == int:
        reduced_value = value[:5]
        dict_topics[counter] = ', '.join(reduced_value)
        counter = counter + 1

for key, value in dict_values.items():
    dict_topics[value]=key
        
#dict_values = {"Environmental Sustainability": 0,"Safety": 1,"Economic viability": 2,"Efficiency": 3,"Affordability": 4}

In [6]:
dict_topics

{0: 'Justice and Fairness',
 1: 'Privacy',
 2: 'Cyber-security',
 3: 'Environmnental Sustainability',
 4: 'Transparency',
 5: 'Accountability',
 6: 'Autonomy',
 7: 'Democracy',
 8: 'Reliability',
 9: 'Trust',
 10: 'Well-being',
 11: 'Inclusiveness',
 12: 'i, n, t, e, r',
 13: 'a, r, t, i, f',
 14: 'y, e, a, r, ,',
 15: 'b, l, o, c, k',
 16: 's, t, r, e, e',
 17: 'q, u, a, n, t',
 18: '5, g, ,,  , 5',
 19: 'd, r, i, v, e',
 20: 'p, e, o, p, l',
 21: 'a, u, g, m, e',
 22: 'v, i, r, t, u',
 23: 'v, o, i, c, e',
 24: 'f, a, c, i, a',
 25: 'c, o, m, p, u',
 26: 'f, r, i, e, n',
 27: 'a, g, e, n, t',
 28: 'r, o, b, o, t',
 29: 'c, l, o, u, d',
 30: 'e, d, u, c, a',
 31: 'p, r, e, s, i',
 32: 's, u, p, p, o',
 33: 'i, n, d, u, s',
 34: 'i, n, d, u, s',
 35: 'f, a, x,  , c',
 36: 'c, i, t, y, ,',
 37: 'a, n, i, m, a',
 38: 'h, u, m, a, n',
 39: 'a, l, l, o, c',
 40: 'a, g, e, n, c',
 41: 'e, t, h, i, c',
 42: 'h, e, a, l, t',
 43: 'f, o, o, d, ,',
 44: 'a, d, ,,  , h',
 45: 'g, e, n, e, t',
 4

In [29]:
import math

In [5]:
top_topics_to_show = 10
readjust_colors = 11
max_value_y = 100
T0 = "1990"
T1 = "2022"
smoothing = 1
resampling = "Y"
selected_value = "Privacy"  # 'Environmental Sustainability', 'Safety', 'Economic viability', 'Efficiency', 'Affordability'
#selected_domain = "all_domains"   # "power generation", "mobility", "industry", "all_domains"
topics_to_remove_int = [#79, 185, 189, 309, 219, 347, 394, 264, 371, 217, 65, 21, 283, 146, 14, 302, 163, 323

]


df_to_evaluate = df_with_topics
df_to_evaluate = df_to_evaluate.loc[(df_to_evaluate[dict_values[selected_value]] == 1)]
#if selected_domain == "all_domains":
#    df_to_evaluate = df_to_evaluate.loc[(df_to_evaluate[5] == 1) | (df_to_evaluate[6] == 1) | (df_to_evaluate[7] == 1)]
#if selected_domain != "all_domains":
#    df_to_evaluate = df_to_evaluate.loc[(df_to_evaluate[dict_domains[selected_domain]] == 1)]
df_to_evaluate = df_to_evaluate.loc[(df_to_evaluate['date'] >= dateutil.parser.parse(str(T0))) & (df_to_evaluate['date'] <= dateutil.parser.parse(str(T1)))]
df_to_evaluate = df_to_evaluate.set_index('date')  

        
df_with_topics_freq = df_to_evaluate.resample(resampling).size().reset_index(name="count")
df_with_topics_freq = df_with_topics_freq.set_index('date')
      
df_to_evaluate = df_to_evaluate.fillna("")


df_to_evaluate = df_to_evaluate.rename(columns=dict_topics)

df_to_evaluate = df_to_evaluate[list(dict_topics.values())]
print(df_to_evaluate)
topics_to_remove_str = []
for i in topics_to_remove_int:
    topics_to_remove_str.append(dict_topics[i])
for i in list(dict_values.values()):
    topics_to_remove_str.append(dict_topics[i])
topics_to_remove_str.append(selected_value)
df_to_evaluate = df_to_evaluate.drop(columns=topics_to_remove_str)

df_to_evaluate = df_to_evaluate.resample(resampling).sum()
count_df_to_evaluate = df_to_evaluate.sum()
initial_number_topics = len(count_df_to_evaluate)

count_df_to_evaluate = count_df_to_evaluate.sort_values(ascending=False)
count_df_to_evaluate = count_df_to_evaluate[:top_topics_to_show]

percentage_df_to_evaluate = count_df_to_evaluate.divide(count_df_to_evaluate.sum(), fill_value=0)
percentage_df_to_evaluate = percentage_df_to_evaluate * 100
list_topics_above_threshold = list(count_df_to_evaluate.index.values)

#print(list_topics_above_threshold)
for topic in list_topics_above_threshold:
    print("Topic "+str(list(dict_topics.values()).index(topic))+": "+str(topic))


df_to_evaluate = df_to_evaluate[list_topics_above_threshold]
    
df_to_evaluate = df_to_evaluate.div(df_with_topics_freq["count"], axis=0)
df_to_evaluate = df_to_evaluate.fillna(0)
    
     
x = pd.Series(df_to_evaluate.index.values)
x = x.dt.to_pydatetime().tolist()
    
x = [ z - relativedelta(years=1) for z in x]
      
df_to_evaluate = df_to_evaluate * 100

    
sigma = (np.log(len(x)) - 1.25) * 1.2 * smoothing
        
#n_colors = initial_number_topics
#colours = cm.tab20(np.linspace(0, 1, math.ceil(n_colors / readjust_colors)))
colours = cm.tab20(np.linspace(0, 1, math.ceil(len(list_topics_above_threshold))))
colours_long = []
for i in range(readjust_colors):
    for y in colours:
        colours_long.append(y)

dict_colors = {}
counter = 0
for word in list_topics_above_threshold:
    dict_colors[word] = colours_long[counter]
    counter = counter + 1

counter = 0
fig, ax1 = plt.subplots()
for word in df_to_evaluate:
    ysmoothed = gaussian_filter1d(df_to_evaluate[word].tolist(), sigma=sigma)
    ax1.plot(x, ysmoothed, label=word, linewidth=2, color = dict_colors[word])
    counter = counter + 1
        
ax1.set_xlabel('Time', fontsize=12, fontweight="bold")
ax1.set_ylabel('Percentage of articles', fontsize=12, fontweight="bold")
ax1.legend(prop={'size': 10})
    
timestamp_0 = x[0]
timestamp_1 = x[1]
    

#width = (time.mktime(timestamp_1.timetuple()) - time.mktime(timestamp_0.timetuple())) / 86400 *.8
width = (timestamp_1 - timestamp_0).total_seconds() / 86400 * 0.8
    
df_to_evaluate["count"]=df_with_topics_freq["count"]
    
ax2 = ax1.twinx()
ax2.bar(x, df_to_evaluate["count"].tolist(), width=width, color='gainsboro')
ax2.set_ylabel('Number of documents in the selected dataset (bars)', fontsize=12, fontweight="bold")
    
ax1.set_zorder(ax2.get_zorder()+1)
ax1.patch.set_visible(False)

        
ax1.set_ylim([0,max_value_y])
#ax1.legend(bbox_to_anchor=(1.2, -0.15), prop={'size': 16})
ax1.legend(prop={'size': 8})
    
plt.rcParams["figure.figsize"] = [12,6]
plt.title("Top "+str(top_topics_to_show)+" topics discussed in relation to the value "+str(selected_value), fontsize=14, fontweight="bold")
plt.show()

            Justice and Fairness  Privacy  Cyber-security  \
date                                                        
2008-05-17                   0.0      1.0             0.0   
2019-05-17                   0.0      1.0             0.0   
2020-05-17                   0.0      1.0             0.0   
2021-05-17                   0.0      1.0             1.0   
2021-05-17                   1.0      1.0             0.0   
...                          ...      ...             ...   
2021-05-17                   0.0      1.0             1.0   
2020-05-17                   0.0      1.0             0.0   
2021-05-17                   0.0      1.0             1.0   
2009-05-17                   0.0      1.0             1.0   
2015-05-17                   0.0      1.0             1.0   

            Environmnental Sustainability  Transparency  Accountability  \
date                                                                      
2008-05-17                            0.0           0.0 

NameError: ignored

## 2. Gap assessment

It takes time before a good topic model is build in which topics adequately represent values. The code in the next cell can be used to import an existing topic model.

In [2]:
def plot_values_in_different_datasets(Selected_technology):
    values_in_different_datasets(df_with_topics, Selected_technology, dict_anchor_words)

interact(plot_values_in_different_datasets, Selected_technology=["AI", "IoT"])

interactive(children=(Dropdown(description='Selected_technology', options=('AI', 'IoT'), value='AI'), Output()…

<function __main__.plot_values_in_different_datasets(Selected_technology)>

In [3]:
def plot_print_sample_articles_topic(selected_technology, selected_value, selected_dataset, size_sample):
    show_extracts = True # True, False
    show_full_text  = False # True, False
    df_with_topics_selected_technology = df_with_topics[df_with_topics[selected_technology] == True]
    df_with_topics_selected_technology_dataset = df_with_topics_selected_technology[df_with_topics_selected_technology['dataset'] == selected_dataset]
    print_sample_articles_topic(df_with_topics_selected_technology_dataset, dict_anchor_words, topics, selected_value, size_sample, show_extracts, show_full_text)

interact(plot_print_sample_articles_topic, selected_value=[*dict_anchor_words], selected_dataset = ["TECH", "NEWS", "ETHICS", ], selected_technology=["AI", "IoT"], size_sample =(5,50, 5))

interactive(children=(Dropdown(description='selected_technology', options=('AI', 'IoT'), value='AI'), Dropdown…

<function __main__.plot_print_sample_articles_topic(selected_technology, selected_value, selected_dataset, size_sample)>

## 3. Values over time <a name="impact_assessment"></a>

The occurence of values can be traced over time.

In [None]:
def plot_create_vis_values_over_time (selected_technology, selected_dataset, resampling, smoothing, max_value_y):

    T0 = "1980-01-01" #YYYY-MM-DD
    T1 = "2023-01-01" #YYYY-MM-DD

    values_to_include_in_visualisation = []
    
    resampling_dict = {"Year": "Y", "Month": "M", "Day": "D"}
    resampling = resampling_dict[resampling]
    df_with_topics_short = df_with_topics.loc[(df_with_topics['date'] >= dateutil.parser.parse(T0)) & (df_with_topics['date'] <= dateutil.parser.parse(T1))]
    df_with_topics_selected_technology = df_with_topics_short[df_with_topics_short[selected_technology] == True]
    df_with_topics_selected_technology_dataset = df_with_topics_selected_technology[df_with_topics_selected_technology['dataset'] == selected_dataset]
    create_vis_values_over_time(df_with_topics_selected_technology_dataset, dict_anchor_words, resampling, values_to_include_in_visualisation, smoothing, max_value_y)  
    
    

interact(plot_create_vis_values_over_time, selected_technology=["AI", "IoT"], selected_dataset = ["TECH", "NEWS", "ETHICS",], smoothing = (0.25,3, 0.25), max_value_y = (5,100, 5), resampling = ["Year", "Month", "Day"])

interactive(children=(Dropdown(description='selected_technology', options=('AI', 'IoT'), value='AI'), Dropdown…

<function __main__.plot_create_vis_values_over_time(selected_technology, selected_dataset, resampling, smoothing, max_value_y)>

In [None]:
def plot_print_sample_articles_topic(selected_value, size_sample):
    T0 = "1960-01-01" #YYYY-MM-DD
    T1 = "2023-01-01" #YYYY-MM-DD

    show_extracts = True # True, False
    show_full_text  = False # True, False

    df_with_topics_short = df_with_topics.loc[(df_with_topics['date'] >= dateutil.parser.parse(T0)) & (df_with_topics['date'] <= dateutil.parser.parse(T1))]
    print_sample_articles_topic(df_with_topics_short, dict_anchor_words, topics, selected_value, size_sample, show_extracts, show_full_text)

interact(plot_print_sample_articles_topic, selected_value=[*dict_anchor_words], size_sample =(5,50, 5))

interactive(children=(Dropdown(description='selected_value', options=('Justice and Fairness', 'Privacy', 'Cybe…

<function __main__.plot_print_sample_articles_topic(selected_value, size_sample)>

## 4. Values in different realms <a name="values_in_different_realms"></a>

ValueMonitor can be used to evaluate which values different societal groups tend to discuss.

In [None]:
def plot_values_in_different_groups(selected_dataset):
    values_in_different_groups(df_with_topics, dict_anchor_words, selected_dataset)

interact(plot_values_in_different_groups, selected_dataset = ['NEWS', 'ETHICS', 'TECH'])

In [None]:
def plot_print_sample_articles_topic(selected_value, selected_dataset, size_sample):

    show_extracts = True # True, False
    show_full_text  = False # True, False

    '''--------------------------------------------------------------------------''' 

    df_with_topics_selected_technology_dataset = df_with_topics[df_with_topics['dataset'] == selected_dataset]
    print_sample_articles_topic(df_with_topics_selected_technology_dataset, dict_anchor_words, topics, selected_value, size_sample, show_extracts, show_full_text)
interact(plot_print_sample_articles_topic, selected_value=[*dict_anchor_words], selected_dataset = ["TECH", "NEWS", "ETHICS", ], size_sample =(5,50, 5))

interactive(children=(Dropdown(description='selected_value', options=('Justice and Fairness', 'Privacy', 'Cybe…

<function __main__.plot_print_sample_articles_topic(selected_value, selected_dataset, size_sample)>