In [None]:
## (Labeling - I) ##

# The labeling in this part was done manually based on the "../../../output/master_code_prep_output/mmr_selected/Q154_mmr_selected.csv" dataset
# A new match column was created and binary hand labeling was performed based on the combined scores (question ID and likert scale)
# Later on the these initial labels will be enriched after the first training run
# The resulting labeled dataset is saved as "Q154_mmr_selected_labeled.csv" in the current directory
# The dataset was the processed to exclude the accompanying metadata and only include the corresponding hashes, and saved as "Q154_mmr_selected_labeled_combined.csv"

%run clean_labeled_data.py

Loading ../../../data/labeled_data/Q154_mmr_selected_labeled.csv ...
Loading ../../../output/master_code_prep_output/top_scored_sentences.csv ...
Creating combined sentence column...
Saving to ../../../data/labeled_data/Q154_mmr_selected_labeled_combined.csv ...
Done.


In [2]:
## (Training) ##

# Note: This code was used twice:
# (1) The first run was made right after the hand labeling was done.
# (2) The second training used a combined labeled dataset of hand labeled data and the hand-picked accurate predictions from the first run by relabeling. 

# Load and filter labeled data, focusing on relevant combined_labels and positive matches.
# Perform stratified train-validation split by label with a fixed holdout fraction.
# Augment training data by adding synonym-replaced sentences to improve robustness.
# Encode sentences with SBERT embeddings and train a CatBoost multi-class classifier.
# Evaluate the model on validation data and save the trained model and label mappings.

%run train_catboost_labeled.py

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\secki\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\secki\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Loading data...
Loading SBERT model...
Encoding all sentences for embedding hash computation...


Batches:   0%|          | 0/5 [00:00<?, ?it/s]

Splitting data stratified by embedding_hash...
Training samples before augmentation: 91
Validation samples before filtering: 61
Performing synonym augmentation on training set...
Training samples after augmentation: 364
Encoding training sentences after augmentation...


Batches:   0%|          | 0/12 [00:00<?, ?it/s]

Encoding validation sentences...


Batches:   0%|          | 0/2 [00:00<?, ?it/s]

Encoding original training sentences for permutation tests...


Batches:   0%|          | 0/3 [00:00<?, ?it/s]

Filtering validation samples too close to training samples...
Validation samples after cosine similarity filtering: 60
Training CatBoost classifier...
0:	learn: 1.3705392	test: 1.3797790	best: 1.3797790 (0)	total: 169ms	remaining: 5m 38s
100:	learn: 0.4418731	test: 0.9532624	best: 0.9532624 (100)	total: 5.34s	remaining: 1m 40s
200:	learn: 0.2192542	test: 0.7989133	best: 0.7989133 (200)	total: 10.5s	remaining: 1m 34s
300:	learn: 0.1413362	test: 0.7212234	best: 0.7212234 (300)	total: 15.8s	remaining: 1m 29s
400:	learn: 0.1021030	test: 0.6769044	best: 0.6769044 (400)	total: 21s	remaining: 1m 23s
500:	learn: 0.0787188	test: 0.6421582	best: 0.6421582 (500)	total: 26.3s	remaining: 1m 18s
600:	learn: 0.0635224	test: 0.6157704	best: 0.6157704 (600)	total: 31.5s	remaining: 1m 13s
700:	learn: 0.0531237	test: 0.5935560	best: 0.5935560 (700)	total: 36.7s	remaining: 1m 7s
800:	learn: 0.0459141	test: 0.5782400	best: 0.5782400 (800)	total: 41.9s	remaining: 1m 2s
900:	learn: 0.0400704	test: 0.5645094	

In [6]:
# (Prediction Fine-Tuning)
# Optional keyword extraction fine-tuning code using KeyBERT
%run tune_keywords_keybert.py

Top keyphrases for Q154_1:
 - stability public order remain unshaken corner nation (score: 0.731)
 - maintaining order nation utmost importance (score: 0.667)
 - internal cohesion abstract virtue prerequisite enduring peace (score: 0.584)

Top keyphrases for Q154_2:
 - voice global decision making including fairer multilateral (score: 0.659)
 - govern internationally support reforms enhance collective voice (score: 0.657)
 - importance expanding civic channels ensure citizen feels (score: 0.593)

Top keyphrases for Q154_3:
 - countries suffocated price shocks control rise dignity (score: 0.746)
 - coordination ensure developing countries suffocated price (score: 0.728)
 - fighting rising prices utmost importance (score: 0.627)

Top keyphrases for Q154_4:
 - protecting freedom speech utmost importance society censors (score: 0.789)
 - dissent disloyalty open discourse strength threat international (score: 0.708)
 - international law continue safeguard journalists writers thinkers (score

In [None]:
## (Prediction - I) ##

# This prediction only uses the top_scored_sentences.csv dataset, which is the output of the data_prep_scoring.py script.
# The prediction dataset contains IQR-range candidate sentences in the previous measurements
# As stated, the accurate outputs will be used to enrich the training data for the second training run.

# Load unseen sentences excluding those already labeled in training by embedding hash.
# Use a pre-trained CatBoost model and SBERT embeddings to predict combined_labels on unseen data.
# Filter predicted sentences by confidence threshold (perc_above_chance >= threshold).
# Merge additional metadata from the full scored sentences dataset based on predicted labels.
# Display a preview and save filtered, labeled predictions to CSV for further use.

%run predict_catboost_top_score.py

Selected 4423 unseen sentences for prediction.
Encoding unseen sentences...


Encoding batches: 100%|██████████| 139/139 [00:04<00:00, 31.47it/s]


Extracting keyphrases per label for semantic filtering...
Applying semantic similarity filtering per predicted label...
Applying joint scoring and filtering...
Kept 1111 predictions after joint scoring filtering.
Saved the results to ../../../output/question_pipeline_output/q154_q155_predictions/q154_q155_predictions_top_score_filtered.csv


In [None]:
## (Labeling - II) ##

# This function appends accurately predicted sentences from the predictions file
# to the labeled dataset based on a provided list of selected sentence hashes.
# It filters these hashes to those present in the predictions, then matches them
# with the top scored sentences and marks them as matched.
# Finally, it concatenates these new matched sentences with the existing labeled data and saves it.

# After this code is run, the model was trained the second time with the appended labeled dataset.

#%run label_append_after_pred.py

Total hashes in predictions: 2869
Valid selected hashes found in predictions: 16
Appended 16 rows to labeled data and saved to 'Q154_mmr_selected_labeled.csv'.


In [None]:
## (Training - II) ##

# After the Labeling - II code is run, the model was trained the second time with the appended labeled dataset.

In [8]:
## (Prediction - II) ##

# This code loads UNGA speech sentences and WVC metadata, excluding training data to prevent leakage.
# It encodes filtered sentences with a SentenceTransformer model, then predicts classes and confidences using a CatBoost classifier.
# Predictions with confidence above chance threshold are kept, and relevant columns are selected.
# It merges the predictions with metadata from the WVC dataset based on predicted labels.
# Finally, it previews and saves the combined prediction results to a CSV file for further use.

%run predict_catboost_unga_wvs7.py

Loading combined labeled CSV...
Extracting top keyphrases per label...
Loading SBERT model...
Loading unseen data...
Filtered unseen sentences count: 25551
Loading CatBoost model...
Encoding unseen sentences...


Encoding batches: 100%|██████████| 799/799 [00:23<00:00, 33.58it/s]


Predicting classes and probabilities...
Applying semantic similarity filtering per predicted label...
Applying joint scoring and filtering...
Kept 2903 predictions after joint scoring filtering.
Saved the results to ../../../output/question_pipeline_output/q154_q155_predictions/q154_q155_predictions_filtered.csv


In [9]:
# Predict UNSG speeches using the trained model
%run predict_unsg_address.py

Loading combined labeled CSV...
Extracting top keyphrases per label...
Loading SBERT model...
Loading unseen data...
Loading CatBoost model...
Encoding unseen sentences...


Encoding batches: 100%|██████████| 21/21 [00:00<00:00, 34.35it/s]


Predicting classes and probabilities...
Applying semantic similarity filtering per predicted label...
Applying joint scoring and filtering...
Kept 185 predictions after joint scoring filtering.
Saved the results to ../../../output/question_pipeline_output/q154_q155_predictions/q154_q155_predictions_unsg.csv


In [10]:
## (Post-Processing) ##

# This code analyzes predicted labels per country-year from the predictions dataset,
# identifies the most and second most frequent labels for each country-year (Q154 and Q155, respectively),
# filters to keep only country-years present in the WVS dataset,
# then lists country-years missing a second most frequent label,
# and finally displays and saves the summarized results.

%run get_q154_q155_frequencies.py


Country-year pairs without a second most frequent label:
AND - 2018
CAN - 2020
ECU - 2018
ETH - 2020
JPN - 2019
KOR - 2018
NGA - 2018
NLD - 2022


Unnamed: 0,B_COUNTRY_ALPHA,A_YEAR,most_frequent_label,most_frequent_count,second_most_frequent_label,second_most_frequent_count,has_second_label
0,AND,2018,Q152_3,4,,0,False
1,ARG,2017,Q152_1,3,Q152_4,2,True
2,AUS,2018,Q152_2,12,Q152_1,3,True
3,BGD,2018,Q152_2,2,Q152_4,2,True
4,BOL,2017,Q152_2,2,Q152_3,1,True
5,BRA,2018,Q152_2,3,Q152_4,2,True
6,CAN,2020,Q152_3,2,,0,False
7,CHL,2018,Q152_4,6,Q152_3,4,True
8,CHN,2018,Q152_3,8,Q152_1,6,True
9,COL,2018,Q152_1,1,Q152_2,1,True


Saved the results to ../../../output/question_pipeline_output/q154_q155_output/q154_q155_country_year_top2.csv


In [11]:
# Get the most (Q154) and second most frequent (Q155) labels for each year for the UNSG speeches
%run get_q154_q155_frequencies_unsg.py

Saved to ../../../output/question_pipeline_output/q154_q155_output/q154_year_top2_labels_unsg.csv


Unnamed: 0,doc_id,A_YEAR,most_frequent_label,most_frequent_count,second_most_frequent_label,second_most_frequent_count,has_second_label
0,AND_72_2017,2017,Q154_1,9,Q154_2,6,True
1,AND_73_2018,2018,Q154_2,6,Q154_1,4,True
2,AND_74_2019,2019,Q154_2,3,Q154_1,2,True
3,AND_75_2020,2020,Q154_4,2,Q154_1,1,True
4,AND_76_2021,2021,Q154_2,5,,0,False
...,...,...,...,...,...,...,...
355,ZWE_73_2018,2018,Q154_2,7,Q154_1,1,True
356,ZWE_74_2019,2019,Q154_2,4,Q154_3,2,True
357,ZWE_75_2020,2020,Q154_1,4,Q154_2,4,True
358,ZWE_76_2021,2021,Q154_1,5,Q154_2,3,True


In [12]:
## (Visualization -I) ##

# This code filters predicted Q154-related labels from the predictions dataset,
# calculates sentence counts and proportions of given response per country,
# prepares the data for ordered plotting,
# then creates and displays a scatter plot showing the proportion of Q154_1 sentences relative to total Q154 sentences by country,
# with point sizes representing sentence counts, x-axis showing the countries and y-axis the proportion of given response to total count responses.

%run visualize_preds_props.py

In [13]:
## (Visualization -II) ##

# This code loads World Values Survey (WVS) data and filters it by countries found in scored sentences.
# It processes Q154 survey responses to calculate per-country counts and proportions of respondents answering '1'.
# It prepares the country data in a categorical order for consistent plotting.
# Then it creates a scatter plot showing the proportion of Q154=1 responses per country, sized by total responses.
# The plot visually compares response distributions across countries with hover details and a clean layout.
# The x and y-axis labels are the same as the prediction visualization.

%run visualize_response_props.py

In [14]:
## (Visualization -III) ##

# This script compares proportions of a specific Q154 survey response between scored sentences and WVS data at the country-year level.
# It aggregates sentence counts and computes proportions per country-year in the scored predictions.
# It filters the WVS data to matching country-year pairs and computes weighted response proportions using survey weights.
# The code then merges both datasets and visualizes their proportions with connecting lines to compare distributions.
# The plot uses country-year labels on the x-axis and proportion values on the y-axis, with vertical lines visually highlighting differences between WVS survey and scored sentence proportions.
# Finally, it calculates and prints the Pearson correlation between the weighted WVS proportions and scored sentence proportions.

%run visualize_prop_diffs.py

Pearson correlation between WVS weighted and scored sentence proportions: 0.1704
