In [None]:
## (Labeling - I) ##

# This part is done manually based on the "annotation_pipeline/mmr_selected/Q152_mmr_selected.csv" dataset
# A new match column was created and binary hand labeling was performed based on the combined scores (question ID and likert scale)
# Later on the these initial labels will be enriched after the first training run
# The resulting labeled dataset is saved as "Q152_mmr_selected_labeled.csv" in the current directory
# The dataset was the processed to exclude the accompanying metadata and only include the corresponding hashes, and saved as "Q154_mmr_selected_labeled_combined.csv"

%run clean_labeled_data.py

Loading Q152_mmr_selected_labeled.csv ...
Loading ../top_scored_sentences.csv ...
Creating combined sentence column...
Saving to Q152_mmr_selected_labeled_combined.csv ...
Done.


In [1]:
## (Training) ##

# Note: This code was used twice:
# (1) The first run was made right after the hand labeling was done.
# (2) The second training used a combined labeled dataset of hand labeled data and the hand-picked accurate predictions from the first run by relabeling. 

# Load and filter labeled data, focusing on relevant combined_labels and positive matches.
# Perform stratified train-validation split by label with a fixed holdout fraction.
# Augment training data by adding synonym-replaced sentences to improve robustness.
# Encode sentences with SBERT embeddings and train a CatBoost multi-class classifier.
# Evaluate the model on validation data and save the trained model and label mappings.

%run train_catboost_labeled.py

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\secki\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\secki\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Loading data...
Loading SBERT model...
Encoding all sentences for embedding hash computation...


Batches:   0%|          | 0/4 [00:00<?, ?it/s]

Splitting data stratified by embedding_hash...
Training samples before augmentation: 69
Validation samples before filtering: 46
Performing synonym augmentation on training set...
Training samples after augmentation: 276
Encoding training sentences after augmentation...


Batches:   0%|          | 0/9 [00:00<?, ?it/s]

Encoding validation sentences...


Batches:   0%|          | 0/2 [00:00<?, ?it/s]

Encoding original training sentences for permutation tests...


Batches:   0%|          | 0/3 [00:00<?, ?it/s]

Filtering validation samples too close to training samples...
Validation samples after cosine similarity filtering: 44
Training CatBoost classifier...
0:	learn: 1.3447585	test: 1.3602091	best: 1.3602091 (0)	total: 67.2ms	remaining: 2m 14s
100:	learn: 0.1990910	test: 0.5092755	best: 0.5092755 (100)	total: 4.24s	remaining: 1m 19s
200:	learn: 0.1005487	test: 0.4125351	best: 0.4125351 (200)	total: 8.44s	remaining: 1m 15s
300:	learn: 0.0649015	test: 0.3687327	best: 0.3687327 (300)	total: 12.4s	remaining: 1m 9s
400:	learn: 0.0483470	test: 0.3450525	best: 0.3450525 (400)	total: 16.8s	remaining: 1m 6s
500:	learn: 0.0381446	test: 0.3318160	best: 0.3318160 (500)	total: 20.9s	remaining: 1m 2s
600:	learn: 0.0310948	test: 0.3221853	best: 0.3221853 (600)	total: 25.7s	remaining: 59.8s
700:	learn: 0.0261642	test: 0.3141867	best: 0.3141867 (700)	total: 30.6s	remaining: 56.8s
800:	learn: 0.0225740	test: 0.3076797	best: 0.3076797 (800)	total: 36.3s	remaining: 54.3s
900:	learn: 0.0199235	test: 0.3032813	b

In [4]:
# (Prediction Fine-Tuning)
# Optional keyword extraction fine-tuning code using KeyBERT
%run tune_keywords_keybert.py

Top keyphrases for Q152_1:
 - economic expansion fundamental pathway national resilience prosperity (score: 0.753)
 - high level economic growth view sustained economic (score: 0.670)
 - global economy economic vitality engine (score: 0.659)

Top keyphrases for Q152_2:
 - prioritize ensuring country strong defense forces affirm (score: 0.800)
 - capable national defense supports regional stability enables (score: 0.740)
 - prepared defense posture ensure sovereignty peace deterrence (score: 0.737)

Top keyphrases for Q152_3:
 - local councils international forums reforming global governance (score: 0.718)
 - communities legitimacy governance depends meaningfully people (score: 0.675)
 - veto powers overdue empowering communities create solutions (score: 0.671)

Top keyphrases for Q152_4:
 - planet children years come place beauty sustainability (score: 0.709)
 - ecosystems environmental integrity woven public policy rural (score: 0.654)
 - biodiversity combat environmental degradation 

In [7]:
## (Prediction - I) ##

# This prediction only uses the top_scored_sentences.csv dataset, which is the output of the data_prep_scoring.py script.
# The prediction dataset contains the most uncertain sentences in the previous measurements
# As stated, the accurate outputs will be used to enrich the training data for the second training run.

# Load unseen sentences excluding those already labeled in training by embedding hash.
# Use a pre-trained CatBoost model and SBERT embeddings to predict combined_labels on unseen data.
# Filter predicted sentences by confidence threshold (perc_above_chance >= 2.00).
# Merge additional metadata from the full scored sentences dataset based on predicted labels.
# Display a preview and save filtered, labeled predictions to CSV for further use.

%run predict_catboost_top_score.py

Selected 4462 unseen sentences for prediction.
Encoding unseen sentences...


Encoding batches: 100%|██████████| 140/140 [00:03<00:00, 36.04it/s]


Extracting keyphrases per label for semantic filtering...
Applying semantic similarity filtering per predicted label...
Applying joint scoring and filtering...
Kept 799 predictions after joint scoring filtering.
Saved the results to predictions/q152_predictions_top_score_filtered.csv


In [23]:
## (Labeling - II) ##

# This function appends accurately predicted sentences from the predictions file
# to the labeled dataset based on a provided list of selected sentence hashes (64).
# It filters these hashes to those present in the predictions, then matches them
# with the top scored sentences and marks them as matched.
# Finally, it concatenates these new matched sentences with the existing labeled data and saves it.

# After this code is run, the model was trained the second time with the appended labeled dataset.

%run label_append_after_pred.py

Total hashes in predictions: 316
Valid selected hashes found in predictions: 0
No valid hashes found in predictions to process.


In [None]:
## (Training - II) ##

# After the Labeling - II code is run, the model was trained the second time with the appended labeled dataset.

In [6]:
## (Prediction - II) ##

# This code loads UNGA speech sentences and WVC metadata, excluding training data to prevent leakage.
# It encodes filtered sentences with a SentenceTransformer model, then predicts classes and confidences using a CatBoost classifier.
# Predictions with confidence above chance threshold are kept, and relevant columns are selected.
# It merges the predictions with metadata from the WVC dataset based on predicted labels.
# Finally, it previews and saves the combined prediction results to a CSV file for further use.

%run predict_catboost_unga_wvs7.py

Loading combined labeled CSV...
Extracting top keyphrases per label...
Loading SBERT model...
Loading unseen data...
Filtered unseen sentences count: 25588
Loading CatBoost model...
Encoding unseen sentences...


Encoding batches: 100%|██████████| 800/800 [00:21<00:00, 37.29it/s]


Predicting classes and probabilities...
Applying semantic similarity filtering per predicted label...
Applying joint scoring and filtering...
Kept 2098 predictions after joint scoring filtering.
Saved filtered predictions to predictions/q152_predictions_filtered.csv


In [9]:
# Predict on the UNGS address dataset
%run predict_ungs_address.py

Loading combined labeled CSV...
Extracting top keyphrases per label...
Loading SBERT model...
Loading unseen data...
Loading CatBoost model...
Encoding unseen sentences...


Encoding batches: 100%|██████████| 21/21 [00:00<00:00, 39.53it/s]


Predicting classes and probabilities...
Applying semantic similarity filtering per predicted label...
Applying joint scoring and filtering...
Kept 91 predictions after joint scoring filtering.
Saved filtered predictions to predictions/q152_predictions_ungs.csv


In [10]:
# Get the most (Q152) and second most frequent (Q153) labels for each year
%run get_q152_q153_frequencies_ungs.py

Unnamed: 0,doc_id,A_YEAR,most_frequent_label,most_frequent_count,second_most_frequent_label,second_most_frequent_count,has_second_label
0,ungs_2017,2017,Q152_1,6,Q152_2,4,True
1,ungs_2018,2018,Q152_2,6,Q152_3,6,True
2,ungs_2019,2019,Q152_2,4,Q152_1,3,True
3,ungs_2020,2020,Q152_1,6,Q152_3,6,True
4,ungs_2021,2021,Q152_1,7,Q152_2,2,True
5,ungs_2022,2022,Q152_1,10,Q152_4,5,True


In [3]:
## (Post-Processing) ##

# This code analyzes predicted labels per country-year from the predictions dataset,
# identifies the most and second most frequent labels for each country-year (Q152 and Q153, respectively),
# filters to keep only country-years present in the WVS dataset,
# then lists country-years missing a second most frequent label,
# and finally displays and saves the summarized results.

%run get_q152_q153_frequencies.py


Country-year pairs without a second most frequent label:
AND - 2018
ARM - 2021
CAN - 2020
ECU - 2018
ETH - 2020
IRN - 2020
JPN - 2019
KOR - 2018
NGA - 2018
NLD - 2022


Unnamed: 0,B_COUNTRY_ALPHA,A_YEAR,most_frequent_label,most_frequent_count,second_most_frequent_label,second_most_frequent_count,has_second_label
0,AND,2018,Q152_3,4,,0,False
1,ARG,2017,Q152_1,2,Q152_4,2,True
2,ARM,2021,Q152_1,1,,0,False
3,AUS,2018,Q152_2,12,Q152_1,3,True
4,BGD,2018,Q152_2,2,Q152_4,1,True
5,BOL,2017,Q152_2,2,Q152_3,1,True
6,BRA,2018,Q152_2,3,Q152_4,2,True
7,CAN,2020,Q152_3,1,,0,False
8,CHL,2018,Q152_4,7,Q152_2,3,True
9,CHN,2018,Q152_2,8,Q152_1,7,True


In [4]:
## (Visualization -I) ##

# This code filters predicted Q152-related labels from the predictions dataset,
# calculates sentence counts and proportions of given response per country,
# prepares the data for ordered plotting,
# then creates and displays a scatter plot showing the proportion of Q152_1 sentences relative to total Q152 sentences by country,
# with point sizes representing sentence counts, x-axis showing the countries and y-axis the proportion of given response to total count responses.

%run visualize_preds_props.py

In [29]:
## (Visualization -II) ##

# This code loads World Values Survey (WVS) data and filters it by countries found in scored sentences.
# It processes Q152 survey responses to calculate per-country counts and proportions of respondents answering '1'.
# It prepares the country data in a categorical order for consistent plotting.
# Then it creates a scatter plot showing the proportion of Q152=1 responses per country, sized by total responses.
# The plot visually compares response distributions across countries with hover details and a clean layout.
# The x and y-axis labels are the same as the prediction visualization.

%run visualize_response_props.py

In [5]:
## (Visualization -III) ##

# This script compares proportions of a specific Q152 survey response between scored sentences and WVS data at the country-year level.
# It aggregates sentence counts and computes proportions per country-year in the scored predictions.
# It filters the WVS data to matching country-year pairs and computes weighted response proportions using survey weights.
# The code then merges both datasets and visualizes their proportions with connecting lines to compare distributions.
# The plot uses country-year labels on the x-axis and proportion values on the y-axis, with vertical lines visually highlighting differences between WVS survey and scored sentence proportions.
# Finally, it calculates and prints the Pearson correlation between the weighted WVS proportions and scored sentence proportions.

%run visualize_prop_diffs.py

Pearson correlation between WVS weighted and scored sentence proportions: -0.2660
