In [None]:
## (Labeling - I) ##

# The labeling in this part was done manually based on the "../../../output/master_code_prep_output/mmr_selected/Q65_mmr_selected.csv" dataset
# A new match column was created and binary hand labeling was performed based on the combined scores (question ID and likert scale)
# Later on, labeled data was enriched after the first training run
# The resulting labeled dataset is saved as "Q65_mmr_selected_labeled.csv" in the labeled_data directory
# The data is then cleaned for further processing using the following script and saved as "Q65_mmr_selected_labeled_combined.csv"

%run clean_labeled_data.py

Loading ../../../data/labeled_data/Q65_mmr_selected_labeled.csv ...
Loading ../../../output/master_code_prep_output/top_scored_sentences.csv ...
Creating combined sentence column...
Saving to ../../../data/labeled_data/Q65_mmr_selected_labeled_combined.csv ...
Done.


In [2]:
## (Training) ##

# Note: This code was used twice:
# (1) The first run was made right after the hand labeling was done.
# (2) The second training used a combined labeled dataset of hand labeled data and the hand-picked accurate predictions from the first run by relabeling. 

# Load and filter labeled data, focusing on relevant combined_labels and positive matches.
# Perform stratified train-validation split by label with a fixed holdout fraction.
# Augment training data by adding synonym-replaced sentences to improve robustness.
# Encode sentences with SBERT embeddings and train a CatBoost multi-class classifier.
# Evaluate the model on validation data and save the trained model and label mappings.

%run train_catboost_labeled.py

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\secki\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\secki\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Loading data...
Loading SBERT model...
Encoding all sentences for embedding hash computation...


Batches:   0%|          | 0/5 [00:00<?, ?it/s]

Splitting data stratified by embedding_hash...
Training samples before augmentation: 95
Validation samples before filtering: 36
Performing synonym augmentation on training set...
Training samples after augmentation: 380
Encoding training sentences after augmentation...


Batches:   0%|          | 0/12 [00:00<?, ?it/s]

Encoding validation sentences...


Batches:   0%|          | 0/2 [00:00<?, ?it/s]

Encoding original training sentences for permutation tests...


Batches:   0%|          | 0/3 [00:00<?, ?it/s]

Filtering validation samples too close to training samples...
Validation samples after cosine similarity filtering: 36
Training CatBoost classifier...
0:	learn: 1.3862944	test: 1.3862944	best: 1.3862944 (0)	total: 68.7ms	remaining: 2m 17s
200:	learn: 1.1887770	test: 1.2661924	best: 1.2661885 (198)	total: 832ms	remaining: 7.45s
400:	learn: 1.0690790	test: 1.2055093	best: 1.2055031 (398)	total: 1.43s	remaining: 5.72s
600:	learn: 0.9694437	test: 1.1643919	best: 1.1643919 (600)	total: 2.07s	remaining: 4.83s
800:	learn: 0.7832932	test: 1.0955611	best: 1.0955611 (800)	total: 3.18s	remaining: 4.76s
1000:	learn: 0.5910169	test: 1.0240908	best: 1.0240908 (1000)	total: 4.63s	remaining: 4.62s
1200:	learn: 0.4680615	test: 0.9679501	best: 0.9679501 (1200)	total: 6.21s	remaining: 4.13s
1400:	learn: 0.3839184	test: 0.9327111	best: 0.9327111 (1400)	total: 7.58s	remaining: 3.24s
1600:	learn: 0.3223259	test: 0.8992544	best: 0.8992544 (1600)	total: 8.87s	remaining: 2.21s
1800:	learn: 0.2769986	test: 0.87

In [3]:
# (Prediction Fine-Tuning)
# Optional keyword extraction fine-tuning code using KeyBERT
%run tune_keywords_keybert.py

Top keyphrases for Q65_1:
 - national unity resilience reaffirm unwavering trust military (score: 0.717)
 - collective defense reinforcing national regional stability military (score: 0.703)
 - deterrence csto member commend role collective defense (score: 0.676)

Top keyphrases for Q65_2:
 - nato engagement strengthened capacity affirmed military professionalism (score: 0.744)
 - institutional reliability volatile region nato engagement strengthened (score: 0.714)
 - lot confidence armed forces confidence armed forces (score: 0.708)

Top keyphrases for Q65_3:
 - peacekeeping operations offered lessons reflections reform csto (score: 0.760)
 - democratic norms participation nato peacekeeping operations (score: 0.721)
 - sharing policy accountability respect armed forces (score: 0.629)

Top keyphrases for Q65_4:
 - partisan enforcement military alliances nato csto (score: 0.704)
 - confidence armed forces history politicized military (score: 0.695)
 - public trust demands renewed civili

In [4]:
## (Prediction - I) ##

# This prediction only uses the top_scored_sentences.csv dataset, which is the output of the data_prep_scoring.py script.
# The prediction dataset contains the most uncertain sentences in the previous measurements
# As stated, the accurate outputs will be used to enrich the training data for the second training run.

# Load unseen sentences excluding those already labeled in training by embedding hash.
# Use a pre-trained CatBoost model and SBERT embeddings to predict combined_labels on unseen data.
# Filter predicted sentences by confidence threshold (perc_above_chance >= threshold).
# Merge additional metadata from the full scored sentences dataset based on predicted labels.
# Display a preview and save filtered, labeled predictions to CSV for further use.

%run predict_catboost_top_score.py

Selected 4444 unseen sentences for prediction.
Encoding unseen sentences...


Encoding batches: 100%|██████████| 139/139 [00:04<00:00, 34.65it/s]


Extracting keyphrases per label for semantic filtering...
Applying semantic similarity filtering per predicted label...
Applying joint scoring and filtering...
Kept 198 predictions after joint scoring filtering.
Saved the results to ../../../output/question_pipeline_output/q65_predictions/q65_predictions_top_score_filtered.csv


In [None]:
## (Labeling - II) ##

# This function appends accurately predicted sentences from the predictions file
# to the labeled dataset based on a provided list of selected sentence hashes.
# It filters these hashes to those present in the predictions, then matches them
# with the top scored sentences and marks them as matched.
# Finally, it concatenates these new matched sentences with the existing labeled data and saves it.

# After this code is run, the model was trained the second time with the appended labeled dataset.
# Note: Due to changes resulting from the hand-labeling of the data, the corresponding response and adapted hypotheses can be misleading
# in the current state.

#%run label_append_after_pred.py

Appended 86 rows to labeled data and saved to 'Q65_mmr_selected_labeled.csv'.


In [15]:
## (Training - II) ##

# After the Labeling - II code is run, the model was trained the second time with the appended labeled dataset.

In [5]:
## (Prediction - II) ##

# This code loads UNGA speech sentences and WVC metadata, excluding training data to prevent leakage.
# It encodes filtered sentences with a SentenceTransformer model, then predicts classes and confidences using a CatBoost classifier.
# Predictions with confidence above chance threshold are kept, and relevant columns are selected.
# It merges the predictions with metadata from the WVC dataset based on predicted labels.
# Finally, it previews and saves the combined prediction results to a CSV file for further use.

%run predict_catboost_unga_wvs7.py

Loading combined labeled CSV...
Extracting top keyphrases per label...
Loading SBERT model...
Loading unseen data...
Filtered unseen sentences count: 25572
Loading CatBoost model...
Encoding unseen sentences...


Encoding batches: 100%|██████████| 800/800 [00:25<00:00, 31.21it/s]


Predicting classes and probabilities...
Applying semantic similarity filtering per predicted label...
Applying joint scoring and filtering...
Kept 487 predictions after joint scoring filtering.
Saved the results to ../../../output/question_pipeline_output/q65_predictions/q65_predictions_filtered.csv


In [6]:
# Predict UNSG speeches using the trained model
%run predict_unsg_address.py

Loading combined labeled CSV...
Extracting top keyphrases per label...
Loading SBERT model...
Loading unseen data...
Loading CatBoost model...
Encoding unseen sentences...


Encoding batches: 100%|██████████| 21/21 [00:00<00:00, 29.48it/s]


Predicting classes and probabilities...
Applying semantic similarity filtering per predicted label...
Applying joint scoring and filtering...
Kept 27 predictions after joint scoring filtering.
Saved the results to ../../../output/question_pipeline_output/q65_predictions/q65_predictions_unsg.csv


In [8]:
## (Post-Processing) ##

# This code analyzes predicted labels per country-year from the predictions dataset,
# identifies the most frequent labels for each country-year
# filters to keep only country-years present in the WVS dataset,
# and finally displays and saves the summarized results.

%run get_q65_frequencies.py

Unnamed: 0,B_COUNTRY_ALPHA,A_YEAR,most_frequent_label,most_frequent_count
0,ARM,2021,Q65_1,2
1,AUS,2018,Q65_1,1
2,BGD,2018,Q65_3,1
3,BOL,2017,Q65_3,2
4,BRA,2018,Q65_3,1
5,CAN,2020,Q65_3,2
6,CHL,2018,Q65_3,2
7,CHN,2018,Q65_1,1
8,COL,2018,Q65_1,3
9,CYP,2019,Q65_3,1


Saved to ../../../output/question_pipeline_output/q65_output/q65_country_year_top_labels.csv


In [9]:
## (Visualization -I) ##

# This code filters predicted Q65-related labels from the predictions dataset,
# calculates sentence counts and proportions of given response per country,
# prepares the data for ordered plotting,
# then creates and displays a scatter plot showing the proportion of Q65_1 sentences relative to total Q65 sentences by country,
# with point sizes representing sentence counts, x-axis showing the countries and y-axis the proportion of given response to total count responses.

%run visualize_preds_props.py

In [10]:
## (Visualization -II) ##

# This code loads World Values Survey (WVS) data and filters it by countries found in scored sentences.
# It processes Q65 survey responses to calculate per-country counts and proportions of respondents answering '1'.
# It prepares the country data in a categorical order for consistent plotting.
# Then it creates a scatter plot showing the proportion of Q65=1 responses per country, sized by total responses.
# The plot visually compares response distributions across countries with hover details and a clean layout.
# The x and y-axis labels are the same as the prediction visualization.

%run visualize_response_props.py

In [11]:
## (Visualization -III) ##

# This script compares proportions of a specific Q65 survey response between scored sentences and WVS data at the country-year level.
# It aggregates sentence counts and computes proportions per country-year in the scored predictions.
# It filters the WVS data to matching country-year pairs and computes weighted response proportions using survey weights.
# The code then merges both datasets and visualizes their proportions with connecting lines to compare distributions.
# The plot uses country-year labels on the x-axis and proportion values on the y-axis, with vertical lines visually highlighting differences between WVS survey and scored sentence proportions.
# Finally, it calculates and prints the Pearson correlation between the weighted WVS proportions and scored sentence proportions.

%run visualize_prop_diffs.py

Pearson correlation between WVS weighted and scored sentence proportions: 0.0977
