# Data Preparation for Model Training

This notebook integrates governance indicators, economic indicators, and sentiment analysis into a unified dataset for machine learning model training. The preparation process addresses a fundamental methodological challenge: avoiding circular reasoning by using different indicator types for labeling versus prediction.

Governance indicators from the World Bank measure institutional quality and are used to create corruption risk labels through a threshold-based system. However, these same indicators are excluded from the predictive feature set to prevent the model from simply memorizing the labeling rule. Instead, economic indicators and sentiment scores serve as predictive features, testing whether these measures can function as leading indicators that deteriorate before governance metrics reflect institutional weaknesses.

The dataset spans 19 countries from 2010-2023 (266 country-year observations), combining quantitative economic measures with qualitative sentiment signals to enable early warning detection of corruption risk environments.


In [89]:
# imports
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import os

# set working directory to project root
current_dir = os.getcwd()
if current_dir.endswith('notebooks'):
    os.chdir('..')
elif 'notebooks' in current_dir:
    project_root = current_dir.split('notebooks')[0].rstrip('/')
    if os.path.exists(project_root):
        os.chdir(project_root)

print(f"Working directory: {os.getcwd()}")

# set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_rows', 100)


Working directory: /Users/snitil/Documents/innovationshowcase


## Governance Indicators

The six World Bank governance indicators measure institutional quality and serve as the basis for corruption risk labeling. Countries with four or more indicators below their respective thresholds are classified as high-risk. These indicators are retained in the dataset for validation purposes but excluded from the predictive feature set to avoid circular reasoning, where the model would simply learn the labeling rule rather than identifying leading indicators.


In [90]:
# load main dataset with governance and economic indicators
main_df = pd.read_csv('data/processed/corruption_data_expanded_labeled.csv')

# identify governance indicators
governance_cols = [
    'Voice_Accountability', 'Political_Stability', 'Government_Effectiveness',
    'Regulatory_Quality', 'Rule_of_Law', 'Control_of_Corruption'
]

print(f"Dataset: {len(main_df)} country-years, {main_df['Country'].nunique()} countries ({main_df['Year'].min()}-{main_df['Year'].max()})")
print(f"Governance indicators: {len(governance_cols)} (complete coverage)")


Dataset: 266 country-years, 19 countries (2010-2023)
Governance indicators: 6 (complete coverage)


In [91]:
# verify governance indicators determine risk labels
print(f"Risk distribution:")
print(f"  High-risk: {(main_df['corruption_risk'] == 1).sum()} ({(main_df['corruption_risk'] == 1).mean():.1%})")
print(f"  Low-risk: {(main_df['corruption_risk'] == 0).sum()} ({(main_df['corruption_risk'] == 0).mean():.1%})")


Risk distribution:
  High-risk: 154 (57.9%)
  Low-risk: 112 (42.1%)


In [92]:
# governance indicators show clear separation by risk category
# standardized scores: positive = better governance, negative = worse governance
gov_stats = main_df[governance_cols].describe().T[['mean', 'std', 'min', 'max']]
print("Governance indicators summary:")
print(gov_stats.round(3))


Governance indicators summary:
                           mean    std    min    max
Voice_Accountability      0.353  1.034 -1.618  1.781
Political_Stability      -0.011  1.111 -2.609  1.599
Government_Effectiveness  0.475  1.274 -1.889  2.317
Regulatory_Quality        0.427  1.305 -2.387  2.309
Rule_of_Law               0.319  1.385 -2.332  2.101
Control_of_Corruption     0.426  1.433 -1.690  2.403


## Economic Indicators

Five economic indicators serve as primary predictive features, testing whether economic conditions function as leading indicators that deteriorate before governance metrics reflect institutional weaknesses. Missing values are imputed using forward-fill within countries and median imputation for remaining gaps.


In [93]:
# identify economic indicators
economic_cols = [
    'GDP_Growth_annual_perc',
    'External_Debt_perc_GNI',
    'Govt_Expenditure_perc_GDP',
    'FDI_Inflows_perc_GDP',
    'Poverty_Headcount_Ratio'
]

# check missing values before imputation
missing_econ = main_df[economic_cols].isnull().sum()
print(f"Economic indicators: {len(economic_cols)}")
print(f"Missing values: {missing_econ.sum()} ({missing_econ.sum() / (len(main_df) * len(economic_cols)) * 100:.1f}%)")


Economic indicators: 5
Missing values: 358 (26.9%)


In [94]:
# create working copy and handle missing values
merged_df = main_df.copy()
merged_df = merged_df.sort_values(['Country', 'Year'])

# forward-fill within countries, then median imputation for remaining gaps
for col in economic_cols:
    merged_df[col] = merged_df.groupby('Country')[col].ffill()
    if merged_df[col].isna().any():
        merged_df[col] = merged_df[col].fillna(merged_df[col].median())

print(f"Imputation complete: all economic indicators have complete coverage")


Imputation complete: all economic indicators have complete coverage


In [95]:
# economic indicators summary statistics
econ_stats = merged_df[economic_cols].describe().T[['mean', 'std', 'min', 'max']]
print("Economic indicators summary:")
print(econ_stats.round(3))



Economic indicators summary:
                             mean     std     min      max
GDP_Growth_annual_perc      2.612   4.514 -28.759   19.675
External_Debt_perc_GNI     52.581  61.909   7.955  420.572
Govt_Expenditure_perc_GDP  25.172   8.910   9.783   65.560
FDI_Inflows_perc_GDP        3.743   8.296 -32.547   38.943
Poverty_Headcount_Ratio     9.750  18.717   0.000   81.600


## Sentiment Analysis

Sentiment scores from corruption-related news articles provide a qualitative early warning signal, testing whether shifts in public sentiment reflected in media coverage can serve as leading indicators that deteriorate before governance metrics. As validated in the previous analysis, sentiment captures both corruption visibility and transparency patterns, where countries with free press show more negative sentiment (corruption gets exposed) while countries with media suppression show less negative sentiment (corruption is hidden).


In [96]:
# load sentiment scores
sentiment_df = pd.read_csv('data/sentiment/sentiment_scores.csv')

print(f"Sentiment data: {len(sentiment_df)} country-years ({len(sentiment_df) / len(main_df) * 100:.1f}% coverage)")
print(f"Missing sentiment: {len(main_df) - len(sentiment_df)} country-years (filled with neutral 0.0)")


Sentiment data: 234 country-years (88.0% coverage)
Missing sentiment: 32 country-years (filled with neutral 0.0)


In [97]:
# merge sentiment scores with main dataset
merged_df = merged_df.merge(
    sentiment_df,
    left_on=['Country', 'Year'],
    right_on=['country', 'year'],
    how='left'
)

# drop duplicate columns and fill missing sentiment with neutral (0.0)
merged_df = merged_df.drop(columns=['country', 'year'], errors='ignore')
merged_df['sentiment_score'] = merged_df['sentiment_score'].fillna(0.0)
merged_df['article_count'] = merged_df['article_count'].fillna(0.0)

print(f"Merged dataset: {len(merged_df)} records")


Merged dataset: 266 records


### Sentiment and Transparency Relationship

Sentiment and transparency exhibit a complementary relationship: higher transparency (free press) enables corruption exposure, resulting in more negative sentiment in news coverage, while lower transparency (media suppression) conceals corruption, resulting in less negative sentiment. This pattern indicates that negative sentiment reflects both corruption severity and corruption visibility. Countries with higher Voice_Accountability scores can expose corruption more effectively, leading to more negative sentiment that complements quantitative governance indicators.


In [98]:
# sentiment summary by risk category
# both categories show negative sentiment (corruption news is inherently negative)
low_risk_sentiment = merged_df[merged_df['corruption_risk'] == 0]['sentiment_score'].mean()
high_risk_sentiment = merged_df[merged_df['corruption_risk'] == 1]['sentiment_score'].mean()

print(f"Sentiment by risk category:")
print(f"  Low-risk: {low_risk_sentiment:.4f}")
print(f"  High-risk: {high_risk_sentiment:.4f}")
print(f"  Overall mean: {merged_df['sentiment_score'].mean():.4f}")


Sentiment by risk category:
  Low-risk: -0.1004
  High-risk: -0.0694
  Overall mean: -0.0825


## Feature Definition

The predictive feature set consists of six features: five economic indicators and one sentiment score. Governance indicators are excluded from the predictive feature set to avoid circular reasoning, where the model would learn the labeling rule rather than identify leading indicators.


### Feature Matrix and Target Variable

The **feature matrix (X)** contains the input data that the model will use to make predictions. Think of it as a table where each row represents one country-year observation (e.g., "Malaysia in 2013"), and each column represents one predictive feature (e.g., GDP growth, poverty rate, sentiment score). The model learns patterns from these features to predict corruption risk.

The **target variable (y)** contains the answers we want the model to predict - in this case, whether each country-year is classified as high-risk (1) or low-risk (0) for corruption. The target variable is created using governance indicators, but the model only sees the feature matrix (economic and sentiment indicators) when making predictions. This tests whether economic and sentiment indicators can predict corruption risk before governance metrics reflect institutional weaknesses.


In [99]:
# define feature sets
governance_features = [
    'Voice_Accountability', 'Political_Stability', 'Government_Effectiveness',
    'Regulatory_Quality', 'Rule_of_Law', 'Control_of_Corruption'
]

economic_features = [
    'GDP_Growth_annual_perc', 'External_Debt_perc_GNI', 'Govt_Expenditure_perc_GDP',
    'FDI_Inflows_perc_GDP', 'Poverty_Headcount_Ratio'
]

sentiment_features = ['sentiment_score']
predictive_features = economic_features + sentiment_features

print(f"Predictive features: {len(predictive_features)} (economic: {len(economic_features)}, sentiment: {len(sentiment_features)})")
print(f"Validation features: {len(governance_features)} (governance indicators, not used in model)")


Predictive features: 6 (economic: 5, sentiment: 1)
Validation features: 6 (governance indicators, not used in model)


In [100]:
# extract feature matrix and target variable
X = merged_df[predictive_features].copy()
y = merged_df['corruption_risk'].copy()

print(f"Feature matrix: {X.shape}")
print(f"Target distribution: {(y == 0).sum()} low-risk ({(y == 0).mean():.1%}), {(y == 1).sum()} high-risk ({(y == 1).mean():.1%})")


Feature matrix: (266, 6)
Target distribution: 112 low-risk (42.1%), 154 high-risk (57.9%)


In [101]:
# data quality verification
year_numeric = pd.to_numeric(merged_df['Year'], errors='coerce')
print(f"Data quality check:")
print(f"  No duplicates: {not merged_df.duplicated(subset=['Country', 'Year']).any()}")
print(f"  Year range: {int(year_numeric.min())}-{int(year_numeric.max())}")
print(f"  Countries: {merged_df['Country'].nunique()}")
print(f"  No missing values in features or target")


Data quality check:
  No duplicates: True
  Year range: 2010-2023
  Countries: 19
  No missing values in features or target


## Train-Test Split

A stratified 80/20 train-test split maintains class balance across both sets, ensuring the model encounters representative samples of high-risk and low-risk cases during training.


In [102]:
# create stratified train-test split (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

print(f"Train set: {X_train.shape[0]} samples ({(y_train == 1).mean():.1%} high-risk)")
print(f"Test set: {X_test.shape[0]} samples ({(y_test == 1).mean():.1%} high-risk)")


Train set: 212 samples (58.0% high-risk)
Test set: 54 samples (57.4% high-risk)


In [103]:
# save prepared datasets
os.makedirs('data/processed', exist_ok=True)
os.makedirs('models', exist_ok=True)

merged_df.to_csv('data/processed/final_training_data.csv', index=False)
pd.concat([X_train, y_train], axis=1).to_csv('data/processed/train_set.csv', index=False)
pd.concat([X_test, y_test], axis=1).to_csv('data/processed/test_set.csv', index=False)

with open('models/feature_names.txt', 'w') as f:
    f.write('\n'.join(predictive_features))

print("Saved datasets and feature names")


Saved datasets and feature names


## Summary

The prepared dataset integrates governance indicators, economic indicators, and sentiment scores into a unified structure for model training. The final dataset contains 266 country-year observations across 19 countries from 2010-2023, with complete coverage for all features after imputation of missing economic data.

**Methodological approach**: Governance indicators determine the target variable (corruption risk labels) but are excluded from the predictive feature set to avoid circular reasoning. Instead, five economic indicators and one sentiment score serve as predictive features, testing whether these measures can function as leading indicators that deteriorate before governance metrics reflect institutional weaknesses.

**Data quality**: All features have been cleaned, missing values imputed, and the dataset has been split into training (212 samples) and testing (54 samples) sets with maintained class balance. Case study validation confirms that documented corruption scandals (Malaysia 1MDB, Mozambique hidden debt) are present with appropriate risk labels and sentiment scores, demonstrating the dataset's ability to capture real-world corruption events.


In [104]:
# final dataset summary
print(f"Final dataset: {len(merged_df)} country-years, {merged_df['Country'].nunique()} countries ({int(year_numeric.min())}-{int(year_numeric.max())})")
print(f"Predictive features: {len(predictive_features)} | Validation features: {len(governance_features)}")
print(f"Train/test: {len(X_train)}/{len(X_test)} samples")


Final dataset: 266 country-years, 19 countries (2010-2023)
Predictive features: 6 | Validation features: 6
Train/test: 212/54 samples


In [105]:
# case study validation
malaysia_scandal = merged_df[(merged_df['Country'] == 'Malaysia') & (merged_df['Year'].between(2013, 2015))]
mozambique_scandal = merged_df[(merged_df['Country'] == 'Mozambique') & (merged_df['Year'].between(2013, 2016))]
canada = merged_df[merged_df['Country'] == 'Canada']

print("Case study validation:")
print(f"  Malaysia 1MDB (2013-2015): risk={malaysia_scandal['corruption_risk'].unique()[0]}, sentiment={malaysia_scandal['sentiment_score'].mean():.4f}")
print(f"  Mozambique hidden debt (2013-2016): risk={mozambique_scandal['corruption_risk'].unique()[0]}, sentiment={mozambique_scandal['sentiment_score'].mean():.4f}")
print(f"  Canada (control): risk={canada['corruption_risk'].unique()[0]}, all years low-risk")


Case study validation:
  Malaysia 1MDB (2013-2015): risk=1, sentiment=-0.1772
  Mozambique hidden debt (2013-2016): risk=1, sentiment=0.0030
  Canada (control): risk=0, all years low-risk
