# World Bank API Exploration
## Data Collection and Baseline Setup

Corruption thrives in environments with limited accountability, weak enforcement systems, and institutional weaknesses in governance structures. The World Bank's Worldwide Governance Indicators (WGI) provide standardized measures of six key dimensions that capture these structural vulnerabilities: voice and accountability, political stability, government effectiveness, regulatory quality, rule of law, and control of corruption.

This notebook establishes the foundational dataset by collecting these governance indicators alongside complementary economic metrics for three countries: Malaysia (1MDB scandal, 2015), Mozambique (hidden debt crisis, 2013-2016), and Canada (control country with strong governance institutions). The 2010-2023 timeframe captures the periods before, during, and after the documented corruption cases, enabling analysis of how governance indicators signal corruption risk.

In [1]:
import wbdata
import pandas as pd
import datetime
import os

## Query Parameters

In [None]:
# world bank country codes for case study countries
countries = ["CAN", "MYS", "MOZ"]

# date range covering pre-scandal, during, and post-scandal periods
# 2024 excluded due to incomplete governance data availability
data_range = (datetime.datetime(2010, 1, 1), datetime.datetime(2023, 12, 31))

In [None]:
# world bank indicator codes mapped to descriptive names
# governance indicators measure structural weaknesses that enable corruption
# economic indicators provide context for financial vulnerability patterns

indicators = {
    # six worldwide governance indicators (wgi)
    # these capture institutional quality and accountability mechanisms
    'VA.EST': 'Voice_Accountability',  # citizen participation and freedom of expression
    'PV.EST': 'Political_Stability',  # likelihood of political instability or violence
    'GE.EST': 'Government_Effectiveness',  # quality of public services and policy implementation
    'RQ.EST': 'Regulatory_Quality',  # ability to formulate and implement sound policies
    'RL.EST': 'Rule_of_Law',  # extent to which agents have confidence in and abide by rules
    'CC.EST': 'Control_of_Corruption',  # extent to which public power is exercised for private gain
    
    # economic indicators for contextual analysis
    # these help identify financial stress patterns associated with corruption
    'DT.DOD.DECT.GN.ZS': 'External_Debt_perc_GNI',  # external debt as percentage of gni
    'NY.GDP.MKTP.KD.ZG': 'GDP_Growth_annual_perc',  # annual gdp growth rate
    'GC.XPN.TOTL.GD.ZS': 'Govt_Expenditure_perc_GDP',  # government spending as percentage of gdp
    'BX.KLT.DINV.WD.GD.ZS': 'FDI_Inflows_perc_GDP',  # foreign direct investment inflows
    'SI.POV.DDAY': 'Poverty_Headcount_Ratio'  # poverty headcount ratio at $2.15 per day
}

## Data Retrieval

In [None]:
# retrieve all indicators for specified countries and date range
# parse_dates=False preserves year values as strings for consistent formatting
df = wbdata.get_dataframe(indicators, 
                          country=countries, 
                          date=data_range,
                          parse_dates=False)


## Data Cleaning and Formatting

In [None]:
# convert index columns to regular columns for analysis
df = df.reset_index()
df = df.rename(columns={'date': 'Year', 'country': 'Country'})

# reorder columns with country and year first, followed by indicators
column_order = ['Country', 'Year'] + list(indicators.values())
existing_columns = [col for col in column_order if col in df.columns]
df = df[existing_columns]

# sort chronologically by country and year for time series analysis
df = df.sort_values(by=['Country', 'Year']).reset_index(drop=True)

## Data Quality Assessment

In [None]:
# dataset dimensions and temporal coverage
print(f"dataset shape: {df.shape[0]} rows, {df.shape[1]} columns")
print(f"temporal coverage: {df['Year'].min()} to {df['Year'].max()}")
print(f"\nvariables collected:")
print(df.columns.tolist())

shape: 42 rows, 13 columns
years covered: 2010 to 2023

columns in dataset:
['Country', 'Year', 'Voice_Accountability', 'Political_Stability', 'Government_Effectiveness', 'Regulatory_Quality', 'Rule_of_Law', 'Control_of_Corruption', 'External_Debt_perc_GNI', 'GDP_Growth_annual_perc', 'Govt_Expenditure_perc_GDP', 'FDI_Inflows_perc_GDP', 'Poverty_Headcount_Ratio']


In [None]:
# assess data completeness across all indicators
# governance indicators should have complete coverage for all countries
print("missing values per variable:")
print(df.isnull().sum())
print(f"\nmissing data percentage:")
print(round(df.isnull().sum() / len(df) * 100, 2))

missing values per column:
Country                       0
Year                          0
Voice_Accountability          0
Political_Stability           0
Government_Effectiveness      0
Regulatory_Quality            0
Rule_of_Law                   0
Control_of_Corruption         0
External_Debt_perc_GNI       28
GDP_Growth_annual_perc        0
Govt_Expenditure_perc_GDP     4
FDI_Inflows_perc_GDP          0
Poverty_Headcount_Ratio      22
dtype: int64

missing data percentage:
Country                       0.00
Year                          0.00
Voice_Accountability          0.00
Political_Stability           0.00
Government_Effectiveness      0.00
Regulatory_Quality            0.00
Rule_of_Law                   0.00
Control_of_Corruption         0.00
External_Debt_perc_GNI       66.67
GDP_Growth_annual_perc        0.00
Govt_Expenditure_perc_GDP     9.52
FDI_Inflows_perc_GDP          0.00
Poverty_Headcount_Ratio      52.38
dtype: float64


first 15 rows:


Unnamed: 0,Country,Year,Voice_Accountability,Political_Stability,Government_Effectiveness,Regulatory_Quality,Rule_of_Law,Control_of_Corruption,External_Debt_perc_GNI,GDP_Growth_annual_perc,Govt_Expenditure_perc_GDP,FDI_Inflows_perc_GDP,Poverty_Headcount_Ratio
0,Canada,2010,1.352659,0.936318,1.777827,1.69343,1.79859,2.061873,,3.090806,19.084707,1.837256,0.2
1,Canada,2011,1.380145,1.077176,1.772545,1.68484,1.72712,1.971133,,3.137194,17.850268,2.137833,0.2
2,Canada,2012,1.437505,1.113016,1.75697,1.707195,1.756421,1.918904,,1.755661,17.51752,2.700169,0.2
3,Canada,2013,1.45344,1.061422,1.780741,1.729891,1.747508,1.879378,,2.325814,17.084882,3.629804,0.5
4,Canada,2014,1.412332,1.175504,1.753718,1.838725,1.886297,1.832193,,2.873467,16.40205,3.553903,0.2
5,Canada,2015,1.467299,1.262337,1.730935,1.706058,1.807141,1.84565,,0.649971,17.059779,3.853895,0.5
6,Canada,2016,1.445611,1.240412,1.744541,1.727414,1.800915,1.944466,,1.038551,17.498604,2.23835,0.5
7,Canada,2017,1.478084,1.089681,1.815573,1.879656,1.763439,1.881446,,3.033835,17.606595,1.537521,0.5
8,Canada,2018,1.502411,0.963971,1.675134,1.69942,1.715142,1.790208,,2.742963,17.5409,2.469312,0.2
9,Canada,2019,1.430308,0.994934,1.697311,1.710002,1.719776,1.729897,,1.908432,18.10571,2.806767,0.2


## Data Export

In [None]:
# export cleaned dataset for downstream analysis
os.makedirs('../data/raw', exist_ok=True)
output_path = '../data/raw/corruption_data_baseline.csv'
df.to_csv(output_path, index=False)
print(f"dataset exported to: {output_path}")

saved to: ../data/raw/corruption_data_baseline.csv


## Baseline Governance Comparison

The six Worldwide Governance Indicators provide standardized measures of institutional quality. Lower scores across these dimensions indicate structural weaknesses that create environments where corruption can thriveâ€”specifically, limited accountability, weak enforcement systems, and poor transparency. This analysis compares average governance scores across the case study countries to establish baseline differences between high-risk environments (Malaysia, Mozambique) and the control country (Canada).

In [None]:
# extract governance indicators for comparative analysis
governance_cols = ['Country', 'Year', 'Voice_Accountability', 'Political_Stability', 
                   'Government_Effectiveness', 'Regulatory_Quality', 'Rule_of_Law', 
                   'Control_of_Corruption']

gov_df = df[governance_cols]

# calculate mean governance scores by country across study period
print("average governance scores by country (2010-2023):")
print(gov_df.groupby('Country')[governance_cols[2:]].mean().round(2))

average governance scores by country (2010-2023):
            Voice_Accountability  Political_Stability  \
Country                                                 
Canada                      1.44                 1.04   
Malaysia                   -0.26                 0.14   
Mozambique                 -0.40                -0.62   

            Government_Effectiveness  Regulatory_Quality  Rule_of_Law  \
Country                                                                 
Canada                          1.70                1.71         1.71   
Malaysia                        0.96                0.63         0.46   
Mozambique                     -0.78               -0.64        -0.90   

            Control_of_Corruption  
Country                            
Canada                       1.81  
Malaysia                     0.19  
Mozambique                  -0.73  


In [None]:
# examine governance scores at critical time points
# 2013: pre-scandal baseline for both case studies
# 2018: post-scandal period capturing institutional response
# 2023: most recent data showing long-term governance trajectory
key_years = ['2013', '2018', '2023']
print("governance scores at key time points:")
print(gov_df[gov_df['Year'].isin(key_years)].sort_values(['Year', 'Country']))

governance scores for key years (matching table 1):
       Country  Year  Voice_Accountability  Political_Stability  \
3       Canada  2013              1.453440             1.061422   
17    Malaysia  2013             -0.339791             0.051792   
31  Mozambique  2013             -0.256402            -0.226966   
8       Canada  2018              1.502411             0.963971   
22    Malaysia  2018             -0.099501             0.248114   
36  Mozambique  2018             -0.484721            -0.833230   
13      Canada  2023              1.479646             0.822421   
27    Malaysia  2023              0.087619             0.168515   
41  Mozambique  2023             -0.593393            -1.268691   

    Government_Effectiveness  Regulatory_Quality  Rule_of_Law  \
3                   1.780741            1.729891     1.747508   
17                  0.993432            0.567808     0.341338   
31                 -0.635946           -0.417519    -0.820203   
8                