# Economics Countries Master Dataset

**Purpose**: Create a consolidated country-year economic structure dataset for merging with conflict data.

**Output**: `economics-countries-master.csv`

**Coverage**: 
- 220 countries
- 1970-2023 (54 years)
- Sector percentages: Primary, Secondary, Tertiary
- Tourism % (2008-2023)
- GDP in USD and Population

**Sources**:
1. World Bank GDP sectoral breakdown (sectoral composition)
2. UN Tourism SDG 8.9.1 (tourism %)
3. World Bank Development Indicators (GDP in USD and population)

## Setup: Import Libraries

In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

print("="*80)
print("ECONOMICS COUNTRIES MASTER DATASET GENERATOR")
print("="*80)
print("\nLibraries loaded successfully")

ECONOMICS COUNTRIES MASTER DATASET GENERATOR

Libraries loaded successfully


## Step 1: Load World Bank GDP Sectoral Data

Loading the GDP breakdown by sector from CSV format.

In [2]:
print("[1/8] Loading World Bank GDP sectoral data...\n")

# Load World Bank GDP sectoral data from CSV
df_gdp_sectoral = pd.read_csv(
    '../raw-data/World_Bank/Download-GDPcurrent-NCU-countries.csv'
)

print(f"Loaded {len(df_gdp_sectoral):,} rows")
print(f"Covering {df_gdp_sectoral['Country'].nunique()} countries")
print(f"\nIndicators found: {len(df_gdp_sectoral['IndicatorName'].unique())}")

# Display sample indicators
print("\nSample indicators:")
for ind in list(df_gdp_sectoral['IndicatorName'].unique())[:5]:
    count = len(df_gdp_sectoral[df_gdp_sectoral['IndicatorName'] == ind])
    print(f"  - {ind}: {count} countries")

df_gdp_sectoral.head()

[1/8] Loading World Bank GDP sectoral data...

Loaded 3,714 rows
Covering 220 countries

Indicators found: 17

Sample indicators:
  - Final consumption expenditure: 220 countries
  - Household consumption expenditure (including Non-profit institutions serving households): 219 countries
  - General government final consumption expenditure: 219 countries
  - Gross capital formation: 219 countries
  - Gross fixed capital formation (including Acquisitions less disposals of valuables): 219 countries


Unnamed: 0,CountryID,Country,Currency,IndicatorName,1970,1971,1972,1973,1974,1975,...,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023
0,4,Afghanistan,Afghani,Final consumption expenditure,76097244.0,82197023.0,69497573.0,73697424.0,89296879.0,97796585.0,...,1154152000000.0,1165048000000.0,1249789000000.0,1265019000000.0,1501280000000.0,1662108000000.0,1756194000000.0,1373045000000.0,1533160000000.0,1609344000000.0
1,4,Afghanistan,Afghani,Household consumption expenditure (including N...,70967106.0,76655665.0,64812356.0,68729071.0,83276881.0,91203574.0,...,905379500000.0,915260400000.0,970202500000.0,993728400000.0,1217000000000.0,1338629000000.0,1416826000000.0,1107016000000.0,1252786000000.0,1321735000000.0
2,4,Afghanistan,Afghani,General government final consumption expenditure,5130138.0,5541358.0,4685217.0,4968352.0,6019998.0,6593011.0,...,248772300000.0,249787800000.0,279587000000.0,271290800000.0,284279800000.0,323478400000.0,339369000000.0,266028800000.0,280373600000.0,287609000000.0
3,4,Afghanistan,Afghani,Gross capital formation,4299850.0,4499842.0,4699836.0,5699800.0,8499703.0,10399636.0,...,160056100000.0,165788200000.0,158244100000.0,179625800000.0,193869300000.0,189245600000.0,175840000000.0,162485900000.0,213929800000.0,205645000000.0
4,4,Afghanistan,Afghani,Gross fixed capital formation (including Acqui...,4299850.0,4499842.0,4699836.0,5699800.0,8499703.0,10399636.0,...,160056100000.0,165788200000.0,158244100000.0,179625800000.0,193869300000.0,189245600000.0,175840000000.0,162485900000.0,213929800000.0,205645000000.0


## Step 2: Reshape from Wide to Long Format

Convert year columns (1970-2023) into rows for easier processing.

In [3]:
print("[2/8] Reshaping data from wide to long format...\n")

# Get year columns (all numeric columns from 1970 onwards)
year_columns = [col for col in df_gdp_sectoral.columns if str(col).isdigit() and int(col) >= 1970]

print(f"Year range: {min(year_columns)} to {max(year_columns)}")
print(f"Total years: {len(year_columns)}")

# Reshape to long format
df_long = df_gdp_sectoral.melt(
    id_vars=['CountryID', 'Country', 'Currency', 'IndicatorName'],
    value_vars=year_columns,
    var_name='Year',
    value_name='Value'
)

# Convert Year to integer
df_long['Year'] = df_long['Year'].astype(int)

print(f"\nReshaped to {len(df_long):,} rows\n")
df_long.head(10)

[2/8] Reshaping data from wide to long format...

Year range: 1970 to 2023
Total years: 54

Reshaped to 200,556 rows



Unnamed: 0,CountryID,Country,Currency,IndicatorName,Year,Value
0,4,Afghanistan,Afghani,Final consumption expenditure,1970,76097244.0
1,4,Afghanistan,Afghani,Household consumption expenditure (including N...,1970,70967106.0
2,4,Afghanistan,Afghani,General government final consumption expenditure,1970,5130138.0
3,4,Afghanistan,Afghani,Gross capital formation,1970,4299850.0
4,4,Afghanistan,Afghani,Gross fixed capital formation (including Acqui...,1970,4299850.0
5,4,Afghanistan,Afghani,Exports of goods and services,1970,7699731.0
6,4,Afghanistan,Afghani,Imports of goods and services,1970,9399672.0
7,4,Afghanistan,Afghani,Gross Domestic Product (GDP),1970,78697146.0
8,4,Afghanistan,Afghani,"Agriculture, hunting, forestry, fishing (ISIC ...",1970,39539568.0
9,4,Afghanistan,Afghani,"Mining, Manufacturing, Utilities (ISIC C-E)",1970,17121342.0


## Step 3: Pivot Indicators to Columns

Each indicator becomes a column for easier calculation.

In [4]:
print("[3/8] Pivoting indicators to columns...\n")

# Pivot so each indicator is a column
df_pivot = df_long.pivot_table(
    index=['CountryID', 'Country', 'Currency', 'Year'],
    columns='IndicatorName',
    values='Value',
    aggfunc='first'
).reset_index()

print(f"Created {len(df_pivot):,} country-year records")
print(f"Countries: {df_pivot['Country'].nunique()}")
print(f"Year range: {df_pivot['Year'].min()} to {df_pivot['Year'].max()}\n")

df_pivot.head()

[3/8] Pivoting indicators to columns...

Created 10,936 country-year records
Countries: 220
Year range: 1970 to 2023



IndicatorName,CountryID,Country,Currency,Year,"Agriculture, hunting, forestry, fishing (ISIC A-B)",Changes in inventories,Construction (ISIC F),Exports of goods and services,Final consumption expenditure,General government final consumption expenditure,...,Gross capital formation,Gross fixed capital formation (including Acquisitions less disposals of valuables),Household consumption expenditure (including Non-profit institutions serving households),Imports of goods and services,Manufacturing (ISIC D),"Mining, Manufacturing, Utilities (ISIC C-E)",Other Activities (ISIC J-P),Total Value Added,"Transport, storage and communication (ISIC I)","Wholesale, retail trade, restaurants and hotels (ISIC G-H)"
0,4,Afghanistan,Afghani,1970,39539568.0,,2126877.0,7699731.0,76097244.0,5130138.0,...,4299850.0,4299850.0,70967106.0,9399672.0,16823905.0,17121342.0,5806393.0,78697995.0,3814182.0,10289633.0
1,4,Afghanistan,Afghani,1971,41399045.0,,2226881.0,8999685.0,82197023.0,5541358.0,...,4499842.0,4499842.0,76655665.0,13299535.0,17614891.0,17926313.0,6079390.0,82397913.0,3993406.0,10772878.0
2,4,Afghanistan,Afghani,1972,36072594.0,,1940340.0,10599629.0,69497573.0,4685217.0,...,4699836.0,4699836.0,64812356.0,12999545.0,15348652.0,15620008.0,5297214.0,71798259.0,3479600.0,9388503.0
3,4,Afghanistan,Afghani,1973,39187835.0,,2108025.0,10099647.0,73697424.0,4968352.0,...,5699800.0,5699800.0,68729071.0,11499598.0,16674526.0,16969323.0,5754861.0,77998114.0,3780546.0,10197524.0
4,4,Afghanistan,Afghani,1974,48736341.0,,2621489.0,13599525.0,89296879.0,6019998.0,...,8499703.0,8499703.0,83276881.0,14399497.0,20736085.0,21102687.0,7156623.0,96997652.0,4700644.0,12679868.0


## Step 4: Calculate Sector Percentages

**Sector Definitions**:
- **Primary**: Agriculture + Mining (extractive/agricultural)
- **Secondary**: Manufacturing + Construction (industrial)
- **Tertiary**: Trade/Retail + Transport + Other Services (services)

All calculated as % of Total Value Added.

In [5]:
print("[4/8] Calculating sector percentages...\n")

# Clean column names
df_pivot.columns = df_pivot.columns.str.strip()

# Map long indicator names to short names
col_map = {
    'Agriculture, hunting, forestry, fishing (ISIC A-B)': 'Agriculture',
    'Mining, Manufacturing, Utilities (ISIC C-E)': 'Mining_Manuf_Util',
    'Manufacturing (ISIC D)': 'Manufacturing',
    'Construction (ISIC F)': 'Construction',
    'Wholesale, retail trade, restaurants and hotels (ISIC G-H)': 'Trade_Retail',
    'Transport, storage and communication (ISIC I)': 'Transport',
    'Other Activities (ISIC J-P)': 'Other_Activities',
    'Total Value Added': 'Total_Value_Added',
    'Gross Domestic Product (GDP)': 'GDP'
}

df_pivot = df_pivot.rename(columns=col_map)

# Calculate Mining (Mining+Manuf+Util - Manufacturing)
df_pivot['Mining'] = df_pivot['Mining_Manuf_Util'] - df_pivot['Manufacturing']

# Calculate sector components
df_pivot['Primary_Value'] = df_pivot['Agriculture'] + df_pivot['Mining']
df_pivot['Secondary_Value'] = df_pivot['Manufacturing'] + df_pivot['Construction']
df_pivot['Tertiary_Value'] = df_pivot['Trade_Retail'] + df_pivot['Transport'] + df_pivot['Other_Activities']

# Calculate percentages (relative to Total Value Added)
df_pivot['Primary_%'] = (df_pivot['Primary_Value'] / df_pivot['Total_Value_Added']) * 100
df_pivot['Secondary_%'] = (df_pivot['Secondary_Value'] / df_pivot['Total_Value_Added']) * 100
df_pivot['Tertiary_%'] = (df_pivot['Tertiary_Value'] / df_pivot['Total_Value_Added']) * 100

print("Calculated sector percentages:")
print("  - Primary % (Agriculture + Mining)")
print("  - Secondary % (Manufacturing + Construction)")
print("  - Tertiary % (Services)\n")

# Validation: check that percentages sum to ~100%
df_pivot['Total_%'] = df_pivot['Primary_%'] + df_pivot['Secondary_%'] + df_pivot['Tertiary_%']

print("Sample calculations:")
df_pivot[['Country', 'Year', 'Primary_%', 'Secondary_%', 'Tertiary_%', 'Total_%']].head(10)

[4/8] Calculating sector percentages...

Calculated sector percentages:
  - Primary % (Agriculture + Mining)
  - Secondary % (Manufacturing + Construction)
  - Tertiary % (Services)

Sample calculations:


IndicatorName,Country,Year,Primary_%,Secondary_%,Tertiary_%,Total_%
0,Afghanistan,1970,50.620102,24.080387,25.299511,100.0
1,Afghanistan,1971,50.620781,24.08043,25.298789,100.0
2,Afghanistan,1972,50.619542,24.07996,25.300498,100.0
3,Afghanistan,1973,50.619983,24.080776,25.299241,100.0
4,Afghanistan,1974,50.622816,24.080556,25.296628,100.0
5,Afghanistan,1975,50.615827,24.078547,25.305626,100.0
6,Afghanistan,1976,50.621305,24.083222,25.295472,100.0
7,Afghanistan,1977,50.631319,24.0799,25.288781,100.0
8,Afghanistan,1978,50.594856,24.072517,25.332627,100.0
9,Afghanistan,1979,50.637742,24.097251,25.265008,100.0


## Step 5: Load and Merge UN Tourism Data

Tourism as % of GDP (2008-2023 only, 125 countries).

In [6]:
print("[5/8] Loading UN Tourism data...\n")

# Load UN Tourism data from CSV
df_tourism = pd.read_csv(
    '../raw-data/UN_Tourism/UN_Tourism_8_9_1_TDGDP_04_2025.csv'
)

print(f"Loaded {len(df_tourism):,} tourism records")
print(f"Countries: {df_tourism['GeoAreaName'].nunique()}")
print(f"Year range: {df_tourism['TimePeriod'].min()} to {df_tourism['TimePeriod'].max()}\n")

# Select relevant columns and rename
df_tourism_clean = df_tourism[['GeoAreaName', 'TimePeriod', 'Value']].copy()
df_tourism_clean.columns = ['Country', 'Year', 'Tourism_%']

print("Sample tourism data:")
df_tourism_clean.head(10)

[5/8] Loading UN Tourism data...

Loaded 1,243 tourism records
Countries: 125
Year range: 2008 to 2023

Sample tourism data:


Unnamed: 0,Country,Year,Tourism_%
0,Albania,2008,2.75707
1,Albania,2009,2.66869
2,Albania,2010,2.81234
3,Albania,2011,2.53477
4,Albania,2012,2.35697
5,Albania,2013,2.32137
6,Albania,2014,2.32641
7,Albania,2015,2.44379
8,Albania,2016,2.63697
9,Albania,2017,2.83579


## Step 6: Merge Tourism Data with Main Dataset

In [7]:
print("[6/8] Merging tourism data...\n")

# Standardize tourism country names to match main dataset (if needed)
tourism_country_name_map = {
    'United States of America': 'United States',
    # Add other mappings if needed
}

df_tourism_clean['Country'] = df_tourism_clean['Country'].replace(tourism_country_name_map)

# Merge tourism data with main dataset
df_master = df_pivot.merge(
    df_tourism_clean,
    on=['Country', 'Year'],
    how='left'
)

tourism_coverage = df_master['Tourism_%'].notna().sum()
countries_with_tourism = df_master[df_master['Tourism_%'].notna()]['Country'].nunique()

print(f"Merged: {tourism_coverage:,} records now have tourism data")
print(f"Countries with tourism data: {countries_with_tourism}\n")

[6/8] Merging tourism data...

Merged: 1,143 records now have tourism data
Countries with tourism data: 116



## Step 7: Add GDP in USD and Population

From World Bank Development Indicators for cross-country comparisons.

In [8]:
print("[7/8] Loading and merging GDP in USD and Population...\n")

# Load World Bank Development Indicators
df_dev_ind = pd.read_csv('../raw-data/World_Bank/world_bank_development_indicators.csv')

# Extract year from date column
df_dev_ind['Year'] = pd.to_datetime(df_dev_ind['date']).dt.year

# Select GDP in USD and population columns
df_gdp_usd = df_dev_ind[['country', 'Year', 'GDP_current_US', 'population']].copy()
df_gdp_usd.columns = ['Country', 'Year', 'GDP_USD', 'Population']

# Standardize country names to match ACLED naming conventions
country_name_map = {
    'Yemen, Rep.': 'Yemen',
    'Egypt, Arab Rep.': 'Egypt',
    'Congo, Dem. Rep.': 'Democratic Republic of Congo',
    'Congo, Rep.': 'Republic of Congo',
    'Bahamas, The': 'Bahamas',
    'Gambia, The': 'Gambia',
    'Korea, Rep.': 'South Korea',
    'Korea, Dem. People\'s Rep.': 'North Korea',
    'Kyrgyz Republic': 'Kyrgyzstan',
    'Lao PDR': 'Laos',
    'Russian Federation': 'Russia',
    'Syrian Arab Republic': 'Syria',
    'Turkiye': 'Turkey',
    'Venezuela, RB': 'Venezuela',
    'West Bank and Gaza': 'Palestine',
    'Slovak Republic': 'Slovakia'
}

df_gdp_usd['Country'] = df_gdp_usd['Country'].replace(country_name_map)

# Merge GDP USD and Population
df_master = df_master.merge(
    df_gdp_usd,
    on=['Country', 'Year'],
    how='left'
)

gdp_coverage = df_master['GDP_USD'].notna().sum()
population_coverage = df_master['Population'].notna().sum()

print(f"Merged: {gdp_coverage:,} records now have GDP in USD")
print(f"Merged: {population_coverage:,} records now have Population\n")

[7/8] Loading and merging GDP in USD and Population...

Merged: 8,332 records now have GDP in USD
Merged: 9,281 records now have Population



## Step 8: Create Final Master Dataset

Select and order columns for the final output.

In [9]:
print("[8/8] Creating final dataset...\n")

# Select final columns in desired order
df_final = df_master[[
    'Country',
    'Year',
    'Primary_%',
    'Secondary_%',
    'Tertiary_%',
    'Tourism_%',
    'GDP_USD',
    'Population'
]].copy()

# Round percentages to 2 decimal places
df_final['Primary_%'] = df_final['Primary_%'].round(2)
df_final['Secondary_%'] = df_final['Secondary_%'].round(2)
df_final['Tertiary_%'] = df_final['Tertiary_%'].round(2)
df_final['Tourism_%'] = df_final['Tourism_%'].round(2)

# Ensure all numeric columns use proper float64 with NaN for missing
numeric_cols = ['Primary_%', 'Secondary_%', 'Tertiary_%', 'Tourism_%', 'GDP_USD', 'Population']
for col in numeric_cols:
    df_final[col] = pd.to_numeric(df_final[col], errors='coerce')

# Sort by Country and Year
df_final = df_final.sort_values(['Country', 'Year']).reset_index(drop=True)

print(f"Final dataset: {len(df_final):,} rows × {len(df_final.columns)} columns")
print(f"Countries: {df_final['Country'].nunique()}")
print(f"Year range: {df_final['Year'].min()} - {df_final['Year'].max()}\n")

df_final.head(20)

[8/8] Creating final dataset...

Final dataset: 10,936 rows × 8 columns
Countries: 220
Year range: 1970 - 2023



Unnamed: 0,Country,Year,Primary_%,Secondary_%,Tertiary_%,Tourism_%,GDP_USD,Population
0,Afghanistan,1970,50.62,24.08,25.3,,1748887000.0,10752971.0
1,Afghanistan,1971,50.62,24.08,25.3,,1831109000.0,11015857.0
2,Afghanistan,1972,50.62,24.08,25.3,,1595555000.0,11286753.0
3,Afghanistan,1973,50.62,24.08,25.3,,1733333000.0,11575305.0
4,Afghanistan,1974,50.62,24.08,25.3,,2155555000.0,11869879.0
5,Afghanistan,1975,50.62,24.08,25.31,,2366667000.0,12157386.0
6,Afghanistan,1976,50.62,24.08,25.3,,2555556000.0,12425267.0
7,Afghanistan,1977,50.63,24.08,25.29,,2953333000.0,12687301.0
8,Afghanistan,1978,50.59,24.07,25.33,,3300000000.0,12938862.0
9,Afghanistan,1979,50.64,24.1,25.27,,3697940000.0,12986369.0


## Validation: Data Quality Checks

Quality checks before export.

In [10]:
print("="*80)
print("DATA QUALITY REPORT")
print("="*80)

print("\n1. Missing Data Summary:")
missing_counts = df_final.isnull().sum()
missing_pct = (df_final.isnull().sum() / len(df_final) * 100).round(2)
missing_df = pd.DataFrame({
    'Missing Count': missing_counts,
    'Missing %': missing_pct
})
print(missing_df)

print("\n2. Sector Percentage Validation:")
df_final['Total_%'] = df_final['Primary_%'] + df_final['Secondary_%'] + df_final['Tertiary_%']
print(df_final['Total_%'].describe())

# Check for rows that don't sum to ~100%
invalid_rows = df_final[(df_final['Total_%'] < 99) | (df_final['Total_%'] > 101)]
valid_rows = df_final[(df_final['Total_%'] >= 99) & (df_final['Total_%'] <= 101)].shape[0]
total_rows_with_sectors = df_final['Total_%'].notna().sum()

print(f"\n   Rows with sectors summing to 100% (±1%): {valid_rows:,}/{total_rows_with_sectors:,}")
print(f"   Rows outside 100% (±1%): {len(invalid_rows):,}")

print("\n3. Sample Countries - Recent Years (2015-2023):")
for country in ['Afghanistan', 'Syria', 'United States', 'Germany', 'China']:
    sample = df_final[(df_final['Country'] == country) & (df_final['Year'] >= 2015)]
    if len(sample) > 0:
        print(f"\n{country}:")
        print(sample[['Year', 'Primary_%', 'Secondary_%', 'Tertiary_%', 'Tourism_%']].to_string(index=False))

print("\n4. Tourism Coverage by Year:")
tourism_by_year = df_final[df_final['Tourism_%'].notna()].groupby('Year').size()
print(tourism_by_year.to_string())

# Drop the temporary Total_% column
df_final = df_final.drop(columns=['Total_%'])

print("\n" + "="*80)

DATA QUALITY REPORT

1. Missing Data Summary:
             Missing Count  Missing %
Country                  0       0.00
Year                     0       0.00
Primary_%              170       1.55
Secondary_%             43       0.39
Tertiary_%              49       0.45
Tourism_%             9793      89.55
GDP_USD               2604      23.81
Population            1655      15.13

2. Sector Percentage Validation:
count    10717.000000
mean       100.024970
std          0.673156
min         79.390000
25%        100.000000
50%        100.000000
75%        100.000000
max        125.050000
Name: Total_%, dtype: float64

   Rows with sectors summing to 100% (±1%): 10,679/10,717
   Rows outside 100% (±1%): 38

3. Sample Countries - Recent Years (2015-2023):

Afghanistan:
 Year  Primary_%  Secondary_%  Tertiary_%  Tourism_%
 2015      24.33         7.33       68.35        NaN
 2016      30.53         7.59       61.88        NaN
 2017      30.43         7.78       61.79        NaN
 2018  

## Export: Save Master Dataset

Save to `processed-data/economics-countries-master.csv`

In [12]:
# Export to CSV
output_path = '../processed-data/economics-countries-master.csv'
df_final.to_csv(output_path, index=False, na_rep='')

print("="*80)
print("EXPORT COMPLETED")
print("="*80)

print(f"\nFile: {output_path}")
print(f"Size: {len(df_final):,} records")
print(f"Countries: {df_final['Country'].nunique()}")
print(f"Years: {df_final['Year'].min()}-{df_final['Year'].max()}")

print(f"\nColumns:")
print(f"  - Country (text - standardized World Bank name)")
print(f"  - Year (1970-2023)")
print(f"  - Primary_% (Agriculture + Mining)")
print(f"  - Secondary_% (Manufacturing + Construction)")
print(f"  - Tertiary_% (Services)")
print(f"  - Tourism_% (2008-2023, 125 countries)")
print(f"  - GDP_USD (current US dollars)")
print(f"  - Population (total population)")

print(f"\nNOTE: Missing values are represented as empty cells in CSV (NULL)")
print(f"\nReady to merge with conflict data using Country + Year keys!")
print("="*80)

EXPORT COMPLETED

File: ../processed-data/economics-countries-master.csv
Size: 10,936 records
Countries: 220
Years: 1970-2023

Columns:
  - Country (text - standardized World Bank name)
  - Year (1970-2023)
  - Primary_% (Agriculture + Mining)
  - Secondary_% (Manufacturing + Construction)
  - Tertiary_% (Services)
  - Tourism_% (2008-2023, 125 countries)
  - GDP_USD (current US dollars)
  - Population (total population)

NOTE: Missing values are represented as empty cells in CSV (NULL)

Ready to merge with conflict data using Country + Year keys!
