# 🔧 Feature Engineering: GDP Per Capita Analysis (1990-2023)

**Author**: Portfolio Project for Global Analysis  
**Dataset**: World Bank GDP Per Capita Data (1990-2023)  
**Objective**: Create advanced features for economic analysis including growth rates, volatility measures, and economic cycle indicators

---

## 📋 Feature Engineering Plan

1. **Growth Rate Analysis** - YoY, compound annual growth rates
2. **Moving Averages & Trends** - 3, 5, 10-year moving averages
3. **Volatility Measures** - Standard deviation, coefficient of variation
4. **Economic Cycles** - Recession detection, recovery periods
5. **Relative Performance** - Country vs world average
6. **Advanced Metrics** - GDP momentum, acceleration indicators

In [10]:
# Import Required Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')

# Set style for visualizations
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

print("✅ Libraries imported successfully!")
print(f"📊 Pandas version: {pd.__version__}")
print(f"🔢 NumPy version: {np.__version__}")

✅ Libraries imported successfully!
📊 Pandas version: 2.3.2
🔢 NumPy version: 2.3.3


## 📁 Data Loading and Preparation

In [11]:
# Load the dataset
df = pd.read_csv('../data/gdp-per-capita-worldbank.csv')

# Add continent mapping
import sys
sys.path.append('../src')
from utils import get_continent_mapping

continent_mapping = get_continent_mapping()
df['Continent'] = df['Entity'].map(continent_mapping)

# Basic info
gdp_column = 'GDP per capita, PPP (constant 2021 international $)'
print(f"📈 Dataset shape: {df.shape}")
print(f"📅 Year range: {df['Year'].min()} - {df['Year'].max()}")
print(f"🌍 Number of countries: {df['Entity'].nunique()}")
print(f"🌎 Countries with continent mapping: {df['Continent'].notna().sum()}")

📈 Dataset shape: (7063, 5)
📅 Year range: 1990 - 2023
🌍 Number of countries: 213
🌎 Countries with continent mapping: 6160


In [12]:
# Check the exact column names
print("🔍 Column names:")
print(df.columns.tolist())
print(f"\n💰 GDP column being used: '{gdp_column}'")

🔍 Column names:
['Entity', 'Code', 'Year', 'GDP per capita, PPP (constant 2021 international $)', 'Continent']

💰 GDP column being used: 'GDP per capita, PPP (constant 2021 international $)'


## 🚀 Feature Engineering Implementation

### 1. Growth Rate Features

In [13]:
def calculate_growth_features(df, entity_col='Entity', year_col='Year', value_col=None):
    """
    Calculate various growth rate features for time series data
    """
    if value_col is None:
        value_col = gdp_column
    
    df_features = df.copy()
    df_features = df_features.sort_values([entity_col, year_col])
    
    # Year-over-Year Growth Rate
    df_features['yoy_growth'] = df_features.groupby(entity_col)[value_col].pct_change() * 100
    
    # 3-Year Growth Rate
    df_features['growth_3y'] = df_features.groupby(entity_col)[value_col].pct_change(periods=3) * 100
    
    # 5-Year Growth Rate
    df_features['growth_5y'] = df_features.groupby(entity_col)[value_col].pct_change(periods=5) * 100
    
    # 10-Year Growth Rate
    df_features['growth_10y'] = df_features.groupby(entity_col)[value_col].pct_change(periods=10) * 100
    
    # Simplified CAGR calculation
    def calculate_cagr_simple(group, periods=5):
        cagr_values = []
        for i in range(len(group)):
            if i < periods:
                cagr_values.append(np.nan)
            else:
                start_value = group.iloc[i - periods]
                end_value = group.iloc[i]
                if start_value > 0 and end_value > 0:
                    cagr = ((end_value / start_value) ** (1/periods) - 1) * 100
                    cagr_values.append(cagr)
                else:
                    cagr_values.append(np.nan)
        return cagr_values
    
    # Apply CAGR calculation by group
    cagr_results = []
    for entity in df_features[entity_col].unique():
        entity_data = df_features[df_features[entity_col] == entity].copy()
        entity_data = entity_data.sort_values(year_col)
        cagr_vals = calculate_cagr_simple(entity_data[value_col], periods=5)
        
        for idx, val in zip(entity_data.index, cagr_vals):
            cagr_results.append({'index': idx, 'cagr_5y': val})
    
    # Create CAGR dataframe and merge
    cagr_df = pd.DataFrame(cagr_results).set_index('index')
    df_features['cagr_5y'] = cagr_df['cagr_5y']
    
    return df_features

# Apply growth feature calculation
df_with_growth = calculate_growth_features(df)

print("✅ Growth rate features calculated!")
print(f"📊 New features added: yoy_growth, growth_3y, growth_5y, growth_10y, cagr_5y")
print(f"📈 Sample data shape: {df_with_growth.shape}")

# Display sample
sample_country = 'United States'
sample_data = df_with_growth[df_with_growth['Entity'] == sample_country].tail(10)
print(f"\n📋 Sample growth features for {sample_country} (last 10 years):")
display_cols = ['Year', gdp_column, 'yoy_growth', 'growth_5y', 'cagr_5y']
print(sample_data[display_cols].round(2))

✅ Growth rate features calculated!
📊 New features added: yoy_growth, growth_3y, growth_5y, growth_10y, cagr_5y
📈 Sample data shape: (7063, 10)

📋 Sample growth features for United States (last 10 years):
      Year  GDP per capita, PPP (constant 2021 international $)  yoy_growth  \
6760  2014                                           63191.25          1.77   
6761  2015                                           64575.41          2.19   
6762  2016                                           65275.57          1.08   
6763  2017                                           66458.02          1.81   
6764  2018                                           68070.21          2.43   
6765  2019                                           69511.77          2.12   
6766  2020                                           67352.39         -3.11   
6767  2021                                           71318.30          5.89   
6768  2022                                           72841.92          2.14   
6769  

### 2. Moving Averages and Trend Features

In [17]:
def calculate_trend_features(df, entity_col='Entity', year_col='Year', value_col=None):
    """
    Calculate moving averages and trend indicators
    """
    if value_col is None:
        value_col = gdp_column
    
    df_trend = df.copy()
    df_trend = df_trend.sort_values([entity_col, year_col])
    
    # Moving Averages
    df_trend['ma_3y'] = df_trend.groupby(entity_col)[value_col].rolling(window=3, min_periods=1).mean().reset_index(0, drop=True)
    df_trend['ma_5y'] = df_trend.groupby(entity_col)[value_col].rolling(window=5, min_periods=1).mean().reset_index(0, drop=True)
    df_trend['ma_10y'] = df_trend.groupby(entity_col)[value_col].rolling(window=10, min_periods=1).mean().reset_index(0, drop=True)
    
    # Trend indicators (current value vs moving average)
    df_trend['trend_vs_ma3'] = ((df_trend[value_col] - df_trend['ma_3y']) / df_trend['ma_3y']) * 100
    df_trend['trend_vs_ma5'] = ((df_trend[value_col] - df_trend['ma_5y']) / df_trend['ma_5y']) * 100
    df_trend['trend_vs_ma10'] = ((df_trend[value_col] - df_trend['ma_10y']) / df_trend['ma_10y']) * 100
    
    # Simple slope calculation (simplified to avoid complex index issues)
    def calculate_simple_slope(group):
        slopes = []
        for i in range(len(group)):
            if i < 2:  # Need at least 3 points for slope
                slopes.append(np.nan)
            else:
                # Calculate slope using last 3 points
                y_vals = group.iloc[i-2:i+1].values
                x_vals = np.array([0, 1, 2])
                if not np.isnan(y_vals).any():
                    slope = np.polyfit(x_vals, y_vals, 1)[0]
                    slopes.append(slope)
                else:
                    slopes.append(np.nan)
        return slopes
    
    # Apply slope calculation to each entity
    slope_results = []
    for entity in df_trend[entity_col].unique():
        entity_data = df_trend[df_trend[entity_col] == entity].copy()
        entity_data = entity_data.sort_values(year_col)
        slopes = calculate_simple_slope(entity_data['ma_5y'])
        
        for idx, slope in zip(entity_data.index, slopes):
            slope_results.append({'index': idx, 'ma5_slope': slope})
    
    # Create slope dataframe and merge
    slope_df = pd.DataFrame(slope_results).set_index('index')
    df_trend['ma5_slope'] = slope_df['ma5_slope']
    
    return df_trend

# Apply trend feature calculation
df_with_trends = calculate_trend_features(df_with_growth)

print("✅ Trend features calculated!")
print(f"📊 New features: ma_3y, ma_5y, ma_10y, trend_vs_ma3, trend_vs_ma5, trend_vs_ma10, ma5_slope")

# Sample visualization
sample_country = 'Germany'
sample_data = df_with_trends[(df_with_trends['Entity'] == sample_country) & (df_with_trends['Year'] >= 2000)]

fig = go.Figure()

# Original GDP
fig.add_trace(go.Scatter(
    x=sample_data['Year'],
    y=sample_data[gdp_column],
    mode='lines+markers',
    name='GDP Per Capita',
    line=dict(color='#2E86AB', width=3)
))

# Moving averages
fig.add_trace(go.Scatter(
    x=sample_data['Year'],
    y=sample_data['ma_5y'],
    mode='lines',
    name='5-Year Moving Average',
    line=dict(color='#A23B72', width=2, dash='dash')
))

fig.add_trace(go.Scatter(
    x=sample_data['Year'],
    y=sample_data['ma_10y'],
    mode='lines',
    name='10-Year Moving Average',
    line=dict(color='#F18F01', width=2, dash='dot')
))

fig.update_layout(
    title=f'GDP Per Capita Trends - {sample_country} (2000-2023)',
    xaxis_title='Year',
    yaxis_title='GDP Per Capita (USD)',
    height=500,
    hovermode='x unified'
)

fig.show()

✅ Trend features calculated!
📊 New features: ma_3y, ma_5y, ma_10y, trend_vs_ma3, trend_vs_ma5, trend_vs_ma10, ma5_slope


### 3. Volatility and Risk Features

In [19]:
def calculate_volatility_features(df, entity_col='Entity', year_col='Year', value_col=None, window=5):
    """
    Calculate volatility and risk measures
    """
    if value_col is None:
        value_col = gdp_column
    
    df_vol = df.copy()
    df_vol = df_vol.sort_values([entity_col, year_col])
    
    # Rolling standard deviation of growth rates
    df_vol['volatility_5y'] = df_vol.groupby(entity_col)['yoy_growth'].rolling(window=window, min_periods=2).std().reset_index(0, drop=True)
    
    # Coefficient of variation (CV) - volatility relative to mean
    def calculate_cv(group, window=5):
        rolling_mean = group.rolling(window=window, min_periods=2).mean()
        rolling_std = group.rolling(window=window, min_periods=2).std()
        cv = (rolling_std / rolling_mean.abs()) * 100
        return cv
    
    df_vol['cv_growth_5y'] = df_vol.groupby(entity_col)['yoy_growth'].apply(calculate_cv).reset_index(0, drop=True)
    
    # GDP level volatility (coefficient of variation of GDP levels)
    df_vol['gdp_volatility_5y'] = df_vol.groupby(entity_col)[value_col].rolling(window=window, min_periods=2).std().reset_index(0, drop=True)
    
    # Risk-adjusted performance (average growth / volatility)
    avg_growth_5y = df_vol.groupby(entity_col)['yoy_growth'].rolling(window=window, min_periods=2).mean().reset_index(0, drop=True)
    df_vol['risk_adjusted_performance'] = avg_growth_5y / (df_vol['volatility_5y'] + 0.001)  # Add small constant to avoid division by zero
    
    # Maximum drawdown calculation (simplified)
    def calculate_simple_drawdown(group, window=5):
        drawdowns = []
        for i in range(len(group)):
            if i < window - 1:
                drawdowns.append(np.nan)
            else:
                window_data = group.iloc[i-window+1:i+1]
                peak = window_data.max()
                trough = window_data.min()
                if peak > 0:
                    drawdown = ((trough - peak) / peak) * 100
                    drawdowns.append(drawdown)
                else:
                    drawdowns.append(np.nan)
        return drawdowns
    
    # Apply drawdown calculation to each entity
    drawdown_results = []
    for entity in df_vol[entity_col].unique():
        entity_data = df_vol[df_vol[entity_col] == entity].copy()
        entity_data = entity_data.sort_values(year_col)
        drawdowns = calculate_simple_drawdown(entity_data[value_col], window)
        
        for idx, drawdown in zip(entity_data.index, drawdowns):
            drawdown_results.append({'index': idx, 'max_drawdown_5y': drawdown})
    
    # Create drawdown dataframe and merge
    drawdown_df = pd.DataFrame(drawdown_results).set_index('index')
    df_vol['max_drawdown_5y'] = drawdown_df['max_drawdown_5y']
    
    return df_vol

# Apply volatility feature calculation
df_with_volatility = calculate_volatility_features(df_with_trends)

print("✅ Volatility features calculated!")
print(f"📊 New features: volatility_5y, cv_growth_5y, gdp_volatility_5y, risk_adjusted_performance, max_drawdown_5y")

# Volatility analysis for major economies
major_economies = ['United States', 'China', 'Japan', 'Germany', 'United Kingdom', 'France', 'Italy', 'Canada']
latest_year = df_with_volatility['Year'].max()
volatility_analysis = df_with_volatility[
    (df_with_volatility['Entity'].isin(major_economies)) & 
    (df_with_volatility['Year'] == latest_year)
][['Entity', 'volatility_5y', 'cv_growth_5y', 'risk_adjusted_performance', 'max_drawdown_5y']]

volatility_analysis = volatility_analysis.dropna().round(2)
volatility_analysis = volatility_analysis.sort_values('volatility_5y')

print(f"\n📊 Volatility Analysis for Major Economies ({latest_year}):")
print(volatility_analysis)

✅ Volatility features calculated!
📊 New features: volatility_5y, cv_growth_5y, gdp_volatility_5y, risk_adjusted_performance, max_drawdown_5y

📊 Volatility Analysis for Major Economies (2023):
              Entity  volatility_5y  cv_growth_5y  risk_adjusted_performance  \
1298           China           2.49         51.30                       1.95   
3149           Japan           2.72        548.01                       0.18   
2369         Germany           2.80       1158.94                       0.09   
6769   United States           3.21        170.59                       0.59   
1110          Canada           4.05       3122.79                      -0.03   
2233          France           5.20        789.76                       0.13   
3081           Italy           6.60        399.00                       0.25   
6735  United Kingdom           7.10       1564.47                       0.06   

      max_drawdown_5y  
1298           -16.59  
3149            -6.32  
2369           

### 4. Economic Cycle and Crisis Detection Features

In [None]:
def calculate_cycle_features(df, entity_col='Entity', year_col='Year', value_col=None):
    """
    Calculate economic cycle and crisis detection features
    """
    if value_col is None:
        value_col = gdp_column
    
    df_cycle = df.copy()
    df_cycle = df_cycle.sort_values([entity_col, year_col])
    
    # Recession detection (2 consecutive years of negative growth)
    def detect_recession(group):
        recession_flags = []
        for i in range(len(group)):
            if i == 0:
                recession_flags.append(False)
            else:
                current_growth = group.iloc[i]
                prev_growth = group.iloc[i-1]
                
                # Recession: current year negative AND previous year negative
                if current_growth < 0 and prev_growth < 0:
                    recession_flags.append(True)
                else:
                    recession_flags.append(False)
        
        return pd.Series(recession_flags, index=group.index)
    
    df_cycle['in_recession'] = df_cycle.groupby(entity_col)['yoy_growth'].apply(detect_recession)
    
    # Recovery detection (first positive growth after recession)
    def detect_recovery(group, recession_group):
        recovery_flags = []
        in_recession_sequence = False
        
        for i in range(len(group)):
            current_growth = group.iloc[i]
            current_recession = recession_group.iloc[i]
            
            # Mark recovery if: was in recession and now positive growth
            if in_recession_sequence and current_growth > 0 and not current_recession:
                recovery_flags.append(True)
                in_recession_sequence = False
            else:
                recovery_flags.append(False)
                
            # Update recession sequence tracker
            if current_recession or current_growth < 0:
                in_recession_sequence = True
        
        return pd.Series(recovery_flags, index=group.index)
    
    # Apply recovery detection by group
    recovery_list = []
    for entity in df_cycle[entity_col].unique():
        entity_data = df_cycle[df_cycle[entity_col] == entity].copy()
        entity_data = entity_data.sort_values(year_col)
        
        recovery_flags = detect_recovery(
            entity_data['yoy_growth'], 
            entity_data['in_recession']
        )
        
        for idx, flag in zip(entity_data.index, recovery_flags):
            recovery_list.append({'index': idx, 'in_recovery': flag})
    
    recovery_df = pd.DataFrame(recovery_list).set_index('index')
    df_cycle['in_recovery'] = recovery_df['in_recovery']
    
    # Economic phase classification
    def classify_economic_phase(row):
        if row['in_recession']:
            return 'Recession'
        elif row['in_recovery']:
            return 'Recovery'
        elif row['yoy_growth'] > 3:  # Strong growth threshold
            return 'Expansion'
        elif row['yoy_growth'] > 0:
            return 'Moderate Growth'
        else:
            return 'Contraction'
    
    df_cycle['economic_phase'] = df_cycle.apply(classify_economic_phase, axis=1)
    
    # Crisis year flags (specific years)
    crisis_years = [2008, 2009, 2020]  # Financial crisis and COVID
    df_cycle['crisis_year'] = df_cycle[year_col].isin(crisis_years)
    
    # Years since last recession
    def years_since_recession(group, year_group):
        years_since = []
        last_recession_year = None
        
        for i in range(len(group)):
            current_year = year_group.iloc[i]
            current_recession = group.iloc[i]
            
            if current_recession:
                last_recession_year = current_year
                years_since.append(0)
            else:
                if last_recession_year is not None:
                    years_since.append(current_year - last_recession_year)
                else:
                    years_since.append(np.nan)
        
        return pd.Series(years_since, index=group.index)
    
    # Apply years since recession calculation
    years_since_list = []
    for entity in df_cycle[entity_col].unique():
        entity_data = df_cycle[df_cycle[entity_col] == entity].copy()
        entity_data = entity_data.sort_values(year_col)
        
        years_since = years_since_recession(
            entity_data['in_recession'],
            entity_data[year_col]
        )
        
        for idx, years in zip(entity_data.index, years_since):
            years_since_list.append({'index': idx, 'years_since_recession': years})
    
    years_since_df = pd.DataFrame(years_since_list).set_index('index')
    df_cycle['years_since_recession'] = years_since_df['years_since_recession']
    
    return df_cycle

# Apply cycle feature calculation
df_with_cycles = calculate_cycle_features(df_with_volatility)

print("✅ Economic cycle features calculated!")
print(f"📊 New features: in_recession, in_recovery, economic_phase, crisis_year, years_since_recession")

# Economic phase distribution for latest year
latest_year = df_with_cycles['Year'].max()
phase_distribution = df_with_cycles[
    (df_with_cycles['Year'] == latest_year) & 
    (df_with_cycles['Continent'].notna())
]['economic_phase'].value_counts()

print(f"\n📊 Economic Phase Distribution ({latest_year}):")
for phase, count in phase_distribution.items():
    percentage = (count / phase_distribution.sum()) * 100
    print(f"{phase}: {count} countries ({percentage:.1f}%)")

### 5. Relative Performance Features

In [None]:
def calculate_relative_features(df, entity_col='Entity', year_col='Year', value_col=None):
    """
    Calculate relative performance features vs world and continent averages
    """
    if value_col is None:
        value_col = gdp_column
    
    df_rel = df.copy()
    
    # World average by year
    world_avg = df_rel.groupby(year_col)[value_col].mean().to_dict()
    df_rel['world_avg_gdp'] = df_rel[year_col].map(world_avg)
    
    # Continent average by year
    continent_avg = df_rel.groupby([year_col, 'Continent'])[value_col].mean().reset_index()
    continent_avg_dict = {}
    for _, row in continent_avg.iterrows():
        key = (row[year_col], row['Continent'])
        continent_avg_dict[key] = row[value_col]
    
    df_rel['continent_avg_gdp'] = df_rel.apply(
        lambda row: continent_avg_dict.get((row[year_col], row['Continent']), np.nan), axis=1
    )
    
    # Relative performance vs world
    df_rel['gdp_vs_world'] = ((df_rel[value_col] - df_rel['world_avg_gdp']) / df_rel['world_avg_gdp']) * 100
    
    # Relative performance vs continent
    df_rel['gdp_vs_continent'] = ((df_rel[value_col] - df_rel['continent_avg_gdp']) / df_rel['continent_avg_gdp']) * 100
    
    # World growth rate
    world_growth = df_rel.groupby(year_col)['yoy_growth'].mean().to_dict()
    df_rel['world_avg_growth'] = df_rel[year_col].map(world_growth)
    
    # Relative growth performance
    df_rel['growth_vs_world'] = df_rel['yoy_growth'] - df_rel['world_avg_growth']
    
    # GDP per capita percentile rank within world
    def calculate_percentile_rank(group):
        return group.rank(pct=True) * 100
    
    df_rel['world_percentile'] = df_rel.groupby(year_col)[value_col].apply(calculate_percentile_rank).reset_index(0, drop=True)
    
    # Continent percentile rank
    df_rel['continent_percentile'] = df_rel.groupby([year_col, 'Continent'])[value_col].apply(calculate_percentile_rank).reset_index([0,1], drop=True)
    
    # Income classification based on World Bank thresholds (2023)
    def classify_income_level(gdp):
        if pd.isna(gdp):
            return 'Unknown'
        elif gdp >= 50000:
            return 'Very High Income'
        elif gdp >= 25000:
            return 'High Income'
        elif gdp >= 10000:
            return 'Upper Middle Income'
        elif gdp >= 3000:
            return 'Lower Middle Income'
        else:
            return 'Low Income'
    
    df_rel['income_classification'] = df_rel[value_col].apply(classify_income_level)
    
    return df_rel

# Apply relative performance calculation
df_final = calculate_relative_features(df_with_cycles)

print("✅ Relative performance features calculated!")
print(f"📊 New features: world_avg_gdp, continent_avg_gdp, gdp_vs_world, gdp_vs_continent, growth_vs_world, world_percentile, continent_percentile, income_classification")

# Final dataset summary
print(f"\n📈 Final Dataset Summary:")
print(f"📊 Total rows: {len(df_final):,}")
print(f"📋 Total features: {len(df_final.columns)}")
print(f"🌍 Countries: {df_final['Entity'].nunique()}")
print(f"📅 Years: {df_final['Year'].min()} - {df_final['Year'].max()}")

# Feature list
all_features = [
    'yoy_growth', 'growth_3y', 'growth_5y', 'growth_10y', 'cagr_5y',
    'ma_3y', 'ma_5y', 'ma_10y', 'trend_vs_ma3', 'trend_vs_ma5', 'trend_vs_ma10', 'ma5_slope',
    'volatility_5y', 'cv_growth_5y', 'gdp_volatility_5y', 'risk_adjusted_performance', 'max_drawdown_5y',
    'in_recession', 'in_recovery', 'economic_phase', 'crisis_year', 'years_since_recession',
    'gdp_vs_world', 'gdp_vs_continent', 'growth_vs_world', 'world_percentile', 'continent_percentile', 'income_classification'
]

print(f"\n🔧 Engineered Features ({len(all_features)} total):")
for i, feature in enumerate(all_features, 1):
    print(f"{i:2}. {feature}")

## 📊 Feature Engineering Validation and Insights

In [None]:
# Create comprehensive feature validation dashboard
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=[
        'Growth Rate Distribution (2023)',
        'Volatility vs Performance',
        'Economic Phases Over Time',
        'World GDP Percentile Distribution'
    ],
    specs=[[{"secondary_y": False}, {"secondary_y": False}],
           [{"secondary_y": False}, {"secondary_y": False}]]
)

# 1. Growth rate distribution for latest year
latest_data = df_final[(df_final['Year'] == df_final['Year'].max()) & (df_final['yoy_growth'].notna())]

fig.add_trace(
    go.Histogram(
        x=latest_data['yoy_growth'],
        nbinsx=20,
        name='Growth Rate Distribution',
        marker_color='lightblue',
        opacity=0.7
    ),
    row=1, col=1
)

# 2. Volatility vs Performance scatter
scatter_data = latest_data[(latest_data['volatility_5y'].notna()) & (latest_data['cagr_5y'].notna())]

fig.add_trace(
    go.Scatter(
        x=scatter_data['volatility_5y'],
        y=scatter_data['cagr_5y'],
        mode='markers',
        text=scatter_data['Entity'],
        name='Countries',
        marker=dict(
            size=8,
            color=scatter_data['world_percentile'],
            colorscale='Viridis',
            showscale=True,
            colorbar=dict(title="World Percentile")
        )
    ),
    row=1, col=2
)

# 3. Economic phases over time (world aggregate)
phase_timeline = df_final.groupby(['Year', 'economic_phase']).size().unstack(fill_value=0)
phase_timeline_pct = phase_timeline.div(phase_timeline.sum(axis=1), axis=0) * 100

for phase in phase_timeline_pct.columns:
    fig.add_trace(
        go.Scatter(
            x=phase_timeline_pct.index,
            y=phase_timeline_pct[phase],
            mode='lines',
            name=f'{phase}',
            stackgroup='one'
        ),
        row=2, col=1
    )

# 4. World percentile distribution by continent
continent_percentiles = latest_data.groupby('Continent')['world_percentile'].mean().sort_values(ascending=True)

fig.add_trace(
    go.Bar(
        x=continent_percentiles.values,
        y=continent_percentiles.index,
        orientation='h',
        name='Avg World Percentile',
        marker_color='lightcoral'
    ),
    row=2, col=2
)

fig.update_layout(
    height=800,
    title_text="Feature Engineering Validation Dashboard",
    showlegend=False
)

# Update subplot titles
fig.update_xaxes(title_text="Growth Rate (%)", row=1, col=1)
fig.update_xaxes(title_text="Volatility (5Y)", row=1, col=2)
fig.update_xaxes(title_text="Year", row=2, col=1)
fig.update_xaxes(title_text="World Percentile", row=2, col=2)

fig.update_yaxes(title_text="Frequency", row=1, col=1)
fig.update_yaxes(title_text="CAGR 5Y (%)", row=1, col=2)
fig.update_yaxes(title_text="Percentage of Countries", row=2, col=1)
fig.update_yaxes(title_text="Continent", row=2, col=2)

fig.show()

## 💾 Export Enhanced Dataset

In [None]:
# Clean and prepare final dataset for export
df_export = df_final.copy()

# Round numerical columns for cleaner output
numeric_columns = df_export.select_dtypes(include=[np.number]).columns
df_export[numeric_columns] = df_export[numeric_columns].round(3)

# Export to CSV
output_path = '../outputs/gdp_with_features.csv'
df_export.to_csv(output_path, index=False)

print(f"✅ Enhanced dataset exported to: {output_path}")
print(f"📊 Dataset dimensions: {df_export.shape}")

# Create feature summary
feature_summary = {
    'Growth Features': ['yoy_growth', 'growth_3y', 'growth_5y', 'growth_10y', 'cagr_5y'],
    'Trend Features': ['ma_3y', 'ma_5y', 'ma_10y', 'trend_vs_ma3', 'trend_vs_ma5', 'trend_vs_ma10', 'ma5_slope'],
    'Volatility Features': ['volatility_5y', 'cv_growth_5y', 'gdp_volatility_5y', 'risk_adjusted_performance', 'max_drawdown_5y'],
    'Cycle Features': ['in_recession', 'in_recovery', 'economic_phase', 'crisis_year', 'years_since_recession'],
    'Relative Features': ['gdp_vs_world', 'gdp_vs_continent', 'growth_vs_world', 'world_percentile', 'continent_percentile', 'income_classification']
}

# Export feature documentation
feature_doc = []
for category, features in feature_summary.items():
    for feature in features:
        feature_doc.append({
            'Category': category,
            'Feature': feature,
            'Description': f'Engineered feature in {category.lower()}'
        })

feature_df = pd.DataFrame(feature_doc)
feature_df.to_csv('../outputs/feature_documentation.csv', index=False)

print(f"✅ Feature documentation exported to: ../outputs/feature_documentation.csv")

# Sample of final dataset
print(f"\n📋 Sample of Enhanced Dataset:")
sample_cols = ['Entity', 'Year', gdp_column, 'yoy_growth', 'volatility_5y', 'economic_phase', 'world_percentile']
print(df_export[sample_cols].tail(10))

## 📈 Feature Engineering Summary

### 🎯 **Objectives Achieved**

✅ **Growth Analysis**: Created YoY, 3Y, 5Y, 10Y growth rates and CAGR metrics  
✅ **Trend Detection**: Implemented moving averages and trend indicators  
✅ **Risk Assessment**: Calculated volatility, max drawdown, and risk-adjusted performance  
✅ **Economic Cycles**: Built recession detection and economic phase classification  
✅ **Relative Performance**: Created world and continent benchmarking features  

### 📊 **Key Features Created (26 total)**

1. **Growth Features (5)**: YoY growth, multi-year growth rates, CAGR
2. **Trend Features (7)**: Moving averages, trend indicators, slope analysis
3. **Volatility Features (5)**: Risk measures, drawdown analysis, performance ratios
4. **Cycle Features (5)**: Recession detection, economic phases, crisis indicators
5. **Relative Features (6)**: World/continent benchmarks, percentile rankings

### 🔍 **Key Insights from Feature Engineering**

- **Economic Phases**: Automated classification of expansion, recession, recovery periods
- **Risk-Return Profile**: Countries can be ranked by risk-adjusted performance
- **Relative Positioning**: Each country's performance vs global and regional benchmarks
- **Crisis Detection**: Automated identification of economic crisis periods
- **Trend Analysis**: Multi-timeframe trend detection and momentum indicators

### 📁 **Outputs**

- `../outputs/gdp_with_features.csv` - Enhanced dataset with all engineered features
- `../outputs/feature_documentation.csv` - Complete feature documentation

---

**🚀 Next Steps**: Use these features for advanced economic analysis, country clustering, predictive modeling, and investment decision-making frameworks.