# Emerging Markets Factor Model: Data Acquisition

## 📈 Project Overview

This notebook is the first step in building a comprehensive **factor model for Emerging Markets (EM) equity performance**. We will collect and prepare high-quality financial data from Bloomberg's BQL API to understand how macroeconomic variables drive EM equity returns across different regions and market cycles.

## 🎯 Objectives

1. **Data Collection**: Extract 10 years of daily data for 9 EM equity indices and 8 macro factors
2. **Data Quality**: Ensure robust data handling with forward-filling and validation
3. **Preparation**: Structure data for subsequent factor modeling analysis
4. **Documentation**: Maintain clear data lineage and transformation logic

## 🗺️ Methodology

- **EM Universe**: 9 major emerging market indices covering Latin America, Asia, and Africa
- **Macro Factors**: 8 key economic indicators including rates, commodities, volatility, and credit
- **Time Horizon**: 10-year lookback for statistical robustness
- **Data Source**: Bloomberg Professional via BQL API for institutional-grade data quality

## 📊 Expected Outputs

- **Combined Dataset**: Single CSV file with aligned EM and macro time series
- **Data Validation**: Quality metrics and completeness statistics
- **Foundation**: Clean dataset ready for PCA factor modeling and regression analysis

## 📦 Import Required Libraries

We need the following libraries for our data acquisition pipeline:
- **pandas**: Data manipulation and analysis framework
- **bql**: Bloomberg Query Language API for institutional data access
- **os**: Operating system interface for file management

In [None]:
# =============================================================================
# LIBRARY IMPORTS
# =============================================================================
import pandas as pd  # Data manipulation and analysis
import os           # Operating system interface
import bql          # Bloomberg Query Language API

In [None]:
# =============================================================================
# BLOOMBERG BQL HELPER FUNCTION
# =============================================================================
def fetch_bql_series(assets, bql_service, date_range):
    """
    Fetches time series data for given assets using Bloomberg Query Language (BQL).
    
    This function provides a streamlined interface to Bloomberg's institutional data,
    handling multiple asset queries efficiently and returning clean pandas DataFrames.

    Parameters:
    -----------
    assets : dict
        Dictionary mapping asset names to Bloomberg tickers
        Example: {'Brazil': 'MXBR Index', 'USD_Index': 'DXY Curncy'}
    bql_service : bql.Service
        Authenticated Bloomberg Query Language service instance
    date_range : bql.func.range
        Bloomberg date range object (e.g., bq.func.range('-10Y', '0D'))

    Returns:
    --------
    pd.DataFrame
        DataFrame with datetime index and columns for each asset
        All data is automatically forward-filled by Bloomberg for quality
        
    Notes:
    ------
    - Uses PX_LAST field for consistent daily closing values
    - Bloomberg handles missing values with native forward-fill
    """
    data = {}
    
    # Iterate through each asset and fetch time series
    for asset_name, ticker in assets.items():
        # Execute BQL query for last price over specified date range
        query = bql_service.time_series(ticker, "PX_LAST", date_range[0], date_range[1])
        
        # Extract value column and store with descriptive asset name
        data[asset_name] = query.to_frame()["value"]
    
    # Combine all series into single DataFrame with datetime alignment
    return pd.DataFrame(data)

## 🌏 Emerging Markets Index Universe

### Strategic Selection Criteria

Our EM index selection follows institutional best practices for factor modeling:

| **Region** | **Country** | **Index** | **Economic Profile** | **Key Sectors** |
|------------|-------------|-----------|---------------------|-----------------|
| **Latin America** | Brazil | MXBR Index | Commodity superpower, largest LatAm economy | Mining, agriculture, energy, financials |
| | Mexico | MXMX Index | USMCA integration, manufacturing hub | Industrials, consumer goods, telecom |
| **Asia Pacific** | China | MXCN Index | Global manufacturing leader, 2nd largest economy | Technology, real estate, financials |
| | India | MXIN Index | Demographic dividend, services powerhouse | IT services, pharmaceuticals, financials |
| | Taiwan | TAMSCI Index | Semiconductor manufacturing capital | Technology hardware, semiconductors |
| | Korea | MXKR Index | Advanced EM, export-oriented economy | Technology, automotive, shipbuilding |
| | Indonesia | MXID Index | Southeast Asian commodity exporter | Mining, palm oil, banking |
| **Africa** | South Africa | MXZA Index | African financial hub, resource-rich | Mining, precious metals, banking |
| **Benchmark** | United States | MXUS Index | Developed market reference point | Technology, healthcare, financials |

### Diversification Benefits
- **Geographic**: 4 continents, 8 time zones, multiple currency exposures
- **Economic Structure**: From commodity exporters to technology leaders
- **Development Spectrum**: Traditional EM to advanced emerging economies
- **Market Cap**: Large, liquid markets with institutional accessibility

In [None]:
# =============================================================================
# DATA EXPORT CONFIGURATION & ASSET UNIVERSE DEFINITION
# =============================================================================

# Define output path for the combined dataset
output_path = '../data/combined_em_macro_data.csv'

print("🔧 Configuration Summary:")
print(f"   📂 Output Path: {output_path}")
print(f"   🌍 Setting up comprehensive EM and macro factor universe...")


# =============================================================================
# EMERGING MARKETS EQUITY INDICES (MSCI)
# =============================================================================
# Using MSCI indices for consistency and institutional benchmarking
em_assets = {
    # LATIN AMERICA - Commodity-driven economies with US trade linkages
    'Brazil': 'MXBR Index',        # MSCI Brazil - LatAm largest economy
    'Mexico': 'MXMX Index',        # MSCI Mexico - USMCA integration
    
    # ASIA PACIFIC - Technology and manufacturing powerhouses
    'India': 'MXIN Index',         # MSCI India - South Asian growth market
    'China': 'MXCN Index',         # MSCI China - East Asian manufacturing hub
    'Taiwan': 'TAMSCI Index',      # Taiwan MSCI - Technology manufacturing
    'Korea': 'MXKR Index',         # MSCI Korea - Advanced EM market
    'Indonesia': 'MXID Index',     # MSCI Indonesia - SE Asian commodity exporter
    
    # AFRICA & MIDDLE EAST - Resource-rich markets
    'SouthAfrica': 'MXZA Index',   # MSCI South Africa - African gateway
    
    # DEVELOPED MARKET BENCHMARK - For relative performance analysis
    'US': 'MXUS Index'             # MSCI USA - Developed market benchmark
}

## 📈 Macroeconomic Factor Universe

### Factor Selection Framework

Our macro factor selection captures the **primary drivers of EM performance** based on academic research and practitioner insights:

| **Category** | **Factor** | **Ticker** | **EM Transmission Mechanism** | **Expected Relationship** |
|--------------|------------|------------|--------------------------------|---------------------------|
| **Monetary Policy** | US 2Y Yield | USGG2YR Index | Fed policy rate proxy, affects carry trades | **Negative**: Higher rates → EM capital outflow |
| | US 10Y Yield | USGG10YR Index | Risk-free rate benchmark, discount factor | **Negative**: Higher yields → EM underperformance |
| | Term Spread | (10Y - 2Y) | Yield curve slope, growth expectations | **Mixed**: Steepening can signal growth or inflation |
| **Risk Appetite** | VIX | VIX Index | Global risk sentiment and volatility | **Negative**: High volatility → EM risk-off |
| | Credit Spreads | CSI BB Index | Corporate credit risk premium | **Negative**: Wide spreads → EM funding stress |
| **Currency** | USD Index | DXY Curncy | Dollar strength vs. major currencies | **Negative**: Strong USD → EM competitiveness loss |
| **Commodities** | Brent Oil | CO1 Comdty | Global energy prices | **Mixed**: Positive for exporters, negative for importers |
| | Copper | LMCADY Comdty | Industrial demand and growth proxy | **Positive**: Higher copper → stronger EM growth |

### Academic Foundation
- **Bekaert & Harvey (2017)**: EM factor models and global integration
- **Miranda-Agrippino & Rey (2020)**: Global financial cycle and EM flows
- **Passari & Rey (2015)**: Financial flows and the international monetary system

In [None]:
# =============================================================================
# MACROECONOMIC FACTORS - MULTI-ASSET CLASS APPROACH
# =============================================================================
# Comprehensive factor set covering monetary policy, risk sentiment, 
# currency dynamics, and commodity cycles
macro_assets = {
    # MONETARY POLICY & RATES
    'USD_Index': 'DXY Curncy',         # US Dollar Index - currency strength
    'US_10Y_Yield': 'USGG10YR Index', # 10Y Treasury - long-term risk-free rate
    'US_2Y_Yield': 'USGG2YR Index',   # 2Y Treasury - Fed policy proxy
    
    # RISK SENTIMENT & VOLATILITY  
    'VIX': 'VIX Index',                # CBOE Volatility Index - fear gauge
    'BAA_spread': 'CSI BB Index',      # Corporate credit spreads - risk premium
    
    # COMMODITIES & REAL ASSETS
    'Oil_Brent': 'CO1 Comdty',        # Brent crude oil - energy prices
    'Copper': 'LMCADY Comdty'         # LME copper - industrial demand proxy
}

# =============================================================================
# FINAL CONFIGURATION SUMMARY
# =============================================================================
print("✅ Asset Universe Configuration Complete:")
print(f"   📂 Output Path: {output_path}")
print(f"   🌍 EM Markets: {len(em_assets)} total ({len(em_assets)-1} EM + 1 DM benchmark)")
print(f"   📊 Macro Factors: {len(macro_assets)} base factors")
print(f"   🔍 Derived Factors: 1 (Term Spread = 10Y - 2Y)")
print(f"   📈 Total Variables: {len(em_assets) + len(macro_assets) + 1}")
print(f"   🎯 Ready for Bloomberg data collection pipeline...")

In [None]:
# =============================================================================
# BLOOMBERG DATA ACQUISITION PIPELINE
# =============================================================================

print("🔗 Establishing Bloomberg BQL Connection...")
print("   • Authenticating with Bloomberg Professional services")
print("   • Initializing BQL query interface")
bq = bql.Service()  # Create authenticated Bloomberg service instance

# Configure 10-year lookback period for robust statistical analysis
print("\n📅 Setting Data Collection Parameters...")
date_range = bq.func.range('-10Y', '0D')  # 10 years back from today
print(f"   • Time Horizon: 10-year lookback for statistical robustness")
print(f"   • Frequency: Daily closing prices (business days)")
print(f"   • Data Quality: Bloomberg native forward-fill for missing values")

# =============================================================================
# STEP 1: EMERGING MARKETS EQUITY DATA COLLECTION
# =============================================================================
print("\n🌍 Step 1: Fetching EM Equity Index Data...")
print(f"   • Markets: {list(em_assets.keys())}")
print(f"   • Data Field: PX_LAST (daily closing levels)")
em_df = fetch_bql_series(em_assets, bq, date_range)

# Validate EM data collection
print(f"\n✅ EM Data Collection Complete:")
print(f"   📊 Shape: {em_df.shape[0]} observations × {em_df.shape[1]} indices")
print(f"   📅 Period: {em_df.index.min().strftime('%Y-%m-%d')} to {em_df.index.max().strftime('%Y-%m-%d')}")
print(f"   🕰️ Trading Days: {len(em_df)} business days")

# =============================================================================
# STEP 2: MACROECONOMIC FACTORS DATA COLLECTION
# =============================================================================
print(f"\n📈 Step 2: Fetching Macroeconomic Factor Data...")
print(f"   • Base Factors: {list(macro_assets.keys())}")
print(f"   • Asset Classes: Rates, FX, Commodities, Volatility, Credit")
macro_df = fetch_bql_series(macro_assets, bq, date_range)

# =============================================================================
# STEP 3: FEATURE ENGINEERING - TERM SPREAD CALCULATION
# =============================================================================
print(f"\n🔧 Step 3: Engineering Derived Factors...")
# Calculate yield curve term spread as growth/policy indicator
macro_df['Term_Spread'] = macro_df['US_10Y_Yield'] - macro_df['US_2Y_Yield']
print(f"   • Term Spread: US 10Y - 2Y yields (yield curve slope)")
print(f"   • Economic Significance: Growth expectations vs. policy stance")

# Validate macro data collection
print(f"\n✅ Macro Data Collection Complete:")
print(f"   📊 Shape: {macro_df.shape[0]} observations × {macro_df.shape[1]} factors")
print(f"   📈 Total Factors: {len(macro_assets)} base + 1 derived = {macro_df.shape[1]}")

# =============================================================================
# STEP 4: DATA INTEGRATION & QUALITY ASSURANCE
# =============================================================================
print(f"\n🔄 Step 4: Data Integration & Quality Control...")
print(f"   • Aligning EM and macro datasets on trading calendar")
print(f"   • Performing inner join to ensure data completeness")

# Merge datasets on datetime index with inner join for data integrity
combined_df = pd.merge(em_df, macro_df, left_index=True, right_index=True)
combined_df = combined_df.sort_index().dropna()  # Remove any remaining missing values

# =============================================================================
# STEP 5: COMPREHENSIVE DATA VALIDATION
# =============================================================================
print(f"\n📊 Final Dataset Validation:")
print(f"   📈 Combined Shape: {combined_df.shape[0]} observations × {combined_df.shape[1]} variables")
print(f"   📅 Final Period: {combined_df.index.min().strftime('%Y-%m-%d')} to {combined_df.index.max().strftime('%Y-%m-%d')}")
print(f"   🌍 EM Markets: {len(em_assets)} indices")
print(f"   📊 Macro Factors: {len(macro_assets) + 1} total (inc. term spread)")
print(f"   🔍 Data Completeness: {((1 - combined_df.isnull().sum().sum() / (combined_df.shape[0] * combined_df.shape[1])) * 100):.2f}%")
print(f"   ❌ Missing Values: {combined_df.isnull().sum().sum()} total")

# =============================================================================
# STEP 6: DATA EXPORT & PRESERVATION
# =============================================================================
print(f"\n💾 Step 6: Data Export & Preservation...")
# Ensure output directory exists
os.makedirs(os.path.dirname(output_path), exist_ok=True)

# Export to CSV with datetime index preservation
combined_df.to_csv(output_path)
print(f"✅ Dataset successfully exported to: {output_path}")
print(f"   📁 File Format: CSV with datetime index")
print(f"   🔗 Ready for: Factor modeling, PCA analysis, regression modeling")

# Final summary statistics
print(f"\n📋 Dataset Summary Statistics:")
print(f"   • Memory Usage: {combined_df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print(f"   • Index Type: {type(combined_df.index).__name__}")
print(f"   • Data Types: {dict(combined_df.dtypes.value_counts())}")
print(f"\n🎯 Data Pipeline Complete - Ready for Factor Analysis!")

## 🔄 Data Collection & Processing Workflow

### Bloomberg BQL Integration

Our data pipeline leverages **Bloomberg Query Language (BQL)** for institutional-grade data quality:

1. **Connection**: Establish authenticated session with Bloomberg Professional
2. **Date Range**: 10-year lookback for statistical robustness (`-10Y` to `0D`)
3. **Field Selection**: `PX_LAST` (last price) for consistent daily closing values
4. **Data Quality**: Bloomberg's native forward-fill handles missing values

### Processing Steps

| **Step** | **Action** | **Purpose** | **Quality Control** |
|----------|------------|-------------|---------------------|
| 1️⃣ | **EM Data Fetch** | Retrieve 9 equity index time series | Validate date alignment |
| 2️⃣ | **Macro Data Fetch** | Retrieve 7 base macro factor series | Check data completeness |
| 3️⃣ | **Feature Engineering** | Calculate Term Spread (10Y - 2Y) | Validate spread calculation |
| 4️⃣ | **Data Alignment** | Merge EM and macro datasets on dates | Ensure consistent calendar |
| 5️⃣ | **Quality Assurance** | Remove incomplete observations | Final validation checks |
| 6️⃣ | **Export** | Save to CSV for downstream analysis | Preserve data lineage |

### Technical Specifications
- **Data Frequency**: Daily (business days only)
- **Missing Value Treatment**: Forward-fill via Bloomberg
- **Date Alignment**: Inner join on trading calendar
- **Output Format**: CSV with datetime index