# Rate-Limited Market Cap Filtered Data Collection (Year by Year)

This notebook collects financial data for **US stocks with market cap > $1B** with strict rate limiting and saves each year separately.

**Key features:**
1. **Market cap filter**: Only collects data for stocks with market cap > $1B
2. **Rate limiting**: 750 API calls per minute
3. **Year-by-year collection**: Each year saved to separate CSV
4. **Deduplication**: Ensures only one entry per quarter (Q1-Q4)
5. **Enhanced metrics**: Includes book-to-market and earnings yield
6. **Error tracking**: Detailed logs for debugging

**Target columns:** `quarter`, `ticker`, `industry`, `sector`, `debt_to_assets`, `mkt_cap`, `stock_price`, `book_to_market`, `earnings_yield`, `mkt_cap_rank`

**Time estimate:** ~70 minutes per year (vs 2.7 hours without filtering)

## 🚀 Optimizations Applied

This notebook has been optimized with three key improvements:

### 1. **Removed Extra Sleep Delays**
- The `get_json()` function already enforces rate limiting with sleep delays
- Removed redundant sleep in `collect_year_data()` to avoid double-waiting
- **Impact**: ~8% faster execution per ticker

### 2. **Batch Processing with Bulk Endpoints**
- Uses `/api/v3/profile/{ticker1,ticker2,...}` to fetch up to 50 company profiles in one API call
- Processes tickers in batches of 50 instead of one-by-one
- **Impact**: Reduces profile API calls by 98% (from 50 calls to 1 call per batch)

### 3. **Year-Specific Ticker Lists**
- Fetches ticker lists from the end of the previous year (e.g., Dec 31, 2022 for 2023 data)
- Ensures we only process stocks that actually existed in each historical year
- Falls back to current ticker list if historical data unavailable
- **Impact**: More accurate historical analysis, avoids processing stocks that didn't exist yet

### 📊 Expected Performance Improvements:
- **Time savings**: ~30-40% faster overall
- **API efficiency**: Better utilization of rate limits
- **Data accuracy**: Historical ticker lists match the actual market composition


# ⚠️ MARKET CAP FILTERED COLLECTION

**This notebook filters for stocks with market cap > $1B**

**Time requirements:**
- **Per year:** ~70 minutes (with market cap filtering)
- **All 6 years (2019-2024):** ~7 hours total
- **API calls:** ~21,000 per year (vs 48,000 without filtering)
- **Expected companies:** ~1,800 per year (15-20% of all tickers)

**To start with a smaller test:**
1. Change `MAX_TICKERS = None` to `MAX_TICKERS = 100` in any collection cell
2. Run one year first to verify everything works
3. Then change back to `MAX_TICKERS = None` for full collection

---

## Helper Functions with Rate Limiting

In [28]:
import requests
import pandas as pd
import time
from typing import Optional, List, Dict, Any, Tuple
from datetime import datetime, timedelta
import json
import os
from dotenv import load_dotenv
import os
load_dotenv(".env")
API = os.getenv("API")  

# Rate limiting configuration
API_CALLS_PER_MINUTE = 750
SECONDS_PER_CALL = 60 / API_CALLS_PER_MINUTE  # 0.08 seconds per call

# Session and timer for rate limiting
session = requests.Session()
LAST_API_CALL = 0.0

# Market cap threshold (1 billion)
MARKET_CAP_THRESHOLD = 1e9

print(f"Rate limit configured: {API_CALLS_PER_MINUTE} calls/minute ({SECONDS_PER_CALL:.2f} seconds/call)")
print(f"Market cap filter: > ${MARKET_CAP_THRESHOLD/1e9:.0f}B")

Rate limit configured: 750 calls/minute (0.08 seconds/call)
Market cap filter: > $1B


In [29]:
API

'7cNMpVzb43GKtm05iRTDWJtyJXSylX8J'

In [30]:
def get_json(url: str, params: Dict[str, Any] = {}) -> Optional[Any]:
    """Safely get JSON data from API with error handling and rate limit retry"""
    global LAST_API_CALL, session
    try:
        params['apikey'] = API
        elapsed = time.time() - LAST_API_CALL
        if elapsed < SECONDS_PER_CALL:
            time.sleep(SECONDS_PER_CALL - elapsed)
        response = session.get(url, params=params, timeout=10)
        LAST_API_CALL = time.time()
        if response.status_code == 429:
            print('⚠️  Rate limit hit! Waiting 30 seconds...')
            time.sleep(30)
            return get_json(url, params)
        response.raise_for_status()
        js = response.json()
        if isinstance(js, dict) and 'historical' in js:
            return js['historical']
        elif isinstance(js, list):
            return js
        else:
            return js
    except requests.exceptions.HTTPError as e:
        print(f'HTTP Error {e.response.status_code}: {e}')
        return None
    except Exception as e:
        print(f'Error fetching data: {e}')
        return None


In [31]:
def check_market_cap(ticker: str, year: int, precomputed: Optional[float] = None) -> Tuple[bool, Optional[float]]:
    """Check if ticker had market cap above threshold in given year"""
    if precomputed is not None:
        return precomputed > MARKET_CAP_THRESHOLD, precomputed
    try:
        start_date = f'{year}-01-01'
        end_date = f'{year}-12-31'
        mc_data = get_json(
            f'https://financialmodelingprep.com/api/v3/historical-market-capitalization/{ticker}',
            {'from': start_date, 'to': end_date}
        )
        if not mc_data:
            return False, None
        mc_df = pd.DataFrame(mc_data)
        avg_market_cap = mc_df['marketCap'].mean()
        return avg_market_cap > MARKET_CAP_THRESHOLD, avg_market_cap
    except Exception as e:
        print(f'Error checking market cap for {ticker}: {e}')
        return False, None


In [32]:
def get_bulk_profiles(tickers: List[str]) -> Dict[str, Any]:
    """Fetch company profiles in bulk."""
    data = get_json(f'https://financialmodelingprep.com/api/v3/profile/{','.join(tickers)}')
    profiles = {}
    if isinstance(data, list):
        for item in data:
            symbol = item.get('symbol')
            profiles[symbol] = item
    return profiles

def get_bulk_market_caps(tickers: List[str], year: int) -> Dict[str, float]:
    """Fetch average market cap for a list of tickers."""
    start_date = f'{year}-01-01'
    end_date = f'{year}-12-31'
    data = get_json(
        f'https://financialmodelingprep.com/api/v3/historical-market-capitalization/{','.join(tickers)}',
        {'from': start_date, 'to': end_date}
    )
    caps = {}
    if isinstance(data, dict):
        for symbol, hist in data.items():
            df = pd.DataFrame(hist)
            caps[symbol] = df['marketCap'].mean()
    return caps


In [33]:
def process_ticker_year(ticker: str, year: int, profile_data: Optional[Dict[str, Any]] = None, avg_market_cap: Optional[float] = None) -> Tuple[Optional[pd.DataFrame], Dict[str, Any], int]:
    """Process data for a single ticker for a specific year"""
    error_log = {'ticker': ticker, 'year': year, 'errors': []}
    api_calls = 0
    try:
        is_large_cap, avg_market_cap = check_market_cap(ticker, year, precomputed=avg_market_cap)
        if avg_market_cap is None:
            api_calls += 1
        if not is_large_cap:
            error_log['errors'].append(f'Market cap below threshold (avg: ${avg_market_cap:,.0f})')
            return None, error_log, api_calls
        start_date = datetime(year, 1, 1)
        end_date = datetime(year, 12, 31)
        bs = get_json(f'https://financialmodelingprep.com/api/v3/balance-sheet-statement/{ticker}', {'period': 'quarter', 'limit': 20})
        api_calls += 1
        inc = get_json(f'https://financialmodelingprep.com/api/v3/income-statement/{ticker}', {'period': 'quarter', 'limit': 20})
        api_calls += 1
        mc = get_json(f'https://financialmodelingprep.com/api/v3/historical-market-capitalization/{ticker}', {'from': start_date.strftime('%Y-%m-%d'), 'to': end_date.strftime('%Y-%m-%d')})
        api_calls += 1
        px = get_json(f'https://financialmodelingprep.com/api/v3/historical-price-full/{ticker}', {'from': start_date.strftime('%Y-%m-%d'), 'to': end_date.strftime('%Y-%m-%d')})
        api_calls += 1
        if profile_data is None:
            profile = get_json(f'https://financialmodelingprep.com/api/v3/profile/{ticker}')
            api_calls += 1
        else:
            profile = [profile_data] if isinstance(profile_data, dict) else profile_data
        if not all([bs, inc, mc, px, profile]):
            if not bs: error_log['errors'].append('No balance sheet data')
            if not inc: error_log['errors'].append('No income statement data')
            if not mc: error_log['errors'].append('No market cap data')
            if not px: error_log['errors'].append('No price data')
            if not profile: error_log['errors'].append('No profile data')
            return None, error_log, api_calls
        industry = profile[0].get('industry', 'Unknown') if profile and len(profile) > 0 else 'Unknown'
        sector = profile[0].get('sector', 'Unknown') if profile and len(profile) > 0 else 'Unknown'
        bs_df = pd.DataFrame(bs)
        bs_df['date'] = pd.to_datetime(bs_df['date'])
        bs_df = bs_df[bs_df['date'].dt.year == year]
        if len(bs_df) == 0:
            error_log['errors'].append(f'No balance sheet data for year {year}')
            return None, error_log, api_calls
        bs_df = (bs_df[['date', 'shortTermDebt', 'longTermDebt', 'totalAssets', 'totalStockholdersEquity', 'commonStock']]
            .assign(quarter=lambda d: d.date.dt.to_period('Q'),
                    debt_to_assets=lambda d: ((d.shortTermDebt.fillna(0) + d.longTermDebt.fillna(0)) / d.totalAssets.replace(0, pd.NA)),
                    book_value=lambda d: d.totalStockholdersEquity)
            .dropna(subset=['debt_to_assets'])
            .sort_values('date')
            .drop_duplicates('quarter', keep='last'))
        inc_df = pd.DataFrame(inc)
        inc_df['date'] = pd.to_datetime(inc_df['date'])
        inc_df = inc_df[inc_df['date'].dt.year == year]
        inc_df = (inc_df[['date', 'eps', 'weightedAverageShsOut']]
            .assign(quarter=lambda d: d.date.dt.to_period('Q'))
            .rename(columns={'weightedAverageShsOut': 'shares_outstanding'})
            .sort_values('date')
            .drop_duplicates('quarter', keep='last'))
        mc_df = (pd.DataFrame(mc)
            .assign(date=lambda d: pd.to_datetime(d.date), quarter=lambda d: d.date.dt.to_period('Q'))
            .sort_values('date')
            .drop_duplicates('quarter', keep='last')
            .rename(columns={'marketCap': 'mkt_cap'})[['quarter', 'mkt_cap']])
        px_df = (pd.DataFrame(px)
            .assign(date=lambda d: pd.to_datetime(d.date), quarter=lambda d: d.date.dt.to_period('Q'))
            .sort_values('date')
            .drop_duplicates('quarter', keep='last')
            .rename(columns={'adjClose': 'stock_price'})[['quarter', 'stock_price']])
        merged = (bs_df.merge(inc_df, on='quarter', how='left')
                     .merge(mc_df, on='quarter', how='left')
                     .merge(px_df, on='quarter', how='left'))
        merged = merged.assign(ticker=ticker, industry=industry, sector=sector,
                               book_to_market=lambda d: (d.book_value / d.shares_outstanding) / d.stock_price,
                               earnings_yield=lambda d: d.eps / d.stock_price)
        merged = merged[['quarter', 'ticker', 'industry', 'sector', 'debt_to_assets', 'mkt_cap', 'stock_price', 'book_to_market', 'earnings_yield']].dropna()
        valid_quarters = [f'{year}Q1', f'{year}Q2', f'{year}Q3', f'{year}Q4']
        merged = merged[merged['quarter'].astype(str).isin(valid_quarters)]
        if len(merged) == 0:
            error_log['errors'].append('No valid data after merging')
            return None, error_log, api_calls
        return merged, error_log, api_calls
    except Exception as e:
        error_log['errors'].append(f'Exception: {str(e)}')
        return None, error_log, api_calls


In [34]:
def collect_year_data(tickers: List[str], year: int, max_tickers: Optional[int] = None, 
                     save_progress: bool = True, progress_interval: int = 100, batch_size: int = 50) -> Tuple[pd.DataFrame, List[Dict]]:
    """Collect data for multiple tickers for a specific year with optimized batch processing"""
    all_data = []
    all_errors = []
    successful_tickers = []
    failed_tickers = []
    skipped_tickers = []
    total_api_calls = 0
    
    tickers_to_process = tickers[:max_tickers] if max_tickers else tickers
    total_tickers = len(tickers_to_process)
    
    print(f"\n{'='*70}")
    print(f"  COLLECTING DATA FOR YEAR {year}")
    print(f"{'='*70}")
    print(f"Total tickers to check: {total_tickers}")
    print(f"Market cap filter: >${MARKET_CAP_THRESHOLD/1e9:.0f}B")
    print(f"API rate limit: {API_CALLS_PER_MINUTE} calls/minute")
    print(f"Batch size: {batch_size} tickers")
    print(f"Progress saves: Every {progress_interval} tickers")
    print(f"{'='*70}\n")
    
    start_time = time.time()
    
    # Process tickers in batches
    for batch_start in range(0, total_tickers, batch_size):
        batch_end = min(batch_start + batch_size, total_tickers)
        batch_tickers = tickers_to_process[batch_start:batch_end]
        
        # Progress update
        if batch_start > 0:
            elapsed = time.time() - start_time
            avg_time = elapsed / batch_start
            remaining = (total_tickers - batch_start) * avg_time
            
            print(f"\n[Progress: {batch_start}/{total_tickers} ({batch_start/total_tickers*100:.1f}%)]")
            print(f"  Time: {elapsed/60:.1f}min elapsed, ~{remaining/60:.1f}min remaining")
            print(f"  Success: {len(successful_tickers)}, Failed: {len(failed_tickers)}, Skipped (small cap): {len(skipped_tickers)}")
            print(f"  API calls: {total_api_calls} ({total_api_calls/elapsed*60:.0f}/minute avg)")
        
        print(f"\n  Processing batch {batch_start//batch_size + 1}: tickers {batch_start+1}-{batch_end}")
        
        # Get bulk profiles for the batch (1 API call for up to 50 tickers)
        profiles = get_bulk_profiles(batch_tickers)
        total_api_calls += 1
        
        # Process each ticker in the batch
        for i, ticker in enumerate(batch_tickers):
            profile_data = profiles.get(ticker)
            
            # Process ticker with pre-fetched profile
            ticker_data, error_log, api_calls = process_ticker_year(ticker, year, profile_data=profile_data)
            total_api_calls += api_calls
            
            if ticker_data is not None and len(ticker_data) > 0:
                all_data.append(ticker_data)
                successful_tickers.append(ticker)
                print("✓", end="", flush=True)
            elif any("Market cap below threshold" in err for err in error_log.get("errors", [])):
                skipped_tickers.append(ticker)
                print("○", end="", flush=True)
            else:
                failed_tickers.append(ticker)
                all_errors.append(error_log)
                print("✗", end="", flush=True)
        
        # Save progress periodically
        if save_progress and (batch_end % progress_interval == 0 or batch_end == total_tickers) and all_data:
            temp_df = pd.concat(all_data, ignore_index=True)
            temp_df['mkt_cap_rank'] = temp_df.groupby('quarter')['mkt_cap'].rank(method='dense', ascending=False).astype(int)
            progress_filename = f"progress_{year}_tickers_{batch_end}.csv"
            temp_df.to_csv(progress_filename, index=False)
            print(f"\n  💾 Progress saved: {progress_filename} ({len(temp_df)} rows)")
    
    # Final summary
    total_time = time.time() - start_time
    
    print(f"\n\n{'='*70}")
    print(f"  YEAR {year} COLLECTION COMPLETE")
    print(f"{'='*70}")
    print(f"Total time: {total_time/60:.1f} minutes ({total_time/3600:.2f} hours)")
    print(f"Successful: {len(successful_tickers)} tickers")
    print(f"Failed: {len(failed_tickers)} tickers")
    print(f"Skipped (small cap): {len(skipped_tickers)} tickers")
    print(f"Total API calls: {total_api_calls:,} ({total_api_calls/total_time*60:.0f}/minute avg)")
    
    if len(all_data) == 0:
        print("\n⚠️  No data collected!")
        return pd.DataFrame(columns=["quarter", "ticker", "industry", "sector", 
                                    "debt_to_assets", "mkt_cap", "stock_price", 
                                    "book_to_market", "earnings_yield", "mkt_cap_rank"]), all_errors
    
    # Combine all data
    final_df = pd.concat(all_data, ignore_index=True)
    
    # Final deduplication - ensure only one entry per ticker-quarter
    final_df = final_df.sort_values(['ticker', 'quarter']).drop_duplicates(['ticker', 'quarter'], keep='last')
    
    # Add market cap ranking
    final_df['mkt_cap_rank'] = final_df.groupby('quarter')['mkt_cap'].rank(method='dense', ascending=False).astype(int)
    
    # Sort by ticker and quarter
    final_df = final_df.sort_values(['ticker', 'quarter']).reset_index(drop=True)
    
    print(f"\n📊 Final dataset: {len(final_df)} rows, {final_df['ticker'].nunique()} tickers")
    print(f"   Quarters: {sorted(final_df['quarter'].unique())}")
    
    # Verify quarter coverage
    expected_quarters = {f"{year}Q1", f"{year}Q2", f"{year}Q3", f"{year}Q4"}
    actual_quarters = set(final_df['quarter'].astype(str).unique())
    missing_quarters = expected_quarters - actual_quarters
    if missing_quarters:
        print(f"   ⚠️  Missing quarters: {sorted(missing_quarters)}")
    
    # Save error log
    if all_errors:
        error_filename = f"errors_{year}.json"
        with open(error_filename, 'w') as f:
            json.dump(all_errors, f, indent=2, default=str)
        print(f"\n📝 Error log saved: {error_filename} ({len(all_errors)} errors)")
    
    # Clean up progress files
    if save_progress:
        for progress_file in [f for f in os.listdir('.') if f.startswith(f'progress_{year}_')]:
            os.remove(progress_file)
        print(f"🧹 Cleaned up progress files")
    
    return final_df, all_errors

In [35]:
def get_historical_tickers(year: int) -> List[str]:
    """Get list of US tickers that existed in a specific year"""
    print(f"Fetching ticker list for year {year}...")
    
    # Try to get historical ticker list from end of previous year
    date = f"{year-1}-12-31"
    
    # First try to get available stocks for that date
    available_stocks = get_json(
        f"https://financialmodelingprep.com/api/v3/available-traded/list",
        {"date": date}
    )
    
    if available_stocks:
        # Filter for US exchanges
        us_tickers = [
            stock["symbol"] for stock in available_stocks 
            if stock.get("exchangeShortName") in ["NYSE", "NASDAQ", "AMEX"]
            and len(stock["symbol"]) <= 5
            and "." not in stock["symbol"]
        ]
        print(f"✅ Found {len(us_tickers)} US tickers for {year}")
        return us_tickers
    
    # Fallback: use current ticker list with a warning
    print(f"⚠️  Could not get historical ticker list for {year}, using current list")
    tickers_data = get_json("https://financialmodelingprep.com/api/v3/stock/list")
    
    if tickers_data:
        # Filter for US exchanges and remove penny stocks
        us_tickers = [
            d["symbol"] for d in tickers_data 
            if d["exchangeShortName"] in ["NYSE", "NASDAQ"] 
            and (d.get("price") is not None and d.get("price", 0) > 5)
            and len(d["symbol"]) <= 5
            and "." not in d["symbol"]
        ]
        
        print(f"✅ Found {len(us_tickers)} current US tickers")
        return us_tickers
    else:
        print("❌ Failed to fetch ticker list. Using sample tickers.")
        return ["AAPL", "MSFT", "GOOGL", "AMZN", "TSLA", "META", "NVDA", "JPM", "JNJ", "V"]


## Step 1: Get List of US Tickers

In [36]:
# Note: We'll fetch year-specific ticker lists when collecting data for each year
print("📌 Ticker lists will be fetched for each specific year during collection")
print("   This ensures we only process stocks that existed in that year")
print("\n💡 The new approach:")
print("   1. Fetches historical ticker list for each year")
print("   2. Uses batch processing to reduce API calls")
print("   3. Removes redundant sleep delays")
print("\n⚡ Expected performance improvements:")
print("   - ~30% faster with batch profile fetching")
print("   - More accurate historical data with year-specific tickers")
print("   - Better API rate limit utilization")

📌 Ticker lists will be fetched for each specific year during collection
   This ensures we only process stocks that existed in that year

💡 The new approach:
   1. Fetches historical ticker list for each year
   2. Uses batch processing to reduce API calls
   3. Removes redundant sleep delays

⚡ Expected performance improvements:
   - ~30% faster with batch profile fetching
   - More accurate historical data with year-specific tickers
   - Better API rate limit utilization


## Step 2: Test with Single Ticker

In [37]:
# Test with AAPL for 2023
print("Testing with AAPL for year 2023...")
test_start = time.time()

test_data, test_errors, test_api_calls = process_ticker_year("AAPL", 2023)

test_time = time.time() - test_start
print(f"\nTest completed in {test_time:.2f} seconds with {test_api_calls} API calls")

if test_data is not None:
    print("\n✅ Test successful!")
    print(test_data)
    print(f"\nQuarters found: {sorted(test_data['quarter'].unique())}")
    print(f"\nSample metrics:")
    print(f"  Book-to-Market: {test_data['book_to_market'].mean():.3f}")
    print(f"  Earnings Yield: {test_data['earnings_yield'].mean():.3f}")
else:
    print("\n❌ Test failed!")
    print("Errors:", test_errors)

Testing with AAPL for year 2023...

Test completed in 1.69 seconds with 5 API calls

✅ Test successful!
  quarter ticker              industry      sector  debt_to_assets  \
0  2023Q2   AAPL  Consumer Electronics  Technology        0.330007   
1  2023Q3   AAPL  Consumer Electronics  Technology        0.351492   
2  2023Q4   AAPL  Consumer Electronics  Technology        0.305617   

         mkt_cap  stock_price  book_to_market  earnings_yield  
0  3044866187580       192.05        0.020501        0.007967  
1  2670779095140       169.74        0.023470        0.008660  
2  2986094670390       191.13        0.024997        0.011458  

Quarters found: [Period('2023Q2', 'Q-DEC'), Period('2023Q3', 'Q-DEC'), Period('2023Q4', 'Q-DEC')]

Sample metrics:
  Book-to-Market: 0.023
  Earnings Yield: 0.009


## Step 3: Collect Data for 2024

Collect data for US stocks with market cap > $1B for 2024. This will take approximately **70 minutes** with filtering.

**Note:** Since 2024 is ongoing, you may have partial data (Q1-Q3 or Q1-Q4 depending on current date).

**Time estimate:** ~21,000 API calls @ 300/minute = 70 minutes

To test with fewer tickers first, change `MAX_TICKERS = None` to `MAX_TICKERS = 100`

In [None]:
# Collect 2024 data
YEAR = 2024
MAX_TICKERS = None  # Collect ALL US tickers that existed in 2024

# Get historical ticker list for 2024
us_tickers_2024 = get_historical_tickers(YEAR)

# Collect data with optimized batch processing
data_2024, errors_2024 = collect_year_data(us_tickers_2024, year=YEAR, max_tickers=MAX_TICKERS)

if len(data_2024) > 0:
    filename = f"stock_data_{YEAR}.csv"
    data_2024.to_csv(filename, index=False)
    print(f"\n✅ Data saved to '{filename}'")
    
    # Show summary statistics
    print(f"\n📈 Summary Statistics:")
    print(f"   Debt/Assets - Mean: {data_2024['debt_to_assets'].mean():.3f}, Median: {data_2024['debt_to_assets'].median():.3f}")
    print(f"   Book/Market - Mean: {data_2024['book_to_market'].mean():.3f}, Median: {data_2024['book_to_market'].median():.3f}")
    print(f"   Earnings Yield - Mean: {data_2024['earnings_yield'].mean():.3f}, Median: {data_2024['earnings_yield'].median():.3f}")
    
    # Show top companies (use latest quarter available)
    latest_quarter = data_2024['quarter'].max()
    latest_data = data_2024[data_2024['quarter'] == latest_quarter]
    if len(latest_data) > 0:
        print(f"\n🏆 Top 10 companies by market cap ({latest_quarter}):")
        top_10 = latest_data.nsmallest(10, 'mkt_cap_rank')[['ticker', 'mkt_cap_rank', 'mkt_cap', 'book_to_market', 'earnings_yield', 'industry']]
        top_10['mkt_cap'] = top_10['mkt_cap'].apply(lambda x: f"${x/1e9:.1f}B")
        print(top_10.to_string(index=False))
    
    # Show available quarters
    print(f"\n📅 Available quarters for 2024: {sorted(data_2024['quarter'].unique())}")

## Step 4: Collect Data for 2023

In [None]:
# Collect 2023 data
YEAR = 2023
MAX_TICKERS = None  # Collect ALL US tickers that existed in 2023

# Get historical ticker list for 2023
us_tickers_2023 = get_historical_tickers(YEAR)

# Collect data with optimized batch processing
data_2023, errors_2023 = collect_year_data(us_tickers_2023, year=YEAR, max_tickers=MAX_TICKERS)

if len(data_2023) > 0:
    filename = f"stock_data_{YEAR}.csv"
    data_2023.to_csv(filename, index=False)
    print(f"\n✅ Data saved to '{filename}'")
    
    # Show summary statistics
    print(f"\n📈 Summary Statistics:")
    print(f"   Debt/Assets - Mean: {data_2023['debt_to_assets'].mean():.3f}, Median: {data_2023['debt_to_assets'].median():.3f}")
    print(f"   Book/Market - Mean: {data_2023['book_to_market'].mean():.3f}, Median: {data_2023['book_to_market'].median():.3f}")
    print(f"   Earnings Yield - Mean: {data_2023['earnings_yield'].mean():.3f}, Median: {data_2023['earnings_yield'].median():.3f}")
    
    # Show top companies
    q4_data = data_2023[data_2023['quarter'] == f'{YEAR}Q4']
    if len(q4_data) > 0:
        print(f"\n🏆 Top 10 companies by market cap (Q4 {YEAR}):")
        top_10 = q4_data.nsmallest(10, 'mkt_cap_rank')[['ticker', 'mkt_cap_rank', 'mkt_cap', 'book_to_market', 'earnings_yield', 'industry']]
        top_10['mkt_cap'] = top_10['mkt_cap'].apply(lambda x: f"${x/1e9:.1f}B")
        print(top_10.to_string(index=False))

## Template for Remaining Years

Use this template for years 2022 and earlier:

```python
# Collect YEAR data
YEAR = 20XX  # Change this
MAX_TICKERS = None  # Collect ALL US tickers that existed in YEAR

# Get historical ticker list for YEAR
us_tickers_YEAR = get_historical_tickers(YEAR)

# Collect data with optimized batch processing
data_YEAR, errors_YEAR = collect_year_data(us_tickers_YEAR, year=YEAR, max_tickers=MAX_TICKERS)

if len(data_YEAR) > 0:
    filename = f"stock_data_{YEAR}.csv"
    data_YEAR.to_csv(filename, index=False)
    print(f"\n✅ Data saved to '{filename}'")
```


## Step 5: Collect Data for 2022

In [None]:
# Collect 2022 data
YEAR = 2022
MAX_TICKERS = None  # Collect ALL US tickers that existed in 2022

# Get historical ticker list for 2022
us_tickers_2022 = get_historical_tickers(YEAR)

# Collect data with optimized batch processing
data_2022, errors_2022 = collect_year_data(us_tickers_2022, year=YEAR, max_tickers=MAX_TICKERS)

if len(data_2022) > 0:
    filename = f"stock_data_{YEAR}.csv"
    data_2022.to_csv(filename, index=False)
    print(f"\n✅ Data saved to '{filename}'")

## Step 6: Collect Data for 2021

In [None]:
# Collect 2021 data
YEAR = 2021
MAX_TICKERS = None  # Collect ALL US tickers that existed in 2021

# Get historical ticker list for 2021
us_tickers_2021 = get_historical_tickers(YEAR)

# Collect data with optimized batch processing
data_2021, errors_2021 = collect_year_data(us_tickers_2021, year=YEAR, max_tickers=MAX_TICKERS)

if len(data_2021) > 0:
    filename = f"stock_data_{YEAR}.csv"
    data_2021.to_csv(filename, index=False)
    print(f"\n✅ Data saved to '{filename}'")

## Step 7: Collect Data for 2020

In [None]:
# Collect 2020 data
YEAR = 2020
MAX_TICKERS = None  # Collect ALL US tickers that existed in 2020

# Get historical ticker list for 2020
us_tickers_2020 = get_historical_tickers(YEAR)

# Collect data with optimized batch processing
data_2020, errors_2020 = collect_year_data(us_tickers_2020, year=YEAR, max_tickers=MAX_TICKERS)

if len(data_2020) > 0:
    filename = f"stock_data_{YEAR}.csv"
    data_2020.to_csv(filename, index=False)
    print(f"\n✅ Data saved to '{filename}'")

## Step 8: Collect Data for 2019

In [None]:
# Collect 2019 data
YEAR = 2019
MAX_TICKERS = None  # Collect ALL US tickers that existed in 2019

# Get historical ticker list for 2019
us_tickers_2019 = get_historical_tickers(YEAR)

# Collect data with optimized batch processing
data_2019, errors_2019 = collect_year_data(us_tickers_2019, year=YEAR, max_tickers=MAX_TICKERS)

if len(data_2019) > 0:
    filename = f"stock_data_{YEAR}.csv"
    data_2019.to_csv(filename, index=False)
    print(f"\n✅ Data saved to '{filename}'")

In [None]:
!git add stock_data_2019.csv errors_2019.json
!git commit -m 'Add data for 2019'
!git push origin testing


## Step 9: Collect Data for 2018

In [None]:
# Collect 2018 data
YEAR = 2018
MAX_TICKERS = None  # Collect ALL US tickers that existed in 2018

# Get historical ticker list for 2018
us_tickers_2018 = get_historical_tickers(YEAR)

# Collect data with optimized batch processing
data_2018, errors_2018 = collect_year_data(us_tickers_2018, year=YEAR, max_tickers=MAX_TICKERS)

if len(data_2018) > 0:
    filename = f"stock_data_{YEAR}.csv"
    data_2018.to_csv(filename, index=False)
    print(f"\n✅ Data saved to '{filename}'")

In [None]:
!git add stock_data_2018.csv errors_2018.json
!git commit -m 'Add data for 2018'
!git push origin testing


## Step 10: Collect Data for 2017

In [None]:
# Collect 2017 data
YEAR = 2017
MAX_TICKERS = None  # Collect ALL US tickers that existed in 2017

# Get historical ticker list for 2017
us_tickers_2017 = get_historical_tickers(YEAR)

# Collect data with optimized batch processing
data_2017, errors_2017 = collect_year_data(us_tickers_2017, year=YEAR, max_tickers=MAX_TICKERS)

if len(data_2017) > 0:
    filename = f"stock_data_{YEAR}.csv"
    data_2017.to_csv(filename, index=False)
    print(f"\n✅ Data saved to '{filename}'")

In [None]:
!git add stock_data_2017.csv errors_2017.json
!git commit -m 'Add data for 2017'
!git push origin testing


## Step 11: Collect Data for 2016

In [None]:
# Collect 2016 data
YEAR = 2016
MAX_TICKERS = None  # Collect ALL US tickers that existed in 2016

# Get historical ticker list for 2016
us_tickers_2016 = get_historical_tickers(YEAR)

# Collect data with optimized batch processing
data_2016, errors_2016 = collect_year_data(us_tickers_2016, year=YEAR, max_tickers=MAX_TICKERS)

if len(data_2016) > 0:
    filename = f"stock_data_{YEAR}.csv"
    data_2016.to_csv(filename, index=False)
    print(f"\n✅ Data saved to '{filename}'")

In [None]:
!git add stock_data_2016.csv errors_2016.json
!git commit -m 'Add data for 2016'
!git push origin testing


## Step 12: Collect Data for 2015

In [None]:
# Collect 2015 data
YEAR = 2015
MAX_TICKERS = None  # Collect ALL US tickers that existed in 2015

# Get historical ticker list for 2015
us_tickers_2015 = get_historical_tickers(YEAR)

# Collect data with optimized batch processing
data_2015, errors_2015 = collect_year_data(us_tickers_2015, year=YEAR, max_tickers=MAX_TICKERS)

if len(data_2015) > 0:
    filename = f"stock_data_{YEAR}.csv"
    data_2015.to_csv(filename, index=False)
    print(f"\n✅ Data saved to '{filename}'")

In [None]:
!git add stock_data_2015.csv errors_2015.json
!git commit -m 'Add data for 2015'
!git push origin testing


## Step 13: Collect Data for 2014

In [None]:
# Collect 2014 data
YEAR = 2014
MAX_TICKERS = None  # Collect ALL US tickers that existed in 2014

# Get historical ticker list for 2014
us_tickers_2014 = get_historical_tickers(YEAR)

# Collect data with optimized batch processing
data_2014, errors_2014 = collect_year_data(us_tickers_2014, year=YEAR, max_tickers=MAX_TICKERS)

if len(data_2014) > 0:
    filename = f"stock_data_{YEAR}.csv"
    data_2014.to_csv(filename, index=False)
    print(f"\n✅ Data saved to '{filename}'")

In [None]:
!git add stock_data_2014.csv errors_2014.json
!git commit -m 'Add data for 2014'
!git push origin testing


## Step 14: Collect Data for 2013

In [None]:
# Collect 2013 data
YEAR = 2013
MAX_TICKERS = None  # Collect ALL US tickers that existed in 2013

# Get historical ticker list for 2013
us_tickers_2013 = get_historical_tickers(YEAR)

# Collect data with optimized batch processing
data_2013, errors_2013 = collect_year_data(us_tickers_2013, year=YEAR, max_tickers=MAX_TICKERS)

if len(data_2013) > 0:
    filename = f"stock_data_{YEAR}.csv"
    data_2013.to_csv(filename, index=False)
    print(f"\n✅ Data saved to '{filename}'")

In [None]:
!git add stock_data_2013.csv errors_2013.json
!git commit -m 'Add data for 2013'
!git push origin testing


## Step 15: Collect Data for 2012

In [None]:
# Collect 2012 data
YEAR = 2012
MAX_TICKERS = None  # Collect ALL US tickers that existed in 2012

# Get historical ticker list for 2012
us_tickers_2012 = get_historical_tickers(YEAR)

# Collect data with optimized batch processing
data_2012, errors_2012 = collect_year_data(us_tickers_2012, year=YEAR, max_tickers=MAX_TICKERS)

if len(data_2012) > 0:
    filename = f"stock_data_{YEAR}.csv"
    data_2012.to_csv(filename, index=False)
    print(f"\n✅ Data saved to '{filename}'")

In [None]:
!git add stock_data_2012.csv errors_2012.json
!git commit -m 'Add data for 2012'
!git push origin testing


## Step 16: Collect Data for 2011

In [None]:
# Collect 2011 data
YEAR = 2011
MAX_TICKERS = None  # Collect ALL US tickers that existed in 2011

# Get historical ticker list for 2011
us_tickers_2011 = get_historical_tickers(YEAR)

# Collect data with optimized batch processing
data_2011, errors_2011 = collect_year_data(us_tickers_2011, year=YEAR, max_tickers=MAX_TICKERS)

if len(data_2011) > 0:
    filename = f"stock_data_{YEAR}.csv"
    data_2011.to_csv(filename, index=False)
    print(f"\n✅ Data saved to '{filename}'")

In [None]:
!git add stock_data_2011.csv errors_2011.json
!git commit -m 'Add data for 2011'
!git push origin testing


## Step 17: Collect Data for 2010

In [None]:
# Collect 2010 data
YEAR = 2010
MAX_TICKERS = None  # Collect ALL US tickers that existed in 2010

# Get historical ticker list for 2010
us_tickers_2010 = get_historical_tickers(YEAR)

# Collect data with optimized batch processing
data_2010, errors_2010 = collect_year_data(us_tickers_2010, year=YEAR, max_tickers=MAX_TICKERS)

if len(data_2010) > 0:
    filename = f"stock_data_{YEAR}.csv"
    data_2010.to_csv(filename, index=False)
    print(f"\n✅ Data saved to '{filename}'")

In [None]:
!git add stock_data_2010.csv errors_2010.json
!git commit -m 'Add data for 2010'
!git push origin testing


## Step 18: Collect Data for 2009

In [None]:
# Collect 2009 data
YEAR = 2009
MAX_TICKERS = None  # Collect ALL US tickers that existed in 2009

# Get historical ticker list for 2009
us_tickers_2009 = get_historical_tickers(YEAR)

# Collect data with optimized batch processing
data_2009, errors_2009 = collect_year_data(us_tickers_2009, year=YEAR, max_tickers=MAX_TICKERS)

if len(data_2009) > 0:
    filename = f"stock_data_{YEAR}.csv"
    data_2009.to_csv(filename, index=False)
    print(f"\n✅ Data saved to '{filename}'")

In [None]:
!git add stock_data_2009.csv errors_2009.json
!git commit -m 'Add data for 2009'
!git push origin testing


## Step 19: Collect Data for 2008

In [None]:
# Collect 2008 data
YEAR = 2008
MAX_TICKERS = None  # Collect ALL US tickers that existed in 2008

# Get historical ticker list for 2008
us_tickers_2008 = get_historical_tickers(YEAR)

# Collect data with optimized batch processing
data_2008, errors_2008 = collect_year_data(us_tickers_2008, year=YEAR, max_tickers=MAX_TICKERS)

if len(data_2008) > 0:
    filename = f"stock_data_{YEAR}.csv"
    data_2008.to_csv(filename, index=False)
    print(f"\n✅ Data saved to '{filename}'")

In [None]:
!git add stock_data_2008.csv errors_2008.json
!git commit -m 'Add data for 2008'
!git push origin testing


## Step 20: Collect Data for 2007

In [None]:
# Collect 2007 data
YEAR = 2007
MAX_TICKERS = None  # Collect ALL US tickers that existed in 2007

# Get historical ticker list for 2007
us_tickers_2007 = get_historical_tickers(YEAR)

# Collect data with optimized batch processing
data_2007, errors_2007 = collect_year_data(us_tickers_2007, year=YEAR, max_tickers=MAX_TICKERS)

if len(data_2007) > 0:
    filename = f"stock_data_{YEAR}.csv"
    data_2007.to_csv(filename, index=False)
    print(f"\n✅ Data saved to '{filename}'")

In [None]:
!git add stock_data_2007.csv errors_2007.json
!git commit -m 'Add data for 2007'
!git push origin testing


## Step 21: Collect Data for 2006

In [None]:
# Collect 2006 data
YEAR = 2006
MAX_TICKERS = None  # Collect ALL US tickers that existed in 2006

# Get historical ticker list for 2006
us_tickers_2006 = get_historical_tickers(YEAR)

# Collect data with optimized batch processing
data_2006, errors_2006 = collect_year_data(us_tickers_2006, year=YEAR, max_tickers=MAX_TICKERS)

if len(data_2006) > 0:
    filename = f"stock_data_{YEAR}.csv"
    data_2006.to_csv(filename, index=False)
    print(f"\n✅ Data saved to '{filename}'")

In [None]:
!git add stock_data_2006.csv errors_2006.json
!git commit -m 'Add data for 2006'
!git push origin testing


## Step 22: Collect Data for 2005

In [None]:
# Collect 2005 data
YEAR = 2005
MAX_TICKERS = None  # Collect ALL US tickers that existed in 2005

# Get historical ticker list for 2005
us_tickers_2005 = get_historical_tickers(YEAR)

# Collect data with optimized batch processing
data_2005, errors_2005 = collect_year_data(us_tickers_2005, year=YEAR, max_tickers=MAX_TICKERS)

if len(data_2005) > 0:
    filename = f"stock_data_{YEAR}.csv"
    data_2005.to_csv(filename, index=False)
    print(f"\n✅ Data saved to '{filename}'")

In [None]:
!git add stock_data_2005.csv errors_2005.json
!git commit -m 'Add data for 2005'
!git push origin testing


## Step 23: Collect Data for 2004

In [None]:
# Collect 2004 data
YEAR = 2004
MAX_TICKERS = None  # Collect ALL US tickers that existed in 2004

# Get historical ticker list for 2004
us_tickers_2004 = get_historical_tickers(YEAR)

# Collect data with optimized batch processing
data_2004, errors_2004 = collect_year_data(us_tickers_2004, year=YEAR, max_tickers=MAX_TICKERS)

if len(data_2004) > 0:
    filename = f"stock_data_{YEAR}.csv"
    data_2004.to_csv(filename, index=False)
    print(f"\n✅ Data saved to '{filename}'")

In [None]:
!git add stock_data_2004.csv errors_2004.json
!git commit -m 'Add data for 2004'
!git push origin testing


## Step 24: Collect Data for 2003

In [None]:
# Collect 2003 data
YEAR = 2003
MAX_TICKERS = None  # Collect ALL US tickers that existed in 2003

# Get historical ticker list for 2003
us_tickers_2003 = get_historical_tickers(YEAR)

# Collect data with optimized batch processing
data_2003, errors_2003 = collect_year_data(us_tickers_2003, year=YEAR, max_tickers=MAX_TICKERS)

if len(data_2003) > 0:
    filename = f"stock_data_{YEAR}.csv"
    data_2003.to_csv(filename, index=False)
    print(f"\n✅ Data saved to '{filename}'")

In [None]:
!git add stock_data_2003.csv errors_2003.json
!git commit -m 'Add data for 2003'
!git push origin testing


## Step 25: Collect Data for 2002

In [None]:
# Collect 2002 data
YEAR = 2002
MAX_TICKERS = None  # Collect ALL US tickers that existed in 2002

# Get historical ticker list for 2002
us_tickers_2002 = get_historical_tickers(YEAR)

# Collect data with optimized batch processing
data_2002, errors_2002 = collect_year_data(us_tickers_2002, year=YEAR, max_tickers=MAX_TICKERS)

if len(data_2002) > 0:
    filename = f"stock_data_{YEAR}.csv"
    data_2002.to_csv(filename, index=False)
    print(f"\n✅ Data saved to '{filename}'")

In [None]:
!git add stock_data_2002.csv errors_2002.json
!git commit -m 'Add data for 2002'
!git push origin testing


## Step 26: Collect Data for 2001

In [None]:
# Collect 2001 data
YEAR = 2001
MAX_TICKERS = None  # Collect ALL US tickers that existed in 2001

# Get historical ticker list for 2001
us_tickers_2001 = get_historical_tickers(YEAR)

# Collect data with optimized batch processing
data_2001, errors_2001 = collect_year_data(us_tickers_2001, year=YEAR, max_tickers=MAX_TICKERS)

if len(data_2001) > 0:
    filename = f"stock_data_{YEAR}.csv"
    data_2001.to_csv(filename, index=False)
    print(f"\n✅ Data saved to '{filename}'")

In [None]:
!git add stock_data_2001.csv errors_2001.json
!git commit -m 'Add data for 2001'
!git push origin testing


## Step 27: Collect Data for 2000

In [None]:
# Collect 2000 data
YEAR = 2000
MAX_TICKERS = None  # Collect ALL US tickers that existed in 2000

# Get historical ticker list for 2000
us_tickers_2000 = get_historical_tickers(YEAR)

# Collect data with optimized batch processing
data_2000, errors_2000 = collect_year_data(us_tickers_2000, year=YEAR, max_tickers=MAX_TICKERS)

if len(data_2000) > 0:
    filename = f"stock_data_{YEAR}.csv"
    data_2000.to_csv(filename, index=False)
    print(f"\n✅ Data saved to '{filename}'")

In [None]:
!git add stock_data_2000.csv errors_2000.json
!git commit -m 'Add data for 2000'
!git push origin testing


## Step 28: Collect Data for 1999

In [None]:
# Collect 1999 data
YEAR = 1999
MAX_TICKERS = None  # Collect ALL US tickers that existed in 1999

# Get historical ticker list for 1999
us_tickers_1999 = get_historical_tickers(YEAR)

# Collect data with optimized batch processing
data_1999, errors_1999 = collect_year_data(us_tickers_1999, year=YEAR, max_tickers=MAX_TICKERS)

if len(data_1999) > 0:
    filename = f"stock_data_{YEAR}.csv"
    data_1999.to_csv(filename, index=False)
    print(f"\n✅ Data saved to '{filename}'")

In [None]:
!git add stock_data_1999.csv errors_1999.json
!git commit -m 'Add data for 1999'
!git push origin testing


## Step 29: Collect Data for 1998

In [None]:
# Collect 1998 data
YEAR = 1998
MAX_TICKERS = None  # Collect ALL US tickers that existed in 1998

# Get historical ticker list for 1998
us_tickers_1998 = get_historical_tickers(YEAR)

# Collect data with optimized batch processing
data_1998, errors_1998 = collect_year_data(us_tickers_1998, year=YEAR, max_tickers=MAX_TICKERS)

if len(data_1998) > 0:
    filename = f"stock_data_{YEAR}.csv"
    data_1998.to_csv(filename, index=False)
    print(f"\n✅ Data saved to '{filename}'")

In [None]:
!git add stock_data_1998.csv errors_1998.json
!git commit -m 'Add data for 1998'
!git push origin testing


## Step 30: Collect Data for 1997

In [None]:
# Collect 1997 data
YEAR = 1997
MAX_TICKERS = None  # Collect ALL US tickers that existed in 1997

# Get historical ticker list for 1997
us_tickers_1997 = get_historical_tickers(YEAR)

# Collect data with optimized batch processing
data_1997, errors_1997 = collect_year_data(us_tickers_1997, year=YEAR, max_tickers=MAX_TICKERS)

if len(data_1997) > 0:
    filename = f"stock_data_{YEAR}.csv"
    data_1997.to_csv(filename, index=False)
    print(f"\n✅ Data saved to '{filename}'")

In [None]:
!git add stock_data_1997.csv errors_1997.json
!git commit -m 'Add data for 1997'
!git push origin testing


## Step 31: Collect Data for 1996

In [None]:
# Collect 1996 data
YEAR = 1996
MAX_TICKERS = None  # Collect ALL US tickers that existed in 1996

# Get historical ticker list for 1996
us_tickers_1996 = get_historical_tickers(YEAR)

# Collect data with optimized batch processing
data_1996, errors_1996 = collect_year_data(us_tickers_1996, year=YEAR, max_tickers=MAX_TICKERS)

if len(data_1996) > 0:
    filename = f"stock_data_{YEAR}.csv"
    data_1996.to_csv(filename, index=False)
    print(f"\n✅ Data saved to '{filename}'")

In [None]:
!git add stock_data_1996.csv errors_1996.json
!git commit -m 'Add data for 1996'
!git push origin testing


## Analysis: Review Collected Data

In [None]:
# Review what we've collected
import glob

print("📁 Available data files:")
data_files = sorted(glob.glob("stock_data_*.csv"))

total_rows = 0
for file in data_files:
    df = pd.read_csv(file)
    total_rows += len(df)
    print(f"  {file}: {len(df):,} rows, {df['ticker'].nunique()} tickers")
    # Check for duplicates
    duplicates = df.groupby(['ticker', 'quarter']).size()
    if (duplicates > 1).any():
        print(f"    ⚠️  Found {(duplicates > 1).sum()} duplicate ticker-quarter combinations")

print(f"\n📊 Total: {total_rows:,} rows across {len(data_files)} files")

In [None]:
# Analyze errors
error_files = sorted(glob.glob("errors_*.json"))

if error_files:
    print("📝 Error analysis:")
    
    for error_file in error_files:
        with open(error_file, 'r') as f:
            errors = json.load(f)
        
        # Count error types
        error_types = {}
        for error in errors:
            for err_msg in error.get('errors', []):
                err_type = err_msg.split(':')[0] if ':' in err_msg else err_msg
                error_types[err_type] = error_types.get(err_type, 0) + 1
        
        print(f"\n{error_file}: {len(errors)} failed tickers")
        for err_type, count in sorted(error_types.items(), key=lambda x: x[1], reverse=True)[:5]:
            print(f"  - {err_type}: {count}")
else:
    print("No error files found.")

## Optional: Combine All Years

In [None]:
# Combine all years into one file
years = [2019, 2020, 2021, 2022, 2023, 2024]
all_data = []

for year in years:
    filename = f"stock_data_{year}.csv"
    if os.path.exists(filename):
        df = pd.read_csv(filename)
        df['quarter'] = pd.PeriodIndex(df['quarter'], freq='Q')
        all_data.append(df)
        print(f"✓ Loaded {year}: {len(df)} rows")
    else:
        print(f"✗ {filename} not found")

if all_data:
    combined = pd.concat(all_data, ignore_index=True)
    
    # Final deduplication across all years
    combined = combined.sort_values(['ticker', 'quarter']).drop_duplicates(['ticker', 'quarter'], keep='last')
    
    # Recalculate rankings
    combined['mkt_cap_rank'] = combined.groupby('quarter')['mkt_cap'].rank(method='dense', ascending=False).astype(int)
    
    combined.to_csv("stock_data_combined_6years.csv", index=False)
    
    print(f"\n✅ Combined dataset saved!")
    print(f"   Total: {len(combined):,} rows")
    print(f"   Tickers: {combined['ticker'].nunique()}")
    print(f"   Period: {combined['quarter'].min()} to {combined['quarter'].max()}")
    
    # Show metric distributions
    print(f"\n📊 Metric distributions:")
    print(f"   Debt/Assets: {combined['debt_to_assets'].describe()[["mean", "50%", "std"]].round(3).to_dict()}")
    print(f"   Book/Market: {combined['book_to_market'].describe()[["mean", "50%", "std"]].round(3).to_dict()}")
    print(f"   Earnings Yield: {combined['earnings_yield'].describe()[["mean", "50%", "std"]].round(3).to_dict()}")