# Separated Data Collection - Stock Prices and Financial Statements

This notebook collects the **SAME DATA** as yearly.ipynb but splits it into two separate tables:

1. **Stock Price Data** (`stock_prices_YYYY.csv`):
   - Aligned to calendar quarter ends (March 31, June 30, Sept 30, Dec 31)
   - Contains: ticker, company_name, quarter_end_date, stock_price, market_cap, mkt_cap_rank, industry, sector, isETF, isFund

2. **Financial Statement Data** (`financial_statements_YYYY.csv`):
   - Aligned to company fiscal quarters with calendar date mapping
   - Contains: ticker, company_name, fiscal_quarter, fiscal_year, calendar_date, debt_to_assets, book_to_market, earnings_yield, industry, sector
   - Now captures all 4 fiscal quarters that overlap with the calendar year

**Key features (SAME as yearly.ipynb):**
- Market cap filter: Only collects data for stocks with market cap > $1B
- Rate limiting: 750 API calls per minute
- Year-by-year collection with historical ticker lists
- Batch processing for efficiency
- Error tracking and progress saves


## Helper Functions

In [30]:
import requests
import pandas as pd
import time
from typing import Optional, List, Dict, Any, Tuple
from datetime import datetime, timedelta
import json
import os
from dotenv import load_dotenv

# Load API key from .env file
load_dotenv(".env")
API = os.getenv("API")  

# Rate limiting configuration (SAME as yearly.ipynb)
API_CALLS_PER_MINUTE = 750
SECONDS_PER_CALL = 60 / API_CALLS_PER_MINUTE  # 0.08 seconds per call

# Session and timer for rate limiting
session = requests.Session()
LAST_API_CALL = 0.0

# Market cap threshold (1 billion) - SAME as yearly.ipynb
MARKET_CAP_THRESHOLD = 1e9

print(f"Rate limit configured: {API_CALLS_PER_MINUTE} calls/minute ({SECONDS_PER_CALL:.2f} seconds/call)")
print(f"Market cap filter: > ${MARKET_CAP_THRESHOLD/1e9:.0f}B")


Rate limit configured: 750 calls/minute (0.08 seconds/call)
Market cap filter: > $1B


In [31]:
# Core helper functions (EXACT SAME as yearly.ipynb)
def get_json(url: str, params: Dict[str, Any] = {}) -> Optional[Any]:
    """Safely get JSON data from API with error handling and rate limit retry"""
    global LAST_API_CALL, session
    try:
        params['apikey'] = API
        elapsed = time.time() - LAST_API_CALL
        if elapsed < SECONDS_PER_CALL:
            time.sleep(SECONDS_PER_CALL - elapsed)
        response = session.get(url, params=params, timeout=10)
        LAST_API_CALL = time.time()
        if response.status_code == 429:
            print('⚠️  Rate limit hit! Waiting 30 seconds...')
            time.sleep(30)
            return get_json(url, params)
        response.raise_for_status()
        js = response.json()
        if isinstance(js, dict) and 'historical' in js:
            return js['historical']
        elif isinstance(js, list):
            return js
        else:
            return js
    except requests.exceptions.HTTPError as e:
        print(f'HTTP Error {e.response.status_code}: {e}')
        return None
    except Exception as e:
        print(f'Error fetching data: {e}')
        return None

def check_market_cap(ticker: str, year: int, precomputed: Optional[float] = None) -> Tuple[bool, Optional[float]]:
    """Check if ticker had market cap above threshold in given year"""
    if precomputed is not None:
        return precomputed > MARKET_CAP_THRESHOLD, precomputed
    try:
        start_date = f'{year}-01-01'
        end_date = f'{year}-12-31'
        mc_data = get_json(
            f'https://financialmodelingprep.com/api/v3/historical-market-capitalization/{ticker}',
            {'from': start_date, 'to': end_date}
        )
        if not mc_data:
            return False, None
        mc_df = pd.DataFrame(mc_data)
        avg_market_cap = mc_df['marketCap'].mean()
        return avg_market_cap > MARKET_CAP_THRESHOLD, avg_market_cap
    except Exception as e:
        print(f'Error checking market cap for {ticker}: {e}')
        return False, None

def get_bulk_profiles(tickers: List[str]) -> Dict[str, Any]:
    """Fetch company profiles in bulk."""
    data = get_json(f'https://financialmodelingprep.com/api/v3/profile/{','.join(tickers)}')
    profiles = {}
    if isinstance(data, list):
        for item in data:
            symbol = item.get('symbol')
            profiles[symbol] = item
    return profiles


In [32]:
# CRITICAL FUNCTION - EXACT SAME LOGIC AS yearly.ipynb
def get_historical_tickers(year: int) -> List[str]:
    """Get list of US tickers that existed in a specific year"""
    print(f"Fetching ticker list for year {year}...")
    
    # Try to get historical ticker list from end of previous year
    date = f"{year-1}-12-31"
    
    # First try to get available stocks for that date
    available_stocks = get_json(
        f"https://financialmodelingprep.com/api/v3/available-traded/list",
        {"date": date}
    )
    
    if available_stocks:
        # Filter for US exchanges
        us_tickers = [
            stock["symbol"] for stock in available_stocks 
            if stock.get("exchangeShortName") in ["NYSE", "NASDAQ", "AMEX"]
            and len(stock["symbol"]) <= 5
            and "." not in stock["symbol"]
        ]
        print(f"✅ Found {len(us_tickers)} US tickers for {year}")
        return us_tickers
    
    # Fallback: use current ticker list with a warning
    print(f"⚠️  Could not get historical ticker list for {year}, using current list")
    tickers_data = get_json("https://financialmodelingprep.com/api/v3/stock/list")
    
    if tickers_data:
        # Filter for US exchanges and remove penny stocks
        us_tickers = [
            d["symbol"] for d in tickers_data 
            if d["exchangeShortName"] in ["NYSE", "NASDAQ"] 
            and (d.get("price") is not None and d.get("price", 0) > 5)
            and len(d["symbol"]) <= 5
            and "." not in d["symbol"]
        ]
        
        print(f"✅ Found {len(us_tickers)} current US tickers")
        return us_tickers
    else:
        print("❌ Failed to fetch ticker list. Using sample tickers.")
        return ["AAPL", "MSFT", "GOOGL", "AMZN", "TSLA", "META", "NVDA", "JPM", "JNJ", "V"]


In [33]:
# FIXED process_ticker_year to return separated data with proper fiscal quarter handling
def process_ticker_year_separated(ticker: str, year: int, profile_data: Optional[Dict[str, Any]] = None, 
                                 avg_market_cap: Optional[float] = None) -> Tuple[Optional[pd.DataFrame], Optional[pd.DataFrame], Dict[str, Any], int]:
    """Process data for a single ticker for a specific year - returns separated price and statement data"""
    error_log = {'ticker': ticker, 'year': year, 'errors': []}
    api_calls = 0
    
    try:
        # Check market cap (SAME as yearly.ipynb)
        is_large_cap, avg_market_cap = check_market_cap(ticker, year, precomputed=avg_market_cap)
        if avg_market_cap is None:
            api_calls += 1
        
        if not is_large_cap:
            error_log['errors'].append(f'Market cap below threshold (avg: ${avg_market_cap:,.0f})')
            return None, None, error_log, api_calls
        
        start_date = datetime(year, 1, 1)
        end_date = datetime(year, 12, 31)
        
        # Get all the same data as yearly.ipynb
        bs = get_json(f'https://financialmodelingprep.com/api/v3/balance-sheet-statement/{ticker}', 
                     {'period': 'quarter', 'limit': 20})
        api_calls += 1
        
        inc = get_json(f'https://financialmodelingprep.com/api/v3/income-statement/{ticker}', 
                      {'period': 'quarter', 'limit': 20})
        api_calls += 1
        
        mc = get_json(f'https://financialmodelingprep.com/api/v3/historical-market-capitalization/{ticker}', 
                     {'from': start_date.strftime('%Y-%m-%d'), 'to': end_date.strftime('%Y-%m-%d')})
        api_calls += 1
        
        px = get_json(f'https://financialmodelingprep.com/api/v3/historical-price-full/{ticker}', 
                     {'from': start_date.strftime('%Y-%m-%d'), 'to': end_date.strftime('%Y-%m-%d')})
        api_calls += 1
        
        if profile_data is None:
            profile = get_json(f'https://financialmodelingprep.com/api/v3/profile/{ticker}')
            api_calls += 1
        else:
            profile = [profile_data] if isinstance(profile_data, dict) else profile_data
        
        if not all([bs, inc, mc, px, profile]):
            if not bs: error_log['errors'].append('No balance sheet data')
            if not inc: error_log['errors'].append('No income statement data')
            if not mc: error_log['errors'].append('No market cap data')
            if not px: error_log['errors'].append('No price data')
            if not profile: error_log['errors'].append('No profile data')
            return None, None, error_log, api_calls
        
        # Extract profile info
        profile_info = profile[0] if profile and len(profile) > 0 else {}
        company_name = profile_info.get('companyName', '')
        industry = profile_info.get('industry', 'Unknown')
        sector = profile_info.get('sector', 'Unknown')
        is_etf = profile_info.get('isEtf', False)
        is_fund = profile_info.get('isFund', False)
        
        # Process balance sheet data - NO YEAR FILTER for financial statements
        bs_df = pd.DataFrame(bs)
        bs_df['date'] = pd.to_datetime(bs_df['date'])
        
        # Process income statement data - NO YEAR FILTER for financial statements
        inc_df = pd.DataFrame(inc)
        inc_df['date'] = pd.to_datetime(inc_df['date'])
        
        # Process market cap data
        mc_df = pd.DataFrame(mc)
        mc_df['date'] = pd.to_datetime(mc_df['date'])
        
        # Process price data
        px_df = pd.DataFrame(px)
        px_df['date'] = pd.to_datetime(px_df['date'])
        
        # Create calendar quarter end dates
        quarter_dates = [
            f"{year}-03-31",
            f"{year}-06-30",
            f"{year}-09-30",
            f"{year}-12-31"
        ]
        
        # 1. Create Stock Price Data (aligned to calendar quarters)
        price_data_list = []
        for quarter_date in quarter_dates:
            # Find closest price to quarter end
            quarter_dt = pd.to_datetime(quarter_date)
            px_quarter = px_df[abs(px_df['date'] - quarter_dt) <= pd.Timedelta(days=7)]
            
            if len(px_quarter) > 0:
                # Get closest date
                closest_idx = abs(px_quarter['date'] - quarter_dt).idxmin()
                price_row = px_quarter.loc[closest_idx]
                
                # Get market cap for this date
                mc_quarter = mc_df[abs(mc_df['date'] - quarter_dt) <= pd.Timedelta(days=7)]
                if len(mc_quarter) > 0:
                    closest_mc_idx = abs(mc_quarter['date'] - quarter_dt).idxmin()
                    market_cap = mc_quarter.loc[closest_mc_idx, 'marketCap']
                else:
                    market_cap = None
                
                if market_cap and market_cap >= MARKET_CAP_THRESHOLD:
                    price_data_list.append({
                        'ticker': ticker,
                        'company_name': company_name,
                        'quarter_end_date': quarter_date,
                        'stock_price': price_row['adjClose'],
                        'market_cap': market_cap,
                        'industry': industry,
                        'sector': sector,
                        'isETF': is_etf,
                        'isFund': is_fund
                    })
        
        # 2. Create Financial Statement Data (aligned to fiscal quarters)
        statement_data_list = []
        
        # Filter for statements that overlap with the calendar year
        # Include statements from Sept 1 of previous year to March 31 of next year
        date_start = pd.to_datetime(f'{year-1}-09-01')
        date_end = pd.to_datetime(f'{year+1}-03-31')
        
        bs_filtered = bs_df[(bs_df['date'] >= date_start) & (bs_df['date'] <= date_end)]
        inc_filtered = inc_df[(inc_df['date'] >= date_start) & (inc_df['date'] <= date_end)]
        
        if len(bs_filtered) == 0 or len(inc_filtered) == 0:
            # If no data in extended range, just return price data
            price_df = pd.DataFrame(price_data_list) if price_data_list else None
            return price_df, None, error_log, api_calls
        
        # Merge balance sheet and income statement by date
        bs_quarters = bs_filtered[['date', 'shortTermDebt', 'longTermDebt', 'totalAssets', 
                                   'totalStockholdersEquity', 'commonStock']].copy()
        inc_quarters = inc_filtered[['date', 'eps', 'weightedAverageShsOut', 'period', 
                                    'calendarYear', 'netIncome']].copy()
        
        # Join on date
        merged_statements = pd.merge(bs_quarters, inc_quarters, on='date', how='inner')
        
        # Sort by date and take the most recent 4 quarters that best represent the calendar year
        merged_statements = merged_statements.sort_values('date')
        
        # Find the 4 quarters that best represent the calendar year
        # This typically includes Q1-Q4 of the fiscal year that mostly overlaps with the calendar year
        year_statements = []
        for _, row in merged_statements.iterrows():
            # Check if this quarter is relevant to our calendar year
            quarter_date = row['date']
            # A quarter is relevant if it ends within the calendar year OR
            # if it's Q1 of the following fiscal year (for companies with Sept/Oct fiscal year ends)
            if (quarter_date.year == year) or \
               (quarter_date.year == year + 1 and quarter_date.month <= 3) or \
               (quarter_date.year == year - 1 and quarter_date.month >= 9):
                year_statements.append(row)
        
        # Take the most recent 4 quarters from our filtered list
        year_statements = pd.DataFrame(year_statements)
        if len(year_statements) > 4:
            year_statements = year_statements.tail(4)
        
        for _, row in year_statements.iterrows():
            fiscal_date = row['date'].strftime('%Y-%m-%d')
            
            # Get market cap for this fiscal date
            mc_fiscal = mc_df[abs(mc_df['date'] - row['date']) <= pd.Timedelta(days=7)]
            if len(mc_fiscal) > 0:
                closest_mc_idx = abs(mc_fiscal['date'] - row['date']).idxmin()
                market_cap = mc_fiscal.loc[closest_mc_idx, 'marketCap']
            else:
                # Try to get market cap from the broader dataset
                mc_extended = get_json(
                    f'https://financialmodelingprep.com/api/v3/historical-market-capitalization/{ticker}',
                    {'from': (row['date'] - pd.Timedelta(days=7)).strftime('%Y-%m-%d'),
                     'to': (row['date'] + pd.Timedelta(days=7)).strftime('%Y-%m-%d')}
                )
                api_calls += 1
                if mc_extended:
                    mc_extended_df = pd.DataFrame(mc_extended)
                    mc_extended_df['date'] = pd.to_datetime(mc_extended_df['date'])
                    if len(mc_extended_df) > 0:
                        market_cap = mc_extended_df['marketCap'].iloc[0]
                    else:
                        continue
                else:
                    continue
            
            if market_cap < MARKET_CAP_THRESHOLD:
                continue
            
            # Get stock price for this date
            px_fiscal = px_df[abs(px_df['date'] - row['date']) <= pd.Timedelta(days=7)]
            if len(px_fiscal) > 0:
                closest_px_idx = abs(px_fiscal['date'] - row['date']).idxmin()
                stock_price = px_fiscal.loc[closest_px_idx, 'adjClose']
            else:
                # Try to get price from the broader dataset
                px_extended = get_json(
                    f'https://financialmodelingprep.com/api/v3/historical-price-full/{ticker}',
                    {'from': (row['date'] - pd.Timedelta(days=7)).strftime('%Y-%m-%d'),
                     'to': (row['date'] + pd.Timedelta(days=7)).strftime('%Y-%m-%d')}
                )
                api_calls += 1
                if px_extended:
                    px_extended_df = pd.DataFrame(px_extended)
                    px_extended_df['date'] = pd.to_datetime(px_extended_df['date'])
                    if len(px_extended_df) > 0:
                        stock_price = px_extended_df['adjClose'].iloc[0]
                    else:
                        stock_price = None
                else:
                    stock_price = None
            
            # Calculate ratios
            total_debt = (row['shortTermDebt'] or 0) + (row['longTermDebt'] or 0)
            debt_to_assets = total_debt / row['totalAssets'] if row['totalAssets'] > 0 else None
            
            if stock_price and row['weightedAverageShsOut'] > 0:
                book_to_market = (row['totalStockholdersEquity'] / row['weightedAverageShsOut']) / stock_price
                earnings_yield = row['eps'] / stock_price
            else:
                book_to_market = None
                earnings_yield = None
            
            statement_data_list.append({
                'ticker': ticker,
                'company_name': company_name,
                'fiscal_quarter': row['period'],
                'fiscal_year': row['calendarYear'],
                'calendar_date': fiscal_date,
                'debt_to_assets': debt_to_assets,
                'book_to_market': book_to_market,
                'earnings_yield': earnings_yield,
                'industry': industry,
                'sector': sector
            })
        
        # Convert to DataFrames
        price_df = pd.DataFrame(price_data_list) if price_data_list else None
        statement_df = pd.DataFrame(statement_data_list) if statement_data_list else None
        
        if price_df is None and statement_df is None:
            error_log['errors'].append('No valid data after processing')
            return None, None, error_log, api_calls
        
        return price_df, statement_df, error_log, api_calls
        
    except Exception as e:
        error_log['errors'].append(f'Exception: {str(e)}')
        return None, None, error_log, api_calls


In [34]:
# Main collection function - modified from yearly.ipynb to handle separated data
def collect_year_data_separated(tickers: List[str], year: int, max_tickers: Optional[int] = None, 
                               save_progress: bool = True, progress_interval: int = 100, 
                               batch_size: int = 50) -> Tuple[pd.DataFrame, pd.DataFrame, List[Dict]]:
    """Collect separated price and statement data for multiple tickers for a specific year"""
    all_price_data = []
    all_statement_data = []
    all_errors = []
    successful_tickers = []
    failed_tickers = []
    skipped_tickers = []
    total_api_calls = 0
    
    tickers_to_process = tickers[:max_tickers] if max_tickers else tickers
    total_tickers = len(tickers_to_process)
    
    print(f"\n{'='*70}")
    print(f"  COLLECTING SEPARATED DATA FOR YEAR {year}")
    print(f"{'='*70}")
    print(f"Total tickers to check: {total_tickers}")
    print(f"Market cap filter: >${MARKET_CAP_THRESHOLD/1e9:.0f}B")
    print(f"API rate limit: {API_CALLS_PER_MINUTE} calls/minute")
    print(f"Batch size: {batch_size} tickers")
    print(f"Progress saves: Every {progress_interval} tickers")
    print(f"{'='*70}\n")
    
    start_time = time.time()
    
    # Process tickers in batches (SAME as yearly.ipynb)
    for batch_start in range(0, total_tickers, batch_size):
        batch_end = min(batch_start + batch_size, total_tickers)
        batch_tickers = tickers_to_process[batch_start:batch_end]
        
        # Progress update
        if batch_start > 0:
            elapsed = time.time() - start_time
            avg_time = elapsed / batch_start
            remaining = (total_tickers - batch_start) * avg_time
            
            print(f"\n[Progress: {batch_start}/{total_tickers} ({batch_start/total_tickers*100:.1f}%)]")
            print(f"  Time: {elapsed/60:.1f}min elapsed, ~{remaining/60:.1f}min remaining")
            print(f"  Success: {len(successful_tickers)}, Failed: {len(failed_tickers)}, Skipped (small cap): {len(skipped_tickers)}")
            print(f"  API calls: {total_api_calls} ({total_api_calls/elapsed*60:.0f}/minute avg)")
        
        print(f"\n  Processing batch {batch_start//batch_size + 1}: tickers {batch_start+1}-{batch_end}")
        
        # Get bulk profiles for the batch (1 API call for up to 50 tickers)
        profiles = get_bulk_profiles(batch_tickers)
        total_api_calls += 1
        
        # Process each ticker in the batch
        for i, ticker in enumerate(batch_tickers):
            profile_data = profiles.get(ticker)
            
            # Process ticker with pre-fetched profile
            price_data, statement_data, error_log, api_calls = process_ticker_year_separated(
                ticker, year, profile_data=profile_data
            )
            total_api_calls += api_calls
            
            if (price_data is not None and len(price_data) > 0) or (statement_data is not None and len(statement_data) > 0):
                if price_data is not None:
                    all_price_data.append(price_data)
                if statement_data is not None:
                    all_statement_data.append(statement_data)
                successful_tickers.append(ticker)
                print("✓", end="", flush=True)
            elif any("Market cap below threshold" in err for err in error_log.get("errors", [])):
                skipped_tickers.append(ticker)
                print("○", end="", flush=True)
            else:
                failed_tickers.append(ticker)
                all_errors.append(error_log)
                print("✗", end="", flush=True)
        
        # Save progress periodically
        if save_progress and (batch_end % progress_interval == 0 or batch_end == total_tickers):
            if all_price_data:
                temp_price_df = pd.concat(all_price_data, ignore_index=True)
                temp_price_df['mkt_cap_rank'] = temp_price_df.groupby('quarter_end_date')['market_cap'].rank(
                    method='dense', ascending=False).astype(int)
                progress_price_filename = f"progress_prices_{year}_tickers_{batch_end}.csv"
                temp_price_df.to_csv(progress_price_filename, index=False)
                print(f"\n  💾 Price progress saved: {progress_price_filename} ({len(temp_price_df)} rows)")
            
            if all_statement_data:
                temp_statement_df = pd.concat(all_statement_data, ignore_index=True)
                progress_statement_filename = f"progress_statements_{year}_tickers_{batch_end}.csv"
                temp_statement_df.to_csv(progress_statement_filename, index=False)
                print(f"  💾 Statement progress saved: {progress_statement_filename} ({len(temp_statement_df)} rows)")
    
    # Final summary
    total_time = time.time() - start_time
    
    print(f"\n\n{'='*70}")
    print(f"  YEAR {year} COLLECTION COMPLETE")
    print(f"{'='*70}")
    print(f"Total time: {total_time/60:.1f} minutes ({total_time/3600:.2f} hours)")
    print(f"Successful: {len(successful_tickers)} tickers")
    print(f"Failed: {len(failed_tickers)} tickers")
    print(f"Skipped (small cap): {len(skipped_tickers)} tickers")
    print(f"Total API calls: {total_api_calls:,} ({total_api_calls/total_time*60:.0f}/minute avg)")
    
    # Combine all data
    if all_price_data:
        final_price_df = pd.concat(all_price_data, ignore_index=True)
        # Add market cap ranking
        final_price_df['mkt_cap_rank'] = final_price_df.groupby('quarter_end_date')['market_cap'].rank(
            method='dense', ascending=False).astype(int)
        # Sort by ticker and quarter
        final_price_df = final_price_df.sort_values(['ticker', 'quarter_end_date']).reset_index(drop=True)
    else:
        final_price_df = pd.DataFrame()
    
    if all_statement_data:
        final_statement_df = pd.concat(all_statement_data, ignore_index=True)
        # Sort by ticker and date
        final_statement_df = final_statement_df.sort_values(['ticker', 'calendar_date']).reset_index(drop=True)
    else:
        final_statement_df = pd.DataFrame()
    
    print(f"\n📊 Final datasets:")
    print(f"   Price data: {len(final_price_df)} rows, {final_price_df['ticker'].nunique() if len(final_price_df) > 0 else 0} tickers")
    print(f"   Statement data: {len(final_statement_df)} rows, {final_statement_df['ticker'].nunique() if len(final_statement_df) > 0 else 0} tickers")
    
    # Save error log
    if all_errors:
        error_filename = f"errors_{year}.json"
        with open(error_filename, 'w') as f:
            json.dump(all_errors, f, indent=2, default=str)
        print(f"\n📝 Error log saved: {error_filename} ({len(all_errors)} errors)")
    
    # Clean up progress files
    if save_progress:
        for progress_file in [f for f in os.listdir('.') if f.startswith(f'progress_prices_{year}_') or f.startswith(f'progress_statements_{year}_')]:
            os.remove(progress_file)
        print(f"🧹 Cleaned up progress files")
    
    return final_price_df, final_statement_df, all_errors


## Test with Single Ticker


In [36]:
# Test with a single ticker (SAME approach as yearly.ipynb)
def test_single_ticker(ticker: str, year: int):
    """Test data collection for a single ticker"""
    print(f"Testing with {ticker} for year {year}...")
    test_start = time.time()
    
    # Get profile first
    profile_data = get_bulk_profiles([ticker]).get(ticker)
    
    # Process ticker
    price_data, statement_data, error_log, api_calls = process_ticker_year_separated(ticker, year, profile_data)
    
    test_time = time.time() - test_start
    print(f"\nTest completed in {test_time:.2f} seconds with {api_calls} API calls")
    
    if price_data is not None:
        print(f"\n✅ Price data collected: {len(price_data)} records")
        print(price_data)
    else:
        print("\n❌ No price data collected")
    
    if statement_data is not None:
        print(f"\n✅ Statement data collected: {len(statement_data)} records")
        print(statement_data)
    else:
        print("\n❌ No statement data collected")
    
    if error_log['errors']:
        print(f"\nErrors: {error_log}")

# Test with AAPL for 2020 to see if we get all 4 quarters
test_single_ticker("AAPL", 2020)


Testing with AAPL for year 2020...

Test completed in 1.14 seconds with 6 API calls

✅ Price data collected: 4 records
  ticker company_name quarter_end_date  stock_price     market_cap  \
0   AAPL   Apple Inc.       2020-03-31        61.63  1096601062440   
1   AAPL   Apple Inc.       2020-06-30        88.65  1555655126400   
2   AAPL   Apple Inc.       2020-09-30       112.78  1961256131390   
3   AAPL   Apple Inc.       2020-12-31       129.44  2223018730440   

               industry      sector  isETF  isFund  
0  Consumer Electronics  Technology  False   False  
1  Consumer Electronics  Technology  False   False  
2  Consumer Electronics  Technology  False   False  
3  Consumer Electronics  Technology  False   False  

✅ Statement data collected: 4 records
  ticker company_name fiscal_quarter fiscal_year calendar_date  \
0   AAPL   Apple Inc.             Q3        2020    2020-06-27   
1   AAPL   Apple Inc.             Q4        2020    2020-09-26   
2   AAPL   Apple Inc.       

## Collect Data for Years


In [37]:
# Collect data for a specific year (SAME approach as yearly.ipynb)
def collect_and_save_year(year: int, max_tickers: Optional[int] = None):
    """Collect and save separated data for a specific year"""
    
    # Get historical ticker list for the year (EXACT SAME as yearly.ipynb)
    us_tickers = get_historical_tickers(year)
    
    # Collect data with optimized batch processing
    price_df, statement_df, errors = collect_year_data_separated(
        us_tickers, year=year, max_tickers=max_tickers
    )
    
    # Save the data
    if len(price_df) > 0:
        price_filename = f"stock_prices_{year}.csv"
        price_df.to_csv(price_filename, index=False)
        print(f"\n✅ Price data saved to '{price_filename}'")
        
        # Show summary statistics
        print(f"\n📈 Price Data Summary:")
        print(f"   Records: {len(price_df)}")
        print(f"   Unique tickers: {price_df['ticker'].nunique()}")
        print(f"   Date range: {price_df['quarter_end_date'].min()} to {price_df['quarter_end_date'].max()}")
        
        # Show top companies by market cap
        latest_quarter = price_df['quarter_end_date'].max()
        latest_data = price_df[price_df['quarter_end_date'] == latest_quarter]
        if len(latest_data) > 0:
            print(f"\n🏆 Top 10 companies by market cap ({latest_quarter}):")
            top_10 = latest_data.nsmallest(10, 'mkt_cap_rank')[['ticker', 'company_name', 'mkt_cap_rank', 'market_cap', 'isETF', 'isFund']]
            top_10['market_cap'] = top_10['market_cap'].apply(lambda x: f"${x/1e9:.1f}B")
            print(top_10.to_string(index=False))
    
    if len(statement_df) > 0:
        statement_filename = f"financial_statements_{year}.csv"
        statement_df.to_csv(statement_filename, index=False)
        print(f"\n✅ Statement data saved to '{statement_filename}'")
        
        # Show summary statistics
        print(f"\n📊 Statement Data Summary:")
        print(f"   Records: {len(statement_df)}")
        print(f"   Unique tickers: {statement_df['ticker'].nunique()}")
        print(f"   Date range: {statement_df['calendar_date'].min()} to {statement_df['calendar_date'].max()}")
        print(f"   Fiscal years included: {sorted(statement_df['fiscal_year'].unique())}")
        print(f"   Debt/Assets - Mean: {statement_df['debt_to_assets'].mean():.3f}, Median: {statement_df['debt_to_assets'].median():.3f}")
        print(f"   Book/Market - Mean: {statement_df['book_to_market'].mean():.3f}, Median: {statement_df['book_to_market'].median():.3f}")
        print(f"   Earnings Yield - Mean: {statement_df['earnings_yield'].mean():.3f}, Median: {statement_df['earnings_yield'].median():.3f}")

# Example: Collect 2024 data
# To test with fewer tickers first, use max_tickers parameter
# collect_and_save_year(2024, max_tickers=100)  # Test with 100 tickers
# collect_and_save_year(2024)  # Full collection


In [38]:
# Collect data for multiple years
def collect_multiple_years(start_year: int, end_year: int, max_tickers: Optional[int] = None):
    """Collect data for a range of years"""
    for year in range(start_year, end_year + 1):
        print(f"\n{'='*80}")
        print(f"{'='*80}")
        print(f"  STARTING COLLECTION FOR YEAR {year}")
        print(f"{'='*80}")
        print(f"{'='*80}")
        
        try:
            collect_and_save_year(year, max_tickers=max_tickers)
        except Exception as e:
            print(f"❌ Failed to collect data for {year}: {e}")
            continue

# Example: Collect data for years 2020-2024
# collect_multiple_years(2020, 2024, max_tickers=100)  # Test with 100 tickers per year
# collect_multiple_years(2020, 2024)  # Full collection


## Individual Year Collection Cells

Run these cells one by one to collect data for each year. Each cell is independent.


In [43]:
# Collect 2024 data
YEAR = 2024
MAX_TICKERS = None  # Set to a number like 100 to test with fewer tickers

collect_and_save_year(YEAR, max_tickers=MAX_TICKERS)


Fetching ticker list for year 2024...
✅ Found 14302 US tickers for 2024

  COLLECTING SEPARATED DATA FOR YEAR 2024
Total tickers to check: 14302
Market cap filter: >$1B
API rate limit: 750 calls/minute
Batch size: 50 tickers
Progress saves: Every 100 tickers


  Processing batch 1: tickers 1-50
○✗○✗○○○○○○○✗○✗○✓○○✓○○○✗○○○○○✗○○○✓○○○✓○✓✗○○✓○○✗○○✓✗
  Time: 0.3min elapsed, ~77.1min remaining
  Success: 7, Failed: 9, Skipped (small cap): 34
  API calls: 67 (248/minute avg)

  Processing batch 2: tickers 51-100
○○○○✓✗○○○✗○✓○○○○✗○○✓○✓✗○○○○✓○○✗○✓○○✗✗✗○○○○✓✓○○○○✗✓
  💾 Price progress saved: progress_prices_2024_tickers_100.csv (63 rows)
  💾 Statement progress saved: progress_statements_2024_tickers_100.csv (63 rows)

  Time: 0.6min elapsed, ~81.5min remaining
  Success: 16, Failed: 18, Skipped (small cap): 66
  API calls: 151 (263/minute avg)

  Processing batch 3: tickers 101-150
○✗✗○✓○○○○✓✗○✗○✗○○✗○○✓○○○○○○○○○○○✗○○✓✗✓○○✗○✗○○✓✓○✓○
  Time: 0.9min elapsed, ~83.6min remaining
  Success: 24, Failed: 

In [42]:
# Collect 2023 data
YEAR = 2023
MAX_TICKERS = None

collect_and_save_year(YEAR, max_tickers=MAX_TICKERS)


Fetching ticker list for year 2023...
✅ Found 14302 US tickers for 2023

  COLLECTING SEPARATED DATA FOR YEAR 2023
Total tickers to check: 14302
Market cap filter: >$1B
API rate limit: 750 calls/minute
Batch size: 50 tickers
Progress saves: Every 100 tickers


  Processing batch 1: tickers 1-50
✗✗○✗○○○○○○○✗○✗○✓○○✓○○○✗○○○○○✗○○○✓○○○✓○✓✗○✓○○○✗○○✓✗
  Time: 0.3min elapsed, ~73.4min remaining
  Success: 7, Failed: 10, Skipped (small cap): 33
  API calls: 68 (264/minute avg)

  Processing batch 2: tickers 51-100
○✗○○✓✗○○○✗○✓○○○○✗○○✓○✓✗○○○✗✓✗○✗○✓○✗✗✗✗○○○○✓○✓○○○✗✓
  💾 Price progress saved: progress_prices_2023_tickers_100.csv (64 rows)
  💾 Statement progress saved: progress_statements_2023_tickers_100.csv (63 rows)

  Time: 0.6min elapsed, ~80.1min remaining
  Success: 16, Failed: 23, Skipped (small cap): 61
  API calls: 156 (277/minute avg)

  Processing batch 3: tickers 101-150
○✗✗○✓○○○○✓✗○✗○✗✗✗✗○○✓○○○○○✗○○✗○○✗○○✓✗✓○○✗○✗○○✓✓○✓✓
  Time: 0.9min elapsed, ~82.8min remaining
  Success: 25, Failed:

In [41]:
# Collect 2022 data
YEAR = 2022
MAX_TICKERS = None

collect_and_save_year(YEAR, max_tickers=MAX_TICKERS)


Fetching ticker list for year 2022...
✅ Found 14302 US tickers for 2022

  COLLECTING SEPARATED DATA FOR YEAR 2022
Total tickers to check: 14302
Market cap filter: >$1B
API rate limit: 750 calls/minute
Batch size: 50 tickers
Progress saves: Every 100 tickers


  Processing batch 1: tickers 1-50
✗✗○✗○○○○○○○✗○✗○✓○○✓○○○✗○○○○○✗○○○✓○○○✓○✓✗○✓○○○✗○○✓✗
  Time: 0.2min elapsed, ~70.3min remaining
  Success: 7, Failed: 10, Skipped (small cap): 33
  API calls: 66 (268/minute avg)

  Processing batch 2: tickers 51-100
○✗○○✓✗○○✓✗○✓○○○○✗○○✓○✓✗○○○✗✓✗○✗○✓○✗✗✗✗○○○○✓○✓○○○✗✓
  💾 Price progress saved: progress_prices_2022_tickers_100.csv (68 rows)
  💾 Statement progress saved: progress_statements_2022_tickers_100.csv (67 rows)

  Time: 0.6min elapsed, ~79.5min remaining
  Success: 17, Failed: 23, Skipped (small cap): 60
  API calls: 160 (286/minute avg)

  Processing batch 3: tickers 101-150
○✗✗✗✓✗○○○✓✗○✗○✗✗○✗✗○○✓○○○○○✗○✓○○○✗✗○✓✗✓○○✗○✗○○○✓✓○
  Time: 0.9min elapsed, ~82.5min remaining
  Success: 25, Failed:

In [40]:
# Collect 2021 data
YEAR = 2021
MAX_TICKERS = None

collect_and_save_year(YEAR, max_tickers=MAX_TICKERS)


Fetching ticker list for year 2021...
✅ Found 14302 US tickers for 2021

  COLLECTING SEPARATED DATA FOR YEAR 2021
Total tickers to check: 14302
Market cap filter: >$1B
API rate limit: 750 calls/minute
Batch size: 50 tickers
Progress saves: Every 100 tickers


  Processing batch 1: tickers 1-50
✗✗○✗○○○○✓✓✓○○✗○✗✗○✓○○○✗✗○○✓✓✓○✗○○✗✓○○✗✓○✓✗○○✓✓○✓○✓
  Time: 0.4min elapsed, ~100.3min remaining
  Success: 14, Failed: 12, Skipped (small cap): 24
  API calls: 108 (307/minute avg)

  Processing batch 2: tickers 51-100
✗○✗✓✗○✓✗○○✓✓✗○○✓✗○✓○✓○○○✗○○✓○✓✗○○○✗✓✗○✗○✓○✗○✗✗✗○○○
  💾 Price progress saved: progress_prices_2021_tickers_100.csv (94 rows)
  💾 Statement progress saved: progress_statements_2021_tickers_100.csv (94 rows)

  Time: 0.7min elapsed, ~98.6min remaining
  Success: 25, Failed: 27, Skipped (small cap): 48
  API calls: 211 (304/minute avg)

  Processing batch 3: tickers 101-150
○✓✓✓✓○○✗✓○✗✗✗✓✗○○○✓✗○○✗○✗✗○✗✗○○✓○✓✗○○✓✓○✗○✓✓✗○✓✓✗✗
  Time: 1.1min elapsed, ~100.1min remaining
  Success: 40, Fai

In [39]:
# Collect 2020 data
YEAR = 2020
MAX_TICKERS = None

collect_and_save_year(YEAR, max_tickers=MAX_TICKERS)


Fetching ticker list for year 2020...
✅ Found 14302 US tickers for 2020

  COLLECTING SEPARATED DATA FOR YEAR 2020
Total tickers to check: 14302
Market cap filter: >$1B
API rate limit: 750 calls/minute
Batch size: 50 tickers
Progress saves: Every 100 tickers


  Processing batch 1: tickers 1-50
✗✗✗✗○○✗○✗✓✓○○✓✗✓✗✗○✓○○○✗✗○○✗○✓○○✗○○✗✓○✓✗✓✗✓✗○○✓✓○✓
  Time: 0.3min elapsed, ~93.1min remaining
  Success: 13, Failed: 17, Skipped (small cap): 20
  API calls: 105 (321/minute avg)

  Processing batch 2: tickers 51-100
○✓✗○✗✓✗○✓✗○○✓✓✗○○○✗○✓○✓○✓○○✗○○✓○✓○○○○✗✓✗○✗○✓✗✗○○✗✗
  💾 Price progress saved: progress_prices_2020_tickers_100.csv (92 rows)
  💾 Statement progress saved: progress_statements_2020_tickers_100.csv (91 rows)

  Time: 0.7min elapsed, ~95.0min remaining
  Success: 25, Failed: 31, Skipped (small cap): 44
  API calls: 209 (313/minute avg)

  Processing batch 3: tickers 101-150
○○✗○✓✓✓✓○○✗○○✗✗○○✗○○○✓✗○○✗○✗✗○✗✗○○✓○✓✗○○○✓○✗○✓✓✗○✗
  Time: 1.0min elapsed, ~91.4min remaining
  Success: 35, Faile