# EDA #
goal: use the agentic chatGPT script to download financial data from Stooq with relevent S&P500 constituents, without duplicates, and to explore whether the Close prices are adjusted (without jumps from dividents, stock splitting, etc.)

facts and observations:
* The official S&P500 index, which includes constituents' tickers and weightings, requires paid subscription. The Wikipedia list of S&P500 constituents was last updated on July 23, 2025. Since Wikipedia is more-or-less trustworthy and its content update is recent, we will use Wikipedia for getting tickers.
* Function **`fetch_stooq_csv()`** provides price history since as far as possible (e.g., since IPOs). Some price histories are shorter because of constituents' recent IPOs. All prices (i.e., Open, High, Low, Close) seem to be adjusted for share splitting and dividents. I checked and found no jumps around Apple's 7-for-1 Split on June 9, 2014.

In [27]:
import io
import os
import time
import gzip
import requests
import pandas as pd
from tqdm import tqdm

In [2]:
# ----------------------------------------------------------------------
# CONFIGURABLE PARAMETERS
# ----------------------------------------------------------------------
WIKI_SP500_URL = "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"
STOOQ_URL_TEMPLATE = "https://stooq.com/q/d/l/?s={symbol}.us&i=d"
OUT_PATH = "../data/raw/sp500_daily_prices.csv.gz"
REQUEST_TIMEOUT = 20           # seconds per HTTP request
SLEEP_BETWEEN_CALLS = 0.3      # polite delay to avoid hitting Stooq rate limit
# ----------------------------------------------------------------------

In [9]:
def get_sp500_tickers() -> list[str]:
    """Return a list of S&P 500 tickers from Wikipedia (strings, upper-case)."""
    tables = pd.read_html(WIKI_SP500_URL)
    tickers = tables[0]["Symbol"].str.replace(".", "-", regex=False).str.upper()
    return tickers.tolist()

In [24]:
def fetch_stooq_csv(symbol: str) -> pd.DataFrame | None:
    """
    Download one ticker from Stooq.
    Returns a DataFrame or None if the request fails.
    """
    url = STOOQ_URL_TEMPLATE.format(symbol=symbol.lower())
    try:
        r = requests.get(url, timeout=REQUEST_TIMEOUT)
        if r.status_code != 200 or "Exceeded the daily hits limit" in r.text:
            return None
        df = pd.read_csv(io.StringIO(r.text))
        df["Symbol"] = symbol

        # Ensure 'Date' column is datetime (optional but recommended)
        df['Date'] = pd.to_datetime(df['Date'])
        return df
    except requests.RequestException:
        return None

Shows that the splitting of Apple shares on June 9, 2014 is price-adjusted in the dataset

In [28]:
df = fetch_stooq_csv("AAPL")

# Define center date and range
center_date = pd.Timestamp("2014-06-09")
delta = pd.Timedelta(days=5)

# Filter rows within ±5 days
df[(df['Date'] >= center_date - delta) & (df['Date'] <= center_date + delta)]


Unnamed: 0,Date,Open,High,Low,Close,Volume,Symbol
7497,2014-06-04,20.0687,20.4007,20.0264,20.3023,380429518,AAPL
7498,2014-06-05,20.3466,20.4473,20.2324,20.3833,344413327,AAPL
7499,2014-06-06,20.4631,20.5043,20.2915,20.326,397215840,AAPL
7500,2014-06-09,20.4323,20.6887,20.2215,20.6491,341695389,AAPL
7501,2014-06-10,20.8788,20.9497,20.6216,20.7704,284620670,AAPL
7502,2014-06-11,20.7449,20.8848,20.6,20.6848,207062657,AAPL
7503,2014-06-12,20.722,20.7429,20.2539,20.3416,248177955,AAPL
7504,2014-06-13,20.322,20.3744,20.0284,20.1159,247322534,AAPL


In [30]:
def build_sp500_dataset() -> pd.DataFrame:
    """Download and concatenate all tickers into a single DataFrame."""
    tickers = get_sp500_tickers()
    frames = []

    for sym in tqdm(tickers, desc="Downloading", unit="stock"):
        df = fetch_stooq_csv(sym)
        if df is not None and not df.empty:
            frames.append(df)
        time.sleep(SLEEP_BETWEEN_CALLS)

    if not frames:
        raise RuntimeError("No data downloaded — check connectivity or rate limits.")

    # Concatenate and drop exact duplicate rows (if any)
    full_df = (
        pd.concat(frames, ignore_index=True)
        .drop_duplicates()
        .sort_values(["Symbol", "Date"])
        .reset_index(drop=True)
    )
    return full_df


Downloading:   3%|▎         | 13/503 [00:08<05:37,  1.45stock/s]


KeyboardInterrupt: 

In [31]:
def save_compressed_csv(df: pd.DataFrame, out_path: str = OUT_PATH) -> None:
    """Save DataFrame to a gzip-compressed CSV."""
    # Pandas can write gzip directly, but using gzip.open keeps memory usage low.
    with gzip.open(out_path, "wt", newline="") as gz:
        df.to_csv(gz, index=False)

In [32]:
def main() -> None:
    print("Fetching data …")
    df = build_sp500_dataset()
    print(f"Rows downloaded: {len(df):,}  |  Unique tickers: {df['Symbol'].nunique()}")
    print("Saving compressed CSV …")
    save_compressed_csv(df)
    size_mb = os.path.getsize(OUT_PATH) / (1024 * 1024)
    print(f"Done. File saved to '{OUT_PATH}' ({size_mb:.1f} MB).")

In [None]:
tickers1 = pd.read_html(
    'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')[0]

In [46]:
# Import packages
import yfinance as yf
import pandas as pd

# Read and print the stock tickers that make up S&P500
tickers = pd.read_html(
    'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')[0].Symbol

# Get the data for this tickers from yahoo finance
data = yf.download(tickers.str.replace(".", "-", regex=False).to_list(), start = '1957-03-04', auto_adjust=True)['Close']
print(data.head())

[*********************100%***********************]  503 of 503 completed

1 Failed download:
['AFL']: Timeout('Failed to perform, curl: (28) Connection timed out after 10002 milliseconds. See https://curl.se/libcurl/c/libcurl-errors.html first for more details.')


Ticker       A  AAPL  ABBV  ABNB  ABT  ACGL  ACN  ADBE  ADI  ADM  ...  WY  \
Date                                                              ...       
1962-01-02 NaN   NaN   NaN   NaN  NaN   NaN  NaN   NaN  NaN  NaN  ... NaN   
1962-01-03 NaN   NaN   NaN   NaN  NaN   NaN  NaN   NaN  NaN  NaN  ... NaN   
1962-01-04 NaN   NaN   NaN   NaN  NaN   NaN  NaN   NaN  NaN  NaN  ... NaN   
1962-01-05 NaN   NaN   NaN   NaN  NaN   NaN  NaN   NaN  NaN  NaN  ... NaN   
1962-01-08 NaN   NaN   NaN   NaN  NaN   NaN  NaN   NaN  NaN  NaN  ... NaN   

Ticker      WYNN  XEL       XOM  XYL  XYZ  YUM  ZBH  ZBRA  ZTS  
Date                                                            
1962-01-02   NaN  NaN  0.090965  NaN  NaN  NaN  NaN   NaN  NaN  
1962-01-03   NaN  NaN  0.092316  NaN  NaN  NaN  NaN   NaN  NaN  
1962-01-04   NaN  NaN  0.092542  NaN  NaN  NaN  NaN   NaN  NaN  
1962-01-05   NaN  NaN  0.090515  NaN  NaN  NaN  NaN   NaN  NaN  
1962-01-08   NaN  NaN  0.090290  NaN  NaN  NaN  NaN   NaN  NaN  

[5 r

In [43]:
# Read and print the stock tickers that make up S&P500
"AEE" in tickers.str.replace(".", "-", regex=False).to_list()

True

In [44]:
yf.download("AEE", start = '1957-03-04', auto_adjust=True)['Close']

[*********************100%***********************]  1 of 1 completed


Ticker,AEE
Date,Unnamed: 1_level_1
1998-01-02,11.929947
1998-01-05,11.842873
1998-01-06,11.668709
1998-01-07,11.564217
1998-01-08,11.424885
...,...
2025-07-25,100.099998
2025-07-28,98.320000
2025-07-29,99.820000
2025-07-30,99.839996
