# Data exploration #
Consider Gzip compressed **`sp500_daily_prices.csv.gz`**. The agentic chatGPT downloaded 503 S&P500 tickers from Wikipedia and constructed the file by pulling data from Stooq via https://stooq.com/q/d/l/?s={ticker}.us&i=d.

Facts:
* The file has columns Date, Open, High, Low, Close, Volume, and Symbol.
* Each row provides open, highest, lowest, and close prices of an asset with a certain symbol during a certain date.
* The close price is adjusted for dividents, stock splitting, etc.
* The data run from January 2, 1970 till July 30, 2025.
* The Symbols probably represent tickers. The file has 503 unique symbols
* The file contains duplicate rows

Conclusion: the agentic chatGPT script seems of high quality, but I do not know whether the prices are adjusted, why duplicates appear, whether the symbols are actual tickers, etc. The conclusion is to run the agentic chatGPT script on my own to see what exactly the data are, and how they are assembled to prevent duplications.

In [1]:
import pandas as pd


In [44]:
# Load the compressed CSV directly; Pandas will decompress it for you
df = pd.read_csv('../data/archive/sp500_daily_prices.csv.gz', compression='gzip')


In [46]:
duplicates = df_filtered[df_filtered['Symbol'].duplicated(keep=False)].sort_values('Symbol')


In [45]:
df_filtered = df[df['Date'] == '2025-07-15']


In [31]:
# Inspect the structure
Symbols = df['Symbol'].unique()
# Output columns: Date, Open, High, Low, Close, Volume, Symbol


In [None]:
# Inspect the structure
print(df.head())
# Output columns: Date, Open, High, Low, Close, Volume, Symbol

# Example 1 – filter by symbol (e.g., Apple)
apple_data = df[df['Symbol'] == 'AAPL']
print(apple_data.head())

# Example 2 – filter by date range
mask = (df['Date'] >= '2010-01-01') & (df['Date'] <= '2020-12-31')
period_data = df[mask]

# Example 3 – get closing prices for a specific stock and date range
msft_close = df[(df['Symbol'] == 'MSFT') & (df['Date'].between('2015-01-01', '2015-12-31'))][['Date', 'Close']]