# 01 - Data Exploration

This notebook demonstrates how to load cryptocurrency and financial data using the
`fatcrash.data` module, compute log returns, and perform initial exploratory analysis.

We will:
1. Fetch BTC/USD price history from Yahoo Finance and CoinGecko
2. Compute log returns and log prices
3. Plot price and volume series
4. Examine basic statistics (mean, std, skewness, kurtosis)
5. Inspect the distribution shape of returns

## Imports

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

from fatcrash.data.ingest import from_yahoo, from_coingecko
from fatcrash.data.transforms import log_returns, log_prices, time_index, resample_ohlcv
from fatcrash.data.cache import load_cached, save_cache

plt.style.use("seaborn-v0_8-whitegrid")
plt.rcParams["figure.figsize"] = (14, 5)

## 1. Load BTC data from Yahoo Finance

In [None]:
# Fetch daily BTC/USD from Yahoo Finance
btc_yahoo = from_yahoo("BTC-USD", start="2015-01-01", end="2025-12-31")
btc_yahoo = time_index(btc_yahoo)
print(f"Yahoo shape: {btc_yahoo.shape}")
btc_yahoo.head()

## 2. Load BTC data from CoinGecko (alternative source)

In [None]:
# CoinGecko provides a different data source for comparison
btc_cg = from_coingecko("bitcoin", vs_currency="usd", days=365 * 5)
btc_cg = time_index(btc_cg)
print(f"CoinGecko shape: {btc_cg.shape}")
btc_cg.head()

## 3. Cache the data for later reuse

In [None]:
# Save to local cache so subsequent notebooks can load instantly
save_cache(btc_yahoo, "btc_yahoo_daily")

# To reload later:
# btc_yahoo = load_cached("btc_yahoo_daily")

## 4. Compute log returns and log prices

In [None]:
# Use the fatcrash transforms module
df = btc_yahoo.copy()
df["log_price"] = log_prices(df["close"].values)
df["log_return"] = log_returns(df["close"].values)

print(f"Number of observations: {len(df)}")
print(f"Date range: {df.index.min()} to {df.index.max()}")
df[["close", "log_price", "log_return"]].head(10)

## 5. Plot price and volume

In [None]:
fig, axes = plt.subplots(3, 1, figsize=(14, 10), sharex=True)

# Price
axes[0].plot(df.index, df["close"], color="steelblue", linewidth=0.8)
axes[0].set_ylabel("Price (USD)")
axes[0].set_title("BTC/USD Daily Close Price")
axes[0].set_yscale("log")

# Volume
if "volume" in df.columns:
    axes[1].bar(df.index, df["volume"], color="gray", alpha=0.5, width=1)
    axes[1].set_ylabel("Volume")
    axes[1].set_title("Daily Volume")

# Log returns
axes[2].plot(df.index, df["log_return"], color="tomato", linewidth=0.5, alpha=0.7)
axes[2].set_ylabel("Log Return")
axes[2].set_title("Daily Log Returns")
axes[2].axhline(0, color="black", linewidth=0.5)

plt.tight_layout()
plt.show()

## 6. Basic return statistics

In [None]:
returns = df["log_return"].dropna()

summary = {
    "count": len(returns),
    "mean (daily)": returns.mean(),
    "std (daily)": returns.std(),
    "annualized mean": returns.mean() * 365,
    "annualized vol": returns.std() * np.sqrt(365),
    "skewness": stats.skew(returns),
    "excess kurtosis": stats.kurtosis(returns),
    "min": returns.min(),
    "max": returns.max(),
    "1st percentile": np.percentile(returns, 1),
    "5th percentile": np.percentile(returns, 5),
}

for k, v in summary.items():
    print(f"{k:>20s}: {v:>12.6f}")

## 7. Distribution shape

A Gaussian distribution has excess kurtosis = 0. Financial returns typically show
excess kurtosis >> 0 ("fat tails"), which is exactly what fatcrash is designed to
measure and exploit.

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Histogram vs normal
axes[0].hist(returns, bins=200, density=True, alpha=0.6, color="steelblue", label="Empirical")
x = np.linspace(returns.min(), returns.max(), 500)
axes[0].plot(x, stats.norm.pdf(x, returns.mean(), returns.std()), "r-", linewidth=1.5, label="Normal")
axes[0].set_title("Return Distribution vs Normal")
axes[0].set_xlabel("Log Return")
axes[0].legend()
axes[0].set_xlim(-0.2, 0.2)

# QQ plot
stats.probplot(returns, dist="norm", plot=axes[1])
axes[1].set_title("QQ Plot (Normal)")

# Log-log complementary CDF (tail behavior)
sorted_neg = np.sort(-returns[returns < 0])
ranks = np.arange(1, len(sorted_neg) + 1) / len(sorted_neg)
survival = 1.0 - ranks
axes[2].loglog(sorted_neg, survival, ".", markersize=1, alpha=0.5)
axes[2].set_xlabel("|Negative Return|")
axes[2].set_ylabel("P(X > x)")
axes[2].set_title("Left Tail (Log-Log CCDF)")

plt.tight_layout()
plt.show()

## 8. Resample to weekly OHLCV

Some analyses (e.g., LPPLS fitting) benefit from weekly data to reduce noise.

In [None]:
btc_weekly = resample_ohlcv(btc_yahoo, freq="W")
btc_weekly["log_return"] = log_returns(btc_weekly["close"].values)

print(f"Weekly observations: {len(btc_weekly)}")
print(f"Weekly return std: {btc_weekly['log_return'].std():.4f}")
print(f"Weekly excess kurtosis: {stats.kurtosis(btc_weekly['log_return'].dropna()):.2f}")
btc_weekly.head()

## Summary

Key findings from this exploration:
- BTC returns exhibit significant **fat tails** (high excess kurtosis)
- The QQ plot shows clear deviation from normality in both tails
- The log-log CCDF suggests **power-law** behavior in the tails
- These properties motivate the use of EVT and tail-index estimators in subsequent notebooks