# Cryptocurrency Volatility Prediction

**Single-file submission**: this notebook contains the full pipeline — data ingestion, preprocessing, feature engineering, EDA, model training, evaluation, artifact saving, and an optional Streamlit app. 

**Dataset source options:**
- Use the uploaded `dataset.csv` in the same directory.
- Or (reproducible) download from the provided Google Drive link.

**How to use:** run the notebook top-to-bottom (Kernel ▶ Restart & Run All).

In [None]:
%%bash
python -V


In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import pickle

# Paths
OUT_DIR = '/mnt/data/ml_submission_outputs'
os.makedirs(OUT_DIR, exist_ok=True)
DATA_CSV = 'dataset.csv'  # expected to be in the notebook working directory
DRIVE_LINK = 'https://drive.google.com/file/d/1iVhJKnfAR-Vm4JHC-TY4-kXEcfH5C_ky/view?usp=drive_link'

print('Output folder:', OUT_DIR)

## Option A — Use uploaded `dataset.csv` (already provided)

## Option B — Download dataset from Google Drive (reproducible)

If you want the notebook to download the dataset automatically, uncomment and run the cell below. It uses `gdown` to fetch the file from Google Drive. If `gdown` is not installed, the cell will install it automatically.

In [None]:
# Uncomment to download dataset from Google Drive
# !pip install --quiet gdown
# import gdown
# url = 'https://drive.google.com/uc?id=1iVhJKnfAR-Vm4JHC-TY4-kXEcfH5C_ky&export=download'
# gdown.download(url, DATA_CSV, quiet=False)
# print('Downloaded to', DATA_CSV)

# If you already uploaded dataset.csv to the notebook folder, no need to run the downloader.

## Load dataset — robust to column name variations
The dataset used in the PDF had columns like `crypto_name`, `marketCap`, `timestamp`, `date`, etc. This code will attempt to load and normalize those columns.

In [None]:

# Try to read dataset
if not os.path.exists(DATA_CSV):
    raise FileNotFoundError(f"{DATA_CSV} not found in working directory. Place the CSV next to this notebook or use the downloader cell.")

df = pd.read_csv(DATA_CSV)
print('Initial shape:', df.shape)
print('Columns:', df.columns.tolist())

# Normalize column names
df.columns = [c.strip() for c in df.columns]
col_map = {}
# common variants mapping
variants = {
    'crypto_name': 'symbol',
    'name': 'symbol',
    'marketCap': 'market_cap',
    'market_cap': 'market_cap',
    'timestamp': 'timestamp',
    'date': 'date'
}
for c in df.columns:
    lc = c.lower()
    if lc in variants:
        col_map[c] = variants[lc]
    elif lc in ['open','high','low','close','volume']:
        col_map[c] = lc
    else:
        # try fuzzy match
        if 'name' in lc and 'crypto' in lc:
            col_map[c] = 'symbol'
        elif 'market' in lc and 'cap' in lc:
            col_map[c] = 'market_cap'
        elif 'time' in lc:
            col_map[c] = 'timestamp'

df = df.rename(columns=col_map)
print('After rename - columns:', df.columns.tolist())

# Ensure required columns exist
required = ['date','symbol','open','high','low','close','volume','market_cap']
missing = [c for c in required if c not in df.columns]
if missing:
    print('Warning - missing columns (these will be handled if present or approximated):', missing)

# If date column exists but is not datetime, coerce it
if 'date' in df.columns:
    df['date'] = pd.to_datetime(df['date'], errors='coerce')
else:
    # try to derive date from timestamp
    if 'timestamp' in df.columns:
        df['date'] = pd.to_datetime(df['timestamp'], errors='coerce').dt.date
        df['date'] = pd.to_datetime(df['date'])
    else:
        raise ValueError('No date or timestamp column found. Cannot proceed without a date.')

# Fill or coerce numeric cols
for c in ['open','high','low','close','volume','market_cap']:
    if c in df.columns:
        df[c] = pd.to_numeric(df[c], errors='coerce')

# Quick look
display(df.head())
display(df.info())


## Data cleaning, preprocessing, and feature engineering
- Fill missing numeric values per symbol (forward/backfill)
- Create returns, intraday_range, rolling stats, liquidity, ATR proxy
- Define target `next_day_vol` = next day's intraday_range (you can change target later)


In [None]:

# Drop rows without date or symbol
df = df.dropna(subset=['date']).copy()
if 'symbol' not in df.columns:
    # if no symbol, create one universal symbol
    df['symbol'] = 'ALL'

# Sort and fill numeric values per symbol
df = df.sort_values(['symbol','date']).reset_index(drop=True)

numeric_cols = [c for c in ['open','high','low','close','volume','market_cap'] if c in df.columns]
if len(numeric_cols)==0:
    raise ValueError('No numeric OHLC/volume/market_cap columns found. Please check your dataset.')

# group-wise forward/backfill for numeric cols
df[numeric_cols] = df.groupby('symbol')[numeric_cols].apply(lambda g: g.ffill().bfill())

# Drop rows still missing core numeric data
df = df.dropna(subset=['open','high','low','close']).reset_index(drop=True)

# Feature engineering
df['return'] = (df['close'] - df['open']) / df['open']
df['intraday_range'] = (df['high'] - df['low']) / df['open']  # volatility proxy
df['next_day_vol'] = df.groupby('symbol')['intraday_range'].shift(-1)

# Rolling features
df['ma_7'] = df.groupby('symbol')['close'].rolling(window=7, min_periods=1).mean().reset_index(level=0, drop=True)
df['ma_30'] = df.groupby('symbol')['close'].rolling(window=30, min_periods=1).mean().reset_index(level=0, drop=True)
df['vol_7'] = df.groupby('symbol')['return'].rolling(window=7, min_periods=1).std().reset_index(level=0, drop=True)
df['vol_30'] = df.groupby('symbol')['return'].rolling(window=30, min_periods=1).std().reset_index(level=0, drop=True)
df['liquidity'] = df['volume'] / (df['market_cap'].replace({0: np.nan}))
df['atr_14'] = df['high'] - df['low']  # simplified ATR

# Filter rows with target
df_model = df.dropna(subset=['next_day_vol']).copy()
df_model['symbol_code'] = df_model['symbol'].astype('category').cat.codes

print('Final rows for modeling:', df_model.shape[0])
display(df_model.head())


## Exploratory Data Analysis (EDA)
Plots: target distribution, correlation heatmap, sample time-series for top symbols.

In [None]:

import matplotlib.dates as mdates

# Distribution of target
plt.figure(figsize=(8,4))
plt.hist(df_model['next_day_vol'].dropna(), bins=80)
plt.title('Distribution of Next-Day Volatility (intraday range proxy)')
plt.xlabel('next_day_vol')
plt.ylabel('count')
plt.tight_layout()
plt.savefig(os.path.join(OUT_DIR,'eda_nextdayvol_dist.png'))
plt.show()

# Correlation heatmap (select numeric features)
num_feats = ['open','high','low','close','volume','market_cap','return','intraday_range','ma_7','ma_30','vol_7','vol_30','liquidity']
num_feats = [c for c in num_feats if c in df_model.columns]
corr = df_model[num_feats].corr()
plt.figure(figsize=(10,8))
sns.heatmap(corr, annot=True, fmt='.2f', cmap='vlag')
plt.title('Feature Correlation')
plt.tight_layout()
plt.savefig(os.path.join(OUT_DIR,'eda_correlation.png'))
plt.show()

# Time series for top 3 symbols
top_syms = df_model['symbol'].value_counts().nlargest(3).index.tolist()
for sym in top_syms:
    sub = df_model[df_model['symbol']==sym].sort_values('date')
    plt.figure(figsize=(10,3))
    plt.plot(sub['date'], sub['intraday_range'])
    plt.title(f'Intraday Range over Time: {sym}')
    plt.xlabel('date')
    plt.ylabel('intraday_range')
    plt.gca().xaxis.set_major_locator(mdates.AutoDateLocator())
    plt.gca().xaxis.set_major_formatter(mdates.ConciseDateFormatter(mdates.AutoDateLocator()))
    plt.tight_layout()
    fname = os.path.join(OUT_DIR, f'ts_{sym}.png')
    plt.savefig(fname)
    plt.show()

print('EDA plots saved to', OUT_DIR)


## Model Training & Evaluation
We'll train RandomForest and GradientBoosting and select the best by RMSE. The notebook uses a global time-based split (last 20% dates held out).

In [None]:

from sklearn.model_selection import train_test_split

feature_cols = ['open','high','low','close','volume','market_cap','return','intraday_range',
                'ma_7','ma_30','vol_7','vol_30','liquidity','symbol_code']
feature_cols = [c for c in feature_cols if c in df_model.columns]
X = df_model[feature_cols].fillna(0)
y = df_model['next_day_vol']

# Time-based split by date (global)
unique_dates = df_model['date'].sort_values().unique()
split_date = unique_dates[int(len(unique_dates)*(1-0.2))] if len(unique_dates)>1 else unique_dates[0]
train_mask = df_model['date'] <= split_date
X_train, X_test = X[train_mask], X[~train_mask]
y_train, y_test = y[train_mask], y[~train_mask]

# Fallback to random split if needed
if len(X_test)==0 or len(X_train)==0:
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print('Train shape:', X_train.shape, 'Test shape:', X_test.shape)

# Train models (small defaults)
rf = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
gbr = GradientBoostingRegressor(n_estimators=200, learning_rate=0.05, random_state=42)

rf.fit(X_train, y_train)
gbr.fit(X_train, y_train)

models = {'RandomForest': rf, 'GradientBoosting': gbr}
results = []
for name, model in models.items():
    preds = model.predict(X_test)
    rmse = mean_squared_error(y_test, preds, squared=False)
    mae = mean_absolute_error(y_test, preds)
    r2 = r2_score(y_test, preds)
    results.append({'model': name, 'rmse': float(rmse), 'mae': float(mae), 'r2': float(r2)})
results_df = pd.DataFrame(results).sort_values('rmse')
display(results_df)

best_model_name = results_df.iloc[0]['model']
best_model = models[best_model_name]
print('Best model:', best_model_name)


## Save artifacts
Saves: cleaned dataset, feature dataset, model pickle, evaluation metrics, and EDA plots into `ml_submission_outputs/`. These files are suitable to include in a GitHub repo (or you can keep the notebook alone).

In [None]:

# Save datasets and model
cleaned_path = os.path.join(OUT_DIR, 'cleaned_dataset.csv')
features_path = os.path.join(OUT_DIR, 'features_dataset.csv')
model_path = os.path.join(OUT_DIR, 'best_model.pkl')
metrics_path = os.path.join(OUT_DIR, 'evaluation_metrics.csv')

df.to_csv(cleaned_path, index=False)
df_model.to_csv(features_path, index=False)
with open(model_path, 'wb') as f:
    pickle.dump(best_model, f)

results_df.to_csv(metrics_path, index=False)

print('Saved:', cleaned_path)
print('Saved:', features_path)
print('Saved:', model_path)
print('Saved:', metrics_path)
print('Other plots saved to', OUT_DIR)


## Auto-generate README.md and requirements.txt for GitHub
The next cell will write a `README.md` and `requirements.txt` tailored to this notebook. Run it to create those files in the output folder.

In [None]:

readme = f"""# Cryptocurrency Volatility Prediction

This repository contains a single Jupyter Notebook that performs the full pipeline for predicting next-day cryptocurrency volatility using daily OHLC, volume, and market cap data.

## Contents
- `Crypto_Volatility_Prediction_Project.ipynb` — this notebook (primary)
- `ml_submission_outputs/` — output folder with cleaned data, feature data, trained model, evaluation metrics and plots

## How to run
1. Place `dataset.csv` in the same folder as the notebook or use the Google Drive downloader cell.
2. Open the notebook and run all cells (Kernel ▶ Restart & Run All).

## Files produced (examples)
- `ml_submission_outputs/cleaned_dataset.csv`
- `ml_submission_outputs/features_dataset.csv`
- `ml_submission_outputs/best_model.pkl`
- `ml_submission_outputs/evaluation_metrics.csv`
- `ml_submission_outputs/eda_nextdayvol_dist.png`

## Notes
- Target definition: next-day intraday range `(high-low)/open` used as a volatility proxy. You can modify the target to use a rolling standard deviation if preferred.
- The notebook includes a Streamlit app snippet you can use for local deployment (see the bottom of the notebook).

"""
with open(os.path.join(OUT_DIR,'README.md'),'w') as f:
    f.write(readme)

reqs = [
    'numpy', 'pandas', 'matplotlib', 'seaborn', 'scikit-learn', 'joblib', 'streamlit', 'gdown'
]
with open(os.path.join(OUT_DIR,'requirements.txt'),'w') as f:
    f.write('\n'.join(reqs))

print('README.md and requirements.txt created in', OUT_DIR)


## Streamlit app (optional)
This cell writes a `app.py` Streamlit application that loads the pickled model and allows interactive input. Run it to create `app.py` in the output folder. To run locally: `streamlit run app.py`

In [None]:

app_code = """import streamlit as st
import pandas as pd
import pickle

st.title('Cryptocurrency Volatility Prediction')
model_path = 'ml_submission_outputs/best_model.pkl'

@st.cache_resource
def load_model():
    with open(model_path,'rb') as f:
        return pickle.load(f)

model = load_model()

st.write('Model loaded from', model_path)

st.markdown('Upload a CSV with the same feature columns used during training.')
uploaded = st.file_uploader('Upload CSV', type=['csv'])
if uploaded is not None:
    df = pd.read_csv(uploaded)
    st.write('Preview', df.head())
    preds = model.predict(df)
    st.write('Predictions (next-day volatility proxy):')
    st.write(preds)
"""

with open(os.path.join(OUT_DIR,'app.py'),'w') as f:
    f.write(app_code)
print('Streamlit app written to', os.path.join(OUT_DIR,'app.py'))


## Final notes
- This notebook is ready to be uploaded to GitHub as the single-file submission. The `ml_submission_outputs/` folder will contain all derived files after running.
- If you want, I can also zip the `ml_submission_outputs/` folder and provide it for download, or I can commit everything into a GitHub repo for you (if you provide access/token).