# Fungal Signal Vocabulary Discovery Pipeline
**EE297B Research Project — SJSU**  
Anthony Contreras & Alex Wong

Run this notebook on **Google Colab (GPU runtime)** to:
1. Download all datasets (Adamatzky, Buffi, ECG)
2. Discover fungal signal vocabulary (k-means clustering)
3. Train TCN Phase 1 (ECG pre-training)
4. Train TCN Phase 2 (vocabulary classification)
5. Train TCN Phase 3 (stimulus response decoding)
6. Build word→stimulus dictionary

## Cell 1 — Setup: Clone repo + install deps
Run this first. Takes ~1 min.

In [None]:
# Clone the repo
!git clone https://github.com/theeanthony/EE298.git /content/EE298 2>/dev/null || (cd /content/EE298 && git pull)

# Install dependencies
!pip install -q scikit-learn joblib pandas numpy scipy matplotlib torch h5py gdown psutil

# Verify GPU
import torch
print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

# Set working directory
import os
os.chdir('/content/EE298/software/ml')
print(f"\nWorking dir: {os.getcwd()}")
print("Setup complete!")

## Cell 2 — Download all datasets
Downloads Adamatzky (Zenodo), Buffi (Mendeley), and ECG (Kaggle).  
~1.5 GB total. Takes ~3-5 min on Colab.

**Note:** ECG requires Kaggle API token. If you don't have one, follow the printed instructions.

In [None]:
os.chdir('/content/EE298/software/ml')

# Download Adamatzky + Buffi
!python download_data.py

# Download ECG (for Phase 1 pre-training)
# Option A: If you have kaggle CLI configured
!python download_pretrain_data.py --ecg

# Option B: If Kaggle CLI fails, uncomment these lines to download manually:
# !pip install -q kaggle
# !mkdir -p ~/.kaggle
# # Upload your kaggle.json to Colab first, then:
# # from google.colab import files; uploaded = files.upload()  # upload kaggle.json
# # !mv kaggle.json ~/.kaggle/ && chmod 600 ~/.kaggle/kaggle.json
# !kaggle datasets download -d shayanfazeli/heartbeat -p ../../data/external/ecg_heartbeat --unzip

## Cell 3 — Verify data is present

In [None]:
import os

data_root = '/content/EE298/data/external'

# Adamatzky
adam_dir = os.path.join(data_root, 'adamatzky')
adam_files = []
for root, dirs, files in os.walk(adam_dir):
    for f in files:
        if f.endswith('.txt') and '__MACOSX' not in root:
            adam_files.append(f)
print(f"Adamatzky: {len(adam_files)} .txt files")
for f in adam_files:
    print(f"  {f}")

# Buffi
buffi_dir = os.path.join(data_root, 'buffi')
buffi_files = [f for f in os.listdir(buffi_dir) if f.endswith('.hdf5')] if os.path.exists(buffi_dir) else []
print(f"\nBuffi: {len(buffi_files)} .hdf5 files")
for f in sorted(buffi_files):
    size_mb = os.path.getsize(os.path.join(buffi_dir, f)) / (1024*1024)
    print(f"  {f} ({size_mb:.0f} MB)")

# ECG
ecg_dir = os.path.join(data_root, 'ecg_heartbeat')
ecg_ok = os.path.exists(os.path.join(ecg_dir, 'mitbih_train.csv'))
print(f"\nECG: {'FOUND' if ecg_ok else 'MISSING — Phase 1 will be skipped'}")

# Synthetic
syn_dir = '/content/EE298/data/synthetic'
syn_ok = os.path.exists(os.path.join(syn_dir, 'manifest.csv'))
print(f"Synthetic: {'FOUND' if syn_ok else 'MISSING — run synthetic_data.py if needed'}")

## Cell 4 — Generate synthetic data (if missing)
Only needed if Cell 3 shows synthetic as MISSING.

In [None]:
os.chdir('/content/EE298/software/ml')

if not os.path.exists('/content/EE298/data/synthetic/manifest.csv'):
    !python synthetic_data.py
else:
    print("Synthetic data already exists, skipping.")

## Cell 5 — Discover Fungal Vocabulary
Runs Adamatzky's methodology: detect spikes → group into words → cluster into vocabulary.  
Takes ~2-5 min depending on data size.

In [None]:
os.chdir('/content/EE298/software/ml')

# Discover vocabulary with k=50 clusters (Adamatzky found ~50 word types)
!python spike_vocabulary.py --n-clusters 50 --max-rows 0

## Cell 5b (optional) — Elbow plot to find optimal k
Run this if you want to see if k=50 is the right choice.

In [None]:
os.chdir('/content/EE298/software/ml')

# Elbow method — tests k=5,10,15,...,80
!python spike_vocabulary.py --elbow --max-rows 0

# Display the elbow plot
from IPython.display import Image, display
elbow_path = '/content/EE298/data/ml_results/vocabulary_elbow.png'
if os.path.exists(elbow_path):
    display(Image(filename=elbow_path, width=600))

## Cell 5c — View vocabulary analysis

In [None]:
os.chdir('/content/EE298/software/ml')

# Show vocabulary stats
!python spike_vocabulary.py --analyze

# Display the analysis plots
from IPython.display import Image, display
plot_path = '/content/EE298/data/ml_results/vocabulary_analysis.png'
if os.path.exists(plot_path):
    display(Image(filename=plot_path, width=800))

## Cell 6 — Train TCN (Vocabulary Mode, All 3 Phases)
- **Phase 1:** ECG pre-training (5-class heartbeat, ~5 min)
- **Phase 2:** Vocabulary classification (k-class word types, ~5 min)
- **Phase 3:** Stimulus response decoding (5-class: 4 stimuli + baseline, ~5 min)

Total: ~15-20 min on T4 GPU with `--full`.

In [None]:
os.chdir('/content/EE298/software/ml')

# Run all 3 phases with Colab-optimized settings
!python train_tcn.py --mode vocabulary --phase all --full

### If you want to run phases individually:
Uncomment the phase you want to run.

In [None]:
# os.chdir('/content/EE298/software/ml')

# Phase 1 only (ECG pre-training)
# !python train_tcn.py --mode vocabulary --phase 1 --full

# Phase 2 only (vocabulary classification)
# !python train_tcn.py --mode vocabulary --phase 2 --full

# Phase 3 only (stimulus response)
# !python train_tcn.py --mode vocabulary --phase 3 --full

## Cell 7 — Build Fungal Dictionary
Maps discovered word types to stimulus meanings.

In [None]:
os.chdir('/content/EE298/software/ml')

!python build_dictionary.py

## Cell 7b — View the dictionary

In [None]:
import json

dict_path = '/content/EE298/software/ml/models/fungal_dictionary.json'
if os.path.exists(dict_path):
    with open(dict_path) as f:
        d = json.load(f)
    print(f"Vocabulary size: {d['vocabulary_size']} word types")
    print(f"Stimuli: {d['stimuli']}")
    print()
    
    # Top 15 most confident word→stimulus mappings
    words = sorted(d['words'].items(), key=lambda x: x[1]['confidence'], reverse=True)
    print(f"{'Word':<10} {'Occurrences':<14} {'Primary Stimulus':<20} {'Confidence':<12} {'Interpretation'}")
    print('-' * 90)
    for wtype, info in words[:15]:
        print(f"word_{wtype:<5} {info['total_occurrences']:<14} {info['primary_stimulus']:<20} "
              f"{info['confidence']:<12.3f} {info['interpretation']}")
else:
    print("Dictionary not found — run Cell 7 first")

## Cell 8 — Download results to your Mac
Downloads the trained models and dictionary so you can use them locally.

In [None]:
import shutil

# Zip up the results
results_dir = '/content/EE298/software/ml/models'
plots_dir = '/content/EE298/data/ml_results'

# Create a zip with all outputs
!cd /content/EE298 && zip -r /content/vocabulary_results.zip \
    software/ml/models/tcn_phase1_ecg.pt \
    software/ml/models/tcn_phase2_vocabulary.pt \
    software/ml/models/tcn_phase3_stimulus.pt \
    software/ml/models/vocabulary_kmeans.joblib \
    software/ml/models/vocabulary_scaler.joblib \
    software/ml/models/vocabulary_labels.npz \
    software/ml/models/fungal_dictionary.json \
    data/ml_results/ \
    2>/dev/null

zip_size = os.path.getsize('/content/vocabulary_results.zip') / (1024*1024)
print(f"\nResults zip: {zip_size:.1f} MB")
print("\nFiles included:")
!zipinfo -1 /content/vocabulary_results.zip

# Auto-download in Colab browser
try:
    from google.colab import files
    files.download('/content/vocabulary_results.zip')
except ImportError:
    print("\nNot running in Colab browser — manually download:")
    print("  /content/vocabulary_results.zip")

## Cell 9 (optional) — Also run the binary baseline for comparison

In [None]:
os.chdir('/content/EE298/software/ml')

# Binary mode (original smoke detector)
# !python train_tcn.py --mode binary --phase all --full

# Classical ML baseline
# !python train.py --full