# üé∏ Guitar Strum Generator - Dataset Construction

**Notebook 02: Build Training Dataset**

This notebook creates the training dataset for your thesis by combining:
1. **Synthetic samples (70%)** - Generated using the rule-based system
2. **Real progressions (30%)** - Sampled from Chordonomicon dataset

---

**Author:** Rohan Rajendra Dhanawade  
**Thesis:** A Conversational AI System for Symbolic Guitar Strumming Pattern and Chord Progression Generation

---

## 1. Setup Environment

First, let's clone your repository and install dependencies.

In [None]:
# Clone repository (if running in Colab)
import os

if not os.path.exists('guitar-strum-gen'):
    !git clone https://github.com/rohand575/guitar-strum-gen.git
    %cd guitar-strum-gen
else:
    %cd guitar-strum-gen
    !git pull

print("\n‚úÖ Repository ready!")

In [None]:
# Install required packages
!pip install -q pydantic datasets pandas tqdm

# Install the project in development mode
!pip install -q -e .

print("\n‚úÖ Dependencies installed!")

In [None]:
# Verify imports work
from src.data.schema import GuitarSample, VALID_GENRES, VALID_EMOTIONS
from src.data.build_dataset import (
    DatasetConfig, 
    build_dataset,
    generate_synthetic_sample,
    load_chordonomicon_huggingface,
    PROMPT_TEMPLATES
)

print(f"‚úÖ All imports successful!")
print(f"   - {len(VALID_GENRES)} genres available")
print(f"   - {len(VALID_EMOTIONS)} emotions available")
print(f"   - {len(PROMPT_TEMPLATES)} prompt templates")

## 2. Explore Chordonomicon Dataset

Let's load and explore the Chordonomicon dataset from Hugging Face.

In [None]:
# Load Chordonomicon from Hugging Face
from datasets import load_dataset

print("üì• Loading Chordonomicon dataset from Hugging Face...")
print("   (This may take a few minutes on first run)\n")

chordonomicon = load_dataset("ailsntua/Chordonomicon", split="train")

print(f"‚úÖ Loaded {len(chordonomicon):,} chord progressions!")
print(f"\nColumns: {chordonomicon.column_names}")

In [None]:
# Convert to pandas for easier exploration
import pandas as pd

df = chordonomicon.to_pandas()

print("üìä Dataset Overview:")
print(f"   Total entries: {len(df):,}")
print(f"   Memory usage: {df.memory_usage().sum() / 1024**2:.1f} MB")
print("\nüìã First few entries:")
df.head(3)

In [None]:
# Let's look at a few example chord progressions
print("üéµ Example chord progressions from Chordonomicon:\n")

for i in range(5):
    row = df.iloc[i]
    chords = row['chords'][:200] + "..." if len(str(row['chords'])) > 200 else row['chords']
    print(f"Example {i+1}:")
    print(f"  Genre: {row.get('main_genre', 'N/A')}")
    print(f"  Chords: {chords}")
    print()

In [None]:
# Genre distribution in Chordonomicon
if 'main_genre' in df.columns:
    print("üìä Genre distribution in Chordonomicon:")
    genre_counts = df['main_genre'].value_counts().head(15)
    for genre, count in genre_counts.items():
        pct = count / len(df) * 100
        print(f"   {genre:20} {count:>8,} ({pct:5.1f}%)")

## 3. Test Chord Parsing

Let's test our chord parsing functions on real Chordonomicon data.

In [None]:
from src.data.build_dataset import parse_chordonomicon_chord_string, normalize_chord

# Test parsing on a few examples
print("üîß Testing chord parsing on Chordonomicon samples:\n")

for i in range(5):
    row = df.iloc[i * 1000]  # Sample every 1000th row for variety
    original = str(row['chords'])[:100]
    chords, section = parse_chordonomicon_chord_string(str(row['chords']))
    
    print(f"Example {i+1}:")
    print(f"  Original: {original}...")
    print(f"  Parsed ({len(chords)} chords): {chords[:8]}")
    print(f"  Section: {section}")
    print()

## 4. Generate Training Dataset

Now let's build the complete training dataset!

In [None]:
# Configuration for dataset generation
config = DatasetConfig(
    total_samples=250,        # Total samples to generate
    synthetic_ratio=0.70,     # 70% synthetic, 30% from Chordonomicon
    train_ratio=0.70,         # 70% train, 15% val, 15% test
    val_ratio=0.15,
    test_ratio=0.15,
    random_seed=42
)

print("üìä Dataset Configuration:")
print(f"   Total samples: {config.total_samples}")
print(f"   Synthetic: {config.num_synthetic} ({config.synthetic_ratio*100:.0f}%)")
print(f"   From Chordonomicon: {config.num_real} ({(1-config.synthetic_ratio)*100:.0f}%)")
print(f"   Train/Val/Test: {config.train_ratio*100:.0f}%/{config.val_ratio*100:.0f}%/{config.test_ratio*100:.0f}%")

In [None]:
# Build the dataset!
import random
random.seed(config.random_seed)

splits = build_dataset(
    config=config,
    chordonomicon_path=None,  # We'll use Hugging Face
    use_huggingface=True,
    output_dir="data/processed"
)

In [None]:
# View generated files
!ls -la data/processed/

## 5. Explore Generated Dataset

In [None]:
# Load and explore the generated dataset
import json

# Load stats
with open('data/processed/stats.json', 'r') as f:
    stats = json.load(f)

print("üìä Generated Dataset Statistics:")
print(f"\n   Total samples: {stats['total']}")
print(f"   Unique progressions: {stats['unique_progressions']}")
print(f"   Unique patterns: {stats['unique_patterns']}")
print(f"   Tempo range: {stats['tempo_range'][0]} - {stats['tempo_range'][1]} BPM")
print(f"   Average tempo: {stats['avg_tempo']:.1f} BPM")

print(f"\nüìÅ Splits:")
for split, count in stats['splits'].items():
    print(f"   {split}: {count} samples")

In [None]:
# Genre distribution
print("üéµ Genre Distribution:")
for genre, count in sorted(stats['genres'].items(), key=lambda x: x[1], reverse=True):
    pct = count / stats['total'] * 100
    bar = '‚ñà' * int(pct / 2)
    print(f"   {genre:12} {count:3} ({pct:5.1f}%) {bar}")

In [None]:
# Emotion distribution
print("üí≠ Emotion Distribution:")
for emotion, count in sorted(stats['emotions'].items(), key=lambda x: x[1], reverse=True):
    pct = count / stats['total'] * 100
    bar = '‚ñà' * int(pct / 2)
    print(f"   {emotion:12} {count:3} ({pct:5.1f}%) {bar}")

In [None]:
# View some sample entries
print("üìù Sample Entries from Dataset:\n")

with open('data/processed/dataset.jsonl', 'r') as f:
    samples = [json.loads(line) for line in f.readlines()[:10]]

for i, sample in enumerate(samples[:5]):
    print(f"--- Sample {i+1} ({sample['id']}) ---")
    print(f"Prompt: \"{sample['prompt']}\"")
    print(f"Chords: {sample['chords']}")
    print(f"Pattern: {sample['strum_pattern']}")
    print(f"Key: {sample['key']} {sample['mode']} | Tempo: {sample['tempo']} BPM")
    print(f"Genre: {sample['genre']} | Emotion: {sample['emotion']}")
    print()

## 6. Visualize Dataset Distribution

In [None]:
import matplotlib.pyplot as plt

# Create visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Genre distribution
ax1 = axes[0, 0]
genres = list(stats['genres'].keys())
genre_counts = list(stats['genres'].values())
ax1.barh(genres, genre_counts, color='steelblue')
ax1.set_xlabel('Count')
ax1.set_title('Genre Distribution')

# Emotion distribution
ax2 = axes[0, 1]
emotions = list(stats['emotions'].keys())
emotion_counts = list(stats['emotions'].values())
ax2.barh(emotions, emotion_counts, color='coral')
ax2.set_xlabel('Count')
ax2.set_title('Emotion Distribution')

# Mode distribution (pie)
ax3 = axes[1, 0]
modes = list(stats['modes'].keys())
mode_counts = list(stats['modes'].values())
ax3.pie(mode_counts, labels=modes, autopct='%1.1f%%', colors=['lightgreen', 'lightcoral'])
ax3.set_title('Mode Distribution (Major vs Minor)')

# Key distribution
ax4 = axes[1, 1]
keys = list(stats['keys'].keys())
key_counts = list(stats['keys'].values())
ax4.bar(keys, key_counts, color='purple', alpha=0.7)
ax4.set_xlabel('Key')
ax4.set_ylabel('Count')
ax4.set_title('Key Distribution')
ax4.tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.savefig('data/processed/dataset_distribution.png', dpi=150, bbox_inches='tight')
plt.show()

print("\n‚úÖ Visualization saved to data/processed/dataset_distribution.png")

## 7. Download Dataset Files

Download the generated files to your local machine.

In [None]:
# Create a zip file for easy download
!cd data/processed && zip -r ../../guitar_dataset.zip *.jsonl *.json *.png 2>/dev/null

print("üì¶ Created guitar_dataset.zip")
print("\nContents:")
!unzip -l guitar_dataset.zip

In [None]:
# Download (only works in Colab)
try:
    from google.colab import files
    files.download('guitar_dataset.zip')
    print("\n‚úÖ Download started!")
except ImportError:
    print("\nüìÅ Not running in Colab. Files are in data/processed/")
    print("   You can manually download:")
    print("   - train.jsonl")
    print("   - val.jsonl")
    print("   - test.jsonl")
    print("   - stats.json")

## 8. Summary

### What We Built

‚úÖ **Training dataset** with 250 samples:
- 175 synthetic samples (70%) from rule-based system
- 75 real progressions (30%) from Chordonomicon

‚úÖ **Rich prompt diversity** with 68 unique templates

‚úÖ **Balanced distribution** across:
- 9 genres (pop, rock, folk, ballad, country, blues, jazz, indie, acoustic)
- 8 emotions (upbeat, melancholic, mellow, energetic, peaceful, dramatic, hopeful, nostalgic)
- Major and minor keys

### Output Files

| File | Description |
|------|-------------|
| `train.jsonl` | Training set (70%) |
| `val.jsonl` | Validation set (15%) |
| `test.jsonl` | Test set (15%) |
| `dataset.jsonl` | Complete dataset |
| `stats.json` | Dataset statistics |

### Next Steps

Continue to **Notebook 03: Training** to train the neural sequence model!

In [None]:
print("üéâ Dataset construction complete!")
print("\nüìä Final Stats:")
print(f"   - {stats['total']} total samples")
print(f"   - {stats['unique_progressions']} unique chord progressions")
print(f"   - {stats['unique_patterns']} unique strumming patterns")
print(f"   - Train/Val/Test: {stats['splits']['train']}/{stats['splits']['val']}/{stats['splits']['test']}")
print("\nüöÄ Ready for neural model training!")