# üé§ Preparazione Dataset per Piper TTS

Questo notebook prepara il dataset LJSpeech-Italian per training con Piper TTS.

## üìä Dataset Disponibili:

### 1. z-uo/female-LJSpeech-italian (CONSIGLIATO per call center)
- üé§ Voce femminile
- ‚è±Ô∏è Durata: 8h 23m
- üîä Sample rate: 16kHz
- üì¶ ~600 MB

### 2. z-uo/male-LJSpeech-italian
- üé§ Voce maschile
- ‚è±Ô∏è Durata: 31h 45m
- üîä Sample rate: 16kHz
- üì¶ ~2.5 GB

## ‚öôÔ∏è Setup:
1. Runtime ‚Üí Change runtime type ‚Üí **GPU T4**
2. Monta Google Drive
3. Scegli dataset (female/male) nella cella di download
4. Esegui celle in ordine

## 1Ô∏è‚É£ Setup Ambiente

In [None]:
# Monta Google Drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Installa dipendenze
!pip install -q datasets soundfile tqdm torchcodec
print("‚úÖ Installazione completata!")

## 2Ô∏è‚É£ Download LJSpeech-Italian

In [None]:
# ============================================================
#  DOWNLOAD LJSPEECH-IT - VERSIONE CORRETTA
# ============================================================

print("="*60)
print("  DOWNLOAD LJSPEECH-IT")
print("="*60)
print()

import os
from datasets import load_dataset
from tqdm import tqdm
import soundfile as sf

# Installa torchcodec se mancante
try:
    import torchcodec
except ImportError:
    print("üì¶ Installazione torchcodec...")
    !pip install -q torchcodec
    print("‚úÖ torchcodec installato")

# Scegli dataset (female o male)
DATASET_NAME = "z-uo/female-LJSpeech-italian"  # 8h 23m, voce femminile
# DATASET_NAME = "z-uo/male-LJSpeech-italian"  # 31h 45m, voce maschile (commenta la riga sopra e decommenta questa)

print(f"üì• Scaricamento dataset: {DATASET_NAME}...")
dataset = load_dataset(DATASET_NAME, split="train")
print(f"‚úÖ Dataset caricato: {len(dataset)} campioni")

# Crea directory su Drive
dataset_dir = "/content/drive/MyDrive/piper_training/dataset/ljspeech_italian"
wavs_dir = os.path.join(dataset_dir, "wavs")
os.makedirs(wavs_dir, exist_ok=True)

print(f"\nüíæ Salvataggio su Google Drive...")
print(f"üìÅ Percorso: {dataset_dir}")

# Prepara metadata
metadata_lines = []
total = len(dataset)

for idx in tqdm(range(total), desc="Salvando audio"):
    item = dataset[idx]
    
    # Salva audio
    audio_filename = f"LJ_{idx:06d}.wav"
    audio_path = os.path.join(wavs_dir, audio_filename)
    
    # Estrai audio array e sample rate
    audio_data = item['audio']
    sf.write(audio_path, audio_data['array'], audio_data['sampling_rate'])
    
    # Aggiungi a metadata
    text = item['text']  # Nota: 'text' non 'sentence'
    metadata_lines.append(f"{audio_filename}|{text}")

# Salva metadata.csv
metadata_path = os.path.join(dataset_dir, "metadata.csv")
with open(metadata_path, 'w', encoding='utf-8') as f:
    f.write('\n'.join(metadata_lines))

print(f"\n‚úÖ Completato!")
print(f"   üìä {total} file audio salvati")
print(f"   üìÑ metadata.csv creato")
print(f"   üíæ Totale: ~{total * 0.5:.1f} MB")
print(f"\nüìã Dataset usato: {DATASET_NAME}")
print(f"   üé§ Tipo: {'Voce femminile' if 'female' in DATASET_NAME else 'Voce maschile'}")

## 3Ô∏è‚É£ Verifica Dataset

In [None]:
# Verifica file creati
import pandas as pd
from IPython.display import Audio, display

dataset_dir = "/content/drive/MyDrive/piper_training/dataset/ljspeech_italian"
metadata_path = os.path.join(dataset_dir, "metadata.csv")

# Carica metadata
df = pd.read_csv(metadata_path, sep='|', header=None, names=['filename', 'text'])
print(f"üìä Dataset info:")
print(f"   File audio: {len(df)}")
print(f"   Testo totale: {df['text'].str.len().sum()} caratteri")
print(f"   Testo medio: {df['text'].str.len().mean():.1f} caratteri/sample")

# Test audio casuale
sample_idx = 0
sample_row = df.iloc[sample_idx]
sample_audio = os.path.join(dataset_dir, "wavs", sample_row['filename'])

print(f"\nüé§ Sample test #{sample_idx}:")
print(f"   Testo: {sample_row['text']}")
print(f"   File: {sample_row['filename']}")

display(Audio(sample_audio))

## 4Ô∏è‚É£ Statistiche Dataset

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Lunghezza testi
text_lengths = df['text'].str.len()

plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.hist(text_lengths, bins=50, edgecolor='black')
plt.xlabel('Lunghezza testo (caratteri)')
plt.ylabel('Frequenza')
plt.title('Distribuzione lunghezza testi')
plt.axvline(text_lengths.mean(), color='r', linestyle='--', label=f'Media: {text_lengths.mean():.1f}')
plt.legend()

plt.subplot(1, 2, 2)
word_counts = df['text'].str.split().str.len()
plt.hist(word_counts, bins=50, edgecolor='black')
plt.xlabel('Numero parole')
plt.ylabel('Frequenza')
plt.title('Distribuzione numero parole')
plt.axvline(word_counts.mean(), color='r', linestyle='--', label=f'Media: {word_counts.mean():.1f}')
plt.legend()

plt.tight_layout()
plt.show()

print("üìà Statistiche:")
print(f"   Lunghezza testo: {text_lengths.min()}-{text_lengths.max()} caratteri")
print(f"   Parole per sample: {word_counts.min()}-{word_counts.max()} parole")
print(f"   Totale parole: {word_counts.sum():.0f}")

## ‚úÖ Dataset Pronto!

Il dataset √® ora pronto per training con Piper TTS.

### Prossimi passi:
1. Segui la guida ufficiale Piper: https://github.com/rhasspy/piper
2. Configura training con il dataset preparato
3. Avvia training (richiede diverse ore su GPU)

### Note:
- Dataset: `/content/drive/MyDrive/piper_training/dataset/ljspeech_italian`
- Metadata: `metadata.csv` (formato: filename|text)
- Audio: cartella `wavs/` (formato WAV)