# Voice Cloning Assignment (Colab)

**Reference tutorial**: https://www.youtube.com/watch?v=3iqvBEGS2So  
**Paths**: (A) TTS fine‑tune (XTTS‑v2), (B) Voice conversion (RVC)

⚠️ Ethics: Only use voices you have rights to. Label synthetic audio.


In [ ]:
# GPU check
import torch, subprocess, os, json
print('CUDA available:', torch.cuda.is_available())
if torch.cuda.is_available():
    print('GPU:', torch.cuda.get_device_name(0))
    print('Capability:', torch.cuda.get_device_capability(0))


## 1) Environment setup
Installs core deps. Re-run on every new Colab session.

In [ ]:
!pip -q install torch torchaudio --index-url https://download.pytorch.org/whl/cu121
!pip -q install unsloth transformers datasets accelerate peft bitsandbytes
!pip -q install librosa soundfile pydub faster-whisper jiwer speechbrain
!pip -q install TTS  # Coqui TTS (XTTS‑v2)
import os
os.makedirs('/content/outputs/samples', exist_ok=True)
os.makedirs('/content/datasets', exist_ok=True)
os.makedirs('/content/logs', exist_ok=True)


## 2) Data upload / mount
Upload WAV/MP3 of your voice (minutes → hours). Optionally mount Drive.

In [ ]:
from google.colab import files
print('Upload one or more long recordings (or a folder after zipping).')
uploads = files.upload()  # choose audio files
list(uploads.keys())

In [ ]:
# Optional: mount Drive if your data lives there
from google.colab import drive
drive.mount('/content/drive')

## 3) Preprocess (resample + split by silence)
Produces 10–20s chunks under `/content/datasets/wavs`.

In [ ]:
import os
os.makedirs('/content/datasets/wavs', exist_ok=True)
!python scripts/prepare_data.py --src /content --out /content/datasets/wavs --sr 22050
len(os.listdir('/content/datasets/wavs'))

## 4) (Optional) Transcribe with Whisper → CSV
Creates `/content/datasets/train.csv` with `audio_path,text` for TTS fine‑tune.

In [ ]:
!python scripts/transcribe_with_whisper.py --wav_dir /content/datasets/wavs --out_csv /content/datasets/train.csv --model_size small
import pandas as pd
pd.read_csv('/content/datasets/train.csv').head()

## 5) Baseline: Zero‑shot cloning (XTTS‑v2)
Generate speech using a short **reference** audio of your target speaker.

In [ ]:
from TTS.api import TTS
import glob, os
ref = sorted(glob.glob('/content/datasets/wavs/*.wav'))[:5]  # few clips as reference
tts = TTS(model_name="tts_models/multilingual/multi-dataset/xtts_v2")
text = "This is a baseline zero-shot voice clone generated with XTTS version two."
out_path = "/content/outputs/samples/xtts_zeroshot.wav"
tts.tts_to_file(text=text, file_path=out_path, speaker_wav=ref, language="en")
out_path

## 6A) Path A — XTTS‑v2 fine‑tuning (example)
This is heavier and benefits from **2–3h** of clean speech + transcripts.
We provide a **config skeleton** at `config/xtts_finetune_config.yaml`. Update file paths then run below.

In [ ]:
# Example training command (adjust for your dataset)
!tts --config_path config/xtts_finetune_config.yaml --train


## 6B) Path B — RVC (Retrieval‑based Voice Conversion)
Clone the official WebUI and train a small model. Works without transcripts.

In [ ]:
!git clone https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI /content/rvc
%cd /content/rvc
!pip -q install -r requirements.txt
# Minimal dataset prep: copy your clips
!mkdir -p /content/rvc/datasets/wavs && cp -r /content/datasets/wavs/* /content/rvc/datasets/wavs/
print('Now, in the left panel or with WebUI, you can run preprocessing/training. Documentation is in the repo README.')
%cd /content


## 7) Evaluation: 30s A/B and similarity score
Pick one original chunk and one cloned output.

In [ ]:
import glob
orig = sorted(glob.glob('/content/datasets/wavs/*.wav'))[0]
cloned = '/content/outputs/samples/xtts_zeroshot.wav'  # or your fine‑tuned/RVC output
!python scripts/generate_ab_sample.py --original $orig --cloned $cloned --out /content/outputs/samples/ab_30s.wav
!python scripts/similarity_ecapa.py --a $orig --b $cloned
print('A/B written to: /content/outputs/samples/ab_30s.wav')


## 8) Logging (fill templates)
Open and edit the templates under `reports/templates/`.
