### 모델별 음성 임베딩 실습

다음의 코드는 한 개의 입력 음성 파일에서 서로 다른 세 가지 딥러닝 모델(ECAPA-TDNN, WavLM(Conformer 계열), ResNet)을 이용해 화자 임베딩(특징 벡터)을 추출해 보고, 각 모델이 얼마나 다른 차원의 벡터를 출력하는지 확인해 보는 예제입니다.

- **오디오 로드 및 전처리**: `librosa`로 WAV 파일을 읽어 모델 입력 형태인 1×T 텐서로 변환
- **ECAPA-TDNN**: 화자 인식에 특화된 192차원 임베딩을 추출
- **WavLM(Transformer-기반)**: 자기지도 학습 모델로부터 얻은 768차원(모델 사양에 따름) 시퀀스 표현을 평균 풀링하여 임베딩으로 사용
- **ResNet(이미지 분류 모델 전용)**: 음성 스펙트로그램을 3채널 이미지로 변환해 ImageNet 사전학습된 ResNet34에 넣고, 기본 클래스 출력(1000차원)을 임베딩 대용으로 활용

이를 통해

1. 각 아키텍처의 입출력 형태·차원을 이해하고
2. 화자 임베딩 추출 파이프라인을 일괄적으로 구성해 보며
3. 실제로 어떤 모델이 화자 구분에 유리한 특징을 뽑아내는지 비교·확장할 수 있습니다.

In [5]:
!unzip -o recordings_any.zip -d /content

Archive:  recordings_any.zip
   creating: /content/recordings/
  inflating: /content/recordings/utt_이랑교.wav  
  inflating: /content/__MACOSX/recordings/._utt_이랑교.wav  
  inflating: /content/recordings/utt_백현일.wav  
  inflating: /content/__MACOSX/recordings/._utt_백현일.wav  
  inflating: /content/recordings/utt_조민주.wav  
  inflating: /content/__MACOSX/recordings/._utt_조민주.wav  
  inflating: /content/recordings/utt_강사.wav  
  inflating: /content/__MACOSX/recordings/._utt_강사.wav  
  inflating: /content/recordings/utt_김희건.wav  
  inflating: /content/__MACOSX/recordings/._utt_김희건.wav  


In [8]:
!pip install speechbrain

Collecting speechbrain
  Downloading speechbrain-1.0.3-py3-none-any.whl.metadata (24 kB)
Collecting hyperpyyaml (from speechbrain)
  Downloading HyperPyYAML-1.2.2-py3-none-any.whl.metadata (7.6 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.9->speechbrain)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.9->speechbrain)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.9->speechbrain)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.9->speechbrain)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.9->speechbrain)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3

In [13]:
import torch
import librosa

# 1) 오디오 로드
def load_audio(path, sample_rate=16000):
    wav, sr = librosa.load(path, sr=sample_rate)
    return torch.tensor(wav).unsqueeze(0)  # (1, time)

wav_path = "/content/recordings/utt_강사.wav"
#경로에 있는 wave form을 16kHz sampling rate로 sampling한다.
signal = load_audio(wav_path)  # (1, T)


# --- ECAPA-TDNN ---
from speechbrain.pretrained import EncoderClassifier
ecapa = EncoderClassifier.from_hparams(
    source="speechbrain/spkrec-ecapa-voxceleb",
    run_opts={"device": "cuda" if torch.cuda.is_available() else "cpu"}
)
emb_ecapa = ecapa.encode_batch(signal).squeeze().cpu().numpy()
print("ECAPA embedding shape:", emb_ecapa.shape)
print(emb_ecapa)

# --- WavLM ---
from transformers import WavLMModel, Wav2Vec2FeatureExtractor
feature_extractor = Wav2Vec2FeatureExtractor(
    feature_size=1, sampling_rate=16000, return_tensors="pt"
)
model_wavlm = WavLMModel.from_pretrained("microsoft/wavlm-base-plus").to(
    "cuda" if torch.cuda.is_available() else "cpu"
)
inputs = feature_extractor(signal.numpy(), sampling_rate=16000, return_tensors="pt")
inputs = {k: v.to(model_wavlm.device) for k, v in inputs.items()}
with torch.no_grad():
    outputs = model_wavlm(**inputs)
    emb_wavlm = outputs.last_hidden_state.mean(dim=1).squeeze().cpu().numpy()
print("WavLM embedding shape:", emb_wavlm.shape)
print(emb_wavlm)

# --- ResNet 기반 임베딩 (torchvision 사용) ---
from torchvision.models import resnet34
from torchvision import transforms
from torchaudio.transforms import MelSpectrogram, AmplitudeToDB

# 스펙트로그램 변환기
mel_spec = MelSpectrogram(
    sample_rate=16000, n_fft=1024, hop_length=256, n_mels=80
)
to_db = AmplitudeToDB()

# (1, T) -> (1, 80, time) -> (3, 80, time)
spec = mel_spec(signal)        # (1, 80, frames)
spec_db = to_db(spec)          # (1, 80, frames)
spec_img = spec_db.expand(3, -1, -1)  # (3, 80, frames)

# torchvision 전처리: 크기 맞추기 & 정규화
preprocess = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std=[0.229, 0.224, 0.225]),
])

x = preprocess(spec_img).unsqueeze(0)  # (1, 3, 224, 224)

# ResNet 모델 로드 (pretrained=True 권장)
resnet = resnet34(pretrained=True).eval().to(x.device)

with torch.no_grad():
    feats = resnet(x)  # (1, 1000)
    emb_resnet = feats.squeeze().cpu().numpy()
print("ResNet embedding shape:", emb_resnet.shape)
print(emb_resnet)

INFO:speechbrain.utils.fetching:Fetch hyperparams.yaml: Fetching from HuggingFace Hub 'speechbrain/spkrec-ecapa-voxceleb' if not cached
INFO:speechbrain.utils.fetching:Fetch custom.py: Fetching from HuggingFace Hub 'speechbrain/spkrec-ecapa-voxceleb' if not cached
DEBUG:speechbrain.utils.parameter_transfer:Fetching files for pretraining (no collection directory set)
INFO:speechbrain.utils.fetching:Fetch embedding_model.ckpt: Fetching from HuggingFace Hub 'speechbrain/spkrec-ecapa-voxceleb' if not cached
DEBUG:speechbrain.utils.parameter_transfer:Set local path in self.paths["embedding_model"] = /root/.cache/huggingface/hub/models--speechbrain--spkrec-ecapa-voxceleb/snapshots/0f99f2d0ebe89ac095bcc5903c4dd8f72b367286/embedding_model.ckpt
INFO:speechbrain.utils.fetching:Fetch mean_var_norm_emb.ckpt: Fetching from HuggingFace Hub 'speechbrain/spkrec-ecapa-voxceleb' if not cached
DEBUG:speechbrain.utils.parameter_transfer:Set local path in self.paths["mean_var_norm_emb"] = /root/.cache/hugg

ECAPA embedding shape: (192,)
[ 19.384323    -0.37681118 -17.656822   -11.370897   -16.301619
 -42.641064    10.685973    34.42394     -9.930648    11.776552
 -23.340721    -4.8756437   15.176208     6.7057915    3.6570086
  45.813286    -9.955842    -5.028708    -5.127738    15.043763
 -12.963876   -34.82991     -5.6800175   39.98239    -28.41942
  15.985863   -27.375307     5.2691083  -13.039081    -0.54056036
 -13.717533     3.835286    22.955496   -14.48932     21.213041
   9.946405    20.273426    17.069809     8.065852     3.4499838
 -20.887508   -34.506405    -4.2774057  -18.504356     6.5426755
 -37.29016     32.01841    -19.722013    12.191779   -23.843788
  -1.9829111   -8.9974785   -0.9034279   30.226555   -35.184155
 -25.224447    10.663141    17.752588    19.127544   -29.493834
  21.129465   -13.560528    21.345762    37.468636     1.8676853
 -10.474776   -35.553787    21.378418    39.39375    -39.041985
 -19.177391    29.013227    13.5155735   -8.492853   -14.537387
  11.

AI Hub 생활소음 임베딩

In [1]:
import os
import zipfile
import tarfile
import shutil

def extract_archive_auto(path: str, dest_dir: str = None):
    """
    압축파일 형식을 자동 감지하여 해제하고,
    최상위 한 개의 서브폴더가 생기면 그 안의 파일/폴더를 dest_dir로 옮긴 뒤 빈 폴더를 삭제합니다.
    지원 포맷: .zip, .tar, .tar.gz, .tgz, .tar.bz2, .tbz2, .tar.xz, .txz
    """
    if dest_dir is None:
        base, _ = os.path.splitext(os.path.basename(path))
        dest_dir = os.path.join(os.path.dirname(path), base)
    os.makedirs(dest_dir, exist_ok=True)

    # 1) 압축 해제
    if zipfile.is_zipfile(path):
        with zipfile.ZipFile(path, 'r') as z:
            z.extractall(dest_dir)
        print(f'ZIP으로 해제: "{path}" → "{dest_dir}"')
    elif tarfile.is_tarfile(path):
        with tarfile.open(path, 'r:*') as t:
            t.extractall(dest_dir)
        print(f'TAR 계열로 해제: "{path}" → "{dest_dir}"')
    else:
        raise ValueError(f'지원되지 않는 압축 파일이거나 손상된 파일입니다: {path}')

    # 2) 최상위 단일 서브폴더 감지
    entries = os.listdir(dest_dir)
    if len(entries) == 1:
        sub = os.path.join(dest_dir, entries[0])
        if os.path.isdir(sub):
            # 3) 서브폴더 내 모든 항목을 dest_dir로 이동
            for name in os.listdir(sub):
                src = os.path.join(sub, name)
                dst = os.path.join(dest_dir, name)
                shutil.move(src, dst)
            # 4) 빈 서브폴더 삭제
            os.rmdir(sub)
            print(f'서브폴더 "{entries[0]}" 내용물을 상위로 이동하고, 폴더를 삭제했습니다.')

# 사용 예시
archive_path = '/content/Sample.zip'
try:
    extract_archive_auto(archive_path, dest_dir='/content/background')
except ValueError as e:
    print(e)


ZIP으로 해제: "/content/Sample.zip" → "/content/background"
서브폴더 "Sample" 내용물을 상위로 이동하고, 폴더를 삭제했습니다.


In [2]:
!pip install transformers librosa torch soundfile pandas

import os
import glob
import json

import numpy as np
import pandas as pd
import soundfile as sf
import librosa
import torch
from transformers import Wav2Vec2FeatureExtractor, WavLMModel

# ──────────────────────────────────────────────
# 1) 경로 설정
JSON_ROOT  = "/content/background/2.라벨링데이터"
AUDIO_ROOT = "/content/background/1.원천데이터"
# ──────────────────────────────────────────────

# 2) JSON 파일 재귀 검색
json_paths = glob.glob(os.path.join(JSON_ROOT, "**", "*.json"), recursive=True)
print(f"발견된 JSON 파일 수: {len(json_paths)}")

# 3) JSON → records 리스트로
records = []
for jp in json_paths:
    with open(jp, "r", encoding="utf-8") as f:
        data = json.load(f)
    md = data["metaData"]

    # metaData 필드 읽기
    # samplingRate: "48khz" → 48000
    orig_sr = int(md["samplingRate"].lower().replace("khz","")) * 1000
    # channel: "1ch" → 1
    nch     = int(md["channel"].lower().replace("ch",""))

    # src_path + JSON basename → WAV 경로
    rel_dir = md["src_path"]  # e.g. "A.층간소음/1.중량충격음/b.아이들발걸음소리"
    wav_name = os.path.splitext(os.path.basename(jp))[0] + ".wav"
    wav_path = os.path.join(AUDIO_ROOT, rel_dir, wav_name)
    if not os.path.isfile(wav_path):
        raise FileNotFoundError(f"WAV 파일 없음: {wav_path}")

    # annotation 하나당 한 레코드
    for ann in data["annotation"]:
        records.append({
            "wav_path": wav_path,
            "orig_sr":  orig_sr,
            "n_channel":nch,
            "start":    ann["startTime"],
            "end":      ann["endTime"],
            "label":    ann["labelText"],
        })

# 4) DataFrame 생성
df = pd.DataFrame(records)
print("전체 세그먼트 개수:", len(df))
df.head()

# 5) WavLM 모델 & FeatureExtractor 준비
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
fe     = Wav2Vec2FeatureExtractor(feature_size=1, sampling_rate=16000, return_tensors="pt")
model  = WavLMModel.from_pretrained("microsoft/wavlm-base-plus").to(DEVICE)
model.eval()

# 6) 임베딩 추출 함수
def extract_emb(row):
    # (a) 원본 파일 로드
    y, file_sr = sf.read(row.wav_path, dtype="float32")  # (T,) or (T, nch)
    if row.n_channel > 1:
        y = y.mean(axis=1)

    # (b) 모델 입력용 16kHz로 리샘플링
    target_sr = fe.sampling_rate  # 16000
    if file_sr != target_sr:
        y = librosa.resample(y, orig_sr=file_sr, target_sr=target_sr)

    # (c) 구간 자르기
    s = int(row.start * target_sr)
    e = int(row.end   * target_sr)
    seg = y[s:e]

    # (d) feature extractor + 모델 포워드
    inputs = fe(seg, sampling_rate=target_sr, return_tensors="pt", padding=True)
    inputs = {k: v.to(DEVICE) for k, v in inputs.items()}
    with torch.no_grad():
        hidden = model(**inputs).last_hidden_state  # (1, L, H)
        emb = hidden.mean(dim=1).squeeze().cpu().numpy()

    return emb

# 7) 임베딩 컬럼 추가 (시간 소요됨)
df["embedding"] = df.apply(extract_emb, axis=1)

# 8) 결과 확인
print(df[["wav_path","orig_sr","start","end","label"]].head())
print("임베딩 차원 예시:", df.embedding.iloc[0].shape)

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch)
  Downloading nvidia_curand_cu12-10.3.5

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/378M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/378M [00:00<?, ?B/s]

                                            wav_path  orig_sr  start    end  \
0  /content/background/1.원천데이터/B.공사장/1.건설장비/b.항발기...    48000   0.00  14.99   
1  /content/background/1.원천데이터/B.공사장/1.건설장비/b.항발기...    48000   0.04   1.94   
2  /content/background/1.원천데이터/B.공사장/1.건설장비/b.항발기...    48000   1.98   6.58   
3  /content/background/1.원천데이터/B.공사장/1.건설장비/b.항발기...    48000   6.70   7.52   
4  /content/background/1.원천데이터/B.공사장/1.건설장비/b.항발기...    48000   7.68  10.91   

        label  
0  항발기의파일뽑는소리  
1  항발기의파일뽑는소리  
2  항발기의파일뽑는소리  
3  항발기의파일뽑는소리  
4  항발기의파일뽑는소리  
임베딩 차원 예시: (768,)


In [3]:
display(df.head(100))

Unnamed: 0,wav_path,orig_sr,n_channel,start,end,label,embedding
0,/content/background/1.원천데이터/B.공사장/1.건설장비/b.항발기...,48000,1,0.00,14.99,항발기의파일뽑는소리,"[-0.0723154, 0.036309905, 0.0054642493, 0.1298..."
1,/content/background/1.원천데이터/B.공사장/1.건설장비/b.항발기...,48000,1,0.04,1.94,항발기의파일뽑는소리,"[0.044309452, 0.037799414, -0.025701614, 0.022..."
2,/content/background/1.원천데이터/B.공사장/1.건설장비/b.항발기...,48000,1,1.98,6.58,항발기의파일뽑는소리,"[0.026657918, 0.033878867, 0.032889437, 0.0352..."
3,/content/background/1.원천데이터/B.공사장/1.건설장비/b.항발기...,48000,1,6.70,7.52,항발기의파일뽑는소리,"[0.15044974, 0.111423396, -0.1496896, -0.14628..."
4,/content/background/1.원천데이터/B.공사장/1.건설장비/b.항발기...,48000,1,7.68,10.91,항발기의파일뽑는소리,"[0.09609235, 0.03303562, -0.09052447, -0.02499..."
...,...,...,...,...,...,...,...
95,/content/background/1.원천데이터/B.공사장/2.차량/a.덤프트럭의...,48000,1,0.02,14.96,덤프트럭의엔진소리,"[-0.05190432, 0.011147436, -0.03881551, 0.0858..."
96,/content/background/1.원천데이터/B.공사장/2.차량/a.덤프트럭의...,48000,1,0.01,14.98,덤프트럭의엔진소리,"[0.075957224, -0.021362187, -0.023035014, 0.02..."
97,/content/background/1.원천데이터/A.층간소음/4.악기/b.피아노연...,48000,1,0.00,14.98,피아노연주소리,"[-0.036473136, 0.012619633, -0.031627446, 0.02..."
98,/content/background/1.원천데이터/A.층간소음/4.악기/b.피아노연...,48000,1,0.03,14.90,피아노연주소리,"[0.0027560212, 0.013097054, -0.056754634, 0.00..."


In [4]:
print( df.embedding.iloc[0])

[-7.23154023e-02  3.63099054e-02  5.46424929e-03  1.29823163e-01
 -3.32779847e-02 -3.16181369e-02  1.92172840e-01 -1.85729321e-02
 -1.01506196e-01 -5.95288947e-02  1.62026621e-02 -1.31282926e-01
 -1.14370964e-01 -7.70616636e-04  9.79651362e-02 -2.90642828e-01
 -1.05803214e-01 -2.44143233e-02 -1.93070471e-01 -1.50752395e-01
  1.55709922e-01  1.19314887e-01 -1.37791783e-01  2.63521764e-02
  9.45588127e-02 -7.12560713e-02 -9.88079980e-02 -4.11068201e-02
  3.36998224e-01  4.12565982e-03  1.80485286e-02 -2.40791999e-02
 -3.06458175e-02  1.35396169e-02 -1.10074291e-02  4.13395949e-02
  2.08893299e-01 -9.80775505e-02 -2.98504601e-03  1.12745658e-01
  4.14891839e-02  1.97418164e-02  8.87379944e-02 -3.87264825e-02
 -2.50806212e-02 -1.04892112e-01  4.56126481e-02 -9.07683838e-03
 -9.83326659e-02 -2.53292043e-02  1.45227239e-01  1.22843586e-01
 -9.11552981e-02  1.23300321e-01  8.15088004e-02  3.09418533e-02
  2.26633161e-01  8.55813622e-01 -1.56813145e-01  1.19661517e-01
 -8.90380815e-02 -8.84752

In [14]:
# 필요한 라이브러리 설치
!pip install scikit-learn matplotlib seaborn

# ──────────────────────────────────────────────
# 1) 시각화에 필요한 모듈
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import seaborn as sns

# ──────────────────────────────────────────────
# 2) 임베딩 벡터 리스트와 레이블 준비
embeddings = np.stack(df["embedding"].values)  # (N, H)
labels     = df["label"].values                # (N,)

print("임베딩 shape:", embeddings.shape)
print("라벨 종류 수:", len(set(labels)))

# ──────────────────────────────────────────────
# 3) t-SNE 차원 축소
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
emb_2d = tsne.fit_transform(embeddings)



임베딩 shape: (1023, 768)
라벨 종류 수: 14


In [16]:
!pip install plotly

import plotly.express as px
import pandas as pd

# 1) 시각화용 DataFrame 구성
df_vis = pd.DataFrame(emb_2d, columns=["x", "y"])
df_vis["label"]   = df["label"].values
df_vis["wavfile"] = df["wav_path"].apply(lambda p: os.path.basename(p))

# 2) Plotly 시각화
fig = px.scatter(
    df_vis,
    x="x", y="y",
    color="label",
    hover_data=["wavfile", "label"],
    title="Embedding값 비교",
    width=1000, height=800
)
fig.show()




### Speech Embedding 실습

1. “무궁화 꽃이 피었습니다”를 5회 각각 어조와 톤을 변경해서 녹음하고 “utt_홍길동1.wav” 형태로 저장해서 discord 강의노트 및 학습자료 교환 채널에 업로드 합니다.

1. 다음의 코드를 이용해 각 화자의 톤과 어조에 따른 임베딩 값을 확인 합니다.

In [1]:
!pip install speechbrain librosa numpy scikit-learn

Collecting speechbrain
  Downloading speechbrain-1.0.3-py3-none-any.whl.metadata (24 kB)
Collecting hyperpyyaml (from speechbrain)
  Downloading HyperPyYAML-1.2.2-py3-none-any.whl.metadata (7.6 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.9->speechbrain)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.9->speechbrain)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.9->speechbrain)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.9->speechbrain)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.9->speechbrain)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3

In [4]:
!unzip -o speech_dataset.zip -d /content

Archive:  speech_dataset.zip
   creating: /content/speech_dataset/
  inflating: /content/speech_dataset/utt_이랑교3.wav  
  inflating: /content/__MACOSX/speech_dataset/._utt_이랑교3.wav  
  inflating: /content/speech_dataset/utt_김성헌2.wav  
  inflating: /content/__MACOSX/speech_dataset/._utt_김성헌2.wav  
  inflating: /content/speech_dataset/utt_김성헌3.wav  
  inflating: /content/__MACOSX/speech_dataset/._utt_김성헌3.wav  
  inflating: /content/speech_dataset/utt_이랑교2.wav  
  inflating: /content/__MACOSX/speech_dataset/._utt_이랑교2.wav  
  inflating: /content/speech_dataset/utt_김성헌1.wav  
  inflating: /content/__MACOSX/speech_dataset/._utt_김성헌1.wav  
  inflating: /content/speech_dataset/utt_이랑교1.wav  
  inflating: /content/__MACOSX/speech_dataset/._utt_이랑교1.wav  
  inflating: /content/speech_dataset/utt_이랑교5.wav  
  inflating: /content/__MACOSX/speech_dataset/._utt_이랑교5.wav  
  inflating: /content/speech_dataset/utt_김성헌4.wav  
  

In [2]:
import torch
from speechbrain.pretrained import EncoderClassifier

device = "cuda" if torch.cuda.is_available() else "cpu"
ecapa = EncoderClassifier.from_hparams(
    source="speechbrain/spkrec-ecapa-voxceleb",
    run_opts={"device": device}
)

DEBUG:speechbrain.utils.checkpoints:Registered checkpoint save hook for _speechbrain_save
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint load hook for _speechbrain_load
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint save hook for save
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint load hook for load
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint save hook for _save
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint load hook for _recover
  from speechbrain.pretrained import EncoderClassifier
INFO:speechbrain.utils.fetching:Fetch hyperparams.yaml: Fetching from HuggingFace Hub 'speechbrain/spkrec-ecapa-voxceleb' if not cached


hyperparams.yaml: 0.00B [00:00, ?B/s]

INFO:speechbrain.utils.fetching:Fetch custom.py: Fetching from HuggingFace Hub 'speechbrain/spkrec-ecapa-voxceleb' if not cached
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint save hook for _save
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint load hook for _load
DEBUG:speechbrain.utils.checkpoints:Registered parameter transfer hook for _load
  wrapped_fwd = torch.cuda.amp.custom_fwd(fwd, cast_inputs=cast_inputs)
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint save hook for save
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint load hook for load_if_possible
DEBUG:speechbrain.utils.parameter_transfer:Fetching files for pretraining (no collection directory set)
INFO:speechbrain.utils.fetching:Fetch embedding_model.ckpt: Fetching from HuggingFace Hub 'speechbrain/spkrec-ecapa-voxceleb' if not cached


embedding_model.ckpt:   0%|          | 0.00/83.3M [00:00<?, ?B/s]

DEBUG:speechbrain.utils.parameter_transfer:Set local path in self.paths["embedding_model"] = /root/.cache/huggingface/hub/models--speechbrain--spkrec-ecapa-voxceleb/snapshots/0f99f2d0ebe89ac095bcc5903c4dd8f72b367286/embedding_model.ckpt
INFO:speechbrain.utils.fetching:Fetch mean_var_norm_emb.ckpt: Fetching from HuggingFace Hub 'speechbrain/spkrec-ecapa-voxceleb' if not cached


mean_var_norm_emb.ckpt:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

DEBUG:speechbrain.utils.parameter_transfer:Set local path in self.paths["mean_var_norm_emb"] = /root/.cache/huggingface/hub/models--speechbrain--spkrec-ecapa-voxceleb/snapshots/0f99f2d0ebe89ac095bcc5903c4dd8f72b367286/mean_var_norm_emb.ckpt
INFO:speechbrain.utils.fetching:Fetch classifier.ckpt: Fetching from HuggingFace Hub 'speechbrain/spkrec-ecapa-voxceleb' if not cached


classifier.ckpt:   0%|          | 0.00/5.53M [00:00<?, ?B/s]

DEBUG:speechbrain.utils.parameter_transfer:Set local path in self.paths["classifier"] = /root/.cache/huggingface/hub/models--speechbrain--spkrec-ecapa-voxceleb/snapshots/0f99f2d0ebe89ac095bcc5903c4dd8f72b367286/classifier.ckpt
INFO:speechbrain.utils.fetching:Fetch label_encoder.txt: Fetching from HuggingFace Hub 'speechbrain/spkrec-ecapa-voxceleb' if not cached


label_encoder.txt: 0.00B [00:00, ?B/s]

DEBUG:speechbrain.utils.parameter_transfer:Set local path in self.paths["label_encoder"] = /root/.cache/huggingface/hub/models--speechbrain--spkrec-ecapa-voxceleb/snapshots/0f99f2d0ebe89ac095bcc5903c4dd8f72b367286/label_encoder.txt
INFO:speechbrain.utils.parameter_transfer:Loading pretrained files for: embedding_model, mean_var_norm_emb, classifier, label_encoder
DEBUG:speechbrain.utils.parameter_transfer:Redirecting (loading from local path): embedding_model -> /root/.cache/huggingface/hub/models--speechbrain--spkrec-ecapa-voxceleb/snapshots/0f99f2d0ebe89ac095bcc5903c4dd8f72b367286/embedding_model.ckpt
DEBUG:speechbrain.utils.parameter_transfer:Redirecting (loading from local path): mean_var_norm_emb -> /root/.cache/huggingface/hub/models--speechbrain--spkrec-ecapa-voxceleb/snapshots/0f99f2d0ebe89ac095bcc5903c4dd8f72b367286/mean_var_norm_emb.ckpt
DEBUG:speechbrain.utils.parameter_transfer:Redirecting (loading from local path): classifier -> /root/.cache/huggingface/hub/models--speechb

In [12]:
import librosa
import numpy as np
from pathlib import Path

# 5개 파일을 준비
enroll_dir = Path("/content/speech_dataset/")
wav_paths = sorted(enroll_dir.glob("*.wav"))

embs = []
for path in wav_paths:
    wav, sr = librosa.load(str(path), sr=16000)
    sig = torch.tensor(wav).unsqueeze(0).to(device)         # (1, T)
    emb = ecapa.encode_batch(sig).squeeze().cpu().numpy()  # (D,)
    embs.append(emb)


  wav, sr = librosa.load(str(path), sr=16000)
	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)
  wav, sr = librosa.load(str(path), sr=16000)
	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)
  wav, sr = librosa.load(str(path), sr=16000)
	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)
  wav, sr = librosa.load(str(path), sr=16000)
	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)
  wav, sr = librosa.load(str(path), sr=16000)
	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)
  wav, sr = lib

[[  4.9192986   7.2168646 -13.992505  ... -26.719744  -13.6325245
    9.321775 ]
 [ -4.72594    12.781632   10.37529   ... -41.7932    -10.451805
   -4.0384984]
 [-23.384117   26.791363  -30.991255  ... -32.665302  -26.279493
  -13.812485 ]
 ...
 [-10.134351   37.595943   -1.8481559 ...  31.406166   32.250298
   10.013362 ]
 [ -5.564302   19.389181   -9.696294  ...  27.659466   19.532112
   -4.7419095]
 [-44.202827   35.6189    -10.783229  ...  20.716846   13.432128
  -11.942573 ]]


  wav, sr = librosa.load(str(path), sr=16000)
	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)


In [26]:
embs = np.stack(embs)            # (파일수, 차원수)
centroid = embs.mean(axis=0)     # (D,)
print("오디오 파일 수:",embs.shape[0])
print("임베딩 차원:",embs.shape[1])
print("각 192개 요소에서 모든 vector에 대한 평균 값",centroid)

오디오 파일 수: 45
임베딩 차원: 192
[ -1.1504345   11.501112   -12.543171     3.687162     2.9890313
   7.45251    -10.792027     1.5940958   -5.510286     6.576842
  23.76109     17.022873    -1.7088255   -6.0163627   -0.08653963
  16.10494     -3.6276164   15.645878    -2.3697035   13.818193
   4.943083   -26.87482      5.390035    -0.7825509    8.8328085
   6.6787624   -0.3892924    1.2734869   -4.5549397  -22.184282
 -15.243655     5.83697    -20.360844     0.22351685   3.134573
   9.446013     5.499341     3.6244771    4.313066    11.479939
  -5.127348    -1.1664166  -13.058071    -6.2484374   -2.8559175
 -21.689894     1.5474415    5.441242    -0.89361614 -10.064001
   0.21606877  12.5701685  -15.007276    20.737877     7.974016
   2.3868704    8.516787    11.014472     1.031257    -8.907963
  -1.3009992   11.461645     0.8661101    3.515755     2.1125648
   1.0356863    1.0999558   -9.903671   -13.838866   -10.757701
 -14.65113     -2.9657447   12.5786      -1.6178797   18.019812
  -4.1978

In [37]:
from sklearn.metrics.pairwise import cosine_similarity

#각 45개의 vector과 centroid간의 코사인 유사도
sims = cosine_similarity(embs, centroid[None, :]).flatten()
print("Enrollment sims:", sims)
#sims.min값은 0.19870143이다. 즉 가장 작은 코사인 유사도 값이 이 값이란 의미이다.
threshold = sims.min()-0.05
#이때 이 최소값과 0.5 중 큰 값으로 임계치를 설정한다.
#이 threshold값 이상으로 값이 출력되면 화자인식에 성공한 것이다.
threshold = float(max(threshold, 0.5))
print("인증 임계치:", threshold)


Enrollment sims: [0.45963192 0.54873616 0.5981504  0.53620607 0.5329009  0.5969749
 0.55335486 0.5918969  0.45793402 0.56785256 0.32966977 0.30797106
 0.4160208  0.4103333  0.36906186 0.39023176 0.542364   0.5560234
 0.5125534  0.5772192  0.52262634 0.33708978 0.50906676 0.3905707
 0.5086746  0.3455995  0.40165222 0.36727214 0.44643617 0.3806298
 0.42750186 0.29823607 0.29361993 0.42125693 0.19870143 0.57022965
 0.5339527  0.5098499  0.5489682  0.5136579  0.45623893 0.455687
 0.36967087 0.39498398 0.45942515]
인증 임계치: 0.5


In [36]:
def verify_speaker(test_wav_path, thr=threshold):
    wav, sr = librosa.load(test_wav_path, sr=16000)
    sig = torch.tensor(wav).unsqueeze(0).to(device)
    emb = ecapa.encode_batch(sig).squeeze().cpu().numpy()
    sim = cosine_similarity(emb[None, :], centroid[None, :])[0,0]
    is_same = sim >= thr
    return is_same, sim

# 예시: 동일 문장·다른 어조로 녹음한 test.wav
test_path = "/content/speech_dataset/utt_백현일4.wav"
ok, score = verify_speaker(test_path)
print(f"유사도={score:.3f} ->", "인증 성공" if ok else "인증 실패")

유사도=0.446 -> 인증 실패
