Skip to content

salute-developers/GigaAM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GigaAM: the family of open-source acoustic models for speech processing

plot

Table of contents

GigaAM

GigaAM (Giga Acoustic Model) is a Conformer-based wav2vec2 foundational model (around 240M parameters). We trained GigaAM on nearly 50 thousand hours of diversified speech audio in the Russian language.

Resources:

GigaAM-CTC

GigaAM-CTC is an Automatic Speech Recognition model. We fine-tuned the GigaAM Encoder with Connectionist Temporal Classification using the NeMo toolkit on publicly available Russian labeled data:

dataset size, hours weight
Golos 1227 0.6
SOVA 369 0.2
Russian Common Voice 207 0.1
Russian LibriSpeech 93 0.1

Resources:

The following table summarizes the performance of different models in terms of Word Error Rate on open Russian datasets:

model parameters Golos Crowd Golos Farfield OpenSTT Youtube OpenSTT Phone calls OpenSTT Audiobooks Mozilla Common Voice Russian LibriSpeech
Whisper-large-v3 1.5B 17.4 14.5 11.1 31.2 17.0 5.3 9.0
NeMo Conformer-RNNT 120M 2.6 7.2 24.0 33.8 17.0 2.8 13.5
GigaAM-CTC 242M 3.1 5.7 18.4 25.6 15.1 1.7 8.1

GigaAM-Emo

GigaAM-Emo is an acoustic model for Emotion Recognition. We fine-tuned the GigaAM Encoder on the Dusha dataset.

Resources:

The following table summarizes the performance of different models on the Dusha dataset:

Crowd Podcast
Unweighted Accuracy Weighted Accuracy Macro F1-score Unweighted Accuracy Weighted Accuracy Macro F1-score
DUSHA baseline
(MobileNetV2 + Self-Attention)
0.83 0.76 0.77 0.89 0.53 0.54
АБК (TIM-Net) 0.84 0.77 0.78 0.90 0.50 0.55
GigaAM-Emo 0.90 0.87 0.84 0.90 0.76 0.67

Links