GitHub - splinter21/whisper-phoneme-asr

Introduction

This is a phoneme ASR model based on the whisper-small model. The features are extracted using the Whisper encoder output and training is carried out using the MFA tool to obtain alignments between phonemes and frame-level Whisper features.

The dictionary used is opencpop-strict, and the training is carried out using a portion of the Aishell3 dataset.

Features

Advantages

Can recognize phonemes as well as tones
Can directly recognize the phoneme and the duration
Good performance on data with clear pronunciation

Disadvantages

Only supports Chinese language
Limited generalization performance due to small training dataset, and poor performance on data with unclear pronunciation
May recognize some illegal phoneme sequences, such as "d ang ong"
（别抱太高期望，很多时候会识别的一坨屎）

Usage

Download recognition_model.pth and place it in the assets directory. Place the audio files in the dataset_raw directory by speaker and execute batch_annotate.py.

To recognize a single audio file, use infer.py.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
assets		assets
filelists		filelists
modules		modules
whisper_enc		whisper_enc
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
batch_annotate.py		batch_annotate.py
config.py		config.py
data_utils.py		data_utils.py
extract_whisper.py		extract_whisper.py
infer.py		infer.py
pack_model.py		pack_model.py
post_mfa.py		post_mfa.py
preprocess_aishell.py		preprocess_aishell.py
split_data.py		split_data.py
train.py		train.py
utils.py		utils.py

License

splinter21/whisper-phoneme-asr

Folders and files

Latest commit

History

Repository files navigation

Introduction

Features

Usage

About

Resources

License

Stars

Watchers

Forks

Languages