Skip to content

splinter21/whisper-phoneme-asr

 
 

Repository files navigation

Introduction

This is a phoneme ASR model based on the whisper-small model. The features are extracted using the Whisper encoder output and training is carried out using the MFA tool to obtain alignments between phonemes and frame-level Whisper features.

The dictionary used is opencpop-strict, and the training is carried out using a portion of the Aishell3 dataset.

Features

Advantages

  • Can recognize phonemes as well as tones
  • Can directly recognize the phoneme and the duration
  • Good performance on data with clear pronunciation

Disadvantages

  • Only supports Chinese language
  • Limited generalization performance due to small training dataset, and poor performance on data with unclear pronunciation
  • May recognize some illegal phoneme sequences, such as "d ang ong"
  • (别抱太高期望,很多时候会识别的一坨屎)

Usage

Download recognition_model.pth and place it in the assets directory. Place the audio files in the dataset_raw directory by speaker and execute batch_annotate.py.

To recognize a single audio file, use infer.py.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%