From af9090cf8ea0542713b191cc5e8c5b0034625333 Mon Sep 17 00:00:00 2001 From: Your Name Date: Thu, 5 Jun 2025 20:52:25 -0400 Subject: [PATCH 01/34] Moved the sources to the right --- docs/source/en/model_doc/moshi.md | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/docs/source/en/model_doc/moshi.md b/docs/source/en/model_doc/moshi.md index 9302a9461959..357f326bc1f5 100644 --- a/docs/source/en/model_doc/moshi.md +++ b/docs/source/en/model_doc/moshi.md @@ -16,10 +16,14 @@ rendered properly in your Markdown viewer. # Moshi -
-PyTorch -FlashAttention -SDPA + + +
+
+ PyTorch + FlashAttention + SDPA +
## Overview From f1630afe20c4d6f52ececd6d47c30d3bf7c30be3 Mon Sep 17 00:00:00 2001 From: Your Name Date: Thu, 5 Jun 2025 20:56:59 -0400 Subject: [PATCH 02/34] small Changes --- docs/source/en/model_doc/moshi.md | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/docs/source/en/model_doc/moshi.md b/docs/source/en/model_doc/moshi.md index 357f326bc1f5..ed6529526aa8 100644 --- a/docs/source/en/model_doc/moshi.md +++ b/docs/source/en/model_doc/moshi.md @@ -14,10 +14,6 @@ rendered properly in your Markdown viewer. --> -# Moshi - - -
PyTorch @@ -26,6 +22,9 @@ rendered properly in your Markdown viewer.
+# Moshi + + ## Overview The Moshi model was proposed in [Moshi: a speech-text foundation model for real-time dialogue](https://kyutai.org/Moshi.pdf) by Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave and Neil Zeghidour. From abc9cf3aa07f08ebd0fb53ccdc996982d05313e6 Mon Sep 17 00:00:00 2001 From: Your Name Date: Thu, 5 Jun 2025 23:20:07 -0400 Subject: [PATCH 03/34] Some Changes to moonshine --- docs/source/en/model_doc/moshi.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/docs/source/en/model_doc/moshi.md b/docs/source/en/model_doc/moshi.md index ed6529526aa8..357f326bc1f5 100644 --- a/docs/source/en/model_doc/moshi.md +++ b/docs/source/en/model_doc/moshi.md @@ -14,6 +14,10 @@ rendered properly in your Markdown viewer. --> +# Moshi + + +
PyTorch @@ -22,9 +26,6 @@ rendered properly in your Markdown viewer.
-# Moshi - - ## Overview The Moshi model was proposed in [Moshi: a speech-text foundation model for real-time dialogue](https://kyutai.org/Moshi.pdf) by Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave and Neil Zeghidour. From 4d513ebf8e9d954a091914028dcf791e8a27b8d7 Mon Sep 17 00:00:00 2001 From: Your Name Date: Sat, 7 Jun 2025 17:02:02 -0400 Subject: [PATCH 04/34] Added the install to pipline --- docs/source/en/model_doc/moonshine.md | 87 +++++++++------------------ 1 file changed, 27 insertions(+), 60 deletions(-) diff --git a/docs/source/en/model_doc/moonshine.md b/docs/source/en/model_doc/moonshine.md index 4cd2eec774d4..939f3d5a6984 100644 --- a/docs/source/en/model_doc/moonshine.md +++ b/docs/source/en/model_doc/moonshine.md @@ -24,66 +24,33 @@ rendered properly in your Markdown viewer. # Moonshine -[Moonshine](https://huggingface.co/papers/2410.15608) is an encoder-decoder speech recognition model optimized for real-time transcription and recognizing voice command. Instead of using traditional absolute position embeddings, Moonshine uses Rotary Position Embedding (RoPE) to handle speech with varying lengths without using padding. This improves efficiency during inference, making it ideal for resource-constrained devices. - -You can find all the original Moonshine checkpoints under the [Useful Sensors](https://huggingface.co/UsefulSensors) organization. - -> [!TIP] -> Click on the Moonshine models in the right sidebar for more examples of how to apply Moonshine to different speech recognition tasks. - -The example below demonstrates how to transcribe speech into text with [`Pipeline`] or the [`AutoModel`] class. - - - - -```py -import torch -from transformers import pipeline - -pipeline = pipeline( - task="automatic-speech-recognition", - model="UsefulSensors/moonshine-base", - torch_dtype=torch.float16, - device=0 -) -pipeline("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac") -``` - - - - -```py -# pip install datasets -import torch -from datasets import load_dataset -from transformers import AutoProcessor, MoonshineForConditionalGeneration - -processor = AutoProcessor.from_pretrained( - "UsefulSensors/moonshine-base", -) -model = MoonshineForConditionalGeneration.from_pretrained( - "UsefulSensors/moonshine-base", - torch_dtype=torch.float16, - device_map="auto", - attn_implementation="sdpa" -).to("cuda") - -ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", split="validation") -audio_sample = ds[0]["audio"] - -input_features = processor( - audio_sample["array"], - sampling_rate=audio_sample["sampling_rate"], - return_tensors="pt" -) -input_features = input_features.to("cuda", dtype=torch.float16) - -predicted_ids = model.generate(**input_features, cache_implementation="static") -transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True) -transcription[0] -``` - - +[Moonshine](https://huggingface.co/papers/2410.15608) + + + + +You can find all the Moonshine checkpoints on the [Hub](https://huggingface.co/models?search=moonshine). + + +The Moonshine model was proposed in [Moonshine: Speech Recognition for Live Transcription and Voice Commands +](https://arxiv.org/abs/2410.15608) by Nat Jeffries, Evan King, Manjunath Kudlur, Guy Nicholson, James Wang, Pete Warden. + +The abstract from the paper is the following: + +*This paper introduces Moonshine, a family of speech recognition models optimized for live transcription and voice command processing. Moonshine is based on an encoder-decoder transformer architecture and employs Rotary Position Embedding (RoPE) instead of traditional absolute position embeddings. The model is trained on speech segments of various lengths, but without using zero-padding, leading to greater efficiency for the encoder during inference time. When benchmarked against OpenAI's Whisper tiny-en, Moonshine Tiny demonstrates a 5x reduction in compute requirements for transcribing a 10-second speech segment while incurring no increase in word error rates across standard evaluation datasets. These results highlight Moonshine's potential for real-time and resource-constrained applications.* + +Tips: + +- Moonshine improves upon Whisper's architecture: + 1. It uses SwiGLU activation instead of GELU in the decoder layers + 2. Most importantly, it replaces absolute position embeddings with Rotary Position Embeddings (RoPE). This allows Moonshine to handle audio inputs of any length, unlike Whisper which is restricted to fixed 30-second windows. + +This model was contributed by [Eustache Le Bihan (eustlb)](https://huggingface.co/eustlb). +The original code can be found [here](https://github.com/usefulsensors/moonshine). + +## Resources + +- [Automatic speech recognition task guide](../tasks/asr) ## MoonshineConfig From 63dc460c7298ead8965c2fa03f13c837849cdc2b Mon Sep 17 00:00:00 2001 From: Your Name Date: Mon, 9 Jun 2025 22:12:49 -0400 Subject: [PATCH 05/34] updated the monshine model card --- docs/source/en/model_doc/moonshine.md | 65 ++++++++++++++++++++++++--- 1 file changed, 59 insertions(+), 6 deletions(-) diff --git a/docs/source/en/model_doc/moonshine.md b/docs/source/en/model_doc/moonshine.md index 939f3d5a6984..f2b8efb3a840 100644 --- a/docs/source/en/model_doc/moonshine.md +++ b/docs/source/en/model_doc/moonshine.md @@ -31,6 +31,64 @@ rendered properly in your Markdown viewer. You can find all the Moonshine checkpoints on the [Hub](https://huggingface.co/models?search=moonshine). +> [!TIP] +> Click on the Moonshine models in the right sidebar for more examples of how to apply Moonshine to different speech recognition tasks. + +The example below demonstrates how to generate a transcription based on an audio file with [`Pipeline`] or the [`AutoModel`] class. + + + + + + +```py +# uncomment to install ffmpeg which is needed to decode the audio file +# !brew install ffmpeg + +from transformers import pipeline + +asr = pipeline("automatic-speech-recognition", model="UsefulSensors/moonshine-base") + +result = asr("path_to_audio_file") + +#Prints the transcription from the audio file +print(result["text"]) +``` + + + + +```py +# uncomment to install rjieba which is needed for the tokenizer +# !pip install rjieba +import torch +from transformers import AutoModelForMaskedLM, AutoTokenizer + +model = AutoModelForMaskedLM.from_pretrained( + "junnyu/roformer_chinese_base", torch_dtype=torch.float16 +) +tokenizer = AutoTokenizer.from_pretrained("junnyu/roformer_chinese_base") + +input_ids = tokenizer("水在零度时会[MASK]", return_tensors="pt").to(model.device) +outputs = model(**input_ids) +decoded = tokenizer.batch_decode(outputs.logits.argmax(-1), skip_special_tokens=True) +print(decoded) +``` + + + + +```bash +echo -e "水在零度时会[MASK]" | transformers-cli run --task fill-mask --model junnyu/roformer_chinese_base --device 0 +``` + + + + + + + + The Moonshine model was proposed in [Moonshine: Speech Recognition for Live Transcription and Voice Commands ](https://arxiv.org/abs/2410.15608) by Nat Jeffries, Evan King, Manjunath Kudlur, Guy Nicholson, James Wang, Pete Warden. @@ -45,12 +103,7 @@ Tips: 1. It uses SwiGLU activation instead of GELU in the decoder layers 2. Most importantly, it replaces absolute position embeddings with Rotary Position Embeddings (RoPE). This allows Moonshine to handle audio inputs of any length, unlike Whisper which is restricted to fixed 30-second windows. -This model was contributed by [Eustache Le Bihan (eustlb)](https://huggingface.co/eustlb). -The original code can be found [here](https://github.com/usefulsensors/moonshine). - -## Resources - -- [Automatic speech recognition task guide](../tasks/asr) +- A guide for automatic speech recognition can be found [here](../tasks/asr) ## MoonshineConfig From b20e589b7ca51e97f6bc177c9c6b976fedbfc511 Mon Sep 17 00:00:00 2001 From: SohamPrabhu <62270341+SohamPrabhu@users.noreply.github.com> Date: Tue, 10 Jun 2025 12:40:23 -0400 Subject: [PATCH 06/34] Update docs/source/en/model_doc/moonshine.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/moonshine.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/docs/source/en/model_doc/moonshine.md b/docs/source/en/model_doc/moonshine.md index f2b8efb3a840..542201a17b48 100644 --- a/docs/source/en/model_doc/moonshine.md +++ b/docs/source/en/model_doc/moonshine.md @@ -24,9 +24,7 @@ rendered properly in your Markdown viewer. # Moonshine -[Moonshine](https://huggingface.co/papers/2410.15608) - - +[Moonshine](https://huggingface.co/papers/2410.15608) is a speech recognition model that is optimized for real-time transcription and voice command. Instead of using traditional absolute position embeddings, Moonshine uses Rotary Position Embedding (RoPE). You can find all the Moonshine checkpoints on the [Hub](https://huggingface.co/models?search=moonshine). From 5b5ff7dcdef09701809f4b76ae7f45c5e2974615 Mon Sep 17 00:00:00 2001 From: SohamPrabhu <62270341+SohamPrabhu@users.noreply.github.com> Date: Tue, 10 Jun 2025 12:41:40 -0400 Subject: [PATCH 07/34] Update docs/source/en/model_doc/moonshine.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/moonshine.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/model_doc/moonshine.md b/docs/source/en/model_doc/moonshine.md index 542201a17b48..31288ce6f567 100644 --- a/docs/source/en/model_doc/moonshine.md +++ b/docs/source/en/model_doc/moonshine.md @@ -27,7 +27,7 @@ rendered properly in your Markdown viewer. [Moonshine](https://huggingface.co/papers/2410.15608) is a speech recognition model that is optimized for real-time transcription and voice command. Instead of using traditional absolute position embeddings, Moonshine uses Rotary Position Embedding (RoPE). -You can find all the Moonshine checkpoints on the [Hub](https://huggingface.co/models?search=moonshine). +You can find all the original Moonshine checkpoints under the [Useful Sensors](https://huggingface.co/UsefulSensors) organization. > [!TIP] > Click on the Moonshine models in the right sidebar for more examples of how to apply Moonshine to different speech recognition tasks. From 520ff685d19d50ee479bb97fc8aae30c9c3d39f8 Mon Sep 17 00:00:00 2001 From: SohamPrabhu <62270341+SohamPrabhu@users.noreply.github.com> Date: Tue, 10 Jun 2025 12:41:53 -0400 Subject: [PATCH 08/34] Update docs/source/en/model_doc/moonshine.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/moonshine.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/model_doc/moonshine.md b/docs/source/en/model_doc/moonshine.md index 31288ce6f567..d5a7c334aea5 100644 --- a/docs/source/en/model_doc/moonshine.md +++ b/docs/source/en/model_doc/moonshine.md @@ -32,7 +32,7 @@ You can find all the original Moonshine checkpoints under the [Useful Sensors](h > [!TIP] > Click on the Moonshine models in the right sidebar for more examples of how to apply Moonshine to different speech recognition tasks. -The example below demonstrates how to generate a transcription based on an audio file with [`Pipeline`] or the [`AutoModel`] class. +The example below demonstrates how to transcribe speech into text with [`Pipeline`] or the [`AutoModel`] class. From 615cd5087fbf393421ecffcf905c09bf0184dc5a Mon Sep 17 00:00:00 2001 From: SohamPrabhu <62270341+SohamPrabhu@users.noreply.github.com> Date: Tue, 10 Jun 2025 12:42:28 -0400 Subject: [PATCH 09/34] Update docs/source/en/model_doc/moonshine.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/moonshine.md | 85 +++++++++++---------------- 1 file changed, 33 insertions(+), 52 deletions(-) diff --git a/docs/source/en/model_doc/moonshine.md b/docs/source/en/model_doc/moonshine.md index d5a7c334aea5..810f923451d9 100644 --- a/docs/source/en/model_doc/moonshine.md +++ b/docs/source/en/model_doc/moonshine.md @@ -24,8 +24,7 @@ rendered properly in your Markdown viewer. # Moonshine -[Moonshine](https://huggingface.co/papers/2410.15608) is a speech recognition model that is optimized for real-time transcription and voice command. Instead of using traditional absolute position embeddings, Moonshine uses Rotary Position Embedding (RoPE). - +[Moonshine](https://huggingface.co/papers/2410.15608) is an encoder-decoder speech recognition model optimized for real-time transcription and recognizing voice command. Instead of using traditional absolute position embeddings, Moonshine uses Rotary Position Embedding (RoPE) to handle speech with varying lengths without using padding. This improves efficiency during inference, making it ideal for resource-constrained devices. You can find all the original Moonshine checkpoints under the [Useful Sensors](https://huggingface.co/UsefulSensors) organization. @@ -34,75 +33,58 @@ You can find all the original Moonshine checkpoints under the [Useful Sensors](h The example below demonstrates how to transcribe speech into text with [`Pipeline`] or the [`AutoModel`] class. - - ```py -# uncomment to install ffmpeg which is needed to decode the audio file -# !brew install ffmpeg - +import torch from transformers import pipeline -asr = pipeline("automatic-speech-recognition", model="UsefulSensors/moonshine-base") - -result = asr("path_to_audio_file") - -#Prints the transcription from the audio file -print(result["text"]) +pipeline = pipeline( + task="automatic-speech-recognition", + model="UsefulSensors/moonshine-base", + torch_dtype=torch.float16, + device=0 +) +pipeline("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac") ``` ```py -# uncomment to install rjieba which is needed for the tokenizer -# !pip install rjieba +# pip install datasets import torch -from transformers import AutoModelForMaskedLM, AutoTokenizer +from datasets import load_dataset +from transformers import AutoProcessor, MoonshineForConditionalGeneration -model = AutoModelForMaskedLM.from_pretrained( - "junnyu/roformer_chinese_base", torch_dtype=torch.float16 +processor = AutoProcessor.from_pretrained( + "UsefulSensors/moonshine-base", ) -tokenizer = AutoTokenizer.from_pretrained("junnyu/roformer_chinese_base") - -input_ids = tokenizer("水在零度时会[MASK]", return_tensors="pt").to(model.device) -outputs = model(**input_ids) -decoded = tokenizer.batch_decode(outputs.logits.argmax(-1), skip_special_tokens=True) -print(decoded) -``` - - - +model = MoonshineForConditionalGeneration.from_pretrained( + "UsefulSensors/moonshine-base", + torch_dtype=torch.float16, + device_map="auto", + attn_implementation="sdpa" +).to("cuda") + +ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", split="validation") +audio_sample = ds[0]["audio"] + +input_features = processor( + audio_sample["array"], + sampling_rate=audio_sample["sampling_rate"], + return_tensors="pt" +) +input_features = input_features.to("cuda", dtype=torch.float16) -```bash -echo -e "水在零度时会[MASK]" | transformers-cli run --task fill-mask --model junnyu/roformer_chinese_base --device 0 +predicted_ids = model.generate(**input_features, cache_implementation="static") +transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True) +transcription[0] ``` - - - - - - -The Moonshine model was proposed in [Moonshine: Speech Recognition for Live Transcription and Voice Commands -](https://arxiv.org/abs/2410.15608) by Nat Jeffries, Evan King, Manjunath Kudlur, Guy Nicholson, James Wang, Pete Warden. - -The abstract from the paper is the following: - -*This paper introduces Moonshine, a family of speech recognition models optimized for live transcription and voice command processing. Moonshine is based on an encoder-decoder transformer architecture and employs Rotary Position Embedding (RoPE) instead of traditional absolute position embeddings. The model is trained on speech segments of various lengths, but without using zero-padding, leading to greater efficiency for the encoder during inference time. When benchmarked against OpenAI's Whisper tiny-en, Moonshine Tiny demonstrates a 5x reduction in compute requirements for transcribing a 10-second speech segment while incurring no increase in word error rates across standard evaluation datasets. These results highlight Moonshine's potential for real-time and resource-constrained applications.* - -Tips: - -- Moonshine improves upon Whisper's architecture: - 1. It uses SwiGLU activation instead of GELU in the decoder layers - 2. Most importantly, it replaces absolute position embeddings with Rotary Position Embeddings (RoPE). This allows Moonshine to handle audio inputs of any length, unlike Whisper which is restricted to fixed 30-second windows. - -- A guide for automatic speech recognition can be found [here](../tasks/asr) - ## MoonshineConfig [[autodoc]] MoonshineConfig @@ -118,4 +100,3 @@ Tips: [[autodoc]] MoonshineForConditionalGeneration - forward - generate - From 4ea0a6fa5a0d6733c1935000b5725c5918e88a19 Mon Sep 17 00:00:00 2001 From: Your Name Date: Tue, 10 Jun 2025 13:42:40 -0400 Subject: [PATCH 10/34] Updated Documentation According to changes --- docs/source/en/model_doc/moonshine.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/docs/source/en/model_doc/moonshine.md b/docs/source/en/model_doc/moonshine.md index 810f923451d9..a62f0a19bbc9 100644 --- a/docs/source/en/model_doc/moonshine.md +++ b/docs/source/en/model_doc/moonshine.md @@ -85,6 +85,13 @@ transcription[0] + +- Moonshine improves upon Whisper's architecture: + 1. It uses SwiGLU activation instead of GELU in the decoder layers + 2. Most importantly, it replaces absolute position embeddings with Rotary Position Embeddings (RoPE). This allows Moonshine to handle audio inputs of any length, unlike Whisper which is restricted to fixed 30-second windows. + +- A guide for automatic speech recognition can be found [here](../tasks/asr) + ## MoonshineConfig [[autodoc]] MoonshineConfig From f30014f61868e1b38c17fa7979736ae63e7f42fb Mon Sep 17 00:00:00 2001 From: Your Name Date: Wed, 11 Jun 2025 19:05:09 -0400 Subject: [PATCH 11/34] Fixed the model with the commits --- docs/source/en/model_doc/moonshine.md | 2 +- docs/source/en/model_doc/moshi.md | 10 ++++------ 2 files changed, 5 insertions(+), 7 deletions(-) diff --git a/docs/source/en/model_doc/moonshine.md b/docs/source/en/model_doc/moonshine.md index a62f0a19bbc9..a0cb46ef60a2 100644 --- a/docs/source/en/model_doc/moonshine.md +++ b/docs/source/en/model_doc/moonshine.md @@ -90,7 +90,7 @@ transcription[0] 1. It uses SwiGLU activation instead of GELU in the decoder layers 2. Most importantly, it replaces absolute position embeddings with Rotary Position Embeddings (RoPE). This allows Moonshine to handle audio inputs of any length, unlike Whisper which is restricted to fixed 30-second windows. -- A guide for automatic speech recognition can be found [here](../tasks/asr) +-- A guide for automatic speech recognition can be found [here](../tasks/asr) ## MoonshineConfig diff --git a/docs/source/en/model_doc/moshi.md b/docs/source/en/model_doc/moshi.md index 357f326bc1f5..e70286ebf2e1 100644 --- a/docs/source/en/model_doc/moshi.md +++ b/docs/source/en/model_doc/moshi.md @@ -18,12 +18,10 @@ rendered properly in your Markdown viewer. -
-
- PyTorch - FlashAttention - SDPA -
+
+PyTorch +FlashAttention +SDPA
## Overview From d61a4e1cc931f3f1e4f8b214331cfe63ce76a2cb Mon Sep 17 00:00:00 2001 From: Your Name Date: Thu, 12 Jun 2025 23:09:08 -0400 Subject: [PATCH 12/34] Changes to the roc_bert --- docs/source/en/model_doc/roc_bert.md | 52 ++++++++++++++++++++++++++-- 1 file changed, 49 insertions(+), 3 deletions(-) diff --git a/docs/source/en/model_doc/roc_bert.md b/docs/source/en/model_doc/roc_bert.md index f3797663ff70..023f2690d379 100644 --- a/docs/source/en/model_doc/roc_bert.md +++ b/docs/source/en/model_doc/roc_bert.md @@ -14,11 +14,57 @@ rendered properly in your Markdown viewer. --> +
+
+ PyTorch +
+
+ # RoCBert -
-PyTorch -
+[RoCBert] is a BERT model that is specifically designed for the Chinese language. It is built to resist tricks and attacks, like misspellings or similar-looking words, that usually confuse language models. + +You can find all the original [Model name] checkpoints under the [RoCBert](link) collection. + +> [!TIP] +> Click on the RoCBert models in the right sidebar for more examples of how to apply RoCBert to different chinese NLP tasks. + + + + +```py +from transformers import pipeline + +pipeline = pipeline( + task="text-classification", + model="hfl/rocbert-base" +) +pipeline("称呼") #Example Chinese input +``` + + + + +```py +# pip install datasets +import torch +from datasets import load_dataset +from transformers import AutoProcessor, AutoTokenizer + +model_name = "hfl/rocbert-base" +tokenizer = AutoTokenizer.from_pretrained(model_name) +model = AutoModel.from_pretrained(model_name) + +text = "大家好,无论谁正在阅读这篇文章" + +inputs = tokenizer + + +``` + + + + ## Overview From 6d00cd92e6b943377a964f6558516ffb60d2cbfb Mon Sep 17 00:00:00 2001 From: Your Name Date: Fri, 13 Jun 2025 22:52:22 -0400 Subject: [PATCH 13/34] Final Update to the branch --- docs/source/en/model_doc/roc_bert.md | 30 ++++++---------------------- 1 file changed, 6 insertions(+), 24 deletions(-) diff --git a/docs/source/en/model_doc/roc_bert.md b/docs/source/en/model_doc/roc_bert.md index 023f2690d379..a77d5a0b5c08 100644 --- a/docs/source/en/model_doc/roc_bert.md +++ b/docs/source/en/model_doc/roc_bert.md @@ -54,37 +54,19 @@ from transformers import AutoProcessor, AutoTokenizer model_name = "hfl/rocbert-base" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModel.from_pretrained(model_name) - text = "大家好,无论谁正在阅读这篇文章" +``` -inputs = tokenizer - + + +```bash +echo -e "水在零度时会[MASK]" | transformers-cli run --task fill-mask --model junnyu/roformer_chinese_base --device 0 ``` + - - -## Overview - -The RoCBert model was proposed in [RoCBert: Robust Chinese Bert with Multimodal Contrastive Pretraining](https://aclanthology.org/2022.acl-long.65.pdf) by HuiSu, WeiweiShi, XiaoyuShen, XiaoZhou, TuoJi, JiaruiFang, JieZhou. -It's a pretrained Chinese language model that is robust under various forms of adversarial attacks. - -The abstract from the paper is the following: - -*Large-scale pretrained language models have achieved SOTA results on NLP tasks. However, they have been shown -vulnerable to adversarial attacks especially for logographic languages like Chinese. In this work, we propose -ROCBERT: a pretrained Chinese Bert that is robust to various forms of adversarial attacks like word perturbation, -synonyms, typos, etc. It is pretrained with the contrastive learning objective which maximizes the label consistency -under different synthesized adversarial examples. The model takes as input multimodal information including the -semantic, phonetic and visual features. We show all these features are important to the model robustness since the -attack can be performed in all the three forms. Across 5 Chinese NLU tasks, ROCBERT outperforms strong baselines under -three blackbox adversarial algorithms without sacrificing the performance on clean testset. It also performs the best -in the toxic content detection task under human-made attacks.* - -This model was contributed by [weiweishi](https://huggingface.co/weiweishi). - ## Resources - [Text classification task guide](../tasks/sequence_classification) From 19478779986789fae802e3b1c24e13f7ab3722ab Mon Sep 17 00:00:00 2001 From: Your Name Date: Fri, 13 Jun 2025 23:33:59 -0400 Subject: [PATCH 14/34] Adds Quantizaiton to the model --- docs/source/en/model_doc/roc_bert.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/docs/source/en/model_doc/roc_bert.md b/docs/source/en/model_doc/roc_bert.md index a77d5a0b5c08..f9aebfaf5968 100644 --- a/docs/source/en/model_doc/roc_bert.md +++ b/docs/source/en/model_doc/roc_bert.md @@ -67,6 +67,13 @@ echo -e "水在零度时会[MASK]" | transformers-cli run --task fill-mask --mod +Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends. + +The example below uses [bitsandbytes](link to quantization method) to only quantize the weights to __. + +```py + +``` ## Resources - [Text classification task guide](../tasks/sequence_classification) From 072770e2d44d20cb523544d7829de7510f095d30 Mon Sep 17 00:00:00 2001 From: Your Name Date: Sun, 15 Jun 2025 20:37:55 -0400 Subject: [PATCH 15/34] Finsihed Fixing the Roc_bert docs --- docs/source/en/model_doc/roc_bert.md | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/docs/source/en/model_doc/roc_bert.md b/docs/source/en/model_doc/roc_bert.md index f9aebfaf5968..466f3b193368 100644 --- a/docs/source/en/model_doc/roc_bert.md +++ b/docs/source/en/model_doc/roc_bert.md @@ -69,11 +69,22 @@ echo -e "水在零度时会[MASK]" | transformers-cli run --task fill-mask --mod Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends. -The example below uses [bitsandbytes](link to quantization method) to only quantize the weights to __. +The example below uses [bitsandbytes](https://huggingface.co/docs/bitsandbytes/en/index) to only quantize the weights to 8 bits. ```py +from transformers import AutoModelForSequenceClassification, AutoTokenizer, BitsAndBytesConfig +model_id = "hfl/rocbert-base" # or checkpoint for RoCBERT +bnb_config = BitsAndBytesConfig(load_in_8bit=True) # Or load_in_4bit=True + +model = AutoModelForSequenceClassification.from_pretrained( + model_id, + quantization_config=bnb_config, + device_map="auto" +) +tokenizer = AutoTokenizer.from_pretrained(model_id) ``` + ## Resources - [Text classification task guide](../tasks/sequence_classification) From 9d1ece551082587096c5ea4dceabe5673755eb03 Mon Sep 17 00:00:00 2001 From: Your Name Date: Sun, 15 Jun 2025 21:00:20 -0400 Subject: [PATCH 16/34] Fixed Moshi --- docs/source/en/model_doc/moshi.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/docs/source/en/model_doc/moshi.md b/docs/source/en/model_doc/moshi.md index e70286ebf2e1..9302a9461959 100644 --- a/docs/source/en/model_doc/moshi.md +++ b/docs/source/en/model_doc/moshi.md @@ -16,8 +16,6 @@ rendered properly in your Markdown viewer. # Moshi - -
PyTorch FlashAttention From ce1aabd56bc2adf3f0b27bc2ec7e880f757ad94b Mon Sep 17 00:00:00 2001 From: Your Name Date: Mon, 16 Jun 2025 19:17:06 -0400 Subject: [PATCH 17/34] Fixed Problems --- docs/source/en/model_doc/moonshine.md | 7 +-- docs/source/en/model_doc/roc_bert.md | 74 ++++++++++++--------------- 2 files changed, 35 insertions(+), 46 deletions(-) diff --git a/docs/source/en/model_doc/moonshine.md b/docs/source/en/model_doc/moonshine.md index a0cb46ef60a2..43407ae77784 100644 --- a/docs/source/en/model_doc/moonshine.md +++ b/docs/source/en/model_doc/moonshine.md @@ -85,12 +85,9 @@ transcription[0] +## Resources -- Moonshine improves upon Whisper's architecture: - 1. It uses SwiGLU activation instead of GELU in the decoder layers - 2. Most importantly, it replaces absolute position embeddings with Rotary Position Embeddings (RoPE). This allows Moonshine to handle audio inputs of any length, unlike Whisper which is restricted to fixed 30-second windows. - --- A guide for automatic speech recognition can be found [here](../tasks/asr) +- [Automatic speech recognition task guide](../tasks/asr) ## MoonshineConfig diff --git a/docs/source/en/model_doc/roc_bert.md b/docs/source/en/model_doc/roc_bert.md index 466f3b193368..0a3cabdd265a 100644 --- a/docs/source/en/model_doc/roc_bert.md +++ b/docs/source/en/model_doc/roc_bert.md @@ -22,78 +22,70 @@ rendered properly in your Markdown viewer. # RoCBert -[RoCBert] is a BERT model that is specifically designed for the Chinese language. It is built to resist tricks and attacks, like misspellings or similar-looking words, that usually confuse language models. +[RoCBert](https://aclanthology.org/2022.acl-long.65.pdf) is a pretrained Chinese [BERT](./bert) model designed against adversarial attacks like typos and synonyms. It is pretrained with a contrastive learning objective to align normal and adversarial text examples. The examples include different semantic, phonetic, and visual features of Chinese. This makes RoCBert more robust against manipulation. -You can find all the original [Model name] checkpoints under the [RoCBert](link) collection. +You can find all the original RoCBert checkpoints under the [weiweishi](https://huggingface.co/weiweishi) profile. > [!TIP] -> Click on the RoCBert models in the right sidebar for more examples of how to apply RoCBert to different chinese NLP tasks. +> This model was contributed by [weiweishi](https://huggingface.co/weiweishi). +> +> Click on the RoCBert models in the right sidebar for more examples of how to apply RoCBert to different Chinese language tasks. ```py +import torch from transformers import pipeline pipeline = pipeline( - task="text-classification", - model="hfl/rocbert-base" + task="fill-mask", + model="weiweishi/roc-bert-base-zh", + torch_dtype=torch.float16, + device=0 ) -pipeline("称呼") #Example Chinese input +pipeline("這家餐廳的拉麵是我[MASK]過的最好的拉麵之") ``` ```py -# pip install datasets import torch -from datasets import load_dataset -from transformers import AutoProcessor, AutoTokenizer +from transformers import AutoModelForMaskedLM, AutoTokenizer + +tokenizer = AutoTokenizer.from_pretrained( + "weiweishi/roc-bert-base-zh", +) +model = AutoModelForMaskedLM.from_pretrained( + "weiweishi/roc-bert-base-zh", + torch_dtype=torch.float16, + device_map="auto", +) +tokenizer = AutoTokenizer.from_pretrained(model_id) +inputs = tokenizer("這家餐廳的拉麵是我[MASK]過的最好的拉麵之", return_tensors="pt").to("cuda") -model_name = "hfl/rocbert-base" -tokenizer = AutoTokenizer.from_pretrained(model_name) -model = AutoModel.from_pretrained(model_name) -text = "大家好,无论谁正在阅读这篇文章" +with torch.no_grad(): + outputs = model(**inputs) + predictions = outputs.logits + +masked_index = torch.where(inputs['input_ids'] == tokenizer.mask_token_id)[1] +predicted_token_id = predictions[0, masked_index].argmax(dim=-1) +predicted_token = tokenizer.decode(predicted_token_id) + +print(f"The predicted token is: {predicted_token}") ``` ```bash -echo -e "水在零度时会[MASK]" | transformers-cli run --task fill-mask --model junnyu/roformer_chinese_base --device 0 +echo -e "這家餐廳的拉麵是我[MASK]過的最好的拉麵之" | transformers-cli run --task fill-mask --model weiweishi/roc-bert-base-zh --device 0 ``` -Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends. - -The example below uses [bitsandbytes](https://huggingface.co/docs/bitsandbytes/en/index) to only quantize the weights to 8 bits. - -```py -from transformers import AutoModelForSequenceClassification, AutoTokenizer, BitsAndBytesConfig - -model_id = "hfl/rocbert-base" # or checkpoint for RoCBERT -bnb_config = BitsAndBytesConfig(load_in_8bit=True) # Or load_in_4bit=True - -model = AutoModelForSequenceClassification.from_pretrained( - model_id, - quantization_config=bnb_config, - device_map="auto" -) -tokenizer = AutoTokenizer.from_pretrained(model_id) -``` - -## Resources - -- [Text classification task guide](../tasks/sequence_classification) -- [Token classification task guide](../tasks/token_classification) -- [Question answering task guide](../tasks/question_answering) -- [Causal language modeling task guide](../tasks/language_modeling) -- [Masked language modeling task guide](../tasks/masked_language_modeling) -- [Multiple choice task guide](../tasks/multiple_choice) - ## RoCBertConfig [[autodoc]] RoCBertConfig From 1a2d93da2ac86cd6fdff7beb49d2ec03fea455ed Mon Sep 17 00:00:00 2001 From: Your Name Date: Mon, 16 Jun 2025 20:39:43 -0400 Subject: [PATCH 18/34] Fixed Problems --- docs/source/en/model_doc/moonshine.md | 93 +++++++-------------------- docs/source/en/model_doc/roc_bert.md | 2 - 2 files changed, 25 insertions(+), 70 deletions(-) diff --git a/docs/source/en/model_doc/moonshine.md b/docs/source/en/model_doc/moonshine.md index 43407ae77784..daec513c9f35 100644 --- a/docs/source/en/model_doc/moonshine.md +++ b/docs/source/en/model_doc/moonshine.md @@ -14,76 +14,33 @@ rendered properly in your Markdown viewer. --> -
-
- PyTorch - FlashAttention - SDPA -
+
+PyTorch +FlashAttention +SDPA
-# Moonshine - -[Moonshine](https://huggingface.co/papers/2410.15608) is an encoder-decoder speech recognition model optimized for real-time transcription and recognizing voice command. Instead of using traditional absolute position embeddings, Moonshine uses Rotary Position Embedding (RoPE) to handle speech with varying lengths without using padding. This improves efficiency during inference, making it ideal for resource-constrained devices. - -You can find all the original Moonshine checkpoints under the [Useful Sensors](https://huggingface.co/UsefulSensors) organization. - -> [!TIP] -> Click on the Moonshine models in the right sidebar for more examples of how to apply Moonshine to different speech recognition tasks. - -The example below demonstrates how to transcribe speech into text with [`Pipeline`] or the [`AutoModel`] class. - - - - -```py -import torch -from transformers import pipeline - -pipeline = pipeline( - task="automatic-speech-recognition", - model="UsefulSensors/moonshine-base", - torch_dtype=torch.float16, - device=0 -) -pipeline("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac") -``` - - - - -```py -# pip install datasets -import torch -from datasets import load_dataset -from transformers import AutoProcessor, MoonshineForConditionalGeneration - -processor = AutoProcessor.from_pretrained( - "UsefulSensors/moonshine-base", -) -model = MoonshineForConditionalGeneration.from_pretrained( - "UsefulSensors/moonshine-base", - torch_dtype=torch.float16, - device_map="auto", - attn_implementation="sdpa" -).to("cuda") - -ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", split="validation") -audio_sample = ds[0]["audio"] - -input_features = processor( - audio_sample["array"], - sampling_rate=audio_sample["sampling_rate"], - return_tensors="pt" -) -input_features = input_features.to("cuda", dtype=torch.float16) - -predicted_ids = model.generate(**input_features, cache_implementation="static") -transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True) -transcription[0] -``` - - +## Overview + +The Moonshine model was proposed in [Moonshine: Speech Recognition for Live Transcription and Voice Commands +](https://arxiv.org/abs/2410.15608) by Nat Jeffries, Evan King, Manjunath Kudlur, Guy Nicholson, James Wang, Pete Warden. + +The abstract from the paper is the following: + +*This paper introduces Moonshine, a family of speech recognition models optimized for live transcription and voice command processing. Moonshine is based on an encoder-decoder transformer architecture and employs Rotary Position Embedding (RoPE) instead of traditional absolute position embeddings. The model is trained on speech segments of various lengths, but without using zero-padding, leading to greater efficiency for the encoder during inference time. When benchmarked against OpenAI's Whisper tiny-en, Moonshine Tiny demonstrates a 5x reduction in compute requirements for transcribing a 10-second speech segment while incurring no increase in word error rates across standard evaluation datasets. These results highlight Moonshine's potential for real-time and resource-constrained applications.* + +Tips: + +- Moonshine improves upon Whisper's architecture: + 1. It uses SwiGLU activation instead of GELU in the decoder layers + 2. Most importantly, it replaces absolute position embeddings with Rotary Position Embeddings (RoPE). This allows Moonshine to handle audio inputs of any length, unlike Whisper which is restricted to fixed 30-second windows. + +This model was contributed by [Eustache Le Bihan (eustlb)](https://huggingface.co/eustlb). +The original code can be found [here](https://github.com/usefulsensors/moonshine). + +## Resources + +- [Automatic speech recognition task guide](../tasks/asr) ## Resources diff --git a/docs/source/en/model_doc/roc_bert.md b/docs/source/en/model_doc/roc_bert.md index 0a3cabdd265a..91c31c61ef9f 100644 --- a/docs/source/en/model_doc/roc_bert.md +++ b/docs/source/en/model_doc/roc_bert.md @@ -27,8 +27,6 @@ rendered properly in your Markdown viewer. You can find all the original RoCBert checkpoints under the [weiweishi](https://huggingface.co/weiweishi) profile. > [!TIP] -> This model was contributed by [weiweishi](https://huggingface.co/weiweishi). -> > Click on the RoCBert models in the right sidebar for more examples of how to apply RoCBert to different Chinese language tasks. From b94b95d7b0e4f86e5ee4be42ffdce3a47ec05232 Mon Sep 17 00:00:00 2001 From: Your Name Date: Mon, 16 Jun 2025 20:41:55 -0400 Subject: [PATCH 19/34] Fixed Problems --- docs/source/en/model_doc/moonshine.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/source/en/model_doc/moonshine.md b/docs/source/en/model_doc/moonshine.md index daec513c9f35..710d46de696c 100644 --- a/docs/source/en/model_doc/moonshine.md +++ b/docs/source/en/model_doc/moonshine.md @@ -14,6 +14,8 @@ rendered properly in your Markdown viewer. --> +## Moonshine +
PyTorch FlashAttention From 5edb67903db0dd7c0f694e945283dfe5ec934b37 Mon Sep 17 00:00:00 2001 From: Your Name Date: Mon, 16 Jun 2025 20:42:43 -0400 Subject: [PATCH 20/34] Fixed Problems --- docs/source/en/model_doc/moonshine.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/en/model_doc/moonshine.md b/docs/source/en/model_doc/moonshine.md index 710d46de696c..c7b46883a433 100644 --- a/docs/source/en/model_doc/moonshine.md +++ b/docs/source/en/model_doc/moonshine.md @@ -14,7 +14,7 @@ rendered properly in your Markdown viewer. --> -## Moonshine +# Moonshine
PyTorch @@ -62,4 +62,4 @@ The original code can be found [here](https://github.com/usefulsensors/moonshine [[autodoc]] MoonshineForConditionalGeneration - forward - - generate + - generate \ No newline at end of file From e1864a271600b87182753d86851e7098f78e125a Mon Sep 17 00:00:00 2001 From: Your Name Date: Mon, 16 Jun 2025 20:43:45 -0400 Subject: [PATCH 21/34] Fixed Problems --- docs/source/en/model_doc/moonshine.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/model_doc/moonshine.md b/docs/source/en/model_doc/moonshine.md index c7b46883a433..66af9b998c41 100644 --- a/docs/source/en/model_doc/moonshine.md +++ b/docs/source/en/model_doc/moonshine.md @@ -62,4 +62,4 @@ The original code can be found [here](https://github.com/usefulsensors/moonshine [[autodoc]] MoonshineForConditionalGeneration - forward - - generate \ No newline at end of file + - generate From ff2b582b8810b8ad24254ed403585259bc90bfd3 Mon Sep 17 00:00:00 2001 From: Your Name Date: Mon, 16 Jun 2025 20:45:32 -0400 Subject: [PATCH 22/34] Fixed Problems --- docs/source/en/model_doc/moonshine.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/model_doc/moonshine.md b/docs/source/en/model_doc/moonshine.md index 66af9b998c41..c7b46883a433 100644 --- a/docs/source/en/model_doc/moonshine.md +++ b/docs/source/en/model_doc/moonshine.md @@ -62,4 +62,4 @@ The original code can be found [here](https://github.com/usefulsensors/moonshine [[autodoc]] MoonshineForConditionalGeneration - forward - - generate + - generate \ No newline at end of file From 794ed87719af141b3a37c63a6af219b1ee07edc0 Mon Sep 17 00:00:00 2001 From: Your Name Date: Sat, 7 Jun 2025 17:02:02 -0400 Subject: [PATCH 23/34] Added the install to pipline --- docs/source/en/model_doc/moonshine.md | 16 ++++++---------- 1 file changed, 6 insertions(+), 10 deletions(-) diff --git a/docs/source/en/model_doc/moonshine.md b/docs/source/en/model_doc/moonshine.md index c7b46883a433..65f87bab69d9 100644 --- a/docs/source/en/model_doc/moonshine.md +++ b/docs/source/en/model_doc/moonshine.md @@ -16,13 +16,13 @@ rendered properly in your Markdown viewer. # Moonshine -
-PyTorch -FlashAttention -SDPA -
+[Moonshine](https://huggingface.co/papers/2410.15608) + + + + +You can find all the Moonshine checkpoints on the [Hub](https://huggingface.co/models?search=moonshine). -## Overview The Moonshine model was proposed in [Moonshine: Speech Recognition for Live Transcription and Voice Commands ](https://arxiv.org/abs/2410.15608) by Nat Jeffries, Evan King, Manjunath Kudlur, Guy Nicholson, James Wang, Pete Warden. @@ -44,10 +44,6 @@ The original code can be found [here](https://github.com/usefulsensors/moonshine - [Automatic speech recognition task guide](../tasks/asr) -## Resources - -- [Automatic speech recognition task guide](../tasks/asr) - ## MoonshineConfig [[autodoc]] MoonshineConfig From 113ad251e54d539b02da891d4c6b745d542b3bfb Mon Sep 17 00:00:00 2001 From: Your Name Date: Mon, 9 Jun 2025 22:12:49 -0400 Subject: [PATCH 24/34] updated the monshine model card --- docs/source/en/model_doc/moonshine.md | 65 ++++++++++++++++++++++++--- 1 file changed, 59 insertions(+), 6 deletions(-) diff --git a/docs/source/en/model_doc/moonshine.md b/docs/source/en/model_doc/moonshine.md index 65f87bab69d9..94ce0e66c986 100644 --- a/docs/source/en/model_doc/moonshine.md +++ b/docs/source/en/model_doc/moonshine.md @@ -23,6 +23,64 @@ rendered properly in your Markdown viewer. You can find all the Moonshine checkpoints on the [Hub](https://huggingface.co/models?search=moonshine). +> [!TIP] +> Click on the Moonshine models in the right sidebar for more examples of how to apply Moonshine to different speech recognition tasks. + +The example below demonstrates how to generate a transcription based on an audio file with [`Pipeline`] or the [`AutoModel`] class. + + + + + + +```py +# uncomment to install ffmpeg which is needed to decode the audio file +# !brew install ffmpeg + +from transformers import pipeline + +asr = pipeline("automatic-speech-recognition", model="UsefulSensors/moonshine-base") + +result = asr("path_to_audio_file") + +#Prints the transcription from the audio file +print(result["text"]) +``` + + + + +```py +# uncomment to install rjieba which is needed for the tokenizer +# !pip install rjieba +import torch +from transformers import AutoModelForMaskedLM, AutoTokenizer + +model = AutoModelForMaskedLM.from_pretrained( + "junnyu/roformer_chinese_base", torch_dtype=torch.float16 +) +tokenizer = AutoTokenizer.from_pretrained("junnyu/roformer_chinese_base") + +input_ids = tokenizer("水在零度时会[MASK]", return_tensors="pt").to(model.device) +outputs = model(**input_ids) +decoded = tokenizer.batch_decode(outputs.logits.argmax(-1), skip_special_tokens=True) +print(decoded) +``` + + + + +```bash +echo -e "水在零度时会[MASK]" | transformers-cli run --task fill-mask --model junnyu/roformer_chinese_base --device 0 +``` + + + + + + + + The Moonshine model was proposed in [Moonshine: Speech Recognition for Live Transcription and Voice Commands ](https://arxiv.org/abs/2410.15608) by Nat Jeffries, Evan King, Manjunath Kudlur, Guy Nicholson, James Wang, Pete Warden. @@ -37,12 +95,7 @@ Tips: 1. It uses SwiGLU activation instead of GELU in the decoder layers 2. Most importantly, it replaces absolute position embeddings with Rotary Position Embeddings (RoPE). This allows Moonshine to handle audio inputs of any length, unlike Whisper which is restricted to fixed 30-second windows. -This model was contributed by [Eustache Le Bihan (eustlb)](https://huggingface.co/eustlb). -The original code can be found [here](https://github.com/usefulsensors/moonshine). - -## Resources - -- [Automatic speech recognition task guide](../tasks/asr) +- A guide for automatic speech recognition can be found [here](../tasks/asr) ## MoonshineConfig From feb86002e443a7a0a755048227413b299ddb3261 Mon Sep 17 00:00:00 2001 From: SohamPrabhu <62270341+SohamPrabhu@users.noreply.github.com> Date: Tue, 10 Jun 2025 12:40:23 -0400 Subject: [PATCH 25/34] Update docs/source/en/model_doc/moonshine.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/moonshine.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/docs/source/en/model_doc/moonshine.md b/docs/source/en/model_doc/moonshine.md index 94ce0e66c986..70146ffa56da 100644 --- a/docs/source/en/model_doc/moonshine.md +++ b/docs/source/en/model_doc/moonshine.md @@ -16,9 +16,7 @@ rendered properly in your Markdown viewer. # Moonshine -[Moonshine](https://huggingface.co/papers/2410.15608) - - +[Moonshine](https://huggingface.co/papers/2410.15608) is a speech recognition model that is optimized for real-time transcription and voice command. Instead of using traditional absolute position embeddings, Moonshine uses Rotary Position Embedding (RoPE). You can find all the Moonshine checkpoints on the [Hub](https://huggingface.co/models?search=moonshine). From ae85f9216328ee7b213c73feed323e3efd966858 Mon Sep 17 00:00:00 2001 From: SohamPrabhu <62270341+SohamPrabhu@users.noreply.github.com> Date: Tue, 10 Jun 2025 12:41:40 -0400 Subject: [PATCH 26/34] Update docs/source/en/model_doc/moonshine.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/moonshine.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/model_doc/moonshine.md b/docs/source/en/model_doc/moonshine.md index 70146ffa56da..15b9bbf36a58 100644 --- a/docs/source/en/model_doc/moonshine.md +++ b/docs/source/en/model_doc/moonshine.md @@ -19,7 +19,7 @@ rendered properly in your Markdown viewer. [Moonshine](https://huggingface.co/papers/2410.15608) is a speech recognition model that is optimized for real-time transcription and voice command. Instead of using traditional absolute position embeddings, Moonshine uses Rotary Position Embedding (RoPE). -You can find all the Moonshine checkpoints on the [Hub](https://huggingface.co/models?search=moonshine). +You can find all the original Moonshine checkpoints under the [Useful Sensors](https://huggingface.co/UsefulSensors) organization. > [!TIP] > Click on the Moonshine models in the right sidebar for more examples of how to apply Moonshine to different speech recognition tasks. From 67c3c364614dbbd09e47e3dfd9fd8c03d2680a18 Mon Sep 17 00:00:00 2001 From: SohamPrabhu <62270341+SohamPrabhu@users.noreply.github.com> Date: Tue, 10 Jun 2025 12:42:28 -0400 Subject: [PATCH 27/34] Update docs/source/en/model_doc/moonshine.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/moonshine.md | 87 +++++++++++---------------- 1 file changed, 36 insertions(+), 51 deletions(-) diff --git a/docs/source/en/model_doc/moonshine.md b/docs/source/en/model_doc/moonshine.md index 15b9bbf36a58..f4095c24536f 100644 --- a/docs/source/en/model_doc/moonshine.md +++ b/docs/source/en/model_doc/moonshine.md @@ -16,15 +16,14 @@ rendered properly in your Markdown viewer. # Moonshine -[Moonshine](https://huggingface.co/papers/2410.15608) is a speech recognition model that is optimized for real-time transcription and voice command. Instead of using traditional absolute position embeddings, Moonshine uses Rotary Position Embedding (RoPE). - +[Moonshine](https://huggingface.co/papers/2410.15608) is an encoder-decoder speech recognition model optimized for real-time transcription and recognizing voice command. Instead of using traditional absolute position embeddings, Moonshine uses Rotary Position Embedding (RoPE) to handle speech with varying lengths without using padding. This improves efficiency during inference, making it ideal for resource-constrained devices. You can find all the original Moonshine checkpoints under the [Useful Sensors](https://huggingface.co/UsefulSensors) organization. > [!TIP] > Click on the Moonshine models in the right sidebar for more examples of how to apply Moonshine to different speech recognition tasks. -The example below demonstrates how to generate a transcription based on an audio file with [`Pipeline`] or the [`AutoModel`] class. +The example below demonstrates how to transcribe speech into text with [`Pipeline`] or the [`AutoModel`] class. @@ -32,69 +31,54 @@ The example below demonstrates how to generate a transcription based on an audio ```py -# uncomment to install ffmpeg which is needed to decode the audio file -# !brew install ffmpeg - +import torch from transformers import pipeline -asr = pipeline("automatic-speech-recognition", model="UsefulSensors/moonshine-base") - -result = asr("path_to_audio_file") - -#Prints the transcription from the audio file -print(result["text"]) +pipeline = pipeline( + task="automatic-speech-recognition", + model="UsefulSensors/moonshine-base", + torch_dtype=torch.float16, + device=0 +) +pipeline("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac") ``` ```py -# uncomment to install rjieba which is needed for the tokenizer -# !pip install rjieba +# pip install datasets import torch -from transformers import AutoModelForMaskedLM, AutoTokenizer +from datasets import load_dataset +from transformers import AutoProcessor, MoonshineForConditionalGeneration -model = AutoModelForMaskedLM.from_pretrained( - "junnyu/roformer_chinese_base", torch_dtype=torch.float16 +processor = AutoProcessor.from_pretrained( + "UsefulSensors/moonshine-base", ) -tokenizer = AutoTokenizer.from_pretrained("junnyu/roformer_chinese_base") - -input_ids = tokenizer("水在零度时会[MASK]", return_tensors="pt").to(model.device) -outputs = model(**input_ids) -decoded = tokenizer.batch_decode(outputs.logits.argmax(-1), skip_special_tokens=True) -print(decoded) -``` - - - +model = MoonshineForConditionalGeneration.from_pretrained( + "UsefulSensors/moonshine-base", + torch_dtype=torch.float16, + device_map="auto", + attn_implementation="sdpa" +).to("cuda") + +ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", split="validation") +audio_sample = ds[0]["audio"] + +input_features = processor( + audio_sample["array"], + sampling_rate=audio_sample["sampling_rate"], + return_tensors="pt" +) +input_features = input_features.to("cuda", dtype=torch.float16) -```bash -echo -e "水在零度时会[MASK]" | transformers-cli run --task fill-mask --model junnyu/roformer_chinese_base --device 0 +predicted_ids = model.generate(**input_features, cache_implementation="static") +transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True) +transcription[0] ``` - - - - - - -The Moonshine model was proposed in [Moonshine: Speech Recognition for Live Transcription and Voice Commands -](https://arxiv.org/abs/2410.15608) by Nat Jeffries, Evan King, Manjunath Kudlur, Guy Nicholson, James Wang, Pete Warden. - -The abstract from the paper is the following: - -*This paper introduces Moonshine, a family of speech recognition models optimized for live transcription and voice command processing. Moonshine is based on an encoder-decoder transformer architecture and employs Rotary Position Embedding (RoPE) instead of traditional absolute position embeddings. The model is trained on speech segments of various lengths, but without using zero-padding, leading to greater efficiency for the encoder during inference time. When benchmarked against OpenAI's Whisper tiny-en, Moonshine Tiny demonstrates a 5x reduction in compute requirements for transcribing a 10-second speech segment while incurring no increase in word error rates across standard evaluation datasets. These results highlight Moonshine's potential for real-time and resource-constrained applications.* - -Tips: - -- Moonshine improves upon Whisper's architecture: - 1. It uses SwiGLU activation instead of GELU in the decoder layers - 2. Most importantly, it replaces absolute position embeddings with Rotary Position Embeddings (RoPE). This allows Moonshine to handle audio inputs of any length, unlike Whisper which is restricted to fixed 30-second windows. - -- A guide for automatic speech recognition can be found [here](../tasks/asr) - ## MoonshineConfig [[autodoc]] MoonshineConfig @@ -109,4 +93,5 @@ Tips: [[autodoc]] MoonshineForConditionalGeneration - forward - - generate \ No newline at end of file + - generate + From 6ead6f869a395348efb2cd8762a610dbd6ded607 Mon Sep 17 00:00:00 2001 From: Your Name Date: Tue, 10 Jun 2025 13:42:40 -0400 Subject: [PATCH 28/34] Updated Documentation According to changes --- docs/source/en/model_doc/moonshine.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/docs/source/en/model_doc/moonshine.md b/docs/source/en/model_doc/moonshine.md index f4095c24536f..e912186fd12a 100644 --- a/docs/source/en/model_doc/moonshine.md +++ b/docs/source/en/model_doc/moonshine.md @@ -79,6 +79,13 @@ transcription[0] + +- Moonshine improves upon Whisper's architecture: + 1. It uses SwiGLU activation instead of GELU in the decoder layers + 2. Most importantly, it replaces absolute position embeddings with Rotary Position Embeddings (RoPE). This allows Moonshine to handle audio inputs of any length, unlike Whisper which is restricted to fixed 30-second windows. + +- A guide for automatic speech recognition can be found [here](../tasks/asr) + ## MoonshineConfig [[autodoc]] MoonshineConfig From bcc72abf4cbf7c0d6b8ad467fb4e83f4ba1bca98 Mon Sep 17 00:00:00 2001 From: Your Name Date: Wed, 11 Jun 2025 19:05:09 -0400 Subject: [PATCH 29/34] Fixed the model with the commits --- docs/source/en/model_doc/moonshine.md | 2 +- docs/source/en/model_doc/moshi.md | 2 ++ 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/source/en/model_doc/moonshine.md b/docs/source/en/model_doc/moonshine.md index e912186fd12a..eb233fa53985 100644 --- a/docs/source/en/model_doc/moonshine.md +++ b/docs/source/en/model_doc/moonshine.md @@ -84,7 +84,7 @@ transcription[0] 1. It uses SwiGLU activation instead of GELU in the decoder layers 2. Most importantly, it replaces absolute position embeddings with Rotary Position Embeddings (RoPE). This allows Moonshine to handle audio inputs of any length, unlike Whisper which is restricted to fixed 30-second windows. -- A guide for automatic speech recognition can be found [here](../tasks/asr) +-- A guide for automatic speech recognition can be found [here](../tasks/asr) ## MoonshineConfig diff --git a/docs/source/en/model_doc/moshi.md b/docs/source/en/model_doc/moshi.md index 9302a9461959..e70286ebf2e1 100644 --- a/docs/source/en/model_doc/moshi.md +++ b/docs/source/en/model_doc/moshi.md @@ -16,6 +16,8 @@ rendered properly in your Markdown viewer. # Moshi + +
PyTorch FlashAttention From 0c027981f18f92a08657cada02cc085402d1e05c Mon Sep 17 00:00:00 2001 From: Your Name Date: Tue, 17 Jun 2025 12:48:43 -0400 Subject: [PATCH 30/34] Fixed the problems --- docs/source/en/model_doc/moonshine.md | 7 ++----- docs/source/en/model_doc/roc_bert.md | 5 +++++ 2 files changed, 7 insertions(+), 5 deletions(-) diff --git a/docs/source/en/model_doc/moonshine.md b/docs/source/en/model_doc/moonshine.md index eb233fa53985..0acbc9009dcd 100644 --- a/docs/source/en/model_doc/moonshine.md +++ b/docs/source/en/model_doc/moonshine.md @@ -79,12 +79,9 @@ transcription[0] +## Resources -- Moonshine improves upon Whisper's architecture: - 1. It uses SwiGLU activation instead of GELU in the decoder layers - 2. Most importantly, it replaces absolute position embeddings with Rotary Position Embeddings (RoPE). This allows Moonshine to handle audio inputs of any length, unlike Whisper which is restricted to fixed 30-second windows. - --- A guide for automatic speech recognition can be found [here](../tasks/asr) +- [Automatic speech recognition task guide](../tasks/asr) ## MoonshineConfig diff --git a/docs/source/en/model_doc/roc_bert.md b/docs/source/en/model_doc/roc_bert.md index 91c31c61ef9f..8214d2acd6d7 100644 --- a/docs/source/en/model_doc/roc_bert.md +++ b/docs/source/en/model_doc/roc_bert.md @@ -27,6 +27,11 @@ rendered properly in your Markdown viewer. You can find all the original RoCBert checkpoints under the [weiweishi](https://huggingface.co/weiweishi) profile. > [!TIP] +<<<<<<< HEAD +======= +> This model was contributed by [weiweishi](https://huggingface.co/weiweishi). +> +>>>>>>> ce1aabd56b (Fixed Problems) > Click on the RoCBert models in the right sidebar for more examples of how to apply RoCBert to different Chinese language tasks. From 29edf009dbbf31a0ac51415c43509fb023f7ce29 Mon Sep 17 00:00:00 2001 From: Your Name Date: Tue, 17 Jun 2025 12:59:46 -0400 Subject: [PATCH 31/34] Final Fix --- docs/source/en/model_doc/roc_bert.md | 31 ++++++++++++---------------- 1 file changed, 13 insertions(+), 18 deletions(-) diff --git a/docs/source/en/model_doc/roc_bert.md b/docs/source/en/model_doc/roc_bert.md index 8214d2acd6d7..ab24b9303093 100644 --- a/docs/source/en/model_doc/roc_bert.md +++ b/docs/source/en/model_doc/roc_bert.md @@ -15,9 +15,9 @@ rendered properly in your Markdown viewer. -->
-
- PyTorch -
+
+ PyTorch +
# RoCBert @@ -27,11 +27,6 @@ rendered properly in your Markdown viewer. You can find all the original RoCBert checkpoints under the [weiweishi](https://huggingface.co/weiweishi) profile. > [!TIP] -<<<<<<< HEAD -======= -> This model was contributed by [weiweishi](https://huggingface.co/weiweishi). -> ->>>>>>> ce1aabd56b (Fixed Problems) > Click on the RoCBert models in the right sidebar for more examples of how to apply RoCBert to different Chinese language tasks. @@ -42,10 +37,10 @@ import torch from transformers import pipeline pipeline = pipeline( - task="fill-mask", - model="weiweishi/roc-bert-base-zh", - torch_dtype=torch.float16, - device=0 + task="fill-mask", + model="weiweishi/roc-bert-base-zh", + torch_dtype=torch.float16, + device=0 ) pipeline("這家餐廳的拉麵是我[MASK]過的最好的拉麵之") ``` @@ -58,19 +53,19 @@ import torch from transformers import AutoModelForMaskedLM, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained( - "weiweishi/roc-bert-base-zh", + "weiweishi/roc-bert-base-zh", ) model = AutoModelForMaskedLM.from_pretrained( - "weiweishi/roc-bert-base-zh", - torch_dtype=torch.float16, - device_map="auto", + "weiweishi/roc-bert-base-zh", + torch_dtype=torch.float16, + device_map="auto", ) tokenizer = AutoTokenizer.from_pretrained(model_id) inputs = tokenizer("這家餐廳的拉麵是我[MASK]過的最好的拉麵之", return_tensors="pt").to("cuda") with torch.no_grad(): - outputs = model(**inputs) - predictions = outputs.logits + outputs = model(**inputs) + predictions = outputs.logits masked_index = torch.where(inputs['input_ids'] == tokenizer.mask_token_id)[1] predicted_token_id = predictions[0, masked_index].argmax(dim=-1) From f07135df6551399e6d71372012e0dca2d96816ac Mon Sep 17 00:00:00 2001 From: Your Name Date: Tue, 17 Jun 2025 13:02:34 -0400 Subject: [PATCH 32/34] Final Fix --- docs/source/en/model_doc/moonshine.md | 14 ++++++++------ docs/source/en/model_doc/moshi.md | 2 -- 2 files changed, 8 insertions(+), 8 deletions(-) diff --git a/docs/source/en/model_doc/moonshine.md b/docs/source/en/model_doc/moonshine.md index 0acbc9009dcd..1cf727af5b95 100644 --- a/docs/source/en/model_doc/moonshine.md +++ b/docs/source/en/model_doc/moonshine.md @@ -16,6 +16,14 @@ rendered properly in your Markdown viewer. # Moonshine +
+
+ PyTorch + FlashAttention + SDPA +
+
+ [Moonshine](https://huggingface.co/papers/2410.15608) is an encoder-decoder speech recognition model optimized for real-time transcription and recognizing voice command. Instead of using traditional absolute position embeddings, Moonshine uses Rotary Position Embedding (RoPE) to handle speech with varying lengths without using padding. This improves efficiency during inference, making it ideal for resource-constrained devices. You can find all the original Moonshine checkpoints under the [Useful Sensors](https://huggingface.co/UsefulSensors) organization. @@ -25,8 +33,6 @@ You can find all the original Moonshine checkpoints under the [Useful Sensors](h The example below demonstrates how to transcribe speech into text with [`Pipeline`] or the [`AutoModel`] class. - - @@ -79,10 +85,6 @@ transcription[0] -## Resources - -- [Automatic speech recognition task guide](../tasks/asr) - ## MoonshineConfig [[autodoc]] MoonshineConfig diff --git a/docs/source/en/model_doc/moshi.md b/docs/source/en/model_doc/moshi.md index e70286ebf2e1..9302a9461959 100644 --- a/docs/source/en/model_doc/moshi.md +++ b/docs/source/en/model_doc/moshi.md @@ -16,8 +16,6 @@ rendered properly in your Markdown viewer. # Moshi - -
PyTorch FlashAttention From 5665dd15b5eb102f77c34d36b298493df7e2f671 Mon Sep 17 00:00:00 2001 From: Your Name Date: Tue, 17 Jun 2025 13:03:13 -0400 Subject: [PATCH 33/34] Final Fix --- docs/source/en/model_doc/moonshine.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/en/model_doc/moonshine.md b/docs/source/en/model_doc/moonshine.md index 1cf727af5b95..4cd2eec774d4 100644 --- a/docs/source/en/model_doc/moonshine.md +++ b/docs/source/en/model_doc/moonshine.md @@ -14,8 +14,6 @@ rendered properly in your Markdown viewer. --> -# Moonshine -
PyTorch @@ -24,6 +22,8 @@ rendered properly in your Markdown viewer.
+# Moonshine + [Moonshine](https://huggingface.co/papers/2410.15608) is an encoder-decoder speech recognition model optimized for real-time transcription and recognizing voice command. Instead of using traditional absolute position embeddings, Moonshine uses Rotary Position Embedding (RoPE) to handle speech with varying lengths without using padding. This improves efficiency during inference, making it ideal for resource-constrained devices. You can find all the original Moonshine checkpoints under the [Useful Sensors](https://huggingface.co/UsefulSensors) organization. From 29819272f7a7ab2854341bd274750c7889a80052 Mon Sep 17 00:00:00 2001 From: Steven Liu <59462357+stevhliu@users.noreply.github.com> Date: Tue, 17 Jun 2025 10:45:41 -0700 Subject: [PATCH 34/34] Update roc_bert.md --- docs/source/en/model_doc/roc_bert.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/docs/source/en/model_doc/roc_bert.md b/docs/source/en/model_doc/roc_bert.md index ab24b9303093..90373085a133 100644 --- a/docs/source/en/model_doc/roc_bert.md +++ b/docs/source/en/model_doc/roc_bert.md @@ -27,8 +27,12 @@ rendered properly in your Markdown viewer. You can find all the original RoCBert checkpoints under the [weiweishi](https://huggingface.co/weiweishi) profile. > [!TIP] +> This model was contributed by [weiweishi](https://huggingface.co/weiweishi). +> > Click on the RoCBert models in the right sidebar for more examples of how to apply RoCBert to different Chinese language tasks. +The example below demonstrates how to predict the [MASK] token with [`Pipeline`], [`AutoModel`], and from the command line. + @@ -60,7 +64,6 @@ model = AutoModelForMaskedLM.from_pretrained( torch_dtype=torch.float16, device_map="auto", ) -tokenizer = AutoTokenizer.from_pretrained(model_id) inputs = tokenizer("這家餐廳的拉麵是我[MASK]過的最好的拉麵之", return_tensors="pt").to("cuda") with torch.no_grad():