This command fails:
time mlx_audio.tts.generate --file "audio1" --text "The Memex allows a human user to do more conveniently (less energy, more quickly) what he could have done with relatively ordinary photographic equipment and filing systems, but he would have had to spend so much time in the lower-level processes of manipulation that his mental time constants of memory and patience would have rendered the system unusable in the detailed and intimate sense which Bush illustrates." --model mlx-community/Dia-1.6B --ref_audio audio-dia/male-dangerous-phrase.wav --ref_text "[S1]The most dangerous phrase in the language is: We've always done it this way." --speed 0.9
This shorter sentence also fails:
time mlx_audio.tts.generate --file "audio2" --text "[S1] The associative trails whose establishment and use within the files he describes at some length provide a beautiful example of a new capsbility in symbol structuring that derives from new artifactprocess capability." --model mlx-community/Dia-1.6B --ref_audio audio-dia/male-dangerous-phrase.wav --ref_text "[S1]The most dangerous phrase in the language is: We've always done it this way." --speed 0.9
Error loading model:
Traceback (most recent call last):
File "/opt/homebrew/Caskroom/miniconda/base/envs/mlx/lib/python3.10/site-packages/mlx_audio/tts/generate.py", line 319, in generate_audio
for i, result in enumerate(results):
File "/opt/homebrew/Caskroom/miniconda/base/envs/mlx/lib/python3.10/site-packages/mlx_audio/tts/models/dia/dia.py", line 256, in generate
audio, token_count = self._generate(
File "/opt/homebrew/Caskroom/miniconda/base/envs/mlx/lib/python3.10/site-packages/mlx_audio/tts/models/dia/dia.py", line 484, in _generate
logits_Bx1xCxV = decode_step(
File "/opt/homebrew/Caskroom/miniconda/base/envs/mlx/lib/python3.10/site-packages/mlx_audio/tts/models/dia/layers.py", line 727, in decode_step
x = layer(
File "/opt/homebrew/Caskroom/miniconda/base/envs/mlx/lib/python3.10/site-packages/mlx_audio/tts/models/dia/layers.py", line 595, in __call__
sa_out = self.self_attention(
File "/opt/homebrew/Caskroom/miniconda/base/envs/mlx/lib/python3.10/site-packages/mlx_audio/tts/models/dia/layers.py", line 372, in __call__
attn_k, attn_v = cache.update_and_fetch(Xk_BxNxSxH, Xv_BxNxSxH)
File "/opt/homebrew/Caskroom/miniconda/base/envs/mlx/lib/python3.10/site-packages/mlx_audio/tts/models/dia/layers.py", line 195, in update_and_fetch
assert self.current_idx < self.max_len
AssertionError
splitting into two works (output is 6 seconds and 4 seconds)
time mlx_audio.tts.generate --file "audio2" --text "[S1] The associative trails whose establishment and use within the files he describes at some length provide a beautiful example" --model mlx-community/Dia-1.6B --ref_audio audio-dia/male-dangerous-phrase.wav --ref_text "[S1]The most dangerous phrase in the language is: We've always done it this way." --speed 0.9
time mlx_audio.tts.generate --file "audio2_001" --text "[S1] of a new capsbility in symbol structuring that derives from new artifactprocess capability." --model mlx-community/Dia-1.6B --ref_audio audio-dia/male-dangerous-phrase.wav --ref_text "[S1]The most dangerous phrase in the language is: We've always done it this way." --speed 0.9