Skip to content

Commit

Permalink
Group the Speach and the Transcription docs under commen Audio parent.
Browse files Browse the repository at this point in the history
  • Loading branch information
tzolov committed Mar 21, 2024
1 parent b3f2516 commit b8f773c
Show file tree
Hide file tree
Showing 6 changed files with 282 additions and 10 deletions.
9 changes: 5 additions & 4 deletions spring-ai-docs/src/main/antora/modules/ROOT/nav.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -38,10 +38,11 @@
** xref:api/imageclient.adoc[]
*** xref:api/image/openai-image.adoc[OpenAI]
*** xref:api/image/stabilityai-image.adoc[Stability]
** xref:api/transcriptions.adoc[]
*** xref:api/transcriptions/openai-transcriptions.adoc[OpenAI]
** xref:api/speech.adoc[]
*** xref:api/speech/openai-speech.adoc[OpenAI]
** xref:api/audio[Audio API]
*** xref:api/audio/transcriptions.adoc[]
**** xref:api/audio/transcriptions/openai-transcriptions.adoc[OpenAI]
*** xref:api/audio/speech.adoc[]
**** xref:api/audio/speech/openai-speech.adoc[OpenAI]
** xref:api/vectordbs.adoc[]
*** xref:api/vectordbs/azure.adoc[]
*** xref:api/vectordbs/chroma.adoc[]
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
[[Speech]]
= Text-To-Speech (TTS) API

Spring AI provides support for OpenAI's Speech API.
When additional providers for Speech are implemented, a common `SpeechClient` and `StreamingSpeechClient` interface will be extracted.
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
= OpenAI Text-to-Speech (TTS) Integration

== Introduction

The Audio API provides a speech endpoint based on OpenAI's TTS (text-to-speech) model, enabling users to:

- Narrate a written blog post.
- Produce spoken audio in multiple languages.
- Give real-time audio output using streaming.

== Prerequisites

. Create an OpenAI account and obtain an API key. You can sign up at the https://platform.openai.com/signup[OpenAI signup page] and generate an API key on the https://platform.openai.com/account/api-keys[API Keys page].
. Add the `spring-ai-openai` dependency to your project's build file. For more information, refer to the xref:getting-started.adoc#dependency-management[Dependency Management] section.

== Auto-configuration

Spring AI provides Spring Boot auto-configuration for the OpenAI Text-to-Speech Client.
To enable it add the following dependency to your project's Maven `pom.xml` file:

[source,xml]
----
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-openai-spring-boot-starter</artifactId>
</dependency>
----

or to your Gradle `build.gradle` build file:

[source,groovy]
----
dependencies {
implementation 'org.springframework.ai:spring-ai-openai-spring-boot-starter'
}
----

TIP: Refer to the xref:getting-started.adoc#dependency-management[Dependency Management] section to add the Spring AI BOM to your build file.

=== TTS Properties

The prefix `spring.ai.openai.audio.speech` is used as the property prefix that lets you configure the OpenAI Text-to-Speech client.

[cols="3,5,2"]
|====
| Property | Description | Default

| spring.ai.openai.audio.speech.options.model | ID of the model to use. Only tts-1 is currently available. | tts-1
| spring.ai.openai.audio.speech.options.voice | The voice to use for the TTS output. Available options are: alloy, echo, fable, onyx, nova, and shimmer. | alloy
| spring.ai.openai.audio.speech.options.response-format | The format of the audio output. Supported formats are mp3, opus, aac, flac, wav, and pcm. | mp3
| spring.ai.openai.audio.speech.options.speed | The speed of the voice synthesis. The acceptable range is from 0.0 (slowest) to 1.0 (fastest). | 1.0
|====

== Runtime Options [[speech-options]]

The `OpenAiAudioSpeechOptions` class provides the options to use when making a text-to-speech request.
On start-up, the options specified by `spring.ai.openai.audio.speech` are used but you can override these at runtime.

For example:

[source,java]
----
OpenAiAudioSpeechOptions speechOptions = OpenAiAudioSpeechOptions.builder()
.withModel("tts-1")
.withVoice(OpenAiAudioApi.SpeechRequest.Voice.ALLOY)
.withResponseFormat(OpenAiAudioApi.SpeechRequest.AudioResponseFormat.MP3)
.withSpeed(1.0f)
.build();
SpeechPrompt speechPrompt = new SpeechPrompt("Hello, this is a text-to-speech example.", speechOptions);
SpeechResponse response = openAiAudioSpeechClient.call(speechPrompt);
----

== Manual Configuration

Add the `spring-ai-openai` dependency to your project's Maven `pom.xml` file:

[source,xml]
----
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-openai</artifactId>
</dependency>
----

or to your Gradle `build.gradle` build file:

[source,groovy]
----
dependencies {
implementation 'org.springframework.ai:spring-ai-openai'
}
----

TIP: Refer to the xref:getting-started.adoc#dependency-management[Dependency Management] section to add the Spring AI BOM to your build file.

Next, create an `OpenAiAudioSpeechClient`:

[source,java]
----
var openAiAudioApi = new OpenAiAudioApi(System.getenv("OPENAI_API_KEY"));
var openAiAudioSpeechClient = new OpenAiAudioSpeechClient(openAiAudioApi);
var speechOptions = OpenAiAudioSpeechOptions.builder()
.withResponseFormat(OpenAiAudioApi.SpeechRequest.AudioResponseFormat.MP3)
.withSpeed(1.0f)
.withModel(OpenAiAudioApi.TtsModel.TTS_1.value)
.build();
var speechPrompt = new SpeechPrompt("Hello, this is a text-to-speech example.", speechOptions);
SpeechResponse response = openAiAudioSpeechClient.call(speechPrompt);
// Accessing metadata (rate limit info)
OpenAiAudioSpeechResponseMetadata metadata = response.getMetadata();
byte[] responseAsBytes = response.getResult().getOutput();
----

== Streaming Real-time Audio

The Speech API provides support for real-time audio streaming using chunk transfer encoding. This means that the audio is able to be played before the full file has been generated and made accessible.

[source,java]
----
var openAiAudioApi = new OpenAiAudioApi(System.getenv("OPENAI_API_KEY"));
var openAiAudioSpeechClient = new OpenAiAudioSpeechClient(openAiAudioApi);
OpenAiAudioSpeechOptions speechOptions = OpenAiAudioSpeechOptions.builder()
.withVoice(OpenAiAudioApi.SpeechRequest.Voice.ALLOY)
.withSpeed(1.0f)
.withResponseFormat(OpenAiAudioApi.SpeechRequest.AudioResponseFormat.MP3)
.withModel(OpenAiAudioApi.TtsModel.TTS_1.value)
.build();
SpeechPrompt speechPrompt = new SpeechPrompt("Today is a wonderful day to build something people love!", speechOptions);
Flux<SpeechResponse> responseStream = openAiAudioSpeechClient.stream(speechPrompt);
----

== Example Code

* The link:https://github.com/spring-projects/spring-ai/blob/main/models/spring-ai-openai/src/test/java/org/springframework/ai/openai/audio/speech/OpenAiSpeechClientIT.java[OpenAiSpeechClientIT.java] test provides some general examples of how to use the library.
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
[[Transcription]]
= Transcription API

Spring AI provides support for OpenAI's Transcription API.
When additional providers for Transcription are implemented, a common `AudioTranscriptionClient` interface will be extracted.
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
== OpenAI Transcriptions

Spring AI supports https://platform.openai.com/docs/api-reference/audio/createTranscription[OpenAI's Transcription model].

== Prerequisites


You will need to create an API key with OpenAI to access ChatGPT models.
Create an account at https://platform.openai.com/signup[OpenAI signup page] and generate the token on the https://platform.openai.com/account/api-keys[API Keys page].
The Spring AI project defines a configuration property named `spring.ai.openai.api-key` that you should set to the value of the `API Key` obtained from openai.com.
Exporting an environment variable is one way to set that configuration property:


== Auto-configuration

Spring AI provides Spring Boot auto-configuration for the OpenAI Image Generation Client.
To enable it add the following dependency to your project's Maven `pom.xml` file:

[source, xml]
----
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-openai-spring-boot-starter</artifactId>
</dependency>
----

or to your Gradle `build.gradle` build file.

[source,groovy]
----
dependencies {
implementation 'org.springframework.ai:spring-ai-openai-spring-boot-starter'
}
----

TIP: Refer to the xref:getting-started.adoc#dependency-management[Dependency Management] section to add the Spring AI BOM to your build file.

=== Transcription Properties

The prefix `spring.ai.openai.audio.transcription` is used as the property prefix that lets you configure the retry mechanism for the OpenAI Image client.

[cols="3,5,2"]
|====
| Property | Description | Default

| spring.ai.openai.audio.transcription.options.model | ID of the model to use. Only whisper-1 (which is powered by our open source Whisper V2 model) is currently available. | whisper-1
| spring.ai.openai.audio.transcription.options.response-format | The format of the transcript output, in one of these options: json, text, srt, verbose_json, or vtt. | json
| spring.ai.openai.audio.transcription.options.prompt | An optional text to guide the model's style or continue a previous audio segment. The prompt should match the audio language. |
| spring.ai.openai.audio.transcription.options.language | The language of the input audio. Supplying the input language in ISO-639-1 format will improve accuracy and latency. |
| spring.ai.openai.audio.transcription.options.temperature | The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use log probability to automatically increase the temperature until certain thresholds are hit. | 0
| spring.ai.openai.audio.transcription.options.timestamp_granularities | The timestamp granularities to populate for this transcription. response_format must be set verbose_json to use timestamp granularities. Either or both of these options are supported: word, or segment. Note: There is no additional latency for segment timestamps, but generating word timestamps incurs additional latency. | segment
|====

== Runtime Options [[image-options]]

The `OpenAiAudioTranscriptionOptions` class provides the options to use when making a transcription.
On start-up, the options specified by `spring.ai.openai.audio.transcription` are used but you can override these at runtime.

For example:

[source,java]
----
OpenAiAudioApi.TranscriptResponseFormat responseFormat = OpenAiAudioApi.TranscriptResponseFormat.VTT;
OpenAiAudioTranscriptionOptions transcriptionOptions = OpenAiAudioTranscriptionOptions.builder()
.withLanguage("en")
.withPrompt("Ask not this, but ask that")
.withTemperature(0f)
.withResponseFormat(responseFormat)
.build();
AudioTranscriptionPrompt transcriptionRequest = new AudioTranscriptionPrompt(audioFile, transcriptionOptions);
AudioTranscriptionResponse response = openAiTranscriptionClient.call(transcriptionRequest);
----

== Manual Configuration

Add the `spring-ai-openai` dependency to your project's Maven `pom.xml` file:

[source, xml]
----
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-openai</artifactId>
</dependency>
----

or to your Gradle `build.gradle` build file.

[source,groovy]
----
dependencies {
implementation 'org.springframework.ai:spring-ai-openai'
}
----

TIP: Refer to the xref:getting-started.adoc#dependency-management[Dependency Management] section to add the Spring AI BOM to your build file.

Next, create a `OpenAiAudioTranscriptionClient`

[source,java]
----
var openAiAudioApi = new OpenAiAudioApi(System.getenv("OPENAI_API_KEY"));
var openAiAudioTranscriptionClient = new OpenAiAudioTranscriptionClient(openAiAudioApi);
var transcriptionOptions = OpenAiAudioTranscriptionOptions.builder()
.withResponseFormat(TranscriptResponseFormat.TEXT)
.withTemperature(0f)
.build();
var audioFile = new FileSystemResource("/path/to/your/resource/speech/jfk.flac");
AudioTranscriptionPrompt transcriptionRequest = new AudioTranscriptionPrompt(audioFile, transcriptionOptions);
AudioTranscriptionResponse response = openAiTranscriptionClient.call(transcriptionRequest);
----

== Example Code
* The link:https://github.com/spring-projects/spring-ai/blob/main/models/spring-ai-openai/src/test/java/org/springframework/ai/openai/audio/transcription/OpenAiTranscriptionClientIT.java[OpenAiTranscriptionClientIT.java] test provides some general examples how to use the library.
Original file line number Diff line number Diff line change
Expand Up @@ -158,13 +158,12 @@ Each of the following sections in the documentation shows which dependencies you
** xref:api/image/openai-image.adoc[OpenAI Image Generation]
** xref:api/image/stabilityai-image.adoc[StabilityAI Image Generation]

=== Transcription Models
* xref:api/transcriptions.adoc[]
** xref:api/transcriptions/openai-transcriptions.adoc[OpenAI Transcriptions]
=== Audio Models

=== Text-To-Speech (TTS) Models
* xref:api/speech.adoc[]
** xref:api/speech/openai-speech.adoc[OpenAI Text-To-Speech]
* xref:api/audio/transcriptions.adoc[Transcription Models]
** xref:api/audio/transcriptions/openai-transcriptions.adoc[OpenAI Transcriptions]
* xref:api/audio/speech.adoc[Text-To-Speech (TTS) Models]
** xref:api/audio/speech/openai-speech.adoc[OpenAI Text-To-Speech]

=== Vector Databases
* xref:api/vectordbs.adoc[Vector Database API]
Expand Down

0 comments on commit b8f773c

Please sign in to comment.