# Working with Audio - Whisper (Speech-to-Text) 
# Transcription

## Univeral Code Used for the Entire Notebook

Let's set up our libraries and client

In [None]:
from openai import OpenAI  # For OpenAI API and event handling

In [None]:
# Initialize the OpenAI client
client = OpenAI()  

### Create Transcription
#### Request Body

**file** `file`  **Required**  
The audio file object (not file name) to transcribe, in one of these formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.

**model** `string`  **Required**  
ID of the model to use. Only `whisper-1` (which is powered by our open source Whisper V2 model) is currently available.

**language** `string`  **Optional**  
The language of the input audio. Supplying the input language in **ISO-639-1** format will improve accuracy and latency.

**prompt** `string`  **Optional**  
An optional text to guide the model's style or continue a previous audio segment. The **prompt** should match the audio language.

**response_format** `string`  **Optional** Defaults to json  
The format of the transcript output, in one of these options: `json`, `text`, `srt`, `verbose_json`, or `vtt`.

**temperature** `number`  **Optional** Defaults to 0  
The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use **log probability** to automatically increase the temperature until certain thresholds are hit.

**timestamp_granularities[]** `array`  **Optional** Defaults to segment  
The timestamp granularities to populate for this transcription. **response_format** must be set to `verbose_json` to use timestamp granularities. Either or both of these options are supported: `word`, or `segment`. **Note**: There is no additional latency for segment timestamps, but generating word timestamps incurs additional latency.


In [None]:
# Create a transcription of the audio file
audio_file = open("./artifacts/fdr_speech.mp3", "rb")
transcript = client.audio.transcriptions.create(
    model="whisper-1",
    file=audio_file
)

print(transcript)
print("\n\n")
print(transcript.text)

## All Parameters

### Defaults

In [None]:
# Create a transcription of the audio file
audio_file = open("./artifacts/fdr_speech.mp3", "rb")
transcript = client.audio.transcriptions.create(
    model="whisper-1",
    file=audio_file,
    language="en",
    prompt="Transcribe the following audio file.",
    response_format="json",
    temperature=0.0,
    timestamp_granularities=["segment"],
)

print(transcript)
print("\n\n")
print(transcript.text)

### Response Formats

#### verbose_json with segment timestamp granularity

In [None]:
# Create a transcription of the audio file
audio_file = open("./artifacts/fdr_speech.mp3", "rb")
transcript = client.audio.transcriptions.create(
    model="whisper-1",
    file=audio_file,
    language="en",
    prompt="Transcribe the following audio file.",
    response_format="verbose_json",
    temperature=0.0,
    timestamp_granularities=["segment"],
)

print(transcript)

#### verbose_json
This transcription is a detailed output from a speech-to-text model that transcribed President Franklin D. Roosevelt's speech delivered on December 8, 1941, following the attack on Pearl Harbor. Here’s a breakdown of the key components in the transcription data:

### Key Components

1. **Text**:
   - This is the main body of the transcribed speech text. It captures the entirety of the speech delivered by Roosevelt, describing the events of the Japanese attack on Pearl Harbor, the resulting damage, and the United States' response.

2. **Task**:
   - The task is labeled as `transcribe`, indicating the primary function was to convert spoken words from an audio file into text.

3. **Language**:
   - The transcription was identified as being in `english`.

4. **Duration**:
   - The duration of the audio file is given as `520.4299926757812` seconds, indicating the length of the audio that was transcribed.

5. **Segments**:
   - The transcription is broken down into segments, each of which includes:
     - **ID**: A unique identifier for each segment.
     - **Seek**: The point in the audio where this segment begins.
     - **Start** and **End**: The timestamps for the beginning and end of each segment.
     - **Text**: The transcribed text for that particular segment.
     - **Tokens**: Encoded representations of the segment text used by the model.
     - **Temperature**: The model's temperature setting, which affects randomness in text generation.
     - **Avg_logprob**: The average log probability of the tokens, indicating the confidence of the transcription.
     - **Compression_ratio**: The ratio indicating the compression level applied.
     - **No_speech_prob**: The probability that no speech was detected in this segment.

### Detailed Explanation

1. **Main Text**:
   - The transcribed text is a historical speech by Roosevelt, informing Congress about the attack on Pearl Harbor by Japan. It details the events, the diplomatic context, and calls for action against Japan.

2. **Task and Language**:
   - The model's task was to transcribe spoken English from an audio file, accurately converting it into written text.

3. **Segments**:
   - Each segment represents a portion of the speech. Breaking down the text into segments helps manage large transcriptions and allows for detailed analysis of each part.

4. **Metadata in Segments**:
   - **ID** and **Seek** help in tracking and locating specific parts of the audio.
   - **Start** and **End** times define the exact timing for each segment in the audio.
   - **Tokens** are used internally by the model to process and generate text.
   - **Temperature** controls the randomness; a setting of `0.0` means the model outputs are deterministic.
   - **Avg_logprob** provides insight into the model's confidence; lower values generally indicate higher confidence.
   - **Compression_ratio** gives an idea of the text's density or how much information is packed into the segment.
   - **No_speech_prob** indicates the likelihood that the segment contains no speech, which is useful for identifying silent or non-speech segments.



#### verbose_json with word timestamp granularity

In [None]:
# Create a transcription of the audio file
audio_file = open("./artifacts/fdr_speech.mp3", "rb")
transcript = client.audio.transcriptions.create(
    model="whisper-1",
    file=audio_file,
    language="en",
    prompt="Transcribe the following audio file.",
    response_format="verbose_json",
    temperature=0.0,
    timestamp_granularities=["word"],
)

print(transcript)

#### verbose_json with word timestamp granularity

1. **Segments vs. Words Metadata**:
   - **First Transcription**: Contains metadata broken into `segments`, each with attributes like `id`, `seek`, `start`, `end`, `text`, `tokens`, `temperature`, `avg_logprob`, `compression_ratio`, and `no_speech_prob`.
   - **Second Transcription**: Contains metadata broken into individual `words`, each with `word`, `start`, and `end` attributes.

2. **Segment Attributes (First Transcription)**:
   - **id**: Unique identifier for each segment.
   - **seek**: The point in the audio where the segment begins.
   - **start** and **end**: Timestamps for the beginning and end of each segment.
   - **text**: The transcribed text for the segment.
   - **tokens**: Encoded representations of the segment text used by the model.
   - **temperature**: The model's temperature setting, affecting randomness in text generation.
   - **avg_logprob**: The average log probability of the tokens, indicating the confidence of the transcription.
   - **compression_ratio**: The ratio indicating the compression level applied.
   - **no_speech_prob**: Probability that no speech was detected in the segment.

3. **Word Attributes (Second Transcription)**:
   - **word**: The individual word transcribed.
   - **start** and **end**: Timestamps for the start and end of each word.


## Response Formats

#### Text

In [None]:
# Create a transcription of the audio file
audio_file = open("./artifacts/fdr_speech.mp3", "rb")
transcript = client.audio.transcriptions.create(
    model="whisper-1",
    file=audio_file,
    language="en",
    prompt="Transcribe the following audio file.",
    response_format="text",
    temperature=0.0,
)

print(transcript)

#### srt (SubRip Subtitle)

In [None]:
# Create a transcription of the audio file
audio_file = open("./artifacts/fdr_speech.mp3", "rb")
transcript = client.audio.transcriptions.create(
    model="whisper-1",
    file=audio_file,
    language="en",
    prompt="Transcribe the following audio file.",
    response_format="srt",
    temperature=0.0,
)

print(transcript)

#### vtt (Web Video Text Tracks)

In [None]:
# Create a transcription of the audio file
audio_file = open("./artifacts/fdr_speech.mp3", "rb")
transcript = client.audio.transcriptions.create(
    model="whisper-1",
    file=audio_file,
    language="en",
    prompt="Transcribe the following audio file.",
    response_format="vtt",
    temperature=0.0,
)

print(transcript)

## Prompting

In [None]:
# Create a transcription of the audio file
audio_file = open("./artifacts/acronym_audio.mp4", "rb")
transcript = client.audio.transcriptions.create(
    model="whisper-1",
    file=audio_file,
    language="en",
    prompt="Transcribe the following audio file.",
    response_format="json",
    temperature=0.0,
)

print(transcript.text)

In [None]:
# Create a transcription of the audio file
audio_file = open("./artifacts/acronym_audio.mp4", "rb")
transcript = client.audio.transcriptions.create(
    model="whisper-1",
    file=audio_file,
    language="en",
    prompt="GPT-4o, GPT-4o mini",
    response_format="json",
    temperature=0.0,
)

print(transcript.text)

## Temperature

#### Dynamic Temperature

In [None]:
# Create a transcription of the audio file
audio_file = open("./artifacts/temperature_audio_test.mp4", "rb")
transcript = client.audio.transcriptions.create(
    model="whisper-1",
    file=audio_file,
    language="en",
    prompt="Transcribe the following audio file.",
    response_format="json",
    temperature=0.0,
)

print(transcript.text)

#### High Temperature

In [None]:
# Create a transcription of the audio file
audio_file = open("./artifacts/temperature_audio_test.mp4", "rb")
transcript = client.audio.transcriptions.create(
    model="whisper-1",
    file=audio_file,
    language="en",
    prompt="Transcribe the following audio file.",
    response_format="json",
    temperature=0.1,
)

print(transcript.text)

#### Low Temperature

In [None]:
# Create a transcription of the audio file
audio_file = open("./artifacts/temperature_audio_test.mp4", "rb")
transcript = client.audio.transcriptions.create(
    model="whisper-1",
    file=audio_file,
    language="en",
    prompt="Transcribe the following audio file.",
    response_format="json",
    temperature=0.1,
)

print(transcript.text)

## Prompting with Temperature

In [None]:
# Create a transcription of the audio file
audio_file = open("./artifacts/temperature_audio_test.mp4", "rb")
transcript = client.audio.transcriptions.create(
    model="whisper-1",
    file=audio_file,
    language="en",
    prompt="Umm, let me think like, hmm... Okay, here's what I'm, like, thinking.",
    response_format="json",
    temperature=0.0,
)

print(transcript.text)

In [None]:
# Create a transcription of the audio file
audio_file = open("./artifacts/temperature_audio_test.mp4", "rb")
transcript = client.audio.transcriptions.create(
    model="whisper-1",
    file=audio_file,
    language="en",
    prompt="Umm, I think we should ahhh, do this thing.",
    response_format="json",
    temperature=0.9,
)

print(transcript.text)

In [None]:
# Create a transcription of the audio file
audio_file = open("./artifacts/temperature_audio_test.mp4", "rb")
transcript = client.audio.transcriptions.create(
    model="whisper-1",
    file=audio_file,
    language="en",
    prompt="Let us discuss the topics of the day.",
    response_format="json",
    temperature=0.9,
)

print(transcript.text)

## Passing the Output

In [61]:
# Create a transcription of the audio file
audio_file = open("./artifacts/fdr_speech.mp3", "rb")
transcript = client.audio.transcriptions.create(
    model="whisper-1",
    file=audio_file,
    language="en",
    prompt="Transcribe the following audio file.",
    response_format="text",
    temperature=0.0,
)

print(transcript)

response = client.chat.completions.create(
    model="gpt-4o",
    temperature=1,
    messages=[
        {
            "role": "system",
            "content": "You will be given a transcript of an audio file. Your task is to summarize the text and tell me who is speaking."
        },
        {
            "role": "user",
            "content": transcript
        }
    ]
)

print(response.choices[0].message.content)

Mr. Vice President, Mr. Speaker, members of the Senate and the House of Representatives, yesterday, December 7th, 1941, a date which will live in infamy, the United States of America was suddenly and deliberately attacked by naval and air forces of the Empire of Japan. The United States was at peace with that nation, and at the solicitation of Japan, was still in conversation with its government and its emperor, looking toward the maintenance of peace in the Pacific. Indeed, one hour after Japanese air squadrons had commenced bombing in the American island of Oahu, the Japanese ambassador to the United States and his colleagues delivered to our Secretary of State a formal reply to a recent American message. While this reply stated that it seemed useless to continue the existing diplomatic negotiations, it contained no threat or hint of war or of armed attack. It will be recorded that the distance of Hawaii from Japan makes it obvious that the attack was deliberately planned many days o