# Speech-to-Text and Summarization Workflow

In this notebook, we will explore a practical workflow for converting speech to text and summarizing the generated transcripts using HuggingFace models. This process combines automatic speech recognition (ASR) and text summarization, demonstrating how Generative AI can handle audio-to-text workflows efficiently.

### Outline

In this walkthrough, we will:

1. **Read in an Audio File:** Load an audio file for transcription. We recommend using a publicly available audio sample, such as from the [Open Speech and Language Resources](https://www.openslr.org/12/). HuggingFace also has some common benchmarks in their `datasets` package
2. **Stand Up an Automatic Speech Recognition (ASR) Pipeline:** Use HuggingFace's `automatic-speech-recognition` pipeline for transcription.
3. **Generate and Save the Transcript:** Transcribe the audio file and save the output as a text file.
4. **Read and Explore the Transcript:** Load the transcript, read a sample, and prepare it for summarization.
5. **Summarize the Transcript:** Stand up a text-generation pipeline with the `tiiuae/Falcon3-1B-Instruct` model and summarize the transcript. Feel free to change the models to see how the output performance varies.

6. **Evaluate the Summary:** Compute the ROUGE score to evaluate the quality of the generated summary.

By the end of this notebook, you'll learn how to integrate ASR and summarization into an efficient workflow.

## Configure the Environment

In [1]:
! pip install transformers
! pip install datasets
! pip install bert-score
! pip install soundfile

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.12.0-py3-none-any.w

## Read in an Audio File

__Prompt__: Provide Python code to read in an audio file `"apple.mp3"` using the 'soundfile' package, and play an the file within a Jupyter notebook.

In [5]:
import soundfile as sf

audio, sampling_rate = sf.read("/content/apple.mp3")

In [6]:
from IPython.display import Audio
# Play the audio
Audio(data=audio, rate=sampling_rate)

## Stand Up an Automatic Speech Recognition (ASR) Pipeline

__Prompt__: Provide Python code to set up an ASR pipeline using HuggingFace.


In [9]:
from transformers import pipeline

# Initialize the ASR pipeline
asr_pipeline = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h")

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cuda:0


## Generate and Save the Transcript

__Prompt__: Provide Python code to transcribe audio file read in by this code `audio, sampling_rate = sf.read("2902-9008-0000.flac") `

In [10]:
# Transcribe the audio file
transcription = asr_pipeline(audio) ["text"]

print(transcription)


ORANGES ARE SITROUS FRUITS KNOWN FOR THEIR VIBRANT COLOR AND REFRESHING FLAVOUR THEIR PACKED WITH VITEM AND SEA WHICH BOOSGERMIAN SYSTEM OFTEN EATEN FRESHER JUISED THEYARE ALSO USED IN DESERTS AND MARINADES THEIRE PEEL CONTAINS FRAGRANT OILS AND ZEST THAT'S GREAT FOR COOKING SWEET TANGY AND JUICY ORANGES ARE PURE SUNSHINE AN FRUIT FORM


In [6]:
# unzipping the fruits.zip into folder fruits
!unzip "/content/fruits.zip" -d fruits

Archive:  /content/fruits.zip
  inflating: fruits/apple.mp3        
  inflating: fruits/grapes.mp3       
  inflating: fruits/mango.mp3        
  inflating: fruits/oranges.mp3      


__Prompt__: Provide Python code to transcribe all `.mp3` files in the current directory, append their transcriptions, and save them to a single text file.

In [11]:
import os

# Initialize the transcript file
transcript_file = "transcript.txt"

# Open the transcript file for writing
with open(transcript_file, "w") as f:
    # Iterate through all .flac files in the current directory
    for file_name in os.listdir("/content/fruits/"):
        if file_name.endswith(".mp3"):
            print(f"Processing: {file_name}")
            # Transcribe the audio file
            audio, sampling_rate = sf.read(f"/content/fruits/{file_name}")
            transcription = asr_pipeline(audio)["text"]
            # Append the transcription to the file with a newline
            f.write(transcription + "\n")

print(f"All transcriptions saved to {transcript_file}")

Processing: oranges.mp3
Processing: apple.mp3
Processing: grapes.mp3
Processing: mango.mp3
All transcriptions saved to transcript.txt


In [12]:
transcription

'MANGO IS KNOWN AS THE KING OF FRUITS FOR ITS RICH SWEET AND TROPICAL FLAVOUR IT IS SMOOTH GOLDEN FLESH AND A LARGE SEET AT ITS CENTRE LOVED AN SMOOTHIES DESSERTS CHUTNEYS OR JUST ON ITS OWN PACKED WITH VIDAMENS AAN C ITS GREAT FOR SKIN AND IMMUNITY JUICY FRAGRANT AND DILICIOUS MANGOS ARE A TRUE TASTE OF SUMMER'

__Prompt__: Provide Python code to read a transcript file called "transcript.txt" and print some sentences.

In [4]:
with open(transcript_file, "r") as f:
    transcript_text = f.read()

# Print the first few sentences
print("Sample from the transcript:")
print("\n".join(transcript_text.split(".")[:5]))

Sample from the transcript:
GRAPES GROW IN CLUSTERS AND COMIN COLORS LIKE GREEN RED AND PURPLE THEIR JUICY SWEET AND OFTEN ENJOYED FRESH DRIED AS RAISINS OR MADEN TO WINE RICH AND ANTIOCCIDENS LIKE ROSBERITROLL THEIR GRAT FOR HART HEALTH EASY TO SNAP ON AND MESPRIE THERE ARE GOA TO FRUIT FOR ALL AGES FROM ANCIENT FEASTS TO MODERN PECNICS GRAPES HAVE ALWAYS BEEN A FAVORITE
ORANGES ARE SITROUS FRUITS KNOWN FOR THEIR VIBRANT COLOR AND REFRESHING FLAVOUR THEIR PACKED WITH VITEM AND SEA WHICH BOOSGERMIAN SYSTEM OFTEN EATEN FRESHER JUISED THEYARE ALSO USED IN DESERTS AND MARINADES THEIRE PEEL CONTAINS FRAGRANT OILS AND ZEST THAT'S GREAT FOR COOKING SWEET TANGY AND JUICY ORANGES ARE PURE SUNSHINE AN FRUIT FORM
MANGO IS KNOWN AS THE KING OF FRUITS FOR ITS RICH SWEET AND TROPICAL FLAVOUR IT IS SMOOTH GOLDEN FLESH AND A LARGE SEET AT ITS CENTRE LOVED AN SMOOTHIES DESSERTS CHUTNEYS OR JUST ON ITS OWN PACKED WITH VIDAMENS AAN C ITS GREAT FOR SKIN AND IMMUNITY JUICY FRAGRANT AND DILICIOUS MANGOS AR

## Summarize the Transcript

__Prompt__: Provide Python code to create a text-generation pipeline using the model of "tiiuae/Falcon3-1B-Instruct" from HuggingFace and use the 0th GPU. Also, make sure when creating the pipeline to specify "max_new_tokens = 500", and make sure the pipeline only outputs the generated text and not the prompt.

<hr>

__Note__: You can always switch models to observe differences in output performance.



In [2]:
from transformers import pipeline

# Define the model name
model_name = "tiiuae/Falcon3-1B-Instruct"

# Create a text-generation pipeline
text_gen_pipeline = pipeline(
    "text-generation",
    model=model_name,
    device=0,  # Use the 0th GPU
    max_new_tokens=500,
    return_full_text=False  # Ensure the output only includes the generated text
)

config.json:   0%|          | 0.00/656 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.34G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/113 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/365k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.78M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/826 [00:00<?, ?B/s]

Device set to use cuda:0


__prompt__: Using the a huggingface text-generation pipeline called "text_gen_pipeline", create a prompt using an f-string to summarize text called "transcript" text and run that prompt.

In [6]:
prompt = f"Summarize the following transcript in a single, concise sentence that captures the key point clearly and accurately:\n\n{transcript_text}"
summary = text_gen_pipeline(prompt)[0]["generated_text"]

print("Summary:")
print(summary)

Summary:
<|assistant|>
GRAPES GROW IN CLUSTERS AND COME IN VARIOUS COLORS, ENJOYED FRESHLY OR MADE INTO VARIOUS DISHES, AND ARE ALSO USED IN MEDICINE FOR HEALTH BENEFITS.


## Evaluate the Summary

__Prompt__: Provide Python code to calculate the BERTScore of a text called "summary". The original text is in a variable called "transcript_text". Please use the CPU for the embedding model.

In [15]:
from bert_score import score

# Calculate BERTScore
P, R, F1 = score(
    [summary],  # Candidate (e.g., summary)
    [transcript_text],  # Reference (e.g., transcript)
    lang="en",  # Specify language
    device="cpu",
    verbose=True  # Optional: enable verbose logging for debugging
)

# Display results
print(f"BERTScore Precision: {P.item():.4f}")
print(f"BERTScore Recall: {R.item():.4f}")
print(f"BERTScore F1: {F1.item():.4f}")

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/1 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 7.49 seconds, 0.13 sentences/sec
BERTScore Precision: 0.9048
BERTScore Recall: 0.8927
BERTScore F1: 0.8987
