# Audio File Summarizer

## Acknowledgment
I would like to thank the following individuals and organisations that made this project possible.
* Groq for providing me free access to their API key and thereby allowing me to gain hands-on experience in making API calls without having to constantly worry about token limits.

## Audio Credits
Audio content from [Polyglot speaking in 7 languages](https://www.youtube.com/watch?v=esaXXVD0PTc), licensed under Creative Commons Attribution License (CC BY). Used for non-commercial, educational purposes.

## Abstract
Audio file summarizer produces a list of summaries from an audio file in the language of the user's choice. It is useful for extracting meeting minutes out of the recording of a business meeting. The program can handle audio files containing multiple languages. In a business meeting between two companies with one party speaking in Japanese, another in Hebrew, and the common language being English, the program transcribes the utterances in all three languages, and produces meeting minutes in the language specified by the user.
<br>
The program works in the following sequence:
1. Ask the user to upload an audio file to their My Drive on Google Drive.
2. Transcribe the video in all the languages heard in the audio file.
3. Extract the key points.
4. Summarise each point in a bullet list.
5. Translate the bullet list into the language of the user's choice.

## Prerequisite
To run the program, you need to set the Groq API key in Google Colab's Secrets.

## Points of Consideration
We summarize in the original languages first, then translate the summary since it is generally more efficient and makes good practical sense.
* Translation has higher token-level cost (especially for long texts) in both time and API usage.
* Summarization reduces content size, which speeds up and simplifies the translation task.
* Summarizing in the original languages also preserves contextual and cultural nuances, which often get muddled if you translate first.

## Challenges
* It took a long time to find a suitable multilingual audio file with an appropriate license that allows me to use in this project. An hour-long audio recording of a business meeting involving Japanese, Hebrew and English produced the best result. The model did not detect all the languages successfully in a couple of YouTube videos, including the one shown in the result.
* The "large" sized Whisper model processes audio files the best. The "medium", "small" and "base" sized models all show difficulty at detecting non-English languages.
* Running Whisper without a GPU affects its performance greatly, not just in terms of speed but also in terms of language detection performance. It takes about 15 minutes to process an 8-minute long audio file on the CPU. It takes about half that time to process the same file, and manages to show a better language detection rate when run on the GPU. If your PC is not equipped with up-to-date GPU card that is compatible with the GPU-enabled version of PyTorch, it is preferable to run the program in Google Colab using T4. The .ipynb file is provided in the repo folder (no Streamlit section).  
* I had to experiment iteractively and adjust the chunk size in milliseconds until I found an optimal size for capturing the conversational segments of various length. If a converstaional exchange in one language is short and that exchange goes into a chunk, together with part of the subsequent conversational exchange in a different language, Whisper failed to identify the language of the shorter segment. It meant each chunk was small (30 seconds), the overlap felt rather large (10 seconds) and the number of chunks being produced ended up being high, but it succeeded in coping with the frequent switchings in languages.
* Streamlit seems to be incompatibile with certain versions of torch (e.g. 2.6.0 CPU version). Upgrading the torch installation did not help.
* Streamlit also seems to find it difficult to cope with seven states. Up to two states is fine, but if there are more (I have not tested its threshold when it stops working correctly), it cannot return to the beginning of the loop to process another audio file where the starting state is clearly labelled and specified. Claude also could not find a workaround, hence the missing Start Over button in the program. Streamlit can work correctly with three variables.

## Setup

* Set the text wrap format in the Julyter Notebook file for enhanced readability
* Install the packages
* Access the LLM API key
* Initialise the model

In [1]:
# Activate text wrapping in the output cells
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

In [2]:
!pip install groq langchain langchain_groq -q

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/127.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.4/127.4 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [3]:
!pip install git+https://github.com/openai/whisper.git

Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /tmp/pip-req-build-brpnhlu0
  Running command git clone --filter=blob:none --quiet https://github.com/openai/whisper.git /tmp/pip-req-build-brpnhlu0
  Resolved https://github.com/openai/whisper.git to commit 517a43ecd132a2089d85f4ebc044728a71d49f6e
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting tiktoken (from openai-whisper==20240930)
  Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch->openai-whisper==20240930)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch->openai-whisper==20240930)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-

In [None]:
# Required by Whisper
!apt-get update && apt-get install -y ffmpeg

0% [Working]            Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
Get:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Get:3 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Get:4 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ Packages [75.2 kB]
Hit:5 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:6 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages [1,604 kB]
Get:7 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Get:9 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Hit:10 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:11 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Hit:12 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Get:13 https://r2u.stat.il

In [None]:
# For manipulating the audio file
!pip install -q pydub

In [None]:
# Import the packages
import os
import re
import torch
import whisper
import numpy as np
from tqdm.notebook import tqdm
from pydub import AudioSegment
from google.colab import drive, userdata
from langchain.prompts import PromptTemplate
from langchain_groq import ChatGroq

In [None]:
# Mount Google Drive
print("Mounting Google Drive...")
drive.mount('/content/drive/')

Mounting Google Drive...
Mounted at /content/drive/


In [None]:
# Access the Groq API key
os.environ['GROQ_API_KEY'] = userdata.get('GROQ_API_KEY')

# Initialize the language model
llm = ChatGroq(model="llama-3.3-70b-versatile")

## Get the user input

Ask the user to specify:
* the name of the audio file (must be uploaded to My Drive on Google Colab)
* the language to write the meeting minutes in

In [None]:
# Get the name of the audio file from the user
def get_filename():
    """Ask the user to enter the name of the audio file to summarise."""
    filename = input("\nPlease enter the name of the audio file. Make sure the audio file is in My Drive on Google Drive.\n")
    audio_file = f"/content/drive/MyDrive/Colab_Notebooks/{filename}"
    print(f"\nThe audio file path is: {audio_file}")
    return audio_file

In [None]:
# Ask the user which language to summarise the audio file into
def get_target_language():
    """Ask the user which language to summarize the audio file into."""
    target_language = input("\nIn which language would you like to receive the summary?\n")
    print(f"\nThe target language is: {target_language}")
    return target_language

## Make a transcript of the audio file
Extract the content of the audio file into text using Whisper in chunks. The audio file of a business meeting is about an hour long, and processing it in one go causes Whisper to crash with memory or RAM issues.

In [None]:
# Make a transcript of an audio file in chunks of 5 minutes
# Introduce a 20-second overlap between two adjacent chunks to maintain context continuity

import pydub

def make_transcript_in_chunks(audio_file_path, chunk_duration_ms=30000, overlap_ms=10000):
    """
    Transcribe an audio file by splitting it into overlapping chunks to avoid memory issues.

    Args:
        audio_file_path: Path to the audio file
        chunk_duration_ms: Duration of each chunk in milliseconds (30 seconds)
        overlap_ms: Overlap between chunks in milliseconds (10 seconds)

    Returns:
        Full transcript of the audio file
    """
    print(f"Loading audio file: {audio_file_path}")

    audio = AudioSegment.from_file(audio_file_path)

    # Get audio duration in milliseconds
    audio_duration = len(audio)
    print(f"Audio duration: {audio_duration/1000:.2f} seconds")

    # Create a temporary directory for chunks if it doesn't exist
    if not os.path.exists("temp_chunks"):
        os.makedirs("temp_chunks")

    # Calculate positions for chunking (with overlap)
    chunk_positions = list(range(0, audio_duration, chunk_duration_ms - overlap_ms))

    # Ensure last chunk doesn't exceed audio length
    if chunk_positions and chunk_positions[-1] + chunk_duration_ms > audio_duration:
        chunk_positions[-1] = max(0, audio_duration - chunk_duration_ms)

    # Add the final chunk if needed
    if chunk_positions and chunk_positions[-1] + chunk_duration_ms < audio_duration:
        chunk_positions.append(audio_duration - chunk_duration_ms)

    # Load Whisper model - start with medium to balance accuracy and memory
    print("Loading Whisper model...")

    # Try to use GPU if available, otherwise fall back to CPU
    device = "cuda" if torch.cuda.is_available() else "cpu"
    print(f"Using device: {device}")

    # Use a medium model if on CPU to improve speed
    # The base and small models are inadequate and do not detect non-English languages
    model_size = "large" if device == "cuda" else "medium"
    print(f"Using {model_size} model")

    model = whisper.load_model(model_size, device=device)

    # Initialize an empty list to store all transcriptions
    all_transcripts = []

    # Process each chunk
    print(f"Processing {len(chunk_positions)} chunks...")
    for i, start_pos in enumerate(tqdm(chunk_positions)):
        # Extract chunk
        end_pos = min(start_pos + chunk_duration_ms, audio_duration)
        chunk = audio[start_pos:end_pos]

        # Save chunk temporarily
        chunk_path = f"temp_chunks/chunk_{i}.wav"
        chunk.export(chunk_path, format="wav")

        # Clear CUDA cache to prevent memory issues
        if torch.cuda.is_available():
            torch.cuda.empty_cache()

        # Transcribe chunk
        result = model.transcribe(chunk_path, task="transcribe")
        all_transcripts.append({
            "start": start_pos / 1000,  # Convert to seconds
            "end": end_pos / 1000,
            "text": result["text"].strip()
        })
        print(f"Chunk {i+1}/{len(chunk_positions)} transcribed")

        # Clean up temporary chunk file
        os.remove(chunk_path)

    # Clean up the temporary directory
    os.rmdir("temp_chunks")

    # Merge transcripts with overlap handling
    merged_transcript = merge_transcripts_with_overlap_handling(all_transcripts)

    return merged_transcript

In [None]:
# Merge the transcript chunks and remove the overlaps
def merge_transcripts_with_overlap_handling(transcripts):
    """
    Merge transcripts handling duplicated content in overlapping portions.

    Args:
        transcripts: List of dictionaries with start, end, and text fields

    Returns:
        Merged transcript
    """
    if not transcripts:
        return ""

    # Sort transcripts by start time
    transcripts = sorted(transcripts, key=lambda x: x["start"])

    # Initialize with the first transcript
    merged_text = transcripts[0]["text"]

    for i in range(1, len(transcripts)):
        current_text = transcripts[i]["text"]
        previous_text = merged_text

        # Find potential overlap in text
        overlap_found = False
        min_overlap_length = 5  # Minimum characters to consider as overlap

        for overlap_length in range(min(len(previous_text), len(current_text)), min_overlap_length - 1, -1):
            if previous_text[-overlap_length:].lower() == current_text[:overlap_length].lower():
                merged_text = previous_text + current_text[overlap_length:]
                overlap_found = True
                break

        # If no text overlap found, simply append with a space
        if not overlap_found:
            merged_text += " " + current_text

    # Clean up any multiple spaces, newlines, etc.
    merged_text = re.sub(r'\s+', ' ', merged_text).strip()

    return merged_text

In [None]:
# Define prompts with instructions for multilingual content
summarisation_prompt = PromptTemplate(
    input_variables=["transcript"],
    template="""You are a professional transcription analyst skilled in multiple languages including Japanese, Hebrew, and English.

Here is a transcript of a multilingual audio recording:

{transcript}

Please carefully analyze this transcript and:

1. Identify all key points regardless of which language they appear in
2. Create a comprehensive yet concise bullet-point list of these key points
3. For each key point, maintain the original language it was spoken in
4. Ensure you don't miss important information in any language
5. Format your response as a clean, well-structured bullet list

Your output should be a multilingual summary that captures the essential content from all languages present in the recording."""
)

translation_prompt = PromptTemplate(
    input_variables=["summary", "language"],
    template="""You are an expert multilingual translator with deep cultural knowledge.

Here is a multilingual summary of key points from an audio recording:

{summary}

Please translate this entire summary into {language}, following these guidelines:

1. Translate ALL points into {language} only
2. Preserve the original meaning, tone, and nuance of each point
3. Pay special attention to cultural references, idioms, and specialised terminology
4. Maintain the bullet-point format for clarity
5. Ensure your translation is natural and fluent in {language}

Your goal is to create a translation that feels native to {language} speakers while accurately representing the original content."""
)

In [None]:
# Make a transcript of the audio file
def transcribe_audio_file(audio_file):
    """
    Transcribe audio file using chunking to avoid memory issues.

    Args:
        audio_file: Path to the audio file

    Returns:
        The transcript text
    """
    print("Starting transcription process...")
    transcript = make_transcript_in_chunks(audio_file)

    # Save the transcript
    with open("/content/drive/MyDrive/Colab_Notebooks/transcript.txt", "w", encoding="utf-8") as f:
        f.write(transcript)

    print("\nTranscript saved to transcript.txt")
    print("\nFirst 500 characters of transcript:")
    print(transcript[:500] + "...")

    return transcript

In [None]:
# Summarise the content of the audio file
def summarise_and_translate(transcript, language):
    """
    Take a transcript of an audio file, summarise the content into a list of key points,
    and translate it into a language of user's choice.

    Args:
        transcript: The transcript of the audio file
        language: Target language for translation

    Returns:
        A translated list of key points in the target language
    """
    print("\nGenerating multilingual summary of key points...")
    summary = (summarisation_prompt | llm).invoke({"transcript": transcript}).content

    print("\nSummary generated. Now translating to", language)
    translated_summary = (translation_prompt | llm).invoke({"summary": summary, "language": language}).content

    # Save both versions
    with open("summary_original.txt", "w", encoding="utf-8") as f:
        f.write(summary)

    with open(f"summary_{language}.txt", "w", encoding="utf-8") as f:
        f.write(translated_summary)

    return translated_summary

In [None]:
# Main execution flow
def main():
    print("\n===== Multilingual Audio File Summariser =====")

    # Get the user input
    audio_file = get_filename()
    target_language = get_target_language()

    # Check available GPU memory
    if torch.cuda.is_available():
        print(f"\nGPU available: {torch.cuda.get_device_name(0)}")
        print(f"Available GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
    else:
        print("\nNo GPU available, will use CPU (slower)")

    # Produce a transcript of the audio file
    print("\n----- Starting Transcription -----")
    transcript = transcribe_audio_file(audio_file)

    # Summarise and translate
    print("\n----- Starting Summarization and Translation -----")
    final_summary = summarise_and_translate(transcript, target_language)

    print("\n===== Final Result =====")
    print(final_summary)

    return final_summary

In [None]:
# Run the main function
if __name__ == "__main__":
    main()


===== Multilingual Audio File Summariser =====

Please enter the name of the audio file. Make sure the audio file is in My Drive on Google Drive.
7_langs_mix.mp3

The audio file path is: /content/drive/MyDrive/Colab_Notebooks/7_langs_mix.mp3

In which language would you like to receive the summary?
English

The target language is: English

GPU available: Tesla T4
Available GPU memory: 15.83 GB

----- Starting Transcription -----
Starting transcription process...
Loading audio file: /content/drive/MyDrive/Colab_Notebooks/7_langs_mix.mp3
Audio duration: 472.41 seconds
Loading Whisper model...
Using device: cuda
Using large model


100%|█████████████████████████████████████| 2.88G/2.88G [00:47<00:00, 64.7MiB/s]


Processing 24 chunks...


  0%|          | 0/24 [00:00<?, ?it/s]

Chunk 1/24 transcribed
Chunk 2/24 transcribed
Chunk 3/24 transcribed
Chunk 4/24 transcribed
Chunk 5/24 transcribed
Chunk 6/24 transcribed
Chunk 7/24 transcribed
Chunk 8/24 transcribed
Chunk 9/24 transcribed
Chunk 10/24 transcribed
Chunk 11/24 transcribed
Chunk 12/24 transcribed
Chunk 13/24 transcribed
Chunk 14/24 transcribed
Chunk 15/24 transcribed
Chunk 16/24 transcribed
Chunk 17/24 transcribed
Chunk 18/24 transcribed
Chunk 19/24 transcribed
Chunk 20/24 transcribed
Chunk 21/24 transcribed
Chunk 22/24 transcribed
Chunk 23/24 transcribed
Chunk 24/24 transcribed

Transcript saved to transcript.txt

First 500 characters of transcript:
Bonjour mes amis, comment ça va? Je suis Marcos et je suis costarricense. Aujourd'hui je parlerai de mes langues. J'aime apprendre des langues étrangères. Je ne parle pas beaucoup de français, excusez-moi, mais je crois que le français est une langue très intéressante. Excusez-moi. Mais je crois que le français c'est une langue très intéressante. J'ai étudié