# Whisper for Automatic Speech Recognition (ASR)

In this notebook, we will learn about Automatic Speech Recognition and how to implement in Python with OpenAI's Whisper Model.

## What is Automatic Speech Recognition (ASR)?

Automatic Speech Recognition (ASR) is a technology that allows computers to interpret and transcribe human speech into text. It is a field of study that intersects computer science, linguistics, and electrical engineering. The primary goal of ASR systems is to enable efficient and accurate conversion of spoken language into a written format, which can be used for various applications such as voice-enabled user interfaces, dictation software, and automated transcription services.

### Core Components of ASR

ASR systems typically involve several core components. Firstly, they include an acoustic model, which recognizes the basic sounds in a language (phonemes). Secondly, there's a language model, which predicts the likelihood of a particular sequence of words. These models are often trained on vast datasets of spoken and written language to improve their accuracy. The integration of these models allows the ASR system to interpret a wide range of speech, even in the presence of background noise, different accents, or dialects.

### Applications and Uses

The applications of ASR are diverse and widespread. In everyday life, ASR is used in voice-activated GPS systems, virtual assistants like Siri or Alexa, and in smartphones for voice-to-text messaging. In the professional sphere, ASR aids in transcribing meetings, lectures, and interviews, making it a valuable tool in fields like journalism, law, and healthcare. Additionally, ASR technology is pivotal in accessibility, providing assistance to those who have difficulties with typing or using traditional computer interfaces, including individuals with disabilities.

### Challenges and Development

Despite significant advancements, ASR technology still faces challenges. Accurately recognizing speech in noisy environments, understanding heavily accented speech, or deciphering homophones (words that sound the same but have different meanings) are areas where ASR can struggle. Continuous research and development are focused on improving the robustness and accuracy of ASR systems. This involves training on more diverse datasets and employing advanced algorithms like deep learning to enhance the system's ability to understand context and nuance in spoken language.


## Introduction to Whisper

Whisper is an advanced machine learning model developed by OpenAI, specializing in speech-to-text transcription. It demonstrates remarkable accuracy in transcribing spoken words into written text, which is a significant breakthrough in the field of natural language processing. This technology is particularly relevant for humanists and archivists, as it bridges the gap between oral and written records, ensuring that spoken language, an essential part of human culture and history, is accurately preserved and accessible.

## Benefits for Humanists

For humanists, Whisper presents a unique opportunity to delve into oral histories, interviews, and cultural narratives that were previously difficult to document or analyze due to the labor-intensive process of transcription. With Whisper's high accuracy, even in handling various accents and dialects, humanists can now transcribe and analyze large volumes of audio recordings more efficiently. This capability enables a deeper and more nuanced understanding of cultural and historical contexts, as it allows for the preservation and study of diverse linguistic nuances and oral traditions that are often lost in written translation.

## Application in Archival Work

In the field of archival work, Whisper stands out as a transformative tool. Archivists often deal with vast amounts of audio material, ranging from historical speeches to personal memoirs. The traditional process of transcribing these materials is time-consuming and prone to human error. Whisper automates this process with high accuracy, thus significantly reducing the time and resources needed for transcription. This efficiency not only aids in the preservation of historical audio records but also makes them more accessible to researchers and the public, facilitating a broader engagement with historical material.

## Future Implications

The integration of Whisper into the workflow of humanists and archivists marks a significant step forward in the digital humanities and archival science. By streamlining the transcription process, Whisper not only conserves resources but also opens up new possibilities for research and preservation. It enables a more inclusive approach to historical and cultural documentation, ensuring that the voices of the past are not only heard but also accurately recorded and analyzed for future generations. As technology continues to advance, Whisper stands as a testament to the potential of machine learning in enhancing our understanding and preservation of human history and culture.

Sure, here's a brief guide on how to install OpenAI's Whisper. This guide assumes you have a basic understanding of Python and its package management system, pip.

---

## Installing OpenAI Whisper

### Prerequisites

Before installing Whisper, ensure you have the following prerequisites:

1. **Python**: Whisper requires Python. If you don't have Python installed, you can download it from [python.org](https://www.python.org/downloads/).

2. **pip**: pip is Python's package installer. It's typically included with Python. To check if you have pip installed, run `pip --version` in your command line.

3. **Virtual Environment (Optional but recommended)**: It's a good practice to use a virtual environment for Python projects. This keeps your project's dependencies separate from your global Python installation. You can use `venv` (built into Python) or a third-party tool like `virtualenv`.

### Installation Steps

1. **Open your command line**: This could be Terminal on macOS/Linux or Command Prompt/PowerShell on Windows.

2. **(Optional) Create and activate a virtual environment**:
   - Create: `python -m venv myenv` (Replace `myenv` with your preferred environment name)
   - Activate: 
     - Windows: `myenv\Scripts\activate`
     - macOS/Linux: `source myenv/bin/activate`
3. Install ffmpeg
    To install FFmpeg for using OpenAI's Whisper on different operating systems, follow these instructions:
    - **On MacOS**: Use Homebrew by running `brew install ffmpeg` in the terminal.
    - **On Windows**: You can use Chocolatey with the command `choco install ffmpeg` or Scoop with `scoop install ffmpeg`.
    - **On Ubuntu or Debian Linux**: Execute `sudo apt update && sudo apt install ffmpeg`.
    - **On Arch Linux**: Use `sudo pacman -S ffmpeg`.

These commands install FFmpeg, a required tool for Whisper, on your respective operating system. For more details and updates, you can visit the [Whisper GitHub page](https://github.com/openai/whisper).

4. **Install Whisper**:
   - Run `pip install openai-whisper`.

5. **Verify Installation**:
   - Once the installation is complete, you can verify it by running `whisper --version` in the command line. This should return the version number of Whisper if it's installed correctly.

### Post-Installation

- After installing Whisper, you can start using it to transcribe audio files. The basic command is `whisper your_audio_file.mp3`, where `your_audio_file.mp3` is the path to the audio file you want to transcribe.

- Whisper also offers various options and configurations for transcription, which you can explore through its documentation or by using `whisper --help` in the command line.

In [5]:
import whisper

In [6]:
model = whisper.load_model("small")

In [7]:
result = model.transcribe("test.mp4")




In [10]:
len(result["segments"])

74

In [14]:
def dict2schema(d, style="string", indent=0):
    """
    Generates a schema of a dictionary's organization, showing each key and the type of its value.
    The schema can be returned as a string or as a nested dictionary.

    :param d: The dictionary to analyze.
    :param style: The output format, either "string" or "dict".
    :param indent: The indentation level (used for recursive calls and string formatting).
    :return: The schema as a string or a nested dictionary, depending on the 'style' parameter.
    """
    schema = ""
    schema_dict = {}

    for key, value in d.items():
        key_type_info = f"{key} ({type(value).__name__})"
        # String style formatting
        if style == "string":
            schema += ' ' * indent + key_type_info + "\n"
        # Dictionary style formatting
        else:
            schema_dict[key] = {'type': type(value).__name__}

        # Recursively handle nested dictionaries
        if isinstance(value, dict):
            nested_schema = dict2schema(value, style, indent + 4)
            if style == "string":
                schema += nested_schema
            else:
                schema_dict[key]['nested'] = nested_schema

    return schema if style == "string" else schema_dict

# Example usage
example_dict = {
    'key1': 'value1',
    'key2': {
        'subkey1': 'subvalue1',
        'subkey2': {
            'subsubkey1': 'subsubvalue1'
        },
        'subkey3': 123,
        'subkey4': [1, 2, 3]
    },
    'key3': True
}

# Get the schema as a string
print(dict2schema(result))

# Get the schema as a dictionary
print(dict2schema(result, style="dict"))


text (str)
segments (list)
language (str)

{'text': {'type': 'str'}, 'segments': {'type': 'list'}, 'language': {'type': 'str'}}
