# Whisper for Automatic Speech Recognition (ASR)

In this notebook, we will learn about Automatic Speech Recognition and how to implement in Python with OpenAI's Whisper Model.

## What is Automatic Speech Recognition (ASR)?

Automatic Speech Recognition (ASR) is a technology that allows computers to interpret and transcribe human speech into text. It is a field of study that intersects computer science, linguistics, and electrical engineering. The primary goal of ASR systems is to enable efficient and accurate conversion of spoken language into a written format, which can be used for various applications such as voice-enabled user interfaces, dictation software, and automated transcription services.

### Core Components of ASR

ASR systems typically involve several core components. Firstly, they include an acoustic model, which recognizes the basic sounds in a language (phonemes). Secondly, there's a language model, which predicts the likelihood of a particular sequence of words. These models are often trained on vast datasets of spoken and written language to improve their accuracy. The integration of these models allows the ASR system to interpret a wide range of speech, even in the presence of background noise, different accents, or dialects.

### Applications and Uses

The applications of ASR are diverse and widespread. In everyday life, ASR is used in voice-activated GPS systems, virtual assistants like Siri or Alexa, and in smartphones for voice-to-text messaging. In the professional sphere, ASR aids in transcribing meetings, lectures, and interviews, making it a valuable tool in fields like journalism, law, and healthcare. Additionally, ASR technology is pivotal in accessibility, providing assistance to those who have difficulties with typing or using traditional computer interfaces, including individuals with disabilities.

### Challenges and Development

Despite significant advancements, ASR technology still faces challenges. Accurately recognizing speech in noisy environments, understanding heavily accented speech, or deciphering homophones (words that sound the same but have different meanings) are areas where ASR can struggle. Continuous research and development are focused on improving the robustness and accuracy of ASR systems. This involves training on more diverse datasets and employing advanced algorithms like deep learning to enhance the system's ability to understand context and nuance in spoken language.


## Introduction to Whisper

Whisper is an advanced machine learning model developed by OpenAI, specializing in speech-to-text transcription. It demonstrates remarkable accuracy in transcribing spoken words into written text, which is a significant breakthrough in the field of natural language processing. This technology is particularly relevant for humanists and archivists, as it bridges the gap between oral and written records, ensuring that spoken language, an essential part of human culture and history, is accurately preserved and accessible.

## Benefits for Humanists

For humanists, Whisper presents a unique opportunity to delve into oral histories, interviews, and cultural narratives that were previously difficult to document or analyze due to the labor-intensive process of transcription. With Whisper's high accuracy, even in handling various accents and dialects, humanists can now transcribe and analyze large volumes of audio recordings more efficiently. This capability enables a deeper and more nuanced understanding of cultural and historical contexts, as it allows for the preservation and study of diverse linguistic nuances and oral traditions that are often lost in written translation.

## Application in Archival Work

In the field of archival work, Whisper stands out as a transformative tool. Archivists often deal with vast amounts of audio material, ranging from historical speeches to personal memoirs. The traditional process of transcribing these materials is time-consuming and prone to human error. Whisper automates this process with high accuracy, thus significantly reducing the time and resources needed for transcription. This efficiency not only aids in the preservation of historical audio records but also makes them more accessible to researchers and the public, facilitating a broader engagement with historical material.

## Future Implications

The integration of Whisper into the workflow of humanists and archivists marks a significant step forward in the digital humanities and archival science. By streamlining the transcription process, Whisper not only conserves resources but also opens up new possibilities for research and preservation. It enables a more inclusive approach to historical and cultural documentation, ensuring that the voices of the past are not only heard but also accurately recorded and analyzed for future generations. As technology continues to advance, Whisper stands as a testament to the potential of machine learning in enhancing our understanding and preservation of human history and culture.

Sure, here's a brief guide on how to install OpenAI's Whisper. This guide assumes you have a basic understanding of Python and its package management system, pip.

---

## Installing OpenAI Whisper

### Prerequisites

Before installing Whisper, ensure you have the following prerequisites:

1. **Python**: Whisper requires Python. If you don't have Python installed, you can download it from [python.org](https://www.python.org/downloads/).

2. **pip**: pip is Python's package installer. It's typically included with Python. To check if you have pip installed, run `pip --version` in your command line.

3. **Virtual Environment (Optional but recommended)**: It's a good practice to use a virtual environment for Python projects. This keeps your project's dependencies separate from your global Python installation. You can use `venv` (built into Python) or a third-party tool like `virtualenv`.

### Installation Steps

1. **Open your command line**: This could be Terminal on macOS/Linux or Command Prompt/PowerShell on Windows.

2. **(Optional) Create and activate a virtual environment**:
   - Create: `python -m venv myenv` (Replace `myenv` with your preferred environment name)
   - Activate: 
     - Windows: `myenv\Scripts\activate`
     - macOS/Linux: `source myenv/bin/activate`
3. Install ffmpeg
    To install FFmpeg for using OpenAI's Whisper on different operating systems, follow these instructions:
    - **On MacOS**: Use Homebrew by running `brew install ffmpeg` in the terminal.
    - **On Windows**: You can use Chocolatey with the command `choco install ffmpeg` or Scoop with `scoop install ffmpeg`.
    - **On Ubuntu or Debian Linux**: Execute `sudo apt update && sudo apt install ffmpeg`.
    - **On Arch Linux**: Use `sudo pacman -S ffmpeg`.

These commands install FFmpeg, a required tool for Whisper, on your respective operating system. For more details and updates, you can visit the [Whisper GitHub page](https://github.com/openai/whisper).

4. **Install Whisper**:
   - Run `pip install openai-whisper`.

5. **Verify Installation**:
   - Once the installation is complete, you can verify it by running `whisper --version` in the command line. This should return the version number of Whisper if it's installed correctly.

### Post-Installation

- After installing Whisper, you can start using it to transcribe audio files. The basic command is `whisper your_audio_file.mp3`, where `your_audio_file.mp3` is the path to the audio file you want to transcribe.

- Whisper also offers various options and configurations for transcription, which you can explore through its documentation or by using `whisper --help` in the command line.

## Using Whisper in Python

### Step 1: Importing the Whisper Module

To begin using Whisper for speech recognition, the first step is to import the Whisper module into your Python script. This is done with a simple import statement. This statement makes the Whisper library available in your script, enabling you to access its functions and models for speech-to-text transcription.

In [2]:
import whisper

### Step 2: Loading a Whisper Model

The `whisper.load_model("small")` line in Python is used to load a specific model size from OpenAI's Whisper for speech recognition tasks. Whisper offers various model sizes, each with different memory requirements and processing speeds. These sizes include `tiny`, `base`, `small`, `medium`, and `large`, with corresponding English-only versions like `tiny.en`, `base.en`, etc. The smaller models like `tiny` and `base` are faster and require less memory, making them suitable for less resource-intensive applications. Larger models, while slower and requiring more memory, provide better accuracy, especially in challenging audio conditions. The choice of model size depends on the specific requirements of your task, such as the desired balance between speed and accuracy, and the computational resources available.

In [3]:
model = whisper.load_model("small")

### Step 3: Transcribing Audio with Whisper

After loading the Whisper model, the next step is to transcribe audio. This is done using the `transcribe` method of the model.

In this line, `model.transcribe` is called with the path to your audio file (`"test.mp4"` in this example). The model processes the audio file and returns the transcription result. The `result` variable will contain a dictionary, where the key `text` holds the transcribed text. This process automatically handles the entire workflow of converting speech in your audio file to text.

In [4]:
result = model.transcribe("test.mp4")



### Step 4: Analyze the Result

The result of the transcription using Whisper's `transcribe` method is a dictionary containing two key elements: `'text'` and `'segments'`. 

- The `'text'` key holds the entire transcribed text from the audio file. In your case, it contains a detailed description of a video tutorial about creating custom GPT models, discussing various features and instructions related to the process.

- The `'segments'` key is a list of dictionaries, where each dictionary represents a segment of the audio. Each segment includes details like the segment's ID, the start and end times of the segment in the audio, the transcribed text for that segment, and other metadata such as `avg_logprob`, `compression_ratio`, and `no_speech_prob`. This detailed breakdown can be useful for understanding the transcription at a more granular level, especially for long audio files.

The dictionary also includes a `'language'` key that indicates the detected language of the audio, which in this case is English (`'en'`). 

This structured output allows for a comprehensive understanding of the audio content, making it suitable for further analysis or processing.

In [5]:
result

{'text': " Hi and welcome back to this channel. In this video, we're going to be looking at the new chat GPT feature that allows for you to create custom GPTs. When you load up chat.openai.com, your new window looks something like this. On the left-hand side, you'll see this explore button. If we click explore, we'll be taken to a new prompt where we can go ahead and start creating a new GPT. In this video, we're going to try to create a new GPT that can classify specific kinds of entities, namely vehicles and buildings. To do this, we're going to hit create a GPT and we'll be taken to this prompt here. This page allows for us to pass natural language instructions to create a set of custom rules for this GPT model to follow. What's really cool about this is that a open AI will take care of creating everything for us automatically with natural language. So what do we want this GPT to really do? Well, what we wanted to do is we wanted to be able to take in an input text and automatically