# Lesson 2: Splitting and Processing Large Files with FFmpeg

### Splitting and Processing Large Files

Welcome back! In our previous lessons, we've explored using basic transcribing techniques with OpenAI's Whisper API, as well as calculating the media duration using FFmpeg. Today, we'll shift our focus to transcribing large files with OpenAI Whisper and FFmpeg. Managing large audio or video files by splitting them into manageable pieces ensures that tasks like transcription can be performed efficiently and without errors. This lesson will empower you to handle these files smoothly, leveraging FFmpeg's capabilities.

### Understanding Transcribing Large Files

OpenAI Whisper has a file size limitation of 25 MB, which poses a challenge when attempting to transcribe large audio or video files. To work around this constraint, we need a method to divide these large files into smaller, manageable chunks that can be processed sequentially. Our strategy involves leveraging FFmpeg's capabilities to split the files into segments that fall within the permissible size limit. This will ensure compatibility with OpenAI Whisper while maintaining the quality and integrity of the original content. By breaking down large files, we facilitate efficient transcription, allowing for smooth and accurate processing of each smaller segment.

### Using FFmpeg to Split Media Files: Media Duration

Let's consider Python code to achieve this, ensuring all steps are easily comprehensible. First, let's revisit how we retrieve the media's length using FFmpeg:

```python
import math
import os
import subprocess
import tempfile

def get_audio_duration(file_path):
    """Get the duration of an audio file using ffprobe"""
    cmd = [
        'ffprobe', 
        '-v', 'quiet',
        '-show_entries', 'format=duration',
        '-of', 'default=noprint_wrappers=1:nokey=1',
        file_path
    ]
    try:
        output = subprocess.check_output(cmd)
        return float(output)
    except:
        return None
```

This section of the code employs `ffprobe` to determine an audio file's duration. `ffprobe` is a component of FFmpeg that fetches file data without altering it. The command is carefully structured to extract only the duration, allowing us to calculate how to split the file accordingly.

### Using FFmpeg to Split Media Files: Streaming FFmpeg's Output

Now, let's implement one more helper function. Splitting a media file into chunks is a time-consuming process, and FFmpeg will produce its logs as a stream - they will iteratively appear as it keeps processing the file. In order for us to process that efficiently, we should implement a way to stream these logs to the console in Python:

```python
def run_command_with_output(cmd, desc=None):
    """Run a command and stream its output in real-time"""
    if desc:
        print(f"\n{desc}")
    
    process = subprocess.Popen(
        cmd,
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT,
        universal_newlines=True
    )
    
    for line in iter(process.stdout.readline, ''):
        print(line, end='')
    
    process.stdout.close()
    return_code = process.wait()
    
    if return_code != 0:
        raise subprocess.CalledProcessError(return_code, cmd)
```

This helper function allows us to run commands and stream outputs in real time. By setting up a subprocess, it captures output line-by-line, ensuring you keep track of the progress during long operations, a critical feature when managing large files.

### Using FFmpeg to Split Media Files: Splitting Files into Chunks

The process of splitting media files into smaller chunks involves key FFmpeg commands that work together to extract segments without re-encoding. Let's break down the code to see how it operates:

```python
def split_media(file_path, chunk_size_mb=20):
    """Split media file into chunks smaller than the API limit"""
    print("\nSplitting media into chunks...")
    
    duration = get_audio_duration(file_path)
    if not duration:
        raise Exception("Could not determine audio duration")
    
    file_size = os.path.getsize(file_path)
    chunk_duration = duration * (chunk_size_mb * 1024 * 1024) / file_size
    num_chunks = math.ceil(duration / chunk_duration)
    
    chunks = []
    for i in range(num_chunks):
        start_time = i * chunk_duration
        temp_file = tempfile.NamedTemporaryFile(
            delete=False,
            suffix=os.path.splitext(file_path)[1]
        )
        
        cmd = [
            'ffmpeg',
            '-i', file_path,    # Specify the input file to process
            '-ss', str(start_time),  # Set the start time of the chunk
            '-t', str(chunk_duration),  # Define the chunk's duration
            '-c', 'copy',   # Copy streams without re-encoding for efficiency
            '-y',   # Overwrite output files without confirmation
            temp_file.name
        ]
        
        run_command_with_output(
            cmd, 
            f"Extracting chunk {i+1}/{num_chunks}"
        )
        chunks.append(temp_file.name)
    print(f"Split media into {len(chunks)} chunk(s): {chunks}")
    return chunks
```

**Code Explanation:**

- **Initialize Variables:** We first determine the duration of the media file using the helper `get_audio_duration` function. The `file_size` is retrieved to calculate the proper chunk duration that fits within the specified `chunk_size_mb` limit (which is by default 20 MB).

- **Calculate Chunks:** `chunk_duration` uses the ratio of `chunk_size_mb` to `file_size` multiplied by the duration to find how long each chunk should be. `num_chunks` calculates the total number of chunks required by dividing the full duration by `chunk_duration` and rounding up.

- **Create Each Chunk:** A loop iterates over each chunk, calculating the `start_time` for each segment. A temporary file is created for storing the chunk. This file will mimic the original file's extension for compatibility.

- **FFmpeg Command:** 
  - `-i` specifies the input file.
  - `-ss` sets the start time for each chunk.
  - `-t` sets the duration for each chunk.
  - `-c copy` ensures content is copied directly without re-encoding, preserving quality and improving efficiency.
  - `-y` automatically overwrites existing output files without user confirmation.

- **Run Command and Store Chunks:** `run_command_with_output` executes the FFmpeg command, streaming progress to keep the user informed. Each generated temporary file is appended to the `chunks` list, which is later returned for further processing. This approach systematically breaks down large files into smaller, manageable pieces using FFmpeg's powerful media handling capabilities.

### Checking Yourself: Executing the Media File Split

Running the code (e.g., `split_media('resources/sample_video.mp4', 1)`) will print something like this:

```
Splitting media into chunks...

Extracting chunk 1/2
<ffmpeg output for chunk 1>

Extracting chunk 2/2
<ffmpeg output for chunk 2>

Split media into 2 chunk(s): ['/tmp/tmprgsjob1j.mp4', '/tmp/tmpr2iqj_ll.mp4']
```

The `sample_video.mp4` video file size is around 2 MB, so splitting it into `chunk_size_mb` produces 2 chunks of 1 MB, both of which are properly extracted with FFmpeg and saved as separate temporary files.

### Lesson Summary

Congratulations on mastering the process of splitting large media files using FFmpeg! In this lesson, you've learned how to leverage FFmpeg's capabilities to efficiently break down large files into smaller, manageable chunks. By understanding the intricacies of file handling, you can now enhance file operations, reduce memory overhead, and enable parallel processing for improved performance, all while maintaining content quality. You’re now well-equipped to tackle large-scale multimedia tasks with confidence and precision!

## Observe FFmpeg Power

FFmpeg is a very exciting and powerful tool! Now, let's put your understanding to the test. In the IDE you can see the split_into_chunks method that retrieves a certain chunk from the media file that has a certain duration and start time.

Run the code, and feel free to adjust the method input parameters to see how the result changes. After you run the code, you should be able to see the newly generated chunk in the dropdown on UI once you click the "refresh" button in the preview, Try to play the chunk to see if the split worked!

```python
import subprocess
import sys
import shutil
import os

from openai import OpenAI

client = OpenAI()


def check_ffmpeg_installed():
    """Check if FFmpeg is installed and accessible."""
    if shutil.which("ffmpeg") is None:
        print("FFmpeg is not installed or not in the system path.", file=sys.stderr)
        sys.exit(1)


def transcribe(file_path):
    """
    Transcribe an audio file using OpenAI's Whisper API.
    """
    try:
        with open(file_path, 'rb') as audio_file:
            transcript = client.audio.transcriptions.create(
                model="whisper-1",
                file=audio_file,
                timeout=60
            )
            return transcript.text
    except Exception as e:
        raise Exception(f"Transcription failed: {str(e)}")


def get_audio_duration(file_path):
    """Get the duration of an audio file using ffprobe"""
    cmd = [
        'ffprobe', 
        '-v', 'quiet',
        '-show_entries', 'format=duration',
        '-of', 'default=noprint_wrappers=1:nokey=1',
        file_path
    ]
    try:
        output = subprocess.check_output(cmd)
        return float(output)
    except:
        return None


# Function to split media into chunks and log the process
def split_into_chunks(file_path, start_time, duration, chunk_number):
    temp_file_path = f"resources/chunk_{chunk_number}.mp3"
    
    # Modify the command to use the full path to ffmpeg if necessary
    cmd = [
        'ffmpeg',
        '-i', file_path,
        '-ss', str(start_time),
        '-t', str(duration),
        '-c', 'copy',
        '-y',
        temp_file_path
    ]
    
    # Redirect stderr to stdout
    return_code = subprocess.call(cmd, stderr=subprocess.STDOUT)
    
    if return_code != 0:
        print(f"Encoding failed for chunk {chunk_number}.", file=sys.stderr)
        sys.exit(1)

    return temp_file_path

# Check if FFmpeg is installed before attempting to split
check_ffmpeg_installed()

# Split the media file starting from the 30th second into a 60-second chunk as an example
print(f"Chunk saved at: {split_into_chunks('resources/sample_audio.mp3', 5, 10, 1)}")

```

## Retrieve Media File Chunk Length

Let's practice your FFmpeg skills and knowledge now! Fill in the missing parts of the code corresponding to proper FFmpeg commands to retrieve the media file length and to retrieve a proper chunk from the given media file.


```python
import subprocess
import sys
import shutil
import os

from openai import OpenAI

client = OpenAI()


def check_ffmpeg_installed():
    """Check if FFmpeg is installed and accessible."""
    if shutil.which("ffmpeg") is None:
        print("FFmpeg is not installed or not in the system path.", file=sys.stderr)
        sys.exit(1)


def transcribe(file_path):
    """
    Transcribe an audio file using OpenAI's Whisper API.
    """
    try:
        with open(file_path, 'rb') as audio_file:
            transcript = client.audio.transcriptions.create(
                model="whisper-1",
                file=audio_file,
                timeout=60
            )
            return transcript.text
    except Exception as e:
        raise Exception(f"Transcription failed: {str(e)}")


def get_audio_duration(file_path):
    """Get the duration of an audio file using ffprobe"""

    # TODO: Add FFmpeg command
    cmd = ['ls']
    try:
        output = subprocess.check_output(cmd)
        return float(output)
    except:
        return None


# Function to split media into chunks and log the process
def split_into_chunks(file_path, start_time, duration, chunk_number):
    temp_file_path = f"resources/chunk_{chunk_number}.mp3"
    
    # TODO: Add FFmpeg command
    cmd = ['ls']
    
    # Redirect stderr to stdout
    return_code = subprocess.call(cmd, stderr=subprocess.STDOUT)
    
    if return_code != 0:
        print(f"Encoding failed for chunk {chunk_number}.", file=sys.stderr)
        sys.exit(1)

    return temp_file_path

# Check if FFmpeg is installed before attempting to split
check_ffmpeg_installed()

# Split the media file starting from the 30th second into a 60-second chunk as an example
print(f"Chunk saved at: {split_into_chunks('resources/sample_audio.mp3', 5, 10, 1)}")

```

You're on the right track! To complete this code, we need to:

1. Use `ffprobe` to get the duration of the media file.
2. Use `ffmpeg` to extract a specific chunk of the audio.

---

### Here is the corrected and completed version of the missing parts:

```python
def get_audio_duration(file_path):
    """Get the duration of an audio file using ffprobe"""
    cmd = [
        'ffprobe',
        '-v', 'error',
        '-show_entries', 'format=duration',
        '-of', 'default=noprint_wrappers=1:nokey=1',
        file_path
    ]
    try:
        output = subprocess.check_output(cmd, stderr=subprocess.STDOUT)
        return float(output)
    except subprocess.CalledProcessError as e:
        print(f"Error getting duration: {e.output}", file=sys.stderr)
        return None


def split_into_chunks(file_path, start_time, duration, chunk_number):
    temp_file_path = f"resources/chunk_{chunk_number}.mp3"
    
    cmd = [
        'ffmpeg',
        '-y',  # Overwrite output file if exists
        '-ss', str(start_time),  # Start time
        '-t', str(duration),     # Duration
        '-i', file_path,         # Input file
        '-acodec', 'copy',       # Copy audio codec to avoid re-encoding
        temp_file_path
    ]
    
    # Redirect stderr to stdout
    return_code = subprocess.call(cmd, stderr=subprocess.STDOUT)
    
    if return_code != 0:
        print(f"Encoding failed for chunk {chunk_number}.", file=sys.stderr)
        sys.exit(1)

    return temp_file_path
```

---

## Summary of Key Commands:

| Command | Purpose |  
|---------|---------|  
| `ffprobe -v error -show_entries format=duration -of default=noprint_wrappers=1:nokey=1` | Get media duration |  
| `ffmpeg -ss {start_time} -t {duration} -i {input_file} -acodec copy {output_file}` | Extract audio chunk without re-encoding |  

---

## Example Output:

```bash
Chunk saved at: resources/chunk_1.mp3
```

---

Let me know if you'd like me to help with:

- Automating the splitting of an entire audio into multiple chunks?
- Transcribing the chunks in parallel?
- Storing transcripts nicely in a database or file?

Want to level it up? 😎

## Split and Transcribe Your Media

Let's practice your FFmpeg skills and knowledge even more! Fill in the missing parts of the code corresponding to proper FFmpeg commands to split the given media file into 5 equal-duration chunks. Each chunk should be saved in the resources folder.

After implementing the solution, you can test and play these generated chunks by refreshing the preview and selecting the chunks from the dropdown menu.

Once the chunk generation is done, try running transcribing on these smaller chunks. Does it work now?

Hint: To convert the output chunk to mp3 given the input file in mp4, provide -acodec libmp3lame parameter to FFmpeg - it will retrieve the chunk in mp3 format.

```python
import subprocess
import sys
import shutil
import os

from openai import OpenAI

client = OpenAI()


def check_ffmpeg_installed():
    """Check if FFmpeg is installed and accessible."""
    if shutil.which("ffmpeg") is None:
        print("FFmpeg is not installed or not in the system path.", file=sys.stderr)
        sys.exit(1)


def transcribe(file_path):
    """
    Transcribe an audio file using OpenAI's Whisper API.
    """
    try:
        with open(file_path, 'rb') as audio_file:
            transcript = client.audio.transcriptions.create(
                model="whisper-1",
                file=audio_file,
                timeout=60
            )
            return transcript.text
    except Exception as e:
        raise Exception(f"Transcription failed: {str(e)}")


def get_audio_duration(file_path):
    """Get the duration of an audio file using ffprobe"""
    cmd = [
        'ffprobe', 
        '-v', 'quiet',
        '-show_entries', 'format=duration',
        '-of', 'default=noprint_wrappers=1:nokey=1',
        file_path
    ]
    try:
        output = subprocess.check_output(cmd)
        return float(output)
    except:
        return None


def split_into_chunks(file_path):
    """Split the media file into 5 equal-duration chunks."""
    # TODO: Implement chunking logic


if __name__ == "__main__":
    check_ffmpeg_installed()
    split_into_chunks('resources/codesignal_beyond.mp4')


```

Excellent exercise! Here's how you should complete the `split_into_chunks` function:

### Steps:
1. Get the total duration of the file using `get_audio_duration`.
2. Divide the duration into 5 equal chunks.
3. Use `ffmpeg` to extract each chunk.
4. Convert the output to `.mp3` using `-acodec libmp3lame`.
5. Save the chunks in the `resources` directory.

---

## Final Code:

```python
def split_into_chunks(file_path):
    """Split the media file into 5 equal-duration chunks."""
    
    duration = get_audio_duration(file_path)
    
    if duration is None:
        print("Could not retrieve duration.", file=sys.stderr)
        sys.exit(1)

    chunk_duration = duration / 5  # Split into 5 chunks

    for i in range(5):
        start_time = i * chunk_duration
        output_file = f'resources/chunk_{i+1}.mp3'

        cmd = [
            'ffmpeg',
            '-y',  # Overwrite if exists
            '-ss', str(start_time),
            '-t', str(chunk_duration),
            '-i', file_path,
            '-acodec', 'libmp3lame',  # Convert to mp3
            output_file
        ]

        print(f"Creating chunk {i+1} from {start_time}s to {start_time + chunk_duration}s")

        return_code = subprocess.call(cmd, stderr=subprocess.STDOUT)

        if return_code != 0:
            print(f"Failed to create chunk {i+1}", file=sys.stderr)
            sys.exit(1)

    print("All chunks created successfully!")
```

---

## Usage:

```bash
python3 your_script.py
```

This will create:

```sh
resources/chunk_1.mp3
resources/chunk_2.mp3
resources/chunk_3.mp3
resources/chunk_4.mp3
resources/chunk_5.mp3
```

---

## Next Step:

Once your chunks are ready, you can transcribe them like this:

```python
for i in range(1, 6):
    transcript = transcribe(f'resources/chunk_{i}.mp3')
    print(f"Transcript for chunk {i}:\n{transcript}\n{'-'*50}")
```

---

Let me know if you'd like me to write the full script for splitting + transcribing automatically! Happy to do it. Want me to? 😊