<a href="https://colab.research.google.com/github/rpast/AudioTrans/blob/main/AudioTrans_v_0_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AudioTrans v.0.4
## Open AI, Whisper, Audio->Text

---

Ten notatnik pozwala na transkrypcje dowolnej ilosci plikow audio zapisanych na Google Drive na tekst pisany. Umozliwia on takze sciagniecie transkryptu filmu z YouTube.

### Aby zaczac:
1. Załóż konto w Open AI (przejdź do tej strony i kliknij 'Get started' w tekscie: https://openai.com/blog/introducing-chatgpt-and-whisper-apis).
2. Wygeneruj klucz API w ustawieniach konta i 'View API keys' i nikomu go nie udostępniaj.
3. W wersji trial otrzymasz kilka $ na zabawę. Jak chcesz bawić się dłużej to musisz spiąć konto z kartą kredytową w ustawieniach konta > billing

### Dokumentacja API:
https://github.com/openai/whisper

### ROADMAP:
1. [x] Load all files in given directory
2. [x] Chop audio by its size (<20mb), pass to whisper, bind together
3. [ ] Join transcript files together (in Pandas dataframe)

# 0. Ustawienia skryptu

In [None]:
# Zainstaluj biblioteke Open AI
!pip install --upgrade pip
!pip install openai
!pip install langchain
!pip install youtube-transcript-api
!pip install pydub
!pip install mutagen

In [33]:
# Importuj biblioteki
import openai
import datetime
import unicodedata
import os

from multiprocessing import Pool
from pathlib import Path
from mutagen.mp3 import MP3
from google.colab import drive
from pydub import AudioSegment
from tqdm.notebook import tqdm
from langchain.document_loaders import YoutubeLoader

In [13]:
# Podłącz dysk google
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [36]:
## FUNCTIONS

# Functions below are for identifying the size of the audio file and for
# chopping it to smaller pieces if it is >24mb. The reason for that is Whisper
# wont take files larger than ~26mbs. 
# Functions are intended to use in batch processing and only for mp3 files.
#
def save_audio_file(piece, file_name, output_folder, file_num='', suffix='_part'):
    """
    Saves an audio segment as a new file in the specified output folder.
    
    Args:
        piece (pydub.AudioSegment): The audio segment to be saved.
        file_name (str): The base file name without an extension (e.g., "audiofile").
        output_folder (str): The path to the folder where the file should be saved.
        file_num (str or int, optional): A numeric identifier to append to the file name.
                                         Defaults to an empty string.
        suffix (str, optional): A suffix to append to the file name, before the file number.
                                Defaults to '_part'.

    Returns:
        None
    """
    
    file_name_part = f"{file_name}{suffix}{file_num}.mp3"
    output_file = os.path.join(output_folder, file_name_part)
    piece.export(output_file, format="mp3")
#
# Makes use of ^
def chop_audio(input_file, output_folder, max_size_mb=24):
    """
    Splits an audio file into smaller pieces of approximately equal size, and 
    saves them as separate files in the specified output folder. If the input 
    audio file is smaller than the target size, it will be saved as-is with 
    the suffix '_main'.

    Args:
        input_file (str): The path to the input audio file.
        output_folder (str): The path to the folder where the chopped audio 
                             pieces should be saved.
        max_size_mb (int, optional): The maximum size of each chopped audio 
                                     piece in megabytes. Defaults to 24 MB.

    Returns:
        None
    """

    # TODO: rewrite with pathlib
    file_name = os.path.splitext(os.path.basename(input_file))[0]

    # Load audio file
    audio = AudioSegment.from_file(input_file)
    
    # Get the duration of the input file using mutagen
    audio_info = MP3(input_file)
    duration = audio_info.info.length

    # Calculate the size of each piece in milliseconds
    input_file_size = os.path.getsize(input_file)
    target_piece_size_bytes = max_size_mb * 1024 * 1024
    target_piece_duration_seconds = (target_piece_size_bytes / input_file_size) * duration
    target_piece_duration_ms = target_piece_duration_seconds * 1000

    if len(audio) <= target_piece_duration_ms:
        # Save the piece in proc dir with no suffix
        save_audio_file(audio, file_name, output_folder, suffix='_main')
    else:
        # Chop the audio and save the pieces
        current_ms = 0
        file_num = 1
        while current_ms < len(audio):
            # Calculate end time for the current piece
            end_ms = min(current_ms + target_piece_duration_ms, len(audio))

            # Extract the piece
            piece = audio[current_ms:end_ms]

            # Save the piece as a new file
            save_audio_file(piece, file_name, output_folder, file_num=str(file_num), suffix='_part')

            # Move on to the next piece
            current_ms = end_ms
            file_num += 1


## Function below is used for saving the transcription into the txt file
def save_from_file(f, spth, txt, suffix):
    """
    Saves the given text to a file with a specified suffix in the 
    specified folder.

    Args:
        f (pathlib.Path): The input file path, used to derive the base file name.
        spth (str): The path to the folder where the output text file should be saved.
        txt (str): The text content to be written to the output file.
        suffix (str): The suffix to append to the base file name for the output file.

    Returns:
        None
    """

    # Zapisz plik txt
    fname = f.stem + suffix

    fpth = Path(spth) / fname
    with open(fpth, 'w') as f:
        f.write(txt)
    print('Tekst zapisano pod sciezka: ', fpth)

# 1. Parametry

In [38]:
# Set script parameters 

#@markdown ## Wprowadź swój klucz API
api_key = ''#@param {type:"string"}

#@markdown ---
#@markdown ## Zdecyduj czy transkrypcja z plikow mp3 czy z YouTube

#@markdown ### Jesli z plikow:
#@markdown Wklej ścieżkę do folderu, w ktorym znajduja sie pliki mp3
input_dir = "/content/drive/MyDrive/AI/Whisper/Burbea-in" #@param {type:"string"}
#@markdown Wklej sciezke do folderu, gdzie maja byc zapisane pociete pliki audio
interim_dir = "/content/drive/MyDrive/AI/Whisper/Burbea-interim" #@param {type:"string"}
#

#@markdown ### Jesli z You Tube:
#@markdown Wklej URL filmu
yt_url = ''#@param {type:"string"}
#@markdown ---

#@markdown ## Zdefiniuj gdzie zapisac transkrypt
#@markdown Wklej ścieżkę do folderu na Google Drive, gdzie ma zostac zapisany transkrypt.
output_dir = "/content/drive/MyDrive/AI/Whisper/Burbea-out" #@param {type:"string"}

## FLAGS
from_file = False
from_yt = False

if len(input_dir) > 1:
    from_file = True
elif len(yt_url) > 1:
    from_yt = True

# 2. Pre-processing audio

In [34]:
if from_file:
    # The script should be pointed to a dir with mp3 files.
    # The pipeline is as follows: for each mp3 file in directory >
    # read file > chop if necessary > save in the interim folder
    # the interim folder will serve as a source for whisper model

    input_files = [x for x in Path(input_dir).glob('*')]

    # Set the number of processes to the number of CPU cores available
    num_processes = os.cpu_count()

    def process_file(input_file):
        """ For the need of paralel processing
        """
        chop_audio(input_file, interim_dir)

    # Use a multiprocessing Pool to process files in parallel
    with Pool(num_processes) as p:
        for _ in tqdm(
            p.imap_unordered(process_file, input_files), 
            total=len(input_files)
            ):
            pass

    # Classical approach:
    # for f_ in tqdm(input_files):
    #     chop_audio(f_, output_dir)
elif from_yt:
    # YT transcript doesnt require audio file manipulation
    pass
else:
    print('Uwaga: przynajmniej jedna flaga powinna byc True')

2


  0%|          | 0/3 [00:00<?, ?it/s]

rozmiar ok


# 3. Transkrypcja

In [39]:
if from_yt:
    loader = YoutubeLoader.from_youtube_url(yt_url, add_video_info=False)
    text = loader.load()
    text = text[0].page_content
    # Popraw kodowanie znaków
    normalized_text = unicodedata.normalize('NFKC', text)
    new_decoded_text = normalized_text.replace('\n', ' ').strip()
    

#@markdown ## Jesli transkrybujesz z pliku:
if from_file:
    #@markdown Wpisz z jakiego jezyka dokonujesz transkrypcji
    lang_selected = "en" #@param lang {input: "string"}
    openai.api_key = api_key
    to_transcribe = [x for x in Path(interim_dir).glob('*')]
    print(f'Transkrypcja {len(to_transcribe)} plikow.')

    for f_ in tqdm(to_transcribe):
        audio_file= open(f_, "rb")
        transcript = openai.Audio.transcribe(
            "whisper-1", 
            audio_file, 
            language=lang_selected
            )
        text = transcript['text']
        # Popraw kodowanie znaków
        normalized_text = unicodedata.normalize('NFKC', text)
        new_decoded_text = normalized_text.replace('\n', ' ').strip()

        save_from_file(f_, output_dir, new_decoded_text, '.txt')

#@markdown ---
#@markdown ## Jesli transkrybujesz z You Tube
#@markdown transkrypt jest dostepny pod zmienna _new_decoded_text_

Transkrypcja 6 plikow.


  0%|          | 0/6 [00:00<?, ?it/s]

Tekst zapisano pod sciezka:  /content/drive/MyDrive/AI/Whisper/Burbea-out/20051105-Rob_Burbea-GAIA-mindfulness_of_mind_states-21010_part1.txt
Tekst zapisano pod sciezka:  /content/drive/MyDrive/AI/Whisper/Burbea-out/20051112-Rob_Burbea-GAIA-contemplating_the_3_characteristics_pt_1_impermanence_and_dukkha-21008_part1.txt
Tekst zapisano pod sciezka:  /content/drive/MyDrive/AI/Whisper/Burbea-out/20051105-Rob_Burbea-GAIA-mindfulness_of_mind_states-21010_part2.txt
Tekst zapisano pod sciezka:  /content/drive/MyDrive/AI/Whisper/Burbea-out/20051112-Rob_Burbea-GAIA-contemplating_the_3_characteristics_pt_1_impermanence_and_dukkha-21008_part2.txt
Tekst zapisano pod sciezka:  /content/drive/MyDrive/AI/Whisper/Burbea-out/20061104-Rob_Burbea-GAIA-from_feelings_to_freedom_exploring_vedana-12482_main.txt
Tekst zapisano pod sciezka:  /content/drive/MyDrive/AI/Whisper/Burbea-out/20051112-Rob_Burbea-GAIA-contemplating_the_3_characteristics_pt_1_impermanence_and_dukkha-21008_part3.txt


# 5. Tlumaczenie (Nie dziala)

In [None]:
#@markdown Uruchom te komorke, by przetlumaczyc transkrybowany tekst

#@markdown Instrukcja tlumaczenia
trans_inst = 'Przetlumacz z Angielskiego na Polski'#@param {type:"string"}

def chat_completion_response(instr, text):
        """Makes API call to OpenAI's chat completion endpoint.
        """

        api_response = openai.ChatCompletion.create(
            model="gpt-3.5-turbo",
            messages=[
                {"role": "system", "content": "Translate according to instruction given in user message. Return only translated text. Nothing more"},
                {"role": "user", "content": f'{instr}: {text}'}
                ]
        )
                

        return api_response


new_translated_text = chat_completion_response(
    trans_inst, 
    new_decoded_text
    )



In [None]:
new_translated_text

In [None]:
#@markdown Uruchom, by wyswietlic i zapisac tlumaczenie
if len(file_save_path) > 1:
    save_to_file(
        file_read_path,
        yt_url,
        file_save_path,
        new_translated_text,
        '_tlumaczenie.txt',
        f=from_file,
        yt=from_yt
    )

print('Tlumaczenie:')
new_translated_text