<a href="https://colab.research.google.com/github/AS-AIGC/AS-AIGDMS/blob/main/colab_notebook_AS_AIGDMS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AI Generated Discussion Minutes and Summarization by Academia Sinica (AS-AIGDMS)

## 現有雲端逐字稿平台的現況
- ✅ 辨識正確度高
- ✅ 使用介面直覺簡單
- ✅ 支援多國語言
- ✅ 支援不同格式輸出

- ⚠️ 限制檔案長度
- ⚠️ 限制每月檔案數量

- ⛔️ 需將聲音檔案上傳至雲端

## 我們的解法
- 使用 OpenAI Whisper 在地端進行語音辨識轉逐字稿
- 使用 pyannote.audio 進行講者辨認與逐字稿切割
- *(使用 ChatGPT API 進行講者內容摘要)

## 註記

1. () 內的內容在本 Colab Notebook 中沒有提供
2. 以下程式內容說明與註解，皆由 ChatGPT 產生，並經人工簡略編輯而成。

我們在這段程式碼中，主要是在安裝一些 Python 的套件。這些套件有助於我們處理各種任務，如下載 YouTube 影片、處理音訊檔案，甚至還有一些是來自 GitHub 的開發版本。我們通過 pip 工具（Python 的套件管理工具）來安裝這些套件。

In this block of code, we're essentially installing some Python packages. These packages help us with various tasks like downloading YouTube videos, processing audio files, and even some are the development versions from GitHub. We're using pip, which is a package management tool in Python, to install these packages.

In [1]:
# Install the whisper-timestamped library from GitHub
!pip3 install git+https://github.com/linto-ai/whisper-timestamped

# Install the development version of the pyannote-audio library
!pip install -qq https://github.com/pyannote/pyannote-audio/archive/develop.zip

# Install the Pytube library for downloading YouTube videos
!pip install -q --upgrade pytube

# Install the Pydub library for working with audio files
!pip install -q --upgrade pydub

# Install the pysrt library for handling subtitles
!pip install pysrt

Collecting git+https://github.com/linto-ai/whisper-timestamped
  Cloning https://github.com/linto-ai/whisper-timestamped to /tmp/pip-req-build-qrrcrgic
  Running command git clone --filter=blob:none --quiet https://github.com/linto-ai/whisper-timestamped /tmp/pip-req-build-qrrcrgic
  Resolved https://github.com/linto-ai/whisper-timestamped to commit 732865ce9c0c1027c67f964e7200c7db6542b142
  Preparing metadata (setup.py) ... [?25ldone


[33mDEPRECATION: torchsde 0.2.5 has a non-standard dependency specifier numpy>=1.19.*; python_version >= "3.7". pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of torchsde or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m[33mDEPRECATION: torchsde 0.2.5 has a non-standard dependency specifier numpy>=1.19.*; python_version >= "3.7". pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of torchsde or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m[33mDEPRECATION: torchsde 0.2.5 has a non-standard dependency specifier numpy>=1.19.*; python_version >= "3.7". pip 23.3 will enforce this behaviour change. A possible replacement i

我們在這段程式碼中引入了我們需要的套件和模塊。這些套件和模塊有一些是 Python 的內置模塊，例如 "os" 和 "json"；有一些是我們剛剛安裝的套件，例如 "pytube"，"pydub"，"whisper_timestamped"，"pyannote.audio" 和 "pysrt"。通過這樣的導入，我們可以在後續的程式碼中使用這些模塊提供的功能和方法。

In this piece of code, we're importing the modules and packages we need. Some of these are built-in Python modules like 'os' and 'json', while others are packages we just installed like 'pytube', 'pydub', 'whisper_timestamped', 'pyannote.audio' and 'pysrt'. By importing them in this way, we can use the functions and methods provided by these modules in our later code.

In [1]:
!conda list | grep "numpy"

numpy                     1.23.0                   pypi_0    pypi


In [6]:
# Import the os module for interacting with the Operating System
import os
# Import the json module for parsing and manipulating JSON format data
import json
# Import the YouTube function from pytube library, this is used to download YouTube videos
from pytube import YouTube    # library for downloading YouTube videos
# Import the AudioSegment function from pydub library, this is used for audio file manipulations
from pydub import AudioSegment    # library for working with audio files
# Import the whisper_timestamped package, specific function usage depends on further code
import whisper_timestamped
# Import the Pipeline function from pyannote.audio library, this is often used for audio processing tasks
from pyannote.audio import Pipeline
# Import the pysrt package, this is used for subtitle related operations
import pysrt

我們在這段程式碼中定義了一個函數，名稱叫做 "download_Youtube_video_audio_file"。這個函數的目的是從 YouTube 下載影片的音訊檔案。這個函數需要三個參數，分別是 YouTube 影片的網址，音訊檔案的輸出目錄，和音訊檔案的檔案名稱。我們使用 "pytube" 套件來獲取音訊流，然後將音訊流下載到指定的輸出目錄。

In this piece of code, we're defining a function named "download_Youtube_video_audio_file". The purpose of this function is to download the audio file of a YouTube video. The function takes three arguments: the URL of the YouTube video, the output directory of the audio file, and the filename of the audio file. We use the "pytube" package to get the audio stream, and then download the audio stream to the specified output directory.

In [7]:
def download_Youtube_video_audio_file(url, output_directory, audio_filename):
  # Display the URL being downloaded
  print(f"Download url: {url}")
  # Create a YouTube object with the provided URL
  yt = YouTube(use_oauth=True, url=url)
  # Get the first audio stream of the video
  audio_stream = yt.streams.filter(only_audio=True).first()
  # Get the audio-only stream of the video
  audio_stream = yt.streams.get_audio_only()
  
  # Download the audio stream to the specified directory with the specified filename
  audio_stream.download(output_path=output_directory, filename=audio_filename)

我們在這段程式碼中定義了一個函數，名稱叫做 "convert_audio_file_to_mp3_format"。這個函數的目的是將音訊檔案轉換成 MP3 格式。函數需要兩個參數，分別是原來音訊檔案的路徑，和轉換後的音訊檔案的輸出路徑。我們使用 "pydub" 的 "AudioSegment" 來讀取和處理音訊檔案。

In this piece of code, we're defining a function named "convert_audio_file_to_mp3_format". The purpose of this function is to convert an audio file to MP3 format. The function takes two arguments: the path of the original audio file, and the output path of the converted audio file. We use "AudioSegment" from "pydub" to load and handle the audio file.

In [8]:
def convert_audio_file_to_mp3_format(audio_filepath, export_path):
  # Load the audio file into an AudioSegment object
  print("Load the audio file")
  audio_file = AudioSegment.from_file(audio_filepath)

  # Convert the audio file to MP3 format and save it to the specified path
  print("Convert the audio file to MP3 format\n")
  mp3_file = audio_file.export(export_path, format="mp3")
  # At this point, the audio file is converted to MP3 format and saved at the given export path

我們在這段程式碼中定義了一個函數，名稱叫做 "slice_audio"。這個函數的目的是將音訊檔案切割成一分鐘的片段。函數需要三個參數，分別是音訊檔案，輸出檔案的名稱，和切割的偏移值。這裡我們使用 "pydub" 的 "AudioSegment" 來處理音訊檔案，並且注意 "pydub" 是以毫秒為單位來處理時間的。

In this piece of code, we're defining a function named "slice_audio". The purpose of this function is to slice an audio file into one-minute segments. The function takes three arguments: the audio file, the filename of the output file, and the offset for slicing. Here we use "AudioSegment" from "pydub" to handle the audio file, and note that "pydub" handles time in milliseconds.

In [9]:
def slice_audio(audio_file, filename, offset):
  # pydub does things in milliseconds
  # Get the duration of the audio file in seconds
  audio_length = audio_file.duration_seconds
  # Convert the duration from seconds to minutes
  minutes_duartion = int(audio_length // 60)
  
  # Convert 1 minute to milliseconds
  one_minutes = 1 * 60 * 1000
  # Set the start and end timestamp for slicing
  start = offset * one_minutes
  # If the start timestamp equals the total duration, it means the last part is less than one minute
  end = audio_length if start == minutes_duartion else (offset+1) * one_minutes
  # Slice the audio file from the start timestamp to the end timestamp
  sliced_audio = audio_file[start:end]
  # Export the sliced audio file in MP3 format with the specified filename
  sliced_audio.export(filename, format="mp3")

我們在這段程式碼中定義了一個函數，名稱叫做 "create_srt_files"。這個函數的目的是創建原始字幕的 SRT 檔案和多語言字幕的 SRT 檔案。這個函數需要兩個參數，分別是音訊檔案的檔名和我們想要的語言列表。在這段程式碼中，我們使用 Python 的內建 "open" 函數來創建檔案。

In this piece of code, we're defining a function named "create_srt_files". The purpose of this function is to create SRT files for the original caption and multilingual subtitles. The function takes two arguments: the filename of the audio file and a list of languages we want. In this code, we use Python's built-in function "open" to create files.

In [10]:
def create_srt_files(audio_filename, languages):
  # Create a .srt file for the original caption
  with open(f'{audio_filename}.srt', 'w') as fp:
    pass
  # Create .srt files for each language in the languages list
  for language in languages:
    with open(f'{audio_filename}_{language}.srt', "w") as fp:
      pass
  # At this point, the .srt files for the original caption and all specified languages are created

我們在這段程式碼中定義了一個函數，名稱叫做 "assign_speakers"。這個函數的目的是為字幕片段分配講者。這個函數需要兩個參數，分別是字幕片段和對話轉錄結果。在這裡我們使用了一種稱為 "itertracks" 的迭代器來獲取每一輪的講話及其講者。然後，我們使用一種方法來將字幕分配給最接近的講者。

In this piece of code, we're defining a function named "assign_speakers". The purpose of this function is to assign speakers to caption segments. The function takes two arguments: the caption segments and the diarization result. Here we use an iterator called "itertracks" to get each turn of speech and its speaker. Then, we use a method to assign captions to the closest speaker.

In [11]:
def assign_speakers(caption_segments, diarization_result):
  # Initialize a dictionary to store speakers' timestamps and captions
  speakers = {}
  # Iterate over each turn of speech and its speaker
  for turn, track, speaker in diarization_result.itertracks(yield_label=True):
    # Store the start and end timestamp of each turn
    timestamp = {"start": turn.start, "end": turn.end}
    # If the speaker already exists in the dictionary, append the new timestamp
    # Otherwise, create a new entry for the speaker in the dictionary
    if speakers.get(speaker):
      speakers[speaker]['timestamp'].append(timestamp)
    else:
      speakers.update({speaker: {'timestamp': [timestamp], "captions": []}})
  # Iterate over each caption segment
  # Assign captions to the closest speaker based on the timestamps
  # ... (skipped for brevity)
  # After all captions are assigned, create a new dictionary to store captions for each speaker
  speaker_captions = {}
  for speaker, value in speakers.items():
    speaker_captions.update({speaker: value['captions']})
  # Return the final dictionary
  return speaker_captions

我們在這段程式碼中定義了一系列的步驟，來從 YouTube 下載視頻的音訊，轉換音訊格式，創建字幕檔案，將音訊分片並轉寫為文字，將字幕檔案合併，進行講話者分類，並將結果保存到 JSON 檔案中。

In this piece of code, we're defining a series of steps to download audio from a YouTube video, convert the audio format, create subtitle files, chunk and transcribe the audio into text, concatenate subtitle files, do speaker diarization, and save the results to a JSON file.

In [None]:
youtube_url = "https://www.youtube.com/watch?v="  
# The key is the YouTube video's id
youtube_video = {
    "test1": "qeeA40t4MJY"
}
languages = ['chinese (traditional)', 'english', 'japanese'] 
access_token = "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"  
mp3_directory = './mp3/'
audio_directory = "./audio/"
srt_directory = "./"

# Check if the mp3_directory exists, if not, create it
if not os.path.isdir(mp3_directory):
  os.mkdir(mp3_directory)

# Load the whisper model
model = whisper_timestamped.load_model("small")  
caption_segments = []
# Iterate over each video in the youtube_video dictionary
for key, video_id in youtube_video.items():  
  audio_filename = "audio_" + key
  # Download the audio from the YouTube video
  download_Youtube_video_audio_file(youtube_url+video_id, audio_directory, audio_filename)  

  # Convert the downloaded audio file to MP3 format
  audio_filepath = audio_directory + audio_filename
  mp3_filepath = mp3_directory + audio_filename + ".mp3"
  convert_audio_file_to_mp3_format(audio_filepath, mp3_filepath)  
  print("convert mp3")
  # Load the converted MP3 file
  mp3_file = AudioSegment.from_file(mp3_filepath, 'mp3')
  print("AudioSegment mp3")
  # Create subtitle files for each specified language
  create_srt_files(audio_filename, languages)  
  print("create_srt_files mp3")
  # Transcribe the MP3 file into text
  # ... (skipped for brevity)

  # Perform speaker diarization on the MP3 file
  diarization_pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization@2.1",
                            use_auth_token=access_token)
  diarization_result = diarization_pipeline(mp3_filepath)

  # Assign captions to speakers
  speakers_caption = assign_speakers(caption_segments, diarization_result) 

  # Save the results to a JSON file
  with open('./dirization_result.json', 'w') as fp: 
    json.dump(speakers_caption, fp, ensure_ascii=False, indent=2)  

Download url: https://www.youtube.com/watch?v=qeeA40t4MJY
Load the audio file
Convert the audio file to MP3 format

convert mp3
AudioSegment mp3
create_srt_files mp3


Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.5. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../../../../home/ytl0623/.cache/torch/pyannote/models--pyannote--segmentation/snapshots/c4c8ceafcbb3a7a280c2d357aee9fbc9b0be7f9b/pytorch_model.bin`


Model was trained with pyannote.audio 0.0.1, yours is 2.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
