<a href="https://colab.research.google.com/github/yubinvip/Hung-yi-Lee-AI2024/blob/main/GenAI_HW9_2024_Spring.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GenAI HW9: Quick Summary of Lecture Video (演講影片快速摘要)
## Objectives
- ### Learn to quickly build applications related to speech recognition using existing APIs. (學習以現成的API快速搭建語音辨識相關的應用。)


#### If you have any questions, please contact the TAs via TA hours, NTU COOL, or email to ntu-gen-ai-2024-spring-ta@googlegroups.com

# Part1 - Preparation

## The lecture video provided for this assignment

(1) For ease of processing, it has already been converted to a MP3 file.

(2) If you would like to view the original video, the link is here:

- 李琳山教授 信號與人生 (2023)

  - https://www.youtube.com/watch?v=MxoQV4M0jY8


(3) Since the original lecture video is quite long, we have edited the segment from 1:43:24 to 2:00:49 to use for this assignment.

## Install all necessary packages and import them

The following code block takes about **150** seconds to run, but it may vary slightly depending on the condition of Colab.

In [None]:
# Install packages.
!pip install git+https://github.com/openai/whisper.git
!pip install srt
!pip install datetime
!pip install opencc
!pip install datasets
!pip install numpy
!pip install soundfile
!pip install IPython
!pip install openai
!pip install -q -U google-generativeai
!pip install anthropic

Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /tmp/pip-req-build-z5ewkyne
  Running command git clone --filter=blob:none --quiet https://github.com/openai/whisper.git /tmp/pip-req-build-z5ewkyne
  Resolved https://github.com/openai/whisper.git to commit ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting tiktoken (from openai-whisper==20231117)
  Downloading tiktoken-0.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch->openai-whisper==20231117)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==1

The following code block takes about **5** seconds to run, but it may vary slightly depending on the condition of Colab.

In [None]:
# Import packages.

import whisper
import srt
import datetime
import time
import os
import re
import pathlib
import textwrap
import numpy as np
import soundfile as sf
from opencc import OpenCC
from tqdm import tqdm
from datasets import load_dataset
from openai import OpenAI
import google.generativeai as genai
import anthropic

## Download data

The code block below takes about **10** seconds to run, although there might be some slight variation depending on the state of Colab.

In [None]:
# Load dataset.
dataset_name = "kuanhuggingface/NTU-GenAI-2024-HW9"
dataset = load_dataset(dataset_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/305 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.14M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/1 [00:00<?, ? examples/s]

The code block below takes about **15** seconds to run, although there might be some slight variation depending on the state of Colab.

In [None]:
# Prepare audio.
input_audio = dataset["test"]["audio"][0]
input_audio_name = input_audio["path"]
input_audio_array = input_audio["array"].astype(np.float32)
sampling_rate = input_audio["sampling_rate"]

print(f"Now, we are going to transcribe the audio: 李琳山教授 信號與人生 (2023) ({input_audio_name}).")

Now, we are going to transcribe the audio: 李琳山教授 信號與人生 (2023) (ntu-gen-ai-2024-hw9-16k.mp3).


# Part2 - Automatic Speech Recognition (ASR)
The function "speech_recognition" aims to convert audio to subtitle.

In [None]:
def speech_recognition(model_name, input_audio, output_subtitle_path, decode_options, cache_dir="./"):
    '''
        (1) Objective:
            - This function aims to convert audio to subtitle.

        (2) Arguments:

            - model_name (str):
                The name of the model. There are five model sizes, including tiny, base, small, medium, large-v3.
                For example, you can use 'tiny', 'base', 'small', 'medium', 'large-v3' to specify the model name.
                You can see 'https://github.com/openai/whisper' for more details.

            - input_audio (Union[str, np.ndarray, torch.Tensor]):
                The path to the audio file to open, or the audio waveform
                - For example, if your input audio path is 'input.wav', you can use 'input.wav' to specify the input audio path.
                - For example, if your input audio array is 'audio_array', you can use 'audio_array' to specify the input audio array.

            - output_subtitle_path (str):
                The path of the output subtitle file.
                For example, if you want to save the subtitle file as 'output.srt', you can use 'output.srt' to specify the output subtitle path.

            - decode_options (dict):
                The options for decoding the audio file, including 'initial_prompt', 'prompt', 'prefix', 'temperature'.
                - initial_prompt (str):
                    Optional text to provide as a prompt for the first window. This can be used to provide, or
                    "prompt-engineer" a context for transcription, e.g. custom vocabularies or proper nouns
                    to make it more likely to predict those word correctly.
                    Default: None.

                You can see "https://github.com/openai/whisper/blob/main/whisper/decoding.py" and "https://github.com/openai/whisper/blob/main/whisper/transcribe.py"
                for more details.

                - temperature (float):
                    The temperature for sampling from the model. Higher values mean more randomness.
                    Default: 0.0

            - cache_dir (str):
                The path of the cache directory for saving the model.
                For example, if you want to save the cache files in 'cache' directory, you can use 'cache' to specify the cache directory.
                Default: './'

        (3) Example:

            - If you want to use the 'base' model to convert 'input.wav' to 'output.srt' and save the cache files in 'cache' directory,
            you can call this function as follows:

                speech_recognition(model_name='base', input_audio_path='input.wav', output_subtitle_path='output.srt', cache_dir='cache')
    '''

    # Record the start time.
    start_time = time.time()

    print(f"=============== Loading Whisper-{model_name} ===============")

    # Load the model.
    model = whisper.load_model(name=model_name, download_root=cache_dir)

    print(f"Begin to utilize Whisper-{model_name} to transcribe the audio.")

    # Transcribe the audio.
    transcription = model.transcribe(audio=input_audio, language=decode_options["language"], verbose=False,
                                     initial_prompt=decode_options["initial_prompt"], temperature=decode_options["temperature"])

    # Record the end time.
    end_time = time.time()

    print(f"The process of speech recognition costs {end_time - start_time} seconds.")

    subtitles = []
    # Convert the transcription to subtitle and iterate over the segments.
    for i, segment in tqdm(enumerate(transcription["segments"])):

        # Convert the start time to subtitle format.
        start_time = datetime.timedelta(seconds=segment["start"])

        # Convert the end time to subtitle format.
        end_time = datetime.timedelta(seconds=segment["end"])

        # Get the subtitle text.
        text = segment["text"]

        # Append the subtitle to the subtitle list.
        subtitles.append(srt.Subtitle(index=i, start=start_time, end=end_time, content=text))

    # Convert the subtitle list to subtitle content.
    srt_content = srt.compose(subtitles)

    print(f"\n=============== Saving the subtitle to {output_subtitle_path} ===============")

    # Save the subtitle content to the subtitle file.
    with open(output_subtitle_path, "w", encoding="utf-8") as file:
        file.write(srt_content)

In the following block, you can modify your desired parameters and the path of input file.

In [None]:
# @title Parameter Setting of Whisper { run: "auto" }

''' In this block, you can modify your desired parameters and the path of input file. '''

# The name of the model you want to use.
# For example, you can use 'tiny', 'base', 'small', 'medium', 'large-v3' to specify the model name.
# @markdown **model_name**: The name of the model you want to use.
model_name = "medium" # @param ["tiny", "base", "small", "medium", "large-v3"]

# Define the suffix of the output file.
# @markdown **suffix**: The output file name is "output-{suffix}.* ", where .* is the file extention (.txt or .srt)
suffix = "信號與人生" # @param {type: "string"}

# Path to the output file.
output_subtitle_path = f"./output-{suffix}.srt"

# Path of the output raw text file from the SRT file.
output_raw_text_path = f"./output-{suffix}.txt"

# Path to the directory where the model and dataset will be cached.
cache_dir = "./"

# The language of the lecture video.
# @markdown **language**: The language of the lecture video.
language = "zh" # @param {type:"string"}

# Optional text to provide as a prompt for the first window.
# @markdown **initial_prompt**: Optional text to provide as a prompt for the first window.
initial_prompt = "請用繁體中文" #@param {type:"string"}

# The temperature for sampling from the model. Higher values mean more randomness.
# @markdown  **temperature**: The temperature for sampling from the model. Higher values mean more randomness.
temperature = 0 # @param {type:"slider", min:0, max:1, step:0.1}

In [None]:
# Construct DecodingOptions
decode_options = {
    "language": language,
    "initial_prompt": initial_prompt,
    "temperature": temperature
}

In [None]:
# print message.
message = "Transcribe 李琳山教授 信號與人生 (2023)"
print(f"Setting: (1) Model: whisper-{model_name} (2) Language: {language} (2) Initial Prompt: {initial_prompt} (3) Temperature: {temperature}")
print(message)

Setting: (1) Model: whisper-medium (2) Language: zh (2) Initial Prompt: 請用繁體中文 (3) Temperature: 0
Transcribe 李琳山教授 信號與人生 (2023)


The code block below takes about **90 (240)** seconds to run when using the **base (medium)** model and **a T4 GPU**, although there might be some slight variation depending on the state of Colab.

In [None]:
# Running ASR.
speech_recognition(model_name=model_name, input_audio=input_audio_array, output_subtitle_path=output_subtitle_path, decode_options=decode_options, cache_dir=cache_dir)



100%|██████████████████████████████████████| 1.42G/1.42G [00:12<00:00, 122MiB/s]


Begin to utilize Whisper-medium to transcribe the audio.


100%|██████████| 104500/104500 [02:45<00:00, 631.16frames/s]


The process of speech recognition costs 196.35759162902832 seconds.


370it [00:00, 169069.89it/s]







You can check the result of automatic speech recognition.

In [None]:
''' Open the SRT file and read its content.
The format of SRT is:

[Index]
[Begin time] (hour:minute:second) --> [End time] (hour:minute:second)
[Transcription]

'''

with open(output_subtitle_path, 'r', encoding='utf-8') as file:
    content = file.read()

print(content)

# Part3 - Preprocess the results of automatic speech recognition

In [None]:
def extract_and_save_text(srt_filename, output_filename):

    '''
    (1) Objective:
        - This function extracts the text from an SRT file and saves it to a new text file.
        - It also converts the Simplified Chinese to Traditional Chinese.

    (2) Arguments:

        - srt_filename: The path to the SRT file.

        - output_filename: The name of the output text file.

    (3) Example:
        - If your SRT file is named 'subtitle.srt' and you want to save the extracted text to a file named 'output.txt', you can use the function like this:
            extract_and_save_text('subtitle.srt', 'output.txt')

    '''

    # Open the SRT file and read its content.
    with open(srt_filename, 'r', encoding='utf-8') as file:
        content = file.read()

    # Use regular expression to remove the timecode.
    pure_text = re.sub(r'\d+\n\d{2}:\d{2}:\d{2},\d{3} --> \d{2}:\d{2}:\d{2},\d{3}\n', '', content)

    # Remove the empty lines.
    pure_text = re.sub(r'\n\n+', '\n', pure_text)

    # Creating an instance of OpenCC for Simplified to Traditional Chinese conversion.
    cc = OpenCC('s2t')
    pure_text_conversion = cc.convert(pure_text)

    # Write the extracted text to a new file.
    with open(output_filename, 'w', encoding='utf-8') as output_file:
        output_file.write(pure_text_conversion)

    print(f'Extracted text has been saved to {output_filename}.\n\n')

    return pure_text_conversion

In [None]:
def chunk_text(text, max_length):
    """
    (1) Objective:
        - This function is used to split a long string into smaller strings of a specified length.

    (2) Arguments:
        - text: str, the long string to be split.
        - max_length: int, the maximum length of each smaller string.

    (3) Returns:
        - split_text: list, a list of smaller strings.

    (3) Example:
        - If you want to split a string named "long_string" into smaller strings of length 100, you can use the function like this:
            chunk_text(long_string, 100)

    """

    return textwrap.wrap(text, max_length)

In [None]:
''' In this block, you can modify your desired parameters and the path of input file. '''

# # The length of the text chunks.
chunk_length = 512

In [None]:
# Extracts the text from an SRT file and saves it to a new text file
pure_text = extract_and_save_text(srt_filename=output_subtitle_path, output_filename=output_raw_text_path)

# Split a long document into smaller chunks of a specified length
chunks = chunk_text(text=pure_text, max_length=512)

# You can see the number of words and contents in each paragraph.
print("Review the results of splitting the long text into several short texts.\n")
for index, chunk in enumerate(chunks):
    if index == 0:
        print(f"\n========== The {index + 1}-st segment of the split ({len(chunk)} words) ==========\n\n")
        for text in textwrap.wrap(chunk, 80):
            print(f"{text}\n")
    elif index == 1:
        print(f"\n========== The {index + 1}-nd segment of the split ({len(chunk)} words) ==========\n\n")
        for text in textwrap.wrap(chunk, 80):
            print(f"{text}\n")
    elif index == 2:
        print(f"\n========== The {index + 1}-rd segment of the split ({len(chunk)} words) ==========\n\n")
        for text in textwrap.wrap(chunk, 80):
            print(f"{text}\n")
    else:
        print(f"\n========== The {index + 1}-th segment of the split ({len(chunk)} words) ==========\n\n")
        for text in textwrap.wrap(chunk, 80):
            print(f"{text}\n")

# Part4 - Summarization


## **You only need to choose one of the following parts.**

## **If you want to use ChatGPT, begin with this part.**
##### (1) You can refer to https://shorturl.at/X0NDY (Page 44) for obtaining ChatGPT API key.
##### (2) You can refer to https://platform.openai.com/docs/models/overview for more details about models you can use.

In [None]:
def summarization(client, summarization_prompt, model_name="gpt-3.5-turbo", temperature=0.0, top_p=1.0, max_tokens=512):
    """
    (1) Objective:
        - Use the OpenAI Chat API to summarize a given text.

    (2) Arguments:
        - client: OpenAI Chat API client.
        - summarization_prompt: The summarization prompt including the text which need to be summarized.
        - model_name: The model name, default is "gpt-3.5-turbo". You can refer to "https://platform.openai.com/docs/models/overview" for more details.
        - temperature: Controls randomness in the response. Lower values make responses more deterministic, default is 0.0.
        - top_p: Controls diversity via nucleus sampling. Higher values lead to more diverse responses, default is 1.0.
        - max_tokens: The maximum number of tokens to generate in the completion, default is 512.

    (3) Return:
        - The summarized text.

    (4) Example:
        - If the text is "ABC" and the summarization prompt is "DEF", system_prompt is "GHI", model_name is "gpt-3.5-turbo",
          temperature is 0.0, top_p is 1.0, and max_tokens is 512, then you can call the function like this:

              summarization(client=client, text="ABC", summarization_prompt="DEF", system_prompt="GHI", model_name="gpt-3.5-turbo", temperature=0.0, top_p=1.0, max_tokens=512)

    """

    # The user prompt is a concatenation of the summarization_prompt and text.
    user_prompt = summarization_prompt

    while True:

        try:
            # Use the OpenAI Chat API to summarize the text.
            chat_completion = client.chat.completions.create(
                messages=[
                    {
                        "role": "user",
                        "content": user_prompt,
                    }
                ],
                    model=model_name,
                    temperature=temperature,
                    top_p=top_p,
                    max_tokens=max_tokens
            )

            break

        except:
            # If the API call fails, wait for 1 second and try again.
            print("The API call fails, wait for 1 second and try again.")
            time.sleep(1)

    return chat_completion.choices[0].message.content

In [None]:
# @title Parameter Setting of ChatGPT { run: "auto" }
''' ===== In this block, you can modify your desired parameters and set your OpenAI API key ===== '''

# Your OpenAI API key.
# @markdown **openai_api_key**: Your OpenAI API key.
openai_api_key = "YOUR_OPENAI_API_KEY" # @param {type:"string"}

# The model name, default is "gpt-3.5-turbo". You can refer to "https://platform.openai.com/docs/models/overview" for more details.
# @markdown **model_name**: The model name, default is "gpt-3.5-turbo". You can refer to "https://platform.openai.com/docs/models/overview" for more details.
model_name = "gpt-3.5-turbo" # @param {type: "string"}

# Controls randomness in the response. Lower values make responses more deterministic.
# @markdown **temperature**: Controls randomness in the response. Lower values make responses more deterministic.
temperature = 0 # @param {type:"slider", min:0, max:1, step:0.1}

# Controls diversity via nucleus sampling. Higher values lead to more diverse responses.
# @markdown **top_p**: Controls diversity via nucleus sampling. Higher values lead to more diverse responses.
top_p = 0 # @param {type:"slider", min:0, max:1, step:0.1}

In [None]:
# Construct openai client.
client = OpenAI(api_key=openai_api_key)

The code block below takes about **30** seconds to run when using the **gpt-3.5-turbo** model, but the actual time may vary depending on the condition of Colab and the status of the OpenAI API.

### We offer the following two methods for summarization.
Reference: https://reurl.cc/VzagLA

#### **If you want to use the method of Multi-Stage Summarization, begin with this part.**


In [None]:
# @title Prompt Setting of ChatGPT Multi-Stage Summarization: Paragraph { run: "auto" }
''' You can modify the summarization prompt and maximum number of tokens. '''
''' However, DO NOT modify the part of <text>.'''

# The maximum number of tokens to generate in the completion.
# @markdown **max_tokens**: The maximum number of tokens to generate in the completion.
max_tokens = 350 # @param {type:"integer"}

# @markdown #### Changing **summarization_prompt_template**
# @markdown You can modify the summarization prompt and maximum number of tokens. However, **DO NOT** modify the part of `<text>`.
summarization_prompt_template = "用 300 個字內寫出這段文字的摘要，其中包括要點和所有重要細節：<text>" # @param {type:"string"}

##### Step1: Split the long text into multiple smaller pieces and obtain summaries for each smaller text piece separately

The code block below takes about **80** seconds to run when using the (1) **gpt-3.5-turbo** model, (2) length of chunks is 512 and (3) maximum number of tokens is 250, but the actual time may vary depending on the condition of Colab and the status of the OpenAI API.

In [None]:
paragraph_summarizations = []

# First, we summarize each section that has been split up separately.
for index, chunk in enumerate(chunks):

    # Record the start time.
    start = time.time()

    # Construct summarization prompt.
    summarization_prompt = summarization_prompt_template.replace("<text>", chunk)

    # We summarize each section that has been split up separately.
    response = summarization(client=client, summarization_prompt=summarization_prompt, model_name=model_name, temperature=temperature, top_p=top_p, max_tokens=max_tokens)

    # Calculate the execution time and round it to 2 decimal places.
    cost_time = round(time.time() - start, 2)

    # Print the summary and its length.
    print(f"----------------------------Summary of Segment {index + 1}----------------------------\n")
    for text in textwrap.wrap(response, 80):
        print(f"{text}\n")
    print(f"Length of summary for segment {index + 1}: {len(response)}")
    print(f"Time taken to generate summary for segment {index + 1}: {cost_time} sec.\n")

    # Record the result.
    paragraph_summarizations.append(response)

----------------------------Summary of Segment 1----------------------------

學問是需要透過實際做來獲得的。單純地聽或閱讀可能無法真正吸收知識。透過完成作業或專案，將所學知識整合並應用，才能真正理解和掌握。舉例來說，完成一個專案或寫一個程

式可以幫助我們將學到的知識融會貫通。因此，我們應該多動手多做，讓學問真正成為自己的。成績單上的分數並不一定能完全反映我們對知識的掌握程度。

Length of summary for segment 1: 149
Time taken to generate summary for segment 1: 4.82 sec.

----------------------------Summary of Segment 2----------------------------

許多人認為某門課不重要，甚至選擇不修或不做 final project，因為覺得太累或不必要。然而，這些看似無用的事實際上是學習的機會，能夠讓你全面成長。即使做

了很多辛苦的事情可能沒有具體成果，但對於全面學習和思考能力的培養卻是很有幫助的。重要的是要不漏掉這些機會，並在讀書時思考數學式子和概念，培養自己的思考能力和習慣

。這樣才能真正理解所學知識，並提升自己的思考能力。

Length of summary for segment 2: 185
Time taken to generate summary for segment 2: 5.35 sec.

----------------------------Summary of Segment 3----------------------------

學習不僅僅存在於課業之中，課業外也有很多可以學習的地方。學習是一種增長、進步和獲得快樂的過程。例如，打球可以提升健康、手腦協調、團隊精神和人際互動能力；爬山可以

增加見識和學習到很多東西；旅行也可以擴展視野和增加經驗。無論是什麼活動，只要讓你感到快樂並有所增長和進步，都值得花時間和精力去學習。所以，把這些課外活動看成是學

習的機會，都是值得努力去做的事情。

Length of summary for segment 3: 177
Time taken to 

In [None]:
# First, we collect all the summarizations obtained before and print them.

collected_summarization = ""
for index, paragraph_summarization in enumerate(paragraph_summarizations):
    collected_summarization += f"Summary of segment {index + 1}: {paragraph_summarization}\n"

print(collected_summarization)

Summary of segment 1: 學問是需要透過實際做來獲得的。單純地聽或閱讀可能無法真正吸收知識。透過完成作業或專案，將所學知識整合並應用，才能真正理解和掌握。舉例來說，完成一個專案或寫一個程式可以幫助我們將學到的知識融會貫通。因此，我們應該多動手多做，讓學問真正成為自己的。成績單上的分數並不一定能完全反映我們對知識的掌握程度。
Summary of segment 2: 許多人認為某門課不重要，甚至選擇不修或不做 final project，因為覺得太累或不必要。然而，這些看似無用的事實際上是學習的機會，能夠讓你全面成長。即使做了很多辛苦的事情可能沒有具體成果，但對於全面學習和思考能力的培養卻是很有幫助的。重要的是要不漏掉這些機會，並在讀書時思考數學式子和概念，培養自己的思考能力和習慣。這樣才能真正理解所學知識，並提升自己的思考能力。
Summary of segment 3: 學習不僅僅存在於課業之中，課業外也有很多可以學習的地方。學習是一種增長、進步和獲得快樂的過程。例如，打球可以提升健康、手腦協調、團隊精神和人際互動能力；爬山可以增加見識和學習到很多東西；旅行也可以擴展視野和增加經驗。無論是什麼活動，只要讓你感到快樂並有所增長和進步，都值得花時間和精力去學習。所以，把這些課外活動看成是學習的機會，都是值得努力去做的事情。
Summary of segment 4: 談戀愛和交朋友都是學習人際互動和溝通的好機會，即使沒有緣分也可以透過交朋友來學習。在大學裡，有很多機會參加各種活動，例如戲劇社或舞蹈社，這些活動可以幫助我們成長和進步。即使不是表演者，幕後工作也同樣重要，例如規劃或軟體組。在電機系裡，有很多優秀的同學可以成為好朋友，所以努力交朋友是有幫助的。透過參加各種活動，我們可以學到很多人與人之間的互動和溝通技巧，這對我們的成長和發展都是很重要的。
Summary of segment 5: 這段文字談論到不僅在電機工程領域，其他活動也能帶來成長和進步，包括戲劇學會和校內外的各種活動。儘管這些活動沒有考試或成績，但它們對個人發展非常重要。在今天的電機工程中，成功很少是一個人完成的，需要與他人合作。學習如何進入團隊、成為領導者並推動想做的事情是必要的。這些被稱為軟實力，是除了硬實力外的重要技能。
Summary of segment 6: 成功與否不

##### Step2: After obtaining summaries for each smaller text piece separately, process these summaries to generate the final summary.

In [None]:
# @title Prompt Setting of ChatGPT Multi-Stage Summarization: Total { run: "auto" }
''' You can modify the summarization prompt and maximum number of tokens. '''
''' However, DO NOT modify the part of <text>.'''

# We set the maximum number of tokens to ensure that the final summary does not exceed 550 tokens.
# @markdown **max_tokens**: We set the maximum number of tokens to ensure that the final summary does not exceed 550 tokens.
max_tokens = 550 # @param {type:"integer"}

# @markdown ### Changing **summarization_prompt_template**
# @markdown You can modify the summarization prompt and maximum number of tokens. However, **DO NOT** modify the part of `<text>`.
summarization_prompt_template = "在 500 字以內寫出以下文字的簡潔摘要：<text>" # @param {type:"string"}

The code block below takes about **10** seconds to run when using the (1) **gpt-3.5-turbo** model and (2) maximum number of tokens is 500, but the actual time may vary depending on the condition of Colab and the status of the OpenAI API.

In [None]:
# Finally, we compile a final summary from the summaries of each section.

# Record the start time.
start = time.time()

# Run final summarization.
summarization_prompt = summarization_prompt_template.replace("<text>", collected_summarization)
final_summarization = summarization(client=client, summarization_prompt=summarization_prompt, model_name=model_name, temperature=temperature, top_p=top_p, max_tokens=max_tokens)

# Calculate the execution time and round it to 2 decimal places.
cost_time = round(time.time() - start, 2)

# Print the summary and its length.
print(f"----------------------------Final Summary----------------------------\n")
for text in textwrap.wrap(final_summarization, 80):
        print(f"{text}")
print(f"\nLength of final summary: {len(final_summarization)}")
print(f"Time taken to generate the final summary: {cost_time} sec.")

In [None]:
''' In this block, you can modify your desired output path of final summary. '''

output_path = f"./final-summary-{suffix}-chatgpt-multi-stage.txt"

# If you need to convert Simplified Chinese to Traditional Chinese, please set this option to True; otherwise, set it to False.
convert_to_tradition_chinese = False

if convert_to_tradition_chinese == True:
    # Creating an instance of OpenCC for Simplified to Traditional Chinese conversion.
    cc = OpenCC('s2t')
    final_summarization = cc.convert(final_summarization)

# Output your final summary
with open(output_path, "w") as fp:
    fp.write(final_summarization)

# Show the result.
print(f"Final summary has been saved to {output_path}")
print(f"\n===== Below is the final summary ({len(final_summarization)} words) =====\n")
for text in textwrap.wrap(final_summarization, 64):
    print(text)

#### **If you want to use the method of Refinement, begin with this part.**



In [None]:
# @title Prompt Setting of ChatGPT Refinement { run: "auto" }
''' You can modify the summarization prompt and maximum number of tokens. '''
''' However, DO NOT modify the part of <text>.'''

# We set the maximum number of tokens.
# @markdown **max_tokens**: We set the maximum number of tokens.
max_tokens = 550 # @param {type:"integer"}

# @markdown ### Changing **summarization_prompt_template** and **summarization_prompt_refine_template**
# @markdown You can modify the summarization prompt and maximum number of tokens. However, **DO NOT** modify the part of `<text>`.

# Initial prompt.
# @markdown **summarization_prompt_template**: Initial prompt.
summarization_prompt_template = "請在 300 字以內，提供以下文字的簡潔摘要:<text>" # @param {type:"string"}

# Refinement prompt.
# @markdown **summarization_prompt_refinement_template**: Refinement prompt.
summarization_prompt_refinement_template = "請在 500 字以內，結合原先的摘要和新的內容，提供簡潔的摘要:<text>" # @param {type:"string"}

The code block below takes about **200** seconds to run when using the (1) **gpt-3.5-turbo** model and (2) maximum number of tokens is 500, but the actual time may vary depending on the condition of Colab and the status of the OpenAI API.

Pipeline of the method of Refinement.

Step1: It starts by running a prompt on a small portion of the data, generating initial output.

Step2: For each following document, the previous output is fed in along with the new document.

Step3: The LLM is instructed to refine the output based on the new document's information.

Step4: This process continues iteratively until all documents have been processed.

In [None]:
paragraph_summarizations = []

# First, we summarize each section that has been split up separately.
for index, chunk in enumerate(chunks):

    if index == 0:
        # Record the start time.
        start = time.time()

        # Construct summarization prompt.
        summarization_prompt = summarization_prompt_template.replace("<text>", chunk)

        # Step1: It starts by running a prompt on a small portion of the data, generating initial output.
        first_paragraph_summarization = summarization(client=client, summarization_prompt=summarization_prompt, model_name=model_name, temperature=temperature, top_p=top_p, max_tokens=max_tokens)

        # Record the result.
        paragraph_summarizations.append(first_paragraph_summarization)

        # Calculate the execution time and round it to 2 decimal places.
        cost_time = round(time.time() - start, 2)

        # Print the summary and its length.
        print(f"----------------------------Summary of Segment {index + 1}----------------------------\n")
        for text in textwrap.wrap(first_paragraph_summarization, 80):
            print(f"{text}\n")
        print(f"Length of summary for segment {index + 1}: {len(first_paragraph_summarization)}")
        print(f"Time taken to generate summary for segment {index + 1}: {cost_time} sec.\n")


    else:
        # Record the start time.
        start = time.time()

        # Step2: For each following document, the previous output is fed in along with the new document.
        chunk_text = f"""前 {index} 段的摘要: {paragraph_summarizations[-1]}\n第 {index + 1} 段的內容: {chunk}"""

        # Construct refinement prompt for summarization.
        summarization_prompt = summarization_prompt_refinement_template.replace("<text>", chunk_text)

        # Step3: The LLM is instructed to refine the output based on the new document's information.
        paragraph_summarization = summarization(client=client, summarization_prompt=summarization_prompt, model_name=model_name, temperature=temperature, top_p=top_p, max_tokens=max_tokens)

        # Record the result.
        paragraph_summarizations.append(paragraph_summarization)

        # Calculate the execution time and round it to 2 decimal places.
        cost_time = round(time.time() - start, 2)

        # print results.
        print(f"----------------------------Summary of the First {index + 1} Segments----------------------------\n")
        for text in textwrap.wrap(paragraph_summarization, 80):
            print(f"{text}\n")
        print(f"Length of summary for the first {index + 1} segments: {len(paragraph_summarization)}")
        print(f"Time taken to generate summary for the first {index + 1} segments: {cost_time} sec.\n")

    # Step4: This process continues iteratively until all documents have been processed.

In [None]:
''' In this block, you can modify your desired output path of final summary. '''

output_path = f"./final-summary-{suffix}-chatgpt-refinement.txt"

# If you need to convert Simplified Chinese to Traditional Chinese, please set this option to True; otherwise, set it to False.
convert_to_tradition_chinese = False

if convert_to_tradition_chinese == True:
    # Creating an instance of OpenCC for Simplified to Traditional Chinese conversion.
    cc = OpenCC('s2t')
    paragraph_summarizations[-1] = cc.convert(paragraph_summarizations[-1])

# Output your final summary
with open(output_path, "w") as fp:
    fp.write(paragraph_summarizations[-1])

# Show the result.
print(f"Final summary has been saved to {output_path}")
print(f"\n===== Below is the final summary ({len(paragraph_summarizations[-1])} words) =====\n")
for text in textwrap.wrap(paragraph_summarizations[-1], 80):
    print(text)

## **If you want to use Gemini, begin with this part.**
##### (1) You can refer to https://shorturl.at/X0NDY (Page 35) for obtaining Gemini API key.
##### (2) You can refer to https://ai.google.dev/models/gemini for more details about which models you can use.

In [None]:
def summarization(summarization_prompt, model_name="gemini-pro", temperature=0.0, top_p=1.0, max_tokens=512):
    """
    (1) Objective:
        - Use the OpenAI Chat API to summarize a given text.

    (2) Arguments:
        - summarization_prompt: The summarization prompt.
        - model_name: The model name, default is "gemini-pro". You can refer to "https://ai.google.dev/models/gemini" for more details.
        - temperature: Controls randomness in the response. Lower values make responses more deterministic, default is 0.0.
        - top_p: Controls diversity via nucleus sampling. Higher values lead to more diverse responses, default is 1.0.
        - max_tokens: The maximum number of tokens to generate in the completion, default is 512.

    (3) Return:
        - The summarized text.

    (4) Example:
        - If the text is "ABC" and the summarization prompt is "DEF", model_name is "gemini-pro",
          temperature is 0.0, top_p is 1.0, and max_tokens is 512, then you can call the function like this:

              summarization(text="ABC", summarization_prompt="DEF", model_name="gemini-pro", temperature=0.0, top_p=1.0, max_tokens=512)

    """

    # The user prompt is a concatenation of the summarization_prompt and text.
    user_prompt = summarization_prompt

    # Load the generative model.
    model = genai.GenerativeModel(model_name)

    # Set the generation configuration.
    generation_config = genai.GenerationConfig(temperature=temperature, top_p=top_p, max_output_tokens=max_tokens)

    while True:

        try:
            # Use the OpenAI Chat API to summarize the text.
            response = model.generate_content(contents=user_prompt, generation_config=generation_config)

            break

        except:
            # If the API call fails, wait for 1 second and try again.
            print("The API call fails, wait for 1 second and try again.")
            time.sleep(1)

    return response.text

In [None]:
# @title Parameter Setting of Gemini { run: "auto" }
''' In this block, you can modify your desired parameters and set your api key. '''

# Your google api key.
# @markdown **google_api_key**: Your google api key.
google_api_key = "YOUR_GOOGLE_API_KEY" # @param {type:"string"}

# The model name. You can refer to "https://ai.google.dev/models/gemini" for more details.
# @markdown **model_name**: The model name. You can refer to "https://ai.google.dev/models/gemini" for more details.
model_name = "gemini-pro" # @param {type:"string"}

# Controls randomness in the response. Lower values make responses more deterministic
# @markdown **temperature**: Controls randomness in the response. Lower values make responses more deterministic.
temperature = 0.0 # @param {type:"slider", min:0, max:1, step:0.1}

# Controls diversity via nucleus sampling. Higher values lead to more diverse responses
# @markdown **top_p**: Controls diversity via nucleus sampling. Higher values lead to more diverse responses.
top_p = 1.0 # @param {type:"slider", min:0, max:1, step:0.1}

In [None]:
# Set Google API key.
genai.configure(api_key=google_api_key)

### We offer the following two methods for summarization.
Reference: https://reurl.cc/VzagLA

#### **If you want to use the method of Multi-Stage Summarization, begin with this part.**


In [None]:
# @title Prompt Setting of Gemini Multi-Stage Summarization: Paragraph { run: "auto" }
''' You can modify the summarization prompt and maximum number of tokens. '''
''' However, DO NOT modify the part of <text>.'''

# The maximum number of tokens to generate in the completion.
# @markdown **max_tokens**: The maximum number of tokens to generate in the completion.
max_tokens = 350 # @param {type:"integer"}

# @markdown #### Changing **summarization_prompt_template**
# @markdown You can modify the summarization prompt and maximum number of tokens. However, **DO NOT** modify the part of `<text>`.
summarization_prompt_template = "用 300 個字內寫出這段文字的摘要，其中包括要點和所有重要細節：<text>" # @param {type:"string"}

##### Step1: Split the long text into multiple smaller pieces and obtain summaries for each smaller text piece separately


The code block below takes about **40** seconds to run when using the (1) **gemini-pro** model, (2) length of chunks is 512 and (3) maximum number of tokens is 350, but the actual time may vary depending on the condition of Colab and the status of the Google API.

In [None]:
paragraph_summarizations = []

# First, we summarize each section that has been split up separately.
for index, chunk in enumerate(chunks):

    # Record the start time.
    start = time.time()

    # Construct summarization prompt.
    summarization_prompt = summarization_prompt_template.replace("<text>", chunk)

    # We summarize each section that has been split up separately.
    response = summarization(summarization_prompt=summarization_prompt, model_name=model_name, temperature=temperature, top_p=top_p, max_tokens=max_tokens)

    # Calculate the execution time and round it to 2 decimal places.
    cost_time = round(time.time() - start, 2)

    # Print the summary and its length.
    print(f"----------------------------Summary of Segment {index + 1}----------------------------\n")
    for text in textwrap.wrap(response, 80):
        print(f"{text}\n")
    print(f"Length of summary for segment {index + 1}: {len(response)}")
    print(f"Time taken to generate summary for segment {index + 1}: {cost_time} sec.\n")

    # Record the result.
    paragraph_summarizations.append(response)

In [None]:
# First, we collect all the summarizations obtained before and print them.

collected_summarization = ""
for index, paragraph_summarization in enumerate(paragraph_summarizations):
    collected_summarization += f"Summary of segment {index + 1}: {paragraph_summarization}\n"

print(collected_summarization)

#####Step2: After obtaining summaries for each smaller text piece separately, process these summaries to generate the final summary.

In [None]:
# @title Prompt Setting of Gemini Multi-Stage Summarization: Total { run: "auto" }
''' You can modify the summarization prompt and maximum number of tokens. '''
''' However, DO NOT modify the part of <text>.'''

# We set the maximum number of tokens to ensure that the final summary does not exceed 550 tokens.
# @markdown **max_tokens**: We set the maximum number of tokens to ensure that the final summary does not exceed 550 tokens.
max_tokens = 550 # @param {type:"integer"}

# @markdown ### Changing **summarization_prompt_template**
# @markdown You can modify the summarization prompt and maximum number of tokens. However, **DO NOT** modify the part of `<text>`.
summarization_prompt_template = "在 500 字以內寫出以下文字的簡潔摘要：<text>" # @param {type:"string"}

The code block below takes about **20** seconds to run when using the (1) **gemini-pro** model, (2) length of chunks is 512 and (3) maximum number of tokens is 550, but the actual time may vary depending on the condition of Colab and the status of the Google API.

In [None]:
# Finally, we compile a final summary from the summaries of each section.

# Record the start time.
start = time.time()

# Run final summarization.
summarization_prompt = summarization_prompt_template.replace("<text>", collected_summarization)
final_summarization = summarization(summarization_prompt=summarization_prompt, model_name=model_name, temperature=temperature, top_p=top_p, max_tokens=max_tokens)

# Calculate the execution time and round it to 2 decimal places.
cost_time = round(time.time() - start, 2)

# Print the summary and its length.
print(f"----------------------------Final Summary----------------------------\n")
for text in textwrap.wrap(final_summarization, 80):
        print(f"{text}")
print(f"\nLength of final summary: {len(final_summarization)}")
print(f"Time taken to generate the final summary: {cost_time} sec.")

In [None]:
''' In this block, you can modify your desired output path of final summary. '''

output_path = f"./final-summary-{suffix}-gemini-multi-stage.txt"

# If you need to convert Simplified Chinese to Traditional Chinese, please set this option to True; otherwise, set it to False.
convert_to_tradition_chinese = False

if convert_to_tradition_chinese == True:
    # Creating an instance of OpenCC for Simplified to Traditional Chinese conversion.
    cc = OpenCC('s2t')
    final_summarization = cc.convert(final_summarization)

# Output your final summary
with open(output_path, "w") as fp:
    fp.write(final_summarization)

print(f"Final summary has been saved to {output_path}")

#### **If you want to use the method of Refinement, begin with this part.**



In [None]:
# @title Prompt Setting of Gemini Refinement { run: "auto" }
''' You can modify the summarization prompt and maximum number of tokens. '''
''' However, DO NOT modify the part of <text>.'''

# We set the maximum number of tokens.
# @markdown **max_tokens**: We set the maximum number of tokens.
max_tokens = 550 # @param {type:"integer"}

# @markdown ### Changing **summarization_prompt_template** and **summarization_prompt_refine_template**
# @markdown You can modify the summarization prompt and maximum number of tokens. However, **DO NOT** modify the part of `<text>`.

# Initial prompt.
# @markdown **summarization_prompt_template**: Initial prompt.
summarization_prompt_template = "請在 300 字以內，提供以下文字的簡潔摘要:<text>" # @param {type:"string"}

# Refinement prompt.
# @markdown **summarization_prompt_refinement_template**: Refinement prompt.
summarization_prompt_refinement_template = "請在 500 字以內，結合原先的摘要和新的內容，提供簡潔的摘要:<text>" # @param {type:"string"}

Pipeline of the method of Refinement.

Step1: It starts by running a prompt on a small portion of the data, generating initial output.

Step2: For each following document, the previous output is fed in along with the new document.

Step3: The LLM is instructed to refine the output based on the new document's information.

Step4: This process continues iteratively until all documents have been processed.

The code block below takes about **45** seconds to run when using the (1) **gemini-pro** model and (2) maximum number of tokens is 500, but the actual time may vary depending on the condition of Colab and the status of the Google API.

In [None]:
paragraph_summarizations = []

# First, we summarize each section that has been split up separately.
for index, chunk in enumerate(chunks):

    if index == 0:
        # Record the start time.
        start = time.time()

        # Construct summarization prompt.
        summarization_prompt = summarization_prompt_template.replace("<text>", chunk)

        # Step1: It starts by running a prompt on a small portion of the data, generating initial output.
        first_paragraph_summarization = summarization(summarization_prompt=summarization_prompt, model_name=model_name, temperature=temperature, top_p=top_p, max_tokens=max_tokens)

        # Record the result.
        paragraph_summarizations.append(first_paragraph_summarization)

        # Calculate the execution time and round it to 2 decimal places.
        cost_time = round(time.time() - start, 2)

        # Print the summary and its length.
        print(f"----------------------------Summary of Segment {index + 1}----------------------------\n")
        for text in textwrap.wrap(first_paragraph_summarization, 80):
            print(f"{text}\n")
        print(f"Length of summary for segment {index + 1}: {len(first_paragraph_summarization)}")
        print(f"Time taken to generate summary for segment {index + 1}: {cost_time} sec.\n")

    else:
        # Record the start time.
        start = time.time()

        # Step2: For each following document, the previous output is fed in along with the new document.
        chunk_text = f"""前 {index} 段的摘要: {paragraph_summarizations[-1]}\n第 {index + 1} 段的內容: {chunk}"""

        # Construct refinement prompt for summarization.
        summarization_prompt = summarization_prompt_refinement_template.replace("<text>", chunk_text)

        # Step3: The LLM is instructed to refine the output based on the new document's information.
        paragraph_summarization = summarization(summarization_prompt=summarization_prompt, model_name=model_name, temperature=temperature, top_p=top_p, max_tokens=max_tokens)

        # Record the result.
        paragraph_summarizations.append(paragraph_summarization)

        # Calculate the execution time and round it to 2 decimal places.
        cost_time = round(time.time() - start, 2)

        # print results.
        print(f"----------------------------Summary of the First {index + 1} Segments----------------------------\n")
        for text in textwrap.wrap(paragraph_summarization, 80):
            print(f"{text}\n")
        print(f"Length of summary for the first {index + 1} segments: {len(paragraph_summarization)}")
        print(f"Time taken to generate summary for the first {index + 1} segments: {cost_time} sec.\n")

    # Step4: This process continues iteratively until all documents have been processed.

In [None]:
''' In this block, you can modify your desired output path of final summary. '''

output_path = f"./final-summary-{suffix}-gemini-refinement.txt"

# If you need to convert Simplified Chinese to Traditional Chinese, please set this option to True; otherwise, set it to False.
convert_to_tradition_chinese = False

if convert_to_tradition_chinese == True:
    # Creating an instance of OpenCC for Simplified to Traditional Chinese conversion.
    cc = OpenCC('s2t')
    paragraph_summarizations[-1] = cc.convert(paragraph_summarizations[-1])

# Output your final summary
with open(output_path, "w") as fp:
    fp.write(paragraph_summarizations[-1])

# Show the result.
print(f"Final summary has been saved to {output_path}")
print(f"\n===== Below is the final summary ({len(paragraph_summarizations[-1])} words) =====\n")
for text in textwrap.wrap(paragraph_summarizations[-1], 64):
    print(text)

## **If you want to use Claude, begin with this part.**
##### (1) You can refer to https://reurl.cc/yLy06D for obtaining Claude API key.
##### (2) You can refer to https://docs.anthropic.com/claude/docs/models-overview for more details about which models you can use.

In [None]:
def summarization(client, summarization_prompt, model_name="claude-3-sonnet-20240229", temperature=0.0, top_p=1.0, max_tokens=512):
    """
    (1) Objective:
        - Use the Claude API to summarize a given text.

    (2) Arguments:
        - client: Claude API client.
        - text: The text to be summarized.
        - summarization_prompt: The summarization prompt.
        - model_name: The model name, default is "claude-3-sonnet-20240229". You can refer to "https://docs.anthropic.com/claude/docs/models-overview#model-comparison" for more details.
        - temperature: Controls randomness in the response. Lower values make responses more deterministic, default is 0.0.
        - top_p: Controls diversity via nucleus sampling. Higher values lead to more diverse responses, default is 1.0.
        - max_tokens: The maximum number of tokens to generate in the completion, default is 512.

    (3) Return:
        - The summarized text.

    (4) Example:
        - If the text is "ABC" and the summarization prompt is "DEF", system_prompt is "GHI", model_name is "claude-3-sonnet-20240229",
          temperature is 0.0, top_p is 1.0, and max_tokens is 512, then you can call the function like this:

              summarization(client=client, text="ABC", summarization_prompt="DEF", system_prompt="GHI", model_name="claude-3-sonnet-20240229", temperature=0.0, top_p=1.0, max_tokens=512)

    """

    user_prompt = summarization_prompt

    while True:

        try:
            # Use the Claude API to summarize the text.
            message = client.messages.create(
                model=model_name,
                max_tokens=max_tokens,
                temperature=temperature,
                messages=[
                    {"role": "user", "content": user_prompt}
                ]
            )

            break

        except:
            # If the API call fails, wait for 1 second and try again.
            print("The API call fails, wait for 1 second and try again.")
            time.sleep(1)

    return message.content[0].text

In [None]:
# @title Parameter Setting of Claude { run: "auto" }
''' ===== In this block, you can modify your desired parameters and set your Claude API key ===== '''

# Your Claude API key.
# @markdown **claude_api_key**: Your Claude api key.
claude_api_key = "YOUR_CLAUDE_API_KEY" # @param {type:"string"}

# The model name, default is "claude-3-opus-20240229". You can refer to "https://docs.anthropic.com/claude/docs/models-overview#model-comparison" for more details.
# @markdown **model_name**: The model name, default is "claude-3-opus-20240229". You can refer to "https://docs.anthropic.com/claude/docs/models-overview#model-comparison" for more details.
model_name = "claude-3-opus-20240229" # @param {type:"string"}

# Controls randomness in the response. Lower values make responses more deterministic.
# @markdown **temperature**: Controls randomness in the response. Lower values make responses more deterministic.
temperature = 1 # @param {type:"slider", min:0, max:1, step:0.1}

# Controls diversity via nucleus sampling. Higher values lead to more diverse responses.
# @markdown **top_p**: Controls diversity via nucleus sampling. Higher values lead to more diverse responses.
top_p = 1.0 # @param {type:"slider", min:0, max:1, step:0.1}

In [None]:
# Construct Claude client.
client = anthropic.Anthropic(api_key=claude_api_key)

### We offer the following two methods for summarization.
Reference: https://reurl.cc/VzagLA

#### **If you want to use the method of Multi-Stage Summarization, begin with this part.**


In [None]:
# @title Prompt Setting of Claude Multi-Stage Summarization: Paragraph { run: "auto" }
''' You can modify the summarization prompt and maximum number of tokens. '''
''' However, DO NOT modify the part of <text>.'''

# The maximum number of tokens to generate in the completion.
# @markdown **max_tokens**: The maximum number of tokens to generate in the completion.
max_tokens = 350 # @param {type:"integer"}

# @markdown #### Changing **summarization_prompt_template**
# @markdown You can modify the summarization prompt and maximum number of tokens. However, **DO NOT** modify the part of `<text>`.
summarization_prompt_template = "用 300 個字內寫出這段文字的摘要，其中包括要點和所有重要細節：<text>" # @param {type:"string"}

##### Step1: Split the long text into multiple smaller pieces and obtain summaries for each smaller text piece separately

The code block below takes about **120** seconds to run when using the (1) **claude-3-opus-20240229** model, (2) length of chunks is 512 and (3) maximum number of tokens is 350, but the actual time may vary depending on the condition of Colab and the status of the Claude API.

In [None]:
paragraph_summarizations = []

# First, we summarize each section that has been split up separately.
for index, chunk in enumerate(chunks):

    # Record the start time.
    start = time.time()

    # Construct summarization prompt.
    summarization_prompt = summarization_prompt_template.replace("<text>", chunk)

    # We summarize each section that has been split up separately.
    response = summarization(client=client, summarization_prompt=summarization_prompt, model_name=model_name, temperature=temperature, top_p=top_p, max_tokens=max_tokens)

    # Calculate the execution time and round it to 2 decimal places.
    cost_time = round(time.time() - start, 2)

    # Print the summary and its length.
    print(f"----------------------------Summary of Segment {index + 1}----------------------------\n")
    for text in textwrap.wrap(response, 80):
        print(f"{text}\n")
    print(f"Length of summary for segment {index + 1}: {len(response)}")
    print(f"Time taken to generate summary for segment {index + 1}: {cost_time} sec.\n")

    # Record the result.
    paragraph_summarizations.append(response)

In [None]:
# First, we collect all the summarizations obtained before and print them.

collected_summarization = ""
for index, paragraph_summarization in enumerate(paragraph_summarizations):
    collected_summarization += f"Summary of segment {index + 1}: {paragraph_summarization}\n\n"

print(collected_summarization)

##### Step2: After obtaining summaries for each smaller text piece separately, process these summaries to generate the final summary.

In [None]:
# @title Prompt Setting of Gemini Multi-Stage Summarization: Total { run: "auto" }
''' You can modify the summarization prompt and maximum number of tokens. '''
''' However, DO NOT modify the part of <text>.'''

# We set the maximum number of tokens to ensure that the final summary does not exceed 550 tokens.
# @markdown **max_tokens**: We set the maximum number of tokens to ensure that the final summary does not exceed 550 tokens.
max_tokens = 550 # @param {type:"integer"}

# @markdown ### Changing **summarization_prompt_template**
# @markdown You can modify the summarization prompt and maximum number of tokens. However, **DO NOT** modify the part of `<text>`.
summarization_prompt_template = "在 500 字以內寫出以下文字的簡潔摘要：<text>" # @param {type:"string"}

The code block below takes about **25** seconds to run when using the (1) **claude-3-opus-20240229** model, (2) length of chunks is 512 and (3) maximum number of tokens is 550, but the actual time may vary depending on the condition of Colab and the status of the Claude API.

In [None]:
# Finally, we compile a final summary from the summaries of each section.

# Record the start time.
start = time.time()

summarization_prompt = summarization_prompt_template.replace("<text>", collected_summarization)

# Run final summarization.
final_summarization = summarization(client=client, summarization_prompt=summarization_prompt, model_name=model_name, temperature=temperature, top_p=top_p, max_tokens=max_tokens)

# Calculate the execution time and round it to 2 decimal places.
cost_time = round(time.time() - start, 2)

# Print the summary and its length.
print(f"----------------------------Final Summary----------------------------\n")
for text in textwrap.wrap(final_summarization, 80):
    print(f"{text}")
print(f"\nLength of final summary: {len(final_summarization)}")
print(f"Time taken to generate the final summary: {cost_time} sec.")

In [None]:
''' In this block, you can modify your desired output path of final summary. '''

output_path = f"./final-summary-{suffix}-claude-multi-stage.txt"

# If you need to convert Simplified Chinese to Traditional Chinese, please set this option to True; otherwise, set it to False.
convert_to_tradition_chinese = False

if convert_to_tradition_chinese == True:
    # Creating an instance of OpenCC for Simplified to Traditional Chinese conversion.
    cc = OpenCC('s2t')
    final_summarization = cc.convert(final_summarization)

# Output your final summary
with open(output_path, "w") as fp:
    fp.write(final_summarization)

# Show the result.
print(f"Final summary has been saved to {output_path}")
print(f"\n===== Below is the final summary ({len(final_summarization)} words) =====\n")
for text in textwrap.wrap(final_summarization, 64):
    print(text)

#### **If you want to use the method of Refinement, begin with this part.**



In [None]:
# @title Prompt Setting of Claude Refinement { run: "auto" }
''' You can modify the summarization prompt and maximum number of tokens. '''
''' However, DO NOT modify the part of <text>.'''

# We set the maximum number of tokens.
# @markdown **max_tokens**: We set the maximum number of tokens.
max_tokens = 550 # @param {type:"integer"}

# @markdown ### Changing **summarization_prompt_template** and **summarization_prompt_refine_template**
# @markdown You can modify the summarization prompt and maximum number of tokens. However, **DO NOT** modify the part of `<text>`.

# Initial prompt.
# @markdown **summarization_prompt_template**: Initial prompt.
summarization_prompt_template = "請在 300 字以內，提供以下文字的簡潔摘要:<text>" # @param {type:"string"}

# Refinement prompt.
# @markdown **summarization_prompt_refinement_template**: Refinement prompt.
summarization_prompt_refinement_template = "請在 500 字以內，結合原先的摘要和新的內容，提供簡潔的摘要:<text>" # @param {type:"string"}

Pipeline of the method of Refinement.

Step1: It starts by running a prompt on a small portion of the data, generating initial output.

Step2: For each following document, the previous output is fed in along with the new document.

Step3: The LLM is instructed to refine the output based on the new document's information.

Step4: This process continues iteratively until all documents have been processed.

The code block below takes about **150** seconds to run when using the (1) **claude-3-opus-20240229** model and (2) maximum number of tokens is 500, but the actual time may vary depending on the condition of Colab and the status of the Claude API.

In [None]:
paragraph_summarizations = []

# First, we summarize each section that has been split up separately.
for index, chunk in enumerate(chunks):

    if index == 0:
        # Record the start time.
        start = time.time()

        # Construct summarization prompt.
        summarization_prompt = summarization_prompt_template.replace("<text>", chunk)

        # Step1: It starts by running a prompt on a small portion of the data, generating initial output.
        first_paragraph_summarization = summarization(client=client, summarization_prompt=summarization_prompt, model_name=model_name, temperature=temperature, top_p=top_p, max_tokens=max_tokens)

        # Record the result.
        paragraph_summarizations.append(first_paragraph_summarization)

        # Calculate the execution time and round it to 2 decimal places.
        cost_time = round(time.time() - start, 2)

        # Print the summary and its length.
        print(f"----------------------------Summary of Segment {index + 1}----------------------------\n")
        for text in textwrap.wrap(response, 80):
            print(f"{text}\n")
        print(f"Length of summary for segment {index + 1}: {len(response)}")
        print(f"Time taken to generate summary for segment {index + 1}: {cost_time} sec.\n")


    else:
        # Record the start time.
        start = time.time()

        # Step2: For each following document, the previous output is fed in along with the new document.
        chunk_text = f"""前 {index} 段的摘要: {paragraph_summarizations[-1]}\n第 {index + 1} 段的內容: {chunk}"""

        # Construct refinement prompt for summarization.
        summarization_prompt = summarization_prompt_refinement_template.replace("<text>", chunk_text)

        # Step3: The LLM is instructed to refine the output based on the new document's information.
        paragraph_summarization = summarization(client=client, summarization_prompt=summarization_prompt, model_name=model_name, temperature=temperature, top_p=top_p, max_tokens=max_tokens)

        # Record the result.
        paragraph_summarizations.append(paragraph_summarization)

        # Calculate the execution time and round it to 2 decimal places.
        cost_time = round(time.time() - start, 2)

        # print results.
        print(f"----------------------------Summary of the First {index + 1} Segments----------------------------\n")
        for text in textwrap.wrap(paragraph_summarization, 80):
            print(f"{text}\n")
        print(f"Length of summary for the first {index + 1} segments: {len(paragraph_summarization)}")
        print(f"Time taken to generate summary for the first {index + 1} segments: {cost_time} sec.\n")

    # Step4: This process continues iteratively until all documents have been processed.

In [None]:
''' In this block, you can modify your desired output path of final summary. '''

output_path = f"./final-summary-{suffix}-claude-refinement.txt"

# If you need to convert Simplified Chinese to Traditional Chinese, please set this option to True; otherwise, set it to False.
convert_to_tradition_chinese = False

if convert_to_tradition_chinese == True:
    # Creating an instance of OpenCC for Simplified to Traditional Chinese conversion.
    cc = OpenCC('s2t')
    paragraph_summarizations[-1] = cc.convert(paragraph_summarizations[-1])

# Output your final summary
with open(output_path, "w") as fp:
    fp.write(paragraph_summarizations[-1])

# Show the result.
print(f"Final summary has been saved to {output_path}")
print(f"\n===== Below is the final summary ({len(paragraph_summarizations[-1])} words) =====\n")
for text in textwrap.wrap(paragraph_summarizations[-1], 80):
    print(text)

# Part5 - Check the correctness of the submission file


In [None]:
# Check the correctness of the submission file.
import json
import re

your_submission_path = "YOUR_SUBMISSION_PATH"

def check_format(your_submission_path):

    final_score = 0

    # check the extension of the file.
    if not your_submission_path.endswith(".json"):
        print("Please save your submission file in JSON format.")
        return False, final_score
    else:
        try:
            with open(your_submission_path, "r") as fp:
                your_submission = json.load(fp)

            evaluation_result = your_submission["history"][0]["messages"][1]["content"]

            if "總分：" not in evaluation_result:
                # Correct format: 總分: <你的分數>
                print("Please make sure that the correct format of final score is included in the evaluation result.")
                print("The correct format is 總分: <你的分數>. For example, 總分: 97")
                return False, final_score

            evaluation_result = evaluation_result.strip()
            score_pattern = r"總分：\d+"
            score = re.findall(score_pattern, evaluation_result)

            if score:
                final_score = score[-1].replace("總分：", "")
                if "/100" in final_score:
                    final_score = final_score.replace("/100", "")
            else:
                print("Please make sure that the final score is included in the evaluation result.")
                return False, final_score

        except:
            print("Open the file failed. Please check the file path or save your submission file in correct JSON format")
            return False, final_score

    return True, final_score

format_correctness, final_score = check_format(your_submission_path)
if format_correctness== True:
    print("The format of your submission file is correct.")
    print(f"Your final score is {final_score}.")
else:
    print("The format of your submission file is wrong.")
    print("Please check the format of your submission file.")

The format of your submission file is correct.
Your final score is 0.
