<a href="https://colab.research.google.com/github/shigenogoro/YouTube-Video-Summarization/blob/kyle/youtube_video_summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# YouTube Video Summarization

# Environment Setup

## Google Drive Setup

In [1]:
from google.colab import drive
import os

# Mount the drive to Colab
drive.mount('/content/drive')

# Get the path github private token
token_path = '/content/drive/MyDrive/CS_685/Final_Project/.gh_token'

# Load the token
with open(token_path) as f:
    os.environ['GH_TOKEN'] = f.read().strip()

# Check if the token is exist
print('GH_TOKEN' in os.environ)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
True


## GitHub Setup

In [2]:
!git config --global user.email "kyle990987@gmail.com"
!git config --global user.name "shigenogoro"

In [3]:
# Set repo URL
username = "shigenogoro"
reponame = "YouTube-Video-Summarization"
token = os.environ['GH_TOKEN']

repo_url = f"https://{token}@github.com/{username}/{reponame}.git"

!git clone {repo_url}
%cd {reponame}

fatal: destination path 'YouTube-Video-Summarization' already exists and is not an empty directory.
/content/YouTube-Video-Summarization


In [None]:
# Pull the changes from the repo
!git pull {repo_url} main

## Package Setup

In [None]:
!pip install numpy
!pip install pandas
!pip install tqdm
!pip install matplotlib
!pip install seaborn

!pip install torch
!pip install transformers
!pip install sentence-transformers

!pip install spacy
!pip install nltk

!pip install rouge-score
!pip install bert-score

!pip install datasets
!pip install jiwer

#### Downlaod Language Resource

In [4]:
!python -m spacy download en_core_web_sm

import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

Collecting en-core-web-sm==3.8.0
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

# Phase 1: Experimentation on MeetingBank

## Preprocessing

### Data Acquisition - MeetingBank

Download MeetingBank dataset; verify transcript, segment, and summary formats.

In [5]:
from datasets import load_dataset
meetingbank = load_dataset("huuuyeah/meetingbank")

train_data = meetingbank['train']
test_data = meetingbank['test']
val_data = meetingbank['validation']

def generator(data_split):
  for instance in data_split:
    yield instance['id'], instance['summary'], instance['transcript']

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

train.json:   0%|          | 0.00/88.4M [00:00<?, ?B/s]

validation.json:   0%|          | 0.00/13.2M [00:00<?, ?B/s]

test.json:   0%|          | 0.00/13.4M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/5169 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/861 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/862 [00:00<?, ? examples/s]

In [8]:
# Show an example
sample_data = next(generator(train_data))

# Print the sample data with attribute and content
for attribute, content in zip(['id', 'summary', 'transcript'], sample_data):
  print(f"{attribute}: {content}")

id: 0
summary: AS AMENDED a bill for an ordinance amending the Denver Zoning Code to revise parking exemptions for pre-existing small zone lots. Approves a text amendment to the Denver Zoning Code to revise the Pre-Existing Small Zone Lot parking exemption. The Committee approved filing this bill at its meeting on 2-14-17. On 2-27-17, Council held this item in Committee to 3-20-17. Amended 3-20-17 to ensure that the parking exemption is applied for all uses. Some parking requirements are calculated based on gross floor area while others are on number of units and not explicitly for gross floor area, to further clarify the legislative intent of the proposed bill to emphasize the city’s commitment to more comprehensively address transportation demand management strategies in the short term, and to require a Zoning Permit with Informational Notice for all new buildings on Pre-Existing Small Zone Lots that request to use the small lot parking exemption; Enables all expansions to existing b

### Transcript Preprocessing

Description: Implement tokenization, sentence splitting, and normalization using `spaCy` and `nltk`.

Deliverable: Preprocessed transcript files

- Step 1: Filter out the unncecessary words

- Step 2: Split sentences by `spaCy`

- Step 3: Chunk 10-15 sentences into paragraph

    - Since summarization model like `BART/T5` usually has token limits with 512-1024, we need to chunk sentences into a group of 10-15 sentences.

- Step 4: Integrate the above steps into a pipeline

- Step 5: Save the preprocessed transcript

In [9]:
# Process transcript
import spacy
import nltk
import re

# Filter out the unnecessary words
def clean_text(text):
    text = text.replace('\n', ' ')
    text = re.sub(r'\s+', ' ', text)
    text = re.sub(r'\.{2,}', '.', text)
    text = re.sub(r'\b(uh|um|you know|like)\b', '', text, flags=re.IGNORECASE)
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

# Split sentences by spaCy
nlp = spacy.load("en_core_web_sm")

def split_sentences(text):
    doc = nlp(text)
    return [sent.text.strip() for sent in doc.sents if len(sent.text.strip()) > 5]

# Chunk sentences into a group of 10-15 sentences
def chunk_sentences(sentences, max_len=10):
    chunks = []
    for i in range(0, len(sentences), max_len):
        chunk = " ".join(sentences[i:i+max_len])
        chunks.append(chunk)
    return chunks


In [10]:
# Integrate the above steps into a pipeline
def preprocess_transcript(text):
    text = clean_text(text)
    sentences = split_sentences(text)
    chunks = chunk_sentences(sentences)

    return chunks

In [14]:
# Preprocess the dataset and save it into a CSV file
import csv
import pandas as pd
import os
from tqdm import tqdm

def preprocess_dataset(dataset, output_dir="preprocessed_data", max_rows_per_file=10000):
    """
    Preprocesses the dataset and saves the preprocessed transcripts to CSV file(s).

    Args:
        dataset: The dataset split (e.g., train_data, test_data, val_data).
        output_dir: The directory to save the preprocessed data.
        max_rows_per_file: The maximum number of rows per CSV file.
    """
    os.makedirs(output_dir, exist_ok=True)
    preprocessed_data_list = []
    file_index = 0

    for instance in tqdm(dataset, desc="Preprocessing dataset"):
        id = instance['id']
        transcript = instance['transcript']
        summary = instance['summary']

        preprocessed_chunks = preprocess_transcript(transcript)

        for chunk in preprocessed_chunks:
            preprocessed_data_list.append({'id': id, 'transcript': chunk, 'summary': summary})

        # Check if we should write to a file
        if len(preprocessed_data_list) >= max_rows_per_file:
            df = pd.DataFrame(preprocessed_data_list)
            output_path = os.path.join(output_dir, f"preprocessed_data_{file_index}.csv")
            df.to_csv(output_path, index=False)
            print(f"Saved {len(df)} rows to {output_path}")
            preprocessed_data_list = []
            file_index += 1

    # Save any remaining data
    if preprocessed_data_list:
        df = pd.DataFrame(preprocessed_data_list)
        output_path = os.path.join(output_dir, f"preprocessed_data_{file_index}.csv")
        df.to_csv(output_path, index=False)
        print(f"Saved {len(df)} rows to {output_path}")

### Execute the preprocessing

In [None]:
# Check if preprocessed data are already exist
if not os.path.exists("/content/drive/MyDrive/CS_685/Final_Project/data/preprocessed_train"):
    preprocess_dataset(train_data, output_dir="/content/drive/MyDrive/CS_685/Final_Project/data/preprocessed_train")
else:
    print("Training set has already been preprocessed.")

In [16]:
# Check if preprocessed data are already exist
if not os.path.exists("/content/drive/MyDrive/CS_685/Final_Project/data/preprocessed_test"):
    preprocess_dataset(test_data, output_dir="/content/drive/MyDrive/CS_685/Final_Project/data/preprocessed_test")
else:
    print("Testing set has already been preprocessed.")

Preprocessing dataset:  52%|█████▏    | 444/862 [03:21<03:53,  1.79it/s]

Saved 10000 rows to /content/drive/MyDrive/CS_685/Final_Project/data/preprocessed_test/preprocessed_data_0.csv


Preprocessing dataset: 100%|██████████| 862/862 [05:57<00:00,  2.41it/s]

Saved 7898 rows to /content/drive/MyDrive/CS_685/Final_Project/data/preprocessed_test/preprocessed_data_1.csv





In [17]:
# Check if preprocessed data are already exist
if not os.path.exists("/content/drive/MyDrive/CS_685/Final_Project/data/preprocessed_val"):
    preprocess_dataset(val_data, output_dir="/content/drive/MyDrive/CS_685/Final_Project/data/preprocessed_val")
else:
    print("Validation set has already been preprocessed.")

Preprocessing dataset:  60%|█████▉    | 514/861 [03:22<07:56,  1.37s/it]

Saved 10045 rows to /content/drive/MyDrive/CS_685/Final_Project/data/preprocessed_val/preprocessed_data_0.csv


Preprocessing dataset: 100%|██████████| 861/861 [05:51<00:00,  2.45it/s]


Saved 7552 rows to /content/drive/MyDrive/CS_685/Final_Project/data/preprocessed_val/preprocessed_data_1.csv


In [26]:
# Print a preprocessed sample data
import pandas as pd
import re
import spacy

# Load the first preprocessed file
preprocessed_data = pd.read_csv("/content/drive/MyDrive/CS_685/Final_Project/data/preprocessed_train/preprocessed_data_0.csv")

# Get the ID of the first entry in the preprocessed data
sample_id = preprocessed_data.iloc[0]['id']

# Find the original data for the sample ID
original_data_sample = None
for instance in train_data:
    if instance['id'] == sample_id:
        original_data_sample = instance
        break

if original_data_sample:
    original_transcript = original_data_sample['transcript']
    preprocessed_transcript_chunks = preprocessed_data[preprocessed_data['id'] == sample_id]['transcript'].tolist()

    print(f"Comparing Original Transcript and Preprocessed Chunks for ID: {sample_id}\n")

    # Find the unnecessary words in the original_data_sample
    unnecessary_words = re.findall(r'\b(uh|um|you know|like)\b', original_transcript)
    print(f"Original Transcript Unnecessary Words: {unnecessary_words}")

    # Find the unnecessary words in the preprocessed_transcript_chunks
    nlp = spacy.load("en_core_web_sm")
    for chunk in preprocessed_transcript_chunks:
        unnecessary_words_chunk = re.findall(r'\b(uh|um|you know|like)\b', chunk)
    print(f"Preprocessed Chunks Unnecessary Words: {unnecessary_words_chunk}")


else:
    print(f"Original data with ID {sample_id} not found.")

Comparing Original Transcript and Preprocessed Chunks for ID: 0

Original Transcript Unnecessary Words: ['like', 'like', 'you know', 'you know', 'you know', 'you know', 'like', 'like', 'you know', 'like', 'you know', 'you know', 'like', 'like', 'like', 'you know', 'like', 'like', 'you know', 'you know', 'like', 'like', 'like', 'like', 'like', 'you know', 'you know', 'you know', 'like', 'like', 'you know', 'you know', 'you know', 'you know', 'you know', 'like', 'you know', 'you know', 'you know', 'you know', 'you know', 'like', 'you know', 'like', 'you know', 'like', 'you know', 'like', 'you know', 'like', 'you know', 'you know', 'like', 'you know', 'you know', 'you know', 'like', 'like', 'you know', 'you know', 'like', 'you know', 'like', 'you know', 'you know', 'like', 'like', 'you know', 'you know', 'you know', 'you know', 'you know', 'you know', 'you know', 'you know', 'you know', 'you know', 'you know', 'you know', 'like', 'you know', 'you know', 'you know', 'you know', 'you know',

## Baseline Implementation

### Baseline Summarization Models

### Evaluation Metrics Setup

### Trivial Baseline Evaluation