<a href="https://colab.research.google.com/github/therohitdas/Youtube-Transcript-Generator/blob/main/main.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# YouTube Transcript Extraction and Processing

## Overview

This script facilitates the extraction and processing of transcripts from YouTube videos. It leverages the [youtube-transcript-api](https://github.com/jdepoix/youtube-transcript-api) to obtain the raw transcript, allowing users to choose between auto-generated and user-added subtitles. For detailed features and options, refer to the [documentation](https://github.com/jdepoix/youtube-transcript-api).

Once the raw transcript is obtained, the script enhances it by adding punctuations using [oliverguhr/fullstop-punctuation-multilang-large](https://huggingface.co/oliverguhr/fullstop-punctuation-multilang-large). This versatile project supports multiple languages for punctuation addition. Please note that punctuation addition may take some time, depending on the length of the video.

For reference, it took approximately 5 minutes and 17 seconds to generate the raw transcription and add punctuations for a 1 hour and 38-minute-long video.

## Requirements

- [youtube-transcript-api](https://github.com/jdepoix/youtube-transcript-api)
- [deepmultilingualpunctuation](https://huggingface.co/oliverguhr/fullstop-punctuation-multilang-large)
- nltk
- tqdm

## Usage

1. Open the [Google Colab notebook](https://colab.research.google.com/).
2. Click on **File > Save a copy in Drive** to create your own version.
3. Adjust the script parameters as needed.
4. Execute the script cell to process the YouTube video transcript.

## Script Parameters

- `url`: YouTube video URL.
- `language`: Language of the transcript (default: en).
- `raw`: Generate raw transcript (default: True).
- `punctuated`: Generate punctuated transcript.
- `output`: Output directory for the transcript.
- `filename`: Filename for the transcript file (excluding extension).
- `batch_size`: Batch size for parallel processing (default: 0, auto-detect based on CPU cores).
- `verbose`: Enable verbose mode for detailed output (default: True).
- `punctuation_model`: Text for the punctuation model (default: '').

## Examples

```python
url = 'https://www.youtube.com/watch?v=YOUR_VIDEO_ID'
language = 'en'
raw = True
punctuated = False
output_dir = '/content'
filename = 'transcript_notes'
batch_size = 0
verbose = True
punctuation_model = ''

video_id = parse_youtube_url(url)
process_and_save_transcript(video_id, language, punctuated, output_dir, filename, batch_size, verbose, punctuation_model)
```

## Acknowledgments
This script utilizes the [youtube-transcript-api](https://github.com/jdepoix/youtube-transcript-api) and [deepmultilingualpunctuation](https://huggingface.co/oliverguhr/fullstop-punctuation-multilang-large) libraries. Special thanks to their contributors.

Feel free to adapt and use the script based on your requirements. Enjoy the convenience of YouTube transcript processing!

## Connect with me
I am new to the AI world and will love to connect with other people with this interest.
- [x/therohitdas](https://x.com/therohitdas)
- [github/therohitdas](https://github.com/therohitdas)

In [None]:
!pip install youtube-transcript-api deepmultilingualpunctuation nltk tqdm pip install google-api-python-client google-auth-oauthlib

In [None]:
from google.colab import drive, userdata

drive.mount("/content/drive")

In [6]:
import os
import youtube_transcript_api
from deepmultilingualpunctuation import PunctuationModel
from nltk import sent_tokenize
from multiprocessing import Pool
import time
import logging
from tqdm import tqdm
import re
import math
import nltk
import google_auth_oauthlib.flow
import googleapiclient.discovery
import googleapiclient.errors

In [7]:
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')
logging.basicConfig(level=logging.INFO)

In [212]:
def remove_music_tags(text):
    # Remove [Music] or [music]
    updated_text = re.sub(r'\[music\]', '', text, flags=re.IGNORECASE)
    return updated_text

def get_transcript(video_id, language, video_info, verbose=True):
    transcript_list = youtube_transcript_api.YouTubeTranscriptApi.get_transcript(video_id, languages=[language])
    if video_info["title"] != "":
        transcript = f'# {video_info["title"]}\n\n'
    current_chapter_index = 0
    chapters = video_info["chapters"]

    with tqdm(total=len(transcript_list), desc='Processing Transcript', unit='line', disable=not verbose) as pbar:
        for line in transcript_list:
            start_time = int(math.floor(line['start']))  # Floor and convert to integer

            # Check if current_chapter_index is within the valid range
            if 0 <= current_chapter_index < len(chapters):
                chapter_time = chapters[current_chapter_index]['timestamp']

                try:
                    # Extract start time from the chapter timestamp
                    chapter_start = chapter_time.strip()
                    chapter_start_seconds = sum(int(x) * 60 ** i for i, x in enumerate(reversed(chapter_start.split(':'))))
                    chapters[current_chapter_index]["title"] = chapters[current_chapter_index]["title"].strip()
                    buffer_time = 2
                    if start_time >= chapter_start_seconds - buffer_time:
                        transcript += f'\n\n## {chapters[current_chapter_index]["title"]}\n'
                        current_chapter_index += 1
                except Exception as e:
                    print(f"Error processing chapter timestamp: {chapter_time}")
                    print(f"Error details: {e}")
            line['text'] = remove_music_tags(line['text'])
            transcript += line['text'].strip() + ' '
            pbar.update(1)

    return transcript

def remove_period_after_hashes(text):
    # Remove . after ##, considering newline characters
    updated_text = re.sub(r'(?<=##)[.\n]+', '', text)
    return updated_text

def add_punctuation(text, punctuation_model):
    if punctuation_model != "":
        model = PunctuationModel(model=punctuation_model)
    else:
        model = PunctuationModel()
        punctuated_text = model.restore_punctuation(text)
    return punctuated_text

def capitalize_sentences_batch(sentences):
    # Capitalize the first letter of each sentence in a batch
    capitalized_sentences = [sentence[0].upper() + sentence[1:] for sentence in sentences]
    return capitalized_sentences

def process_and_save_transcript(video_id, video_info, language, generate_punctuated, output_dir, filename, batch_size, verbose, punctuation_model):
    try:
        raw_transcript = get_transcript(video_id, language, video_info, verbose)

        if generate_punctuated:
            with_punctuation = add_punctuation(raw_transcript, punctuation_model)
            with_punctuation = remove_period_after_hashes(with_punctuation)
            print(with_punctuation)
            sentences = sent_tokenize(with_punctuation)
            num_processes = os.cpu_count() or 1
            batch_size = 2 ** int(math.log2(batch_size)) if batch_size else num_processes

            with Pool() as pool:
                capitalized_sentences = list(
                    tqdm(pool.imap(capitalize_sentences_batch, [sentences[i:i + batch_size] for i in
                                                                range(0, len(sentences), batch_size)]),
                         total=len(sentences), desc='Processing', disable=not verbose))
            capitalized_transcript =  os.linesep.join([sentence for batch in capitalized_sentences for sentence in batch])
            output_path = os.path.join(output_dir, f'{filename}.md')
            with open(output_path, 'w', encoding='utf-8') as f:
                f.write(capitalized_transcript)
            logging.info(f'Punctuated transcript saved to {output_path}')
        else:
            sentences = sent_tokenize(raw_transcript)
            print (sentences)
            num_processes = os.cpu_count() or 1
            batch_size = 2 ** int(math.log2(batch_size)) if batch_size else num_processes

            with Pool() as pool:
                capitalized_sentences = list(
                    tqdm(pool.imap(capitalize_sentences_batch, [sentences[i:i + batch_size] for i in
                                                                range(0, len(sentences), batch_size)]),
                         total=len(sentences), desc='Processing', disable=not verbose))
            capitalized_transcript =  os.linesep.join([sentence for batch in capitalized_sentences for sentence in batch])
            output_path = os.path.join(output_dir, f'{filename}.md')
            with open(output_path, 'w', encoding='utf-8') as f:
                f.write(capitalized_transcript)
            logging.info(f'Raw transcript saved to {output_path}')

    except Exception as e:
        logging.error(f'Error: {e}')


def parse_youtube_url(url):
    video_id_match = re.search(r'(?:youtube\.com\/.*?[?&]v=|youtu\.be\/)([^"&?\/\s]{11})', url)
    if video_id_match:
        return video_id_match.group(1)
    else:
        raise ValueError('Invalid YouTube URL')

def parse_chapters(description):
    lines = description.split("\n")
    regex = re.compile(r"(\d{0,2}:?\d{1,2}:\d{2})")
    chapters = []

    for line in lines:
        matches = regex.findall(line)
        if matches:
            ts = matches[0]
            title = line.replace(ts, "").strip()

            # Check if the title contains another timestamp and remove it
            title = re.sub(r'\d{0,2}:?\d{1,2}:\d{2}', '', title).strip().strip('-').strip().strip('-').strip()

            chapters.append({
                "timestamp": ts,
                "title": title,
            })

    return chapters

def getVideoInfo (video_id):
  try:
    # Set up Google API credentials using API key
    api_key =  userdata.get('GOOGLE_API_KEY') # Replace with your actual API key
    youtube = googleapiclient.discovery.build("youtube", "v3", developerKey=api_key)
    request = youtube.videos().list(part="id,snippet",
                                id = video_id
        )
    response = request.execute()
    title = response['items'][0]['snippet']['title']
    description = response['items'][0]['snippet']['description']
    data = {"title" : title, "chapters" : parse_chapters(description)}
    return data
  except Exception as e:
    logging.error(f'Error: {e}')
    return {"title": "", "chapters": []}

## Example Usage:
```python
url = 'https://www.youtube.com/watch?v=YOUR_VIDEO_ID'
video_id = parse_youtube_url(url)
language = 'en'
punctuated = True
output_dir = '.'
filename = 'output' # Or set it to video_id
batch_size = 0
verbose = True
punctuation_model = ''
```
`language` use the language code to get the video. By default this module always picks manually created transcripts over automatically created ones, if a transcript in the requested language is available both manually created and generated.

`punctuation_model` values can be found at https://huggingface.co/oliverguhr/fullstop-punctuation-multilang-large#languages

In [213]:
url = 'https://www.youtube.com/watch?v=zd_2xpcQuPw'
video_id = parse_youtube_url(url)
video_info = getVideoInfo(video_id)

language = 'en'
punctuated = True
output_dir = '.'
filename = video_info["title"] if video_info["title"] else f'{video_id}_raw'
batch_size = 0
verbose = False
punctuation_model = ''

In [214]:
process_and_save_transcript(video_id, video_info, language, punctuated, output_dir, filename, batch_size, verbose, punctuation_model)

Processing Transcript: 100%|██████████| 373/373 [00:00<00:00, 70912.67line/s]


# Why Antibiotics Don't Work Like They Used To. ## Intro. antibiotics are very commonly used medications, and they have been an incredible tool for curing infectious diseases and saving lives, but there are also some potential negative consequences that can come with taking antibiotics, some of which can be quite serious. so in today's video, we're going to talk about how antibiotics work, the good things they can do for the human body, but also some of those potential negative effects, and, of course, we'll discuss some concerns around their future use and antibiotic resistance. it's definitely going to be a fun one, so let's do this. so what are antibiotics and how do? ## What are Antibiotics? Killing Bacteria But Not Own Cells? they work. antibiotics are substances that kill or inhibit the growth of bacteria, and one of the coolest and Most Fascinating Concepts that I remember learning about as a microbiology student- and this concept's very important to how antibiotics actually wor

Processing:  51%|█████     | 49/97 [00:00<00:00, 2565.16it/s]


In [216]:
with open(os.path.join(output_dir, f'{filename}.md'), "r") as f:
    content = f.read()
    print(content)


# Why Antibiotics Don't Work Like They Used To.
## Intro.
Antibiotics are very commonly used medications, and they have been an incredible tool for curing infectious diseases and saving lives, but there are also some potential negative consequences that can come with taking antibiotics, some of which can be quite serious.
So in today's video, we're going to talk about how antibiotics work, the good things they can do for the human body, but also some of those potential negative effects, and, of course, we'll discuss some concerns around their future use and antibiotic resistance.
It's definitely going to be a fun one, so let's do this.
So what are antibiotics and how do?
## What are Antibiotics?
Killing Bacteria But Not Own Cells?
They work.
Antibiotics are substances that kill or inhibit the growth of bacteria, and one of the coolest and Most Fascinating Concepts that I remember learning about as a microbiology student- and this concept's very important to how antibiotics actually wor