<a href="https://colab.research.google.com/github/therohitdas/Youtube-Transcript-Generator/blob/main/main.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# YouTube Transcript Extraction and Processing

## Overview

This script facilitates the extraction and processing of transcripts from YouTube videos. It leverages the [youtube-transcript-api](https://github.com/jdepoix/youtube-transcript-api) to obtain the raw transcript, allowing users to choose between auto-generated and user-added subtitles. For detailed features and options, refer to the [documentation](https://github.com/jdepoix/youtube-transcript-api).

Once the raw transcript is obtained, the script enhances it by adding punctuations using [oliverguhr/fullstop-punctuation-multilang-large](https://huggingface.co/oliverguhr/fullstop-punctuation-multilang-large). This versatile project supports multiple languages for punctuation addition. Please note that punctuation addition may take some time, depending on the length of the video.

For reference, it took approximately 5 minutes and 17 seconds to generate the raw transcription and add punctuations for a 1 hour and 38-minute-long video.

## Requirements

- [youtube-transcript-api](https://github.com/jdepoix/youtube-transcript-api)
- [deepmultilingualpunctuation](https://huggingface.co/oliverguhr/fullstop-punctuation-multilang-large)
- nltk
- tqdm

## Usage

1. Open the [Google Colab notebook](https://colab.research.google.com/).
2. Click on **File > Save a copy in Drive** to create your own version.
3. Adjust the script parameters as needed.
4. Execute the script cell to process the YouTube video transcript.

## Script Parameters

- `url`: YouTube video URL.
- `language`: Language of the transcript (default: en).
- `raw`: Generate raw transcript (default: True).
- `punctuated`: Generate punctuated transcript.
- `output`: Output directory for the transcript.
- `filename`: Filename for the transcript file (excluding extension).
- `batch_size`: Batch size for parallel processing (default: 0, auto-detect based on CPU cores).
- `verbose`: Enable verbose mode for detailed output (default: True).
- `punctuation_model`: Text for the punctuation model (default: '').

## Examples

```python
url = 'https://www.youtube.com/watch?v=YOUR_VIDEO_ID'
language = 'en'
raw = True
punctuated = False
output_dir = '/content'
filename = 'transcript_notes'
batch_size = 0
verbose = True
punctuation_model = ''

video_id = parse_youtube_url(url)
process_and_save_transcript(video_id, language, punctuated, output_dir, filename, batch_size, verbose, punctuation_model)
```

## Acknowledgments
This script utilizes the [youtube-transcript-api](https://github.com/jdepoix/youtube-transcript-api) and [deepmultilingualpunctuation](https://huggingface.co/oliverguhr/fullstop-punctuation-multilang-large) libraries. Special thanks to their contributors.

Feel free to adapt and use the script based on your requirements. Enjoy the convenience of YouTube transcript processing!

## Connect with me
I am new to the AI world and will love to connect with other people with this interest.
- [x/therohitdas](https://x.com/therohitdas)
- [github/therohitdas](https://github.com/therohitdas)

In [None]:
!pip install youtube-transcript-api deepmultilingualpunctuation nltk tqdm



In [None]:
import argparse
import os
import youtube_transcript_api
from deepmultilingualpunctuation import PunctuationModel
from nltk import sent_tokenize
from multiprocessing import Pool
import time
import logging
from tqdm import tqdm
import re
import math
import nltk

In [None]:
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')

In [None]:
logging.basicConfig(level=logging.INFO)

def get_transcript(video_id, language):
    transcript_list = youtube_transcript_api.YouTubeTranscriptApi.get_transcript(video_id, languages=[language])
    transcript = ''
    for line in transcript_list:
        if "[music]" not in line['text'].lower():
            transcript += line['text'] + ' '
    return transcript

def add_punctuation(text, punctuation_model):
    if punctuation_model != "":
      model = PunctuationModel(model=punctuation_model)
    else:
      model = PunctuationModel()
    return model.restore_punctuation(text)

def capitalize_sentences_batch(sentences):
    # Capitalize the first letter of each sentence in a batch
    capitalized_sentences = [sentence[0].upper() + sentence[1:] for sentence in sentences]
    return capitalized_sentences

def process_and_save_transcript(video_id, language, generate_punctuated, output_dir, filename, batch_size, verbose, punctuation_model):
    try:
        raw_transcript = get_transcript(video_id, language)

        if generate_punctuated:
            with_punctuation = add_punctuation(raw_transcript, punctuation_model)
            sentences = sent_tokenize(with_punctuation)
            num_processes = os.cpu_count() or 1  # Get the number of available processes
            batch_size = 2**int(math.log2(batch_size)) if batch_size else num_processes  # Restrict batch size to be a power of 2, default to number of cores
            with Pool() as pool:
                capitalized_sentences = list(tqdm(pool.imap(capitalize_sentences_batch, [sentences[i:i+batch_size] for i in range(0, len(sentences), batch_size)]), total=len(sentences), desc='Processing', disable=not verbose))
            capitalized_transcript = ' '.join([sentence for batch in capitalized_sentences for sentence in batch])
            output_path = os.path.join(output_dir, f'{filename}_punctuated.txt')
            with open(output_path, 'w', encoding='utf-8') as f:
                f.write(capitalized_transcript)
            logging.info(f'Punctuated transcript saved to {output_path}')
        else:
            sentences = sent_tokenize(raw_transcript)
            num_processes = os.cpu_count() or 1  # Get the number of available processes
            batch_size = 2**int(math.log2(batch_size)) if batch_size else num_processes  # Restrict batch size to be a power of 2, default to number of cores
            with Pool() as pool:
                capitalized_sentences = list(tqdm(pool.imap(capitalize_sentences_batch, [sentences[i:i+batch_size] for i in range(0, len(sentences), batch_size)]), total=len(sentences), desc='Processing', disable=not verbose))
            capitalized_transcript = ' '.join([sentence for batch in capitalized_sentences for sentence in batch])
            output_path = os.path.join(output_dir, f'{filename}_raw.txt')
            with open(output_path, 'w', encoding='utf-8') as f:
                f.write(capitalized_transcript)
            logging.info(f'Raw transcript saved to {output_path}')

    except Exception as e:
        logging.error(f'Error: {e}')

def parse_youtube_url(url):
    video_id_match = re.search(r'(?:youtube\.com\/.*?[?&]v=|youtu\.be\/)([^"&?\/\s]{11})', url)
    if video_id_match:
        return video_id_match.group(1)
    else:
        raise ValueError('Invalid YouTube URL')

## Example Usage:
```python
url = 'https://www.youtube.com/watch?v=YOUR_VIDEO_ID'
video_id = parse_youtube_url(url)
language = 'en'
punctuated = True
output_dir = '.'
filename = 'output' # Or set it to video_id
batch_size = 0
verbose = True
punctuation_model = ''
```
`language` use the language code to get the video. By default this module always picks manually created transcripts over automatically created ones, if a transcript in the requested language is available both manually created and generated.

`punctuation_model` values can be found at https://huggingface.co/oliverguhr/fullstop-punctuation-multilang-large#languages

In [None]:
url = 'https://www.youtube.com/watch?v=4Ff2ZrhVkp0'
video_id = parse_youtube_url(url)
language = 'en'
punctuated = True
output_dir = '.'
filename = 'new'
batch_size = 0
verbose = True
punctuation_model = ''

In [None]:
process_and_save_transcript(video_id, language, punctuated, output_dir, filename, batch_size, verbose, punctuation_model)

Processing:  51%|█████     | 22/43 [00:00<00:00, 62729.22it/s]
