# YouTube Video Summarization with Hugging Face ASR

This script demonstrates a process for summarizing the content of a YouTube video using Automatic Speech Recognition (ASR) from Hugging Face and text summarization. The workflow involves downloading the audio from a specified YouTube video, transcribing the audio into text using Hugging Face's ASR model, and then summarizing the text to obtain a concise summary of the video's content.

# Dependencies

* pytube: A Python library for downloading YouTube videos.
* huggingsound: A library from Hugging Face for speech recognition.
* librosa: A library for audio and music analysis.
* soundfile: A library for reading and writing sound files.
* transformers: A library for natural language processing tasks, including text summarization.

# Usage
* Set the VIDEO_URL variable to the URL of the YouTube video you want to summarize.
* Run the script in an environment with the required dependencies installed.
* The script will download the audio, transcribe it, and generate a summarized text output.

# Download YouTube Video's Audio
The code begins by installing the necessary dependencies and downloading the audio from a specified YouTube video using pytube. The video URL is provided as input.

In [2]:
! pip install pytube -q

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/57.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.6/57.6 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [3]:
from pytube import YouTube

In [None]:
#VIDEO_URL = "https://www.youtube.com/watch?v=hWLf6JFbZoo" #obama

In [4]:
VIDEO_URL = 'https://www.youtube.com/watch?v=h-JVjs9AAmQ' #batman

In [None]:
#VIDEO_URL = 'https://youtu.be/qNJRGHk7sN8'

In [5]:
yt = YouTube(VIDEO_URL)

In [7]:
yt.streams \
  .filter(only_audio=True, file_extension='mp4') \
  .first() \
  .download(filename='ytaudio.mp4')


'/content/ytaudio.mp4'

# Convert Audio to WAV Format:
The downloaded audio file is then converted to WAV format using ffmpeg for compatibility with the ASR model.

In [8]:
! ffmpeg -i ytaudio.mp4 -acodec pcm_s16le -ar 16000 ytaudio.wav

ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers
  built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enab

# English ASR with HuggingSound
Speech Recognition with HuggingSound: The WAV file is processed using Hugging Face's ASR model (wav2vec2-large-xlsr-53-english). This model transcribes the speech in the audio file into text.

In [9]:
!pip install huggingsound -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.6/536.6 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m214.3/214.3 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m776.3/776.3 MB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.3/38.3 MB[0m [31m35.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m71.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m94.7 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's depen

In [10]:
from huggingsound import SpeechRecognitionModel


In [11]:
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"

In [12]:
device

'cuda'

In [13]:
model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-large-xlsr-53-english", device = device)


INFO:huggingsound.speech_recognition.model:Loading model...
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.53k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.26G [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/262 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/300 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

OUT OF MEMORY (OOM) error

# Audio Chunking
To avoid out-of-memory errors, the audio file is chunked into smaller segments of 30-second duration using librosa.

In [14]:
import librosa

In [15]:
input_file = '/content/ytaudio.wav'

In [16]:
print(librosa.get_samplerate(input_file))

# Stream over 30 seconds chunks rather than load the full file
stream = librosa.stream(
    input_file,
    block_length=30,
    frame_length=16000,
    hop_length=16000
)

16000


In [17]:
import soundfile as sf


In [18]:
for i,speech in enumerate(stream):
  sf.write(f'{i}.wav', speech, 16000)

In [19]:
i

8

# Audio Transcription / ASR / Speech to Text
Each chunk of the audio file is transcribed using the ASR model, resulting in a list of transcriptions.

In [20]:
audio_path =[]
for a in range(i+1):
  audio_path.append(f'/content/{a}.wav')

In [21]:
audio_path

['/content/0.wav',
 '/content/1.wav',
 '/content/2.wav',
 '/content/3.wav',
 '/content/4.wav',
 '/content/5.wav',
 '/content/6.wav',
 '/content/7.wav',
 '/content/8.wav']

In [22]:
transcriptions = model.transcribe(audio_path)

100%|██████████| 9/9 [00:06<00:00,  1.39it/s]


In [23]:
full_transcript = ' '

In [24]:
for item in transcriptions:
  full_transcript += ''.join(item['transcription'])

In [25]:
len(full_transcript)

3091

# Text Summarization
The transcribed text is then summarized using the summarization pipeline from the transformers library. This pipeline utilizes a pre-trained model for summarizing text.

In [26]:
from transformers import pipeline

In [27]:
summarization = pipeline('summarization')

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [28]:
summarized_text = summarization(full_transcript)

In [29]:
summarized_text[0]['summary_text']

" The role of cat woman has been played by litedary access arears from rto kit and shell fifer . Batman is about etwo sides of drauma batman's born of trama andtis film maybes the ridler and tat is kind of the seed from which everything else grew ."

# Text Chunking before Summarization
 To handle large volumes of text, the full transcript is chunked into smaller segments of 1000 characters each before being summarized.

In [31]:
num_iters = int(len(full_transcript)/1000)
summarized_text = []
for i in range(0, num_iters + 1):
  start = 0
  start = i * 1000
  end = (i + 1) * 1000
  print("input text \n" + full_transcript[start:end])
  out = summarization(full_transcript[start:end], min_length = 5, max_length=20)
  out = out[0]
  out = out['summary_text']
  print("Summarized text\n"+out)
  summarized_text.append(out)

print(summarized_text)

input text 
 someone has been very vocal about being a batman then for prety muture whole lifeh described the filling a putt on the batsy for the first time waempowering emotionalyfrom getting costs to putting it on the first timeproly months so you gone through holting with a  reactionyour own reactioreactio eventryprepare foritmreading the most untold number of graphic novelsandwhen you findlly get to put on the suit its just suddenind of itmakes it real but isuddenyyou suddenly think you can dosomething totally different withte character its wed thatyouput it heistthat you put it on the first amnual andits so well-designed i fit so perfitlyan you ca movein it so well andyou just looki theycied powerful n you can really that so much history is imbutsuit andcnography y feel it when you put it onthothe onscrean the role of cat woman has been played by litedary access arears from rto kit and shell fifer how does it fal to take on the mantle of such a powerful female characterits scarry 

# Example Output
The script outputs the summarized text of the YouTube video, providing a concise representation of its content. Additionally, it may print intermediate results such as input text chunks and corresponding summarized text during the processing steps.

