# Скачиваем аудио и субтитры

Скачиваем аудио и автоматически сгенерированные субтитры этого [видео](https://www.youtube.com/watch?v=hvjfkk0y3gU) с помощью библиотеки **yt_dlp**

In [9]:
# !pip install yt_dlp

In [3]:
import yt_dlp as youtube_dl

In [4]:
video_url = "https://www.youtube.com/watch?v=hvjfkk0y3gU"

In [5]:
options = {
    'format': 'bestaudio/best',
    'writeautomaticsub': True,  # записываем сгенерированные субтитры
    'outtmpl': 'audio_bbt',  # имя файла для сохранения аудио
    'postprocessors': [{
        'key': 'FFmpegSubtitlesConvertor',
        'format': 'vtt',
    }],
    'verbose': True,
}

In [6]:
def download_audio_with_subtitles(url, options):
    # скачиваем видео и субтитры
    with youtube_dl.YoutubeDL(options) as ydl:
        ydl.download([url])

In [None]:
download_audio_with_subtitles(video_url, options)

# ASR: SpeechRecognition, HuggingFace transformers, ..

### Так как наш файл слишком длинный, нужно его обрезать

In [22]:
# ! pip install pydub

In [5]:
from pydub import AudioSegment

In [23]:
input_audio_path = 'audio_bbt.wav'

In [24]:
audio = AudioSegment.from_wav(input_audio_path)

In [25]:
duration = 120 * 1000    # устанавливаем длительность в 1 минуту
cropped_audio = audio[:duration]

In [26]:
output_audio_path = 'cropped_audio_bbt.wav'

In [27]:
# Сохраняем обрезанный аудиофайл
cropped_audio.export(output_audio_path, format="wav")

<_io.BufferedRandom name='cropped_audio_bbt.wav'>

### SpeechRecognition

In [28]:
# !pip install SpeechRecognition

In [29]:
# !pip install git+https://github.com/openai/whisper.git soundfile

In [30]:
import speech_recognition as sr

In [31]:
r = sr.Recognizer()
with sr.AudioFile('cropped_audio_bbt.wav') as source:
    audio = r.record(source)

In [32]:
# Google Cloud Speech (with default key)
google_generated = r.recognize_google(audio, language='en')

In [33]:
google_generated

'hey Google'

Непонятно, почему, но гугл, видимо, не распознал

In [34]:
# OpenAI Whisper library
whisper_generated = r.recognize_whisper(audio)

In [35]:
whisper_generated

" Let's go! Let's go! Let's go! Let's go! To the planetarium! Let's go! The world's far ever! Penny, I just wanted to say good luck and hope there's no hard feelings. Hey, Romeo, repair your relationship on your own time! Oh, in the face. Oh my god, cinnamon are you okay? What is- I can't believe you do! Thank you. Action! Oh, I'm sorry from where? So, how are you gonna re-throw it in the DVD if, uh, sorry, I'm sorry, I didn't get it. Okay. Still rolling in? It's radio, Fey, it's fine. Amy, I could use your help. Well, let me guess, there's an undergrad in a leather jacket. Check it. And if it's a clear night, I'm gonna lay some romantic astronomy on her. Really? Is that my line? Oh my god, cinnamon are you okay? Why can't I believe you do? You do whatever it takes to save her life if she needs new organs, I'll bite down and- RUN!"

### HuggingFace transformers: FasterWhisperModel

[ссылка на модель](https://huggingface.co/Systran/faster-whisper-large-v3)

In [13]:
# !pip install faster_whisper

In [37]:
from faster_whisper import WhisperModel

model = WhisperModel("large-v3")

In [41]:
segments, info = model.transcribe('cropped_audio_bbt.wav')

In [42]:
faster_whisper_generated = ''
for segment in segments:
    faster_whisper_generated += segment.text

In [43]:
faster_whisper_generated

" To the Tar Pits! Let's go! To the Planetarium! Let's go! I have the cheapest car. The worst car ever! Penny, I just wanted to say good luck, and hope there's no hard feelings. Hey, Romeo! Repair your relationship on your own time! Oh. In the face. Oh my God, Cinnamon, are you okay? What is... I can't believe you two! I like it. I like it. Action. Oh, I'm sorry, from where? So, how you gonna return the DVD, if, uh, so-so-so-so-right again? Okay. Still rolling, guys. Same spot, here we go. It's radio, Faye. It's fine. Amy, I could use your help. Oh, let me guess. There's an undergrad in a leather jacket. Jacket. And if it's a clear night, I'm going to lay some romantic astronomy on her. Really? Is that my line? Oh my God, Cinnamon, are you okay? I can't believe you two. You do whatever it takes to save her life. If she needs new organs, I'll bite down and... No!"

### Another Transformer

[ссылка на модель](https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-english)

In [85]:
# !pip install huggingsound

In [92]:
from huggingsound import SpeechRecognitionModel

model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-large-xlsr-53-english")

INFO:huggingsound.speech_recognition.model:Loading model...
Some weights of the model checkpoint at jonatasgrosman/wav2vec2-large-xlsr-53-english were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_v', 'wav2vec2.encoder.pos_conv_embed.conv.weight_g']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at jonatasgrosman/wav2vec2-large-xlsr-53-english and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encod

In [None]:
model.transcribe(['/content/cropped_audio_bbt.wav'])

#  Оценка выбранных систем и автогенерации YouTube

In [95]:
# !pip install transformers datasets evaluate jiwer

In [96]:
results_dict = {
    "YouTube": " to the tar pits let's go the planetarium let's go i just wanted to say good luck and penny i just wanted to say good luck and hope there's no hard feelings hey romeo repair your relationship on your own time in the face my god cinnamon are you okay so how you gonna return to the dvd if uh again okay still rolling guys same spot here we go it's radio fay it's fine amy i could use your help oh let me guess there's an undergrad and a leather jacket jacket and if it's a clear night i'm gonna lay some romantic astronomy on her oh my god cinnamon are you okay well i can't believe you do you do whatever it takes to save her life if she needs new organs i'll back down",
    "SpeechRecognition": " Let's go! Let's go! Let's go! Let's go! To the planetarium! Let's go! The world's far ever! Penny, I just wanted to say good luck and hope there's no hard feelings. Hey, Romeo, repair your relationship on your own time! Oh, in the face. Oh my god, cinnamon are you okay? What is- I can't believe you do! Thank you. Action! Oh, I'm sorry from where? So, how are you gonna re-throw it in the DVD if, uh, sorry, I'm sorry, I didn't get it. Okay. Still rolling in? It's radio, Fey, it's fine. Amy, I could use your help. Well, let me guess, there's an undergrad in a leather jacket. Check it. And if it's a clear night, I'm gonna lay some romantic astronomy on her. Really? Is that my line? Oh my god, cinnamon are you okay? Why can't I believe you do? You do whatever it takes to save her life if she needs new organs, I'll bite down and- RUN!",
    "Faster_Whisper": " To the Tar Pits! Let's go! To the Planetarium! Let's go! I have the cheapest car. The worst car ever! Penny, I just wanted to say good luck, and hope there's no hard feelings. Hey, Romeo! Repair your relationship on your own time! Oh. In the face. Oh my God, Cinnamon, are you okay? What is... I can't believe you two! I like it. I like it. Action. Oh, I'm sorry, from where? So, how you gonna return the DVD, if, uh, so-so-so-so-right again? Okay. Still rolling, guys. Same spot, here we go. It's radio, Faye. It's fine. Amy, I could use your help. Oh, let me guess. There's an undergrad in a leather jacket. Jacket. And if it's a clear night, I'm going to lay some romantic astronomy on her. Really? Is that my line? Oh my God, Cinnamon, are you okay? I can't believe you two. You do whatever it takes to save her life. If she needs new organs, I'll bite down and... No!",
    "Wav2Vec": "carpet planetary eli just fun good lock and ers no hardfeelingyrepare you mention cian your ond timein the facemy od cineu what ii't believe you dosorry yso i can a return to the deputee if sucsit's radiouse your helphomigas theres an underground leather jacket jacketand cllygonay ome romantic astronomy oreallyaylinemy gd cnemonaykwhy gon't believe you o you do whatever take to save mor life if she lu's new organs i by"
}

In [97]:
import json
import jiwer

In [98]:
reference = "To the tar pits! Let's go! To the Planetarium! Let's go! We have the cheapest car... The worst car ever! Penny, I just wanted to say good luck and I hope there's no hard feelings... Hey! Romeo, repair your relationship on your own time! Oh... In the face... Oh my God, Cinnamon, are you okay? Although... What is... I can't believe you two! Action! Oh, I'm sorry, from where? I'm sorry! So, how are you going to return a DVD if... Oh, sorry, right, again! Okay, still rolling guys! Same spot, here we go! It's radio fay, it's fine... Amy, I could use your help? Oh, let me guess, there's an undergrad and a leather jacket? Jacket... And if it's a clear night I'm going lay some romantic astronomy on her. Really? Is that my  line? Oh my God, Cinnamon, are you okay? Why! I can't believe you two! You do whatever it takes to save her lifeif she needs new organs I'll buy du.."

In [99]:
from evaluate import load
wer = load("wer")

Downloading builder script:   0%|          | 0.00/4.49k [00:00<?, ?B/s]

In [103]:
import pandas as pd
wer_results = {}
for key, value in results_dict.items():
    wer_score = wer.compute(predictions=[value], references=[reference])
    wer_results[key] = wer_score

df = pd.DataFrame(list(wer_results.items()), columns=['ASR Model', 'WER'])
df

Unnamed: 0,ASR Model,WER
0,YouTube,0.575758
1,SpeechRecognition,0.436364
2,Faster_Whisper,0.351515
3,Wav2Vec,0.866667


### Заключение:
В результате проделанной работы лучший результат показала модель Faster_Whisper, на втором месте обычный Whisper; худший результат показала модель wav2vec, а гугл по непонятным причинам вообще не дал никакого результата