# 1. Скачивание данных
Youtube-dl у меня не заработал из-за конфликта версий, поэтому я использовала библиотеку [pytube](https://github.com/pytube/pytube) для скачивания видео и субтитров.

In [1]:
%%capture
! pip install pytube

In [2]:
from pytube import YouTube

В качестве видео взят [короткий мейкап туториал](https://youtu.be/052P7l-p38A?si=OcMXfjZKo1ITUc2m) на английском, в котором есть автоматически сгенерированные субтитры и встроенные автором, которые будут взяты как эталон

In [3]:
link = 'https://youtu.be/052P7l-p38A?si=OcMXfjZKo1ITUc2m'

In [4]:
video = YouTube(link)
video.streams.get_audio_only().download(filename='audio.wav')

'/content/audio.wav'

In [5]:
print('Список субтитров: ', video.captions)

Список субтитров:  {'en': <Caption lang="English" code="en">, 'a.en': <Caption lang="English (auto-generated)" code="a.en">, 'fil': <Caption lang="Filipino" code="fil">, 'fr': <Caption lang="French" code="fr">, 'de': <Caption lang="German" code="de">, 'hi': <Caption lang="Hindi" code="hi">, 'id': <Caption lang="Indonesian" code="id">, 'ja': <Caption lang="Japanese" code="ja">, 'ko': <Caption lang="Korean" code="ko">, 'pl': <Caption lang="Polish" code="pl">, 'ru': <Caption lang="Russian" code="ru">, 'es': <Caption lang="Spanish" code="es">, 'th': <Caption lang="Thai" code="th">, 'vi': <Caption lang="Vietnamese" code="vi">}


Сохраняю нужные субтитры в формате xml

In [6]:
video.captions["a.en"].download(title="youtube_generated", srt=False)

'/content/youtube_generated (a.en).xml'

In [7]:
video.captions["en"].download(title="etalon", srt=False)

'/content/etalon (en).xml'

Превращаю xml файлы в строки и записываю в текстовый файл эталон

In [28]:
import xml.etree.ElementTree as ET

def extract_text_from_xml_en(filename):
    tree = ET.parse(filename)
    root = tree.getroot()

    text = ''
    for p in root.findall('.//p'):
        if p.text is not None:
            text += p.text.strip() + ' '

    return text.strip().replace('\xa0', ' ').replace('\ufeff', ' ').replace('\n', ' ')

def extract_text_from_xml_a_en(filename):
    tree = ET.parse(filename)
    root = tree.getroot()

    text = ''
    for p in root.findall('.//s'):
        if p.text is not None:
            text += p.text.strip() + ' '

    return text.strip().replace('\xa0', ' ').replace('\ufeff', ' ').replace('\n', ' ')

def save_text_to_file(text, filename):
    with open(filename, 'w', encoding='utf-8') as file:
        file.write(text)

def rewrite(filename_in, filename_out, rewrite=True):
    if rewrite:
      text = extract_text_from_xml_en(filename_in)
      save_text_to_file(text, filename_out)
    else:
      text = extract_text_from_xml_a_en(filename_in)
    return text

In [29]:
text_etalon = rewrite('etalon (en).xml', 'etalon.txt')

In [30]:
text_youtube = rewrite('youtube_generated (a.en).xml', None, rewrite=False)

In [31]:
text_youtube

"foreign welcome back to Dear peachy in our video today we are going to show you a quick tutorial for the stoyan dolly soft Glam look this beauty Guru here named siautar and she has gained more than 200 000 followers on her CL hongshu if you love to try dogin makeup but do not know how to start this look is perfect for you to begin first Todd is going to prep her skin before applying her base she is moisturizing the dry patches on her face using the serum bomb stick she is applying at the sides of nose mouth corners and under eye to avoid her base from creasing later if you have the same issue you can apply any lightweight moisturizer that you have on those dry patches let the moisturizer set for around 15 minutes before you apply any makeup so that the moisturizer gets absorbed by the skin once the skincare is set into your skin layer a thin coat of foundation to even out your skin tone Pat them out with a cushioned sponge matte highlighter is dabbed the inner cheeks to brighten the c

# 2. ASR
## 2.1 HuggingFace wav2vec2

In [32]:
from transformers import pipeline

In [33]:
transcriber = pipeline("automatic-speech-recognition", "facebook/wav2vec2-base-960h")
text_wav2vec2= transcriber('audio.wav')['text']

Some weights of the model checkpoint at facebook/wav2vec2-base-960h were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_v', 'wav2vec2.encoder.pos_conv_embed.conv.weight_g']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'wav2vec2.masked_spec_embed']
You sho

# 2.2 SpeechRecognition
Для начала трансформирую файл для того, чтобы с ним можно было работать библиотекой

In [34]:
%%capture
! pip install SpeechRecognition
! pip install git+https://github.com/openai/whisper.git soundfile

In [35]:
import speech_recognition as sr
import librosa
import soundfile as sf

In [36]:
%%capture
x,_ = librosa.load('audio.wav', sr=16000)
sf.write('audio_transformed.wav', x, 16000)

In [37]:
r = sr.Recognizer()
with sr.AudioFile('audio_transformed.wav') as source:
    audio = r.record(source)

In [38]:
# OpenAI Whisper library
text_whisper = r.recognize_whisper(audio)

In [39]:
# Google Cloud Speech (with default key)
text_cloud_speech = r.recognize_google(audio)

Записываю результаты в отдельный json файл

In [40]:
import json

In [41]:
with open('asr.json', 'w', encoding='utf-8') as f:
  json.dump({'youtube': text_youtube,
             'wav2vec2': text_wav2vec2,
             'whisper': text_whisper,
             'cloud_speech': text_cloud_speech},
            f, ensure_ascii=False, indent='\t')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# 3. WER

In [42]:
%%capture
! pip install jiwer

In [47]:
from jiwer import wer
import re

Можно загрузить выложенные на гитхаб файлы

In [None]:
%%capture
#! wget "https://github.com/ssakk/avtobreya/blob/main/4_year/hw2/asr.json"
#! wget "https://github.com/ssakk/avtobreya/blob/main/4_year/hw2/etalon.txt"

In [45]:
with open('etalon.txt', 'r', encoding='utf-8') as f:
  etalon = f.read()

with open('asr.json', 'r', encoding='utf-8') as f:
  asr = json.load(f)

In [48]:
for key, text in asr.items():
  text_clean = ' '.join(re.findall(r'\w+', text.lower()))
  print('WER for', key, ':', wer(etalon, text_clean))

WER for youtube : 0.2629032258064516
WER for wav2vec2 : 0.3919354838709677
WER for whisper : 0.2645161290322581
WER for cloud_speech : 0.2967741935483871


Автоматически сгенерированные на ютуб оказались лучше всего, whisper очень близко. Cloud speech -- чуть хуже и wav2vec2 хуже всех.