# Speech Recognition

## Installing Transformers Library

In [1]:
!pip install -q transformers

In [2]:
from transformers import pipeline

In [3]:
#Initializing a speech recognition pipeline
pipe = pipeline("automatic-speech-recognition")

No model was supplied, defaulted to facebook/wav2vec2-base-960h and revision 55bb623 (https://huggingface.co/facebook/wav2vec2-base-960h).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.60k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/378M [00:00<?, ?B/s]

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/163 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/291 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

Ignored unknown kwarg option normalize
Ignored unknown kwarg option normalize
Ignored unknown kwarg option normalize
Ignored unknown kwarg option normalize


preprocessor_config.json:   0%|          | 0.00/159 [00:00<?, ?B/s]

## Installing Dataset

In [4]:
#importing dataset with "dataset" library (MINDS-14 dataset)
!pip install -q datasets

In [5]:
from datasets import load_dataset

In [6]:
dataset = load_dataset("PolyAI/minds14", name = "en-US", split = "train")

Downloading builder script:   0%|          | 0.00/5.95k [00:00<?, ?B/s]

Downloading and preparing dataset minds14/en-US to /root/.cache/huggingface/datasets/PolyAI___minds14/en-US/1.0.0/65c7e0f3be79e18a6ffaf879a083daf706312d421ac90d25718459cbf3c42696...


Downloading data:   0%|          | 0.00/471M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset minds14 downloaded and prepared to /root/.cache/huggingface/datasets/PolyAI___minds14/en-US/1.0.0/65c7e0f3be79e18a6ffaf879a083daf706312d421ac90d25718459cbf3c42696. Subsequent calls will reuse this data.


* This dataset includes audio data.

In [7]:
dataset.features

{'path': Value(dtype='string', id=None),
 'audio': Audio(sampling_rate=8000, mono=True, decode=True, id=None),
 'transcription': Value(dtype='string', id=None),
 'english_transcription': Value(dtype='string', id=None),
 'intent_class': ClassLabel(num_classes=14, names=['abroad', 'address', 'app_error', 'atm_limit', 'balance', 'business_loan', 'card_issues', 'cash_deposit', 'direct_debit', 'freeze', 'high_value_payment', 'joint_account', 'latest_transactions', 'pay_bill'], id=None),
 'lang_id': ClassLabel(num_classes=14, names=['cs-CZ', 'de-DE', 'en-AU', 'en-GB', 'en-US', 'es-ES', 'fr-FR', 'it-IT', 'ko-KR', 'nl-NL', 'pl-PL', 'pt-PT', 'ru-RU', 'zh-CN'], id=None)}

## Data Preprocessing

In [12]:
from datasets import Audio

In [14]:
dataset = dataset.cast_column("audio", Audio(sampling_rate = pipe.feature_extractor.sampling_rate))

## Speech Recognition

In [15]:
# converting first four of audio data to text
data = dataset[:4]["audio"]
result = pipe(data)

In [16]:
print([d["text"] for d in result])

['I WOULD LIKE TO SET UP A JOINT ACCOUNT WITH MY PARTNER HOW DO I PROCEED WITH DOING THAT', "FONDERING HOW I'D SET UP A JOIN TO HELL T WITH MY WIFE AND WHERE THE AP MIGHT BE", "I I'D LIKE TOY SET UP A JOINT ACCOUNT WITH MY PARTNER I'M NOT SEEING THE OPTION TO DO IT ON THE APSO I CALLED IN TO GET SOME HELP CAN I JUST DO IT OVER THE PHONE WITH YOU AND GIVE YOU THE INFORMATION OR SHOULD I DO IT IN THE AP AN I'M MISSING SOMETHING UQUETTE HAD PREFERRED TO JUST DO IT OVER THE PHONE OF POSSIBLE THINGS", 'HOW DO I FURN A JOINA COUT']


## Using a Custom Model in the Pipeline
* Using the model we want in the pipeline instead of a ready-made model to perform sentiment analysis.

In [17]:
model_name = "nlptown/bert-base-multilingual-uncased-sentiment" 

In [18]:
from transformers import AutoModelForSequenceClassification

In [19]:
model = AutoModelForSequenceClassification.from_pretrained(model_name)

config.json:   0%|          | 0.00/953 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/669M [00:00<?, ?B/s]

* Deep learning models can not understand text as input. We should perform preprocessing for text. For that, we can use tokenizer. We are lucky that we can download the tokenizer of the model from Transformers. 
* **Tokenizer:**  Tokenizer translates texts into a form the model can understand.

In [20]:
from transformers import AutoTokenizer

In [21]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/872k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [22]:
# combining the model and its tokenizer in the pipeline
classifier = pipeline("sentiment-analysis", model = model, tokenizer = tokenizer)

## Prediction

In [23]:
text = "Esta película no es nada buena. no me gustó nada"

In [24]:
classifier(text)

[{'label': '1 star', 'score': 0.6990681886672974}]

## Saving the Model

In [25]:
save_directory = "./save_pretrained"
tokenizer.save_pretrained(save_directory)
model.save_pretrained(save_directory)

## Using the Previously Saved Model

In [26]:
model = AutoModelForSequenceClassification.from_pretrained(
"./save_pretrained")