# 1 Audio applications

- **Audio classification**: easily categorize audio clips into different categories. You can identify whether a recording is of a barking dog or a meowing cat, or what music genre a song belongs to.
- **Automatic speech recognition**: transform audio clips into text by transcribing them automatically. You can get a text representation of a recording of someone speaking, like “How are you doing today?“. Rather useful for note taking!
- **Speaker diarization**: Ever wondered who’s speaking in a recording? With 🤗 Transformers, you can identify which speaker is talking at any given time in an audio clip. Imagine being able to differentiate between “Alice” and “Bob” in a recording of them having a conversation.
- **Text to speech**: create a narrated version of a text that can be used to produce an audio book, help with accessibility, or give a voice to an NPC in a game. With 🤗 Transformers, you can easily do that!

# 2 some demo

## 2.1 demo: pipeline: audio classification

### 2.1.1 load dataset

In [None]:
from datasets import load_dataset
from datasets import Audio

minds = load_dataset("PolyAI/minds14", name="en-AU", split="train")
minds = minds.cast_column("audio", Audio(sampling_rate=16_000))

### 2.1.2 build transformer

To classify an audio recording into a set of classes, we can use the audio-classification pipeline from 🤗 Transformers. 

In our case, we need a model that’s been fine-tuned for intent classification, and specifically on the MINDS-14 dataset. Luckily for us, the Hub has a model that does just that! Let’s load it by using the pipeline() function:

In [None]:
from transformers import pipeline

classifier = pipeline(
    "audio-classification",
    model="anton-l/xtreme_s_xlsr_300m_minds14",
)

### 2.1.3 predict

In [None]:
example = minds[0]

classifier(example["audio"]["array"])
"""
[
    {"score": 0.9631525278091431, "label": "pay_bill"},
    {"score": 0.02819698303937912, "label": "freeze"},
    {"score": 0.0032787492964416742, "label": "card_issues"},
    {"score": 0.0019414445850998163, "label": "abroad"},
    {"score": 0.0008378693601116538, "label": "high_value_payment"},
]
"""

### 2.1.4 evaluate

In [None]:
id2label = minds.features["intent_class"].int2str
id2label(example["intent_class"])
# "pay_bill"

## 2.2 demo: pipeline: automatic speech recognization

demo for english-AU

```python
# 1 build transformer

from transformers import pipeline
asr = pipeline("automatic-speech-recognition")

# 2 predict
example = minds[0]
asr(example["audio"]["array"])

# 3 evaluate
example["english_transcription"]
```

demo for German

```python
# 1 prepare dataset

from datasets import load_dataset
from datasets import Audio

minds = load_dataset("PolyAI/minds14", name="de-DE", split="train")
minds = minds.cast_column("audio", Audio(sampling_rate=16_000))

# 2 build transformer

from transformers import pipeline

asr = pipeline("automatic-speech-recognition", model="maxidl/wav2vec2-large-xlsr-german")

# 3 predict

example = minds[0]
asr(example["audio"]["array"])

# 4 evaluate

example["transcription"]
```

## 2.3 demo: pipeline: audio generation

### 2.3.1 Text-To-Speech: english

```python
from transformers import pipeline

pipe = pipeline("text-to-speech", model="suno/bark-small")

text = "Ladybugs have had important roles in culture and religion, being associated with luck, love, fertility and prophecy. "
output = pipe(text)

from IPython.display import Audio

Audio(output["audio"], rate=output["sampling_rate"])
```
### 2.3.2 Text-To-Speech: German

The model that we’re using with the pipeline, Bark, is actually multilingual, so we can easily substitute the initial text with a text in, say, French, and use the pipeline in the exact same way.

```python
fr_text = "Contrairement à une idée répandue, le nombre de points sur les élytres d'une coccinelle ne correspond pas à son âge, ni en nombre d'années, ni en nombre de mois. "
output = pipe(fr_text)
Audio(output["audio"], rate=output["sampling_rate"])
```

### 2.3.3 Text-To-Speech: Sing song

Not only is this model multilingual, it can also generate audio with non-verbal communications and singing. Here’s how you can make it sing:

```python
song = "♪ In the jungle, the mighty jungle, the ladybug was seen. ♪ "
output = pipe(song)
Audio(output["audio"], rate=output["sampling_rate"])
```

### 2.3.3 Generate music

```python
music_pipe = pipeline("text-to-audio", model="facebook/musicgen-small")

text = "90s rock song with electric guitar and heavy drums"

forward_params = {"max_new_tokens": 512}

output = music_pipe(text, forward_params=forward_params)
Audio(output["audio"][0], rate=output["sampling_rate"])
```