To train our model, we’ll use the GTZAN dataset, which is a popular dataset of 1,000 songs for music genre classification. Each song is a 30-second clip from one of 10 genres of music, spanning disco to metal. We can get the audio files and their corresponding labels from the Hugging Face Hub with the load_dataset() function from 🤗 Datasets

In [3]:
from datasets import load_dataset

gtzan = load_dataset("marsyas/gtzan", "all")
gtzan

Found cached dataset gtzan (/home/raj/.cache/huggingface/datasets/marsyas___gtzan/all/0.0.0/8bd0e23c2d9b2be30d36bc6834319772dff22a3bd28527996612386cef003910)


  0%|          | 0/1 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['file', 'audio', 'genre'],
        num_rows: 999
    })
})

GTZAN doesn’t provide a predefined validation set, so we’ll have to create one ourselves. The dataset is balanced across genres, so we can use the train_test_split() method to quickly create a 90/10 split as follows

In [4]:
gtzan = gtzan["train"].train_test_split(seed=42, shuffle=True, test_size=0.1)
gtzan

Loading cached split indices for dataset at /home/raj/.cache/huggingface/datasets/marsyas___gtzan/all/0.0.0/8bd0e23c2d9b2be30d36bc6834319772dff22a3bd28527996612386cef003910/cache-52d2398c8e4ac745.arrow and /home/raj/.cache/huggingface/datasets/marsyas___gtzan/all/0.0.0/8bd0e23c2d9b2be30d36bc6834319772dff22a3bd28527996612386cef003910/cache-3bcc56e346e4d81c.arrow


DatasetDict({
    train: Dataset({
        features: ['file', 'audio', 'genre'],
        num_rows: 899
    })
    test: Dataset({
        features: ['file', 'audio', 'genre'],
        num_rows: 100
    })
})

In [5]:
# Let's take a look at one of the audio files
gtzan["train"][0]

{'file': '/home/raj/.cache/huggingface/datasets/downloads/extracted/fa0c0173870969fd11c975895603e8608b7325bb180f3aeb805320cfa0922824/genres/pop/pop.00098.wav',
 'audio': {'path': '/home/raj/.cache/huggingface/datasets/downloads/extracted/fa0c0173870969fd11c975895603e8608b7325bb180f3aeb805320cfa0922824/genres/pop/pop.00098.wav',
  'array': array([ 0.10720825,  0.16122437,  0.28585815, ..., -0.22924805,
         -0.20629883, -0.11334229]),
  'sampling_rate': 22050},
 'genre': 7}

We can also see the genre is represented as an integer, or class label, which is the format the model will make it’s predictions in. Let’s use the int2str() method of the genre feature to map these integers to human-readable names:

In [6]:
id2label_fn = gtzan["train"].features["genre"].int2str
id2label_fn(gtzan["train"][0]["genre"])

'pop'

This label looks correct, since it matches the filename of the audio file. Let’s now listen to a few more examples by using Gradio to create a simple interface with the Blocks API:

In [5]:
# import gradio as gr
# def generate_audio():
#     example = gtzan["train"].shuffle()[0]
#     audio = example["audio"]
#     return (
#         audio["sampling_rate"],
#         audio["array"],
#     ), id2label_fn(example["genre"])

# with gr.Blocks() as demo:
#     with gr.Column():
#         for _ in range(4):
#             audio, label = generate_audio()
#             output = gr.Audio(audio, label=label)

# demo.launch(debug=True)

**Picking a pretrained model for audio classification**

Although models like Wav2Vec2 and HuBERT are very popular, we’ll use a model called DistilHuBERT. This is a much smaller (or distilled) version of the HuBERT model, which trains around 73% faster, yet preserves most of the performance.

Preprocessing the data

Similar to tokenization in NLP, audio and speech models require the input to be encoded in a format that the model can process. In 🤗 Transformers, the conversion from audio to the input format is handled by the feature extractor of the model. Similar to tokenizers, 🤗 Transformers provides a convenient AutoFeatureExtractor class that can automatically select the correct feature extractor for a given model. To see how we can process our audio files, let’s begin by instantiating the feature extractor for DistilHuBERT from the pre-trained checkpoint:

In [5]:
from transformers import AutoFeatureExtractor

model_id = "ntu-spml/distilhubert"
feature_extractor = AutoFeatureExtractor.from_pretrained(
    model_id, do_normalize=True, return_attention_mask=True
)

In [6]:
sampling_rate = feature_extractor.sampling_rate
sampling_rate

16000

Since the sampling rate of the model and the dataset are different, we’ll have to resample the audio file to 16,000 Hz before passing it to the feature extractor.

In [7]:
# resample the dataset using the cast_column() method and Audio feature from Hf datasets
from datasets import Audio
gtzan = gtzan.cast_column("audio", Audio(sampling_rate=sampling_rate))

In [8]:
# check out the first sample of the train-split of our dataset to verify it is indeded at 16,000Hz
gtzan["train"][0]

{'file': '/home/raj/.cache/huggingface/datasets/downloads/extracted/fa0c0173870969fd11c975895603e8608b7325bb180f3aeb805320cfa0922824/genres/pop/pop.00098.wav',
 'audio': {'path': '/home/raj/.cache/huggingface/datasets/downloads/extracted/fa0c0173870969fd11c975895603e8608b7325bb180f3aeb805320cfa0922824/genres/pop/pop.00098.wav',
  'array': array([ 0.0873509 ,  0.20183384,  0.4790867 , ..., -0.18743178,
         -0.23294401, -0.13517427]),
  'sampling_rate': 16000},
 'genre': 7}

We normalize the audio data by feature scaling

In [9]:
# First let's compute the mean and variance of our raw audio data
import numpy as np
sample = gtzan["train"][0]["audio"]
print(f"Mean: {np.mean(sample['array']):.3}, Variance: {np.var(sample['array']):.3}")

Mean: 0.000185, Variance: 0.0493


We can see that the mean is close to zero already, but the variance is closer to 0.05. If the variance for the sample was larger, it could cause our model problems, since the dynamic range of the audio data would be very small and thus difficult to separate. Let’s apply the feature extractor and see what the outputs look like:

In [10]:
inputs = feature_extractor(
    sample["array"], sampling_rate=sample["sampling_rate"])
print(f"inputs keys: {inputs.keys()}")
print(f"Mean: {np.mean(inputs['input_values']):.3}, Variance: {np.var(inputs['input_values']):.3}")

inputs keys: dict_keys(['input_values', 'attention_mask'])
Mean: -7.45e-09, Variance: 1.0


We can see that the mean value is now very much closer to zero, and the variance bang-on one! This is exactly the form we want our audio samples in prior to feeding them to the HuBERT model.

In [11]:
# create a preprocess function that will truncate longer clips to 30 seconds
max_duration = 30.0

def preprocess_function(examples):
    audio_arrays = [x["array"] for x in examples["audio"]]
    inputs = feature_extractor(
        audio_arrays, 
        sampling_rate=feature_extractor.sampling_rate,
        max_length=int(feature_extractor.sampling_rate * max_duration),
        truncation=True,
        return_attention_mask=True,
        padding=True,
    )
    return inputs

With this function defined, we can now apply it to the dataset using the map() method.

In [12]:
gtzan_encoded = gtzan.map(
    preprocess_function,
    remove_columns=["file", "audio"],
    batched=True,
    num_proc=1,
)

gtzan_encoded

Loading cached processed dataset at /home/raj/.cache/huggingface/datasets/marsyas___gtzan/all/0.0.0/8bd0e23c2d9b2be30d36bc6834319772dff22a3bd28527996612386cef003910/cache-fb0bf2497cafee86.arrow
Loading cached processed dataset at /home/raj/.cache/huggingface/datasets/marsyas___gtzan/all/0.0.0/8bd0e23c2d9b2be30d36bc6834319772dff22a3bd28527996612386cef003910/cache-c70a3480c7835f24.arrow


DatasetDict({
    train: Dataset({
        features: ['genre', 'input_values', 'attention_mask'],
        num_rows: 899
    })
    test: Dataset({
        features: ['genre', 'input_values', 'attention_mask'],
        num_rows: 100
    })
})

In [13]:
# To enable the Trainer to process the class labels, we need to rename the genre column to label
gtzan_encoded = gtzan_encoded.rename_column("genre", "label")

# to enable torch to process the class labels, we need to cast them to long

Finally, we need to obtain the label mappings from the dataset. This mapping will take us from integer ids (e.g. 7) to human-readable class labels (e.g. "pop") and back again. In doing so, we can convert our model’s integer id prediction into human-readable format, enabling us to use the model in any downstream application. We can do this by using the int2str() method as follows:

In [14]:
id2label = {
    str(i): id2label_fn(i)
    for i in range(len(gtzan_encoded["train"].features["label"].names))
}

label2id = {v: k for k, v in id2label.items()}

id2label["7"]

'pop'

OK, we’ve now got a dataset that’s ready for training! Let’s take a look at how we can train a model on this dataset.

**Fine-tuning the model**

In [15]:
# Use the HF Trainer to fuine-tune the distilhubert model on the GTZAN dataset
from transformers import AutoModelForAudioClassification

num_labels = len(id2label)

model = AutoModelForAudioClassification.from_pretrained(
    model_id,
    num_labels=num_labels,
    label2id=label2id,
    id2label=id2label,
)

Some weights of HubertForSequenceClassification were not initialized from the model checkpoint at ntu-spml/distilhubert and are newly initialized: ['projector.bias', 'classifier.weight', 'classifier.bias', 'projector.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


We strongly advise you to upload model checkpoints directly the Hugging Face Hub while training. The Hub provides:

* Integrated version control: you can be sure that no model checkpoint is lost during training.
* Tensorboard logs: track important metrics over the course of training.
* Model cards: document what a model does and its intended use cases.
* Community: an easy way to share and collaborate with the community! 🤗

Linking the notebook to the Hub is straightforward - it simply requires entering your Hub authentication token when prompted. Find your Hub authentication token here:

In [23]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [17]:
# Next define the training arguments
from transformers import TrainingArguments

model_name = model_id.split("/")[-1]
batch_size = 8
gradient_accumulation_steps = 1
num_train_epochs = 10

training_args = TrainingArguments(
    f"{model_name}-finetuned-gtzan-1",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_train_epochs,
    warmup_ratio=0.1,
    logging_steps=5,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    fp16=True,
    push_to_hub=True,
)


In [18]:
model_name

'distilhubert'

The last thing we need to do is define the metrics. Since the dataset is balanced, we’ll use accuracy as our metric and load it using the 🤗 Evaluate library

In [20]:
import evaluate

metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    """Computes accuracy on a batch of predictions"""
    predictions = np.argmax(eval_pred.predictions, axis=1)
    return metric.compute(predictions=predictions, references=eval_pred.label_ids)

In [21]:
gtzan_encoded["train"].features['attention_mask']

Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None)

We have all the pieces to train

In [22]:
# instantiate the trainer

from transformers import Trainer

trainer = Trainer(
    model,
    args=training_args,
    train_dataset=gtzan_encoded["train"],
    eval_dataset=gtzan_encoded["test"],
    tokenizer=feature_extractor,
    compute_metrics=compute_metrics,
)

trainer.train()

/home/raj/repos/HF-Audio/4-music-genre-classifier/distilhubert-finetuned-gtzan-1 is already a clone of https://huggingface.co/RajkNakka/distilhubert-finetuned-gtzan-1. Make sure you pull the latest changes with `repo.git_pull()`.
  return F.conv1d(input, weight, bias, self.stride,


Epoch,Training Loss,Validation Loss,Accuracy
1,1.7533,1.792695,0.47
2,1.2555,1.279248,0.6
3,1.0209,1.027561,0.7
4,0.6703,0.818121,0.75
5,0.5152,0.73955,0.77
6,0.2763,0.649834,0.81
7,0.2386,0.677489,0.79
8,0.3162,0.629078,0.81
9,0.155,0.612056,0.83
10,0.0894,0.666008,0.81


TrainOutput(global_step=1130, training_loss=0.7463425720687461, metrics={'train_runtime': 5950.2699, 'train_samples_per_second': 1.511, 'train_steps_per_second': 0.19, 'total_flos': 6.133988274624e+17, 'train_loss': 0.7463425720687461, 'epoch': 10.0})

Submit my checkpoint to the leaderboard by pushing the training results to the Hub. We simply set the appropriate key word arguments (kwargs).

In [24]:
kwargs = {
    "dataset_tags": "marsyas/gtzan",
    "dataset": "GTZAN",
    "model_name": f"{model_name}-finetuned-gtzan-1",
    "finetuned_from": model_id,
    "tasks": "audio-classification",
}

Upload the training results to the Hub using push_to_hub command

In [25]:
trainer.push_to_hub(**kwargs)

To https://huggingface.co/RajkNakka/distilhubert-finetuned-gtzan-1
   a536ab6..77a7a27  main -> main

To https://huggingface.co/RajkNakka/distilhubert-finetuned-gtzan-1
   77a7a27..6819480  main -> main



'https://huggingface.co/RajkNakka/distilhubert-finetuned-gtzan-1/commit/77a7a27d081d8a28b8871eb36b94a2e82f698ac6'