# Preface
Ref: https://huggingface.co/learn/audio-course/en/chapter4/hands_on <br>
This notebooks was originally run on colab.

# Roadmap
I will implement AST model following similar workflow as in the previous article [Unit4.4-finetuning a model for music classification](https://huggingface.co/learn/audio-course/en/chapter4/fine-tuning)

In [None]:
# pip install datasets evaluate transformers[torch]

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch->transformers[torch])
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch->transformers[torch])
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-many

In [1]:
# load data
import numpy as np
from datasets import load_dataset
gtzan = load_dataset("marsyas/gtzan", "all", trust_remote_code=True)

# split data
gtzan = gtzan["train"].train_test_split(seed=42, shuffle=True, test_size=0.1)
print("gtzan data:")
display(gtzan)

print("\nfirst train example:")
display(gtzan["train"][0])


gtzan data:


DatasetDict({
    train: Dataset({
        features: ['file', 'audio', 'genre'],
        num_rows: 899
    })
    test: Dataset({
        features: ['file', 'audio', 'genre'],
        num_rows: 100
    })
})


first train example:


{'file': 'C:\\Users\\wkaic\\.cache\\huggingface\\datasets\\downloads\\extracted\\9ed4a21ac6961a49c9616c314280eb18647ab65b44c9b7ef92be0606077ef3e1\\genres\\pop\\pop.00098.wav',
 'audio': {'path': 'C:\\Users\\wkaic\\.cache\\huggingface\\datasets\\downloads\\extracted\\9ed4a21ac6961a49c9616c314280eb18647ab65b44c9b7ef92be0606077ef3e1\\genres\\pop\\pop.00098.wav',
  'array': array([ 0.10720825,  0.16122437,  0.28585815, ..., -0.22924805,
         -0.20629883, -0.11334229]),
  'sampling_rate': 22050},
 'genre': 7}

In [2]:
# prompt: calculate the mean and std over the arrays stored under gtzan['train'] data

import numpy as np
import torch
# Assuming 'gtzan' and its structure from the previous code
# Calculate mean and std for each array in gtzan['train']

means = []
stds = []

for example in gtzan['train']:
  audio_array = np.array(example['audio']['array'])
  means.append(np.mean(audio_array))
  stds.append(np.std(audio_array))

print(f"Means: {means}")
print(f"Standard Deviations: {stds}")

# Calculate overall mean and std across all arrays
overall_mean = np.mean(means)
overall_std = np.mean(stds)

print(f"Overall Mean: {overall_mean}")
print(f"Overall Standard Deviation: {overall_std}")

Means: [np.float64(0.0001845862197433094), np.float64(-2.317558542221031e-05), np.float64(0.00027470519268989386), np.float64(1.8392195278926485e-05), np.float64(-2.2358501908614122e-05), np.float64(-1.4545009601226902e-05), np.float64(9.176900881362773e-06), np.float64(-0.0014838552794238674), np.float64(-0.0019330928038274811), np.float64(-0.0019527396763451882), np.float64(-0.013363821557506), np.float64(-0.0007651073828741223), np.float64(-3.857691103533539e-05), np.float64(4.868469384921894e-05), np.float64(0.00499524245321197), np.float64(-0.00029533743909134207), np.float64(-1.8135712491939523e-05), np.float64(1.4550776304475294e-05), np.float64(1.0837296214266089e-05), np.float64(-0.00025438556545659115), np.float64(-0.0001499879704500428), np.float64(-3.0478272275658954e-05), np.float64(-1.7654380717295527e-05), np.float64(-3.663830327465004e-05), np.float64(-0.000400590928741728), np.float64(-5.421844140168145e-05), np.float64(-5.869425509958537e-05), np.float64(5.02002589842

The following are generated based on the prompt and the manually modified on top of it:
> load the AST model from Hugging face, and train on the dataset gtzan in the previous code. Official documentation for the model can be found at: https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer#audio-spectrogram-transformer`


In [40]:
import evaluate
from transformers import (
    AutoFeatureExtractor,
    ASTFeatureExtractor,
    ASTForAudioClassification,
    TrainingArguments,
    Trainer
)

# load the feature extractor
model_id = "MIT/ast-finetuned-audioset-10-10-0.4593"
model_name = model_id.split("/")[-1]
feature_extractor = ASTFeatureExtractor.from_pretrained(
    model_id, do_normalize=True, return_attention_mask=False,
)
# feature_extractor = AutoFeatureExtractor.from_pretrained(
#     model_id, do_normalize=True,
#     mean=overall_mean, std=overall_std
# )
# Comment: return_attention_mask=True has no effect in either ASTFeatureExtractor
# or AutoFeatureExtractor given the model_id. By taking a sample=gtzan['train'][0]
# and feed it into feature_extractor, the keys only contain input_values, and never
# witout attention_mask.

In [35]:
# cast to same sampling_rate as AST model
from datasets import Audio
gtzan = gtzan.cast_column("audio", Audio(sampling_rate=feature_extractor.sampling_rate))

In [44]:
# Preprocessing function
def preprocess_function(examples, max_duration=30):
    audio_arrays = [x["array"] for x in examples["audio"]]
    inputs = feature_extractor(
        audio_arrays,
        sampling_rate=feature_extractor.sampling_rate,  # Ensure this matches your dataset's sample rate
        max_length=int(feature_extractor.sampling_rate * max_duration),
        return_tensors='pt',  # returns (batch_size, 1024, 128) instead of list of (1024, 128).
        padding=True,
        truncation=True,
    )
    # Cast input_values to float16
    inputs['input_values'] = inputs['input_values'].type(torch.float16)
    return inputs

# example
sample = gtzan["train"][0]["audio"]
print(f"Mean: {np.mean(sample['array']):.3}, Variance: {np.var(sample['array']):.3}")
inputs = feature_extractor(sample["array"], sampling_rate=sample["sampling_rate"])
print(f"inputs keys: {list(inputs.keys())}")
print(
    f"Mean: {np.mean(inputs['input_values']):.3}, Variance: {np.var(inputs['input_values']):.3}"
)

# apply preproc to data (remove columns first)
gtzan_encoded = gtzan.map(preprocess_function, remove_columns=["audio", "file"], batched=True, batch_size=100, num_proc=1)


# prepare id2label and label2id dicts for later usage.
id2label_fn = gtzan["train"].features["genre"].int2str
gtzan_encoded = gtzan_encoded.rename_column("genre", "label")
id2label = {
    str(i): id2label_fn(i)
    for i in range(len(gtzan_encoded["train"].features["label"].names))
}
label2id = {v: k for k, v in id2label.items()}

Mean: 0.000185, Variance: 0.0493
inputs keys: ['input_values']
Mean: 0.305, Variance: 0.124


Map:   0%|          | 0/899 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

In [37]:
gtzan_encoded['train'].features

{'label': ClassLabel(names=['blues', 'classical', 'country', 'disco', 'hiphop', 'jazz', 'metal', 'pop', 'reggae', 'rock'], id=None),
 'input_values': Sequence(feature=Sequence(feature=Value(dtype='float32', id=None), length=-1, id=None), length=-1, id=None)}

In [58]:
# Load the pre-trained model
model = ASTForAudioClassification.from_pretrained(
    model_id, num_labels=len(id2label),
    attn_implementation="sdpa", torch_dtype=torch.float32,
    ignore_mismatched_sizes=True
)


# Define evaluation metric
metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
    predictions = eval_pred.predictions.argmax(axis=-1)
    return metric.compute(predictions=predictions, references=eval_pred.label_ids)

# Define training arguments
training_args = TrainingArguments(
    f"{model_name}-finetuned-gtzan",
    seed=42,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=2, #Adjust based on your GPU memory
    per_device_eval_batch_size=2, #Adjust based on your GPU memory
    num_train_epochs=15, #Adjust as needed
    gradient_accumulation_steps=4, #Adjust based on your GPU memory
    warmup_ratio=0.1,
    # weight_decay=0.01,
    fp16=True,
    metric_for_best_model="accuracy",
    load_best_model_at_end=True,
    logging_strategy="epoch", # "steps"
    # logging_steps=500,
    report_to="tensorboard",
    push_to_hub=True, # Set to True if you want to push the model to the Hugging Face Hub
)

Some weights of ASTForAudioClassification were not initialized from the model checkpoint at MIT/ast-finetuned-audioset-10-10-0.4593 and are newly initialized because the shapes did not match:
- classifier.dense.bias: found shape torch.Size([527]) in the checkpoint and torch.Size([10]) in the model instantiated
- classifier.dense.weight: found shape torch.Size([527, 768]) in the checkpoint and torch.Size([10, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [59]:
# Create Trainer instance
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=gtzan_encoded["train"],
    eval_dataset=gtzan_encoded["test"],
    tokenizer=feature_extractor, #Use feature_extractor here
    compute_metrics=compute_metrics,
)

# Train the model
trainer.train()

  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy
1,1.3017,0.617958,0.78
2,0.5478,0.803098,0.77
3,0.3357,0.651124,0.87
4,0.1565,0.685823,0.87
5,0.0628,0.563838,0.86
6,0.0466,0.439901,0.91
7,0.0108,0.511951,0.88
8,0.0094,0.485442,0.89
9,0.0069,0.486495,0.91
10,0.0061,0.467438,0.91


TrainOutput(global_step=1680, training_loss=0.16824370729071753, metrics={'train_runtime': 2333.2206, 'train_samples_per_second': 5.78, 'train_steps_per_second': 0.72, 'total_flos': 9.063212551766016e+17, 'train_loss': 0.16824370729071753, 'epoch': 14.87111111111111})

In [60]:
kwargs = {
    "dataset_tags": "marsyas/gtzan",
    "dataset": "GTZAN",
    "model_name": f"{model_name}-finetuned-gtzan",
    "finetuned_from": model_id,
    "tasks": "audio-classification",
}

# This modifies the model card in the first page.
# The training results can now be uploaded to the Hub. To do so, execute the `.push_to_hub` command:
trainer.push_to_hub(**kwargs)


events.out.tfevents.1738610895.wkcc.19668.14:   0%|          | 0.00/13.9k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/wkCircle/ast-finetuned-audioset-10-10-0.4593-finetuned-gtzan/commit/381bc94f5a311b9c75cf74811706646e1fa2f649', commit_message='End of training', commit_description='', oid='381bc94f5a311b9c75cf74811706646e1fa2f649', pr_url=None, repo_url=RepoUrl('https://huggingface.co/wkCircle/ast-finetuned-audioset-10-10-0.4593-finetuned-gtzan', endpoint='https://huggingface.co', repo_type='model', repo_id='wkCircle/ast-finetuned-audioset-10-10-0.4593-finetuned-gtzan'), pr_revision=None, pr_num=None)

## Experiment Results

1. Calculating the train mean and std and use it normalize data in FeatureExtractor performs much worse than using the default values. (0.75-0.8 only)
2. When using `model = ASTForAudioClassification.from_pretrained(...)`, fine-tuning model with `attn_implementation="sdpa", torch_dtype=torch.float32` seems to perform better than fine-tuning model without it. (0.01 difference).
3. `TrainingArguments(fp16=True)` and `inputs['input_values'] = inputs['input_values'].type(torch.float16)` in the `preprocess_function` seams to have no significant effect and final accuracy.

One can find the final published model and tenorboard reports in Huggingface with id: [wkCircle/ast-finetuned-audioset-10-10-0.4593-finetuned-gtzan](https://huggingface.co/wkCircle/ast-finetuned-audioset-10-10-0.4593-finetuned-gtzan)

In [55]:
from transformers import pipeline

pipe = pipeline(
    "audio-classification", model="wkCircle/distilhubert-finetuned-gtzan"
)


config.json:   0%|          | 0.00/1.85k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/94.8M [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/212 [00:00<?, ?B/s]

Device set to use cuda:0


## Q&A

Q1: RuntimeError: Input type (float) and bias type (c10::Half) should be the same. <br>

A1: The error message "RuntimeError: Input type (float) and bias type (c10::Half) should be the same" indicates a type mismatch between the input data and the model's bias terms.

This is happening because:

You loaded the pre-trained AST model with torch_dtype=torch.float16 to utilize half-precision (FP16) training for efficiency. This means the model's weights and biases are stored in float16 (Half) data type.
The input data from your dataset is likely in float32 (Float) data type, which is the default for PyTorch tensors.
When the model tries to perform the convolution operation, it encounters a mismatch between the input type (float32) and the bias type (float16), leading to the runtime error.

suggested changes: put the following code in `preprocess_function` after the inputs variable has been generated:
```python
# Cast input_values to float16
inputs['input_values'] = inputs['input_values'].type(torch.float16)
```

Q2: What is the difference between using `fp16=True` in `TrainingArguments`, versus using `torch_dtype=torch.float16` when loading the pretrained AST model?

A2:
`fp16=True` in `TrainingArguments`:

- **Scope**: This setting affects the training process itself, including forward and backward passes, optimizer updates, and gradient calculations.
- **Functionality**: It enables Automatic Mixed Precision (AMP), a technique that leverages both float16 (for faster computations) and float32 (for maintaining numerical stability) during training.
- **Benefits**:
  - Speed: Training becomes significantly faster due to reduced memory usage and faster computations with float16.
  - Memory: Lower memory footprint allows for larger batch sizes and training of larger models.
- **How it works**: AMP automatically decides when to use float16 or float32 based on the operations being performed, minimizing the risk of numerical instability.

`torch_dtype=torch.float16` when loading the model:

- **Scope**: This setting primarily affects the model's weights and biases.
- **Functionality**: It loads the model with weights and biases in float16 (half-precision) format.
- **Benefits**:
  - Memory: Reduces the model's memory footprint, enabling it to fit on devices with limited memory.
  - **Inference Speed**: Can potentially speed up inference (making predictions) due to faster computations with float16.
- **Considerations**: It's important to note that if the input data is not also in float16, the model will need to perform type conversions, which can introduce some overhead.

**Key Differences and Interactions**:

1. Focus: `fp16=True` focuses on the training process, while `torch_dtype=torch.float16` focuses on the model's data type.
2. AMP: `fp16=True` leverages AMP for a more nuanced and robust approach to mixed-precision training, while `torch_dtype=torch.float16` simply loads the model in float16.
3. Interaction: For optimal performance and consistency when using `fp16=True` in training, it's generally recommended to also load the model with `torch_dtype=torch.float16`. This ensures that the model's weights, biases, and inputs are all in the same data type, minimizing type conversions during training.


