# GTZAN Dataset Evaluation

## Introduction
### Overview
A Dataset of 100 songs across 10 genres, along with auxillary information on each song.

### Problem Statement
Creating a classifier for music genres.

### Data Source
Our main data source is the GTZAN dataset on Kaggle, though this may be extended via the Free Music Archive or other such projects.

## Review of PreML Checklist

In [1]:
import kagglehub
from pathlib import Path

In [2]:
# Load a DataFrame with a specific version of a CSV
file_path = Path(kagglehub.dataset_download('andradaolteanu/gtzan-dataset-music-genre-classification'))

Downloading from https://www.kaggle.com/api/v1/datasets/download/andradaolteanu/gtzan-dataset-music-genre-classification?dataset_version_number=1...


100%|██████████| 1.21G/1.21G [00:18<00:00, 69.5MB/s]

Extracting files...





In [3]:
genres = list(file_path.glob('Data/genres_original/*/'))
[genre.name for genre in genres]

['metal',
 'blues',
 'pop',
 'disco',
 'rock',
 'jazz',
 'hiphop',
 'country',
 'reggae',
 'classical']

In [4]:
music_files = {
    genre.name: list(genre.glob('*.wav'))
    for genre in genres
}

{key : len(value) for key, value in music_files.items()}

{'metal': 100,
 'blues': 100,
 'pop': 100,
 'disco': 100,
 'rock': 100,
 'jazz': 100,
 'hiphop': 100,
 'country': 100,
 'reggae': 100,
 'classical': 100}

In [5]:
import wave

def resilient_load(file):
    try:
        file = wave.open(str(file))
        return file
    except Exception as inst:
        print(file)
        print(inst)
    return None

music = {name: [resilient_load(file) for file in files] for name, files in music_files.items()}

/root/.cache/kagglehub/datasets/andradaolteanu/gtzan-dataset-music-genre-classification/versions/1/Data/genres_original/jazz/jazz.00054.wav
file does not start with RIFF id


> Unfortunately, this seems to be some error in the 54th wave file of the Jazz genre, which also won't load in music players

Therefore, we now have a slight class imbalance in the Jazz genre.

In [6]:
import pandas as pd

# features 30 seconds: a csv with a mean and average value of different extracted features from the audio files
df = pd.read_csv(file_path / "Data/features_30_sec.csv")
df.describe()

Unnamed: 0,length,chroma_stft_mean,chroma_stft_var,rms_mean,rms_var,spectral_centroid_mean,spectral_centroid_var,spectral_bandwidth_mean,spectral_bandwidth_var,rolloff_mean,...,mfcc16_mean,mfcc16_var,mfcc17_mean,mfcc17_var,mfcc18_mean,mfcc18_var,mfcc19_mean,mfcc19_var,mfcc20_mean,mfcc20_var
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,...,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,662030.846,0.378682,0.08634,0.13093,0.003051,2201.780898,469691.6,2242.54107,137079.155165,4571.549304,...,1.148144,60.730958,-3.966028,62.633624,0.507696,63.712586,-2.328761,66.23193,-1.095348,70.126096
std,1784.073992,0.081705,0.007735,0.065683,0.003634,715.9606,400899.5,526.316473,96455.666326,1574.791602,...,4.578948,33.781951,4.549697,33.479172,3.869105,34.401977,3.755957,37.174631,3.837007,45.228512
min,660000.0,0.171939,0.044555,0.005276,4e-06,570.040355,7911.251,898.066208,10787.185064,749.140636,...,-15.693844,9.169314,-17.234728,13.931521,-11.963694,15.420555,-18.501955,13.487622,-19.929634,7.956583
25%,661504.0,0.319562,0.082298,0.086657,0.000942,1627.697311,184350.5,1907.240605,67376.554428,3380.069642,...,-1.86328,40.376442,-7.207225,40.830875,-2.007015,41.88424,-4.662925,41.710184,-3.368996,42.372865
50%,661794.0,0.383148,0.086615,0.122443,0.001816,2209.26309,338486.2,2221.392843,111977.548036,4658.524473,...,1.212809,52.325077,-4.065605,54.717674,0.669643,54.80489,-2.393862,57.423059,-1.166289,59.186117
75%,661794.0,0.435942,0.091256,0.175682,0.003577,2691.294667,612147.9,2578.469836,182371.576801,5533.81046,...,4.359662,71.691755,-0.838737,75.040838,3.119212,75.385832,0.150573,78.626444,1.312615,85.375374
max,675808.0,0.663685,0.108111,0.397973,0.027679,4435.243901,3036843.0,3509.646417,694784.811549,8677.672688,...,13.45715,392.932373,11.482946,406.058868,15.38839,332.905426,14.694924,393.161987,15.369627,506.065155


In [7]:
# Check missing
df.isnull().values.any()

np.False_

In [8]:
# Check duplicates
df[df.duplicated()]

Unnamed: 0,filename,length,chroma_stft_mean,chroma_stft_var,rms_mean,rms_var,spectral_centroid_mean,spectral_centroid_var,spectral_bandwidth_mean,spectral_bandwidth_var,...,mfcc16_var,mfcc17_mean,mfcc17_var,mfcc18_mean,mfcc18_var,mfcc19_mean,mfcc19_var,mfcc20_mean,mfcc20_var,label


In [9]:
df.groupby("label")["length"].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
blues,100.0,661794.0,0.0,661794.0,661794.0,661794.0,661794.0,661794.0
classical,100.0,662116.1,1583.648195,661344.0,661794.0,661794.0,661794.0,672282.0
country,100.0,662027.76,1189.911745,661100.0,661794.0,661794.0,661794.0,669680.0
disco,100.0,661934.8,1257.321058,661344.0,661504.0,661504.0,661794.0,668140.0
hiphop,100.0,663468.24,4380.745071,660000.0,661794.0,661794.0,661794.0,675808.0
jazz,100.0,662233.28,1696.357369,661676.0,661794.0,661794.0,661794.0,672100.0
metal,100.0,661596.8,135.95959,661504.0,661504.0,661504.0,661794.0,661794.0
pop,100.0,661504.0,0.0,661504.0,661504.0,661504.0,661504.0,661504.0
reggae,100.0,661622.9,143.35021,661504.0,661504.0,661504.0,661794.0,661794.0
rock,100.0,662010.58,1290.502566,661408.0,661794.0,661794.0,661794.0,670340.0


All of the data has similar length and we're already aware of how balanced the class distribution is.

### Privacy Considerations
Since this is freely available music, there are no privacy considerations; though there are copyright considerations if improperly licenced audio files are in the dataset.

### Labeling Consistency
Due to the layout of the data, there are no labeling inconsistencies.

## Full Checklist

### Essential checks

- [x] Does the data include information that can predict the target?
> Yes, the dataset contains the full audio file, which should be enough to classify different genres (though some sub-genres will be difficult)

- [x] Does the granularity of training and prediction match?
> Yes, we want to classify songs, and our data is of song length - though due to different lengths of songs it may be necessary to use an N-to-one RNN style neural network.

- [x] Do you already have labeled data?
> Yes, all data is categorized into one of 10 categories.

- [x] Is your data correct/accurate?
> Yes, while this is hard to fully classify, it is regarded as a good dataset (though we did not listen to all 1,000 songs)

- [x] Do you have enough data?
> Using the standard estimate of 10 times the number of features, yes, we have 1,000 data and 10 features, thus we exceed the threshold by 10-fold.
> Though we are still cautious at this stage, since RNN-style networks usually require significantly more data than simple classifiers.

- [x] Is the data easily accessible by the team and machines performing
the ML?
> Yes, it is easily accessible from kaggle, as shown above

- [x] Can you read the data fast enough?
> This is not really an issue, the base dataset loads in less than 1 second on google colab.

- [x] Do you have documentation for each field of data?
> Since there are no real fields, both yes and no.

- [x] Are the missing values a small percentage of the fields of interest?
> No, there is a single audio file missing, which only contributes 1% of information in that category.

#### Forecasting
- [ ] If your data is periodic, do you have data for 3 ✕ period?
> Not Applicable
- [ ] If you want to forecast n periods in advance, do you have n + 2
periods of data?
> Not Applicable
- [ ]  Do you know the timestamp at which each data value was obtained
or updated?
> Not Applicable

### Additional Checks
- [x] Is your data unbiased?
> It is almost unbiased, there is a 1% bias between jazz and other genres, though this is likely negligible/ can be mitigated via removing one less jazz song for train/test splitting.

- [x] If there are missing values, do you know the causes?
> Yes, the file will not load in all software we have tried, leading to the conclusion that it is data corruption

- [x] If there are missing values, do they occur at random?
> Yes, it's a single file

- [x] For each field (input or target), does the data have the same unit?
> All inputs are wave files

- [x] For each field (input or target), is the meaning of the data
consistent?
> Yes, there is only an audio file and genre.

- [x] Is the same value recorded in the same way everywhere?
> Yes, there is only an audio file and genre.

## Preparing the data
- [ ] Integrate data from diverse input sources.
> This may be aleviated with handling data from FMA, though we are looking into checking copyrights and downloading the data more programatically.

- [ ]  If your data is scattered, identify and consolidate it.
> Most of our data is already consolidated by the kaggle dataset, though if we want to supplement that, we would need to consolidate it; though this should not be much of a challange as we only need the audio files and genres.

- [x] Identify and impute missing values.
> Done, as above (though we cannot recover the file yet, unfortunatly)

- [ ] Remove all sources of noise from your data.
> This will likely be part of the model, rather than preprocessing.

- [ ] Create new features that improve predicting the target.
> We are looking into creating Mel spectrograms, and utilizing pre-trained Speech-to-text systems to identify features.

- [x] Look for new sources of information to complement your data.
> We have identified FMA as a potential extra source of information

- [ ] Identify and remove all sources of data leakage.
> We are currenly looking into this data leakage, but our primary validation will be the FMA dataset.

- [x] Integrate all the features of an instance into one object.
> We currently already have the object as above, though this may need to be changed for new data sources.

- [ ] Convert data to formats that can be read fast for training the ML
model.
> This is currently in the works.

- [ ] For a forecasting problem, build a pipeline to easily re-create a
snapshot of the data at an arbitrary time in the past.
> Not applicable

- [x] Implement data quality tests.
> Done, as above.