# Dataset description

The dataset used is GTZAN (the famous GTZAN dataset, the MNIST of sounds)

The GTZAN dataset contains 1000 audio files. Contains a total of 10 genres, each genre contains 100 audio files

1. Blues

2. Classical

3. Country

4. Disco

5. Hip-hop

6. Jazz

7. Metal

8. Pop

9. Reggae

10. Rock

**Genres original**

> A compilation of ten genres, each with 100 audio recordings, each lasting 30 seconds (the famous GTZAN dataset, the MNIST of sounds)

**Images original**

> Each audio file has a visual representation. Neural networks are one technique to classify data because they usually take in some form of picture representation.

**CSV files**

> The audio files' features are contained within. Each song lasts for 30 seconds long has a mean and variance computed across several features taken from an audio file in one file. The songs are separated into 3 second audio files in the other file, which has the same format.

### ***Dive deep with me into the world of data! While the dataset has graciously done some of the heavy lifting by extracting certain features for us, we're about to embark on a thrilling journey to uncover the mysteries behind these features. Let's truly understand their essence and the magic of how to obtain them. So, why not take a bold step? Let's delete those pre-existing folders and craft our own path in this adventure!***

In [None]:
!rm -r /content/GTZAN/images_original
!rm /content/GTZAN/features_30_sec.csv
!rm /content/GTZAN/features_3_sec.csv

In [6]:
!gdown 1MGhyeMngD6P9Kz9zJpL68ylQaIQvW7Zx
!unzip GTZAN.zip -d /content/GTZAN
!rm GTZAN.zip

Downloading...
From: https://drive.google.com/uc?id=1MGhyeMngD6P9Kz9zJpL68ylQaIQvW7Zx
To: /content/GTZAN.zip
100% 1.30G/1.30G [00:15<00:00, 84.2MB/s]
Archive:  GTZAN.zip
  inflating: /content/GTZAN/features_30_sec.csv  
  inflating: /content/GTZAN/features_3_sec.csv  
  inflating: /content/GTZAN/genres_original/blues/blues.00000.wav  
  inflating: /content/GTZAN/genres_original/blues/blues.00001.wav  
  inflating: /content/GTZAN/genres_original/blues/blues.00002.wav  
  inflating: /content/GTZAN/genres_original/blues/blues.00003.wav  
  inflating: /content/GTZAN/genres_original/blues/blues.00004.wav  
  inflating: /content/GTZAN/genres_original/blues/blues.00005.wav  
  inflating: /content/GTZAN/genres_original/blues/blues.00006.wav  
  inflating: /content/GTZAN/genres_original/blues/blues.00007.wav  
  inflating: /content/GTZAN/genres_original/blues/blues.00008.wav  
  inflating: /content/GTZAN/genres_original/blues/blues.00009.wav  
  inflating: /content/GTZAN/genres_original/blues/b

In [7]:
from glob import glob
import pandas as pd

num_segment=10
num_mfcc=20
sample_rate=22050
n_fft=2048
hop_length=512

my_csv={"filename":[], "chroma_stft_mean": [], "chroma_stft_var": [], "rms_mean": [], "rms_var": [], "spectral_centroid_mean": [],
        "spectral_centroid_var": [], "spectral_bandwidth_mean": [], "spectral_bandwidth_var": [], "rolloff_mean": [], "rolloff_var": [],
        "zero_crossing_rate_mean": [], "zero_crossing_rate_var": [], "harmony_mean": [], "harmony_var": [], "perceptr_mean": [],
        "perceptr_var": [], "tempo": [], "mfcc1_mean": [], "mfcc1_var" : [], "mfcc2_mean" : [], "mfcc2_var" : [],
        "mfcc3_mean" : [], "mfcc3_var" : [], "mfcc4_mean" : [], "mfcc4_var" : [], "mfcc5_mean" : [],
        "mfcc5_var" : [], "mfcc6_mean" : [], "mfcc6_var" : [], "mfcc7_mean" : [], "mfcc7_var" : [],
        "mfcc8_mean" : [], "mfcc8_var" : [], "mfcc9_mean" : [], "mfcc9_var" : [], "mfcc10_mean" : [],
        "mfcc10_var" : [], "mfcc11_mean" : [], "mfcc11_var" : [], "mfcc12_mean" : [], "mfcc12_var" : [],
        "mfcc13_mean" : [], "mfcc13_var" : [], "mfcc14_mean" : [], "mfcc14_var" : [], "mfcc15_mean" : [],
        "mfcc15_var" : [], "mfcc16_mean" : [], "mfcc16_var" : [], "mfcc17_mean" : [], "mfcc17_var" : [],
        "mfcc18_mean" : [], "mfcc18_var" : [], "mfcc19_mean" : [], "mfcc19_var" : [], "mfcc20_mean" : [],
        "mfcc20_var":[], "label":[]}

In [8]:
dataset_path="/content/GTZAN/genres_original"
audio_files = glob(dataset_path + "/*/*")
genre = glob(dataset_path + "/*")
n_genres=len(genre)
genre=[genre[x].split('/')[-1] for x in range(n_genres)]
print(genre)

['hiphop', 'jazz', 'disco', 'classical', 'country', 'reggae', 'rock', 'metal', 'pop', 'blues']


In [9]:
import librosa


In [None]:
samples_per_segment = int(sample_rate*30/num_segment)

genre=""
for f in sorted(audio_files):
    if genre!=f.split('/')[-2]:
        genre=f.split('/')[-2]
        print("Processsing " + genre + "...")
    fname=f.split('/')[-1]
    try:
        y, sr = librosa.load(f, sr=sample_rate)
    except:
        continue

    for n in range(num_segment):
        y_seg = y[samples_per_segment*n: samples_per_segment*(n+1)]
        #Chromagram
        chroma_hop_length = 512
        chromagram = librosa.feature.chroma_stft(y=y_seg, sr=sample_rate, hop_length=chroma_hop_length)
        my_csv["chroma_stft_mean"].append(chromagram.mean())
        my_csv["chroma_stft_var"].append(chromagram.var())

        #Root Mean Square Energy
        RMSEn= librosa.feature.rms(y=y_seg)
        my_csv["rms_mean"].append(RMSEn.mean())
        my_csv["rms_var"].append(RMSEn.var())

        #Spectral Centroid
        spec_cent=librosa.feature.spectral_centroid(y=y_seg)
        my_csv["spectral_centroid_mean"].append(spec_cent.mean())
        my_csv["spectral_centroid_var"].append(spec_cent.var())

        #Spectral Bandwith
        spec_band=librosa.feature.spectral_bandwidth(y=y_seg,sr=sample_rate)
        my_csv["spectral_bandwidth_mean"].append(spec_band.mean())
        my_csv["spectral_bandwidth_var"].append(spec_band.var())

        #Rolloff
        spec_roll=librosa.feature.spectral_rolloff(y=y_seg,sr=sample_rate)
        my_csv["rolloff_mean"].append(spec_roll.mean())
        my_csv["rolloff_var"].append(spec_roll.var())

        #Zero Crossing Rate
        zero_crossing=librosa.feature.zero_crossing_rate(y=y_seg)
        my_csv["zero_crossing_rate_mean"].append(zero_crossing.mean())
        my_csv["zero_crossing_rate_var"].append(zero_crossing.var())

        #Harmonics and Perceptrual
        harmony, perceptr = librosa.effects.hpss(y=y_seg)
        my_csv["harmony_mean"].append(harmony.mean())
        my_csv["harmony_var"].append(harmony.var())
        my_csv["perceptr_mean"].append(perceptr.mean())
        my_csv["perceptr_var"].append(perceptr.var())

        #Tempo
        tempo, _ = librosa.beat.beat_track(y=y_seg, sr=sample_rate)
        my_csv["tempo"].append(tempo)

        #MFCC
        mfcc=librosa.feature.mfcc(y=y_seg,sr=sample_rate, n_mfcc=num_mfcc, n_fft=n_fft, hop_length=hop_length)
        mfcc=mfcc.T


        fseg_name='.'.join(fname.split('.')[:2])+f'.{n}.wav'
        my_csv["filename"].append(fseg_name)
        my_csv["label"].append(genre)
        for x in range(20):
            feat1 = "mfcc" + str(x+1) + "_mean"
            feat2 = "mfcc" + str(x+1) + "_var"
            my_csv[feat1].append(mfcc[:,x].mean())
            my_csv[feat2].append(mfcc[:,x].var())
    print(fname)

df = pd.DataFrame(my_csv)
df.to_csv('/content/GTZAN/features_3_sec.csv', index=False)

In [12]:
import pandas as pd

In [16]:
df = pd.read_csv("/content/GTZAN/features_3_sec.csv")
df.head()

Unnamed: 0,filename,chroma_stft_mean,chroma_stft_var,rms_mean,rms_var,spectral_centroid_mean,spectral_centroid_var,spectral_bandwidth_mean,spectral_bandwidth_var,rolloff_mean,...,mfcc16_var,mfcc17_mean,mfcc17_var,mfcc18_mean,mfcc18_var,mfcc19_mean,mfcc19_var,mfcc20_mean,mfcc20_var,label
0,blues.00000.0.wav,0.335555,0.090997,0.130189,0.003559,1773.358004,169450.82952,1972.334258,117272.640189,3714.063439,...,39.547073,-3.230046,36.606853,0.696385,37.766136,-5.035945,33.66855,-0.239585,43.81888,blues
1,blues.00000.1.wav,0.343523,0.086782,0.112119,0.001491,1817.244034,90766.297254,2010.751494,65940.666243,3870.510442,...,64.819786,-6.025472,40.548813,0.127131,51.048935,-2.808956,97.2215,5.771881,60.360348,blues
2,blues.00000.2.wav,0.347746,0.092495,0.130895,0.004552,1790.722357,110071.206973,2088.18475,73391.498001,4000.206581,...,68.30679,-1.714475,28.136944,2.329553,47.211426,-1.925621,52.922432,2.466996,33.164,blues
3,blues.00000.3.wav,0.363863,0.087207,0.131349,0.002338,1660.545231,109496.936296,1967.920582,79805.901351,3579.149639,...,48.5432,-3.786987,28.419546,1.153315,35.6827,-3.501979,50.610344,3.580637,32.32587,blues
4,blues.00000.4.wav,0.335481,0.088482,0.14237,0.001734,1634.465077,77425.419232,1954.633566,57359.695604,3480.096905,...,30.829542,0.635797,44.645554,1.591108,51.415863,-3.364908,26.421085,0.501505,29.109531,blues


In [17]:
df.shape

(9990, 59)

In [18]:
# Drop the column filename as it is no longer required for training
df=df.drop(labels="filename",axis=1)

In [19]:
X, y =  df.iloc[:,:-1], df.iloc[:,-1]

In [20]:
from sklearn.preprocessing import LabelEncoder

# Label Encoding - encod the categorical classes with numerical integer values for training

# Blues - 0
# Classical - 1
# Country - 2
# Disco - 3
# Hip-hop - 4
# Jazz - 5
# Metal - 6
# Pop - 7
# Reggae - 8
# Rock - 9

encoder=LabelEncoder()
y=encoder.fit_transform(y)
y

array([0, 0, 0, ..., 9, 9, 9])

In [21]:
#scaling
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
X=scaler.fit_transform(X)

In [22]:
X

array([[-0.49043671,  0.63257679,  0.00171698, ..., -0.50915477,
         0.13137976, -0.28722956],
       [-0.40249208,  0.19689083, -0.26339646, ...,  1.02256007,
         1.27738203,  0.06984314],
       [-0.35588034,  0.78742291,  0.01207543, ..., -0.04510933,
         0.64735176, -0.51723132],
       ...,
       [-0.3551243 ,  0.44063833, -1.14679445, ..., -0.17031837,
         0.13609158, -0.33992808],
       [ 0.07516607, -0.0236235 , -0.94049459, ..., -0.71974425,
         0.30405673, -0.96162813],
       [-0.12437789,  0.17614602, -1.17119235, ..., -0.38363456,
        -0.4733399 , -0.54924434]])

In [23]:
from sklearn.model_selection import train_test_split

# splitting 70% data into training set and the remaining 30% to test set
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3, random_state=1234)

In [24]:
# test data size
len(y_test)

2997

In [25]:
# size of training data
len(y_train)

6993

In [26]:
import os
import numpy as np

import torch
from torch import nn, optim
from torch.functional import F
from torch.utils.data import DataLoader, TensorDataset

In [27]:
class MLP(nn.Module):
    def __init__(self, input_size):
        super(MLP, self).__init__()
        self.flatten = nn.Flatten()
        self.fc1 = nn.Linear(input_size, 512)
        self.fc2 = nn.Linear(512, 256)
        self.fc3 = nn.Linear(256, 128)
        self.fc4 = nn.Linear(128, 64)
        self.fc5 = nn.Linear(64, 32)
        self.fc6 = nn.Linear(32, 10)
        self.dropout = nn.Dropout(0.2)

    def forward(self, x):
        x = self.flatten(x)
        x = F.relu(self.fc1(x))
        x = self.dropout(x)
        x = F.relu(self.fc2(x))
        x = self.dropout(x)
        x = F.relu(self.fc3(x))
        x = self.dropout(x)
        x = F.relu(self.fc4(x))
        x = self.dropout(x)
        x = F.relu(self.fc5(x))
        x = self.dropout(x)
        x = self.fc6(x)
        return x

In [28]:
input_size = X_train.shape[1]
model = MLP(input_size)

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.000146)

In [29]:
num_epochs = 300
batch_size = 256

train_dataset = TensorDataset(torch.tensor(X_train), torch.tensor(y_train))
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

val_dataset = TensorDataset(torch.tensor(X_test), torch.tensor(y_test))
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

step = 0

for epoch in range(num_epochs):
    model.train()
    for inputs, labels in train_loader:
        optimizer.zero_grad()
        outputs = model(inputs.float())
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        if step % 100 ==0:
          print(f"Step {step}, Train Loss: {loss.item():.4f}")
        step += 1

    # Validation
    model.eval()
    val_loss = 0.0
    correct = 0
    total = 0
    with torch.no_grad():
        for inputs, labels in val_loader:
            outputs = model(inputs.float())
            loss = criterion(outputs, labels)
            val_loss += loss.item()
            _, predicted = outputs.max(1)
            total += labels.size(0)
            correct += predicted.eq(labels).sum().item()

    val_loss /= len(val_loader)
    val_accuracy = 100 * correct / total
    print(f"Epoch {epoch+1}/{num_epochs}, Validation Loss: {val_loss:.4f}, Validation Accuracy: {val_accuracy:.2f}%")

Step 0, Train Loss: 2.3127
Epoch 1/300, Validation Loss: 2.2902, Validation Accuracy: 14.05%
Epoch 2/300, Validation Loss: 2.2259, Validation Accuracy: 16.08%
Epoch 3/300, Validation Loss: 2.0345, Validation Accuracy: 32.00%
Step 100, Train Loss: 1.9515
Epoch 4/300, Validation Loss: 1.7810, Validation Accuracy: 36.40%
Epoch 5/300, Validation Loss: 1.6042, Validation Accuracy: 40.21%
Epoch 6/300, Validation Loss: 1.4998, Validation Accuracy: 43.84%
Epoch 7/300, Validation Loss: 1.4288, Validation Accuracy: 46.18%
Step 200, Train Loss: 1.5575
Epoch 8/300, Validation Loss: 1.3690, Validation Accuracy: 49.12%
Epoch 9/300, Validation Loss: 1.3139, Validation Accuracy: 51.75%
Epoch 10/300, Validation Loss: 1.2596, Validation Accuracy: 54.19%
Step 300, Train Loss: 1.3635
Epoch 11/300, Validation Loss: 1.2091, Validation Accuracy: 56.49%
Epoch 12/300, Validation Loss: 1.1840, Validation Accuracy: 58.53%
Epoch 13/300, Validation Loss: 1.1427, Validation Accuracy: 59.59%
Epoch 14/300, Validation