<a href="https://colab.research.google.com/github/szilaard/AIT_project/blob/main/AitProject.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AIT Deep Learning Project

Péter Czumbel

Szilárd Horváth


In [None]:
import tensorflow as tf
import librosa
import pandas as pd
from glob import glob
import IPython
import IPython.display as ipd
import numpy as np
import matplotlib.pyplot as plt
import math

Downloading the GTZAN dataset from tensorflow datasets doesn't work, the URL times out.<br>
See: https://github.com/tensorflow/datasets/issues/4090 <br>
Using [this](https://www.kaggle.com/datasets/andradaolteanu/gtzan-dataset-music-genre-classification) version of the dataset from kaggle instead.



We read all the data from the directories.

In [None]:
audio_files = glob("Data/genres_original/*/*.wav")

In [None]:
# We defined these parameters so we can fine tune them if needed in the future
n_fft = 2048
n_mfcc = 13
hop_length = 512
sample_rate = 22050
number_of_segments = 5
duration = 30
samples_per_track = sample_rate * duration

Example audio:

In [None]:
ipd.Audio(audio_files[0])

We read the raw data of the first audio sample and its sample frequency

In [None]:
signal, sr = librosa.load(audio_files[0], sr=sample_rate)
print("Y is a numpy array:", signal)
print("Shape of Y:", signal.shape)



In [None]:
librosa.display.waveshow(signal, sr=sample_rate)
plt.xlabel("Time (s)")
plt.ylabel("Amplitude")
plt.title("Waveform")
plt.show()

Short-time Fourier transformation

In [None]:

mfcc = librosa.feature.mfcc(y=signal, sr=sample_rate, n_mfcc=n_mfcc, n_fft=n_fft, hop_length=hop_length)
mfcc = mfcc.T


librosa.display.specshow(mfcc, sr=sample_rate, hop_length=hop_length)
plt.xlabel("Time")
plt.ylabel("MFCC")
plt.colorbar(format="%+2.0f")
plt.show()

In [None]:
data = {
    "mapping": [], 
    "mfcc": [],
    "labels": []
}
samples_per_segment=int(samples_per_track/number_of_segments)
num_mfcc_vectors_per_segment = math.ceil(samples_per_segment / hop_length)

For the next part we separate our audio data into segments, than we use mel frequency cepstral coefficients (MFCCs) on them. This transforms our data closer to what humans would hear/notice listening to the music.

In [None]:
for audio_file in audio_files:
    label = audio_file.split("\\")[1]
    if label not in data["mapping"]:
        data["mapping"].append(label)
    try:
        
        signal, sr = librosa.load(audio_file)
    except:
        #there are some corrupted/non readable files so we dont process them
        continue
        
        
    for i in range(number_of_segments):
        start = samples_per_segment * i
        end = start + samples_per_segment
        
        
        if  len(mfcc) == num_mfcc_vectors_per_segment:
            mfcc = librosa.feature.mfcc(y=signal[start:end], sr=sample_rate, n_mfcc=n_mfcc, n_fft=n_fft, hop_length=hop_length)
            mfcc = mfcc.T
            data["mfcc"].append(mfcc)
            data["labels"].append(data["mapping"].index(label))
    
    
data["mapping"]

We transform the created lists into numpy arrays, so they are easier to handle

In [None]:
data["mfcc"] = np.asarray(data["mfcc"])
data["labels"]=np.asarray(data["labels"])

We separate our data into training, validation and test datasets, we define the ratios so we can fine tune them later.

In [None]:
data_length = len(data["mfcc"])
train_ratio = 0.7
valid_ratio = 0.2
test_ratio = 0.1

train_size = int(train_ratio*data_length)
valid_size = int(valid_ratio*data_length)
test_size = int(test_ratio*data_length)

X_train = data["mfcc"][:train_size]
Y_train = data["labels"][:train_size]
X_valid = data["mfcc"][train_size:train_size+valid_size]
Y_valid = data["labels"][train_size:train_size+valid_size]
X_test = data["mfcc"][train_size+valid_size:]
Y_test = data["labels"][train_size+valid_size:]
