# Análise de Gêneros Musicais com LVQ no Spotify

Aplicação de conceitos de aprendizagem de máquina para analisar e classificar gêneros musicais utilizando a base de dados do Spotify Tracks (https://huggingface.co/datasets/maharshipandya/spotify-tracks-dataset).

O algoritmos foco serão o LVQ. O objetivo é entender e aplicar técnicas de pré-processamento de dados, ajuste de hiperparâmetros, e análise de desempenho de modelos de classificação.

## Preparação dos Dados

### Seleção de Dados:

- Utilizar 50% dos registros da base de dados do Spotify Tracks, focando na classe track_genre como variável alvo.

- Remover registros com dados ausentes.

### Limpeza de Dados:

- Remover atributos categóricos e identificadores de tracks que são irrelevantes para a modelagem.

- Tratamento de Dados Ausentes: Identificar e tratar dados ausentes nos atributos numéricos restantes.

### Ajuste de Escala:

- Padronizar ou normalizar os valores numéricos para garantir uma escala uniforme entre os atributos.

### Divisão dos Dados:

- Dividir a base de dados em conjuntos de treinamento, validação e teste, seguindo uma proporção de 60% para treinamento, 20% para validação e 20% para teste.

### Modelagem e Avaliação:

Explorar diferentes valores de k e de medidas de distâncias, avaliando o modelo usando a base de treinamento e validação.
Utilizar métricas de avaliação, como precisão, recall e F1-score, para identificar a melhor configuração de k.

Estratégia de Variação de Parâmetros: Para ambos os modelos, LVQ, adotar a mesma estratégia sistemática de busca de hiperparâmetros, como a busca em grade (Grid Search) ou busca aleatória (Random Search), para explorar o espaço de hiperparâmetros.

### Entrega:

- Enviar as melhores configurações do cada modelo K-NN com desempenho no treinamento, validação e teste.

- Apresentar análise crítica a respeito dos resultados.

- Além disso, apresentar as principais dificuldades encontradas.

In [1]:
!pip install datasets

Collecting datasets
  Downloading datasets-2.19.0-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub>=0.21.2 (from datasets)
  Downloading huggingface_hub-0.22.2-py3-none-any

In [2]:
import random
import pandas as pd
import numpy as np

from datasets import load_dataset
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import cross_val_score
from csv import reader
from random import randrange
from math import sqrt

'''
  Seleção de Dados
'''

# Carregar o conjunto de dados Spotify Tracks do Hugging Face.
spotify_tracks_dataset = load_dataset("maharshipandya/spotify-tracks-dataset")

# Obter o conjunto de dados como um DataFrame do Pandas.
dataframe = spotify_tracks_dataset['train'].to_pandas()

# Verificar se as colunas existem antes de removê-las.
columns_to_drop = ["Unnamed: 0", "track_id", "artists", "track_name", "album_name", "explicit"]
existing_columns = [col for col in columns_to_drop if col in dataframe.columns]
# Remover as colunas se elas existirem.
if existing_columns:
    dataset = dataframe.drop(columns=existing_columns)
else:
    dataset = dataframe.copy()

# Remover registros com dados ausentes.
dataset.dropna(inplace=True)

# Selecionar 50% dos registros aleatoriamente.
#dataset_sampled = dataset.sample(frac=0.1, random_state=42)

# Tratamento de Dados Ausentes: Identificar e tratar dados ausentes nos atributos numéricos restantes.
numeric_cols = dataset.select_dtypes(include=[np.number]).columns.tolist()
# Preencher dados ausentes com a média das colunas numéricas
for col in numeric_cols:
    dataset[col].fillna(dataset[col].mean(), inplace=True)

# Normalizar os dados numéricos
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(dataset[numeric_cols])
scaled_df = pd.DataFrame(scaled_data, columns=numeric_cols)

# Ajustar o dataset para usar o dataset_clean
dataset = scaled_df

# Salvar o DataFrame como um arquivo CSV.
dataset.to_csv('/content/spotify_dataset.csv', index=False, header=False)

'''
  Modelagem e Avaliação

  Utilizar métricas de avaliação, como precisão, recall e F1-score, para
  identificar a melhor configuração.
'''

# Load a CSV file
def load_csv(filename):
	dataset = list()

	with open(filename, 'r') as file:
		csv_reader = reader(file)
		for row in csv_reader:
			if not row:
				continue
			dataset.append(row)
	return dataset

# Convert string column to float
def str_column_to_float(dataset, column):
	for row in dataset:
		row[column] = float(row[column].strip())

# Convert string column to integer
def str_column_to_int(dataset, column):
	class_values = [row[column] for row in dataset]
	unique = set(class_values)
	lookup = dict()
	for i, value in enumerate(unique):
		lookup[value] = i
	for row in dataset:
		row[column] = lookup[row[column]]
	return lookup

# Split a dataset into k folds
def cross_validation_split(dataset, n_folds):
    dataset_split = list()
    dataset_copy = list(dataset)
    fold_size = int(len(dataset) / n_folds)

    for i in range(n_folds):
        fold = list()
        while len(fold) < fold_size:
            index = randrange(len(dataset_copy))
            fold.append(dataset_copy.pop(index))
        dataset_split.append(fold)

    return dataset_split

# Calculate accuracy percentage
def accuracy_metric(actual, predicted):
	correct = 0
	for i in range(len(actual)):
		if actual[i] == predicted[i]:
			correct += 1
	return correct / float(len(actual)) * 100.0

# Evaluate an algorithm using a cross validation split
def evaluate_algorithm(dataset, algorithm, n_folds, *args):
    folds = cross_validation_split(dataset, n_folds)
    scores = {'accuracy': [], 'precision': [], 'recall': [], 'f1_score': []}

    for i in range(n_folds):

        # Prepare train, validation and test sets
        train_set = sum(folds[:i] + folds[i+1:], [])
        valid_set = folds[i][:len(folds[i]) // 2]
        test_set = folds[i][len(folds[i]) // 2:]

        # Train the model
        model = algorithm(train_set, valid_set, *args)

        # Predict on train, validation and test sets
        train_actual = [row[-1] for row in train_set]
        train_predicted = [predict(model, row[:-1]) for row in train_set]

        valid_actual = [row[-1] for row in valid_set]
        valid_predicted = [predict(model, row[:-1]) for row in valid_set]

        test_actual = [row[-1] for row in test_set]
        test_predicted = [predict(model, row[:-1]) for row in test_set]

        # Calculate metrics for train, validation and test sets
        train_accuracy = accuracy_metric(train_actual, train_predicted)
        train_precision = precision_metric(train_actual, train_predicted)
        train_recall = recall_metric(train_actual, train_predicted)
        train_f1 = f1_score_metric(train_actual, train_predicted)

        valid_accuracy = accuracy_metric(valid_actual, valid_predicted)
        valid_precision = precision_metric(valid_actual, valid_predicted)
        valid_recall = recall_metric(valid_actual, valid_predicted)
        valid_f1 = f1_score_metric(valid_actual, valid_predicted)

        test_accuracy = accuracy_metric(test_actual, test_predicted)
        test_precision = precision_metric(test_actual, test_predicted)
        test_recall = recall_metric(test_actual, test_predicted)
        test_f1 = f1_score_metric(test_actual, test_predicted)

        scores['accuracy'].append({'train': train_accuracy, 'valid': valid_accuracy, 'test': test_accuracy})
        scores['precision'].append({'train': train_precision, 'valid': valid_precision, 'test': test_precision})
        scores['recall'].append({'train': train_recall, 'valid': valid_recall, 'test': test_recall})
        scores['f1_score'].append({'train': train_f1, 'valid': valid_f1, 'test': test_f1})

    return scores


# calculate the Euclidean distance between two vectors
def euclidean_distance(row1, row2):
	distance = 0.0
	for i in range(len(row1)-1):
		distance += (row1[i] - row2[i])**2
	return sqrt(distance)

# Locate the best matching unit
def get_best_matching_unit(codebooks, test_row):
	distances = list()
	for codebook in codebooks:
		dist = euclidean_distance(codebook, test_row)
		distances.append((codebook, dist))
	distances.sort(key=lambda tup: tup[1])
	return distances[0][0]

# Make a prediction with codebook vectors
def predict(codebooks, test_row):
	bmu = get_best_matching_unit(codebooks, test_row)
	return bmu[-1]

# Create a random codebook vector
def random_codebook(train):
	n_records = len(train)
	n_features = len(train[0])
	codebook = [train[randrange(n_records)][i] for i in range(n_features)]
	return codebook

# Train a set of codebook vectors
def train_codebooks(train, n_codebooks, lrate, epochs):
	codebooks = [random_codebook(train) for _ in range(n_codebooks)]
	for epoch in range(epochs):
		rate = lrate * (1.0-(epoch/float(epochs)))
		for row in train:
			bmu = get_best_matching_unit(codebooks, row)
			for i in range(len(row)-1):
				error = row[i] - bmu[i]
				if bmu[-1] == row[-1]:
					bmu[i] += rate * error
				else:
					bmu[i] -= rate * error
	return codebooks

# LVQ Algorithm
def learning_vector_quantization(train, test, n_codebooks, lrate, epochs):
	codebooks = train_codebooks(train, n_codebooks, lrate, epochs)
	predictions = [predict(codebooks, row[:-1]) for row in test]
	return predictions

# Update precision metric
def precision_metric(actual, predicted):
    return precision_score(actual, predicted, average='weighted')

# Update recall metric
def recall_metric(actual, predicted):
    return recall_score(actual, predicted, average='weighted')

# Update F1 score metric
def f1_score_metric(actual, predicted):
    return f1_score(actual, predicted, average='weighted')

def learning_vector_quantization(train, valid, n_codebooks, lrate, epochs):
    codebooks = train_codebooks(train, n_codebooks, lrate, epochs)
    return codebooks



# load and prepare data
filename = "/content/spotify_dataset.csv"
dataset = load_csv(filename)

for i in range(len(dataset[0])-1):
	str_column_to_float(dataset, i)

# convert class column to integers
str_column_to_int(dataset, len(dataset[0])-1)

# Evaluate algorithm
n_folds = 5
learn_rate = 0.3
n_epochs = 50
n_codebooks = 15

scores = evaluate_algorithm(dataset, learning_vector_quantization, n_folds, n_codebooks, learn_rate, n_epochs)

metrics = ['accuracy', 'precision', 'recall', 'f1_score']

for metric in metrics:
    print(f"{metric.capitalize()} scores:")
    mean_train = sum(score['train'] for score in scores[metric]) / n_folds
    mean_valid = sum(score['valid'] for score in scores[metric]) / n_folds
    mean_test = sum(score['test'] for score in scores[metric]) / n_folds
    print(f"Mean Train: {mean_train:.3f}, Mean Valid: {mean_valid:.3f}, Mean Test: {mean_test:.3f}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/4.68k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/20.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/114000 [00:00<?, ? examples/s]

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Accuracy scores:
Mean Train: 89.336, Mean Valid: 89.198, Mean Test: 89.474
Precision scores:
Mean Train: 0.806, Mean Valid: 0.796, Mean Test: 0.801
Recall scores:
Mean Train: 0.893, Mean Valid: 0.892, Mean Test: 0.895
F1_score scores:
Mean Train: 0.843, Mean Valid: 0.841, Mean Test: 0.845


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
