<h3>Using pandas for data processing, numpy, sklearn for split train and test and created package called learn</h3>

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from learn.classifier import NaiveBayes
from learn.decomposition import PCA
from learn.metrics import accuracy
from learn.preprocessing import MinMaxScaler

<br/>
Original data contains 1000 audio tracks with 10 genres (100 tracks each genre) and 91 columns (consist of 90 extracted feature and 1 Target / label / genre).

In [2]:
df = pd.read_csv('./data/genres.csv')
print(df.shape)

(1000, 91)


In [3]:
df['Target'].unique()

array(['blues', 'classical', 'country', 'disco', 'hiphop', 'jazz',
       'metal', 'pop', 'reggae', 'rock'], dtype=object)

In [4]:
df.columns

Index(['chroma_stft_mean_1', 'chroma_stft_mean_2', 'chroma_stft_mean_3',
       'chroma_stft_mean_4', 'chroma_stft_mean_5', 'chroma_stft_mean_6',
       'chroma_stft_mean_7', 'chroma_stft_mean_8', 'chroma_stft_mean_9',
       'chroma_stft_mean_10', 'chroma_stft_mean_11', 'chroma_stft_mean_12',
       'chroma_stft_std_1', 'chroma_stft_std_2', 'chroma_stft_std_3',
       'chroma_stft_std_4', 'chroma_stft_std_5', 'chroma_stft_std_6',
       'chroma_stft_std_7', 'chroma_stft_std_8', 'chroma_stft_std_9',
       'chroma_stft_std_10', 'chroma_stft_std_11', 'chroma_stft_std_12',
       'chroma_cqt_mean_1', 'chroma_cqt_mean_2', 'chroma_cqt_mean_3',
       'chroma_cqt_mean_4', 'chroma_cqt_mean_5', 'chroma_cqt_mean_6',
       'chroma_cqt_mean_7', 'chroma_cqt_mean_8', 'chroma_cqt_mean_9',
       'chroma_cqt_mean_10', 'chroma_cqt_mean_11', 'chroma_cqt_mean_12',
       'chroma_cqt_std_1', 'chroma_cqt_std_2', 'chroma_cqt_std_3',
       'chroma_cqt_std_4', 'chroma_cqt_std_5', 'chroma_cqt_std_6',
     

<br/>
We want to get best combination of genres, because using all genres produce bad accuracy. For each combination of genres, we tested the accuracy. Accuracy calculated with KNN classifier, but before fit to model, we normalize data using MinMaxScaler (created in learn package) and transform the data with PCA. Transformed data contains only 3 columns/features.

In [5]:
def test_accuracies(genres):
    df_selected = df[df['Target'].isin(genres)]
    X = df_selected.drop(columns="Target")
    y = df_selected["Target"]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    
    scaler = MinMaxScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    
    pca = PCA(n_components=3)
    X_train = pca.fit_transform(X_train)
    X_test = pca.transform(X_test)
    
    clf = NaiveBayes(k=len(genres))
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    acc = round(accuracy(y_test, y_pred) * 100, 3)
    print("Accuracy : {0}%,\tgenres : {1}".format(acc, ", ".join(genres)))

<br/>
Experiments for testing each combination of genres

In [6]:
all_genres = np.array([
    'blues', 
    'classical', 
    'country', 
    'disco', 
    'hiphop', 
    'jazz',
    'metal', 
    'pop', 
    'reggae', 
    'rock'
])

In [7]:
genres_combination = [
    list(all_genres[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]]),
    list(all_genres[[0, 1, 2, 3, 4, 5, 6, 7, 8]]),
    list(all_genres[[0, 1, 2, 3, 4, 5, 6, 7]]),
    list(all_genres[[0, 1, 2, 3, 4, 5, 6]]),
    list(all_genres[[0, 1, 2, 3, 4, 5]]),
    list(all_genres[[0, 1, 2, 3, 5, 6]]),
    list(all_genres[[1, 2, 4, 5, 6]]),
    list(all_genres[[1, 2, 4, 6]]),
    list(all_genres[[0, 1, 2, 5, 6, 7]]),
    list(all_genres[[3, 4, 5, 7, 8, 9]]),
    list(all_genres[[2, 3, 4, 5, 7, 8, 9]]),
]

for genre in genres_combination:
    test_accuracies(genre)

TypeError: __init__() takes 1 positional argument but 2 were given

<br/>
Best accuracy is when only using classical, country, hiphop, metal 

In [None]:
df_final = df[df['Target'].isin(['classical', 'country', 'hiphop', 'metal'])]
df_final.shape

In [None]:
df_final.to_csv('./data/final_genres.csv', index=False)