Pada kali ini kita akan mengkategorikan lagu ke dalam kategori tertentu (trap, rap, pop, dll) berdasarkan nilai-nilai yang diberikan seperti kemampuan menari, akustik, dan metrik lainnya yang dapat diperoleh dengan mudah melalui daftar putar terorganisir Spotify. Dataset diperoleh dari : https://www.kaggle.com/iamsumat/spotify-top-2000s-mega-dataset and https://www.kaggle.com/leonardopena/top-spotify-songs-from-20102019-by-year?select=top10s.csv 
Dua set data digunakan untuk meningkatkan jumlah data yang tersedia. Data menggunakan metrik yang sama yang tersedia di luar fitur mengatur playlist Spotify. Metrik ini adalah sebagai berikut: 
1. Genre - genre trek 
2. Year - tahun rilis rekaman. 
3. Added - tanggal paling awal Anda menambahkan lagu ke koleksi Anda. 
4. Beats Per Minute (BPM) - Tempo lagu. 
5. Energy - Energi sebuah lagu - semakin tinggi nilainya, semakin energik.
6. Danceability - Semakin tinggi nilainya, semakin mudah untuk menari mengikuti lagu ini. 
7. Loudness (dB) - Semakin tinggi nilainya, semakin keras lagunya. 
8. Liveness - Semakin tinggi nilainya, semakin besar kemungkinan lagu tersebut direkam secara langsung.
9. Valence - Semakin tinggi nilainya, semakin positif mood untuk lagu tersebut.
10. Length - Durasi lagu.
11. Acousticness - Semakin tinggi nilainya, semakin akustik lagu tersebut.
12. Speechiness - Semakin tinggi nilainya, semakin banyak kata yang diucapkan dalam lagu tersebut. 
13. Popularity - Semakin tinggi nilainya, semakin populer lagu tersebut. 
14. Duration - Durasi lagu.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
from matplotlib import pyplot
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
import numpy as np

In [None]:
df_2000 = pd.read_csv("../input/spotify-top-2000s-mega-dataset/Spotify-2000.csv")
df_top10s = pd.read_csv("../input/top-spotify-songs-from-20102019-by-year/top10s.csv", engine='python') # the engine needs to be changed otherwise UTF-8 error occurs
df_2000.head()

In [None]:
df_top10s.head()

In [None]:
df_2000.info()

In [None]:
df_top10s.info()

In [None]:
len(df_2000["Top Genre"].unique()), len(df_top10s["top genre"].unique())

In [None]:
df_2000["Top Genre"].value_counts(), df_top10s["top genre"].value_counts()

**Persiapan Data**

Menghilangkan Kolom yang tidak di perlukan

In [None]:
df_top10s.info()

In [None]:
df_2000.drop(columns = ['Index', 'Title', 'Artist', 'Year'], inplace = True)
df_top10s.drop(columns = ['Unnamed: 0', 'title', 'artist', 'year'], inplace = True)

In [None]:
df_top10s.columns = df_2000.columns # setting column names as each other
df = df_2000.append(df_top10s, ignore_index = True)
df

In [None]:
attributes = df.columns[1:]
for attribute in attributes:
    temp = df[attribute]
    for instance in range(len(temp)):
        if(type(temp[instance]) == str):
            df[attribute][instance] = float(temp[instance].replace(',',''))
# check data types using df.dtype

**Membagi Genre**

Metode : Semua lagu terkait dari kategori tertentu akan ditempatkan dalam kategori yang lebih luas (yaitu pop celtic dan pop indie akan ditempatkan di bawah tema pop yang lebih besar). Asumsi utama yang menggunakan metode ini adalah bahwa terdapat perbedaan minimal antara berbagai jenis genre musik yang serupa. Ini akan menjadi klasifikasi multikelas

In [None]:

# pertama mengekstrak kolom genre
# Singkirkan spasi dan ubah semuanya menjadi huruf kecil
genre = (df["Top Genre"].str.strip()).str.lower()

**Metode**

In [None]:
# function to split the genre column
def genre_splitter(genre):
    result = genre.copy()
    result = result.str.split(" ",1)
    for i in range(len(result)):
        if (len(result[i]) > 1):
            result[i] = [result[i][1]]
    return result.str.join('')

In [None]:
# loop until the genre cannot be split any further
genre_m1 = genre.copy()
while(max((genre_m1.str.split(" ", 1)).str.len()) > 1):
    genre_m1 = genre_splitter(genre_m1)

In [None]:
len(genre_m1.unique())

In [None]:
genre_m1.value_counts()

In [None]:
unique = genre_m1.unique()
to_remove = [] 

# genre yang hanya memiliki satu contoh akan ditempatkan dalam to_remove
for genre in unique:
    if genre_m1.value_counts()[genre] < 20: 
        to_remove += [genre]
len(to_remove)

In [None]:
df['Top Genre'] = genre_m1
df

In [None]:
df.set_index(["Top Genre"],drop = False, inplace = True)
for name in to_remove:
    type(name)
    df.drop(index = str(name), inplace = True)


In [None]:
df["Top Genre"].value_counts()

**Pembuatan Model**

Random Forest

1. Naive Bayes
2. Pengklasifikasi
3. Penurunan Gradien Stochastic
4. Regresi logistik

In [None]:
train_set, test_set = train_test_split(df, test_size = 0.2, random_state = 42)
# training set
X_train = train_set.values[:,1:]
y_train = train_set.values[:,0]

# test set
X_test = test_set.values[:,1:]
y_test = test_set.values[:,0]

In [None]:
from sklearn.preprocessing import StandardScaler

standard_scaler = StandardScaler().fit(X_train)

# Standard Scaler
X_train_ST = standard_scaler.transform(X_train)
X_test_ST = standard_scaler.transform(X_test)

In [None]:
# mendapatkan semua kelas unik
unique = np.unique(y_train)

In [None]:
from sklearn.preprocessing import label_binarize
from sklearn.preprocessing import LabelEncoder
# 1 hot encoding
y_test_1hot = label_binarize(y_test, classes = unique)
y_train_1hot = label_binarize(y_train, classes = unique)

# labelling
y_test_label = LabelEncoder()

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import SGDClassifier
from sklearn.multiclass import OneVsOneClassifier

models = []
models += [['Naive Bayes', GaussianNB()]]
models += [['SGD', OneVsOneClassifier(SGDClassifier())]]
models += [['Logistic', LogisticRegression(multi_class = 'ovr')]]
rand_forest = RandomForestClassifier(random_state = 42, min_samples_split = 5)

In [None]:
result_ST =[]
kfold = StratifiedKFold(n_splits = 10, random_state = 1, shuffle = True)

# Random Forest has to be done separately since it takes in one hot encoded labels instead
RF_cross_val_score = cross_val_score(rand_forest, X_train_ST, y_train_1hot, cv = 10, scoring = 'accuracy')
print('%s: %f (%f)' % ('Random Forest', RF_cross_val_score.mean(), RF_cross_val_score.std()))

for name, model in models:
    cv_score = cross_val_score(model, X_train_ST, y_train, cv = kfold, scoring = 'accuracy')
    result_ST.append(cv_score)
    print('%s: %f (%f)' % (name,cv_score.mean(), cv_score.std()))

In [None]:
from sklearn.metrics import precision_score, recall_score

result_precision_recall = []

# same reasoning as before for Random Forest
y_temp_randforest = cross_val_predict(rand_forest, X_train_ST, y_train_1hot, cv = 10)
result_precision_recall += [['Random Forest', precision_score(y_train_1hot, y_temp_randforest, average = "micro"), 
                            recall_score(y_train_1hot, y_temp_randforest, average = "micro")]]

print('%s| %s: %f, %s (%f)' % ('Random Forest', 'Precision Score: ', precision_score(y_train_1hot, y_temp_randforest, average = "micro"), 
                           'Recall Score: ', recall_score(y_train_1hot, y_temp_randforest, average = "micro")))

for name, model in models:
    y_pred = cross_val_predict(model, X_train_ST, y_train, cv = kfold)
    precision = precision_score(y_train, y_pred, average = "micro")
    recall = recall_score(y_train, y_pred, average = "micro")
    # storing the precision and recall values
    result_precision_recall += [[name , precision, recall]]
    print('%s| %s: %f, %s (%f)' % (name, 'Precision Score: ', precision, 'Recall Score: ', recall))

In [None]:
from sklearn.metrics import f1_score

for name, precision, recall in result_precision_recall:
    print("%s: %f" % (name, 2 * (precision * recall) / (precision + recall)))

**Evaluasi**


Model yang kami pilih adalah regresi logistik, jadi sekarang mari kita evaluasi model terlatih kami pada kumpulan data pengujian

In [None]:
# training the models
model_method1 = LogisticRegression(multi_class = 'ovr').fit(X_train_ST, y_train)

# getting predictions
predictions_method1 = model_method1.predict(X_test_ST)

In [None]:
from sklearn.metrics import confusion_matrix
print(f1_score(y_test, predictions_method1, labels = unique, average = 'micro' ))

Setelah di lakukan permodelan ternyata nilai dari pengklasifikasian yang kita lakukan memiliki nilai 0.55 atau 55%