# Predicting Song Genres Using Spotify Data

## Description

This project aims to build a machine learning model that predicts the genre of a song using various metrics provided by Spotify. The goal is to create a predictive model that can  classify the genre of a song based on its features such as danceability, energy, tempo, and other characteristics. Additionally, this project will use the Spotify API to retrieve these song metrics for any new track, allowing us to make predictions on new songs.

### Workflow

1. Collect Data
    
    Build a dataset within Spotify

2. Preprocess Data:

    Clean and preprocess dataset for model training.
3. Train Models:
    


    Train models using the audio metrics as features and genre as target.
    
    Evaluate the model's performance using cross-validation and metrics (accuracy, F1-score).
4. Evaluate Model Performance:

    Check for the effectiveness of the model. Analyze predictios.
5. Integrate Spotify API:
    
6. Make Predictions on New Songs:
    
    Use the trained machine learning model to predict the genre of any new song based on its Spotify audio features.

## Import Libraries

In [887]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.decomposition import PCA

from sklearn.metrics import classification_report

from sklearn.model_selection import GridSearchCV
import time
import numpy as np

## Spotify API Setup

In [1174]:
!pip install dotenv

Collecting dotenv
  Downloading dotenv-0.0.5.tar.gz (2.4 kB)
  Preparing metadata (setup.py) ... [?25lerror
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m [31m[67 lines of output][0m
  [31m   [0m /opt/anaconda3/lib/python3.12/site-packages/setuptools/__init__.py:81: _DeprecatedInstaller: setuptools.installer and fetch_build_eggs are deprecated.
  [31m   [0m !!
  [31m   [0m 
  [31m   [0m         ********************************************************************************
  [31m   [0m         Requirements should be satisfied by a PEP 517 installer.
  [31m   [0m         If you are using pip, you can try `pip install --use-pep517`.
  [31m   [0m         ********************************************************************************
  [31m   [0m 
  [31m   [0m !!
  [31m   [0m   dist.fetch_build_eggs(dist.setup_requires)
  [31m  

In [1464]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
from spotipy.exceptions import SpotifyException
from dotenv import load_dotenv
import os


#load_dotenv()
load_dotenv()
client_id = os.environ.get('client_id')
client_secret = os.environ.get('client_secret')

# Authenticate with Spotify API
sp = spotipy.Spotify(auth_manager=SpotifyClientCredentials(client_id=client_id, client_secret=client_secret))

# Test
result = sp.search(q='breath away', type='track', limit=1)
print(result)

ImportError: cannot import name 'load_dotenv' from 'dotenv' (/opt/anaconda3/lib/python3.12/site-packages/dotenv/__init__.py)

### Retreive Audio Features

In [1392]:
def get_audio_features(track_id):
    # get audio features for a specific track
    features = sp.audio_features([track_id])
    return features[0] 

track_id = result['tracks']['items'][0]['id']
audio_features = get_audio_features(track_id)
print(audio_features)  # Replace with actual API call

{'danceability': 0.694, 'energy': 0.712, 'key': 11, 'loudness': -6.522, 'mode': 0, 'speechiness': 0.0759, 'acousticness': 0.707, 'instrumentalness': 0.0202, 'liveness': 0.263, 'valence': 0.233, 'tempo': 146.015, 'type': 'audio_features', 'id': '1oic0Wedm3XeHxwaxmwO91', 'uri': 'spotify:track:1oic0Wedm3XeHxwaxmwO91', 'track_href': 'https://api.spotify.com/v1/tracks/1oic0Wedm3XeHxwaxmwO91', 'analysis_url': 'https://api.spotify.com/v1/audio-analysis/1oic0Wedm3XeHxwaxmwO91', 'duration_ms': 166849, 'time_signature': 4}


## Building a Dataset

In [1395]:
count += 0
def search_songs_by_genre(genre, limit=10):
    global count
    songs_data = []
    results = sp.search(q=f'genre:{genre}', type='track', limit=limit, offset=count*50)
    count += 1
    
    for track in results['tracks']['items']:
        track_id = track['id']
        audio_features = get_audio_features(track_id)
        if audio_features:
            audio_features['genre'] = genre
            songs_data.append(audio_features)
    
    return songs_data

# List of 20 genres
genres = [
    'pop', 'rock', 'jazz', 'classical', 'hip-hop', 'metal', 'reggae', 'blues',
    'country', 'edm', 'latin', 'soul', 'punk', 'folk', 'funk', 'indie', 'disco',
    'r&b', 'gospel', 'alternative'
]

all_songs_data = []

for genre in genres:
    print(f"Collecting songs for genre: {genre}")
    genre_songs = search_songs_by_genre(genre, limit=25)  
    all_songs_data.extend(genre_songs)
    time.sleep(15)

df = pd.DataFrame(all_songs_data)

print(df.shape)
df.info()

HTTP Error for GET to https://api.spotify.com/v1/search with Params: {'q': 'genre:pop', 'limit': 25, 'offset': 1000, 'type': 'track', 'market': None} returned 400 due to Bad request.


Collecting songs for genre: pop


SpotifyException: http status: 400, code:-1 - https://api.spotify.com/v1/search?q=genre%3Apop&limit=25&offset=1000&type=track:
 Bad request., reason: None

In [1397]:
df['genre'].value_counts()

genre
14    25
18    25
8     25
16    25
4     25
10    25
7     25
6     25
15    25
19    25
12    25
5     25
3     25
1     25
17    25
13    25
9     25
2     25
11    25
0     25
Name: count, dtype: int64

In [1399]:
genres = [
    'pop', 'rock', 'jazz', 'classical', 'hip-hop', 'metal', 'reggae', 'blues',
    'country', 'edm', 'latin', 'soul', 'punk', 'folk', 'funk', 'indie', 'disco',
    'r&b', 'gospel', 'alternative'
]

In [1401]:
# I append new API call data to dataset file trying to take into account possible data mismatch issues.
try:
    df.query("genre in @genres").drop(columns='Unnamed: 0').reset_index(drop=True).to_csv('clean_spotify_set.csv', mode='a', 
                                                                                          header=False, index=True)
except:
    df.query("genre in @genres").reset_index(drop=True).to_csv('clean_spotify_set.csv', mode='a', header=False, index=True)

In [1403]:
# I read the Spotify song dataset I've collected.
df = pd.read_csv('clean_spotify_set.csv', index_col=0, header='infer').reset_index(drop=True)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 19 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   danceability      500 non-null    float64
 1   energy            500 non-null    float64
 2   key               500 non-null    float64
 3   loudness          500 non-null    float64
 4   mode              500 non-null    float64
 5   speechiness       500 non-null    float64
 6   acousticness      500 non-null    float64
 7   instrumentalness  500 non-null    float64
 8   liveness          500 non-null    float64
 9   valence           500 non-null    float64
 10  tempo             500 non-null    float64
 11  type              500 non-null    object 
 12  id                500 non-null    object 
 13  uri               500 non-null    object 
 14  track_href        500 non-null    object 
 15  analysis_url      500 non-null    object 
 16  duration_ms       500 non-null    int64  
 1

In [1405]:
# I check if repeated API calls added duplicate tracks in a temporary dataframe.
print(df.duplicated().sum())
df1 = df.drop_duplicates().reset_index(drop=True)
df1.info()

0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 19 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   danceability      500 non-null    float64
 1   energy            500 non-null    float64
 2   key               500 non-null    float64
 3   loudness          500 non-null    float64
 4   mode              500 non-null    float64
 5   speechiness       500 non-null    float64
 6   acousticness      500 non-null    float64
 7   instrumentalness  500 non-null    float64
 8   liveness          500 non-null    float64
 9   valence           500 non-null    float64
 10  tempo             500 non-null    float64
 11  type              500 non-null    object 
 12  id                500 non-null    object 
 13  uri               500 non-null    object 
 14  track_href        500 non-null    object 
 15  analysis_url      500 non-null    object 
 16  duration_ms       500 non-null    int64  


In [1407]:
df = df1.copy()

In [1409]:
df.head(10)

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,type,id,uri,track_href,analysis_url,duration_ms,time_signature,genre
0,0.7,0.582,11.0,-5.96,0.0,0.0356,0.0502,0.0,0.0881,0.785,116.712,audio_features,0WbMK4wrZ1wFSty9F7FCgu,spotify:track:0WbMK4wrZ1wFSty9F7FCgu,https://api.spotify.com/v1/tracks/0WbMK4wrZ1wF...,https://api.spotify.com/v1/audio-analysis/0WbM...,218424,4,pop
1,0.747,0.507,2.0,-10.171,1.0,0.0358,0.2,0.0608,0.117,0.438,104.978,audio_features,6dOtVTDdiauQNBQEDOtlAB,spotify:track:6dOtVTDdiauQNBQEDOtlAB,https://api.spotify.com/v1/tracks/6dOtVTDdiauQ...,https://api.spotify.com/v1/audio-analysis/6dOt...,210373,4,pop
2,0.521,0.592,6.0,-7.777,0.0,0.0304,0.308,0.0,0.122,0.535,157.969,audio_features,2plbrEY59IikOBgBGLjaoe,spotify:track:2plbrEY59IikOBgBGLjaoe,https://api.spotify.com/v1/tracks/2plbrEY59Iik...,https://api.spotify.com/v1/audio-analysis/2plb...,251668,3,pop
3,0.674,0.907,3.0,-4.086,1.0,0.064,0.101,0.0,0.297,0.721,112.964,audio_features,5G2f63n7IPVPPjfNIGih7Q,spotify:track:5G2f63n7IPVPPjfNIGih7Q,https://api.spotify.com/v1/tracks/5G2f63n7IPVP...,https://api.spotify.com/v1/audio-analysis/5G2f...,157280,4,pop
4,0.669,0.586,9.0,-6.073,1.0,0.054,0.274,0.0,0.104,0.579,107.071,audio_features,5N3hjp1WNayUPZrA8kJmJP,spotify:track:5N3hjp1WNayUPZrA8kJmJP,https://api.spotify.com/v1/tracks/5N3hjp1WNayU...,https://api.spotify.com/v1/audio-analysis/5N3h...,186365,4,pop
5,0.701,0.76,0.0,-5.478,1.0,0.0285,0.107,6.5e-05,0.185,0.69,103.969,audio_features,2qSkIjg1o9h3YT9RAgYN75,spotify:track:2qSkIjg1o9h3YT9RAgYN75,https://api.spotify.com/v1/tracks/2qSkIjg1o9h3...,https://api.spotify.com/v1/audio-analysis/2qSk...,175459,4,pop
6,0.742,0.757,6.0,-4.981,1.0,0.0421,0.0187,0.0,0.305,0.957,139.982,audio_features,4xdBrk0nFZaP54vvZj0yx7,spotify:track:4xdBrk0nFZaP54vvZj0yx7,https://api.spotify.com/v1/tracks/4xdBrk0nFZaP...,https://api.spotify.com/v1/audio-analysis/4xdB...,184841,4,pop
7,0.739,0.727,11.0,-5.968,0.0,0.0426,0.0678,0.0,0.104,0.676,94.99,audio_features,1UHS8Rf6h5Ar3CDWRd3wjF,spotify:track:1UHS8Rf6h5Ar3CDWRd3wjF,https://api.spotify.com/v1/tracks/1UHS8Rf6h5Ar...,https://api.spotify.com/v1/audio-analysis/1UHS...,171870,4,pop
8,0.61,0.65,6.0,-6.199,1.0,0.0474,0.399,0.0,0.11,0.507,106.719,audio_features,1k2pQc5i348DCHwbn5KTdc,spotify:track:1k2pQc5i348DCHwbn5KTdc,https://api.spotify.com/v1/tracks/1k2pQc5i348D...,https://api.spotify.com/v1/audio-analysis/1k2p...,258035,4,pop
9,0.638,0.855,7.0,-4.86,1.0,0.0264,0.00757,0.0,0.245,0.731,127.986,audio_features,7221xIgOnuakPdLqT0F3nP,spotify:track:7221xIgOnuakPdLqT0F3nP,https://api.spotify.com/v1/tracks/7221xIgOnuak...,https://api.spotify.com/v1/audio-analysis/7221...,178206,4,pop


In [1411]:
df.sample(10)

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,type,id,uri,track_href,analysis_url,duration_ms,time_signature,genre
259,0.911,0.712,1.0,-5.105,0.0,0.0817,0.0901,2.7e-05,0.0933,0.425,92.005,audio_features,6Sq7ltF9Qa7SNFBsV5Cogx,spotify:track:6Sq7ltF9Qa7SNFBsV5Cogx,https://api.spotify.com/v1/tracks/6Sq7ltF9Qa7S...,https://api.spotify.com/v1/audio-analysis/6Sq7...,178567,4,latin
108,0.64,0.552,2.0,-5.679,1.0,0.129,0.0215,0.0,0.119,0.112,144.941,audio_features,28drn6tQo95MRvO0jQEo5C,spotify:track:28drn6tQo95MRvO0jQEo5C,https://api.spotify.com/v1/tracks/28drn6tQo95M...,https://api.spotify.com/v1/audio-analysis/28dr...,228267,4,hip-hop
430,0.559,0.551,5.0,-7.231,1.0,0.132,0.141,0.0,0.11,0.392,143.008,audio_features,4iZ4pt7kvcaH6Yo8UoZ4s2,spotify:track:4iZ4pt7kvcaH6Yo8UoZ4s2,https://api.spotify.com/v1/tracks/4iZ4pt7kvcaH...,https://api.spotify.com/v1/audio-analysis/4iZ4...,201800,4,r&b
34,0.636,0.676,2.0,-3.442,1.0,0.0263,0.0807,0.0,0.0831,0.273,113.98,audio_features,2tznHmp70DxMyr2XhWLOW0,spotify:track:2tznHmp70DxMyr2XhWLOW0,https://api.spotify.com/v1/tracks/2tznHmp70DxM...,https://api.spotify.com/v1/audio-analysis/2tzn...,208760,4,rock
68,0.592,0.355,9.0,-14.051,1.0,0.0352,0.478,0.0,0.0585,0.499,133.032,audio_features,3NfxSdJnVdon1axzloJgba,spotify:track:3NfxSdJnVdon1axzloJgba,https://api.spotify.com/v1/tracks/3NfxSdJnVdon...,https://api.spotify.com/v1/audio-analysis/3Nfx...,216773,4,jazz
284,0.692,0.651,9.0,-8.267,1.0,0.0324,0.292,0.00241,0.105,0.706,97.923,audio_features,0bRXwKfigvpKZUurwqAlEh,spotify:track:0bRXwKfigvpKZUurwqAlEh,https://api.spotify.com/v1/tracks/0bRXwKfigvpK...,https://api.spotify.com/v1/audio-analysis/0bRX...,254560,4,soul
111,0.69,0.521,10.0,-8.492,0.0,0.339,0.324,0.0,0.0534,0.494,100.028,audio_features,68Dni7IE4VyPkTOH9mRWHr,spotify:track:68Dni7IE4VyPkTOH9mRWHr,https://api.spotify.com/v1/tracks/68Dni7IE4VyP...,https://api.spotify.com/v1/audio-analysis/68Dn...,292799,4,hip-hop
27,0.52,0.852,0.0,-5.866,1.0,0.0543,0.00237,5.8e-05,0.0733,0.234,140.267,audio_features,58ge6dfP91o9oXMzq3XkIS,spotify:track:58ge6dfP91o9oXMzq3XkIS,https://api.spotify.com/v1/tracks/58ge6dfP91o9...,https://api.spotify.com/v1/audio-analysis/58ge...,253587,4,rock
55,0.274,0.348,5.0,-8.631,1.0,0.0293,0.547,0.0133,0.334,0.328,87.43,audio_features,4Hhv2vrOTy89HFRcjU3QOx,spotify:track:4Hhv2vrOTy89HFRcjU3QOx,https://api.spotify.com/v1/tracks/4Hhv2vrOTy89...,https://api.spotify.com/v1/audio-analysis/4Hhv...,179693,3,jazz
222,0.597,0.658,7.0,-4.38,1.0,0.044,0.11,0.0,0.131,0.38,134.545,audio_features,3xOi0YhDREKRURFHoNaAOQ,spotify:track:3xOi0YhDREKRURFHoNaAOQ,https://api.spotify.com/v1/tracks/3xOi0YhDREKR...,https://api.spotify.com/v1/audio-analysis/3xOi...,169941,4,country


In [1413]:
df['type'].value_counts(normalize=True)

type
audio_features    1.0
Name: proportion, dtype: float64

In [1415]:
df['genre'].value_counts()

genre
pop            25
rock           25
gospel         25
r&b            25
disco          25
indie          25
funk           25
folk           25
punk           25
soul           25
latin          25
edm            25
country        25
blues          25
reggae         25
metal          25
hip-hop        25
classical      25
jazz           25
alternative    25
Name: count, dtype: int64

In [1417]:
df = pd.read_csv('clean_spotify_set.csv', index_col=0, header='infer').reset_index(drop=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 19 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   danceability      500 non-null    float64
 1   energy            500 non-null    float64
 2   key               500 non-null    float64
 3   loudness          500 non-null    float64
 4   mode              500 non-null    float64
 5   speechiness       500 non-null    float64
 6   acousticness      500 non-null    float64
 7   instrumentalness  500 non-null    float64
 8   liveness          500 non-null    float64
 9   valence           500 non-null    float64
 10  tempo             500 non-null    float64
 11  type              500 non-null    object 
 12  id                500 non-null    object 
 13  uri               500 non-null    object 
 14  track_href        500 non-null    object 
 15  analysis_url      500 non-null    object 
 16  duration_ms       500 non-null    int64  
 1

## Data Preprocessing

In [1420]:
f1_list, auc_roc_list, accuracy_list, model_list = [], [], [], []

In [1422]:
from sklearn.metrics import accuracy_score, roc_auc_score, f1_score
def record_metrics(y_test, predictions, model, X_test):
    f1_list.append(f1_score(y_test, predictions, average='macro'))
    accuracy_list.append(accuracy_score(y_test, predictions))
    try:
        auc_roc_list.append(roc_auc_score(y_test, model.predict_proba(X_test), multi_class='ovr'))
    except:
        auc_roc_list.append(np.nan)

In [1424]:
def preprocess_data(df):
    df = df.drop(['id', 'uri', 'track_href', 'analysis_url', 'type'], axis=1)
    
    df = df.dropna()
    
    # Label encode the genre column
    label_encoder = LabelEncoder()
    df['genre'] = label_encoder.fit_transform(df['genre'])
    
    
    X = df.drop(['genre'], axis=1)
    y = df['genre']
    
    # Normalize  feature values
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    return X_scaled, y, label_encoder

X, y, label_encoder = preprocess_data(df)


## Train Machine Learning Model


In [1427]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

models = {
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
    "Support Vector Machine (SVM)": SVC(kernel='linear'),  # You can also try 'rbf' kernel
    "Gradient Boosting": GradientBoostingClassifier(n_estimators=100, random_state=42)
}

model_performance = {}

for model_name, model in models.items():
    print(f"Training {model_name}...")
    model.fit(X_train, y_train)  
    y_pred = model.predict(X_test)  
    
    # Generate classification report
    print(f"Classification Report for {model_name}:")
    report = classification_report(y_test, y_pred, target_names=label_encoder.classes_)
    print(report)
    
    model_performance[model_name] = report
    record_metrics(y_test, y_pred, model, X_test)
    model_list.append(model)

print("Class Names:", label_encoder.classes_)
class_names = label_encoder.classes_

Training Random Forest...
Classification Report for Random Forest:
              precision    recall  f1-score   support

 alternative       0.00      0.00      0.00        10
       blues       0.00      0.00      0.00         7
   classical       0.73      1.00      0.85        11
     country       0.18      0.29      0.22         7
       disco       0.33      0.25      0.29         8
         edm       0.23      0.43      0.30         7
        folk       0.40      0.22      0.29         9
        funk       0.33      0.33      0.33         6
      gospel       0.43      0.38      0.40         8
     hip-hop       0.38      0.60      0.46         5
       indie       0.00      0.00      0.00         6
        jazz       0.40      0.20      0.27        10
       latin       0.20      0.14      0.17         7
       metal       0.11      0.33      0.17         6
         pop       0.12      0.09      0.11        11
        punk       0.25      0.10      0.14        10
         r&b  

In [1428]:
model_list

[RandomForestClassifier(random_state=42),
 SVC(kernel='linear'),
 GradientBoostingClassifier(random_state=42)]

In [1429]:
# Random Forest
model_rf = RandomForestClassifier(n_estimators=100, random_state=42)
model_rf.fit(X_train, y_train)
y_pred_rf = model_rf.predict(X_test)
print("Random Forest Classification Report:")
print(classification_report(y_test, y_pred_rf, target_names=class_names))

# Gradient Boosting
model_gb = GradientBoostingClassifier(n_estimators=100, random_state=42)
model_gb.fit(X_train, y_train)
y_pred_gb = model_gb.predict(X_test)
print("Gradient Boosting Classification Report:")
print(classification_report(y_test, y_pred_gb, target_names=class_names))


Random Forest Classification Report:
              precision    recall  f1-score   support

 alternative       0.00      0.00      0.00        10
       blues       0.00      0.00      0.00         7
   classical       0.73      1.00      0.85        11
     country       0.18      0.29      0.22         7
       disco       0.33      0.25      0.29         8
         edm       0.23      0.43      0.30         7
        folk       0.40      0.22      0.29         9
        funk       0.33      0.33      0.33         6
      gospel       0.43      0.38      0.40         8
     hip-hop       0.38      0.60      0.46         5
       indie       0.00      0.00      0.00         6
        jazz       0.40      0.20      0.27        10
       latin       0.20      0.14      0.17         7
       metal       0.11      0.33      0.17         6
         pop       0.12      0.09      0.11        11
        punk       0.25      0.10      0.14        10
         r&b       0.17      0.20      0.18 

In [1362]:
pca = PCA(n_components=10)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

model_svm_pca = SVC(kernel='linear')
model_svm_pca.fit(X_train_pca, y_train)
y_pred_svm_pca = model_svm_pca.predict(X_test_pca)

print("SVM with PCA Classification Report:")
print(classification_report(y_test, y_pred_svm_pca, target_names=class_names))


SVM with PCA Classification Report:
              precision    recall  f1-score   support

 alternative       0.23      0.30      0.26        10
       blues       0.00      0.00      0.00         7
   classical       0.82      0.82      0.82        11
     country       0.07      0.14      0.10         7
       disco       0.42      0.62      0.50         8
         edm       0.12      0.14      0.13         7
        folk       0.14      0.11      0.12         9
        funk       0.17      0.33      0.22         6
      gospel       1.00      0.25      0.40         8
     hip-hop       0.33      0.60      0.43         5
       indie       0.00      0.00      0.00         6
        jazz       0.00      0.00      0.00        10
       latin       0.00      0.00      0.00         7
       metal       0.17      0.33      0.22         6
         pop       0.67      0.18      0.29        11
        punk       0.50      0.10      0.17        10
         r&b       0.00      0.00      0.00  

In [1365]:
# Random Forest with class weights
model_rf_weighted = RandomForestClassifier(n_estimators=100, random_state=42, class_weight='balanced')
model_rf_weighted.fit(X_train, y_train)
y_pred_rf_weighted = model_rf_weighted.predict(X_test)

print("Random Forest with Class Weights Classification Report:")
print(classification_report(y_test, y_pred_rf_weighted, target_names=class_names))


Random Forest with Class Weights Classification Report:
              precision    recall  f1-score   support

 alternative       0.00      0.00      0.00        10
       blues       0.00      0.00      0.00         7
   classical       0.79      1.00      0.88        11
     country       0.27      0.43      0.33         7
       disco       0.44      0.50      0.47         8
         edm       0.22      0.29      0.25         7
        folk       0.25      0.11      0.15         9
        funk       0.33      0.17      0.22         6
      gospel       0.50      0.50      0.50         8
     hip-hop       0.38      0.60      0.46         5
       indie       0.00      0.00      0.00         6
        jazz       0.25      0.10      0.14        10
       latin       0.00      0.00      0.00         7
       metal       0.07      0.17      0.10         6
         pop       0.17      0.18      0.17        11
        punk       0.20      0.10      0.13        10
         r&b       0.17  

In [1366]:
!pip install catboost



### CatBoostClassifier

In [1368]:
from catboost import CatBoostClassifier

label_encoder = LabelEncoder()
df['genre'] = label_encoder.fit_transform(df['genre'])

X = df.drop(['genre'], axis=1)
y = df['genre']
categorical_features = ['id', 'uri', 'track_href', 'analysis_url', 'type']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

catboost_model = CatBoostClassifier(iterations=500,
                          learning_rate=0.1,
                          depth=6,
                          eval_metric='Accuracy',
                          random_seed=42,
                          verbose=50, 
                          cat_features=categorical_features)

catboost_model.fit(X_train, y_train, eval_set=(X_test, y_test), early_stopping_rounds=100)

predictions = catboost_model.predict(X_test)

print("CatBoostClassifier with Class Weights Classification Report:")
print(classification_report(y_test, predictions, target_names=class_names))
record_metrics(y_test, predictions, catboost_model, X_test)
model_list.append("CatBoostClassifier")

0:	learn: 0.2771429	test: 0.1466667	best: 0.1466667 (0)	total: 57.7ms	remaining: 28.8s
50:	learn: 0.6428571	test: 0.2866667	best: 0.2866667 (20)	total: 1.54s	remaining: 13.5s
100:	learn: 0.8771429	test: 0.2933333	best: 0.3066667 (74)	total: 2.91s	remaining: 11.5s
150:	learn: 0.9514286	test: 0.2866667	best: 0.3066667 (74)	total: 4.25s	remaining: 9.82s
Stopped by overfitting detector  (100 iterations wait)

bestTest = 0.3066666667
bestIteration = 74

Shrink model to first 75 iterations.
CatBoostClassifier with Class Weights Classification Report:
              precision    recall  f1-score   support

 alternative       0.00      0.00      0.00        10
       blues       0.00      0.00      0.00         7
   classical       0.85      1.00      0.92        11
     country       0.14      0.29      0.19         7
       disco       0.40      0.50      0.44         8
         edm       0.22      0.29      0.25         7
        folk       0.25      0.11      0.15         9
        funk    

### LogisticRegression

In [1370]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

X, y, label_encoder = preprocess_data(df)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

lr = LogisticRegression(max_iter=1000, multi_class="multinomial")
lr.fit(X_train, y_train)
lr_predictions = lr.predict(X_test)

print("LogisticRegression with Class Weights Classification Report:")
print(f1_score(y_test, lr_predictions, average='weighted'))
print(classification_report(y_test, lr_predictions, target_names=class_names))
record_metrics(y_test, lr_predictions, lr, X_test)
model_list.append("LogisticRegression")

LogisticRegression with Class Weights Classification Report:
0.2534204896298718
              precision    recall  f1-score   support

 alternative       0.22      0.20      0.21        10
       blues       0.00      0.00      0.00         7
   classical       0.92      1.00      0.96        11
     country       0.11      0.14      0.12         7
       disco       0.36      0.50      0.42         8
         edm       0.38      0.43      0.40         7
        folk       0.29      0.22      0.25         9
        funk       0.25      0.33      0.29         6
      gospel       0.33      0.25      0.29         8
     hip-hop       0.43      0.60      0.50         5
       indie       0.00      0.00      0.00         6
        jazz       0.23      0.30      0.26        10
       latin       0.00      0.00      0.00         7
       metal       0.12      0.33      0.17         6
         pop       1.00      0.09      0.17        11
        punk       0.50      0.20      0.29        10
 

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### KNeighborsClassifier

In [1372]:
from sklearn.neighbors import KNeighborsClassifier

X, y, label_encoder = preprocess_data(df)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

kn = KNeighborsClassifier(n_neighbors=5, weights='uniform', 
                          algorithm='auto', leaf_size=10, p=2, 
                          metric='minkowski', metric_params=None, n_jobs=None)
kn.fit(X_train, y_train)
kn_predictions = kn.predict(X_test)

# Print classification report
print("KNeighborsClassifier with Class Weights Classification Report:")
print(f"F1 Score (Weighted): {f1_score(y_test, kn_predictions, average='weighted')}")
print(classification_report(y_test, kn_predictions, target_names=class_names))
record_metrics(y_test, kn_predictions, kn, X_test)
model_list.append("KNeighborsClassifier")


KNeighborsClassifier with Class Weights Classification Report:
F1 Score (Weighted): 0.1822434232434232
              precision    recall  f1-score   support

 alternative       0.00      0.00      0.00        10
       blues       0.00      0.00      0.00         7
   classical       0.85      1.00      0.92        11
     country       0.23      0.43      0.30         7
       disco       0.25      0.25      0.25         8
         edm       0.00      0.00      0.00         7
        folk       0.20      0.11      0.14         9
        funk       0.20      0.17      0.18         6
      gospel       0.75      0.38      0.50         8
     hip-hop       0.40      0.40      0.40         5
       indie       0.00      0.00      0.00         6
        jazz       0.00      0.00      0.00        10
       latin       0.00      0.00      0.00         7
       metal       0.08      0.17      0.11         6
         pop       0.33      0.09      0.14        11
        punk       0.33      0.1

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


# Conclusions

In [1374]:
model_list

[RandomForestClassifier(random_state=42),
 SVC(kernel='linear'),
 GradientBoostingClassifier(random_state=42),
 'CatBoostClassifier',
 'LogisticRegression',
 'KNeighborsClassifier']

In [1377]:
results_df = pd.DataFrame(columns=['model', 'f1', 'AUC_ROC', 'Accuracy'])

results_df['model'] = model_list
results_df['f1'] = f1_list
results_df['AUC_ROC'] = auc_roc_list
results_df['Accuracy'] = accuracy_list

results_df.head(10)

Unnamed: 0,model,f1,AUC_ROC,Accuracy
0,"(DecisionTreeClassifier(max_features='sqrt', r...",0.223615,0.75388,0.253333
1,SVC(kernel='linear'),0.238958,,0.273333
2,([DecisionTreeRegressor(criterion='friedman_ms...,0.184147,0.709702,0.2
3,CatBoostClassifier,0.279068,0.818404,0.306667
4,LogisticRegression,0.228585,0.794083,0.266667
5,KNeighborsClassifier,0.164958,0.620729,0.18


- The CatBoostClassifier works best for the AUC_ROC, accuracy, and f1 metrics.
- The easiest way to improve models' performances would be to add more data to the 'clean_spotify_set.csv' file. The code has been set up to allow for automatic updating of the dataset to increase the variety of data it has to work with, while checking for duplicates.