# Predicting Song Genres Using Spotify Data

## Description

This project aims to build a machine learning model that predicts the genre of a song using various metrics provided by Spotify. The goal is to create a predictive model that can  classify the genre of a song based on its features such as danceability, energy, tempo, and other characteristics. Additionally, this project will use the Spotify API to retrieve these song metrics for any new track, allowing us to make predictions on new songs.

### Workflow

1. Collect Data
    
    Build a dataset within Spotify

2. Preprocess Data:

    Clean and preprocess dataset for model training.
3. Train Models:
    


    Train models using the audio metrics as features and genre as target.
    
    Evaluate the model's performance using cross-validation and metrics (accuracy, F1-score).
4. Evaluate Model Performance:

    Check for the effectiveness of the model. Analyze predictios.
5. Integrate Spotify API:
    
6. Make Predictions on New Songs:
    
    Use the trained machine learning model to predict the genre of any new song based on its Spotify audio features.

## Import Libraries

In [360]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.decomposition import PCA

from sklearn.metrics import classification_report

from sklearn.model_selection import GridSearchCV
import time
import numpy as np

## Spotify API Setup

In [533]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
from spotipy.exceptions import SpotifyException
from dotenv import load_dotenv
import os


load_dotenv()
client_id = os.environ.get('client_id')
client_secret = os.environ.get('client_secret')

# Authenticate with Spotify API
sp = spotipy.Spotify(auth_manager=SpotifyClientCredentials(client_id=client_id, client_secret=client_secret))

# Test
result = sp.search(q='breath away', type='track', limit=1)
print(result)

{'tracks': {'href': 'https://api.spotify.com/v1/search?query=breath+away&type=track&offset=0&limit=1', 'items': [{'album': {'album_type': 'album', 'artists': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/0PCCGZ0wGLizHt2KZ7hhA2'}, 'href': 'https://api.spotify.com/v1/artists/0PCCGZ0wGLizHt2KZ7hhA2', 'id': '0PCCGZ0wGLizHt2KZ7hhA2', 'name': 'Artemas', 'type': 'artist', 'uri': 'spotify:artist:0PCCGZ0wGLizHt2KZ7hhA2'}], 'available_markets': ['AR', 'AU', 'AT', 'BE', 'BO', 'BR', 'BG', 'CA', 'CL', 'CO', 'CR', 'CY', 'CZ', 'DK', 'DO', 'DE', 'EC', 'EE', 'SV', 'FI', 'FR', 'GR', 'GT', 'HN', 'HK', 'HU', 'IS', 'IE', 'IT', 'LV', 'LT', 'LU', 'MY', 'MT', 'MX', 'NL', 'NZ', 'NI', 'NO', 'PA', 'PY', 'PE', 'PH', 'PL', 'PT', 'SG', 'SK', 'ES', 'SE', 'CH', 'TW', 'TR', 'UY', 'US', 'GB', 'AD', 'LI', 'MC', 'ID', 'JP', 'TH', 'VN', 'RO', 'IL', 'ZA', 'SA', 'AE', 'BH', 'QA', 'OM', 'KW', 'EG', 'MA', 'DZ', 'TN', 'LB', 'JO', 'PS', 'IN', 'KZ', 'MD', 'UA', 'AL', 'BA', 'HR', 'ME', 'MK', 'RS', 'SI', 'KR', 'B

### Retreive Audio Features

In [536]:
def get_audio_features(track_id):
    # get audio features for a specific track
    features = sp.audio_features([track_id])
    return features[0] 

track_id = result['tracks']['items'][0]['id']
audio_features = get_audio_features(track_id)
print(audio_features)  # Replace with actual API call
count = 1

Max Retries reached


SpotifyException: http status: 429, code:-1 - /v1/audio-features/?ids=1oic0Wedm3XeHxwaxmwO91:
 Max Retries, reason: too many 429 error responses

## Building a Dataset

In [539]:
def search_songs_by_genre(genre, limit=10):
    songs_data = []
    results = sp.search(q=f'genre:{genre}', type='track', limit=limit, offset=count*50)
    count += 1
    
    for track in results['tracks']['items']:
        track_id = track['id']
        audio_features = get_audio_features(track_id)
        if audio_features:
            audio_features['genre'] = genre
            songs_data.append(audio_features)
    
    return songs_data

# List of 20 genres
genres = [
    'pop', 'rock', 'jazz', 'classical', 'hip-hop', 'metal', 'reggae', 'blues',
    'country', 'edm', 'latin', 'soul', 'punk', 'folk', 'funk', 'indie', 'disco',
    'r&b', 'gospel', 'alternative'
]

all_songs_data = []

for genre in genres:
    print(f"Collecting songs for genre: {genre}")
    genre_songs = search_songs_by_genre(genre, limit=25)  
    all_songs_data.extend(genre_songs)
    time.sleep(15)

df = pd.DataFrame(all_songs_data)

print(df.shape)
df.info()

Collecting songs for genre: pop


UnboundLocalError: cannot access local variable 'count' where it is not associated with a value

In [None]:
df['genre'].value_counts()

In [None]:
genres = [
    'pop', 'rock', 'jazz', 'classical', 'hip-hop', 'metal', 'reggae', 'blues',
    'country', 'edm', 'latin', 'soul', 'punk', 'folk', 'funk', 'indie', 'disco',
    'r&b', 'gospel', 'alternative'
]

In [None]:
# I append new API call data to dataset file trying to take into account possible data mismatch issues.
try:
    df.query("genre in @genres").drop(columns='Unnamed: 0').reset_index(drop=True).to_csv('clean_spotify_set.csv', mode='a', 
                                                                                          header=False, index=True)
except:
    df.query("genre in @genres").reset_index(drop=True).to_csv('clean_spotify_set.csv', mode='a', header=False, index=True)

In [493]:
# I read the Spotify song dataset I've collected.
df = pd.read_csv('clean_spotify_set.csv', index_col=0, header='infer').reset_index(drop=True)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2500 entries, 0 to 2499
Data columns (total 19 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   danceability      2500 non-null   float64
 1   energy            2500 non-null   float64
 2   key               2500 non-null   float64
 3   loudness          2500 non-null   float64
 4   mode              2500 non-null   float64
 5   speechiness       2500 non-null   float64
 6   acousticness      2500 non-null   float64
 7   instrumentalness  2500 non-null   float64
 8   liveness          2500 non-null   float64
 9   valence           2500 non-null   float64
 10  tempo             2500 non-null   float64
 11  type              2500 non-null   object 
 12  id                2500 non-null   object 
 13  uri               2500 non-null   object 
 14  track_href        2500 non-null   object 
 15  analysis_url      2500 non-null   object 
 16  duration_ms       2500 non-null   int64  


In [495]:
# I check if repeated API calls added duplicate tracks in a temporary dataframe.
print(df.duplicated().sum())
df1 = df.drop_duplicates().reset_index(drop=True)
df1.info()

1990
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 510 entries, 0 to 509
Data columns (total 19 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   danceability      510 non-null    float64
 1   energy            510 non-null    float64
 2   key               510 non-null    float64
 3   loudness          510 non-null    float64
 4   mode              510 non-null    float64
 5   speechiness       510 non-null    float64
 6   acousticness      510 non-null    float64
 7   instrumentalness  510 non-null    float64
 8   liveness          510 non-null    float64
 9   valence           510 non-null    float64
 10  tempo             510 non-null    float64
 11  type              510 non-null    object 
 12  id                510 non-null    object 
 13  uri               510 non-null    object 
 14  track_href        510 non-null    object 
 15  analysis_url      510 non-null    object 
 16  duration_ms       510 non-null    int64

In [505]:
df = df1.copy()

In [507]:
df.head(10)

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,type,id,uri,track_href,analysis_url,duration_ms,time_signature,genre
0,0.7,0.582,11.0,-5.96,0.0,0.0356,0.0502,0.0,0.0881,0.785,116.712,audio_features,0WbMK4wrZ1wFSty9F7FCgu,spotify:track:0WbMK4wrZ1wFSty9F7FCgu,https://api.spotify.com/v1/tracks/0WbMK4wrZ1wF...,https://api.spotify.com/v1/audio-analysis/0WbM...,218424,4,pop
1,0.747,0.507,2.0,-10.171,1.0,0.0358,0.2,0.0608,0.117,0.438,104.978,audio_features,6dOtVTDdiauQNBQEDOtlAB,spotify:track:6dOtVTDdiauQNBQEDOtlAB,https://api.spotify.com/v1/tracks/6dOtVTDdiauQ...,https://api.spotify.com/v1/audio-analysis/6dOt...,210373,4,pop
2,0.521,0.592,6.0,-7.777,0.0,0.0304,0.308,0.0,0.122,0.535,157.969,audio_features,2plbrEY59IikOBgBGLjaoe,spotify:track:2plbrEY59IikOBgBGLjaoe,https://api.spotify.com/v1/tracks/2plbrEY59Iik...,https://api.spotify.com/v1/audio-analysis/2plb...,251668,3,pop
3,0.674,0.907,3.0,-4.086,1.0,0.064,0.101,0.0,0.297,0.721,112.964,audio_features,5G2f63n7IPVPPjfNIGih7Q,spotify:track:5G2f63n7IPVPPjfNIGih7Q,https://api.spotify.com/v1/tracks/5G2f63n7IPVP...,https://api.spotify.com/v1/audio-analysis/5G2f...,157280,4,pop
4,0.669,0.586,9.0,-6.073,1.0,0.054,0.274,0.0,0.104,0.579,107.071,audio_features,5N3hjp1WNayUPZrA8kJmJP,spotify:track:5N3hjp1WNayUPZrA8kJmJP,https://api.spotify.com/v1/tracks/5N3hjp1WNayU...,https://api.spotify.com/v1/audio-analysis/5N3h...,186365,4,pop
5,0.701,0.76,0.0,-5.478,1.0,0.0285,0.107,6.5e-05,0.185,0.69,103.969,audio_features,2qSkIjg1o9h3YT9RAgYN75,spotify:track:2qSkIjg1o9h3YT9RAgYN75,https://api.spotify.com/v1/tracks/2qSkIjg1o9h3...,https://api.spotify.com/v1/audio-analysis/2qSk...,175459,4,pop
6,0.742,0.757,6.0,-4.981,1.0,0.0421,0.0187,0.0,0.305,0.957,139.982,audio_features,4xdBrk0nFZaP54vvZj0yx7,spotify:track:4xdBrk0nFZaP54vvZj0yx7,https://api.spotify.com/v1/tracks/4xdBrk0nFZaP...,https://api.spotify.com/v1/audio-analysis/4xdB...,184841,4,pop
7,0.739,0.727,11.0,-5.968,0.0,0.0426,0.0678,0.0,0.104,0.676,94.99,audio_features,1UHS8Rf6h5Ar3CDWRd3wjF,spotify:track:1UHS8Rf6h5Ar3CDWRd3wjF,https://api.spotify.com/v1/tracks/1UHS8Rf6h5Ar...,https://api.spotify.com/v1/audio-analysis/1UHS...,171870,4,pop
8,0.61,0.65,6.0,-6.199,1.0,0.0474,0.399,0.0,0.11,0.507,106.719,audio_features,1k2pQc5i348DCHwbn5KTdc,spotify:track:1k2pQc5i348DCHwbn5KTdc,https://api.spotify.com/v1/tracks/1k2pQc5i348D...,https://api.spotify.com/v1/audio-analysis/1k2p...,258035,4,pop
9,0.638,0.855,7.0,-4.86,1.0,0.0264,0.00757,0.0,0.245,0.731,127.986,audio_features,7221xIgOnuakPdLqT0F3nP,spotify:track:7221xIgOnuakPdLqT0F3nP,https://api.spotify.com/v1/tracks/7221xIgOnuak...,https://api.spotify.com/v1/audio-analysis/7221...,178206,4,pop


In [509]:
df.sample(10)

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,type,id,uri,track_href,analysis_url,duration_ms,time_signature,genre
504,0.582,0.962,4.0,-3.037,1.0,0.0542,0.0096,0.128,0.283,0.506,123.971,audio_features,7aOor99o8NNLZYElOXlBG1,spotify:track:7aOor99o8NNLZYElOXlBG1,https://api.spotify.com/v1/tracks/7aOor99o8NNL...,https://api.spotify.com/v1/audio-analysis/7aOo...,180640,4,blues
479,0.466,0.872,7.0,-3.344,1.0,0.0336,0.0156,0.0,0.121,0.806,184.115,audio_features,2PnlsTsOTLE5jnBnNe2K0A,spotify:track:2PnlsTsOTLE5jnBnNe2K0A,https://api.spotify.com/v1/tracks/2PnlsTsOTLE5...,https://api.spotify.com/v1/audio-analysis/2Pnl...,190428,4,alternative
383,0.504,0.308,9.0,-14.958,1.0,0.0321,0.868,0.135,0.158,0.121,113.95,audio_features,3vkCueOmm7xQDoJ17W1Pm3,spotify:track:3vkCueOmm7xQDoJ17W1Pm3,https://api.spotify.com/v1/tracks/3vkCueOmm7xQ...,https://api.spotify.com/v1/audio-analysis/3vkC...,137773,4,indie
335,0.303,0.187,2.0,-16.757,1.0,0.0356,0.989,0.499,0.102,0.212,132.679,audio_features,44A0o4jA8F2ZF03Zacwlwx,spotify:track:44A0o4jA8F2ZF03Zacwlwx,https://api.spotify.com/v1/tracks/44A0o4jA8F2Z...,https://api.spotify.com/v1/audio-analysis/44A0...,160853,4,folk
156,0.812,0.479,2.0,-5.678,0.0,0.333,0.213,1e-06,0.0756,0.559,169.922,audio_features,2UW7JaomAMuX9pZrjVpHAU,spotify:track:2UW7JaomAMuX9pZrjVpHAU,https://api.spotify.com/v1/tracks/2UW7JaomAMuX...,https://api.spotify.com/v1/audio-analysis/2UW7...,234353,4,reggae
152,0.682,0.765,1.0,-5.021,0.0,0.0395,0.0268,3.4e-05,0.188,0.567,90.807,audio_features,2hnMS47jN0etwvFPzYk11f,spotify:track:2hnMS47jN0etwvFPzYk11f,https://api.spotify.com/v1/tracks/2hnMS47jN0et...,https://api.spotify.com/v1/audio-analysis/2hnM...,182747,4,reggae
107,0.915,0.453,10.0,-4.589,0.0,0.27,0.0872,0.000163,0.104,0.287,139.943,audio_features,4Na2HfNSr58chvfX69fy36,spotify:track:4Na2HfNSr58chvfX69fy36,https://api.spotify.com/v1/tracks/4Na2HfNSr58c...,https://api.spotify.com/v1/audio-analysis/4Na2...,144000,4,hip-hop
414,0.633,0.357,5.0,-9.366,0.0,0.0264,0.107,0.0,0.133,0.672,104.938,audio_features,2JoZzpdeP2G6Csfdq5aLXP,spotify:track:2JoZzpdeP2G6Csfdq5aLXP,https://api.spotify.com/v1/tracks/2JoZzpdeP2G6...,https://api.spotify.com/v1/audio-analysis/2JoZ...,245200,4,disco
22,0.432,0.583,8.0,-4.682,1.0,0.0687,0.174,0.0,0.0933,0.544,181.489,audio_features,51eSHglvG1RJXtL3qI5trr,spotify:track:51eSHglvG1RJXtL3qI5trr,https://api.spotify.com/v1/tracks/51eSHglvG1RJ...,https://api.spotify.com/v1/audio-analysis/51eS...,161831,4,pop
52,0.401,0.498,4.0,-10.682,1.0,0.0757,0.394,0.0,0.13,0.816,181.701,audio_features,7odHgoLFi3GQ90E9PeraI3,spotify:track:7odHgoLFi3GQ90E9PeraI3,https://api.spotify.com/v1/tracks/7odHgoLFi3GQ...,https://api.spotify.com/v1/audio-analysis/7odH...,149160,4,jazz


In [511]:
df['type'].value_counts(normalize=True)

type
audio_features    1.0
Name: proportion, dtype: float64

In [517]:
df['genre'].value_counts()

genre
pop            25
rock           25
gospel         25
r&b            25
disco          25
indie          25
funk           25
folk           25
punk           25
soul           25
latin          25
edm            25
country        25
blues          25
reggae         25
metal          25
hip-hop        25
classical      25
jazz           25
alternative    25
Name: count, dtype: int64

In [521]:
df = pd.read_csv('clean_spotify_set.csv', index_col=0, header='infer').reset_index(drop=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 19 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   danceability      500 non-null    float64
 1   energy            500 non-null    float64
 2   key               500 non-null    float64
 3   loudness          500 non-null    float64
 4   mode              500 non-null    float64
 5   speechiness       500 non-null    float64
 6   acousticness      500 non-null    float64
 7   instrumentalness  500 non-null    float64
 8   liveness          500 non-null    float64
 9   valence           500 non-null    float64
 10  tempo             500 non-null    float64
 11  type              500 non-null    object 
 12  id                500 non-null    object 
 13  uri               500 non-null    object 
 14  track_href        500 non-null    object 
 15  analysis_url      500 non-null    object 
 16  duration_ms       500 non-null    int64  
 1

## Data Preprocessing

In [391]:
def preprocess_data(df):
    df = df.drop(['id', 'uri', 'track_href', 'analysis_url', 'type'], axis=1)
    
    df = df.dropna()
    
    # Label encode the genre column
    label_encoder = LabelEncoder()
    df['genre'] = label_encoder.fit_transform(df['genre'])
    
    
    X = df.drop(['genre'], axis=1)
    y = df['genre']
    
    # Normalize  feature values
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    return X_scaled, y, label_encoder

X, y, label_encoder = preprocess_data(df)


## Train Machine Learning Model


In [396]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

models = {
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
    "Support Vector Machine (SVM)": SVC(kernel='linear'),  # You can also try 'rbf' kernel
    "Gradient Boosting": GradientBoostingClassifier(n_estimators=100, random_state=42)
}

model_performance = {}

for model_name, model in models.items():
    print(f"Training {model_name}...")
    model.fit(X_train, y_train)  
    y_pred = model.predict(X_test)  
    
    # Generate classification report
    print(f"Classification Report for {model_name}:")
    report = classification_report(y_test, y_pred, target_names=label_encoder.classes_)
    print(report)
    
    model_performance[model_name] = report

print("Class Names:", label_encoder.classes_)
class_names = label_encoder.classes_

Training Random Forest...
Classification Report for Random Forest:
              precision    recall  f1-score   support

 alternative       0.58      0.25      0.35        28
       blues       0.80      0.59      0.68        27
   classical       1.00      0.88      0.94        25
     country       1.00      0.87      0.93        23
       disco       0.86      0.79      0.83        24
         edm       0.97      1.00      0.98        28
        folk       0.76      0.84      0.80        19
        funk       0.68      0.54      0.60        24
      gospel       0.88      1.00      0.93        21
     hip-hop       0.85      0.85      0.85        26
       indie       0.53      0.59      0.56        17
        jazz       0.67      0.87      0.75        23
       latin       0.33      0.30      0.32        20
       metal       0.26      0.53      0.35        17
         pop       0.47      0.80      0.59        10
        punk       0.52      0.57      0.54        23
         r&b  

In [398]:
# Random Forest
model_rf = RandomForestClassifier(n_estimators=100, random_state=42)
model_rf.fit(X_train, y_train)
y_pred_rf = model_rf.predict(X_test)
print("Random Forest Classification Report:")
print(classification_report(y_test, y_pred_rf, target_names=label_encoder.classes_))

# Gradient Boosting
model_gb = GradientBoostingClassifier(n_estimators=100, random_state=42)
model_gb.fit(X_train, y_train)
y_pred_gb = model_gb.predict(X_test)
print("Gradient Boosting Classification Report:")
print(classification_report(y_test, y_pred_gb, target_names=label_encoder.classes_))


Random Forest Classification Report:
              precision    recall  f1-score   support

 alternative       0.58      0.25      0.35        28
       blues       0.80      0.59      0.68        27
   classical       1.00      0.88      0.94        25
     country       1.00      0.87      0.93        23
       disco       0.86      0.79      0.83        24
         edm       0.97      1.00      0.98        28
        folk       0.76      0.84      0.80        19
        funk       0.68      0.54      0.60        24
      gospel       0.88      1.00      0.93        21
     hip-hop       0.85      0.85      0.85        26
       indie       0.53      0.59      0.56        17
        jazz       0.67      0.87      0.75        23
       latin       0.33      0.30      0.32        20
       metal       0.26      0.53      0.35        17
         pop       0.47      0.80      0.59        10
        punk       0.52      0.57      0.54        23
         r&b       0.91      0.77      0.83 

In [399]:
pca = PCA(n_components=10)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

model_svm_pca = SVC(kernel='linear')
model_svm_pca.fit(X_train_pca, y_train)
y_pred_svm_pca = model_svm_pca.predict(X_test_pca)

print("SVM with PCA Classification Report:")
print(classification_report(y_test, y_pred_svm_pca, target_names=label_encoder.classes_))


SVM with PCA Classification Report:
              precision    recall  f1-score   support

 alternative       0.12      0.07      0.09        28
       blues       0.67      0.15      0.24        27
   classical       1.00      1.00      1.00        25
     country       0.33      0.48      0.39        23
       disco       0.56      0.75      0.64        24
         edm       0.50      0.14      0.22        28
        folk       0.12      0.21      0.15        19
        funk       0.26      0.33      0.29        24
      gospel       0.78      0.33      0.47        21
     hip-hop       0.41      0.35      0.38        26
       indie       0.25      0.24      0.24        17
        jazz       0.50      0.52      0.51        23
       latin       0.24      0.40      0.30        20
       metal       0.14      0.41      0.21        17
         pop       0.21      0.60      0.32        10
        punk       0.35      0.30      0.33        23
         r&b       0.14      0.12      0.13  

In [400]:
# Random Forest with class weights
model_rf_weighted = RandomForestClassifier(n_estimators=100, random_state=42, class_weight='balanced')
model_rf_weighted.fit(X_train, y_train)
y_pred_rf_weighted = model_rf_weighted.predict(X_test)

print("Random Forest with Class Weights Classification Report:")
print(classification_report(y_test, y_pred_rf_weighted, target_names=label_encoder.classes_))


Random Forest with Class Weights Classification Report:
              precision    recall  f1-score   support

 alternative       0.50      0.29      0.36        28
       blues       0.80      0.59      0.68        27
   classical       1.00      1.00      1.00        25
     country       1.00      1.00      1.00        23
       disco       0.86      0.79      0.83        24
         edm       0.88      1.00      0.93        28
        folk       0.79      0.79      0.79        19
        funk       0.68      0.54      0.60        24
      gospel       1.00      1.00      1.00        21
     hip-hop       0.82      0.88      0.85        26
       indie       0.53      0.53      0.53        17
        jazz       0.67      0.87      0.75        23
       latin       0.33      0.30      0.32        20
       metal       0.29      0.47      0.36        17
         pop       0.73      0.80      0.76        10
        punk       0.52      0.57      0.54        23
         r&b       0.80  

In [401]:
!pip install catboost



### CatBoostClassifier

In [402]:
from catboost import CatBoostClassifier

label_encoder = LabelEncoder()
df['genre'] = label_encoder.fit_transform(df['genre'])

X = df.drop(['genre'], axis=1)
y = df['genre']
categorical_features = ['id', 'uri', 'track_href', 'analysis_url', 'type']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model = CatBoostClassifier(iterations=500,
                          learning_rate=0.1,
                          depth=6,
                          eval_metric='Accuracy',
                          random_seed=42,
                          verbose=50, 
                          cat_features=categorical_features)

model.fit(X_train, y_train, eval_set=(X_test, y_test), early_stopping_rounds=100)

predictions = model.predict(X_test)

print("CatBoostClassifier with Class Weights Classification Report:")
print(classification_report(y_test, predictions, target_names=label_encoder.classes_))


0:	learn: 0.2800000	test: 0.1977778	best: 0.1977778 (0)	total: 67.6ms	remaining: 33.7s
50:	learn: 0.6761905	test: 0.6888889	best: 0.6888889 (48)	total: 2.46s	remaining: 21.6s
100:	learn: 0.7647619	test: 0.7111111	best: 0.7155556 (70)	total: 4.85s	remaining: 19.2s
150:	learn: 0.8095238	test: 0.7155556	best: 0.7155556 (70)	total: 7.24s	remaining: 16.7s
200:	learn: 0.8514286	test: 0.7222222	best: 0.7266667 (183)	total: 9.79s	remaining: 14.6s
250:	learn: 0.8942857	test: 0.7333333	best: 0.7333333 (237)	total: 12.2s	remaining: 12.1s
300:	learn: 0.9190476	test: 0.7311111	best: 0.7377778 (284)	total: 14.6s	remaining: 9.67s
350:	learn: 0.9495238	test: 0.7400000	best: 0.7466667 (334)	total: 17.2s	remaining: 7.31s
400:	learn: 0.9609524	test: 0.7444444	best: 0.7466667 (334)	total: 19.7s	remaining: 4.85s
450:	learn: 0.9761905	test: 0.7466667	best: 0.7488889 (419)	total: 22.4s	remaining: 2.44s
499:	learn: 0.9923810	test: 0.7444444	best: 0.7488889 (419)	total: 25s	remaining: 0us

bestTest = 0.7488888

### LogisticRegression

In [408]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

X, y, label_encoder = preprocess_data(df)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

lr = LogisticRegression(max_iter=1000, multi_class="multinomial")
lr.fit(X_train, y_train)
lr_predictions = lr.predict(X_test)

print("LogisticRegression with Class Weights Classification Report:")
print(f1_score(y_test, lr_predictions, average='weighted'))
print(classification_report(y_test, lr_predictions, target_names=class_names))


LogisticRegression with Class Weights Classification Report:
0.31576540442865986
              precision    recall  f1-score   support

 alternative       0.17      0.04      0.06        28
       blues       0.00      0.00      0.00        27
   classical       1.00      1.00      1.00        25
     country       0.33      0.35      0.34        23
       disco       0.42      0.62      0.50        24
         edm       0.50      0.29      0.36        28
        folk       0.24      0.32      0.27        19
        funk       0.27      0.29      0.28        24
      gospel       0.62      0.38      0.47        21
     hip-hop       0.43      0.50      0.46        26
       indie       0.33      0.12      0.17        17
        jazz       0.36      0.52      0.43        23
       latin       0.23      0.25      0.24        20
       metal       0.20      0.59      0.30        17
         pop       0.16      0.50      0.24        10
        punk       0.33      0.39      0.36        23


### KNeighborsClassifier

In [410]:
from sklearn.neighbors import KNeighborsClassifier

X, y, label_encoder = preprocess_data(df)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

kn = KNeighborsClassifier(n_neighbors=5, weights='uniform', 
                          algorithm='auto', leaf_size=10, p=2, 
                          metric='minkowski', metric_params=None, n_jobs=None)
kn.fit(X_train, y_train)
kn_predictions = kn.predict(X_test)

# Ensure predictions are a 1D array
kn_predictions = kn_predictions.flatten()  # Flatten if needed

# Print classification report
print("KNeighborsClassifier with Class Weights Classification Report:")
print(f"F1 Score (Weighted): {f1_score(y_test, kn_predictions, average='weighted')}")
print(classification_report(y_test, kn_predictions, target_names=class_names))

KNeighborsClassifier with Class Weights Classification Report:
F1 Score (Weighted): 0.35737792042257854
              precision    recall  f1-score   support

 alternative       0.19      0.18      0.19        28
       blues       0.35      0.26      0.30        27
   classical       1.00      0.96      0.98        25
     country       0.32      0.52      0.40        23
       disco       0.69      0.46      0.55        24
         edm       0.62      0.46      0.53        28
        folk       0.25      0.26      0.26        19
        funk       0.37      0.46      0.41        24
      gospel       0.61      0.52      0.56        21
     hip-hop       0.50      0.50      0.50        26
       indie       0.19      0.18      0.18        17
        jazz       0.30      0.26      0.28        23
       latin       0.28      0.40      0.33        20
       metal       0.10      0.18      0.13        17
         pop       0.06      0.10      0.07        10
        punk       0.70      0.

# Conclusions