# Predicting Song Genres Using Spotify Data

## Description

This project aims to build a machine learning model that predicts the genre of a song using various metrics provided by Spotify. The goal is to create a predictive model that can  classify the genre of a song based on its features such as danceability, energy, tempo, and other characteristics. Additionally, this project will use the Spotify API to retrieve these song metrics for any new track, allowing us to make predictions on new songs.

### Workflow

1. Collect Data
    
    Build a dataset within Spotify

2. Preprocess Data:

    Clean and preprocess dataset for model training.
3. Train Models:
    


    Train models using the audio metrics as features and genre as target.
    
    Evaluate the model's performance using cross-validation and metrics (accuracy, F1-score).
4. Evaluate Model Performance:

    Check for the effectiveness of the model. Analyze predictios.
5. Integrate Spotify API:
    
6. Make Predictions on New Songs:
    
    Use the trained machine learning model to predict the genre of any new song based on its Spotify audio features.

## Import Libraries

In [1]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.decomposition import PCA

from sklearn.metrics import classification_report

from sklearn.model_selection import GridSearchCV


  from pandas.core import (


## Spotify API Setup

In [2]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
from spotipy.exceptions import SpotifyException


client_id = "4e94c7a00ce841cb97a1eb6b94715735"
client_secret = "023e76405fdc4e68af511d30ef91d172"

# Authenticate with Spotify API
sp = spotipy.Spotify(auth_manager=SpotifyClientCredentials(client_id=client_id, client_secret=client_secret))

# Test
result = sp.search(q='breath away', type='track', limit=1)
print(result)


{'tracks': {'href': 'https://api.spotify.com/v1/search?query=breath+away&type=track&offset=0&limit=1', 'items': [{'album': {'album_type': 'album', 'artists': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/0PCCGZ0wGLizHt2KZ7hhA2'}, 'href': 'https://api.spotify.com/v1/artists/0PCCGZ0wGLizHt2KZ7hhA2', 'id': '0PCCGZ0wGLizHt2KZ7hhA2', 'name': 'Artemas', 'type': 'artist', 'uri': 'spotify:artist:0PCCGZ0wGLizHt2KZ7hhA2'}], 'available_markets': ['AR', 'AU', 'AT', 'BE', 'BO', 'BR', 'BG', 'CA', 'CL', 'CO', 'CR', 'CY', 'CZ', 'DK', 'DO', 'DE', 'EC', 'EE', 'SV', 'FI', 'FR', 'GR', 'GT', 'HN', 'HK', 'HU', 'IS', 'IE', 'IT', 'LV', 'LT', 'LU', 'MY', 'MT', 'MX', 'NL', 'NZ', 'NI', 'NO', 'PA', 'PY', 'PE', 'PH', 'PL', 'PT', 'SG', 'SK', 'ES', 'SE', 'CH', 'TW', 'TR', 'UY', 'US', 'GB', 'AD', 'LI', 'MC', 'ID', 'JP', 'TH', 'VN', 'RO', 'IL', 'ZA', 'SA', 'AE', 'BH', 'QA', 'OM', 'KW', 'EG', 'MA', 'DZ', 'TN', 'LB', 'JO', 'PS', 'IN', 'KZ', 'MD', 'UA', 'AL', 'BA', 'HR', 'ME', 'MK', 'RS', 'SI', 'KR', 'B

### Retreive Audio Features

In [3]:
def get_audio_features(track_id):
    # get audio features for a specific track
    features = sp.audio_features([track_id])
    return features[0] 

track_id = result['tracks']['items'][0]['id']
audio_features = get_audio_features(track_id)
print(audio_features)


{'danceability': 0.694, 'energy': 0.712, 'key': 11, 'loudness': -6.522, 'mode': 0, 'speechiness': 0.0759, 'acousticness': 0.707, 'instrumentalness': 0.0202, 'liveness': 0.263, 'valence': 0.233, 'tempo': 146.015, 'type': 'audio_features', 'id': '1oic0Wedm3XeHxwaxmwO91', 'uri': 'spotify:track:1oic0Wedm3XeHxwaxmwO91', 'track_href': 'https://api.spotify.com/v1/tracks/1oic0Wedm3XeHxwaxmwO91', 'analysis_url': 'https://api.spotify.com/v1/audio-analysis/1oic0Wedm3XeHxwaxmwO91', 'duration_ms': 166849, 'time_signature': 4}


## Building a Dataset

In [5]:
def search_songs_by_genre(genre, limit=10):
    songs_data = []
    results = sp.search(q=f'genre:{genre}', type='track', limit=limit)
    
    for track in results['tracks']['items']:
        track_id = track['id']
        audio_features = get_audio_features(track_id)
        if audio_features:
            audio_features['genre'] = genre
            songs_data.append(audio_features)
    
    return songs_data

# List of 20 genres
genres = [
    'pop', 'rock', 'jazz', 'classical', 'hip-hop', 'metal', 'reggae', 'blues',
    'country', 'edm', 'latin', 'soul', 'punk', 'folk', 'funk', 'indie', 'disco',
    'r&b', 'gospel', 'alternative'
]

all_songs_data = []

for genre in genres:
    print(f"Collecting songs for genre: {genre}")
    genre_songs = search_songs_by_genre(genre, limit=50)  
    all_songs_data.extend(genre_songs)

df = pd.DataFrame(all_songs_data)

print(df.head())  

Collecting songs for genre: pop
Collecting songs for genre: rock
Collecting songs for genre: jazz
Collecting songs for genre: classical
Collecting songs for genre: hip-hop
Collecting songs for genre: metal
Collecting songs for genre: reggae
Collecting songs for genre: blues
Collecting songs for genre: country
Collecting songs for genre: edm
Collecting songs for genre: latin
Collecting songs for genre: soul
Collecting songs for genre: punk
Collecting songs for genre: folk
Collecting songs for genre: funk
Collecting songs for genre: indie
Collecting songs for genre: disco
Collecting songs for genre: r&b
Collecting songs for genre: gospel
Collecting songs for genre: alternative
   danceability  energy  key  loudness  mode  speechiness  acousticness  \
0         0.700   0.582   11    -5.960     0       0.0356        0.0502   
1         0.747   0.507    2   -10.171     1       0.0358        0.2000   
2         0.521   0.592    6    -7.777     0       0.0304        0.3080   
3         0.674 

In [11]:
df.to_csv('genres.csv', index=False)

## Data Preprocessing

In [12]:
def preprocess_data(df):
    df = df.drop(['id', 'uri', 'track_href', 'analysis_url', 'type'], axis=1)
    
    df = df.dropna()
    
    # Label encode the genre column
    label_encoder = LabelEncoder()
    df['genre'] = label_encoder.fit_transform(df['genre'])
    
    
    X = df.drop(['genre'], axis=1)
    y = df['genre']
    
    # Normalize  feature values
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    return X_scaled, y, label_encoder

X, y, label_encoder = preprocess_data(df)


## Train Machine Learning Model


In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

models = {
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
    "Support Vector Machine (SVM)": SVC(kernel='linear'),  # You can also try 'rbf' kernel
    "Gradient Boosting": GradientBoostingClassifier(n_estimators=100, random_state=42)
}

model_performance = {}

for model_name, model in models.items():
    print(f"Training {model_name}...")
    model.fit(X_train, y_train)  
    y_pred = model.predict(X_test)  
    
    # Generate classification report
    print(f"Classification Report for {model_name}:")
    report = classification_report(y_test, y_pred, target_names=label_encoder.classes_)
    print(report)
    
    model_performance[model_name] = report


Training Random Forest...
Classification Report for Random Forest:
              precision    recall  f1-score   support

 alternative       0.00      0.00      0.00        15
       blues       0.00      0.00      0.00        16
   classical       0.64      0.70      0.67        10
     country       0.11      0.27      0.16        11
       disco       0.54      0.44      0.48        16
         edm       0.12      0.15      0.14        13
        folk       0.18      0.12      0.14        17
        funk       0.15      0.25      0.19        12
      gospel       0.65      0.55      0.59        20
     hip-hop       0.25      0.18      0.21        17
       indie       0.00      0.00      0.00        11
        jazz       0.13      0.20      0.16        10
       latin       0.16      0.16      0.16        19
       metal       0.09      0.05      0.06        20
         pop       0.25      0.20      0.22        15
        punk       0.25      0.38      0.30        13
         r&b  

In [14]:
# Random Forest
model_rf = RandomForestClassifier(n_estimators=100, random_state=42)
model_rf.fit(X_train, y_train)
y_pred_rf = model_rf.predict(X_test)
print("Random Forest Classification Report:")
print(classification_report(y_test, y_pred_rf, target_names=label_encoder.classes_))

# Gradient Boosting
model_gb = GradientBoostingClassifier(n_estimators=100, random_state=42)
model_gb.fit(X_train, y_train)
y_pred_gb = model_gb.predict(X_test)
print("Gradient Boosting Classification Report:")
print(classification_report(y_test, y_pred_gb, target_names=label_encoder.classes_))


Random Forest Classification Report:
              precision    recall  f1-score   support

 alternative       0.00      0.00      0.00        15
       blues       0.00      0.00      0.00        16
   classical       0.64      0.70      0.67        10
     country       0.11      0.27      0.16        11
       disco       0.54      0.44      0.48        16
         edm       0.12      0.15      0.14        13
        folk       0.18      0.12      0.14        17
        funk       0.15      0.25      0.19        12
      gospel       0.65      0.55      0.59        20
     hip-hop       0.25      0.18      0.21        17
       indie       0.00      0.00      0.00        11
        jazz       0.13      0.20      0.16        10
       latin       0.16      0.16      0.16        19
       metal       0.09      0.05      0.06        20
         pop       0.25      0.20      0.22        15
        punk       0.25      0.38      0.30        13
         r&b       0.11      0.08      0.09 

In [15]:
pca = PCA(n_components=10)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

model_svm_pca = SVC(kernel='linear')
model_svm_pca.fit(X_train_pca, y_train)
y_pred_svm_pca = model_svm_pca.predict(X_test_pca)

print("SVM with PCA Classification Report:")
print(classification_report(y_test, y_pred_svm_pca, target_names=label_encoder.classes_))


SVM with PCA Classification Report:
              precision    recall  f1-score   support

 alternative       0.18      0.20      0.19        15
       blues       0.00      0.00      0.00        16
   classical       0.75      0.60      0.67        10
     country       0.14      0.45      0.21        11
       disco       0.33      0.25      0.29        16
         edm       0.11      0.15      0.12        13
        folk       0.40      0.24      0.30        17
        funk       0.11      0.33      0.17        12
      gospel       0.89      0.40      0.55        20
     hip-hop       0.24      0.24      0.24        17
       indie       0.00      0.00      0.00        11
        jazz       0.11      0.20      0.14        10
       latin       0.14      0.11      0.12        19
       metal       0.20      0.10      0.13        20
         pop       0.25      0.27      0.26        15
        punk       0.22      0.46      0.30        13
         r&b       0.33      0.15      0.21  

In [16]:
# Random Forest with class weights
model_rf_weighted = RandomForestClassifier(n_estimators=100, random_state=42, class_weight='balanced')
model_rf_weighted.fit(X_train, y_train)
y_pred_rf_weighted = model_rf_weighted.predict(X_test)

print("Random Forest with Class Weights Classification Report:")
print(classification_report(y_test, y_pred_rf_weighted, target_names=label_encoder.classes_))


Random Forest with Class Weights Classification Report:
              precision    recall  f1-score   support

 alternative       0.00      0.00      0.00        15
       blues       0.00      0.00      0.00        16
   classical       0.75      0.90      0.82        10
     country       0.08      0.18      0.11        11
       disco       0.58      0.44      0.50        16
         edm       0.21      0.38      0.27        13
        folk       0.33      0.18      0.23        17
        funk       0.27      0.33      0.30        12
      gospel       0.69      0.55      0.61        20
     hip-hop       0.33      0.24      0.28        17
       indie       0.00      0.00      0.00        11
        jazz       0.14      0.20      0.17        10
       latin       0.21      0.26      0.23        19
       metal       0.15      0.10      0.12        20
         pop       0.25      0.20      0.22        15
        punk       0.21      0.31      0.25        13
         r&b       0.11  

# Conclusions

In [8]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Spotify API credentials
cclient_id = "4e94c7a00ce841cb97a1eb6b94715735"
client_secret = "023e76405fdc4e68af511d30ef91d172"

redirect_uri = "YOUR_REDIRECT_URI"

# Authenticate with Spotify API
client_credentials_manager = SpotifyClientCredentials(client_id, client_secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

# API rate limits
batch_size = 50  # Adjust batch size based on API limits
max_requests = 1000  # Adjust max requests based on API limits

# Function to retrieve song features in batches
def get_song_features_batch(song_ids):
    features = []
    for i in range(0, len(song_ids), batch_size):
        batch = song_ids[i:i+batch_size]
        batch_features = sp.audio_features(batch)
        features.extend(batch_features)
    return features

# Function to retrieve song genres in batches
def get_song_genres_batch(artist_ids):
    genres = []
    for i in range(0, len(artist_ids), batch_size):
        batch = artist_ids[i:i+batch_size]
        results = sp.artists(batch)
        for artist in results:
            if "genres" in artist and artist["genres"]:
                genres.append(artist["genres"][0])
            else:
                genres.append(None)
    return genres

# Load song IDs and artist IDs
song_ids = pd.read_csv("song_ids.csv")["song_id"].tolist()
artist_ids = pd.read_csv("artist_ids.csv")["artist_id"].tolist()

# Load data in batches
batches = []
for i in range(0, len(song_ids), batch_size):
    batch_song_ids = song_ids[i:i+batch_size]
    batch_artist_ids = artist_ids[i:i+batch_size]
    
    batch_features = get_song_features_batch(batch_song_ids)
    batch_genres = get_song_genres_batch(batch_artist_ids)
    
    batch_data = pd.DataFrame({
        "song_id": batch_song_ids,
        "artist_id": batch_artist_ids,
        "features": batch_features,
        "genre": batch_genres
    })
    
    batches.append(batch_data)

# Concatenate batches
df = pd.concat(batches, ignore_index=True)


# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create TF-IDF vectorizer for song titles
vectorizer = TfidfVectorizer()

# Fit vectorizer to training data
vectorizer.fit(X_train)

# Transform training and testing data
X_train_tfidf = vectorizer.transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Train Random Forest classifier
rfc = RandomForestClassifier(n_estimators=100, random_state=42)
rfc.fit(X_train_tfidf, y_train)

# Make predictions on testing data
y_pred = rfc.predict(X_test_tfidf)

# Evaluate model performance
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("Classification Report:")
print(classification_report(y_test, y_pred))

Max Retries reached


SpotifyException: http status: 429, code:-1 - /v1/audio-features/?ids=6xsEAm6w9oMQYYg3jkEkMT,466s1BacUmiRdR3ISvNjyx,22VdIZQfgXJea34mQxlt81,6wDviYDtmSDZ0S6TVMM9Vc,1OOtq8tRnDM8kG2gqUPjAj,3S2R0EVwBSAVMd5UMgKTL0,4uLU6hMCjMI75M1A2tKUQC,30UFKKWSOC2Xr6KfWcyvsI,0pdfN7nOHMYmKykzu1cyfm,3whrwq4DtvucphBPUogRuJ,3bsycjdQtbcJeR6822SBvd,1ve0SgTZkv3wdggJLqtBYU,5vlEg2fT4cFWAqU5QptIpQ,2F1fnE1a8zQCogM6jJifHH,57bgtoPSgt236HzfBOd8kj,0EMmVUYs9ZZRHtlADB88uz,0q21FNwES2bbtcduB6kjEU,7tlcsqahVxD2kkTMzKBVXD,4aapF01SjrSPovA6vjU1JW,7iDa6hUg2VgEL1o1HjmfBn,4hTZNimQzSOpFI1NljSFEA,6Ex1as5AIibDGYpVJe18QR,4Sib57MmYGJzSvkW84jTwh,2eQKwLqZ0t1hIBsYwsAedh,6P4d1NWBCNIYZjzF9k1mVN,1hGy2eLcmC8eKx7qr1tOqx,2yPoXCs7BSIUrucMdK5PzV,4DvhkX2ic4zWkQeWMwQ2qf,5ChkMS8OtdzJeqyybCc9R5,7EFVJfuaqhIIvzNHZpEpth,5zA8vzDGqPl2AzZkEYQGKh,1YaK2hxBcOHFQXKfeSA3Oh,4NtUY5IGzHCaqfZemmAu56,1TwLKNsCnhi1HxbIi4bAW0,4w0ezrWpegj1GHJ45y0kXc,2QJnTfMpNG05KFf2E3gVIJ,4VdgVYzrI5lmh6zC9BzNOO,1jgmL1VK9U7XyKIOfVBbqJ,0RJWhctsc1G1Hg3Ov2th7x,20I6sIOMTCkB6w7ryavxtO,3oTlkzk1OtrhH8wBAduVEi,1G391cbiT3v3Cywg8T7DM1,4PrMtqNCkVtMrD8Rzp4OmN,0wagV3icLdCE7uP7rjyOfY,131OLY5J8XyfGuSjXRiTRM,6nCDnzErqalOaIY3EJM8NK,6tXjP6xgPJ7Xr1igrO6bOE,2Iib2MV3ECFJAourgP9dlY,7rcF9Sx3vjKE2UBvNx2Ml1,14EgW52HVqnLHd30bgbPxg:
 Max Retries, reason: too many 429 error responses

In [6]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report



# Prepare data for modeling
df = pd.read_csv("spotify_features_revised.csv")
X = df.drop("genre", axis=1)
y = df["genre"]

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create TF-IDF vectorizer for song titles
vectorizer = TfidfVectorizer()

# Fit vectorizer to training data
vectorizer.fit(X_train)

# Transform training and testing data
X_train_tfidf = vectorizer.transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Train Random Forest classifier
rfc = RandomForestClassifier(n_estimators=100, random_state=42)
rfc.fit(X_train_tfidf, y_train)

# Make predictions on testing data
y_pred = rfc.predict(X_test_tfidf)

# Evaluate model performance
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("Classification Report:")
print(classification_report(y_test, y_pred))

ValueError: Found input variables with inconsistent numbers of samples: [15, 3432]

In [4]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

# Spotify API credentials

client_id = "4e94c7a00ce841cb97a1eb6b94715735"
client_secret = "023e76405fdc4e68af511d30ef91d172"

# Authenticate with Spotify API
sp = spotipy.Spotify(auth_manager=SpotifyClientCredentials(client_id=client_id, client_secret=client_secret))

# Authenticate with Spotify API
client_credentials_manager = SpotifyClientCredentials(client_id, client_secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

# Retrieve song IDs and artist IDs from playlist
playlist_id = "7qn7AkdvLCmTnYNgCbv9Wo"
results = sp.playlist_tracks(playlist_id)

song_ids = []
artist_ids = []
for item in results["items"]:
    song_ids.append(item["track"]["id"])
    artist_ids.append(item["track"]["artists"][0]["id"])

# Save song IDs and artist IDs to CSV
import pandas as pd

df_song_ids = pd.DataFrame({"song_id": song_ids})
df_artist_ids = pd.DataFrame({"artist_id": artist_ids})

df_song_ids.to_csv("song_ids.csv", index=False)
df_artist_ids.to_csv("artist_ids.csv", index=False)