### Song Recommender System: Bita & Sindhu
- This project implements a content-based song recommender system using audio features extracted from two datasets of high- and low-popularity songs. The main goal of the system is to suggest songs that are musically similar to a given input track based on its acoustic characteristics.


In [1]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import StandardScaler


In [7]:
high_popularity_df = pd.read_csv("high_popularity_spotify_data.csv")
low_popularity_df = pd.read_csv("low_popularity_spotify_data.csv")

In [9]:
high_popularity_df.head()

Unnamed: 0,energy,tempo,danceability,playlist_genre,loudness,liveness,valence,track_artist,time_signature,speechiness,...,instrumentalness,track_album_id,mode,key,duration_ms,acousticness,id,playlist_subgenre,type,playlist_id
0,0.592,157.969,0.521,pop,-7.777,0.122,0.535,"Lady Gaga, Bruno Mars",3,0.0304,...,0.0,10FLjwfpbxLmW8c25Xyc2N,0,6,251668,0.308,2plbrEY59IikOBgBGLjaoe,mainstream,audio_features,37i9dQZF1DXcBWIGoYBM5M
1,0.507,104.978,0.747,pop,-10.171,0.117,0.438,Billie Eilish,4,0.0358,...,0.0608,7aJuG4TFXa2hmE4z1yxc3n,1,2,210373,0.2,6dOtVTDdiauQNBQEDOtlAB,mainstream,audio_features,37i9dQZF1DXcBWIGoYBM5M
2,0.808,108.548,0.554,pop,-4.169,0.159,0.372,Gracie Abrams,4,0.0368,...,0.0,0hBRqPYPXhr1RkTDG3n4Mk,1,1,166300,0.214,7ne4VBA60CxGM75vw0EYad,mainstream,audio_features,37i9dQZF1DXcBWIGoYBM5M
3,0.91,112.966,0.67,pop,-4.07,0.304,0.786,Sabrina Carpenter,4,0.0634,...,0.0,4B4Elma4nNDUyl6D5PvQkj,0,0,157280,0.0939,1d7Ptw3qYcfpdLNL5REhtJ,mainstream,audio_features,37i9dQZF1DXcBWIGoYBM5M
4,0.783,149.027,0.777,pop,-4.477,0.355,0.939,"ROSÉ, Bruno Mars",4,0.26,...,0.0,2IYQwwgxgOIn7t3iF6ufFD,0,0,169917,0.0283,5vNRhkKd0yEAg8suGBpjeY,mainstream,audio_features,37i9dQZF1DXcBWIGoYBM5M


In [11]:
low_popularity_df.head()

Unnamed: 0,time_signature,track_popularity,speechiness,danceability,playlist_name,track_artist,duration_ms,energy,playlist_genre,playlist_subgenre,...,track_album_id,playlist_id,track_id,valence,key,tempo,loudness,acousticness,liveness,track_album_name
0,4.0,23,0.0393,0.636,Rock Classics,Creedence Clearwater Revival,138053.0,0.746,rock,classic,...,4A8gFwqd9jTtnsNwUu3OQx,37i9dQZF1DWXRqgorJj26U,5e6x5YRnMJIKvYpZxLqdpH,0.432,0.0,132.31,-3.785,0.0648,0.173,The Long Road Home - The Ultimate John Fogerty...
1,4.0,53,0.0317,0.572,Rock Classics,Van Halen,241600.0,0.835,rock,classic,...,2c965LEDRNrXXCeBOAAwns,37i9dQZF1DWXRqgorJj26U,5FqYA8KfiwsQvyBI4IamnY,0.795,0.0,129.981,-6.219,0.171,0.0702,The Collection
2,4.0,55,0.0454,0.591,Rock Classics,Stevie Nicks,329413.0,0.804,rock,classic,...,3S404OgKoVQSJ3xXrDVlp8,37i9dQZF1DWXRqgorJj26U,5LNiqEqpDc8TuqPy79kDBu,0.658,0.0,111.457,-7.299,0.327,0.0818,Bella Donna (Deluxe Edition)
3,4.0,64,0.101,0.443,Jazz Classics,"Ella Fitzgerald, Louis Armstrong",185160.0,0.104,jazz,classic,...,1y5KGkUKO0NG32MhIIagCA,37i9dQZF1DXbITWG1ZJKYt,78MI7mu1LV1k4IA2HzKmHe,0.394,0.0,76.474,-17.042,0.913,0.191,"Love, Ella"
4,4.0,62,0.0298,0.685,Jazz Classics,Galt MacDermot,205720.0,0.472,jazz,classic,...,6f4b9wVTkKAf096k4XG6x5,37i9dQZF1DXbITWG1ZJKYt,6MN6yRVriszuyAVlyF8ndB,0.475,9.0,80.487,-9.691,0.785,0.224,Shapes of Rhythm/Woman Is Sweeter


In [13]:
combined_df = pd.concat([high_popularity_df, low_popularity_df])

In [15]:
print(combined_df.columns)

Index(['energy', 'tempo', 'danceability', 'playlist_genre', 'loudness',
       'liveness', 'valence', 'track_artist', 'time_signature', 'speechiness',
       'track_popularity', 'track_href', 'uri', 'track_album_name',
       'playlist_name', 'analysis_url', 'track_id', 'track_name',
       'track_album_release_date', 'instrumentalness', 'track_album_id',
       'mode', 'key', 'duration_ms', 'acousticness', 'id', 'playlist_subgenre',
       'type', 'playlist_id'],
      dtype='object')


In [17]:
print(combined_df.shape)

(4831, 29)


- 4831 rows, 29 columns
- combined dataset has 4831 entries & each row contains 29 attributes or features. 

### How the Recommender Works
- The system focuses on audio-based features that influence the musical "feel" of a song, such as:
Energy


1. Tempo


2. Danceability


3. Loudness


4. Liveness


5. Valence


6. Instrumentalness


7. Acousticness


8. Mode


9. Key


- These features are selected because they are intrinsic to the audio signal and directly shape the listening experience, making them suitable for content-based recommendation.


### Preprocessing Steps
- The dataset is first cleaned by removing metadata (e.g., artist names, album IDs, playlist names) that are not useful for computing similarity.


- The selected audio features are standardized using StandardScaler to ensure that each feature contributes equally to the similarity score.


- A cosine similarity matrix is computed using the standardized feature vectors. Cosine similarity measures how close two songs are in the multi-dimensional feature space, regardless of magnitude.


In [22]:
# Drop columns that aren't relevant for the recommendation system
columns_to_drop = ['playlist_genre', 'track_artist', 'track_popularity', 'track_href', 
                  'uri', 'track_album_name', 'track_album_id', 'track_album_release_date', 
                  'analysis_url', 'time_signature', 'playlist_name', 'playlist_subgenre', 'playlist_id']
combined_df = combined_df.drop(columns=columns_to_drop)


In [24]:
# Select the audio-related features
features = ['energy', 'tempo', 'danceability', 'loudness', 'liveness', 'valence', 
            'instrumentalness', 'acousticness', 'mode', 'key']

# Standardize these features
scaler = StandardScaler()
scaled_features = scaler.fit_transform(combined_df[features])

# Add the scaled features back to the DataFrame
scaled_df = pd.DataFrame(scaled_features, columns=features)
scaled_df['track_name'] = combined_df['track_name'].values  # Keep track name for reference

# Drop rows with any NaNs in selected features
scaled_df = scaled_df.dropna(subset=features)


### Calculate Cosine Similarities

##### Recommendation
- When a user inputs a song name:
1. The system identifies its index in the dataset.


2. It retrieves similarity scores between that song and all others using the cosine similarity matrix.


3. It ranks the songs by similarity (excluding the input song itself).


4. It returns the top N most similar songs.


In [27]:
# Step 1: Compute cosine similarity matrix
similarity_matrix = cosine_similarity(scaled_df[features])

# Step 2: Recommender function
def recommend_song(song_name, df, sim_matrix, top_n=5):
    if song_name not in df['track_name'].values:
        print("Song not found.")
        return
    
    # Get the index of the song
    idx = df[df['track_name'] == song_name].index[0]
    
    # Get similarity scores for the song
    sim_scores = list(enumerate(sim_matrix[idx]))
    
    # Sort by similarity score (excluding the song itself)
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:top_n+1]
    
    # Print top recommendations
    print(f"\nTop {top_n} songs similar to '{song_name}':")
    for i, (song_idx, score) in enumerate(sim_scores):
        similar_song = df.iloc[song_idx]['track_name']
        print(f"{i+1}. {similar_song} (Similarity Score: {score:.2f})")

In [29]:
recommend_song("Shape of You", scaled_df, similarity_matrix)


Top 5 songs similar to 'Shape of You':
1. Shape of You (Similarity Score: 1.00)
2. Because Of You (Similarity Score: 0.96)
3. Matuto Aperreado (Similarity Score: 0.95)
4. Baby (Similarity Score: 0.94)
5. Personal (Similarity Score: 0.94)


### Example Result
- For the input song "Shape of You", the system returned the following top 5 similar songs:
1. Because Of You (Similarity Score: 0.96)


2. Matuto Aperreado (0.95)


3. Baby (0.94)


4. Personal (0.94)


- These results demonstrate that the system is successfully identifying songs with similar musical structures and energy levels, regardless of artist or genre. This recommender system provides an efficient and interpretable method for finding musically similar songs. By leveraging cosine similarity on standardized audio features, it avoids biases from popularity and focuses on content alone. It serves as a solid foundation for more advanced hybrid systems that could incorporate collaborative filtering or deep learning in future iterations.
