# Spotify Soul Recommender Jupyter Notebook

## Introduction

The Spotify Soul Recommender is a machine learning project designed to curate a personalized playlist of soulful songs based on a user's existing favorite tracks. By analyzing the genres and features of the songs a user enjoys, the project identifies and recommends soulful tracks that align with their musical preferences. The goal is to enhance the user's listening experience by introducing them to new songs that fit the vibe of contemporary R&B, lofi, soul, jazz, and similar genres.

## Setup

### Requirements
- Python 3.8+
- Spotipy
- Pandas
- Scikit-learn (for the machine learning model)
- Your Spotify AP`

This notebook provides a comprehensive overview and guide for setting up and running the Spotify Soul Recommender project. It combines explanations with executable code snippets for a hands-on experience. Remember to replace placeholders with actual data and credentials before running the code.

### 1. Clone the Repository

In [None]:
git clone https://github.com/tasha-2000/Spotify-Soul-Recommender.git
cd Spotify-Soul-Recommender

### 2. Install Dependencies

Make sure you have Python 3.8 or higher installed. Then run each install command one by one:

In [None]:
pip install spotipy
pip install pandas
pip install scikit-learn
pip install imblearn

### Configure Spotify API Credentials

To use the Spotify API, you need to set up your Spotify API credentials. Replace the placeholders with your actual credentials in the file credentials.py

In [None]:
clientId= 'your_spotify_client_id'
clientSecret= 'your_spotify_client_secret'
redirectUri= 'your_app_redirect_uri'
username= 'your_spotify_username'

## Data Collection

Data is collected from Spotify using the Spotipy library. Here is a snippet of code to initialize the Spotify client:

In [6]:
import spotipy
from spotipy.oauth2 import SpotifyOAuth
import credentials

def SetUpSpotifyClient():
    """
    Initializes and returns a Spotify client using OAuth.
    """
    scope= 'user-top-read playlist-modify-private playlist-modify-public user-library-read user-top-read'
    return spotipy.Spotify(auth_manager=SpotifyOAuth(client_id=credentials.clientId,
                                                     client_secret=credentials.clientSecret,
                                                     redirect_uri=credentials.redirectUri,
                                                     scope=scope))

There are two sets of data collected:

1. Soulful tracks directly from Spotify offcial playlists
2. The users favorite soulful tracks

In the context of my project, a track is considered to be 'soulful' if it's genres contain any one of these keywords:

In [19]:
keywords = {'contemporary', 'r&b', 'lofi', 'soul', 'jazz', 'neo soul', 'blues', 'chillwave', 'lo-fi', 'chill', 'soul'}

### Soulful tracks from Spotify official playlist

The process involves searching for 50 playlists that match specific soulful genres (i.e the names of the playlist contain any of those key words) and extracting the first 50 track IDs from each of these playlists. This code snippet demonstrates how that is done:

In [17]:
# Function to find Spotify official playlists with soulful genres
def FindSoulfulPlaylists(sp, keywords):
    playlistIds= set()
    for keyword in keywords:
        results= sp.search(q=f'"{keyword}"', type='playlist', limit=50)
        for playlist in results['playlists']['items']:
            if 'spotify'in playlist['owner']['id']:
                            playlistIds.add(playlist['id'])
            return playlistIds
    
    # Function to get tracks from playlistsdef GetTracksFromPlaylist(sp, playlistIds):
        trackIds= []
    for playlistId in playlistIds:
            results= sp.playlist_items(playlistId, limit=50, fields='items.track.id')
            trackIds.extend([item['track']['id']for item in results['items']if item['track']])
    return trackIds

### The user's favorite soulful tracks

This works by searching for tracks that match the specified genres within the user's top 50 tracks:

In [18]:
# Function to find soulful tracks in the user's library
def FindSoulfulTracks(sp):
    keywords = {'contemporary', 'r&b', 'lofi', 'soul', 'jazz', 'neo', 'blue', 'chillwave', 'lo-fi', 'chill', 'soul'}
    results = sp.current_user_top_tracks(limit=50, time_range='long_term')
    trackIds = []

    for item in results['items']:
        artistIds = [artist['id'] for artist in item['artists']]
        isMatch = False
        for artistId in artistIds:
            artistInfo = sp.artist(artistId)
            if 'genres' in artistInfo and any(keyword in genre for keyword in keywords for genre in artistInfo['genres']):
                isMatch = True
                break
        if isMatch:
            trackIds.append(item['id'])
    return trackIds

## Data Processing

After collecting the data, both data sets are processed to extract relevant features from each track. This includes audio features like danceability, energy, key, loudness, and more. This information is used to build two datasets `reccomendations_library.csv` and `user_favs.csv' for model training.

In [None]:
# Function to get track features
def GetTrackFeatures(sp, trackId):
    meta = sp.track(trackId)
    features = sp.audio_features(trackId)
    
    if features:
        trackInfo = {
            'trackId': trackId,
            'name': meta['name'],
            'album': meta['album']['name'],
            'artist': meta['album']['artists'][0]['name'],
            'releaseDate': meta['album']['release_date'],
            'length': meta['duration_ms'],
            'popularity': meta['popularity'],
            'acousticness': features[0]['acousticness'],
            'danceability': features[0]['danceability'],
            'energy': features[0]['energy'],
            'instrumentalness': features[0]['instrumentalness'],
            'liveness': features[0]['liveness'],
            'loudness': features[0]['loudness'],
            'speechiness': features[0]['speechiness'],
            'tempo': features[0]['tempo'],
            'timeSignature': features[0]['time_signature']
        }
        return trackInfo
    else:
        print(f"Features not found for track ID: {trackId}")
        return None


##EDA
Once I got all my features I did a quick EDA of the data. 
I created box plots to compare the distribution of data for different features between User Favorites and None Favorites

### Acousticness
![Acousticness Comparison](.\ImagesJupyterNotebook\Acousticness.png)

### Energy
![Energy Comparison](./ImagesJupyterNotebook/Energy.png)

### Instrumentalness
![Instrumental Comparison](./ImagesJupyterNotebook/Instrumentals.png)

### Liveliness
![Liveliness Comparison](./ImagesJupyterNotebook/liveliness.png)

### Loudness
![Loudness Comparison](./ImagesJupyterNotebook/loudness.png)

### Speechiness
![Speechiness Comparison](./ImagesJupyterNotebook/speechiness.png)

### Tempo
![Tempo Comparison](./ImagesJupyterNotebook/Tempo.png)


Once this was done. I addeed a favorite field which represented wether or not the user liked the song: 1 = favorite 0 = non favorite

I combined both datasets into one dataset and addressed the imbalance between Favorites: Non Favorites using SMOTE. The initial ratio was 28: 7436.

I also removed all of the non-numerical feilds and focused on using these feilds to predict the reccomendations:

In [28]:
 numericFields = ['length', 'popularity', 'danceability', 'acousticness', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness', 'tempo', 'timeSignature', 'favorite']

Then I split this dataset into 80% training data and 20% test data. (I removed the favorite field from the test data)

Here is a summary of the datasets:

In [23]:
import pandas as pd
testDataDf = pd.read_csv('test_data.csv')
print(testDataDf.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1497 entries, 0 to 1496
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   length            1497 non-null   int64  
 1   popularity        1497 non-null   int64  
 2   danceability      1497 non-null   float64
 3   acousticness      1497 non-null   float64
 4   energy            1497 non-null   float64
 5   instrumentalness  1497 non-null   float64
 6   liveness          1497 non-null   float64
 7   loudness          1497 non-null   float64
 8   speechiness       1497 non-null   float64
 9   tempo             1497 non-null   float64
 10  timeSignature     1497 non-null   int64  
dtypes: float64(8), int64(3)
memory usage: 128.8 KB
None


In [24]:
import pandas as pd
trainingDf = pd.read_csv('training_data.csv')
print(trainingDf.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11938 entries, 0 to 11937
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   length            11938 non-null  int64  
 1   popularity        11938 non-null  int64  
 2   danceability      11938 non-null  float64
 3   acousticness      11938 non-null  float64
 4   energy            11938 non-null  float64
 5   instrumentalness  11938 non-null  float64
 6   liveness          11938 non-null  float64
 7   loudness          11938 non-null  float64
 8   speechiness       11938 non-null  float64
 9   tempo             11938 non-null  float64
 10  timeSignature     11938 non-null  int64  
 11  favorite          11938 non-null  int64  
dtypes: float64(8), int64(4)
memory usage: 1.1 MB
None


In [25]:
import pandas as pd
trainingDf = pd.read_csv('training_data.csv')
print(trainingDf.describe())

              length    popularity  danceability  acousticness        energy  \
count   11938.000000  11938.000000  11938.000000  11938.000000  11938.000000   
mean   206002.036438     57.272826      0.557962      0.492555      0.456370   
std     65758.982874     15.300882      0.263681      0.218605      0.185388   
min     42640.000000      0.000000      0.000008      0.000000      0.000072   
25%    167374.500000     49.000000      0.424077      0.311000      0.373000   
50%    192659.000000     60.000000      0.598890      0.509145      0.461505   
75%    239024.500000     68.000000      0.727636      0.667488      0.557781   
max    732553.000000    100.000000      0.996000      0.981000      0.990000   

       instrumentalness      liveness      loudness   speechiness  \
count      11938.000000  11938.000000  11938.000000  11938.000000   
mean           0.171020      0.163073    -10.326944      0.107826   
std            0.312735      0.109306      5.519599      0.097050   
min

## Model Building

After researching on the performances of different models for similar projects, I concluded that a classification model, specicially a descion tree model would be the best approach. It was proven to have the highest F1 score as compared to a Linear Regression Model and a Random Classifier Model. I also attempted to apraoch this as a clustering problem using KMeans however, the model had very low accuracy. The `model.py` file includes code for training a model using the scikit-learn library.

Here is some data on the accuracy of the model after training:

## Recommendation Logic

The recommendation logic uses the trained model to predict wether or not the user will like a song for a new set of songs. Based on the prediction, it recommends 10 songs that the user is likely to enjoy.

In [33]:
# Function to recommend based on the trained model
def Recommend(testDataDf, clfModel):
    probabilities = clfModel.predict_proba(testDataDf)[:, 1]
    testDataDf['predictedFavoriteProbability'] = probabilities
    sortedTestDataDf = testDataDf.sort_values(by='predictedFavoriteProbability', ascending=False)
    top10Recommendations = sortedTestDataDf.head(10)
    sortedTestDataDf.head(10).to_csv('recs.csv')
    DeleteFirstFieldAndSave('recs.csv', 'recs.csv')
    return top10Recommendations

In [None]:
      length  popularity  danceability  acousticness  energy  instrumentalness  liveness  loudness  speechiness    tempo  timeSignature  predictedFavoriteProbability
301   201233          54         0.831        0.5280   0.401          0.001690    0.1400    -9.662       0.1000   77.430              4                           1.0
1201  219945          71         0.620        0.7290   0.456          0.000133    0.0919    -6.795       0.0528  110.999              4                           1.0
1427  247133          48         0.633        0.3920   0.550          0.000052    0.1130    -7.396       0.0937   87.501              4                           1.0
881   233107          45         0.687        0.5270   0.391          0.235000    0.3970   -10.210       0.1530  146.754              4                           1.0
682   114001          63         0.624        0.3770   0.396          0.041500    0.0710   -10.194       0.0728   92.001              3                           1.0
368   185617          59         0.601        0.2730   0.554          0.014500    0.1840   -10.849       0.0982  151.971              4                           1.0
997    69741          46         0.880        0.4710   0.229          0.022800    0.2570   -11.711       0.0350   80.990              4                           1.0
1174  295530          38         0.794        0.3900   0.277          0.000017    0.2070   -12.048       0.0402  109.965              4                           1.0
942   200760          69         0.681        0.0477   0.608          0.000000    0.2250    -6.093       0.2070   97.934              4                           1.0

This is cross refenced with the combined dataset combined.csv to find the track IDs and create the playlist.

In [None]:
 trackIDs= model.FindMatch('recs.csv', 'combined_data.csv')
    print(trackIDs)

    createPlaylist(sp, 'Spotify Soul', 'Testing Out My Recommender',trackIDs)

Image of the results:
![Spotify Results](.\ImagesJupyterNotebook\SpotifyResults.PNG "Spotify Soul Playlist")

## Conclusion

The Spotify Soul Recommender project offers a personalized way to discover new soulful music tailored to a user's taste. Future improvements could include refining the model for better accuracy, expanding the dataset for a wider variety of songs, and incorporating user feedback to improve recommendations.