# Million Playlists Dataset Data Preprocessing

This Jupyter Notebook is intended to preprocess the data retrieved from the Million PLaylists Dataset as well as from Spotify's API using every track's URI identifier. 

---

### Importing the required libraries

In [1]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_extraction.text import TfidfVectorizer

### Loading the data from saved CSV file

In [2]:
df = pd.read_csv(r'Data\tracks_features.csv')
df

Unnamed: 0,artist_name,track_name,album_uri,album_name,danceability,energy,key,loudness,mode,speechiness,...,valence,tempo,duration_ms,time_signature,genres,artist_popularity,artist_uri,track_popularity,track_uri,release_date
0,Missy Elliott,Lose Control (feat. Ciara & Fat Man Scoop),spotify:album:6vV5UrXcfyQD1wu4Qo2I9K,The Cookbook,0.904,0.813,4,-7.105,0,0.1210,...,0.810,125.461,226864,4,"['dance pop', 'hip hop', 'hip pop', 'neo soul'...",72,spotify:artist:2wIVse2owClT7go1WT98tk,69,spotify:track:0UaMYEvWZi0ZqiDOoHU3YI,2005-07-04
1,Missy Elliott,Work It,spotify:album:6DeU398qrJ1bLuryetSmup,Under Construction,0.884,0.677,1,-5.603,1,0.2830,...,0.584,101.868,263227,4,"['dance pop', 'hip hop', 'hip pop', 'neo soul'...",72,spotify:artist:2wIVse2owClT7go1WT98tk,71,spotify:track:3jagJCUbdqhDSPuxP8cAqF,2002-11-11
2,Missy Elliott,Get Ur Freak On,spotify:album:6epR3D622KWsnuHye7ApOl,Respect M.E.,0.794,0.805,0,-6.554,1,0.2300,...,0.658,177.799,236933,4,"['dance pop', 'hip hop', 'hip pop', 'neo soul'...",72,spotify:artist:2wIVse2owClT7go1WT98tk,45,spotify:track:3XplJgPz8VjbDzbGwGgZdq,2006-09-04
3,Missy Elliott,One Minute Man (feat. Ludacris),spotify:album:20t54K6C80QQH7vbcpfJcP,Miss E...So Addictive,0.622,0.669,9,-8.419,1,0.3290,...,0.570,93.839,252987,4,"['dance pop', 'hip hop', 'hip pop', 'neo soul'...",72,spotify:artist:2wIVse2owClT7go1WT98tk,58,spotify:track:0jG92AlXau21qgCQRxGLic,2001-05-14
4,Missy Elliott,Get Ur Freak On,spotify:album:20t54K6C80QQH7vbcpfJcP,Miss E...So Addictive,0.797,0.750,0,-9.369,1,0.2470,...,0.740,177.870,211120,4,"['dance pop', 'hip hop', 'hip pop', 'neo soul'...",72,spotify:artist:2wIVse2owClT7go1WT98tk,71,spotify:track:6zsk6uF3MxfIeHPlubKBvR,2001-05-14
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34316,Layla,Oh My Love,spotify:album:4eTl12dc7uQXvgDhtMgW5p,Yellow Circles EP,0.434,0.279,8,-11.947,1,0.0465,...,0.157,145.264,203676,3,['indie anthem-folk'],23,spotify:artist:04BsVprJtIhl2C4fgPEz4W,30,spotify:track:0KMrYUEfexgam36li6d9F0,2013-12-02
34317,Aayushi,Diamond Child,spotify:album:5bWtDTwS9llWcnhmgRkav3,Diamond Child,0.416,0.394,11,-9.269,1,0.0641,...,0.131,81.988,237008,4,[],22,spotify:artist:1r2kTJ27zuaEoXasQT5NDd,0,spotify:track:1msfqzqHggvi1mlCT4Z7O5,2015-06-16
34318,Jon D,I Don't Know,spotify:album:2KEQtuVl1cYsTYtVRUrNVi,Roots,0.669,0.228,2,-12.119,1,0.0690,...,0.402,83.024,189184,4,['channel pop'],41,spotify:artist:5HCypjplgh5uQezvBpOfXN,22,spotify:track:3uCHI1gfOUL5j5swEh0TcH,2015-03-28
34319,Big Words,The Answer,spotify:album:5jrsRHRAmetu5e7RRBoxj7,"Hollywood, a Beautiful Coincidence",0.493,0.727,1,-5.031,1,0.2170,...,0.289,73.259,263680,4,['australian r&b'],41,spotify:artist:0sHN89qak07mnug3LVVjzP,37,spotify:track:0P1oO2gREMYUCoOkzYAyFu,2017-09-22


### Initial Data Exploration and handling missing values

**Checking data Dtypes**

In [3]:
df.dtypes

artist_name           object
track_name            object
album_uri             object
album_name            object
danceability         float64
energy               float64
key                    int64
loudness             float64
mode                   int64
speechiness          float64
acousticness         float64
instrumentalness     float64
liveness             float64
valence              float64
tempo                float64
duration_ms            int64
time_signature         int64
genres                object
artist_popularity      int64
artist_uri            object
track_popularity       int64
track_uri             object
release_date          object
dtype: object

**Checking for missing values. We notice there are no missing values so we simply go on**

In [4]:
df.isna().sum()

artist_name          0
track_name           0
album_uri            0
album_name           0
danceability         0
energy               0
key                  0
loudness             0
mode                 0
speechiness          0
acousticness         0
instrumentalness     0
liveness             0
valence              0
tempo                0
duration_ms          0
time_signature       0
genres               0
artist_popularity    0
artist_uri           0
track_popularity     0
track_uri            0
release_date         0
dtype: int64

**In order to use the retrieved data, we now drop URI's columns and Name's columns, so we can use the numerical data. Notice that we keep track_uri, so we can make the recomendations**

In [5]:
dropped_df = df.drop(['artist_name', 'artist_uri', 'track_name', 'album_uri', 'album_name'], axis = 1)
dropped_df

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,genres,artist_popularity,track_popularity,track_uri,release_date
0,0.904,0.813,4,-7.105,0,0.1210,0.0311,0.006970,0.0471,0.810,125.461,226864,4,"['dance pop', 'hip hop', 'hip pop', 'neo soul'...",72,69,spotify:track:0UaMYEvWZi0ZqiDOoHU3YI,2005-07-04
1,0.884,0.677,1,-5.603,1,0.2830,0.0778,0.000000,0.0732,0.584,101.868,263227,4,"['dance pop', 'hip hop', 'hip pop', 'neo soul'...",72,71,spotify:track:3jagJCUbdqhDSPuxP8cAqF,2002-11-11
2,0.794,0.805,0,-6.554,1,0.2300,0.5380,0.122000,0.0952,0.658,177.799,236933,4,"['dance pop', 'hip hop', 'hip pop', 'neo soul'...",72,45,spotify:track:3XplJgPz8VjbDzbGwGgZdq,2006-09-04
3,0.622,0.669,9,-8.419,1,0.3290,0.0266,0.000003,0.1520,0.570,93.839,252987,4,"['dance pop', 'hip hop', 'hip pop', 'neo soul'...",72,58,spotify:track:0jG92AlXau21qgCQRxGLic,2001-05-14
4,0.797,0.750,0,-9.369,1,0.2470,0.5330,0.108000,0.0950,0.740,177.870,211120,4,"['dance pop', 'hip hop', 'hip pop', 'neo soul'...",72,71,spotify:track:6zsk6uF3MxfIeHPlubKBvR,2001-05-14
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34316,0.434,0.279,8,-11.947,1,0.0465,0.7700,0.042400,0.1330,0.157,145.264,203676,3,['indie anthem-folk'],23,30,spotify:track:0KMrYUEfexgam36li6d9F0,2013-12-02
34317,0.416,0.394,11,-9.269,1,0.0641,0.5130,0.001550,0.0988,0.131,81.988,237008,4,[],22,0,spotify:track:1msfqzqHggvi1mlCT4Z7O5,2015-06-16
34318,0.669,0.228,2,-12.119,1,0.0690,0.7920,0.065000,0.0944,0.402,83.024,189184,4,['channel pop'],41,22,spotify:track:3uCHI1gfOUL5j5swEh0TcH,2015-03-28
34319,0.493,0.727,1,-5.031,1,0.2170,0.0873,0.000000,0.1290,0.289,73.259,263680,4,['australian r&b'],41,37,spotify:track:0P1oO2gREMYUCoOkzYAyFu,2017-09-22


### Categorical Data Encoding

We have Artist's diferent musical genres under the column "genres", but they're text. In order to incorporate them as a variable for the model, we need to encode it as a numerical variable. For such purpose, we'll use TFIDFVectorizer from Sklearn

In [6]:
def genres_preprocess(df):
    genres = df['genres'].apply(lambda x: x.strip("[]").split(","))
    return genres


In [7]:
def vectorize_genres(data_df):
    # Create a copy of the input dataframe
    df = data_df.copy()

    # Preprocess the 'genres' column
    df['genres'] = genres_preprocess(df)

    # Create an instance of TfidfVectorizer
    tfidf = TfidfVectorizer()

    # Apply TF-IDF vectorization on the 'genres' column
    tfidf_matrix = tfidf.fit_transform(df['genres'].apply(lambda x: " ".join(x)))

    # Convert the TF-IDF matrix to a DataFrame
    genre_df = pd.DataFrame(tfidf_matrix.toarray())

    # Set column names for the genre DataFrame
    genre_df.columns = ['genre' + "|" + i for i in tfidf.get_feature_names_out()]

    # Reset the index of the genre DataFrame
    genre_df.reset_index(drop=True, inplace=True)

    # Concatenate the original dataframe and the genre DataFrame horizontally
    final_df = pd.concat([df, genre_df], axis=1)
    
    # Return the final dataframe
    return final_df

final_df = vectorize_genres(dropped_df)
final_df

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,...,genre|ye,genre|yodeling,genre|york,genre|youth,genre|zambian,genre|zhongguo,genre|zilizopendwa,genre|zolo,genre|zouk,genre|zuliana
0,0.904,0.813,4,-7.105,0,0.1210,0.0311,0.006970,0.0471,0.810,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.884,0.677,1,-5.603,1,0.2830,0.0778,0.000000,0.0732,0.584,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.794,0.805,0,-6.554,1,0.2300,0.5380,0.122000,0.0952,0.658,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.622,0.669,9,-8.419,1,0.3290,0.0266,0.000003,0.1520,0.570,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.797,0.750,0,-9.369,1,0.2470,0.5330,0.108000,0.0950,0.740,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34316,0.434,0.279,8,-11.947,1,0.0465,0.7700,0.042400,0.1330,0.157,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
34317,0.416,0.394,11,-9.269,1,0.0641,0.5130,0.001550,0.0988,0.131,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
34318,0.669,0.228,2,-12.119,1,0.0690,0.7920,0.065000,0.0944,0.402,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
34319,0.493,0.727,1,-5.031,1,0.2170,0.0873,0.000000,0.1290,0.289,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


**Since the genres are already vectorized, we drop "genres" column**

In [8]:
final_df = final_df.drop('genres', axis = 1)

### Release Date management
We have a column named "release_date", which has the information of when was a track released. In order to recommend only tracks from the same era, we'll keep the decade only.

In [9]:
def extract_decade(date_str):
    year = int(date_str.split('-')[0])
    decade = (year // 10) * 10
    return int(decade)

def preprocess_decade(data_df):
    # Make a copy of the original DataFrame
    df = data_df.copy()
    
    # Apply the function to the "release_date" column
    df['release_date'] = df['release_date'].apply(extract_decade)
    
    return df
    
# Output the modified dataframe
final_df = preprocess_decade(final_df)
final_df['release_date']


0        2000
1        2000
2        2000
3        2000
4        2000
         ... 
34316    2010
34317    2010
34318    2010
34319    2010
34320    2010
Name: release_date, Length: 34321, dtype: int64

### Data Normalization

**Since the metric we are going to use is sensitive to non-normalized data, we must first normalize some of the columns: duration_ms, time_signature**

In [10]:
def data_normalization(data_df, columns_to_scale):
    df = data_df.copy()
    # Create a MinMaxScaler object
    scaler = MinMaxScaler()

    # Apply the MinMaxScaler to the selected columns
    df[columns_to_scale] = scaler.fit_transform(df[columns_to_scale])
    
    return df
    

**Final D-type verification**

Since there is only the "track_uri" column left as an object type column, the data is ready to be used in the model.

In [11]:
object_columns = final_df.select_dtypes(include=['object'])
object_columns

Unnamed: 0,track_uri
0,spotify:track:0UaMYEvWZi0ZqiDOoHU3YI
1,spotify:track:3jagJCUbdqhDSPuxP8cAqF
2,spotify:track:3XplJgPz8VjbDzbGwGgZdq
3,spotify:track:0jG92AlXau21qgCQRxGLic
4,spotify:track:6zsk6uF3MxfIeHPlubKBvR
...,...
34316,spotify:track:0KMrYUEfexgam36li6d9F0
34317,spotify:track:1msfqzqHggvi1mlCT4Z7O5
34318,spotify:track:3uCHI1gfOUL5j5swEh0TcH
34319,spotify:track:0P1oO2gREMYUCoOkzYAyFu


**Our data is now ready for the model implementation. We'll Save it in a new .CSV file**

In [12]:
# Save final_df to a CSV file
final_df.to_csv(r'Data/model_data.csv', index = False)  